mlx.Copy shares the backing buffer with its source (via
copy_shared_buffer) rather than allocating independent storage.
When used to snapshot a slice of the KV cache, the snapshot array
holds the entire original cache buffer alive through the shared
data pointer — even after eval detaches the computation graph.
Replace Copy with Contiguous in Snapshot and Split. Contiguous
allocates a compact buffer when the source buffer is significantly
larger than the logical slice (Contiguous::eval checks
buffer_size > nbytes + 16384), which is always the case for KV
cache slices.
Previously, a partial match within a node's edge would truncate the path
to the parent snapshot - effectively making all cache types behave as
recurrent caches. Caches with only transformer layers can rewind to
arbitrary boundary so this restores this capability to improve cache
hits
Enable multiple conversations to reuse cached computations when they
share token prefixes (e.g. the same system prompt). A prefix trie
tracks shared regions so switching between conversations only
recomputes tokens that diverge. Inactive conversation state is paged
from active GPU memory to other memory and restored on demand, with LRU
eviction to keep memory usage bounded.
* prefer rocm v6 on windows
Avoid building with v7 - more changes are needed
* MLX: add header vendoring and remove go build tag
This switches to using a vendoring approach for the mlx-c headers so that Go
can build without requiring a cmake first. This enables building the new MLX
based code by default. Every time cmake runs, the headers are refreshed, so we
can easily keep them in sync when we bump mlx versions. Basic Windows
and Linux support are verified.
* ci: harden for flaky choco repo servers
CI sometimes fails due to choco not actually installing cache. Since it just speeds up the build, we can proceed without.
* review comments
This change adds support for qwen3.5-next-moe models (qwen3-next/qwen3.5-next/qwen3-coder) to the MLX runner. It also:
* introduces recurrent cache support and related MLX ops
* updates pipeline/runner integration and adds tests
* properly quantizes stacked expert tensors
* a Gated Delta Metal kernel for fast SSM inference
* adds new MLX calls for Conv1d, DepthwideConv1d, Contiguous, Exp, Log, SoftmaxAxis
Pass subprocess stdout/stderr through to the parent's stderr directly
instead of re-wrapping each line with slog. The subprocess already
writes structured slog output, so the re-wrapping produced nested
timestamps, levels, and message fields that were hard to read.
Also downgrade verbose KV cache debug logs to trace level.
The KV cache previously used a tree structure which could
store multiple divergent sequences, which is good for cache
reuse. However, this is typically used in conjunction with
paged attention so each node in the tree can store just a
chunk of the KV cache and they can be stitched together later.
We don't currently do this, so the cache was storing copies of
the full cache for each past sequence.
This redundancy plus the lack of resource limits, caused significant
memory use as a conversation grew. Instead, this changes to store
a single entry for the cache, which can be prefix matched. Although
it is less ideal for multiple users, it largely matches Ollama's
current behavior. It can be improved as additional pieces are fleshed
out.
The previous approach tracked array lifecycles through reference
counting, where each array recorded its inputs and a reference count
that was decremented as dependents were freed. This is not really
necessary as MLX tracks references internally. It is also error
prone as it is easy to create new arrays and forget to free them
when the Go variable goes out of scope.
Instead, we can pin just the arrays we want (typically outputs and
specific intermediates, like the cache). All other arrays are freed
by default when we run sweep. This avoids most causes of memory leaks
while still giving the freedom to save what we want.
This change adds a new MLX based runner which includes:
* Method-based MLX bindings
* Subprocess-based MLX runner (x/mlxrunner)
* KV cache with tree management
* A basic sampler
The GLM4-MoE-Lite model has been ported to use the new bindings.
---------
Co-authored-by: Michael Yang <git@mxy.ng>