ollama

mirror of https://github.com/ollama/ollama.git synced 2026-03-27 02:58:43 +07:00

Files

Jesse Gross 96e36c0d90 mlxrunner: share KV cache across conversations with common prefixes

Enable multiple conversations to reuse cached computations when they
share token prefixes (e.g. the same system prompt). A prefix trie
tracks shared regions so switching between conversations only
recomputes tokens that diverge. Inactive conversation state is paged
from active GPU memory to other memory and restored on demand, with LRU
eviction to keep memory usage bounded.

2026-03-18 16:06:33 -07:00

cache

mlxrunner: share KV cache across conversations with common prefixes

2026-03-18 16:06:33 -07:00

mlx

mlxrunner: share KV cache across conversations with common prefixes

2026-03-18 16:06:33 -07:00

model

mlx: quantized embeddings, fast SwiGLU, and runtime fixes (#14884 )

2026-03-17 11:21:38 -07:00

sample

mlxrunner: fix Slice(0, 0) returning full dimension instead of empty