ollama

mirror of https://github.com/ollama/ollama.git synced 2026-03-27 02:58:43 +07:00

Author	SHA1	Message	Date
Patrick Devine	de5cb7311f	mlx: add mxfp4/mxfp8/nvfp4 importing (#15015 ) This change allows importing bf16 and converting to mxfp4/mxfp8/nvfp4 and also importing fp8 and converting directly to mxfp8.	2026-03-24 13:45:44 -07:00
Jesse Gross	95ee7fbd29	mlxrunner: panic on double unpin	2026-03-23 17:44:19 -07:00
Jesse Gross	ec55536734	mlxrunner: show time since last used in cache dump tree	2026-03-23 17:44:19 -07:00
Jesse Gross	77491439c2	mlxrunner: support partial match on pure transformer caches Previously, a partial match within a node's edge would truncate the path to the parent snapshot - effectively making all cache types behave as recurrent caches. Caches with only transformer layers can rewind to arbitrary boundary so this restores this capability to improve cache hits	2026-03-23 17:44:19 -07:00
Daniel Hiltgen	c2b0bb7a52	mlx: update as of 3/23 (#14789 ) * mlx: update to HEAD on 3/23 Also fixes a few misc vendoring bugs uncovered with this first update. This also renames the version files to make them clearer. * CUDA Fast Gated Delta kernel * mlx: detect eval errors and panic On model errors or missing kernels, don't mask the error, bubble it up.	2026-03-23 11:28:44 -07:00
Jesse Gross	d7c176ab91	llm, mlxrunner: fix done channel value consumed by first receiver Receiving from a buffered chan error consumes the value, so only the first caller (WaitUntilRunning, HasExited, or Close) sees the signal. Subsequent receivers block or take the wrong branch. Replace with a closed chan struct{} which can be received from any number of times, and store the error in a separate field.	2026-03-19 17:44:28 -07:00
Jesse Gross	0ff7d724ff	mlx: fix subprocess log deadlock The stderr reader used bufio.Scanner which has a 64KB max line size. If the subprocess wrote a line exceeding this limit, the scanner would stop reading, the OS pipe buffer would fill, and the subprocess would deadlock. Replace the scanner with a statusWriter that wraps io.Copy. The writer forwards all stderr to os.Stderr while capturing the last short line (≤256 bytes) for error reporting, avoiding both the deadlock and the need to buffer arbitrarily long lines.	2026-03-19 17:44:28 -07:00
Jesse Gross	96e36c0d90	mlxrunner: share KV cache across conversations with common prefixes Enable multiple conversations to reuse cached computations when they share token prefixes (e.g. the same system prompt). A prefix trie tracks shared regions so switching between conversations only recomputes tokens that diverge. Inactive conversation state is paged from active GPU memory to other memory and restored on demand, with LRU eviction to keep memory usage bounded.	2026-03-18 16:06:33 -07:00
Jesse Gross	6f8ddbb26b	mlxrunner: fix Slice(0, 0) returning full dimension instead of empty Slice used cmp.Or to resolve a zero stop value to the dimension size, intended to support open-ended slices like a[i:]. This made Slice(0, 0) indistinguishable from Slice(), so any slice with a zero stop would silently include the entire dimension instead of being empty. Replace cmp.Or with an explicit End sentinel and resolve negative indices against the dimension size, matching Python/PyTorch semantics.	2026-03-18 16:06:33 -07:00
Patrick Devine	d727aacd04	mlx: quantized embeddings, fast SwiGLU, and runtime fixes (#14884 ) Add QuantizedEmbedding and EmbeddingLayer interface so models can use quantized embedding weights and expose tied output projections. This change updates gemma3, glm4_moe_lite, llama, qwen3, and qwen3_5 to use the new interface.	2026-03-17 11:21:38 -07:00
Patrick Devine	fa69b833cd	mlx: add prequantized tensor packing + changes for qwen35 (#14878 ) This change adds a tensorImportTransform interface for model-specific tensor transformations during safetensors import. This allows importing and modifying the standard HF based weights as well as the mlx-community derived pre-quantized safetensors repos to be directly imported into `ollama create`. Right now this only works with Qwen3.5 importing which does tensor renaming, norm weight shifting (it adds +1 to each value of the norm vectors), conv1d transposition, and casts to BF16s for F32 based vectors.	2026-03-17 11:21:18 -07:00
Jesse Gross	bbbad97686	sched: Model eviction for MLX MLX runners (image generation and LLM) previously bypassed the scheduler's standard load path via a separate loadMLX method. This meant they skipped VRAM fitting checks and couldn't participate in model eviction. Now all model types flow through the same load function. Model eviction for MLX is based on weights as KV cache and compute graph are dynamic. This means that eviction does not take into account the worst case memory and models can still compete for memory but it is a significant improvement.	2026-03-16 17:40:29 -07:00
Daniel Hiltgen	539741199e	mlx: perf improvements (#14768 ) * mlx: perf improvements Fix nn.go to call mlx_fast_layer_norm instead of manually implementing (mean, subtract, variance, rsqrt, multiply, add — 6 ops) Fix llama.go, gemma3.go to remove RepeatKV to tile K/V tensors to match the Q head count, since scaled_dot_product_attention natively handles GQA (it just requires n_q_heads % n_kv_heads == 0) * review comments	2026-03-12 12:01:28 -07:00
Daniel Hiltgen	c222735c02	mlx: only log load errors when MLX is needed (#14764 ) This suppresses irrelevant/noisy errors in the GGML runner.	2026-03-11 10:31:31 -07:00
Daniel Hiltgen	87d21c7fc0	MLX: harden for init failures (#14777 ) The CLI now links to the lazy-load MLX code, but that still happens in init functions. On internal MLX errors, the CLI exits before it has a chance to start. This change re-wires the MLX error handling so it doesn't exit by default. The MLX based runners currently expect exits on failure, so they re-initialize the default error handling. We can refine error handling for better go stack traces in the future.	2026-03-10 22:52:23 -07:00
Daniel Hiltgen	62d1f01ab4	ci: Fix windows build (#14754 ) Instead of relying on sh for wildcard, do it in Go for better windows compatibility.	2026-03-09 19:27:59 -07:00
Daniel Hiltgen	10e51c5177	MLX: add header vendoring and remove go build tag (#14642 ) * prefer rocm v6 on windows Avoid building with v7 - more changes are needed * MLX: add header vendoring and remove go build tag This switches to using a vendoring approach for the mlx-c headers so that Go can build without requiring a cmake first. This enables building the new MLX based code by default. Every time cmake runs, the headers are refreshed, so we can easily keep them in sync when we bump mlx versions. Basic Windows and Linux support are verified. * ci: harden for flaky choco repo servers CI sometimes fails due to choco not actually installing cache. Since it just speeds up the build, we can proceed without. * review comments	2026-03-09 17:24:45 -07:00
Patrick Devine	3e06bde643	mlx: get parameters from modelfile during model creation (#14747 )	2026-03-09 15:33:24 -07:00
Patrick Devine	d126467d5d	x/mlxrunner: replace sampler interface chain with single stateful Sampler (#14652 ) - Collapse MLX sampling state into a single sample.Sampler struct (options + history). - Replace interface-based sampler chain (TopP, TopK, penalty, etc.) with function-based transforms. - Update request/pipeline wiring to use *sample.Sampler, seed history from prompt tokens, and append generated tokens each step. - Implement top_p, min_p, repeat_penalty, and frequency_penalty	2026-03-07 17:50:57 -08:00
Patrick Devine	e790dc435b	mlx: int4 groupsize 64 (#14682 ) Change affine 4bit integers to use groupsize 64	2026-03-06 16:39:47 -08:00
Jeffrey Morgan	4eab60c1e2	Reapply "don't require pulling stubs for cloud models" again (#14608 ) * Revert "Revert "Reapply "don't require pulling stubs for cloud models"" (#14606)" This reverts commit `39982a954e`. * fix test + do cloud lookup only when seeing cloud models --------- Co-authored-by: ParthSareen <parth.sareen@ollama.com>	2026-03-06 14:27:47 -08:00
Bruce MacDonald	1af850e6e3	parsers: repair unclosed arg_value tags in GLM tool calls (#14656 ) GLM models sometimes omits </arg_value> closing tags in tool call XML, causing xml.Unmarshal to fail with "element <arg_value> closed by </tool_call>". This is a known issue across the GLM family. Sanitize the input to fix closing arg_key values so encoding/xml can handle it.	2026-03-06 14:08:34 -08:00
Jeffrey Morgan	39982a954e	Revert "Reapply "don't require pulling stubs for cloud models"" (#14606 ) This reverts commit `799e51d419`.	2026-03-03 20:56:10 -08:00
Patrick Devine	e9f6ea232f	Add qwen3.5-next-moe support to MLX runner and models (#14417 ) This change adds support for qwen3.5-next-moe models (qwen3-next/qwen3.5-next/qwen3-coder) to the MLX runner. It also: * introduces recurrent cache support and related MLX ops * updates pipeline/runner integration and adds tests * properly quantizes stacked expert tensors * a Gated Delta Metal kernel for fast SSM inference * adds new MLX calls for Conv1d, DepthwideConv1d, Contiguous, Exp, Log, SoftmaxAxis	2026-03-03 16:39:22 -08:00
Patrick Devine	110eff01a9	chore: remove old imagegen LLMs models (#14597 ) These models are implemented in the x/mlxrunner instead.	2026-03-03 13:23:40 -08:00
Jeffrey Morgan	799e51d419	Reapply "don't require pulling stubs for cloud models" This reverts commit `97d2f05a6d`.	2026-03-03 13:17:10 -08:00
Jeffrey Morgan	97d2f05a6d	Revert "don't require pulling stubs for cloud models (#14574 )" (#14596 ) This reverts commit `8207e55ec7`.	2026-03-03 12:51:23 -08:00
Devon Rifkin	8207e55ec7	don't require pulling stubs for cloud models (#14574 ) * don't require pulling stubs for cloud models This is a first in a series of PRs that will better integrate Ollama's cloud into the API and CLI. Previously we used to have a layer of indirection where you'd first have to pull a "stub" model that contains a reference to a cloud model. With this change, you don't have to pull first, you can just use a cloud model in various routes like `/api/chat` and `/api/show`. This change respects <https://github.com/ollama/ollama/pull/14221>, so if cloud is disabled, these models won't be accessible. There's also a new, simpler pass-through proxy that doesn't convert the requests ahead of hitting the cloud models, which they themselves already support various formats (e.g., `v1/chat/completions` or Open Responses, etc.). This will help prevent issues caused by double converting (e.g., `v1/chat/completions` converted to `api/chat` on the client, then calling cloud and converting back to a `v1/chat/completions` response instead of the cloud model handling the original `v1/chat/completions` request first). There's now a notion of "source tags", which can be mixed with existing tags. So instead of having different formats like`gpt-oss:20b-cloud` vs. `kimi-k2.5:cloud` (`-cloud` suffix vs. `:cloud`), you can now specify cloud by simply appending `:cloud`. This PR doesn't change model resolution yet, but sets us up to allow for things like omitting the non-source tag, which would make something like `ollama run gpt-oss:cloud` work the same way that `ollama run gpt-oss` already works today. More detailed changes: - Added a shared model selector parser in `types/modelselector`: - supports `:cloud` and `:local` - accepts source tags in any position - supports legacy `:<tag>-cloud` - rejects conflicting source tags - Integrated selector handling across server inference/show routes: - `GenerateHandler`, `ChatHandler`, `EmbedHandler`, `EmbeddingsHandler`, `ShowHandler` - Added explicit-cloud passthrough proxy for ollama.com: - same-endpoint forwarding for `/api/`, `/v1/`, and `/v1/messages` - normalizes `model` (and `name` for `/api/show`) before forwarding - forwards request headers except hop-by-hop/proxy-managed headers - uses bounded response-header timeout - handles auth failures in a friendly way - Preserved cloud-disable behavior (`OLLAMA_NO_CLOUD`) - Updated create flow to support `FROM ...:cloud` model sources (though this flow uses the legacy proxy still, supporting Modelfile overrides is more complicated with the direct proxy approach) - Updated CLI/TUI/config cloud detection to use shared selector logic - Updated CLI preflight behavior so explicit cloud requests do not auto-pull local stubs What's next? - Cloud discovery/listing and cache-backed `ollama ls` / `/api/tags` - Modelfile overlay support for virtual cloud models on OpenAI/Anthropic request families - Recommender/default-selection behavior for ambiguous model families - Fully remove the legacy flow Fixes: https://github.com/ollama/ollama/issues/13801 * consolidate pull logic into confirmAndPull helper pullIfNeeded and ShowOrPull shared identical confirm-and-pull logic. Extract confirmAndPull to eliminate the duplication. * skip local existence checks for cloud models ModelExists and the TUI's modelExists both check the local model list, which causes cloud models to appear missing. Return true early for explicit cloud models so the TUI displays them beside the integration name and skips re-prompting the model picker on relaunch. * support optionally pulling stubs for newly-style names We now normalize names like `<family>:<size>:cloud` into legacy-style names like `<family>:<size>-cloud` for pulling and deleting (this also supports stripping `:local`). Support for pulling cloud models is temporary, once we integrate properly into `/api/tags` we won't need this anymore. * Fix server alias syncing * Update cmd/cmd.go Co-authored-by: Parth Sareen <parth.sareen@ollama.com> * address comments * improve some naming --------- Co-authored-by: ParthSareen <parth.sareen@ollama.com>	2026-03-03 10:46:33 -08:00
Jesse Gross	ad16bffc7d	mlx: Remove peak memory from the API This is still in flux so it is better to just log it for now.	2026-03-02 15:56:18 -08:00
Jesse Gross	c1e3ef4bcc	mlxrunner: Refcount pinned tensors Otherwise, it is error prone to manage multiple components working with the same tensor.	2026-03-02 15:56:06 -08:00
Jesse Gross	a60b9adcce	mlxrunner: Fix prompt eval timing and count metrics Only the last token's processing time is included in prompt processing, giving an artificially high rate. In addition, the number of tokens only included the tokens that miss the cache, instead of our historic total tokens.	2026-02-27 17:29:47 -08:00
Jesse Gross	a16f96658b	mlxrunner: Enforce model context limit Currently, context length is unbounded - the cache will keep growing forever independent of the model's trained context length. This caps it and enforces semantics similar to most cloud services: - Long prompts will result in an error, not truncation. - Generation that exceeds the context will be stopped	2026-02-27 17:29:47 -08:00
Jesse Gross	18ab09b431	mlxrunner: Propagate pipeline errors to client via api.StatusError Errors that occur during pipeline processing are currently only logged but not sent back to the client. Rather than using HTTP status codes as we have historically done, this serializes errors as messages to allow sending them at any time during the stream.	2026-02-27 17:29:47 -08:00
Jesse Gross	638faeac54	mlxrunner: Report actual memory usage from runner The MLX runner previously reported a static VRAM estimate that was computed at load time and consisted only of the weights. This is strictly less than the actual memory usage, as it does not include the KV cache or compute graph.	2026-02-27 17:29:47 -08:00
Jesse Gross	dd5eb6337d	mlxrunner: Fix panic on full KV cache hit When the entire prompt was already cached (e.g. repeated prompt), findRemaining returned an empty slice, causing FromValues to panic on an index-out-of-range accessing a zero-length byte slice. Fix by always keeping at least one token to re-evaluate so the pipeline can seed token generation. Also reject empty prompts early rather than panicking.	2026-02-27 11:07:03 -08:00
Patrick Devine	79917cf80b	show peak memory usage (#14485 )	2026-02-26 18:38:27 -08:00
Jesse Gross	0f23b7bff5	mlxrunner: Cancel in-flight requests when the client disconnects Currently, a canceled request can result in computation continuing in the background to completion. It can also trigger a deadlock when there is nobody to read the output tokens and the pipeline cannot continue to the next request.	2026-02-25 14:00:42 -08:00
Jesse Gross	4e57d2094e	mlxrunner: Simplify pipeline memory and cache management Particularly in error cases, it can be difficult to ensure that all pinned memory is unpinned, MLX buffers are released and cache state is consistent. This encapsulates those pieces and sets up proper deferrals so that this happens automatically on exit.	2026-02-25 14:00:42 -08:00
Daniel Hiltgen	f4f0a4a471	update mlx-c bindings to 0.5.0 (#14380 ) * chore: update mlx-c bindings to 0.5.0 (#14303) * linux: use gcc 11 --------- Co-authored-by: Patrick Devine <patrick@infrahq.com>	2026-02-23 16:44:29 -08:00
Jesse Gross	f20dc6b698	mlx: don't default to affine quantization for unquantized models Otherwise the BF16 version of models trigger segfaults when they call into quantized kernels.	2026-02-23 15:03:53 -08:00
Jesse Gross	8daf47fb3a	mlxrunner: Fix duplicate log prefixes and reduce log noise Pass subprocess stdout/stderr through to the parent's stderr directly instead of re-wrapping each line with slog. The subprocess already writes structured slog output, so the re-wrapping produced nested timestamps, levels, and message fields that were hard to read. Also downgrade verbose KV cache debug logs to trace level.	2026-02-23 14:09:20 -08:00
Jesse Gross	5c73c4e2ee	mlxrunner: Simplify KV cache to single-entry prefix matching The KV cache previously used a tree structure which could store multiple divergent sequences, which is good for cache reuse. However, this is typically used in conjunction with paged attention so each node in the tree can store just a chunk of the KV cache and they can be stitched together later. We don't currently do this, so the cache was storing copies of the full cache for each past sequence. This redundancy plus the lack of resource limits, caused significant memory use as a conversation grew. Instead, this changes to store a single entry for the cache, which can be prefix matched. Although it is less ideal for multiple users, it largely matches Ollama's current behavior. It can be improved as additional pieces are fleshed out.	2026-02-23 09:50:07 -08:00
Jesse Gross	5daf59cc66	mlxrunner: Fix memory leaks with pin/sweep lifecycle management The previous approach tracked array lifecycles through reference counting, where each array recorded its inputs and a reference count that was decremented as dependents were freed. This is not really necessary as MLX tracks references internally. It is also error prone as it is easy to create new arrays and forget to free them when the Go variable goes out of scope. Instead, we can pin just the arrays we want (typically outputs and specific intermediates, like the cache). All other arrays are freed by default when we run sweep. This avoids most causes of memory leaks while still giving the freedom to save what we want.	2026-02-23 09:50:07 -08:00
Jeffrey Morgan	8b4e5a82a8	mlx: remove noisy error output from dynamic library loading (#14346 ) The recent change in #14322 added tryLoadByName() which attempts to load libmlxc.dylib via rpath before searching directories. This is an optimization for Homebrew installations where rpath is correctly set. However, when rpath isn't set (which is the common case for app bundle installations), dlopen fails and the CHECK macro prints an error to stderr: ERROR - dynamic.c:21 - CHECK failed: handle->ctx != NULL This error is misleading because it's an expected failure path - the code correctly falls back to searching the executable directory and loads the library successfully. The error message causes user confusion and makes it appear that something is broken. Replace the CHECK macro with a simple return code so the C code fails silently. The Go code already handles error logging appropriately: tryLoadByName() fails silently (intentional fallback), while tryLoadFromDir() logs via slog.Error() when explicit path loading fails.	2026-02-20 23:46:07 -08:00
Patrick Devine	97323d1c68	consolidate the tokenizer (#14327 ) This change adds a new x/tokenizer package which includes: * New BPE and SentencePiece tokenizers * Removing the dependency on the imagegen tokenizers * Fixes to multibyte decoding in the pipeline * Various correctness and benchmark tests Not included in this PR is the WordPiece tokenizer for BERT models which will be added when we add embedding models. The imagegen tokenizers will also be removed in a follow-up PR.	2026-02-19 15:55:45 -08:00
natl-set	458dd1b9d9	mlx: try loading library via rpath before searching directories (#14322 ) The existing code manually searches directories for libmlxc.* and passes full paths to dlopen, bypassing the binary's rpath. This means MLX libraries installed via package managers (e.g., Homebrew) aren't found even when rpath is correctly set at link time. This change adds a fallback that tries loading via rpath first (using just the library name), before falling back to the existing directory search. This follows standard Unix/macOS conventions and works with any installation that sets rpath. Fixes library loading on macOS with Homebrew-installed mlx-c without requiring OLLAMA_LIBRARY_PATH environment variable. Co-authored-by: Natl <nat@MacBook-Pro.local>	2026-02-19 10:55:02 -08:00
Patrick Devine	0759fface9	Revert "chore: update mlx-c bindings to 0.5.0 (#14303 )" (#14316 ) This reverts commit `f01a9a7859`.	2026-02-18 17:01:25 -08:00
Patrick Devine	f01a9a7859	chore: update mlx-c bindings to 0.5.0 (#14303 )	2026-02-17 16:48:16 -08:00
Patrick Devine	9aefd2dfee	model: add qwen3 support to mlxrunner (#14293 )	2026-02-17 13:58:49 -08:00
Patrick Devine	3a88f7eb20	bugfix: add missing linear layer factory (#14289 )	2026-02-16 17:22:20 -08:00

1 2

96 Commits