ollama

mirror of https://github.com/ollama/ollama.git synced 2026-03-27 02:58:43 +07:00

Author	SHA1	Message	Date
Jesse Gross	d1151e18a1	mlx: fix KV cache snapshot memory leak mlx.Copy shares the backing buffer with its source (via copy_shared_buffer) rather than allocating independent storage. When used to snapshot a slice of the KV cache, the snapshot array holds the entire original cache buffer alive through the shared data pointer — even after eval detaches the computation graph. Replace Copy with Contiguous in Snapshot and Split. Contiguous allocates a compact buffer when the source buffer is significantly larger than the logical slice (Contiguous::eval checks buffer_size > nbytes + 16384), which is always the case for KV cache slices.	2026-03-25 17:26:34 -07:00
rick	ebbce136c7	ggml: force flash attention off for grok	2026-03-25 16:15:49 -07:00
Devon Rifkin	26b9f53f8e	api/show: overwrite basename for copilot chat (#15062 ) Copilot Chat prefers to use `general.basename` in the built-in Ollama integration, but this name isn't usually shown directly to users (and there may be many models that share this name). Instead we pass back `req.Model`, which for this extension is the value that we return from `/api/tags` v0.18.3-rc2 v0.18.3	2026-03-25 14:02:22 -07:00
Eva H	7575438366	cmd: ollama launch vscode (#15060 ) Co-authored-by: Parth Sareen <parth.sareen@ollama.com>	2026-03-25 16:37:02 -04:00
Eva H	7d7c90d702	tui: add left arrow back navigation in model selector (#14940 )	2026-03-25 11:53:48 -07:00
Daniel Hiltgen	4fda69809a	ci: fix windows cgo compiler error (#15046 ) v0.18.3-rc1	2026-03-24 16:45:36 -07:00
Daniel Hiltgen	c9b5da6b0c	integration: improve ability to test individual models (#14948 ) * integration: improve ability to test individual models Add OLLAMA_TEST_MODEL env var to run integration tests against a single model. Enhance vision tests: multi-turn chat with cached image tokens, object counting, spatial reasoning, detail recognition, scene understanding, OCR, and multi-image comparison. Add tool calling stress tests with complex agent-style prompts, large system messages, and multi-turn tool response handling. * review comments	2026-03-24 14:28:23 -07:00
Patrick Devine	de5cb7311f	mlx: add mxfp4/mxfp8/nvfp4 importing (#15015 ) This change allows importing bf16 and converting to mxfp4/mxfp8/nvfp4 and also importing fp8 and converting directly to mxfp8. v0.18.3-rc0	2026-03-24 13:45:44 -07:00
Jesse Gross	95ee7fbd29	mlxrunner: panic on double unpin	2026-03-23 17:44:19 -07:00
Jesse Gross	ec55536734	mlxrunner: show time since last used in cache dump tree	2026-03-23 17:44:19 -07:00
Jesse Gross	77491439c2	mlxrunner: support partial match on pure transformer caches Previously, a partial match within a node's edge would truncate the path to the parent snapshot - effectively making all cache types behave as recurrent caches. Caches with only transformer layers can rewind to arbitrary boundary so this restores this capability to improve cache hits	2026-03-23 17:44:19 -07:00
Parth Sareen	b166b36cd2	docs: update Claude Code with Telegram guide (#15026 )	2026-03-23 16:31:21 -07:00
Daniel Hiltgen	c2b0bb7a52	mlx: update as of 3/23 (#14789 ) * mlx: update to HEAD on 3/23 Also fixes a few misc vendoring bugs uncovered with this first update. This also renames the version files to make them clearer. * CUDA Fast Gated Delta kernel * mlx: detect eval errors and panic On model errors or missing kernels, don't mask the error, bubble it up.	2026-03-23 11:28:44 -07:00
Bruce MacDonald	22c2bdbd8a	docs: nemoclaw integration (#14962 ) --------- Co-authored-by: ParthSareen <parth.sareen@ollama.com>	2026-03-20 15:27:37 -07:00
Bruce MacDonald	6df6d097d9	launch: skip openclaw gateway health check when no daemon install (#14984 )	2026-03-20 15:20:14 -07:00
Jesse Gross	d7c176ab91	llm, mlxrunner: fix done channel value consumed by first receiver Receiving from a buffered chan error consumes the value, so only the first caller (WaitUntilRunning, HasExited, or Close) sees the signal. Subsequent receivers block or take the wrong branch. Replace with a closed chan struct{} which can be received from any number of times, and store the error in a separate field.	2026-03-19 17:44:28 -07:00
Jesse Gross	0ff7d724ff	mlx: fix subprocess log deadlock The stderr reader used bufio.Scanner which has a 64KB max line size. If the subprocess wrote a line exceeding this limit, the scanner would stop reading, the OS pipe buffer would fill, and the subprocess would deadlock. Replace the scanner with a statusWriter that wraps io.Copy. The writer forwards all stderr to os.Stderr while capturing the last short line (≤256 bytes) for error reporting, avoiding both the deadlock and the need to buffer arbitrarily long lines.	2026-03-19 17:44:28 -07:00
Devon Rifkin	46cb7795e1	add ability to turn on debug request logging (#14106 ) If `OLLAMA_DEBUG_LOG_REQUESTS` is set, then on server startup a temp folder will be created. Upon any inference request, the body will be logged to a file in this folder, as well as a small shell script to "replay" the request using cURL. This is just intended for debugging scenarios, not as something to turn on normally.	2026-03-19 17:08:17 -07:00
Bruce MacDonald	126d8db7f3	parsers: robust xml tool repair (#14961 ) Previous xml repair for glm was a good start, but we need to go further and repair any incorrect open or closing tags Co-authored-by: Dongluo Chen <dongluo.chen@gmail.com>	2026-03-19 11:24:48 -07:00
Eva H	3f3a24b418	app: fix desktop app stuck loading when OLLAMA_HOST is an unspecified bind address (#14885 )	2026-03-19 12:57:57 -04:00
Jesse Gross	96e36c0d90	mlxrunner: share KV cache across conversations with common prefixes Enable multiple conversations to reuse cached computations when they share token prefixes (e.g. the same system prompt). A prefix trie tracks shared regions so switching between conversations only recomputes tokens that diverge. Inactive conversation state is paged from active GPU memory to other memory and restored on demand, with LRU eviction to keep memory usage bounded.	2026-03-18 16:06:33 -07:00
Jesse Gross	6f8ddbb26b	mlxrunner: fix Slice(0, 0) returning full dimension instead of empty Slice used cmp.Or to resolve a zero stop value to the dimension size, intended to support open-ended slices like a[i:]. This made Slice(0, 0) indistinguishable from Slice(), so any slice with a zero stop would silently include the entire dimension instead of being empty. Replace cmp.Or with an explicit End sentinel and resolve negative indices against the dimension size, matching Python/PyTorch semantics.	2026-03-18 16:06:33 -07:00
Eva H	b5e7888414	cmd/launch: skip redundant config writes when model unchanged (#14941 )	2026-03-18 17:36:52 -04:00
Parth Sareen	eab4d22269	docs: update claude code and openclaw for web search (#14922 )	2026-03-18 14:18:49 -07:00
Bruce MacDonald	5759c2d2d2	launch: fix openclaw not picking up newly selected model (#14943 ) Sessions with a stale model field were not updated when the primary changed, so the old model continued to be used. v0.18.2-rc1 v0.18.2	2026-03-18 13:20:10 -07:00
Bruce MacDonald	42b1c2642b	docs: update minimax-m2.5 references to m2.7 (#14942 )	2026-03-18 12:59:28 -07:00
Bruce MacDonald	727d69ddf3	tui: fix signin on headless Linux systems (#14627 ) Defensively handle environments without a display server to ensure signin remains usable on headless VMs and SSH sessions. - Skip calling xdg-open when neither DISPLAY nor WAYLAND_DISPLAY is set, preventing silent failures or unexpected browser handlers - Render the signin URL as plain text instead of wrapping it in OSC 8 hyperlink escape sequences, which can be garbled or hidden by terminals that don't support them	2026-03-18 11:11:17 -07:00
Jesse Gross	f622b0c5fc	launch: disable claude attribution header to preserve KV cache Claude Code sends an x-anthropic-billing-header that changes on every request. This is embedded in the system prompt and consequently breaks the KV cache for every request. Given the size of the prompts that Claude Code usees, this has significant performance impact.	2026-03-17 20:48:03 -07:00
Bruce MacDonald	5d0000634c	cmd/launch: check for both npm and git before installing OpenClaw (#14888 ) The OpenClaw installer requires git in addition to npm. Update the dependency check to detect both and provide specific install guidance for whichever dependencies are missing.	2026-03-17 18:20:05 -07:00
Parth Sareen	676d9845ba	launch: register websearch for openclaw (#14914 ) v0.18.2-rc0	2026-03-17 15:03:15 -07:00
Devon Rifkin	e37a9b4c01	cloud_proxy: for the web_search legacy path, flush on newlines (#14897 ) `WebSearchAnthropicWriter` expects a single object per write. The new transparent proxy will instead send it whatever bytes it sees. This cloud-model + local-orchestration + cloud-search is a temporary code path, so instead of making the web search code more robust to this, I put an adapter in the middle that will flush line-by-line to preserve the old behavior.	2026-03-17 13:30:17 -07:00
Patrick Devine	d727aacd04	mlx: quantized embeddings, fast SwiGLU, and runtime fixes (#14884 ) Add QuantizedEmbedding and EmbeddingLayer interface so models can use quantized embedding weights and expose tied output projections. This change updates gemma3, glm4_moe_lite, llama, qwen3, and qwen3_5 to use the new interface.	2026-03-17 11:21:38 -07:00
Patrick Devine	fa69b833cd	mlx: add prequantized tensor packing + changes for qwen35 (#14878 ) This change adds a tensorImportTransform interface for model-specific tensor transformations during safetensors import. This allows importing and modifying the standard HF based weights as well as the mlx-community derived pre-quantized safetensors repos to be directly imported into `ollama create`. Right now this only works with Qwen3.5 importing which does tensor renaming, norm weight shifting (it adds +1 to each value of the norm vectors), conv1d transposition, and casts to BF16s for F32 based vectors.	2026-03-17 11:21:18 -07:00
Jesse Gross	bbbad97686	sched: Model eviction for MLX MLX runners (image generation and LLM) previously bypassed the scheduler's standard load path via a separate loadMLX method. This meant they skipped VRAM fitting checks and couldn't participate in model eviction. Now all model types flow through the same load function. Model eviction for MLX is based on weights as KV cache and compute graph are dynamic. This means that eviction does not take into account the worst case memory and models can still compete for memory but it is a significant improvement.	2026-03-16 17:40:29 -07:00
Parth Sareen	bcf6d55b54	launch: fix web search, add web fetch, and enable both for local (#14886 ) v0.18.1-rc1 v0.18.1	2026-03-16 16:26:19 -07:00
easonysliu	810d4f9c22	runner: fix swallowed error in allocModel graph reservation In allocModel(), the first call to reserveWorstCaseGraph(true) had its error silently discarded — `return nil` was used instead of `return err`. This meant that if the prompt-sized graph reservation failed (e.g. due to insufficient memory), the error was swallowed, allocModel reported success, and the model appeared to load correctly. Subsequent inference would then fail in unexpected ways because the worst-case graph was never properly reserved. Fix: return the actual error so the caller can handle the failure (retry with reduced parallelism, report OOM, etc.). Co-Authored-By: Claude (claude-opus-4-6) <noreply@anthropic.com>	2026-03-16 15:48:45 -07:00
Bruce MacDonald	856c047a6c	cmd/launch: skip --install-daemon when systemd is unavailable (#14883 ) In container environments without systemd, `openclaw onboard --install-daemon` exits non-zero because it cannot create a systemd user service. This causes `ollama launch openclaw` to abort even though the gateway can be started as a foreground child process. Only pass --install-daemon when systemd user services are reachable (Linux with /run/systemd/system present and XDG_RUNTIME_DIR set). On all other platforms the flag is still included by default. v0.18.1-rc0	2026-03-16 13:50:04 -07:00
Daniel Hiltgen	79c1e93c00	bench: improve benchmarking tool (#14240 ) New features: - Warmup phase to eliminate cold-start outliers - time-to-first-token measured in each epoch - VRAM/memory tracking to identify CPU spillover - Controlled prompt length - Defaults to 6 epochs and 200 tokens max Benchstat fixes: - ns/request instead of ns/op — non-standard unit created a separate group instead of grouping with timing metrics - Token count as the N field — benchstat interprets N as iteration count for statistical weighting, not as a token count	2026-03-15 11:47:31 -07:00
Parth Sareen	f8b657c967	cmd/launch: add guards for headless mode (#14837 )	2026-03-14 00:10:02 -07:00
Bruce MacDonald	10fefe0d57	config: use native OpenClaw Ollama onboarding (#14829 ) OpenClaw now accepts the Ollama onboarding flags directly upstream, so rely on its wizard state instead of the legacy integration onboarding flag. Update first-run setup to pass the Ollama auth and model flags during onboarding, perform a best-effort update before onboarding when needed, and drop the stale test that asserted persistence of the old onboarding flag.	2026-03-13 16:28:40 -07:00
Daniel Hiltgen	2f9a68f9e9	rocm: doc driver constraints (#14833 )	2026-03-13 15:53:35 -07:00
Bruce MacDonald	3980c0217d	server: decompress zstd request bodies in cloud passthrough middleware (#14827 ) When a zstd-compressed request (e.g. from Codex CLI) hits /v1/responses with a cloud model the request failed. Fix by decompressing zstd bodies before model extraction, so cloud models are detected and proxied directly without the writer being wrapped. v0.18.0-rc2 v0.18.0	2026-03-13 15:06:47 -07:00
Parth Sareen	870599f5da	launch: remove warning for default policy (#14830 )	2026-03-13 15:01:38 -07:00
Bruce MacDonald	abf8e8e9c8	middleware: handle non-JSON error responses gracefully (#14828 ) writeError in both OpenAI and Anthropic middleware writers would return a raw json.SyntaxError when the error payload wasn't valid JSON (e.g. "invalid character 'e' looking for beginning of value"). Fall back to using the raw bytes as the error message instead. Also use the actual HTTP status code rather than hardcoding 500, so error types map correctly	2026-03-13 14:50:49 -07:00
Shivam Tiwari	f3f31a8192	anthropic: close thinking block before tool_use when no text in between (#14825 ) Root cause: StreamConverter.Process() only incremented contentIndex when closing a thinking block if text content was present. When a model emitted thinking followed directly by a tool_use block (no text in between), thinkingDone was never set and contentIndex was not incremented, causing the tool_use content_block_start to reuse index 0. Clients expecting sequential indices would then fail to find the tool content block. Fix: In the tool call loop, close any open thinking block (thinkingStarted && !thinkingDone) and increment contentIndex before opening the tool_use block, mirroring the existing logic that closes an open text block. Fixes #14816 v0.18.0-rc1	2026-03-13 13:12:05 -07:00
Devon Rifkin	9e7ba835da	cmd: still populate `ollama ls` when using `ollama run <model:cloud>` (#14824 ) This is temporary until `api/tags` supports cloud natively v0.18.0-rc0	2026-03-13 12:24:45 -07:00
Parth Sareen	347f17b8d1	launch: add compact window for claude code (#14823 )	2026-03-13 12:09:23 -07:00
Devon Rifkin	081b9eb423	api/create: always propagate `:cloud` source for cloud models (#14822 ) Otherwise, using `/save` would try to run the local model instead	2026-03-13 11:58:00 -07:00
Parth Sareen	bb867c6fdb	launch: fix headless --yes integration flow and policy scoping (#14815 )	2026-03-13 11:45:36 -07:00
Cadu	81f4506a61	docs: document reasoning_effort support in OpenAI-compatible API (#14821 ) Add reasoning_effort and reasoning to the supported features and request fields for /v1/chat/completions. These fields control thinking on thinking-capable models but were previously undocumented. Closes #14820	2026-03-13 10:57:14 -07:00

1 2 3 4 5 ...

5222 Commits