Root cause: StreamConverter.Process() only incremented contentIndex when
closing a thinking block if text content was present. When a model emitted
thinking followed directly by a tool_use block (no text in between),
thinkingDone was never set and contentIndex was not incremented, causing the
tool_use content_block_start to reuse index 0. Clients expecting sequential
indices would then fail to find the tool content block.
Fix: In the tool call loop, close any open thinking block (thinkingStarted &&
!thinkingDone) and increment contentIndex before opening the tool_use block,
mirroring the existing logic that closes an open text block.
Fixes#14816
Add reasoning_effort and reasoning to the supported features and
request fields for /v1/chat/completions. These fields control
thinking on thinking-capable models but were previously undocumented.
Closes#14820
Use 7z compression (better compression rate) if found in path. That
alone isn't sufficient to get us under 2G, so MLX is now split out as a
discrete download. Fix CI so it will fail if artifacts fail to upload.
The CLI now links to the lazy-load MLX code, but that still happens in
init functions. On internal MLX errors, the CLI exits before it has a
chance to start. This change re-wires the MLX error handling so it
doesn't exit by default. The MLX based runners currently expect exits
on failure, so they re-initialize the default error handling. We can
refine error handling for better go stack traces in the future.
* prefer rocm v6 on windows
Avoid building with v7 - more changes are needed
* MLX: add header vendoring and remove go build tag
This switches to using a vendoring approach for the mlx-c headers so that Go
can build without requiring a cmake first. This enables building the new MLX
based code by default. Every time cmake runs, the headers are refreshed, so we
can easily keep them in sync when we bump mlx versions. Basic Windows
and Linux support are verified.
* ci: harden for flaky choco repo servers
CI sometimes fails due to choco not actually installing cache. Since it just speeds up the build, we can proceed without.
* review comments
- Collapse MLX sampling state into a single sample.Sampler struct (options + history).
- Replace interface-based sampler chain (TopP, TopK, penalty, etc.) with function-based transforms.
- Update request/pipeline wiring to use *sample.Sampler, seed history from prompt tokens, and append generated tokens each step.
- Implement top_p, min_p, repeat_penalty, and frequency_penalty
Previously we were printing out bad errors for expected cases like
clients disconnecting. Now we only debug log when that happens (which
still might help in cases where we're figuring out why an integration
isn't working). For other errors, we print out a proper warning now
Our Dockerfile leverages parallel stages for more efficient builds. However,
our old parallel settings were naive and lead to under/over utilization
depending on the capabilities of your build system.
This change switches to using Ninja for all our docker cmake builds to leverage
its smarter parallel logic. We tell Ninja to target a load of nproc so each of
the build stages will share the load on the system aiming for full CPU use
without oversaturation.
The GPU parallelism settings are also adjusted to 4 to avoid a long-tail for
the last few GPU targets as they work through the long list of GPU
architectures.
This also fixes the Dockerfile to move Vulkan install to just the stage that
needs it instead of blocking most other GPU installs. This should speed up CI
which always has a clean build cache.
GLM models sometimes omits </arg_value> closing tags in tool call XML, causing xml.Unmarshal to fail with "element <arg_value> closed by </tool_call>".
This is a known issue across the GLM family.
Sanitize the input to fix closing arg_key values so encoding/xml can handle it.
Remove the /v1 suffix from the OpenClaw provider baseUrl so it uses
the native Ollama API instead of the OpenAI-compatible endpoint. The
/v1 endpoint my break tool calling in OpenClaw.
This change adds support for qwen3.5-next-moe models (qwen3-next/qwen3.5-next/qwen3-coder) to the MLX runner. It also:
* introduces recurrent cache support and related MLX ops
* updates pipeline/runner integration and adds tests
* properly quantizes stacked expert tensors
* a Gated Delta Metal kernel for fast SSM inference
* adds new MLX calls for Conv1d, DepthwideConv1d, Contiguous, Exp, Log, SoftmaxAxis
* don't require pulling stubs for cloud models
This is a first in a series of PRs that will better integrate Ollama's
cloud into the API and CLI. Previously we used to have a layer of
indirection where you'd first have to pull a "stub" model that contains
a reference to a cloud model. With this change, you don't have to pull
first, you can just use a cloud model in various routes like `/api/chat`
and `/api/show`. This change respects
<https://github.com/ollama/ollama/pull/14221>, so if cloud is disabled,
these models won't be accessible.
There's also a new, simpler pass-through proxy that doesn't convert the
requests ahead of hitting the cloud models, which they themselves
already support various formats (e.g., `v1/chat/completions` or Open
Responses, etc.). This will help prevent issues caused by double
converting (e.g., `v1/chat/completions` converted to `api/chat` on the
client, then calling cloud and converting back to a
`v1/chat/completions` response instead of the cloud model handling the
original `v1/chat/completions` request first).
There's now a notion of "source tags", which can be mixed with existing
tags. So instead of having different formats like`gpt-oss:20b-cloud` vs.
`kimi-k2.5:cloud` (`-cloud` suffix vs. `:cloud`), you can now specify
cloud by simply appending `:cloud`. This PR doesn't change model
resolution yet, but sets us up to allow for things like omitting the
non-source tag, which would make something like `ollama run
gpt-oss:cloud` work the same way that `ollama run gpt-oss` already works
today.
More detailed changes:
- Added a shared model selector parser in `types/modelselector`:
- supports `:cloud` and `:local`
- accepts source tags in any position
- supports legacy `:<tag>-cloud`
- rejects conflicting source tags
- Integrated selector handling across server inference/show routes:
- `GenerateHandler`, `ChatHandler`, `EmbedHandler`,
`EmbeddingsHandler`, `ShowHandler`
- Added explicit-cloud passthrough proxy for ollama.com:
- same-endpoint forwarding for `/api/*`, `/v1/*`, and `/v1/messages`
- normalizes `model` (and `name` for `/api/show`) before forwarding
- forwards request headers except hop-by-hop/proxy-managed headers
- uses bounded response-header timeout
- handles auth failures in a friendly way
- Preserved cloud-disable behavior (`OLLAMA_NO_CLOUD`)
- Updated create flow to support `FROM ...:cloud` model sources (though
this flow uses the legacy proxy still, supporting Modelfile overrides
is more complicated with the direct proxy approach)
- Updated CLI/TUI/config cloud detection to use shared selector logic
- Updated CLI preflight behavior so explicit cloud requests do not
auto-pull local stubs
What's next?
- Cloud discovery/listing and cache-backed `ollama ls` / `/api/tags`
- Modelfile overlay support for virtual cloud models on OpenAI/Anthropic
request families
- Recommender/default-selection behavior for ambiguous model families
- Fully remove the legacy flow
Fixes: https://github.com/ollama/ollama/issues/13801
* consolidate pull logic into confirmAndPull helper
pullIfNeeded and ShowOrPull shared identical confirm-and-pull logic.
Extract confirmAndPull to eliminate the duplication.
* skip local existence checks for cloud models
ModelExists and the TUI's modelExists both check the local model list,
which causes cloud models to appear missing. Return true early for
explicit cloud models so the TUI displays them beside the integration
name and skips re-prompting the model picker on relaunch.
* support optionally pulling stubs for newly-style names
We now normalize names like `<family>:<size>:cloud` into legacy-style
names like `<family>:<size>-cloud` for pulling and deleting (this also
supports stripping `:local`). Support for pulling cloud models is
temporary, once we integrate properly into `/api/tags` we won't need
this anymore.
* Fix server alias syncing
* Update cmd/cmd.go
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
* address comments
* improve some naming
---------
Co-authored-by: ParthSareen <parth.sareen@ollama.com>
The "(local)" qualifier is unnecessary since there's only one Ollama
provider. Existing configs with the old name are migrated automatically;
custom names are left unchanged.