* don't require pulling stubs for cloud models
This is a first in a series of PRs that will better integrate Ollama's
cloud into the API and CLI. Previously we used to have a layer of
indirection where you'd first have to pull a "stub" model that contains
a reference to a cloud model. With this change, you don't have to pull
first, you can just use a cloud model in various routes like `/api/chat`
and `/api/show`. This change respects
<https://github.com/ollama/ollama/pull/14221>, so if cloud is disabled,
these models won't be accessible.
There's also a new, simpler pass-through proxy that doesn't convert the
requests ahead of hitting the cloud models, which they themselves
already support various formats (e.g., `v1/chat/completions` or Open
Responses, etc.). This will help prevent issues caused by double
converting (e.g., `v1/chat/completions` converted to `api/chat` on the
client, then calling cloud and converting back to a
`v1/chat/completions` response instead of the cloud model handling the
original `v1/chat/completions` request first).
There's now a notion of "source tags", which can be mixed with existing
tags. So instead of having different formats like`gpt-oss:20b-cloud` vs.
`kimi-k2.5:cloud` (`-cloud` suffix vs. `:cloud`), you can now specify
cloud by simply appending `:cloud`. This PR doesn't change model
resolution yet, but sets us up to allow for things like omitting the
non-source tag, which would make something like `ollama run
gpt-oss:cloud` work the same way that `ollama run gpt-oss` already works
today.
More detailed changes:
- Added a shared model selector parser in `types/modelselector`:
- supports `:cloud` and `:local`
- accepts source tags in any position
- supports legacy `:<tag>-cloud`
- rejects conflicting source tags
- Integrated selector handling across server inference/show routes:
- `GenerateHandler`, `ChatHandler`, `EmbedHandler`,
`EmbeddingsHandler`, `ShowHandler`
- Added explicit-cloud passthrough proxy for ollama.com:
- same-endpoint forwarding for `/api/*`, `/v1/*`, and `/v1/messages`
- normalizes `model` (and `name` for `/api/show`) before forwarding
- forwards request headers except hop-by-hop/proxy-managed headers
- uses bounded response-header timeout
- handles auth failures in a friendly way
- Preserved cloud-disable behavior (`OLLAMA_NO_CLOUD`)
- Updated create flow to support `FROM ...:cloud` model sources (though
this flow uses the legacy proxy still, supporting Modelfile overrides
is more complicated with the direct proxy approach)
- Updated CLI/TUI/config cloud detection to use shared selector logic
- Updated CLI preflight behavior so explicit cloud requests do not
auto-pull local stubs
What's next?
- Cloud discovery/listing and cache-backed `ollama ls` / `/api/tags`
- Modelfile overlay support for virtual cloud models on OpenAI/Anthropic
request families
- Recommender/default-selection behavior for ambiguous model families
- Fully remove the legacy flow
Fixes: https://github.com/ollama/ollama/issues/13801
* consolidate pull logic into confirmAndPull helper
pullIfNeeded and ShowOrPull shared identical confirm-and-pull logic.
Extract confirmAndPull to eliminate the duplication.
* skip local existence checks for cloud models
ModelExists and the TUI's modelExists both check the local model list,
which causes cloud models to appear missing. Return true early for
explicit cloud models so the TUI displays them beside the integration
name and skips re-prompting the model picker on relaunch.
* support optionally pulling stubs for newly-style names
We now normalize names like `<family>:<size>:cloud` into legacy-style
names like `<family>:<size>-cloud` for pulling and deleting (this also
supports stripping `:local`). Support for pulling cloud models is
temporary, once we integrate properly into `/api/tags` we won't need
this anymore.
* Fix server alias syncing
* Update cmd/cmd.go
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
* address comments
* improve some naming
---------
Co-authored-by: ParthSareen <parth.sareen@ollama.com>
When trying to use cloud model with OLLAMA_HOST="ollama.com" while not signed in a helpful error message is displayed when the user is not signed in telling them they must sign in to use cloud models. This should be the same experience for models which specify a remote instance.
TeaCache:
- Timestep embedding similarity caching for diffusion models
- Polynomial rescaling with configurable thresholds
- Reduces transformer forward passes by ~30-50%
FP8 quantization:
- Support for FP8 quantized models (8-bit weights with scales)
- QuantizedMatmul on Metal, Dequantize on CUDA
- Client-side quantization via ollama create --quantize fp8
Other bug fixes:
- Fix `/api/show` API for image generation models
- Server properly returns model info (architecture, parameters, quantization)
- Memory allocation optimizations
- CLI improvements for image generation
Co-authored-by: A-Akhil <akhilrahul70@gmail.com>
This PR introduces a new ollama embed command that allows users to generate embeddings directly from the command line.
Added ollama embed MODEL [TEXT...] command for generating text embeddings
Supports both direct text arguments and stdin piping for scripted workflows
Outputs embeddings as JSON arrays (one per line)
There are two bugs when using `/load <model>` for a model that doesn't exist, namely:
1. it will not restore the current model settings if the current model is a thinking model; and
2. it will crash is the current model is a non-thinking model
This bug fix saves the current runOptions and then restores them if the model load
doesn't happen. It also fixes the crash happening for non-thinking models.
* auth: fix problems with the ollama keypairs
This change adds several fixes including:
- reading in the pubkey files correctly
- fixing the push unit test to create a keypair file in a temp directory
- not return 500 errors for normal status error
With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.
This fixes the case where a FROM line in previous modelfile points to a
file which may/may not be present in a different ollama instance. We
shouldn't be relying on the filename though and instead just check if
the FROM line was instead a valid model name and point to that instead.
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
After a user pushes their model it is not clear what to do next. Add a link
to the output of `ollama push` that tells the user where their model can now
be found.