`WebSearchAnthropicWriter` expects a single object per write. The new
transparent proxy will instead send it whatever bytes it sees. This
cloud-model + local-orchestration + cloud-search is a temporary code
path, so instead of making the web search code more robust to this, I
put an adapter in the middle that will flush line-by-line to preserve
the old behavior.
Add QuantizedEmbedding and EmbeddingLayer interface so models can
use quantized embedding weights and expose tied output projections.
This change updates gemma3, glm4_moe_lite, llama, qwen3, and qwen3_5
to use the new interface.
This change adds a tensorImportTransform interface for model-specific
tensor transformations during safetensors import. This allows importing
and modifying the standard HF based weights as well as the mlx-community
derived pre-quantized safetensors repos to be directly
imported into `ollama create`. Right now this only works with Qwen3.5
importing which does tensor renaming, norm weight shifting (it
adds +1 to each value of the norm vectors), conv1d transposition,
and casts to BF16s for F32 based vectors.
MLX runners (image generation and LLM) previously bypassed the
scheduler's standard load path via a separate loadMLX method. This meant
they skipped VRAM fitting checks and couldn't participate in model
eviction.
Now all model types flow through the same load function. Model eviction
for MLX is based on weights as KV cache and compute graph are dynamic.
This means that eviction does not take into account the worst case
memory and models can still compete for memory but it is a significant
improvement.
In allocModel(), the first call to reserveWorstCaseGraph(true) had its
error silently discarded — `return nil` was used instead of `return err`.
This meant that if the prompt-sized graph reservation failed (e.g. due
to insufficient memory), the error was swallowed, allocModel reported
success, and the model appeared to load correctly. Subsequent inference
would then fail in unexpected ways because the worst-case graph was
never properly reserved.
Fix: return the actual error so the caller can handle the failure
(retry with reduced parallelism, report OOM, etc.).
Co-Authored-By: Claude (claude-opus-4-6) <noreply@anthropic.com>
In container environments without systemd, `openclaw onboard
--install-daemon` exits non-zero because it cannot create a systemd
user service. This causes `ollama launch openclaw` to abort even
though the gateway can be started as a foreground child process.
Only pass --install-daemon when systemd user services are reachable
(Linux with /run/systemd/system present and XDG_RUNTIME_DIR set).
On all other platforms the flag is still included by default.
New features:
- Warmup phase to eliminate cold-start outliers
- time-to-first-token measured in each epoch
- VRAM/memory tracking to identify CPU spillover
- Controlled prompt length
- Defaults to 6 epochs and 200 tokens max
Benchstat fixes:
- ns/request instead of ns/op — non-standard unit created a separate group instead of grouping with timing metrics
- Token count as the N field — benchstat interprets N as iteration count for statistical weighting, not as a token count
OpenClaw now accepts the Ollama onboarding flags directly upstream, so rely on its wizard state instead of the legacy integration onboarding flag.
Update first-run setup to pass the Ollama auth and model flags during onboarding, perform a best-effort update before onboarding when needed, and drop the stale test that asserted persistence of the old onboarding flag.
When a zstd-compressed request (e.g. from Codex CLI) hits /v1/responses
with a cloud model the request failed.
Fix by decompressing zstd bodies before
model extraction, so cloud models are detected and proxied directly
without the writer being wrapped.
writeError in both OpenAI and Anthropic middleware writers would return
a raw json.SyntaxError when the error payload wasn't valid JSON (e.g.
"invalid character 'e' looking for beginning of value"). Fall back to
using the raw bytes as the error message instead.
Also use the actual HTTP status code rather than hardcoding 500, so
error types map correctly
Root cause: StreamConverter.Process() only incremented contentIndex when
closing a thinking block if text content was present. When a model emitted
thinking followed directly by a tool_use block (no text in between),
thinkingDone was never set and contentIndex was not incremented, causing the
tool_use content_block_start to reuse index 0. Clients expecting sequential
indices would then fail to find the tool content block.
Fix: In the tool call loop, close any open thinking block (thinkingStarted &&
!thinkingDone) and increment contentIndex before opening the tool_use block,
mirroring the existing logic that closes an open text block.
Fixes#14816
Add reasoning_effort and reasoning to the supported features and
request fields for /v1/chat/completions. These fields control
thinking on thinking-capable models but were previously undocumented.
Closes#14820
Use 7z compression (better compression rate) if found in path. That
alone isn't sufficient to get us under 2G, so MLX is now split out as a
discrete download. Fix CI so it will fail if artifacts fail to upload.
The CLI now links to the lazy-load MLX code, but that still happens in
init functions. On internal MLX errors, the CLI exits before it has a
chance to start. This change re-wires the MLX error handling so it
doesn't exit by default. The MLX based runners currently expect exits
on failure, so they re-initialize the default error handling. We can
refine error handling for better go stack traces in the future.
* prefer rocm v6 on windows
Avoid building with v7 - more changes are needed
* MLX: add header vendoring and remove go build tag
This switches to using a vendoring approach for the mlx-c headers so that Go
can build without requiring a cmake first. This enables building the new MLX
based code by default. Every time cmake runs, the headers are refreshed, so we
can easily keep them in sync when we bump mlx versions. Basic Windows
and Linux support are verified.
* ci: harden for flaky choco repo servers
CI sometimes fails due to choco not actually installing cache. Since it just speeds up the build, we can proceed without.
* review comments
- Collapse MLX sampling state into a single sample.Sampler struct (options + history).
- Replace interface-based sampler chain (TopP, TopK, penalty, etc.) with function-based transforms.
- Update request/pipeline wiring to use *sample.Sampler, seed history from prompt tokens, and append generated tokens each step.
- Implement top_p, min_p, repeat_penalty, and frequency_penalty
Previously we were printing out bad errors for expected cases like
clients disconnecting. Now we only debug log when that happens (which
still might help in cases where we're figuring out why an integration
isn't working). For other errors, we print out a proper warning now
Our Dockerfile leverages parallel stages for more efficient builds. However,
our old parallel settings were naive and lead to under/over utilization
depending on the capabilities of your build system.
This change switches to using Ninja for all our docker cmake builds to leverage
its smarter parallel logic. We tell Ninja to target a load of nproc so each of
the build stages will share the load on the system aiming for full CPU use
without oversaturation.
The GPU parallelism settings are also adjusted to 4 to avoid a long-tail for
the last few GPU targets as they work through the long list of GPU
architectures.
This also fixes the Dockerfile to move Vulkan install to just the stage that
needs it instead of blocking most other GPU installs. This should speed up CI
which always has a clean build cache.
GLM models sometimes omits </arg_value> closing tags in tool call XML, causing xml.Unmarshal to fail with "element <arg_value> closed by </tool_call>".
This is a known issue across the GLM family.
Sanitize the input to fix closing arg_key values so encoding/xml can handle it.