* WIP - MLX backend with gemma3
* MLX: add cmake and go tag build toggles
To build the new MLX backend code:
cmake --preset MLX
cmake --build --preset MLX --parallel
cmake --install build --component MLX
go build -tags mlx .
Note: the main.go entrypoint for the MLX engine will change in a follow up commit.
* add experimental image generation runtime
* add experimental image generation runtime
* MLX: wire up cuda build for linux
* MLX: get dependencies correct and dedup
This is still too large for a unified github artifact, but is now "correct" for the mlx_cuda_v13
directory.
* fix relative link bug in dedup
* Add darwin build and readme
* add go build tag for mlx dependent code and wire up build_darwin.sh
* lint cleanup
* macos: build mlx for x86
This will be CPU only.
* cuda build instructions and fix drift from mlx bump
* stale comment
* Delete agent helper doc
* Clean up readme.md
* Revise README for tokenizer clarity and details
Updated README to clarify tokenizer functionality and removed correctness section.
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
* bf16
* tests
* gpt-oss
* enable gptoss for engine
* rough estimate
* convert to mxfp4
* handle safetensors U8
* clamp glu/linear
* update tokenizer
* MXFP4 support
This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.
* Unit tests for MXFP4 support
This exercises various operations and shapes on both CPU and GPU (if detected
on the system)
* cuda graph
* unit test adjustments
* cuda: optimize memory access
Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
* mac: fix crash on old macos versions
cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors. Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.
* server: Minimum context length for gptoss
This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.
* ggml: Multiply by numParallel for gptoss sliding window
When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.
* gpt-oss integration
includes harmony parser and thinking levels, etc.
* fix sync
* fix tests
* fix lint
---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>