mirror of
https://github.com/ollama/ollama.git
synced 2026-03-28 03:08:44 +07:00
TeaCache: - Timestep embedding similarity caching for diffusion models - Polynomial rescaling with configurable thresholds - Reduces transformer forward passes by ~30-50% FP8 quantization: - Support for FP8 quantized models (8-bit weights with scales) - QuantizedMatmul on Metal, Dequantize on CUDA - Client-side quantization via ollama create --quantize fp8 Other bug fixes: - Fix `/api/show` API for image generation models - Server properly returns model info (architecture, parameters, quantization) - Memory allocation optimizations - CLI improvements for image generation
MLX Engine
Experimental MLX backend for running models on Apple Silicon and CUDA.
Build
go build -tags mlx -o engine ./x/imagegen/cmd/engine
Text Generation
./engine -model /path/to/model -prompt "Hello" -max-tokens 100
Options:
-temperature- sampling temperature (default 0.7)-top-p- nucleus sampling (default 0.9)-top-k- top-k sampling (default 40)
Supports: Llama, Gemma3, GPT-OSS
Image Generation
./engine -zimage -model /path/to/z-image -prompt "a cat" -output cat.png
Options:
-width,-height- image dimensions (default 1024x1024)-steps- denoising steps (default 9)-seed- random seed (default 42)