mirror of
https://github.com/ollama/ollama.git
synced 2026-03-28 03:08:44 +07:00
TeaCache: - Timestep embedding similarity caching for diffusion models - Polynomial rescaling with configurable thresholds - Reduces transformer forward passes by ~30-50% FP8 quantization: - Support for FP8 quantized models (8-bit weights with scales) - QuantizedMatmul on Metal, Dequantize on CUDA - Client-side quantization via ollama create --quantize fp8 Other bug fixes: - Fix `/api/show` API for image generation models - Server properly returns model info (architecture, parameters, quantization) - Memory allocation optimizations - CLI improvements for image generation
251 lines
6.0 KiB
Markdown
251 lines
6.0 KiB
Markdown
# Image Generation in Ollama (Experimental)
|
|
|
|
Generate images from text prompts using local AI models.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Run with a prompt
|
|
ollama run z-image "a sunset over mountains"
|
|
Generating: step 30/30
|
|
Image saved to: /tmp/ollama-image-1704067200.png
|
|
```
|
|
|
|
On macOS, the generated image will automatically open in Preview.
|
|
|
|
## Supported Models
|
|
|
|
| Model | VRAM Required | Notes |
|
|
|-------|---------------|-------|
|
|
| z-image | ~12GB | Based on Flux architecture |
|
|
|
|
## CLI Usage
|
|
|
|
```bash
|
|
# Generate an image
|
|
ollama run z-image "a cat playing piano"
|
|
|
|
# Check if model is running
|
|
ollama ps
|
|
|
|
# Stop the model
|
|
ollama stop z-image
|
|
```
|
|
|
|
## API
|
|
|
|
### OpenAI-Compatible Endpoint
|
|
|
|
```bash
|
|
POST /v1/images/generations
|
|
```
|
|
|
|
**Request:**
|
|
```json
|
|
{
|
|
"model": "z-image",
|
|
"prompt": "a sunset over mountains",
|
|
"size": "1024x1024",
|
|
"response_format": "b64_json"
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"created": 1704067200,
|
|
"data": [
|
|
{
|
|
"b64_json": "iVBORw0KGgo..."
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Example: cURL
|
|
|
|
```bash
|
|
curl http://localhost:11434/v1/images/generations \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "z-image",
|
|
"prompt": "a white cat",
|
|
"size": "1024x1024"
|
|
}'
|
|
```
|
|
|
|
### Example: Save to File
|
|
|
|
```bash
|
|
curl -s http://localhost:11434/v1/images/generations \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "z-image",
|
|
"prompt": "a white cat",
|
|
"size": "1024x1024"
|
|
}' | jq -r '.data[0].b64_json' | base64 -d > image.png
|
|
```
|
|
|
|
### Streaming Progress
|
|
|
|
Enable streaming to receive progress updates via SSE:
|
|
|
|
```bash
|
|
curl http://localhost:11434/v1/images/generations \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "z-image", "prompt": "a sunset", "stream": true}'
|
|
```
|
|
|
|
Events:
|
|
```
|
|
event: progress
|
|
data: {"step": 1, "total": 30}
|
|
|
|
event: progress
|
|
data: {"step": 2, "total": 30}
|
|
...
|
|
|
|
event: done
|
|
data: {"created": 1704067200, "data": [{"b64_json": "..."}]}
|
|
```
|
|
|
|
## Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| model | string | required | Model name |
|
|
| prompt | string | required | Text description of image |
|
|
| size | string | "1024x1024" | Image dimensions (WxH) |
|
|
| n | int | 1 | Number of images (currently only 1 supported) |
|
|
| response_format | string | "b64_json" | "b64_json" or "url" |
|
|
| stream | bool | false | Enable progress streaming |
|
|
|
|
## Requirements
|
|
|
|
- macOS with Apple Silicon (M1/M2/M3/M4)
|
|
- CUDA: tested on CUDA 12 Blackwell, more testing coming soon
|
|
- Sufficient VRAM (see model table above)
|
|
- Ollama built with MLX support
|
|
|
|
## Limitations
|
|
|
|
- macOS only (uses MLX backend)
|
|
- Single image per request
|
|
- Fixed step count (30 steps)
|
|
- Modelfiles not yet supported (use `ollama create` from model directory)
|
|
|
|
---
|
|
|
|
# Tensor Model Storage Format
|
|
|
|
Tensor models store each tensor as a separate blob with metadata in the manifest. This enables faster downloads (parallel fetching) and deduplication (shared tensors are stored once).
|
|
|
|
## Manifest Structure
|
|
|
|
The manifest follows the standard ollama format with tensor-specific layer metadata:
|
|
|
|
```json
|
|
{
|
|
"schemaVersion": 2,
|
|
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
|
|
"config": { "digest": "sha256:...", "size": 1234 },
|
|
"layers": [
|
|
{
|
|
"mediaType": "application/vnd.ollama.image.tensor",
|
|
"digest": "sha256:25b36eed...",
|
|
"size": 49807448,
|
|
"name": "text_encoder/model.layers.0.mlp.down_proj.weight",
|
|
"dtype": "BF16",
|
|
"shape": [2560, 9728]
|
|
},
|
|
{
|
|
"mediaType": "application/vnd.ollama.image.json",
|
|
"digest": "sha256:abc123...",
|
|
"size": 512,
|
|
"name": "text_encoder/config.json"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Each tensor layer includes:
|
|
- `name`: Path-style tensor name (e.g., `text_encoder/model.layers.0.mlp.down_proj.weight`)
|
|
- `dtype`: Data type (BF16, F32, etc.)
|
|
- `shape`: Tensor dimensions
|
|
|
|
Config layers use the same path-style naming (e.g., `tokenizer/tokenizer.json`).
|
|
|
|
## Blob Format
|
|
|
|
Each tensor blob is a minimal safetensors file:
|
|
|
|
```
|
|
[8 bytes: header size (uint64 LE)]
|
|
[~80 bytes: JSON header, padded to 8-byte alignment]
|
|
[N bytes: raw tensor data]
|
|
```
|
|
|
|
Header contains a single tensor named `"data"`:
|
|
|
|
```json
|
|
{"data":{"dtype":"BF16","shape":[2560,9728],"data_offsets":[0,49807360]}}
|
|
```
|
|
|
|
## Why Include the Header?
|
|
|
|
The ~88 byte safetensors header enables MLX's native `mlx_load_safetensors` function, which:
|
|
|
|
1. **Uses mmap** - Maps file directly into memory, no copies
|
|
2. **Zero-copy to GPU** - MLX reads directly from mapped pages
|
|
3. **No custom code** - Standard MLX API, battle-tested
|
|
|
|
Without the header, we'd need custom C++ code to create MLX arrays from raw mmap'd data. MLX's public API doesn't expose this - it always copies when creating arrays from external pointers.
|
|
|
|
The overhead is negligible: 88 bytes per tensor = ~100KB total for a 13GB model (0.0007%).
|
|
|
|
## Why Per-Tensor Blobs?
|
|
|
|
**Deduplication**: Blobs are content-addressed by SHA256. If two models share identical tensors (same weights, dtype, shape), they share the same blob file.
|
|
|
|
Example: Model A and Model B both use the same text encoder. The text encoder's 400 tensors are stored once, referenced by both manifests.
|
|
|
|
```
|
|
~/.ollama/models/
|
|
blobs/
|
|
sha256-25b36eed... <- shared by both models
|
|
sha256-abc123...
|
|
manifests/
|
|
library/model-a/latest <- references sha256-25b36eed
|
|
library/model-b/latest <- references sha256-25b36eed
|
|
```
|
|
|
|
## Import Flow
|
|
|
|
```
|
|
cd ./weights/Z-Image-Turbo
|
|
ollama create z-image
|
|
|
|
1. Scan component directories (text_encoder/, transformer/, vae/)
|
|
2. For each .safetensors file:
|
|
- Extract individual tensors
|
|
- Wrap each in minimal safetensors format (88B header + data)
|
|
- Write to blob store (SHA256 content-addressed)
|
|
- Add layer entry to manifest with path-style name
|
|
3. Copy config files (*.json) as config layers
|
|
4. Write manifest
|
|
```
|
|
|
|
## FP8 Quantization
|
|
|
|
Z-Image supports FP8 quantization to reduce memory usage by ~50% while maintaining image quality.
|
|
|
|
### Usage
|
|
|
|
```bash
|
|
cd ./weights/Z-Image-Turbo
|
|
ollama create z-image-fp8 --quantize fp8
|
|
```
|
|
|
|
This quantizes weights during import. The resulting model will be ~15GB instead of ~31GB.
|
|
|