Add Docker Model Runner documentation

For configuration, IDE integrations, inference engines, and Open WebUI

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
This commit is contained in:
Eric Curtin
2026-01-06 16:46:13 +00:00
parent ccd16bb273
commit 8bf6cd030e
8 changed files with 1460 additions and 96 deletions

View File

@@ -77,7 +77,7 @@ Common configuration options include:
> as small as feasible for your specific needs.
- `runtime_flags`: A list of raw command-line flags passed to the inference engine when the model is started.
For example, if you use llama.cpp, you can pass any of [the available parameters](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
See [Configuration options](/manuals/ai/model-runner/configuration.md) for commonly used parameters and examples.
- Platform-specific options may also be available via extension attributes `x-*`
> [!TIP]
@@ -364,5 +364,7 @@ services:
- [`models` top-level element](/reference/compose-file/models.md)
- [`models` attribute](/reference/compose-file/services.md#models)
- [Docker Model Runner documentation](/manuals/ai/model-runner.md)
- [Compose Model Runner documentation](/manuals/ai/compose/models-and-compose.md)
- [Docker Model Runner documentation](/manuals/ai/model-runner/_index.md)
- [Configuration options](/manuals/ai/model-runner/configuration.md) - Context size and runtime parameters
- [Inference engines](/manuals/ai/model-runner/inference-engines.md) - llama.cpp and vLLM details
- [API reference](/manuals/ai/model-runner/api-reference.md) - OpenAI and Ollama-compatible APIs

View File

@@ -6,7 +6,7 @@ params:
group: AI
weight: 30
description: Learn how to use Docker Model Runner to manage and run AI models.
keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, llama.cpp, vllm, cpu, nvidia, cuda, amd, rocm, vulkan
keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor
aliases:
- /desktop/features/model-runner/
- /model-runner/
@@ -21,7 +21,7 @@ large language models (LLMs) and other AI models directly from Docker Hub or any
OCI-compliant registry.
With seamless integration into Docker Desktop and Docker
Engine, you can serve models via OpenAI-compatible APIs, package GGUF files as
Engine, you can serve models via OpenAI and Ollama-compatible APIs, package GGUF files as
OCI Artifacts, and interact with models from both the command line and graphical
interface.
@@ -33,10 +33,13 @@ with AI models locally.
## Key features
- [Pull and push models to and from Docker Hub](https://hub.docker.com/u/ai)
- Serve models on OpenAI-compatible APIs for easy integration with existing apps
- Support for both llama.cpp and vLLM inference engines (vLLM currently supported on Linux x86_64/amd64 with NVIDIA GPUs only)
- Serve models on [OpenAI and Ollama-compatible APIs](api-reference.md) for easy integration with existing apps
- Support for both [llama.cpp and vLLM inference engines](inference-engines.md) (vLLM on Linux x86_64/amd64 and Windows WSL2 with NVIDIA GPUs)
- Package GGUF and Safetensors files as OCI Artifacts and publish them to any Container Registry
- Run and interact with AI models directly from the command line or from the Docker Desktop GUI
- [Connect to AI coding tools](ide-integrations.md) like Cline, Continue, Cursor, and Aider
- [Configure context size and model parameters](configuration.md) to tune performance
- [Set up Open WebUI](openwebui-integration.md) for a ChatGPT-like web interface
- Manage local models and display logs
- Display prompt and response details
- Conversational context support for multi-turn interactions
@@ -82,9 +85,28 @@ locally. They load into memory only at runtime when a request is made, and
unload when not in use to optimize resources. Because models can be large, the
initial pull may take some time. After that, they're cached locally for faster
access. You can interact with the model using
[OpenAI-compatible APIs](api-reference.md).
[OpenAI and Ollama-compatible APIs](api-reference.md).
Docker Model Runner supports both [llama.cpp](https://github.com/ggerganov/llama.cpp) and [vLLM](https://github.com/vllm-project/vllm) as inference engines, providing flexibility for different model formats and performance requirements. For more details, see the [Docker Model Runner repository](https://github.com/docker/model-runner).
### Inference engines
Docker Model Runner supports two inference engines:
| Engine | Best for | Model format |
|--------|----------|--------------|
| [llama.cpp](inference-engines.md#llamacpp) | Local development, resource efficiency | GGUF (quantized) |
| [vLLM](inference-engines.md#vllm) | Production, high throughput | Safetensors |
llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. See [Inference engines](inference-engines.md) for detailed comparison and setup.
### Context size
Models have a configurable context size (context length) that determines how many tokens they can process. The default varies by model but is typically 2,048-8,192 tokens. You can adjust this per-model:
```console
$ docker model configure --context-size 8192 ai/qwen2.5-coder
```
See [Configuration options](configuration.md) for details on context size and other parameters.
> [!TIP]
>
@@ -120,4 +142,9 @@ Thanks for trying out Docker Model Runner. To report bugs or request features, [
## Next steps
[Get started with DMR](get-started.md)
- [Get started with DMR](get-started.md) - Enable DMR and run your first model
- [API reference](api-reference.md) - OpenAI and Ollama-compatible API documentation
- [Configuration options](configuration.md) - Context size and runtime parameters
- [Inference engines](inference-engines.md) - llama.cpp and vLLM details
- [IDE integrations](ide-integrations.md) - Connect Cline, Continue, Cursor, and more
- [Open WebUI integration](openwebui-integration.md) - Set up a web chat interface

View File

@@ -1,30 +1,37 @@
---
title: DMR REST API
description: Reference documentation for the Docker Model Runner REST API endpoints and usage examples.
description: Reference documentation for the Docker Model Runner REST API endpoints, including OpenAI and Ollama compatibility.
weight: 30
keywords: Docker, ai, model runner, rest api, openai, endpoints, documentation
keywords: Docker, ai, model runner, rest api, openai, ollama, endpoints, documentation, cline, continue, cursor
---
Once Model Runner is enabled, new API endpoints are available. You can use
these endpoints to interact with a model programmatically.
these endpoints to interact with a model programmatically. Docker Model Runner
provides compatibility with both OpenAI and Ollama API formats.
### Determine the base URL
## Determine the base URL
The base URL to interact with the endpoints depends
on how you run Docker:
The base URL to interact with the endpoints depends on how you run Docker and
which API format you're using.
{{< tabs >}}
{{< tab name="Docker Desktop">}}
- From containers: `http://model-runner.docker.internal/`
- From host processes: `http://localhost:12434/`, assuming TCP host access is
enabled on the default port (12434).
| Access from | Base URL |
|-------------|----------|
| Containers | `http://model-runner.docker.internal` |
| Host processes (TCP) | `http://localhost:12434` |
> [!NOTE]
> TCP host access must be enabled. See [Enable Docker Model Runner](get-started.md#enable-docker-model-runner-in-docker-desktop).
{{< /tab >}}
{{< tab name="Docker Engine">}}
- From containers: `http://172.17.0.1:12434/` (with `172.17.0.1` representing the host gateway address)
- From host processes: `http://localhost:12434/`
| Access from | Base URL |
|-------------|----------|
| Containers | `http://172.17.0.1:12434` |
| Host processes | `http://localhost:12434` |
> [!NOTE]
> The `172.17.0.1` interface may not be available by default to containers
@@ -35,77 +42,139 @@ on how you run Docker:
> extra_hosts:
> - "model-runner.docker.internal:host-gateway"
> ```
> Then you can access the Docker Model Runner APIs at http://model-runner.docker.internal:12434/
> Then you can access the Docker Model Runner APIs at `http://model-runner.docker.internal:12434/`
{{< /tab >}}
{{</tabs >}}
### Available DMR endpoints
### Base URLs for third-party tools
- Create a model:
When configuring third-party tools that expect OpenAI-compatible APIs, use these base URLs:
```text
POST /models/create
```
| Tool type | Base URL format |
|-----------|-----------------|
| OpenAI SDK / clients | `http://localhost:12434/engines/v1` |
| Ollama-compatible clients | `http://localhost:12434` |
- List models:
See [IDE and tool integrations](ide-integrations.md) for specific configuration examples.
```text
GET /models
```
## Supported APIs
- Get a model:
Docker Model Runner supports multiple API formats:
```text
GET /models/{namespace}/{name}
```
| API | Description | Use case |
|-----|-------------|----------|
| [OpenAI API](#openai-compatible-api) | OpenAI-compatible chat completions, embeddings | Most AI frameworks and tools |
| [Ollama API](#ollama-compatible-api) | Ollama-compatible endpoints | Tools built for Ollama |
| [DMR API](#dmr-native-endpoints) | Native Docker Model Runner endpoints | Model management |
- Delete a local model:
## OpenAI-compatible API
```text
DELETE /models/{namespace}/{name}
```
DMR implements the OpenAI API specification for maximum compatibility with existing tools and frameworks.
### Available OpenAI endpoints
### Endpoints
DMR supports the following OpenAI endpoints:
- [List models](https://platform.openai.com/docs/api-reference/models/list):
```text
GET /engines/llama.cpp/v1/models
```
- [Retrieve model](https://platform.openai.com/docs/api-reference/models/retrieve):
```text
GET /engines/llama.cpp/v1/models/{namespace}/{name}
```
- [List chat completions](https://platform.openai.com/docs/api-reference/chat/list):
```text
POST /engines/llama.cpp/v1/chat/completions
```
- [Create completions](https://platform.openai.com/docs/api-reference/completions/create):
```text
POST /engines/llama.cpp/v1/completions
```
- [Create embeddings](https://platform.openai.com/docs/api-reference/embeddings/create):
```text
POST /engines/llama.cpp/v1/embeddings
```
To call these endpoints via a Unix socket (`/var/run/docker.sock`), prefix their path
with `/exp/vDD4.40`.
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/engines/v1/models` | GET | [List models](https://platform.openai.com/docs/api-reference/models/list) |
| `/engines/v1/models/{namespace}/{name}` | GET | [Retrieve model](https://platform.openai.com/docs/api-reference/models/retrieve) |
| `/engines/v1/chat/completions` | POST | [Create chat completion](https://platform.openai.com/docs/api-reference/chat/create) |
| `/engines/v1/completions` | POST | [Create completion](https://platform.openai.com/docs/api-reference/completions/create) |
| `/engines/v1/embeddings` | POST | [Create embeddings](https://platform.openai.com/docs/api-reference/embeddings/create) |
> [!NOTE]
> You can omit `llama.cpp` from the path. For example: `POST /engines/v1/chat/completions`.
> You can optionally include the engine name in the path: `/engines/llama.cpp/v1/chat/completions`.
> This is useful when running multiple inference engines.
### Model name format
When specifying a model in API requests, use the full model identifier including the namespace:
```json
{
"model": "ai/smollm2",
"messages": [...]
}
```
Common model name formats:
- Docker Hub models: `ai/smollm2`, `ai/llama3.2`, `ai/qwen2.5-coder`
- Tagged versions: `ai/smollm2:360M-Q4_K_M`
- Custom models: `myorg/mymodel`
### Supported parameters
The following OpenAI API parameters are supported:
| Parameter | Type | Description |
|-----------|------|-------------|
| `model` | string | Required. The model identifier. |
| `messages` | array | Required for chat completions. The conversation history. |
| `prompt` | string | Required for completions. The prompt text. |
| `max_tokens` | integer | Maximum tokens to generate. |
| `temperature` | float | Sampling temperature (0.0-2.0). |
| `top_p` | float | Nucleus sampling parameter (0.0-1.0). |
| `stream` | Boolean | Enable streaming responses. |
| `stop` | string/array | Stop sequences. |
| `presence_penalty` | float | Presence penalty (-2.0 to 2.0). |
| `frequency_penalty` | float | Frequency penalty (-2.0 to 2.0). |
### Limitations and differences from OpenAI
Be aware of these differences when using DMR's OpenAI-compatible API:
| Feature | DMR behavior |
|---------|--------------|
| API key | Not required. DMR ignores the `Authorization` header. |
| Function calling | Supported with llama.cpp for compatible models. |
| Vision | Supported for multi-modal models (e.g., LLaVA). |
| JSON mode | Supported via `response_format: {"type": "json_object"}`. |
| Logprobs | Supported. |
| Token counting | Uses the model's native token encoder, which may differ from OpenAI's. |
## Ollama-compatible API
DMR also provides Ollama-compatible endpoints for tools and frameworks built for Ollama.
### Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tags` | GET | List available models |
| `/api/show` | POST | Show model information |
| `/api/chat` | POST | Generate chat completion |
| `/api/generate` | POST | Generate completion |
| `/api/embeddings` | POST | Generate embeddings |
### Example: Chat with Ollama API
```bash
curl http://localhost:12434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
```
### Example: List models
```bash
curl http://localhost:12434/api/tags
```
## DMR native endpoints
These endpoints are specific to Docker Model Runner for model management:
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/models/create` | POST | Pull/create a model |
| `/models` | GET | List local models |
| `/models/{namespace}/{name}` | GET | Get model details |
| `/models/{namespace}/{name}` | DELETE | Delete a local model |
## REST API examples
@@ -116,7 +185,7 @@ To call the `chat/completions` OpenAI endpoint from within another container usi
```bash
#!/bin/sh
curl http://model-runner.docker.internal/engines/llama.cpp/v1/chat/completions \
curl http://model-runner.docker.internal/engines/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
@@ -149,21 +218,21 @@ To call the `chat/completions` OpenAI endpoint from the host via TCP:
```bash
#!/bin/sh
curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Please write 500 words about the fall of Rome."
}
]
}'
curl http://localhost:12434/engines/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Please write 500 words about the fall of Rome."
}
]
}'
```
### Request from the host using a Unix socket
@@ -174,7 +243,7 @@ To call the `chat/completions` OpenAI endpoint through the Docker socket from th
#!/bin/sh
curl --unix-socket $HOME/.docker/run/docker.sock \
localhost/exp/vDD4.40/engines/llama.cpp/v1/chat/completions \
localhost/exp/vDD4.40/engines/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
@@ -190,3 +259,65 @@ curl --unix-socket $HOME/.docker/run/docker.sock \
]
}'
```
### Streaming responses
To receive streaming responses, set `stream: true`:
```bash
curl http://localhost:12434/engines/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
"stream": true,
"messages": [
{"role": "user", "content": "Count from 1 to 10"}
]
}'
```
## Using with OpenAI SDKs
### Python
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:12434/engines/v1",
api_key="not-needed" # DMR doesn't require an API key
)
response = client.chat.completions.create(
model="ai/smollm2",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)
```
### Node.js
```javascript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:12434/engines/v1',
apiKey: 'not-needed',
});
const response = await client.chat.completions.create({
model: 'ai/smollm2',
messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(response.choices[0].message.content);
```
## What's next
- [IDE and tool integrations](ide-integrations.md) - Configure Cline, Continue, Cursor, and other tools
- [Configuration options](configuration.md) - Adjust context size and runtime parameters
- [Inference engines](inference-engines.md) - Learn about llama.cpp and vLLM options

View File

@@ -0,0 +1,305 @@
---
title: Configuration options
description: Configure context size, runtime parameters, and model behavior in Docker Model Runner.
weight: 35
keywords: Docker, ai, model runner, configuration, context size, context length, tokens, llama.cpp, parameters
---
Docker Model Runner provides several configuration options to tune model behavior,
memory usage, and inference performance. This guide covers the key settings and
how to apply them.
## Context size (context length)
The context size determines the maximum number of tokens a model can process in
a single request, including both the input prompt and generated output. This is
one of the most important settings affecting memory usage and model capabilities.
### Default context size
By default, Docker Model Runner uses a context size that balances capability with
resource efficiency:
| Engine | Default behavior |
|--------|------------------|
| llama.cpp | 4096 tokens |
| vLLM | Uses the model's maximum trained context size |
> [!NOTE]
> The actual default varies by model. Most models support between 2,048 and 8,192
> tokens by default. Some newer models support 32K, 128K, or even larger contexts.
### Configure context size
You can adjust context size per model using the `docker model configure` command:
```console
$ docker model configure --context-size 8192 ai/qwen2.5-coder
```
Or in a Compose file:
```yaml
models:
llm:
model: ai/qwen2.5-coder
context_size: 8192
```
### Context size guidelines
| Context size | Typical use case | Memory impact |
|--------------|------------------|---------------|
| 2,048 | Simple queries, short code snippets | Low |
| 4,096 | Standard conversations, medium code files | Moderate |
| 8,192 | Long conversations, larger code files | Higher |
| 16,384+ | Extended documents, multi-file context | High |
> [!IMPORTANT]
> Larger context sizes require more memory (RAM/VRAM). If you experience out-of-memory
> errors, reduce the context size. As a rough guide, each additional 1,000 tokens
> requires approximately 100-500 MB of additional memory, depending on the model size.
### Check a model's maximum context
To see a model's configuration including context size:
```console
$ docker model inspect ai/qwen2.5-coder
```
> [!NOTE]
> The `docker model inspect` command shows the model's maximum supported context length
> (e.g., `gemma3.context_length`), not the configured context size. The configured context
> size is what you set with `docker model configure --context-size` and represents the
> actual limit used during inference, which should be less than or equal to the model's
> maximum supported context length.
## Runtime flags
Runtime flags let you pass parameters directly to the underlying inference engine.
This provides fine-grained control over model behavior.
### Using runtime flags
Runtime flags can be provided through multiple mechanisms:
#### Using Docker Compose
In a Compose file:
```yaml
models:
llm:
model: ai/qwen2.5-coder
context_size: 4096
runtime_flags:
- "--temp"
- "0.7"
- "--top-p"
- "0.9"
```
#### Using Command Line
With the `docker model configure` command:
```console
$ docker model configure --runtime-flag "--temp" --runtime-flag "0.7" --runtime-flag "--top-p" --runtime-flag "0.9" ai/qwen2.5-coder
```
### Common llama.cpp parameters
These are the most commonly used llama.cpp parameters. You don't need to look up
the llama.cpp documentation for typical use cases.
#### Sampling parameters
| Flag | Description | Default | Range |
|------|-------------|---------|-------|
| `--temp` | Temperature for sampling. Lower = more deterministic, higher = more creative | 0.8 | 0.0-2.0 |
| `--top-k` | Limit sampling to top K tokens. Lower = more focused | 40 | 1-100 |
| `--top-p` | Nucleus sampling threshold. Lower = more focused | 0.9 | 0.0-1.0 |
| `--min-p` | Minimum probability threshold | 0.05 | 0.0-1.0 |
| `--repeat-penalty` | Penalty for repeating tokens | 1.1 | 1.0-2.0 |
**Example: Deterministic output (for code generation)**
```yaml
runtime_flags:
- "--temp"
- "0"
- "--top-k"
- "1"
```
**Example: Creative output (for storytelling)**
```yaml
runtime_flags:
- "--temp"
- "1.2"
- "--top-p"
- "0.95"
```
#### Performance parameters
| Flag | Description | Default | Notes |
|------|-------------|---------|-------|
| `--threads` | CPU threads for generation | Auto | Set to number of performance cores |
| `--threads-batch` | CPU threads for batch processing | Auto | Usually same as `--threads` |
| `--batch-size` | Batch size for prompt processing | 512 | Higher = faster prompt processing |
| `--mlock` | Lock model in memory | Off | Prevents swapping, requires sufficient RAM |
| `--no-mmap` | Disable memory mapping | Off | May improve performance on some systems |
**Example: Optimized for multi-core CPU**
```yaml
runtime_flags:
- "--threads"
- "8"
- "--batch-size"
- "1024"
```
#### GPU parameters
| Flag | Description | Default | Notes |
|------|-------------|---------|-------|
| `--n-gpu-layers` | Layers to offload to GPU | All (if GPU available) | Reduce if running out of VRAM |
| `--main-gpu` | GPU to use for computation | 0 | For multi-GPU systems |
| `--split-mode` | How to split across GPUs | layer | Options: `none`, `layer`, `row` |
**Example: Partial GPU offload (limited VRAM)**
```yaml
runtime_flags:
- "--n-gpu-layers"
- "20"
```
#### Advanced parameters
| Flag | Description | Default |
|------|-------------|---------|
| `--rope-scaling` | RoPE scaling method | Auto |
| `--rope-freq-base` | RoPE base frequency | Model default |
| `--rope-freq-scale` | RoPE frequency scale | Model default |
| `--no-prefill-assistant` | Disable assistant pre-fill | Off |
| `--reasoning-budget` | Token budget for reasoning models | 0 (disabled) |
### vLLM parameters
When using the vLLM backend, different parameters are available.
Use `--hf_overrides` to pass HuggingFace model config overrides as JSON:
```console
$ docker model configure --hf_overrides '{"rope_scaling": {"type": "dynamic", "factor": 2.0}}' ai/model-vllm
```
## Configuration presets
Here are complete configuration examples for common use cases.
### Code completion (fast, deterministic)
```yaml
models:
coder:
model: ai/qwen2.5-coder
context_size: 4096
runtime_flags:
- "--temp"
- "0.1"
- "--top-k"
- "1"
- "--batch-size"
- "1024"
```
### Chat assistant (balanced)
```yaml
models:
assistant:
model: ai/llama3.2
context_size: 8192
runtime_flags:
- "--temp"
- "0.7"
- "--top-p"
- "0.9"
- "--repeat-penalty"
- "1.1"
```
### Creative writing (high temperature)
```yaml
models:
writer:
model: ai/llama3.2
context_size: 8192
runtime_flags:
- "--temp"
- "1.2"
- "--top-p"
- "0.95"
- "--repeat-penalty"
- "1.0"
```
### Long document analysis (large context)
```yaml
models:
analyzer:
model: ai/qwen2.5-coder:14B
context_size: 32768
runtime_flags:
- "--mlock"
- "--batch-size"
- "2048"
```
### Low memory system
```yaml
models:
efficient:
model: ai/smollm2:360M-Q4_K_M
context_size: 2048
runtime_flags:
- "--threads"
- "4"
```
## Environment-based configuration
You can also configure models via environment variables in containers:
| Variable | Description |
|----------|-------------|
| `LLM_URL` | Auto-injected URL of the model endpoint |
| `LLM_MODEL` | Auto-injected model identifier |
See [Models and Compose](/manuals/ai/compose/models-and-compose.md) for details on how these are populated.
## Reset configuration
Configuration set via `docker model configure` persists until the model is removed.
To reset configuration:
```console
$ docker model configure --context-size -1 ai/qwen2.5-coder
```
Using `-1` resets to the default value.
## What's next
- [Inference engines](inference-engines.md) - Learn about llama.cpp and vLLM
- [API reference](api-reference.md) - API parameters for per-request configuration
- [Models and Compose](/manuals/ai/compose/models-and-compose.md) - Configure models in Compose applications

View File

@@ -221,6 +221,10 @@ In Docker Desktop, to inspect the requests and responses for each model:
## Related pages
- [Interact with your model programmatically](./api-reference.md)
- [Models and Compose](../compose/models-and-compose.md)
- [Docker Model Runner CLI reference documentation](/reference/cli/docker/model)
- [API reference](./api-reference.md) - OpenAI and Ollama-compatible API documentation
- [Configuration options](./configuration.md) - Context size and runtime parameters
- [Inference engines](./inference-engines.md) - llama.cpp and vLLM details
- [IDE integrations](./ide-integrations.md) - Connect Cline, Continue, Cursor, and more
- [Open WebUI integration](./openwebui-integration.md) - Set up a web chat interface
- [Models and Compose](../compose/models-and-compose.md) - Use models in Compose applications
- [Docker Model Runner CLI reference](/reference/cli/docker/model) - Complete CLI documentation

View File

@@ -0,0 +1,283 @@
---
title: IDE and tool integrations
description: Configure popular AI coding assistants and tools to use Docker Model Runner as their backend.
weight: 40
keywords: Docker, ai, model runner, cline, continue, cursor, vscode, ide, integration, openai, ollama
---
Docker Model Runner can serve as a local backend for popular AI coding assistants
and development tools. This guide shows how to configure common tools to use
models running in DMR.
## Prerequisites
Before configuring any tool:
1. [Enable Docker Model Runner](get-started.md#enable-docker-model-runner) in Docker Desktop or Docker Engine.
2. Enable TCP host access:
- Docker Desktop: Enable **host-side TCP support** in Settings > AI, or run:
```console
$ docker desktop enable model-runner --tcp 12434
```
- Docker Engine: TCP is enabled by default on port 12434.
3. Pull a model:
```console
$ docker model pull ai/qwen2.5-coder
```
## Cline (VS Code)
[Cline](https://github.com/cline/cline) is an AI coding assistant for VS Code.
### Configuration
1. Open VS Code and go to the Cline extension settings.
2. Select **OpenAI Compatible** as the API provider.
3. Configure the following settings:
| Setting | Value |
|---------|-------|
| Base URL | `http://localhost:12434/engines/v1` |
| API Key | `not-needed` (or any placeholder value) |
| Model ID | `ai/qwen2.5-coder` (or your preferred model) |
> [!IMPORTANT]
> The base URL must include `/engines/v1` at the end. Do not include a trailing slash.
### Troubleshooting Cline
If Cline fails to connect:
1. Verify DMR is running:
```console
$ docker model status
```
2. Test the endpoint directly:
```console
$ curl http://localhost:12434/engines/v1/models
```
3. Check that CORS is configured if running a web-based version:
- In Docker Desktop Settings > AI, add your origin to **CORS Allowed Origins**
## Continue (VS Code / JetBrains)
[Continue](https://continue.dev) is an open-source AI code assistant that works with VS Code and JetBrains IDEs.
### Configuration
Edit your Continue configuration file (`~/.continue/config.json`):
```json
{
"models": [
{
"title": "Docker Model Runner",
"provider": "openai",
"model": "ai/qwen2.5-coder",
"apiBase": "http://localhost:12434/engines/v1",
"apiKey": "not-needed"
}
]
}
```
### Using Ollama provider
Continue also supports the Ollama provider, which works with DMR:
```json
{
"models": [
{
"title": "Docker Model Runner (Ollama)",
"provider": "ollama",
"model": "ai/qwen2.5-coder",
"apiBase": "http://localhost:12434"
}
]
}
```
## Cursor
[Cursor](https://cursor.sh) is an AI-powered code editor.
### Configuration
1. Open Cursor Settings (Cmd/Ctrl + ,).
2. Navigate to **Models** > **OpenAI API Key**.
3. Configure:
| Setting | Value |
|---------|-------|
| OpenAI API Key | `not-needed` |
| Override OpenAI Base URL | `http://localhost:12434/engines/v1` |
4. In the model drop-down, enter your model name: `ai/qwen2.5-coder`
> [!NOTE]
> Some Cursor features may require models with specific capabilities (e.g., function calling).
> Use capable models like `ai/qwen2.5-coder` or `ai/llama3.2` for best results.
## Zed
[Zed](https://zed.dev) is a high-performance code editor with AI features.
### Configuration
Edit your Zed settings (`~/.config/zed/settings.json`):
```json
{
"language_models": {
"openai": {
"api_url": "http://localhost:12434/engines/v1",
"available_models": [
{
"name": "ai/qwen2.5-coder",
"display_name": "Qwen 2.5 Coder (DMR)",
"max_tokens": 8192
}
]
}
}
}
```
## Open WebUI
[Open WebUI](https://github.com/open-webui/open-webui) provides a ChatGPT-like interface for local models.
See [Open WebUI integration](openwebui-integration.md) for detailed setup instructions.
## Aider
[Aider](https://aider.chat) is an AI pair programming tool for the terminal.
### Configuration
Set environment variables or use command-line flags:
```bash
export OPENAI_API_BASE=http://localhost:12434/engines/v1
export OPENAI_API_KEY=not-needed
aider --model openai/ai/qwen2.5-coder
```
Or in a single command:
```console
$ aider --openai-api-base http://localhost:12434/engines/v1 \
--openai-api-key not-needed \
--model openai/ai/qwen2.5-coder
```
## LangChain
### Python
```python
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:12434/engines/v1",
api_key="not-needed",
model="ai/qwen2.5-coder"
)
response = llm.invoke("Write a hello world function in Python")
print(response.content)
```
### JavaScript/TypeScript
```typescript
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({
configuration: {
baseURL: "http://localhost:12434/engines/v1",
},
apiKey: "not-needed",
modelName: "ai/qwen2.5-coder",
});
const response = await model.invoke("Write a hello world function");
console.log(response.content);
```
## LlamaIndex
```python
from llama_index.llms.openai_like import OpenAILike
llm = OpenAILike(
api_base="http://localhost:12434/engines/v1",
api_key="not-needed",
model="ai/qwen2.5-coder"
)
response = llm.complete("Write a hello world function")
print(response.text)
```
## Common issues
### "Connection refused" errors
1. Ensure Docker Model Runner is enabled and running:
```console
$ docker model status
```
2. Verify TCP access is enabled:
```console
$ curl http://localhost:12434/engines/v1/models
```
3. Check if another service is using port 12434.
### "Model not found" errors
1. Verify the model is pulled:
```console
$ docker model list
```
2. Use the full model name including namespace (e.g., `ai/qwen2.5-coder`, not just `qwen2.5-coder`).
### Slow responses or timeouts
1. For first requests, models need to load into memory. Subsequent requests are faster.
2. Consider using a smaller model or adjusting the context size:
```console
$ docker model configure --context-size 4096 ai/qwen2.5-coder
```
3. Check available system resources (RAM, GPU memory).
### CORS errors (web-based tools)
If using browser-based tools, add the origin to CORS allowed origins:
1. Docker Desktop: Settings > AI > CORS Allowed Origins
2. Add your tool's URL (e.g., `http://localhost:3000`)
## Recommended models by use case
| Use case | Recommended model | Notes |
|----------|-------------------|-------|
| Code completion | `ai/qwen2.5-coder` | Optimized for coding tasks |
| General assistant | `ai/llama3.2` | Good balance of capabilities |
| Small/fast | `ai/smollm2` | Low resource usage |
| Embeddings | `ai/all-minilm` | For RAG and semantic search |
## What's next
- [API reference](api-reference.md) - Full API documentation
- [Configuration options](configuration.md) - Tune model behavior
- [Open WebUI integration](openwebui-integration.md) - Set up a web interface

View File

@@ -0,0 +1,319 @@
---
title: Inference engines
description: Learn about the llama.cpp and vLLM inference engines in Docker Model Runner.
weight: 50
keywords: Docker, ai, model runner, llama.cpp, vllm, inference, gguf, safetensors, cuda, gpu
---
Docker Model Runner supports two inference engines: **llama.cpp** and **vLLM**.
Each engine has different strengths, supported platforms, and model format
requirements. This guide helps you choose the right engine and configure it for
your use case.
## Engine comparison
| Feature | llama.cpp | vLLM |
|---------|-----------|------|
| **Model formats** | GGUF | Safetensors, HuggingFace |
| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only |
| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only |
| **CPU inference** | Yes | No |
| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited |
| **Memory efficiency** | High (with quantization) | Moderate |
| **Throughput** | Good | High (with batching) |
| **Best for** | Local development, resource-constrained environments | Production, high throughput |
## llama.cpp
[llama.cpp](https://github.com/ggerganov/llama.cpp) is the default inference
engine in Docker Model Runner. It's designed for efficient local inference and
supports a wide range of hardware configurations.
### Platform support
| Platform | GPU support | Notes |
|----------|-------------|-------|
| macOS (Apple Silicon) | Metal | Automatic GPU acceleration |
| Windows (x64) | NVIDIA CUDA | Requires NVIDIA drivers 576.57+ |
| Windows (ARM64) | Adreno OpenCL | Qualcomm 6xx series and later |
| Linux (x64) | NVIDIA, AMD, Vulkan | Multiple backend options |
| Linux | CPU only | Works on any x64/ARM64 system |
### Model format: GGUF
llama.cpp uses the GGUF format, which supports efficient quantization for reduced
memory usage without significant quality loss.
#### Quantization levels
| Quantization | Bits per weight | Memory usage | Quality |
|--------------|-----------------|--------------|---------|
| Q2_K | ~2.5 | Lowest | Reduced |
| Q3_K_M | ~3.5 | Minimal | Acceptable |
| Q4_K_M | ~4.5 | Low | Good |
| Q5_K_M | ~5.5 | Moderate | Excellent |
| Q6_K | ~6.5 | Higher | Excellent |
| Q8_0 | 8 | High | Near-original |
| F16 | 16 | Highest | Original |
**Recommended**: Q4_K_M offers the best balance of quality and memory usage for
most use cases.
#### Pulling quantized models
Models on Docker Hub often include quantization in the tag:
```console
$ docker model pull ai/llama3.2:3B-Q4_K_M
```
### Using llama.cpp
llama.cpp is the default engine. No special configuration is required:
```console
$ docker model run ai/smollm2
```
To explicitly specify llama.cpp when running models:
```console
$ docker model run ai/smollm2 --backend llama.cpp
```
### llama.cpp API endpoints
When using llama.cpp, API calls use the llama.cpp engine path:
```text
POST /engines/llama.cpp/v1/chat/completions
```
Or without the engine prefix:
```text
POST /engines/v1/chat/completions
```
## vLLM
[vLLM](https://github.com/vllm-project/vllm) is a high-performance inference
engine optimized for production workloads with high throughput requirements.
### Platform support
| Platform | GPU | Support status |
|----------|-----|----------------|
| Linux x86_64 | NVIDIA CUDA | Supported |
| Windows with WSL2 | NVIDIA CUDA | Supported (Docker Desktop 4.54+) |
| macOS | - | Not supported |
| Linux ARM64 | - | Not supported |
| AMD GPUs | - | Not supported |
> [!IMPORTANT]
> vLLM requires an NVIDIA GPU with CUDA support. It does not support CPU-only
> inference.
### Model format: Safetensors
vLLM works with models in Safetensors format, which is the standard format for
HuggingFace models. These models typically use more memory than quantized GGUF
models but may offer better quality and faster inference on powerful hardware.
### Setting up vLLM
#### Docker Engine (Linux)
Install the Model Runner with vLLM backend:
```console
$ docker model install-runner --backend vllm --gpu cuda
```
Verify the installation:
```console
$ docker model status
Docker Model Runner is running
Status:
llama.cpp: running llama.cpp version: c22473b
vllm: running vllm version: 0.11.0
```
#### Docker Desktop (Windows with WSL2)
1. Ensure you have:
- Docker Desktop 4.54 or later
- NVIDIA GPU with updated drivers
- WSL2 enabled
2. Install vLLM backend:
```console
$ docker model install-runner --backend vllm --gpu cuda
```
### Running models with vLLM
vLLM models are typically tagged with `-vllm` suffix:
```console
$ docker model run ai/smollm2-vllm
```
To specify the vLLM backend explicitly:
```console
$ docker model run ai/model --backend vllm
```
### vLLM API endpoints
When using vLLM, specify the engine in the API path:
```text
POST /engines/vllm/v1/chat/completions
```
### vLLM configuration
#### HuggingFace overrides
Use `--hf_overrides` to pass model configuration overrides:
```console
$ docker model configure --hf_overrides '{"max_model_len": 8192}' ai/model-vllm
```
#### Common vLLM settings
| Setting | Description | Example |
|---------|-------------|---------|
| `max_model_len` | Maximum context length | 8192 |
| `gpu_memory_utilization` | Fraction of GPU memory to use | 0.9 |
| `tensor_parallel_size` | GPUs for tensor parallelism | 2 |
### vLLM and llama.cpp performance comparison
| Scenario | Recommended engine |
|----------|-------------------|
| Single user, local development | llama.cpp |
| Multiple concurrent requests | vLLM |
| Limited GPU memory | llama.cpp (with quantization) |
| Maximum throughput | vLLM |
| CPU-only system | llama.cpp |
| Apple Silicon Mac | llama.cpp |
| Production deployment | vLLM (if hardware supports it) |
## Running both engines
You can run both llama.cpp and vLLM simultaneously. Docker Model Runner routes
requests to the appropriate engine based on the model or explicit engine selection.
Check which engines are running:
```console
$ docker model status
Docker Model Runner is running
Status:
llama.cpp: running llama.cpp version: c22473b
vllm: running vllm version: 0.11.0
```
### Engine-specific API paths
| Engine | API path |
|--------|----------|
| llama.cpp | `/engines/llama.cpp/v1/...` |
| vLLM | `/engines/vllm/v1/...` |
| Auto-select | `/engines/v1/...` |
## Managing inference engines
### Install an engine
```console
$ docker model install-runner --backend <engine> [--gpu <type>]
```
Options:
- `--backend`: `llama.cpp` or `vllm`
- `--gpu`: `cuda`, `rocm`, `vulkan`, or `metal` (depends on platform)
### Reinstall an engine
```console
$ docker model reinstall-runner --backend <engine>
```
### Check engine status
```console
$ docker model status
```
### View engine logs
```console
$ docker model logs
```
## Packaging models for each engine
### Package a GGUF model (llama.cpp)
```console
$ docker model package --gguf ./model.gguf --push myorg/mymodel:Q4_K_M
```
### Package a Safetensors model (vLLM)
```console
$ docker model package --safetensors ./model/ --push myorg/mymodel-vllm
```
## Troubleshooting
### vLLM won't start
1. Verify NVIDIA GPU is available:
```console
$ nvidia-smi
```
2. Check Docker has GPU access:
```console
$ docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
```
3. Verify you're on a supported platform (Linux x86_64 or Windows WSL2).
### llama.cpp is slow
1. Ensure GPU acceleration is working (check logs for Metal/CUDA messages).
2. Try a more aggressive quantization:
```console
$ docker model pull ai/model:Q4_K_M
```
3. Reduce context size:
```console
$ docker model configure --context-size 2048 ai/model
```
### Out of memory errors
1. Use a smaller quantization (Q4 instead of Q8).
2. Reduce context size.
3. For vLLM, adjust `gpu_memory_utilization`:
```console
$ docker model configure --hf_overrides '{"gpu_memory_utilization": 0.8}' ai/model
```
## What's next
- [Configuration options](configuration.md) - Detailed parameter reference
- [API reference](api-reference.md) - API documentation
- [GPU support](/manuals/desktop/features/gpu.md) - GPU configuration for Docker Desktop

View File

@@ -0,0 +1,293 @@
---
title: Open WebUI integration
description: Set up Open WebUI as a ChatGPT-like interface for Docker Model Runner.
weight: 45
keywords: Docker, ai, model runner, open webui, openwebui, chat interface, ollama, ui
---
[Open WebUI](https://github.com/open-webui/open-webui) is an open-source,
self-hosted web interface that provides a ChatGPT-like experience for local
AI models. You can connect it to Docker Model Runner to get a polished chat
interface for your models.
## Prerequisites
- Docker Model Runner enabled with TCP access
- A model pulled (e.g., `docker model pull ai/llama3.2`)
## Quick start with Docker Compose
The easiest way to run Open WebUI with Docker Model Runner is using Docker Compose.
Create a `compose.yaml` file:
```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:12434
- WEBUI_AUTH=false
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
- open-webui:/app/backend/data
volumes:
open-webui:
```
Start the services:
```console
$ docker compose up -d
```
Open your browser to [http://localhost:3000](http://localhost:3000).
## Configuration options
### Environment variables
| Variable | Description | Default |
|----------|-------------|---------|
| `OLLAMA_BASE_URL` | URL of Docker Model Runner | Required |
| `WEBUI_AUTH` | Enable authentication | `true` |
| `OPENAI_API_BASE_URL` | Use OpenAI-compatible API instead | - |
| `OPENAI_API_KEY` | API key (use any value for DMR) | - |
### Using OpenAI-compatible API
If you prefer to use the OpenAI-compatible API instead of the Ollama API:
```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OPENAI_API_BASE_URL=http://host.docker.internal:12434/engines/v1
- OPENAI_API_KEY=not-needed
- WEBUI_AUTH=false
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
- open-webui:/app/backend/data
volumes:
open-webui:
```
## Network configuration
### Docker Desktop
On Docker Desktop, `host.docker.internal` automatically resolves to the host machine.
The previous example works without modification.
### Docker Engine (Linux)
On Docker Engine, you may need to configure the network differently:
```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
network_mode: host
environment:
- OLLAMA_BASE_URL=http://localhost:12434
- WEBUI_AUTH=false
volumes:
- open-webui:/app/backend/data
volumes:
open-webui:
```
Or use the host gateway:
```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://172.17.0.1:12434
- WEBUI_AUTH=false
volumes:
- open-webui:/app/backend/data
volumes:
open-webui:
```
## Using Open WebUI
### Select a model
1. Open [http://localhost:3000](http://localhost:3000)
2. Select the model drop-down in the top-left
3. Select from your pulled models (they appear with `ai/` prefix)
### Pull models through the UI
Open WebUI can pull models directly:
1. Select the model drop-down
2. Enter a model name: `ai/llama3.2`
3. Select the download icon
### Chat features
Open WebUI provides:
- Multi-turn conversations with context
- Message editing and regeneration
- Code syntax highlighting
- Markdown rendering
- Conversation history and search
- Export conversations
## Complete example with multiple models
This example sets up Open WebUI with Docker Model Runner and pre-pulls several models:
```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:12434
- WEBUI_AUTH=false
- DEFAULT_MODELS=ai/llama3.2
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
- open-webui:/app/backend/data
depends_on:
model-setup:
condition: service_completed_successfully
model-setup:
image: docker:cli
volumes:
- /var/run/docker.sock:/var/run/docker.sock
command: >
sh -c "
docker model pull ai/llama3.2 &&
docker model pull ai/qwen2.5-coder &&
docker model pull ai/smollm2
"
volumes:
open-webui:
```
## Enabling authentication
For multi-user setups or security, enable authentication:
```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:12434
- WEBUI_AUTH=true
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
- open-webui:/app/backend/data
volumes:
open-webui:
```
On first visit, you'll create an admin account.
## Troubleshooting
### Models don't appear in the drop-down
1. Verify Docker Model Runner is accessible:
```console
$ curl http://localhost:12434/api/tags
```
2. Check that models are pulled:
```console
$ docker model list
```
3. Verify the `OLLAMA_BASE_URL` is correct and accessible from the container.
### "Connection refused" errors
1. Ensure TCP access is enabled for Docker Model Runner.
2. On Docker Desktop, verify `host.docker.internal` resolves:
```console
$ docker run --rm alpine ping -c 1 host.docker.internal
```
3. On Docker Engine, try using `network_mode: host` or the explicit host IP.
### Slow response times
1. First requests load the model into memory, which takes time.
2. Subsequent requests are much faster.
3. If consistently slow, consider:
- Using a smaller model
- Reducing context size
- Checking GPU acceleration is working
### CORS errors
If running Open WebUI on a different host:
1. In Docker Desktop, go to Settings > AI
2. Add the Open WebUI URL to **CORS Allowed Origins**
## Customization
### Custom system prompts
Open WebUI supports setting system prompts per model. Configure these in the UI under Settings > Models.
### Model parameters
Adjust model parameters in the chat interface:
1. Select the settings icon next to the model name
2. Adjust temperature, top-p, max tokens, etc.
These settings are passed through to Docker Model Runner.
## Running on a different port
To run Open WebUI on a different port:
```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "8080:8080" # Change first port number
# ... rest of config
```
## What's next
- [API reference](api-reference.md) - Learn about the APIs Open WebUI uses
- [Configuration options](configuration.md) - Tune model behavior
- [IDE integrations](ide-integrations.md) - Connect other tools to DMR