mirror of
https://github.com/docker/docs.git
synced 2026-03-27 14:28:47 +07:00
Add Docker Model Runner documentation
For configuration, IDE integrations, inference engines, and Open WebUI Signed-off-by: Eric Curtin <eric.curtin@docker.com>
This commit is contained in:
@@ -77,7 +77,7 @@ Common configuration options include:
|
||||
> as small as feasible for your specific needs.
|
||||
|
||||
- `runtime_flags`: A list of raw command-line flags passed to the inference engine when the model is started.
|
||||
For example, if you use llama.cpp, you can pass any of [the available parameters](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
|
||||
See [Configuration options](/manuals/ai/model-runner/configuration.md) for commonly used parameters and examples.
|
||||
- Platform-specific options may also be available via extension attributes `x-*`
|
||||
|
||||
> [!TIP]
|
||||
@@ -364,5 +364,7 @@ services:
|
||||
|
||||
- [`models` top-level element](/reference/compose-file/models.md)
|
||||
- [`models` attribute](/reference/compose-file/services.md#models)
|
||||
- [Docker Model Runner documentation](/manuals/ai/model-runner.md)
|
||||
- [Compose Model Runner documentation](/manuals/ai/compose/models-and-compose.md)
|
||||
- [Docker Model Runner documentation](/manuals/ai/model-runner/_index.md)
|
||||
- [Configuration options](/manuals/ai/model-runner/configuration.md) - Context size and runtime parameters
|
||||
- [Inference engines](/manuals/ai/model-runner/inference-engines.md) - llama.cpp and vLLM details
|
||||
- [API reference](/manuals/ai/model-runner/api-reference.md) - OpenAI and Ollama-compatible APIs
|
||||
|
||||
@@ -6,7 +6,7 @@ params:
|
||||
group: AI
|
||||
weight: 30
|
||||
description: Learn how to use Docker Model Runner to manage and run AI models.
|
||||
keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, llama.cpp, vllm, cpu, nvidia, cuda, amd, rocm, vulkan
|
||||
keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor
|
||||
aliases:
|
||||
- /desktop/features/model-runner/
|
||||
- /model-runner/
|
||||
@@ -21,7 +21,7 @@ large language models (LLMs) and other AI models directly from Docker Hub or any
|
||||
OCI-compliant registry.
|
||||
|
||||
With seamless integration into Docker Desktop and Docker
|
||||
Engine, you can serve models via OpenAI-compatible APIs, package GGUF files as
|
||||
Engine, you can serve models via OpenAI and Ollama-compatible APIs, package GGUF files as
|
||||
OCI Artifacts, and interact with models from both the command line and graphical
|
||||
interface.
|
||||
|
||||
@@ -33,10 +33,13 @@ with AI models locally.
|
||||
## Key features
|
||||
|
||||
- [Pull and push models to and from Docker Hub](https://hub.docker.com/u/ai)
|
||||
- Serve models on OpenAI-compatible APIs for easy integration with existing apps
|
||||
- Support for both llama.cpp and vLLM inference engines (vLLM currently supported on Linux x86_64/amd64 with NVIDIA GPUs only)
|
||||
- Serve models on [OpenAI and Ollama-compatible APIs](api-reference.md) for easy integration with existing apps
|
||||
- Support for both [llama.cpp and vLLM inference engines](inference-engines.md) (vLLM on Linux x86_64/amd64 and Windows WSL2 with NVIDIA GPUs)
|
||||
- Package GGUF and Safetensors files as OCI Artifacts and publish them to any Container Registry
|
||||
- Run and interact with AI models directly from the command line or from the Docker Desktop GUI
|
||||
- [Connect to AI coding tools](ide-integrations.md) like Cline, Continue, Cursor, and Aider
|
||||
- [Configure context size and model parameters](configuration.md) to tune performance
|
||||
- [Set up Open WebUI](openwebui-integration.md) for a ChatGPT-like web interface
|
||||
- Manage local models and display logs
|
||||
- Display prompt and response details
|
||||
- Conversational context support for multi-turn interactions
|
||||
@@ -82,9 +85,28 @@ locally. They load into memory only at runtime when a request is made, and
|
||||
unload when not in use to optimize resources. Because models can be large, the
|
||||
initial pull may take some time. After that, they're cached locally for faster
|
||||
access. You can interact with the model using
|
||||
[OpenAI-compatible APIs](api-reference.md).
|
||||
[OpenAI and Ollama-compatible APIs](api-reference.md).
|
||||
|
||||
Docker Model Runner supports both [llama.cpp](https://github.com/ggerganov/llama.cpp) and [vLLM](https://github.com/vllm-project/vllm) as inference engines, providing flexibility for different model formats and performance requirements. For more details, see the [Docker Model Runner repository](https://github.com/docker/model-runner).
|
||||
### Inference engines
|
||||
|
||||
Docker Model Runner supports two inference engines:
|
||||
|
||||
| Engine | Best for | Model format |
|
||||
|--------|----------|--------------|
|
||||
| [llama.cpp](inference-engines.md#llamacpp) | Local development, resource efficiency | GGUF (quantized) |
|
||||
| [vLLM](inference-engines.md#vllm) | Production, high throughput | Safetensors |
|
||||
|
||||
llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. See [Inference engines](inference-engines.md) for detailed comparison and setup.
|
||||
|
||||
### Context size
|
||||
|
||||
Models have a configurable context size (context length) that determines how many tokens they can process. The default varies by model but is typically 2,048-8,192 tokens. You can adjust this per-model:
|
||||
|
||||
```console
|
||||
$ docker model configure --context-size 8192 ai/qwen2.5-coder
|
||||
```
|
||||
|
||||
See [Configuration options](configuration.md) for details on context size and other parameters.
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
@@ -120,4 +142,9 @@ Thanks for trying out Docker Model Runner. To report bugs or request features, [
|
||||
|
||||
## Next steps
|
||||
|
||||
[Get started with DMR](get-started.md)
|
||||
- [Get started with DMR](get-started.md) - Enable DMR and run your first model
|
||||
- [API reference](api-reference.md) - OpenAI and Ollama-compatible API documentation
|
||||
- [Configuration options](configuration.md) - Context size and runtime parameters
|
||||
- [Inference engines](inference-engines.md) - llama.cpp and vLLM details
|
||||
- [IDE integrations](ide-integrations.md) - Connect Cline, Continue, Cursor, and more
|
||||
- [Open WebUI integration](openwebui-integration.md) - Set up a web chat interface
|
||||
|
||||
@@ -1,30 +1,37 @@
|
||||
---
|
||||
title: DMR REST API
|
||||
description: Reference documentation for the Docker Model Runner REST API endpoints and usage examples.
|
||||
description: Reference documentation for the Docker Model Runner REST API endpoints, including OpenAI and Ollama compatibility.
|
||||
weight: 30
|
||||
keywords: Docker, ai, model runner, rest api, openai, endpoints, documentation
|
||||
keywords: Docker, ai, model runner, rest api, openai, ollama, endpoints, documentation, cline, continue, cursor
|
||||
---
|
||||
|
||||
Once Model Runner is enabled, new API endpoints are available. You can use
|
||||
these endpoints to interact with a model programmatically.
|
||||
these endpoints to interact with a model programmatically. Docker Model Runner
|
||||
provides compatibility with both OpenAI and Ollama API formats.
|
||||
|
||||
### Determine the base URL
|
||||
## Determine the base URL
|
||||
|
||||
The base URL to interact with the endpoints depends
|
||||
on how you run Docker:
|
||||
The base URL to interact with the endpoints depends on how you run Docker and
|
||||
which API format you're using.
|
||||
|
||||
{{< tabs >}}
|
||||
{{< tab name="Docker Desktop">}}
|
||||
|
||||
- From containers: `http://model-runner.docker.internal/`
|
||||
- From host processes: `http://localhost:12434/`, assuming TCP host access is
|
||||
enabled on the default port (12434).
|
||||
| Access from | Base URL |
|
||||
|-------------|----------|
|
||||
| Containers | `http://model-runner.docker.internal` |
|
||||
| Host processes (TCP) | `http://localhost:12434` |
|
||||
|
||||
> [!NOTE]
|
||||
> TCP host access must be enabled. See [Enable Docker Model Runner](get-started.md#enable-docker-model-runner-in-docker-desktop).
|
||||
|
||||
{{< /tab >}}
|
||||
{{< tab name="Docker Engine">}}
|
||||
|
||||
- From containers: `http://172.17.0.1:12434/` (with `172.17.0.1` representing the host gateway address)
|
||||
- From host processes: `http://localhost:12434/`
|
||||
| Access from | Base URL |
|
||||
|-------------|----------|
|
||||
| Containers | `http://172.17.0.1:12434` |
|
||||
| Host processes | `http://localhost:12434` |
|
||||
|
||||
> [!NOTE]
|
||||
> The `172.17.0.1` interface may not be available by default to containers
|
||||
@@ -35,77 +42,139 @@ on how you run Docker:
|
||||
> extra_hosts:
|
||||
> - "model-runner.docker.internal:host-gateway"
|
||||
> ```
|
||||
> Then you can access the Docker Model Runner APIs at http://model-runner.docker.internal:12434/
|
||||
> Then you can access the Docker Model Runner APIs at `http://model-runner.docker.internal:12434/`
|
||||
|
||||
{{< /tab >}}
|
||||
{{</tabs >}}
|
||||
|
||||
### Available DMR endpoints
|
||||
### Base URLs for third-party tools
|
||||
|
||||
- Create a model:
|
||||
When configuring third-party tools that expect OpenAI-compatible APIs, use these base URLs:
|
||||
|
||||
```text
|
||||
POST /models/create
|
||||
```
|
||||
| Tool type | Base URL format |
|
||||
|-----------|-----------------|
|
||||
| OpenAI SDK / clients | `http://localhost:12434/engines/v1` |
|
||||
| Ollama-compatible clients | `http://localhost:12434` |
|
||||
|
||||
- List models:
|
||||
See [IDE and tool integrations](ide-integrations.md) for specific configuration examples.
|
||||
|
||||
```text
|
||||
GET /models
|
||||
```
|
||||
## Supported APIs
|
||||
|
||||
- Get a model:
|
||||
Docker Model Runner supports multiple API formats:
|
||||
|
||||
```text
|
||||
GET /models/{namespace}/{name}
|
||||
```
|
||||
| API | Description | Use case |
|
||||
|-----|-------------|----------|
|
||||
| [OpenAI API](#openai-compatible-api) | OpenAI-compatible chat completions, embeddings | Most AI frameworks and tools |
|
||||
| [Ollama API](#ollama-compatible-api) | Ollama-compatible endpoints | Tools built for Ollama |
|
||||
| [DMR API](#dmr-native-endpoints) | Native Docker Model Runner endpoints | Model management |
|
||||
|
||||
- Delete a local model:
|
||||
## OpenAI-compatible API
|
||||
|
||||
```text
|
||||
DELETE /models/{namespace}/{name}
|
||||
```
|
||||
DMR implements the OpenAI API specification for maximum compatibility with existing tools and frameworks.
|
||||
|
||||
### Available OpenAI endpoints
|
||||
### Endpoints
|
||||
|
||||
DMR supports the following OpenAI endpoints:
|
||||
|
||||
- [List models](https://platform.openai.com/docs/api-reference/models/list):
|
||||
|
||||
```text
|
||||
GET /engines/llama.cpp/v1/models
|
||||
```
|
||||
|
||||
- [Retrieve model](https://platform.openai.com/docs/api-reference/models/retrieve):
|
||||
|
||||
```text
|
||||
GET /engines/llama.cpp/v1/models/{namespace}/{name}
|
||||
```
|
||||
|
||||
- [List chat completions](https://platform.openai.com/docs/api-reference/chat/list):
|
||||
|
||||
```text
|
||||
POST /engines/llama.cpp/v1/chat/completions
|
||||
```
|
||||
|
||||
- [Create completions](https://platform.openai.com/docs/api-reference/completions/create):
|
||||
|
||||
```text
|
||||
POST /engines/llama.cpp/v1/completions
|
||||
```
|
||||
|
||||
|
||||
- [Create embeddings](https://platform.openai.com/docs/api-reference/embeddings/create):
|
||||
|
||||
```text
|
||||
POST /engines/llama.cpp/v1/embeddings
|
||||
```
|
||||
|
||||
To call these endpoints via a Unix socket (`/var/run/docker.sock`), prefix their path
|
||||
with `/exp/vDD4.40`.
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/engines/v1/models` | GET | [List models](https://platform.openai.com/docs/api-reference/models/list) |
|
||||
| `/engines/v1/models/{namespace}/{name}` | GET | [Retrieve model](https://platform.openai.com/docs/api-reference/models/retrieve) |
|
||||
| `/engines/v1/chat/completions` | POST | [Create chat completion](https://platform.openai.com/docs/api-reference/chat/create) |
|
||||
| `/engines/v1/completions` | POST | [Create completion](https://platform.openai.com/docs/api-reference/completions/create) |
|
||||
| `/engines/v1/embeddings` | POST | [Create embeddings](https://platform.openai.com/docs/api-reference/embeddings/create) |
|
||||
|
||||
> [!NOTE]
|
||||
> You can omit `llama.cpp` from the path. For example: `POST /engines/v1/chat/completions`.
|
||||
> You can optionally include the engine name in the path: `/engines/llama.cpp/v1/chat/completions`.
|
||||
> This is useful when running multiple inference engines.
|
||||
|
||||
### Model name format
|
||||
|
||||
When specifying a model in API requests, use the full model identifier including the namespace:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "ai/smollm2",
|
||||
"messages": [...]
|
||||
}
|
||||
```
|
||||
|
||||
Common model name formats:
|
||||
- Docker Hub models: `ai/smollm2`, `ai/llama3.2`, `ai/qwen2.5-coder`
|
||||
- Tagged versions: `ai/smollm2:360M-Q4_K_M`
|
||||
- Custom models: `myorg/mymodel`
|
||||
|
||||
### Supported parameters
|
||||
|
||||
The following OpenAI API parameters are supported:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `model` | string | Required. The model identifier. |
|
||||
| `messages` | array | Required for chat completions. The conversation history. |
|
||||
| `prompt` | string | Required for completions. The prompt text. |
|
||||
| `max_tokens` | integer | Maximum tokens to generate. |
|
||||
| `temperature` | float | Sampling temperature (0.0-2.0). |
|
||||
| `top_p` | float | Nucleus sampling parameter (0.0-1.0). |
|
||||
| `stream` | Boolean | Enable streaming responses. |
|
||||
| `stop` | string/array | Stop sequences. |
|
||||
| `presence_penalty` | float | Presence penalty (-2.0 to 2.0). |
|
||||
| `frequency_penalty` | float | Frequency penalty (-2.0 to 2.0). |
|
||||
|
||||
### Limitations and differences from OpenAI
|
||||
|
||||
Be aware of these differences when using DMR's OpenAI-compatible API:
|
||||
|
||||
| Feature | DMR behavior |
|
||||
|---------|--------------|
|
||||
| API key | Not required. DMR ignores the `Authorization` header. |
|
||||
| Function calling | Supported with llama.cpp for compatible models. |
|
||||
| Vision | Supported for multi-modal models (e.g., LLaVA). |
|
||||
| JSON mode | Supported via `response_format: {"type": "json_object"}`. |
|
||||
| Logprobs | Supported. |
|
||||
| Token counting | Uses the model's native token encoder, which may differ from OpenAI's. |
|
||||
|
||||
## Ollama-compatible API
|
||||
|
||||
DMR also provides Ollama-compatible endpoints for tools and frameworks built for Ollama.
|
||||
|
||||
### Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/api/tags` | GET | List available models |
|
||||
| `/api/show` | POST | Show model information |
|
||||
| `/api/chat` | POST | Generate chat completion |
|
||||
| `/api/generate` | POST | Generate completion |
|
||||
| `/api/embeddings` | POST | Generate embeddings |
|
||||
|
||||
### Example: Chat with Ollama API
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/chat \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "ai/smollm2",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello!"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Example: List models
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/api/tags
|
||||
```
|
||||
|
||||
## DMR native endpoints
|
||||
|
||||
These endpoints are specific to Docker Model Runner for model management:
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/models/create` | POST | Pull/create a model |
|
||||
| `/models` | GET | List local models |
|
||||
| `/models/{namespace}/{name}` | GET | Get model details |
|
||||
| `/models/{namespace}/{name}` | DELETE | Delete a local model |
|
||||
|
||||
## REST API examples
|
||||
|
||||
@@ -116,7 +185,7 @@ To call the `chat/completions` OpenAI endpoint from within another container usi
|
||||
```bash
|
||||
#!/bin/sh
|
||||
|
||||
curl http://model-runner.docker.internal/engines/llama.cpp/v1/chat/completions \
|
||||
curl http://model-runner.docker.internal/engines/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "ai/smollm2",
|
||||
@@ -149,21 +218,21 @@ To call the `chat/completions` OpenAI endpoint from the host via TCP:
|
||||
```bash
|
||||
#!/bin/sh
|
||||
|
||||
curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "ai/smollm2",
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Please write 500 words about the fall of Rome."
|
||||
}
|
||||
]
|
||||
}'
|
||||
curl http://localhost:12434/engines/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "ai/smollm2",
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Please write 500 words about the fall of Rome."
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Request from the host using a Unix socket
|
||||
@@ -174,7 +243,7 @@ To call the `chat/completions` OpenAI endpoint through the Docker socket from th
|
||||
#!/bin/sh
|
||||
|
||||
curl --unix-socket $HOME/.docker/run/docker.sock \
|
||||
localhost/exp/vDD4.40/engines/llama.cpp/v1/chat/completions \
|
||||
localhost/exp/vDD4.40/engines/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "ai/smollm2",
|
||||
@@ -190,3 +259,65 @@ curl --unix-socket $HOME/.docker/run/docker.sock \
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Streaming responses
|
||||
|
||||
To receive streaming responses, set `stream: true`:
|
||||
|
||||
```bash
|
||||
curl http://localhost:12434/engines/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "ai/smollm2",
|
||||
"stream": true,
|
||||
"messages": [
|
||||
{"role": "user", "content": "Count from 1 to 10"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
## Using with OpenAI SDKs
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:12434/engines/v1",
|
||||
api_key="not-needed" # DMR doesn't require an API key
|
||||
)
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="ai/smollm2",
|
||||
messages=[
|
||||
{"role": "user", "content": "Hello!"}
|
||||
]
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
### Node.js
|
||||
|
||||
```javascript
|
||||
import OpenAI from 'openai';
|
||||
|
||||
const client = new OpenAI({
|
||||
baseURL: 'http://localhost:12434/engines/v1',
|
||||
apiKey: 'not-needed',
|
||||
});
|
||||
|
||||
const response = await client.chat.completions.create({
|
||||
model: 'ai/smollm2',
|
||||
messages: [{ role: 'user', content: 'Hello!' }],
|
||||
});
|
||||
|
||||
console.log(response.choices[0].message.content);
|
||||
```
|
||||
|
||||
## What's next
|
||||
|
||||
- [IDE and tool integrations](ide-integrations.md) - Configure Cline, Continue, Cursor, and other tools
|
||||
- [Configuration options](configuration.md) - Adjust context size and runtime parameters
|
||||
- [Inference engines](inference-engines.md) - Learn about llama.cpp and vLLM options
|
||||
|
||||
305
content/manuals/ai/model-runner/configuration.md
Normal file
305
content/manuals/ai/model-runner/configuration.md
Normal file
@@ -0,0 +1,305 @@
|
||||
---
|
||||
title: Configuration options
|
||||
description: Configure context size, runtime parameters, and model behavior in Docker Model Runner.
|
||||
weight: 35
|
||||
keywords: Docker, ai, model runner, configuration, context size, context length, tokens, llama.cpp, parameters
|
||||
---
|
||||
|
||||
Docker Model Runner provides several configuration options to tune model behavior,
|
||||
memory usage, and inference performance. This guide covers the key settings and
|
||||
how to apply them.
|
||||
|
||||
## Context size (context length)
|
||||
|
||||
The context size determines the maximum number of tokens a model can process in
|
||||
a single request, including both the input prompt and generated output. This is
|
||||
one of the most important settings affecting memory usage and model capabilities.
|
||||
|
||||
### Default context size
|
||||
|
||||
By default, Docker Model Runner uses a context size that balances capability with
|
||||
resource efficiency:
|
||||
|
||||
| Engine | Default behavior |
|
||||
|--------|------------------|
|
||||
| llama.cpp | 4096 tokens |
|
||||
| vLLM | Uses the model's maximum trained context size |
|
||||
|
||||
> [!NOTE]
|
||||
> The actual default varies by model. Most models support between 2,048 and 8,192
|
||||
> tokens by default. Some newer models support 32K, 128K, or even larger contexts.
|
||||
|
||||
### Configure context size
|
||||
|
||||
You can adjust context size per model using the `docker model configure` command:
|
||||
|
||||
```console
|
||||
$ docker model configure --context-size 8192 ai/qwen2.5-coder
|
||||
```
|
||||
|
||||
Or in a Compose file:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
llm:
|
||||
model: ai/qwen2.5-coder
|
||||
context_size: 8192
|
||||
```
|
||||
|
||||
### Context size guidelines
|
||||
|
||||
| Context size | Typical use case | Memory impact |
|
||||
|--------------|------------------|---------------|
|
||||
| 2,048 | Simple queries, short code snippets | Low |
|
||||
| 4,096 | Standard conversations, medium code files | Moderate |
|
||||
| 8,192 | Long conversations, larger code files | Higher |
|
||||
| 16,384+ | Extended documents, multi-file context | High |
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Larger context sizes require more memory (RAM/VRAM). If you experience out-of-memory
|
||||
> errors, reduce the context size. As a rough guide, each additional 1,000 tokens
|
||||
> requires approximately 100-500 MB of additional memory, depending on the model size.
|
||||
|
||||
### Check a model's maximum context
|
||||
|
||||
To see a model's configuration including context size:
|
||||
|
||||
```console
|
||||
$ docker model inspect ai/qwen2.5-coder
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> The `docker model inspect` command shows the model's maximum supported context length
|
||||
> (e.g., `gemma3.context_length`), not the configured context size. The configured context
|
||||
> size is what you set with `docker model configure --context-size` and represents the
|
||||
> actual limit used during inference, which should be less than or equal to the model's
|
||||
> maximum supported context length.
|
||||
|
||||
## Runtime flags
|
||||
|
||||
Runtime flags let you pass parameters directly to the underlying inference engine.
|
||||
This provides fine-grained control over model behavior.
|
||||
|
||||
### Using runtime flags
|
||||
|
||||
Runtime flags can be provided through multiple mechanisms:
|
||||
|
||||
#### Using Docker Compose
|
||||
|
||||
In a Compose file:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
llm:
|
||||
model: ai/qwen2.5-coder
|
||||
context_size: 4096
|
||||
runtime_flags:
|
||||
- "--temp"
|
||||
- "0.7"
|
||||
- "--top-p"
|
||||
- "0.9"
|
||||
```
|
||||
|
||||
#### Using Command Line
|
||||
|
||||
With the `docker model configure` command:
|
||||
|
||||
```console
|
||||
$ docker model configure --runtime-flag "--temp" --runtime-flag "0.7" --runtime-flag "--top-p" --runtime-flag "0.9" ai/qwen2.5-coder
|
||||
```
|
||||
|
||||
### Common llama.cpp parameters
|
||||
|
||||
These are the most commonly used llama.cpp parameters. You don't need to look up
|
||||
the llama.cpp documentation for typical use cases.
|
||||
|
||||
#### Sampling parameters
|
||||
|
||||
| Flag | Description | Default | Range |
|
||||
|------|-------------|---------|-------|
|
||||
| `--temp` | Temperature for sampling. Lower = more deterministic, higher = more creative | 0.8 | 0.0-2.0 |
|
||||
| `--top-k` | Limit sampling to top K tokens. Lower = more focused | 40 | 1-100 |
|
||||
| `--top-p` | Nucleus sampling threshold. Lower = more focused | 0.9 | 0.0-1.0 |
|
||||
| `--min-p` | Minimum probability threshold | 0.05 | 0.0-1.0 |
|
||||
| `--repeat-penalty` | Penalty for repeating tokens | 1.1 | 1.0-2.0 |
|
||||
|
||||
**Example: Deterministic output (for code generation)**
|
||||
|
||||
```yaml
|
||||
runtime_flags:
|
||||
- "--temp"
|
||||
- "0"
|
||||
- "--top-k"
|
||||
- "1"
|
||||
```
|
||||
|
||||
**Example: Creative output (for storytelling)**
|
||||
|
||||
```yaml
|
||||
runtime_flags:
|
||||
- "--temp"
|
||||
- "1.2"
|
||||
- "--top-p"
|
||||
- "0.95"
|
||||
```
|
||||
|
||||
#### Performance parameters
|
||||
|
||||
| Flag | Description | Default | Notes |
|
||||
|------|-------------|---------|-------|
|
||||
| `--threads` | CPU threads for generation | Auto | Set to number of performance cores |
|
||||
| `--threads-batch` | CPU threads for batch processing | Auto | Usually same as `--threads` |
|
||||
| `--batch-size` | Batch size for prompt processing | 512 | Higher = faster prompt processing |
|
||||
| `--mlock` | Lock model in memory | Off | Prevents swapping, requires sufficient RAM |
|
||||
| `--no-mmap` | Disable memory mapping | Off | May improve performance on some systems |
|
||||
|
||||
**Example: Optimized for multi-core CPU**
|
||||
|
||||
```yaml
|
||||
runtime_flags:
|
||||
- "--threads"
|
||||
- "8"
|
||||
- "--batch-size"
|
||||
- "1024"
|
||||
```
|
||||
|
||||
#### GPU parameters
|
||||
|
||||
| Flag | Description | Default | Notes |
|
||||
|------|-------------|---------|-------|
|
||||
| `--n-gpu-layers` | Layers to offload to GPU | All (if GPU available) | Reduce if running out of VRAM |
|
||||
| `--main-gpu` | GPU to use for computation | 0 | For multi-GPU systems |
|
||||
| `--split-mode` | How to split across GPUs | layer | Options: `none`, `layer`, `row` |
|
||||
|
||||
**Example: Partial GPU offload (limited VRAM)**
|
||||
|
||||
```yaml
|
||||
runtime_flags:
|
||||
- "--n-gpu-layers"
|
||||
- "20"
|
||||
```
|
||||
|
||||
#### Advanced parameters
|
||||
|
||||
| Flag | Description | Default |
|
||||
|------|-------------|---------|
|
||||
| `--rope-scaling` | RoPE scaling method | Auto |
|
||||
| `--rope-freq-base` | RoPE base frequency | Model default |
|
||||
| `--rope-freq-scale` | RoPE frequency scale | Model default |
|
||||
| `--no-prefill-assistant` | Disable assistant pre-fill | Off |
|
||||
| `--reasoning-budget` | Token budget for reasoning models | 0 (disabled) |
|
||||
|
||||
### vLLM parameters
|
||||
|
||||
When using the vLLM backend, different parameters are available.
|
||||
|
||||
Use `--hf_overrides` to pass HuggingFace model config overrides as JSON:
|
||||
|
||||
```console
|
||||
$ docker model configure --hf_overrides '{"rope_scaling": {"type": "dynamic", "factor": 2.0}}' ai/model-vllm
|
||||
```
|
||||
|
||||
## Configuration presets
|
||||
|
||||
Here are complete configuration examples for common use cases.
|
||||
|
||||
### Code completion (fast, deterministic)
|
||||
|
||||
```yaml
|
||||
models:
|
||||
coder:
|
||||
model: ai/qwen2.5-coder
|
||||
context_size: 4096
|
||||
runtime_flags:
|
||||
- "--temp"
|
||||
- "0.1"
|
||||
- "--top-k"
|
||||
- "1"
|
||||
- "--batch-size"
|
||||
- "1024"
|
||||
```
|
||||
|
||||
### Chat assistant (balanced)
|
||||
|
||||
```yaml
|
||||
models:
|
||||
assistant:
|
||||
model: ai/llama3.2
|
||||
context_size: 8192
|
||||
runtime_flags:
|
||||
- "--temp"
|
||||
- "0.7"
|
||||
- "--top-p"
|
||||
- "0.9"
|
||||
- "--repeat-penalty"
|
||||
- "1.1"
|
||||
```
|
||||
|
||||
### Creative writing (high temperature)
|
||||
|
||||
```yaml
|
||||
models:
|
||||
writer:
|
||||
model: ai/llama3.2
|
||||
context_size: 8192
|
||||
runtime_flags:
|
||||
- "--temp"
|
||||
- "1.2"
|
||||
- "--top-p"
|
||||
- "0.95"
|
||||
- "--repeat-penalty"
|
||||
- "1.0"
|
||||
```
|
||||
|
||||
### Long document analysis (large context)
|
||||
|
||||
```yaml
|
||||
models:
|
||||
analyzer:
|
||||
model: ai/qwen2.5-coder:14B
|
||||
context_size: 32768
|
||||
runtime_flags:
|
||||
- "--mlock"
|
||||
- "--batch-size"
|
||||
- "2048"
|
||||
```
|
||||
|
||||
### Low memory system
|
||||
|
||||
```yaml
|
||||
models:
|
||||
efficient:
|
||||
model: ai/smollm2:360M-Q4_K_M
|
||||
context_size: 2048
|
||||
runtime_flags:
|
||||
- "--threads"
|
||||
- "4"
|
||||
```
|
||||
|
||||
## Environment-based configuration
|
||||
|
||||
You can also configure models via environment variables in containers:
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `LLM_URL` | Auto-injected URL of the model endpoint |
|
||||
| `LLM_MODEL` | Auto-injected model identifier |
|
||||
|
||||
See [Models and Compose](/manuals/ai/compose/models-and-compose.md) for details on how these are populated.
|
||||
|
||||
## Reset configuration
|
||||
|
||||
Configuration set via `docker model configure` persists until the model is removed.
|
||||
To reset configuration:
|
||||
|
||||
```console
|
||||
$ docker model configure --context-size -1 ai/qwen2.5-coder
|
||||
```
|
||||
|
||||
Using `-1` resets to the default value.
|
||||
|
||||
## What's next
|
||||
|
||||
- [Inference engines](inference-engines.md) - Learn about llama.cpp and vLLM
|
||||
- [API reference](api-reference.md) - API parameters for per-request configuration
|
||||
- [Models and Compose](/manuals/ai/compose/models-and-compose.md) - Configure models in Compose applications
|
||||
@@ -221,6 +221,10 @@ In Docker Desktop, to inspect the requests and responses for each model:
|
||||
|
||||
## Related pages
|
||||
|
||||
- [Interact with your model programmatically](./api-reference.md)
|
||||
- [Models and Compose](../compose/models-and-compose.md)
|
||||
- [Docker Model Runner CLI reference documentation](/reference/cli/docker/model)
|
||||
- [API reference](./api-reference.md) - OpenAI and Ollama-compatible API documentation
|
||||
- [Configuration options](./configuration.md) - Context size and runtime parameters
|
||||
- [Inference engines](./inference-engines.md) - llama.cpp and vLLM details
|
||||
- [IDE integrations](./ide-integrations.md) - Connect Cline, Continue, Cursor, and more
|
||||
- [Open WebUI integration](./openwebui-integration.md) - Set up a web chat interface
|
||||
- [Models and Compose](../compose/models-and-compose.md) - Use models in Compose applications
|
||||
- [Docker Model Runner CLI reference](/reference/cli/docker/model) - Complete CLI documentation
|
||||
283
content/manuals/ai/model-runner/ide-integrations.md
Normal file
283
content/manuals/ai/model-runner/ide-integrations.md
Normal file
@@ -0,0 +1,283 @@
|
||||
---
|
||||
title: IDE and tool integrations
|
||||
description: Configure popular AI coding assistants and tools to use Docker Model Runner as their backend.
|
||||
weight: 40
|
||||
keywords: Docker, ai, model runner, cline, continue, cursor, vscode, ide, integration, openai, ollama
|
||||
---
|
||||
|
||||
Docker Model Runner can serve as a local backend for popular AI coding assistants
|
||||
and development tools. This guide shows how to configure common tools to use
|
||||
models running in DMR.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before configuring any tool:
|
||||
|
||||
1. [Enable Docker Model Runner](get-started.md#enable-docker-model-runner) in Docker Desktop or Docker Engine.
|
||||
2. Enable TCP host access:
|
||||
- Docker Desktop: Enable **host-side TCP support** in Settings > AI, or run:
|
||||
```console
|
||||
$ docker desktop enable model-runner --tcp 12434
|
||||
```
|
||||
- Docker Engine: TCP is enabled by default on port 12434.
|
||||
3. Pull a model:
|
||||
```console
|
||||
$ docker model pull ai/qwen2.5-coder
|
||||
```
|
||||
|
||||
## Cline (VS Code)
|
||||
|
||||
[Cline](https://github.com/cline/cline) is an AI coding assistant for VS Code.
|
||||
|
||||
### Configuration
|
||||
|
||||
1. Open VS Code and go to the Cline extension settings.
|
||||
2. Select **OpenAI Compatible** as the API provider.
|
||||
3. Configure the following settings:
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| Base URL | `http://localhost:12434/engines/v1` |
|
||||
| API Key | `not-needed` (or any placeholder value) |
|
||||
| Model ID | `ai/qwen2.5-coder` (or your preferred model) |
|
||||
|
||||
> [!IMPORTANT]
|
||||
> The base URL must include `/engines/v1` at the end. Do not include a trailing slash.
|
||||
|
||||
### Troubleshooting Cline
|
||||
|
||||
If Cline fails to connect:
|
||||
|
||||
1. Verify DMR is running:
|
||||
```console
|
||||
$ docker model status
|
||||
```
|
||||
|
||||
2. Test the endpoint directly:
|
||||
```console
|
||||
$ curl http://localhost:12434/engines/v1/models
|
||||
```
|
||||
|
||||
3. Check that CORS is configured if running a web-based version:
|
||||
- In Docker Desktop Settings > AI, add your origin to **CORS Allowed Origins**
|
||||
|
||||
## Continue (VS Code / JetBrains)
|
||||
|
||||
[Continue](https://continue.dev) is an open-source AI code assistant that works with VS Code and JetBrains IDEs.
|
||||
|
||||
### Configuration
|
||||
|
||||
Edit your Continue configuration file (`~/.continue/config.json`):
|
||||
|
||||
```json
|
||||
{
|
||||
"models": [
|
||||
{
|
||||
"title": "Docker Model Runner",
|
||||
"provider": "openai",
|
||||
"model": "ai/qwen2.5-coder",
|
||||
"apiBase": "http://localhost:12434/engines/v1",
|
||||
"apiKey": "not-needed"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Using Ollama provider
|
||||
|
||||
Continue also supports the Ollama provider, which works with DMR:
|
||||
|
||||
```json
|
||||
{
|
||||
"models": [
|
||||
{
|
||||
"title": "Docker Model Runner (Ollama)",
|
||||
"provider": "ollama",
|
||||
"model": "ai/qwen2.5-coder",
|
||||
"apiBase": "http://localhost:12434"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Cursor
|
||||
|
||||
[Cursor](https://cursor.sh) is an AI-powered code editor.
|
||||
|
||||
### Configuration
|
||||
|
||||
1. Open Cursor Settings (Cmd/Ctrl + ,).
|
||||
2. Navigate to **Models** > **OpenAI API Key**.
|
||||
3. Configure:
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| OpenAI API Key | `not-needed` |
|
||||
| Override OpenAI Base URL | `http://localhost:12434/engines/v1` |
|
||||
|
||||
4. In the model drop-down, enter your model name: `ai/qwen2.5-coder`
|
||||
|
||||
> [!NOTE]
|
||||
> Some Cursor features may require models with specific capabilities (e.g., function calling).
|
||||
> Use capable models like `ai/qwen2.5-coder` or `ai/llama3.2` for best results.
|
||||
|
||||
## Zed
|
||||
|
||||
[Zed](https://zed.dev) is a high-performance code editor with AI features.
|
||||
|
||||
### Configuration
|
||||
|
||||
Edit your Zed settings (`~/.config/zed/settings.json`):
|
||||
|
||||
```json
|
||||
{
|
||||
"language_models": {
|
||||
"openai": {
|
||||
"api_url": "http://localhost:12434/engines/v1",
|
||||
"available_models": [
|
||||
{
|
||||
"name": "ai/qwen2.5-coder",
|
||||
"display_name": "Qwen 2.5 Coder (DMR)",
|
||||
"max_tokens": 8192
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Open WebUI
|
||||
|
||||
[Open WebUI](https://github.com/open-webui/open-webui) provides a ChatGPT-like interface for local models.
|
||||
|
||||
See [Open WebUI integration](openwebui-integration.md) for detailed setup instructions.
|
||||
|
||||
## Aider
|
||||
|
||||
[Aider](https://aider.chat) is an AI pair programming tool for the terminal.
|
||||
|
||||
### Configuration
|
||||
|
||||
Set environment variables or use command-line flags:
|
||||
|
||||
```bash
|
||||
export OPENAI_API_BASE=http://localhost:12434/engines/v1
|
||||
export OPENAI_API_KEY=not-needed
|
||||
|
||||
aider --model openai/ai/qwen2.5-coder
|
||||
```
|
||||
|
||||
Or in a single command:
|
||||
|
||||
```console
|
||||
$ aider --openai-api-base http://localhost:12434/engines/v1 \
|
||||
--openai-api-key not-needed \
|
||||
--model openai/ai/qwen2.5-coder
|
||||
```
|
||||
|
||||
## LangChain
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from langchain_openai import ChatOpenAI
|
||||
|
||||
llm = ChatOpenAI(
|
||||
base_url="http://localhost:12434/engines/v1",
|
||||
api_key="not-needed",
|
||||
model="ai/qwen2.5-coder"
|
||||
)
|
||||
|
||||
response = llm.invoke("Write a hello world function in Python")
|
||||
print(response.content)
|
||||
```
|
||||
|
||||
### JavaScript/TypeScript
|
||||
|
||||
```typescript
|
||||
import { ChatOpenAI } from "@langchain/openai";
|
||||
|
||||
const model = new ChatOpenAI({
|
||||
configuration: {
|
||||
baseURL: "http://localhost:12434/engines/v1",
|
||||
},
|
||||
apiKey: "not-needed",
|
||||
modelName: "ai/qwen2.5-coder",
|
||||
});
|
||||
|
||||
const response = await model.invoke("Write a hello world function");
|
||||
console.log(response.content);
|
||||
```
|
||||
|
||||
## LlamaIndex
|
||||
|
||||
```python
|
||||
from llama_index.llms.openai_like import OpenAILike
|
||||
|
||||
llm = OpenAILike(
|
||||
api_base="http://localhost:12434/engines/v1",
|
||||
api_key="not-needed",
|
||||
model="ai/qwen2.5-coder"
|
||||
)
|
||||
|
||||
response = llm.complete("Write a hello world function")
|
||||
print(response.text)
|
||||
```
|
||||
|
||||
## Common issues
|
||||
|
||||
### "Connection refused" errors
|
||||
|
||||
1. Ensure Docker Model Runner is enabled and running:
|
||||
```console
|
||||
$ docker model status
|
||||
```
|
||||
|
||||
2. Verify TCP access is enabled:
|
||||
```console
|
||||
$ curl http://localhost:12434/engines/v1/models
|
||||
```
|
||||
|
||||
3. Check if another service is using port 12434.
|
||||
|
||||
### "Model not found" errors
|
||||
|
||||
1. Verify the model is pulled:
|
||||
```console
|
||||
$ docker model list
|
||||
```
|
||||
|
||||
2. Use the full model name including namespace (e.g., `ai/qwen2.5-coder`, not just `qwen2.5-coder`).
|
||||
|
||||
### Slow responses or timeouts
|
||||
|
||||
1. For first requests, models need to load into memory. Subsequent requests are faster.
|
||||
|
||||
2. Consider using a smaller model or adjusting the context size:
|
||||
```console
|
||||
$ docker model configure --context-size 4096 ai/qwen2.5-coder
|
||||
```
|
||||
|
||||
3. Check available system resources (RAM, GPU memory).
|
||||
|
||||
### CORS errors (web-based tools)
|
||||
|
||||
If using browser-based tools, add the origin to CORS allowed origins:
|
||||
|
||||
1. Docker Desktop: Settings > AI > CORS Allowed Origins
|
||||
2. Add your tool's URL (e.g., `http://localhost:3000`)
|
||||
|
||||
## Recommended models by use case
|
||||
|
||||
| Use case | Recommended model | Notes |
|
||||
|----------|-------------------|-------|
|
||||
| Code completion | `ai/qwen2.5-coder` | Optimized for coding tasks |
|
||||
| General assistant | `ai/llama3.2` | Good balance of capabilities |
|
||||
| Small/fast | `ai/smollm2` | Low resource usage |
|
||||
| Embeddings | `ai/all-minilm` | For RAG and semantic search |
|
||||
|
||||
## What's next
|
||||
|
||||
- [API reference](api-reference.md) - Full API documentation
|
||||
- [Configuration options](configuration.md) - Tune model behavior
|
||||
- [Open WebUI integration](openwebui-integration.md) - Set up a web interface
|
||||
319
content/manuals/ai/model-runner/inference-engines.md
Normal file
319
content/manuals/ai/model-runner/inference-engines.md
Normal file
@@ -0,0 +1,319 @@
|
||||
---
|
||||
title: Inference engines
|
||||
description: Learn about the llama.cpp and vLLM inference engines in Docker Model Runner.
|
||||
weight: 50
|
||||
keywords: Docker, ai, model runner, llama.cpp, vllm, inference, gguf, safetensors, cuda, gpu
|
||||
---
|
||||
|
||||
Docker Model Runner supports two inference engines: **llama.cpp** and **vLLM**.
|
||||
Each engine has different strengths, supported platforms, and model format
|
||||
requirements. This guide helps you choose the right engine and configure it for
|
||||
your use case.
|
||||
|
||||
## Engine comparison
|
||||
|
||||
| Feature | llama.cpp | vLLM |
|
||||
|---------|-----------|------|
|
||||
| **Model formats** | GGUF | Safetensors, HuggingFace |
|
||||
| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only |
|
||||
| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only |
|
||||
| **CPU inference** | Yes | No |
|
||||
| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited |
|
||||
| **Memory efficiency** | High (with quantization) | Moderate |
|
||||
| **Throughput** | Good | High (with batching) |
|
||||
| **Best for** | Local development, resource-constrained environments | Production, high throughput |
|
||||
|
||||
## llama.cpp
|
||||
|
||||
[llama.cpp](https://github.com/ggerganov/llama.cpp) is the default inference
|
||||
engine in Docker Model Runner. It's designed for efficient local inference and
|
||||
supports a wide range of hardware configurations.
|
||||
|
||||
### Platform support
|
||||
|
||||
| Platform | GPU support | Notes |
|
||||
|----------|-------------|-------|
|
||||
| macOS (Apple Silicon) | Metal | Automatic GPU acceleration |
|
||||
| Windows (x64) | NVIDIA CUDA | Requires NVIDIA drivers 576.57+ |
|
||||
| Windows (ARM64) | Adreno OpenCL | Qualcomm 6xx series and later |
|
||||
| Linux (x64) | NVIDIA, AMD, Vulkan | Multiple backend options |
|
||||
| Linux | CPU only | Works on any x64/ARM64 system |
|
||||
|
||||
### Model format: GGUF
|
||||
|
||||
llama.cpp uses the GGUF format, which supports efficient quantization for reduced
|
||||
memory usage without significant quality loss.
|
||||
|
||||
#### Quantization levels
|
||||
|
||||
| Quantization | Bits per weight | Memory usage | Quality |
|
||||
|--------------|-----------------|--------------|---------|
|
||||
| Q2_K | ~2.5 | Lowest | Reduced |
|
||||
| Q3_K_M | ~3.5 | Minimal | Acceptable |
|
||||
| Q4_K_M | ~4.5 | Low | Good |
|
||||
| Q5_K_M | ~5.5 | Moderate | Excellent |
|
||||
| Q6_K | ~6.5 | Higher | Excellent |
|
||||
| Q8_0 | 8 | High | Near-original |
|
||||
| F16 | 16 | Highest | Original |
|
||||
|
||||
**Recommended**: Q4_K_M offers the best balance of quality and memory usage for
|
||||
most use cases.
|
||||
|
||||
#### Pulling quantized models
|
||||
|
||||
Models on Docker Hub often include quantization in the tag:
|
||||
|
||||
```console
|
||||
$ docker model pull ai/llama3.2:3B-Q4_K_M
|
||||
```
|
||||
|
||||
### Using llama.cpp
|
||||
|
||||
llama.cpp is the default engine. No special configuration is required:
|
||||
|
||||
```console
|
||||
$ docker model run ai/smollm2
|
||||
```
|
||||
|
||||
To explicitly specify llama.cpp when running models:
|
||||
|
||||
```console
|
||||
$ docker model run ai/smollm2 --backend llama.cpp
|
||||
```
|
||||
|
||||
### llama.cpp API endpoints
|
||||
|
||||
When using llama.cpp, API calls use the llama.cpp engine path:
|
||||
|
||||
```text
|
||||
POST /engines/llama.cpp/v1/chat/completions
|
||||
```
|
||||
|
||||
Or without the engine prefix:
|
||||
|
||||
```text
|
||||
POST /engines/v1/chat/completions
|
||||
```
|
||||
|
||||
## vLLM
|
||||
|
||||
[vLLM](https://github.com/vllm-project/vllm) is a high-performance inference
|
||||
engine optimized for production workloads with high throughput requirements.
|
||||
|
||||
### Platform support
|
||||
|
||||
| Platform | GPU | Support status |
|
||||
|----------|-----|----------------|
|
||||
| Linux x86_64 | NVIDIA CUDA | Supported |
|
||||
| Windows with WSL2 | NVIDIA CUDA | Supported (Docker Desktop 4.54+) |
|
||||
| macOS | - | Not supported |
|
||||
| Linux ARM64 | - | Not supported |
|
||||
| AMD GPUs | - | Not supported |
|
||||
|
||||
> [!IMPORTANT]
|
||||
> vLLM requires an NVIDIA GPU with CUDA support. It does not support CPU-only
|
||||
> inference.
|
||||
|
||||
### Model format: Safetensors
|
||||
|
||||
vLLM works with models in Safetensors format, which is the standard format for
|
||||
HuggingFace models. These models typically use more memory than quantized GGUF
|
||||
models but may offer better quality and faster inference on powerful hardware.
|
||||
|
||||
### Setting up vLLM
|
||||
|
||||
#### Docker Engine (Linux)
|
||||
|
||||
Install the Model Runner with vLLM backend:
|
||||
|
||||
```console
|
||||
$ docker model install-runner --backend vllm --gpu cuda
|
||||
```
|
||||
|
||||
Verify the installation:
|
||||
|
||||
```console
|
||||
$ docker model status
|
||||
Docker Model Runner is running
|
||||
|
||||
Status:
|
||||
llama.cpp: running llama.cpp version: c22473b
|
||||
vllm: running vllm version: 0.11.0
|
||||
```
|
||||
|
||||
#### Docker Desktop (Windows with WSL2)
|
||||
|
||||
1. Ensure you have:
|
||||
- Docker Desktop 4.54 or later
|
||||
- NVIDIA GPU with updated drivers
|
||||
- WSL2 enabled
|
||||
|
||||
2. Install vLLM backend:
|
||||
```console
|
||||
$ docker model install-runner --backend vllm --gpu cuda
|
||||
```
|
||||
|
||||
### Running models with vLLM
|
||||
|
||||
vLLM models are typically tagged with `-vllm` suffix:
|
||||
|
||||
```console
|
||||
$ docker model run ai/smollm2-vllm
|
||||
```
|
||||
|
||||
To specify the vLLM backend explicitly:
|
||||
|
||||
```console
|
||||
$ docker model run ai/model --backend vllm
|
||||
```
|
||||
|
||||
### vLLM API endpoints
|
||||
|
||||
When using vLLM, specify the engine in the API path:
|
||||
|
||||
```text
|
||||
POST /engines/vllm/v1/chat/completions
|
||||
```
|
||||
|
||||
### vLLM configuration
|
||||
|
||||
#### HuggingFace overrides
|
||||
|
||||
Use `--hf_overrides` to pass model configuration overrides:
|
||||
|
||||
```console
|
||||
$ docker model configure --hf_overrides '{"max_model_len": 8192}' ai/model-vllm
|
||||
```
|
||||
|
||||
#### Common vLLM settings
|
||||
|
||||
| Setting | Description | Example |
|
||||
|---------|-------------|---------|
|
||||
| `max_model_len` | Maximum context length | 8192 |
|
||||
| `gpu_memory_utilization` | Fraction of GPU memory to use | 0.9 |
|
||||
| `tensor_parallel_size` | GPUs for tensor parallelism | 2 |
|
||||
|
||||
### vLLM and llama.cpp performance comparison
|
||||
|
||||
| Scenario | Recommended engine |
|
||||
|----------|-------------------|
|
||||
| Single user, local development | llama.cpp |
|
||||
| Multiple concurrent requests | vLLM |
|
||||
| Limited GPU memory | llama.cpp (with quantization) |
|
||||
| Maximum throughput | vLLM |
|
||||
| CPU-only system | llama.cpp |
|
||||
| Apple Silicon Mac | llama.cpp |
|
||||
| Production deployment | vLLM (if hardware supports it) |
|
||||
|
||||
## Running both engines
|
||||
|
||||
You can run both llama.cpp and vLLM simultaneously. Docker Model Runner routes
|
||||
requests to the appropriate engine based on the model or explicit engine selection.
|
||||
|
||||
Check which engines are running:
|
||||
|
||||
```console
|
||||
$ docker model status
|
||||
Docker Model Runner is running
|
||||
|
||||
Status:
|
||||
llama.cpp: running llama.cpp version: c22473b
|
||||
vllm: running vllm version: 0.11.0
|
||||
```
|
||||
|
||||
### Engine-specific API paths
|
||||
|
||||
| Engine | API path |
|
||||
|--------|----------|
|
||||
| llama.cpp | `/engines/llama.cpp/v1/...` |
|
||||
| vLLM | `/engines/vllm/v1/...` |
|
||||
| Auto-select | `/engines/v1/...` |
|
||||
|
||||
## Managing inference engines
|
||||
|
||||
### Install an engine
|
||||
|
||||
```console
|
||||
$ docker model install-runner --backend <engine> [--gpu <type>]
|
||||
```
|
||||
|
||||
Options:
|
||||
- `--backend`: `llama.cpp` or `vllm`
|
||||
- `--gpu`: `cuda`, `rocm`, `vulkan`, or `metal` (depends on platform)
|
||||
|
||||
### Reinstall an engine
|
||||
|
||||
```console
|
||||
$ docker model reinstall-runner --backend <engine>
|
||||
```
|
||||
|
||||
### Check engine status
|
||||
|
||||
```console
|
||||
$ docker model status
|
||||
```
|
||||
|
||||
### View engine logs
|
||||
|
||||
```console
|
||||
$ docker model logs
|
||||
```
|
||||
|
||||
## Packaging models for each engine
|
||||
|
||||
### Package a GGUF model (llama.cpp)
|
||||
|
||||
```console
|
||||
$ docker model package --gguf ./model.gguf --push myorg/mymodel:Q4_K_M
|
||||
```
|
||||
|
||||
### Package a Safetensors model (vLLM)
|
||||
|
||||
```console
|
||||
$ docker model package --safetensors ./model/ --push myorg/mymodel-vllm
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### vLLM won't start
|
||||
|
||||
1. Verify NVIDIA GPU is available:
|
||||
```console
|
||||
$ nvidia-smi
|
||||
```
|
||||
|
||||
2. Check Docker has GPU access:
|
||||
```console
|
||||
$ docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
|
||||
```
|
||||
|
||||
3. Verify you're on a supported platform (Linux x86_64 or Windows WSL2).
|
||||
|
||||
### llama.cpp is slow
|
||||
|
||||
1. Ensure GPU acceleration is working (check logs for Metal/CUDA messages).
|
||||
|
||||
2. Try a more aggressive quantization:
|
||||
```console
|
||||
$ docker model pull ai/model:Q4_K_M
|
||||
```
|
||||
|
||||
3. Reduce context size:
|
||||
```console
|
||||
$ docker model configure --context-size 2048 ai/model
|
||||
```
|
||||
|
||||
### Out of memory errors
|
||||
|
||||
1. Use a smaller quantization (Q4 instead of Q8).
|
||||
2. Reduce context size.
|
||||
3. For vLLM, adjust `gpu_memory_utilization`:
|
||||
```console
|
||||
$ docker model configure --hf_overrides '{"gpu_memory_utilization": 0.8}' ai/model
|
||||
```
|
||||
|
||||
## What's next
|
||||
|
||||
- [Configuration options](configuration.md) - Detailed parameter reference
|
||||
- [API reference](api-reference.md) - API documentation
|
||||
- [GPU support](/manuals/desktop/features/gpu.md) - GPU configuration for Docker Desktop
|
||||
293
content/manuals/ai/model-runner/openwebui-integration.md
Normal file
293
content/manuals/ai/model-runner/openwebui-integration.md
Normal file
@@ -0,0 +1,293 @@
|
||||
---
|
||||
title: Open WebUI integration
|
||||
description: Set up Open WebUI as a ChatGPT-like interface for Docker Model Runner.
|
||||
weight: 45
|
||||
keywords: Docker, ai, model runner, open webui, openwebui, chat interface, ollama, ui
|
||||
---
|
||||
|
||||
[Open WebUI](https://github.com/open-webui/open-webui) is an open-source,
|
||||
self-hosted web interface that provides a ChatGPT-like experience for local
|
||||
AI models. You can connect it to Docker Model Runner to get a polished chat
|
||||
interface for your models.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker Model Runner enabled with TCP access
|
||||
- A model pulled (e.g., `docker model pull ai/llama3.2`)
|
||||
|
||||
## Quick start with Docker Compose
|
||||
|
||||
The easiest way to run Open WebUI with Docker Model Runner is using Docker Compose.
|
||||
|
||||
Create a `compose.yaml` file:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
open-webui:
|
||||
image: ghcr.io/open-webui/open-webui:main
|
||||
ports:
|
||||
- "3000:8080"
|
||||
environment:
|
||||
- OLLAMA_BASE_URL=http://host.docker.internal:12434
|
||||
- WEBUI_AUTH=false
|
||||
extra_hosts:
|
||||
- "host.docker.internal:host-gateway"
|
||||
volumes:
|
||||
- open-webui:/app/backend/data
|
||||
|
||||
volumes:
|
||||
open-webui:
|
||||
```
|
||||
|
||||
Start the services:
|
||||
|
||||
```console
|
||||
$ docker compose up -d
|
||||
```
|
||||
|
||||
Open your browser to [http://localhost:3000](http://localhost:3000).
|
||||
|
||||
## Configuration options
|
||||
|
||||
### Environment variables
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `OLLAMA_BASE_URL` | URL of Docker Model Runner | Required |
|
||||
| `WEBUI_AUTH` | Enable authentication | `true` |
|
||||
| `OPENAI_API_BASE_URL` | Use OpenAI-compatible API instead | - |
|
||||
| `OPENAI_API_KEY` | API key (use any value for DMR) | - |
|
||||
|
||||
### Using OpenAI-compatible API
|
||||
|
||||
If you prefer to use the OpenAI-compatible API instead of the Ollama API:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
open-webui:
|
||||
image: ghcr.io/open-webui/open-webui:main
|
||||
ports:
|
||||
- "3000:8080"
|
||||
environment:
|
||||
- OPENAI_API_BASE_URL=http://host.docker.internal:12434/engines/v1
|
||||
- OPENAI_API_KEY=not-needed
|
||||
- WEBUI_AUTH=false
|
||||
extra_hosts:
|
||||
- "host.docker.internal:host-gateway"
|
||||
volumes:
|
||||
- open-webui:/app/backend/data
|
||||
|
||||
volumes:
|
||||
open-webui:
|
||||
```
|
||||
|
||||
## Network configuration
|
||||
|
||||
### Docker Desktop
|
||||
|
||||
On Docker Desktop, `host.docker.internal` automatically resolves to the host machine.
|
||||
The previous example works without modification.
|
||||
|
||||
### Docker Engine (Linux)
|
||||
|
||||
On Docker Engine, you may need to configure the network differently:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
open-webui:
|
||||
image: ghcr.io/open-webui/open-webui:main
|
||||
network_mode: host
|
||||
environment:
|
||||
- OLLAMA_BASE_URL=http://localhost:12434
|
||||
- WEBUI_AUTH=false
|
||||
volumes:
|
||||
- open-webui:/app/backend/data
|
||||
|
||||
volumes:
|
||||
open-webui:
|
||||
```
|
||||
|
||||
Or use the host gateway:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
open-webui:
|
||||
image: ghcr.io/open-webui/open-webui:main
|
||||
ports:
|
||||
- "3000:8080"
|
||||
environment:
|
||||
- OLLAMA_BASE_URL=http://172.17.0.1:12434
|
||||
- WEBUI_AUTH=false
|
||||
volumes:
|
||||
- open-webui:/app/backend/data
|
||||
|
||||
volumes:
|
||||
open-webui:
|
||||
```
|
||||
|
||||
## Using Open WebUI
|
||||
|
||||
### Select a model
|
||||
|
||||
1. Open [http://localhost:3000](http://localhost:3000)
|
||||
2. Select the model drop-down in the top-left
|
||||
3. Select from your pulled models (they appear with `ai/` prefix)
|
||||
|
||||
### Pull models through the UI
|
||||
|
||||
Open WebUI can pull models directly:
|
||||
|
||||
1. Select the model drop-down
|
||||
2. Enter a model name: `ai/llama3.2`
|
||||
3. Select the download icon
|
||||
|
||||
### Chat features
|
||||
|
||||
Open WebUI provides:
|
||||
|
||||
- Multi-turn conversations with context
|
||||
- Message editing and regeneration
|
||||
- Code syntax highlighting
|
||||
- Markdown rendering
|
||||
- Conversation history and search
|
||||
- Export conversations
|
||||
|
||||
## Complete example with multiple models
|
||||
|
||||
This example sets up Open WebUI with Docker Model Runner and pre-pulls several models:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
open-webui:
|
||||
image: ghcr.io/open-webui/open-webui:main
|
||||
ports:
|
||||
- "3000:8080"
|
||||
environment:
|
||||
- OLLAMA_BASE_URL=http://host.docker.internal:12434
|
||||
- WEBUI_AUTH=false
|
||||
- DEFAULT_MODELS=ai/llama3.2
|
||||
extra_hosts:
|
||||
- "host.docker.internal:host-gateway"
|
||||
volumes:
|
||||
- open-webui:/app/backend/data
|
||||
depends_on:
|
||||
model-setup:
|
||||
condition: service_completed_successfully
|
||||
|
||||
model-setup:
|
||||
image: docker:cli
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
command: >
|
||||
sh -c "
|
||||
docker model pull ai/llama3.2 &&
|
||||
docker model pull ai/qwen2.5-coder &&
|
||||
docker model pull ai/smollm2
|
||||
"
|
||||
|
||||
volumes:
|
||||
open-webui:
|
||||
```
|
||||
|
||||
## Enabling authentication
|
||||
|
||||
For multi-user setups or security, enable authentication:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
open-webui:
|
||||
image: ghcr.io/open-webui/open-webui:main
|
||||
ports:
|
||||
- "3000:8080"
|
||||
environment:
|
||||
- OLLAMA_BASE_URL=http://host.docker.internal:12434
|
||||
- WEBUI_AUTH=true
|
||||
extra_hosts:
|
||||
- "host.docker.internal:host-gateway"
|
||||
volumes:
|
||||
- open-webui:/app/backend/data
|
||||
|
||||
volumes:
|
||||
open-webui:
|
||||
```
|
||||
|
||||
On first visit, you'll create an admin account.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Models don't appear in the drop-down
|
||||
|
||||
1. Verify Docker Model Runner is accessible:
|
||||
```console
|
||||
$ curl http://localhost:12434/api/tags
|
||||
```
|
||||
|
||||
2. Check that models are pulled:
|
||||
```console
|
||||
$ docker model list
|
||||
```
|
||||
|
||||
3. Verify the `OLLAMA_BASE_URL` is correct and accessible from the container.
|
||||
|
||||
### "Connection refused" errors
|
||||
|
||||
1. Ensure TCP access is enabled for Docker Model Runner.
|
||||
|
||||
2. On Docker Desktop, verify `host.docker.internal` resolves:
|
||||
```console
|
||||
$ docker run --rm alpine ping -c 1 host.docker.internal
|
||||
```
|
||||
|
||||
3. On Docker Engine, try using `network_mode: host` or the explicit host IP.
|
||||
|
||||
### Slow response times
|
||||
|
||||
1. First requests load the model into memory, which takes time.
|
||||
|
||||
2. Subsequent requests are much faster.
|
||||
|
||||
3. If consistently slow, consider:
|
||||
- Using a smaller model
|
||||
- Reducing context size
|
||||
- Checking GPU acceleration is working
|
||||
|
||||
### CORS errors
|
||||
|
||||
If running Open WebUI on a different host:
|
||||
|
||||
1. In Docker Desktop, go to Settings > AI
|
||||
2. Add the Open WebUI URL to **CORS Allowed Origins**
|
||||
|
||||
## Customization
|
||||
|
||||
### Custom system prompts
|
||||
|
||||
Open WebUI supports setting system prompts per model. Configure these in the UI under Settings > Models.
|
||||
|
||||
### Model parameters
|
||||
|
||||
Adjust model parameters in the chat interface:
|
||||
|
||||
1. Select the settings icon next to the model name
|
||||
2. Adjust temperature, top-p, max tokens, etc.
|
||||
|
||||
These settings are passed through to Docker Model Runner.
|
||||
|
||||
## Running on a different port
|
||||
|
||||
To run Open WebUI on a different port:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
open-webui:
|
||||
image: ghcr.io/open-webui/open-webui:main
|
||||
ports:
|
||||
- "8080:8080" # Change first port number
|
||||
# ... rest of config
|
||||
```
|
||||
|
||||
## What's next
|
||||
|
||||
- [API reference](api-reference.md) - Learn about the APIs Open WebUI uses
|
||||
- [Configuration options](configuration.md) - Tune model behavior
|
||||
- [IDE integrations](ide-integrations.md) - Connect other tools to DMR
|
||||
Reference in New Issue
Block a user