mirror of
https://github.com/open-webui/docs.git
synced 2026-03-26 13:18:42 +07:00
docs: performance best practices, scaling, and tool calling legacy/native
- Mark Default (prompt-based) tool calling as legacy; recommend Native Mode - Document KV cache breaking in Default Mode, system tools only in Native - Updated comparison table with Status, KV Cache, System Tools rows - Add Content Extraction Engine section (pypdf memory leak warning, Tika/Docling) - Add Embedding Engine section (SentenceTransformers RAM warning at scale) - Add Common Anti-Patterns section with 6 real-world scaling mistakes - Add Redis Tuning subsection (timeout, maxclients, single instance sufficiency) - Expand Profile 3 with content extraction, embeddings, tool calling, Redis - New Step 6 in scaling guide: Fix Content Extraction & Embeddings - Quick reference table: add Ext. Content Extraction and Ext. Embeddings columns - Add CONTENT_EXTRACTION_ENGINE and RAG_EMBEDDING_ENGINE to minimum env vars
This commit is contained in:
@@ -89,14 +89,22 @@ You can also let your LLM auto-select the right Tools using the [**AutoTool Filt
|
||||
|
||||
Open WebUI offers two distinct ways for models to interact with tools: a standard **Default Mode** and a high-performance **Native Mode (Agentic Mode)**. Choosing the right mode depends on your model's capabilities and your performance requirements.
|
||||
|
||||
### 🟡 Default Mode (Prompt-based)
|
||||
### 🟡 Default Mode (Prompt-based) — Legacy
|
||||
|
||||
:::warning Legacy Mode
|
||||
Default Mode is maintained purely for **backwards compatibility** with older or smaller models that lack native function-calling support. It is considered **legacy** and should not be used when your model supports native tool calling. New deployments should use **Native Mode** exclusively.
|
||||
:::
|
||||
|
||||
In Default Mode, Open WebUI manages tool selection by injecting a specific prompt template that guides the model to output a tool request.
|
||||
- **Compatibility**: Works with **practically any model**, including older or smaller local models that lack native function-calling support.
|
||||
- **Flexibility**: Highly customizable via prompt templates.
|
||||
- **Caveat**: Can be slower (requires extra tokens) and less reliable for complex, multi-step tool chaining.
|
||||
- **Caveats**:
|
||||
- Can be slower (requires extra tokens) and less reliable for complex, multi-step tool chaining.
|
||||
- **Breaks KV cache**: The injected prompt changes every turn, preventing LLM engines from reusing cached key-value pairs. This increases latency and cost for every message in the conversation.
|
||||
- Does not support built-in system tools (memory, notes, channels, etc.).
|
||||
|
||||
### 🟢 Native Mode (Agentic Mode / System Function Calling)
|
||||
Native Mode (also called **Agentic Mode**) leverages the model's built-in capability to handle tool definitions and return structured tool calls (JSON). This is the **recommended mode** for high-performance agentic workflows.
|
||||
### 🟢 Native Mode (Agentic Mode / System Function Calling) — Recommended
|
||||
Native Mode (also called **Agentic Mode**) leverages the model's built-in capability to handle tool definitions and return structured tool calls (JSON). This is the **recommended mode** for all models that support it — which includes the vast majority of modern models (2024+).
|
||||
|
||||
:::warning Model Quality Matters
|
||||
**Agentic tool calling requires high-quality models to work reliably.** While small local models may technically support function calling, they often struggle with the complex reasoning required for multi-step tool usage. For best results, use frontier models like **GPT-5**, **Claude 4.5 Sonnet**, **Gemini 3 Flash**, or **MiniMax M2.5**. Small local models may produce malformed JSON or fail to follow the strict state management required for agentic behavior.
|
||||
@@ -104,9 +112,11 @@ Native Mode (also called **Agentic Mode**) leverages the model's built-in capabi
|
||||
|
||||
#### Why use Native Mode (Agentic Mode)?
|
||||
- **Speed & Efficiency**: Lower latency as it avoids bulky prompt-based tool selection.
|
||||
- **KV Cache Friendly**: Tool definitions are sent as structured parameters (not injected into the prompt), so they don't invalidate the KV cache between turns. This can significantly reduce latency and token costs.
|
||||
- **Reliability**: Higher accuracy in following tool schemas (with quality models).
|
||||
- **Multi-step Chaining**: Essential for **Agentic Research** and **Interleaved Thinking** where a model needs to call multiple tools in succession.
|
||||
- **Autonomous Decision-Making**: Models can decide when to search, which tools to use, and how to combine results.
|
||||
- **System Tools**: Only Native Mode unlocks the [built-in system tools](#built-in-system-tools-nativeagentic-mode) (memory, notes, knowledge, channels, etc.).
|
||||
|
||||
#### How to Enable Native Mode (Agentic Mode)
|
||||
Native Mode can be enabled at two levels:
|
||||
@@ -164,11 +174,14 @@ These models excel at multi-step reasoning, proper JSON formatting, and autonomo
|
||||
**This is a DeepSeek model/API issue**, not an Open WebUI issue. Open WebUI correctly sends tools in standard OpenAI format — the malformed output originates from DeepSeek's non-standard internal format.
|
||||
:::
|
||||
|
||||
| Feature | Default Mode | Native Mode |
|
||||
| Feature | Default Mode (Legacy) | Native Mode (Recommended) |
|
||||
|:---|:---|:---|
|
||||
| **Status** | Legacy / backwards compat | ✅ Recommended |
|
||||
| **Latency** | Medium/High | Low |
|
||||
| **KV Cache** | ❌ Can break cache | ✅ Cache-friendly |
|
||||
| **Model Compatibility** | Universal | Requires Tool-Calling Support |
|
||||
| **Logic** | Prompt-based (Open WebUI) | Model-native (API/Ollama) |
|
||||
| **System Tools** | ❌ Not available | ✅ Full access |
|
||||
| **Complex Chaining** | ⚠️ Limited | ✅ Excellent |
|
||||
|
||||
### Built-in System Tools (Native/Agentic Mode)
|
||||
|
||||
@@ -75,6 +75,7 @@ ENABLE_WEBSOCKET_SUPPORT=true
|
||||
- If you're using Redis Sentinel for high availability, also set `REDIS_SENTINEL_HOSTS` and consider setting `REDIS_SOCKET_CONNECT_TIMEOUT=5` to prevent hangs during failover.
|
||||
- For AWS Elasticache or other managed Redis Cluster services, set `REDIS_CLUSTER=true`.
|
||||
- Make sure your Redis server has `timeout 1800` and a high enough `maxclients` (10000+) to prevent connection exhaustion over time.
|
||||
- A **single Redis instance** is sufficient for the vast majority of deployments, even with thousands of users. You almost certainly do not need Redis Cluster unless you have specific HA/bandwidth requirements. If you think you need Redis Cluster, first check whether your connection count and memory usage are caused by fixable configuration issues (see [Common Anti-Patterns](/troubleshooting/performance#%EF%B8%8F-common-anti-patterns)).
|
||||
- Without Redis in a multi-instance setup, you will experience [WebSocket 403 errors](/troubleshooting/multi-replica#2-websocket-403-errors--connection-failures), [configuration sync issues](/troubleshooting/multi-replica#3-model-not-found-or-configuration-mismatch), and intermittent authentication failures.
|
||||
|
||||
For a complete step-by-step Redis setup (Docker Compose, Sentinel, Cluster mode, verification), see the [Redis WebSocket Support](/tutorials/integrations/redis) tutorial. For WebSocket and CORS issues behind reverse proxies, see [Connection Errors](/troubleshooting/connection-error#-https-tls-cors--websocket-issues).
|
||||
@@ -232,7 +233,40 @@ Each provider has its own set of environment variables for credentials and bucke
|
||||
|
||||
---
|
||||
|
||||
## Step 6 — Add Observability
|
||||
## Step 6 — Fix Content Extraction & Embeddings
|
||||
|
||||
**When:** You process documents regularly (RAG, knowledge bases) and are running in production.
|
||||
|
||||
:::danger These Defaults Cause Memory Leaks at Scale
|
||||
The default content extraction engine (pypdf) and default embedding engine (SentenceTransformers) are the **two most common causes of memory leaks** in production Open WebUI deployments. Fixing these is just as important as switching to PostgreSQL or adding Redis.
|
||||
:::
|
||||
|
||||
**What to do:**
|
||||
|
||||
1. **Switch the content extraction engine** to an external service:
|
||||
|
||||
```
|
||||
CONTENT_EXTRACTION_ENGINE=tika
|
||||
TIKA_SERVER_URL=http://tika:9998
|
||||
```
|
||||
|
||||
2. **Switch the embedding engine** to an external provider:
|
||||
|
||||
```
|
||||
RAG_EMBEDDING_ENGINE=openai
|
||||
# or for self-hosted:
|
||||
RAG_EMBEDDING_ENGINE=ollama
|
||||
```
|
||||
|
||||
**Key things to know:**
|
||||
|
||||
- The default content extractor (pypdf) has unavoidable **known memory leaks** that cause your Open WebUI process to grow in memory continuously. An external extractor (Tika, Docling) runs in its own process/container, isolating these leaks.
|
||||
- The default SentenceTransformers embedding model loads ~500MB per worker process. With 8 workers, that's 4GB of RAM just for embeddings. External embedding eliminates this.
|
||||
- For detailed guidance and configuration options, see [Content Extraction Engine](/troubleshooting/performance#content-extraction-engine) and [Embedding Engine](/troubleshooting/performance#embedding-engine) in the Performance guide.
|
||||
|
||||
---
|
||||
|
||||
## Step 7 — Add Observability
|
||||
|
||||
**When:** You want to monitor performance, troubleshoot issues, and understand how your deployment is behaving at scale.
|
||||
|
||||
@@ -300,6 +334,14 @@ ENABLE_WEBSOCKET_SUPPORT=true
|
||||
# S3_BUCKET_NAME=my-openwebui-bucket
|
||||
# S3_REGION_NAME=us-east-1
|
||||
|
||||
# Content Extraction (do NOT use default pypdf in production)
|
||||
CONTENT_EXTRACTION_ENGINE=tika
|
||||
TIKA_SERVER_URL=http://tika:9998
|
||||
|
||||
# Embeddings (do NOT use default SentenceTransformers at scale)
|
||||
RAG_EMBEDDING_ENGINE=openai
|
||||
# or: RAG_EMBEDDING_ENGINE=ollama
|
||||
|
||||
# Workers (let orchestrator scale, keep workers at 1)
|
||||
UVICORN_WORKERS=1
|
||||
|
||||
@@ -311,13 +353,13 @@ ENABLE_DB_MIGRATIONS=false
|
||||
|
||||
## Quick Reference: When Do I Need What?
|
||||
|
||||
| Scenario | PostgreSQL | Redis | External Vector DB | Shared Storage |
|
||||
|---|:---:|:---:|:---:|:---:|
|
||||
| Single user / evaluation | ✗ | ✗ | ✗ | ✗ |
|
||||
| Small team (< 50 users, single instance) | Recommended | ✗ | ✗ | ✗ |
|
||||
| Multiple Uvicorn workers | **Required** | **Required** | **Required** | ✗ (same filesystem) |
|
||||
| Multiple instances / HA | **Required** | **Required** | **Required** | **Optional** (NFS or S3) |
|
||||
| Large scale (1000+ users) | **Required** | **Required** | **Required** | **Optional** (NFS or S3) |
|
||||
| Scenario | PostgreSQL | Redis | External Vector DB | Ext. Content Extraction | Ext. Embeddings | Shared Storage |
|
||||
|---|:---:|:---:|:---:|:---:|:---:|:---:|
|
||||
| Single user / evaluation | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
|
||||
| Small team (< 50 users, single instance) | Recommended | ✗ | ✗ | Recommended | ✗ | ✗ |
|
||||
| Multiple Uvicorn workers | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | ✗ (same filesystem) |
|
||||
| Multiple instances / HA | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | **Optional** (NFS or S3) |
|
||||
| Large scale (1000+ users) | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | **Optional** (NFS or S3) |
|
||||
|
||||
:::note About "External Vector DB"
|
||||
The default ChromaDB uses a local SQLite backend that crashes under multi-process access. "External Vector DB" means either a client-server database (PGVector, Milvus, Qdrant, Pinecone) or ChromaDB running as a separate HTTP server. See [Step 4](#step-4--switch-to-an-external-vector-database) for details.
|
||||
|
||||
@@ -134,6 +134,40 @@ For multi-user setups, the choice of Vector DB matters.
|
||||
* `ENABLE_MILVUS_MULTITENANCY_MODE=True`
|
||||
* `ENABLE_QDRANT_MULTITENANCY_MODE=True`
|
||||
|
||||
### Content Extraction Engine
|
||||
|
||||
:::danger Default Content Extractor Causes Memory Leaks
|
||||
The **default content extraction engine** uses Python libraries including **pypdf**, which are known to have **persistent memory leaks** during document ingestion. In production deployments with regular document uploads, this will cause Open WebUI's memory usage to grow continuously until the process is killed or the container is restarted.
|
||||
|
||||
This is the **#1 cause of unexplained memory growth** in production deployments.
|
||||
:::
|
||||
|
||||
**Recommendation**: Switch to an external content extraction engine for any deployment that processes documents regularly:
|
||||
|
||||
| Engine | Best For | Configuration |
|
||||
|---|---|---|
|
||||
| **Apache Tika** | General-purpose, widely used, handles most document types | `CONTENT_EXTRACTION_ENGINE=tika` + `TIKA_SERVER_URL=http://tika:9998` |
|
||||
| **Docling** | High-quality extraction with layout-aware parsing | `CONTENT_EXTRACTION_ENGINE=docling` |
|
||||
| **External Loader** | Recommended for production and custom extraction pipelines | `CONTENT_EXTRACTION_ENGINE=external` + `EXTERNAL_DOCUMENT_LOADER_URL=...` |
|
||||
|
||||
Using an external extractor moves the memory-intensive parsing out of the Open WebUI process entirely, eliminating this class of memory leaks.
|
||||
|
||||
### Embedding Engine
|
||||
|
||||
:::warning SentenceTransformers at Scale
|
||||
The **default SentenceTransformers** embedding engine (all-MiniLM-L6-v2) loads a machine learning model into the Open WebUI process memory. While lightweight enough for personal use, at scale this model:
|
||||
|
||||
- **Consumes significant RAM** (~500MB+ per worker process)
|
||||
- **Blocks the event loop** during embedding operations on older versions
|
||||
- **Multiplies with workers** — each Uvicorn worker loads its own copy of the model
|
||||
|
||||
For multi-user or production deployments, **offload embeddings to an external service**.
|
||||
:::
|
||||
|
||||
- **Recommended**: Use `RAG_EMBEDDING_ENGINE=openai` (for cloud embeddings via OpenAI, Azure, or compatible APIs) or `RAG_EMBEDDING_ENGINE=ollama` (for self-hosted embedding via Ollama with models like `nomic-embed-text`).
|
||||
- **Env Var**: `RAG_EMBEDDING_ENGINE=openai`
|
||||
- **Effect**: The embedding model is no longer loaded into the Open WebUI process, freeing hundreds of MB of RAM per worker.
|
||||
|
||||
### Optimizing Document Chunking
|
||||
|
||||
The way your documents are chunked directly impacts both storage efficiency and retrieval quality.
|
||||
@@ -359,12 +393,43 @@ If resource usage is critical, disable automated features that constantly trigge
|
||||
*Target: Many concurrent users, Stability > Persistence.*
|
||||
|
||||
1. **Database**: **PostgreSQL** (Mandatory).
|
||||
2. **Workers**: `THREAD_POOL_SIZE=2000` (Prevent timeouts).
|
||||
3. **Streaming**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7` (Reduce CPU/Net/DB writes).
|
||||
4. **Chat Saving**: `ENABLE_REALTIME_CHAT_SAVE=False`.
|
||||
5. **Vector DB**: **Milvus**, **Qdrant**, or **PGVector**. **Do not use ChromaDB's default local mode** — its SQLite backend will crash under multi-worker/multi-replica access.
|
||||
6. **Task Model**: External/Hosted (Offload compute).
|
||||
7. **Caching**: `ENABLE_BASE_MODELS_CACHE=True`, `MODELS_CACHE_TTL=300`, `ENABLE_QUERIES_CACHE=True`.
|
||||
2. **Content Extraction**: **Tika** or **Docling** (Mandatory — default pypdf leaks memory). See [Content Extraction Engine](#content-extraction-engine).
|
||||
3. **Embeddings**: **External** — `RAG_EMBEDDING_ENGINE=openai` or `ollama` (Mandatory — default SentenceTransformers consumes too much RAM at scale). See [Embedding Engine](#embedding-engine).
|
||||
4. **Tool Calling**: **Native Mode** (strongly recommended — Default Mode is legacy and breaks KV cache). See [Tool Calling Modes](/features/extensibility/plugin/tools#tool-calling-modes-default-vs-native).
|
||||
5. **Workers**: `THREAD_POOL_SIZE=2000` (Prevent timeouts).
|
||||
6. **Streaming**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7` (Reduce CPU/Net/DB writes).
|
||||
7. **Chat Saving**: `ENABLE_REALTIME_CHAT_SAVE=False`.
|
||||
8. **Vector DB**: **Milvus**, **Qdrant**, or **PGVector**. **Do not use ChromaDB's default local mode** — its SQLite backend will crash under multi-worker/multi-replica access.
|
||||
9. **Task Model**: External/Hosted (Offload compute).
|
||||
10. **Caching**: `ENABLE_BASE_MODELS_CACHE=True`, `MODELS_CACHE_TTL=300`, `ENABLE_QUERIES_CACHE=True`.
|
||||
11. **Redis**: Single instance with `timeout 1800` and high `maxclients` (10000+). See [Redis Tuning](#redis-tuning) below.
|
||||
|
||||
#### Redis Tuning
|
||||
|
||||
A single Redis instance is sufficient for the vast majority of deployments, including those with thousands of users. **You almost certainly do not need Redis Cluster or Redis Sentinel** unless you have specific HA requirements.
|
||||
|
||||
Common Redis configuration issues that cause unnecessary scaling:
|
||||
|
||||
| Issue | Symptom | Fix |
|
||||
|---|---|---|
|
||||
| **Stale connections** | Redis runs out of connections or memory grows indefinitely | Set `timeout 1800` in redis.conf (kills idle connections after 30 minutes) |
|
||||
| **Low maxclients** | `max number of clients reached` errors | Set `maxclients 10000` or higher |
|
||||
| **No connection limits** | Open WebUI pods may accumulate connections that never close | Combine `timeout` with connection pool limits in your Redis client config |
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Common Anti-Patterns
|
||||
|
||||
These are real-world mistakes that cause organizations to massively over-provision infrastructure:
|
||||
|
||||
| Anti-Pattern | What Happens | Fix |
|
||||
|---|---|---|
|
||||
| **Using default content extractor in production** | pypdf leaks memory → containers restart constantly → you add more replicas to compensate | Switch to Tika or Docling (`CONTENT_EXTRACTION_ENGINE=tika`) |
|
||||
| **Running SentenceTransformers at scale** | Each worker loads ~500MB embedding model → RAM usage explodes → you add more machines | Use external embeddings (`RAG_EMBEDDING_ENGINE=openai` or `ollama`) |
|
||||
| **Redis Cluster when single Redis suffices** | Too many replicas → too many connections → Redis can't handle them → you deploy Redis Cluster to compensate | Fix the root cause (fewer replicas, `timeout 1800`, `maxclients 10000`) |
|
||||
| **Scaling replicas to mask memory leaks** | Leaky processes → OOM kills → auto-scaler adds more pods → more Redis connections → Redis overwhelmed | Fix the leaks first (content extraction, embedding engine), then right-size |
|
||||
| **Using Default (prompt-based) tool calling** | Injected prompts may break KV cache → higher latency → more resources needed per request | Switch to Native Mode for all capable models |
|
||||
| **Not configuring Redis stale connection timeout** | Connections accumulate forever → Redis OOM → you deploy Redis Cluster | Add `timeout 1800` to redis.conf |
|
||||
|
||||
---
|
||||
|
||||
@@ -384,6 +449,7 @@ For detailed information on all available variables, see the [Environment Config
|
||||
| `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE` | [Streaming Chunk Size](/reference/env-configuration#chat_response_stream_delta_chunk_size) |
|
||||
| `THREAD_POOL_SIZE` | [Thread Pool Size](/reference/env-configuration#thread_pool_size) |
|
||||
| `RAG_EMBEDDING_ENGINE` | [Embedding Engine](/reference/env-configuration#rag_embedding_engine) |
|
||||
| `CONTENT_EXTRACTION_ENGINE` | [Content Extraction Engine](/reference/env-configuration#content_extraction_engine) |
|
||||
| `AUDIO_STT_ENGINE` | [STT Engine](/reference/env-configuration#audio_stt_engine) |
|
||||
| `ENABLE_IMAGE_GENERATION` | [Image Generation](/reference/env-configuration#enable_image_generation) |
|
||||
| `ENABLE_AUTOCOMPLETE_GENERATION` | [Autocomplete](/reference/env-configuration#enable_autocomplete_generation) |
|
||||
|
||||
Reference in New Issue
Block a user