docs: performance best practices, scaling, and tool calling legacy/native

- Mark Default (prompt-based) tool calling as legacy; recommend Native Mode
  - Document KV cache breaking in Default Mode, system tools only in Native
  - Updated comparison table with Status, KV Cache, System Tools rows
- Add Content Extraction Engine section (pypdf memory leak warning, Tika/Docling)
- Add Embedding Engine section (SentenceTransformers RAM warning at scale)
- Add Common Anti-Patterns section with 6 real-world scaling mistakes
- Add Redis Tuning subsection (timeout, maxclients, single instance sufficiency)
- Expand Profile 3 with content extraction, embeddings, tool calling, Redis
- New Step 6 in scaling guide: Fix Content Extraction & Embeddings
- Quick reference table: add Ext. Content Extraction and Ext. Embeddings columns
- Add CONTENT_EXTRACTION_ENGINE and RAG_EMBEDDING_ENGINE to minimum env vars
This commit is contained in:
Classic
2026-02-28 21:31:11 +01:00
parent f2fba91aa7
commit db1ca1a342
3 changed files with 140 additions and 19 deletions

View File

@@ -89,14 +89,22 @@ You can also let your LLM auto-select the right Tools using the [**AutoTool Filt
Open WebUI offers two distinct ways for models to interact with tools: a standard **Default Mode** and a high-performance **Native Mode (Agentic Mode)**. Choosing the right mode depends on your model's capabilities and your performance requirements.
### 🟡 Default Mode (Prompt-based)
### 🟡 Default Mode (Prompt-based) — Legacy
:::warning Legacy Mode
Default Mode is maintained purely for **backwards compatibility** with older or smaller models that lack native function-calling support. It is considered **legacy** and should not be used when your model supports native tool calling. New deployments should use **Native Mode** exclusively.
:::
In Default Mode, Open WebUI manages tool selection by injecting a specific prompt template that guides the model to output a tool request.
- **Compatibility**: Works with **practically any model**, including older or smaller local models that lack native function-calling support.
- **Flexibility**: Highly customizable via prompt templates.
- **Caveat**: Can be slower (requires extra tokens) and less reliable for complex, multi-step tool chaining.
- **Caveats**:
- Can be slower (requires extra tokens) and less reliable for complex, multi-step tool chaining.
- **Breaks KV cache**: The injected prompt changes every turn, preventing LLM engines from reusing cached key-value pairs. This increases latency and cost for every message in the conversation.
- Does not support built-in system tools (memory, notes, channels, etc.).
### 🟢 Native Mode (Agentic Mode / System Function Calling)
Native Mode (also called **Agentic Mode**) leverages the model's built-in capability to handle tool definitions and return structured tool calls (JSON). This is the **recommended mode** for high-performance agentic workflows.
### 🟢 Native Mode (Agentic Mode / System Function Calling) — Recommended
Native Mode (also called **Agentic Mode**) leverages the model's built-in capability to handle tool definitions and return structured tool calls (JSON). This is the **recommended mode** for all models that support it — which includes the vast majority of modern models (2024+).
:::warning Model Quality Matters
**Agentic tool calling requires high-quality models to work reliably.** While small local models may technically support function calling, they often struggle with the complex reasoning required for multi-step tool usage. For best results, use frontier models like **GPT-5**, **Claude 4.5 Sonnet**, **Gemini 3 Flash**, or **MiniMax M2.5**. Small local models may produce malformed JSON or fail to follow the strict state management required for agentic behavior.
@@ -104,9 +112,11 @@ Native Mode (also called **Agentic Mode**) leverages the model's built-in capabi
#### Why use Native Mode (Agentic Mode)?
- **Speed & Efficiency**: Lower latency as it avoids bulky prompt-based tool selection.
- **KV Cache Friendly**: Tool definitions are sent as structured parameters (not injected into the prompt), so they don't invalidate the KV cache between turns. This can significantly reduce latency and token costs.
- **Reliability**: Higher accuracy in following tool schemas (with quality models).
- **Multi-step Chaining**: Essential for **Agentic Research** and **Interleaved Thinking** where a model needs to call multiple tools in succession.
- **Autonomous Decision-Making**: Models can decide when to search, which tools to use, and how to combine results.
- **System Tools**: Only Native Mode unlocks the [built-in system tools](#built-in-system-tools-nativeagentic-mode) (memory, notes, knowledge, channels, etc.).
#### How to Enable Native Mode (Agentic Mode)
Native Mode can be enabled at two levels:
@@ -164,11 +174,14 @@ These models excel at multi-step reasoning, proper JSON formatting, and autonomo
**This is a DeepSeek model/API issue**, not an Open WebUI issue. Open WebUI correctly sends tools in standard OpenAI format — the malformed output originates from DeepSeek's non-standard internal format.
:::
| Feature | Default Mode | Native Mode |
| Feature | Default Mode (Legacy) | Native Mode (Recommended) |
|:---|:---|:---|
| **Status** | Legacy / backwards compat | ✅ Recommended |
| **Latency** | Medium/High | Low |
| **KV Cache** | ❌ Can break cache | ✅ Cache-friendly |
| **Model Compatibility** | Universal | Requires Tool-Calling Support |
| **Logic** | Prompt-based (Open WebUI) | Model-native (API/Ollama) |
| **System Tools** | ❌ Not available | ✅ Full access |
| **Complex Chaining** | ⚠️ Limited | ✅ Excellent |
### Built-in System Tools (Native/Agentic Mode)

View File

@@ -75,6 +75,7 @@ ENABLE_WEBSOCKET_SUPPORT=true
- If you're using Redis Sentinel for high availability, also set `REDIS_SENTINEL_HOSTS` and consider setting `REDIS_SOCKET_CONNECT_TIMEOUT=5` to prevent hangs during failover.
- For AWS Elasticache or other managed Redis Cluster services, set `REDIS_CLUSTER=true`.
- Make sure your Redis server has `timeout 1800` and a high enough `maxclients` (10000+) to prevent connection exhaustion over time.
- A **single Redis instance** is sufficient for the vast majority of deployments, even with thousands of users. You almost certainly do not need Redis Cluster unless you have specific HA/bandwidth requirements. If you think you need Redis Cluster, first check whether your connection count and memory usage are caused by fixable configuration issues (see [Common Anti-Patterns](/troubleshooting/performance#%EF%B8%8F-common-anti-patterns)).
- Without Redis in a multi-instance setup, you will experience [WebSocket 403 errors](/troubleshooting/multi-replica#2-websocket-403-errors--connection-failures), [configuration sync issues](/troubleshooting/multi-replica#3-model-not-found-or-configuration-mismatch), and intermittent authentication failures.
For a complete step-by-step Redis setup (Docker Compose, Sentinel, Cluster mode, verification), see the [Redis WebSocket Support](/tutorials/integrations/redis) tutorial. For WebSocket and CORS issues behind reverse proxies, see [Connection Errors](/troubleshooting/connection-error#-https-tls-cors--websocket-issues).
@@ -232,7 +233,40 @@ Each provider has its own set of environment variables for credentials and bucke
---
## Step 6 — Add Observability
## Step 6 — Fix Content Extraction & Embeddings
**When:** You process documents regularly (RAG, knowledge bases) and are running in production.
:::danger These Defaults Cause Memory Leaks at Scale
The default content extraction engine (pypdf) and default embedding engine (SentenceTransformers) are the **two most common causes of memory leaks** in production Open WebUI deployments. Fixing these is just as important as switching to PostgreSQL or adding Redis.
:::
**What to do:**
1. **Switch the content extraction engine** to an external service:
```
CONTENT_EXTRACTION_ENGINE=tika
TIKA_SERVER_URL=http://tika:9998
```
2. **Switch the embedding engine** to an external provider:
```
RAG_EMBEDDING_ENGINE=openai
# or for self-hosted:
RAG_EMBEDDING_ENGINE=ollama
```
**Key things to know:**
- The default content extractor (pypdf) has unavoidable **known memory leaks** that cause your Open WebUI process to grow in memory continuously. An external extractor (Tika, Docling) runs in its own process/container, isolating these leaks.
- The default SentenceTransformers embedding model loads ~500MB per worker process. With 8 workers, that's 4GB of RAM just for embeddings. External embedding eliminates this.
- For detailed guidance and configuration options, see [Content Extraction Engine](/troubleshooting/performance#content-extraction-engine) and [Embedding Engine](/troubleshooting/performance#embedding-engine) in the Performance guide.
---
## Step 7 — Add Observability
**When:** You want to monitor performance, troubleshoot issues, and understand how your deployment is behaving at scale.
@@ -300,6 +334,14 @@ ENABLE_WEBSOCKET_SUPPORT=true
# S3_BUCKET_NAME=my-openwebui-bucket
# S3_REGION_NAME=us-east-1
# Content Extraction (do NOT use default pypdf in production)
CONTENT_EXTRACTION_ENGINE=tika
TIKA_SERVER_URL=http://tika:9998
# Embeddings (do NOT use default SentenceTransformers at scale)
RAG_EMBEDDING_ENGINE=openai
# or: RAG_EMBEDDING_ENGINE=ollama
# Workers (let orchestrator scale, keep workers at 1)
UVICORN_WORKERS=1
@@ -311,13 +353,13 @@ ENABLE_DB_MIGRATIONS=false
## Quick Reference: When Do I Need What?
| Scenario | PostgreSQL | Redis | External Vector DB | Shared Storage |
|---|:---:|:---:|:---:|:---:|
| Single user / evaluation | ✗ | ✗ | ✗ | ✗ |
| Small team (< 50 users, single instance) | Recommended | ✗ | ✗ | ✗ |
| Multiple Uvicorn workers | **Required** | **Required** | **Required** | ✗ (same filesystem) |
| Multiple instances / HA | **Required** | **Required** | **Required** | **Optional** (NFS or S3) |
| Large scale (1000+ users) | **Required** | **Required** | **Required** | **Optional** (NFS or S3) |
| Scenario | PostgreSQL | Redis | External Vector DB | Ext. Content Extraction | Ext. Embeddings | Shared Storage |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| Single user / evaluation | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Small team (< 50 users, single instance) | Recommended | ✗ | ✗ | Recommended | ✗ | ✗ |
| Multiple Uvicorn workers | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | ✗ (same filesystem) |
| Multiple instances / HA | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | **Optional** (NFS or S3) |
| Large scale (1000+ users) | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | **Optional** (NFS or S3) |
:::note About "External Vector DB"
The default ChromaDB uses a local SQLite backend that crashes under multi-process access. "External Vector DB" means either a client-server database (PGVector, Milvus, Qdrant, Pinecone) or ChromaDB running as a separate HTTP server. See [Step 4](#step-4--switch-to-an-external-vector-database) for details.

View File

@@ -134,6 +134,40 @@ For multi-user setups, the choice of Vector DB matters.
* `ENABLE_MILVUS_MULTITENANCY_MODE=True`
* `ENABLE_QDRANT_MULTITENANCY_MODE=True`
### Content Extraction Engine
:::danger Default Content Extractor Causes Memory Leaks
The **default content extraction engine** uses Python libraries including **pypdf**, which are known to have **persistent memory leaks** during document ingestion. In production deployments with regular document uploads, this will cause Open WebUI's memory usage to grow continuously until the process is killed or the container is restarted.
This is the **#1 cause of unexplained memory growth** in production deployments.
:::
**Recommendation**: Switch to an external content extraction engine for any deployment that processes documents regularly:
| Engine | Best For | Configuration |
|---|---|---|
| **Apache Tika** | General-purpose, widely used, handles most document types | `CONTENT_EXTRACTION_ENGINE=tika` + `TIKA_SERVER_URL=http://tika:9998` |
| **Docling** | High-quality extraction with layout-aware parsing | `CONTENT_EXTRACTION_ENGINE=docling` |
| **External Loader** | Recommended for production and custom extraction pipelines | `CONTENT_EXTRACTION_ENGINE=external` + `EXTERNAL_DOCUMENT_LOADER_URL=...` |
Using an external extractor moves the memory-intensive parsing out of the Open WebUI process entirely, eliminating this class of memory leaks.
### Embedding Engine
:::warning SentenceTransformers at Scale
The **default SentenceTransformers** embedding engine (all-MiniLM-L6-v2) loads a machine learning model into the Open WebUI process memory. While lightweight enough for personal use, at scale this model:
- **Consumes significant RAM** (~500MB+ per worker process)
- **Blocks the event loop** during embedding operations on older versions
- **Multiplies with workers** — each Uvicorn worker loads its own copy of the model
For multi-user or production deployments, **offload embeddings to an external service**.
:::
- **Recommended**: Use `RAG_EMBEDDING_ENGINE=openai` (for cloud embeddings via OpenAI, Azure, or compatible APIs) or `RAG_EMBEDDING_ENGINE=ollama` (for self-hosted embedding via Ollama with models like `nomic-embed-text`).
- **Env Var**: `RAG_EMBEDDING_ENGINE=openai`
- **Effect**: The embedding model is no longer loaded into the Open WebUI process, freeing hundreds of MB of RAM per worker.
### Optimizing Document Chunking
The way your documents are chunked directly impacts both storage efficiency and retrieval quality.
@@ -359,12 +393,43 @@ If resource usage is critical, disable automated features that constantly trigge
*Target: Many concurrent users, Stability > Persistence.*
1. **Database**: **PostgreSQL** (Mandatory).
2. **Workers**: `THREAD_POOL_SIZE=2000` (Prevent timeouts).
3. **Streaming**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7` (Reduce CPU/Net/DB writes).
4. **Chat Saving**: `ENABLE_REALTIME_CHAT_SAVE=False`.
5. **Vector DB**: **Milvus**, **Qdrant**, or **PGVector**. **Do not use ChromaDB's default local mode** — its SQLite backend will crash under multi-worker/multi-replica access.
6. **Task Model**: External/Hosted (Offload compute).
7. **Caching**: `ENABLE_BASE_MODELS_CACHE=True`, `MODELS_CACHE_TTL=300`, `ENABLE_QUERIES_CACHE=True`.
2. **Content Extraction**: **Tika** or **Docling** (Mandatory — default pypdf leaks memory). See [Content Extraction Engine](#content-extraction-engine).
3. **Embeddings**: **External** — `RAG_EMBEDDING_ENGINE=openai` or `ollama` (Mandatory — default SentenceTransformers consumes too much RAM at scale). See [Embedding Engine](#embedding-engine).
4. **Tool Calling**: **Native Mode** (strongly recommended — Default Mode is legacy and breaks KV cache). See [Tool Calling Modes](/features/extensibility/plugin/tools#tool-calling-modes-default-vs-native).
5. **Workers**: `THREAD_POOL_SIZE=2000` (Prevent timeouts).
6. **Streaming**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7` (Reduce CPU/Net/DB writes).
7. **Chat Saving**: `ENABLE_REALTIME_CHAT_SAVE=False`.
8. **Vector DB**: **Milvus**, **Qdrant**, or **PGVector**. **Do not use ChromaDB's default local mode** — its SQLite backend will crash under multi-worker/multi-replica access.
9. **Task Model**: External/Hosted (Offload compute).
10. **Caching**: `ENABLE_BASE_MODELS_CACHE=True`, `MODELS_CACHE_TTL=300`, `ENABLE_QUERIES_CACHE=True`.
11. **Redis**: Single instance with `timeout 1800` and high `maxclients` (10000+). See [Redis Tuning](#redis-tuning) below.
#### Redis Tuning
A single Redis instance is sufficient for the vast majority of deployments, including those with thousands of users. **You almost certainly do not need Redis Cluster or Redis Sentinel** unless you have specific HA requirements.
Common Redis configuration issues that cause unnecessary scaling:
| Issue | Symptom | Fix |
|---|---|---|
| **Stale connections** | Redis runs out of connections or memory grows indefinitely | Set `timeout 1800` in redis.conf (kills idle connections after 30 minutes) |
| **Low maxclients** | `max number of clients reached` errors | Set `maxclients 10000` or higher |
| **No connection limits** | Open WebUI pods may accumulate connections that never close | Combine `timeout` with connection pool limits in your Redis client config |
---
## ⚠️ Common Anti-Patterns
These are real-world mistakes that cause organizations to massively over-provision infrastructure:
| Anti-Pattern | What Happens | Fix |
|---|---|---|
| **Using default content extractor in production** | pypdf leaks memory → containers restart constantly → you add more replicas to compensate | Switch to Tika or Docling (`CONTENT_EXTRACTION_ENGINE=tika`) |
| **Running SentenceTransformers at scale** | Each worker loads ~500MB embedding model → RAM usage explodes → you add more machines | Use external embeddings (`RAG_EMBEDDING_ENGINE=openai` or `ollama`) |
| **Redis Cluster when single Redis suffices** | Too many replicas → too many connections → Redis can't handle them → you deploy Redis Cluster to compensate | Fix the root cause (fewer replicas, `timeout 1800`, `maxclients 10000`) |
| **Scaling replicas to mask memory leaks** | Leaky processes → OOM kills → auto-scaler adds more pods → more Redis connections → Redis overwhelmed | Fix the leaks first (content extraction, embedding engine), then right-size |
| **Using Default (prompt-based) tool calling** | Injected prompts may break KV cache → higher latency → more resources needed per request | Switch to Native Mode for all capable models |
| **Not configuring Redis stale connection timeout** | Connections accumulate forever → Redis OOM → you deploy Redis Cluster | Add `timeout 1800` to redis.conf |
---
@@ -384,6 +449,7 @@ For detailed information on all available variables, see the [Environment Config
| `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE` | [Streaming Chunk Size](/reference/env-configuration#chat_response_stream_delta_chunk_size) |
| `THREAD_POOL_SIZE` | [Thread Pool Size](/reference/env-configuration#thread_pool_size) |
| `RAG_EMBEDDING_ENGINE` | [Embedding Engine](/reference/env-configuration#rag_embedding_engine) |
| `CONTENT_EXTRACTION_ENGINE` | [Content Extraction Engine](/reference/env-configuration#content_extraction_engine) |
| `AUDIO_STT_ENGINE` | [STT Engine](/reference/env-configuration#audio_stt_engine) |
| `ENABLE_IMAGE_GENERATION` | [Image Generation](/reference/env-configuration#enable_image_generation) |
| `ENABLE_AUTOCOMPLETE_GENERATION` | [Autocomplete](/reference/env-configuration#enable_autocomplete_generation) |