docs: performance best practices, scaling, and tool calling legacy/native

- Mark Default (prompt-based) tool calling as legacy; recommend Native Mode - Document KV cache breaking in Default Mode, system tools only in Native - Updated comparison table with Status, KV Cache, System Tools rows - Add Content Extraction Engine section (pypdf memory leak warning, Tika/Docling) - Add Embedding Engine section (SentenceTransformers RAM warning at scale) - Add Common Anti-Patterns section with 6 real-world scaling mistakes - Add Redis Tuning subsection (timeout, maxclients, single instance sufficiency) - Expand Profile 3 with content extraction, embeddings, tool calling, Redis - New Step 6 in scaling guide: Fix Content Extraction & Embeddings - Quick reference table: add Ext. Content Extraction and Ext. Embeddings columns - Add CONTENT_EXTRACTION_ENGINE and RAG_EMBEDDING_ENGINE to minimum env vars
2026-03-26 13:18:42 +07:00 · 2026-02-28 21:31:11 +01:00
parent f2fba91aa7
commit db1ca1a342
3 changed files with 140 additions and 19 deletions
--- a/docs/features/extensibility/plugin/tools/index.mdx
+++ b/docs/features/extensibility/plugin/tools/index.mdx
@@ -89,14 +89,22 @@ You can also let your LLM auto-select the right Tools using the [**AutoTool Filt

 Open WebUI offers two distinct ways for models to interact with tools: a standard **Default Mode** and a high-performance **Native Mode (Agentic Mode)**. Choosing the right mode depends on your model's capabilities and your performance requirements.

-### 🟡 Default Mode (Prompt-based)
+### 🟡 Default Mode (Prompt-based) — Legacy
+
+:::warning Legacy Mode
+Default Mode is maintained purely for **backwards compatibility** with older or smaller models that lack native function-calling support. It is considered **legacy** and should not be used when your model supports native tool calling. New deployments should use **Native Mode** exclusively.
+:::
+
 In Default Mode, Open WebUI manages tool selection by injecting a specific prompt template that guides the model to output a tool request.
 - **Compatibility**: Works with **practically any model**, including older or smaller local models that lack native function-calling support.
 - **Flexibility**: Highly customizable via prompt templates.
- **Caveat**: Can be slower (requires extra tokens) and less reliable for complex, multi-step tool chaining.
+- **Caveats**:
+  - Can be slower (requires extra tokens) and less reliable for complex, multi-step tool chaining.
+  - **Breaks KV cache**: The injected prompt changes every turn, preventing LLM engines from reusing cached key-value pairs. This increases latency and cost for every message in the conversation.
+  - Does not support built-in system tools (memory, notes, channels, etc.).

-### 🟢 Native Mode (Agentic Mode / System Function Calling)
-Native Mode (also called **Agentic Mode**) leverages the model's built-in capability to handle tool definitions and return structured tool calls (JSON). This is the **recommended mode** for high-performance agentic workflows.
+### 🟢 Native Mode (Agentic Mode / System Function Calling) — Recommended
+Native Mode (also called **Agentic Mode**) leverages the model's built-in capability to handle tool definitions and return structured tool calls (JSON). This is the **recommended mode** for all models that support it — which includes the vast majority of modern models (2024+).

 :::warning Model Quality Matters
 **Agentic tool calling requires high-quality models to work reliably.** While small local models may technically support function calling, they often struggle with the complex reasoning required for multi-step tool usage. For best results, use frontier models like **GPT-5**, **Claude 4.5 Sonnet**, **Gemini 3 Flash**, or **MiniMax M2.5**. Small local models may produce malformed JSON or fail to follow the strict state management required for agentic behavior.
@@ -104,9 +112,11 @@ Native Mode (also called **Agentic Mode**) leverages the model's built-in capabi

 #### Why use Native Mode (Agentic Mode)?
 - **Speed & Efficiency**: Lower latency as it avoids bulky prompt-based tool selection.
+- **KV Cache Friendly**: Tool definitions are sent as structured parameters (not injected into the prompt), so they don't invalidate the KV cache between turns. This can significantly reduce latency and token costs.
 - **Reliability**: Higher accuracy in following tool schemas (with quality models).
 - **Multi-step Chaining**: Essential for **Agentic Research** and **Interleaved Thinking** where a model needs to call multiple tools in succession.
 - **Autonomous Decision-Making**: Models can decide when to search, which tools to use, and how to combine results.
+- **System Tools**: Only Native Mode unlocks the [built-in system tools](#built-in-system-tools-nativeagentic-mode) (memory, notes, knowledge, channels, etc.).

 #### How to Enable Native Mode (Agentic Mode)
 Native Mode can be enabled at two levels:
@@ -164,11 +174,14 @@ These models excel at multi-step reasoning, proper JSON formatting, and autonomo
 **This is a DeepSeek model/API issue**, not an Open WebUI issue. Open WebUI correctly sends tools in standard OpenAI format — the malformed output originates from DeepSeek's non-standard internal format.
 :::

-| Feature | Default Mode | Native Mode |
+| Feature | Default Mode (Legacy) | Native Mode (Recommended) |
 |:---|:---|:---|
+| **Status** | Legacy / backwards compat | ✅ Recommended |
 | **Latency** | Medium/High | Low |
+| **KV Cache** | ❌ Can break cache | ✅ Cache-friendly |
 | **Model Compatibility** | Universal | Requires Tool-Calling Support |
 | **Logic** | Prompt-based (Open WebUI) | Model-native (API/Ollama) |
+| **System Tools** | ❌ Not available | ✅ Full access |
 | **Complex Chaining** | ⚠️ Limited | ✅ Excellent |

 ### Built-in System Tools (Native/Agentic Mode)
--- a/docs/getting-started/advanced-topics/scaling.md
+++ b/docs/getting-started/advanced-topics/scaling.md
@@ -75,6 +75,7 @@ ENABLE_WEBSOCKET_SUPPORT=true
 - If you're using Redis Sentinel for high availability, also set `REDIS_SENTINEL_HOSTS` and consider setting `REDIS_SOCKET_CONNECT_TIMEOUT=5` to prevent hangs during failover.
 - For AWS Elasticache or other managed Redis Cluster services, set `REDIS_CLUSTER=true`.
 - Make sure your Redis server has `timeout 1800` and a high enough `maxclients` (10000+) to prevent connection exhaustion over time.
+- A **single Redis instance** is sufficient for the vast majority of deployments, even with thousands of users. You almost certainly do not need Redis Cluster unless you have specific HA/bandwidth requirements. If you think you need Redis Cluster, first check whether your connection count and memory usage are caused by fixable configuration issues (see [Common Anti-Patterns](/troubleshooting/performance#%EF%B8%8F-common-anti-patterns)).
 - Without Redis in a multi-instance setup, you will experience [WebSocket 403 errors](/troubleshooting/multi-replica#2-websocket-403-errors--connection-failures), [configuration sync issues](/troubleshooting/multi-replica#3-model-not-found-or-configuration-mismatch), and intermittent authentication failures.

 For a complete step-by-step Redis setup (Docker Compose, Sentinel, Cluster mode, verification), see the [Redis WebSocket Support](/tutorials/integrations/redis) tutorial. For WebSocket and CORS issues behind reverse proxies, see [Connection Errors](/troubleshooting/connection-error#-https-tls-cors--websocket-issues).
@@ -232,7 +233,40 @@ Each provider has its own set of environment variables for credentials and bucke

 ---

-## Step 6 — Add Observability
+## Step 6 — Fix Content Extraction & Embeddings
+
+**When:** You process documents regularly (RAG, knowledge bases) and are running in production.
+
+:::danger These Defaults Cause Memory Leaks at Scale
+The default content extraction engine (pypdf) and default embedding engine (SentenceTransformers) are the **two most common causes of memory leaks** in production Open WebUI deployments. Fixing these is just as important as switching to PostgreSQL or adding Redis.
+:::
+
+**What to do:**
+
+1. **Switch the content extraction engine** to an external service:
+
+```
+CONTENT_EXTRACTION_ENGINE=tika
+TIKA_SERVER_URL=http://tika:9998
+```
+
+2. **Switch the embedding engine** to an external provider:
+
+```
+RAG_EMBEDDING_ENGINE=openai
+# or for self-hosted:
+RAG_EMBEDDING_ENGINE=ollama
+```
+
+**Key things to know:**
+
+- The default content extractor (pypdf) has unavoidable **known memory leaks** that cause your Open WebUI process to grow in memory continuously. An external extractor (Tika, Docling) runs in its own process/container, isolating these leaks.
+- The default SentenceTransformers embedding model loads ~500MB per worker process. With 8 workers, that's 4GB of RAM just for embeddings. External embedding eliminates this.
+- For detailed guidance and configuration options, see [Content Extraction Engine](/troubleshooting/performance#content-extraction-engine) and [Embedding Engine](/troubleshooting/performance#embedding-engine) in the Performance guide.
+
+---
+
+## Step 7 — Add Observability

 **When:** You want to monitor performance, troubleshoot issues, and understand how your deployment is behaving at scale.

@@ -300,6 +334,14 @@ ENABLE_WEBSOCKET_SUPPORT=true
 # S3_BUCKET_NAME=my-openwebui-bucket
 # S3_REGION_NAME=us-east-1

+# Content Extraction (do NOT use default pypdf in production)
+CONTENT_EXTRACTION_ENGINE=tika
+TIKA_SERVER_URL=http://tika:9998
+
+# Embeddings (do NOT use default SentenceTransformers at scale)
+RAG_EMBEDDING_ENGINE=openai
+# or: RAG_EMBEDDING_ENGINE=ollama
+
 # Workers (let orchestrator scale, keep workers at 1)
 UVICORN_WORKERS=1

@@ -311,13 +353,13 @@ ENABLE_DB_MIGRATIONS=false

 ## Quick Reference: When Do I Need What?

-| Scenario | PostgreSQL | Redis | External Vector DB | Shared Storage |
-|---|:---:|:---:|:---:|:---:|
-| Single user / evaluation | ✗ | ✗ | ✗ | ✗ |
-| Small team (< 50 users, single instance) | Recommended | ✗ | ✗ | ✗ |
-| Multiple Uvicorn workers | **Required** | **Required** | **Required** | ✗ (same filesystem) |
-| Multiple instances / HA | **Required** | **Required** | **Required** | **Optional** (NFS or S3) |
-| Large scale (1000+ users) | **Required** | **Required** | **Required** | **Optional** (NFS or S3) |
+| Scenario | PostgreSQL | Redis | External Vector DB | Ext. Content Extraction | Ext. Embeddings | Shared Storage |
+|---|:---:|:---:|:---:|:---:|:---:|:---:|
+| Single user / evaluation | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
+| Small team (< 50 users, single instance) | Recommended | ✗ | ✗ | Recommended | ✗ | ✗ |
+| Multiple Uvicorn workers | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | ✗ (same filesystem) |
+| Multiple instances / HA | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | **Optional** (NFS or S3) |
+| Large scale (1000+ users) | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | **Optional** (NFS or S3) |

 :::note About "External Vector DB"
 The default ChromaDB uses a local SQLite backend that crashes under multi-process access. "External Vector DB" means either a client-server database (PGVector, Milvus, Qdrant, Pinecone) or ChromaDB running as a separate HTTP server. See [Step 4](#step-4--switch-to-an-external-vector-database) for details.
--- a/docs/troubleshooting/performance.md
+++ b/docs/troubleshooting/performance.md
@@ -134,6 +134,40 @@ For multi-user setups, the choice of Vector DB matters.
    *   `ENABLE_MILVUS_MULTITENANCY_MODE=True`
    *   `ENABLE_QDRANT_MULTITENANCY_MODE=True`

+### Content Extraction Engine
+
+:::danger Default Content Extractor Causes Memory Leaks
+The **default content extraction engine** uses Python libraries including **pypdf**, which are known to have **persistent memory leaks** during document ingestion. In production deployments with regular document uploads, this will cause Open WebUI's memory usage to grow continuously until the process is killed or the container is restarted.
+
+This is the **#1 cause of unexplained memory growth** in production deployments.
+:::
+
+**Recommendation**: Switch to an external content extraction engine for any deployment that processes documents regularly:
+
+| Engine | Best For | Configuration |
+|---|---|---|
+| **Apache Tika** | General-purpose, widely used, handles most document types | `CONTENT_EXTRACTION_ENGINE=tika` + `TIKA_SERVER_URL=http://tika:9998` |
+| **Docling** | High-quality extraction with layout-aware parsing | `CONTENT_EXTRACTION_ENGINE=docling` |
+| **External Loader** | Recommended for production and custom extraction pipelines | `CONTENT_EXTRACTION_ENGINE=external` + `EXTERNAL_DOCUMENT_LOADER_URL=...` |
+
+Using an external extractor moves the memory-intensive parsing out of the Open WebUI process entirely, eliminating this class of memory leaks.
+
+### Embedding Engine
+
+:::warning SentenceTransformers at Scale
+The **default SentenceTransformers** embedding engine (all-MiniLM-L6-v2) loads a machine learning model into the Open WebUI process memory. While lightweight enough for personal use, at scale this model:
+
+- **Consumes significant RAM** (~500MB+ per worker process)
+- **Blocks the event loop** during embedding operations on older versions
+- **Multiplies with workers** — each Uvicorn worker loads its own copy of the model
+
+For multi-user or production deployments, **offload embeddings to an external service**.
+:::
+
+-   **Recommended**: Use `RAG_EMBEDDING_ENGINE=openai` (for cloud embeddings via OpenAI, Azure, or compatible APIs) or `RAG_EMBEDDING_ENGINE=ollama` (for self-hosted embedding via Ollama with models like `nomic-embed-text`).
+-   **Env Var**: `RAG_EMBEDDING_ENGINE=openai`
+-   **Effect**: The embedding model is no longer loaded into the Open WebUI process, freeing hundreds of MB of RAM per worker.
+
 ### Optimizing Document Chunking

 The way your documents are chunked directly impacts both storage efficiency and retrieval quality.
@@ -359,12 +393,43 @@ If resource usage is critical, disable automated features that constantly trigge
 *Target: Many concurrent users, Stability > Persistence.*

 1.  **Database**: **PostgreSQL** (Mandatory).
-2.  **Workers**: `THREAD_POOL_SIZE=2000` (Prevent timeouts).
-3.  **Streaming**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7` (Reduce CPU/Net/DB writes).
-4.  **Chat Saving**: `ENABLE_REALTIME_CHAT_SAVE=False`.
-5.  **Vector DB**: **Milvus**, **Qdrant**, or **PGVector**. **Do not use ChromaDB's default local mode** — its SQLite backend will crash under multi-worker/multi-replica access.
-6.  **Task Model**: External/Hosted (Offload compute).
-7.  **Caching**: `ENABLE_BASE_MODELS_CACHE=True`, `MODELS_CACHE_TTL=300`, `ENABLE_QUERIES_CACHE=True`.
+2.  **Content Extraction**: **Tika** or **Docling** (Mandatory — default pypdf leaks memory). See [Content Extraction Engine](#content-extraction-engine).
+3.  **Embeddings**: **External** — `RAG_EMBEDDING_ENGINE=openai` or `ollama` (Mandatory — default SentenceTransformers consumes too much RAM at scale). See [Embedding Engine](#embedding-engine).
+4.  **Tool Calling**: **Native Mode** (strongly recommended — Default Mode is legacy and breaks KV cache). See [Tool Calling Modes](/features/extensibility/plugin/tools#tool-calling-modes-default-vs-native).
+5.  **Workers**: `THREAD_POOL_SIZE=2000` (Prevent timeouts).
+6.  **Streaming**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7` (Reduce CPU/Net/DB writes).
+7.  **Chat Saving**: `ENABLE_REALTIME_CHAT_SAVE=False`.
+8.  **Vector DB**: **Milvus**, **Qdrant**, or **PGVector**. **Do not use ChromaDB's default local mode** — its SQLite backend will crash under multi-worker/multi-replica access.
+9.  **Task Model**: External/Hosted (Offload compute).
+10. **Caching**: `ENABLE_BASE_MODELS_CACHE=True`, `MODELS_CACHE_TTL=300`, `ENABLE_QUERIES_CACHE=True`.
+11. **Redis**: Single instance with `timeout 1800` and high `maxclients` (10000+). See [Redis Tuning](#redis-tuning) below.
+
+#### Redis Tuning
+
+A single Redis instance is sufficient for the vast majority of deployments, including those with thousands of users. **You almost certainly do not need Redis Cluster or Redis Sentinel** unless you have specific HA requirements.
+
+Common Redis configuration issues that cause unnecessary scaling:
+
+| Issue | Symptom | Fix |
+|---|---|---|
+| **Stale connections** | Redis runs out of connections or memory grows indefinitely | Set `timeout 1800` in redis.conf (kills idle connections after 30 minutes) |
+| **Low maxclients** | `max number of clients reached` errors | Set `maxclients 10000` or higher |
+| **No connection limits** | Open WebUI pods may accumulate connections that never close | Combine `timeout` with connection pool limits in your Redis client config |
+
+---
+
+## ⚠️ Common Anti-Patterns
+
+These are real-world mistakes that cause organizations to massively over-provision infrastructure:
+
+| Anti-Pattern | What Happens | Fix |
+|---|---|---|
+| **Using default content extractor in production** | pypdf leaks memory → containers restart constantly → you add more replicas to compensate | Switch to Tika or Docling (`CONTENT_EXTRACTION_ENGINE=tika`) |
+| **Running SentenceTransformers at scale** | Each worker loads ~500MB embedding model → RAM usage explodes → you add more machines | Use external embeddings (`RAG_EMBEDDING_ENGINE=openai` or `ollama`) |
+| **Redis Cluster when single Redis suffices** | Too many replicas → too many connections → Redis can't handle them → you deploy Redis Cluster to compensate | Fix the root cause (fewer replicas, `timeout 1800`, `maxclients 10000`) |
+| **Scaling replicas to mask memory leaks** | Leaky processes → OOM kills → auto-scaler adds more pods → more Redis connections → Redis overwhelmed | Fix the leaks first (content extraction, embedding engine), then right-size |
+| **Using Default (prompt-based) tool calling** | Injected prompts may break KV cache → higher latency → more resources needed per request | Switch to Native Mode for all capable models |
+| **Not configuring Redis stale connection timeout** | Connections accumulate forever → Redis OOM → you deploy Redis Cluster | Add `timeout 1800` to redis.conf |

 ---

@@ -384,6 +449,7 @@ For detailed information on all available variables, see the [Environment Config
 | `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE` | [Streaming Chunk Size](/reference/env-configuration#chat_response_stream_delta_chunk_size) |
 | `THREAD_POOL_SIZE` | [Thread Pool Size](/reference/env-configuration#thread_pool_size) |
 | `RAG_EMBEDDING_ENGINE` | [Embedding Engine](/reference/env-configuration#rag_embedding_engine) |
+| `CONTENT_EXTRACTION_ENGINE` | [Content Extraction Engine](/reference/env-configuration#content_extraction_engine) |
 | `AUDIO_STT_ENGINE` | [STT Engine](/reference/env-configuration#audio_stt_engine) |
 | `ENABLE_IMAGE_GENERATION` | [Image Generation](/reference/env-configuration#enable_image_generation) |
 | `ENABLE_AUTOCOMPLETE_GENERATION` | [Autocomplete](/reference/env-configuration#enable_autocomplete_generation) |