diff --git a/docs/features/extensibility/plugin/tools/index.mdx b/docs/features/extensibility/plugin/tools/index.mdx index c9a87af9..36d94bce 100644 --- a/docs/features/extensibility/plugin/tools/index.mdx +++ b/docs/features/extensibility/plugin/tools/index.mdx @@ -89,14 +89,22 @@ You can also let your LLM auto-select the right Tools using the [**AutoTool Filt Open WebUI offers two distinct ways for models to interact with tools: a standard **Default Mode** and a high-performance **Native Mode (Agentic Mode)**. Choosing the right mode depends on your model's capabilities and your performance requirements. -### 🟡 Default Mode (Prompt-based) +### 🟡 Default Mode (Prompt-based) — Legacy + +:::warning Legacy Mode +Default Mode is maintained purely for **backwards compatibility** with older or smaller models that lack native function-calling support. It is considered **legacy** and should not be used when your model supports native tool calling. New deployments should use **Native Mode** exclusively. +::: + In Default Mode, Open WebUI manages tool selection by injecting a specific prompt template that guides the model to output a tool request. - **Compatibility**: Works with **practically any model**, including older or smaller local models that lack native function-calling support. - **Flexibility**: Highly customizable via prompt templates. -- **Caveat**: Can be slower (requires extra tokens) and less reliable for complex, multi-step tool chaining. +- **Caveats**: + - Can be slower (requires extra tokens) and less reliable for complex, multi-step tool chaining. + - **Breaks KV cache**: The injected prompt changes every turn, preventing LLM engines from reusing cached key-value pairs. This increases latency and cost for every message in the conversation. + - Does not support built-in system tools (memory, notes, channels, etc.). -### 🟢 Native Mode (Agentic Mode / System Function Calling) -Native Mode (also called **Agentic Mode**) leverages the model's built-in capability to handle tool definitions and return structured tool calls (JSON). This is the **recommended mode** for high-performance agentic workflows. +### 🟢 Native Mode (Agentic Mode / System Function Calling) — Recommended +Native Mode (also called **Agentic Mode**) leverages the model's built-in capability to handle tool definitions and return structured tool calls (JSON). This is the **recommended mode** for all models that support it — which includes the vast majority of modern models (2024+). :::warning Model Quality Matters **Agentic tool calling requires high-quality models to work reliably.** While small local models may technically support function calling, they often struggle with the complex reasoning required for multi-step tool usage. For best results, use frontier models like **GPT-5**, **Claude 4.5 Sonnet**, **Gemini 3 Flash**, or **MiniMax M2.5**. Small local models may produce malformed JSON or fail to follow the strict state management required for agentic behavior. @@ -104,9 +112,11 @@ Native Mode (also called **Agentic Mode**) leverages the model's built-in capabi #### Why use Native Mode (Agentic Mode)? - **Speed & Efficiency**: Lower latency as it avoids bulky prompt-based tool selection. +- **KV Cache Friendly**: Tool definitions are sent as structured parameters (not injected into the prompt), so they don't invalidate the KV cache between turns. This can significantly reduce latency and token costs. - **Reliability**: Higher accuracy in following tool schemas (with quality models). - **Multi-step Chaining**: Essential for **Agentic Research** and **Interleaved Thinking** where a model needs to call multiple tools in succession. - **Autonomous Decision-Making**: Models can decide when to search, which tools to use, and how to combine results. +- **System Tools**: Only Native Mode unlocks the [built-in system tools](#built-in-system-tools-nativeagentic-mode) (memory, notes, knowledge, channels, etc.). #### How to Enable Native Mode (Agentic Mode) Native Mode can be enabled at two levels: @@ -164,11 +174,14 @@ These models excel at multi-step reasoning, proper JSON formatting, and autonomo **This is a DeepSeek model/API issue**, not an Open WebUI issue. Open WebUI correctly sends tools in standard OpenAI format — the malformed output originates from DeepSeek's non-standard internal format. ::: -| Feature | Default Mode | Native Mode | +| Feature | Default Mode (Legacy) | Native Mode (Recommended) | |:---|:---|:---| +| **Status** | Legacy / backwards compat | ✅ Recommended | | **Latency** | Medium/High | Low | +| **KV Cache** | ❌ Can break cache | ✅ Cache-friendly | | **Model Compatibility** | Universal | Requires Tool-Calling Support | | **Logic** | Prompt-based (Open WebUI) | Model-native (API/Ollama) | +| **System Tools** | ❌ Not available | ✅ Full access | | **Complex Chaining** | ⚠️ Limited | ✅ Excellent | ### Built-in System Tools (Native/Agentic Mode) diff --git a/docs/getting-started/advanced-topics/scaling.md b/docs/getting-started/advanced-topics/scaling.md index ce2fba4d..d1c44338 100644 --- a/docs/getting-started/advanced-topics/scaling.md +++ b/docs/getting-started/advanced-topics/scaling.md @@ -75,6 +75,7 @@ ENABLE_WEBSOCKET_SUPPORT=true - If you're using Redis Sentinel for high availability, also set `REDIS_SENTINEL_HOSTS` and consider setting `REDIS_SOCKET_CONNECT_TIMEOUT=5` to prevent hangs during failover. - For AWS Elasticache or other managed Redis Cluster services, set `REDIS_CLUSTER=true`. - Make sure your Redis server has `timeout 1800` and a high enough `maxclients` (10000+) to prevent connection exhaustion over time. +- A **single Redis instance** is sufficient for the vast majority of deployments, even with thousands of users. You almost certainly do not need Redis Cluster unless you have specific HA/bandwidth requirements. If you think you need Redis Cluster, first check whether your connection count and memory usage are caused by fixable configuration issues (see [Common Anti-Patterns](/troubleshooting/performance#%EF%B8%8F-common-anti-patterns)). - Without Redis in a multi-instance setup, you will experience [WebSocket 403 errors](/troubleshooting/multi-replica#2-websocket-403-errors--connection-failures), [configuration sync issues](/troubleshooting/multi-replica#3-model-not-found-or-configuration-mismatch), and intermittent authentication failures. For a complete step-by-step Redis setup (Docker Compose, Sentinel, Cluster mode, verification), see the [Redis WebSocket Support](/tutorials/integrations/redis) tutorial. For WebSocket and CORS issues behind reverse proxies, see [Connection Errors](/troubleshooting/connection-error#-https-tls-cors--websocket-issues). @@ -232,7 +233,40 @@ Each provider has its own set of environment variables for credentials and bucke --- -## Step 6 — Add Observability +## Step 6 — Fix Content Extraction & Embeddings + +**When:** You process documents regularly (RAG, knowledge bases) and are running in production. + +:::danger These Defaults Cause Memory Leaks at Scale +The default content extraction engine (pypdf) and default embedding engine (SentenceTransformers) are the **two most common causes of memory leaks** in production Open WebUI deployments. Fixing these is just as important as switching to PostgreSQL or adding Redis. +::: + +**What to do:** + +1. **Switch the content extraction engine** to an external service: + +``` +CONTENT_EXTRACTION_ENGINE=tika +TIKA_SERVER_URL=http://tika:9998 +``` + +2. **Switch the embedding engine** to an external provider: + +``` +RAG_EMBEDDING_ENGINE=openai +# or for self-hosted: +RAG_EMBEDDING_ENGINE=ollama +``` + +**Key things to know:** + +- The default content extractor (pypdf) has unavoidable **known memory leaks** that cause your Open WebUI process to grow in memory continuously. An external extractor (Tika, Docling) runs in its own process/container, isolating these leaks. +- The default SentenceTransformers embedding model loads ~500MB per worker process. With 8 workers, that's 4GB of RAM just for embeddings. External embedding eliminates this. +- For detailed guidance and configuration options, see [Content Extraction Engine](/troubleshooting/performance#content-extraction-engine) and [Embedding Engine](/troubleshooting/performance#embedding-engine) in the Performance guide. + +--- + +## Step 7 — Add Observability **When:** You want to monitor performance, troubleshoot issues, and understand how your deployment is behaving at scale. @@ -300,6 +334,14 @@ ENABLE_WEBSOCKET_SUPPORT=true # S3_BUCKET_NAME=my-openwebui-bucket # S3_REGION_NAME=us-east-1 +# Content Extraction (do NOT use default pypdf in production) +CONTENT_EXTRACTION_ENGINE=tika +TIKA_SERVER_URL=http://tika:9998 + +# Embeddings (do NOT use default SentenceTransformers at scale) +RAG_EMBEDDING_ENGINE=openai +# or: RAG_EMBEDDING_ENGINE=ollama + # Workers (let orchestrator scale, keep workers at 1) UVICORN_WORKERS=1 @@ -311,13 +353,13 @@ ENABLE_DB_MIGRATIONS=false ## Quick Reference: When Do I Need What? -| Scenario | PostgreSQL | Redis | External Vector DB | Shared Storage | -|---|:---:|:---:|:---:|:---:| -| Single user / evaluation | ✗ | ✗ | ✗ | ✗ | -| Small team (< 50 users, single instance) | Recommended | ✗ | ✗ | ✗ | -| Multiple Uvicorn workers | **Required** | **Required** | **Required** | ✗ (same filesystem) | -| Multiple instances / HA | **Required** | **Required** | **Required** | **Optional** (NFS or S3) | -| Large scale (1000+ users) | **Required** | **Required** | **Required** | **Optional** (NFS or S3) | +| Scenario | PostgreSQL | Redis | External Vector DB | Ext. Content Extraction | Ext. Embeddings | Shared Storage | +|---|:---:|:---:|:---:|:---:|:---:|:---:| +| Single user / evaluation | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | +| Small team (< 50 users, single instance) | Recommended | ✗ | ✗ | Recommended | ✗ | ✗ | +| Multiple Uvicorn workers | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | ✗ (same filesystem) | +| Multiple instances / HA | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | **Optional** (NFS or S3) | +| Large scale (1000+ users) | **Required** | **Required** | **Required** | **Strongly Recommended** | **Strongly Recommended** | **Optional** (NFS or S3) | :::note About "External Vector DB" The default ChromaDB uses a local SQLite backend that crashes under multi-process access. "External Vector DB" means either a client-server database (PGVector, Milvus, Qdrant, Pinecone) or ChromaDB running as a separate HTTP server. See [Step 4](#step-4--switch-to-an-external-vector-database) for details. diff --git a/docs/troubleshooting/performance.md b/docs/troubleshooting/performance.md index 62eecd00..5b7cc77b 100644 --- a/docs/troubleshooting/performance.md +++ b/docs/troubleshooting/performance.md @@ -134,6 +134,40 @@ For multi-user setups, the choice of Vector DB matters. * `ENABLE_MILVUS_MULTITENANCY_MODE=True` * `ENABLE_QDRANT_MULTITENANCY_MODE=True` +### Content Extraction Engine + +:::danger Default Content Extractor Causes Memory Leaks +The **default content extraction engine** uses Python libraries including **pypdf**, which are known to have **persistent memory leaks** during document ingestion. In production deployments with regular document uploads, this will cause Open WebUI's memory usage to grow continuously until the process is killed or the container is restarted. + +This is the **#1 cause of unexplained memory growth** in production deployments. +::: + +**Recommendation**: Switch to an external content extraction engine for any deployment that processes documents regularly: + +| Engine | Best For | Configuration | +|---|---|---| +| **Apache Tika** | General-purpose, widely used, handles most document types | `CONTENT_EXTRACTION_ENGINE=tika` + `TIKA_SERVER_URL=http://tika:9998` | +| **Docling** | High-quality extraction with layout-aware parsing | `CONTENT_EXTRACTION_ENGINE=docling` | +| **External Loader** | Recommended for production and custom extraction pipelines | `CONTENT_EXTRACTION_ENGINE=external` + `EXTERNAL_DOCUMENT_LOADER_URL=...` | + +Using an external extractor moves the memory-intensive parsing out of the Open WebUI process entirely, eliminating this class of memory leaks. + +### Embedding Engine + +:::warning SentenceTransformers at Scale +The **default SentenceTransformers** embedding engine (all-MiniLM-L6-v2) loads a machine learning model into the Open WebUI process memory. While lightweight enough for personal use, at scale this model: + +- **Consumes significant RAM** (~500MB+ per worker process) +- **Blocks the event loop** during embedding operations on older versions +- **Multiplies with workers** — each Uvicorn worker loads its own copy of the model + +For multi-user or production deployments, **offload embeddings to an external service**. +::: + +- **Recommended**: Use `RAG_EMBEDDING_ENGINE=openai` (for cloud embeddings via OpenAI, Azure, or compatible APIs) or `RAG_EMBEDDING_ENGINE=ollama` (for self-hosted embedding via Ollama with models like `nomic-embed-text`). +- **Env Var**: `RAG_EMBEDDING_ENGINE=openai` +- **Effect**: The embedding model is no longer loaded into the Open WebUI process, freeing hundreds of MB of RAM per worker. + ### Optimizing Document Chunking The way your documents are chunked directly impacts both storage efficiency and retrieval quality. @@ -359,12 +393,43 @@ If resource usage is critical, disable automated features that constantly trigge *Target: Many concurrent users, Stability > Persistence.* 1. **Database**: **PostgreSQL** (Mandatory). -2. **Workers**: `THREAD_POOL_SIZE=2000` (Prevent timeouts). -3. **Streaming**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7` (Reduce CPU/Net/DB writes). -4. **Chat Saving**: `ENABLE_REALTIME_CHAT_SAVE=False`. -5. **Vector DB**: **Milvus**, **Qdrant**, or **PGVector**. **Do not use ChromaDB's default local mode** — its SQLite backend will crash under multi-worker/multi-replica access. -6. **Task Model**: External/Hosted (Offload compute). -7. **Caching**: `ENABLE_BASE_MODELS_CACHE=True`, `MODELS_CACHE_TTL=300`, `ENABLE_QUERIES_CACHE=True`. +2. **Content Extraction**: **Tika** or **Docling** (Mandatory — default pypdf leaks memory). See [Content Extraction Engine](#content-extraction-engine). +3. **Embeddings**: **External** — `RAG_EMBEDDING_ENGINE=openai` or `ollama` (Mandatory — default SentenceTransformers consumes too much RAM at scale). See [Embedding Engine](#embedding-engine). +4. **Tool Calling**: **Native Mode** (strongly recommended — Default Mode is legacy and breaks KV cache). See [Tool Calling Modes](/features/extensibility/plugin/tools#tool-calling-modes-default-vs-native). +5. **Workers**: `THREAD_POOL_SIZE=2000` (Prevent timeouts). +6. **Streaming**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7` (Reduce CPU/Net/DB writes). +7. **Chat Saving**: `ENABLE_REALTIME_CHAT_SAVE=False`. +8. **Vector DB**: **Milvus**, **Qdrant**, or **PGVector**. **Do not use ChromaDB's default local mode** — its SQLite backend will crash under multi-worker/multi-replica access. +9. **Task Model**: External/Hosted (Offload compute). +10. **Caching**: `ENABLE_BASE_MODELS_CACHE=True`, `MODELS_CACHE_TTL=300`, `ENABLE_QUERIES_CACHE=True`. +11. **Redis**: Single instance with `timeout 1800` and high `maxclients` (10000+). See [Redis Tuning](#redis-tuning) below. + +#### Redis Tuning + +A single Redis instance is sufficient for the vast majority of deployments, including those with thousands of users. **You almost certainly do not need Redis Cluster or Redis Sentinel** unless you have specific HA requirements. + +Common Redis configuration issues that cause unnecessary scaling: + +| Issue | Symptom | Fix | +|---|---|---| +| **Stale connections** | Redis runs out of connections or memory grows indefinitely | Set `timeout 1800` in redis.conf (kills idle connections after 30 minutes) | +| **Low maxclients** | `max number of clients reached` errors | Set `maxclients 10000` or higher | +| **No connection limits** | Open WebUI pods may accumulate connections that never close | Combine `timeout` with connection pool limits in your Redis client config | + +--- + +## ⚠️ Common Anti-Patterns + +These are real-world mistakes that cause organizations to massively over-provision infrastructure: + +| Anti-Pattern | What Happens | Fix | +|---|---|---| +| **Using default content extractor in production** | pypdf leaks memory → containers restart constantly → you add more replicas to compensate | Switch to Tika or Docling (`CONTENT_EXTRACTION_ENGINE=tika`) | +| **Running SentenceTransformers at scale** | Each worker loads ~500MB embedding model → RAM usage explodes → you add more machines | Use external embeddings (`RAG_EMBEDDING_ENGINE=openai` or `ollama`) | +| **Redis Cluster when single Redis suffices** | Too many replicas → too many connections → Redis can't handle them → you deploy Redis Cluster to compensate | Fix the root cause (fewer replicas, `timeout 1800`, `maxclients 10000`) | +| **Scaling replicas to mask memory leaks** | Leaky processes → OOM kills → auto-scaler adds more pods → more Redis connections → Redis overwhelmed | Fix the leaks first (content extraction, embedding engine), then right-size | +| **Using Default (prompt-based) tool calling** | Injected prompts may break KV cache → higher latency → more resources needed per request | Switch to Native Mode for all capable models | +| **Not configuring Redis stale connection timeout** | Connections accumulate forever → Redis OOM → you deploy Redis Cluster | Add `timeout 1800` to redis.conf | --- @@ -384,6 +449,7 @@ For detailed information on all available variables, see the [Environment Config | `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE` | [Streaming Chunk Size](/reference/env-configuration#chat_response_stream_delta_chunk_size) | | `THREAD_POOL_SIZE` | [Thread Pool Size](/reference/env-configuration#thread_pool_size) | | `RAG_EMBEDDING_ENGINE` | [Embedding Engine](/reference/env-configuration#rag_embedding_engine) | +| `CONTENT_EXTRACTION_ENGINE` | [Content Extraction Engine](/reference/env-configuration#content_extraction_engine) | | `AUDIO_STT_ENGINE` | [STT Engine](/reference/env-configuration#audio_stt_engine) | | `ENABLE_IMAGE_GENERATION` | [Image Generation](/reference/env-configuration#enable_image_generation) | | `ENABLE_AUTOCOMPLETE_GENERATION` | [Autocomplete](/reference/env-configuration#enable_autocomplete_generation) |