mirror of
https://github.com/open-webui/docs.git
synced 2026-01-02 17:59:41 +07:00
Update performance.md
This commit is contained in:
@@ -29,16 +29,16 @@ If Open WebUI feels slow or unresponsive, especially during chat generation or h
|
||||
|
||||
By default, Open WebUI automates background tasks like title generation, tagging, and autocomplete. These run in the background and can slow down your main chat model if they share the same resources.
|
||||
|
||||
**Recommendation**: Use a **very fast, small, and cheap NON-REASONING model** for these tasks. Avoid using large reasoning models (like o1 or r1) as they are too slow and expensive for simple background tasks.
|
||||
**Recommendation**: Use a **very fast, small, and cheap NON-REASONING model** for these tasks. Avoid using large reasoning models (like o1, r1, or Claude) as they are too slow and expensive for simple background tasks.
|
||||
|
||||
**Configuration:**
|
||||
There are two separate settings in **Admin Panel > Settings > Interface**. The system intelligently selects which one to use based on the model you are currently chatting with:
|
||||
* **Task Model (External)**: Used when you are chatting with an external model (e.g., OpenAI, Anthropic).
|
||||
* **Task Model (External)**: Used when you are chatting with an external model (e.g., OpenAI).
|
||||
* **Task Model (Local)**: Used when you are chatting with a locally hosted model (e.g., Ollama).
|
||||
|
||||
**Best Options:**
|
||||
* **External/Cloud**: `gpt-4o-mini`, `gemini-2.0-flash` (or `gemini-1.5-flash`), `llama-3.1-8b-instant` (Groq/OpenRouter).
|
||||
* **Local**: `llama3.2:3b`, `qwen2.5:3b`.
|
||||
**Best Options (2025):**
|
||||
* **External/Cloud**: `gpt-5-nano`, `gemini-2.5-flash-lite`, `llama-3.1-8b-instant` (OpenAI/Google/Groq/OpenRouter).
|
||||
* **Local**: `qwen3:1b`, `gemma3:1b`, `llama3.2:3b`.
|
||||
|
||||
### 2. Caching & Latency Optimization
|
||||
|
||||
@@ -55,18 +55,52 @@ Drastically reduces startup time and API calls to external providers.
|
||||
* *Note*: Sets a 5-minute cache for external API responses.
|
||||
|
||||
#### Search Query Caching
|
||||
Eliminates redundant interactions by caching the search queries generated by the LLM. If a user asks the same question, the system reuses the previously generated search query instead of calling the LLM again.
|
||||
Reuses the LLM-generated Web-Search search queries for RAG search within the same chat turn. This prevents redundant LLM calls when multiple retrieval features act on the same user prompt.
|
||||
|
||||
- **Env Var**: `ENABLE_QUERIES_CACHE=True`
|
||||
|
||||
### 3. High-Concurrency & Network Optimization
|
||||
I.e. the LLM generates "US News 2025" as a Web Search query, if this setting is enabled, the same search query will be reused for RAG instead of generating new queries for RAG, saving on inference cost and API calls, thus improving performance.
|
||||
|
||||
---
|
||||
|
||||
## 📦 Database Optimization
|
||||
|
||||
For high-scale deployments, your database configuration is the single most critical factor for stability.
|
||||
|
||||
### PostgreSQL (Mandatory for Scale)
|
||||
For any multi-user or high-concurrency setup, **PostgreSQL is mandatory**. SQLite (the default) is not designed for high concurrency and will become a bottleneck (database locking errors).
|
||||
|
||||
- **Variable**: `DATABASE_URL`
|
||||
- **Example**: `postgres://user:password@localhost:5432/webui`
|
||||
|
||||
### Chat Saving Strategy
|
||||
By default, Open WebUI saves chats in **real-time**. This ensures no data loss but creates massive database write pressure because *every single chunk* of text received from the LLM triggers a database update.
|
||||
|
||||
- **Env Var**: `ENABLE_REALTIME_CHAT_SAVE=False`
|
||||
- **Effect**: Chats are saved only when the generation is complete (or periodically).
|
||||
- **Recommendation**: **Highly Recommended** for any high-user setup to reduce DB load substantially.
|
||||
|
||||
### Vector Database (RAG)
|
||||
For multi-user setups, the choice of Vector DB matters.
|
||||
|
||||
- **ChromaDB**: **NOT RECOMMENDED** for multi-user environments due to performance limitations and locking issues.
|
||||
- **Recommendations**:
|
||||
* **Milvus** or **Qdrant**: Best for improved scale and performance.
|
||||
* **PGVector**: Excellent choice if you are already using PostgreSQL.
|
||||
- **Multitenancy**: If using Milvus or Qdrant, enabling multitenancy offers better resource sharing.
|
||||
* `ENABLE_MILVUS_MULTITENANCY_MODE=True`
|
||||
* `ENABLE_QDRANT_MULTITENANCY_MODE=True`
|
||||
|
||||
---
|
||||
|
||||
## ⚡ High-Concurrency & Network Optimization
|
||||
|
||||
For setups with many simultaneous users, these settings are crucial to prevent bottlenecks.
|
||||
|
||||
#### Batch Streaming Tokens
|
||||
By default, Open WebUI streams *every single token* arriving from the LLM. This not only increases network load but also **triggers a database write for every token** (if real-time saving is on). Increasing the chunk size batches these updates, significantly reducing CPU load, network overhead, and database write frequency.
|
||||
By default, Open WebUI streams *every single token* arriving from the LLM. High-frequency streaming increases network IO and CPU usage on the server. If real-time saving is enabled, it also destroys database performance (you can disable it with `ENABLE_REALTIME_CHAT_SAVE=False`).
|
||||
|
||||
* *Effect*: If the provider sends tokens one by one, setting this to 5-10 will buffer them and send them to the client (and DB) in groups.
|
||||
Increasing the chunk size buffers these updates, sending them to the client in larger groups. The only downside is a slightly choppier UI experience when streaming the response, but it can make a big difference in performance.
|
||||
|
||||
- **Env Var**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7`
|
||||
* *Recommendation*: Set to **5-10** for high-concurrency instances.
|
||||
@@ -87,13 +121,12 @@ If deploying on memory-constrained devices (Raspberry Pi, small VPS), use these
|
||||
|
||||
### 1. Offload Auxiliary Models (Local Deployments Only)
|
||||
|
||||
Open WebUI loads local ML models for features like RAG and STT. **This section is only relevant if you are running models LOCALLY.** If you use external APIs, these models are not loaded.
|
||||
Open WebUI loads local ML models for features like RAG and STT. **This section is only relevant if you are running models LOCALLY.**
|
||||
|
||||
#### RAG Embeddings
|
||||
- **Low-Spec Recommendation**:
|
||||
* **Option A (Easiest)**: Keep the default **SentenceTransformers** (all-MiniLM-L6-v2). It is lightweight, runs on CPU, and is significantly more efficient than running a full Ollama instance on the same Raspberry Pi.
|
||||
* **Option B (Best Performance)**: Use an **External API** (OpenAI/Cloud).
|
||||
* **Avoid**: Do NOT run Ollama for embeddings on the same low-spec device as Open WebUI; it will kill performance.
|
||||
|
||||
- **Configuration**:
|
||||
* **Admin Panel**: `Settings > Documents > Embedding Model Engine`
|
||||
@@ -130,34 +163,6 @@ If resource usage is critical, disable automated features that constantly trigge
|
||||
|
||||
---
|
||||
|
||||
## 📦 Database Optimization
|
||||
|
||||
### PostgreSQL (Mandatory for Scale)
|
||||
For any multi-user or high-concurrency setup, **PostgreSQL is mandatory**. SQLite (the default) is not designed for high concurrency and will become a bottleneck.
|
||||
|
||||
- **Action**: Use the `DATABASE_URL` environment variable to connect to a Postgres instance.
|
||||
|
||||
### Chat Saving Strategy
|
||||
By default, Open WebUI saves chats in real-time. This ensures no data loss but creates massive database write pressure (especially with token streaming).
|
||||
|
||||
- **Env Var**: `ENABLE_REALTIME_CHAT_SAVE=False`
|
||||
- **Effect**: Chats are saved only when the generation is complete (or periodically).
|
||||
- **Recommendation**: **Highly Recommended** for any high-user setup.
|
||||
- **Note**: For single-user local setups, you may leave this `True` to prevent data loss during crashes.
|
||||
|
||||
### Vector Database (RAG)
|
||||
For multi-user setups, the choice of Vector DB matters.
|
||||
|
||||
- **ChromaDB**: **NOT RECOMMENDED** for multi-user environments due to performance limitations and locking issues. Acceptable for standalone local instances.
|
||||
- **Recommendations**:
|
||||
* **Milvus** or **Qdrant**: Best for improved scale and performance.
|
||||
* **PGVector**: Excellent choice if you are already using PostgreSQL.
|
||||
- **Multitenancy**: If using Milvus/Qdrant, enabling multitenancy offers better resource sharing.
|
||||
* `ENABLE_MILVUS_MULTITENANCY_MODE=True`
|
||||
* `ENABLE_QDRANT_MULTITENANCY_MODE=True`
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Recommended Configuration Profiles
|
||||
|
||||
### Profile 1: Maximum Privacy (Weak Hardware/RPi)
|
||||
@@ -174,10 +179,10 @@ For multi-user setups, the choice of Vector DB matters.
|
||||
*Target: Max Quality & Speed, Local + External APIs.*
|
||||
|
||||
1. **Embeddings**: `RAG_EMBEDDING_ENGINE=openai` (or `ollama` with `nomic-embed-text` on a fast server).
|
||||
2. **Task Model**: `gpt-4o-mini` or `llama-3.1-8b-instant`.
|
||||
2. **Task Model**: `gpt-5-nano` or `llama-3.1-8b-instant`.
|
||||
3. **Caching**: `MODELS_CACHE_TTL=300`.
|
||||
4. **Database**: `ENABLE_REALTIME_CHAT_SAVE=True` (Persistence is usually preferred over raw write speed here).
|
||||
5. **Vector DB**: ChromaDB or PGVector.
|
||||
5. **Vector DB**: PGVector (recommended) or ChromaDB.
|
||||
|
||||
### Profile 3: High Scale / Enterprise
|
||||
*Target: Many concurrent users, Stability > Persistence.*
|
||||
@@ -189,3 +194,25 @@ For multi-user setups, the choice of Vector DB matters.
|
||||
5. **Vector DB**: **Milvus**, **Qdrant**, or **PGVector**. **Avoid ChromaDB.**
|
||||
6. **Task Model**: External/Hosted (Offload compute).
|
||||
7. **Caching**: `ENABLE_BASE_MODELS_CACHE=True`, `MODELS_CACHE_TTL=300`, `ENABLE_QUERIES_CACHE=True`.
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Environment Variable References
|
||||
|
||||
For detailed information on all available variables, see the [Environment Configuration](/getting-started/env-configuration) guide.
|
||||
|
||||
| Variable | Description & Link |
|
||||
| :--- | :--- |
|
||||
| `TASK_MODEL` | [Task Model (Local)](/getting-started/env-configuration#task_model) |
|
||||
| `TASK_MODEL_EXTERNAL` | [Task Model (External)](/getting-started/env-configuration#task_model_external) |
|
||||
| `ENABLE_BASE_MODELS_CACHE` | [Cache Model List](/getting-started/env-configuration#enable_base_models_cache) |
|
||||
| `MODELS_CACHE_TTL` | [Model Cache TTL](/getting-started/env-configuration#models_cache_ttl) |
|
||||
| `ENABLE_QUERIES_CACHE` | [Queries Cache](/getting-started/env-configuration#enable_queries_cache) |
|
||||
| `DATABASE_URL` | [Database URL](/getting-started/env-configuration#database_url) |
|
||||
| `ENABLE_REALTIME_CHAT_SAVE` | [Realtime Chat Save](/getting-started/env-configuration#enable_realtime_chat_save) |
|
||||
| `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE` | [Streaming Chunk Size](/getting-started/env-configuration#chat_response_stream_delta_chunk_size) |
|
||||
| `THREAD_POOL_SIZE` | [Thread Pool Size](/getting-started/env-configuration#thread_pool_size) |
|
||||
| `RAG_EMBEDDING_ENGINE` | [Embedding Engine](/getting-started/env-configuration#rag_embedding_engine) |
|
||||
| `AUDIO_STT_ENGINE` | [STT Engine](/getting-started/env-configuration#audio_stt_engine) |
|
||||
| `ENABLE_IMAGE_GENERATION` | [Image Generation](/getting-started/env-configuration#enable_image_generation) |
|
||||
| `ENABLE_AUTOCOMPLETE_GENERATION` | [Autocomplete](/getting-started/env-configuration#enable_autocomplete_generation) |
|
||||
|
||||
Reference in New Issue
Block a user