folders / projects and missing env vars

This commit is contained in:
DrMelone
2025-12-28 18:13:29 +01:00
parent 4301e8a95b
commit c0601b5277
8 changed files with 369 additions and 419 deletions

View File

@@ -1,205 +0,0 @@
---
sidebar_position: 12
title: "Improve Local LLM Performance with Dedicated Task Models"
---
## Improve Performance with Dedicated Task Models
Open-WebUI provides several automated features—such as title generation, tag creation, autocomplete, and search query generation—to enhance the user experience. However, these features can generate multiple simultaneous requests to your local model, which may impact performance on systems with limited resources.
This guide explains how to optimize your setup by configuring a dedicated, lightweight task model or by selectively disabling automation features, ensuring that your primary chat functionality remains responsive and efficient.
---
:::tip
## Why Does Open-WebUI Feel Slow?
By default, Open-WebUI has several background tasks that can make it feel like magic but can also place a heavy load on local resources:
- **Title Generation**
- **Tag Generation**
- **Autocomplete Generation** (this function triggers on every keystroke)
- **Search Query Generation**
Each of these features makes asynchronous requests to your model. For example, continuous calls from the autocomplete feature can significantly delay responses on devices with limited memory or processing power, such as a Mac with 32GB of RAM running a 32B quantized model.
Optimizing the task model can help isolate these background tasks from your main chat application, improving overall responsiveness.
:::
---
## ⚡ How to Optimize Task Model Performance
Follow these steps to configure an efficient task model:
### Step 1: Access the Admin Panel
1. Open Open-WebUI in your browser.
2. Navigate to the **Admin Panel**.
3. Click on **Settings** in the sidebar.
### Step 2: Configure the Task Model
1. Go to **Interface > Set Task Model**.
2. Choose one of the following options based on your needs:
- **Lightweight Local Model (Recommended)**
- Select a compact model such as **Llama 3.2 3B** or **Qwen2.5 3B**.
- These models offer rapid responses while consuming minimal system resources.
- **Hosted API Endpoint (For Maximum Speed)**
- Connect to a hosted API service to handle task processing.
- This can be very cheap. For example, OpenRouter offers Llama and Qwen models at less than **1.5 cents per million input tokens**.
:::tip OpenRouter Recommendation
When using **OpenRouter**, we highly recommend configuring the **Model IDs (Allowlist)** in the connection settings. Importing thousands of models can clutter your selector and degrade admin panel performance.
:::
- **Disable Unnecessary Automation**
- If certain automated features are not required, disable them to reduce extraneous background calls—especially features like autocomplete.
![Local Model Configuration Set to Qwen2.5:3b](/images/tutorials/tips/set-task-model.png)
### Step 3: Save Your Changes and Test
1. Save the new configuration.
2. Interact with your chat interface and observe the responsiveness.
3. If necessary, adjust by further disabling unused automation features or experimenting with different task models.
---
## 🚀 Recommended Setup for Local Models
| Optimization Strategy | Benefit | Recommended For |
|---------------------------------|------------------------------------------|----------------------------------------|
| **Lightweight Local Model** | Minimizes resource usage | Systems with limited hardware |
| **Hosted API Endpoint** | Offers the fastest response times | Users with reliable internet/API access|
| **Disable Automation Features** | Maximizes performance by reducing load | Those focused on core chat functionality|
Implementing these recommendations can greatly improve the responsiveness of Open-WebUI while allowing your local models to efficiently handle chat interactions.
---
## ⚙️ Environment Variables for Performance
You can also configure performance-related settings via environment variables. Add these to your Docker Compose file or `.env` file.
:::tip
Many of these settings can also be configured directly in the **Admin Panel > Settings** interface. Environment variables are useful for initial deployment configuration or when managing settings across multiple instances.
:::
### Task Model Configuration
Set a dedicated lightweight model for background tasks:
```bash
# For Ollama models
TASK_MODEL=llama3.2:3b
# For OpenAI-compatible endpoints
TASK_MODEL_EXTERNAL=gpt-4o-mini
```
### Disable Unnecessary Features
```bash
# Disable automatic title generation
ENABLE_TITLE_GENERATION=False
# Disable follow-up question suggestions
ENABLE_FOLLOW_UP_GENERATION=False
# Disable autocomplete suggestions (triggers on every keystroke - high impact!)
ENABLE_AUTOCOMPLETE_GENERATION=False
# Disable automatic tag generation
ENABLE_TAGS_GENERATION=False
# Disable search query generation for RAG (if not using web search)
ENABLE_SEARCH_QUERY_GENERATION=False
# Disable retrieval query generation
ENABLE_RETRIEVAL_QUERY_GENERATION=False
```
### Enable Caching and Optimization
```bash
# Cache model list responses (seconds) - reduces API calls
MODELS_CACHE_TTL=300
# Cache LLM-generated search queries - eliminates duplicate LLM calls when both web search and RAG are active
ENABLE_QUERIES_CACHE=True
# Convert base64 images to file URLs - reduces response size and database strain
ENABLE_CHAT_RESPONSE_BASE64_IMAGE_URL_CONVERSION=True
# Batch streaming tokens to reduce CPU load (recommended: 5-10 for high concurrency)
CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=5
# Enable gzip compression for HTTP responses (enabled by default)
ENABLE_COMPRESSION_MIDDLEWARE=True
```
### Database and Persistence
```bash
# Disable real-time chat saving for better performance (trades off data persistence)
ENABLE_REALTIME_CHAT_SAVE=False
```
### Network Timeouts
```bash
# Increase timeout for slow models (default: 300 seconds)
AIOHTTP_CLIENT_TIMEOUT=300
# Faster timeout for model list fetching (default: 10 seconds)
AIOHTTP_CLIENT_TIMEOUT_MODEL_LIST=5
```
### RAG Performance
```bash
# Enable parallel embedding for faster document processing (requires sufficient resources)
RAG_EMBEDDING_BATCH_SIZE=100
```
### High Concurrency Settings
For larger instances with many concurrent users:
```bash
# Increase thread pool size (default is 40)
THREAD_POOL_SIZE=500
```
:::info
For a complete list of environment variables, see the [Environment Variable Configuration](/getting-started/env-configuration) documentation.
:::
---
## 💡 Additional Tips
- **Monitor System Resources:** Use your operating systems tools (such as Activity Monitor on macOS or Task Manager on Windows) to keep an eye on resource usage.
- **Reduce Parallel Model Calls:** Limiting background automation prevents simultaneous requests from overwhelming your LLM.
- **Experiment with Configurations:** Test different lightweight models or hosted endpoints to find the optimal balance between speed and functionality.
- **Stay Updated:** Regular updates to Open-WebUI often include performance improvements and bug fixes, so keep your software current.
---
## Related Guides
- [Reduce RAM Usage](/tutorials/tips/reduce-ram-usage) - For memory-constrained environments like Raspberry Pi
- [SQLite Database Overview](/tutorials/tips/sqlite-database) - Database schema, encryption, and advanced configuration
- [Environment Variable Configuration](/getting-started/env-configuration) - Complete list of all configuration options
---
By applying these configuration changes, you'll support a more responsive and efficient Open-WebUI experience, allowing your local LLM to focus on delivering high-quality chat interactions without unnecessary delays.

View File

@@ -0,0 +1,191 @@
---
sidebar_position: 10
title: "Optimization, Performance & RAM Usage"
---
# Optimization, Performance & RAM Usage
This guide provides a comprehensive overview of strategies to optimize Open WebUI. Your ideal configuration depends heavily on your specific deployment goals. Consider which of these scenarios describes you best:
1. **Maximum Privacy on Weak Hardware (e.g., Raspberry Pi)**:
* *Goal*: keep everything local; minimize resource usage.
* *Trade-off*: You must use lightweight local models (SentenceTransformers) and disable heavy features to prevent crashes.
2. **Maximum Quality for Single User (e.g., Desktop)**:
* *Goal*: Best possible experience with high speed and quality.
* *Strategy*: Leverage external APIs (OpenAI/Anthropic) for embeddings and task models to offload compute from your local machine.
3. **High Scale for Many Users (e.g., Enterprise/Production)**:
* *Goal*: Stability and concurrency.
* *Strategy*: Requires dedicated Vector DBs (Milvus/Qdrant), increased thread pools, caching to handle load, and **PostgreSQL** instead of SQLite.
---
## ⚡ Performance Tuning (Speed & Responsiveness)
If Open WebUI feels slow or unresponsive, especially during chat generation or high concurrency, specialized optimizations can significantly improve the user experience.
### 1. Dedicated Task Models
By default, Open WebUI automates background tasks like title generation, tagging, and autocomplete. These run in the background and can slow down your main chat model if they share the same resources.
**Recommendation**: Use a **very fast, small, and cheap NON-REASONING model** for these tasks. Avoid using large reasoning models (like o1 or r1) as they are too slow and expensive for simple background tasks.
**Configuration:**
There are two separate settings in **Admin Panel > Settings > Interface**. The system intelligently selects which one to use based on the model you are currently chatting with:
* **Task Model (External)**: Used when you are chatting with an external model (e.g., OpenAI, Anthropic).
* **Task Model (Local)**: Used when you are chatting with a locally hosted model (e.g., Ollama).
**Best Options:**
* **External/Cloud**: `gpt-4o-mini`, `gemini-2.0-flash` (or `gemini-1.5-flash`), `llama-3.1-8b-instant` (Groq/OpenRouter).
* **Local**: `llama3.2:3b`, `qwen2.5:3b`.
### 2. Caching & Latency Optimization
Configure these settings to reduce latency and external API usage.
#### Model Caching
Drastically reduces startup time and API calls to external providers.
- **Admin Panel**: `Settings > Connections > Cache Base Model List`
- **Env Var**: `ENABLE_BASE_MODELS_CACHE=True`
* *Note*: Caches the list of models in memory. Only refreshes on App Restart or when clicking **Save** in Connections settings.
- **Env Var**: `MODELS_CACHE_TTL=300`
* *Note*: Sets a 5-minute cache for external API responses.
#### Search Query Caching
Eliminates redundant interactions by caching the search queries generated by the LLM. If a user asks the same question, the system reuses the previously generated search query instead of calling the LLM again.
- **Env Var**: `ENABLE_QUERIES_CACHE=True`
### 3. High-Concurrency & Network Optimization
For setups with many simultaneous users, these settings are crucial to prevent bottlenecks.
#### Batch Streaming Tokens
By default, Open WebUI streams *every single token* arriving from the LLM. This not only increases network load but also **triggers a database write for every token** (if real-time saving is on). Increasing the chunk size batches these updates, significantly reducing CPU load, network overhead, and database write frequency.
* *Effect*: If the provider sends tokens one by one, setting this to 5-10 will buffer them and send them to the client (and DB) in groups.
- **Env Var**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7`
* *Recommendation*: Set to **5-10** for high-concurrency instances.
#### Thread Pool Size
Defines the number of worker threads available for handling requests.
* **Default**: 40
* **High-Traffic Recommendation**: **2000+**
* **Warning**: **NEVER decrease this value.** Even on low-spec hardware, an idle thread pool does not consume significant resources. Setting this too low (e.g., 10) **WILL cause application freezes** and request timeouts.
- **Env Var**: `THREAD_POOL_SIZE=2000`
---
## 📉 Resource Efficiency (Reducing RAM)
If deploying on memory-constrained devices (Raspberry Pi, small VPS), use these strategies to prevent the application from crashing due to OOM (Out of Memory) errors.
### 1. Offload Auxiliary Models (Local Deployments Only)
Open WebUI loads local ML models for features like RAG and STT. **This section is only relevant if you are running models LOCALLY.** If you use external APIs, these models are not loaded.
#### RAG Embeddings
- **Low-Spec Recommendation**:
* **Option A (Easiest)**: Keep the default **SentenceTransformers** (all-MiniLM-L6-v2). It is lightweight, runs on CPU, and is significantly more efficient than running a full Ollama instance on the same Raspberry Pi.
* **Option B (Best Performance)**: Use an **External API** (OpenAI/Cloud).
* **Avoid**: Do NOT run Ollama for embeddings on the same low-spec device as Open WebUI; it will kill performance.
- **Configuration**:
* **Admin Panel**: `Settings > Documents > Embedding Model Engine`
* **Env Var**: `RAG_EMBEDDING_ENGINE=openai` (to offload completely)
#### Speech-to-Text (STT)
Local Whisper models are heavy (~500MB+ RAM).
- **Recommendation**: Use **WebAPI** (Browser-based). It uses the user's device capabilities, costing 0 server RAM.
- **Configuration**:
* **Admin Panel**: `Settings > Audio > STT Engine`
* **Env Var**: `AUDIO_STT_ENGINE=webapi`
### 2. Disable Unused Features
Prevent the application from loading **local** models you don't use.
- **Image Generation**: `ENABLE_IMAGE_GENERATION=False` (Admin: `Settings > Images`)
- **Code Interpreter**: `ENABLE_CODE_INTERPRETER=False` (Admin: `Settings > Tools`)
### 3. Disable Background Tasks
If resource usage is critical, disable automated features that constantly trigger model inference.
**Recommendation order (Highest Impact first):**
1. **Autocomplete**: `ENABLE_AUTOCOMPLETE_GENERATION=False` (**High Impact**: Triggers on every keystroke!)
* Admin: `Settings > Interface > Autocomplete`
2. **Follow-up Questions**: `ENABLE_FOLLOW_UP_GENERATION=False`
* Admin: `Settings > Interface > Follow-up`
3. **Title Generation**: `ENABLE_TITLE_GENERATION=False`
* Admin: `Settings > Interface > Chat Title`
4. **Tag Generation**: `ENABLE_TAGS_GENERATION=False`
---
## 📦 Database Optimization
### PostgreSQL (Mandatory for Scale)
For any multi-user or high-concurrency setup, **PostgreSQL is mandatory**. SQLite (the default) is not designed for high concurrency and will become a bottleneck.
- **Action**: Use the `DATABASE_URL` environment variable to connect to a Postgres instance.
### Chat Saving Strategy
By default, Open WebUI saves chats in real-time. This ensures no data loss but creates massive database write pressure (especially with token streaming).
- **Env Var**: `ENABLE_REALTIME_CHAT_SAVE=False`
- **Effect**: Chats are saved only when the generation is complete (or periodically).
- **Recommendation**: **Highly Recommended** for any high-user setup.
- **Note**: For single-user local setups, you may leave this `True` to prevent data loss during crashes.
### Vector Database (RAG)
For multi-user setups, the choice of Vector DB matters.
- **ChromaDB**: **NOT RECOMMENDED** for multi-user environments due to performance limitations and locking issues. Acceptable for standalone local instances.
- **Recommendations**:
* **Milvus** or **Qdrant**: Best for improved scale and performance.
* **PGVector**: Excellent choice if you are already using PostgreSQL.
- **Multitenancy**: If using Milvus/Qdrant, enabling multitenancy offers better resource sharing.
* `ENABLE_MILVUS_MULTITENANCY_MODE=True`
* `ENABLE_QDRANT_MULTITENANCY_MODE=True`
---
## 🚀 Recommended Configuration Profiles
### Profile 1: Maximum Privacy (Weak Hardware/RPi)
*Target: 100% Local, Raspberry Pi / <4GB RAM.*
1. **Embeddings**: Default (SentenceTransformers) - *Runs on CPU, lightweight.*
2. **Audio**: `AUDIO_STT_ENGINE=webapi` - *Zero server load.*
3. **Task Model**: Disable or use tiny model (`llama3.2:1b`).
4. **Scaling**: Keep default `THREAD_POOL_SIZE` (40).
5. **Disable**: Image Gen, Code Interpreter, Autocomplete, Follow-ups.
6. **Database**: SQLite is fine.
### Profile 2: Single User Enthusiast
*Target: Max Quality & Speed, Local + External APIs.*
1. **Embeddings**: `RAG_EMBEDDING_ENGINE=openai` (or `ollama` with `nomic-embed-text` on a fast server).
2. **Task Model**: `gpt-4o-mini` or `llama-3.1-8b-instant`.
3. **Caching**: `MODELS_CACHE_TTL=300`.
4. **Database**: `ENABLE_REALTIME_CHAT_SAVE=True` (Persistence is usually preferred over raw write speed here).
5. **Vector DB**: ChromaDB or PGVector.
### Profile 3: High Scale / Enterprise
*Target: Many concurrent users, Stability > Persistence.*
1. **Database**: **PostgreSQL** (Mandatory).
2. **Workers**: `THREAD_POOL_SIZE=2000` (Prevent timeouts).
3. **Streaming**: `CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7` (Reduce CPU/Net/DB writes).
4. **Chat Saving**: `ENABLE_REALTIME_CHAT_SAVE=False`.
5. **Vector DB**: **Milvus**, **Qdrant**, or **PGVector**. **Avoid ChromaDB.**
6. **Task Model**: External/Hosted (Offload compute).
7. **Caching**: `ENABLE_BASE_MODELS_CACHE=True`, `MODELS_CACHE_TTL=300`, `ENABLE_QUERIES_CACHE=True`.

View File

@@ -1,168 +0,0 @@
---
sidebar_position: 10
title: "Reduce RAM Usage"
---
## Reduce RAM Usage
If you are deploying Open WebUI in a RAM-constrained environment (such as a Raspberry Pi, small VPS, or shared hosting), there are several strategies to significantly reduce memory consumption.
On a Raspberry Pi 4 (arm64) with version v0.3.10, these optimizations reduced idle memory consumption from >1GB to ~200MB (as observed with `docker container stats`).
---
## Quick Start
Set the following environment variables for immediate RAM savings:
```bash
# Use external embedding instead of local SentenceTransformers
RAG_EMBEDDING_ENGINE=ollama
# Use external Speech-to-Text instead of local Whisper
AUDIO_STT_ENGINE=openai
```
:::tip
These settings can also be configured in the **Admin Panel > Settings** interface - set RAG embedding to Ollama or OpenAI, and Speech-to-Text to OpenAI or WebAPI.
:::
---
## Why Does Open WebUI Use So Much RAM?
Much of the memory consumption comes from locally loaded ML models. Even when using an external LLM (OpenAI or separate Ollama instance), Open WebUI may load additional models for:
| Feature | Default | RAM Impact | Solution |
|---------|---------|------------|----------|
| **RAG Embedding** | Local SentenceTransformers | ~500-800MB | Use Ollama or OpenAI embeddings |
| **Speech-to-Text** | Local Whisper | ~300-500MB | Use OpenAI or WebAPI |
| **Reranking** | Disabled | ~200-400MB when enabled | Keep disabled or use external |
| **Image Generation** | Disabled | Variable | Keep disabled if not needed |
---
## ⚙️ Environment Variables for RAM Reduction
### Offload Embedding to External Service
The biggest RAM saver is using an external embedding engine:
```bash
# Option 1: Use Ollama for embeddings (if you have Ollama running separately)
RAG_EMBEDDING_ENGINE=ollama
# Option 2: Use OpenAI for embeddings
RAG_EMBEDDING_ENGINE=openai
OPENAI_API_KEY=your-api-key
```
### Offload Speech-to-Text
Local Whisper models consume significant RAM:
```bash
# Use OpenAI's Whisper API
AUDIO_STT_ENGINE=openai
# Or use browser-based WebAPI (no external service needed)
AUDIO_STT_ENGINE=webapi
```
### Disable Unused Features
Disable features you don't need to prevent model loading:
```bash
# Disable image generation (prevents loading image models)
ENABLE_IMAGE_GENERATION=False
# Disable code execution (reduces overhead)
ENABLE_CODE_EXECUTION=False
# Disable code interpreter
ENABLE_CODE_INTERPRETER=False
```
### Reduce Background Task Overhead
These settings reduce memory usage from background operations:
```bash
# Disable autocomplete (high resource usage)
ENABLE_AUTOCOMPLETE_GENERATION=False
# Disable automatic title generation
ENABLE_TITLE_GENERATION=False
# Disable tag generation
ENABLE_TAGS_GENERATION=False
# Disable follow-up suggestions
ENABLE_FOLLOW_UP_GENERATION=False
```
### Database and Cache Optimization
```bash
# Disable real-time chat saving (reduces database overhead)
ENABLE_REALTIME_CHAT_SAVE=False
# Reduce thread pool size for low-resource systems
THREAD_POOL_SIZE=10
```
### Vector Database Multitenancy
If using Milvus or Qdrant, enable multitenancy mode to reduce RAM:
```bash
# For Milvus
ENABLE_MILVUS_MULTITENANCY_MODE=True
# For Qdrant
ENABLE_QDRANT_MULTITENANCY_MODE=True
```
---
## 🚀 Recommended Minimal Configuration
For extremely RAM-constrained environments, use this combined configuration:
```bash
# Offload ML models to external services
RAG_EMBEDDING_ENGINE=ollama
AUDIO_STT_ENGINE=openai
# Disable all non-essential features
ENABLE_IMAGE_GENERATION=False
ENABLE_CODE_EXECUTION=False
ENABLE_CODE_INTERPRETER=False
ENABLE_AUTOCOMPLETE_GENERATION=False
ENABLE_TITLE_GENERATION=False
ENABLE_TAGS_GENERATION=False
ENABLE_FOLLOW_UP_GENERATION=False
# Reduce worker overhead
THREAD_POOL_SIZE=10
```
---
## 💡 Additional Tips
- **Monitor Memory Usage**: Use `docker container stats` or `htop` to monitor RAM consumption
- **Restart After Changes**: Environment variable changes require a container restart
- **Fresh Deployments**: Some environment variables only take effect on fresh deployments without an existing `config.json`
- **Consider Alternatives**: For very constrained systems, consider running Open WebUI on a more capable machine and accessing it remotely
---
## Related Guides
- [Improve Local LLM Performance](/tutorials/tips/improve-performance-local) - For optimizing performance without reducing features
- [Environment Variable Configuration](/getting-started/env-configuration) - Complete list of all configuration options