Files
open-webui-docs/docs/tutorials/tips/performance.md
2025-12-28 18:13:29 +01:00

9.5 KiB

sidebar_position, title
sidebar_position title
10 Optimization, Performance & RAM Usage

Optimization, Performance & RAM Usage

This guide provides a comprehensive overview of strategies to optimize Open WebUI. Your ideal configuration depends heavily on your specific deployment goals. Consider which of these scenarios describes you best:

  1. Maximum Privacy on Weak Hardware (e.g., Raspberry Pi):

    • Goal: keep everything local; minimize resource usage.
    • Trade-off: You must use lightweight local models (SentenceTransformers) and disable heavy features to prevent crashes.
  2. Maximum Quality for Single User (e.g., Desktop):

    • Goal: Best possible experience with high speed and quality.
    • Strategy: Leverage external APIs (OpenAI/Anthropic) for embeddings and task models to offload compute from your local machine.
  3. High Scale for Many Users (e.g., Enterprise/Production):

    • Goal: Stability and concurrency.
    • Strategy: Requires dedicated Vector DBs (Milvus/Qdrant), increased thread pools, caching to handle load, and PostgreSQL instead of SQLite.

Performance Tuning (Speed & Responsiveness)

If Open WebUI feels slow or unresponsive, especially during chat generation or high concurrency, specialized optimizations can significantly improve the user experience.

1. Dedicated Task Models

By default, Open WebUI automates background tasks like title generation, tagging, and autocomplete. These run in the background and can slow down your main chat model if they share the same resources.

Recommendation: Use a very fast, small, and cheap NON-REASONING model for these tasks. Avoid using large reasoning models (like o1 or r1) as they are too slow and expensive for simple background tasks.

Configuration: There are two separate settings in Admin Panel > Settings > Interface. The system intelligently selects which one to use based on the model you are currently chatting with:

  • Task Model (External): Used when you are chatting with an external model (e.g., OpenAI, Anthropic).
  • Task Model (Local): Used when you are chatting with a locally hosted model (e.g., Ollama).

Best Options:

  • External/Cloud: gpt-4o-mini, gemini-2.0-flash (or gemini-1.5-flash), llama-3.1-8b-instant (Groq/OpenRouter).
  • Local: llama3.2:3b, qwen2.5:3b.

2. Caching & Latency Optimization

Configure these settings to reduce latency and external API usage.

Model Caching

Drastically reduces startup time and API calls to external providers.

  • Admin Panel: Settings > Connections > Cache Base Model List

  • Env Var: ENABLE_BASE_MODELS_CACHE=True

    • Note: Caches the list of models in memory. Only refreshes on App Restart or when clicking Save in Connections settings.
  • Env Var: MODELS_CACHE_TTL=300

    • Note: Sets a 5-minute cache for external API responses.

Search Query Caching

Eliminates redundant interactions by caching the search queries generated by the LLM. If a user asks the same question, the system reuses the previously generated search query instead of calling the LLM again.

  • Env Var: ENABLE_QUERIES_CACHE=True

3. High-Concurrency & Network Optimization

For setups with many simultaneous users, these settings are crucial to prevent bottlenecks.

Batch Streaming Tokens

By default, Open WebUI streams every single token arriving from the LLM. This not only increases network load but also triggers a database write for every token (if real-time saving is on). Increasing the chunk size batches these updates, significantly reducing CPU load, network overhead, and database write frequency.

  • Effect: If the provider sends tokens one by one, setting this to 5-10 will buffer them and send them to the client (and DB) in groups.
  • Env Var: CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7
    • Recommendation: Set to 5-10 for high-concurrency instances.

Thread Pool Size

Defines the number of worker threads available for handling requests.

  • Default: 40
  • High-Traffic Recommendation: 2000+
  • Warning: NEVER decrease this value. Even on low-spec hardware, an idle thread pool does not consume significant resources. Setting this too low (e.g., 10) WILL cause application freezes and request timeouts.
  • Env Var: THREAD_POOL_SIZE=2000

📉 Resource Efficiency (Reducing RAM)

If deploying on memory-constrained devices (Raspberry Pi, small VPS), use these strategies to prevent the application from crashing due to OOM (Out of Memory) errors.

1. Offload Auxiliary Models (Local Deployments Only)

Open WebUI loads local ML models for features like RAG and STT. This section is only relevant if you are running models LOCALLY. If you use external APIs, these models are not loaded.

RAG Embeddings

  • Low-Spec Recommendation:

    • Option A (Easiest): Keep the default SentenceTransformers (all-MiniLM-L6-v2). It is lightweight, runs on CPU, and is significantly more efficient than running a full Ollama instance on the same Raspberry Pi.
    • Option B (Best Performance): Use an External API (OpenAI/Cloud).
    • Avoid: Do NOT run Ollama for embeddings on the same low-spec device as Open WebUI; it will kill performance.
  • Configuration:

    • Admin Panel: Settings > Documents > Embedding Model Engine
    • Env Var: RAG_EMBEDDING_ENGINE=openai (to offload completely)

Speech-to-Text (STT)

Local Whisper models are heavy (~500MB+ RAM).

  • Recommendation: Use WebAPI (Browser-based). It uses the user's device capabilities, costing 0 server RAM.
  • Configuration:
    • Admin Panel: Settings > Audio > STT Engine
    • Env Var: AUDIO_STT_ENGINE=webapi

2. Disable Unused Features

Prevent the application from loading local models you don't use.

  • Image Generation: ENABLE_IMAGE_GENERATION=False (Admin: Settings > Images)
  • Code Interpreter: ENABLE_CODE_INTERPRETER=False (Admin: Settings > Tools)

3. Disable Background Tasks

If resource usage is critical, disable automated features that constantly trigger model inference.

Recommendation order (Highest Impact first):

  1. Autocomplete: ENABLE_AUTOCOMPLETE_GENERATION=False (High Impact: Triggers on every keystroke!)
    • Admin: Settings > Interface > Autocomplete
  2. Follow-up Questions: ENABLE_FOLLOW_UP_GENERATION=False
    • Admin: Settings > Interface > Follow-up
  3. Title Generation: ENABLE_TITLE_GENERATION=False
    • Admin: Settings > Interface > Chat Title
  4. Tag Generation: ENABLE_TAGS_GENERATION=False

📦 Database Optimization

PostgreSQL (Mandatory for Scale)

For any multi-user or high-concurrency setup, PostgreSQL is mandatory. SQLite (the default) is not designed for high concurrency and will become a bottleneck.

  • Action: Use the DATABASE_URL environment variable to connect to a Postgres instance.

Chat Saving Strategy

By default, Open WebUI saves chats in real-time. This ensures no data loss but creates massive database write pressure (especially with token streaming).

  • Env Var: ENABLE_REALTIME_CHAT_SAVE=False
  • Effect: Chats are saved only when the generation is complete (or periodically).
  • Recommendation: Highly Recommended for any high-user setup.
  • Note: For single-user local setups, you may leave this True to prevent data loss during crashes.

Vector Database (RAG)

For multi-user setups, the choice of Vector DB matters.

  • ChromaDB: NOT RECOMMENDED for multi-user environments due to performance limitations and locking issues. Acceptable for standalone local instances.
  • Recommendations:
    • Milvus or Qdrant: Best for improved scale and performance.
    • PGVector: Excellent choice if you are already using PostgreSQL.
  • Multitenancy: If using Milvus/Qdrant, enabling multitenancy offers better resource sharing.
    • ENABLE_MILVUS_MULTITENANCY_MODE=True
    • ENABLE_QDRANT_MULTITENANCY_MODE=True

Profile 1: Maximum Privacy (Weak Hardware/RPi)

Target: 100% Local, Raspberry Pi / <4GB RAM.

  1. Embeddings: Default (SentenceTransformers) - Runs on CPU, lightweight.
  2. Audio: AUDIO_STT_ENGINE=webapi - Zero server load.
  3. Task Model: Disable or use tiny model (llama3.2:1b).
  4. Scaling: Keep default THREAD_POOL_SIZE (40).
  5. Disable: Image Gen, Code Interpreter, Autocomplete, Follow-ups.
  6. Database: SQLite is fine.

Profile 2: Single User Enthusiast

Target: Max Quality & Speed, Local + External APIs.

  1. Embeddings: RAG_EMBEDDING_ENGINE=openai (or ollama with nomic-embed-text on a fast server).
  2. Task Model: gpt-4o-mini or llama-3.1-8b-instant.
  3. Caching: MODELS_CACHE_TTL=300.
  4. Database: ENABLE_REALTIME_CHAT_SAVE=True (Persistence is usually preferred over raw write speed here).
  5. Vector DB: ChromaDB or PGVector.

Profile 3: High Scale / Enterprise

Target: Many concurrent users, Stability > Persistence.

  1. Database: PostgreSQL (Mandatory).
  2. Workers: THREAD_POOL_SIZE=2000 (Prevent timeouts).
  3. Streaming: CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE=7 (Reduce CPU/Net/DB writes).
  4. Chat Saving: ENABLE_REALTIME_CHAT_SAVE=False.
  5. Vector DB: Milvus, Qdrant, or PGVector. Avoid ChromaDB.
  6. Task Model: External/Hosted (Offload compute).
  7. Caching: ENABLE_BASE_MODELS_CACHE=True, MODELS_CACHE_TTL=300, ENABLE_QUERIES_CACHE=True.