open-webui-docs/docs/troubleshooting/multi-replica.mdx

---
sidebar_position: 10
title: "Scaling & HA"
---

# Multi-Replica, High Availability & Concurrency Troubleshooting

This guide addresses common issues encountered when deploying Open WebUI in **multi-replica** environments (e.g., Kubernetes, Docker Swarm) or when using **multiple workers** (`UVICORN_WORKERS > 1`) for increased concurrency.

## Core Requirements Checklist

Before troubleshooting specific errors, ensure your deployment meets these **absolute requirements** for a multi-replica setup. Missing any of these will cause instability, login loops, or data loss.

1.  **Shared Secret Key:** [`WEBUI_SECRET_KEY`](/reference/env-configuration#webui_secret_key) **MUST** be identical on all replicas.
2.  **External Database:** You **MUST** use an external PostgreSQL database (see [`DATABASE_URL`](/reference/env-configuration#database_url)). SQLite is **NOT** supported for multiple instances.
3.  **Redis for WebSockets:** [`ENABLE_WEBSOCKET_SUPPORT=True`](/reference/env-configuration#enable_websocket_support) and [`WEBSOCKET_MANAGER=redis`](/reference/env-configuration#websocket_manager) with a valid [`WEBSOCKET_REDIS_URL`](/reference/env-configuration#websocket_redis_url) are required.
4.  **Shared Storage:** A persistent volume (RWX / ReadWriteMany if possible, or ensuring all replicas map to the same underlying storage for `data/`) is critical for RAG (uploads/vectors) and generated images.
5.  **External Vector Database (Required):** The default ChromaDB uses a local SQLite-backed `PersistentClient` that is **not safe for multi-worker or multi-replica deployments**. SQLite connections are not fork-safe, and concurrent writes from multiple processes will crash workers instantly. You **must** use a dedicated external Vector DB (e.g., [PGVector](/reference/env-configuration#pgvector_db_url), Milvus, Qdrant) via [`VECTOR_DB`](/reference/env-configuration#vector_db), or run ChromaDB as a [separate HTTP server](/reference/env-configuration#chroma_http_host).
6.  **Database Session Sharing (Optional):** For PostgreSQL deployments with adequate resources, consider enabling [`DATABASE_ENABLE_SESSION_SHARING=True`](/reference/env-configuration#database_enable_session_sharing) to improve performance under high concurrency.

---

## Common Issues

### 1. Login Loops / 401 Unauthorized Errors

**Symptoms:**
- You log in successfully, but the next click logs you out.
- You see "Unauthorized" or "401" errors in the browser console immediately after login.
- "Error decrypting tokens" appears in logs.

**Cause:**
Each replica is using a different `WEBUI_SECRET_KEY`. When Replica A issues a session token (JWT), Replica B rejects it because it cannot verify the signature with its own different key.

**Solution:**
Set the `WEBUI_SECRET_KEY` environment variable to the **same** strong, random string on all backend replicas.

```yaml
# Example in Kubernetes/Compose
env:
  - name: WEBUI_SECRET_KEY
    value: "your-super-secure-static-key-here"
```

### 2. WebSocket 403 Errors / Connection Failures

**Symptoms:**
- Chat stops responding or hangs.
- Browser console shows `WebSocket connection failed: 403 Forbidden` or `Connection closed`.
- Logs show `engineio.server: https://your-domain.com is not an accepted origin`.

**Cause:**
- **CORS:** The load balancer or ingress origin does not match the allowed origins.
- **Missing Redis:** WebSockets are defaulting to in-memory, so events on Replica A (e.g., LLM generation finish) are not broadcast to the user connected to Replica B.

**Solution:**
1.  **Configure CORS:** Ensure [`CORS_ALLOW_ORIGIN`](/reference/env-configuration#cors_allow_origin) includes your public domain *and* http/https variations.

    If you see logs like `engineio.base_server:_log_error_once:354 - https://yourdomain.com is not an accepted origin`, you must update this variable. It accepts a **semicolon-separated list** of allowed origins.

    **Example:**
    ```bash
    CORS_ALLOW_ORIGIN="https://chat.yourdomain.com;http://chat.yourdomain.com;https://yourhostname;http://localhost:3000"
    ```
    *Add all valid IPs, Domains, and Hostnames that users might use to access your Open WebUI.*
2.  **Enable Redis for WebSockets:**
    Ensure these variables are set on **all** replicas:
    ```bash
    ENABLE_WEBSOCKET_SUPPORT=True
    WEBSOCKET_MANAGER=redis
    WEBSOCKET_REDIS_URL=redis://your-redis-host:6379/0
    ```

### 3. "Model Not Found" or Configuration Mismatch

**Symptoms:**
- You enable a model or change a setting in the Admin UI, but other users (or you, after a refresh) don't see the change.
- Chats fail with "Model not found" intermittently.

**Cause:**
- **Configuration Sync:** Replicas are not synced. Open WebUI uses Redis Pub/Sub to broadcast configuration changes (like toggling a model) to all other instances.
- **Missing Redis:** If `REDIS_URL` is not set, configuration changes stay local to the instance where the change was made.

**Solution:**
Set `REDIS_URL` to point to your shared Redis instance. This enables the Pub/Sub mechanism for real-time config syncing.

```bash
REDIS_URL=redis://your-redis-host:6379/0
```

### 4. Database Corruption / "Locked" Errors

**Symptoms:**
- Logs show `database is locked` or severe SQL errors.
- Data saved on one instance disappears on another.

**Cause:**
Using **SQLite** with multiple replicas. SQLite is a file-based database and does not support concurrent network writes from multiple containers.

**Solution:**
Migrate to **PostgreSQL**. Update your connection string:

```bash
DATABASE_URL=postgresql://user:password@postgres-host:5432/openwebui
```

### 5. Uploaded Files or RAG Knowledge Inaccessible

**Symptoms:**
- You upload a file (for RAG) on one instance, but the model cannot find it later.
- Generated images appear as broken links.

**Cause:**
The `/app/backend/data` directory is not shared or is not consistent across replicas. If User A uploads a file to Replica 1, and the next request hits Replica 2, Replica 2 won't have the file physically on disk.

**Solution:**
- **Kubernetes:** Use a `PersistentVolumeClaim` with `ReadWriteMany` (RWX) access mode if your storage provider supports it (e.g., NFS, CephFS, AWS EFS).
- **Docker Swarm/Compose:** Mount a shared volume (e.g., NFS mount) to `/app/backend/data` on all containers.

### 6. Worker Crashes During Document Upload (ChromaDB + Multi-Worker)

**Symptoms:**
- Logs show the following sequence, all within the same second:
  ```
  save_docs_to_vector_db:1619 - adding to collection file-id
  INFO:     Waiting for child process [pid]
  INFO:     Child process [pid] died
  ```
- Workers die immediately during RAG document ingestion.
- The crash is instant (not a timeout).

**Cause:**
The default ChromaDB configuration uses a local `PersistentClient` backed by **SQLite**. When uvicorn forks multiple workers (`UVICORN_WORKERS > 1`), each worker process inherits a copy of the same SQLite database connection — all pointing at the same file on disk (`data/vector_db/`).

When two workers attempt to write to the collection simultaneously (e.g., during document upload), SQLite's file-level locking fails across forked processes. The result is either a database lock error or a segfault from corrupted internal state inherited across the `fork()` call, which kills the worker process instantly.

This is a [well-known SQLite limitation](https://www.sqlite.org/howtocorrupt.html#_carrying_an_open_database_connection_across_a_fork_): open database connections must not be carried across a `fork()`.

**Solution:**
You **must** stop using the default local ChromaDB with multiple workers. Pick one of these options:

| Option | Change | Tradeoff |
|--------|--------|----------|
| **Keep 1 worker** | Set `UVICORN_WORKERS=1` (the default) | Simplest, but limits concurrency |
| **Use ChromaDB HTTP mode** | Set [`CHROMA_HTTP_HOST`](/reference/env-configuration#chroma_http_host) / [`CHROMA_HTTP_PORT`](/reference/env-configuration#chroma_http_port) to point to a separate Chroma server | Each worker connects via HTTP instead of SQLite — fully fork-safe |
| **Switch vector DB** | Set [`VECTOR_DB`](/reference/env-configuration#vector_db) to `pgvector`, `milvus`, `qdrant`, etc. | These are client-server databases, inherently multi-process safe |

**Recommended fix** — run ChromaDB as a separate server:

```bash
# Run chroma server separately
chroma run --host 0.0.0.0 --port 8000 --path /data/vector_db

# Then set these env vars for Open WebUI
CHROMA_HTTP_HOST=localhost
CHROMA_HTTP_PORT=8000
UVICORN_WORKERS=4
```

### 7. Slow Performance in Cloud vs. Local Kubernetes

**Symptoms:**
- Open WebUI performs well locally but experiences significant degradation or timeouts when deployed to cloud providers (AKS, EKS, GKE).
- Performance drops sharply under concurrent load despite adequate resource allocation.

**Cause:**
This is typically caused by infrastructure latency (Network Latency to the database or Disk I/O latency for SQLite) that is inherently higher in cloud environments compared to local NVMe/SSD storage and local networks.

**Solution:**
Refer to the **[Cloud Infrastructure Latency](/troubleshooting/performance#%EF%B8%8F-cloud-infrastructure-latency)** section in the Performance Guide for a detailed breakdown of diagnosis and mitigation strategies.

If you need more tips for performance improvements, check out the full [Optimization & Performance Guide](/troubleshooting/performance).

### 8. Optimizing Database Performance

For PostgreSQL deployments with adequate resources, consider these optimizations:

#### Database Session Sharing

Enabling session sharing can improve performance under high concurrency:

```bash
DATABASE_ENABLE_SESSION_SHARING=true
```

See [DATABASE_ENABLE_SESSION_SHARING](/reference/env-configuration#database_enable_session_sharing) for details.

#### Connection Pool Sizing

If you experience `QueuePool limit reached` errors or connection timeouts under high concurrency, increase the pool size:

```bash
DATABASE_POOL_SIZE=15 (or higher)
DATABASE_POOL_MAX_OVERFLOW=20 (or higher)
```

**Important:** The combined total (`DATABASE_POOL_SIZE` + `DATABASE_POOL_MAX_OVERFLOW`) should remain well below your database's `max_connections` limit. PostgreSQL defaults to 100 max connections, so keep the combined total under 50-80 per Open WebUI instance to leave room for other clients and maintenance operations.

:::warning Pool Size Multiplies with Concurrency

**Each Open WebUI process maintains its own independent connection pool.** This applies to multiple replicas (Kubernetes pods, Docker Swarm replicas) *and* multiple Uvicorn workers within each replica.

The actual maximum number of database connections is:

```
Total connections = (DATABASE_POOL_SIZE + DATABASE_POOL_MAX_OVERFLOW) × Total processes
```

Where `Total processes = Number of replicas × UVICORN_WORKERS per replica`.

For example, with `DATABASE_POOL_SIZE=15`, `DATABASE_POOL_MAX_OVERFLOW=20`, 3 replicas, and 2 workers each, you could open up to **210 connections** (35 × 6 processes).

:::

See [DATABASE_POOL_SIZE](/reference/env-configuration#database_pool_size) for details.

---

## Deployment Best Practices

### Updates and Migrations

:::danger Critical: Avoid Concurrent Migrations
**Always ensure only one process is running database migrations when upgrading Open WebUI versions.**
:::

Database migrations run automatically on startup. If multiple replicas (or multiple workers within a single container) start simultaneously with a new version, they may try to run migrations concurrently, potentially leading to race conditions or database schema corruption.

**Safe Update Procedure:**

There are two ways to safely handle migrations in a multi-replica environment:

#### Option 1: Designate a Master Migration Pod (Recommended)
1.  Identify one pod/replica as the "master" for migrations.
2.  Set `ENABLE_DB_MIGRATIONS=True` (default) on the master pod.
3.  Set `ENABLE_DB_MIGRATIONS=False` on all other pods.
4.  When updating, the master pod will handle the database schema update while other pods skip the migration step.

#### Option 2: Scale Down During Update
1.  **Scale Down:** Set replicas to `1` (and ensure `UVICORN_WORKERS=1`).
2.  **Update Image:** Update the image or version.
3.  **Wait for Health Check:** Wait for the single instance to start fully and complete migrations.
4.  **Scale Up:** Increase replicas back to your desired count.


### Session Affinity (Sticky Sessions)
While Open WebUI is designed to be stateless with proper Redis configuration, enabling **Session Affinity** (Sticky Sessions) at your Load Balancer / Ingress level can improve performance and reduce occasional jitter in WebSocket connections.

- **Nginx Ingress:** `nginx.ingress.kubernetes.io/affinity: "cookie"`
- **AWS ALB:** Enable Target Group Stickiness.

---

## Related Documentation

- [Environment Variable Configuration](/reference/env-configuration)
- [Optimization, Performance & RAM Usage](/troubleshooting/performance)
- [Troubleshooting Connection Errors](/troubleshooting/connection-error)
- [Logging Configuration](/getting-started/advanced-topics/logging)