Files
Timothy Jaeryang Baek b989feaeae refac
2026-02-16 14:52:48 -06:00

439 lines
16 KiB
Plaintext

---
sidebar_position: 3
title: "Audio"
---
# Audio Troubleshooting Guide
This page covers common issues with Speech-to-Text (STT) and Text-to-Speech (TTS) functionality in Open WebUI, along with their solutions.
## Where to Find Audio Settings
### Admin Settings (Server-Wide)
Admins can configure server-wide audio defaults:
1. Click your **profile icon** (bottom-left corner)
2. Select **Admin Panel**
3. Click **Settings** in the top navigation
4. Select the **Audio** tab
Here you can configure:
- **Speech-to-Text Engine** — Choose between local Whisper, OpenAI, Azure, Deepgram, or Mistral
- **Whisper Model** — Select model size for local STT (tiny, base, small, medium, large)
- **Text-to-Speech Engine** — Choose between OpenAI-compatible, ElevenLabs, Azure, local Transformers, or disable backend TTS (browser-only)
- **TTS Voice** — Select the default voice
- **API Keys and Base URLs** — Configure external service connections
### User Settings (Per-User)
Individual users can customize their audio experience:
1. Click your **profile icon** (bottom-left corner)
2. Select **Settings**
3. Click the **Audio** tab
User-level options include:
- **STT Engine Override** — Use "Web API" for browser-based speech recognition
- **STT Language** — Set preferred language for transcription
- **TTS Engine** — Choose "Browser Kokoro" for local in-browser TTS
- **TTS Voice** — Select from available voices
- **Auto-playback** — Automatically play AI responses
- **Playback Speed** — Adjust audio speed
- **Conversation Mode** — Enable hands-free voice interaction
:::tip
User settings override admin defaults. If you're having issues, check both locations to ensure settings aren't conflicting.
:::
## Quick Setup Guide
### Fastest Setup: OpenAI (Paid)
If you have an OpenAI API key, this is the simplest setup:
**In Admin Panel → Settings → Audio:**
- **STT Engine:** `OpenAI` | **Model:** `whisper-1`
- **TTS Engine:** `OpenAI` | **Model:** `tts-1` | **Voice:** `alloy`
- Enter your OpenAI API key in both sections
Or via environment variables:
```yaml
environment:
- AUDIO_STT_ENGINE=openai
- AUDIO_STT_OPENAI_API_KEY=sk-...
- AUDIO_TTS_ENGINE=openai
- AUDIO_TTS_OPENAI_API_KEY=sk-...
- AUDIO_TTS_MODEL=tts-1
- AUDIO_TTS_VOICE=alloy
```
→ See full guides: [Speech-to-Text](/category/speech-to-text) | [Text-to-Speech](/category/text-to-speech)
### Free Setup: Local Whisper + Edge TTS
For a completely free setup:
**STT:** Leave engine empty (uses built-in Whisper running on the backend)
```yaml
environment:
- WHISPER_MODEL=base # Options: tiny, base, small, medium, large
```
**TTS:** Use OpenAI Edge TTS (free Microsoft voices)
```yaml
services:
openai-edge-tts:
image: travisvn/openai-edge-tts:latest
ports:
- "5050:5050"
open-webui:
environment:
- AUDIO_TTS_ENGINE=openai
- AUDIO_TTS_OPENAI_API_BASE_URL=http://openai-edge-tts:5050/v1
- AUDIO_TTS_OPENAI_API_KEY=not-needed
```
→ See full guide: [OpenAI Edge TTS](/features/media-generation/audio/text-to-speech/openai-edge-tts-integration)
### Browser-Only Setup (No Backend Config Needed)
For basic functionality without any server-side audio processing:
**In User Settings → Audio:**
- **STT Engine:** `Web API` (uses the browser's built-in speech recognition; **does not call the backend STT endpoint**)
- **TTS Engine:** `Web API` or `Browser Kokoro` (uses browser's built-in text-to-speech or client-side Kokoro; **does not call the backend TTS endpoint**)
:::note
When the admin leaves `AUDIO_TTS_ENGINE` as an empty string (the default), no backend TTS service is available. All TTS is handled client-side. Similarly, if users select "Web API" for STT in their user settings, the backend's local Whisper is not used.
:::
## Microphone Access Issues
### Understanding Secure Contexts 🔒
For security reasons, accessing the microphone is restricted to pages served over HTTPS or locally from `localhost`. This requirement is meant to safeguard your data by ensuring it is transmitted over secure channels.
### Common Permission Issues 🚫
Browsers like Chrome, Brave, Microsoft Edge, Opera, and Vivaldi, as well as Firefox, restrict microphone access on non-HTTPS URLs. This typically becomes an issue when accessing a site from another device within the same network (e.g., using a mobile phone to access a desktop server).
### Solutions for Non-HTTPS Connections
1. **Set Up HTTPS (Recommended):**
- Configure your server to support HTTPS. This not only resolves permission issues but also enhances the security of your data transmissions.
- You can use a reverse proxy like Nginx or Caddy with Let's Encrypt certificates.
2. **Temporary Browser Flags (Use with caution):**
- These settings force your browser to treat certain insecure URLs as secure. This is useful for development purposes but poses significant security risks.
**Chromium-based Browsers (e.g., Chrome, Brave):**
- Open `chrome://flags/#unsafely-treat-insecure-origin-as-secure`
- Enter your non-HTTPS address (e.g., `http://192.168.1.35:3000`)
- Restart the browser to apply the changes
**Firefox-based Browsers:**
- Open `about:config`
- Search and modify (or create) the string value `dom.securecontext.allowlist`
- Add your IP addresses separated by commas (e.g., `http://127.0.0.1:8080`)
:::warning
While browser flags offer a quick fix, they bypass important security checks which can expose your device and data to vulnerabilities. Always prioritize proper security measures, especially when planning for a production environment.
:::
### Microphone Not Working
If the microphone icon doesn't respond even on HTTPS:
1. **Check browser permissions:** Ensure your browser has microphone access for the site
2. **Check system permissions:** On Windows/Mac, ensure the browser has microphone access in system settings
3. **Check browser compatibility:** Some browsers have limited STT support
4. **Try a different browser:** Chrome typically has the best support for web audio APIs
---
## Text-to-Speech (TTS) Issues
### TTS Loading Forever / Not Working
If clicking the play button on chat responses causes endless loading, try the following solutions:
#### 1. Hugging Face Dataset Library Conflict (Local Transformers TTS)
**Symptoms:**
- TTS keeps loading forever
- Container logs show: `RuntimeError: Dataset scripts are no longer supported, but found cmu-arctic-xvectors.py`
**Cause:** This occurs when using local Transformers TTS (`AUDIO_TTS_ENGINE=transformers`). The `datasets` library is pulled in as an indirect dependency of the `transformers` package and isn't pinned to a specific version in Open WebUI's requirements. Newer versions of `datasets` removed support for dataset loading scripts, causing this error when loading speaker embeddings.
**Solutions:**
**Temporary fix** (re-applies after container restart):
```bash
docker exec open-webui bash -lc "pip install datasets==3.6.0" && docker restart open-webui
```
**Permanent fix using environment variable:**
Add this to your `docker-compose.yml`:
```yaml
environment:
- EXTRA_PIP_PACKAGES=datasets==3.6.0
```
**Verify the installed version:**
```bash
docker exec open-webui bash -lc "pip show datasets"
```
:::tip
Consider using an external TTS service like [OpenAI Edge TTS](/features/media-generation/audio/text-to-speech/openai-edge-tts-integration) or [Kokoro](/features/media-generation/audio/text-to-speech/Kokoro-FastAPI-integration) instead of local Transformers TTS to avoid these dependency conflicts.
:::
#### 2. Using External TTS Instead of Local
If you continue to have issues with local TTS, configuring an external TTS service is often more reliable. See the example Docker Compose configuration below that uses `openai-edge-tts`:
```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
environment:
- AUDIO_TTS_ENGINE=openai
- AUDIO_TTS_OPENAI_API_KEY=your-api-key-here
- AUDIO_TTS_OPENAI_API_BASE_URL=http://openai-edge-tts:5050/v1
depends_on:
- openai-edge-tts
# ... other configuration
openai-edge-tts:
image: travisvn/openai-edge-tts:latest
ports:
- "5050:5050"
environment:
- API_KEY=your-api-key-here
restart: unless-stopped
```
### TTS Voice Not Found / No Audio Output
**Checklist:**
1. Verify the TTS engine is correctly configured in **Admin Panel → Settings → Audio**
2. Check that the voice name matches an available voice for your chosen engine
3. For external TTS services, verify the API Base URL is accessible from the Open WebUI container
4. Check container logs for any error messages
### Docker Networking Issues with TTS
If Open WebUI can't reach your TTS service:
**Problem:** Using `localhost` in the API Base URL doesn't work from within Docker.
**Solutions:**
- Use `host.docker.internal` instead of `localhost` (works on Docker Desktop for Windows/Mac)
- Use the container name if both services are on the same Docker network (e.g., `http://openai-edge-tts:5050/v1`)
- Use the host machine's IP address
---
## Speech-to-Text (STT) Issues
### Whisper STT Not Working / Compute Type Error
**Symptoms:**
- Error message: `Error transcribing chunk: Requested int8 compute type, but the target device or backend do not support efficient int8 computation`
- STT fails to process audio, often showing a persistent loading spinner or a red error toast.
**Cause:** This typically occurs when using the `:cuda` Docker image with an NVIDIA GPU that doesn't support the required `int8` compute operations (common on older Maxwell or Pascal architecture GPUs). In version **v0.6.43**, a regression caused the compute type to be incorrectly defaulted or hardcoded to `int8` in some scenarios.
**Solutions:**
#### 1. Upgrade to the Latest Version (Recommended)
The most reliable fix is to upgrade to the latest version of Open WebUI. Recent updates ensure that `WHISPER_COMPUTE_TYPE` is correctly respected and provides optimized defaults for CUDA environments.
#### 2. Manually Set Compute Type
If you are on an affected version or still experiencing issues on GPU, explicitly set the compute type to `float16`:
```yaml
environment:
- WHISPER_COMPUTE_TYPE=float16
```
#### 3. Switch to the Standard Image
If your GPU is very old or compatibility persists, switch to the standard (CPU-based) image. For smaller models like Whisper, CPU mode often provides comparable performance without compatibility issues:
```bash
# Instead of:
# ghcr.io/open-webui/open-webui:cuda
# Use:
ghcr.io/open-webui/open-webui:main
```
:::info
The CUDA image primarily accelerates RAG embedding/reranking models and Whisper STT. For smaller models like Whisper, CPU mode often provides comparable performance without the compatibility issues.
:::
#### Adjust Whisper Compute Type
If you want to keep GPU acceleration, try changing the compute type:
```yaml
environment:
- WHISPER_COMPUTE_TYPE=float16 # Recommended for GPU
```
**Available compute types (from faster-whisper):**
| Compute Type | Best For | Notes |
|--------------|----------|-------|
| `int8` | **CPU (default)** | Fastest, but doesn't work on older GPUs |
| `float16` | **CUDA/GPU (recommended)** | Best balance of speed and compatibility for GPUs |
| `int8_float16` | GPU with hybrid precision | Uses int8 for weights, float16 for computation |
| `float32` | Maximum compatibility | Slowest, but works on all hardware |
:::info Default Behavior
- **CPU mode:** Defaults to `int8` for best performance
- **CUDA mode:** The `:cuda` image may default to `int8`, which can cause errors on older GPUs. Set `float16` explicitly for GPUs.
:::
### STT Not Recognizing Speech Correctly
**Tips for better recognition:**
1. **Set the correct language:**
```yaml
environment:
- WHISPER_LANGUAGE=en # Use ISO 639-1 language code
```
2. **Try a larger Whisper model** for better accuracy (at the cost of speed):
```yaml
environment:
- WHISPER_MODEL=medium # Options: tiny, base, small, medium, large
```
3. **Check microphone permissions** in your browser (see above)
4. **Use the Web API engine** as an alternative:
- Go to user settings (not admin panel)
- Under STT Settings, try switching Speech-to-Text Engine to "Web API"
- This uses the browser's built-in speech recognition
---
## ElevenLabs Integration
ElevenLabs is natively supported in Open WebUI. To configure:
1. Go to **Admin Panel → Settings → Audio**
2. Select **ElevenLabs** as the TTS engine
3. Enter your ElevenLabs API key
4. Select the voice and model
5. Save settings
**Using environment variables:**
```yaml
environment:
- AUDIO_TTS_ENGINE=elevenlabs
- AUDIO_TTS_API_KEY=sk_... # Your ElevenLabs API key
- AUDIO_TTS_VOICE=EXAVITQu4vr4xnSDxMaL # Voice ID from ElevenLabs dashboard
- AUDIO_TTS_MODEL=eleven_multilingual_v2
```
:::note
You can find your Voice ID in the ElevenLabs dashboard under the voice settings. Common model options are `eleven_multilingual_v2` or `eleven_monolingual_v1`.
:::
---
## General Debugging Tips
### Check Container Logs
```bash
# View Open WebUI logs
docker logs open-webui -f
# View logs for external TTS service (if applicable)
docker logs openai-edge-tts -f
```
### Check Browser Console
1. Open browser developer tools (F12 or right-click → Inspect)
2. Go to the Console tab
3. Look for error messages when attempting to use audio features
### Verify Service Health
For external TTS services, test directly:
```bash
# Test OpenAI Edge TTS
curl -X POST http://localhost:5050/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_api_key_here" \
-d '{"input": "Hello, this is a test.", "voice": "alloy"}' \
--output test.mp3
```
### Network Connectivity
Verify the Open WebUI container can reach external services:
```bash
# Enter the container
docker exec -it open-webui bash
# Test connectivity (if curl is available)
curl http://your-tts-service:port/health
```
---
## Quick Reference: Environment Variables
### TTS Environment Variables
| Variable | Description |
|----------|-------------|
| `AUDIO_TTS_ENGINE` | TTS engine: `""` (empty, disables backend TTS - uses browser), `openai`, `elevenlabs`, `azure`, `transformers` |
| `AUDIO_TTS_MODEL` | TTS model to use (default: `tts-1`) |
| `AUDIO_TTS_VOICE` | Default voice for TTS (default: `alloy`) |
| `AUDIO_TTS_API_KEY` | API key for ElevenLabs or Azure TTS |
| `AUDIO_TTS_OPENAI_API_BASE_URL` | Base URL for OpenAI-compatible TTS |
| `AUDIO_TTS_OPENAI_API_KEY` | API key for OpenAI-compatible TTS |
### STT Environment Variables
| Variable | Description |
|----------|-------------|
| `WHISPER_MODEL` | Whisper model: `tiny`, `base`, `small`, `medium`, `large` (default: `base`) |
| `WHISPER_COMPUTE_TYPE` | Compute type: `int8`, `float16`, `int8_float16`, `float32` (default: `int8`) |
| `WHISPER_LANGUAGE` | ISO 639-1 language code (empty = auto-detect) |
| `WHISPER_VAD_FILTER` | Enable Voice Activity Detection filter (default: `False`) |
| `AUDIO_STT_ENGINE` | STT engine: `""` (empty, uses local Whisper), `openai`, `azure`, `deepgram`, `mistral` |
| `AUDIO_STT_OPENAI_API_BASE_URL` | Base URL for OpenAI-compatible STT |
| `AUDIO_STT_OPENAI_API_KEY` | API key for OpenAI-compatible STT |
| `DEEPGRAM_API_KEY` | Deepgram API key |
For a complete list of audio environment variables, see [Environment Variable Configuration](/reference/env-configuration#audio).
---
## Still Having Issues?
If you've tried the above solutions and still experience problems:
1. **Search existing issues** on GitHub for similar problems
2. **Check the discussions** for community solutions
3. **Create a new issue** with:
- Open WebUI version
- Docker image being used
- Complete error logs
- Very detailed steps to reproduce
- Your environment details (OS, GPU if applicable)