🎤 docs: add custom speech config, browser TTS/STT features, and dynamic speech tab settings (#61)

* initial commit

* chore: Update STT and TTS configuration documentation

* docs(stt_tts): small improvements
This commit is contained in:
Marco Beretta
2024-07-05 17:41:09 +03:00
committed by GitHub
parent 9f2ee14300
commit 01f94bf8ff

View File

@@ -6,237 +6,222 @@ description: Configuration of the Speech to Text (STT) and Text to Speech (TTS)
# Speech to Text (STT) and Text to Speech (TTS)
<Callout type="info" title="Upcoming STT/TTS Enhancements" collapsible>
The Google cloud STT/TTS and Deepgram services are beeing planned to add in the future
The Google Cloud STT/TTS and Deepgram services are being planned for future integration.
</Callout>
## STT
## Speech Introduction
The Speech-to-Text (STT) feature allows you to convert spoken words into written text.
To enable the STT (already configured), click on the STT button (near the send button) and start speaking. Otherwise, you can also use the key combination: ++Ctrl+Alt+L++ to start the transcription.
The Speech Configuration includes settings for both Speech-to-Text (STT) and Text-to-Speech (TTS) under a unified `speech:` section. Additionally, there is a new `speechTab` menu for user-specific settings
There are many different STT services available, but here's a list of some of the most popular ones:
## Speech Tab (optional)
### Local STT
- Browser-based
- Whisper (tested on LocalAI and HomeAssistant)
### Cloud STT
- OpenAI Whisper (via API calls)
- Azure Whisper (via API calls)
- All the other OpenAI compatible STT services (via API calls)
### Browser-based
No setup required, just click make sure that the "STT button" in the speech settings tab is enabled and in the engine dropdown "Browser" is selected. When clicking the button, the browser will ask for permission to use the microphone. Once permission is granted, you can start speaking and the text will be displayed in the chat window in real-time. When you're done speaking, click the button again to stop the transcription or wait for the timeout to stop the transcription automatically
### Whisper local
<Callout type="warning" title="Compatibility Testing" collapsible>
Whisper local has been tested only on LocalAI and HomeAssistant's whisper docker image, but it should work on any other local whisper instance
</Callout>
To use the Whisper local STT service, you need to have a local whisper instance running. You can find more information on how to set up a local whisper instance with LocalAI in the [LocalAI's documentation](https://localai.io/features/audio-to-text/). Once you have a local whisper instance running, you can configure the STT service as followed:
in the `librechat.yaml` add this configuration:
The `speechTab` menu provides customizable options for conversation and advanced modes, as well as detailed settings for STT and TTS. This will set the default settings for users
```yaml
stt:
openai:
url: 'http://host.docker.internal:8080/v1/audio/transcriptions'
model: 'whisper'
speech:
speechTab:
conversationMode: true
advancedMode: false
speechToText:
engineSTT: "OpenAI Whisper"
languageSTT: "en"
autoTranscribeAudio: true
decibelValue: 30
autoSendText: true
textToSpeech:
engineTTS: "OpenAI"
voice: "alloy"
languageTTS: "en"
automaticPlayback: true
playbackRate: 1.0
cacheTTS: true
```
where, `url` it's the url of the whisper instance, the apiKey points to the .env and `model` is the model that you want to use for the transcription
## STT (Speech-to-Text)
### OpenAI Whisper
The Speech-to-Text (STT) feature converts spoken words into written text. To enable STT, click on the STT button (near the send button) or use the key combination ++Ctrl+Alt+L++ to start the transcription.
Create an OpenAI api key at [OpenAI's website](https://platform.openai.com/account/api-keys)
### Available STT Services
Then, in the `librechat.yaml` file, add the following configuration:
- **Local STT**
- Browser-based
- Whisper (tested on LocalAI)
- **Cloud STT**
- OpenAI Whisper
- Azure Whisper
- Other OpenAI-compatible STT services
### Configuring Local STT
- #### Browser-based
No setup required. Ensure the "Speech To Text" switch in the speech settings tab is enabled and "Browser" is selected in the engine dropdown.
- #### Whisper Local
Requires a local Whisper instance.
```yaml
stt:
openai:
apiKey: '${STT_API_KEY}'
model: 'whisper-1'
speech:
stt:
openai:
url: 'http://host.docker.internal:8080/v1/audio/transcriptions'
model: 'whisper'
```
<Callout type="abstract" title="Understanding Guide" collapsible>
### Configuring Cloud STT
if you want to understand more about these variables check the [Whisper local](#whisper-local) section
- #### OpenAI Whisper
</Callout>
### Azure Whisper (WIP)
in the `librechat.yaml` file, add the following configuration to your already existing Azure configuration:
<Callout type="i" title="don't have an Azure configuration yet?" collapsible>
if you don't have one, you can find more information on how to set up an Azure STT service in the [Azure's documentation](https://docs.librechat.ai/install/configuration/azure_openai.html)
</Callout>
```yaml filename="librechat.yaml"
models:
whisper:
deploymentName: whisper-01
```yaml
speech:
stt:
openai:
apiKey: '${STT_API_KEY}'
model: 'whisper-1'
```
<Callout type="abstract" title="Understanding Guide" collapsible>
- #### Azure Whisper (WIP)
if you want to understand more about these variables check the [Whisper local](#whisper-local) section
```yaml
speech:
stt:
azure:
instanceName: 'instanceName'
apiKey: '${STT_API_KEY}'
deploymentName: 'deploymentName'
apiVersion: 'apiVersion'
```
</Callout>
- #### OpenAI compatible
### OpenAI compatible STT services
Refer to the OpenAI Whisper section, adjusting the `url` and `model` as needed.
check the [OpenAI Whisper](#openai-whisper) section, just change the `url` and `model` variables to the ones that you want to use
example
```yaml
speech:
stt:
openai:
url: 'http://host.docker.internal:8080/v1/audio/transcriptions'
model: 'whisper'
```
## TTS
## TTS (Text-to-Speech)
The Text-to-Speech (TTS) feature allows you to convert written text into spoken words. There are many different TTS services available, but here's a list of some of the most popular ones:
The Text-to-Speech (TTS) feature converts written text into spoken words. Various TTS services are available:
### Local TTS
### Available TTS Services
- Browser-based
- Piper (tested on LocalAI)
- Coqui (tested on LocalAI)
- **Local TTS**
- Browser-based
- Piper (tested on LocalAI)
- Coqui (tested on LocalAI)
- **Cloud TTS**
- OpenAI TTS
- ElevenLabs
- Other OpenAI/ElevenLabs-compatible TTS services
### Cloud TTS
### Configuring Local TTS
- OpenAI TTS
- ElevenLabs
- All the other OpenAI compatible TTS services
- #### Browser-based
### Browser-based
No setup required. Ensure the "Text To Speech" switcg in the speech settings tab is enabled and "Browser" is selected in the engine dropdown.
No setup required, just click make sure that the "TTS button" in the speech settings tab is enabled and in the engine dropdown "Browser" is selected. When clicking the button, it will start speaking, click the button again to stop the speech or wait for the speech to finish
- #### Piper
### Piper
Requires a local Piper instance.
<Callout type="warning" title="Compatibility Testing" collapsible>
Piper has been tested only on LocalAI, but it should work on any other local piper instance
</Callout>
To use the Piper local TTS service, you need to have a local piper instance running. You can find more information on how to set up a local piper instance with LocalAI in the [LocalAI's documentation](https://localai.io/features/text-to-audio/#piper). Once you have a local piper instance running, you can configure the TTS service as followed:
In the `librechat.yaml` add this configuration:
```yaml filename="librechat.yaml"
tts:
localai:
url: "http://host.docker.internal:8080/tts"
apiKey: "EMPTY"
voices: [
"en-us-amy-low.onnx",
"en-us-danny-low.onnx",
"en-us-libritts-high.onnx",
"en-us-ryan-high.onnx",
```yaml
speech:
tts:
localai:
url: "http://host.docker.internal:8080/tts"
apiKey: "EMPTY"
voices: [
"en-us-amy-low.onnx",
"en-us-danny-low.onnx",
"en-us-libritts-high.onnx",
"en-us-ryan-high.onnx",
]
backend: "piper"
backend: "piper"
```
Voices are just an example, you can find more information about the voices in the [LocalAI's documentation](https://localai.io/features/text-to-audio/#piper)
- #### Coqui
### Coqui
Requires a local Coqui instance.
<Callout type="warning" title="Compatibility Testing" collapsible>
Coqui has been tested only on LocalAI, but it should work on any other local coqui instance
</Callout>
To use the Coqui local TTS service, you need to have a local coqui instance running. You can find more information on how to set up a local coqui instance with LocalAI in the [LocalAI's documentation](https://localai.io/features/text-to-audio/#-coqui). Once you have a local coqui instance running, you can configure the TTS service as followed:
in the `librechat.yaml` add this configuration:
```yaml filename="librechat.yaml"
tts:
localai:
url: 'http://localhost:8080/v1/audio/synthesize'
voices: ['tts_models/en/ljspeech/glow-tts', 'tts_models/en/ljspeech/tacotron2', 'tts_models/en/ljspeech/waveglow']
backend: 'coqui'
```yaml
speech:
tts:
localai:
url: 'http://localhost:8080/v1/audio/synthesize'
voices: ['tts_models/en/ljspeech/glow-tts', 'tts_models/en/ljspeech/tacotron2', 'tts_models/en/ljspeech/waveglow']
backend: 'coqui'
```
voices are just an example, you can find more information about the voices in the [LocalAI's documentation](https://localai.io/features/text-to-audio/#-coqui)
### Configuring Cloud TTS
### OpenAI TTS
- #### OpenAI TTS
Create an OpenAI api key at [OpenAI's website](https://platform.openai.com/account/api-keys)
Then, in the `librechat.yaml` file, add the following configuration:
```yaml filename="librechat.yaml"
tts:
openai:
apiKey: '${TTS_API_KEY}'
model: 'tts-1'
voices: ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer']
```yaml
speech:
tts:
openai:
apiKey: '${TTS_API_KEY}'
model: 'tts-1'
voices: ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer']
```
you can choose between the `tts-1` and the `tts-1-hd` models, more information about the models can be found in the [OpenAI's documentation](https://platform.openai.com/docs/guides/text-to-speech/audio-quality)
- #### ElevenLabs
the `voice` variable can be `alloy`, `echo`, `fable` etc... more information about the voices can be found in the [OpenAI's documentation](https://platform.openai.com/docs/guides/text-to-speech/voice-options)
### ElevenLabs
Create an ElevenLabs api key at [ElevenLabs's website](https://elevenlabs.io/)
Then, click on the "Voices" tab, and copy the ID of the voices you want to use. If you haven't already added one, click on the "Voice library" where you can find a lot of pre-made voices, add one and copy the ID of the voice that you want to use by clicking the "ID" button
in the `librechat.yaml` file, add the following configuration:
```yaml filename="librechat.yaml"
tts:
elevenlabs:
apiKey: '${TTS_API_KEY}'
model: 'eleven_multilingual_v2'
voices: ['202898wioas09d2', 'addwqr324tesfsf', '3asdasr3qrq44w', 'adsadsa']
```yaml
speech:
tts:
elevenlabs:
apiKey: '${TTS_API_KEY}'
model: 'eleven_multilingual_v2'
voices: ['202898wioas09d2', 'addwqr324tesfsf', '3asdasr3qrq44w', 'adsadsa']
```
- **model:** model is the model that you want to use for the synthesis (not the voice), you can find more information about the models in the [ElevenLabs's documentation](https://elevenlabs.io/docs/api-reference/get-models)
Additional ElevenLabs-specific parameters can be added as follows:
- **voices:** list all the voice IDs you want. Add them first to your Elevenlabs account here: https://elevenlabs.io/app/voice-lab
if you want to add custom parameters, you can add them in the `librechat.yaml` file as follows:
<Callout type="warning" title="only for ElevenLabs" collapsible>
these parameters under the `voice_settings` and the pronunciation_dictionary_locators are only for ElevenLabs
</Callout>
```yaml filename="librechat.yaml"
voice_settings:
similarity_boost: '' # number
stability: '' # number
style: '' # number
use_speaker_boost: #boolean
pronunciation_dictionary_locators: [''] # list of strings (array)
```yaml
voice_settings:
similarity_boost: '' # number
stability: '' # number
style: '' # number
use_speaker_boost: # boolean
pronunciation_dictionary_locators: [''] # list of strings (array)
```
### OpenAI compatible TTS services
- #### OpenAI compatible
check the [OpenAI TTS](#openai-tts) section, just change the `url` variable to the ones that you want to use. It should be a complete url:
```yaml filename="librechat.yaml"
tts:
openai:
apiKey: 'sk-xxx'
model: 'tts-1'
voices: ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer']
url: "https://api.compatible.com/v1/audio/speech"
Refer to the OpenAI TTS section, adjusting the `url` variable as needed
example:
```yaml
speech:
tts:
openai:
url: 'http://host.docker.internal:8080/v1/audio/synthesize'
apiKey: '${TTS_API_KEY}'
model: 'tts-1'
voices: ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer']
```
### ElevenLabs compatible TTS services
- #### ElevenLabs compatible
check the [ElevenLabs](#elevenlabs) section, just change the `url` variable to the ones that you want to use
Refer to the ElevenLabs section, adjusting the `url` variable as needed
example:
```yaml
speech:
tts:
elevenlabs:
url: 'http://host.docker.internal:8080/v1/audio/synthesize'
apiKey: '${TTS_API_KEY}'
model: 'eleven_multilingual_v2'
voices: ['202898wioas09d2', 'addwqr324tesfsf', '3asdasr3qrq44w', 'adsadsa']
```