AllTalk TTS for Text generation webUI

Table of Contents

Getting Started

AllTalk is a web interface, based around the Coqui TTS voice cloning/speech generation system. To generate TTS, you can use the provided gradio interface or interact with the server using JSON/CURL commands. 

Note: When loading up a new character in Text generation webUI it may look like nothing is happening for 20-30 seconds. It's actually processing the introduction section (greeting message) of the text and once that is completed, it will appear. You can see the activity occurring in the console window. Refreshing the page multiple times will try force the TTS engine to keep re-generating the text, so please just wait and check the console if needed.

Note: Ensure that your RP character card has asterisks around anything for the narration and double quotes around anything spoken. There is a complication ONLY with the greeting card so, ensuring it has the correct use of quotes and asterisks will help make sure that the greeting card sounds correct. I will aim to address this issue in a future update. In Text-generation-webUI parameters menu > character tab > greeting make sure that anything in there that is the narrator is in asterisks and anything spoken is in double quotes, then hit the save (💾) button.

AllTalk Minor Bug Fixes Changelog & Known issues

If I squash any minor bugs or find any issues, I will try to apply an update asap. If you think something isnt working correctly or you have a problem, check these two links below first.

AllTalk
Github link here
Update instructions link here
Help and issues link here
TTS Generator link here

Text generation webUI
Web interface link here
Documentation link here

Back to top of page

Server Information

Back to top of page

Demo/Test TTS

If you are wanting to generate bulk quantities of TTS and have control over them, please see the AllTalk TTS Generator below.

Back to top of page

AllTalk TTS Generator

AllTalk TTS Generator is the solution for converting large volumes of text into speech using the voice of your choice. Whether you're creating audio content or just want to hear text read aloud, the TTS Generator is equipped to handle it all efficiently.

The TTS Generator is available on this link

Quick Start

Once you have sent text off to be generated, either as a stream or wav file generation, the TTS server will remain busy until this process has competed. As such, think carefully as to how much you want to send to the server.

If you are generating wav files and populating the queue, you can generate one lot of text to speech, then input your next lot of text and it will continue adding to the list.

TTS Generation Modes

With wav chunks you can either playback “In Browser” which is the web page you are on, or “On Server” which is through the console/terminal where AllTalk is running from. Only generation “In Browser” can play back smoothly and populate the Generated TTS List. Setting the Volume will affect the volume level played back both “In Browser” and “On Server”.

Playback and List Management

Exporting Your Audio

Customization and Preferences

Interface and Accessibility

Notes on Usage

Back to top of page

Using Voice Samples

Where are the sample voices stored?

Voice samples are stored in /alltalk_tts/voices/ and should be named using the following format name.wav

Where are the outputs stored & Automatic output wav file deletion.

Voice outputs are stored in /alltalk_tts/outputs/

You can configure automatic maintenance deletion of old wav files by setting Del WAV's older than in the settings above.

When Disabled your output wav files will be left untouched. When set to a setting 1 Day or greater, your output wav files older than that time period will be automatically deleted on start-up of AllTalk.

Where are the models stored?

This extension will download the 2.0.2 model to /alltalk_tts/models/

This TTS engine will also download the latest available model and store it wherever your OS normally stores it (Windows/Linux/Mac).

How do I create a new voice sample?

To create a new voice sample, you need to make a wav file that is 22050Hz, Mono, 16 bit and between 6 to 30 seconds long, though 8 to 10 seconds is usually good enough. The model can handle up to 30 second samples, however I've not noticed any improvement in voice output from a much longer clips.

You want to find a nice clear selection of audio, so lets say you wanted to clone your favourite celebrity. You may go looking for an interview where they are talking. Pay close attention to the audio you are listening to and trying to sample. Are there noises in the background, hiss on the soundtrack, a low humm, some quiet music playing or something? The better quality the audio the better the final TTS result. Don't forget, the AI that processes the sounds can hear everything in your sample and it will use them in the voice its trying to recreate.

Try make your clip one of nice flowing speech, like the included example files. No big pauses, gaps or other sounds. Preferably a sample that the person you are trying to copy will show a little vocal range and emotion in their voice. Also, try to avoid a clip starting or ending with breathy sounds (breathing in/out etc).

Editing your sample!

So, you’ve downloaded your favourite celebrity interview off YouTube, from here you need to chop it down to 6 to 30 seconds in length and resample it.

If you need to clean it up, do audio processing, volume level changes etc, do this before down-sampling.

Using the latest version of Audacity select/highlight your 6 to 30 second clip and:

                  Tracks > Resample to 22050Hz then
                  Tracks > Mix > Stereo to Mono then
                  File > Export Audio saving it as a WAV of 22050Hz.

Save your generated wav file in the /alltalk_tts/voices/ folder.

Its worth mentioning that using AI generated audio clips may introduce unwanted sounds as its already a copy/simulation of a voice.

Why doesnt it sound like XXX Person?

Maybe you might be interested in trying Finetuning of the model. Otherwise, the reasons can be that you:

Some samples just never seem to work correctly, so maybe try a different sample. Always remember though, this is an AI model attempting to re-create a voice, so you will never get a 100% match.

Back to top of page

Text Not inside

This only affects the Narrator function. Most AI models should be using asterisks or double quotes to differentiate between the Narrator or the Character, however, many models sometimes switch between using asterisks and double quotes or sometimes nothing at all for the text it outputs.

This leaves a bit of a mess because sometimes un-marked text is narration and sometimes its the character talking, leaving no clear way to know where to split sentences and what voice to use. Whilst there is no 100% solution at the moment many models lean more one way or the other as to what that unmarked text will be (character or narrator).

As such, the "Text not inside" function at least gives you the choice to set how you want the TTS engine to handle situations of un-marked text.

When the AI doesnt use an asterisk or a quote

Back to top of page

Low VRAM

The Low VRAM option is a crucial feature designed to enhance performance under constrained (VRAM) conditions, as the TTS models require 2GB-3GB of VRAM to run effectively. This feature strategically manages the relocation of the Text-to-Speech (TTS) model between your system's Random Access Memory (RAM) and VRAM, moving it between the two on the fly. Obviously, this is very useful for people who have smaller graphics cards and will use all their VRAM to load in their LLM.

When you don't have enough VRAM free after loading your LLM model into your VRAM (Normal Mode example below), you can see that with so little working space, your GPU will have to swap in and out bits of the TTS model, which causes horrible slowdown.

Note: An Nvidia Graphics card is required for the LowVRAM option to work, as you will just be using system RAM otherwise. 

How It Works:

The Low VRAM mode intelligently orchestrates the relocation of the entire TTS model and stores the TTS model in your system RAM. When the TTS engine requires VRAM for processing, the entire model seamlessly moves into VRAM, causing your LLM to unload/displace some layers, ensuring optimal performance of the TTS engine.

Post-TTS processing, the model moves back to system RAM, freeing up VRAM space for your Language Model (LLM) to load back in the missing layers. This adds about 1-2 seconds to both text generation by the LLM and the TTS engine.

By transferring the entire model between RAM and VRAM, the Low VRAM option avoids fragmentation, ensuring the TTS model remains cohesive and has all the working space it needs in your GPU, without having to just work on small bits of the TTS model at a time (which causes terrible slow down).

This creates a TTS generation performance Boost for Low VRAM Users and is particularly beneficial for users with less than 2GB of free VRAM after loading their LLM, delivering a substantial 5-10x improvement in TTS generation speed.

How Low VRAM Works

Back to top of page

SillyTavern Support

Important note for Text-generation-webui users

You HAVE to disable Enable TTS within the Text-generation-webui AllTalk interface, otherwise Text-generation-webui will also generate TTS due to the way it sends out text. You can do this each time you start up Text-generation-webui or set it in the start-up settings at the top of this page.

Quick Tips

TTS Generation Methods in SilllyTavern

You have 2 types of audio generation options, Streaming and Standard.

The Streaming Audio Generation method is designed for speed and is best suited for situations where you just want quick audio playback. This method, however, is limited to using just one voice per TTS generation request. This means a limitation of the Streaming method is the inability to utilize the AllTalk narrator function, making it a straightforward but less nuanced option.

On the other hand, the Standard Audio Generation method provides a richer auditory experience. It's slightly slower than the Streaming method but compensates for this with its ability to split text into multiple voices. This functionality is particularly useful in scenarios where differentiating between character dialogues and narration can enhance the storytelling and delivery. The inclusion of the AllTalk narrator functionality in the Standard method allows for a more layered and immersive experience, making it ideal for content where depth and variety in voice narration add significant value.

In summary, the choice between Streaming and Standard methods in AllTalk TTS depends on what you want. Streaming is great for quick and simple audio generation, while Standard is preferable for a more dynamic and engaging audio experience.

Changing model or DeepSpeed or Low VRAM each take about 15 seconds so you should only change one at a time and wait for Ready before changing the next setting. To set these options long term you can apply the settings at the top of this page.

AllTalk Narrator

Only available on the Standard Audo Generation method.

Usage Notes:

Troubleshooting:

Back to top of page

Finetuning (Training the model)

If you have a voice that the model doesnt quite reproduce correctly, or indeed you just want to improve the reproduced voice, then finetuning is a way to train your "XTTSv2 local" model (stored in /alltalk_tts/models/xxxxx/) on a specific voice. For this you will need:

How will this work/How complicated is it?

Everything has been done to make this as simple as possible. At its simplest, you can literally just download a large chunk of audio from an interview, and tell the finetuning to strip through it, find spoken parts and build your dataset. You can literally click 4 buttons, then copy a few files and you are done. At it's more complicated end you will clean up the audio a little beforehand, but its still only 4x buttons and copying a few files.

The audio you will use:

I would suggest that if its in an interview format, you cut out the interviewer speaking in audacity or your chosen audio editing package. You don't have to worry about being perfect with your cuts, the finetuning Step 1 will go and find spoken audio and cut it out for you. Is there is music over the spoken parts, for best quality you would cut out those parts, though its not 100% necessary. As always, try to avoid bad quality audio with noises in it (humming sounds, hiss etc). You can try something like Audioenhancer to try clean up noisier audio. There is no need to down-sample any of the audio, all of that is handled for you. Just give the finetuning some good quality audio to work with.

Important requirements CUDA 11.8:

As mentioned you must have a small portion of the Nvidia CUDA Toolkit 11.8 installed. Not higher or lower versions. Specifically 11.8. You do not have to uninstall any other versions, change any graphics drivers, reinstall torch or anything like that. There are instructions within the finetuning interface on doing this or you can also find them on this link here

Starting Finetuning:

Ensure you have followed the instructions on setting up the Nvidia CUDA Toolkit 11.8 here or the below procedure will fail.

The below instructions are also available online here

  1. Close all other applications that are using your GPU/VRAM and copy your audio samples into:

    /alltalk_tts/finetune/put-voice-samples-in-here/

  2. In a command prompt/terminal window you need to move into your Text generation webUI folder:

    cd text-generation-webui

  3. Start the Text generation webUI Python environment for your OS:

    cmd_windows.bat, ./cmd_linux.sh, cmd_macos.sh or cmd_wsl.bat

  4. You can double check your search path environment still works correctly with nvcc --version. It should report back 11.8:

    Cuda compilation tools, release 11.8.

  5. Move into your extensions folder:

    cd extensions

  6. Move into the alltalk_tts folder:

    cd alltalk_tts

  7. Install the finetune requirements file: pip install -r requirements_finetune.txt

  8. Type python finetune.py and it should start up.

  9. Follow the on-screen instructions when the web interface starts up.

  10. When you have finished finetuning, the final tab will tell you what to do with your files and how to move your newly trained model to the correct location on disk.

Back to top of page

DeepSpeed

DeepSpeed provides a 2x-3x speed boost for Text-to-Speech and AI tasks. It's all about making AI and TTS happen faster and more efficiently.

DeepSpeed on vs off

DeepSpeed only works with the XTTSv2 Local model and will deactivate when other models are selected, even if the checkbox still shows as being selected.

Note: DeepSpeed/AllTalk may warn if the Nvidia Cuda Toolkit and CUDA_HOME environment variable isnt set correctly. On Linux you need CUDA_HOME configured correctly; on Windows, if you use the pre-built wheel its ok without.

Note: You do not need to set Text-generation-webUi's --deepspeed setting for AllTalk to be able to use DeepSpeed.

Back to top of page

DeepSpeed Setup - Linux

➡️DeepSpeed requires an Nvidia Graphics card!⬅️

  1. Preferably use your built in package manager to install the CUDA toolkit. Alternatively download and install the Nvidia Cuda Toolkit for Linux Nvidia Cuda Toolkit 11.8 or 12.1
  2. Open a terminal console.
  3. Install libaio-dev (however your Linux version installs things) e.g. sudo apt install libaio-dev
  4. Move into your Text generation webUI folder e.g. cd text-generation-webui
  5. Start the Text generation webUI Python environment ./cmd_linux.sh
  6. Text generation webUI overwrites the CUDA_HOME environment variable each time you ./cmd_linux.sh or ./start_linux.sh so you will need to either permanently change within the python environment OR set CUDA_HOME it each time you ./cmd_linux.sh. Details to change it each time are on the next step. Below is a link to Conda's manual and changing environment variables permanently though its possible changing it permanently could affect other extensions, you would have to test. Conda manual - Environment variables
  7. You can temporarily set the CUDA_HOME environment with (Standard Ubuntu path below, but it could vary on other Linux flavours):

    export CUDA_HOME=/etc/alternatives/cuda every time you run ./cmd_linux.sh

    If you try to start DeepSpeed with the CUDA_HOME path set incorrectly, expect an error similar to:

             [Errno 2] No such file or directory: /home/yourname/text-generation-webui/installer_files/env/bin/nvcc

  8. Now install deepspeed with pip install deepspeed
  9. You can now start Text generation webUI python server.py ensuring to activate your extensions.

    Just to reiterate, starting Text-generation-webUI with ./start_linux.sh will overwrite the CUDA_HOME variable unless you have permanently changed it, hence always starting it with ./cmd_linux.sh then setting the environment variable manually export CUDA_HOME=/etc/alternatives/cuda and then python server.py which is how you would need to run it each time, unless you permanently set the environment variable for CUDA_HOME within Text-generation-webUI's standard Python environment.

    Removal - If it became necessary to uninstall DeepSpeed, you can do so with ./cmd_linux.sh and then pip uninstall deepspeed

Back to top of page

DeepSpeed Setup - Windows

➡️DeepSpeed requires an Nvidia Graphics card!⬅️

The atsetup utility for Windows can install DeepSpeed for you (Stanalone users whom installed via the atsetup utility will already have DeepSpeed installed).

DeepSpeed v11.2 will work on the current default text-generation-webui Python 3.11 environment! You have 2x options for how to setup DeepSpeed on Windows. A quick way Option 1 and a long way Option 2.

Thanks to @S95Sedan They managed to get DeepSpeed 11.2 working on Windows via making some edits to the original Microsoft DeepSpeed v11.2 installation. The original post is here.

OPTION 1 - Pre-Compiled Wheel Deepspeed v11.2 (Python 3.11 and 3.10)

  1. Download the correct wheel version for your Python/Cuda from here and save the file it inside your text-generation-webui folder.
  2. Open a command prompt window, move into your text-generation-webui folder you can now start the Python environment for text-generation-webui cmd_windows.bat
  3. With the file that you saved in the text-generation-webui folder you now type the following, replacing your-version with the name of the file you have:

    pip install "deepspeed-0.11.2+your-version-win_amd64.whl"

  4. This should install through cleanly and you should now have DeepSpeed v11.2 installed within the Python 3.11/3.10 environment of text-generation-webui.
  5. When you start up text-generation-webui, and AllTalk starts, you should see:

                [AllTalk Startup] DeepSpeed Detected

  6. Within AllTalk, you will now have a checkbox for Activate DeepSpeed though remember you can only change 1x setting every 15 or so seconds, so don't try to activate DeepSpeed and LowVRAM simultaneously. When you are happy it works, you can set the default start-up settings in the settings page.

    Removal - If it became necessary to uninstall DeepSpeed, you can do so with cmd_windows.bat and then pip uninstall deepspeed


    OPTION 2 - Manual build of DeepSpeed v11.2 (Python 3.11 and 3.10)

    Due to the complexity of this, and the complicated formatting, instructions can be found on this link

Back to top of page

TTS Models/Methods

It's worth noting that all models and methods can and do sound different from one another. Many people complained about the quality of audio produced by the 2.0.3 model, so this extension will download the 2.0.2 model to your models folder and give you the choice to use 2.0.2 (API Local and XTTSv2 Local) or use the most current model 2.0.3 (API TTS). As/When a new model is released by Coqui it will be downloaded by the TTS service on startup and stored wherever the TTS service keeps new models for your operating system.

Back to top of page

Model Temperature and Repetition Settings

It is recommended not to modify these settings unless you fully comprehend their effects. A general overview is provided below for reference.

Changes to these settings won't take effect until you restart AllTalk/Text generation webUI.

These settings only affect API Local and XTTSv2 Local methods.

Repetition Penalty:

In the context of text-to-speech (TTS), the Repetition Penalty influences how the model handles the repetition of sounds, phonemes, or intonation patterns. Here's how it works:

The factory setting for repetition penalty is 10.0

Temperature:

Temperature influences the randomness of the generated speech. Here's how it affects the output:

The factory setting for temperature is 0.70

Temperature and Repetition Settings Examples:

Factory settings should be fine for most people, however if you choose to experiment, setting extremely high or low values, especially without a good understanding of their effects, may lead to flat-sounding output or very strange-sounding output. It's advisable to experiment with adjustments incrementally and observe the impact on the generated speech to find a balance that suits your desired outcome.

Back to top of page

Start-up Checks

AllTalk performs a variety of checks on startup and will warn out messages at the console should you need to do something such as update your TTS version. 

A basic environment check to ensure everything should work e.g. is the model already downloaded, are the configuration files set correctly etc.

AllTalk will download the Xtts model (version 2.0.2) into your models folder. Many people didn't like the quality of the 2.0.3 model, however the latest model will be accessible on the API TTS setting (2.0.3 at the time of writing) so you have the best of both worlds.

Back to top of page

Custom TTS Models and Model path

Its possible to set a custom model for the API Local and XTTSv2 Local methods, or indeed point it at the same model that API TTS uses (wherever it is stored on your OS of choice).

Many people did not like the sound quality of the Coqui 2.0.3 model, and as such the AllTalk downloads the 2.0.2 model separately from the 2.0.3 model that TTS service downloads and manages.

Typically the 2.0.2 model is stored in your /alltalk_tts/models folder and it is downloaded on first start-up of the AllTalk_tts extension. However, you may either want to use a custom model version of your choosing, or re-point AllTalk to a different path on your system, or even point it so that API Local and XTTSv2 Local both use the same model that API TTS is using.

If you do choose to change the location there are a couple of things to note. 

To change the model path, there are at minimum 2x settings you need to alter in the modeldownload.json file, base_path and model_path.

You would edit the settings in the modeldownload.json file as follows (make a backup of your current file in case):

        Windows path example: c:\\mystuff\\mydownloads\\myTTSmodel\\{files in here}
        base_path would be "c:\\mystuff\\mydownloads"
        model_path would be: "myTTSmodel"

Note: On Windows systems, you have to specify a double backslash \\ for each folder level in the path (as above)

        Linux path example: /home/myaccount/myTTSmodel/{files in here}
        base_path would be "/home/myaccount"
        model_path would be: "TTSmodel"

Once you restart Alltalk, it will check this path for the files and output any details at the console.

When you are happy it's' working correctly, you are welcome to go delete the models folder stored at /alltalk_tts/models.

If you wish to change the files that the modeldownloader is pulling at startup, you can further edit the modeldownload.json and change the https addresses within this files files_to_download section  e.g.

"files_to_download": {
        "LICENSE.txt": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/LICENSE.txt?download=true",
        "README.md": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/README.md?download=true",
        "config.json": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/config.json?download=true",
        "model.pth": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/model.pth?download=true",
        "vocab.json": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/vocab.json?download=true"
 }

Back to top of page

Configuration file settings

confignew.json file:

Key Default Value Explanation
"activate" true Sets activation state within Text-generation-webUI.
"autoplay" true Sets autoplay within Text generation webUI.
"branding" "AllTalk" Used to change the default name. Requires a space e.g. "Mybrand ".
"deepspeed_activate" false Sets DeepSpeed activation on startup.
"delete_output_wavs" "Disabled" Sets duration of outputs to delete.
"ip_address" "127.0.0.1" Sets default IP address.
"language" "English" Sets default language for Text-generation-webUI TTS.
"low_vram" false Sets default setting for LowVRAM mode.
"local_temperature" "0.70" Sets default model temp for API Local and XTTSv2 Local.
"local_repetition_penalty" "10.0" Sets default model repetition for API Local and XTTSv2 Local.
"tts_model_loaded" true AllTalk internal use only. Do not change.
"tts_model_name" "tts_models/multilingual/multi-dataset/xtts_v2" Sets default model that API TTS is looking for through the TTS service (separate to API Local and XTTSv2 Local).
"narrator_enabled" true Sets default narrator on/off in Text-generation-webUI TTS.
"narrator_voice" "female_02.wav" Sets default wav to use for narrator in Text-generation-webUI TTS.
"port_number" "7851" Sets default port number for AllTalk.
"output_folder_wav" "extensions/alltalk_tts/outputs/" Sets default output path Text-generation-webUI should use for finding outputs.
"output_folder_wav_standalone" "outputs/" Sets default output path in standalone mode.
"remove_trailing_dots" false Sets trailing dot removal pre-generating TTS.
"show_text" true Sets if text should be displayed below audio in Text-generation-webUI.
"tts_method_api_local" false Sets API Local as the default model/method for TTS.
"tts_method_api_tts" false Sets API TTS as the default model/method for TTS.
"tts_method_xtts_local" true Sets XTTSv2 Local as the default model/method for TTS.
"voice" "female_01.wav" Sets default voice for TTS.

modeldownload.json file:

Key Value Description
"base_path" "models" Sets local model base path for API Local and XTTSv2 Local.
"model_path" "xttsv2_2.0.2" Sets local model folder for API Local and XTTSv2 Local below the base path.
"files_to_download"
File Download URL
"LICENSE.txt" https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/LICENSE.txt?download=true
"README.md" https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/README.md?download=true
"config.json" https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/config.json?download=true
"model.pth" https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/model.pth?download=true
"vocab.json" https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/vocab.json?download=true
Sets the model files required to be downloaded into \base_path\model_path\ and where to download them from.

JSON calls & CURL Commands

Overview

The Text-to-Speech (TTS) Generation API allows you to generate speech from text input using various configuration options. This API supports both character and narrator voices, providing flexibility for creating dynamic and engaging audio content.

TTS Generation Endpoint

Example command line

Standard TTS speech Example (standard text) generating a time-stamped file

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=All of this is text spoken by the character. This is text not inside quotes, though that doesnt matter in the slightest" -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=false" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"

Narrator Example (standard text) generating a time-stamped file

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=*This is text spoken by the narrator* \"This is text spoken by the character\". This is text not inside quotes." -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=true" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"

Note that if your text that needs to be generated contains double quotes you will need to escape them with \" (Please see the narrator example).

 Request Parameters

TTS Generation Response

The API returns a JSON object with the following properties:

Example JSON TTS Generation Response:

{"status": "generate-success", "output_file_path": "C:\\text-generation-webui\\extensions\\alltalk_tts\\outputs\\myoutputfile_1703149973.wav", "output_file_url": "http://127.0.0.1:7851/audio/myoutputfile_1703149973.wav"}

Switching Model

curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=API%20Local"
curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=API%20TTS"
curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=XTTSv2%20Local"

Switch between the 3 models respectively.

JSON return {"status": "model-success"}

Switch DeepSpeed

curl -X POST "http://127.0.0.1:7851/api/deepspeed?new_deepspeed_value=True"

Replace True with False to disable DeepSpeed mode.

JSON return {"status": "deepspeed-success"}

Switching Low VRAM

curl -X POST "http://127.0.0.1:7851/api/lowvramsetting?new_low_vram_value=True"

Replace True with False to disable Low VRAM mode.

JSON return {"status": "lowvram-success"}

Ready Endpoint

Check if the Text-to-Speech (TTS) service is ready to accept requests.

curl -X GET "http://127.0.0.1:7851/api/ready"

Voices List Endpoint

Retrieve a list of available voices for generating speech.

curl -X GET "http://127.0.0.1:7851/api/voices"

JSON return: {"voices": ["voice1.wav", "voice2.wav", "voice3.wav"]}

Preview Voice Endpoint

Generate a preview of a specified voice with hardcoded settings.

curl -X POST "http://127.0.0.1:7851/api/previewvoice/" -F "voice=female_01.wav"

Replace female_01.wav with the name of the voice sample you want to hear.

JSON return: {"status": "generate-success", "output_file_path": "/path/to/outputs/api_preview_voice.wav", "output_file_url": "http://127.0.0.1:7851/audio/api_preview_voice.wav"}

Back to top of page

Debugging and TTS Generation Information

Command line outputs are more verbose to assist in understanding backend processes and debugging.

Its possible during startup you can get a warning message such as:

            [AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 80 seconds maximum

This is normal behaviour if the subprocess is taking a while to start, however, if there is an issue starting the subprocess, you may see multiples of this message and an it will time out after 80 seconds, resulting in the TTS engine not starting. Its likely that you are not in the correct python environment or one that has a TTS engine inside, if this happens, though the system will output a warning about that ahead of this message

Typically, the command line console will output any warning or error messages. If you need to reset your default configuration, the settings are all listed above in the configuration details.

Back to top of page

Thanks & References

Coqui TTS Engine

Extension coded by

Thanks to & Text generation webUI

Thanks to

Back to top of page