AllTalk is a web interface, based around the Coqui TTS voice cloning/speech generation system. To generate TTS, you can use the provided gradio interface or interact with the server using JSON/CURL commands.
Note: When loading up a new character in Text generation webUI it may look like nothing is happening for 20-30 seconds. It's actually processing the introduction section (greeting message) of the text and once that is completed, it will appear. You can see the activity occurring in the console window. Refreshing the page multiple times will try force the TTS engine to keep re-generating the text, so please just wait and check the console if needed.
Note: Ensure that your RP character card has asterisks around anything for the narration and double quotes around anything spoken. There is a complication ONLY with the greeting card so, ensuring it has the correct use of quotes and asterisks will help make sure that the greeting card sounds correct. I will aim to address this issue in a future update. In Text-generation-webUI parameters menu > character tab > greeting make sure that anything in there that is the narrator is in asterisks and anything spoken is in double quotes, then hit the save (💾) button.
AllTalk Minor Bug Fixes Changelog & Known issues
If I squash any minor bugs or find any issues, I will try to apply an update asap. If you think something isnt working correctly or you have a problem, check these two links below first.
AllTalk
Github link
here
Update instructions link here
Help and issues link
here
TTS Generator link
here
Text generation webUI
Web interface link
here
Documentation link here
http://127.0.0.1:7851http://127.0.0.1:7851/ready
If you are wanting to generate bulk quantities of TTS and have control over them, please see the AllTalk TTS Generator below.
AllTalk TTS Generator is the solution for converting large volumes of text into speech using the voice of your choice. Whether you're creating audio content or just want to hear text read aloud, the TTS Generator is equipped to handle it all efficiently.
The TTS Generator is available on this link
Once you have sent text off to be generated, either as a stream or wav file generation, the TTS server will remain busy until this process has competed. As such, think carefully as to how much you want to send to the server.
If you are generating wav files and populating the queue, you can generate one lot of text to speech, then input your next lot of text and it will continue adding to the list.
With wav chunks you can either playback “In Browser” which is the web page you are on, or “On Server” which is through the console/terminal where AllTalk is running from. Only generation “In Browser” can play back smoothly and populate the Generated TTS List. Setting the Volume will affect the volume level played back both “In Browser” and “On Server”.
Voice samples are stored in /alltalk_tts/voices/ and should be named using the following format name.wav
Voice outputs are stored in /alltalk_tts/outputs/
You can configure automatic maintenance deletion of old wav files by setting Del WAV's older than in the settings above.
When Disabled your output wav files will be left untouched. When set to a setting 1 Day or greater, your output wav files older than that time period will be automatically deleted on start-up of AllTalk.
This extension will download the 2.0.2 model to /alltalk_tts/models/
This TTS engine will also download the latest available model and store it wherever your OS normally stores it (Windows/Linux/Mac).
To create a new voice sample, you need to make a wav file that is 22050Hz, Mono, 16 bit and between 6 to 30 seconds long, though 8 to 10 seconds is usually good enough. The model can handle up to 30 second samples, however I've not noticed any improvement in voice output from a much longer clips.
You want to find a nice clear selection of audio, so lets say you wanted to clone your favourite celebrity. You may go looking for an interview where they are talking. Pay close attention to the audio you are listening to and trying to sample. Are there noises in the background, hiss on the soundtrack, a low humm, some quiet music playing or something? The better quality the audio the better the final TTS result. Don't forget, the AI that processes the sounds can hear everything in your sample and it will use them in the voice its trying to recreate.
Try make your clip one of nice flowing speech, like the included example files. No big pauses, gaps or other sounds. Preferably a sample that the person you are trying to copy will show a little vocal range and emotion in their voice. Also, try to avoid a clip starting or ending with breathy sounds (breathing in/out etc).
So, you’ve downloaded your favourite celebrity interview off YouTube, from here you need to chop it down to 6 to 30 seconds in length and resample it.
If you need to clean it up, do audio processing, volume level
changes etc, do this before down-sampling.
Using the latest version of Audacity select/highlight your 6 to 30 second clip and:
Tracks
> Resample to 22050Hz then
Tracks >
Mix > Stereo to
Mono then
File > Export Audio saving it as a WAV of 22050Hz.
Save your generated wav file in the /alltalk_tts/voices/ folder.
Its worth mentioning that using AI generated audio clips may introduce unwanted sounds as its already a copy/simulation of a voice.
Maybe you might be interested in trying Finetuning of the model. Otherwise, the reasons can be that you:
Some samples just never seem to work correctly, so maybe try a different sample. Always remember though, this is an AI model attempting to re-create a voice, so you will never get a 100% match.
This only affects the Narrator function. Most AI models should
be using asterisks or double quotes to differentiate between the Narrator or the Character, however, many models
sometimes switch between using asterisks and double quotes or sometimes nothing at all for the text it outputs.
This leaves a bit of a mess because sometimes un-marked text is narration and sometimes its the
character talking, leaving no clear way to know where to split sentences and what voice to use. Whilst there is
no 100% solution at the moment many models lean more one way or the other as to what that unmarked text will be
(character or narrator).
As such, the "Text not inside" function at least gives you the choice to set how you want the TTS engine to handle situations of un-marked text.

The Low VRAM option is a crucial feature designed to enhance performance under constrained (VRAM) conditions, as the TTS models require 2GB-3GB of VRAM to run effectively. This feature strategically manages the relocation of the Text-to-Speech (TTS) model between your system's Random Access Memory (RAM) and VRAM, moving it between the two on the fly. Obviously, this is very useful for people who have smaller graphics cards and will use all their VRAM to load in their LLM.
When you don't have enough VRAM free after loading your LLM model into your VRAM (Normal Mode example below), you can see that with so little working space, your GPU will have to swap in and out bits of the TTS model, which causes horrible slowdown.
Note: An Nvidia Graphics card is required for the LowVRAM option to work, as you will just be using system RAM otherwise.
The Low VRAM mode intelligently orchestrates the relocation of the entire TTS model and stores the TTS model in your system RAM. When the TTS engine requires VRAM for processing, the entire model seamlessly moves into VRAM, causing your LLM to unload/displace some layers, ensuring optimal performance of the TTS engine.
Post-TTS processing, the model moves back to system RAM, freeing up VRAM space for your Language Model (LLM) to load back in the missing layers. This adds about 1-2 seconds to both text generation by the LLM and the TTS engine.
By transferring the entire model between RAM and VRAM, the Low VRAM option avoids fragmentation, ensuring the TTS model remains cohesive and has all the working space it needs in your GPU, without having to just work on small bits of the TTS model at a time (which causes terrible slow down).
This creates a TTS generation performance Boost for Low VRAM Users and is particularly beneficial for users with less than 2GB of free VRAM after loading their LLM, delivering a substantial 5-10x improvement in TTS generation speed.

You HAVE to disable Enable TTS within the Text-generation-webui AllTalk interface, otherwise Text-generation-webui will also generate TTS due to the way it sends out text. You can do this each time you start up Text-generation-webui or set it in the start-up settings at the top of this page.
You have 2 types of audio generation options, Streaming and Standard.
The Streaming Audio Generation method is designed for speed and is best suited for situations where you just want quick audio playback. This method, however, is limited to using just one voice per TTS generation request. This means a limitation of the Streaming method is the inability to utilize the AllTalk narrator function, making it a straightforward but less nuanced option.
On the other hand, the Standard Audio Generation method provides a richer auditory experience. It's slightly slower than the Streaming method but compensates for this with its ability to split text into multiple voices. This functionality is particularly useful in scenarios where differentiating between character dialogues and narration can enhance the storytelling and delivery. The inclusion of the AllTalk narrator functionality in the Standard method allows for a more layered and immersive experience, making it ideal for content where depth and variety in voice narration add significant value.
In summary, the choice between Streaming and Standard methods in AllTalk TTS depends on what you want. Streaming is great for quick and simple audio generation, while Standard is preferable for a more dynamic and engaging audio experience.
Changing model or DeepSpeed or Low VRAM each take about 15 seconds so you should only change one at a time and wait for Ready before changing the next setting. To set these options long term you can apply the settings at the top of this page.
Only available on the Standard Audo Generation method.
If you have a voice that the model doesnt quite reproduce correctly, or indeed you just want to improve the reproduced voice, then finetuning is a way to train your "XTTSv2 local" model (stored in /alltalk_tts/models/xxxxx/) on a specific voice. For this you will need:
Everything has been done to make this as simple as possible. At its simplest, you can literally just download a large chunk of audio from an interview, and tell the finetuning to strip through it, find spoken parts and build your dataset. You can literally click 4 buttons, then copy a few files and you are done. At it's more complicated end you will clean up the audio a little beforehand, but its still only 4x buttons and copying a few files.
I would suggest that if its in an interview format, you cut out the interviewer speaking in audacity or your chosen audio editing package. You don't have to worry about being perfect with your cuts, the finetuning Step 1 will go and find spoken audio and cut it out for you. Is there is music over the spoken parts, for best quality you would cut out those parts, though its not 100% necessary. As always, try to avoid bad quality audio with noises in it (humming sounds, hiss etc). You can try something like Audioenhancer to try clean up noisier audio. There is no need to down-sample any of the audio, all of that is handled for you. Just give the finetuning some good quality audio to work with.
As mentioned you must have a small portion of the Nvidia CUDA Toolkit 11.8 installed. Not higher or lower versions. Specifically 11.8. You do not have to uninstall any other versions, change any graphics drivers, reinstall torch or anything like that. There are instructions within the finetuning interface on doing this or you can also find them on this link here
Ensure you have followed the instructions on setting up the Nvidia CUDA Toolkit 11.8 here or the below procedure will fail.
The below instructions are also available online here
Close all other applications that are using your GPU/VRAM and copy your audio samples into:
/alltalk_tts/finetune/put-voice-samples-in-here/
In a command prompt/terminal window you need to move into your Text generation webUI folder:
cd text-generation-webui
Start the Text generation webUI Python environment for your OS:
cmd_windows.bat, ./cmd_linux.sh, cmd_macos.sh or cmd_wsl.bat
You can double check your search path environment still works correctly with nvcc --version. It should report back 11.8:
Cuda compilation tools, release 11.8.
Move into your extensions folder:
cd extensions
Move into the alltalk_tts folder:
cd alltalk_tts
Install the finetune requirements file: pip install -r requirements_finetune.txt
Type python finetune.py and it should start up.
Follow the on-screen instructions when the web interface starts up.
When you have finished finetuning, the final tab will tell you what to do with your files and how to move your newly trained model to the correct location on disk.
DeepSpeed provides a 2x-3x speed boost for Text-to-Speech and AI tasks. It's all about making AI and TTS happen faster and more efficiently.

DeepSpeed only works with the XTTSv2 Local model and will deactivate when other models are selected, even if the checkbox still shows as being selected.
Note: DeepSpeed/AllTalk may warn if the Nvidia Cuda Toolkit and CUDA_HOME environment variable isnt set correctly. On Linux you need CUDA_HOME configured correctly; on Windows, if you use the pre-built wheel its ok without.
Note: You do not need to set Text-generation-webUi's --deepspeed setting for AllTalk to be able to use DeepSpeed.
➡️DeepSpeed requires an Nvidia Graphics card!⬅️
➡️DeepSpeed requires an Nvidia Graphics card!⬅️
The atsetup utility for Windows can install DeepSpeed for you (Stanalone users whom installed via the atsetup utility will already have DeepSpeed installed).
DeepSpeed v11.2 will work on the current default text-generation-webui Python 3.11 environment! You have 2x options for how to setup DeepSpeed on Windows. A quick way Option 1 and a long way Option 2.
Thanks to @S95Sedan They managed to get
DeepSpeed 11.2 working on Windows via making some edits to the original Microsoft DeepSpeed v11.2 installation.
The original post is here.
OPTION 1 - Pre-Compiled Wheel Deepspeed v11.2 (Python 3.11 and 3.10)
It's worth noting that all models and methods can and do sound different from one another. Many people complained about the quality of audio produced by the 2.0.3 model, so this extension will download the 2.0.2 model to your models folder and give you the choice to use 2.0.2 (API Local and XTTSv2 Local) or use the most current model 2.0.3 (API TTS). As/When a new model is released by Coqui it will be downloaded by the TTS service on startup and stored wherever the TTS service keeps new models for your operating system.
It is recommended not to modify these settings unless you fully comprehend their effects. A general overview is provided below for reference.
Changes to these settings won't take effect until you restart AllTalk/Text generation webUI.
These settings only affect API Local and XTTSv2 Local methods.
In the context of text-to-speech (TTS), the Repetition Penalty influences how the model handles the repetition of sounds, phonemes, or intonation patterns. Here's how it works:
The factory setting for repetition penalty is 10.0
Temperature influences the randomness of the generated speech. Here's how it affects the output:
The factory setting for temperature is 0.70
Factory settings should be fine for most people, however if you choose to experiment, setting extremely high or low values, especially without a good understanding of their effects, may lead to flat-sounding output or very strange-sounding output. It's advisable to experiment with adjustments incrementally and observe the impact on the generated speech to find a balance that suits your desired outcome.
AllTalk performs a variety of checks on startup and will warn out messages at the console should you need to do something such as update your TTS version.
A basic environment check to ensure everything should work e.g. is the model already downloaded, are the configuration files set correctly etc.
AllTalk will download the Xtts model (version 2.0.2) into your models folder. Many people didn't like the quality of the 2.0.3 model, however the latest model will be accessible on the API TTS setting (2.0.3 at the time of writing) so you have the best of both worlds.
Its possible to set a custom model for the API Local and XTTSv2 Local methods, or indeed point it at the same model that API TTS uses (wherever it is stored on your OS of choice).
Many people did not like the sound quality of the Coqui 2.0.3 model, and as such the AllTalk downloads the 2.0.2 model separately from the 2.0.3 model that TTS service downloads and manages.
Typically the 2.0.2 model is stored in your /alltalk_tts/models folder and it is downloaded on first start-up of the AllTalk_tts extension. However, you may either want to use a custom model version of your choosing, or re-point AllTalk to a different path on your system, or even point it so that API Local and XTTSv2 Local both use the same model that API TTS is using.
If you do choose to change the location there are a couple of things to note.
To change the model path, there are at minimum 2x settings you need to alter in the modeldownload.json file, base_path and model_path.
You would edit the settings in the modeldownload.json file as follows (make a backup of your current file in
case):
Windows path example: c:\\mystuff\\mydownloads\\myTTSmodel\\{files in here}
base_path would be "c:\\mystuff\\mydownloads"
model_path would be: "myTTSmodel"
Note: On Windows systems, you have to specify a double backslash \\ for each folder level in the path (as above)
Linux path example: /home/myaccount/myTTSmodel/{files in
here}
base_path
would be "/home/myaccount"
model_path would be: "TTSmodel"
Once you restart Alltalk, it will check this
path for the files and output any details at the console.
When you are happy it's' working correctly, you are welcome to go delete the models folder stored at /alltalk_tts/models.
If you wish to change the files that the modeldownloader is pulling at startup, you can further edit the modeldownload.json and change the https addresses within this files files_to_download section e.g.
"files_to_download": {
"LICENSE.txt": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/LICENSE.txt?download=true",
"README.md": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/README.md?download=true",
"config.json": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/config.json?download=true",
"model.pth": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/model.pth?download=true",
"vocab.json": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/vocab.json?download=true"
}
confignew.json file:
| Key | Default Value | Explanation |
|---|---|---|
| "activate" | true | Sets activation state within Text-generation-webUI. |
| "autoplay" | true | Sets autoplay within Text generation webUI. |
| "branding" | "AllTalk" | Used to change the default name. Requires a space e.g. "Mybrand ". |
| "deepspeed_activate" | false | Sets DeepSpeed activation on startup. |
| "delete_output_wavs" | "Disabled" | Sets duration of outputs to delete. |
| "ip_address" | "127.0.0.1" | Sets default IP address. |
| "language" | "English" | Sets default language for Text-generation-webUI TTS. |
| "low_vram" | false | Sets default setting for LowVRAM mode. |
| "local_temperature" | "0.70" | Sets default model temp for API Local and XTTSv2 Local. |
| "local_repetition_penalty" | "10.0" | Sets default model repetition for API Local and XTTSv2 Local. |
| "tts_model_loaded" | true | AllTalk internal use only. Do not change. |
| "tts_model_name" | "tts_models/multilingual/multi-dataset/xtts_v2" | Sets default model that API TTS is looking for through the TTS service (separate to API Local and XTTSv2 Local). |
| "narrator_enabled" | true | Sets default narrator on/off in Text-generation-webUI TTS. |
| "narrator_voice" | "female_02.wav" | Sets default wav to use for narrator in Text-generation-webUI TTS. |
| "port_number" | "7851" | Sets default port number for AllTalk. |
| "output_folder_wav" | "extensions/alltalk_tts/outputs/" | Sets default output path Text-generation-webUI should use for finding outputs. |
| "output_folder_wav_standalone" | "outputs/" | Sets default output path in standalone mode. |
| "remove_trailing_dots" | false | Sets trailing dot removal pre-generating TTS. |
| "show_text" | true | Sets if text should be displayed below audio in Text-generation-webUI. |
| "tts_method_api_local" | false | Sets API Local as the default model/method for TTS. |
| "tts_method_api_tts" | false | Sets API TTS as the default model/method for TTS. |
| "tts_method_xtts_local" | true | Sets XTTSv2 Local as the default model/method for TTS. |
| "voice" | "female_01.wav" | Sets default voice for TTS. |
modeldownload.json file:
| Key | Value | Description | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "base_path" | "models" | Sets local model base path for API Local and XTTSv2 Local. | ||||||||||||
| "model_path" | "xttsv2_2.0.2" | Sets local model folder for API Local and XTTSv2 Local below the base path. | ||||||||||||
| "files_to_download" |
|
Sets the model files required to be downloaded into \base_path\model_path\ and where to download them from. |
The Text-to-Speech (TTS) Generation API allows you to generate speech from text input using various configuration options. This API supports both character and narrator voices, providing flexibility for creating dynamic and engaging audio content.
Standard TTS speech Example (standard text) generating a time-stamped file
curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=All of this is text spoken by the character. This is text not inside quotes, though that doesnt matter in the slightest" -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=false" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"
Narrator Example (standard text) generating a time-stamped file
curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=*This is text spoken by the narrator* \"This is text spoken by the character\". This is text not inside quotes." -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=true" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"
Note that if your text that needs to be generated contains double quotes you will need to escape them with \" (Please see the narrator example).
text_input: The text you want the TTS engine to produce. Use escaped double quotes for character speech and asterisks for narrator speech if using the narrator function. Example:
-d "text_input=*This is text spoken by the narrator* \"This is text spoken by the character\". This is text not inside quotes."
text_filtering: Filter for text. Options:
-d "text_filtering=none"
-d
"text_filtering=standard"
-d
"text_filtering=html"
Example:
character_voice_gen: The WAV file name for the character's voice.
-d "character_voice_gen=female_01.wav"
narrator_enabled: Enable or disable the narrator function. If true, minimum text filtering is set to standard. Anything between double quotes is considered the character's speech, and anything between asterisks is considered the narrator's speech.
-d "narrator_enabled=true"
-d "narrator_enabled=false"
narrator_voice_gen: The WAV file name for the narrator's voice.
-d "narrator_voice_gen=male_01.wav"
text_not_inside: Specify the handling of lines not inside double quotes or asterisks, for the narrator feature. Options:
-d "text_not_inside=character"
-d "text_not_inside=narrator"
language: Choose the language for TTS. Options:
-d "language=en"
output_file_name: The name of the output file (excluding the .wav extension).
-d "output_file_name=myoutputfile"
output_file_timestamp: Add a timestamp to the output file name. If true, each file will have a unique timestamp; otherwise, the same file name will be overwritten each time you generate TTS.
-d "output_file_timestamp=true"
-d "output_file_timestamp=false"
autoplay: Enable or disable autoplay. Still needs to be specified in the JSON request.
-d "autoplay=true"
-d "autoplay=false"
autoplay_volume: Set the autoplay volume. Should be between 0.1 and 1.0. Still needs to be specified in the JSON request.
-d "autoplay_volume=0.8"
The API returns a JSON object with the following properties:
Example JSON TTS Generation Response:
{"status": "generate-success", "output_file_path": "C:\\text-generation-webui\\extensions\\alltalk_tts\\outputs\\myoutputfile_1703149973.wav", "output_file_url": "http://127.0.0.1:7851/audio/myoutputfile_1703149973.wav"}
curl -X POST
"http://127.0.0.1:7851/api/reload?tts_method=API%20Local"
curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=API%20TTS"
curl -X POST
"http://127.0.0.1:7851/api/reload?tts_method=XTTSv2%20Local"
Switch between the 3 models respectively.
JSON return {"status": "model-success"}
curl -X POST "http://127.0.0.1:7851/api/deepspeed?new_deepspeed_value=True"
Replace True with False to disable DeepSpeed mode.
JSON return {"status": "deepspeed-success"}
curl -X POST "http://127.0.0.1:7851/api/lowvramsetting?new_low_vram_value=True"
Replace True with False to disable Low VRAM mode.
JSON return {"status": "lowvram-success"}
Check if the Text-to-Speech (TTS) service is ready to accept requests.
curl -X GET "http://127.0.0.1:7851/api/ready"
Retrieve a list of available voices for generating speech.
curl -X GET "http://127.0.0.1:7851/api/voices"
JSON return: {"voices": ["voice1.wav", "voice2.wav", "voice3.wav"]}
Generate a preview of a specified voice with hardcoded settings.
curl -X POST "http://127.0.0.1:7851/api/previewvoice/" -F "voice=female_01.wav"
Replace female_01.wav with the name of the voice sample you want to hear.
JSON return: {"status": "generate-success", "output_file_path": "/path/to/outputs/api_preview_voice.wav", "output_file_url": "http://127.0.0.1:7851/audio/api_preview_voice.wav"}
Command line outputs are more verbose to assist in understanding backend processes and debugging.
Its possible during startup you can get a warning message
such as:
[AllTalk
Startup] Warning TTS Subprocess
has NOT started up yet, Will keep trying for 80 seconds maximum
This is normal
behaviour if the subprocess is taking a while to start, however, if there is an issue starting the
subprocess, you may see multiples of this message and an it will time out after 80 seconds, resulting in the
TTS engine not starting. Its likely that you are not in the correct python environment or one that has a TTS
engine inside, if this happens, though the system will output a warning about that ahead of this message
Typically, the command line console will output any warning or error messages. If you need to reset your default configuration, the settings are all listed above in the configuration details.