AllTalk TTS for Text generation webUI

Getting Started
Server Information
Demo/Test TTS output
AllTalk TTS Generator
Using Voice Samples
Text Not inside
Low VRAM
Silly Tavern Support
Finetuning (Training the model)
DeepSpeed
- DeepSpeed Setup Linux
- DeepSpeed Setup Windows
TTS Models/Methods
Model temperature and repetition settings
Start-up checks
Custom TTS Models and Model path
Configuration file settings
JSON calls and CURL Commands
Debugging and TTS Generation Information
Thanks & References

Getting Started

AllTalk is a web interface, based around the Coqui TTS voice cloning/speech generation system. To generate TTS, you can use the provided gradio interface or interact with the server using JSON/CURL commands.

Note: When loading up a new character in Text generation webUI it may look like nothing is happening for 20-30 seconds. It's actually processing the introduction section (greeting message) of the text and once that is completed, it will appear. You can see the activity occurring in the console window. Refreshing the page multiple times will try force the TTS engine to keep re-generating the text, so please just wait and check the console if needed.

Note: Ensure that your RP character card has asterisks around anything for the narration and double quotes around anything spoken. There is a complication ONLY with the greeting card so, ensuring it has the correct use of quotes and asterisks will help make sure that the greeting card sounds correct. I will aim to address this issue in a future update. In Text-generation-webUI parameters menu > character tab > greeting make sure that anything in there that is the narrator is in asterisks and anything spoken is in double quotes, then hit the save (💾) button.

AllTalk Minor Bug Fixes Changelog & Known issues

If I squash any minor bugs or find any issues, I will try to apply an update asap. If you think something isnt working correctly or you have a problem, check these two links below first.

Minor bug fixes changelog link here.
Help and known issues link here.

AllTalk
Github link here
Update instructions link here
Help and issues link here
TTS Generator link here

Text generation webUI
Web interface link here
Documentation link here

Server Information

Base URL: http://127.0.0.1:7851
Server Status: http://127.0.0.1:7851/ready

Demo/Test TTS

If you are wanting to generate bulk quantities of TTS and have control over them, please see the AllTalk TTS Generator below.

Text: Voice: Language:

Output File:

Streaming:

AllTalk TTS Generator

AllTalk TTS Generator is the solution for converting large volumes of text into speech using the voice of your choice. Whether you're creating audio content or just want to hear text read aloud, the TTS Generator is equipped to handle it all efficiently.

The TTS Generator is available on this link

Quick Start

Text Input: Enter the text you wish to convert into speech in the 'Text Input' box.
Generate TTS Button: Hit this to start the text-to-speech conversion.
Pause/Resume: Used to pause and resume the playback of the initial generation of wavs or the stream.
Stop Playback: This will stop the current audio playing back. It does not stop the text from being generated however.

Once you have sent text off to be generated, either as a stream or wav file generation, the TTS server will remain busy until this process has competed. As such, think carefully as to how much you want to send to the server.

If you are generating wav files and populating the queue, you can generate one lot of text to speech, then input your next lot of text and it will continue adding to the list.

TTS Generation Modes

Wav Chunks: Perfect for creating audio books, or anything you want to keep long term. Breaks down your text into manageable wav files and queues them up. Generation begins automatically, and playback will start after a few chunks have been prepared ahead. You can set the volume to 0 if you don’t want to hear playback. With Wav chunks, you can edit and/or regenerate portions of the TTS as needed.
Streaming: For immediate playback without the ability to save. Ideal for on-the-fly speech generation and listening. This will not generate wav files and it will play back through your browser. You cannot stop the server generating the TTS once it has been sent.

With wav chunks you can either playback “In Browser” which is the web page you are on, or “On Server” which is through the console/terminal where AllTalk is running from. Only generation “In Browser” can play back smoothly and populate the Generated TTS List. Setting the Volume will affect the volume level played back both “In Browser” and “On Server”.

Playback and List Management

Playback Controls: Utilize 'Play List' to start from the beginning or 'Stop Playback' to halt at any time.
Custom Start: Jump into your list at a specific ID to hear a particular section.
Regeneration and Editing: If a chunk isn't quite right, you can opt to regenerate it or edit the text directly. Click off the text to save changes and hit regenerate for the specific line.
Export/Import List: Save your TTS list as a JSON file or import one. Note: Existing wav files are needed for playback. Exporting is handy if you want to take your files away into another program and have a list of which wav is which, or if you keep your audio files, but want to come back at a later date, edit one or two lines, regenerate the speech and re-combine the wav’s into one new long wav.

Exporting Your Audio

Export to WAV: Combine all generated TTS from the list, into one single WAV file for easy download and distribution.

Customization and Preferences

Character Voice: Choose the voice that will read your text.
Language: Select the language of your text.
Chunk Sizes: Decide the size of text chunks for generation. Smaller sizes are recommended for better TTS quality.

Interface and Accessibility

Dark/Light Mode: Switch between themes for your visual comfort.
Word Count and Generation Queue: Keep track of the word count and the generation progress.

Notes on Usage

For seamless TTS generation, it's advised to keep text chunks under 250 characters, which you can control with the Chunk sizes.
Generated audio can be played back from the list, which also highlights the currently playing chunk.
The TTS Generator remembers your settings, so you can pick up where you left off even after refreshing the page.

Using Voice Samples

Where are the sample voices stored?

Voice samples are stored in /alltalk_tts/voices/ and should be named using the following format name.wav

Where are the outputs stored & Automatic output wav file deletion.

Voice outputs are stored in /alltalk_tts/outputs/

You can configure automatic maintenance deletion of old wav files by setting Del WAV's older than in the settings above.

When Disabled your output wav files will be left untouched. When set to a setting 1 Day or greater, your output wav files older than that time period will be automatically deleted on start-up of AllTalk.

Where are the models stored?

This extension will download the 2.0.2 model to /alltalk_tts/models/

This TTS engine will also download the latest available model and store it wherever your OS normally stores it (Windows/Linux/Mac).

How do I create a new voice sample?

To create a new voice sample, you need to make a wav file that is 22050Hz, Mono, 16 bit and between 6 to 30 seconds long, though 8 to 10 seconds is usually good enough. The model can handle up to 30 second samples, however I've not noticed any improvement in voice output from a much longer clips.

You want to find a nice clear selection of audio, so lets say you wanted to clone your favourite celebrity. You may go looking for an interview where they are talking. Pay close attention to the audio you are listening to and trying to sample. Are there noises in the background, hiss on the soundtrack, a low humm, some quiet music playing or something? The better quality the audio the better the final TTS result. Don't forget, the AI that processes the sounds can hear everything in your sample and it will use them in the voice its trying to recreate.

Try make your clip one of nice flowing speech, like the included example files. No big pauses, gaps or other sounds. Preferably a sample that the person you are trying to copy will show a little vocal range and emotion in their voice. Also, try to avoid a clip starting or ending with breathy sounds (breathing in/out etc).

Editing your sample!

So, you’ve downloaded your favourite celebrity interview off YouTube, from here you need to chop it down to 6 to 30 seconds in length and resample it.

If you need to clean it up, do audio processing, volume level changes etc, do this before down-sampling.

Using the latest version of Audacity select/highlight your 6 to 30 second clip and:

Tracks > Resample to 22050Hz then
Tracks > Mix > Stereo to Mono then
File > Export Audio saving it as a WAV of 22050Hz.

Save your generated wav file in the /alltalk_tts/voices/ folder.

Its worth mentioning that using AI generated audio clips may introduce unwanted sounds as its already a copy/simulation of a voice.

Why doesnt it sound like XXX Person?

Maybe you might be interested in trying Finetuning of the model. Otherwise, the reasons can be that you:

Didn't down-sample it as above.
Have a bad quality voice sample.
Try using the 3x different generation methods: API TTS, API Local, and XTTSv2 Local within the web interface, as they generate output in different ways and sound different.

Some samples just never seem to work correctly, so maybe try a different sample. Always remember though, this is an AI model attempting to re-create a voice, so you will never get a 100% match.

Text Not inside

This only affects the Narrator function. Most AI models should be using asterisks or double quotes to differentiate between the Narrator or the Character, however, many models sometimes switch between using asterisks and double quotes or sometimes nothing at all for the text it outputs.

This leaves a bit of a mess because sometimes un-marked text is narration and sometimes its the character talking, leaving no clear way to know where to split sentences and what voice to use. Whilst there is no 100% solution at the moment many models lean more one way or the other as to what that unmarked text will be (character or narrator).

As such, the "Text not inside" function at least gives you the choice to set how you want the TTS engine to handle situations of un-marked text.

When the AI doesnt use an asterisk or a quote

Low VRAM

The Low VRAM option is a crucial feature designed to enhance performance under constrained (VRAM) conditions, as the TTS models require 2GB-3GB of VRAM to run effectively. This feature strategically manages the relocation of the Text-to-Speech (TTS) model between your system's Random Access Memory (RAM) and VRAM, moving it between the two on the fly. Obviously, this is very useful for people who have smaller graphics cards and will use all their VRAM to load in their LLM.

When you don't have enough VRAM free after loading your LLM model into your VRAM (Normal Mode example below), you can see that with so little working space, your GPU will have to swap in and out bits of the TTS model, which causes horrible slowdown.

Note: An Nvidia Graphics card is required for the LowVRAM option to work, as you will just be using system RAM otherwise.

How It Works:

The Low VRAM mode intelligently orchestrates the relocation of the entire TTS model and stores the TTS model in your system RAM. When the TTS engine requires VRAM for processing, the entire model seamlessly moves into VRAM, causing your LLM to unload/displace some layers, ensuring optimal performance of the TTS engine.

Post-TTS processing, the model moves back to system RAM, freeing up VRAM space for your Language Model (LLM) to load back in the missing layers. This adds about 1-2 seconds to both text generation by the LLM and the TTS engine.

By transferring the entire model between RAM and VRAM, the Low VRAM option avoids fragmentation, ensuring the TTS model remains cohesive and has all the working space it needs in your GPU, without having to just work on small bits of the TTS model at a time (which causes terrible slow down).

This creates a TTS generation performance Boost for Low VRAM Users and is particularly beneficial for users with less than 2GB of free VRAM after loading their LLM, delivering a substantial 5-10x improvement in TTS generation speed.

SillyTavern Support

Important note for Text-generation-webui users

You HAVE to disable Enable TTS within the Text-generation-webui AllTalk interface, otherwise Text-generation-webui will also generate TTS due to the way it sends out text. You can do this each time you start up Text-generation-webui or set it in the start-up settings at the top of this page.

Quick Tips

Only change DeepSpeed, Low VRAM or Model one at a time. Wait for it to say Ready before changing something else.
You can permanently change the AllTalk startup settings or DeepSpeed, Low VRAM and Model at the top of this page.
Different AI models use quotes and asterisks differently, so you may need to change "Text not inside" depending on model.
Add new voice samples to the voices folder. You can Finetune a model to make it sound even closer to the original sample.
DeepSpeed will improve processing time to TTS to be 2-3x faster.
Low VRAM can be very beneficial if you don't have much memory left after loading your LLM.

TTS Generation Methods in SilllyTavern

You have 2 types of audio generation options, Streaming and Standard.

The Streaming Audio Generation method is designed for speed and is best suited for situations where you just want quick audio playback. This method, however, is limited to using just one voice per TTS generation request. This means a limitation of the Streaming method is the inability to utilize the AllTalk narrator function, making it a straightforward but less nuanced option.

On the other hand, the Standard Audio Generation method provides a richer auditory experience. It's slightly slower than the Streaming method but compensates for this with its ability to split text into multiple voices. This functionality is particularly useful in scenarios where differentiating between character dialogues and narration can enhance the storytelling and delivery. The inclusion of the AllTalk narrator functionality in the Standard method allows for a more layered and immersive experience, making it ideal for content where depth and variety in voice narration add significant value.

In summary, the choice between Streaming and Standard methods in AllTalk TTS depends on what you want. Streaming is great for quick and simple audio generation, while Standard is preferable for a more dynamic and engaging audio experience.

AllTalk TTS Generation Method:
- Select between Standard and Streaming Audio Generation methods.
- This setting impacts the AllTalk narrator functionality.
Language Selection:
- Select your preferred TTS generation language from the "Language" dropdown.
Model Switching:
- Switch between different TTS models like API TTS, API Local, XTTSv2 Local, and optionally XTTSv2 FT if you have a finetuned model available.
- Fine-tuned model availability (XTTSv2 FT) will only show when a finetuned model is detected by AllTalk.
- See TTS Models/Methods for more information (though most people will want to stick with XTTSv2 Local).
DeepSpeed and Low VRAM Options:
- Optimize performance with DeepSpeed and Low VRAM settings.
- DeepSpeed can offer a 2-3x performance boost on TTS generation. (Requires installation)
- See the relevant sections in this documentation for details.

Changing model or DeepSpeed or Low VRAM each take about 15 seconds so you should only change one at a time and wait for Ready before changing the next setting. To set these options long term you can apply the settings at the top of this page.

AllTalk Narrator

Only available on the Standard Audo Generation method.

Narrator Voice Selection:
- Allows users to choose different narrator voices.
- Access via the "Narrator Voice" dropdown.
AllTalk Narrator:
- Toggle the AllTalk narrator feature.
- Access via "AT Narrator" dropdown with Enabled/Disabled options.
Text Outside Asterisks Handling:
- Choose how text outside asterisks is interpreted (as Narrator or Character voice).
- Managed via the "Text Not Inside * or "" dropdown.
- Note, only available when the AllTalk Narrator is enabled.

Usage Notes:

On startup of SillyTavern, it will pull your current settings from AllTalk (Current model, DeepSpeed status, Low VRAM status and Finetuned model availability).
Enabling the narrator automatically unchecks certain checkboxes related to text handling.
Changes in model or settings might trigger multiple requests to the server; patience is advised.

Troubleshooting:

If experiencing issues, use the Reload button in SillyTavern's TTS extension to reinitialize the connection to AllTalk and check if AllTalk is started correctly.

Finetuning (Training the model)

If you have a voice that the model doesnt quite reproduce correctly, or indeed you just want to improve the reproduced voice, then finetuning is a way to train your "XTTSv2 local" model (stored in /alltalk_tts/models/xxxxx/) on a specific voice. For this you will need:

An Nvidia graphics card To install a few portions of the Nvidia CUDA 11.8 Toolkit (this will not impact text-generation-webui's cuda setup.
18GB of disk space free (most of this is used temporarily)
At least 2 minutes of good quality speech from your chosen speaker in mp3, wav or flacc format, in one or more files (have tested as far as 20 minutes worth of audio).

How will this work/How complicated is it?

Everything has been done to make this as simple as possible. At its simplest, you can literally just download a large chunk of audio from an interview, and tell the finetuning to strip through it, find spoken parts and build your dataset. You can literally click 4 buttons, then copy a few files and you are done. At it's more complicated end you will clean up the audio a little beforehand, but its still only 4x buttons and copying a few files.

The audio you will use:

I would suggest that if its in an interview format, you cut out the interviewer speaking in audacity or your chosen audio editing package. You don't have to worry about being perfect with your cuts, the finetuning Step 1 will go and find spoken audio and cut it out for you. Is there is music over the spoken parts, for best quality you would cut out those parts, though its not 100% necessary. As always, try to avoid bad quality audio with noises in it (humming sounds, hiss etc). You can try something like Audioenhancer to try clean up noisier audio. There is no need to down-sample any of the audio, all of that is handled for you. Just give the finetuning some good quality audio to work with.

Important requirements CUDA 11.8:

As mentioned you must have a small portion of the Nvidia CUDA Toolkit 11.8 installed. Not higher or lower versions. Specifically 11.8. You do not have to uninstall any other versions, change any graphics drivers, reinstall torch or anything like that. There are instructions within the finetuning interface on doing this or you can also find them on this link here

Starting Finetuning:

Ensure you have followed the instructions on setting up the Nvidia CUDA Toolkit 11.8 here or the below procedure will fail.

The below instructions are also available online here

Close all other applications that are using your GPU/VRAM and copy your audio samples into:

/alltalk_tts/finetune/put-voice-samples-in-here/
In a command prompt/terminal window you need to move into your Text generation webUI folder:

cd text-generation-webui
Start the Text generation webUI Python environment for your OS:

cmd_windows.bat, ./cmd_linux.sh, cmd_macos.sh or cmd_wsl.bat
You can double check your search path environment still works correctly with nvcc --version. It should report back 11.8:

Cuda compilation tools, release 11.8.
Move into your extensions folder:

cd extensions
Move into the alltalk_tts folder:

cd alltalk_tts
Install the finetune requirements file: pip install -r requirements_finetune.txt
Type python finetune.py and it should start up.
Follow the on-screen instructions when the web interface starts up.
When you have finished finetuning, the final tab will tell you what to do with your files and how to move your newly trained model to the correct location on disk.

DeepSpeed

DeepSpeed provides a 2x-3x speed boost for Text-to-Speech and AI tasks. It's all about making AI and TTS happen faster and more efficiently.

Model Parallelism: Spreads work across multiple GPUs, making AI/TTS models handle tasks more efficiently.
Memory Magic: Optimizes how memory is used, reducing the memory needed for large AI/TTS models.
Handles More Load: Scales up to handle larger workloads with improved performance.
Smart Resource Use: Uses your computer's resources smartly, getting the most out of your hardware.

DeepSpeed only works with the XTTSv2 Local model and will deactivate when other models are selected, even if the checkbox still shows as being selected.

Note: DeepSpeed/AllTalk may warn if the Nvidia Cuda Toolkit and CUDA_HOME environment variable isnt set correctly. On Linux you need CUDA_HOME configured correctly; on Windows, if you use the pre-built wheel its ok without.

Note: You do not need to set Text-generation-webUi's --deepspeed setting for AllTalk to be able to use DeepSpeed.

DeepSpeed Setup - Linux

➡️DeepSpeed requires an Nvidia Graphics card!⬅️

Preferably use your built in package manager to install the CUDA toolkit. Alternatively download and install the Nvidia Cuda Toolkit for Linux Nvidia Cuda Toolkit 11.8 or 12.1
Open a terminal console.
Install libaio-dev (however your Linux version installs things) e.g. sudo apt install libaio-dev
Move into your Text generation webUI folder e.g. cd text-generation-webui
Start the Text generation webUI Python environment ./cmd_linux.sh
Text generation webUI overwrites the CUDA_HOME environment variable each time you ./cmd_linux.sh or ./start_linux.sh so you will need to either permanently change within the python environment OR set CUDA_HOME it each time you ./cmd_linux.sh. Details to change it each time are on the next step. Below is a link to Conda's manual and changing environment variables permanently though its possible changing it permanently could affect other extensions, you would have to test. Conda manual - Environment variables
You can temporarily set the CUDA_HOME environment with (Standard Ubuntu path below, but it could vary on other Linux flavours):

export CUDA_HOME=/etc/alternatives/cuda every time you run ./cmd_linux.sh

If you try to start DeepSpeed with the CUDA_HOME path set incorrectly, expect an error similar to:

[Errno 2] No such file or directory: /home/yourname/text-generation-webui/installer_files/env/bin/nvcc
Now install deepspeed with pip install deepspeed
You can now start Text generation webUI python server.py ensuring to activate your extensions.

Just to reiterate, starting Text-generation-webUI with ./start_linux.sh will overwrite the CUDA_HOME variable unless you have permanently changed it, hence always starting it with ./cmd_linux.sh then setting the environment variable manually export CUDA_HOME=/etc/alternatives/cuda and then python server.py which is how you would need to run it each time, unless you permanently set the environment variable for CUDA_HOME within Text-generation-webUI's standard Python environment.

Removal - If it became necessary to uninstall DeepSpeed, you can do so with ./cmd_linux.sh and then pip uninstall deepspeed

DeepSpeed Setup - Windows

➡️DeepSpeed requires an Nvidia Graphics card!⬅️

The atsetup utility for Windows can install DeepSpeed for you (Stanalone users whom installed via the atsetup utility will already have DeepSpeed installed).

DeepSpeed v11.2 will work on the current default text-generation-webui Python 3.11 environment! You have 2x options for how to setup DeepSpeed on Windows. A quick way Option 1 and a long way Option 2.

Thanks to @S95Sedan They managed to get DeepSpeed 11.2 working on Windows via making some edits to the original Microsoft DeepSpeed v11.2 installation. The original post is here.

OPTION 1 - Pre-Compiled Wheel Deepspeed v11.2 (Python 3.11 and 3.10)

Download the correct wheel version for your Python/Cuda from here and save the file it inside your text-generation-webui folder.
Open a command prompt window, move into your text-generation-webui folder you can now start the Python environment for text-generation-webui cmd_windows.bat
With the file that you saved in the text-generation-webui folder you now type the following, replacing your-version with the name of the file you have:

pip install "deepspeed-0.11.2+your-version-win_amd64.whl"
This should install through cleanly and you should now have DeepSpeed v11.2 installed within the Python 3.11/3.10 environment of text-generation-webui.
When you start up text-generation-webui, and AllTalk starts, you should see:

[AllTalk Startup] DeepSpeed Detected
Within AllTalk, you will now have a checkbox for Activate DeepSpeed though remember you can only change 1x setting every 15 or so seconds, so don't try to activate DeepSpeed and LowVRAM simultaneously. When you are happy it works, you can set the default start-up settings in the settings page.

Removal - If it became necessary to uninstall DeepSpeed, you can do so with cmd_windows.bat and then pip uninstall deepspeed

OPTION 2 - Manual build of DeepSpeed v11.2 (Python 3.11 and 3.10)

Due to the complexity of this, and the complicated formatting, instructions can be found on this link

TTS Models/Methods

It's worth noting that all models and methods can and do sound different from one another. Many people complained about the quality of audio produced by the 2.0.3 model, so this extension will download the 2.0.2 model to your models folder and give you the choice to use 2.0.2 (API Local and XTTSv2 Local) or use the most current model 2.0.3 (API TTS). As/When a new model is released by Coqui it will be downloaded by the TTS service on startup and stored wherever the TTS service keeps new models for your operating system.

API TTS: Uses the current TTS model available that's downloaded by the TTS API process ( version 2.0.3 at the time of writing). This model is not stored in your "models" folder, but elsewhere on your system and managed by the TTS software.
API Local: Utilizes the 2.0.2 local model stored at /alltalk_tts/models/xttsv2_2.0.2
XTTSv2 Local: Utilizes the 2.0.2 local model /alltalk_tts/models/xttsv2_2.0.2 and utilizes a distinct TTS generation method. Supports DeepSpeed acceleration.

Model Temperature and Repetition Settings

It is recommended not to modify these settings unless you fully comprehend their effects. A general overview is provided below for reference.

Changes to these settings won't take effect until you restart AllTalk/Text generation webUI.

These settings only affect API Local and XTTSv2 Local methods.

Repetition Penalty:

In the context of text-to-speech (TTS), the Repetition Penalty influences how the model handles the repetition of sounds, phonemes, or intonation patterns. Here's how it works:

Higher Repetition Penalty (e.g. 16.0): The model is less likely to repeat sounds or patterns. It promotes diversity in the generated speech. This can result in a more varied and expressive output, though introduce elements of unpredictability in the TTS output.
Lower Repetition Penalty (e.g. 2.0): The model is more tolerant of repeating sounds or patterns. This might lead to more repetition in the generated speech, potentially making it sound more structured or rhythmically consistent. Lower values can still introduce expressive variations, but to a lesser extent. This tendency means that the generated speech may remain closer to the original sample.

The factory setting for repetition penalty is 10.0

Temperature:

Temperature influences the randomness of the generated speech. Here's how it affects the output:

Higher Temperature (e.g. 0.95): Increases randomness in how the model selects and pronounces phonemes or intonation patterns. This can result in more creative, but potentially less controlled or "stable," speech that may deviate from the input sample. It adds an element of unpredictability and variety, contributing to expressiveness in the voice output created.
Lower Temperature (e.g. 0.20): Reduces randomness, making the model more likely to closely mimic the input sample's voice, intonation, and overall style. This tends to produce more "coherent" speech that aligns closely with the characteristics of the training data or input voice sample. It adds a level of predictability and consistency, potentially reducing expressive variations. So it could end up sounding too monotone.

The factory setting for temperature is 0.70

Temperature and Repetition Settings Examples:

Temp High (0.90) and Repetition High (16.0):
Result: Speech may sound highly creative and diverse, with reduced repetition. It could be more expressive and unpredictable.
Temp Low (0.20) and Repetition High (16.0):
Result: Output tends to be focused and deterministic, but with reduced repetition. It may sound structured and less expressive.
Temp High (0.90) and Repetition Low (2.0):
Result: Speech may be more creative and diverse, with tolerance for repeating sounds. It could have expressive variations but with some structured patterns.
Temp Low (0.20) and Repetition Low (2.0):
Result: Output is focused and deterministic, with tolerance for repeating sounds. It may sound more structured and less expressive.

Factory settings should be fine for most people, however if you choose to experiment, setting extremely high or low values, especially without a good understanding of their effects, may lead to flat-sounding output or very strange-sounding output. It's advisable to experiment with adjustments incrementally and observe the impact on the generated speech to find a balance that suits your desired outcome.

Start-up Checks

AllTalk performs a variety of checks on startup and will warn out messages at the console should you need to do something such as update your TTS version.

A basic environment check to ensure everything should work e.g. is the model already downloaded, are the configuration files set correctly etc.

AllTalk will download the Xtts model (version 2.0.2) into your models folder. Many people didn't like the quality of the 2.0.3 model, however the latest model will be accessible on the API TTS setting (2.0.3 at the time of writing) so you have the best of both worlds.

Custom TTS Models and Model path

Its possible to set a custom model for the API Local and XTTSv2 Local methods, or indeed point it at the same model that API TTS uses (wherever it is stored on your OS of choice).

Many people did not like the sound quality of the Coqui 2.0.3 model, and as such the AllTalk downloads the 2.0.2 model separately from the 2.0.3 model that TTS service downloads and manages.

Typically the 2.0.2 model is stored in your /alltalk_tts/models folder and it is downloaded on first start-up of the AllTalk_tts extension. However, you may either want to use a custom model version of your choosing, or re-point AllTalk to a different path on your system, or even point it so that API Local and XTTSv2 Local both use the same model that API TTS is using.

If you do choose to change the location there are a couple of things to note.

The folder you place the model in, cannot be called "models". This name is reserved solely for the system to identify you are or are not using a custom model.
On each startup, the AllTalk tts extension will check the custom location and if it does not exist, it will create it and download the files it needs. It will also re download any missing files in that location that are needed for the model to function.
There will be extra output at the console to inform you that you are using a custom model and each time you load up AllTalk extension or switch between models.

To change the model path, there are at minimum 2x settings you need to alter in the modeldownload.json file, base_path and model_path.

You would edit the settings in the modeldownload.json file as follows (make a backup of your current file in case):

Windows path example: c:\\mystuff\\mydownloads\\myTTSmodel\\{files in here}
base_path would be "c:\\mystuff\\mydownloads"
model_path would be: "myTTSmodel"

Note: On Windows systems, you have to specify a double backslash \\ for each folder level in the path (as above)

Linux path example: /home/myaccount/myTTSmodel/{files in here}
base_path would be "/home/myaccount"
model_path would be: "TTSmodel"

Once you restart Alltalk, it will check this path for the files and output any details at the console.

When you are happy it's' working correctly, you are welcome to go delete the models folder stored at /alltalk_tts/models.

If you wish to change the files that the modeldownloader is pulling at startup, you can further edit the modeldownload.json and change the https addresses within this files files_to_download section e.g.

"files_to_download": {
"LICENSE.txt": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/LICENSE.txt?download=true",
"README.md": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/README.md?download=true",
"config.json": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/config.json?download=true",
"model.pth": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/model.pth?download=true",
"vocab.json": "https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/vocab.json?download=true"
}

Configuration file settings

confignew.json file:

Key	Default Value	Explanation
"activate"	true	Sets activation state within Text-generation-webUI.
"autoplay"	true	Sets autoplay within Text generation webUI.
"branding"	"AllTalk"	Used to change the default name. Requires a space e.g. "Mybrand ".
"deepspeed_activate"	false	Sets DeepSpeed activation on startup.
"delete_output_wavs"	"Disabled"	Sets duration of outputs to delete.
"ip_address"	"127.0.0.1"	Sets default IP address.
"language"	"English"	Sets default language for Text-generation-webUI TTS.
"low_vram"	false	Sets default setting for LowVRAM mode.
"local_temperature"	"0.70"	Sets default model temp for API Local and XTTSv2 Local.
"local_repetition_penalty"	"10.0"	Sets default model repetition for API Local and XTTSv2 Local.
"tts_model_loaded"	true	AllTalk internal use only. Do not change.
"tts_model_name"	"tts_models/multilingual/multi-dataset/xtts_v2"	Sets default model that API TTS is looking for through the TTS service (separate to API Local and XTTSv2 Local).
"narrator_enabled"	true	Sets default narrator on/off in Text-generation-webUI TTS.
"narrator_voice"	"female_02.wav"	Sets default wav to use for narrator in Text-generation-webUI TTS.
"port_number"	"7851"	Sets default port number for AllTalk.
"output_folder_wav"	"extensions/alltalk_tts/outputs/"	Sets default output path Text-generation-webUI should use for finding outputs.
"output_folder_wav_standalone"	"outputs/"	Sets default output path in standalone mode.
"remove_trailing_dots"	false	Sets trailing dot removal pre-generating TTS.
"show_text"	true	Sets if text should be displayed below audio in Text-generation-webUI.
"tts_method_api_local"	false	Sets API Local as the default model/method for TTS.
"tts_method_api_tts"	false	Sets API TTS as the default model/method for TTS.
"tts_method_xtts_local"	true	Sets XTTSv2 Local as the default model/method for TTS.
"voice"	"female_01.wav"	Sets default voice for TTS.

modeldownload.json file:

Key

Value

Description

"base_path"

"models"

Sets local model base path for API Local and XTTSv2 Local.

"model_path"

"xttsv2_2.0.2"

Sets local model folder for API Local and XTTSv2 Local below the base path.

"files_to_download"

File	Download URL
"LICENSE.txt"	https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/LICENSE.txt?download=true
"README.md"	https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/README.md?download=true
"config.json"	https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/config.json?download=true
"model.pth"	https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/model.pth?download=true
"vocab.json"	https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/vocab.json?download=true

Sets the model files required to be downloaded into \base_path\model_path\ and where to download them from.

JSON calls & CURL Commands

Overview

The Text-to-Speech (TTS) Generation API allows you to generate speech from text input using various configuration options. This API supports both character and narrator voices, providing flexibility for creating dynamic and engaging audio content.

TTS Generation Endpoint

URL: http://127.0.0.1:7851/api/tts-generate
Method: POST
Content-Type: application/x-www-form-urlencoded

Example command line

Standard TTS speech Example (standard text) generating a time-stamped file

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=All of this is text spoken by the character. This is text not inside quotes, though that doesnt matter in the slightest" -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=false" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"

Narrator Example (standard text) generating a time-stamped file

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=*This is text spoken by the narrator* \"This is text spoken by the character\". This is text not inside quotes." -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=true" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"

Note that if your text that needs to be generated contains double quotes you will need to escape them with \" (Please see the narrator example).

Request Parameters

text_input: The text you want the TTS engine to produce. Use escaped double quotes for character speech and asterisks for narrator speech if using the narrator function. Example:

-d "text_input=*This is text spoken by the narrator* \"This is text spoken by the character\". This is text not inside quotes."
text_filtering: Filter for text. Options:
- none No filtering. Whatever is sent will go over to the TTS engine as raw text, which may result in some odd sounds with some special characters.
- standard Human-readable text and a basic level of filtering, just to clean up some special characters.
- html HTML content. Where you are using HTML entity's like "
-d "text_filtering=none"
-d "text_filtering=standard"
-d "text_filtering=html"

Example:
- Standard Example: *This is text spoken by the narrator* "This is text spoken by the character" This is text not inside quotes.
- HTML Example: *This is text spoken by the narrator* "This is text spoken by the character" This is text not inside quotes.
- None will just pass whatever characters/text you send at it.
character_voice_gen: The WAV file name for the character's voice.

-d "character_voice_gen=female_01.wav"
narrator_enabled: Enable or disable the narrator function. If true, minimum text filtering is set to standard. Anything between double quotes is considered the character's speech, and anything between asterisks is considered the narrator's speech.

-d "narrator_enabled=true"
-d "narrator_enabled=false"
narrator_voice_gen: The WAV file name for the narrator's voice.

-d "narrator_voice_gen=male_01.wav"
text_not_inside: Specify the handling of lines not inside double quotes or asterisks, for the narrator feature. Options:
- character: Treat as character speech.
- narrator: Treat as narrator speech.
-d "text_not_inside=character"
-d "text_not_inside=narrator"
language: Choose the language for TTS. Options:
- ar Arabic
- zh-cn Chinese (Simplified)
- cs Czech
- nl Dutch
- en English
- fr French
- de German
- hu Hungarian
- it Italian
- ja Japanese
- ko Korean
- pl Polish
- pt Portuguese
- ru Russian
- es Spanish
- tr Turkish
-d "language=en"
output_file_name: The name of the output file (excluding the .wav extension).

-d "output_file_name=myoutputfile"
output_file_timestamp: Add a timestamp to the output file name. If true, each file will have a unique timestamp; otherwise, the same file name will be overwritten each time you generate TTS.

-d "output_file_timestamp=true"
-d "output_file_timestamp=false"
autoplay: Enable or disable autoplay. Still needs to be specified in the JSON request.

-d "autoplay=true"
-d "autoplay=false"
autoplay_volume: Set the autoplay volume. Should be between 0.1 and 1.0. Still needs to be specified in the JSON request.

-d "autoplay_volume=0.8"

TTS Generation Response

The API returns a JSON object with the following properties:

status Indicates whether the generation was successful (generate-success) or failed (generate-failure).
output_file_path The on-disk location of the generated WAV file.
output_file_url The HTTP location for accessing the generated WAV file.

Example JSON TTS Generation Response:

{"status": "generate-success", "output_file_path": "C:\\text-generation-webui\\extensions\\alltalk_tts\\outputs\\myoutputfile_1703149973.wav", "output_file_url": "http://127.0.0.1:7851/audio/myoutputfile_1703149973.wav"}

Switching Model

curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=API%20Local"
curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=API%20TTS"
curl -X POST "http://127.0.0.1:7851/api/reload?tts_method=XTTSv2%20Local"

Switch between the 3 models respectively.

JSON return {"status": "model-success"}

Switch DeepSpeed

curl -X POST "http://127.0.0.1:7851/api/deepspeed?new_deepspeed_value=True"

Replace True with False to disable DeepSpeed mode.

JSON return {"status": "deepspeed-success"}

Switching Low VRAM

curl -X POST "http://127.0.0.1:7851/api/lowvramsetting?new_low_vram_value=True"

Replace True with False to disable Low VRAM mode.

JSON return {"status": "lowvram-success"}

Ready Endpoint

Check if the Text-to-Speech (TTS) service is ready to accept requests.

URL: http://127.0.0.1:7851/api/ready
Method: GET
Response: Ready

curl -X GET "http://127.0.0.1:7851/api/ready"

Voices List Endpoint

Retrieve a list of available voices for generating speech.

URL: http://127.0.0.1:7851/api/voices
Method: GET

curl -X GET "http://127.0.0.1:7851/api/voices"

JSON return: {"voices": ["voice1.wav", "voice2.wav", "voice3.wav"]}

Preview Voice Endpoint

Generate a preview of a specified voice with hardcoded settings.

URL: http://127.0.0.1:7851/api/previewvoice/
Method: POST
Content-Type: application/x-www-form-urlencoded

curl -X POST "http://127.0.0.1:7851/api/previewvoice/" -F "voice=female_01.wav"

Replace female_01.wav with the name of the voice sample you want to hear.

JSON return: {"status": "generate-success", "output_file_path": "/path/to/outputs/api_preview_voice.wav", "output_file_url": "http://127.0.0.1:7851/audio/api_preview_voice.wav"}

Debugging and TTS Generation Information

Command line outputs are more verbose to assist in understanding backend processes and debugging.

Its possible during startup you can get a warning message such as:

[AllTalk Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 80 seconds maximum

This is normal behaviour if the subprocess is taking a while to start, however, if there is an issue starting the subprocess, you may see multiples of this message and an it will time out after 80 seconds, resulting in the TTS engine not starting. Its likely that you are not in the correct python environment or one that has a TTS engine inside, if this happens, though the system will output a warning about that ahead of this message

Typically, the command line console will output any warning or error messages. If you need to reset your default configuration, the settings are all listed above in the configuration details.

Thanks & References

Coqui TTS Engine

Extension coded by

Erew123 GitHub Profile

Thanks to & Text generation webUI

Ooobabooga GitHub Repository (Portions of orginal Coquii_TTS extension)

Thanks to

daswer123 GitHub Profile (Assistance with cuda to cpu moving)
S95Sedan GitHub Profile (Editing the Microsoft DeepSpeed v11.x installation files so they work)
kanttouchthis GitHub Profile (Portions of orginal Coquii_TTS extension)
Wuzzooy GitHub Profile (Trying out the code while in development)