TTS-WebUI

A single Gradio + React WebUI with extensions for ACE-Step, Kimi Audio, Piper TTS, GPT-SoVITS, CosyVoice, XTTSv2, DIA, Kokoro, OpenVoice, ParlerTTS, Stable Audio, MMS, StyleTTS2, MAGNet, AudioGen, MusicGen, Tortoise, RVC, Vocos, Demucs, SeamlessM4T, and Bark!

2,313

245

2,313

View on GitHub

Top Related Projects

TTS

9,796

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

bark

38,091

🔊 Text-Prompted Generative Audio Model

tortoise-tts

14,346

A multi-voice TTS system trained with an emphasis on quality

StyleTTS2

5,682

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Quick Overview

TTS Generation WebUI is a web-based interface for text-to-speech generation using various AI models. It provides a user-friendly platform for generating speech from text input, offering multiple voice options and customization features. The project aims to make advanced TTS technology accessible to users without requiring deep technical knowledge.

Pros

Easy-to-use web interface for text-to-speech generation
Supports multiple TTS models and voices
Offers customization options for speech output
Provides a convenient way to experiment with different TTS technologies

Cons

May require significant computational resources for some models
Limited to the specific models and voices implemented in the project
Potential for inconsistent results across different TTS engines
Requires setup and configuration, which might be challenging for non-technical users

Code Examples

# Example 1: Initializing the TTS engine
from tts_generation import TTSEngine

engine = TTSEngine(model="tacotron2", device="cuda")

# Example 2: Generating speech from text
text = "Hello, world! This is a text-to-speech example."
audio = engine.generate_speech(text)

# Example 3: Saving the generated audio to a file
engine.save_audio(audio, "output.wav")

# Example 4: Changing voice settings
engine.set_voice(speaker_id=1, language="en")
audio = engine.generate_speech("This is spoken in a different voice.")

Getting Started

To get started with TTS Generation WebUI:

Clone the repository:

git clone https://github.com/rsxdalv/tts-generation-webui.git

Install dependencies:

cd tts-generation-webui
pip install -r requirements.txt

Run the web interface:
```
python app.py
```
Open a web browser and navigate to http://localhost:7860 to access the TTS Generation WebUI.

Competitor Comparisons

TTS

9,796

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

More comprehensive and feature-rich TTS library
Better documentation and community support
Offers a wider range of pre-trained models and voices

Cons of TTS

Steeper learning curve for beginners
Requires more computational resources
Less focus on user-friendly web interface

Code Comparison

TTS:

from TTS.api import TTS

tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

tts-generation-webui:

from TTS.api import TTS

model = TTS("tts_models/multilingual/multi-dataset/your_tts", gpu=True)
model.tts_to_file(text="Hello world!", speaker_wav="path/to/speaker.wav", language="en", file_path="output.wav")

Both repositories use the TTS library, but tts-generation-webui focuses on providing a web interface for easier use, while TTS offers more flexibility and control over the TTS process.

bark

38,091

🔊 Text-Prompted Generative Audio Model

Pros of Bark

More advanced and versatile text-to-speech model with multilingual support
Capable of generating non-speech sounds and music
Actively maintained by a dedicated AI research company

Cons of Bark

Requires more computational resources and may be slower for generation
Less user-friendly interface, primarily designed for developers
Limited customization options for voice characteristics

Code Comparison

Bark:

from bark import SAMPLE_RATE, generate_audio, preload_models

preload_models()
text = "Hello, I'm a spoken audio clip generated using Bark."
audio_array = generate_audio(text)

TTS Generation WebUI:

import gradio as gr
from TTS.api import TTS

tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
gr.Interface(fn=tts.tts, inputs="text", outputs="audio").launch()

The Bark code snippet demonstrates its focus on generating audio programmatically, while TTS Generation WebUI emphasizes creating a user interface for text-to-speech conversion. Bark's approach is more flexible but requires more setup, whereas TTS Generation WebUI provides a simpler, more accessible interface for end-users.

tortoise-tts

14,346

A multi-voice TTS system trained with an emphasis on quality

Pros of Tortoise-TTS

More advanced and feature-rich TTS system with higher quality output
Supports multi-voice synthesis and voice cloning capabilities
Offers fine-grained control over various aspects of speech generation

Cons of Tortoise-TTS

Higher computational requirements and slower inference times
More complex setup and usage compared to TTS Generation WebUI
Limited web interface options out of the box

Code Comparison

Tortoise-TTS:

tts = TextToSpeech()
wav = tts.tts_with_preset("Hello world!", voice_samples=["path/to/sample.wav"], preset="fast")

TTS Generation WebUI:

from TTS.api import TTS
tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

The code snippets demonstrate that Tortoise-TTS offers more advanced features like voice cloning, while TTS Generation WebUI provides a simpler interface for basic TTS functionality.

StyleTTS2

5,682

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Pros of StyleTTS2

Offers more advanced voice cloning capabilities
Provides better control over prosody and speaking style
Supports multi-speaker TTS with a single model

Cons of StyleTTS2

Requires more computational resources
Has a steeper learning curve for beginners
Less user-friendly interface compared to tts-generation-webui

Code Comparison

StyleTTS2:

mel = model.style_encoder(style_wav)
text = model.get_text(text, language)
audio = model.infer(text, mel, alpha=alpha, beta=beta, diffusion_steps=steps)

tts-generation-webui:

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
wav = tts.tts(text=text, speaker_wav=speaker_wav, language=language)

StyleTTS2 offers more granular control over the generation process, allowing for style encoding and diffusion steps adjustment. tts-generation-webui provides a simpler interface with fewer parameters, making it more accessible for quick text-to-speech tasks.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

Comprehensive toolkit for sequence modeling tasks
Supports a wide range of NLP and speech processing tasks
Highly scalable and optimized for performance

Cons of fairseq

Steeper learning curve due to its complexity
Requires more computational resources
Less focused on TTS-specific features

Code Comparison

fairseq:

from fairseq.models.wav2vec import Wav2VecModel

model = Wav2VecModel.from_pretrained('path/to/model')
wav_input_16khz = torch.randn(1,10000)
z = model.feature_extractor(wav_input_16khz)
c = model.feature_aggregator(z)

tts-generation-webui:

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

The fairseq code demonstrates its flexibility for various audio processing tasks, while tts-generation-webui focuses on simplifying the TTS process with a user-friendly API. fairseq offers more control over the underlying model, whereas tts-generation-webui provides a streamlined approach for generating speech from text.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

TTS WebUI / Harmonica

Download Installer || Installation || Docker Setup || Silly Tavern || Extensions || Feedback / Bug reports

Videos

Models

Text-to-speech	Audio/Music Generation	Audio Conversion/Tools
Bark	MusicGen	RVC
Tortoise	MAGNeT	Demucs
Maha TTS	Stable Audio	Vocos
MMS	(Extension) Riffusion	Whisper
Vall-E X	(Extension) AudioCraft Mac	AP BWE
StyleTTS2	(Extension) AudioCraft Plus	Resemble Enhance
SeamlessM4T		Audio Separator
(Extension) XTTSv2
(Extension) MARS5
(Extension) F5-TTS
(Extension) Parler TTS
(Extension) OpenVoice
(Extension) OpenVoice V2
(Extension) Kokoro TTS
(Extension) DIA
(Extension) CosyVoice
(Extension) GPT-SoVITS
(Extension) Piper TTS
(Extension) Kimi Audio 7B Instruct
(Extension) ACE-Step

Examples

Screenshots

Changelog

June 26:

Fix React UI file size limit of 4MB, now 50MB. Thanks https://github.com/SuperFurias ! (#446)

June 20:

Upgrade Chatterbox to enable compilation for 2-4x speedup.
Fix React UI build errors.
Add 'auto start' option to OpenAI-API.

June 10:

Patch eslint warnings during build.
Fix extension_cuda_toolkit definition.

June 9:

Add CUDA Toolkit extension.
Hotfix for PyTorch 2.7.0 nightly.
Update Docker to 2.7.0

June 8:

Fix decorators for generation.
Refactor server.py code.
Hotfix for docker, thanks https://github.com/chrislawso for reporting.

June 7:

Chatterbox upgrade for streaming.

June 6:

Update DIA Extension for Float16 support.
Improve decorators for streaming usage.

June 4:

Attempt dockerfile fix.
Add interactivity to model unloading button, improve Gradio random seed UI.
Add sample voices.

June 1:

Add presets API.
Add API Preset config to React UI.

May 2025

May 31:

Improve React UI Audio player.
Fix ROCm installation version.

May 30:

Make OpenAI API extension installed by default (extension_kokoro_tts_api).
Add Favicon.
Fix OpenVoice v2 extension.
Improve UI layout for StyleTTS2, MahaTTS, Vall-E-X, Parler TTS

May 29:

Add Chatterbox extension.
Add Kokoro TTS to React UI.
Fix React Build, thanks noaht8um!

May 28:

Restore gr.Tabs to the old style for easier stacking of many tabs.
Integrate custom IconButton.
Fix Gradio's output tab display
Add tutorial section

May 27:

Include gradio==5.5.0 in each installation of extensions. While this might cause some extensions to fail to install, it should prevent extensions from breaking the UI. Please report extensions that fail to install. Thanks to cwlowden for debugging this issue.
Make XTTS-RVC-UI an unrecommended extension.

May 26:

Add fixes for decorators to work with non-'text' inputs.
Clean up .env generator and remove the Bark environment variables from settings.
Add Audio book extension definitions for future use (extensions not available yet).
Fix SeamlessM4T Audio to Audio tab.
Update ACE-Step extension.
Improve Kokoro TTS API.

May 14:

Prepare for Python 3.11 and 3.12 support.

May 12:

Fix deepspeed for Windows. Thank you for the reports!
Improve decorator extensions for future API.
Improve Kokoro TTS API for OpenAI compatibility, now usable with SillyTavern.
Add setup.py for future pip installs. Sync versions.json with setup.py and package.json.
Remove deprecated requirements_* files.
Removed Windows deepspeed until it no longer requires NVCC, thank you https://github.com/lcmiracle for extensive debugging and testing.

May 10:

Fix missing directory bug causing extensions to fail to load. Thanks Discord/Comstock for discovery of the bug.
Add ACE-Step to React UI.
Add emoji to Gradio UI categories for simplicity.
Add enhanced logging for every update and app startup, allowing for easier debugging once issues happen.
Show gr.Info when models are being loaded or unloaded.
Allow users to use React UI together with Gradio auth by specifying GRADIO_AUTH="username:pass" environment variable.

May 7:

Add Piper TTS extension
Add ACE-Step extension

May 6:

Add Kimi Audio 7B Instruct extension
Fix React-Gradio file proxy missing slash
Add Kokoro TTS API extension

April 2025

Apr 25:

Add OpenVoice V2 extension

Apr 24:

Add OpenVoice V1 extension

Apr 23:

Deprecate requirements_* files using direct extension installation instead.
Add proxy for gradio files in React UI.
Added DIA extension.

Apr 22:

Allow newer versions of pip
Remove PyTorch's +cpu for Apple M Series Chip
Installer fixes - fix CUDA repair, CRLF, warn about GCC, terminate if pip fails.

Apr 20:

Fix install/uninstall in extension manager
Add Kokoro TTS extension

Apr 18:

Fix extension manager startup
Convert most models to extensions, install the classic ones by default
Attempt to fix linux installer
Add 'recommended' flag for extensions

Apr 17:

Create extension manager
Warn Windows users if conda is installed
upgrade dockerfile to PyTorch 2.6.0

Apr 12:

Upgrade to PyTorch 2.6.0 Cuda 12.4, switch to pip for pytorch install
Add compatibility layer for older models
Fix StyleTTS2 missing nlkt downloader
Reorder TTS tabs
Allow disabled extensions to be configured in config.json
Remove PyTorch CPU via pip option, redundant
Move all core conda packages to init_mamba scripts.
Upgrade the installer to include a web-based UI
Add conda storage optimizer extension
Hotfix: New init_app bug that caused the installer to freeze

Apr 11:

Add AP BWE upscaling extension

Apr 02:

Fix pydantic (#465, #468)
Add --no-react --no-database advanced flags
Add a fix to avoid directory errors on the very first React UI build (#466)

March 2025

Mar 21:

Add CosyVoice extension [Unstable] and GPT-SoVITS [Alpha] extension

Mar 20:

Add executable macOS script for double-click launching
Add unstable CosyVoice extension

Mar 18:

Remove old rvc files
Fix missing torchfcpe dependency for RVC

Mar 17:

Upgrade Google Colab to PyTorch 2.6.0, add Conda to downgrade Python to 3.10
No longer abort when the automatic update fails to fetch the new code (Improving offline support #457)
Upgrade Tortoise to v3.0.1 for transformers 4.49.0 #454
Prevent running in Windows/System32 folder #459

February 2025

Feb 15:

Fix Stable Audio to match the new version

Feb 14:

Pin accelerate>=0.33.0 project wide
Add basic Seamless M4T quantization code

Feb 13:

Fix Stable Audio and Seamless M4T incompatibility
Make Seamless M4T automatically use CUDA if available, otherwise CPU

Feb 10:

Improve installation instructions in README

January 2025

2024

Click to expand

See the 2024 Changelog for a detailed list of changes in 2024.

2023

Click to expand

See the 2023 Changelog for a detailed list of changes in 2023.

Upgrading (For old installations)

In case of issues, feel free to contact the developers.

Click to expand

Upgrading from v6 to new installer

Recommended: Fresh install

Download the new version and run the start_tts_webui.bat (Windows) or start_tts_webui.sh (MacOS, Linux)
Once it is finished, close the server.
Recommended: Copy the old generations to the new directory, such as favorites/ outputs/ outputs-rvc/ models/ collections/ config.json
With caution: you can copy the whole new tts-webui directory over the old one, but there might be some old files that are lost.

In-place upgrade, can delete some files, tweaks

Update the existing installation using the update_platform script
After the update run the new start_tts_webui.bat (Windows) or start_tts_webui.sh (MacOS, Linux) inside of the tts-webui directory
Once the server starts, check if it works.
With caution: if the new server works, within the one-click-installers directory, delete the old installer_files.

Is there any more optimal way to do this?

Not exactly, the dependencies clash, especially between conda and python (and dependencies are already in a critical state, moving them to conda is ways off). Therefore, while it might be possible to just replace the old installer with the new one and running the update, the problems are unpredictable and unfixable. Making an update to installer requires a lot of testing so it's not done lightly.

Extensions

Extensions are available to install from the webui itself, or using React UI. They can also be installed using the extension manager. Internally, extensions are just python packages that are installed using pip. Multiple extensions can be installed at the same time, but there might be compatibility issues between them. After installing or updating an extension, you need to restart the app to load it.

Updates need to be done manually by using the mini-control panel:

mini-control-panel

Integrations

Silly Tavern

Install the Kokoro TTS API extension
Start the API and test it with Python Requests

(OpenAI client might not be installed thus the Test with Python OpenAI client might fail)
Once you can see the audio generates successfully, go to Silly Tavern, and add a new TTS API Default provider endpoint: http://localhost:7778/v1/audio/speech
Test it out!

OpenAI Compatible APIs

Using the instructions above, you can install an OpenAI compatible API, and use it with Silly Tavern or other OpenAI compatible clients.

Installation

Current base installation size is around 10.7 GB. Each model will require 2-8 GB of space in addition.

Download the latest version and extract it.
Run start_tts_webui.bat or start_tts_webui.sh to start the server. It will ask you to select the GPU/Chip you are using. Once everything has installed, it will start the Gradio server at http://localhost:7770 and the React UI at http://localhost:3000.
Output log will be available in the installer_scripts/output.log file.
Note: The start script sets up a conda environment and a python virtual environment. Thus you don't need to make a venv before that, and in fact, launching from another venv might break this script.

Manual installation

For detailed manual installation instructions, please refer to the Manual Installation Guide.

Docker Setup

tts-webui can also be ran inside of a Docker container. Using CUDA inside of docker requires (NVIDIA Container Toolkit)[https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html]. To get started, pull the image from GitHub Container Registry:

docker pull ghcr.io/rsxdalv/tts-webui:main

Once the image has been pulled it can be started with Docker Compose: The ports are 7770 (env:TTS_PORT) for the Gradio backend and 3000 (env:UI_PORT) for the React front end.

docker compose up -d

The container will take some time to generate the first output while models are downloaded in the background. The status of this download can be verified by checking the container logs:

docker logs tts-webui

Building the image yourself

If you wish to build your own docker container, you can use the included Dockerfile:

docker build -t tts-webui .

Please note that the docker-compose needs to be edited to use the image you just built.

Compatibility / Errors

Audiocraft is currently only compatible with Linux and Windows. MacOS support still has not arrived, although it might be possible to install manually.

Torch being reinstalled

Due to the python package manager (pip) limitations, torch can get reinstalled several times. This is a wide ranging issue of pip and torch.

Red messages in console

These messages:

---- requires ----, but you have ---- which is incompatible.

Are completely normal. It's both a limitation of pip and because this Web UI combines a lot of different AI projects together. Since the projects are not always compatible with each other, they will complain about the other projects being installed. This is normal and expected. And in the end, despite the warnings/errors the projects will work together. It's not clear if this situation will ever be resolvable, but that is the hope.

Extra Voices for Bark, Prompt Samples

Bark Readme

README_Bark.md

Info about managing models, caches and system space for AI projects

https://github.com/rsxdalv/tts-webui/discussions/186#discussioncomment-7291274

Open Source Libraries

This project utilizes the following open source libraries:

suno-ai/bark - MIT License
- Description: Inference code for Bark model.
- Repository: suno/bark
tortoise-tts - Apache-2.0 License
- Description: A flexible text-to-speech synthesis library for various platforms.
- Repository: neonbjb/tortoise-tts
ffmpeg - LGPL License
- Description: A complete and cross-platform solution for video and audio processing.
- Repository: FFmpeg
- Use: Encoding Vorbis Ogg files
ffmpeg-python - Apache 2.0 License
- Description: Python bindings for FFmpeg library for handling multimedia files.
- Repository: kkroening/ffmpeg-python
audiocraft - MIT License
- Description: A library for audio generation and MusicGen.
- Repository: facebookresearch/audiocraft
vocos - MIT License
- Description: An improved decoder for encodec samples
- Repository: charactr-platform/vocos
RVC - MIT License
- Description: An easy-to-use Voice Conversion framework based on VITS.
- Repository: RVC-Project/Retrieval-based-Voice-Conversion-WebUI

Ethical and Responsible Use

This technology is intended for enablement and creativity, not for harm.

By engaging with this AI model, you acknowledge and agree to abide by these guidelines, employing the AI model in a responsible, ethical, and legal manner.

Non-Malicious Intent: Do not use this AI model for malicious, harmful, or unlawful activities. It should only be used for lawful and ethical purposes that promote positive engagement, knowledge sharing, and constructive conversations.
No Impersonation: Do not use this AI model to impersonate or misrepresent yourself as someone else, including individuals, organizations, or entities. It should not be used to deceive, defraud, or manipulate others.
No Fraudulent Activities: This AI model must not be used for fraudulent purposes, such as financial scams, phishing attempts, or any form of deceitful practices aimed at acquiring sensitive information, monetary gain, or unauthorized access to systems.
Legal Compliance: Ensure that your use of this AI model complies with applicable laws, regulations, and policies regarding AI usage, data protection, privacy, intellectual property, and any other relevant legal obligations in your jurisdiction.
Acknowledgement: By engaging with this AI model, you acknowledge and agree to abide by these guidelines, using the AI model in a responsible, ethical, and legal manner.

License

Codebase and Dependencies

The codebase is licensed under MIT. However, it's important to note that when installing the dependencies, you will also be subject to their respective licenses. Although most of these licenses are permissive, there may be some that are not. Therefore, it's essential to understand that the permissive license only applies to the codebase itself, not the entire project.

That being said, the goal is to maintain MIT compatibility throughout the project. If you come across a dependency that is not compatible with the MIT license, please feel free to open an issue and bring it to our attention.

Known non-permissive dependencies:

Library	License	Notes
encodec	CC BY-NC 4.0	Newer versions are MIT, but need to be installed manually
diffq	CC BY-NC 4.0	Optional in the future, not necessary to run, can be uninstalled, should be updated with demucs
lameenc	GPL License	Future versions will make it LGPL, but need to be installed manually
unidecode	GPL License	Not mission critical, can be replaced with another library, issue: https://github.com/neonbjb/tortoise-tts/issues/494

Model Weights

Model weights have different licenses, please pay attention to the license of the model you are using.

Most notably:

Bark: MIT
Tortoise: Unknown (Apache-2.0 according to repo, but no license file in HuggingFace)
MusicGen: CC BY-NC 4.0
AudioGen: CC BY-NC 4.0

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot