Top Related Projects
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
End-to-End Speech Processing Toolkit
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Quick Overview
Tortoise-TTS is an open-source text-to-speech system that uses a novel architecture to produce high-quality, natural-sounding speech. It leverages large language models and diffusion techniques to generate speech that closely mimics human voices, including the ability to clone voices with minimal training data.
Pros
- Produces highly natural and expressive speech output
- Capable of voice cloning with just a few seconds of reference audio
- Supports multi-speaker synthesis and voice conversion
- Continuously improving through community contributions and research advancements
Cons
- Computationally intensive, requiring significant GPU resources for optimal performance
- Slower inference speed compared to traditional TTS systems
- Limited language support, primarily focused on English
- May require fine-tuning for specific use cases or accents
Code Examples
- Basic text-to-speech generation:
from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, play_audio
tts = TextToSpeech()
text = "Hello, this is a test of Tortoise TTS."
wav = tts.tts(text)
play_audio(wav, sr=24000)
- Voice cloning with a reference audio file:
from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio
tts = TextToSpeech()
reference_audio = load_audio("path/to/reference.wav")
text = "This is a cloned voice speaking."
wav = tts.tts(text, voice_samples=[reference_audio])
- Multi-speaker synthesis:
from tortoise.api import TextToSpeech
tts = TextToSpeech()
text = "Multiple speakers can be synthesized."
voices = ["tom", "emma", "daniel"]
wavs = tts.tts_multi(text, voices)
Getting Started
To get started with Tortoise-TTS:
- Install the library:
pip install TorchAudio
pip install git+https://github.com/neonbjb/tortoise-tts
- Basic usage:
from tortoise.api import TextToSpeech
from tortoise.utils.audio import play_audio
tts = TextToSpeech()
text = "Welcome to Tortoise TTS!"
wav = tts.tts(text)
play_audio(wav, sr=24000)
Note: Ensure you have a CUDA-capable GPU for optimal performance. Refer to the project's README for more detailed setup instructions and advanced usage examples.
Competitor Comparisons
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Pros of TTS
- More established project with a larger community and longer development history
- Supports a wider range of TTS models and voice conversion techniques
- Better documentation and easier setup process for beginners
Cons of TTS
- Generally slower inference speed compared to Tortoise-TTS
- Less focus on high-quality, natural-sounding speech synthesis
- May require more computational resources for some models
Code Comparison
TTS:
from TTS.api import TTS
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")
Tortoise-TTS:
from tortoise.api import TextToSpeech
tts = TextToSpeech()
tts.tts_to_file("Hello world!", speaker="random", file_path="output.wav")
Both repositories offer Python APIs for text-to-speech synthesis, but Tortoise-TTS has a simpler interface for generating speech with random voices. TTS provides more flexibility in model selection and configuration.
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
Pros of Tacotron2
- Developed by NVIDIA, a leader in AI and deep learning technologies
- Well-documented and extensively tested in research environments
- Supports multi-speaker synthesis with global style tokens
Cons of Tacotron2
- Requires significant computational resources for training and inference
- May produce less natural-sounding speech compared to newer models
- Limited flexibility in voice customization without extensive fine-tuning
Code Comparison
Tacotron2:
from tacotron2.hparams import create_hparams
from tacotron2.model import Tacotron2
from tacotron2.stft import STFT
hparams = create_hparams()
model = Tacotron2(hparams)
Tortoise-TTS:
from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, play_audio
tts = TextToSpeech()
wav = tts.tts("Hello world")
Tacotron2 focuses on a more traditional approach to TTS, requiring manual setup of hyperparameters and model initialization. Tortoise-TTS, on the other hand, provides a more user-friendly API with simplified function calls for text-to-speech conversion. Tortoise-TTS also includes utility functions for audio handling, making it easier to work with generated speech output.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- More comprehensive and versatile, supporting a wide range of NLP tasks
- Backed by Facebook AI Research, with extensive documentation and community support
- Highly optimized for performance and scalability
Cons of fairseq
- Steeper learning curve due to its complexity and broad scope
- May be overkill for projects focused solely on text-to-speech tasks
- Requires more computational resources for training and inference
Code Comparison
fairseq:
from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('/path/to/model', 'checkpoint.pt')
tokens = model.encode('Hello world')
translated = model.translate(tokens)
tortoise-tts:
from tortoise.api import TextToSpeech
tts = TextToSpeech()
wav = tts.tts("Hello world", voice="random")
Summary
fairseq is a more comprehensive toolkit for various NLP tasks, including machine translation, text generation, and speech processing. It offers greater flexibility and scalability but comes with a steeper learning curve. tortoise-tts, on the other hand, is specifically designed for text-to-speech tasks, making it more straightforward to use for TTS applications but less versatile overall. The choice between the two depends on the specific project requirements and the desired balance between simplicity and versatility.
End-to-End Speech Processing Toolkit
Pros of ESPnet
- Comprehensive toolkit supporting various speech processing tasks (ASR, TTS, speech enhancement, etc.)
- Large community support and regular updates
- Extensive documentation and examples
Cons of ESPnet
- Steeper learning curve due to its broad scope
- Potentially higher computational requirements for some models
- May be overkill for projects focused solely on TTS
Code Comparison
ESPnet (config example):
tts_config = {
"module": "espnet2.tasks.tts:TTSTask",
"model_conf": {
"adim": 384,
"aheads": 4,
"elayers": 6,
"eunits": 1536,
"dlayers": 6,
"dunits": 1536,
}
}
Tortoise-TTS (usage example):
from tortoise.api import TextToSpeech
tts = TextToSpeech()
wav = tts.tts("Hello world", voice="random")
While ESPnet offers more flexibility and configuration options, Tortoise-TTS provides a simpler API for quick TTS implementation. ESPnet's broader scope makes it suitable for various speech processing tasks, whereas Tortoise-TTS is more focused on high-quality TTS with a user-friendly interface.
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Pros of Real-Time-Voice-Cloning
- Designed for real-time voice cloning, allowing for faster processing and immediate results
- Includes a user-friendly GUI for easy interaction and testing
- Supports multi-speaker voice cloning with a single model
Cons of Real-Time-Voice-Cloning
- Generally lower audio quality compared to Tortoise-TTS
- Less flexibility in terms of fine-tuning and customization options
- May require more training data for optimal results
Code Comparison
Real-Time-Voice-Cloning:
from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder
encoder.load_model(enc_model_fpath)
synthesizer = Synthesizer(syn_model_fpath)
vocoder.load_model(voc_model_fpath)
Tortoise-TTS:
from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, play_audio
tts = TextToSpeech()
wav = tts.tts("Hello world", voice="random")
play_audio(wav, 24000)
The code snippets demonstrate the initialization and basic usage of both libraries. Real-Time-Voice-Cloning requires separate loading of encoder, synthesizer, and vocoder models, while Tortoise-TTS provides a more streamlined API for text-to-speech conversion.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
TorToiSe
Tortoise is a text-to-speech program built with the following priorities:
- Strong multi-voice capabilities.
- Highly realistic prosody and intonation.
This repo contains all the code needed to run Tortoise TTS in inference mode.
Manuscript: https://arxiv.org/abs/2305.07243
Hugging Face space
A live demo is hosted on Hugging Face Spaces. If you'd like to avoid a queue, please duplicate the Space and add a GPU. Please note that CPU-only spaces do not work for this demo.
https://huggingface.co/spaces/Manmay/tortoise-tts
Install via pip
pip install tortoise-tts
If you would like to install the latest development version, you can also install it directly from the git repository:
pip install git+https://github.com/neonbjb/tortoise-tts
What's in a name?
I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model is insanely slow. It leverages both an autoregressive decoder and a diffusion decoder; both known for their low sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.
well..... not so slow anymore now we can get a 0.25-0.3 RTF on 4GB vram and with streaming we can get < 500 ms latency !!!
Demos
See this page for a large list of example outputs.
A cool application of Tortoise + GPT-3 (not affiliated with this repository): https://twitter.com/lexman_ai. Unfortunately, this project seems no longer to be active.
Usage guide
Local installation
If you want to use this on your own computer, you must have an NVIDIA GPU.
On Windows, I highly recommend using the Conda installation method. I have been told that if you do not do this, you will spend a lot of time chasing dependency problems.
First, install miniconda: https://docs.conda.io/en/latest/miniconda.html
Then run the following commands, using anaconda prompt as the terminal (or any other terminal configured to work with conda)
This will:
- create conda environment with minimal dependencies specified
- activate the environment
- install pytorch with the command provided here: https://pytorch.org/get-started/locally/
- clone tortoise-tts
- change the current directory to tortoise-tts
- run tortoise python setup install script
conda create --name tortoise python=3.9 numba inflect
conda activate tortoise
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install transformers=4.29.2
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python setup.py install
Optionally, pytorch can be installed in the base environment, so that other conda environments can use it too. To do this, simply send the conda install pytorch...
line before activating the tortoise environment.
Note: When you want to use tortoise-tts, you will always have to ensure the
tortoise
conda environment is activated.
If you are on windows, you may also need to install pysoundfile: conda install -c conda-forge pysoundfile
Docker
An easy way to hit the ground running and a good jumping off point depending on your use case.
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
docker build . -t tts
docker run --gpus all \
-e TORTOISE_MODELS_DIR=/models \
-v /mnt/user/data/tortoise_tts/models:/models \
-v /mnt/user/data/tortoise_tts/results:/results \
-v /mnt/user/data/.cache/huggingface:/root/.cache/huggingface \
-v /root:/work \
-it tts
This gives you an interactive terminal in an environment that's ready to do some tts. Now you can explore the different interfaces that tortoise exposes for tts.
For example:
cd app
conda activate tortoise
time python tortoise/do_tts.py \
--output_path /results \
--preset ultra_fast \
--voice geralt \
--text "Time flies like an arrow; fruit flies like a bananna."
Apple Silicon
On macOS 13+ with M1/M2 chips you need to install the nighly version of PyTorch, as stated in the official page you can do:
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
Be sure to do that after you activate the environment. If you don't use conda the commands would look like this:
python3.10 -m venv .venv
source .venv/bin/activate
pip install numba inflect psutil
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip install transformers
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
pip install .
Be aware that DeepSpeed is disabled on Apple Silicon since it does not work. The flag --use_deepspeed
is ignored.
You may need to prepend PYTORCH_ENABLE_MPS_FALLBACK=1
to the commands below to make them work since MPS does not support all the operations in Pytorch.
do_tts.py
This script allows you to speak a single phrase with one or more voices.
python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
do socket streaming
python tortoise/socket_server.py
will listen at port 5000
faster inference read.py
This script provides tools for reading large amounts of text.
python tortoise/read_fast.py --textfile <your text to be read> --voice random
read.py
This script provides tools for reading large amounts of text.
python tortoise/read.py --textfile <your text to be read> --voice random
This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and output that as well.
Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running read.py
with the --regenerate
argument.
API
Tortoise can be used programmatically, like so:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
To use deepspeed:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
To use kv cache:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(kv_cache=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
To run model in float16:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
for Faster runs use all three:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True, kv_cache=True, half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
Acknowledgements
This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to credit a few of the amazing folks in the community that have helped make this happen:
- Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
- Ramesh et al who authored the DALLE paper, which is the inspiration behind Tortoise.
- Nichol and Dhariwal who authored the (revision of) the code that drives the diffusion model.
- Jang et al who developed and open-sourced univnet, the vocoder this repo uses.
- Kim and Jung who implemented univnet pytorch model.
- lucidrains who writes awesome open source pytorch models, many of which are used here.
- Patrick von Platen whose guides on setting up wav2vec were invaluable to building my dataset.
Notice
Tortoise was built entirely by the author (James Betker) using their own hardware. Their employer was not involved in any facet of Tortoise's development.
License
Tortoise TTS is licensed under the Apache 2.0 license.
If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.
Top Related Projects
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
End-to-End Speech Processing Toolkit
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot