StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

5,874

589

5,874

104

View on GitHub

Top Related Projects

tortoise-tts

14,619

A multi-voice TTS system trained with an emphasis on quality

TTS

41,775

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

vits

7,703

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

fairseq

31,682

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

unilm

21,586

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Quick Overview

StyleTTS2 is a text-to-speech (TTS) system that aims to generate high-quality speech with controllable speaking styles. It is an extension of the original StyleTTS project, which focused on generating speech with different emotional styles.

Pros

Controllable Speaking Styles: StyleTTS2 allows users to control the speaking style of the generated speech, enabling the creation of more expressive and natural-sounding audio.
High-Quality Audio Output: The project claims to generate high-quality speech that is comparable to human-recorded audio.
Flexible and Customizable: The system is designed to be flexible and customizable, allowing users to fine-tune the speaking styles and other parameters to suit their specific needs.
Potential Applications: The ability to generate speech with different styles could be useful in various applications, such as audiobook narration, virtual assistants, and interactive media.

Cons

Limited Evaluation: The project's documentation does not provide extensive evaluation or comparison with other state-of-the-art TTS systems, making it difficult to assess the system's performance relative to other solutions.
Complexity: Implementing and using the StyleTTS2 system may require a significant amount of technical expertise, as it involves working with various deep learning models and speech processing techniques.
Computational Requirements: The generation of high-quality speech with controllable styles likely requires significant computational resources, which may limit its accessibility or deployment in resource-constrained environments.
Lack of Detailed Documentation: The project's documentation could be more comprehensive, providing more detailed instructions, examples, and troubleshooting guidance for users.

Code Examples

Since StyleTTS2 is a code library, here are a few short code examples to give you a sense of how it can be used:

from styletts2.inference import StyleTTS2Inference

# Initialize the StyleTTS2 inference model
tts = StyleTTS2Inference(
    model_path="path/to/your/model",
    device="cuda"  # or "cpu" if no GPU is available
)

# Generate speech with a specific style
text = "Hello, this is a sample text."
style = "angry"
audio = tts.generate_audio(text, style)
audio.save("output.wav")

This code demonstrates how to use the StyleTTS2Inference class to generate speech with a specific style.

from styletts2.train import StyleTTS2Trainer

# Initialize the StyleTTS2 trainer
trainer = StyleTTS2Trainer(
    train_dataset="path/to/your/train/dataset",
    val_dataset="path/to/your/val/dataset",
    model_config="path/to/your/model/config.json",
    output_dir="path/to/your/output/directory"
)

# Train the StyleTTS2 model
trainer.train(num_epochs=100)

This code demonstrates how to use the StyleTTS2Trainer class to train a StyleTTS2 model from scratch.

from styletts2.utils import preprocess_text

# Preprocess the input text
text = "This is a sample text to be preprocessed."
preprocessed_text = preprocess_text(text)
print(preprocessed_text)

This code demonstrates how to use the preprocess_text function to prepare the input text for the StyleTTS2 system.

Getting Started

To get started with StyleTTS2, you'll need to follow these steps:

Clone the GitHub repository:

git clone https://github.com/yl4579/StyleTTS2.git

Install the required dependencies:

cd StyleTTS2
pip install -r requirements.txt

Prepare your dataset:
- The project expects your dataset to be in a specific format, with audio files and corresponding text transcripts.
- Refer to the project's documentation for detailed instructions on dataset preparation.
Train the StyleTTS2 model:
- Customize the model configuration file (config.json) to suit your needs.
- Use the StyleTTS2Trainer class to train the model on your dataset.
Generate speech with the trained model:
- Use the `

Competitor Comparisons

tortoise-tts

14,619

A multi-voice TTS system trained with an emphasis on quality

Pros of Tortoise-TTS

Supports multi-voice synthesis with a single model
Offers fine-grained control over speech attributes like speaking rate and emotion
Includes a user-friendly web interface for easy experimentation

Cons of Tortoise-TTS

Generally slower inference speed compared to StyleTTS2
Requires more computational resources for training and inference
Less focus on style transfer capabilities

Code Comparison

Tortoise-TTS:

tts = TextToSpeech()
pcm_audio = tts.tts_with_preset("Hello world!", voice_samples=["path/to/sample.wav"], preset="fast")

StyleTTS2:

synthesizer = StyleTTS2(config_path, model_path)
audio = synthesizer.infer_from_text("Hello world!", speaker="p225", style_text="Happy birthday!")

Both projects aim to provide high-quality text-to-speech synthesis, but they differ in their approaches and features. Tortoise-TTS excels in multi-voice synthesis and fine-grained control, while StyleTTS2 focuses more on style transfer and faster inference. The code examples show that Tortoise-TTS uses a preset system for different synthesis speeds, while StyleTTS2 allows for explicit style text input. Users should consider their specific needs, such as inference speed, voice variety, and style control, when choosing between these two options.

TTS

41,775

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Pros of TTS

More comprehensive and feature-rich, offering a wider range of TTS models and voice conversion techniques
Better documentation and community support, making it easier for users to get started and troubleshoot issues
Actively maintained with frequent updates and improvements

Cons of TTS

Potentially more complex to set up and use, especially for beginners or those with specific use cases
May require more computational resources due to its broader scope and feature set

Code Comparison

StyleTTS2:

model = StyleTTS2(config)
audio = model.infer(text, speaker_id, style_vec)

TTS:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

The code comparison shows that TTS offers a more straightforward API for basic text-to-speech conversion, while StyleTTS2 provides more control over style and speaker parameters. TTS's approach may be more user-friendly for simple use cases, but StyleTTS2 offers finer-grained control for advanced applications.

vits

7,703

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Pros of VITS

Simpler architecture, potentially easier to understand and implement
Faster inference time due to flow-based model
Better handling of long-form text and prosody

Cons of VITS

Less flexibility in controlling voice style and emotion
May produce lower quality audio in some cases
Limited ability to transfer styles between speakers

Code Comparison

VITS:

x, m_p, logs_p, x_mask = self.enc_p(y, y_lengths, sid=sid)
z, logdet = self.flow(x, x_mask, g=g, reverse=reverse)
o = self.dec((z * x_mask)[:,:,:max_len], g=g)

StyleTTS2:

text_emb = self.text_encoder(text, text_lengths)
style_emb = self.style_encoder(mel, mel_lengths)
duration = self.duration_predictor(text_emb, style_emb, text_lengths)
mel_out = self.decoder(text_emb, style_emb, duration, mel_lengths)

Both repositories focus on text-to-speech synthesis, but they employ different approaches. VITS uses a flow-based model, which can lead to faster inference times and better handling of long-form text. However, StyleTTS2 offers more control over voice style and emotion, potentially resulting in higher quality audio in some scenarios. StyleTTS2 also provides better style transfer capabilities between speakers. The code snippets illustrate the different architectures, with VITS using a flow-based approach and StyleTTS2 employing separate encoders for text and style.

fairseq

31,682

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

Broader scope: Supports a wide range of sequence modeling tasks beyond text-to-speech
More extensive documentation and community support
Highly modular and customizable architecture

Cons of fairseq

Steeper learning curve due to its complexity and broader focus
May require more computational resources for training and inference
Less specialized for style transfer in text-to-speech applications

Code Comparison

StyleTTS2:

mel = self.decoder(text_hidden, style_vector, f0)
audio = self.vocoder(mel)

fairseq:

encoder_out = self.encoder(src_tokens, src_lengths)
decoder_out = self.decoder(prev_output_tokens, encoder_out)

StyleTTS2 focuses on generating mel spectrograms and audio from text and style inputs, while fairseq provides a more general-purpose encoder-decoder architecture for various sequence-to-sequence tasks. StyleTTS2's code is more specialized for text-to-speech with style transfer, whereas fairseq's code is more abstract and adaptable to different tasks.

unilm

21,586

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

Broader scope: Covers multiple NLP tasks beyond text-to-speech
Larger community and support from Microsoft
More extensive documentation and examples

Cons of UniLM

Less specialized for TTS tasks compared to StyleTTS2
Potentially more complex to set up and use for specific TTS applications
May require more computational resources due to its multi-task nature

Code Comparison

StyleTTS2:

model = StyleTTS2(config)
wav, alignment, *_ = model.inference(text, style_vector)

UniLM:

model = UniLMForConditionalGeneration.from_pretrained("microsoft/unilm-base-cased")
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)

Key Differences

StyleTTS2 focuses specifically on text-to-speech with style transfer
UniLM is a more general-purpose NLP model covering multiple tasks
StyleTTS2 may offer more fine-grained control over speech synthesis
UniLM provides a unified approach to various language tasks

Use Cases

StyleTTS2: Ideal for projects requiring high-quality, stylized TTS
UniLM: Better suited for applications needing multiple NLP capabilities

Community and Development

StyleTTS2: Smaller, more focused community
UniLM: Larger ecosystem, backed by Microsoft's resources

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

Focuses on real-time voice cloning, which may be more suitable for interactive applications
Includes a user-friendly toolbox for voice cloning experiments
Has been around longer, potentially offering more community support and resources

Cons of Real-Time-Voice-Cloning

Less recent updates compared to StyleTTS2, which may impact performance on newer datasets
Primarily designed for English, while StyleTTS2 supports multiple languages
May require more computational resources for real-time processing

Code Comparison

Real-Time-Voice-Cloning:

def load_model(checkpoint_path):
    model = SpeakerEncoder()
    checkpoint = torch.load(checkpoint_path)
    model.load_state_dict(checkpoint["model_state"])
    return model

StyleTTS2:

def build_model(args):
    model = StyleTTS2(
        args.preprocess_config,
        args.model_config,
        args.train_config,
    )
    return model

Both repositories provide methods for loading or building their respective models. Real-Time-Voice-Cloning uses a more straightforward approach with a single SpeakerEncoder model, while StyleTTS2 employs a more complex model structure with multiple configuration files.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

Paper: https://arxiv.org/abs/2306.07691

Audio samples: https://styletts2.github.io/

Online demo: Hugging Face (thank @fakerybakery for the wonderful online demo)

TODO

Training and inference demo code for single-speaker models (LJSpeech)
Test training code for multi-speaker models (VCTK and LibriTTS)
Finish demo code for multispeaker model and upload pre-trained models
Add a finetuning script for new speakers with base pre-trained multispeaker models
Fix DDP (accelerator) for train_second.py (I have tried everything I could to fix this but had no success, so if you are willing to help, please see #7)

Pre-requisites

Python >= 3.7
Clone this repository:

git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2

Install python requirements:

pip install -r requirements.txt

On Windows add:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -U

Also install phonemizer and espeak if you want to run the demo:

pip install phonemizer
sudo apt-get install espeak-ng

Download and extract the LJSpeech dataset, unzip to the data folder and upsample the data to 24 kHz. The text aligner and pitch extractor are pre-trained on 24 kHz data, but you can easily change the preprocessing and re-train them using your own preprocessing. For LibriTTS, you will need to combine train-clean-360 with train-clean-100 and rename the folder train-clean-460 (see val_list_libritts.txt as an example).

Training

First stage training:

accelerate launch train_first.py --config_path ./Configs/config.yml

Second stage training (DDP version not working, so the current version uses DP, again see #7 if you want to help):

python train_second.py --config_path ./Configs/config.yml

You can run both consecutively and it will train both the first and second stages. The model will be saved in the format "epoch_1st_%05d.pth" and "epoch_2nd_%05d.pth". Checkpoints and Tensorboard logs will be saved at log_dir.

The data list format needs to be filename.wav|transcription|speaker, see val_list.txt as an example. The speaker labels are needed for multi-speaker models because we need to sample reference audio for style diffusion model training.

Important Configurations

In config.yml, there are a few important configurations to take care of:

OOD_data: The path for out-of-distribution texts for SLM adversarial training. The format should be text|anything.
min_length: Minimum length of OOD texts for training. This is to make sure the synthesized speech has a minimum length.
max_len: Maximum length of audio for training. The unit is frame. Since the default hop size is 300, one frame is approximately 300 / 24000 (0.0125) second. Lowering this if you encounter the out-of-memory issue.
multispeaker: Set to true if you want to train a multispeaker model. This is needed because the architecture of the denoiser is different for single and multispeaker models.
batch_percentage: This is to make sure during SLM adversarial training there are no out-of-memory (OOM) issues. If you encounter OOM problem, please set a lower number for this.

Pre-trained modules

In Utils folder, there are three pre-trained models:

ASR folder: It contains the pre-trained text aligner, which was pre-trained on English (LibriTTS), Japanese (JVS), and Chinese (AiShell) corpus. It works well for most other languages without fine-tuning, but you can always train your own text aligner with the code here: yl4579/AuxiliaryASR.
JDC folder: It contains the pre-trained pitch extractor, which was pre-trained on English (LibriTTS) corpus only. However, it works well for other languages too because F0 is independent of language. If you want to train on singing corpus, it is recommended to train a new pitch extractor with the code here: yl4579/PitchExtractor.
PLBERT folder: It contains the pre-trained PL-BERT model, which was pre-trained on English (Wikipedia) corpus only. It probably does not work very well on other languages, so you will need to train a different PL-BERT for different languages using the repo here: yl4579/PL-BERT. You can also use the multilingual PL-BERT which supports 14 languages.

Common Issues

Loss becomes NaN: If it is the first stage, please make sure you do not use mixed precision, as it can cause loss becoming NaN for some particular datasets when the batch size is not set properly (need to be more than 16 to work well). For the second stage, please also experiment with different batch sizes, with higher batch sizes being more likely to cause NaN loss values. We recommend the batch size to be 16. You can refer to issues #10 and #11 for more details.
Out of memory: Please either use lower batch_size or max_len. You may refer to issue #10 for more information.
Non-English dataset: You can train on any language you want, but you will need to use a pre-trained PL-BERT model for that language. We have a pre-trained multilingual PL-BERT that supports 14 languages. You may refer to yl4579/StyleTTS#10 and #70 for some examples to train on Chinese datasets.

Finetuning

The script is modified from train_second.py which uses DP, as DDP does not work for train_second.py. Please see the bold section above if you are willing to help with this problem.

python train_finetune.py --config_path ./Configs/config_ft.yml

Please make sure you have the LibriTTS checkpoint downloaded and unzipped under the folder. The default configuration config_ft.yml finetunes on LJSpeech with 1 hour of speech data (around 1k samples) for 50 epochs. This took about 4 hours to finish on four NVidia A100. The quality is slightly worse (similar to NaturalSpeech on LJSpeech) than LJSpeech model trained from scratch with 24 hours of speech data, which took around 2.5 days to finish on four A100. The samples can be found at #65 (comment).

If you are using a single GPU (because the script doesn't work with DDP) and want to save training speed and VRAM, you can do (thank @korakoe for making the script at #100):

accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml

Common Issues

@Kreevoz has made detailed notes on common issues in finetuning, with suggestions in maximizing audio quality: #81. Some of these also apply to training from scratch. @IIEleven11 has also made a guideline for fine-tuning: #128.

Out of memory after joint_epoch: This is likely because your GPU RAM is not big enough for SLM adversarial training run. You may skip that but the quality could be worse. Setting joint_epoch a larger number than epochs could skip the SLM advesariral training.

Inference

Please refer to Inference_LJSpeech.ipynb (single-speaker) and Inference_LibriTTS.ipynb (multi-speaker) for details. For LibriTTS, you will also need to download reference_audio.zip and unzip it under the demo before running the demo.

The pretrained StyleTTS 2 on LJSpeech corpus in 24 kHz can be downloaded at https://huggingface.co/yl4579/StyleTTS2-LJSpeech/tree/main.
The pretrained StyleTTS 2 model on LibriTTS can be downloaded at https://huggingface.co/yl4579/StyleTTS2-LibriTTS/tree/main.

You can import StyleTTS 2 and run it in your own code. However, the inference depends on a GPL-licensed package, so it is not included directly in this repository. A GPL-licensed fork has an importable script, as well as an experimental streaming API, etc. A fully MIT-licensed package that uses gruut (albeit lower quality due to mismatch between phonemizer and gruut) is also available.

Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.

Common Issues

High-pitched background noise: This is caused by numerical float differences in older GPUs. For more details, please refer to issue #13. Basically, you will need to use more modern GPUs or do inference on CPUs.
Pre-trained model license: You only need to abide by the above rules if you use the pre-trained models and the voices are NOT in the training set, i.e., your reference speakers are not from any open access dataset. For more details of rules to use the pre-trained models, please see #37.

References

License

Code: MIT License

Pre-Trained Models: Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot