Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation

2,309

913

2,309

265

View on GitHub

Top Related Projects

TTS

9,935

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

tacotron2

5,257

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

tacotron

2,980

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Quick Overview

Rayhane-mamah/Tacotron-2 is a TensorFlow implementation of Google's Tacotron 2 text-to-speech synthesis model. It aims to reproduce the high-quality speech synthesis results described in the original paper, providing a framework for training and inference of Tacotron 2 models.

Pros

Implements the state-of-the-art Tacotron 2 architecture for high-quality speech synthesis
Provides flexibility in model configuration and training options
Includes pre-trained models and sample audio for quick testing and comparison
Supports both single-speaker and multi-speaker synthesis

Cons

Requires significant computational resources for training
Documentation could be more comprehensive for easier setup and customization
May require fine-tuning for optimal results on specific datasets or languages
Dependency on older versions of TensorFlow and other libraries

Code Examples

Synthesizing speech from text:

from synthesizer import Synthesizer

synthesizer = Synthesizer()
wav = synthesizer.synthesize("Hello, world!")

Loading a pre-trained model:

from tacotron.synthesize import Synthesizer

checkpoint_path = "path/to/model/checkpoint"
hparams_path = "path/to/hparams.json"
synthesizer = Synthesizer(checkpoint_path, hparams_path)

Training a new model:

from tacotron.train import tacotron_train

hparams = {
    "num_mels": 80,
    "num_freq": 1025,
    "sample_rate": 20000,
    "frame_length_ms": 50,
    "frame_shift_ms": 12.5,
    "preemphasis": 0.97,
    "min_level_db": -100,
    "ref_level_db": 20,
}

tacotron_train(hparams)

Getting Started

Clone the repository:

git clone https://github.com/Rayhane-mamah/Tacotron-2.git
cd Tacotron-2

Install dependencies:
```
pip install -r requirements.txt
```
Download and prepare a dataset (e.g., LJSpeech):
```
python preprocess.py --dataset ljspeech
```
Train the model:
```
python train.py
```

Synthesize speech:

python synthesize.py --text "Hello, world!"

Competitor Comparisons

TTS

9,935

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

More actively maintained with frequent updates and contributions
Broader range of supported models and architectures
Better documentation and examples for easier implementation

Cons of TTS

Steeper learning curve due to more complex codebase
Potentially slower inference time for some models

Code Comparison

TTS:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

Tacotron-2:

from synthesizer.inference import Synthesizer
from vocoder import inference as vocoder

synthesizer = Synthesizer(checkpoint_path)
wav = synthesizer.synthesize_spectrograms([text])[0]
generated_wav = vocoder.infer_waveform(wav)

Both repositories provide implementations of text-to-speech models, with TTS offering a more comprehensive toolkit and Tacotron-2 focusing specifically on the Tacotron 2 architecture. TTS provides a higher-level API for easier use, while Tacotron-2 requires more manual setup and configuration. TTS supports multiple languages and models, whereas Tacotron-2 is primarily designed for English synthesis using the Tacotron 2 model.

tacotron2

5,257

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of tacotron2

Official NVIDIA implementation, potentially better optimized for GPU acceleration
Includes pre-trained models for quick start and evaluation
More comprehensive documentation and usage instructions

Cons of tacotron2

Less flexible architecture, harder to modify or extend
Fewer configuration options for fine-tuning the model
Limited support for custom datasets and preprocessing

Code Comparison

Tacotron-2:

def create_model(self):
    with tf.variable_scope('Tacotron_model') as scope:
        self.initialize_decoder()
        self.initialize_encoder()
        self.initialize_postnet()

tacotron2:

def forward(self, inputs, input_lengths, mel_inputs=None):
    embedded_inputs = self.embedding(inputs).transpose(1, 2)
    encoder_outputs = self.encoder(embedded_inputs, input_lengths)
    mel_outputs, gate_outputs, alignments = self.decoder(
        encoder_outputs, mel_inputs)

The Tacotron-2 implementation uses TensorFlow and separates model components into distinct initialization methods, while the tacotron2 implementation uses PyTorch and combines the forward pass in a single method. This reflects differences in framework design and implementation approach between the two repositories.

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

Offers real-time voice cloning capabilities
Includes a user-friendly toolbox for voice cloning experiments
Supports multi-speaker voice cloning

Cons of Real-Time-Voice-Cloning

May require more computational resources for real-time processing
Potentially less focus on high-quality synthesis compared to Tacotron-2

Code Comparison

Real-Time-Voice-Cloning:

def load_model(model_dir: Path):
    # Load the three models
    encoder = SpeakerEncoder(model_dir.joinpath("encoder.pt"))
    synthesizer = Synthesizer(model_dir.joinpath("synthesizer.pt"))
    vocoder = WaveRNN(model_dir.joinpath("vocoder.pt"))
    return encoder, synthesizer, vocoder

Tacotron-2:

def create_hparams(hparams_string=None, verbose=False):
    """Create model hyperparameters. Parse nondefault from given string."""
    hparams = tf.contrib.training.HParams(
        # Comma-separated list of cleaners to run on text prior to training and eval.
        cleaners='english_cleaners',
    )
    return hparams

The code snippets highlight the different approaches:

Real-Time-Voice-Cloning loads three separate models for encoding, synthesis, and vocoding.
Tacotron-2 focuses on creating hyperparameters for the model, emphasizing text preprocessing.

These differences reflect the repositories' distinct goals: real-time voice cloning versus high-quality speech synthesis.

WaveRNN

2,166

WaveRNN Vocoder + TTS

Pros of WaveRNN

Faster inference time due to its efficient architecture
Smaller model size, making it more suitable for deployment on resource-constrained devices
More flexible, allowing for easy integration with different front-end models

Cons of WaveRNN

Potentially lower audio quality compared to Tacotron-2's WaveNet vocoder
May require more fine-tuning to achieve optimal results
Less extensive documentation and community support

Code Comparison

WaveRNN:

def forward(self, x, mels):
    bsize = x.size(0)
    h1, h2 = self.wavernn(x, mels)
    return self.fc(h1), self.res(h2).view(bsize, -1)

Tacotron-2:

def inference(self, inputs):
    embedded_inputs = self.embedding(inputs).transpose(1, 2)
    decoder_output, _ = self.decoder(embedded_inputs, memory)
    mel_outputs = self.mel_projection(decoder_output)
    return mel_outputs

The code snippets show the core forward pass for WaveRNN and the inference method for Tacotron-2. WaveRNN focuses on generating waveforms, while Tacotron-2 generates mel spectrograms. WaveRNN's architecture is simpler and more efficient, which contributes to its faster inference time.

tacotron

2,980

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Pros of Tacotron

Simpler implementation, making it easier to understand and modify
Faster training time due to less complex architecture
More lightweight, requiring less computational resources

Cons of Tacotron

Lower audio quality compared to Tacotron-2
Less robust to variations in input text and speaking styles
Limited ability to handle long sentences or complex pronunciations

Code Comparison

Tacotron:

def create_model(hparams):
    return Tacotron(hparams)

class Tacotron:
    def __init__(self, hparams):
        self.encoder = Encoder(hparams)
        self.decoder = Decoder(hparams)

Tacotron-2:

def create_model(hparams):
    return Tacotron2(hparams)

class Tacotron2:
    def __init__(self, hparams):
        self.encoder = Encoder(hparams)
        self.decoder = Decoder(hparams)
        self.postnet = Postnet(hparams)

The main difference in the code structure is the addition of the Postnet module in Tacotron-2, which enhances the output quality. Tacotron-2 also includes more advanced attention mechanisms and a more sophisticated decoder, contributing to its improved performance but increased complexity.

wavenet_vocoder

2,359

WaveNet vocoder

Pros of wavenet_vocoder

Focuses specifically on WaveNet vocoder implementation, allowing for more specialized and potentially optimized audio synthesis
Provides pre-trained models for quick experimentation and deployment
Supports real-time generation, which can be crucial for certain applications

Cons of wavenet_vocoder

Limited to vocoder functionality, while Tacotron-2 offers a complete end-to-end text-to-speech solution
May require more expertise to integrate into a full TTS pipeline
Less active development and community support compared to Tacotron-2

Code Comparison

wavenet_vocoder:

model = build_model()
y_hat = model.incremental_forward(c)

Tacotron-2:

model = Tacotron2()
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(text)

The wavenet_vocoder code focuses on generating audio from mel-spectrograms, while Tacotron-2 handles the entire text-to-speech process, including text processing and mel-spectrogram generation.

Both repositories offer valuable tools for speech synthesis, with wavenet_vocoder providing a specialized vocoder solution and Tacotron-2 offering a more comprehensive TTS framework. The choice between them depends on the specific requirements of your project and the level of control you need over the speech synthesis process.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Tacotron-2:

Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions

This Repository contains additional improvements and attempts over the paper, we thus propose paper_hparams.py file which holds the exact hyperparameters to reproduce the paper results without any additional extras.

Suggested hparams.py file which is default in use, contains the hyperparameters with extras that proved to provide better results in most cases. Feel free to toy with the parameters as needed.

DIFFERENCES WILL BE HIGHLIGHTED IN DOCUMENTATION SHORTLY.

Repository Structure:

Tacotron-2
âââ datasets
âââ en_UK		(0)
âÂ Â  âââ by_book
âÂ Â      âââ female
âââ en_US		(0)
âÂ Â  âââ by_book
âÂ Â      âââ female
âÂ Â      âââ male
âââ LJSpeech-1.1	(0)
âÂ Â  âââ wavs
âââ logs-Tacotron	(2)
âÂ Â  âââ eval_-dir
âÂ Â  âÂ 	âââ plots
âÂ 	âÂ 	âââ wavs
âÂ Â  âââ mel-spectrograms
âÂ Â  âââ plots
âÂ Â  âââ taco_pretrained
âÂ Â  âââ metas
âÂ Â  âââ wavs
âââ logs-Wavenet	(4)
âÂ Â  âââ eval-dir
âÂ Â  âÂ 	âââ plots
âÂ 	âÂ 	âââ wavs
âÂ Â  âââ plots
âÂ Â  âââ wave_pretrained
âÂ Â  âââ metas
âÂ Â  âââ wavs
âââ logs-Tacotron-2	( * )
âÂ Â  âââ eval-dir
âÂ Â  âÂ 	âââ plots
âÂ 	âÂ 	âââ wavs
âÂ Â  âââ plots
âÂ Â  âââ taco_pretrained
âÂ Â  âââ wave_pretrained
âÂ Â  âââ metas
âÂ Â  âââ wavs
âââ papers
âââ tacotron
âÂ Â  âââ models
âÂ Â  âââ utils
âââ tacotron_output	(3)
âÂ Â  âââ eval
âÂ Â  âââ gta
âÂ Â  âââ logs-eval
âÂ Â  âÂ Â  âââ plots
âÂ Â  âÂ Â  âââ wavs
âÂ Â  âââ natural
âââ wavenet_output	(5)
âÂ Â  âââ plots
âÂ Â  âââ wavs
âââ training_data	(1)
âÂ Â  âââ audio
âÂ Â  âââ linear
â	âââ mels
âââ wavenet_vocoder
	âââ models

The previous tree shows the current state of the repository (separate training, one step at a time).

Step (0): Get your dataset, here I have set the examples of Ljspeech, en_US and en_UK (from M-AILABS).
Step (1): Preprocess your data. This will give you the training_data folder.
Step (2): Train your Tacotron model. Yields the logs-Tacotron folder.
Step (3): Synthesize/Evaluate the Tacotron model. Gives the tacotron_output folder.
Step (4): Train your Wavenet model. Yield the logs-Wavenet folder.
Step (5): Synthesize audio using the Wavenet model. Gives the wavenet_output folder.
Note: Steps 2, 3, and 4 can be made with a simple run for both Tacotron and WaveNet (Tacotron-2, step ( * )).

Note:

Our preprocessing only supports Ljspeech and Ljspeech-like datasets (M-AILABS speech data)! If running on datasets stored differently, you will probably need to make your own preprocessing script.
In the previous tree, files were not represented and max depth was set to 3 for simplicity.
If you run training of both models at the same time, repository structure will be different.

Pretrained model and Samples:

Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) here. THIS IS VERY OUTDATED, I WILL UPDATE THIS SOON

Model Architecture:

The model described by the authors can be divided in two parts:

Spectrogram prediction network
Wavenet vocoder

To have an in-depth exploration of the model architecture, training procedure and preprocessing logic, refer to our wiki

Current state:

To have an overview of our advance on this project, please refer to this discussion

since the two parts of the global model are trained separately, we can start by training the feature prediction model to use his predictions later during the wavenet training.

How to start

Machine Setup:

First, you need to have python 3 installed along with Tensorflow.

Next, you need to install some Linux dependencies to ensure audio libraries work properly:

apt-get install -y libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libav-tools

Finally, you can install the requirements. If you are an Anaconda user: (else replace pip with pip3 and python with python3)

pip install -r requirements.txt

Docker:

Alternatively, one can build the docker image to ensure everything is setup automatically and use the project inside the docker containers. Dockerfile is insider "docker" folder

docker image can be built with:

docker build -t tacotron-2_image docker/

Then containers are runnable with:

docker run -i --name new_container tacotron-2_image

Please report any issues with the Docker usage with our models, I'll get to it. Thanks!

Dataset:

We tested the code above on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)

We are also running current tests on the new M-AILABS speech dataset which contains more than 700h of speech (more than 80 Gb of data) for more than 10 languages.

After downloading the dataset, extract the compressed file, and place the folder inside the cloned repository.

Hparams setting:

Before proceeding, you must pick the hyperparameters that suit best your needs. While it is possible to change the hyper parameters from command line during preprocessing/training, I still recommend making the changes once and for all on the hparams.py file directly.

To pick optimal fft parameters, I have made a griffin_lim_synthesis_tool notebook that you can use to invert real extracted mel/linear spectrograms and choose how good your preprocessing is. All other options are well explained in the hparams.py and have meaningful names so that you can try multiple things with them.

AWAIT DOCUMENTATION ON HPARAMS SHORTLY!!

Preprocessing

Before running the following steps, please make sure you are inside Tacotron-2 folder

cd Tacotron-2

Preprocessing can then be started using:

python preprocess.py

dataset can be chosen using the --dataset argument. If using M-AILABS dataset, you need to provide the language, voice, reader, merge_books and book arguments for your custom need. Default is Ljspeech.

Example M-AILABS:

python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=False --book='northandsouth'

or if you want to use all books for a single speaker:

python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=True

This should take no longer than a few minutes.

Training:

To train both models sequentially (one after the other):

python train.py --model='Tacotron-2'

Feature prediction model can separately be trained using:

python train.py --model='Tacotron'

checkpoints will be made each 5000 steps and stored under logs-Tacotron folder.

Naturally, training the wavenet separately is done by:

python train.py --model='WaveNet'

logs will be stored inside logs-Wavenet.

Note:

If model argument is not provided, training will default to Tacotron-2 model training. (both models)
Please refer to train arguments under train.py for a set of options you can use.
It is now possible to make wavenet preprocessing alone using wavenet_proprocess.py.

Synthesis

To synthesize audio in an End-to-End (text to audio) manner (both models at work):

python synthesize.py --model='Tacotron-2'

For the spectrogram prediction network (separately), there are three types of mel spectrograms synthesis:

Evaluation (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.

python synthesize.py --model='Tacotron'

Natural synthesis (let the model make predictions alone by feeding last decoder output to the next time step).

python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=False

Ground Truth Aligned synthesis (DEFAULT: the model is assisted by true labels in a teacher forcing manner). This synthesis method is used when predicting mel spectrograms used to train the wavenet vocoder. (yields better results as stated in the paper)

python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=True

Synthesizing the waveforms conditionned on previously synthesized Mel-spectrograms (separately) can be done with:

python synthesize.py --model='WaveNet'

Note:

If model argument is not provided, synthesis will default to Tacotron-2 model synthesis. (End-to-End TTS)
Please refer to synthesis arguments under synthesize.py for a set of options you can use.

References and Resources:

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot