tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

5,225

1,417

5,225

218

View on GitHub

Top Related Projects

TTS

9,796

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Real-Time-Voice-Cloning

54,087

Clone a voice in 5 seconds to generate arbitrary speech in real-time

tacotron

2,971

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Tacotron-2

2,306

DeepMind's Tacotron-2 Tensorflow implementation

Quick Overview

NVIDIA/tacotron2 is an implementation of Tacotron 2, a neural network architecture for speech synthesis. It converts text to mel spectrograms, which can then be used to generate speech. This repository provides a PyTorch implementation of the model described in the paper "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions."

Pros

High-quality speech synthesis with natural-sounding results
Flexible architecture that can be fine-tuned for different voices and languages
Includes pre-trained models for quick experimentation
Well-documented codebase with clear instructions for training and inference

Cons

Requires significant computational resources for training
Dependency on specific versions of libraries may cause compatibility issues
Limited to generating mel spectrograms; requires additional steps for audio generation
May struggle with uncommon words or complex pronunciations

Code Examples

Loading a pre-trained model:

from tacotron2.model import Tacotron2
from tacotron2.hparams import create_hparams

hparams = create_hparams()
model = Tacotron2(hparams)
checkpoint_path = 'tacotron2_statedict.pt'
model.load_state_dict(torch.load(checkpoint_path)['state_dict'])

Synthesizing mel spectrograms from text:

from tacotron2.text import text_to_sequence

text = "Hello, world!"
sequence = text_to_sequence(text, ['english_cleaners'])
inputs = torch.LongTensor(sequence).unsqueeze(0)
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(inputs)

Plotting the generated mel spectrogram:

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.imshow(mel_outputs_postnet[0].detach().cpu().numpy(), aspect='auto', origin='lower')
plt.colorbar()
plt.tight_layout()
plt.show()

Getting Started

Clone the repository:

git clone https://github.com/NVIDIA/tacotron2.git
cd tacotron2

Install dependencies:
```
pip install -r requirements.txt
```

Download pre-trained models:

wget https://github.com/NVIDIA/tacotron2/releases/download/v1.0/tacotron2_statedict.pt

Run inference:

python inference.py --model='tacotron2' --waveglow_path='waveglow_256channels_universal_v5.pt' --text="Hello, world!"

Competitor Comparisons

WaveRNN

2,155

WaveRNN Vocoder + TTS

Pros of WaveRNN

Lighter and faster than Tacotron2, making it more suitable for real-time applications
More flexible architecture, allowing for easier customization and experimentation
Better support for low-resource environments and edge devices

Cons of WaveRNN

May produce slightly lower quality audio compared to Tacotron2 in some cases
Less extensive documentation and community support
Fewer pre-trained models available out-of-the-box

Code Comparison

WaveRNN:

def forward(self, x, mels):
    x = self.I(x)
    mels = self.mel_upsample(mels.transpose(1, 2))
    mels = self.mel_hidden(mels)
    x = self.rnn1(x, mels)
    return self.fc(x)

Tacotron2:

def forward(self, inputs, input_lengths, mel_inputs=None):
    embedded_inputs = self.embedding(inputs).transpose(1, 2)
    encoder_outputs = self.encoder(embedded_inputs, input_lengths)
    mel_outputs, gate_outputs, alignments = self.decoder(
        encoder_outputs, mel_inputs, memory_lengths=input_lengths)
    return mel_outputs, gate_outputs, alignments

The code snippets show that WaveRNN has a simpler forward pass, focusing on upsampling and RNN processing, while Tacotron2 involves more complex encoder-decoder architecture with attention mechanisms.

TTS

9,796

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

More comprehensive and flexible, supporting multiple TTS models and vocoders
Active development with frequent updates and community contributions
Extensive documentation and tutorials for easier implementation

Cons of TTS

Potentially more complex setup due to wider range of options
May require more computational resources for some models

Code Comparison

TTS:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

Tacotron2:

from tacotron2_model import Tacotron2
from text import text_to_sequence

model = Tacotron2()
text = "Hello world!"
sequence = text_to_sequence(text, ['english_cleaners'])
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)

Key Differences

TTS offers a higher-level API for easier integration
Tacotron2 provides more direct access to model internals
TTS supports multiple languages and models out-of-the-box
Tacotron2 focuses specifically on the Tacotron 2 architecture

Both repositories are valuable for text-to-speech tasks, with TTS offering more versatility and Tacotron2 providing a specialized implementation of a specific model.

Real-Time-Voice-Cloning

54,087

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

Offers real-time voice cloning capabilities
Includes a user-friendly interface for easy interaction
Combines multiple models (encoder, synthesizer, vocoder) for improved performance

Cons of Real-Time-Voice-Cloning

May require more computational resources due to its real-time nature
Potentially less optimized for production use compared to Tacotron 2
Could have a steeper learning curve for beginners due to its complexity

Code Comparison

Tacotron 2:

# Load model
model = load_model('tacotron2_model.pth')
# Generate mel-spectrogram
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(text)

Real-Time-Voice-Cloning:

# Load models
encoder = SpeakerEncoder("encoder.pt")
synthesizer = Synthesizer("synthesizer.pt")
vocoder = WaveRNN("vocoder.pt")
# Generate audio
embed = encoder.embed_utterance(preprocessed_wav)
specs = synthesizer.synthesize_spectrograms([text], [embed])
generated_wav = vocoder.infer_waveform(specs[0])

The code snippets illustrate the different approaches:

Tacotron 2 focuses on generating mel-spectrograms
Real-Time-Voice-Cloning involves multiple steps: encoding, synthesis, and vocoding

Both repositories offer powerful text-to-speech capabilities, but Real-Time-Voice-Cloning provides more flexibility for voice cloning at the cost of increased complexity.

wavenet_vocoder

2,357

WaveNet vocoder

Pros of wavenet_vocoder

More flexible and adaptable to different languages and datasets
Potentially higher quality audio synthesis in some cases
Easier to integrate with custom front-end models

Cons of wavenet_vocoder

Slower inference time compared to Tacotron 2
May require more computational resources for training and inference
Less optimized for real-time applications

Code Comparison

wavenet_vocoder:

from wavenet_vocoder import WaveNet

model = WaveNet(
    layers=24,
    stacks=4,
    residual_channels=512,
    gate_channels=512,
    skip_out_channels=256,
    cin_channels=80,
    gin_channels=-1,
    weight_normalization=True,
    n_speakers=None,
    dropout=0.05,
    kernel_size=3
)

Tacotron2:

from tacotron2.model import Tacotron2

model = Tacotron2(
    n_mel_channels=80,
    n_symbols=len(symbols),
    symbols_embedding_dim=512,
    encoder_kernel_size=5,
    encoder_n_convolutions=3,
    encoder_embedding_dim=512,
    attention_rnn_dim=1024,
    attention_dim=128,
    attention_location_n_filters=32,
    attention_location_kernel_size=31
)

The code snippets show the initialization of the main models in each repository. wavenet_vocoder focuses on the WaveNet architecture for audio synthesis, while Tacotron2 implements the full end-to-end text-to-speech model, including the attention mechanism and mel-spectrogram generation.

tacotron

2,971

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Pros of Tacotron

Simpler implementation, easier to understand and modify
Supports both character and phoneme inputs
More extensive documentation and examples

Cons of Tacotron

Generally lower audio quality compared to Tacotron 2
Slower inference speed
Less robust to out-of-domain text

Code Comparison

Tacotron (Python):

def decoder_prenet(inputs, is_training, layer_sizes, scope=None):
    x = inputs
    drop_rate = 0.5 if is_training else 0.0
    with tf.variable_scope(scope or 'decoder_prenet'):
        for i, size in enumerate(layer_sizes):
            dense = tf.layers.dense(x, units=size, activation=tf.nn.relu, name='dense_%d' % (i+1))
            x = tf.layers.dropout(dense, rate=drop_rate, training=is_training, name='dropout_%d' % (i+1))
    return x

Tacotron 2 (PyTorch):

class Prenet(nn.Module):
    def __init__(self, in_dim, sizes):
        super(Prenet, self).__init__()
        in_sizes = [in_dim] + sizes[:-1]
        self.layers = nn.ModuleList(
            [LinearNorm(in_size, out_size, bias=False)
             for (in_size, out_size) in zip(in_sizes, sizes)])

    def forward(self, x):
        for linear in self.layers:
            x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
        return x

Tacotron-2

2,306

DeepMind's Tacotron-2 Tensorflow implementation

Pros of Tacotron-2

Offers more flexibility in model architecture and hyperparameters
Includes additional features like multi-speaker support and global style tokens
Provides more detailed documentation and explanations of the implementation

Cons of Tacotron-2

Less optimized for performance compared to the NVIDIA implementation
May require more effort to set up and configure due to additional features
Potentially less stable or reliable as it's a community-maintained project

Code Comparison

Tacotron2:

mel_outputs, mel_outputs_postnet, _, alignments = model(inputs)

Tacotron-2:

mel_outputs, mel_outputs_postnet, alignments, stop_tokens = model(inputs)
global_step = sess.run(global_step)

The Tacotron-2 implementation includes additional outputs like stop tokens and explicitly manages the global step, which can be useful for more advanced training scenarios.

Both repositories implement the Tacotron 2 text-to-speech model, but they differ in their focus and features. The NVIDIA version prioritizes performance and simplicity, while the Rayhane-mamah version offers more flexibility and additional features at the cost of potentially increased complexity and setup time.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Tacotron 2 (without wavenet)

PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions.

This implementation includes distributed and automatic mixed precision support and uses the LJSpeech dataset.

Distributed and Automatic Mixed Precision support relies on NVIDIA's Apex and AMP.

Visit our website for audio samples using our published Tacotron 2 and WaveGlow models.

Alignment, Predicted Mel Spectrogram, Target Mel Spectrogram

Pre-requisites

NVIDIA GPU + CUDA cuDNN

Setup

Download and extract the LJ Speech dataset
Clone this repo: git clone https://github.com/NVIDIA/tacotron2.git
CD into this repo: cd tacotron2
Initialize submodule: git submodule init; git submodule update
Update .wav paths: sed -i -- 's,DUMMY,ljs_dataset_folder/wavs,g' filelists/*.txt
- Alternatively, set load_mel_from_disk=True in hparams.py and update mel-spectrogram paths
Install PyTorch 1.0
Install Apex
Install python requirements or build docker image
- Install python requirements: pip install -r requirements.txt

Training

python train.py --output_directory=outdir --log_directory=logdir
(OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the dataset dependent text embedding layers are ignored

Download our published Tacotron 2 model
python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start

Multi-GPU (distributed) and Automatic Mixed Precision Training

python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

Inference demo

Download our published Tacotron 2 model
Download our published WaveGlow model
jupyter notebook --ip=127.0.0.1 --port=31337
Load inference.ipynb

N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.

Related repos

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis

nv-wavenet Faster than real time WaveNet.

Acknowledgements

This implementation uses code from the following repos: Keith Ito, Prem Seetharaman as described in our code.

We are inspired by Ryuchi Yamamoto's Tacotron PyTorch implementation.

We are thankful to the Tacotron 2 paper authors, specially Jonathan Shen, Yuxuan Wang and Zongheng Yang.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot