Convert Figma logo to code with AI

NVIDIA logotacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

5,034
1,373
5,034
211

Top Related Projects

2,124

WaveRNN Vocoder + TTS

9,217

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Clone a voice in 5 seconds to generate arbitrary speech in real-time

WaveNet vocoder

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

DeepMind's Tacotron-2 Tensorflow implementation

Quick Overview

NVIDIA/tacotron2 is an implementation of Tacotron 2, a neural network architecture for speech synthesis. It converts text to mel spectrograms, which can then be used to generate speech. This repository provides a PyTorch implementation of the model described in the paper "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions."

Pros

  • High-quality speech synthesis with natural-sounding results
  • Flexible architecture that can be fine-tuned for different voices and languages
  • Includes pre-trained models for quick experimentation
  • Well-documented codebase with clear instructions for training and inference

Cons

  • Requires significant computational resources for training
  • Dependency on specific versions of libraries may cause compatibility issues
  • Limited to generating mel spectrograms; requires additional steps for audio generation
  • May struggle with uncommon words or complex pronunciations

Code Examples

  1. Loading a pre-trained model:
from tacotron2.model import Tacotron2
from tacotron2.hparams import create_hparams

hparams = create_hparams()
model = Tacotron2(hparams)
checkpoint_path = 'tacotron2_statedict.pt'
model.load_state_dict(torch.load(checkpoint_path)['state_dict'])
  1. Synthesizing mel spectrograms from text:
from tacotron2.text import text_to_sequence

text = "Hello, world!"
sequence = text_to_sequence(text, ['english_cleaners'])
inputs = torch.LongTensor(sequence).unsqueeze(0)
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(inputs)
  1. Plotting the generated mel spectrogram:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.imshow(mel_outputs_postnet[0].detach().cpu().numpy(), aspect='auto', origin='lower')
plt.colorbar()
plt.tight_layout()
plt.show()

Getting Started

  1. Clone the repository:

    git clone https://github.com/NVIDIA/tacotron2.git
    cd tacotron2
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download pre-trained models:

    wget https://github.com/NVIDIA/tacotron2/releases/download/v1.0/tacotron2_statedict.pt
    
  4. Run inference:

    python inference.py --model='tacotron2' --waveglow_path='waveglow_256channels_universal_v5.pt' --text="Hello, world!"
    

Competitor Comparisons

2,124

WaveRNN Vocoder + TTS

Pros of WaveRNN

  • Lighter and faster than Tacotron2, making it more suitable for real-time applications
  • More flexible architecture, allowing for easier customization and experimentation
  • Better support for low-resource environments and edge devices

Cons of WaveRNN

  • May produce slightly lower quality audio compared to Tacotron2 in some cases
  • Less extensive documentation and community support
  • Fewer pre-trained models available out-of-the-box

Code Comparison

WaveRNN:

def forward(self, x, mels):
    x = self.I(x)
    mels = self.mel_upsample(mels.transpose(1, 2))
    mels = self.mel_hidden(mels)
    x = self.rnn1(x, mels)
    return self.fc(x)

Tacotron2:

def forward(self, inputs, input_lengths, mel_inputs=None):
    embedded_inputs = self.embedding(inputs).transpose(1, 2)
    encoder_outputs = self.encoder(embedded_inputs, input_lengths)
    mel_outputs, gate_outputs, alignments = self.decoder(
        encoder_outputs, mel_inputs, memory_lengths=input_lengths)
    return mel_outputs, gate_outputs, alignments

The code snippets show that WaveRNN has a simpler forward pass, focusing on upsampling and RNN processing, while Tacotron2 involves more complex encoder-decoder architecture with attention mechanisms.

9,217

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

  • More comprehensive and flexible, supporting multiple TTS models and vocoders
  • Active development with frequent updates and community contributions
  • Extensive documentation and tutorials for easier implementation

Cons of TTS

  • Potentially more complex setup due to wider range of options
  • May require more computational resources for some models

Code Comparison

TTS:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

Tacotron2:

from tacotron2_model import Tacotron2
from text import text_to_sequence

model = Tacotron2()
text = "Hello world!"
sequence = text_to_sequence(text, ['english_cleaners'])
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)

Key Differences

  • TTS offers a higher-level API for easier integration
  • Tacotron2 provides more direct access to model internals
  • TTS supports multiple languages and models out-of-the-box
  • Tacotron2 focuses specifically on the Tacotron 2 architecture

Both repositories are valuable for text-to-speech tasks, with TTS offering more versatility and Tacotron2 providing a specialized implementation of a specific model.

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

  • Offers real-time voice cloning capabilities
  • Includes a user-friendly interface for easy interaction
  • Combines multiple models (encoder, synthesizer, vocoder) for improved performance

Cons of Real-Time-Voice-Cloning

  • May require more computational resources due to its real-time nature
  • Potentially less optimized for production use compared to Tacotron 2
  • Could have a steeper learning curve for beginners due to its complexity

Code Comparison

Tacotron 2:

# Load model
model = load_model('tacotron2_model.pth')
# Generate mel-spectrogram
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(text)

Real-Time-Voice-Cloning:

# Load models
encoder = SpeakerEncoder("encoder.pt")
synthesizer = Synthesizer("synthesizer.pt")
vocoder = WaveRNN("vocoder.pt")
# Generate audio
embed = encoder.embed_utterance(preprocessed_wav)
specs = synthesizer.synthesize_spectrograms([text], [embed])
generated_wav = vocoder.infer_waveform(specs[0])

The code snippets illustrate the different approaches:

  • Tacotron 2 focuses on generating mel-spectrograms
  • Real-Time-Voice-Cloning involves multiple steps: encoding, synthesis, and vocoding

Both repositories offer powerful text-to-speech capabilities, but Real-Time-Voice-Cloning provides more flexibility for voice cloning at the cost of increased complexity.

WaveNet vocoder

Pros of wavenet_vocoder

  • More flexible and adaptable to different languages and datasets
  • Potentially higher quality audio synthesis in some cases
  • Easier to integrate with custom front-end models

Cons of wavenet_vocoder

  • Slower inference time compared to Tacotron 2
  • May require more computational resources for training and inference
  • Less optimized for real-time applications

Code Comparison

wavenet_vocoder:

from wavenet_vocoder import WaveNet

model = WaveNet(
    layers=24,
    stacks=4,
    residual_channels=512,
    gate_channels=512,
    skip_out_channels=256,
    cin_channels=80,
    gin_channels=-1,
    weight_normalization=True,
    n_speakers=None,
    dropout=0.05,
    kernel_size=3
)

Tacotron2:

from tacotron2.model import Tacotron2

model = Tacotron2(
    n_mel_channels=80,
    n_symbols=len(symbols),
    symbols_embedding_dim=512,
    encoder_kernel_size=5,
    encoder_n_convolutions=3,
    encoder_embedding_dim=512,
    attention_rnn_dim=1024,
    attention_dim=128,
    attention_location_n_filters=32,
    attention_location_kernel_size=31
)

The code snippets show the initialization of the main models in each repository. wavenet_vocoder focuses on the WaveNet architecture for audio synthesis, while Tacotron2 implements the full end-to-end text-to-speech model, including the attention mechanism and mel-spectrogram generation.

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Pros of Tacotron

  • Simpler implementation, easier to understand and modify
  • Supports both character and phoneme inputs
  • More extensive documentation and examples

Cons of Tacotron

  • Generally lower audio quality compared to Tacotron 2
  • Slower inference speed
  • Less robust to out-of-domain text

Code Comparison

Tacotron (Python):

def decoder_prenet(inputs, is_training, layer_sizes, scope=None):
    x = inputs
    drop_rate = 0.5 if is_training else 0.0
    with tf.variable_scope(scope or 'decoder_prenet'):
        for i, size in enumerate(layer_sizes):
            dense = tf.layers.dense(x, units=size, activation=tf.nn.relu, name='dense_%d' % (i+1))
            x = tf.layers.dropout(dense, rate=drop_rate, training=is_training, name='dropout_%d' % (i+1))
    return x

Tacotron 2 (PyTorch):

class Prenet(nn.Module):
    def __init__(self, in_dim, sizes):
        super(Prenet, self).__init__()
        in_sizes = [in_dim] + sizes[:-1]
        self.layers = nn.ModuleList(
            [LinearNorm(in_size, out_size, bias=False)
             for (in_size, out_size) in zip(in_sizes, sizes)])

    def forward(self, x):
        for linear in self.layers:
            x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
        return x

DeepMind's Tacotron-2 Tensorflow implementation

Pros of Tacotron-2

  • Offers more flexibility in model architecture and hyperparameters
  • Includes additional features like multi-speaker support and global style tokens
  • Provides more detailed documentation and explanations of the implementation

Cons of Tacotron-2

  • Less optimized for performance compared to the NVIDIA implementation
  • May require more effort to set up and configure due to additional features
  • Potentially less stable or reliable as it's a community-maintained project

Code Comparison

Tacotron2:

mel_outputs, mel_outputs_postnet, _, alignments = model(inputs)

Tacotron-2:

mel_outputs, mel_outputs_postnet, alignments, stop_tokens = model(inputs)
global_step = sess.run(global_step)

The Tacotron-2 implementation includes additional outputs like stop tokens and explicitly manages the global step, which can be useful for more advanced training scenarios.

Both repositories implement the Tacotron 2 text-to-speech model, but they differ in their focus and features. The NVIDIA version prioritizes performance and simplicity, while the Rayhane-mamah version offers more flexibility and additional features at the cost of potentially increased complexity and setup time.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Tacotron 2 (without wavenet)

PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions.

This implementation includes distributed and automatic mixed precision support and uses the LJSpeech dataset.

Distributed and Automatic Mixed Precision support relies on NVIDIA's Apex and AMP.

Visit our website for audio samples using our published Tacotron 2 and WaveGlow models.

Alignment, Predicted Mel Spectrogram, Target Mel Spectrogram

Pre-requisites

  1. NVIDIA GPU + CUDA cuDNN

Setup

  1. Download and extract the LJ Speech dataset
  2. Clone this repo: git clone https://github.com/NVIDIA/tacotron2.git
  3. CD into this repo: cd tacotron2
  4. Initialize submodule: git submodule init; git submodule update
  5. Update .wav paths: sed -i -- 's,DUMMY,ljs_dataset_folder/wavs,g' filelists/*.txt
    • Alternatively, set load_mel_from_disk=True in hparams.py and update mel-spectrogram paths
  6. Install PyTorch 1.0
  7. Install Apex
  8. Install python requirements or build docker image
    • Install python requirements: pip install -r requirements.txt

Training

  1. python train.py --output_directory=outdir --log_directory=logdir
  2. (OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the dataset dependent text embedding layers are ignored

  1. Download our published Tacotron 2 model
  2. python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start

Multi-GPU (distributed) and Automatic Mixed Precision Training

  1. python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

Inference demo

  1. Download our published Tacotron 2 model
  2. Download our published WaveGlow model
  3. jupyter notebook --ip=127.0.0.1 --port=31337
  4. Load inference.ipynb

N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.

Related repos

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis

nv-wavenet Faster than real time WaveNet.

Acknowledgements

This implementation uses code from the following repos: Keith Ito, Prem Seetharaman as described in our code.

We are inspired by Ryuchi Yamamoto's Tacotron PyTorch implementation.

We are thankful to the Tacotron 2 paper authors, specially Jonathan Shen, Yuxuan Wang and Zongheng Yang.