Top Related Projects
WaveRNN Vocoder + TTS
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Clone a voice in 5 seconds to generate arbitrary speech in real-time
WaveNet vocoder
A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
DeepMind's Tacotron-2 Tensorflow implementation
Quick Overview
NVIDIA/tacotron2 is an implementation of Tacotron 2, a neural network architecture for speech synthesis. It converts text to mel spectrograms, which can then be used to generate speech. This repository provides a PyTorch implementation of the model described in the paper "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions."
Pros
- High-quality speech synthesis with natural-sounding results
- Flexible architecture that can be fine-tuned for different voices and languages
- Includes pre-trained models for quick experimentation
- Well-documented codebase with clear instructions for training and inference
Cons
- Requires significant computational resources for training
- Dependency on specific versions of libraries may cause compatibility issues
- Limited to generating mel spectrograms; requires additional steps for audio generation
- May struggle with uncommon words or complex pronunciations
Code Examples
- Loading a pre-trained model:
from tacotron2.model import Tacotron2
from tacotron2.hparams import create_hparams
hparams = create_hparams()
model = Tacotron2(hparams)
checkpoint_path = 'tacotron2_statedict.pt'
model.load_state_dict(torch.load(checkpoint_path)['state_dict'])
- Synthesizing mel spectrograms from text:
from tacotron2.text import text_to_sequence
text = "Hello, world!"
sequence = text_to_sequence(text, ['english_cleaners'])
inputs = torch.LongTensor(sequence).unsqueeze(0)
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(inputs)
- Plotting the generated mel spectrogram:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.imshow(mel_outputs_postnet[0].detach().cpu().numpy(), aspect='auto', origin='lower')
plt.colorbar()
plt.tight_layout()
plt.show()
Getting Started
-
Clone the repository:
git clone https://github.com/NVIDIA/tacotron2.git cd tacotron2
-
Install dependencies:
pip install -r requirements.txt
-
Download pre-trained models:
wget https://github.com/NVIDIA/tacotron2/releases/download/v1.0/tacotron2_statedict.pt
-
Run inference:
python inference.py --model='tacotron2' --waveglow_path='waveglow_256channels_universal_v5.pt' --text="Hello, world!"
Competitor Comparisons
WaveRNN Vocoder + TTS
Pros of WaveRNN
- Lighter and faster than Tacotron2, making it more suitable for real-time applications
- More flexible architecture, allowing for easier customization and experimentation
- Better support for low-resource environments and edge devices
Cons of WaveRNN
- May produce slightly lower quality audio compared to Tacotron2 in some cases
- Less extensive documentation and community support
- Fewer pre-trained models available out-of-the-box
Code Comparison
WaveRNN:
def forward(self, x, mels):
x = self.I(x)
mels = self.mel_upsample(mels.transpose(1, 2))
mels = self.mel_hidden(mels)
x = self.rnn1(x, mels)
return self.fc(x)
Tacotron2:
def forward(self, inputs, input_lengths, mel_inputs=None):
embedded_inputs = self.embedding(inputs).transpose(1, 2)
encoder_outputs = self.encoder(embedded_inputs, input_lengths)
mel_outputs, gate_outputs, alignments = self.decoder(
encoder_outputs, mel_inputs, memory_lengths=input_lengths)
return mel_outputs, gate_outputs, alignments
The code snippets show that WaveRNN has a simpler forward pass, focusing on upsampling and RNN processing, while Tacotron2 involves more complex encoder-decoder architecture with attention mechanisms.
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Pros of TTS
- More comprehensive and flexible, supporting multiple TTS models and vocoders
- Active development with frequent updates and community contributions
- Extensive documentation and tutorials for easier implementation
Cons of TTS
- Potentially more complex setup due to wider range of options
- May require more computational resources for some models
Code Comparison
TTS:
from TTS.api import TTS
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")
Tacotron2:
from tacotron2_model import Tacotron2
from text import text_to_sequence
model = Tacotron2()
text = "Hello world!"
sequence = text_to_sequence(text, ['english_cleaners'])
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)
Key Differences
- TTS offers a higher-level API for easier integration
- Tacotron2 provides more direct access to model internals
- TTS supports multiple languages and models out-of-the-box
- Tacotron2 focuses specifically on the Tacotron 2 architecture
Both repositories are valuable for text-to-speech tasks, with TTS offering more versatility and Tacotron2 providing a specialized implementation of a specific model.
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Pros of Real-Time-Voice-Cloning
- Offers real-time voice cloning capabilities
- Includes a user-friendly interface for easy interaction
- Combines multiple models (encoder, synthesizer, vocoder) for improved performance
Cons of Real-Time-Voice-Cloning
- May require more computational resources due to its real-time nature
- Potentially less optimized for production use compared to Tacotron 2
- Could have a steeper learning curve for beginners due to its complexity
Code Comparison
Tacotron 2:
# Load model
model = load_model('tacotron2_model.pth')
# Generate mel-spectrogram
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(text)
Real-Time-Voice-Cloning:
# Load models
encoder = SpeakerEncoder("encoder.pt")
synthesizer = Synthesizer("synthesizer.pt")
vocoder = WaveRNN("vocoder.pt")
# Generate audio
embed = encoder.embed_utterance(preprocessed_wav)
specs = synthesizer.synthesize_spectrograms([text], [embed])
generated_wav = vocoder.infer_waveform(specs[0])
The code snippets illustrate the different approaches:
- Tacotron 2 focuses on generating mel-spectrograms
- Real-Time-Voice-Cloning involves multiple steps: encoding, synthesis, and vocoding
Both repositories offer powerful text-to-speech capabilities, but Real-Time-Voice-Cloning provides more flexibility for voice cloning at the cost of increased complexity.
WaveNet vocoder
Pros of wavenet_vocoder
- More flexible and adaptable to different languages and datasets
- Potentially higher quality audio synthesis in some cases
- Easier to integrate with custom front-end models
Cons of wavenet_vocoder
- Slower inference time compared to Tacotron 2
- May require more computational resources for training and inference
- Less optimized for real-time applications
Code Comparison
wavenet_vocoder:
from wavenet_vocoder import WaveNet
model = WaveNet(
layers=24,
stacks=4,
residual_channels=512,
gate_channels=512,
skip_out_channels=256,
cin_channels=80,
gin_channels=-1,
weight_normalization=True,
n_speakers=None,
dropout=0.05,
kernel_size=3
)
Tacotron2:
from tacotron2.model import Tacotron2
model = Tacotron2(
n_mel_channels=80,
n_symbols=len(symbols),
symbols_embedding_dim=512,
encoder_kernel_size=5,
encoder_n_convolutions=3,
encoder_embedding_dim=512,
attention_rnn_dim=1024,
attention_dim=128,
attention_location_n_filters=32,
attention_location_kernel_size=31
)
The code snippets show the initialization of the main models in each repository. wavenet_vocoder focuses on the WaveNet architecture for audio synthesis, while Tacotron2 implements the full end-to-end text-to-speech model, including the attention mechanism and mel-spectrogram generation.
A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
Pros of Tacotron
- Simpler implementation, easier to understand and modify
- Supports both character and phoneme inputs
- More extensive documentation and examples
Cons of Tacotron
- Generally lower audio quality compared to Tacotron 2
- Slower inference speed
- Less robust to out-of-domain text
Code Comparison
Tacotron (Python):
def decoder_prenet(inputs, is_training, layer_sizes, scope=None):
x = inputs
drop_rate = 0.5 if is_training else 0.0
with tf.variable_scope(scope or 'decoder_prenet'):
for i, size in enumerate(layer_sizes):
dense = tf.layers.dense(x, units=size, activation=tf.nn.relu, name='dense_%d' % (i+1))
x = tf.layers.dropout(dense, rate=drop_rate, training=is_training, name='dropout_%d' % (i+1))
return x
Tacotron 2 (PyTorch):
class Prenet(nn.Module):
def __init__(self, in_dim, sizes):
super(Prenet, self).__init__()
in_sizes = [in_dim] + sizes[:-1]
self.layers = nn.ModuleList(
[LinearNorm(in_size, out_size, bias=False)
for (in_size, out_size) in zip(in_sizes, sizes)])
def forward(self, x):
for linear in self.layers:
x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
return x
DeepMind's Tacotron-2 Tensorflow implementation
Pros of Tacotron-2
- Offers more flexibility in model architecture and hyperparameters
- Includes additional features like multi-speaker support and global style tokens
- Provides more detailed documentation and explanations of the implementation
Cons of Tacotron-2
- Less optimized for performance compared to the NVIDIA implementation
- May require more effort to set up and configure due to additional features
- Potentially less stable or reliable as it's a community-maintained project
Code Comparison
Tacotron2:
mel_outputs, mel_outputs_postnet, _, alignments = model(inputs)
Tacotron-2:
mel_outputs, mel_outputs_postnet, alignments, stop_tokens = model(inputs)
global_step = sess.run(global_step)
The Tacotron-2 implementation includes additional outputs like stop tokens and explicitly manages the global step, which can be useful for more advanced training scenarios.
Both repositories implement the Tacotron 2 text-to-speech model, but they differ in their focus and features. The NVIDIA version prioritizes performance and simplicity, while the Rayhane-mamah version offers more flexibility and additional features at the cost of potentially increased complexity and setup time.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Tacotron 2 (without wavenet)
PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions.
This implementation includes distributed and automatic mixed precision support and uses the LJSpeech dataset.
Distributed and Automatic Mixed Precision support relies on NVIDIA's Apex and AMP.
Visit our website for audio samples using our published Tacotron 2 and WaveGlow models.
Pre-requisites
- NVIDIA GPU + CUDA cuDNN
Setup
- Download and extract the LJ Speech dataset
- Clone this repo:
git clone https://github.com/NVIDIA/tacotron2.git
- CD into this repo:
cd tacotron2
- Initialize submodule:
git submodule init; git submodule update
- Update .wav paths:
sed -i -- 's,DUMMY,ljs_dataset_folder/wavs,g' filelists/*.txt
- Alternatively, set
load_mel_from_disk=True
inhparams.py
and update mel-spectrogram paths
- Alternatively, set
- Install PyTorch 1.0
- Install Apex
- Install python requirements or build docker image
- Install python requirements:
pip install -r requirements.txt
- Install python requirements:
Training
python train.py --output_directory=outdir --log_directory=logdir
- (OPTIONAL)
tensorboard --logdir=outdir/logdir
Training using a pre-trained model
Training using a pre-trained model can lead to faster convergence
By default, the dataset dependent text embedding layers are ignored
- Download our published Tacotron 2 model
python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start
Multi-GPU (distributed) and Automatic Mixed Precision Training
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True
Inference demo
- Download our published Tacotron 2 model
- Download our published WaveGlow model
jupyter notebook --ip=127.0.0.1 --port=31337
- Load inference.ipynb
N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.
Related repos
WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis
nv-wavenet Faster than real time WaveNet.
Acknowledgements
This implementation uses code from the following repos: Keith Ito, Prem Seetharaman as described in our code.
We are inspired by Ryuchi Yamamoto's Tacotron PyTorch implementation.
We are thankful to the Tacotron 2 paper authors, specially Jonathan Shen, Yuxuan Wang and Zongheng Yang.
Top Related Projects
WaveRNN Vocoder + TTS
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Clone a voice in 5 seconds to generate arbitrary speech in real-time
WaveNet vocoder
A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
DeepMind's Tacotron-2 Tensorflow implementation
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot