Top Related Projects
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
Clone a voice in 5 seconds to generate arbitrary speech in real-time
WaveNet vocoder
A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
DeepMind's Tacotron-2 Tensorflow implementation
Quick Overview
WaveRNN is a PyTorch implementation of the WaveRNN model for efficient audio synthesis. It is designed to generate high-quality speech waveforms and can be used for text-to-speech applications. The repository provides a complete pipeline for training and inference of WaveRNN models.
Pros
- Fast and efficient audio synthesis
- High-quality output comparable to state-of-the-art models
- Flexible architecture allowing for various configurations
- Includes pre-trained models for quick experimentation
Cons
- Requires significant computational resources for training
- Limited documentation and examples for advanced usage
- May require fine-tuning for specific use cases
- Dependency on older versions of some libraries
Code Examples
- Loading a pre-trained model:
from models.fatchord_version import WaveRNN
model = WaveRNN(rnn_dims=512, fc_dims=512, bits=9, pad=2,
upsample_factors=(5, 5, 11), feat_dims=80,
compute_dims=128, res_out_dims=128, res_blocks=10)
model.load('path/to/pretrained_model.pyt')
- Generating audio from mel spectrograms:
import torch
# Assuming 'mel' is your input mel spectrogram
mel = torch.FloatTensor(mel).unsqueeze(0)
output = model.generate(mel, batched=True, target=11000, overlap=550)
- Training the model:
from utils.dataset import get_vocoder_datasets
from utils.distribution import discretized_mix_logistic_loss
train_set, test_set = get_vocoder_datasets(paths, batch_size, train_gta)
for i, (x, y, m) in enumerate(train_set, 1):
y_hat = model(x, m)
loss = discretized_mix_logistic_loss(y_hat, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Getting Started
-
Clone the repository:
git clone https://github.com/fatchord/WaveRNN.git cd WaveRNN
-
Install dependencies:
pip install -r requirements.txt
-
Download pre-trained models:
python scripts/download_models.py
-
Generate audio using a pre-trained model:
from utils.text.symbols import symbols from models.tacotron import Tacotron from models.fatchord_version import WaveRNN tts_model = Tacotron(embed_dims=256, num_chars=len(symbols)) tts_model.load('pretrained/tts_model.pyt') voc_model = WaveRNN(rnn_dims=512, fc_dims=512, bits=9, pad=2, upsample_factors=(5, 5, 11), feat_dims=80, compute_dims=128, res_out_dims=128, res_blocks=10) voc_model.load('pretrained/voc_model.pyt') text = "Hello, world!" _, mel, _ = tts_model.generate(text) audio = voc_model.generate(mel)
Competitor Comparisons
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Pros of TTS
- More comprehensive TTS toolkit with multiple models and voice conversion
- Active development and regular updates
- Better documentation and examples for ease of use
Cons of TTS
- Higher complexity due to multiple models and features
- Potentially higher resource requirements for training and inference
- Steeper learning curve for beginners
Code Comparison
WaveRNN:
model = Model(rnn_dims=512, fc_dims=512, bits=9, pad=2,
upsample_factors=(5,5,8), feat_dims=80,
compute_dims=128, res_out_dims=128, res_blocks=10)
TTS:
from TTS.utils.synthesizer import Synthesizer
synthesizer = Synthesizer(
tts_checkpoint="path/to/tts_model.pth",
tts_config_path="path/to/tts_config.json",
vocoder_checkpoint="path/to/vocoder_model.pth",
vocoder_config="path/to/vocoder_config.json"
)
Summary
TTS offers a more comprehensive toolkit with multiple models and voice conversion capabilities, while WaveRNN focuses specifically on waveform generation. TTS benefits from active development and better documentation but may be more complex for beginners. WaveRNN is simpler and potentially easier to get started with for those focusing solely on waveform synthesis.
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
Pros of Tacotron2
- Highly optimized for NVIDIA GPUs, offering faster training and inference
- Includes pre-trained models for quick experimentation and deployment
- Comprehensive documentation and examples for ease of use
Cons of Tacotron2
- Limited flexibility for customization compared to WaveRNN
- Requires more computational resources, especially for training
- Less suitable for low-resource environments or non-NVIDIA hardware
Code Comparison
WaveRNN:
batch = torch2gpu(batch)
y_hat = model(x, mels)
loss = criterion(y_hat, y)
Tacotron2:
y, _, alignments = model(text_padded, input_lengths, mel_padded)
loss = criterion(y, mel_padded)
Key Differences
- WaveRNN focuses on efficient waveform generation, while Tacotron2 is a complete text-to-speech system
- Tacotron2 uses a sequence-to-sequence model with attention, whereas WaveRNN employs a recurrent neural network
- WaveRNN is more lightweight and adaptable to various use cases, while Tacotron2 is optimized for high-quality speech synthesis on powerful hardware
Use Cases
-
Choose WaveRNN for:
- Resource-constrained environments
- Custom vocoder development
- Flexibility in model architecture
-
Choose Tacotron2 for:
- High-quality speech synthesis on NVIDIA GPUs
- Quick deployment with pre-trained models
- Integration with NVIDIA's ecosystem of tools
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Pros of Real-Time-Voice-Cloning
- Offers end-to-end voice cloning capabilities, including speaker encoding, synthesis, and vocoding
- Provides a user-friendly interface for real-time voice cloning demonstrations
- Includes pre-trained models for immediate use
Cons of Real-Time-Voice-Cloning
- May require more computational resources due to its comprehensive nature
- Potentially more complex to set up and customize for specific use cases
- Less focused on specific vocoding techniques compared to WaveRNN
Code Comparison
Real-Time-Voice-Cloning
from encoder.params_model import model_embedding_size as speaker_embedding_size
from utils.argutils import print_args
from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder
WaveRNN
from utils.dataset import get_vocoder_datasets
from utils.dsp import *
from models.fatchord_version import WaveRNN
from utils.paths import Paths
from utils.display import simple_table
The code snippets show that Real-Time-Voice-Cloning imports modules for the entire voice cloning pipeline, while WaveRNN focuses specifically on the vocoder component. This reflects the broader scope of Real-Time-Voice-Cloning compared to the more specialized focus of WaveRNN on neural vocoding.
WaveNet vocoder
Pros of wavenet_vocoder
- Implements the original WaveNet architecture, which can produce high-quality audio
- Supports both autoregressive and non-autoregressive generation
- Well-documented and includes pre-trained models
Cons of wavenet_vocoder
- Slower generation speed compared to WaveRNN
- Higher computational requirements for training and inference
- Less efficient for real-time applications
Code Comparison
WaveRNN:
def forward(self, x, mels):
x = self.rnn1(x, mels)
x = self.fc1(x)
return self.fc2(x)
wavenet_vocoder:
def forward(self, x, c=None):
x = self.first_conv(x)
skips = None
for f in self.conv_layers:
x, h = f(x, c)
if skips is None:
skips = h
else:
skips += h
x = skips
for f in self.last_conv_layers:
x = f(x)
return x
The code comparison shows that WaveRNN uses a simpler architecture with RNN and fully connected layers, while wavenet_vocoder implements the more complex WaveNet architecture with multiple convolutional layers and skip connections.
A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
Pros of Tacotron
- Implements the complete Tacotron architecture for end-to-end text-to-speech synthesis
- Provides pre-trained models and detailed instructions for training and inference
- Includes a simple web application for demo purposes
Cons of Tacotron
- Focuses solely on the Tacotron model, lacking the flexibility of WaveRNN's modular approach
- May require more computational resources for training and inference compared to WaveRNN
- Less active development and community support in recent years
Code Comparison
Tacotron (model definition):
class Tacotron():
def __init__(self):
self.encoder = Encoder()
self.decoder = Decoder()
self.postnet = Postnet()
WaveRNN (model definition):
class Model(nn.Module):
def __init__(self, rnn_dims, fc_dims, bits, pad, upsample_factors,
feat_dims, compute_dims, res_out_dims, res_blocks):
super().__init__()
self.n_classes = 2**bits
self.rnn_dims = rnn_dims
self.aux_dims = res_out_dims // 4
self.upsample = UpsampleNetwork(upsample_factors, feat_dims)
self.I = nn.Linear(feat_dims + self.aux_dims + 1, rnn_dims)
self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
self.rnn2 = nn.GRU(rnn_dims + self.aux_dims, rnn_dims, batch_first=True)
self.fc1 = nn.Linear(rnn_dims + self.aux_dims, fc_dims)
self.fc2 = nn.Linear(fc_dims + self.aux_dims, fc_dims)
self.fc3 = nn.Linear(fc_dims, self.n_classes)
DeepMind's Tacotron-2 Tensorflow implementation
Pros of Tacotron-2
- Implements the complete Tacotron 2 architecture, including both the text-to-mel spectrogram model and the WaveNet vocoder
- Provides pre-trained models and detailed instructions for training and inference
- Supports both English and Chinese languages
Cons of Tacotron-2
- May require more computational resources due to the full Tacotron 2 implementation
- Less focused on real-time synthesis compared to WaveRNN
Code Comparison
Tacotron-2:
def inference(self, inputs, input_lengths, speaker_embeddings=None):
batch_size = inputs.size(0)
embedded_inputs = self.embedding(inputs).transpose(1, 2)
encoder_outputs = self.encoder(embedded_inputs, input_lengths)
WaveRNN:
def forward(self, x, mels):
if self.mode == 'RAW':
return self.forward_raw(x, mels)
elif self.mode == 'MOL':
return self.forward_mol(x, mels)
return None
The Tacotron-2 code snippet shows the inference process for the text-to-mel spectrogram model, while the WaveRNN code focuses on the vocoder part, demonstrating different approaches to speech synthesis.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
WaveRNN
(Update: Vanilla Tacotron One TTS system just implemented - more coming soon!)
Pytorch implementation of Deepmind's WaveRNN model from Efficient Neural Audio Synthesis
Installation
Ensure you have:
- Python >= 3.6
- Pytorch 1 with CUDA
Then install the rest with pip:
pip install -r requirements.txt
How to Use
Quick Start
If you want to use TTS functionality immediately you can simply use:
python quick_start.py
This will generate everything in the default sentences.txt file and output to a new 'quick_start' folder where you can playback the wav files and take a look at the attention plots
You can also use that script to generate custom tts sentences and/or use '-u' to generate unbatched (better audio quality):
python quick_start.py -u --input_text "What will happen if I run this command?"
Training your own Models
Download the LJSpeech Dataset.
Edit hparams.py, point wav_path to your dataset and run:
python preprocess.py
or use preprocess.py --path to point directly to the dataset
Here's my recommendation on what order to run things:
1 - Train Tacotron with:
python train_tacotron.py
2 - You can leave that finish training or at any point you can use:
python train_tacotron.py --force_gta
this will force tactron to create a GTA dataset even if it hasn't finish training.
3 - Train WaveRNN with:
python train_wavernn.py --gta
NB: You can always just run train_wavernn.py without --gta if you're not interested in TTS.
4 - Generate Sentences with both models using:
python gen_tacotron.py wavernn
this will generate default sentences. If you want generate custom sentences you can use
python gen_tacotron.py --input_text "this is whatever you want it to be" wavernn
And finally, you can always use --help on any of those scripts to see what options are available :)
Samples
Pretrained Models
Currently there are two pretrained models available in the /pretrained/ folder':
Both are trained on LJSpeech
- WaveRNN (Mixture of Logistics output) trained to 800k steps
- Tacotron trained to 180k steps
References
- Efficient Neural Audio Synthesis
- Tacotron: Towards End-to-End Speech Synthesis
- Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Acknowlegements
- https://github.com/keithito/tacotron
- https://github.com/r9y9/wavenet_vocoder
- Special thanks to github users G-Wang, geneing & erogol
Top Related Projects
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Tacotron 2 - PyTorch implementation with faster-than-realtime inference
Clone a voice in 5 seconds to generate arbitrary speech in real-time
WaveNet vocoder
A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
DeepMind's Tacotron-2 Tensorflow implementation
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot