WaveRNN

WaveRNN Vocoder + TTS

2,166

698

2,166

107

View on GitHub

Top Related Projects

TTS

9,935

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

tacotron2

5,257

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

tacotron

2,980

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Tacotron-2

2,309

DeepMind's Tacotron-2 Tensorflow implementation

Quick Overview

WaveRNN is a PyTorch implementation of the WaveRNN model for efficient audio synthesis. It is designed to generate high-quality speech waveforms and can be used for text-to-speech applications. The repository provides a complete pipeline for training and inference of WaveRNN models.

Pros

Fast and efficient audio synthesis
High-quality output comparable to state-of-the-art models
Flexible architecture allowing for various configurations
Includes pre-trained models for quick experimentation

Cons

Requires significant computational resources for training
Limited documentation and examples for advanced usage
May require fine-tuning for specific use cases
Dependency on older versions of some libraries

Code Examples

Loading a pre-trained model:

from models.fatchord_version import WaveRNN

model = WaveRNN(rnn_dims=512, fc_dims=512, bits=9, pad=2,
                upsample_factors=(5, 5, 11), feat_dims=80,
                compute_dims=128, res_out_dims=128, res_blocks=10)

model.load('path/to/pretrained_model.pyt')

Generating audio from mel spectrograms:

import torch

# Assuming 'mel' is your input mel spectrogram
mel = torch.FloatTensor(mel).unsqueeze(0)
output = model.generate(mel, batched=True, target=11000, overlap=550)

Training the model:

from utils.dataset import get_vocoder_datasets
from utils.distribution import discretized_mix_logistic_loss

train_set, test_set = get_vocoder_datasets(paths, batch_size, train_gta)

for i, (x, y, m) in enumerate(train_set, 1):
    y_hat = model(x, m)
    loss = discretized_mix_logistic_loss(y_hat, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Getting Started

Clone the repository:

git clone https://github.com/fatchord/WaveRNN.git
cd WaveRNN

Install dependencies:
```
pip install -r requirements.txt
```
Download pre-trained models:
```
python scripts/download_models.py
```

Generate audio using a pre-trained model:

from utils.text.symbols import symbols
from models.tacotron import Tacotron
from models.fatchord_version import WaveRNN

tts_model = Tacotron(embed_dims=256, num_chars=len(symbols))
tts_model.load('pretrained/tts_model.pyt')

voc_model = WaveRNN(rnn_dims=512, fc_dims=512, bits=9, pad=2,
                    upsample_factors=(5, 5, 11), feat_dims=80,
                    compute_dims=128, res_out_dims=128, res_blocks=10)
voc_model.load('pretrained/voc_model.pyt')

text = "Hello, world!"
_, mel, _ = tts_model.generate(text)
audio = voc_model.generate(mel)

Competitor Comparisons

TTS

9,935

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

More comprehensive TTS toolkit with multiple models and voice conversion
Active development and regular updates
Better documentation and examples for ease of use

Cons of TTS

Higher complexity due to multiple models and features
Potentially higher resource requirements for training and inference
Steeper learning curve for beginners

Code Comparison

WaveRNN:

model = Model(rnn_dims=512, fc_dims=512, bits=9, pad=2,
              upsample_factors=(5,5,8), feat_dims=80,
              compute_dims=128, res_out_dims=128, res_blocks=10)

TTS:

from TTS.utils.synthesizer import Synthesizer
synthesizer = Synthesizer(
    tts_checkpoint="path/to/tts_model.pth",
    tts_config_path="path/to/tts_config.json",
    vocoder_checkpoint="path/to/vocoder_model.pth",
    vocoder_config="path/to/vocoder_config.json"
)

Summary

TTS offers a more comprehensive toolkit with multiple models and voice conversion capabilities, while WaveRNN focuses specifically on waveform generation. TTS benefits from active development and better documentation but may be more complex for beginners. WaveRNN is simpler and potentially easier to get started with for those focusing solely on waveform synthesis.

tacotron2

5,257

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of Tacotron2

Highly optimized for NVIDIA GPUs, offering faster training and inference
Includes pre-trained models for quick experimentation and deployment
Comprehensive documentation and examples for ease of use

Cons of Tacotron2

Limited flexibility for customization compared to WaveRNN
Requires more computational resources, especially for training
Less suitable for low-resource environments or non-NVIDIA hardware

Code Comparison

WaveRNN:

batch = torch2gpu(batch)
y_hat = model(x, mels)
loss = criterion(y_hat, y)

Tacotron2:

y, _, alignments = model(text_padded, input_lengths, mel_padded)
loss = criterion(y, mel_padded)

Key Differences

WaveRNN focuses on efficient waveform generation, while Tacotron2 is a complete text-to-speech system
Tacotron2 uses a sequence-to-sequence model with attention, whereas WaveRNN employs a recurrent neural network
WaveRNN is more lightweight and adaptable to various use cases, while Tacotron2 is optimized for high-quality speech synthesis on powerful hardware

Use Cases

Choose WaveRNN for:
- Resource-constrained environments
- Custom vocoder development
- Flexibility in model architecture
Choose Tacotron2 for:
- High-quality speech synthesis on NVIDIA GPUs
- Quick deployment with pre-trained models
- Integration with NVIDIA's ecosystem of tools

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

Offers end-to-end voice cloning capabilities, including speaker encoding, synthesis, and vocoding
Provides a user-friendly interface for real-time voice cloning demonstrations
Includes pre-trained models for immediate use

Cons of Real-Time-Voice-Cloning

May require more computational resources due to its comprehensive nature
Potentially more complex to set up and customize for specific use cases
Less focused on specific vocoding techniques compared to WaveRNN

Code Comparison

Real-Time-Voice-Cloning

from encoder.params_model import model_embedding_size as speaker_embedding_size
from utils.argutils import print_args
from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder

WaveRNN

from utils.dataset import get_vocoder_datasets
from utils.dsp import *
from models.fatchord_version import WaveRNN
from utils.paths import Paths
from utils.display import simple_table

The code snippets show that Real-Time-Voice-Cloning imports modules for the entire voice cloning pipeline, while WaveRNN focuses specifically on the vocoder component. This reflects the broader scope of Real-Time-Voice-Cloning compared to the more specialized focus of WaveRNN on neural vocoding.

wavenet_vocoder

2,359

WaveNet vocoder

Pros of wavenet_vocoder

Implements the original WaveNet architecture, which can produce high-quality audio
Supports both autoregressive and non-autoregressive generation
Well-documented and includes pre-trained models

Cons of wavenet_vocoder

Slower generation speed compared to WaveRNN
Higher computational requirements for training and inference
Less efficient for real-time applications

Code Comparison

WaveRNN:

def forward(self, x, mels):
    x = self.rnn1(x, mels)
    x = self.fc1(x)
    return self.fc2(x)

wavenet_vocoder:

def forward(self, x, c=None):
    x = self.first_conv(x)
    skips = None
    for f in self.conv_layers:
        x, h = f(x, c)
        if skips is None:
            skips = h
        else:
            skips += h
    x = skips
    for f in self.last_conv_layers:
        x = f(x)
    return x

The code comparison shows that WaveRNN uses a simpler architecture with RNN and fully connected layers, while wavenet_vocoder implements the more complex WaveNet architecture with multiple convolutional layers and skip connections.

tacotron

2,980

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Pros of Tacotron

Implements the complete Tacotron architecture for end-to-end text-to-speech synthesis
Provides pre-trained models and detailed instructions for training and inference
Includes a simple web application for demo purposes

Cons of Tacotron

Focuses solely on the Tacotron model, lacking the flexibility of WaveRNN's modular approach
May require more computational resources for training and inference compared to WaveRNN
Less active development and community support in recent years

Code Comparison

Tacotron (model definition):

class Tacotron():
  def __init__(self):
    self.encoder = Encoder()
    self.decoder = Decoder()
    self.postnet = Postnet()

WaveRNN (model definition):

class Model(nn.Module):
    def __init__(self, rnn_dims, fc_dims, bits, pad, upsample_factors,
                 feat_dims, compute_dims, res_out_dims, res_blocks):
        super().__init__()
        self.n_classes = 2**bits
        self.rnn_dims = rnn_dims
        self.aux_dims = res_out_dims // 4
        self.upsample = UpsampleNetwork(upsample_factors, feat_dims)
        self.I = nn.Linear(feat_dims + self.aux_dims + 1, rnn_dims)
        self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
        self.rnn2 = nn.GRU(rnn_dims + self.aux_dims, rnn_dims, batch_first=True)
        self.fc1 = nn.Linear(rnn_dims + self.aux_dims, fc_dims)
        self.fc2 = nn.Linear(fc_dims + self.aux_dims, fc_dims)
        self.fc3 = nn.Linear(fc_dims, self.n_classes)

Tacotron-2

2,309

DeepMind's Tacotron-2 Tensorflow implementation

Pros of Tacotron-2

Implements the complete Tacotron 2 architecture, including both the text-to-mel spectrogram model and the WaveNet vocoder
Provides pre-trained models and detailed instructions for training and inference
Supports both English and Chinese languages

Cons of Tacotron-2

May require more computational resources due to the full Tacotron 2 implementation
Less focused on real-time synthesis compared to WaveRNN

Code Comparison

Tacotron-2:

def inference(self, inputs, input_lengths, speaker_embeddings=None):
    batch_size = inputs.size(0)
    embedded_inputs = self.embedding(inputs).transpose(1, 2)
    encoder_outputs = self.encoder(embedded_inputs, input_lengths)

WaveRNN:

def forward(self, x, mels):
    if self.mode == 'RAW':
        return self.forward_raw(x, mels)
    elif self.mode == 'MOL':
        return self.forward_mol(x, mels)
    return None

The Tacotron-2 code snippet shows the inference process for the text-to-mel spectrogram model, while the WaveRNN code focuses on the vocoder part, demonstrating different approaches to speech synthesis.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

WaveRNN

(Update: Vanilla Tacotron One TTS system just implemented - more coming soon!)

Tacotron with WaveRNN diagrams

Pytorch implementation of Deepmind's WaveRNN model from Efficient Neural Audio Synthesis

Installation

Ensure you have:

Python >= 3.6
Pytorch 1 with CUDA

Then install the rest with pip:

pip install -r requirements.txt

How to Use

Quick Start

If you want to use TTS functionality immediately you can simply use:

python quick_start.py

This will generate everything in the default sentences.txt file and output to a new 'quick_start' folder where you can playback the wav files and take a look at the attention plots

You can also use that script to generate custom tts sentences and/or use '-u' to generate unbatched (better audio quality):

python quick_start.py -u --input_text "What will happen if I run this command?"

Training your own Models

Attenion and Mel Training GIF

Download the LJSpeech Dataset.

Edit hparams.py, point wav_path to your dataset and run:

python preprocess.py

or use preprocess.py --path to point directly to the dataset

Here's my recommendation on what order to run things:

1 - Train Tacotron with:

python train_tacotron.py

2 - You can leave that finish training or at any point you can use:

python train_tacotron.py --force_gta

this will force tactron to create a GTA dataset even if it hasn't finish training.

3 - Train WaveRNN with:

python train_wavernn.py --gta

NB: You can always just run train_wavernn.py without --gta if you're not interested in TTS.

4 - Generate Sentences with both models using:

python gen_tacotron.py wavernn

this will generate default sentences. If you want generate custom sentences you can use

python gen_tacotron.py --input_text "this is whatever you want it to be" wavernn

And finally, you can always use --help on any of those scripts to see what options are available :)

Samples

Can be found here.

Pretrained Models

Currently there are two pretrained models available in the /pretrained/ folder':

Both are trained on LJSpeech

WaveRNN (Mixture of Logistics output) trained to 800k steps
Tacotron trained to 180k steps

References

Acknowlegements

https://github.com/keithito/tacotron
https://github.com/r9y9/wavenet_vocoder
Special thanks to github users G-Wang, geneing & erogol

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot