Convert Figma logo to code with AI

fatchord logoWaveRNN

WaveRNN Vocoder + TTS

2,124
697
2,124
107

Top Related Projects

9,217

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Clone a voice in 5 seconds to generate arbitrary speech in real-time

WaveNet vocoder

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

DeepMind's Tacotron-2 Tensorflow implementation

Quick Overview

WaveRNN is a PyTorch implementation of the WaveRNN model for efficient audio synthesis. It is designed to generate high-quality speech waveforms and can be used for text-to-speech applications. The repository provides a complete pipeline for training and inference of WaveRNN models.

Pros

  • Fast and efficient audio synthesis
  • High-quality output comparable to state-of-the-art models
  • Flexible architecture allowing for various configurations
  • Includes pre-trained models for quick experimentation

Cons

  • Requires significant computational resources for training
  • Limited documentation and examples for advanced usage
  • May require fine-tuning for specific use cases
  • Dependency on older versions of some libraries

Code Examples

  1. Loading a pre-trained model:
from models.fatchord_version import WaveRNN

model = WaveRNN(rnn_dims=512, fc_dims=512, bits=9, pad=2,
                upsample_factors=(5, 5, 11), feat_dims=80,
                compute_dims=128, res_out_dims=128, res_blocks=10)

model.load('path/to/pretrained_model.pyt')
  1. Generating audio from mel spectrograms:
import torch

# Assuming 'mel' is your input mel spectrogram
mel = torch.FloatTensor(mel).unsqueeze(0)
output = model.generate(mel, batched=True, target=11000, overlap=550)
  1. Training the model:
from utils.dataset import get_vocoder_datasets
from utils.distribution import discretized_mix_logistic_loss

train_set, test_set = get_vocoder_datasets(paths, batch_size, train_gta)

for i, (x, y, m) in enumerate(train_set, 1):
    y_hat = model(x, m)
    loss = discretized_mix_logistic_loss(y_hat, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Getting Started

  1. Clone the repository:

    git clone https://github.com/fatchord/WaveRNN.git
    cd WaveRNN
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download pre-trained models:

    python scripts/download_models.py
    
  4. Generate audio using a pre-trained model:

    from utils.text.symbols import symbols
    from models.tacotron import Tacotron
    from models.fatchord_version import WaveRNN
    
    tts_model = Tacotron(embed_dims=256, num_chars=len(symbols))
    tts_model.load('pretrained/tts_model.pyt')
    
    voc_model = WaveRNN(rnn_dims=512, fc_dims=512, bits=9, pad=2,
                        upsample_factors=(5, 5, 11), feat_dims=80,
                        compute_dims=128, res_out_dims=128, res_blocks=10)
    voc_model.load('pretrained/voc_model.pyt')
    
    text = "Hello, world!"
    _, mel, _ = tts_model.generate(text)
    audio = voc_model.generate(mel)
    

Competitor Comparisons

9,217

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

  • More comprehensive TTS toolkit with multiple models and voice conversion
  • Active development and regular updates
  • Better documentation and examples for ease of use

Cons of TTS

  • Higher complexity due to multiple models and features
  • Potentially higher resource requirements for training and inference
  • Steeper learning curve for beginners

Code Comparison

WaveRNN:

model = Model(rnn_dims=512, fc_dims=512, bits=9, pad=2,
              upsample_factors=(5,5,8), feat_dims=80,
              compute_dims=128, res_out_dims=128, res_blocks=10)

TTS:

from TTS.utils.synthesizer import Synthesizer
synthesizer = Synthesizer(
    tts_checkpoint="path/to/tts_model.pth",
    tts_config_path="path/to/tts_config.json",
    vocoder_checkpoint="path/to/vocoder_model.pth",
    vocoder_config="path/to/vocoder_config.json"
)

Summary

TTS offers a more comprehensive toolkit with multiple models and voice conversion capabilities, while WaveRNN focuses specifically on waveform generation. TTS benefits from active development and better documentation but may be more complex for beginners. WaveRNN is simpler and potentially easier to get started with for those focusing solely on waveform synthesis.

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of Tacotron2

  • Highly optimized for NVIDIA GPUs, offering faster training and inference
  • Includes pre-trained models for quick experimentation and deployment
  • Comprehensive documentation and examples for ease of use

Cons of Tacotron2

  • Limited flexibility for customization compared to WaveRNN
  • Requires more computational resources, especially for training
  • Less suitable for low-resource environments or non-NVIDIA hardware

Code Comparison

WaveRNN:

batch = torch2gpu(batch)
y_hat = model(x, mels)
loss = criterion(y_hat, y)

Tacotron2:

y, _, alignments = model(text_padded, input_lengths, mel_padded)
loss = criterion(y, mel_padded)

Key Differences

  • WaveRNN focuses on efficient waveform generation, while Tacotron2 is a complete text-to-speech system
  • Tacotron2 uses a sequence-to-sequence model with attention, whereas WaveRNN employs a recurrent neural network
  • WaveRNN is more lightweight and adaptable to various use cases, while Tacotron2 is optimized for high-quality speech synthesis on powerful hardware

Use Cases

  • Choose WaveRNN for:

    • Resource-constrained environments
    • Custom vocoder development
    • Flexibility in model architecture
  • Choose Tacotron2 for:

    • High-quality speech synthesis on NVIDIA GPUs
    • Quick deployment with pre-trained models
    • Integration with NVIDIA's ecosystem of tools

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

  • Offers end-to-end voice cloning capabilities, including speaker encoding, synthesis, and vocoding
  • Provides a user-friendly interface for real-time voice cloning demonstrations
  • Includes pre-trained models for immediate use

Cons of Real-Time-Voice-Cloning

  • May require more computational resources due to its comprehensive nature
  • Potentially more complex to set up and customize for specific use cases
  • Less focused on specific vocoding techniques compared to WaveRNN

Code Comparison

Real-Time-Voice-Cloning

from encoder.params_model import model_embedding_size as speaker_embedding_size
from utils.argutils import print_args
from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder

WaveRNN

from utils.dataset import get_vocoder_datasets
from utils.dsp import *
from models.fatchord_version import WaveRNN
from utils.paths import Paths
from utils.display import simple_table

The code snippets show that Real-Time-Voice-Cloning imports modules for the entire voice cloning pipeline, while WaveRNN focuses specifically on the vocoder component. This reflects the broader scope of Real-Time-Voice-Cloning compared to the more specialized focus of WaveRNN on neural vocoding.

WaveNet vocoder

Pros of wavenet_vocoder

  • Implements the original WaveNet architecture, which can produce high-quality audio
  • Supports both autoregressive and non-autoregressive generation
  • Well-documented and includes pre-trained models

Cons of wavenet_vocoder

  • Slower generation speed compared to WaveRNN
  • Higher computational requirements for training and inference
  • Less efficient for real-time applications

Code Comparison

WaveRNN:

def forward(self, x, mels):
    x = self.rnn1(x, mels)
    x = self.fc1(x)
    return self.fc2(x)

wavenet_vocoder:

def forward(self, x, c=None):
    x = self.first_conv(x)
    skips = None
    for f in self.conv_layers:
        x, h = f(x, c)
        if skips is None:
            skips = h
        else:
            skips += h
    x = skips
    for f in self.last_conv_layers:
        x = f(x)
    return x

The code comparison shows that WaveRNN uses a simpler architecture with RNN and fully connected layers, while wavenet_vocoder implements the more complex WaveNet architecture with multiple convolutional layers and skip connections.

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

Pros of Tacotron

  • Implements the complete Tacotron architecture for end-to-end text-to-speech synthesis
  • Provides pre-trained models and detailed instructions for training and inference
  • Includes a simple web application for demo purposes

Cons of Tacotron

  • Focuses solely on the Tacotron model, lacking the flexibility of WaveRNN's modular approach
  • May require more computational resources for training and inference compared to WaveRNN
  • Less active development and community support in recent years

Code Comparison

Tacotron (model definition):

class Tacotron():
  def __init__(self):
    self.encoder = Encoder()
    self.decoder = Decoder()
    self.postnet = Postnet()

WaveRNN (model definition):

class Model(nn.Module):
    def __init__(self, rnn_dims, fc_dims, bits, pad, upsample_factors,
                 feat_dims, compute_dims, res_out_dims, res_blocks):
        super().__init__()
        self.n_classes = 2**bits
        self.rnn_dims = rnn_dims
        self.aux_dims = res_out_dims // 4
        self.upsample = UpsampleNetwork(upsample_factors, feat_dims)
        self.I = nn.Linear(feat_dims + self.aux_dims + 1, rnn_dims)
        self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
        self.rnn2 = nn.GRU(rnn_dims + self.aux_dims, rnn_dims, batch_first=True)
        self.fc1 = nn.Linear(rnn_dims + self.aux_dims, fc_dims)
        self.fc2 = nn.Linear(fc_dims + self.aux_dims, fc_dims)
        self.fc3 = nn.Linear(fc_dims, self.n_classes)

DeepMind's Tacotron-2 Tensorflow implementation

Pros of Tacotron-2

  • Implements the complete Tacotron 2 architecture, including both the text-to-mel spectrogram model and the WaveNet vocoder
  • Provides pre-trained models and detailed instructions for training and inference
  • Supports both English and Chinese languages

Cons of Tacotron-2

  • May require more computational resources due to the full Tacotron 2 implementation
  • Less focused on real-time synthesis compared to WaveRNN

Code Comparison

Tacotron-2:

def inference(self, inputs, input_lengths, speaker_embeddings=None):
    batch_size = inputs.size(0)
    embedded_inputs = self.embedding(inputs).transpose(1, 2)
    encoder_outputs = self.encoder(embedded_inputs, input_lengths)

WaveRNN:

def forward(self, x, mels):
    if self.mode == 'RAW':
        return self.forward_raw(x, mels)
    elif self.mode == 'MOL':
        return self.forward_mol(x, mels)
    return None

The Tacotron-2 code snippet shows the inference process for the text-to-mel spectrogram model, while the WaveRNN code focuses on the vocoder part, demonstrating different approaches to speech synthesis.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

WaveRNN

(Update: Vanilla Tacotron One TTS system just implemented - more coming soon!)

Tacotron with WaveRNN diagrams

Pytorch implementation of Deepmind's WaveRNN model from Efficient Neural Audio Synthesis

Installation

Ensure you have:

Then install the rest with pip:

pip install -r requirements.txt

How to Use

Quick Start

If you want to use TTS functionality immediately you can simply use:

python quick_start.py

This will generate everything in the default sentences.txt file and output to a new 'quick_start' folder where you can playback the wav files and take a look at the attention plots

You can also use that script to generate custom tts sentences and/or use '-u' to generate unbatched (better audio quality):

python quick_start.py -u --input_text "What will happen if I run this command?"

Training your own Models

Attenion and Mel Training GIF

Download the LJSpeech Dataset.

Edit hparams.py, point wav_path to your dataset and run:

python preprocess.py

or use preprocess.py --path to point directly to the dataset


Here's my recommendation on what order to run things:

1 - Train Tacotron with:

python train_tacotron.py

2 - You can leave that finish training or at any point you can use:

python train_tacotron.py --force_gta

this will force tactron to create a GTA dataset even if it hasn't finish training.

3 - Train WaveRNN with:

python train_wavernn.py --gta

NB: You can always just run train_wavernn.py without --gta if you're not interested in TTS.

4 - Generate Sentences with both models using:

python gen_tacotron.py wavernn

this will generate default sentences. If you want generate custom sentences you can use

python gen_tacotron.py --input_text "this is whatever you want it to be" wavernn

And finally, you can always use --help on any of those scripts to see what options are available :)

Samples

Can be found here.

Pretrained Models

Currently there are two pretrained models available in the /pretrained/ folder':

Both are trained on LJSpeech

  • WaveRNN (Mixture of Logistics output) trained to 800k steps
  • Tacotron trained to 180k steps

References

Acknowlegements