wavenet_vocoder

WaveNet vocoder

2,359

497

2,359

View on GitHub

Top Related Projects

magenta

19,599

Magenta: Music and Art Generation with Machine Intelligence

tacotron2

5,257

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

melgan-neurips

1,024

GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis

ParallelWaveGAN

1,619

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

TTS

9,935

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Quick Overview

The r9y9/wavenet_vocoder is a PyTorch implementation of WaveNet vocoder, which is a neural network-based text-to-speech (TTS) system. It can generate high-quality speech waveforms from acoustic features and has been widely used in various speech synthesis applications.

Pros

High-quality speech synthesis with natural-sounding output
Flexible and customizable, allowing for fine-tuning and experimentation
Well-documented and actively maintained repository
Includes pre-trained models for quick start and evaluation

Cons

Computationally intensive, requiring significant processing power for training and inference
May require large datasets for optimal performance
Steep learning curve for those new to deep learning and speech synthesis
Potential licensing issues for commercial use (check the repository's license)

Code Examples

Loading a pre-trained model:

from wavenet_vocoder import builder
import torch

model = builder.wavenet(
    n_layers=24,
    n_stack=4,
    n_loop=6,
    residual_channels=512,
    gate_channels=512,
    skip_out_channels=256,
    cin_channels=80,
    gin_channels=-1,
    weight_normalization=True,
    dropout=0.05,
    legacy=False)

checkpoint = torch.load("path/to/checkpoint.pth")
model.load_state_dict(checkpoint["state_dict"])

Generating audio from mel-spectrogram:

import numpy as np
from wavenet_vocoder import synthesis

c = np.load("path/to/mel_spectrogram.npy")
waveform = synthesis.synthesis(model, c=c, fast=True)

Training a new model:

from wavenet_vocoder import WaveNet
from torch import optim

model = WaveNet(
    layers=24,
    stacks=4,
    residual_channels=128,
    gate_channels=256,
    skip_out_channels=128,
    cin_channels=80,
    gin_channels=-1,
    weight_normalization=True,
    dropout=0.05)

optimizer = optim.Adam(model.parameters())
criterion = model.loss()

for epoch in range(num_epochs):
    for batch in dataloader:
        optimizer.zero_grad()
        y_hat = model(batch)
        loss = criterion(y_hat, batch)
        loss.backward()
        optimizer.step()

Getting Started

Install the required dependencies:

pip install torch numpy scipy librosa
git clone https://github.com/r9y9/wavenet_vocoder.git
cd wavenet_vocoder
pip install -e .

Download a pre-trained model:

wget https://www.dropbox.com/s/zdbfprugbagfp2w/20180510_mixture_lj_checkpoint_step000320000_ema.pth?dl=0 -O checkpoint_step000320000_ema.pth

Generate audio from a mel-spectrogram:

import torch
import numpy as np
from wavenet_vocoder import builder
from wavenet_vocoder import synthesis

# Load model
model = builder.wavenet(checkpoint_path="checkpoint_step000320000_ema.pth")
model.eval()

# Load mel-spectrogram
c = np.load("path/to/mel_spectrogram.npy")

# Generate audio
waveform = synthesis.synthesis(model, c=c, fast=True)

Competitor Comparisons

magenta

19,599

Magenta: Music and Art Generation with Machine Intelligence

Pros of Magenta

Broader scope: Covers multiple AI music and art generation projects
More active development and larger community
Extensive documentation and tutorials

Cons of Magenta

Higher complexity due to diverse projects
Steeper learning curve for beginners
May require more computational resources

Code Comparison

Magenta (TensorFlow-based):

import magenta
from magenta.models.melody_rnn import melody_rnn_sequence_generator

generator = melody_rnn_sequence_generator.get_generator()
melody = generator.generate(steps_per_quarter=4, total_steps=32)

Wavenet_vocoder (PyTorch-based):

import torch
from wavenet_vocoder import WaveNet

model = WaveNet()
audio = model.generate(mel_spectrogram, batch_size=1)

Summary

Magenta offers a comprehensive suite of AI-powered creative tools, while Wavenet_vocoder focuses specifically on audio synthesis. Magenta provides more resources and a wider range of applications but may be more challenging for newcomers. Wavenet_vocoder is more specialized and potentially easier to integrate for audio-specific projects. The choice between them depends on the scope of your project and your familiarity with the respective frameworks (TensorFlow vs. PyTorch).

WaveRNN

2,166

WaveRNN Vocoder + TTS

Pros of WaveRNN

Faster generation speed compared to WaveNet vocoder
More memory-efficient, allowing for larger batch sizes during training
Simpler architecture, potentially easier to understand and implement

Cons of WaveRNN

May produce slightly lower quality audio compared to WaveNet vocoder
Less established in the research community, with fewer variations and improvements
Limited pre-trained models available

Code Comparison

WaveRNN:

def forward(self, x, mels):
    x = self.I(x)
    mels = self.mel_upsample(mels.transpose(1, 2))
    mels = self.mel_hidden(mels)
    x = x + mels
    x = self.rnn1(x)
    return self.fc1(x), self.fc2(x)

WaveNet vocoder:

def forward(self, x, c):
    x = self.first_conv(x)
    skips = None
    for f in self.conv_layers:
        x, h = f(x, c)
        if skips is None:
            skips = h
        else:
            skips += h
    x = skips
    for f in self.last_conv_layers:
        x = f(x)
    return x

The code snippets show the core forward pass implementations for both models. WaveRNN uses a simpler RNN-based approach, while WaveNet vocoder employs a more complex convolutional architecture with skip connections.

tacotron2

5,257

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of tacotron2

Offers a complete end-to-end text-to-speech solution
Includes both text-to-mel-spectrogram and vocoder components
Backed by NVIDIA, potentially offering better performance on GPU hardware

Cons of tacotron2

May require more computational resources due to its comprehensive nature
Less flexibility for mixing and matching different components
Potentially more complex to modify or customize

Code Comparison

wavenet_vocoder:

model = WaveNet(
    layers=20,
    stacks=2,
    residual_channels=128,
    gate_channels=256,
    skip_out_channels=128,
    cin_channels=80,
    gin_channels=-1,
    weight_normalization=True,
    n_speakers=n_speakers,
    dropout=0.05,
    kernel_size=3,
    upsample_conditional_features=True,
    upsample_scales=[4, 4, 4, 4],
    freq_axis_kernel_size=3,
    scalar_input=True,
)

tacotron2:

model = Tacotron2(n_mel_channels=80,
                  n_symbols=len(symbols),
                  symbols_embedding_dim=512,
                  encoder_kernel_size=5,
                  decoder_rnn_dim=1024,
                  prenet_dim=256,
                  max_decoder_steps=1000,
                  gate_threshold=0.5,
                  p_attention_dropout=0.1,
                  p_decoder_dropout=0.1)

melgan-neurips

1,024

GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis

Pros of MelGAN-NeurIPS

Faster inference speed due to non-autoregressive architecture
Lighter model with fewer parameters, requiring less computational resources
Simpler training process with fewer hyperparameters to tune

Cons of MelGAN-NeurIPS

Potentially lower audio quality compared to WaveNet Vocoder
Less flexibility in controlling various aspects of the generated audio
May struggle with complex or highly dynamic audio content

Code Comparison

WaveNet Vocoder:

def _generate_audio(mel):
    audio = self.net.generate(mel)
    return audio.cpu().numpy()

MelGAN-NeurIPS:

def inference(self, mel):
    with torch.no_grad():
        audio = self.generator(mel)
    return audio.squeeze().cpu().numpy()

Both repositories focus on neural audio synthesis, but they use different approaches. WaveNet Vocoder employs a more complex, autoregressive model that can produce high-quality audio at the cost of slower inference. MelGAN-NeurIPS, on the other hand, uses a non-autoregressive approach that sacrifices some audio quality for significantly faster generation times and a lighter model.

The code comparison shows that both models take mel-spectrograms as input and generate audio. However, WaveNet Vocoder's generation process is likely more involved due to its autoregressive nature, while MelGAN-NeurIPS can generate audio in a single forward pass through its generator network.

ParallelWaveGAN

1,619

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Pros of ParallelWaveGAN

Faster inference time due to parallel generation
Improved audio quality with multi-resolution STFT loss
Supports various model architectures (e.g., MelGAN, Multi-band MelGAN)

Cons of ParallelWaveGAN

More complex training process
Potentially higher computational requirements for training
May require more fine-tuning for optimal results

Code Comparison

ParallelWaveGAN:

model = ParallelWaveGANGenerator()
optimizer = torch.optim.Adam(model.parameters())
criterion = MultiResolutionSTFTLoss()

wavenet_vocoder:

model = WaveNet()
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

The main differences in the code are:

Different model architectures (ParallelWaveGANGenerator vs. WaveNet)
Use of MultiResolutionSTFTLoss in ParallelWaveGAN instead of CrossEntropyLoss
ParallelWaveGAN offers more flexibility in model selection and loss functions

Both repositories provide implementations of neural vocoders for text-to-speech synthesis. ParallelWaveGAN offers faster inference and potentially better audio quality but may require more complex training. wavenet_vocoder is simpler to use but may have slower inference times. The choice between the two depends on specific project requirements and available computational resources.

TTS

9,935

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

More comprehensive TTS solution, including text processing and vocoding
Actively maintained with regular updates and improvements
Supports multiple languages and voices out-of-the-box

Cons of TTS

Larger and more complex codebase, potentially harder to customize
May require more computational resources due to its full-stack nature
Less focused on specific vocoding techniques compared to wavenet_vocoder

Code Comparison

TTS (example of model initialization):

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

wavenet_vocoder (example of model usage):

from wavenet_vocoder import WaveNet

model = WaveNet(layers=10, stacks=3, residual_channels=64, gate_channels=64)
y = model(c, g=g, softmax=True, quantize=True)

Both repositories offer powerful tools for speech synthesis, with TTS providing a more complete end-to-end solution and wavenet_vocoder focusing specifically on neural vocoding techniques.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

WaveNet vocoder

NOTE: This is the development version. If you need a stable version, please checkout the v0.1.1.

The goal of the repository is to provide an implementation of the WaveNet vocoder, which can generate high quality raw speech samples conditioned on linguistic or acoustic features.

Audio samples are available at https://r9y9.github.io/wavenet_vocoder/.

News

2019/10/31: The repository has been adapted to ESPnet. English, Chinese, and Japanese samples and pretrained models are available there. See https://github.com/espnet/espnet and https://github.com/espnet/espnet#tts-results for details.

Online TTS demo

A notebook supposed to be executed on https://colab.research.google.com is available:

Tacotron2: WaveNet-based text-to-speech demo

Highlights

Focus on local and global conditioning of WaveNet, which is essential for vocoder.
16-bit raw audio modeling by mixture distributions: mixture of logistics (MoL), mixture of Gaussians, and single Gaussian distributions are supported.
Various audio samples and pre-trained models
Fast inference by caching intermediate states in convolutions. Similar to arXiv:1611.09482
Integration with ESPNet (https://github.com/espnet/espnet)

Pre-trained models

Note: This is not itself a text-to-speech (TTS) model. With a pre-trained model provided here, you can synthesize waveform given a mel spectrogram, not raw text. You will need mel-spectrogram prediction model (such as Tacotron2) to use the pre-trained models for TTS.

Note: As for the pretrained model for LJSpeech, the model was fine-tuned multiple times and trained for more than 1000k steps in total. Please refer to the issues (#1, #75, #45) to know how the model was trained.

Model URL	Data	Hyper params URL	Git commit	Steps
link	LJSpeech	link	2092a64	1000k~ steps
link	CMU ARCTIC	link	b1a1076	740k steps

To use pre-trained models, first checkout the specific git commit noted above. i.e.,

git checkout ${commit_hash}

And then follows "Synthesize from a checkpoint" section in the README. Note that old version of synthesis.py may not accept --preset=<json> parameter and you might have to change hparams.py according to the preset (json) file.

You could try for example:

# Assuming you have downloaded LJSpeech-1.1 at ~/data/LJSpeech-1.1
# pretrained model (20180510_mixture_lj_checkpoint_step000320000_ema.pth)
# hparams (20180510_mixture_lj_checkpoint_step000320000_ema.json)
git checkout 2092a64
python preprocess.py ljspeech ~/data/LJSpeech-1.1 ./data/ljspeech \
  --preset=20180510_mixture_lj_checkpoint_step000320000_ema.json
python synthesis.py --preset=20180510_mixture_lj_checkpoint_step000320000_ema.json \
  --conditional=./data/ljspeech/ljspeech-mel-00001.npy \
  20180510_mixture_lj_checkpoint_step000320000_ema.pth \
  generated

You can find a generated wav file in generated directory. Wonder how it works? then take a look at code:)

Repository structure

The repository consists of 1) pytorch library, 2) command line tools, and 3) ESPnet-style recipes. The first one is a pytorch library to provide WavaNet functionality. The second one is a set of tools to run WaveNet training/inference, data processing, etc. The last one is the reproducible recipes combining the WaveNet library and utility tools. Please take a look at them depending on your purpose. If you want to build your WaveNet on your dataset (I guess this is the most likely case), the recipe is the way for you.

Requirements

Python 3
CUDA >= 8.0
PyTorch >= v0.4.0

Installation

git clone https://github.com/r9y9/wavenet_vocoder && cd wavenet_vocoder
pip install -e .

If you only need the library part, you can install it from pypi:

pip install wavenet_vocoder

Getting started

Kaldi-style recipes

The repository provides Kaldi-style recipes to make experiments reproducible and easily manageable. Available recipes are as follows:

mulaw256: WaveNet that uses categorical output distribution. The input is 8-bit mulaw quantized waveform.
mol: Mixture of Logistics (MoL) WaveNet. The input is 16-bit raw audio.
gaussian: Single-Gaussian WaveNet (a.k.a. teacher WaveNet of ClariNet). The input is 16-bit raw audio.

All the recipe has run.sh, which specifies all the steps to perform WaveNet training/inference including data preprocessing. Please see run.sh in egs directory for details.

NOTICE: Global conditioning for multi-speaker WaveNet is not supported in the above recipes (it shouldn't be difficult to implement though). Please check v0.1.12 for the feature, or if you really need the feature, please raise an issue.

Apply recipe to your own dataset

The recipes are designed to be generic so that one can use them for any dataset. To apply recipes to your own dataset, you'd need to put all the wav files in a single flat directory. i.e.,

> tree -L 1 ~/data/LJSpeech-1.1/wavs/ | head
/Users/ryuichi/data/LJSpeech-1.1/wavs/
âââ LJ001-0001.wav
âââ LJ001-0002.wav
âââ LJ001-0003.wav
âââ LJ001-0004.wav
âââ LJ001-0005.wav
âââ LJ001-0006.wav
âââ LJ001-0007.wav
âââ LJ001-0008.wav
âââ LJ001-0009.wav

That's it! The last step is to modify db_root in run.sh or give db_root as the command line argment for run.sh.

./run.sh --stage 0 --stop-stage 0 --db-root ~/data/LJSpeech-1.1/wavs/

Step-by-step

A recipe typically consists of multiple steps. It is strongly recommended to run the recipe step-by-step to understand how it works for the first time. To do so, specify stage and stop_stage as follows:

./run.sh --stage 0 --stop-stage 0

./run.sh --stage 1 --stop-stage 1

./run.sh --stage 2 --stop-stage 2

In typical situations, you'd need to specify CUDA devices explciitly expecially for training step.

CUDA_VISIBLE_DEVICES="0,1" ./run.sh --stage 2 --stop-stage 2

Docs for command line tools

Command line tools are writtern with docopt. See each docstring for the basic usages.

tojson.py

Dump hyperparameters to a json file.

Usage:

python tojson.py --hparams="parameters you want to override" <output_json_path>

preprocess.py

Usage:

python preprocess.py wavallin ${dataset_path} ${out_dir} --preset=<json>

train.py

Note: for multi gpu training, you have better ensure that batch_size % num_gpu == 0

Usage:

python train.py --dump-root=${dump-root} --preset=<json>\
  --hparams="parameters you want to override"

evaluate.py

Given a directoy that contains local conditioning features, synthesize waveforms for them.

Usage:

python evaluate.py ${dump_root} ${checkpoint} ${output_dir} --dump-root="data location"\
    --preset=<json> --hparams="parameters you want to override"

Options:

--num-utterances=<N>: Number of utterances to be generated. If not specified, generate all uttereances. This is useful for debugging.

synthesis.py

NOTICE: This is probably not working now. Please use evaluate.py instead.

Synthesize waveform give a conditioning feature.

Usage:

python synthesis.py ${checkpoint_path} ${output_dir} --preset=<json> --hparams="parameters you want to override"

Important options:

--conditional=<path>: (Required for conditional WaveNet) Path of local conditional features (.npy). If this is specified, number of time steps to generate is determined by the size of conditional feature.

Training scenarios

Training un-conditional WaveNet

NOTICE: This is probably not working now. Please check v0.1.1 for the working version.

python train.py --dump-root=./data/cmu_arctic/
    --hparams="cin_channels=-1,gin_channels=-1"

You have to disable global and local conditioning by setting gin_channels and cin_channels to negative values.

Training WaveNet conditioned on mel-spectrogram

python train.py --dump-root=./data/cmu_arctic/ --speaker-id=0 \
    --hparams="cin_channels=80,gin_channels=-1"

Training WaveNet conditioned on mel-spectrogram and speaker embedding

NOTICE: This is probably not working now. Please check v0.1.1 for the working version.

python train.py --dump-root=./data/cmu_arctic/ \
    --hparams="cin_channels=80,gin_channels=16,n_speakers=7"

Misc

Monitor with Tensorboard

Logs are dumped in ./log directory by default. You can monitor logs by tensorboard:

tensorboard --logdir=log

List of papers that used the repository

A Comparison of Recent Neural Vocoders for Speech Signal Reconstruction https://www.isca-speech.org/archive/SSW_2019/abstracts/SSW10_O_1-2.html
WaveGlow: A Flow-based Generative Network for Speech Synthesis https://arxiv.org/abs/1811.00002
WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation https://arxiv.org/abs/1904.02892
Parametric Resynthesis with neural vocoders https://arxiv.org/abs/1906.06762
Representation Mixing fo TTS Synthesis https://arxiv.org/abs/1811.07240
A Unified Neural Architecture for Instrumental Audio Tasks https://arxiv.org/abs/1903.00142
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit: https://arxiv.org/abs/1910.10909

Thank you very much!! If you find a new one, please submit a PR.

References

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Magenta

Cons of Magenta

Code Comparison

Summary

Pros of WaveRNN

Cons of WaveRNN

Code Comparison

Pros of tacotron2

Cons of tacotron2

Code Comparison

Pros of MelGAN-NeurIPS

Cons of MelGAN-NeurIPS

Code Comparison

Pros of ParallelWaveGAN

Cons of ParallelWaveGAN

Code Comparison

Pros of TTS

Cons of TTS

Code Comparison

Convert designs to code with AI

README

WaveNet vocoder

News

Online TTS demo

Highlights

Pre-trained models

Repository structure

Requirements

Installation

Getting started

Kaldi-style recipes

Apply recipe to your own dataset

Step-by-step

Docs for command line tools

tojson.py

preprocess.py

train.py

evaluate.py

synthesis.py

Training scenarios

Training un-conditional WaveNet

Training WaveNet conditioned on mel-spectrogram

Training WaveNet conditioned on mel-spectrogram and speaker embedding

Misc

Monitor with Tensorboard

List of papers that used the repository

Sponsors

References

Top Related Projects

Convert designs to code with AI