melgan-neurips

GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis

1,012

218

1,012

View on GitHub

Top Related Projects

hifi-gan

2,169

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

tacotron2

5,225

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

TTS

9,796

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Quick Overview

MelGAN-NeurIPS is a PyTorch implementation of the MelGAN vocoder, a Generative Adversarial Network (GAN) for efficient high-fidelity speech synthesis. This project provides a lightweight and fast alternative to traditional vocoders, capable of generating high-quality audio from mel-spectrograms in real-time.

Pros

Fast inference speed, suitable for real-time applications
High-quality audio generation from mel-spectrograms
Efficient architecture with fewer parameters compared to traditional vocoders
Easy to train and fine-tune on custom datasets

Cons

May require significant computational resources for training
Performance can be sensitive to hyperparameter tuning
Limited documentation and examples in the repository
Potential instability issues common to GAN training

Code Examples

Loading a pre-trained MelGAN model:

from model.generator import Generator

model = Generator(80)  # 80 mel channels
model.load_state_dict(torch.load("path/to/pretrained_model.pt"))
model.eval()

Generating audio from a mel-spectrogram:

import torch

mel_spectrogram = torch.randn(1, 80, 234)  # Example mel-spectrogram
with torch.no_grad():
    audio = model(mel_spectrogram)

Training the MelGAN model:

from model.generator import Generator
from model.discriminator import Discriminator

generator = Generator(80)
discriminator = Discriminator()

# Training loop (simplified)
for epoch in range(num_epochs):
    for mel, audio in dataloader:
        fake_audio = generator(mel)
        d_loss = discriminator.train_step(audio, fake_audio)
        g_loss = generator.train_step(mel, fake_audio, discriminator)

Getting Started

To get started with MelGAN-NeurIPS:

Clone the repository:

git clone https://github.com/descriptinc/melgan-neurips.git
cd melgan-neurips

Install dependencies:
```
pip install -r requirements.txt
```
Prepare your dataset and update the configuration file.
Train the model:
```
python train.py --config config.json
```
For inference, use the provided inference.py script or integrate the model into your own pipeline as shown in the code examples above.

Competitor Comparisons

hifi-gan

2,169

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Pros of HiFi-GAN

Faster inference speed compared to MelGAN
Higher quality audio output with less artifacts
More efficient training process

Cons of HiFi-GAN

Slightly more complex architecture
May require more computational resources for training
Potentially longer training time due to multi-scale discriminators

Code Comparison

MelGAN:

class ResidualStack(nn.Module):
    def __init__(self, channels, num_res_blocks, kernel_size):
        super(ResidualStack, self).__init__()
        self.num_res_blocks = num_res_blocks
        self.stack = nn.ModuleList()
        for _ in range(num_res_blocks):
            self.stack.append(ResidualBlock(channels, kernel_size))

HiFi-GAN:

class ResBlock1(torch.nn.Module):
    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
        super(ResBlock1, self).__init__()
        self.h = h
        self.convs1 = nn.ModuleList([
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
                               padding=get_padding(kernel_size, dilation[0]))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
                               padding=get_padding(kernel_size, dilation[1]))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
                               padding=get_padding(kernel_size, dilation[2])))
        ])

The code comparison shows that HiFi-GAN uses a more complex residual block structure with dilated convolutions, while MelGAN uses a simpler residual stack approach. This difference contributes to HiFi-GAN's improved audio quality and efficiency.

tacotron2

5,225

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of Tacotron2

More comprehensive end-to-end text-to-speech system
Includes both text-to-mel-spectrogram and vocoder components
Backed by NVIDIA, potentially offering better support and documentation

Cons of Tacotron2

More complex architecture, potentially harder to understand and modify
May require more computational resources for training and inference
Less focused on the vocoder aspect compared to MelGAN

Code Comparison

MelGAN:

def forward(self, x):
    return self.model(x)

Tacotron2:

def forward(self, inputs, input_lengths, mel_specs=None):
    embedded_inputs = self.embedding(inputs).transpose(1, 2)
    encoder_outputs = self.encoder(embedded_inputs, input_lengths)
    mel_outputs, gate_outputs, alignments = self.decoder(
        encoder_outputs, mel_specs, memory_lengths=input_lengths)
    return mel_outputs, gate_outputs, alignments

The code comparison shows that MelGAN has a simpler forward pass, focusing on the vocoder aspect, while Tacotron2 has a more complex forward method that handles the entire text-to-speech pipeline, including text embedding, encoding, and decoding stages.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

Broader scope: Supports a wide range of sequence modeling tasks, including machine translation, text generation, and speech recognition
Extensive documentation and examples: Provides comprehensive guides and tutorials for various use cases
Active development and community support: Regular updates and contributions from a large user base

Cons of fairseq

Steeper learning curve: More complex architecture due to its versatility
Higher resource requirements: May need more computational power for training and inference

Code comparison

melgan-neurips:

model = Generator(args.n_mel_channels, args.ngf, args.n_residual_layers)
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, betas=(0.5, 0.9))

fairseq:

model = TransformerModel.build_model(args, task)
optimizer = fairseq.optim.Adam(args, model.parameters())
criterion = LabelSmoothedCrossEntropyCriterion(args, task)

Summary

While melgan-neurips focuses specifically on MelGAN for efficient audio generation, fairseq offers a more comprehensive toolkit for various sequence modeling tasks. fairseq provides greater flexibility and extensive documentation but may require more resources and have a steeper learning curve. melgan-neurips, being more specialized, might be easier to use for audio-specific tasks but lacks the broader applicability of fairseq.

wavenet_vocoder

2,357

WaveNet vocoder

Pros of wavenet_vocoder

Higher audio quality and more natural-sounding speech synthesis
Better handling of long-term dependencies in audio generation
More flexible and adaptable to various speech synthesis tasks

Cons of wavenet_vocoder

Slower inference time compared to MelGAN
Higher computational requirements for training and generation
More complex architecture, potentially harder to implement and fine-tune

Code Comparison

wavenet_vocoder:

def _generate_audio(self, mel):
    audio = self.net.generate(mel)
    return audio.cpu().numpy()

MelGAN:

def inference(self, mel):
    with torch.no_grad():
        audio = self.generator(mel)
    return audio.squeeze().cpu().numpy()

The code snippets show the core audio generation functions for both models. wavenet_vocoder uses a more complex network architecture for generation, while MelGAN employs a simpler generator model. This difference reflects the trade-off between audio quality and generation speed in these two approaches.

TTS

9,796

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

More comprehensive TTS toolkit with multiple models and voice conversion
Active development and regular updates
Extensive documentation and examples

Cons of TTS

Larger codebase, potentially more complex to use
May require more computational resources for training and inference

Code Comparison

TTS:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

MelGAN-NeurIPS:

from melgan.inference import MelGANInference

melgan = MelGANInference("path/to/model.pth")
audio = melgan.infer(mel_spectrogram)

Key Differences

TTS offers a higher-level API for easy text-to-speech conversion
MelGAN-NeurIPS focuses specifically on the MelGAN vocoder
TTS includes multiple models and voices, while MelGAN-NeurIPS is more specialized

Use Cases

TTS: Suitable for projects requiring a complete TTS pipeline with various options
MelGAN-NeurIPS: Ideal for researchers or developers focusing on MelGAN vocoder implementation

Community and Support

TTS: Larger community, more frequent updates, and extensive documentation
MelGAN-NeurIPS: Smaller community, focused on specific MelGAN implementation

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Official repository for the paper MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Previous works have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. Subjective evaluation metric (Mean Opinion Score, or MOS) shows the effectiveness of the proposed approach for high quality mel-spectrogram inversion. To establish the generality of the proposed techniques, we show qualitative results of our model in speech synthesis, music domain translation and unconditional music synthesis. We evaluate the various components of the model through ablation studies and suggest a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Our model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion. Our pytorch implementation runs at more than 100x faster than realtime on GTX 1080Ti GPU and more than 2x faster than real-time on CPU, without any hardware specific optimization tricks. Blog post with samples and accompanying code coming soon.

Visit our website for samples. You can try the speech correction application here created based on the end-to-end speech synthesis pipeline using MelGAN.

Check the slides if you aren't attending the NeurIPS 2019 conference to check out our poster.

Code organization

âââ README.md             <- Top-level README.
âââ set_env.sh            <- Set PYTHONPATH and CUDA_VISIBLE_DEVICES.
â
âââ mel2wav
âÂ Â  âââ dataset.py           <- data loader scripts
âÂ Â  âââ modules.py           <- Model, layers and losses
âÂ Â  âââ utils.py             <- Utilities to monitor, save, log, schedule etc.
â
âââ scripts
â   âââ train.py                    <- training / validation / etc scripts
â   âââ generate_from_folder.py

Preparing dataset

Create a raw folder with all the samples stored in wavs/ subfolder. Run these commands:

ls wavs/*.wav | tail -n+10 > train_files.txt
ls wavs/*.wav | head -n10 > test_files.txt

Training Example

. source set_env.sh 0
# Set PYTHONPATH and use first GPU
python scripts/train.py --save_path logs/baseline --path <root_data_folder>

PyTorch Hub Example

import torch
vocoder = torch.hub.load('descriptinc/melgan-neurips', 'load_melgan')
vocoder.inverse(audio)  # audio (torch.tensor) -> (batch_size, 80, timesteps)

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot