hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

2,232

536

2,232

107

View on GitHub

Top Related Projects

melgan-neurips

1,024

GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis

ParallelWaveGAN

1,619

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

tacotron2

5,257

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Quick Overview

HiFi-GAN is a high-fidelity generative adversarial network for efficient and high-quality speech synthesis. It is designed to generate high-quality audio waveforms from mel-spectrograms, making it suitable for text-to-speech applications and other audio generation tasks. The project aims to provide a fast and efficient alternative to existing vocoder models while maintaining high audio quality.

Pros

High-quality audio generation with fast inference speed
Efficient architecture, suitable for real-time applications
Supports both single-speaker and multi-speaker models
Well-documented and easy to use

Cons

Requires significant computational resources for training
May struggle with certain types of audio, such as music or complex environmental sounds
Limited to generating audio from mel-spectrograms, not directly from text
Potential for mode collapse or other GAN-related issues during training

Code Examples

Loading a pre-trained HiFi-GAN model:

from models import Generator
import torch

model = Generator(h).to(device)
checkpoint_dict = torch.load("generator_v1", map_location=device)
model.load_state_dict(checkpoint_dict['generator'])

Generating audio from a mel-spectrogram:

with torch.no_grad():
    y_g_hat = model(x)
    audio = y_g_hat.squeeze()
    audio = audio * MAX_WAV_VALUE
    audio = audio.cpu().numpy().astype('int16')

Training a HiFi-GAN model:

from train import train

config = {
    "resblock": "1",
    "num_gpus": 1,
    "batch_size": 16,
    "learning_rate": 0.0002,
    "adam_b1": 0.8,
    "adam_b2": 0.99,
    "lr_decay": 0.999,
    "seed": 1234,
    "num_workers": 4,
    "num_epochs": 3100,
    "fp16_run": False,
    "log_interval": 100,
    "eval_interval": 1000
}

train(args, config)

Getting Started

To get started with HiFi-GAN:

Clone the repository:

git clone https://github.com/jik876/hifi-gan.git
cd hifi-gan

Install dependencies:
```
pip install -r requirements.txt
```
Download pre-trained models or prepare your data for training.

Use the provided scripts for inference or training:

python inference.py --checkpoint_file [PATH_TO_CHECKPOINT] --input_mels_dir [DIRECTORY_PATH]

python train.py --config config_v1.json

Refer to the repository's README for more detailed instructions on data preparation, training, and inference.

Competitor Comparisons

melgan-neurips

1,024

GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis

Pros of MelGAN-NeurIPS

Simpler architecture, potentially easier to understand and implement
Faster inference time due to its lightweight design
May require less computational resources for training and generation

Cons of MelGAN-NeurIPS

Generally lower audio quality compared to HiFi-GAN
Less robust to variations in input mel-spectrograms
May struggle with more complex audio tasks or edge cases

Code Comparison

MelGAN-NeurIPS:

class ResStack(torch.nn.Module):
    def __init__(self, channel):
        super(ResStack, self).__init__()
        self.blocks = nn.ModuleList([
            nn.Sequential(
                nn.LeakyReLU(0.2),
                nn.ReflectionPad1d(3**i),
                nn.utils.weight_norm(nn.Conv1d(channel, channel, kernel_size=3, dilation=3**i)),
                nn.LeakyReLU(0.2),
                nn.utils.weight_norm(nn.Conv1d(channel, channel, kernel_size=1)),
            )
            for i in range(3)
        ])

HiFi-GAN:

class ResBlock1(torch.nn.Module):
    def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)):
        super(ResBlock1, self).__init__()
        self.convs1 = nn.ModuleList([
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
                               padding=get_padding(kernel_size, dilation[0]))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
                               padding=get_padding(kernel_size, dilation[1]))),
            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
                               padding=get_padding(kernel_size, dilation[2])))
        ])

ParallelWaveGAN

1,619

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Pros of ParallelWaveGAN

Supports multi-speaker and multi-language models
Includes various vocoder architectures (e.g., MelGAN, Multi-band MelGAN)
Provides comprehensive documentation and examples

Cons of ParallelWaveGAN

Generally slower inference speed compared to HiFi-GAN
May require more computational resources for training
Less optimized for real-time applications

Code Comparison

ParallelWaveGAN:

model = ParallelWaveGANGenerator()
optimizer = RAdam(model.parameters())
criterion = MultiResolutionSTFTLoss()

HiFi-GAN:

model = Generator(h)
mpd = MultiPeriodDiscriminator()
msd = MultiScaleDiscriminator()

Both repositories provide high-quality vocoders for text-to-speech applications. ParallelWaveGAN offers more flexibility with multiple architectures and better documentation, while HiFi-GAN focuses on faster inference and real-time capabilities. The code structures differ, with ParallelWaveGAN using a single generator and HiFi-GAN implementing separate generator and discriminator models. Choose based on your specific requirements for speed, quality, and supported features.

WaveRNN

2,166

WaveRNN Vocoder + TTS

Pros of WaveRNN

More flexible architecture, allowing for various configurations and modifications
Generally produces higher quality audio output, especially for complex waveforms
Better suited for low-resource environments due to its recurrent nature

Cons of WaveRNN

Slower inference time compared to HiFi-GAN
Requires more computational resources for training
Less efficient for real-time applications

Code Comparison

WaveRNN:

class WaveRNN(nn.Module):
    def __init__(self, rnn_dims, fc_dims, bits, pad, upsample_factors, feat_dims, compute_dims, res_out_dims, res_blocks):
        super().__init__()
        self.res_blocks = res_blocks
        self.pad = pad
        self.n_classes = 2**bits
        self.rnn_dims = rnn_dims
        self.aux_dims = res_out_dims // 4
        self.upsample = UpsampleNetwork(upsample_factors, feat_dims)

HiFi-GAN:

class Generator(torch.nn.Module):
    def __init__(self, h):
        super(Generator, self).__init__()
        self.h = h
        self.num_kernels = len(h.resblock_kernel_sizes)
        self.num_upsamples = len(h.upsample_rates)
        self.conv_pre = weight_norm(Conv1d(80, h.upsample_initial_channel, 7, 1, padding=3))
        resblock = ResBlock1 if h.resblock == '1' else ResBlock2

tacotron2

5,257

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of Tacotron2

Comprehensive end-to-end text-to-speech system
Includes both text-to-mel-spectrogram and vocoder components
Backed by NVIDIA, with extensive documentation and support

Cons of Tacotron2

Older architecture, may not represent the latest advancements in TTS
Can be more complex to set up and train compared to HiFi-GAN

Code Comparison

Tacotron2 (model definition):

class Tacotron2(nn.Module):
    def __init__(self, hparams):
        super(Tacotron2, self).__init__()
        self.mask_padding = hparams.mask_padding
        self.fp16_run = hparams.fp16_run
        self.n_mel_channels = hparams.n_mel_channels
        self.n_frames_per_step = hparams.n_frames_per_step
        self.embedding = nn.Embedding(hparams.n_symbols, hparams.symbols_embedding_dim)

HiFi-GAN (model definition):

class Generator(torch.nn.Module):
    def __init__(self, h):
        super(Generator, self).__init__()
        self.h = h
        self.num_kernels = len(h.resblock_kernel_sizes)
        self.num_upsamples = len(h.upsample_rates)
        self.conv_pre = weight_norm(Conv1d(80, h.upsample_initial_channel, 7, 1, padding=3))
        resblock = ResBlock1 if h.resblock == '1' else ResBlock2

Both repositories provide implementations of text-to-speech models, but they focus on different aspects of the TTS pipeline. Tacotron2 offers a complete end-to-end solution, while HiFi-GAN specializes in high-fidelity audio generation from mel-spectrograms.

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

Offers end-to-end voice cloning capabilities, including speaker encoding, synthesis, and vocoding
Provides a user-friendly interface for real-time voice cloning demonstrations
Includes pre-trained models for immediate use

Cons of Real-Time-Voice-Cloning

May require more computational resources due to its comprehensive approach
Less focused on specific audio quality improvements compared to HiFi-GAN
Potentially more complex to integrate into existing projects

Code Comparison

Real-Time-Voice-Cloning:

def load_model(checkpoint_path):
    model = SpeakerEncoder()
    checkpoint = torch.load(checkpoint_path)
    model.load_state_dict(checkpoint["model_state"])
    return model

HiFi-GAN:

def load_model(filepath):
    checkpoint = torch.load(filepath)
    generator = Generator(h).to(device)
    generator.load_state_dict(checkpoint['generator'])
    return generator

Both repositories use PyTorch for model loading, but Real-Time-Voice-Cloning focuses on a speaker encoding model, while HiFi-GAN loads a generator model for high-fidelity waveform generation. HiFi-GAN's approach is more specialized for audio quality enhancement, while Real-Time-Voice-Cloning offers a broader voice cloning solution.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
We provide our implementation and pretrained models as open source in this repository.

Abstract : Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Visit our demo website for audio samples.

Pre-requisites

Python >= 3.6
Clone this repository.
Install python requirements. Please refer requirements.txt
Download and extract the LJ Speech dataset. And move all wav files to LJSpeech-1.1/wavs

Training

python train.py --config config_v1.json

To train V2 or V3 Generator, replace config_v1.json with config_v2.json or config_v3.json.
Checkpoints and copy of the configuration file are saved in cp_hifigan directory by default.
You can change the path by adding --checkpoint_path option.

Validation loss during training with V1 generator.

Pretrained Model

You can also use pretrained models we provide.
Download pretrained models
Details of each folder are as in follows:

Folder Name	Generator	Dataset	Fine-Tuned
LJ_V1	V1	LJSpeech	No
LJ_V2	V2	LJSpeech	No
LJ_V3	V3	LJSpeech	No
LJ_FT_T2_V1	V1	LJSpeech	Yes (Tacotron2)
LJ_FT_T2_V2	V2	LJSpeech	Yes (Tacotron2)
LJ_FT_T2_V3	V3	LJSpeech	Yes (Tacotron2)
VCTK_V1	V1	VCTK	No
VCTK_V2	V2	VCTK	No
VCTK_V3	V3	VCTK	No
UNIVERSAL_V1	V1	Universal	No

We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.

Fine-Tuning

Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing.
The file name of the generated mel-spectrogram should match the audio file and the extension should be .npy.
Example:
```
Audio File : LJ001-0001.wav
Mel-Spectrogram File : LJ001-0001.npy
```
Create ft_dataset folder and copy the generated mel-spectrogram files into it.
Run the following command.
```
python train.py --fine_tuning True --config config_v1.json
```
For other command line options, please refer to the training section.

Inference from wav file

Make test_files directory and copy wav files into the directory.

Run the following command.

python inference.py --checkpoint_file [generator checkpoint file path]

Generated wav files are saved in generated_files by default.
You can change the path by adding --output_dir option.

Inference for end-to-end speech synthesis

Make test_mel_files directory and copy generated mel-spectrogram files into the directory.
You can generate mel-spectrograms using Tacotron2, Glow-TTS and so forth.

Run the following command.

python inference_e2e.py --checkpoint_file [generator checkpoint file path]

Generated wav files are saved in generated_files_from_mel by default.
You can change the path by adding --output_dir option.

Acknowledgements

We referred to WaveGlow, MelGAN and Tacotron2 to implement this.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot