Convert Figma logo to code with AI

jik876 logohifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

1,892
500
1,892
102

Top Related Projects

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

2,124

WaveRNN Vocoder + TTS

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Quick Overview

HiFi-GAN is a high-fidelity generative adversarial network for efficient and high-quality speech synthesis. It is designed to generate high-quality audio waveforms from mel-spectrograms, making it suitable for text-to-speech applications and other audio generation tasks. The project aims to provide a fast and efficient alternative to existing vocoder models while maintaining high audio quality.

Pros

  • High-quality audio generation with fast inference speed
  • Efficient architecture, suitable for real-time applications
  • Supports both single-speaker and multi-speaker models
  • Well-documented and easy to use

Cons

  • Requires significant computational resources for training
  • May struggle with certain types of audio, such as music or complex environmental sounds
  • Limited to generating audio from mel-spectrograms, not directly from text
  • Potential for mode collapse or other GAN-related issues during training

Code Examples

  1. Loading a pre-trained HiFi-GAN model:
from models import Generator
import torch

model = Generator(h).to(device)
checkpoint_dict = torch.load("generator_v1", map_location=device)
model.load_state_dict(checkpoint_dict['generator'])
  1. Generating audio from a mel-spectrogram:
with torch.no_grad():
    y_g_hat = model(x)
    audio = y_g_hat.squeeze()
    audio = audio * MAX_WAV_VALUE
    audio = audio.cpu().numpy().astype('int16')
  1. Training a HiFi-GAN model:
from train import train

config = {
    "resblock": "1",
    "num_gpus": 1,
    "batch_size": 16,
    "learning_rate": 0.0002,
    "adam_b1": 0.8,
    "adam_b2": 0.99,
    "lr_decay": 0.999,
    "seed": 1234,
    "num_workers": 4,
    "num_epochs": 3100,
    "fp16_run": False,
    "log_interval": 100,
    "eval_interval": 1000
}

train(args, config)

Getting Started

To get started with HiFi-GAN:

  1. Clone the repository:

    git clone https://github.com/jik876/hifi-gan.git
    cd hifi-gan
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download pre-trained models or prepare your data for training.

  4. Use the provided scripts for inference or training:

    python inference.py --checkpoint_file [PATH_TO_CHECKPOINT] --input_mels_dir [DIRECTORY_PATH]
    

    or

    python train.py --config config_v1.json
    

Refer to the repository's README for more detailed instructions on data preparation, training, and inference.

Competitor Comparisons

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Pros of ParallelWaveGAN

  • Supports multi-speaker and multi-language models
  • Includes various vocoder architectures (e.g., MelGAN, Multi-band MelGAN)
  • Provides comprehensive documentation and examples

Cons of ParallelWaveGAN

  • Generally slower inference speed compared to HiFi-GAN
  • May require more computational resources for training
  • Less optimized for real-time applications

Code Comparison

ParallelWaveGAN:

model = ParallelWaveGANGenerator()
optimizer = RAdam(model.parameters())
criterion = MultiResolutionSTFTLoss()

HiFi-GAN:

model = Generator(h)
mpd = MultiPeriodDiscriminator()
msd = MultiScaleDiscriminator()

Both repositories provide high-quality vocoders for text-to-speech applications. ParallelWaveGAN offers more flexibility with multiple architectures and better documentation, while HiFi-GAN focuses on faster inference and real-time capabilities. The code structures differ, with ParallelWaveGAN using a single generator and HiFi-GAN implementing separate generator and discriminator models. Choose based on your specific requirements for speed, quality, and supported features.

2,124

WaveRNN Vocoder + TTS

Pros of WaveRNN

  • More flexible architecture, allowing for various configurations and modifications
  • Generally produces higher quality audio output, especially for complex waveforms
  • Better suited for low-resource environments due to its recurrent nature

Cons of WaveRNN

  • Slower inference time compared to HiFi-GAN
  • Requires more computational resources for training
  • Less efficient for real-time applications

Code Comparison

WaveRNN:

class WaveRNN(nn.Module):
    def __init__(self, rnn_dims, fc_dims, bits, pad, upsample_factors, feat_dims, compute_dims, res_out_dims, res_blocks):
        super().__init__()
        self.res_blocks = res_blocks
        self.pad = pad
        self.n_classes = 2**bits
        self.rnn_dims = rnn_dims
        self.aux_dims = res_out_dims // 4
        self.upsample = UpsampleNetwork(upsample_factors, feat_dims)

HiFi-GAN:

class Generator(torch.nn.Module):
    def __init__(self, h):
        super(Generator, self).__init__()
        self.h = h
        self.num_kernels = len(h.resblock_kernel_sizes)
        self.num_upsamples = len(h.upsample_rates)
        self.conv_pre = weight_norm(Conv1d(80, h.upsample_initial_channel, 7, 1, padding=3))
        resblock = ResBlock1 if h.resblock == '1' else ResBlock2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of Tacotron2

  • Comprehensive end-to-end text-to-speech system
  • Includes both text-to-mel-spectrogram and vocoder components
  • Backed by NVIDIA, with extensive documentation and support

Cons of Tacotron2

  • Older architecture, may not represent the latest advancements in TTS
  • Can be more complex to set up and train compared to HiFi-GAN

Code Comparison

Tacotron2 (model definition):

class Tacotron2(nn.Module):
    def __init__(self, hparams):
        super(Tacotron2, self).__init__()
        self.mask_padding = hparams.mask_padding
        self.fp16_run = hparams.fp16_run
        self.n_mel_channels = hparams.n_mel_channels
        self.n_frames_per_step = hparams.n_frames_per_step
        self.embedding = nn.Embedding(hparams.n_symbols, hparams.symbols_embedding_dim)

HiFi-GAN (model definition):

class Generator(torch.nn.Module):
    def __init__(self, h):
        super(Generator, self).__init__()
        self.h = h
        self.num_kernels = len(h.resblock_kernel_sizes)
        self.num_upsamples = len(h.upsample_rates)
        self.conv_pre = weight_norm(Conv1d(80, h.upsample_initial_channel, 7, 1, padding=3))
        resblock = ResBlock1 if h.resblock == '1' else ResBlock2

Both repositories provide implementations of text-to-speech models, but they focus on different aspects of the TTS pipeline. Tacotron2 offers a complete end-to-end solution, while HiFi-GAN specializes in high-fidelity audio generation from mel-spectrograms.

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

  • Offers end-to-end voice cloning capabilities, including speaker encoding, synthesis, and vocoding
  • Provides a user-friendly interface for real-time voice cloning demonstrations
  • Includes pre-trained models for immediate use

Cons of Real-Time-Voice-Cloning

  • May require more computational resources due to its comprehensive approach
  • Less focused on specific audio quality improvements compared to HiFi-GAN
  • Potentially more complex to integrate into existing projects

Code Comparison

Real-Time-Voice-Cloning:

def load_model(checkpoint_path):
    model = SpeakerEncoder()
    checkpoint = torch.load(checkpoint_path)
    model.load_state_dict(checkpoint["model_state"])
    return model

HiFi-GAN:

def load_model(filepath):
    checkpoint = torch.load(filepath)
    generator = Generator(h).to(device)
    generator.load_state_dict(checkpoint['generator'])
    return generator

Both repositories use PyTorch for model loading, but Real-Time-Voice-Cloning focuses on a speaker encoding model, while HiFi-GAN loads a generator model for high-fidelity waveform generation. HiFi-GAN's approach is more specialized for audio quality enhancement, while Real-Time-Voice-Cloning offers a broader voice cloning solution.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
We provide our implementation and pretrained models as open source in this repository.

Abstract : Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Visit our demo website for audio samples.

Pre-requisites

  1. Python >= 3.6
  2. Clone this repository.
  3. Install python requirements. Please refer requirements.txt
  4. Download and extract the LJ Speech dataset. And move all wav files to LJSpeech-1.1/wavs

Training

python train.py --config config_v1.json

To train V2 or V3 Generator, replace config_v1.json with config_v2.json or config_v3.json.
Checkpoints and copy of the configuration file are saved in cp_hifigan directory by default.
You can change the path by adding --checkpoint_path option.

Validation loss during training with V1 generator.
validation loss

Pretrained Model

You can also use pretrained models we provide.
Download pretrained models
Details of each folder are as in follows:

Folder NameGeneratorDatasetFine-Tuned
LJ_V1V1LJSpeechNo
LJ_V2V2LJSpeechNo
LJ_V3V3LJSpeechNo
LJ_FT_T2_V1V1LJSpeechYes (Tacotron2)
LJ_FT_T2_V2V2LJSpeechYes (Tacotron2)
LJ_FT_T2_V3V3LJSpeechYes (Tacotron2)
VCTK_V1V1VCTKNo
VCTK_V2V2VCTKNo
VCTK_V3V3VCTKNo
UNIVERSAL_V1V1UniversalNo

We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.

Fine-Tuning

  1. Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing.
    The file name of the generated mel-spectrogram should match the audio file and the extension should be .npy.
    Example:
    Audio File : LJ001-0001.wav
    Mel-Spectrogram File : LJ001-0001.npy
    
  2. Create ft_dataset folder and copy the generated mel-spectrogram files into it.
  3. Run the following command.
    python train.py --fine_tuning True --config config_v1.json
    
    For other command line options, please refer to the training section.

Inference from wav file

  1. Make test_files directory and copy wav files into the directory.
  2. Run the following command.
    python inference.py --checkpoint_file [generator checkpoint file path]
    

Generated wav files are saved in generated_files by default.
You can change the path by adding --output_dir option.

Inference for end-to-end speech synthesis

  1. Make test_mel_files directory and copy generated mel-spectrogram files into the directory.
    You can generate mel-spectrograms using Tacotron2, Glow-TTS and so forth.
  2. Run the following command.
    python inference_e2e.py --checkpoint_file [generator checkpoint file path]
    

Generated wav files are saved in generated_files_from_mel by default.
You can change the path by adding --output_dir option.

Acknowledgements

We referred to WaveGlow, MelGAN and Tacotron2 to implement this.