vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

6,657

1,226

6,657

158

Quick Overview

The jaywalnut310/vits repository is a PyTorch implementation of the VITS (Variational Inference for Text-to-Speech) model, a state-of-the-art text-to-speech (TTS) system that can generate high-quality, natural-sounding speech from text inputs.

Pros

High-Quality Speech Generation: VITS is capable of generating highly natural-sounding speech, with a level of quality that rivals professional voice recordings.
Versatile and Customizable: The model can be fine-tuned on various datasets, allowing for the creation of custom voices and speaking styles.
Efficient and Fast Inference: The model is designed for efficient inference, enabling real-time speech generation on modern hardware.
Open-Source and Actively Maintained: The project is open-source and actively maintained, with regular updates and improvements.

Cons

Complexity: Implementing and training the VITS model can be a complex and computationally intensive process, requiring significant expertise in machine learning and TTS systems.
Limited Pre-Trained Models: The repository currently provides a limited number of pre-trained models, which may not cover all desired languages or speaking styles.
Dependency on PyTorch: The project is built using PyTorch, which may be a limitation for users who prefer other deep learning frameworks.
Potential Performance Issues: Depending on the hardware and system configuration, the model may experience performance issues during inference or training.

Code Examples

from vits.model import VITSModel
from vits.utils import load_checkpoint

# Load a pre-trained VITS model
model = VITSModel.from_pretrained("jaywalnut310/vits-ljspeech")

# Generate speech from text
text = "Hello, this is a sample text-to-speech output."
audio = model.generate_speech(text)

This code demonstrates how to load a pre-trained VITS model and use it to generate speech from a given text input.

from vits.dataset import LJSpeechDataset
from vits.trainer import VITSTrainer

# Load the LJSpeech dataset
dataset = LJSpeechDataset("path/to/ljspeech")

# Create a VITS trainer and fine-tune the model
trainer = VITSTrainer(dataset, model_path="path/to/save/model")
trainer.train(num_epochs=100)

This code shows how to fine-tune the VITS model on a custom dataset, such as the LJSpeech dataset, using the provided VITSTrainer class.

from vits.inference import VITSInference

# Create an inference interface
inference = VITSInference("path/to/vits/model")

# Generate speech from text
text = "This is another sample text-to-speech output."
audio = inference.generate_speech(text)

This code demonstrates how to use the VITSInference class to generate speech from text using a pre-trained VITS model.

Getting Started

To get started with the jaywalnut310/vits project, follow these steps:

Clone the repository:

git clone https://github.com/jaywalnut310/vits.git

Install the required dependencies:

cd vits
pip install -r requirements.txt

Download a pre-trained VITS model:

from vits.model import VITSModel
model = VITSModel.from_pretrained("jaywalnut310/vits-ljspeech")

Generate speech from text:

text = "Hello, this is a sample text-to-speech output."
audio = model.generate_speech(text)

(Optional) Fine-tune the VITS model on a custom dataset:

from vits.dataset import LJSpeechDataset
from vits.trainer import VITSTrainer

dataset = LJSpeechDataset("path/to/ljspeech")
trainer = VITSTrainer(dataset, model_path="path/to/save/model")
trainer

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Visit our demo for audio samples.

We also provide the pretrained models.

** Update note: Thanks to Rishikesh (à¤à¤·à¤¿à¤à¥à¤¶), our interactive TTS demo is now available on Colab Notebook.

VITS at training	VITS at inference

Pre-requisites

Python >= 3.6
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

Training Exmaple

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

Inference Example

See inference.ipynb

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot