vits
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Quick Overview
The jaywalnut310/vits
repository is a PyTorch implementation of the VITS (Variational Inference for Text-to-Speech) model, a state-of-the-art text-to-speech (TTS) system that can generate high-quality, natural-sounding speech from text inputs.
Pros
- High-Quality Speech Generation: VITS is capable of generating highly natural-sounding speech, with a level of quality that rivals professional voice recordings.
- Versatile and Customizable: The model can be fine-tuned on various datasets, allowing for the creation of custom voices and speaking styles.
- Efficient and Fast Inference: The model is designed for efficient inference, enabling real-time speech generation on modern hardware.
- Open-Source and Actively Maintained: The project is open-source and actively maintained, with regular updates and improvements.
Cons
- Complexity: Implementing and training the VITS model can be a complex and computationally intensive process, requiring significant expertise in machine learning and TTS systems.
- Limited Pre-Trained Models: The repository currently provides a limited number of pre-trained models, which may not cover all desired languages or speaking styles.
- Dependency on PyTorch: The project is built using PyTorch, which may be a limitation for users who prefer other deep learning frameworks.
- Potential Performance Issues: Depending on the hardware and system configuration, the model may experience performance issues during inference or training.
Code Examples
from vits.model import VITSModel
from vits.utils import load_checkpoint
# Load a pre-trained VITS model
model = VITSModel.from_pretrained("jaywalnut310/vits-ljspeech")
# Generate speech from text
text = "Hello, this is a sample text-to-speech output."
audio = model.generate_speech(text)
This code demonstrates how to load a pre-trained VITS model and use it to generate speech from a given text input.
from vits.dataset import LJSpeechDataset
from vits.trainer import VITSTrainer
# Load the LJSpeech dataset
dataset = LJSpeechDataset("path/to/ljspeech")
# Create a VITS trainer and fine-tune the model
trainer = VITSTrainer(dataset, model_path="path/to/save/model")
trainer.train(num_epochs=100)
This code shows how to fine-tune the VITS model on a custom dataset, such as the LJSpeech dataset, using the provided VITSTrainer
class.
from vits.inference import VITSInference
# Create an inference interface
inference = VITSInference("path/to/vits/model")
# Generate speech from text
text = "This is another sample text-to-speech output."
audio = inference.generate_speech(text)
This code demonstrates how to use the VITSInference
class to generate speech from text using a pre-trained VITS model.
Getting Started
To get started with the jaywalnut310/vits
project, follow these steps:
- Clone the repository:
git clone https://github.com/jaywalnut310/vits.git
- Install the required dependencies:
cd vits
pip install -r requirements.txt
- Download a pre-trained VITS model:
from vits.model import VITSModel
model = VITSModel.from_pretrained("jaywalnut310/vits-ljspeech")
- Generate speech from text:
text = "Hello, this is a sample text-to-speech output."
audio = model.generate_speech(text)
- (Optional) Fine-tune the VITS model on a custom dataset:
from vits.dataset import LJSpeechDataset
from vits.trainer import VITSTrainer
dataset = LJSpeechDataset("path/to/ljspeech")
trainer = VITSTrainer(dataset, model_path="path/to/save/model")
trainer
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Jaehyeon Kim, Jungil Kong, and Juhee Son
In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.
Visit our demo for audio samples.
We also provide the pretrained models.
** Update note: Thanks to Rishikesh (à¤à¤·à¤¿à¤à¥à¤¶), our interactive TTS demo is now available on Colab Notebook.
VITS at training | VITS at inference |
---|---|
Pre-requisites
- Python >= 3.6
- Clone this repository
- Install python requirements. Please refer requirements.txt
- You may need to install espeak first:
apt-get install espeak
- You may need to install espeak first:
- Download datasets
- Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder:
ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
- For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder:
ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
- Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder:
- Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace
# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt
Training Exmaple
# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base
# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base
Inference Example
See inference.ipynb
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot