Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time

54,087

8,951

54,087

221

View on GitHub

Top Related Projects

TTS

9,796

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

tacotron2

5,225

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

dc_tts

1,160

A TensorFlow Implementation of DC-TTS: yet another text-to-speech model

Quick Overview

Real-Time Voice Cloning is an open-source project that implements a deep learning framework for voice cloning. It allows users to clone a voice from a few seconds of audio and use it for text-to-speech synthesis in real-time. The project is based on three main components: a speaker encoder, a synthesizer, and a vocoder.

Pros

Achieves high-quality voice cloning with just a few seconds of audio input
Supports real-time synthesis, making it suitable for interactive applications
Provides a user-friendly interface for demonstration and testing
Offers flexibility for researchers and developers to experiment and improve the model

Cons

Requires significant computational resources for training and real-time synthesis
May have ethical concerns regarding potential misuse of voice cloning technology
Limited to the English language in its current implementation
Requires careful fine-tuning and dataset preparation for optimal results

Code Examples

Loading the pretrained models:

from encoder import inference as encoder
from synthesizer.inference import Synthesizer
from vocoder import inference as vocoder

encoder.load_model("encoder.pt")
synthesizer = Synthesizer("synthesizer.pt")
vocoder.load_model("vocoder.pt")

Encoding a speaker's voice:

from pathlib import Path
import numpy as np

reference_audio = Path("path/to/reference_audio.wav")
encoder_wav = encoder.preprocess_wav(reference_audio)
embed = encoder.embed_utterance(encoder_wav)

Synthesizing and vocalizing new speech:

texts = ["Hello, this is a cloned voice.", "Voice cloning is amazing!"]
embeds = [embed] * len(texts)
specs = synthesizer.synthesize_spectrograms(texts, embeds)
generated_wav = vocoder.infer_waveform(specs[0])

Getting Started

Clone the repository:

git clone https://github.com/CorentinJ/Real-Time-Voice-Cloning.git
cd Real-Time-Voice-Cloning

Install dependencies:
```
pip install -r requirements.txt
```
Download pretrained models:
```
python download_pretrained_models.py
```
Run the demo:
```
python demo_cli.py
```

Follow the prompts to provide a reference audio file and input text for voice cloning and synthesis.

Competitor Comparisons

TTS

9,796

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Pros of TTS

More comprehensive and feature-rich, offering a wider range of TTS models and voice synthesis techniques
Better documentation and community support, making it easier for developers to integrate and customize
Actively maintained by Mozilla, ensuring regular updates and improvements

Cons of TTS

Requires more computational resources and setup time compared to Real-Time-Voice-Cloning
Less focused on real-time voice cloning, which may be a drawback for specific use cases
Steeper learning curve due to its broader scope and more complex architecture

Code Comparison

Real-Time-Voice-Cloning:

encoder = SpeakerEncoder("encoder/saved_models/pretrained.pt")
synthesizer = Synthesizer("synthesizer/saved_models/pretrained/pretrained.pt")
vocoder = WaveRNN("vocoder/saved_models/pretrained/pretrained.pt")

TTS:

from TTS.utils.synthesizer import Synthesizer
synthesizer = Synthesizer(
    tts_checkpoint="path/to/tts_model.pth",
    tts_config_path="path/to/tts_config.json",
    vocoder_checkpoint="path/to/vocoder_model.pth",
    vocoder_config="path/to/vocoder_config.json"
)

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

Broader scope: Supports a wide range of sequence-to-sequence tasks, including machine translation, text summarization, and speech recognition
Extensive documentation and examples: Provides comprehensive guides and tutorials for various use cases
Active development and community support: Regularly updated with new features and improvements

Cons of fairseq

Steeper learning curve: Requires more in-depth knowledge of NLP and deep learning concepts
Higher computational requirements: May need more powerful hardware for training and inference
Less focused on voice cloning: Not specifically designed for real-time voice cloning tasks

Code Comparison

Real-Time-Voice-Cloning:

from encoder.params_model import model_embedding_size as speaker_embedding_size
from utils.argutils import print_args
from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder

fairseq:

from fairseq import checkpoint_utils, options, tasks, utils
from fairseq.data import encoders
from fairseq.token_generation_constraints import pack_constraints, unpack_constraints
from fairseq_cli.generate import get_symbols_to_strip_from_output

Both repositories provide powerful tools for speech and language processing, but they have different focuses. Real-Time-Voice-Cloning is specifically designed for voice cloning tasks, while fairseq offers a more versatile platform for various sequence-to-sequence tasks. The code snippets demonstrate the different import structures and functionalities of each project.

tacotron2

5,225

Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Pros of tacotron2

Developed by NVIDIA, leveraging their expertise in deep learning and GPU optimization
Focuses on high-quality speech synthesis with attention to prosody and naturalness
Provides pre-trained models for immediate use and experimentation

Cons of tacotron2

Limited to text-to-speech synthesis, not designed for voice cloning
Requires more computational resources and training time
Less user-friendly for non-technical users or quick prototyping

Code Comparison

Real-Time-Voice-Cloning:

encoder = VoiceEncoder()
embed = encoder.embed_utterance(wav)
specs = synthesizer.synthesize_spectrograms([text], [embed])
generated_wav = vocoder.infer_waveform(specs[0])

tacotron2:

text = torch.LongTensor(text_to_sequence(text, ['english_cleaners']))[None, :]
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(text)
audio = waveglow.infer(mel_outputs_postnet, sigma=0.666)

Both repositories offer powerful speech synthesis capabilities, but Real-Time-Voice-Cloning is more focused on voice cloning and real-time applications, while tacotron2 emphasizes high-quality text-to-speech synthesis. Real-Time-Voice-Cloning provides a more user-friendly interface for quick voice cloning tasks, while tacotron2 offers a robust foundation for advanced speech synthesis research and development.

wavenet_vocoder

2,357

WaveNet vocoder

Pros of wavenet_vocoder

Focuses specifically on WaveNet-based vocoder implementation
Provides a lightweight and modular codebase
Supports multiple datasets and languages

Cons of wavenet_vocoder

Limited to vocoder functionality, not a complete voice cloning solution
Requires more technical expertise to use and integrate
Less active development and community support

Code Comparison

wavenet_vocoder:

def _assert_tensor_shape(x, shape):
    assert x.shape == shape, (
        "Shape of tensor {} does not match expected shape {}".format(
            x.shape, shape))

Real-Time-Voice-Cloning:

def load_preprocess_wav(fpath):
    wav = librosa.load(fpath, sr=sampling_rate)[0]
    if len(wav.shape) == 2:
        wav = wav.mean(-1)
    return wav

The wavenet_vocoder code focuses on tensor shape validation, while Real-Time-Voice-Cloning includes audio preprocessing functionality. This reflects the different scopes of the projects, with wavenet_vocoder being more focused on the vocoder implementation and Real-Time-Voice-Cloning providing a more comprehensive voice cloning solution.

dc_tts

1,160

A TensorFlow Implementation of DC-TTS: yet another text-to-speech model

Pros of dc_tts

Simpler architecture, potentially easier to understand and implement
Faster inference time due to its lightweight design
Focuses specifically on text-to-speech, which may be beneficial for certain use cases

Cons of dc_tts

Limited to text-to-speech functionality, lacking voice cloning capabilities
May produce less natural-sounding speech compared to more advanced models
Requires pre-trained model weights, which might not be as flexible for custom voices

Code Comparison

dc_tts:

def text2mel(text):
    cleaner_names = [x.strip() for x in hparams.cleaners.split(',')]
    seq = text_to_sequence(text, cleaner_names)
    mel = np.zeros((len(seq), hparams.n_mels), dtype=np.float32)
    return mel

Real-Time-Voice-Cloning:

def synthesize_spectrograms(texts, embeddings, return_alignments=False):
    specs = []
    for text, embed in zip(texts, embeddings):
        spec = synthesizer.synthesize_spectrograms([text], [embed])[0]
        specs.append(spec)
    return specs

The code snippets demonstrate the different approaches:

dc_tts focuses on converting text to mel spectrograms
Real-Time-Voice-Cloning uses embeddings for voice cloning and synthesis

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Real-Time Voice Cloning

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. This was my master's thesis.

SV2TTS is a deep learning framework in three stages. In the first stage, one creates a digital representation of a voice from a few seconds of audio. In the second and third stages, this representation is used as reference to generate speech given arbitrary text.

Video demonstration (click the picture):

Papers implemented

URL	Designation	Title	Implementation source
1806.04558	SV2TTS	Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis	This repo
1802.08435	WaveRNN (vocoder)	Efficient Neural Audio Synthesis	fatchord/WaveRNN
1703.10135	Tacotron (synthesizer)	Tacotron: Towards End-to-End Speech Synthesis	fatchord/WaveRNN
1710.10467	GE2E (encoder)	Generalized End-To-End Loss for Speaker Verification	This repo

Heads up

Like everything else in Deep Learning, this repo has quickly gotten old. Many SaaS apps (often paying) will give you a better audio quality than this repository will. If you wish for an open-source solution with a high voice quality:

Check out paperswithcode for other repositories and recent research in the field of speech synthesis.
Check out CoquiTTS for a repository with a better voice cloning quality and more functionalities.
Check out MetaVoice-1B for a large voice model with high voice quality

Setup

1. Install Requirements

Both Windows and Linux are supported. A GPU is recommended for training and for inference speed, but is not mandatory.
Python 3.7 is recommended. Python 3.5 or greater should work, but you'll probably have to tweak the dependencies' versions. I recommend setting up a virtual environment using venv, but this is optional.
Install ffmpeg. This is necessary for reading audio files.
Install PyTorch. Pick the latest stable version, your operating system, your package manager (pip by default) and finally pick any of the proposed CUDA versions if you have a GPU, otherwise pick CPU. Run the given command.
Install the remaining requirements with pip install -r requirements.txt

2. (Optional) Download Pretrained Models

Pretrained models are now downloaded automatically. If this doesn't work for you, you can manually download them here.

3. (Optional) Test Configuration

Before you download any dataset, you can begin by testing your configuration with:

python demo_cli.py

If all tests pass, you're good to go.

4. (Optional) Download Datasets

For playing with the toolbox alone, I only recommend downloading LibriSpeech/train-clean-100. Extract the contents as <datasets_root>/LibriSpeech/train-clean-100 where <datasets_root> is a directory of your choosing. Other datasets are supported in the toolbox, see here. You're free not to download any dataset, but then you will need your own data as audio files or you will have to record it with the toolbox.

5. Launch the Toolbox

You can then try the toolbox:

python demo_toolbox.py -d <datasets_root>
or
python demo_toolbox.py

depending on whether you downloaded any datasets. If you are running an X-server or if you have the error Aborted (core dumped), see this issue.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot