Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs

12,243

2,620

12,243

349

View on GitHub

Top Related Projects

StyleTTS2

5,874

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

TTS

41,775

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Quick Overview

Wav2Lip is an open-source project that focuses on lip-syncing videos to match given audio input. It uses deep learning techniques to generate realistic lip movements for any talking face video, synchronizing it with a provided audio track. This technology has applications in dubbing, video production, and virtual avatars.

Pros

High-quality lip-sync results with minimal artifacts
Works on a wide range of face types and orientations
Supports both image sequences and video inputs
Includes a pre-trained model for immediate use

Cons

Requires significant computational resources for optimal performance
May struggle with extreme facial expressions or unusual lighting conditions
Limited to lip movements only; doesn't modify other facial features
Potential ethical concerns regarding deepfake technology

Code Examples

# Load the Wav2Lip model
model = load_model('path/to/wav2lip_model.pth')

# Prepare input video and audio
video = load_video('input_video.mp4')
audio = load_audio('input_audio.wav')

# Generate lip-synced video
synced_video = model.predict(video, audio)

# Save the result
save_video(synced_video, 'output_video.mp4')

# Fine-tune the model on custom data
dataset = CustomDataset('path/to/dataset')
trainer = Wav2LipTrainer(model, dataset)
trainer.train(epochs=10, batch_size=32)

# Process a batch of videos
video_paths = ['video1.mp4', 'video2.mp4', 'video3.mp4']
audio_paths = ['audio1.wav', 'audio2.wav', 'audio3.wav']

for video_path, audio_path in zip(video_paths, audio_paths):
    synced_video = process_video(model, video_path, audio_path)
    save_video(synced_video, f'synced_{video_path}')

Getting Started

Clone the repository:

git clone https://github.com/Rudrabha/Wav2Lip.git
cd Wav2Lip

Install dependencies:
```
pip install -r requirements.txt
```

Download the pre-trained model:

wget https://github.com/Rudrabha/Wav2Lip/releases/download/v1.0/wav2lip.pth

Run inference on a video:

python inference.py --checkpoint_path wav2lip.pth --face video.mp4 --audio input_audio.wav

Competitor Comparisons

StyleTTS2

5,874

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Pros of StyleTTS2

Generates more natural-sounding speech with improved prosody and intonation
Offers greater control over voice style and emotion in synthesized speech
Supports multi-speaker voice cloning with fewer reference samples

Cons of StyleTTS2

Requires more computational resources for training and inference
May have longer processing times compared to Wav2Lip
Less focused on lip-syncing capabilities, which is Wav2Lip's primary function

Code Comparison

StyleTTS2:

output = model.infer(text, speaker_embedding, style_embedding)

Wav2Lip:

mel = audio.melspectrogram(wav)
result = model(mel, face)

StyleTTS2 focuses on generating speech from text with style control, while Wav2Lip primarily handles lip-syncing existing audio to video. The code snippets reflect these different purposes, with StyleTTS2 taking text and embeddings as input, and Wav2Lip processing audio and facial data.

Real-Time-Voice-Cloning

54,789

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pros of Real-Time-Voice-Cloning

Focuses on voice cloning and synthesis, allowing for the creation of new speech from text input
Provides real-time capabilities, enabling on-the-fly voice generation
Offers a more comprehensive solution for voice-related tasks, including speaker encoding and synthesis

Cons of Real-Time-Voice-Cloning

Limited to audio-only output, lacking visual components or lip-syncing capabilities
May require more computational resources due to its real-time processing nature
Potentially more complex to set up and use for beginners compared to Wav2Lip

Code Comparison

Real-Time-Voice-Cloning:

encoder = SpeakerEncoder("encoder/saved_models/pretrained.pt")
synthesizer = Synthesizer("synthesizer/saved_models/pretrained/pretrained.pt")
vocoder = WaveRNN("vocoder/saved_models/pretrained/pretrained.pt")

Wav2Lip:

model = Wav2Lip()
model.load_state_dict(torch.load(args.checkpoint_path))
model.eval()

The code snippets show that Real-Time-Voice-Cloning uses separate models for encoding, synthesis, and vocoding, while Wav2Lip employs a single model for lip-syncing. This reflects the different focus areas of each project, with Real-Time-Voice-Cloning offering a more comprehensive voice processing pipeline and Wav2Lip specializing in lip synchronization.

audiocraft

22,358

Pros of AudioCraft

Broader scope: Generates music and audio, not just lip-syncing
More advanced AI techniques: Uses transformer-based models
Active development: Regularly updated by Facebook Research team

Cons of AudioCraft

Higher computational requirements: Needs more powerful hardware
Steeper learning curve: More complex to use and implement
Less focused: Not specialized for lip-syncing tasks

Code Comparison

Wav2Lip (Python):

from wav2lip import Wav2Lip
model = Wav2Lip()
result = model.predict(face='input_face.mp4', audio='input_audio.wav')

AudioCraft (Python):

from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('medium')
wav = model.generate(
    descriptions=['happy rock'],
    duration=8,
)

Wav2Lip focuses on lip-syncing, with a straightforward API for processing video and audio inputs. AudioCraft, on the other hand, demonstrates its capability to generate music based on text descriptions, showcasing its broader audio generation capabilities.

While Wav2Lip is more suitable for specific lip-syncing tasks, AudioCraft offers a wider range of audio generation possibilities, making it more versatile but potentially more complex for users with specific needs.

TTS

41,775

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Pros of TTS

Offers a wide range of text-to-speech models and voices
Supports multiple languages and accents
Provides more flexibility in generating custom speech

Cons of TTS

Requires more setup and configuration
May have higher computational requirements
Limited to audio generation, doesn't handle lip-syncing

Code Comparison

TTS:

from TTS.api import TTS

tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")

Wav2Lip:

from inference import Wav2Lip

model = Wav2Lip()
model.load_model("checkpoints/wav2lip.pth")
model.predict("input_video.mp4", "input_audio.wav", "output.mp4")

TTS focuses on generating speech from text, offering various models and voices. It's more versatile for pure audio generation but doesn't handle visual aspects. Wav2Lip, on the other hand, specializes in lip-syncing existing videos with audio input, making it more suitable for video-related tasks. The choice between the two depends on the specific requirements of the project, whether it's generating speech or creating lip-synced videos.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Wav2Lip: Accurately Lip-syncing Videos In The Wild

Commercial Version

Create your first lipsync generation in minutes. Please note, the commercial version is of a much higher quality than the old open source model!

Create your API Key

Create your API key from the Dashboard. You will use this key to securely access the Sync API.

Make your first generation

The following example shows how to make a lipsync generation using the Sync API.

Python

Step 1: Install Sync SDK

pip install syncsdk

Step 2: Make your first generation

Copy the following code into a file quickstart.py and replace YOUR_API_KEY_HERE with your generated API key.

# quickstart.py
import time
from sync import Sync
from sync.common import Audio, GenerationOptions, Video
from sync.core.api_error import ApiError

# ---------- UPDATE API KEY ----------
# Replace with your Sync.so API key
api_key = "YOUR_API_KEY_HERE" 

# ----------[OPTIONAL] UPDATE INPUT VIDEO AND AUDIO URL ----------
# URL to your source video
video_url = "https://assets.sync.so/docs/example-video.mp4"
# URL to your audio file
audio_url = "https://assets.sync.so/docs/example-audio.wav"
# ----------------------------------------

client = Sync(
    base_url="https://api.sync.so", 
    api_key=api_key
).generations

print("Starting lip sync generation job...")

try:
    response = client.create(
        input=[Video(url=video_url),Audio(url=audio_url)],
        model="lipsync-2",
        options=GenerationOptions(sync_mode="cut_off"),
        outputFileName="quickstart"
    )
except ApiError as e:
    print(f'create generation request failed with status code {e.status_code} and error {e.body}')
    exit()

job_id = response.id
print(f"Generation submitted successfully, job id: {job_id}")

generation = client.get(job_id)
status = generation.status
while status not in ['COMPLETED', 'FAILED']:
    print('polling status for generation', job_id)
    time.sleep(10)
    generation = client.get(job_id)
    status = generation.status

if status == 'COMPLETED':
    print('generation', job_id, 'completed successfully, output url:', generation.output_url)
else:
    print('generation', job_id, 'failed')

Run the script:

python quickstart.py

Step 3: Done!

It may take a few minutes for the generation to complete. You should see the generated video URL in the terminal post completion.

TypeScript

Step 1: Install dependencies

npm i @sync.so/sdk

Step 2: Make your first generation

Copy the following code into a file quickstart.ts and replace YOUR_API_KEY_HERE with your generated API key.

// quickstart.ts
import { SyncClient, SyncError } from "@sync.so/sdk";

// ---------- UPDATE API KEY ----------
// Replace with your Sync.so API key
const apiKey = "YOUR_API_KEY_HERE";

// ----------[OPTIONAL] UPDATE INPUT VIDEO AND AUDIO URL ----------
// URL to your source video
const videoUrl = "https://assets.sync.so/docs/example-video.mp4";
// URL to your audio file
const audioUrl = "https://assets.sync.so/docs/example-audio.wav";
// ----------------------------------------

const client = new SyncClient({ apiKey });

async function main() {
    console.log("Starting lip sync generation job...");

    let jobId: string;
    try {
        const response = await client.generations.create({
            input: [
                {
                    type: "video",
                    url: videoUrl,
                },
                {
                    type: "audio",
                    url: audioUrl,
                },
            ],
            model: "lipsync-2",
            options: {
                sync_mode: "cut_off",
            },
            outputFileName: "quickstart"
        });
        jobId = response.id;
        console.log(`Generation submitted successfully, job id: ${jobId}`);
    } catch (err) {
        if (err instanceof SyncError) {
            console.error(`create generation request failed with status code ${err.statusCode} and error ${JSON.stringify(err.body)}`);
        } else {
            console.error('An unexpected error occurred:', err);
        }
        return;
    }

    let generation;
    let status;
    while (status !== 'COMPLETED' && status !== 'FAILED') {
        console.log(`polling status for generation ${jobId}...`);
        try {
            await new Promise(resolve => setTimeout(resolve, 10000));
            generation = await client.generations.get(jobId);
            status = generation.status;
        } catch (err) {
            if (err instanceof SyncError) {
                console.error(`polling failed with status code ${err.statusCode} and error ${JSON.stringify(err.body)}`);
            } else {
                console.error('An unexpected error occurred during polling:', err);
            }
            status = 'FAILED';
        }
    }

    if (status === 'COMPLETED') {
        console.log(`generation ${jobId} completed successfully, output url: ${generation?.outputUrl}`);
    } else {
        console.log(`generation ${jobId} failed`);
    }
}

main();

Run the script:

npx tsx quickstart.ts -y

Step 3: Done!

You should see the generated video URL in the terminal.

Next Steps

Well done! You've just made your first lipsync generation with sync.so!

Ready to unlock the full potential of lipsync? Dive into our interactive Studio to experiment with all available models, or explore our API Documentation to take your lip-sync generations to the next level!

Contact

Non Commercial Open-source Version

This code is part of the paper: A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild published at ACM Multimedia 2020.

ð Original Paper	ð° Project Page	ð Demo	â¡ Live Testing	ð Colab Notebook
Paper	Project Page	Demo Video	Interactive Demo	Colab Notebook /Updated Collab Notebook

Highlights

Weights of the visual quality disc has been updated in readme!
Lip-sync videos to any target speech with high accuracy :100:. Try our interactive demo.
:sparkles: Works for any identity, voice, and language. Also works for CGI faces and synthetic voices.
Complete training code, inference code, and pretrained models are available :boom:
Or, quick-start with the Google Colab Notebook: Link. Checkpoints and samples are available in a Google Drive folder as well. There is also a tutorial video on this, courtesy of What Make Art. Also, thanks to Eyal Gruss, there is a more accessible Google Colab notebook with more useful features. A tutorial collab notebook is present at this link.
:fire: :fire: Several new, reliable evaluation benchmarks and metrics [evaluation/ folder of this repo] released. Instructions to calculate the metrics reported in the paper are also present.

Disclaimer

All results from this open-source code or our demo website should only be used for research/academic/personal purposes only. As the models are trained on the LRS2 dataset, any form of commercial use is strictly prohibited. For commercial requests please contact us directly! Prerequisites

Python 3.6
ffmpeg: sudo apt-get install ffmpeg
Install necessary packages using pip install -r requirements.txt. Alternatively, instructions for using a docker image is provided here. Have a look at this comment and comment on the gist if you encounter any issues.
Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth. Alternative link if the above does not work. Getting the weights

Model	Description	Link to the model
Wav2Lip	Highly accurate lip-sync	Link
Wav2Lip + GAN	Slightly inferior lip-sync, but better visual quality	Link

Lip-syncing videos using the pre-trained models (Inference)

You can lip-sync any video to any audio:

python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source>

The result is saved (by default) in results/result_voice.mp4. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by FFMPEG containing audio data: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio.

Tips for better results:

Experiment with the --pads argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g. --pads 0 20 0 0.
If you see the mouth position dislocated or some weird artifacts such as two mouths, then it can be because of over-smoothing the face detections. Use the --nosmooth argument and give it another try.
Experiment with the --resize_factor argument, to get a lower-resolution video. Why? The models are trained on faces that were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too).
The Wav2Lip model without GAN usually needs more experimenting with the above two to get the most ideal results, and sometimes, can give you a better result as well. Preparing LRS2 for training

Our models are trained on LRS2. See here for a few suggestions regarding training on other datasets.

LRS2 dataset folder structure

data_root (mvlrs_v1)
âââ main, pretrain (we use only main folder in this work)
|	âââ list of folders
|	â   âââ five-digit numbered video IDs ending with (.mp4)

Place the LRS2 filelists (train, val, test) .txt files in the filelists/ folder.

Preprocess the dataset for fast training

python preprocess.py --data_root data_root/main --preprocessed_root lrs2_preprocessed/

Additional options like batch_size and the number of GPUs to use in parallel to use can also be set.

Preprocessed LRS2 folder structure

preprocessed_root (lrs2_preprocessed)
âââ list of folders
|	âââ Folders with five-digit numbered video IDs
|	â   âââ *.jpg
|	â   âââ audio.wav

Train!

There are two major steps: (i) Train the expert lip-sync discriminator, (ii) Train the Wav2Lip model(s).

Training the expert discriminator

You can download the pre-trained weights if you want to skip this step. To train it:

python color_syncnet_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints>

Training the Wav2Lip models

You can either train the model without the additional visual quality discriminator (< 1 day of training) or use the discriminator (~2 days). For the former, run:

python wav2lip_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint>

To train with the visual quality discriminator, you should run `hq_wav2lip_train.py` instead. The arguments for both files are similar. In both cases, you can resume training as well. Look at `python wav2lip_train.py --help` for more details. You can also set additional less commonly-used hyper-parameters at the bottom of the `hparams.py` file. Training on datasets other than LRS2

Training on other datasets might require modifications to the code. Please read the following before you raise an issue:

You might not get good results by training/fine-tuning on a few minutes of a single speaker. This is a separate research problem, to which we do not have a solution yet. Thus, we would most likely not be able to resolve your issue.
You must train the expert discriminator for your own dataset before training Wav2Lip.
If it is your own dataset downloaded from the web, in most cases, needs to be sync-corrected.
Be mindful of the FPS of the videos of your dataset. Changes to FPS would need significant code changes.
The expert discriminator's eval loss should go down to ~0.25 and the Wav2Lip eval sync loss should go down to ~0.2 to get good results. When raising an issue on this topic, please let us know that you are aware of all these points. We have an HD model trained on a dataset allowing commercial usage. The size of the generated face will be 192 x 288 in our new model. Evaluation

Please check the `evaluation/` folder for the instructions. License and Citation

This repository can only be used for personal/research/non-commercial purposes. However, for commercial requests, please contact us directly at rudrabha@synclabs.so or prajwal@synclabs.so. We have a turn-key hosted API with new and improved lip-syncing models here: https://synclabs.so/ The size of the generated face will be 192 x 288 in our new models. Please cite the following paper if you use this repository:

@inproceedings{10.1145/3394171.3413532,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild},
year = {2020},
isbn = {9781450379885},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3394171.3413532},
doi = {10.1145/3394171.3413532},
booktitle = {Proceedings of the 28th ACM International Conference on Multimedia},
pages = {484â492},
numpages = {9},
keywords = {lip sync, talking face generation, video generation},
location = {Seattle, WA, USA},
series = {MM '20}
}

Acknowledgments

Parts of the code structure are inspired by this TTS repository. We thank the author for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models. We thank zabique for the tutorial collab notebook.

Acknowledgements

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of StyleTTS2

Cons of StyleTTS2

Code Comparison

Pros of Real-Time-Voice-Cloning

Cons of Real-Time-Voice-Cloning

Code Comparison

Pros of AudioCraft

Cons of AudioCraft

Code Comparison

Pros of TTS

Cons of TTS

Code Comparison

Convert designs to code with AI

README

Wav2Lip: Accurately Lip-syncing Videos In The Wild

Commercial Version

Create your API Key

Make your first generation

Python

Step 1: Install Sync SDK

Step 2: Make your first generation

Step 3: Done!

TypeScript

Step 1: Install dependencies

Step 2: Make your first generation

Step 3: Done!

Next Steps

Contact

Non Commercial Open-source Version

Highlights

Disclaimer

All results from this open-source code or our demo website should only be used for research/academic/personal purposes only. As the models are trained on the LRS2 dataset, any form of commercial use is strictly prohibited. For commercial requests please contact us directly! Prerequisites

Lip-syncing videos using the pre-trained models (Inference)

Tips for better results:

LRS2 dataset folder structure

Preprocess the dataset for fast training

Preprocessed LRS2 folder structure

Train!

Training the expert discriminator

Training the Wav2Lip models

Please check the evaluation/ folder for the instructions. License and Citation

Acknowledgments

Acknowledgements

Top Related Projects

Convert designs to code with AI

Please check the `evaluation/` folder for the instructions. License and Citation