first-order-model

This repository contains the source code for the paper First Order Motion Model for Image Animation

14,900

3,284

14,900

316

View on GitHub

Top Related Projects

GFPGAN

36,861

GFPGAN aims at developing Practical Algorithms for Real-world Face Restoration.

Real-ESRGAN

31,984

Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.

stable-diffusion

71,028

A latent text-to-image diffusion model

stylegan3

6,765

Official PyTorch implementation of StyleGAN3

VideoPose3D

3,872

Efficient 3D human pose estimation in video using 2D keypoint trajectories

Quick Overview

The first-order-model repository is an implementation of the paper "First Order Motion Model for Image Animation." It provides a framework for animating a source image using the motion from a driving video, allowing for the creation of realistic image animations without extensive training data or specific annotations.

Pros

Enables high-quality image animation with minimal input requirements
Supports a wide range of applications, including face animation, human pose transfer, and object animation
Provides pre-trained models for quick experimentation and results
Offers flexibility in adapting to different types of motions and images

Cons

May require significant computational resources for training and inference
Performance can vary depending on the similarity between source and driving images
Limited control over specific aspects of the generated animation
Potential for misuse in creating deepfakes or misleading content

Code Examples

Loading the model and performing inference:

from demo import load_checkpoints, make_animation
from skimage import io, img_as_float32
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation

source_image = io.imread('path/to/source.png')
driving_video = np.array([img_as_float32(frame) for frame in io.imread_collection('path/to/driving_video/*.png')])

generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml', checkpoint_path='vox-cpk.pth.tar')

predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=True)

Visualizing the generated animation:

fig = plt.figure(figsize=(10, 10))
plt.axis('off')
im = plt.imshow(predictions[0])

def update_frame(frame):
    im.set_data(predictions[frame])
    return [im]

anim = animation.FuncAnimation(fig, update_frame, frames=len(predictions), interval=50, blit=True)
plt.show()

Saving the generated animation:

from moviepy.editor import ImageSequenceClip

clip = ImageSequenceClip([frame for frame in predictions], fps=25)
clip.write_videofile("output_animation.mp4", codec='libx264')

Getting Started

Clone the repository:

git clone https://github.com/AliaksandrSiarohin/first-order-model.git
cd first-order-model

Install dependencies:
```
pip install -r requirements.txt
```

Download pre-trained models:

wget https://github.com/AliaksandrSiarohin/first-order-model/releases/download/v1.0.0/vox-cpk.pth.tar -P checkpoints/

Run the demo:

python demo.py --config config/vox-256.yaml --driving_video path/to/driving.mp4 --source_image path/to/source.png --result_video path/to/result.mp4

Competitor Comparisons

GFPGAN

36,861

GFPGAN aims at developing Practical Algorithms for Real-world Face Restoration.

Pros of GFPGAN

Specializes in face restoration and enhancement
Offers pre-trained models for immediate use
Provides a user-friendly interface for non-technical users

Cons of GFPGAN

Limited to face-specific tasks, less versatile than first-order-model
May introduce artifacts in certain cases of face restoration
Requires more computational resources for high-resolution outputs

Code Comparison

GFPGAN:

from gfpgan import GFPGANer

restorer = GFPGANer(model_path='experiments/pretrained_models/GFPGANv1.3.pth', upscale=2)
restored_img, _ = restorer.enhance(img, has_aligned=False, only_center_face=False, paste_back=True)

first-order-model:

from demo import load_checkpoints, make_animation

source_image = imageio.imread(args.source_image)
driving_video = imageio.mimread(args.driving_video)

generator, kp_detector = load_checkpoints(config_path=args.config, checkpoint_path=args.checkpoint)
predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=args.relative)

The code snippets highlight the different focus areas of each project. GFPGAN is centered on face restoration, while first-order-model is designed for more general image animation tasks.

Real-ESRGAN

31,984

Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.

Pros of Real-ESRGAN

Focuses on image super-resolution and enhancement
Provides pre-trained models for easy implementation
Supports both CPU and GPU inference

Cons of Real-ESRGAN

Limited to static image processing, not video or animation
May introduce artifacts in some cases, especially with text

Code Comparison

Real-ESRGAN:

from basicsr.archs.rrdbnet_arch import RRDBNet
from realesrgan import RealESRGANer

model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32)
upsampler = RealESRGANer(scale=4, model_path='weights/RealESRGAN_x4plus.pth', model=model)

first-order-model:

from demo import load_checkpoints
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml', 
                                          checkpoint_path='vox-cpk.pth.tar')

Key Differences

Real-ESRGAN is primarily for image enhancement, while first-order-model focuses on video animation and face reenactment
Real-ESRGAN operates on single images, whereas first-order-model works with video sequences
first-order-model requires source and driving videos, while Real-ESRGAN only needs a single input image
Real-ESRGAN's output is a higher resolution version of the input, while first-order-model generates animated videos based on driving sequences

stable-diffusion

71,028

A latent text-to-image diffusion model

Pros of stable-diffusion

Generates high-quality images from text descriptions, offering more versatile creative applications
Supports various image manipulation tasks like inpainting, outpainting, and image-to-image translation
Has a larger and more active community, with frequent updates and improvements

Cons of stable-diffusion

Requires more computational resources and longer processing times for image generation
May produce less consistent results when generating multiple images from the same prompt
Has a steeper learning curve for fine-tuning and customization

Code Comparison

first-order-model:

from demo import load_checkpoints
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml', 
                                          checkpoint_path='vox-cpk.pth.tar')

stable-diffusion:

from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
image = pipe("A beautiful sunset over the ocean").images[0]

stylegan3

6,765

Official PyTorch implementation of StyleGAN3

Pros of StyleGAN3

Produces higher quality and more realistic images with fewer artifacts
Offers better control over image generation and style mixing
Implements advanced techniques like alias-free sampling for improved results

Cons of StyleGAN3

Requires more computational resources and longer training times
Has a steeper learning curve due to its complexity
Limited to generating static images, unlike first-order-model's video capabilities

Code Comparison

StyleGAN3:

import torch
import dnnlib
import legacy

network_pkl = 'https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-t-ffhq-1024x1024.pkl'
device = torch.device('cuda')
with dnnlib.util.open_url(network_pkl) as f:
    G = legacy.load_network_pkl(f)['G_ema'].to(device)

first-order-model:

from demo import load_checkpoints
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml', 
                                          checkpoint_path='vox-cpk.pth.tar')

Summary

StyleGAN3 excels in generating high-quality static images with advanced control, while first-order-model focuses on video manipulation and animation. StyleGAN3 requires more resources but offers superior image quality, whereas first-order-model is more versatile for video-based tasks. The code snippets highlight the difference in setup and usage between the two projects.

VideoPose3D

3,872

Efficient 3D human pose estimation in video using 2D keypoint trajectories

Pros of VideoPose3D

Focuses on 3D human pose estimation from video, offering more detailed skeletal tracking
Provides pre-trained models and datasets for easier implementation
Supports both 2D-to-3D pose lifting and end-to-end 3D pose estimation

Cons of VideoPose3D

Limited to human pose estimation, less versatile for general video manipulation
Requires more computational resources for 3D pose estimation
May struggle with complex scenes or multiple subjects

Code Comparison

VideoPose3D:

from common.model import TemporalModel
model = TemporalModel(num_joints_in, in_features, num_joints_out, filter_widths, causal=args.causal)

first-order-model:

from modules.generator import OcclusionAwareGenerator
generator = OcclusionAwareGenerator(num_channels, num_kp, num_bottleneck_blocks, estimate_occlusion_map=True)

The code snippets show that VideoPose3D focuses on temporal modeling for pose estimation, while first-order-model uses an occlusion-aware generator for image manipulation tasks.

faceswap

54,146

Deepfakes Software For All

Pros of faceswap

More comprehensive and feature-rich, offering a complete pipeline for face swapping
Extensive documentation and active community support
Includes a graphical user interface for easier use by non-technical users

Cons of faceswap

Requires more computational resources due to its complexity
Steeper learning curve for beginners
May produce less realistic results in some cases compared to first-order-model

Code comparison

first-order-model:

source_image = torch.tensor(source_image[np.newaxis].astype(np.float32)).permute(0, 3, 1, 2)
driving_video = torch.tensor(np.array(driving_video)[np.newaxis].astype(np.float32)).permute(0, 4, 1, 2, 3)
predictions = model(source_image, driving_video)

faceswap:

detected_faces = self.detect_faces(image)
for face in detected_faces:
    landmarks = self.get_landmarks(image, face)
    mask = self.get_mask(image, landmarks)
    warped_face = self.warp_face(image, landmarks, self.reference_landmarks)

The first-order-model code focuses on generating animations from a single image, while faceswap's code demonstrates face detection, landmark extraction, and warping for face swapping. first-order-model uses PyTorch tensors, while faceswap relies on more traditional image processing techniques.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

!!! Check out our new paper and framework improved for articulated objects

First Order Motion Model for Image Animation

This repository contains the source code for the paper First Order Motion Model for Image Animation by Aliaksandr Siarohin, StÃ©phane LathuiliÃ¨re, Sergey Tulyakov, Elisa Ricci and Nicu Sebe.

Hugging Face Spaces

Example animations

The videos on the left show the driving videos. The first row on the right for each dataset shows the source videos. The bottom row contains the animated sequences with motion transferred from the driving video and object taken from the source image. We trained a separate network for each task.

VoxCeleb Dataset

Screenshot

Fashion Dataset

Screenshot

MGIF Dataset

Screenshot

Installation

We support python3. To install the dependencies run:

pip install -r requirements.txt

YAML configs

There are several configuration (config/dataset_name.yaml) files one for each dataset. See config/taichi-256.yaml to get description of each parameter.

Pre-trained checkpoint

Checkpoints can be found under following link: google-drive or yandex-disk.

Animation Demo

To run a demo, download checkpoint and run the following command:

python demo.py  --config config/dataset_name.yaml --driving_video path/to/driving --source_image path/to/source --checkpoint path/to/checkpoint --relative --adapt_scale

The result will be stored in result.mp4.

The driving videos and source images should be cropped before it can be used in our method. To obtain some semi-automatic crop suggestions you can use python crop-video.py --inp some_youtube_video.mp4. It will generate commands for crops using ffmpeg. In order to use the script, face-alligment library is needed:

git clone https://github.com/1adrianb/face-alignment
cd face-alignment
pip install -r requirements.txt
python setup.py install

Animation demo with Docker

If you are having trouble getting the demo to work because of library compatibility issues, and you're running Linux, you might try running it inside a Docker container, which would give you better control over the execution environment.

Requirements: Docker 19.03+ and nvidia-docker installed and able to successfully run the nvidia-docker usage tests.

We'll first build the container.

docker build -t first-order-model .

And now that we have the container available locally, we can use it to run the demo.

docker run -it --rm --gpus all \
       -v $HOME/first-order-model:/app first-order-model \
       python3 demo.py --config config/vox-256.yaml \
           --driving_video driving.mp4 \
           --source_image source.png  \ 
           --checkpoint vox-cpk.pth.tar \ 
           --result_video result.mp4 \
           --relative --adapt_scale

Colab Demo

@graphemecluster prepared a GUI demo for the Google Colab. It also works in Kaggle. For the source code, see demo.ipynb.

For the old demo, see old_demo.ipynb.

Face-swap

It is possible to modify the method to perform face-swap using supervised segmentation masks. Screenshot For both unsupervised and supervised video editing, such as face-swap, please refer to Motion Co-Segmentation.

Training

To train a model on specific dataset run:

CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py --config config/dataset_name.yaml --device_ids 0,1,2,3

The code will create a folder in the log directory (each run will create a time-stamped new directory). Checkpoints will be saved to this folder. To check the loss values during training see log.txt. You can also check training data reconstructions in the train-vis subfolder. By default the batch size is tuned to run on 2 or 4 Titan-X gpu (apart from speed it does not make much difference). You can change the batch size in the train_params in corresponding .yaml file.

Evaluation on video reconstruction

To evaluate the reconstruction performance run:

CUDA_VISIBLE_DEVICES=0 python run.py --config config/dataset_name.yaml --mode reconstruction --checkpoint path/to/checkpoint

You will need to specify the path to the checkpoint, the reconstruction subfolder will be created in the checkpoint folder. The generated video will be stored to this folder, also generated videos will be stored in png subfolder in loss-less '.png' format for evaluation. Instructions for computing metrics from the paper can be found: https://github.com/AliaksandrSiarohin/pose-evaluation.

Image animation

In order to animate videos run:

CUDA_VISIBLE_DEVICES=0 python run.py --config config/dataset_name.yaml --mode animate --checkpoint path/to/checkpoint

You will need to specify the path to the checkpoint, the animation subfolder will be created in the same folder as the checkpoint. You can find the generated video there and its loss-less version in the png subfolder. By default video from test set will be randomly paired, but you can specify the "source,driving" pairs in the corresponding .csv files. The path to this file should be specified in corresponding .yaml file in pairs_list setting.

There are 2 different ways of performing animation: by using absolute keypoint locations or by using relative keypoint locations.

Animation using absolute coordinates: the animation is performed using the absolute positions of the driving video and appearance of the source image. In this way there are no specific requirements for the driving video and source appearance that is used. However this usually leads to poor performance since irrelevant details such as shape is transferred. Check animate parameters in taichi-256.yaml to enable this mode.

Animation using relative coordinates: from the driving video we first estimate the relative movement of each keypoint, then we add this movement to the absolute position of keypoints in the source image. This keypoint along with source image is used for animation. This usually leads to better performance, however this requires that the object in the first frame of the video and in the source image have the same pose

Datasets

Bair. This dataset can be directly downloaded.
Mgif. This dataset can be directly downloaded.
Fashion. Follow the instruction on dataset downloading from.
Taichi. Follow the instructions in data/taichi-loading or instructions from https://github.com/AliaksandrSiarohin/video-preprocessing.
Nemo. Please follow the instructions on how to download the dataset. Then the dataset should be preprocessed using scripts from https://github.com/AliaksandrSiarohin/video-preprocessing.
VoxCeleb. Please follow the instruction from https://github.com/AliaksandrSiarohin/video-preprocessing.

Training on your own dataset

Resize all the videos to the same size e.g 256x256, the videos can be in '.gif', '.mp4' or folder with images. We recommend the later, for each video make a separate folder with all the frames in '.png' format. This format is loss-less, and it has better i/o performance.
Create a folder data/dataset_name with 2 subfolders train and test, put training videos in the train and testing in the test.
Create a config config/dataset_name.yaml, in dataset_params specify the root dir the root_dir: data/dataset_name. Also adjust the number of epoch in train_params.

Additional notes

Citation:

@InProceedings{Siarohin_2019_NeurIPS,
  author={Siarohin, Aliaksandr and LathuiliÃ¨re, StÃ©phane and Tulyakov, Sergey and Ricci, Elisa and Sebe, Nicu},
  title={First Order Motion Model for Image Animation},
  booktitle = {Conference on Neural Information Processing Systems (NeurIPS)},
  month = {December},
  year = {2019}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot