first-order-model
This repository contains the source code for the paper First Order Motion Model for Image Animation
Top Related Projects
GFPGAN aims at developing Practical Algorithms for Real-world Face Restoration.
Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.
A latent text-to-image diffusion model
Official PyTorch implementation of StyleGAN3
Efficient 3D human pose estimation in video using 2D keypoint trajectories
Deepfakes Software For All
Quick Overview
The first-order-model repository is an implementation of the paper "First Order Motion Model for Image Animation." It provides a framework for animating a source image using the motion from a driving video, allowing for the creation of realistic image animations without extensive training data or specific annotations.
Pros
- Enables high-quality image animation with minimal input requirements
- Supports a wide range of applications, including face animation, human pose transfer, and object animation
- Provides pre-trained models for quick experimentation and results
- Offers flexibility in adapting to different types of motions and images
Cons
- May require significant computational resources for training and inference
- Performance can vary depending on the similarity between source and driving images
- Limited control over specific aspects of the generated animation
- Potential for misuse in creating deepfakes or misleading content
Code Examples
- Loading the model and performing inference:
from demo import load_checkpoints, make_animation
from skimage import io, img_as_float32
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
source_image = io.imread('path/to/source.png')
driving_video = np.array([img_as_float32(frame) for frame in io.imread_collection('path/to/driving_video/*.png')])
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml', checkpoint_path='vox-cpk.pth.tar')
predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=True)
- Visualizing the generated animation:
fig = plt.figure(figsize=(10, 10))
plt.axis('off')
im = plt.imshow(predictions[0])
def update_frame(frame):
im.set_data(predictions[frame])
return [im]
anim = animation.FuncAnimation(fig, update_frame, frames=len(predictions), interval=50, blit=True)
plt.show()
- Saving the generated animation:
from moviepy.editor import ImageSequenceClip
clip = ImageSequenceClip([frame for frame in predictions], fps=25)
clip.write_videofile("output_animation.mp4", codec='libx264')
Getting Started
-
Clone the repository:
git clone https://github.com/AliaksandrSiarohin/first-order-model.git cd first-order-model
-
Install dependencies:
pip install -r requirements.txt
-
Download pre-trained models:
wget https://github.com/AliaksandrSiarohin/first-order-model/releases/download/v1.0.0/vox-cpk.pth.tar -P checkpoints/
-
Run the demo:
python demo.py --config config/vox-256.yaml --driving_video path/to/driving.mp4 --source_image path/to/source.png --result_video path/to/result.mp4
Competitor Comparisons
GFPGAN aims at developing Practical Algorithms for Real-world Face Restoration.
Pros of GFPGAN
- Specializes in face restoration and enhancement
- Offers pre-trained models for immediate use
- Provides a user-friendly interface for non-technical users
Cons of GFPGAN
- Limited to face-specific tasks, less versatile than first-order-model
- May introduce artifacts in certain cases of face restoration
- Requires more computational resources for high-resolution outputs
Code Comparison
GFPGAN:
from gfpgan import GFPGANer
restorer = GFPGANer(model_path='experiments/pretrained_models/GFPGANv1.3.pth', upscale=2)
restored_img, _ = restorer.enhance(img, has_aligned=False, only_center_face=False, paste_back=True)
first-order-model:
from demo import load_checkpoints, make_animation
source_image = imageio.imread(args.source_image)
driving_video = imageio.mimread(args.driving_video)
generator, kp_detector = load_checkpoints(config_path=args.config, checkpoint_path=args.checkpoint)
predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=args.relative)
The code snippets highlight the different focus areas of each project. GFPGAN is centered on face restoration, while first-order-model is designed for more general image animation tasks.
Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.
Pros of Real-ESRGAN
- Focuses on image super-resolution and enhancement
- Provides pre-trained models for easy implementation
- Supports both CPU and GPU inference
Cons of Real-ESRGAN
- Limited to static image processing, not video or animation
- May introduce artifacts in some cases, especially with text
Code Comparison
Real-ESRGAN:
from basicsr.archs.rrdbnet_arch import RRDBNet
from realesrgan import RealESRGANer
model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32)
upsampler = RealESRGANer(scale=4, model_path='weights/RealESRGAN_x4plus.pth', model=model)
first-order-model:
from demo import load_checkpoints
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml',
checkpoint_path='vox-cpk.pth.tar')
Key Differences
- Real-ESRGAN is primarily for image enhancement, while first-order-model focuses on video animation and face reenactment
- Real-ESRGAN operates on single images, whereas first-order-model works with video sequences
- first-order-model requires source and driving videos, while Real-ESRGAN only needs a single input image
- Real-ESRGAN's output is a higher resolution version of the input, while first-order-model generates animated videos based on driving sequences
A latent text-to-image diffusion model
Pros of stable-diffusion
- Generates high-quality images from text descriptions, offering more versatile creative applications
- Supports various image manipulation tasks like inpainting, outpainting, and image-to-image translation
- Has a larger and more active community, with frequent updates and improvements
Cons of stable-diffusion
- Requires more computational resources and longer processing times for image generation
- May produce less consistent results when generating multiple images from the same prompt
- Has a steeper learning curve for fine-tuning and customization
Code Comparison
first-order-model:
from demo import load_checkpoints
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml',
checkpoint_path='vox-cpk.pth.tar')
stable-diffusion:
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
image = pipe("A beautiful sunset over the ocean").images[0]
Official PyTorch implementation of StyleGAN3
Pros of StyleGAN3
- Produces higher quality and more realistic images with fewer artifacts
- Offers better control over image generation and style mixing
- Implements advanced techniques like alias-free sampling for improved results
Cons of StyleGAN3
- Requires more computational resources and longer training times
- Has a steeper learning curve due to its complexity
- Limited to generating static images, unlike first-order-model's video capabilities
Code Comparison
StyleGAN3:
import torch
import dnnlib
import legacy
network_pkl = 'https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-t-ffhq-1024x1024.pkl'
device = torch.device('cuda')
with dnnlib.util.open_url(network_pkl) as f:
G = legacy.load_network_pkl(f)['G_ema'].to(device)
first-order-model:
from demo import load_checkpoints
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml',
checkpoint_path='vox-cpk.pth.tar')
Summary
StyleGAN3 excels in generating high-quality static images with advanced control, while first-order-model focuses on video manipulation and animation. StyleGAN3 requires more resources but offers superior image quality, whereas first-order-model is more versatile for video-based tasks. The code snippets highlight the difference in setup and usage between the two projects.
Efficient 3D human pose estimation in video using 2D keypoint trajectories
Pros of VideoPose3D
- Focuses on 3D human pose estimation from video, offering more detailed skeletal tracking
- Provides pre-trained models and datasets for easier implementation
- Supports both 2D-to-3D pose lifting and end-to-end 3D pose estimation
Cons of VideoPose3D
- Limited to human pose estimation, less versatile for general video manipulation
- Requires more computational resources for 3D pose estimation
- May struggle with complex scenes or multiple subjects
Code Comparison
VideoPose3D:
from common.model import TemporalModel
model = TemporalModel(num_joints_in, in_features, num_joints_out, filter_widths, causal=args.causal)
first-order-model:
from modules.generator import OcclusionAwareGenerator
generator = OcclusionAwareGenerator(num_channels, num_kp, num_bottleneck_blocks, estimate_occlusion_map=True)
The code snippets show that VideoPose3D focuses on temporal modeling for pose estimation, while first-order-model uses an occlusion-aware generator for image manipulation tasks.
Deepfakes Software For All
Pros of faceswap
- More comprehensive and feature-rich, offering a complete pipeline for face swapping
- Extensive documentation and active community support
- Includes a graphical user interface for easier use by non-technical users
Cons of faceswap
- Requires more computational resources due to its complexity
- Steeper learning curve for beginners
- May produce less realistic results in some cases compared to first-order-model
Code comparison
first-order-model:
source_image = torch.tensor(source_image[np.newaxis].astype(np.float32)).permute(0, 3, 1, 2)
driving_video = torch.tensor(np.array(driving_video)[np.newaxis].astype(np.float32)).permute(0, 4, 1, 2, 3)
predictions = model(source_image, driving_video)
faceswap:
detected_faces = self.detect_faces(image)
for face in detected_faces:
landmarks = self.get_landmarks(image, face)
mask = self.get_mask(image, landmarks)
warped_face = self.warp_face(image, landmarks, self.reference_landmarks)
The first-order-model code focuses on generating animations from a single image, while faceswap's code demonstrates face detection, landmark extraction, and warping for face swapping. first-order-model uses PyTorch tensors, while faceswap relies on more traditional image processing techniques.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
!!! Check out our new paper and framework improved for articulated objects
First Order Motion Model for Image Animation
This repository contains the source code for the paper First Order Motion Model for Image Animation by Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci and Nicu Sebe.
Example animations
The videos on the left show the driving videos. The first row on the right for each dataset shows the source videos. The bottom row contains the animated sequences with motion transferred from the driving video and object taken from the source image. We trained a separate network for each task.
VoxCeleb Dataset
Fashion Dataset
MGIF Dataset
Installation
We support python3
. To install the dependencies run:
pip install -r requirements.txt
YAML configs
There are several configuration (config/dataset_name.yaml
) files one for each dataset
. See config/taichi-256.yaml
to get description of each parameter.
Pre-trained checkpoint
Checkpoints can be found under following link: google-drive or yandex-disk.
Animation Demo
To run a demo, download checkpoint and run the following command:
python demo.py --config config/dataset_name.yaml --driving_video path/to/driving --source_image path/to/source --checkpoint path/to/checkpoint --relative --adapt_scale
The result will be stored in result.mp4
.
The driving videos and source images should be cropped before it can be used in our method. To obtain some semi-automatic crop suggestions you can use python crop-video.py --inp some_youtube_video.mp4
. It will generate commands for crops using ffmpeg. In order to use the script, face-alligment library is needed:
git clone https://github.com/1adrianb/face-alignment
cd face-alignment
pip install -r requirements.txt
python setup.py install
Animation demo with Docker
If you are having trouble getting the demo to work because of library compatibility issues, and you're running Linux, you might try running it inside a Docker container, which would give you better control over the execution environment.
Requirements: Docker 19.03+ and nvidia-docker
installed and able to successfully run the nvidia-docker
usage tests.
We'll first build the container.
docker build -t first-order-model .
And now that we have the container available locally, we can use it to run the demo.
docker run -it --rm --gpus all \
-v $HOME/first-order-model:/app first-order-model \
python3 demo.py --config config/vox-256.yaml \
--driving_video driving.mp4 \
--source_image source.png \
--checkpoint vox-cpk.pth.tar \
--result_video result.mp4 \
--relative --adapt_scale
Colab Demo
@graphemecluster prepared a GUI demo for the Google Colab. It also works in Kaggle. For the source code, see demo.ipynb
.
For the old demo, see old_demo.ipynb
.
Face-swap
It is possible to modify the method to perform face-swap using supervised segmentation masks. For both unsupervised and supervised video editing, such as face-swap, please refer to Motion Co-Segmentation.
Training
To train a model on specific dataset run:
CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py --config config/dataset_name.yaml --device_ids 0,1,2,3
The code will create a folder in the log directory (each run will create a time-stamped new directory).
Checkpoints will be saved to this folder.
To check the loss values during training see log.txt
.
You can also check training data reconstructions in the train-vis
subfolder.
By default the batch size is tuned to run on 2 or 4 Titan-X gpu (apart from speed it does not make much difference). You can change the batch size in the train_params in corresponding .yaml
file.
Evaluation on video reconstruction
To evaluate the reconstruction performance run:
CUDA_VISIBLE_DEVICES=0 python run.py --config config/dataset_name.yaml --mode reconstruction --checkpoint path/to/checkpoint
You will need to specify the path to the checkpoint,
the reconstruction
subfolder will be created in the checkpoint folder.
The generated video will be stored to this folder, also generated videos will be stored in png
subfolder in loss-less '.png' format for evaluation.
Instructions for computing metrics from the paper can be found: https://github.com/AliaksandrSiarohin/pose-evaluation.
Image animation
In order to animate videos run:
CUDA_VISIBLE_DEVICES=0 python run.py --config config/dataset_name.yaml --mode animate --checkpoint path/to/checkpoint
You will need to specify the path to the checkpoint,
the animation
subfolder will be created in the same folder as the checkpoint.
You can find the generated video there and its loss-less version in the png
subfolder.
By default video from test set will be randomly paired, but you can specify the "source,driving" pairs in the corresponding .csv
files. The path to this file should be specified in corresponding .yaml
file in pairs_list setting.
There are 2 different ways of performing animation: by using absolute keypoint locations or by using relative keypoint locations.
- Animation using absolute coordinates: the animation is performed using the absolute positions of the driving video and appearance of the source image.
In this way there are no specific requirements for the driving video and source appearance that is used.
However this usually leads to poor performance since irrelevant details such as shape is transferred.
Check animate parameters in
taichi-256.yaml
to enable this mode.
- Animation using relative coordinates: from the driving video we first estimate the relative movement of each keypoint, then we add this movement to the absolute position of keypoints in the source image. This keypoint along with source image is used for animation. This usually leads to better performance, however this requires that the object in the first frame of the video and in the source image have the same pose
Datasets
-
Bair. This dataset can be directly downloaded.
-
Mgif. This dataset can be directly downloaded.
-
Fashion. Follow the instruction on dataset downloading from.
-
Taichi. Follow the instructions in data/taichi-loading or instructions from https://github.com/AliaksandrSiarohin/video-preprocessing.
-
Nemo. Please follow the instructions on how to download the dataset. Then the dataset should be preprocessed using scripts from https://github.com/AliaksandrSiarohin/video-preprocessing.
-
VoxCeleb. Please follow the instruction from https://github.com/AliaksandrSiarohin/video-preprocessing.
Training on your own dataset
-
Resize all the videos to the same size e.g 256x256, the videos can be in '.gif', '.mp4' or folder with images. We recommend the later, for each video make a separate folder with all the frames in '.png' format. This format is loss-less, and it has better i/o performance.
-
Create a folder
data/dataset_name
with 2 subfolderstrain
andtest
, put training videos in thetrain
and testing in thetest
. -
Create a config
config/dataset_name.yaml
, in dataset_params specify the root dir theroot_dir: data/dataset_name
. Also adjust the number of epoch in train_params.
Additional notes
Citation:
@InProceedings{Siarohin_2019_NeurIPS,
author={Siarohin, Aliaksandr and Lathuilière, Stéphane and Tulyakov, Sergey and Ricci, Elisa and Sebe, Nicu},
title={First Order Motion Model for Image Animation},
booktitle = {Conference on Neural Information Processing Systems (NeurIPS)},
month = {December},
year = {2019}
}
Top Related Projects
GFPGAN aims at developing Practical Algorithms for Real-world Face Restoration.
Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.
A latent text-to-image diffusion model
Official PyTorch implementation of StyleGAN3
Efficient 3D human pose estimation in video using 2D keypoint trajectories
Deepfakes Software For All
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot