Convert Figma logo to code with AI

AliaksandrSiarohin logofirst-order-model

This repository contains the source code for the paper First Order Motion Model for Image Animation

14,491
3,211
14,491
306

Top Related Projects

36,023

GFPGAN aims at developing Practical Algorithms for Real-world Face Restoration.

Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.

A latent text-to-image diffusion model

Official PyTorch implementation of StyleGAN3

Efficient 3D human pose estimation in video using 2D keypoint trajectories

52,677

Deepfakes Software For All

Quick Overview

The first-order-model repository is an implementation of the paper "First Order Motion Model for Image Animation." It provides a framework for animating a source image using the motion from a driving video, allowing for the creation of realistic image animations without extensive training data or specific annotations.

Pros

  • Enables high-quality image animation with minimal input requirements
  • Supports a wide range of applications, including face animation, human pose transfer, and object animation
  • Provides pre-trained models for quick experimentation and results
  • Offers flexibility in adapting to different types of motions and images

Cons

  • May require significant computational resources for training and inference
  • Performance can vary depending on the similarity between source and driving images
  • Limited control over specific aspects of the generated animation
  • Potential for misuse in creating deepfakes or misleading content

Code Examples

  1. Loading the model and performing inference:
from demo import load_checkpoints, make_animation
from skimage import io, img_as_float32
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation

source_image = io.imread('path/to/source.png')
driving_video = np.array([img_as_float32(frame) for frame in io.imread_collection('path/to/driving_video/*.png')])

generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml', checkpoint_path='vox-cpk.pth.tar')

predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=True)
  1. Visualizing the generated animation:
fig = plt.figure(figsize=(10, 10))
plt.axis('off')
im = plt.imshow(predictions[0])

def update_frame(frame):
    im.set_data(predictions[frame])
    return [im]

anim = animation.FuncAnimation(fig, update_frame, frames=len(predictions), interval=50, blit=True)
plt.show()
  1. Saving the generated animation:
from moviepy.editor import ImageSequenceClip

clip = ImageSequenceClip([frame for frame in predictions], fps=25)
clip.write_videofile("output_animation.mp4", codec='libx264')

Getting Started

  1. Clone the repository:

    git clone https://github.com/AliaksandrSiarohin/first-order-model.git
    cd first-order-model
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download pre-trained models:

    wget https://github.com/AliaksandrSiarohin/first-order-model/releases/download/v1.0.0/vox-cpk.pth.tar -P checkpoints/
    
  4. Run the demo:

    python demo.py --config config/vox-256.yaml --driving_video path/to/driving.mp4 --source_image path/to/source.png --result_video path/to/result.mp4
    

Competitor Comparisons

36,023

GFPGAN aims at developing Practical Algorithms for Real-world Face Restoration.

Pros of GFPGAN

  • Specializes in face restoration and enhancement
  • Offers pre-trained models for immediate use
  • Provides a user-friendly interface for non-technical users

Cons of GFPGAN

  • Limited to face-specific tasks, less versatile than first-order-model
  • May introduce artifacts in certain cases of face restoration
  • Requires more computational resources for high-resolution outputs

Code Comparison

GFPGAN:

from gfpgan import GFPGANer

restorer = GFPGANer(model_path='experiments/pretrained_models/GFPGANv1.3.pth', upscale=2)
restored_img, _ = restorer.enhance(img, has_aligned=False, only_center_face=False, paste_back=True)

first-order-model:

from demo import load_checkpoints, make_animation

source_image = imageio.imread(args.source_image)
driving_video = imageio.mimread(args.driving_video)

generator, kp_detector = load_checkpoints(config_path=args.config, checkpoint_path=args.checkpoint)
predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=args.relative)

The code snippets highlight the different focus areas of each project. GFPGAN is centered on face restoration, while first-order-model is designed for more general image animation tasks.

Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.

Pros of Real-ESRGAN

  • Focuses on image super-resolution and enhancement
  • Provides pre-trained models for easy implementation
  • Supports both CPU and GPU inference

Cons of Real-ESRGAN

  • Limited to static image processing, not video or animation
  • May introduce artifacts in some cases, especially with text

Code Comparison

Real-ESRGAN:

from basicsr.archs.rrdbnet_arch import RRDBNet
from realesrgan import RealESRGANer

model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32)
upsampler = RealESRGANer(scale=4, model_path='weights/RealESRGAN_x4plus.pth', model=model)

first-order-model:

from demo import load_checkpoints
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml', 
                                          checkpoint_path='vox-cpk.pth.tar')

Key Differences

  • Real-ESRGAN is primarily for image enhancement, while first-order-model focuses on video animation and face reenactment
  • Real-ESRGAN operates on single images, whereas first-order-model works with video sequences
  • first-order-model requires source and driving videos, while Real-ESRGAN only needs a single input image
  • Real-ESRGAN's output is a higher resolution version of the input, while first-order-model generates animated videos based on driving sequences

A latent text-to-image diffusion model

Pros of stable-diffusion

  • Generates high-quality images from text descriptions, offering more versatile creative applications
  • Supports various image manipulation tasks like inpainting, outpainting, and image-to-image translation
  • Has a larger and more active community, with frequent updates and improvements

Cons of stable-diffusion

  • Requires more computational resources and longer processing times for image generation
  • May produce less consistent results when generating multiple images from the same prompt
  • Has a steeper learning curve for fine-tuning and customization

Code Comparison

first-order-model:

from demo import load_checkpoints
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml', 
                                          checkpoint_path='vox-cpk.pth.tar')

stable-diffusion:

from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
image = pipe("A beautiful sunset over the ocean").images[0]

Official PyTorch implementation of StyleGAN3

Pros of StyleGAN3

  • Produces higher quality and more realistic images with fewer artifacts
  • Offers better control over image generation and style mixing
  • Implements advanced techniques like alias-free sampling for improved results

Cons of StyleGAN3

  • Requires more computational resources and longer training times
  • Has a steeper learning curve due to its complexity
  • Limited to generating static images, unlike first-order-model's video capabilities

Code Comparison

StyleGAN3:

import torch
import dnnlib
import legacy

network_pkl = 'https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-t-ffhq-1024x1024.pkl'
device = torch.device('cuda')
with dnnlib.util.open_url(network_pkl) as f:
    G = legacy.load_network_pkl(f)['G_ema'].to(device)

first-order-model:

from demo import load_checkpoints
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml', 
                                          checkpoint_path='vox-cpk.pth.tar')

Summary

StyleGAN3 excels in generating high-quality static images with advanced control, while first-order-model focuses on video manipulation and animation. StyleGAN3 requires more resources but offers superior image quality, whereas first-order-model is more versatile for video-based tasks. The code snippets highlight the difference in setup and usage between the two projects.

Efficient 3D human pose estimation in video using 2D keypoint trajectories

Pros of VideoPose3D

  • Focuses on 3D human pose estimation from video, offering more detailed skeletal tracking
  • Provides pre-trained models and datasets for easier implementation
  • Supports both 2D-to-3D pose lifting and end-to-end 3D pose estimation

Cons of VideoPose3D

  • Limited to human pose estimation, less versatile for general video manipulation
  • Requires more computational resources for 3D pose estimation
  • May struggle with complex scenes or multiple subjects

Code Comparison

VideoPose3D:

from common.model import TemporalModel
model = TemporalModel(num_joints_in, in_features, num_joints_out, filter_widths, causal=args.causal)

first-order-model:

from modules.generator import OcclusionAwareGenerator
generator = OcclusionAwareGenerator(num_channels, num_kp, num_bottleneck_blocks, estimate_occlusion_map=True)

The code snippets show that VideoPose3D focuses on temporal modeling for pose estimation, while first-order-model uses an occlusion-aware generator for image manipulation tasks.

52,677

Deepfakes Software For All

Pros of faceswap

  • More comprehensive and feature-rich, offering a complete pipeline for face swapping
  • Extensive documentation and active community support
  • Includes a graphical user interface for easier use by non-technical users

Cons of faceswap

  • Requires more computational resources due to its complexity
  • Steeper learning curve for beginners
  • May produce less realistic results in some cases compared to first-order-model

Code comparison

first-order-model:

source_image = torch.tensor(source_image[np.newaxis].astype(np.float32)).permute(0, 3, 1, 2)
driving_video = torch.tensor(np.array(driving_video)[np.newaxis].astype(np.float32)).permute(0, 4, 1, 2, 3)
predictions = model(source_image, driving_video)

faceswap:

detected_faces = self.detect_faces(image)
for face in detected_faces:
    landmarks = self.get_landmarks(image, face)
    mask = self.get_mask(image, landmarks)
    warped_face = self.warp_face(image, landmarks, self.reference_landmarks)

The first-order-model code focuses on generating animations from a single image, while faceswap's code demonstrates face detection, landmark extraction, and warping for face swapping. first-order-model uses PyTorch tensors, while faceswap relies on more traditional image processing techniques.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

!!! Check out our new paper and framework improved for articulated objects

First Order Motion Model for Image Animation

This repository contains the source code for the paper First Order Motion Model for Image Animation by Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci and Nicu Sebe.

Hugging Face Spaces

Example animations

The videos on the left show the driving videos. The first row on the right for each dataset shows the source videos. The bottom row contains the animated sequences with motion transferred from the driving video and object taken from the source image. We trained a separate network for each task.

VoxCeleb Dataset

Screenshot

Fashion Dataset

Screenshot

MGIF Dataset

Screenshot

Installation

We support python3. To install the dependencies run:

pip install -r requirements.txt

YAML configs

There are several configuration (config/dataset_name.yaml) files one for each dataset. See config/taichi-256.yaml to get description of each parameter.

Pre-trained checkpoint

Checkpoints can be found under following link: google-drive or yandex-disk.

Animation Demo

To run a demo, download checkpoint and run the following command:

python demo.py  --config config/dataset_name.yaml --driving_video path/to/driving --source_image path/to/source --checkpoint path/to/checkpoint --relative --adapt_scale

The result will be stored in result.mp4.

The driving videos and source images should be cropped before it can be used in our method. To obtain some semi-automatic crop suggestions you can use python crop-video.py --inp some_youtube_video.mp4. It will generate commands for crops using ffmpeg. In order to use the script, face-alligment library is needed:

git clone https://github.com/1adrianb/face-alignment
cd face-alignment
pip install -r requirements.txt
python setup.py install

Animation demo with Docker

If you are having trouble getting the demo to work because of library compatibility issues, and you're running Linux, you might try running it inside a Docker container, which would give you better control over the execution environment.

Requirements: Docker 19.03+ and nvidia-docker installed and able to successfully run the nvidia-docker usage tests.

We'll first build the container.

docker build -t first-order-model .

And now that we have the container available locally, we can use it to run the demo.

docker run -it --rm --gpus all \
       -v $HOME/first-order-model:/app first-order-model \
       python3 demo.py --config config/vox-256.yaml \
           --driving_video driving.mp4 \
           --source_image source.png  \ 
           --checkpoint vox-cpk.pth.tar \ 
           --result_video result.mp4 \
           --relative --adapt_scale

Colab Demo

Open In Colab Open in Kaggle

@graphemecluster prepared a GUI demo for the Google Colab. It also works in Kaggle. For the source code, see demo.ipynb.

For the old demo, see old_demo.ipynb.

Face-swap

It is possible to modify the method to perform face-swap using supervised segmentation masks. Screenshot For both unsupervised and supervised video editing, such as face-swap, please refer to Motion Co-Segmentation.

Training

To train a model on specific dataset run:

CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py --config config/dataset_name.yaml --device_ids 0,1,2,3

The code will create a folder in the log directory (each run will create a time-stamped new directory). Checkpoints will be saved to this folder. To check the loss values during training see log.txt. You can also check training data reconstructions in the train-vis subfolder. By default the batch size is tuned to run on 2 or 4 Titan-X gpu (apart from speed it does not make much difference). You can change the batch size in the train_params in corresponding .yaml file.

Evaluation on video reconstruction

To evaluate the reconstruction performance run:

CUDA_VISIBLE_DEVICES=0 python run.py --config config/dataset_name.yaml --mode reconstruction --checkpoint path/to/checkpoint

You will need to specify the path to the checkpoint, the reconstruction subfolder will be created in the checkpoint folder. The generated video will be stored to this folder, also generated videos will be stored in png subfolder in loss-less '.png' format for evaluation. Instructions for computing metrics from the paper can be found: https://github.com/AliaksandrSiarohin/pose-evaluation.

Image animation

In order to animate videos run:

CUDA_VISIBLE_DEVICES=0 python run.py --config config/dataset_name.yaml --mode animate --checkpoint path/to/checkpoint

You will need to specify the path to the checkpoint, the animation subfolder will be created in the same folder as the checkpoint. You can find the generated video there and its loss-less version in the png subfolder. By default video from test set will be randomly paired, but you can specify the "source,driving" pairs in the corresponding .csv files. The path to this file should be specified in corresponding .yaml file in pairs_list setting.

There are 2 different ways of performing animation: by using absolute keypoint locations or by using relative keypoint locations.

  1. Animation using absolute coordinates: the animation is performed using the absolute positions of the driving video and appearance of the source image. In this way there are no specific requirements for the driving video and source appearance that is used. However this usually leads to poor performance since irrelevant details such as shape is transferred. Check animate parameters in taichi-256.yaml to enable this mode.
  1. Animation using relative coordinates: from the driving video we first estimate the relative movement of each keypoint, then we add this movement to the absolute position of keypoints in the source image. This keypoint along with source image is used for animation. This usually leads to better performance, however this requires that the object in the first frame of the video and in the source image have the same pose

Datasets

  1. Bair. This dataset can be directly downloaded.

  2. Mgif. This dataset can be directly downloaded.

  3. Fashion. Follow the instruction on dataset downloading from.

  4. Taichi. Follow the instructions in data/taichi-loading or instructions from https://github.com/AliaksandrSiarohin/video-preprocessing.

  5. Nemo. Please follow the instructions on how to download the dataset. Then the dataset should be preprocessed using scripts from https://github.com/AliaksandrSiarohin/video-preprocessing.

  6. VoxCeleb. Please follow the instruction from https://github.com/AliaksandrSiarohin/video-preprocessing.

Training on your own dataset

  1. Resize all the videos to the same size e.g 256x256, the videos can be in '.gif', '.mp4' or folder with images. We recommend the later, for each video make a separate folder with all the frames in '.png' format. This format is loss-less, and it has better i/o performance.

  2. Create a folder data/dataset_name with 2 subfolders train and test, put training videos in the train and testing in the test.

  3. Create a config config/dataset_name.yaml, in dataset_params specify the root dir the root_dir: data/dataset_name. Also adjust the number of epoch in train_params.

Additional notes

Citation:

@InProceedings{Siarohin_2019_NeurIPS,
  author={Siarohin, Aliaksandr and Lathuilière, Stéphane and Tulyakov, Sergey and Ricci, Elisa and Sebe, Nicu},
  title={First Order Motion Model for Image Animation},
  booktitle = {Conference on Neural Information Processing Systems (NeurIPS)},
  month = {December},
  year = {2019}
}