Convert Figma logo to code with AI

facebookresearch logoVideoPose3D

Efficient 3D human pose estimation in video using 2D keypoint trajectories

3,682
748
3,682
161

Top Related Projects

30,786

OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation

The project is an official implement of our ECCV2018 paper "Simple Baselines for Human Pose Estimation and Tracking(https://arxiv.org/abs/1804.06208)"

2,870

Official implementation of CVPR2020 paper "VIBE: Video Inference for Human Body Pose and Shape Estimation"

Quick Overview

VideoPose3D is a state-of-the-art approach for 3D human pose estimation in video. Developed by Facebook Research, it leverages temporal information from video sequences to produce accurate 3D pose predictions. The project includes both the implementation of the method and pre-trained models for immediate use.

Pros

  • High accuracy in 3D pose estimation compared to previous methods
  • Efficient implementation, allowing for real-time or near-real-time processing
  • Includes pre-trained models for quick deployment
  • Supports both 2D-to-3D pose lifting and end-to-end 3D pose estimation

Cons

  • Requires good 2D pose estimates as input for optimal performance
  • May struggle with complex or occluded poses
  • Limited to human pose estimation, not applicable to other objects or scenarios
  • Dependency on specific deep learning frameworks may limit portability

Code Examples

  1. Loading a pre-trained model:
from common.model import TemporalModel
model = TemporalModel(num_joints_in=17, num_frames=243, num_joints_out=17, filter_widths=[3, 3, 3, 3], causal=False, dropout=0.25, channels=1024)
model.load_state_dict(torch.load('pretrained_model.pth'))
  1. Processing a video sequence:
from common.camera import camera_to_world
from common.generators import UnchunkedGenerator
generator = UnchunkedGenerator(None, None, [input_2d], pad=pad, causal_shift=causal_shift, augment=False)
prediction = evaluate(generator, model)
prediction = camera_to_world(prediction, R=cam['orientation'], t=cam['translation'])
  1. Visualizing the 3D pose:
from common.visualization import render_animation
anim = render_animation(prediction, keypoints, fps=30, limit=60, bitrate=3000)
anim.save('output_animation.mp4')

Getting Started

  1. Clone the repository:

    git clone https://github.com/facebookresearch/VideoPose3D.git
    cd VideoPose3D
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download pre-trained models:

    ./data/prepare_data.sh
    
  4. Run inference on a sample video:

    python run.py -d custom -k myvideos -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_detectron_coco.bin --render --viz-subject input_video.mp4 --viz-action custom --viz-camera 0 --viz-video input_video.mp4 --viz-output output.mp4
    

Competitor Comparisons

30,786

OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation

Pros of OpenPose

  • Real-time performance for 2D pose estimation on multiple people
  • Supports multi-person tracking and hand/face keypoint detection
  • Extensive documentation and community support

Cons of OpenPose

  • Limited to 2D pose estimation, lacking 3D capabilities
  • Higher computational requirements for real-time processing

Code Comparison

OpenPose (C++):

auto datum = opWrapper.emplaceAndPop(imageToProcess);
if (datum != nullptr)
{
    auto poseKeypoints = datum->at(0)->poseKeypoints;
    // Process pose keypoints
}

VideoPose3D (Python):

keypoints = keypoints_symmetry = None
with torch.no_grad():
    model_pos.eval()
    predicted_3d_pos = model_pos(inputs_2d).numpy()
    # Process 3D pose predictions

Summary

OpenPose excels in real-time 2D pose estimation for multiple people, offering additional features like hand and face detection. It has strong community support but requires more computational power. VideoPose3D focuses on 3D pose estimation from video, providing a different approach to pose analysis. The choice between them depends on whether 2D real-time or 3D video-based pose estimation is needed for the specific application.

The project is an official implement of our ECCV2018 paper "Simple Baselines for Human Pose Estimation and Tracking(https://arxiv.org/abs/1804.06208)"

Pros of human-pose-estimation.pytorch

  • Focuses on 2D pose estimation, which can be more suitable for real-time applications
  • Provides a simpler implementation, making it easier to understand and modify
  • Includes pre-trained models for quick deployment

Cons of human-pose-estimation.pytorch

  • Limited to 2D pose estimation, lacking depth information
  • May not be as accurate for complex poses or occlusions compared to 3D approaches

Code Comparison

VideoPose3D:

from common.model import TemporalModel
model = TemporalModel(num_joints_in, in_features, num_joints_out, filter_widths, causal=args.causal)

human-pose-estimation.pytorch:

from models.pose_resnet import get_pose_net
model = get_pose_net(cfg, is_train=True)

VideoPose3D focuses on temporal modeling for 3D pose estimation, while human-pose-estimation.pytorch uses a ResNet-based architecture for 2D pose estimation. The code snippets highlight the different approaches in model initialization and architecture.

Both repositories offer valuable tools for pose estimation, with VideoPose3D providing more advanced 3D capabilities and human-pose-estimation.pytorch offering a simpler 2D approach. The choice between them depends on the specific requirements of the project, such as real-time performance, accuracy needs, and the importance of depth information.

2,870

Official implementation of CVPR2020 paper "VIBE: Video Inference for Human Body Pose and Shape Estimation"

Pros of VIBE

  • Provides more robust and accurate 3D human pose and shape estimation
  • Offers end-to-end training and inference pipeline
  • Supports both single-image and video-based estimation

Cons of VIBE

  • May require more computational resources due to its complexity
  • Less flexible for customization compared to VideoPose3D
  • Potentially slower inference time for real-time applications

Code Comparison

VideoPose3D:

from common.model import TemporalModel
model = TemporalModel(num_joints_in, in_features, num_joints_out, filter_widths, causal=args.causal)

VIBE:

from lib.models.vibe import VIBE
model = VIBE(
    seqlen=cfg.DATASET.SEQLEN,
    n_layers=cfg.MODEL.TGRU.NUM_LAYERS,
    hidden_size=cfg.MODEL.TGRU.HIDDEN_SIZE
)

Both repositories focus on 3D human pose estimation, but VIBE offers a more comprehensive solution for both pose and shape estimation. VideoPose3D provides a simpler, more customizable approach, which may be preferable for specific use cases or when computational resources are limited. The code comparison shows that VIBE uses a more complex model structure, while VideoPose3D employs a temporal model with customizable filter widths.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

3D human pose estimation in video with temporal convolutions and semi-supervised training

This is the implementation of the approach described in the paper:

Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

More demos are available at https://dariopavllo.github.io/VideoPose3D

Results on Human3.6M

Under Protocol 1 (mean per-joint position error) and Protocol 2 (mean-per-joint position error after rigid alignment).

2D DetectionsBBoxesBlocksReceptive FieldError (P1)Error (P2)
CPNMask R-CNN4243 frames46.8 mm36.5 mm
CPNGround truth4243 frames47.1 mm36.8 mm
CPNGround truth381 frames47.7 mm37.2 mm
CPNGround truth227 frames48.8 mm38.0 mm
Mask R-CNNMask R-CNN4243 frames51.6 mm40.3 mm
Ground truth--4243 frames37.2 mm27.2 mm

Quick start

To get started as quickly as possible, follow the instructions in this section. This should allow you train a model from scratch, test our pretrained models, and produce basic visualizations. For more detailed instructions, please refer to DOCUMENTATION.md.

Dependencies

Make sure you have the following dependencies installed before proceeding:

  • Python 3+ distribution
  • PyTorch >= 0.4.0

Optional:

  • Matplotlib, if you want to visualize predictions. Additionally, you need ffmpeg to export MP4 videos, and imagemagick to export GIFs.
  • MATLAB, if you want to experiment with HumanEva-I (you need this to convert the dataset).

Dataset setup

You can find the instructions for setting up the Human3.6M and HumanEva-I datasets in DATASETS.md. For this short guide, we focus on Human3.6M. You are not required to setup HumanEva, unless you want to experiment with it.

In order to proceed, you must also copy CPN detections (for Human3.6M) and/or Mask R-CNN detections (for HumanEva).

Evaluating our pretrained models

The pretrained models can be downloaded from AWS. Put pretrained_h36m_cpn.bin (for Human3.6M) and/or pretrained_humaneva15_detectron.bin (for HumanEva) in the checkpoint/ directory (create it if it does not exist).

mkdir checkpoint
cd checkpoint
wget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_cpn.bin
wget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_humaneva15_detectron.bin
cd ..

These models allow you to reproduce our top-performing baselines, which are:

  • 46.8 mm for Human3.6M, using fine-tuned CPN detections, bounding boxes from Mask R-CNN, and an architecture with a receptive field of 243 frames.
  • 33.0 mm for HumanEva-I (on 3 actions), using pretrained Mask R-CNN detections, and an architecture with a receptive field of 27 frames. This is the multi-action model trained on 3 actions (Walk, Jog, Box).

To test on Human3.6M, run:

python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin

To test on HumanEva, run:

python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -a Walk,Jog,Box --by-subject -c checkpoint --evaluate pretrained_humaneva15_detectron.bin

DOCUMENTATION.md provides a precise description of all command-line arguments.

Inference in the wild

We have introduced an experimental feature to run our model on custom videos. See INFERENCE.md for more details.

Training from scratch

If you want to reproduce the results of our pretrained models, run the following commands.

For Human3.6M:

python run.py -e 80 -k cpn_ft_h36m_dbb -arc 3,3,3,3,3

By default the application runs in training mode. This will train a new model for 80 epochs, using fine-tuned CPN detections. Expect a training time of 24 hours on a high-end Pascal GPU. If you feel that this is too much, or your GPU is not powerful enough, you can train a model with a smaller receptive field, e.g.

  • -arc 3,3,3,3 (81 frames) should require 11 hours and achieve 47.7 mm.
  • -arc 3,3,3 (27 frames) should require 6 hours and achieve 48.8 mm.

You could also lower the number of epochs from 80 to 60 with a negligible impact on the result.

For HumanEva:

python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -b 128 -e 1000 -lrd 0.996 -a Walk,Jog,Box --by-subject

This will train for 1000 epochs, using Mask R-CNN detections and evaluating each subject separately. Since HumanEva is much smaller than Human3.6M, training should require about 50 minutes.

Semi-supervised training

To perform semi-supervised training, you just need to add the --subjects-unlabeled argument. In the example below, we use ground-truth 2D poses as input, and train supervised on just 10% of Subject 1 (specified by --subset 0.1). The remaining subjects are treated as unlabeled data and are used for semi-supervision.

python run.py -k gt --subjects-train S1 --subset 0.1 --subjects-unlabeled S5,S6,S7,S8 -e 200 -lrd 0.98 -arc 3,3,3 --warmup 5 -b 64

This should give you an error around 65.2 mm. By contrast, if we only train supervised

python run.py -k gt --subjects-train S1 --subset 0.1 -e 200 -lrd 0.98 -arc 3,3,3 -b 64

we get around 80.7 mm, which is significantly higher.

Visualization

If you have the original Human3.6M videos, you can generate nice visualizations of the model predictions. For instance:

python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin --render --viz-subject S11 --viz-action Walking --viz-camera 0 --viz-video "/path/to/videos/S11/Videos/Walking.54138969.mp4" --viz-output output.gif --viz-size 3 --viz-downsample 2 --viz-limit 60

The script can also export MP4 videos, and supports a variety of parameters (e.g. downsampling/FPS, size, bitrate). See DOCUMENTATION.md for more details.

License

This work is licensed under CC BY-NC. See LICENSE for details. Third-party datasets are subject to their respective licenses. If you use our code/models in your research, please cite our paper:

@inproceedings{pavllo:videopose3d:2019,
  title={3D human pose estimation in video with temporal convolutions and semi-supervised training},
  author={Pavllo, Dario and Feichtenhofer, Christoph and Grangier, David and Auli, Michael},
  booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2019}
}