pytorch-image-models
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
Top Related Projects
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
Models and examples built with TensorFlow
Datasets, Transforms and Models specific to Computer Vision
Deep Learning for humans
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
Best Practices, code samples, and documentation for Computer Vision.
Quick Overview
PyTorch Image Models (timm) is a collection of image models, layers, utilities, optimizers, schedulers, data-loaders, augmentations, and training/validation scripts for PyTorch. It aims to pull together a wide variety of SOTA models with ability to reproduce ImageNet training results.
Pros
- Extensive collection of pre-trained models and implementations
- Consistent interface for different models, making it easy to switch between them
- Regular updates with new models and improvements
- Includes training scripts and utilities for fine-tuning and evaluation
Cons
- Large repository size due to the extensive collection of models
- Can be overwhelming for beginners due to the wide range of options
- Some models may have dependencies on specific PyTorch versions
- Documentation could be more comprehensive for some advanced features
Code Examples
- Loading a pre-trained model:
import timm
model = timm.create_model('resnet50', pretrained=True)
model.eval()
- Performing inference on an image:
from PIL import Image
import torch
import timm.data
img = Image.open('path/to/image.jpg')
transform = timm.data.create_transform(
input_size=224,
is_training=False
)
img_tensor = transform(img).unsqueeze(0)
with torch.no_grad():
output = model(img_tensor)
probabilities = torch.nn.functional.softmax(output[0], dim=0)
print(probabilities.topk(5))
- Fine-tuning a model on a custom dataset:
import timm
model = timm.create_model('efficientnet_b0', pretrained=True, num_classes=10)
model.train()
# Assuming you have your custom dataset and dataloader
for epoch in range(num_epochs):
for batch in dataloader:
images, labels = batch
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Getting Started
To get started with PyTorch Image Models:
- Install the library:
pip install timm
- Import and use in your Python script:
import timm
# List available models
print(timm.list_models())
# Create a model
model = timm.create_model('resnet50', pretrained=True)
# Use the model for inference or training
# ...
Competitor Comparisons
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
Pros of Detectron2
- Comprehensive suite of object detection and segmentation models
- Extensive documentation and tutorials for various use cases
- Built-in support for distributed training and deployment
Cons of Detectron2
- Steeper learning curve for beginners
- More focused on detection and segmentation tasks, less versatile for general image classification
- Heavier framework with more dependencies
Code Comparison
Detectron2:
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
cfg = get_cfg()
cfg.merge_from_file("path/to/config.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(image)
pytorch-image-models:
import timm
model = timm.create_model('resnet50', pretrained=True)
model.eval()
output = model(image)
pytorch-image-models offers a more straightforward API for quickly loading and using pre-trained models, while Detectron2 provides a more comprehensive configuration system for fine-tuning complex detection and segmentation models.
Models and examples built with TensorFlow
Pros of tensorflow/models
- Broader scope, covering various ML domains beyond just image models
- Official TensorFlow implementation, ensuring compatibility and optimization
- Extensive documentation and tutorials for each model
Cons of tensorflow/models
- Less focused on image models specifically, potentially lacking some specialized architectures
- May have a steeper learning curve due to its broader scope
- Updates might be less frequent for individual model categories
Code Comparison
tensorflow/models:
import tensorflow as tf
from official.vision.image_classification import resnet_model
model = resnet_model.resnet50(num_classes=1000)
pytorch-image-models:
import timm
model = timm.create_model('resnet50', pretrained=True, num_classes=1000)
Summary
tensorflow/models is a comprehensive repository for various machine learning tasks, while pytorch-image-models focuses specifically on image models. The TensorFlow repository offers a wider range of models and official implementations, but may be more complex to navigate. pytorch-image-models provides a more streamlined experience for image-related tasks, with a simpler API and frequent updates. The choice between the two depends on the specific project requirements and the preferred deep learning framework.
Datasets, Transforms and Models specific to Computer Vision
Pros of vision
- Official PyTorch repository, ensuring long-term support and compatibility
- Comprehensive set of computer vision tools beyond just models
- Tightly integrated with other PyTorch libraries and ecosystem
Cons of vision
- Fewer pre-trained models compared to pytorch-image-models
- Less frequent updates and new model implementations
- May have a steeper learning curve for beginners
Code Comparison
vision:
import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)
pytorch-image-models:
import timm
model = timm.create_model('resnet18', pretrained=True)
Both repositories provide easy access to pre-trained models, but pytorch-image-models (timm) offers a wider variety of models and more flexibility in model creation. vision focuses on providing a comprehensive set of tools for computer vision tasks, including datasets, transforms, and utilities, while pytorch-image-models specializes in offering a large collection of image models with consistent API.
vision is ideal for users deeply integrated into the PyTorch ecosystem, while pytorch-image-models is excellent for those seeking a wide range of cutting-edge models with minimal setup. The choice between them depends on specific project requirements and personal preferences.
Deep Learning for humans
Pros of Keras
- Higher-level API, making it easier for beginners to get started
- Supports multiple backend engines (TensorFlow, Theano, CNTK)
- Extensive documentation and community support
Cons of Keras
- Less flexibility for advanced users compared to PyTorch
- Slower development cycle for cutting-edge features
- Limited support for dynamic computational graphs
Code Comparison
Keras:
from keras.models import Sequential
from keras.layers import Dense
model = Sequential([
Dense(64, activation='relu', input_shape=(784,)),
Dense(10, activation='softmax')
])
pytorch-image-models:
import timm
model = timm.create_model('resnet18', pretrained=True, num_classes=10)
The Keras example shows its simplicity in creating a basic neural network, while the pytorch-image-models snippet demonstrates the ease of using pre-trained models with a single line of code.
pytorch-image-models focuses specifically on computer vision tasks and provides a wide range of state-of-the-art image models. It offers more flexibility and customization options for researchers and advanced practitioners. Keras, on the other hand, is a more general-purpose deep learning library that supports various types of neural networks and is known for its user-friendly interface.
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
Pros of pytorch-image-models
- Extensive collection of pre-trained image models
- Regular updates and active maintenance
- Comprehensive documentation and examples
Cons of pytorch-image-models
- Larger repository size due to extensive model collection
- May have a steeper learning curve for beginners
- Potentially higher computational requirements
Code Comparison
pytorch-image-models:
import timm
model = timm.create_model('resnet50', pretrained=True)
output = model(input_tensor)
pytorch-image-models:
import timm
model = timm.create_model('resnet50', pretrained=True)
output = model(input_tensor)
As both repositories are the same, there is no difference in the code comparison. The usage and implementation would be identical for both.
Summary
Since the comparison is between the same repository (huggingface/pytorch-image-models), there are no actual differences to highlight. The repository, known as pytorch-image-models or timm, is a popular collection of image models and utilities for PyTorch. It offers a wide range of pre-trained models, is actively maintained, and provides excellent documentation. However, its extensive collection may result in a larger repository size and potentially higher computational requirements compared to more focused libraries.
Best Practices, code samples, and documentation for Computer Vision.
Pros of computervision-recipes
- Comprehensive collection of computer vision recipes and notebooks
- Covers a wide range of CV tasks, including object detection, image classification, and segmentation
- Provides end-to-end examples and best practices for Azure integration
Cons of computervision-recipes
- Less focused on state-of-the-art model implementations
- May have a steeper learning curve for those not familiar with Azure ecosystem
- Fewer pre-trained models available compared to pytorch-image-models
Code Comparison
pytorch-image-models:
import timm
model = timm.create_model('resnet50', pretrained=True)
output = model(input_tensor)
computervision-recipes:
from azureml.core import Workspace
from azureml.core.model import Model
model = Model(ws, 'my_model')
model.download(target_dir=os.getcwd(), exist_ok=True)
pytorch-image-models focuses on providing a wide range of pre-trained models with a simple API, while computervision-recipes emphasizes Azure integration and end-to-end workflows for various computer vision tasks. The choice between the two depends on specific project requirements and the desired level of Azure integration.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
PyTorch Image Models
- What's New
- Introduction
- Models
- Features
- Results
- Getting Started (Documentation)
- Train, Validation, Inference Scripts
- Awesome PyTorch Resources
- Licenses
- Citing
What's New
Aug 21, 2024
- Updated SBB ViT models trained on ImageNet-12k and fine-tuned on ImageNet-1k, challenging quite a number of much larger, slower models
model | top1 | top5 | param_count | img_size |
---|---|---|---|---|
vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k | 87.438 | 98.256 | 64.11 | 384 |
vit_mediumd_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k | 86.608 | 97.934 | 64.11 | 256 |
vit_betwixt_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k | 86.594 | 98.02 | 60.4 | 384 |
vit_betwixt_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k | 85.734 | 97.61 | 60.4 | 256 |
- MobileNet-V1 1.25, EfficientNet-B1, & ResNet50-D weights w/ MNV4 baseline challenge recipe
model | top1 | top5 | param_count | img_size |
---|---|---|---|---|
resnet50d.ra4_e3600_r224_in1k | 81.838 | 95.922 | 25.58 | 288 |
efficientnet_b1.ra4_e3600_r240_in1k | 81.440 | 95.700 | 7.79 | 288 |
resnet50d.ra4_e3600_r224_in1k | 80.952 | 95.384 | 25.58 | 224 |
efficientnet_b1.ra4_e3600_r240_in1k | 80.406 | 95.152 | 7.79 | 240 |
mobilenetv1_125.ra4_e3600_r224_in1k | 77.600 | 93.804 | 6.27 | 256 |
mobilenetv1_125.ra4_e3600_r224_in1k | 76.924 | 93.234 | 6.27 | 224 |
- Add SAM2 (HieraDet) backbone arch & weight loading support
- Add Hiera Small weights trained w/ abswin pos embed on in12k & fine-tuned on 1k
model | top1 | top5 | param_count |
---|---|---|---|
hiera_small_abswin_256.sbb2_e200_in12k_ft_in1k | 84.912 | 97.260 | 35.01 |
hiera_small_abswin_256.sbb2_pd_e200_in12k_ft_in1k | 84.560 | 97.106 | 35.01 |
Aug 8, 2024
- Add RDNet ('DenseNets Reloaded', https://arxiv.org/abs/2403.19588), thanks Donghyun Kim
July 28, 2024
- Add
mobilenet_edgetpu_v2_m
weights w/ra4
mnv4-small based recipe. 80.1% top-1 @ 224 and 80.7 @ 256. - Release 1.0.8
July 26, 2024
- More MobileNet-v4 weights, ImageNet-12k pretrain w/ fine-tunes, and anti-aliased ConvLarge models
model | top1 | top1_err | top5 | top5_err | param_count | img_size |
---|---|---|---|---|---|---|
mobilenetv4_conv_aa_large.e230_r448_in12k_ft_in1k | 84.99 | 15.01 | 97.294 | 2.706 | 32.59 | 544 |
mobilenetv4_conv_aa_large.e230_r384_in12k_ft_in1k | 84.772 | 15.228 | 97.344 | 2.656 | 32.59 | 480 |
mobilenetv4_conv_aa_large.e230_r448_in12k_ft_in1k | 84.64 | 15.36 | 97.114 | 2.886 | 32.59 | 448 |
mobilenetv4_conv_aa_large.e230_r384_in12k_ft_in1k | 84.314 | 15.686 | 97.102 | 2.898 | 32.59 | 384 |
mobilenetv4_conv_aa_large.e600_r384_in1k | 83.824 | 16.176 | 96.734 | 3.266 | 32.59 | 480 |
mobilenetv4_conv_aa_large.e600_r384_in1k | 83.244 | 16.756 | 96.392 | 3.608 | 32.59 | 384 |
mobilenetv4_hybrid_medium.e200_r256_in12k_ft_in1k | 82.99 | 17.01 | 96.67 | 3.33 | 11.07 | 320 |
mobilenetv4_hybrid_medium.e200_r256_in12k_ft_in1k | 82.364 | 17.636 | 96.256 | 3.744 | 11.07 | 256 |
- Impressive MobileNet-V1 and EfficientNet-B0 baseline challenges (https://huggingface.co/blog/rwightman/mobilenet-baselines)
model | top1 | top1_err | top5 | top5_err | param_count | img_size |
---|---|---|---|---|---|---|
efficientnet_b0.ra4_e3600_r224_in1k | 79.364 | 20.636 | 94.754 | 5.246 | 5.29 | 256 |
efficientnet_b0.ra4_e3600_r224_in1k | 78.584 | 21.416 | 94.338 | 5.662 | 5.29 | 224 |
mobilenetv1_100h.ra4_e3600_r224_in1k | 76.596 | 23.404 | 93.272 | 6.728 | 5.28 | 256 |
mobilenetv1_100.ra4_e3600_r224_in1k | 76.094 | 23.906 | 93.004 | 6.996 | 4.23 | 256 |
mobilenetv1_100h.ra4_e3600_r224_in1k | 75.662 | 24.338 | 92.504 | 7.496 | 5.28 | 224 |
mobilenetv1_100.ra4_e3600_r224_in1k | 75.382 | 24.618 | 92.312 | 7.688 | 4.23 | 224 |
- Prototype of
set_input_size()
added to vit and swin v1/v2 models to allow changing image size, patch size, window size after model creation. - Improved support in swin for different size handling, in addition to
set_input_size
,always_partition
andstrict_img_size
args have been added to__init__
to allow more flexible input size constraints - Fix out of order indices info for intermediate 'Getter' feature wrapper, check out or range indices for same.
- Add several
tiny
< .5M param models for testing that are actually trained on ImageNet-1k
model | top1 | top1_err | top5 | top5_err | param_count | img_size | crop_pct |
---|---|---|---|---|---|---|---|
test_efficientnet.r160_in1k | 47.156 | 52.844 | 71.726 | 28.274 | 0.36 | 192 | 1.0 |
test_byobnet.r160_in1k | 46.698 | 53.302 | 71.674 | 28.326 | 0.46 | 192 | 1.0 |
test_efficientnet.r160_in1k | 46.426 | 53.574 | 70.928 | 29.072 | 0.36 | 160 | 0.875 |
test_byobnet.r160_in1k | 45.378 | 54.622 | 70.572 | 29.428 | 0.46 | 160 | 0.875 |
test_vit.r160_in1k | 42.0 | 58.0 | 68.664 | 31.336 | 0.37 | 192 | 1.0 |
test_vit.r160_in1k | 40.822 | 59.178 | 67.212 | 32.788 | 0.37 | 160 | 0.875 |
- Fix vit reg token init, thanks Promisery
- Other misc fixes
June 24, 2024
- 3 more MobileNetV4 hyrid weights with different MQA weight init scheme
model | top1 | top1_err | top5 | top5_err | param_count | img_size |
---|---|---|---|---|---|---|
mobilenetv4_hybrid_large.ix_e600_r384_in1k | 84.356 | 15.644 | 96.892 | 3.108 | 37.76 | 448 |
mobilenetv4_hybrid_large.ix_e600_r384_in1k | 83.990 | 16.010 | 96.702 | 3.298 | 37.76 | 384 |
mobilenetv4_hybrid_medium.ix_e550_r384_in1k | 83.394 | 16.606 | 96.760 | 3.240 | 11.07 | 448 |
mobilenetv4_hybrid_medium.ix_e550_r384_in1k | 82.968 | 17.032 | 96.474 | 3.526 | 11.07 | 384 |
mobilenetv4_hybrid_medium.ix_e550_r256_in1k | 82.492 | 17.508 | 96.278 | 3.722 | 11.07 | 320 |
mobilenetv4_hybrid_medium.ix_e550_r256_in1k | 81.446 | 18.554 | 95.704 | 4.296 | 11.07 | 256 |
- florence2 weight loading in DaViT model
June 12, 2024
- MobileNetV4 models and initial set of
timm
trained weights added:
- Apple MobileCLIP (https://arxiv.org/pdf/2311.17049, FastViT and ViT-B) image tower model support & weights added (part of OpenCLIP support).
- ViTamin (https://arxiv.org/abs/2404.02132) CLIP image tower model & weights added (part of OpenCLIP support).
- OpenAI CLIP Modified ResNet image tower modelling & weight support (via ByobNet). Refactor AttentionPool2d.
May 14, 2024
- Support loading PaliGemma jax weights into SigLIP ViT models with average pooling.
- Add Hiera models from Meta (https://github.com/facebookresearch/hiera).
- Add
normalize=
flag for transorms, return non-normalized torch.Tensor with original dytpe (forchug
) - Version 1.0.3 release
May 11, 2024
Searching for Better ViT Baselines (For the GPU Poor)
weights and vit variants released. Exploring model shapes between Tiny and Base.
model | top1 | top5 | param_count | img_size |
---|---|---|---|---|
vit_mediumd_patch16_reg4_gap_256.sbb_in12k_ft_in1k | 86.202 | 97.874 | 64.11 | 256 |
vit_betwixt_patch16_reg4_gap_256.sbb_in12k_ft_in1k | 85.418 | 97.48 | 60.4 | 256 |
vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k | 84.322 | 96.812 | 63.95 | 256 |
vit_betwixt_patch16_rope_reg4_gap_256.sbb_in1k | 83.906 | 96.684 | 60.23 | 256 |
vit_base_patch16_rope_reg1_gap_256.sbb_in1k | 83.866 | 96.67 | 86.43 | 256 |
vit_medium_patch16_rope_reg1_gap_256.sbb_in1k | 83.81 | 96.824 | 38.74 | 256 |
vit_betwixt_patch16_reg4_gap_256.sbb_in1k | 83.706 | 96.616 | 60.4 | 256 |
vit_betwixt_patch16_reg1_gap_256.sbb_in1k | 83.628 | 96.544 | 60.4 | 256 |
vit_medium_patch16_reg4_gap_256.sbb_in1k | 83.47 | 96.622 | 38.88 | 256 |
vit_medium_patch16_reg1_gap_256.sbb_in1k | 83.462 | 96.548 | 38.88 | 256 |
vit_little_patch16_reg4_gap_256.sbb_in1k | 82.514 | 96.262 | 22.52 | 256 |
vit_wee_patch16_reg1_gap_256.sbb_in1k | 80.256 | 95.360 | 13.42 | 256 |
vit_pwee_patch16_reg1_gap_256.sbb_in1k | 80.072 | 95.136 | 15.25 | 256 |
vit_mediumd_patch16_reg4_gap_256.sbb_in12k | N/A | N/A | 64.11 | 256 |
vit_betwixt_patch16_reg4_gap_256.sbb_in12k | N/A | N/A | 60.4 | 256 |
- AttentionExtract helper added to extract attention maps from
timm
models. See example in https://github.com/huggingface/pytorch-image-models/discussions/1232#discussioncomment-9320949 forward_intermediates()
API refined and added to more models including some ConvNets that have other extraction methods.- 1017 of 1047 model architectures support
features_only=True
feature extraction. Remaining 34 architectures can be supported but based on priority requests. - Remove torch.jit.script annotated functions including old JIT activations. Conflict with dynamo and dynamo does a much better job when used.
April 11, 2024
- Prepping for a long overdue 1.0 release, things have been stable for a while now.
- Significant feature that's been missing for a while,
features_only=True
support for ViT models with flat hidden states or non-std module layouts (so far covering'vit_*', 'twins_*', 'deit*', 'beit*', 'mvitv2*', 'eva*', 'samvit_*', 'flexivit*'
) - Above feature support achieved through a new
forward_intermediates()
API that can be used with a feature wrapping module or direclty.
model = timm.create_model('vit_base_patch16_224')
final_feat, intermediates = model.forward_intermediates(input)
output = model.forward_head(final_feat) # pooling + classifier head
print(final_feat.shape)
torch.Size([2, 197, 768])
for f in intermediates:
print(f.shape)
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
print(output.shape)
torch.Size([2, 1000])
model = timm.create_model('eva02_base_patch16_clip_224', pretrained=True, img_size=512, features_only=True, out_indices=(-3, -2,))
output = model(torch.randn(2, 3, 512, 512))
for o in output:
print(o.shape)
torch.Size([2, 768, 32, 32])
torch.Size([2, 768, 32, 32])
- TinyCLIP vision tower weights added, thx Thien Tran
Feb 19, 2024
- Next-ViT models added. Adapted from https://github.com/bytedance/Next-ViT
- HGNet and PP-HGNetV2 models added. Adapted from https://github.com/PaddlePaddle/PaddleClas by SeeFun
- Removed setup.py, moved to pyproject.toml based build supported by PDM
- Add updated model EMA impl using _for_each for less overhead
- Support device args in train script for non GPU devices
- Other misc fixes and small additions
- Min supported Python version increased to 3.8
- Release 0.9.16
Jan 8, 2024
Datasets & transform refactoring
- HuggingFace streaming (iterable) dataset support (
--dataset hfids:org/dataset
) - Webdataset wrapper tweaks for improved split info fetching, can auto fetch splits from supported HF hub webdataset
- Tested HF
datasets
and webdataset wrapper streaming from HF hub with recenttimm
ImageNet uploads to https://huggingface.co/timm - Make input & target column/field keys consistent across datasets and pass via args
- Full monochrome support when using e:g:
--input-size 1 224 224
or--in-chans 1
, sets PIL image conversion appropriately in dataset - Improved several alternate crop & resize transforms (ResizeKeepRatio, RandomCropOrPad, etc) for use in PixParse document AI project
- Add SimCLR style color jitter prob along with grayscale and gaussian blur options to augmentations and args
- Allow train without validation set (
--val-split ''
) in train script - Add
--bce-sum
(sum over class dim) and--bce-pos-weight
(positive weighting) args for training as they're common BCE loss tweaks I was often hard coding
Nov 23, 2023
- Added EfficientViT-Large models, thanks SeeFun
- Fix Python 3.7 compat, will be dropping support for it soon
- Other misc fixes
- Release 0.9.12
Nov 20, 2023
- Added significant flexibility for Hugging Face Hub based timm models via
model_args
config entry.model_args
will be passed as kwargs through to models on creation. - Updated imagenet eval and test set csv files with latest models
vision_transformer.py
typing and doc cleanup by Laureηt- 0.9.11 release
Nov 3, 2023
- DFN (Data Filtering Networks) and MetaCLIP ViT weights added
- DINOv2 'register' ViT model weights added (https://huggingface.co/papers/2309.16588, https://huggingface.co/papers/2304.07193)
- Add
quickgelu
ViT variants for OpenAI, DFN, MetaCLIP weights that use it (less efficient) - Improved typing added to ResNet, MobileNet-v3 thanks to Aryan
- ImageNet-12k fine-tuned (from LAION-2B CLIP)
convnext_xxlarge
- 0.9.9 release
Oct 20, 2023
- SigLIP image tower weights supported in
vision_transformer.py
.- Great potential for fine-tune and downstream feature use.
- Experimental 'register' support in vit models as per Vision Transformers Need Registers
- Updated RepViT with new weight release. Thanks wangao
- Add patch resizing support (on pretrained weight load) to Swin models
- 0.9.8 release pending
Sep 1, 2023
- TinyViT added by SeeFun
- Fix EfficientViT (MIT) to use torch.autocast so it works back to PT 1.10
- 0.9.7 release
Introduction
PyTorch Image Models (timm
) is a collection of image models, layers, utilities, optimizers, schedulers, data-loaders / augmentations, and reference training / validation scripts that aim to pull together a wide variety of SOTA models with ability to reproduce ImageNet training results.
The work of many others is present here. I've tried to make sure all source material is acknowledged via links to github, arxiv papers, etc in the README, documentation, and code docstrings. Please let me know if I missed anything.
Features
Models
All model architecture families include variants with pretrained weights. There are specific model variants without any weights, it is NOT a bug. Help training new or better weights is always appreciated.
- Aggregating Nested Transformers - https://arxiv.org/abs/2105.12723
- BEiT - https://arxiv.org/abs/2106.08254
- Big Transfer ResNetV2 (BiT) - https://arxiv.org/abs/1912.11370
- Bottleneck Transformers - https://arxiv.org/abs/2101.11605
- CaiT (Class-Attention in Image Transformers) - https://arxiv.org/abs/2103.17239
- CoaT (Co-Scale Conv-Attentional Image Transformers) - https://arxiv.org/abs/2104.06399
- CoAtNet (Convolution and Attention) - https://arxiv.org/abs/2106.04803
- ConvNeXt - https://arxiv.org/abs/2201.03545
- ConvNeXt-V2 - http://arxiv.org/abs/2301.00808
- ConViT (Soft Convolutional Inductive Biases Vision Transformers)- https://arxiv.org/abs/2103.10697
- CspNet (Cross-Stage Partial Networks) - https://arxiv.org/abs/1911.11929
- DeiT - https://arxiv.org/abs/2012.12877
- DeiT-III - https://arxiv.org/pdf/2204.07118.pdf
- DenseNet - https://arxiv.org/abs/1608.06993
- DLA - https://arxiv.org/abs/1707.06484
- DPN (Dual-Path Network) - https://arxiv.org/abs/1707.01629
- EdgeNeXt - https://arxiv.org/abs/2206.10589
- EfficientFormer - https://arxiv.org/abs/2206.01191
- EfficientNet (MBConvNet Family)
- EfficientNet NoisyStudent (B0-B7, L2) - https://arxiv.org/abs/1911.04252
- EfficientNet AdvProp (B0-B8) - https://arxiv.org/abs/1911.09665
- EfficientNet (B0-B7) - https://arxiv.org/abs/1905.11946
- EfficientNet-EdgeTPU (S, M, L) - https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html
- EfficientNet V2 - https://arxiv.org/abs/2104.00298
- FBNet-C - https://arxiv.org/abs/1812.03443
- MixNet - https://arxiv.org/abs/1907.09595
- MNASNet B1, A1 (Squeeze-Excite), and Small - https://arxiv.org/abs/1807.11626
- MobileNet-V2 - https://arxiv.org/abs/1801.04381
- Single-Path NAS - https://arxiv.org/abs/1904.02877
- TinyNet - https://arxiv.org/abs/2010.14819
- EfficientViT (MIT) - https://arxiv.org/abs/2205.14756
- EfficientViT (MSRA) - https://arxiv.org/abs/2305.07027
- EVA - https://arxiv.org/abs/2211.07636
- EVA-02 - https://arxiv.org/abs/2303.11331
- FastViT - https://arxiv.org/abs/2303.14189
- FlexiViT - https://arxiv.org/abs/2212.08013
- FocalNet (Focal Modulation Networks) - https://arxiv.org/abs/2203.11926
- GCViT (Global Context Vision Transformer) - https://arxiv.org/abs/2206.09959
- GhostNet - https://arxiv.org/abs/1911.11907
- GhostNet-V2 - https://arxiv.org/abs/2211.12905
- gMLP - https://arxiv.org/abs/2105.08050
- GPU-Efficient Networks - https://arxiv.org/abs/2006.14090
- Halo Nets - https://arxiv.org/abs/2103.12731
- HGNet / HGNet-V2 - TBD
- HRNet - https://arxiv.org/abs/1908.07919
- InceptionNeXt - https://arxiv.org/abs/2303.16900
- Inception-V3 - https://arxiv.org/abs/1512.00567
- Inception-ResNet-V2 and Inception-V4 - https://arxiv.org/abs/1602.07261
- Lambda Networks - https://arxiv.org/abs/2102.08602
- LeViT (Vision Transformer in ConvNet's Clothing) - https://arxiv.org/abs/2104.01136
- MaxViT (Multi-Axis Vision Transformer) - https://arxiv.org/abs/2204.01697
- MetaFormer (PoolFormer-v2, ConvFormer, CAFormer) - https://arxiv.org/abs/2210.13452
- MLP-Mixer - https://arxiv.org/abs/2105.01601
- MobileCLIP - https://arxiv.org/abs/2311.17049
- MobileNet-V3 (MBConvNet w/ Efficient Head) - https://arxiv.org/abs/1905.02244
- FBNet-V3 - https://arxiv.org/abs/2006.02049
- HardCoRe-NAS - https://arxiv.org/abs/2102.11646
- LCNet - https://arxiv.org/abs/2109.15099
- MobileNetV4 - https://arxiv.org/abs/2404.10518
- MobileOne - https://arxiv.org/abs/2206.04040
- MobileViT - https://arxiv.org/abs/2110.02178
- MobileViT-V2 - https://arxiv.org/abs/2206.02680
- MViT-V2 (Improved Multiscale Vision Transformer) - https://arxiv.org/abs/2112.01526
- NASNet-A - https://arxiv.org/abs/1707.07012
- NesT - https://arxiv.org/abs/2105.12723
- Next-ViT - https://arxiv.org/abs/2207.05501
- NFNet-F - https://arxiv.org/abs/2102.06171
- NF-RegNet / NF-ResNet - https://arxiv.org/abs/2101.08692
- PNasNet - https://arxiv.org/abs/1712.00559
- PoolFormer (MetaFormer) - https://arxiv.org/abs/2111.11418
- Pooling-based Vision Transformer (PiT) - https://arxiv.org/abs/2103.16302
- PVT-V2 (Improved Pyramid Vision Transformer) - https://arxiv.org/abs/2106.13797
- RDNet (DenseNets Reloaded) - https://arxiv.org/abs/2403.19588
- RegNet - https://arxiv.org/abs/2003.13678
- RegNetZ - https://arxiv.org/abs/2103.06877
- RepVGG - https://arxiv.org/abs/2101.03697
- RepGhostNet - https://arxiv.org/abs/2211.06088
- RepViT - https://arxiv.org/abs/2307.09283
- ResMLP - https://arxiv.org/abs/2105.03404
- ResNet/ResNeXt
- ResNet (v1b/v1.5) - https://arxiv.org/abs/1512.03385
- ResNeXt - https://arxiv.org/abs/1611.05431
- 'Bag of Tricks' / Gluon C, D, E, S variations - https://arxiv.org/abs/1812.01187
- Weakly-supervised (WSL) Instagram pretrained / ImageNet tuned ResNeXt101 - https://arxiv.org/abs/1805.00932
- Semi-supervised (SSL) / Semi-weakly Supervised (SWSL) ResNet/ResNeXts - https://arxiv.org/abs/1905.00546
- ECA-Net (ECAResNet) - https://arxiv.org/abs/1910.03151v4
- Squeeze-and-Excitation Networks (SEResNet) - https://arxiv.org/abs/1709.01507
- ResNet-RS - https://arxiv.org/abs/2103.07579
- Res2Net - https://arxiv.org/abs/1904.01169
- ResNeSt - https://arxiv.org/abs/2004.08955
- ReXNet - https://arxiv.org/abs/2007.00992
- SelecSLS - https://arxiv.org/abs/1907.00837
- Selective Kernel Networks - https://arxiv.org/abs/1903.06586
- Sequencer2D - https://arxiv.org/abs/2205.01972
- Swin S3 (AutoFormerV2) - https://arxiv.org/abs/2111.14725
- Swin Transformer - https://arxiv.org/abs/2103.14030
- Swin Transformer V2 - https://arxiv.org/abs/2111.09883
- Transformer-iN-Transformer (TNT) - https://arxiv.org/abs/2103.00112
- TResNet - https://arxiv.org/abs/2003.13630
- Twins (Spatial Attention in Vision Transformers) - https://arxiv.org/pdf/2104.13840.pdf
- Visformer - https://arxiv.org/abs/2104.12533
- Vision Transformer - https://arxiv.org/abs/2010.11929
- ViTamin - https://arxiv.org/abs/2404.02132
- VOLO (Vision Outlooker) - https://arxiv.org/abs/2106.13112
- VovNet V2 and V1 - https://arxiv.org/abs/1911.06667
- Xception - https://arxiv.org/abs/1610.02357
- Xception (Modified Aligned, Gluon) - https://arxiv.org/abs/1802.02611
- Xception (Modified Aligned, TF) - https://arxiv.org/abs/1802.02611
- XCiT (Cross-Covariance Image Transformers) - https://arxiv.org/abs/2106.09681
Optimizers
Included optimizers available via create_optimizer
/ create_optimizer_v2
factory methods:
adabelief
an implementation of AdaBelief adapted from https://github.com/juntang-zhuang/Adabelief-Optimizer - https://arxiv.org/abs/2010.07468adafactor
adapted from FAIRSeq impl - https://arxiv.org/abs/1804.04235adahessian
by David Samuel - https://arxiv.org/abs/2006.00719adamp
andsgdp
by Naver ClovAI - https://arxiv.org/abs/2006.08217adan
an implementation of Adan adapted from https://github.com/sail-sg/Adan - https://arxiv.org/abs/2208.06677lamb
an implementation of Lamb and LambC (w/ trust-clipping) cleaned up and modified to support use with XLA - https://arxiv.org/abs/1904.00962lars
an implementation of LARS and LARC (w/ trust-clipping) - https://arxiv.org/abs/1708.03888lion
and implementation of Lion adapted from https://github.com/google/automl/tree/master/lion - https://arxiv.org/abs/2302.06675lookahead
adapted from impl by Liam - https://arxiv.org/abs/1907.08610madgrad
- and implementation of MADGRAD adapted from https://github.com/facebookresearch/madgrad - https://arxiv.org/abs/2101.11075nadam
an implementation of Adam w/ Nesterov momentumnadamw
an impementation of AdamW (Adam w/ decoupled weight-decay) w/ Nesterov momentum. A simplified impl based on https://github.com/mlcommons/algorithmic-efficiencynovograd
by Masashi Kimura - https://arxiv.org/abs/1905.11286radam
by Liyuan Liu - https://arxiv.org/abs/1908.03265rmsprop_tf
adapted from PyTorch RMSProp by myself. Reproduces much improved Tensorflow RMSProp behavioursgdw
and implementation of SGD w/ decoupled weight-decayfused<name>
optimizers by name with NVIDIA Apex installedbits<name>
optimizers by name with BitsAndBytes installed
Augmentations
- Random Erasing from Zhun Zhong - https://arxiv.org/abs/1708.04896)
- Mixup - https://arxiv.org/abs/1710.09412
- CutMix - https://arxiv.org/abs/1905.04899
- AutoAugment (https://arxiv.org/abs/1805.09501) and RandAugment (https://arxiv.org/abs/1909.13719) ImageNet configurations modeled after impl for EfficientNet training (https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py)
- AugMix w/ JSD loss, JSD w/ clean + augmented mixing support works with AutoAugment and RandAugment as well - https://arxiv.org/abs/1912.02781
- SplitBachNorm - allows splitting batch norm layers between clean and augmented (auxiliary batch norm) data
Regularization
- DropPath aka "Stochastic Depth" - https://arxiv.org/abs/1603.09382
- DropBlock - https://arxiv.org/abs/1810.12890
- Blur Pooling - https://arxiv.org/abs/1904.11486
Other
Several (less common) features that I often utilize in my projects are included. Many of their additions are the reason why I maintain my own set of models, instead of using others' via PIP:
- All models have a common default configuration interface and API for
- accessing/changing the classifier -
get_classifier
andreset_classifier
- doing a forward pass on just the features -
forward_features
(see documentation) - these makes it easy to write consistent network wrappers that work with any of the models
- accessing/changing the classifier -
- All models support multi-scale feature map extraction (feature pyramids) via create_model (see documentation)
create_model(name, features_only=True, out_indices=..., output_stride=...)
out_indices
creation arg specifies which feature maps to return, these indices are 0 based and generally correspond to theC(i + 1)
feature level.output_stride
creation arg controls output stride of the network by using dilated convolutions. Most networks are stride 32 by default. Not all networks support this.- feature map channel counts, reduction level (stride) can be queried AFTER model creation via the
.feature_info
member
- All models have a consistent pretrained weight loader that adapts last linear if necessary, and from 3 to 1 channel input if desired
- High performance reference training, validation, and inference scripts that work in several process/GPU modes:
- NVIDIA DDP w/ a single GPU per process, multiple processes with APEX present (AMP mixed-precision optional)
- PyTorch DistributedDataParallel w/ multi-gpu, single process (AMP disabled as it crashes when enabled)
- PyTorch w/ single GPU single process (AMP optional)
- A dynamic global pool implementation that allows selecting from average pooling, max pooling, average + max, or concat([average, max]) at model creation. All global pooling is adaptive average by default and compatible with pretrained weights.
- A 'Test Time Pool' wrapper that can wrap any of the included models and usually provides improved performance doing inference with input images larger than the training size. Idea adapted from original DPN implementation when I ported (https://github.com/cypw/DPNs)
- Learning rate schedulers
- Ideas adopted from
- AllenNLP schedulers
- FAIRseq lr_scheduler
- SGDR: Stochastic Gradient Descent with Warm Restarts (https://arxiv.org/abs/1608.03983)
- Schedulers include
step
,cosine
w/ restarts,tanh
w/ restarts,plateau
- Ideas adopted from
- Space-to-Depth by mrT23 (https://arxiv.org/abs/1801.04590) -- original paper?
- Adaptive Gradient Clipping (https://arxiv.org/abs/2102.06171, https://github.com/deepmind/deepmind-research/tree/master/nfnets)
- An extensive selection of channel and/or spatial attention modules:
- Bottleneck Transformer - https://arxiv.org/abs/2101.11605
- CBAM - https://arxiv.org/abs/1807.06521
- Effective Squeeze-Excitation (ESE) - https://arxiv.org/abs/1911.06667
- Efficient Channel Attention (ECA) - https://arxiv.org/abs/1910.03151
- Gather-Excite (GE) - https://arxiv.org/abs/1810.12348
- Global Context (GC) - https://arxiv.org/abs/1904.11492
- Halo - https://arxiv.org/abs/2103.12731
- Involution - https://arxiv.org/abs/2103.06255
- Lambda Layer - https://arxiv.org/abs/2102.08602
- Non-Local (NL) - https://arxiv.org/abs/1711.07971
- Squeeze-and-Excitation (SE) - https://arxiv.org/abs/1709.01507
- Selective Kernel (SK) - (https://arxiv.org/abs/1903.06586
- Split (SPLAT) - https://arxiv.org/abs/2004.08955
- Shifted Window (SWIN) - https://arxiv.org/abs/2103.14030
Results
Model validation results can be found in the results tables
Getting Started (Documentation)
The official documentation can be found at https://huggingface.co/docs/hub/timm. Documentation contributions are welcome.
Getting Started with PyTorch Image Models (timm): A Practitionerâs Guide by Chris Hughes is an extensive blog post covering many aspects of timm
in detail.
timmdocs is an alternate set of documentation for timm
. A big thanks to Aman Arora for his efforts creating timmdocs.
paperswithcode is a good resource for browsing the models within timm
.
Train, Validation, Inference Scripts
The root folder of the repository contains reference train, validation, and inference scripts that work with the included models and other features of this repository. They are adaptable for other datasets and use cases with a little hacking. See documentation.
Awesome PyTorch Resources
One of the greatest assets of PyTorch is the community and their contributions. A few of my favourite resources that pair well with the models and components here are listed below.
Object Detection, Instance and Semantic Segmentation
- Detectron2 - https://github.com/facebookresearch/detectron2
- Segmentation Models (Semantic) - https://github.com/qubvel/segmentation_models.pytorch
- EfficientDet (Obj Det, Semantic soon) - https://github.com/rwightman/efficientdet-pytorch
Computer Vision / Image Augmentation
- Albumentations - https://github.com/albumentations-team/albumentations
- Kornia - https://github.com/kornia/kornia
Knowledge Distillation
- RepDistiller - https://github.com/HobbitLong/RepDistiller
- torchdistill - https://github.com/yoshitomo-matsubara/torchdistill
Metric Learning
- PyTorch Metric Learning - https://github.com/KevinMusgrave/pytorch-metric-learning
Training / Frameworks
- fastai - https://github.com/fastai/fastai
Licenses
Code
The code here is licensed Apache 2.0. I've taken care to make sure any third party code included or adapted has compatible (permissive) licenses such as MIT, BSD, etc. I've made an effort to avoid any GPL / LGPL conflicts. That said, it is your responsibility to ensure you comply with licenses here and conditions of any dependent licenses. Where applicable, I've linked the sources/references for various components in docstrings. If you think I've missed anything please create an issue.
Pretrained Weights
So far all of the pretrained weights available here are pretrained on ImageNet with a select few that have some additional pretraining (see extra note below). ImageNet was released for non-commercial research purposes only (https://image-net.org/download). It's not clear what the implications of that are for the use of pretrained weights from that dataset. Any models I have trained with ImageNet are done for research purposes and one should assume that the original dataset license applies to the weights. It's best to seek legal advice if you intend to use the pretrained weights in a commercial product.
Pretrained on more than ImageNet
Several weights included or references here were pretrained with proprietary datasets that I do not have access to. These include the Facebook WSL, SSL, SWSL ResNe(Xt) and the Google Noisy Student EfficientNet models. The Facebook models have an explicit non-commercial license (CC-BY-NC 4.0, https://github.com/facebookresearch/semi-supervised-ImageNet1K-models, https://github.com/facebookresearch/WSL-Images). The Google models do not appear to have any restriction beyond the Apache 2.0 license (and ImageNet concerns). In either case, you should contact Facebook or Google with any questions.
Citing
BibTeX
@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}
Latest DOI
Top Related Projects
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
Models and examples built with TensorFlow
Datasets, Transforms and Models specific to Computer Vision
Deep Learning for humans
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
Best Practices, code samples, and documentation for Computer Vision.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot