taming-transformers

Taming Transformers for High-Resolution Image Synthesis

6,229

1,194

6,229

167

View on GitHub

Top Related Projects

DALL-E

10,876

PyTorch package for the discrete VAE used for DALL·E.

DALLE-pytorch

5,617

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

dalle-mini

14,810

DALL·E Mini - Generate images from a text prompt

vit-pytorch

23,231

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Quick Overview

Taming Transformers is a project that focuses on combining the power of transformers with convolutional approaches for image generation tasks. It introduces a novel architecture called the Vector Quantized Generative Adversarial Network (VQ-GAN), which enables high-resolution image synthesis with transformers.

Pros

Achieves state-of-the-art results in image generation tasks
Combines the strengths of transformers and convolutional approaches
Supports various applications, including image synthesis, inpainting, and semantic image manipulation
Well-documented and provides pre-trained models for easy use

Cons

Requires significant computational resources for training and inference
May be complex to understand and implement for beginners
Limited to specific image-related tasks and may not be suitable for other domains
Dependency on specific versions of libraries may cause compatibility issues

Code Examples

Loading a pre-trained model:

import torch
from omegaconf import OmegaConf
from taming.models.vqgan import VQModel

config = OmegaConf.load("path/to/model.yaml")
model = VQModel(**config.model.params)
model.eval().cuda()
model.load_state_dict(torch.load("path/to/last.ckpt")["state_dict"])

Encoding an image:

import torch
from PIL import Image
from torchvision import transforms

image = Image.open("path/to/image.jpg")
preprocess = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
    transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
])
x = preprocess(image).unsqueeze(0).cuda()
z_q, _, [_, _, indices] = model.encode(x)

Generating an image from latent representation:

import torch.nn.functional as F

x_rec = model.decode(z_q)
x_rec = torch.clamp(x_rec, -1., 1.)
x_rec = (x_rec + 1.) / 2.
x_rec = x_rec.squeeze(0).permute(1, 2, 0).cpu().numpy()

Getting Started

Clone the repository:

git clone https://github.com/CompVis/taming-transformers.git
cd taming-transformers

Install dependencies:
```
pip install -r requirements.txt
```

Download pre-trained models:

mkdir -p models/vqgan_imagenet_f16_16384
wget -O models/vqgan_imagenet_f16_16384/last.ckpt https://heibox.uni-heidelberg.de/f/867b05fc8c4841768640/?dl=1
wget -O models/vqgan_imagenet_f16_16384/config.yaml https://heibox.uni-heidelberg.de/f/274fb24ed38341bfa753/?dl=1

Run the example script:

python scripts/sample_conditional.py -r models/vqgan_imagenet_f16_16384/last.ckpt -c models/vqgan_imagenet_f16_16384/config.yaml -s 10

Competitor Comparisons

DALL-E

10,876

PyTorch package for the discrete VAE used for DALL·E.

Pros of DALL-E

More advanced and capable of generating higher-quality images from text descriptions
Offers a wider range of image generation capabilities, including complex scenes and abstract concepts
Backed by OpenAI's extensive research and resources

Cons of DALL-E

Less accessible to the general public, with limited availability and usage restrictions
Requires more computational resources and training data

Code Comparison

While both repositories focus on image generation, their codebases differ significantly. Here's a brief comparison of their model initialization:

DALL-E:

model = DALLE(
    dim = 1024,
    vae = vae,
    num_text_tokens = 10000,
    text_seq_len = 256,
    depth = 64,
    heads = 16
)

Taming Transformers:

model = VQGanTransformer(
    args,
    ckpt_path=None,
    ignore_keys=[],
    only_model=True,
    colorize_nlabels=None,
    monitor=None
)

Key Differences

DALL-E focuses on text-to-image generation, while Taming Transformers offers a broader range of image synthesis tasks
Taming Transformers is more accessible and open-source, allowing for easier experimentation and modification
DALL-E generally produces higher-quality results but requires more resources and has limited availability

DALLE-pytorch

5,617

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Pros of DALLE-pytorch

More lightweight and easier to understand implementation
Focuses specifically on DALL-E architecture, making it simpler for those interested in that model
Active development with frequent updates and community contributions

Cons of DALLE-pytorch

Less comprehensive in terms of transformer architectures covered
May have fewer optimizations and advanced features compared to taming-transformers
Documentation could be more extensive for some components

Code Comparison

taming-transformers:

class VQModel(nn.Module):
    def __init__(self,
                 ddconfig,
                 lossconfig,
                 n_embed,
                 embed_dim,
                 ckpt_path=None,
                 ignore_keys=[],
                 image_key="image",
                 colorize_nlabels=None,
                 monitor=None,
                 ):
        super().__init__()
        self.image_key = image_key

DALLE-pytorch:

class DALLE(nn.Module):
    def __init__(
        self,
        *,
        dim,
        vae,
        num_text_tokens = 10000,
        text_seq_len = 256,
        depth,
        heads = 8,
        dim_head = 64,
        reversible = False,
        attn_dropout = 0.,
        ff_dropout = 0,
        sparse_attn = False,
        attn_types = None,
        loss_img_weight = 7,
        stable = False
    ):
        super().__init__()
        assert isinstance(vae, (DiscreteVAE, OpenAIDiscreteVAE)), 'vae must be an instance of DiscreteVAE'

dalle-mini

14,810

DALL·E Mini - Generate images from a text prompt

Pros of DALL-E Mini

More user-friendly and accessible for non-experts
Faster inference time for generating images
Active community and frequent updates

Cons of DALL-E Mini

Lower image quality and resolution compared to Taming Transformers
Limited flexibility in terms of model architecture modifications
Less comprehensive documentation for advanced users

Code Comparison

DALL-E Mini:

import jax
import jax.numpy as jnp
from dalle_mini import DalleBart, DalleBartProcessor

model = DalleBart.from_pretrained("dalle-mini/dalle-mini")
processor = DalleBartProcessor.from_pretrained("dalle-mini/dalle-mini")

Taming Transformers:

import torch
from taming.models import cond_transformer, vqgan

model = cond_transformer.Net2NetTransformer(args, ckpt_path=None)
vqgan_model = vqgan.VQModel(args.vqgan_config, ckpt_path=args.vqgan_checkpoint)

Both repositories focus on image generation using transformer-based models. DALL-E Mini is more accessible and faster for general users, while Taming Transformers offers higher quality results and more flexibility for researchers and advanced practitioners. The code comparison shows that DALL-E Mini uses JAX and has a simpler setup, whereas Taming Transformers uses PyTorch and requires more configuration.

vit-pytorch

23,231

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Pros of vit-pytorch

Lightweight and focused implementation of Vision Transformers
Easier to understand and modify for specific use cases
More recent updates and active maintenance

Cons of vit-pytorch

Less comprehensive feature set compared to taming-transformers
Primarily focused on Vision Transformers, while taming-transformers covers a broader range of models
May require additional implementations for advanced use cases

Code Comparison

vit-pytorch:

v = ViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)

taming-transformers:

model = VQModel(
    ddconfig,
    lossconfig,
    n_embed,
    embed_dim,
    ckpt_path=None,
    ignore_keys=[],
    image_key="image",
    colorize_nlabels=None,
    monitor=None,
    remap=None,
    sane_index_shape=False
)

The code snippets demonstrate the different focus areas of the two repositories. vit-pytorch provides a straightforward implementation of Vision Transformers, while taming-transformers offers a more complex model with additional parameters and features.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Broader scope, covering a wide range of NLP tasks and models
Extensive documentation and community support
Regular updates and maintenance

Cons of transformers

Larger codebase, potentially more complex for specific use cases
May include unnecessary components for users focused on image generation

Code comparison

taming-transformers:

model = VQModel(
    ddconfig=model_config.model.params.ddconfig,
    lossconfig=model_config.model.params.lossconfig,
    n_embed=model_config.model.params.n_embed,
    embed_dim=model_config.model.params.embed_dim,
    ckpt_path=None,
    ignore_keys=[],
    image_key="image",
    colorize_nlabels=None
)

transformers:

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

taming-transformers focuses specifically on image generation using VQ-VAE and transformers, while transformers provides a more general-purpose library for various NLP tasks. taming-transformers may be more suitable for specialized image generation projects, whereas transformers offers greater flexibility and broader applicability across different NLP applications.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Taming Transformers for High-Resolution Image Synthesis

CVPR 2021 (Oral)

teaser

Taming Transformers for High-Resolution Image Synthesis
Patrick Esser*, Robin Rombach*, BjÃ¶rn Ommer
* equal contribution

tl;dr We combine the efficiancy of convolutional approaches with the expressivity of transformers by introducing a convolutional VQGAN, which learns a codebook of context-rich visual parts, whose composition is modeled with an autoregressive transformer.

teaser arXiv | BibTeX | Project Page

News

2022

More pretrained VQGANs (e.g. a f8-model with only 256 codebook entries) are available in our new work on Latent Diffusion Models.
Added scene synthesis models as proposed in the paper High-Resolution Complex Scene Synthesis with Transformers, see this section.

2021

Thanks to rom1504 it is now easy to train a VQGAN on your own datasets.
Included a bugfix for the quantizer. For backward compatibility it is disabled by default (which corresponds to always training with beta=1.0). Use legacy=False in the quantizer config to enable it. Thanks richcmwang and wcshin-git!
Our paper received an update: See https://arxiv.org/abs/2012.09841v3 and the corresponding changelog.
Added a pretrained, 1.4B transformer model trained for class-conditional ImageNet synthesis, which obtains state-of-the-art FID scores among autoregressive approaches and outperforms BigGAN.
Added pretrained, unconditional models on FFHQ and CelebA-HQ.
Added accelerated sampling via caching of keys/values in the self-attention operation, used in scripts/sample_fast.py.
Added a checkpoint of a VQGAN trained with f8 compression and Gumbel-Quantization. See also our updated reconstruction notebook.
We added a colab notebook which compares two VQGANs and OpenAI's DALL-E. See also this section.
We now include an overview of pretrained models in Tab.1. We added models for COCO and ADE20k.
The streamlit demo now supports image completions.
We now include a couple of examples from the D-RIN dataset so you can run the D-RIN demo without preparing the dataset first.
You can now jump right into sampling with our Colab quickstart notebook.

Requirements

A suitable conda environment named taming can be created and activated with:

conda env create -f environment.yaml
conda activate taming

Overview of pretrained models

The following table provides an overview of all models that are currently available. FID scores were evaluated using torch-fidelity. For reference, we also include a link to the recently released autoencoder of the DALL-E model. See the corresponding colab notebook for a comparison and discussion of reconstruction capabilities.

Dataset	FID vs train	FID vs val	Link	Samples (256x256)	Comments
FFHQ (f=16)	9.6	--	ffhq_transformer	ffhq_samples
CelebA-HQ (f=16)	10.2	--	celebahq_transformer	celebahq_samples
ADE20K (f=16)	--	35.5	ade20k_transformer	ade20k_samples.zip [2k]	evaluated on val split (2k images)
COCO-Stuff (f=16)	--	20.4	coco_transformer	coco_samples.zip [5k]	evaluated on val split (5k images)
ImageNet (cIN) (f=16)	15.98/15.78/6.59/5.88/5.20	--	cin_transformer	cin_samples	different decoding hyperparameters

FacesHQ (f=16)	--	--	faceshq_transformer
S-FLCKR (f=16)	--	--	sflckr
D-RIN (f=16)	--	--	drin_transformer

VQGAN ImageNet (f=16), 1024	10.54	7.94	vqgan_imagenet_f16_1024	reconstructions	Reconstruction-FIDs.
VQGAN ImageNet (f=16), 16384	7.41	4.98	vqgan_imagenet_f16_16384	reconstructions	Reconstruction-FIDs.
VQGAN OpenImages (f=8), 256	--	1.49	https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip	---	Reconstruction-FIDs. Available via latent diffusion.
VQGAN OpenImages (f=8), 16384	--	1.14	https://ommer-lab.com/files/latent-diffusion/vq-f8.zip	---	Reconstruction-FIDs. Available via latent diffusion
VQGAN OpenImages (f=8), 8192, GumbelQuantization	3.24	1.49	vqgan_gumbel_f8	---	Reconstruction-FIDs.

DALL-E dVAE (f=8), 8192, GumbelQuantization	33.88	32.01	https://github.com/openai/DALL-E	reconstructions	Reconstruction-FIDs.

Running pretrained models

The commands below will start a streamlit demo which supports sampling at different resolutions and image completions. To run a non-interactive version of the sampling process, replace streamlit run scripts/sample_conditional.py -- by python scripts/make_samples.py --outdir <path_to_write_samples_to> and keep the remaining command line arguments.

To sample from unconditional or class-conditional models, run python scripts/sample_fast.py -r <path/to/config_and_checkpoint>. We describe below how to use this script to sample from the ImageNet, FFHQ, and CelebA-HQ models, respectively.

S-FLCKR

teaser

You can also run this model in a Colab notebook, which includes all necessary steps to start sampling.

Download the 2020-11-09T13-31-51_sflckr folder and place it into logs. Then, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-09T13-31-51_sflckr/

ImageNet

teaser

Download the 2021-04-03T19-39-50_cin_transformer folder and place it into logs. Sampling from the class-conditional ImageNet model does not require any data preparation. To produce 50 samples for each of the 1000 classes of ImageNet, with k=600 for top-k sampling, p=0.92 for nucleus sampling and temperature t=1.0, run

python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25

To restrict the model to certain classes, provide them via the --classes argument, separated by commas. For example, to sample 50 ostriches, border collies and whiskey jugs, run

python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25 --classes 9,232,901

We recommended to experiment with the autoregressive decoding parameters (top-k, top-p and temperature) for best results.

FFHQ/CelebA-HQ

Download the 2021-04-23T18-19-01_ffhq_transformer and 2021-04-23T18-11-19_celebahq_transformer folders and place them into logs. Again, sampling from these unconditional models does not require any data preparation. To produce 50000 samples, with k=250 for top-k sampling, p=1.0 for nucleus sampling and temperature t=1.0, run

python scripts/sample_fast.py -r logs/2021-04-23T18-19-01_ffhq_transformer/

for FFHQ and

python scripts/sample_fast.py -r logs/2021-04-23T18-11-19_celebahq_transformer/

to sample from the CelebA-HQ model. For both models it can be advantageous to vary the top-k/top-p parameters for sampling.

FacesHQ

teaser

Download 2020-11-13T21-41-45_faceshq_transformer and place it into logs. Follow the data preparation steps for CelebA-HQ and FFHQ. Run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-13T21-41-45_faceshq_transformer/

D-RIN

teaser

Download 2020-11-20T12-54-32_drin_transformer and place it into logs. To run the demo on a couple of example depth maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.imagenet.DRINExamples}}}"

To run the demo on the complete validation set, first follow the data preparation steps for ImageNet and then run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/

COCO

Download 2021-01-20T16-04-20_coco_transformer and place it into logs. To run the demo on a couple of example segmentation maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2021-01-20T16-04-20_coco_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.coco.Examples}}}"

ADE20k

Download 2020-11-20T21-45-44_ade20k_transformer and place it into logs. To run the demo on a couple of example segmentation maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T21-45-44_ade20k_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.ade20k.Examples}}}"

Scene Image Synthesis

teaser Scene image generation based on bounding box conditionals as done in our CVPR2021 AI4CC workshop paper High-Resolution Complex Scene Synthesis with Transformers (see talk on workshop page). Supporting the datasets COCO and Open Images.

Training

Download first-stage models COCO-8k-VQGAN for COCO or COCO/Open-Images-8k-VQGAN for Open Images. Change ckpt_path in data/coco_scene_images_transformer.yaml and data/open_images_scene_images_transformer.yaml to point to the downloaded first-stage models. Download the full COCO/OI datasets and adapt data_path in the same files, unless working with the 100 files provided for training and validation suits your needs already.

Code can be run with python main.py --base configs/coco_scene_images_transformer.yaml -t True --gpus 0, or python main.py --base configs/open_images_scene_images_transformer.yaml -t True --gpus 0,

Sampling

Train a model as described above or download a pre-trained model:

Open Images 1 billion parameter model available that trained 100 epochs. On 256x256 pixels, FID 41.48Â±0.21, SceneFID 14.60Â±0.15, Inception Score 18.47Â±0.27. The model was trained with 2d crops of images and is thus well-prepared for the task of generating high-resolution images, e.g. 512x512.
Open Images distilled version of the above model with 125 million parameters allows for sampling on smaller GPUs (4 GB is enough for sampling 256x256 px images). Model was trained for 60 epochs with 10% soft loss, 90% hard loss. On 256x256 pixels, FID 43.07Â±0.40, SceneFID 15.93Â±0.19, Inception Score 17.23Â±0.11.
COCO 30 epochs
COCO 60 epochs (find model statistics for both COCO versions in assets/coco_scene_images_training.svg)

When downloading a pre-trained model, remember to change ckpt_path in configs/*project.yaml to point to your downloaded first-stage model (see ->Training).

Scene image generation can be run with python scripts/make_scene_samples.py --outdir=/some/outdir -r /path/to/pretrained/model --resolution=512,512

Training on custom data

Training on your own dataset can be beneficial to get better tokens and hence better images for your domain. Those are the steps to follow to make this work:

install the repo with conda env create -f environment.yaml, conda activate taming and pip install -e .
put your .jpg files in a folder your_folder
create 2 text files a xx_train.txt and xx_test.txt that point to the files in your training and test set respectively (for example find $(pwd)/your_folder -name "*.jpg" > train.txt)
adapt configs/custom_vqgan.yaml to point to these 2 files
run python main.py --base configs/custom_vqgan.yaml -t True --gpus 0,1 to train on two GPUs. Use --gpus 0, (with a trailing comma) to train on a single GPU.

Data Preparation

ImageNet

The code will try to download (through Academic Torrents) and prepare ImageNet the first time it is used. However, since ImageNet is quite large, this requires a lot of disk space and time. If you already have ImageNet on your disk, you can speed things up by putting the data into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ (which defaults to ~/.cache/autoencoders/data/ILSVRC2012_{split}/data/), where {split} is one of train/validation. It should have the following structure:

${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
âââ n01440764
â   âââ n01440764_10026.JPEG
â   âââ n01440764_10027.JPEG
â   âââ ...
âââ n01443537
â   âââ n01443537_10007.JPEG
â   âââ n01443537_10014.JPEG
â   âââ ...
âââ ...

If you haven't extracted the data, you can also place ILSVRC2012_img_train.tar/ILSVRC2012_img_val.tar (or symlinks to them) into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/ / ${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/, which will then be extracted into above structure without downloading it again. Note that this will only happen if neither a folder ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ nor a file ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready exist. Remove them if you want to force running the dataset preparation again.

You will then need to prepare the depth data using MiDaS. Create a symlink data/imagenet_depth pointing to a folder with two subfolders train and val, each mirroring the structure of the corresponding ImageNet folder described above and containing a png file for each of ImageNet's JPEG files. The png encodes float32 depth values obtained from MiDaS as RGBA images. We provide the script scripts/extract_depth.py to generate this data. Please note that this script uses MiDaS via PyTorch Hub. When we prepared the data, the hub provided the MiDaS v2.0 version, but now it provides a v2.1 version. We haven't tested our models with depth maps obtained via v2.1 and if you want to make sure that things work as expected, you must adjust the script to make sure it explicitly uses v2.0!

CelebA-HQ

Create a symlink data/celebahq pointing to a folder containing the .npy files of CelebA-HQ (instructions to obtain them can be found in the PGGAN repository).

FFHQ

Create a symlink data/ffhq pointing to the images1024x1024 folder obtained from the FFHQ repository.

S-FLCKR

Unfortunately, we are not allowed to distribute the images we collected for the S-FLCKR dataset and can therefore only give a description how it was produced. There are many resources on collecting images from the web to get started. We collected sufficiently large images from flickr (see data/flickr_tags.txt for a full list of tags used to find images) and various subreddits (see data/subreddits.txt for all subreddits that were used). Overall, we collected 107625 images, and split them randomly into 96861 training images and 10764 validation images. We then obtained segmentation masks for each image using DeepLab v2 trained on COCO-Stuff. We used a PyTorch reimplementation and include an example script for this process in scripts/extract_segmentation.py.

COCO

Create a symlink data/coco containing the images from the 2017 split in train2017 and val2017, and their annotations in annotations. Files can be obtained from the COCO webpage. In addition, we use the Stuff+thing PNG-style annotations on COCO 2017 trainval annotations from COCO-Stuff, which should be placed under data/cocostuffthings.

ADE20k

Create a symlink data/ade20k_root containing the contents of ADEChallengeData2016.zip from the MIT Scene Parsing Benchmark.

Training models

FacesHQ

Train a VQGAN with

python main.py --base configs/faceshq_vqgan.yaml -t True --gpus 0,

Then, adjust the checkpoint path of the config key model.params.first_stage_config.params.ckpt_path in configs/faceshq_transformer.yaml (or download 2020-11-09T13-33-36_faceshq_vqgan and place into logs, which corresponds to the preconfigured checkpoint path), then run

python main.py --base configs/faceshq_transformer.yaml -t True --gpus 0,

D-RIN

Train a VQGAN on ImageNet with

python main.py --base configs/imagenet_vqgan.yaml -t True --gpus 0,

or download a pretrained one from 2020-09-23T17-56-33_imagenet_vqgan and place under logs. If you trained your own, adjust the path in the config key model.params.first_stage_config.params.ckpt_path of configs/drin_transformer.yaml.

Train a VQGAN on Depth Maps of ImageNet with

python main.py --base configs/imagenetdepth_vqgan.yaml -t True --gpus 0,

or download a pretrained one from 2020-11-03T15-34-24_imagenetdepth_vqgan and place under logs. If you trained your own, adjust the path in the config key model.params.cond_stage_config.params.ckpt_path of configs/drin_transformer.yaml.

To train the transformer, run

python main.py --base configs/drin_transformer.yaml -t True --gpus 0,

More Resources

Comparing Different First Stage Models

The reconstruction and compression capabilities of different fist stage models can be analyzed in this colab notebook. In particular, the notebook compares two VQGANs with a downsampling factor of f=16 for each and codebook dimensionality of 1024 and 16384, a VQGAN with f=8 and 8192 codebook entries and the discrete autoencoder of OpenAI's DALL-E (which has f=8 and 8192 codebook entries). firststages1 firststages2

Other

A video summary by Two Minute Papers.
A video summary by Gradient Dude.
A weights and biases report summarizing the paper by ayulockin.
A video summary by What's AI.
Take a look at ak9250's notebook if you want to run the streamlit demos on Colab.

Text-to-Image Optimization via CLIP

VQGAN has been successfully used as an image generator guided by the CLIP model, both for pure image generation from scratch and image-to-image translation. We recommend the following notebooks/videos/resources:

Advadnouns Patreon and corresponding LatentVision notebooks: https://www.patreon.com/patronizeme
The notebook of Rivers Have Wings.
A video explanation by Dot CSV (in Spanish, but English subtitles are available)

txt2img

Text prompt: 'A bird drawn by a child'

Shout-outs

Thanks to everyone who makes their code and models available. In particular,

The architecture of our VQGAN is inspired by Denoising Diffusion Probabilistic Models
The very hackable transformer implementation minGPT
The good ol' PatchGAN and Learned Perceptual Similarity (LPIPS)

BibTeX

@misc{esser2020taming,
      title={Taming Transformers for High-Resolution Image Synthesis}, 
      author={Patrick Esser and Robin Rombach and BjÃ¶rn Ommer},
      year={2020},
      eprint={2012.09841},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot