pix2pixHD

Synthesizing and manipulating 2048x1024 images with conditional GANs

6,789

1,413

6,789

247

View on GitHub

Top Related Projects

pytorch-CycleGAN-and-pix2pix

23,952

Image-to-Image Translation in PyTorch

SPADE

7,662

Semantic Image Synthesis with SPADE

pix2pix

10,444

Image-to-image translation with conditional adversarial nets

zi2zi

2,644

Learning Chinese Character style with conditional GAN

PhotographicImageSynthesis

1,248

Photographic Image Synthesis with Cascaded Refinement Networks

Quick Overview

NVIDIA/pix2pixHD is a high-resolution image-to-image translation framework. It extends the original pix2pix model to generate high-resolution, photorealistic images. The project focuses on synthesizing and manipulating images at 2048x1024 resolution.

Pros

Produces high-resolution, photorealistic images
Supports multi-scale generator and discriminator architectures
Includes instance-level feature embedding for improved results
Provides pre-trained models for various datasets

Cons

Requires significant computational resources for training
Limited to specific image-to-image translation tasks
May struggle with complex scenes or highly diverse datasets
Requires paired training data, which can be difficult to obtain for some applications

Code Examples

Loading a pre-trained model:

from options.test_options import TestOptions
from models.models import create_model
import torch

opt = TestOptions().parse()
model = create_model(opt)
model.eval()

Performing inference on an image:

from data.data_loader import CreateDataLoader
from util.visualizer import Visualizer

data_loader = CreateDataLoader(opt)
dataset = data_loader.load_data()
visualizer = Visualizer(opt)

for i, data in enumerate(dataset):
    generated = model.inference(data['label'], data['inst'])
    visualizer.save_generated_image(generated, data['path'])

Training a new model:

from options.train_options import TrainOptions
from models.models import create_model
from data.data_loader import CreateDataLoader

opt = TrainOptions().parse()
data_loader = CreateDataLoader(opt)
dataset = data_loader.load_data()
model = create_model(opt)

for epoch in range(opt.niter + opt.niter_decay):
    for i, data in enumerate(dataset):
        model.set_input(data)
        model.optimize_parameters()

Getting Started

Clone the repository:

git clone https://github.com/NVIDIA/pix2pixHD.git
cd pix2pixHD

Install dependencies:
```
pip install -r requirements.txt
```
Download a pre-trained model:
```
bash ./scripts/download_model.sh
```

Run inference on a sample image:

python test.py --name label2city_1024p --netG local --ngf 32 --resize_or_crop none

Competitor Comparisons

pytorch-CycleGAN-and-pix2pix

23,952

Image-to-Image Translation in PyTorch

Pros of pytorch-CycleGAN-and-pix2pix

Supports multiple image-to-image translation models (CycleGAN, pix2pix, etc.)
More flexible and easier to customize for various tasks
Better documentation and examples for beginners

Cons of pytorch-CycleGAN-and-pix2pix

Lower resolution output compared to pix2pixHD
May require more manual tuning for optimal results
Lacks some advanced features present in pix2pixHD

Code Comparison

pytorch-CycleGAN-and-pix2pix:

from models import create_model
model = create_model(opt)
model.setup(opt)
model.train()

pix2pixHD:

from models.models import create_model
model = create_model(opt)
model.train()

The code structure is similar, but pix2pixHD has a more streamlined setup process. pytorch-CycleGAN-and-pix2pix offers more flexibility with model creation and setup options.

Both repositories provide powerful image-to-image translation capabilities, but they cater to different use cases. pytorch-CycleGAN-and-pix2pix is more versatile and beginner-friendly, while pix2pixHD focuses on high-resolution results and advanced features for specific applications.

SPADE

7,662

Semantic Image Synthesis with SPADE

Pros of SPADE

Improved semantic layout preservation, especially for complex scenes
More flexible architecture allowing for diverse style inputs
Better handling of multi-modal distributions in generated images

Cons of SPADE

Potentially higher computational requirements due to more complex architecture
May require more training data for optimal performance
Slightly more complex implementation compared to pix2pixHD

Code Comparison

SPADE introduces a new normalization layer:

class SPADE(nn.Module):
    def __init__(self, norm_nc, label_nc):
        super().__init__()
        self.param_free_norm = nn.InstanceNorm2d(norm_nc, affine=False)
        self.mlp_shared = nn.Sequential(
            nn.Conv2d(label_nc, 128, kernel_size=3, padding=1),
            nn.ReLU()
        )

pix2pixHD uses a more traditional approach:

class ResnetBlock(nn.Module):
    def __init__(self, dim, padding_type, norm_layer, activation=nn.ReLU(True), use_dropout=False):
        super(ResnetBlock, self).__init__()
        self.conv_block = self.build_conv_block(dim, padding_type, norm_layer, activation, use_dropout)

Both repositories focus on image-to-image translation, but SPADE offers more advanced semantic control and style manipulation capabilities compared to pix2pixHD.

pix2pix

10,444

Image-to-image translation with conditional adversarial nets

Pros of pix2pix

Simpler implementation, making it easier to understand and modify
Faster training time due to less complex architecture
Wider compatibility with various datasets and tasks

Cons of pix2pix

Lower output resolution compared to pix2pixHD
Less detailed and realistic results in some cases
Limited ability to handle high-resolution images

Code Comparison

pix2pix:

class UnetGenerator(nn.Module):
    def __init__(self, input_nc, output_nc, num_downs, ngf=64):
        super(UnetGenerator, self).__init__()
        # Encoder and decoder implementation

pix2pixHD:

class GlobalGenerator(nn.Module):
    def __init__(self, input_nc, output_nc, ngf=64, n_downsampling=3, n_blocks=9):
        super(GlobalGenerator, self).__init__()
        # More complex generator with global and local enhancer networks

The code snippets show that pix2pixHD uses a more sophisticated generator architecture, which contributes to its ability to produce higher-resolution and more detailed outputs. However, this comes at the cost of increased complexity and longer training times compared to the simpler pix2pix implementation.

zi2zi

2,644

Learning Chinese Character style with conditional GAN

Pros of zi2zi

Specialized for Chinese character generation, offering better results for this specific use case
Includes a lightweight web interface for easy testing and visualization
Provides pre-trained models for immediate use

Cons of zi2zi

Limited to character-based image generation, less versatile than pix2pixHD
Smaller community and fewer updates compared to pix2pixHD
May require more domain-specific knowledge for optimal results

Code Comparison

zi2zi:

def build_generator(self, img_width):
    inputs = Input(shape=(img_width, img_width, 1))
    # ... (generator architecture)
    return Model(inputs=inputs, outputs=out)

pix2pixHD:

def define_G(input_nc, output_nc, ngf, n_downsample_global=3, n_blocks_global=9, n_local_enhancers=1, n_blocks_local=3, norm='instance', gpu_ids=[]):
    netG = GlobalGenerator(input_nc, output_nc, ngf, n_downsample_global, n_blocks_global, norm)
    # ... (additional generator components)
    return netG

The code snippets show that zi2zi uses a simpler generator structure, while pix2pixHD employs a more complex architecture with global and local components, potentially offering better results for general image-to-image translation tasks.

PhotographicImageSynthesis

1,248

Photographic Image Synthesis with Cascaded Refinement Networks

Pros of PhotographicImageSynthesis

Focuses on generating photorealistic images from semantic layouts
Utilizes a global-local adversarial loss for improved image quality
Implements a novel perceptual loss function for enhanced realism

Cons of PhotographicImageSynthesis

Limited to specific scene types and may not generalize well
Requires more computational resources due to complex architecture
Less versatile in terms of input types compared to pix2pixHD

Code Comparison

PhotographicImageSynthesis:

def build_generator(self):
    inputs = Input(shape=self.input_shape)
    x = Conv2D(64, 3, padding='same')(inputs)
    # ... more layers
    return Model(inputs, x)

pix2pixHD:

def define_G(self, input_nc, output_nc, ngf, n_downsample_global=3, n_blocks_global=9, n_local_enhancers=1, n_blocks_local=3, norm='instance', gpu_ids=[]):
    netG = GlobalGenerator(input_nc, output_nc, ngf, n_downsample_global, n_blocks_global, norm)
    # ... more code
    return netG

The code snippets show different approaches to building generators. PhotographicImageSynthesis uses a simpler sequential approach, while pix2pixHD employs a more modular structure with separate global and local enhancers.

PyTorch-Multi-Style-Transfer

1,004

Neural Style and MSG-Net

Pros of PyTorch-Multi-Style-Transfer

Supports multiple style transfer techniques in a single repository
Implements real-time style transfer for videos
Provides pre-trained models for quick experimentation

Cons of PyTorch-Multi-Style-Transfer

Less focus on high-resolution image generation
May require more manual tuning for optimal results
Limited to style transfer applications

Code Comparison

PyTorch-Multi-Style-Transfer:

style_model = Net(ngf=128)
style_model.load_state_dict(torch.load(args.model))
style_model.to(device)
style_model.eval()

pix2pixHD:

netG = networks.define_G(opt.input_nc, opt.output_nc, opt.ngf, opt.netG,
                         opt.n_downsample_global, opt.n_blocks_global, opt.n_local_enhancers,
                         opt.n_blocks_local, opt.norm, gpu_ids=self.gpu_ids)

PyTorch-Multi-Style-Transfer focuses on implementing various style transfer techniques, including real-time video processing. It offers pre-trained models for quick experimentation but may require more manual tuning. In contrast, pix2pixHD specializes in high-resolution image generation and provides a more structured approach to image-to-image translation tasks. The code snippets show the difference in model initialization, with PyTorch-Multi-Style-Transfer using a simpler Net class, while pix2pixHD employs a more complex network definition with multiple parameters for fine-tuning the generation process.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pix2pixHD

Project | Youtube | Paper

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic image-to-image translation. It can be used for turning semantic label maps into photo-realistic images or synthesizing portraits from face label maps.

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
Ting-Chun Wang¹, Ming-Yu Liu¹, Jun-Yan Zhu², Andrew Tao¹, Jan Kautz¹, Bryan Catanzaro¹
¹NVIDIA Corporation, ²UC Berkeley
In CVPR 2018.

Image-to-image translation at 2k/1k resolution

Our label-to-streetview results

- Interactive editing results

- Additional streetview results

Label-to-face and interactive editing results

Our editing interface

Prerequisites

Linux or macOS
Python 2 or 3
NVIDIA GPU (11G memory or larger) + CUDA cuDNN

Getting Started

Installation

Install PyTorch and dependencies from http://pytorch.org
Install python libraries dominate.

pip install dominate

Clone this repo:

git clone https://github.com/NVIDIA/pix2pixHD
cd pix2pixHD

Testing

A few example Cityscapes test images are included in the datasets folder.
Please download the pre-trained Cityscapes model from here (google drive link), and put it under ./checkpoints/label2city_1024p/
Test the model (bash ./scripts/test_1024p.sh):

#!./scripts/test_1024p.sh
python test.py --name label2city_1024p --netG local --ngf 32 --resize_or_crop none

The test results will be saved to a html file here: ./results/label2city_1024p/test_latest/index.html.

More example scripts can be found in the scripts directory.

Dataset

We use the Cityscapes dataset. To train a model on the full dataset, please download it from the official website (registration required). After downloading, please put it under the datasets folder in the same way the example images are provided.

Training

Train a model at 1024 x 512 resolution (bash ./scripts/train_512p.sh):

#!./scripts/train_512p.sh
python train.py --name label2city_512p

To view training results, please checkout intermediate results in ./checkpoints/label2city_512p/web/index.html. If you have tensorflow installed, you can see tensorboard logs in ./checkpoints/label2city_512p/logs by adding --tf_log to the training scripts.

Multi-GPU training

Train a model using multiple GPUs (bash ./scripts/train_512p_multigpu.sh):

#!./scripts/train_512p_multigpu.sh
python train.py --name label2city_512p --batchSize 8 --gpu_ids 0,1,2,3,4,5,6,7

Note: this is not tested and we trained our model using single GPU only. Please use at your own discretion.

Training with Automatic Mixed Precision (AMP) for faster speed

To train with mixed precision support, please first install apex from: https://github.com/NVIDIA/apex
You can then train the model by adding --fp16. For example,

#!./scripts/train_512p_fp16.sh
python -m torch.distributed.launch train.py --name label2city_512p --fp16

In our test case, it trains about 80% faster with AMP on a Volta machine.

Training at full resolution

To train the images at full resolution (2048 x 1024) requires a GPU with 24G memory (bash ./scripts/train_1024p_24G.sh), or 16G memory if using mixed precision (AMP).
If only GPUs with 12G memory are available, please use the 12G script (bash ./scripts/train_1024p_12G.sh), which will crop the images during training. Performance is not guaranteed using this script.

Training with your own dataset

If you want to train with your own dataset, please generate label maps which are one-channel whose pixel values correspond to the object labels (i.e. 0,1,...,N-1, where N is the number of labels). This is because we need to generate one-hot vectors from the label maps. Please also specity --label_nc N during both training and testing.
If your input is not a label map, please just specify --label_nc 0 which will directly use the RGB colors as input. The folders should then be named train_A, train_B instead of train_label, train_img, where the goal is to translate images from A to B.
If you don't have instance maps or don't want to use them, please specify --no_instance.
The default setting for preprocessing is scale_width, which will scale the width of all training images to opt.loadSize (1024) while keeping the aspect ratio. If you want a different setting, please change it by using the --resize_or_crop option. For example, scale_width_and_crop first resizes the image to have width opt.loadSize and then does random cropping of size (opt.fineSize, opt.fineSize). crop skips the resizing step and only performs random cropping. If you don't want any preprocessing, please specify none, which will do nothing other than making sure the image is divisible by 32.

More Training/Test Details

Flags: see options/train_options.py and options/base_options.py for all the training flags; see options/test_options.py and options/base_options.py for all the test flags.
Instance map: we take in both label maps and instance maps as input. If you don't want to use instance maps, please specify the flag --no_instance.

Citation

If you find this useful for your research, please use the following.

@inproceedings{wang2018pix2pixHD,
  title={High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs},
  author={Ting-Chun Wang and Ming-Yu Liu and Jun-Yan Zhu and Andrew Tao and Jan Kautz and Bryan Catanzaro},  
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2018}
}

Acknowledgments

This code borrows heavily from pytorch-CycleGAN-and-pix2pix.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot