Top Related Projects
Image-to-Image Translation in PyTorch
Synthesizing and manipulating 2048x1024 images with conditional GANs
Semantic Image Synthesis with SPADE
Tensorflow port of Image-to-Image Translation with Conditional Adversarial Nets https://phillipi.github.io/pix2pix/
Quick Overview
pix2pix is a deep learning-based image-to-image translation framework. It uses conditional adversarial networks to learn a mapping from input images to output images, enabling various applications such as colorization, edge detection, and image generation from sketches.
Pros
- Versatile: Can be applied to a wide range of image-to-image translation tasks
- High-quality results: Produces realistic and detailed output images
- Open-source: Freely available for research and development
- Well-documented: Includes comprehensive instructions and examples
Cons
- Requires significant computational resources for training
- Dependent on the quality and quantity of training data
- May struggle with complex scenes or highly diverse datasets
- Limited control over specific features in the output
Code Examples
- Loading and preprocessing data:
def load_data(path):
dataset = tf.data.Dataset.list_files(str(path/'*/*'))
dataset = dataset.map(load_image_train, num_parallel_calls=tf.data.AUTOTUNE)
return dataset.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
def load_image_train(image_file):
input_image, real_image = load(image_file)
input_image, real_image = random_jitter(input_image, real_image)
return input_image, real_image
- Defining the generator model:
def Generator():
inputs = tf.keras.layers.Input(shape=[256, 256, 3])
down_stack = [
downsample(64, 4, apply_batchnorm=False),
downsample(128, 4),
downsample(256, 4),
downsample(512, 4),
downsample(512, 4),
downsample(512, 4),
downsample(512, 4),
downsample(512, 4),
]
up_stack = [
upsample(512, 4, apply_dropout=True),
upsample(512, 4, apply_dropout=True),
upsample(512, 4, apply_dropout=True),
upsample(512, 4),
upsample(256, 4),
upsample(128, 4),
upsample(64, 4),
]
initializer = tf.random_normal_initializer(0., 0.02)
last = tf.keras.layers.Conv2DTranspose(OUTPUT_CHANNELS, 4,
strides=2,
padding='same',
kernel_initializer=initializer,
activation='tanh')
x = inputs
# Downsampling through the model
skips = []
for down in down_stack:
x = down(x)
skips.append(x)
skips = reversed(skips[:-1])
# Upsampling and establishing the skip connections
for up, skip in zip(up_stack, skips):
x = up(x)
x = tf.keras.layers.Concatenate()([x, skip])
x = last(x)
return tf.keras.Model(inputs=inputs, outputs=x)
- Training step function:
@tf.function
def train_step(input_image, target, step):
with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
gen_output = generator(input_image, training=True)
disc_real_output = discriminator([input_image, target], training=True)
disc_generated_output = discriminator([input_image, gen_output], training=True)
gen_total_loss, gen_gan_loss, gen_l1_loss = generator_loss(disc_generated_output, gen_output, target)
disc_loss = discriminator_loss(disc_real_output, disc_generated_output)
generator_gradients = gen_tape.gradient(gen_total_loss,
Competitor Comparisons
Image-to-Image Translation in PyTorch
Pros of pytorch-CycleGAN-and-pix2pix
- Implements both CycleGAN and pix2pix in a single repository
- Uses PyTorch, which offers dynamic computational graphs and easier debugging
- Provides more extensive documentation and examples
Cons of pytorch-CycleGAN-and-pix2pix
- May have a steeper learning curve for those unfamiliar with PyTorch
- Potentially more complex codebase due to supporting multiple models
Code Comparison
pix2pix (TensorFlow):
def discriminator(image, options, reuse=False, name="discriminator"):
with tf.variable_scope(name):
# Layers defined here
return out, end_points
pytorch-CycleGAN-and-pix2pix:
class Discriminator(nn.Module):
def __init__(self, input_nc, ndf=64, n_layers=3, norm_layer=nn.BatchNorm2d):
super(Discriminator, self).__init__()
# Layers defined here
def forward(self, input):
return self.model(input)
The pytorch-CycleGAN-and-pix2pix repository uses PyTorch's object-oriented approach, defining the discriminator as a class. In contrast, pix2pix uses TensorFlow's functional approach with a discriminator function. The PyTorch version may be more intuitive for those familiar with object-oriented programming.
Synthesizing and manipulating 2048x1024 images with conditional GANs
Pros of pix2pixHD
- Higher resolution output (up to 2048x1024)
- Improved visual quality and realism
- Multi-scale generator and discriminator architecture
Cons of pix2pixHD
- Requires more computational resources
- Longer training time
- More complex implementation
Code Comparison
pix2pix:
class UnetGenerator(nn.Module):
def __init__(self, input_nc, output_nc, num_downs, ngf=64):
super(UnetGenerator, self).__init__()
# Implementation details...
pix2pixHD:
class GlobalGenerator(nn.Module):
def __init__(self, input_nc, output_nc, ngf=64, n_downsampling=3, n_blocks=9):
super(GlobalGenerator, self).__init__()
# Implementation details...
The pix2pixHD implementation introduces a more complex generator architecture, including global and local enhancer networks, which contribute to its improved output quality and resolution capabilities. However, this increased complexity also results in higher computational requirements and longer training times compared to the original pix2pix implementation.
Semantic Image Synthesis with SPADE
Pros of SPADE
- Improved image quality and realism compared to pix2pix
- Better handling of complex scenes and diverse layouts
- More flexible input format, allowing for semantic segmentation masks
Cons of SPADE
- Higher computational requirements and longer training time
- More complex architecture, potentially harder to implement and fine-tune
- May struggle with certain types of fine details or textures
Code Comparison
SPADE introduces a spatially-adaptive normalization layer, which is a key difference in implementation:
# SPADE
class SPADE(nn.Module):
def __init__(self, norm_nc, label_nc):
super().__init__()
self.param_free_norm = nn.InstanceNorm2d(norm_nc, affine=False)
self.mlp_shared = nn.Sequential(
nn.Conv2d(label_nc, 128, kernel_size=3, padding=1),
nn.ReLU()
)
self.mlp_gamma = nn.Conv2d(128, norm_nc, kernel_size=3, padding=1)
self.mlp_beta = nn.Conv2d(128, norm_nc, kernel_size=3, padding=1)
# pix2pix
class UnetGenerator(nn.Module):
def __init__(self, input_nc, output_nc, num_downs, ngf=64):
super(UnetGenerator, self).__init__()
unet_block = UnetSkipConnectionBlock(ngf * 8, ngf * 8, input_nc=None, submodule=None, innermost=True)
for i in range(num_downs - 5):
unet_block = UnetSkipConnectionBlock(ngf * 8, ngf * 8, input_nc=None, submodule=unet_block)
Tensorflow port of Image-to-Image Translation with Conditional Adversarial Nets https://phillipi.github.io/pix2pix/
Pros of pix2pix-tensorflow
- Implemented in TensorFlow, offering better GPU acceleration and distributed computing capabilities
- Includes a web-based interface for easy experimentation and visualization
- Provides pre-trained models for quick testing and deployment
Cons of pix2pix-tensorflow
- May have a steeper learning curve for those unfamiliar with TensorFlow
- Potentially less flexible for custom modifications compared to the original PyTorch implementation
- Could have compatibility issues with older TensorFlow versions
Code Comparison
pix2pix (PyTorch):
class UnetGenerator(nn.Module):
def __init__(self, input_nc, output_nc, num_downs, ngf=64):
super(UnetGenerator, self).__init__()
# Implementation details...
pix2pix-tensorflow:
def create_generator(generator_inputs, generator_outputs_channels):
layers = []
# Implementation details...
return tf.keras.Model(inputs=generator_inputs, outputs=x)
The main difference in the code is the use of PyTorch's nn.Module
in the original implementation versus TensorFlow's tf.keras.Model
in the TensorFlow version. This reflects the different frameworks and their respective approaches to building neural network architectures.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
pix2pix
Torch implementation for learning a mapping from input images to output images, for example:
Image-to-Image Translation with Conditional Adversarial Networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros
CVPR, 2017.
On some tasks, decent results can be obtained fairly quickly and on small datasets. For example, to learn to generate facades (example shown above), we trained on just 400 images for about 2 hours (on a single Pascal Titan X GPU). However, for harder problems it may be important to train on far larger datasets, and for many hours or even days.
Note: Please check out our PyTorch implementation for pix2pix and CycleGAN. The PyTorch version is under active development and can produce results comparable to or better than this Torch version.
Setup
Prerequisites
- Linux or OSX
- NVIDIA GPU + CUDA CuDNN (CPU mode and CUDA without CuDNN may work with minimal modification, but untested)
Getting Started
- Install torch and dependencies from https://github.com/torch/distro
- Install torch packages
nngraph
anddisplay
luarocks install nngraph
luarocks install https://raw.githubusercontent.com/szym/display/master/display-scm-0.rockspec
- Clone this repo:
git clone git@github.com:phillipi/pix2pix.git
cd pix2pix
- Download the dataset (e.g., CMP Facades):
bash ./datasets/download_dataset.sh facades
- Train the model
DATA_ROOT=./datasets/facades name=facades_generation which_direction=BtoA th train.lua
- (CPU only) The same training command without using a GPU or CUDNN. Setting the environment variables
gpu=0 cudnn=0
forces CPU only
DATA_ROOT=./datasets/facades name=facades_generation which_direction=BtoA gpu=0 cudnn=0 batchSize=10 save_epoch_freq=5 th train.lua
- (Optionally) start the display server to view results as the model trains. ( See Display UI for more details):
th -ldisplay.start 8000 0.0.0.0
- Finally, test the model:
DATA_ROOT=./datasets/facades name=facades_generation which_direction=BtoA phase=val th test.lua
The test results will be saved to an html file here: ./results/facades_generation/latest_net_G_val/index.html
.
Train
DATA_ROOT=/path/to/data/ name=expt_name which_direction=AtoB th train.lua
Switch AtoB
to BtoA
to train translation in opposite direction.
Models are saved to ./checkpoints/expt_name
(can be changed by passing checkpoint_dir=your_dir
in train.lua).
See opt
in train.lua for additional training options.
Test
DATA_ROOT=/path/to/data/ name=expt_name which_direction=AtoB phase=val th test.lua
This will run the model named expt_name
in direction AtoB
on all images in /path/to/data/val
.
Result images, and a webpage to view them, are saved to ./results/expt_name
(can be changed by passing results_dir=your_dir
in test.lua).
See opt
in test.lua for additional testing options.
Datasets
Download the datasets using the following script. Some of the datasets are collected by other researchers. Please cite their papers if you use the data.
bash ./datasets/download_dataset.sh dataset_name
facades
: 400 images from CMP Facades dataset. [Citation]cityscapes
: 2975 images from the Cityscapes training set. [Citation]maps
: 1096 training images scraped from Google Mapsedges2shoes
: 50k training images from UT Zappos50K dataset. Edges are computed by HED edge detector + post-processing. [Citation]edges2handbags
: 137K Amazon Handbag images from iGAN project. Edges are computed by HED edge detector + post-processing. [Citation]night2day
: around 20K natural scene images from Transient Attributes dataset [Citation]. To train aday2night
pix2pix model, you need to addwhich_direction=BtoA
.
Models
Download the pre-trained models with the following script. You need to rename the model (e.g., facades_label2image
to /checkpoints/facades/latest_net_G.t7
) after the download has finished.
bash ./models/download_model.sh model_name
facades_label2image
(label -> facade): trained on the CMP Facades dataset.cityscapes_label2image
(label -> street scene): trained on the Cityscapes dataset.cityscapes_image2label
(street scene -> label): trained on the Cityscapes dataset.edges2shoes
(edge -> photo): trained on UT Zappos50K dataset.edges2handbags
(edge -> photo): trained on Amazon handbags images.day2night
(daytime scene -> nighttime scene): trained on around 100 webcams.
Setup Training and Test data
Generating Pairs
We provide a python script to generate training data in the form of pairs of images {A,B}, where A and B are two different depictions of the same underlying scene. For example, these might be pairs {label map, photo} or {bw image, color image}. Then we can learn to translate A to B or B to A:
Create folder /path/to/data
with subfolders A
and B
. A
and B
should each have their own subfolders train
, val
, test
, etc. In /path/to/data/A/train
, put training images in style A. In /path/to/data/B/train
, put the corresponding images in style B. Repeat same for other data splits (val
, test
, etc).
Corresponding images in a pair {A,B} must be the same size and have the same filename, e.g., /path/to/data/A/train/1.jpg
is considered to correspond to /path/to/data/B/train/1.jpg
.
Once the data is formatted this way, call:
python scripts/combine_A_and_B.py --fold_A /path/to/data/A --fold_B /path/to/data/B --fold_AB /path/to/data
This will combine each pair of images (A,B) into a single image file, ready for training.
Notes on Colorization
No need to run combine_A_and_B.py
for colorization. Instead, you need to prepare some natural images and set preprocess=colorization
in the script. The program will automatically convert each RGB image into Lab color space, and create L -> ab
image pair during the training. Also set input_nc=1
and output_nc=2
.
Extracting Edges
We provide python and Matlab scripts to extract coarse edges from photos. Run scripts/edges/batch_hed.py
to compute HED edges. Run scripts/edges/PostprocessHED.m
to simplify edges with additional post-processing steps. Check the code documentation for more details.
Evaluating Labels2Photos on Cityscapes
We provide scripts for running the evaluation of the Labels2Photos task on the Cityscapes validation set. We assume that you have installed caffe
(and pycaffe
) in your system. If not, see the official website for installation instructions. Once caffe
is successfully installed, download the pre-trained FCN-8s semantic segmentation model (512MB) by running
bash ./scripts/eval_cityscapes/download_fcn8s.sh
Then make sure ./scripts/eval_cityscapes/
is in your system's python path. If not, run the following command to add it
export PYTHONPATH=${PYTHONPATH}:./scripts/eval_cityscapes/
Now you can run the following command to evaluate your predictions:
python ./scripts/eval_cityscapes/evaluate.py --cityscapes_dir /path/to/original/cityscapes/dataset/ --result_dir /path/to/your/predictions/ --output_dir /path/to/output/directory/
Images stored under --result_dir
should contain your model predictions on the Cityscapes validation split, and have the original Cityscapes naming convention (e.g., frankfurt_000001_038418_leftImg8bit.png
). The script will output a text file under --output_dir
containing the metric.
Further notes: Our pre-trained FCN model is not supposed to work on Cityscapes in the original resolution (1024x2048) as it was trained on 256x256 images that are then upsampled to 1024x2048 during training. The purpose of the resizing during training was to 1) keep the label maps in the original high resolution untouched and 2) avoid the need to change the standard FCN training code and the architecture for Cityscapes. During test time, you need to synthesize 256x256 results. Our test code will automatically upsample your results to 1024x2048 before feeding them to the pre-trained FCN model. The output is at 1024x2048 resolution and will be compared to 1024x2048 ground truth labels. You do not need to resize the ground truth labels. The best way to verify whether everything is correct is to reproduce the numbers for real images in the paper first. To achieve it, you need to resize the original/real Cityscapes images (not labels) to 256x256 and feed them to the evaluation code.
Display UI
Optionally, for displaying images during training and test, use the display package.
- Install it with:
luarocks install https://raw.githubusercontent.com/szym/display/master/display-scm-0.rockspec
- Then start the server with:
th -ldisplay.start
- Open this URL in your browser: http://localhost:8000
By default, the server listens on localhost. Pass 0.0.0.0
to allow external connections on any interface:
th -ldisplay.start 8000 0.0.0.0
Then open http://(hostname):(port)/
in your browser to load the remote desktop.
L1 error is plotted to the display by default. Set the environment variable display_plot
to a comma-separated list of values errL1
, errG
and errD
to visualize the L1, generator, and discriminator error respectively. For example, to plot only the generator and discriminator errors to the display instead of the default L1 error, set display_plot="errG,errD"
.
Citation
If you use this code for your research, please cite our paper Image-to-Image Translation Using Conditional Adversarial Networks:
@article{pix2pix2017,
title={Image-to-Image Translation with Conditional Adversarial Networks},
author={Isola, Phillip and Zhu, Jun-Yan and Zhou, Tinghui and Efros, Alexei A},
journal={CVPR},
year={2017}
}
Cat Paper Collection
If you love cats, and love reading cool graphics, vision, and learning papers, please check out the Cat Paper Collection:
[Github] [Webpage]
Acknowledgments
Code borrows heavily from DCGAN. The data loader is modified from DCGAN and Context-Encoder.
Top Related Projects
Image-to-Image Translation in PyTorch
Synthesizing and manipulating 2048x1024 images with conditional GANs
Semantic Image Synthesis with SPADE
Tensorflow port of Image-to-Image Translation with Conditional Adversarial Nets https://phillipi.github.io/pix2pix/
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot