MUNIT

Multimodal Unsupervised Image-to-Image Translation

2,681

484

2,681

View on GitHub

Top Related Projects

BicycleGAN

1,504

Toward Multimodal Image-to-Image Translation

UNIT

2,017

Unsupervised Image-to-Image Translation

FUNIT

1,584

Translate images to unseen domains in the test time with few example images.

stargan-v2

3,576

StarGAN v2 - Official PyTorch Implementation (CVPR 2020)

contrastive-unpaired-translation

2,372

Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan (ECCV 2020, in PyTorch)

stargan

5,269

StarGAN - Official PyTorch Implementation (CVPR 2018)

Quick Overview

MUNIT (Multimodal UNsupervised Image-to-image Translation) is a framework for unsupervised image-to-image translation. It allows for generating diverse outputs without paired training data, enabling the translation between different image domains while preserving content and style separately.

Pros

Enables unsupervised learning without paired training data
Supports diverse and multimodal outputs for a single input image
Provides fine-grained control over the generated images by manipulating content and style separately
Achieves high-quality results across various image translation tasks

Cons

Requires significant computational resources for training
May struggle with complex scenes or highly diverse image domains
Can sometimes produce unrealistic or artifacts in the generated images
Limited by the quality and diversity of the training dataset

Code Examples

Loading a pre-trained MUNIT model:

from MUNIT import MUNIT
model = MUNIT.load_from_checkpoint('path/to/checkpoint.ckpt')

Performing image translation:

import torch
from PIL import Image

input_image = Image.open('input.jpg')
content, _ = model.encode(input_image)
style = torch.randn(1, model.style_dim, 1, 1)
output_image = model.decode(content, style)
output_image.save('output.jpg')

Extracting and combining content and style:

content_image = Image.open('content.jpg')
style_image = Image.open('style.jpg')

content, _ = model.encode(content_image)
_, style = model.encode(style_image)

combined_image = model.decode(content, style)
combined_image.save('combined.jpg')

Getting Started

Clone the repository:

git clone https://github.com/NVlabs/MUNIT.git
cd MUNIT

Install dependencies:
```
pip install -r requirements.txt
```

Download a pre-trained model or train your own:

python train.py --config configs/edges2shoes_folder.yaml

Use the model for image translation:

from MUNIT import MUNIT
model = MUNIT.load_from_checkpoint('checkpoints/edges2shoes.ckpt')

input_image = Image.open('input.jpg')
content, _ = model.encode(input_image)
style = torch.randn(1, model.style_dim, 1, 1)
output_image = model.decode(content, style)
output_image.save('output.jpg')

Competitor Comparisons

BicycleGAN

1,504

Toward Multimodal Image-to-Image Translation

Pros of BicycleGAN

Supports diverse image-to-image translation with explicit style control
Combines conditional VAE-GAN and conditional Latent Regressor GAN
Provides better mode coverage and sample diversity

Cons of BicycleGAN

Limited to paired image-to-image translation tasks
May struggle with complex, high-resolution images
Requires paired training data, which can be challenging to obtain

Code Comparison

MUNIT:

def forward(self, x_a, x_b):
    c_a = self.gen_a.encode(x_a)
    c_b = self.gen_b.encode(x_b)
    s_a = self.gen_a.encode(x_a)
    s_b = self.gen_b.encode(x_b)
    x_ba = self.gen_a.decode(c_b, s_a)
    x_ab = self.gen_b.decode(c_a, s_b)

BicycleGAN:

def forward(self, input, z):
    z_encoded = self.netE(self.real_B)
    z_random = self.get_z_random(input.size(0), self.nz)
    fake_B = self.netG(input, z_encoded)
    fake_B_random = self.netG(input, z_random)

UNIT

2,017

Unsupervised Image-to-Image Translation

Pros of UNIT

Pioneered the concept of unsupervised image-to-image translation
Simpler architecture, potentially easier to understand and implement
Effective for tasks with similar domain structures

Cons of UNIT

Limited flexibility in handling multi-modal translations
May struggle with more complex domain mappings
Less control over style transfer compared to MUNIT

Code Comparison

UNIT (VAE-GAN architecture):

def forward(self, x_a, x_b):
    h_a, n_a = self.gen_a.encode(x_a)
    h_b, n_b = self.gen_b.encode(x_b)
    x_ba = self.gen_a.decode(h_b + n_b)
    x_ab = self.gen_b.decode(h_a + n_a)
    return x_ab, x_ba

MUNIT (AdaIN-based architecture):

def forward(self, x_a, x_b):
    c_a = self.enc_c_a(x_a)
    s_a = self.enc_s_a(x_a)
    c_b = self.enc_c_b(x_b)
    s_b = self.enc_s_b(x_b)
    x_ba = self.gen_b(c_a, s_b)
    x_ab = self.gen_a(c_b, s_a)
    return x_ab, x_ba

The key difference is MUNIT's separate content and style encoders, allowing for more flexible style transfer and multi-modal translations.

FUNIT

1,584

Translate images to unseen domains in the test time with few example images.

Pros of FUNIT

Supports few-shot unsupervised image-to-image translation
Can generalize to unseen target classes with just a few examples
Achieves higher quality results for novel classes compared to MUNIT

Cons of FUNIT

Requires class-labeled images for training, unlike MUNIT
May struggle with fine-grained details in some cases
More complex architecture, potentially harder to implement and train

Code Comparison

MUNIT:

def forward(self, x_a, x_b):
    c_a = self.enc_c_a(x_a)
    s_a = self.enc_s_a(x_a)
    c_b = self.enc_c_b(x_b)
    s_b = self.enc_s_b(x_b)
    return c_a, s_a, c_b, s_b

FUNIT:

def forward(self, x_s, x_c):
    c_s = self.content_encoder(x_s)
    s_c = self.class_encoder(x_c)
    x_f = self.decoder(c_s, s_c)
    return x_f

Both MUNIT and FUNIT are image-to-image translation frameworks developed by NVIDIA Labs. MUNIT focuses on multimodal unsupervised translation between two domains, while FUNIT extends this concept to few-shot unsupervised translation across multiple classes. FUNIT's ability to generalize to unseen classes with limited examples makes it more versatile for certain applications, but it requires class-labeled data for training. MUNIT, on the other hand, offers a simpler approach for two-domain translation without the need for class labels.

stargan-v2

3,576

StarGAN v2 - Official PyTorch Implementation (CVPR 2020)

Pros of StarGAN v2

Supports multi-domain image-to-image translation with a single generator
Produces higher quality and more diverse outputs than MUNIT
Better preserves content details while changing style

Cons of StarGAN v2

More complex architecture, potentially harder to implement and train
May require more computational resources due to its larger model size
Less flexibility in controlling specific attributes independently

Code Comparison

StarGAN v2:

def compute_d_loss(self, x_real, y_org, y_trg, z_trg=None, x_ref=None):
    assert (z_trg is None) != (x_ref is None)
    # ... (implementation details)
    return d_loss, d_losses_latent

MUNIT:

def forward(self, x_a, x_b):
    c_a, s_a = self.gen_a.encode(x_a)
    c_b, s_b = self.gen_b.encode(x_b)
    x_ba = self.gen_a.decode(c_b, s_a)
    x_ab = self.gen_b.decode(c_a, s_b)
    return x_ab, x_ba

The code snippets show that StarGAN v2 focuses on discriminator loss computation, while MUNIT emphasizes encoder-decoder architecture for style transfer between two domains.

contrastive-unpaired-translation

2,372

Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan (ECCV 2020, in PyTorch)

Pros of contrastive-unpaired-translation

Improved image quality and diversity in translations
Better preservation of content and style during translation
More stable training process with contrastive learning

Cons of contrastive-unpaired-translation

Potentially higher computational requirements
May require more fine-tuning for specific datasets
Slightly more complex implementation

Code Comparison

MUNIT:

def forward(self, x_a, x_b):
    c_a = self.gen_a.encode(x_a)
    c_b = self.gen_b.encode(x_b)
    x_ba = self.gen_a.decode(c_b)
    x_ab = self.gen_b.decode(c_a)
    return x_ab, x_ba

contrastive-unpaired-translation:

def forward(self, real_A, real_B):
    fake_B = self.netG_A(real_A)
    rec_A = self.netG_B(fake_B)
    fake_A = self.netG_B(real_B)
    rec_B = self.netG_A(fake_A)
    return fake_B, rec_A, fake_A, rec_B

Both repositories focus on unpaired image-to-image translation, but contrastive-unpaired-translation introduces contrastive learning to improve results. MUNIT uses a multi-modal approach, while contrastive-unpaired-translation emphasizes content preservation and style transfer. The code comparison shows differences in the forward pass implementation, reflecting their distinct approaches to image translation.

stargan

5,269

StarGAN - Official PyTorch Implementation (CVPR 2018)

Pros of StarGAN

Simpler architecture, making it easier to implement and train
Capable of performing multi-domain image-to-image translation with a single model
Generally faster inference time due to its unified structure

Cons of StarGAN

Limited flexibility in handling diverse and complex transformations
May struggle with preserving fine details in some translation tasks
Less control over specific style attributes compared to MUNIT

Code Comparison

MUNIT uses separate content and style encoders:

content = self.content_encoder(x_a)
style = self.style_encoder(x_b)
images = self.decoder(content, style)

StarGAN uses a single generator with domain labels:

c_trg = self.label2onehot(target_domain, self.c_dim)
x_fake = self.G(x_real, c_trg)

Summary

StarGAN offers a more straightforward approach to multi-domain image translation, while MUNIT provides greater flexibility and control over style attributes. StarGAN is generally faster and easier to implement, but may struggle with complex transformations. MUNIT's separate content and style encoders allow for more nuanced manipulations but at the cost of increased complexity.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

The code base is no longer maintained.

Please check here for an improved implementation of MUNIT: https://github.com/NVlabs/imaginaire/tree/master/projects/munit

MUNIT: Multimodal UNsupervised Image-to-image Translation

License

For commercial use, please consult NVIDIA Research Inquiries.

Code usage

Please check out the user manual page.

Paper

Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz, "Multimodal Unsupervised Image-to-Image Translation", ECCV 2018

Results Video

Edges to Shoes/handbags Translation

Animal Image Translation

Street Scene Translation

Yosemite Summer to Winter Translation (HD)

Example-guided Image Translation

Other Implementations

MUNIT-Tensorflow by Junho Kim

MUNIT-keras by shaoanlu

Citation

If you find this code useful for your research, please cite our paper:

@inproceedings{huang2018munit,
  title={Multimodal Unsupervised Image-to-image Translation},
  author={Huang, Xun and Liu, Ming-Yu and Belongie, Serge and Kautz, Jan},
  booktitle={ECCV},
  year={2018}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot