Convert Figma logo to code with AI

NVlabs logoMUNIT

Multimodal Unsupervised Image-to-Image Translation

2,635
483
2,635
66

Top Related Projects

Toward Multimodal Image-to-Image Translation

1,979

Unsupervised Image-to-Image Translation

1,574

Translate images to unseen domains in the test time with few example images.

StarGAN v2 - Official PyTorch Implementation (CVPR 2020)

Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan (ECCV 2020, in PyTorch)

5,213

StarGAN - Official PyTorch Implementation (CVPR 2018)

Quick Overview

MUNIT (Multimodal UNsupervised Image-to-image Translation) is a framework for unsupervised image-to-image translation. It allows for generating diverse outputs without paired training data, enabling the translation between different image domains while preserving content and style separately.

Pros

  • Enables unsupervised learning without paired training data
  • Supports diverse and multimodal outputs for a single input image
  • Provides fine-grained control over the generated images by manipulating content and style separately
  • Achieves high-quality results across various image translation tasks

Cons

  • Requires significant computational resources for training
  • May struggle with complex scenes or highly diverse image domains
  • Can sometimes produce unrealistic or artifacts in the generated images
  • Limited by the quality and diversity of the training dataset

Code Examples

  1. Loading a pre-trained MUNIT model:
from MUNIT import MUNIT
model = MUNIT.load_from_checkpoint('path/to/checkpoint.ckpt')
  1. Performing image translation:
import torch
from PIL import Image

input_image = Image.open('input.jpg')
content, _ = model.encode(input_image)
style = torch.randn(1, model.style_dim, 1, 1)
output_image = model.decode(content, style)
output_image.save('output.jpg')
  1. Extracting and combining content and style:
content_image = Image.open('content.jpg')
style_image = Image.open('style.jpg')

content, _ = model.encode(content_image)
_, style = model.encode(style_image)

combined_image = model.decode(content, style)
combined_image.save('combined.jpg')

Getting Started

  1. Clone the repository:

    git clone https://github.com/NVlabs/MUNIT.git
    cd MUNIT
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download a pre-trained model or train your own:

    python train.py --config configs/edges2shoes_folder.yaml
    
  4. Use the model for image translation:

    from MUNIT import MUNIT
    model = MUNIT.load_from_checkpoint('checkpoints/edges2shoes.ckpt')
    
    input_image = Image.open('input.jpg')
    content, _ = model.encode(input_image)
    style = torch.randn(1, model.style_dim, 1, 1)
    output_image = model.decode(content, style)
    output_image.save('output.jpg')
    

Competitor Comparisons

Toward Multimodal Image-to-Image Translation

Pros of BicycleGAN

  • Supports diverse image-to-image translation with explicit style control
  • Combines conditional VAE-GAN and conditional Latent Regressor GAN
  • Provides better mode coverage and sample diversity

Cons of BicycleGAN

  • Limited to paired image-to-image translation tasks
  • May struggle with complex, high-resolution images
  • Requires paired training data, which can be challenging to obtain

Code Comparison

MUNIT:

def forward(self, x_a, x_b):
    c_a = self.gen_a.encode(x_a)
    c_b = self.gen_b.encode(x_b)
    s_a = self.gen_a.encode(x_a)
    s_b = self.gen_b.encode(x_b)
    x_ba = self.gen_a.decode(c_b, s_a)
    x_ab = self.gen_b.decode(c_a, s_b)

BicycleGAN:

def forward(self, input, z):
    z_encoded = self.netE(self.real_B)
    z_random = self.get_z_random(input.size(0), self.nz)
    fake_B = self.netG(input, z_encoded)
    fake_B_random = self.netG(input, z_random)
1,979

Unsupervised Image-to-Image Translation

Pros of UNIT

  • Pioneered the concept of unsupervised image-to-image translation
  • Simpler architecture, potentially easier to understand and implement
  • Effective for tasks with similar domain structures

Cons of UNIT

  • Limited flexibility in handling multi-modal translations
  • May struggle with more complex domain mappings
  • Less control over style transfer compared to MUNIT

Code Comparison

UNIT (VAE-GAN architecture):

def forward(self, x_a, x_b):
    h_a, n_a = self.gen_a.encode(x_a)
    h_b, n_b = self.gen_b.encode(x_b)
    x_ba = self.gen_a.decode(h_b + n_b)
    x_ab = self.gen_b.decode(h_a + n_a)
    return x_ab, x_ba

MUNIT (AdaIN-based architecture):

def forward(self, x_a, x_b):
    c_a = self.enc_c_a(x_a)
    s_a = self.enc_s_a(x_a)
    c_b = self.enc_c_b(x_b)
    s_b = self.enc_s_b(x_b)
    x_ba = self.gen_b(c_a, s_b)
    x_ab = self.gen_a(c_b, s_a)
    return x_ab, x_ba

The key difference is MUNIT's separate content and style encoders, allowing for more flexible style transfer and multi-modal translations.

1,574

Translate images to unseen domains in the test time with few example images.

Pros of FUNIT

  • Supports few-shot unsupervised image-to-image translation
  • Can generalize to unseen target classes with just a few examples
  • Achieves higher quality results for novel classes compared to MUNIT

Cons of FUNIT

  • Requires class-labeled images for training, unlike MUNIT
  • May struggle with fine-grained details in some cases
  • More complex architecture, potentially harder to implement and train

Code Comparison

MUNIT:

def forward(self, x_a, x_b):
    c_a = self.enc_c_a(x_a)
    s_a = self.enc_s_a(x_a)
    c_b = self.enc_c_b(x_b)
    s_b = self.enc_s_b(x_b)
    return c_a, s_a, c_b, s_b

FUNIT:

def forward(self, x_s, x_c):
    c_s = self.content_encoder(x_s)
    s_c = self.class_encoder(x_c)
    x_f = self.decoder(c_s, s_c)
    return x_f

Both MUNIT and FUNIT are image-to-image translation frameworks developed by NVIDIA Labs. MUNIT focuses on multimodal unsupervised translation between two domains, while FUNIT extends this concept to few-shot unsupervised translation across multiple classes. FUNIT's ability to generalize to unseen classes with limited examples makes it more versatile for certain applications, but it requires class-labeled data for training. MUNIT, on the other hand, offers a simpler approach for two-domain translation without the need for class labels.

StarGAN v2 - Official PyTorch Implementation (CVPR 2020)

Pros of StarGAN v2

  • Supports multi-domain image-to-image translation with a single generator
  • Produces higher quality and more diverse outputs than MUNIT
  • Better preserves content details while changing style

Cons of StarGAN v2

  • More complex architecture, potentially harder to implement and train
  • May require more computational resources due to its larger model size
  • Less flexibility in controlling specific attributes independently

Code Comparison

StarGAN v2:

def compute_d_loss(self, x_real, y_org, y_trg, z_trg=None, x_ref=None):
    assert (z_trg is None) != (x_ref is None)
    # ... (implementation details)
    return d_loss, d_losses_latent

MUNIT:

def forward(self, x_a, x_b):
    c_a, s_a = self.gen_a.encode(x_a)
    c_b, s_b = self.gen_b.encode(x_b)
    x_ba = self.gen_a.decode(c_b, s_a)
    x_ab = self.gen_b.decode(c_a, s_b)
    return x_ab, x_ba

The code snippets show that StarGAN v2 focuses on discriminator loss computation, while MUNIT emphasizes encoder-decoder architecture for style transfer between two domains.

Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan (ECCV 2020, in PyTorch)

Pros of contrastive-unpaired-translation

  • Improved image quality and diversity in translations
  • Better preservation of content and style during translation
  • More stable training process with contrastive learning

Cons of contrastive-unpaired-translation

  • Potentially higher computational requirements
  • May require more fine-tuning for specific datasets
  • Slightly more complex implementation

Code Comparison

MUNIT:

def forward(self, x_a, x_b):
    c_a = self.gen_a.encode(x_a)
    c_b = self.gen_b.encode(x_b)
    x_ba = self.gen_a.decode(c_b)
    x_ab = self.gen_b.decode(c_a)
    return x_ab, x_ba

contrastive-unpaired-translation:

def forward(self, real_A, real_B):
    fake_B = self.netG_A(real_A)
    rec_A = self.netG_B(fake_B)
    fake_A = self.netG_B(real_B)
    rec_B = self.netG_A(fake_A)
    return fake_B, rec_A, fake_A, rec_B

Both repositories focus on unpaired image-to-image translation, but contrastive-unpaired-translation introduces contrastive learning to improve results. MUNIT uses a multi-modal approach, while contrastive-unpaired-translation emphasizes content preservation and style transfer. The code comparison shows differences in the forward pass implementation, reflecting their distinct approaches to image translation.

5,213

StarGAN - Official PyTorch Implementation (CVPR 2018)

Pros of StarGAN

  • Simpler architecture, making it easier to implement and train
  • Capable of performing multi-domain image-to-image translation with a single model
  • Generally faster inference time due to its unified structure

Cons of StarGAN

  • Limited flexibility in handling diverse and complex transformations
  • May struggle with preserving fine details in some translation tasks
  • Less control over specific style attributes compared to MUNIT

Code Comparison

MUNIT uses separate content and style encoders:

content = self.content_encoder(x_a)
style = self.style_encoder(x_b)
images = self.decoder(content, style)

StarGAN uses a single generator with domain labels:

c_trg = self.label2onehot(target_domain, self.c_dim)
x_fake = self.G(x_real, c_trg)

Summary

StarGAN offers a more straightforward approach to multi-domain image translation, while MUNIT provides greater flexibility and control over style attributes. StarGAN is generally faster and easier to implement, but may struggle with complex transformations. MUNIT's separate content and style encoders allow for more nuanced manipulations but at the cost of increased complexity.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

The code base is no longer maintained.

Please check here for an improved implementation of MUNIT: https://github.com/NVlabs/imaginaire/tree/master/projects/munit

License CC BY-NC-SA 4.0 Python 2.7 Python 3.6

MUNIT: Multimodal UNsupervised Image-to-image Translation

License

Copyright (C) 2018 NVIDIA Corporation. All rights reserved. Licensed under the CC BY-NC-SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode).

For commercial use, please consult NVIDIA Research Inquiries.

Code usage

Please check out the user manual page.

Paper

Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz, "Multimodal Unsupervised Image-to-Image Translation", ECCV 2018

Results Video

Edges to Shoes/handbags Translation

Animal Image Translation

Street Scene Translation

Yosemite Summer to Winter Translation (HD)

Example-guided Image Translation

Other Implementations

MUNIT-Tensorflow by Junho Kim

MUNIT-keras by shaoanlu

Citation

If you find this code useful for your research, please cite our paper:

@inproceedings{huang2018munit,
  title={Multimodal Unsupervised Image-to-image Translation},
  author={Huang, Xun and Liu, Ming-Yu and Belongie, Serge and Kautz, Jan},
  booktitle={ECCV},
  year={2018}
}