dalle-mini

DALL·E Mini - Generate images from a text prompt

14,810

1,222

14,810

View on GitHub

Top Related Projects

DALL-E

10,876

PyTorch package for the discrete VAE used for DALL·E.

stable-diffusion

71,028

A latent text-to-image diffusion model

DALLE2-pytorch

11,297

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch

latent-diffusion

13,150

High-Resolution Image Synthesis with Latent Diffusion Models

diffusers

29,520

🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch and FLAX.

Quick Overview

DALL·E Mini is an open-source AI model that generates images from text prompts. It's a smaller, more accessible version of OpenAI's DALL·E, designed to run on less powerful hardware while still producing creative and diverse images based on textual descriptions.

Pros

Open-source and freely available for use and modification
Runs on less powerful hardware compared to full-scale DALL·E
Produces creative and diverse images from text prompts
Active community and ongoing development

Cons

Lower image quality compared to more advanced models like DALL·E 2 or Midjourney
Limited training data compared to larger models
Can produce inconsistent or inaccurate results for complex prompts
Slower generation times compared to more optimized models

Code Examples

# Install the dalle-mini library
!pip install git+https://github.com/borisdayma/dalle-mini.git

# Import necessary modules
from dalle_mini import DalleBart, DalleBartProcessor
from vqgan_jax import VQModel
import jax
import jax.numpy as jnp

# Load the model and processor
model = DalleBart.from_pretrained("dalle-mini/dalle-mini")
processor = DalleBartProcessor.from_pretrained("dalle-mini/dalle-mini")
vqgan = VQModel.from_pretrained("dalle-mini/vqgan")

# Generate images from a text prompt
prompt = "a cute cat wearing a hat"
encoded_prompt = processor(prompt, return_tensors="jax").pixel_values
images = model.generate(encoded_prompt, num_return_sequences=4)
decoded_images = vqgan.decode_code(images)

Getting Started

To get started with DALL·E Mini:

Clone the repository:

git clone https://github.com/borisdayma/dalle-mini.git
cd dalle-mini

Install dependencies:
```
pip install -r requirements.txt
```

Run the inference script:

python inference.py --prompt "Your text prompt here"

This will generate images based on your text prompt and save them in the output directory.

Competitor Comparisons

DALL-E

10,876

PyTorch package for the discrete VAE used for DALL·E.

Pros of DALL-E

Higher quality image generation with more detailed and coherent results
Larger model with more extensive training data, leading to better understanding of complex prompts
Supports inpainting and outpainting features for image editing

Cons of DALL-E

Closed-source and not publicly available for free use or modification
Limited API access, requiring approval and potentially high costs for usage
Less flexibility for customization or fine-tuning on specific datasets

Code Comparison

While a direct code comparison is not possible due to DALL-E's closed-source nature, we can compare the usage of DALL-E Mini with a hypothetical DALL-E API:

DALL-E Mini:

import dalle_mini
model = dalle_mini.load_model()
images = model.generate_images("A cat wearing a hat")

Hypothetical DALL-E API:

import openai
openai.api_key = "your_api_key"
response = openai.Image.create(prompt="A cat wearing a hat")
image_url = response['data'][0]['url']

DALL-E Mini offers a more accessible and open approach, allowing for local usage and modifications. DALL-E, while potentially more powerful, requires API access and may have usage restrictions.

stable-diffusion

71,028

A latent text-to-image diffusion model

Pros of Stable-Diffusion

Higher image quality and more detailed outputs
Faster inference time, especially on GPU
More flexible architecture allowing for various applications (inpainting, image-to-image, etc.)

Cons of Stable-Diffusion

Requires more computational resources and VRAM
More complex setup and installation process
Larger model size, making it less suitable for edge devices

Code Comparison

DALLE-mini:

from dalle_mini import DalleBart, DalleBartProcessor
model = DalleBart.from_pretrained("dalle-mini/dalle-mini")
processor = DalleBartProcessor.from_pretrained("dalle-mini/dalle-mini")
images = model.generate_images(processor("A cute cat"))

Stable-Diffusion:

from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
image = pipe("A cute cat").images[0]

Both repositories offer text-to-image generation capabilities, but Stable-Diffusion generally produces higher quality results with faster inference times. However, it requires more computational resources and has a more complex setup process. DALLE-mini is easier to use and more lightweight, making it suitable for simpler applications or resource-constrained environments. The code comparison shows that Stable-Diffusion has a more streamlined API, while DALLE-mini's approach is slightly more verbose but potentially more flexible for advanced users.

DALLE2-pytorch

11,297

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch

Pros of DALLE2-pytorch

Implements the more advanced DALL-E 2 architecture
Offers more flexibility and customization options
Potentially higher quality image generation

Cons of DALLE2-pytorch

More complex implementation, requiring deeper understanding
Heavier computational requirements
Less documentation and community support compared to dalle-mini

Code Comparison

DALLE2-pytorch:

model = DALLE2(
    dim = 512,
    image_size = 256,
    text_encoder_depth = 6,
    text_encoder_heads = 8,
    text_encoder_dim_head = 64,
    num_tokens = 16384,
    # ... more parameters ...
)

dalle-mini:

model = DalleBart.from_pretrained(
    "dalle-mini/dalle-mini",
    revision=None,
    dtype=jnp.float16,
    abstract_init=True
)

DALLE2-pytorch offers more granular control over model architecture, while dalle-mini provides a simpler, pre-configured approach. DALLE2-pytorch requires manual setup of various parameters, whereas dalle-mini uses a pre-trained model with fewer configuration options. The DALLE2-pytorch implementation is more suitable for researchers and advanced users looking to experiment with the architecture, while dalle-mini is more accessible for quick deployment and testing.

latent-diffusion

13,150

High-Resolution Image Synthesis with Latent Diffusion Models

Pros of latent-diffusion

More efficient training and inference due to compression in latent space
Supports various conditioning types (text, image, class labels)
Flexible architecture allowing for different applications (image generation, inpainting, super-resolution)

Cons of latent-diffusion

More complex implementation and setup
May require more computational resources for initial training
Less straightforward for beginners to understand and modify

Code Comparison

latent-diffusion:

model = LatentDiffusion(
    unet_config,
    cond_stage_config,
    first_stage_config,
    num_timesteps_cond=num_timesteps_cond,
    cond_stage_key="caption",
    cond_stage_trainable=True,
    concat_mode=True,
)

dalle-mini:

model = DalleBart.from_pretrained(
    pretrained_model_name_or_path,
    revision=revision,
    dtype=dtype,
    device_map="auto"
)

The latent-diffusion code shows a more complex model initialization with various configurations, while dalle-mini uses a simpler pretrained model loading approach. This reflects the difference in complexity and flexibility between the two projects.

diffusers

29,520

🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch and FLAX.

Pros of diffusers

Broader scope, supporting multiple diffusion models and techniques
More active development and larger community support
Extensive documentation and integration with Hugging Face ecosystem

Cons of diffusers

Higher complexity due to supporting multiple models
Potentially steeper learning curve for beginners
May require more computational resources for some models

Code comparison

dalle-mini:

from dalle_mini import DalleBart, DalleBartProcessor
model = DalleBart.from_pretrained("dalle-mini/dalle-mini")
processor = DalleBartProcessor.from_pretrained("dalle-mini/dalle-mini")

diffusers:

from diffusers import StableDiffusionPipeline
model = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")

Summary

diffusers offers a more comprehensive solution for working with various diffusion models, while dalle-mini focuses specifically on the DALL-E Mini model. diffusers benefits from broader community support and integration with the Hugging Face ecosystem, but may be more complex for users seeking a simpler, specific implementation. dalle-mini provides a more straightforward approach for those specifically interested in the DALL-E Mini model, but with potentially less flexibility and ongoing development compared to diffusers.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

DALLÂ·E Mini

How to use it?

You can use the model on ðï¸ craiyon

How does it work?

Refer to our reports:

Development

Dependencies Installation

For inference only, use pip install dalle-mini.

For development, clone the repo and use pip install -e ".[dev]". Before making a PR, check style with make style.

You can experiment with the pipeline step by step through our inference pipeline notebook

Training of DALLÂ·E mini

Use tools/train/train.py.

You can also adjust the sweep configuration file if you need to perform a hyperparameter search.

FAQ

Where to find the latest models?

Trained models are on ð¤ Model Hub:

VQGAN-f16-16384 for encoding/decoding images
DALLÂ·E mini or DALLÂ·E mega for generating images from a text prompt

Where does the logo come from?

The "armchair in the shape of an avocado" was used by OpenAI when releasing DALLÂ·E to illustrate the model's capabilities. Having successful predictions on this prompt represents a big milestone for us.

Contributing

Join the community on the LAION Discord. Any contribution is welcome, from reporting issues to proposing fixes/improvements or testing the model with cool prompts!

You can also use these great projects from the community:

spin off your own app with DALL-E Playground repository (thanks Sahar)
try DALLÂ·E Flow project for generating, diffusion, and upscaling in a Human-in-the-Loop workflow (thanks Han Xiao)
run on Replicate, in the browser or via API

Acknowledgements

ð¤ Hugging Face for organizing the FLAX/JAX community week
Google TPU Research Cloud (TRC) program for providing computing resources
Weights & Biases for providing the infrastructure for experiment tracking and model management

Authors & Contributors

DALLÂ·E mini was initially developed by:

Many thanks to the people who helped make it better:

the DALLE-Pytorch and EleutherAI communities for testing and exchanging cool ideas
Rohan Anil for adding Distributed Shampoo optimizer and always giving great suggestions
Phil Wang has provided a lot of cool implementations of transformer variants and gives interesting insights with x-transformers
Katherine Crowson for super conditioning
the Gradio team made an amazing UI for our app

Citing DALLÂ·E mini

If you find DALLÂ·E mini useful in your research or wish to refer, please use the following BibTeX entry.

@misc{Dayma_DALLÂ·E_Mini_2021,
      author = {Dayma, Boris and Patil, Suraj and Cuenca, Pedro and Saifullah, Khalid and Abraham, Tanishq and LÃª Kháº¯c, PhÃºc and Melas, Luke and Ghosh, Ritobrata},
      doi = {10.5281/zenodo.5146400},
      month = {7},
      title = {DALLÂ·E Mini},
      url = {https://github.com/borisdayma/dalle-mini},
      year = {2021}
}

References

Original DALLÂ·E from "Zero-Shot Text-to-Image Generation" with image quantization from "Learning Transferable Visual Models From Natural Language Supervision".

Image encoder from "Taming Transformers for High-Resolution Image Synthesis".

Sequence to sequence model based on "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension" with implementation of a few variants:

Main optimizer (Distributed Shampoo) from "Scalable Second Order Optimization for Deep Learning".

Citations

@misc{
  title={Zero-Shot Text-to-Image Generation}, 
  author={Aditya Ramesh and Mikhail Pavlov and Gabriel Goh and Scott Gray and Chelsea Voss and Alec Radford and Mark Chen and Ilya Sutskever},
  year={2021},
  eprint={2102.12092},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

@misc{
  title={Learning Transferable Visual Models From Natural Language Supervision}, 
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  year={2021},
  eprint={2103.00020},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

@misc{
  title={Taming Transformers for High-Resolution Image Synthesis}, 
  author={Patrick Esser and Robin Rombach and BjÃ¶rn Ommer},
  year={2021},
  eprint={2012.09841},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

@misc{
  title={BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension}, 
  author={Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and Abdelrahman Mohamed and Omer Levy and Ves Stoyanov and Luke Zettlemoyer},
  year={2019},
  eprint={1910.13461},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

@misc{
  title={Scalable Second Order Optimization for Deep Learning},
  author={Rohan Anil and Vineet Gupta and Tomer Koren and Kevin Regan and Yoram Singer},
  year={2021},
  eprint={2002.09018},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

@misc{
  title={GLU Variants Improve Transformer},
  author={Noam Shazeer},
  year={2020},
  url={https://arxiv.org/abs/2002.05202}    
}

 @misc{
  title={DeepNet: Scaling transformers to 1,000 layers},
  author={Wang, Hongyu and Ma, Shuming and Dong, Li and Huang, Shaohan and Zhang, Dongdong and Wei, Furu},
  year={2022},
  eprint={2203.00555}
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

@misc{
  title={NormFormer: Improved Transformer Pretraining with Extra Normalization},
  author={Sam Shleifer and Jason Weston and Myle Ott},
  year={2021},
  eprint={2110.09456},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

@inproceedings{
  title={Swin Transformer V2: Scaling Up Capacity and Resolution}, 
  author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

@misc{
  title = {CogView: Mastering Text-to-Image Generation via Transformers},
  author = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},
  year = {2021},
  eprint = {2105.13290},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV}
}

@misc{
  title = {Root Mean Square Layer Normalization},
  author = {Biao Zhang and Rico Sennrich},
  year = {2019},
  eprint = {1910.07467},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG}
}

@misc{
  title = {Sinkformers: Transformers with Doubly Stochastic Attention},
  url = {https://arxiv.org/abs/2110.11773},
  author = {Sander, Michael E. and Ablin, Pierre and Blondel, Mathieu and PeyrÃ©, Gabriel},
  publisher = {arXiv},
  year = {2021},
}

@misc{
  title = {Smooth activations and reproducibility in deep networks},
  url = {https://arxiv.org/abs/2010.09931},
  author = {Shamir, Gil I. and Lin, Dong and Coviello, Lorenzo},
  publisher = {arXiv},
  year = {2020},
}

@misc{
  title = {Foundation Transformers},
  url = {https://arxiv.org/abs/2210.06423},
  author = {Wang, Hongyu and Ma, Shuming and Huang, Shaohan and Dong, Li and Wang, Wenhui and Peng, Zhiliang and Wu, Yu and Bajaj, Payal and Singhal, Saksham and Benhaim, Alon and Patra, Barun and Liu, Zhun and Chaudhary, Vishrav and Song, Xia and Wei, Furu},
  publisher = {arXiv},
  year = {2022},
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot