Top Related Projects
PyTorch package for the discrete VAE used for DALL路E.
A latent text-to-image diffusion model
Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
High-Resolution Image Synthesis with Latent Diffusion Models
馃 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch and FLAX.
Quick Overview
DALL路E Mini is an open-source AI model that generates images from text prompts. It's a smaller, more accessible version of OpenAI's DALL路E, designed to run on less powerful hardware while still producing creative and diverse images based on textual descriptions.
Pros
- Open-source and freely available for use and modification
- Runs on less powerful hardware compared to full-scale DALL路E
- Produces creative and diverse images from text prompts
- Active community and ongoing development
Cons
- Lower image quality compared to more advanced models like DALL路E 2 or Midjourney
- Limited training data compared to larger models
- Can produce inconsistent or inaccurate results for complex prompts
- Slower generation times compared to more optimized models
Code Examples
# Install the dalle-mini library
!pip install git+https://github.com/borisdayma/dalle-mini.git
# Import necessary modules
from dalle_mini import DalleBart, DalleBartProcessor
from vqgan_jax import VQModel
import jax
import jax.numpy as jnp
# Load the model and processor
model = DalleBart.from_pretrained("dalle-mini/dalle-mini")
processor = DalleBartProcessor.from_pretrained("dalle-mini/dalle-mini")
vqgan = VQModel.from_pretrained("dalle-mini/vqgan")
# Generate images from a text prompt
prompt = "a cute cat wearing a hat"
encoded_prompt = processor(prompt, return_tensors="jax").pixel_values
images = model.generate(encoded_prompt, num_return_sequences=4)
decoded_images = vqgan.decode_code(images)
Getting Started
To get started with DALL路E Mini:
-
Clone the repository:
git clone https://github.com/borisdayma/dalle-mini.git cd dalle-mini
-
Install dependencies:
pip install -r requirements.txt
-
Run the inference script:
python inference.py --prompt "Your text prompt here"
This will generate images based on your text prompt and save them in the output directory.
Competitor Comparisons
PyTorch package for the discrete VAE used for DALL路E.
Pros of DALL-E
- Higher quality image generation with more detailed and coherent results
- Larger model with more extensive training data, leading to better understanding of complex prompts
- Supports inpainting and outpainting features for image editing
Cons of DALL-E
- Closed-source and not publicly available for free use or modification
- Limited API access, requiring approval and potentially high costs for usage
- Less flexibility for customization or fine-tuning on specific datasets
Code Comparison
While a direct code comparison is not possible due to DALL-E's closed-source nature, we can compare the usage of DALL-E Mini with a hypothetical DALL-E API:
DALL-E Mini:
import dalle_mini
model = dalle_mini.load_model()
images = model.generate_images("A cat wearing a hat")
Hypothetical DALL-E API:
import openai
openai.api_key = "your_api_key"
response = openai.Image.create(prompt="A cat wearing a hat")
image_url = response['data'][0]['url']
DALL-E Mini offers a more accessible and open approach, allowing for local usage and modifications. DALL-E, while potentially more powerful, requires API access and may have usage restrictions.
A latent text-to-image diffusion model
Pros of Stable-Diffusion
- Higher image quality and more detailed outputs
- Faster inference time, especially on GPU
- More flexible architecture allowing for various applications (inpainting, image-to-image, etc.)
Cons of Stable-Diffusion
- Requires more computational resources and VRAM
- More complex setup and installation process
- Larger model size, making it less suitable for edge devices
Code Comparison
DALLE-mini:
from dalle_mini import DalleBart, DalleBartProcessor
model = DalleBart.from_pretrained("dalle-mini/dalle-mini")
processor = DalleBartProcessor.from_pretrained("dalle-mini/dalle-mini")
images = model.generate_images(processor("A cute cat"))
Stable-Diffusion:
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
image = pipe("A cute cat").images[0]
Both repositories offer text-to-image generation capabilities, but Stable-Diffusion generally produces higher quality results with faster inference times. However, it requires more computational resources and has a more complex setup process. DALLE-mini is easier to use and more lightweight, making it suitable for simpler applications or resource-constrained environments. The code comparison shows that Stable-Diffusion has a more streamlined API, while DALLE-mini's approach is slightly more verbose but potentially more flexible for advanced users.
Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
Pros of DALLE2-pytorch
- Implements the more advanced DALL-E 2 architecture
- Offers more flexibility and customization options
- Potentially higher quality image generation
Cons of DALLE2-pytorch
- More complex implementation, requiring deeper understanding
- Heavier computational requirements
- Less documentation and community support compared to dalle-mini
Code Comparison
DALLE2-pytorch:
model = DALLE2(
dim = 512,
image_size = 256,
text_encoder_depth = 6,
text_encoder_heads = 8,
text_encoder_dim_head = 64,
num_tokens = 16384,
# ... more parameters ...
)
dalle-mini:
model = DalleBart.from_pretrained(
"dalle-mini/dalle-mini",
revision=None,
dtype=jnp.float16,
abstract_init=True
)
DALLE2-pytorch offers more granular control over model architecture, while dalle-mini provides a simpler, pre-configured approach. DALLE2-pytorch requires manual setup of various parameters, whereas dalle-mini uses a pre-trained model with fewer configuration options. The DALLE2-pytorch implementation is more suitable for researchers and advanced users looking to experiment with the architecture, while dalle-mini is more accessible for quick deployment and testing.
High-Resolution Image Synthesis with Latent Diffusion Models
Pros of latent-diffusion
- More efficient training and inference due to compression in latent space
- Supports various conditioning types (text, image, class labels)
- Flexible architecture allowing for different applications (image generation, inpainting, super-resolution)
Cons of latent-diffusion
- More complex implementation and setup
- May require more computational resources for initial training
- Less straightforward for beginners to understand and modify
Code Comparison
latent-diffusion:
model = LatentDiffusion(
unet_config,
cond_stage_config,
first_stage_config,
num_timesteps_cond=num_timesteps_cond,
cond_stage_key="caption",
cond_stage_trainable=True,
concat_mode=True,
)
dalle-mini:
model = DalleBart.from_pretrained(
pretrained_model_name_or_path,
revision=revision,
dtype=dtype,
device_map="auto"
)
The latent-diffusion code shows a more complex model initialization with various configurations, while dalle-mini uses a simpler pretrained model loading approach. This reflects the difference in complexity and flexibility between the two projects.
馃 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch and FLAX.
Pros of diffusers
- Broader scope, supporting multiple diffusion models and techniques
- More active development and larger community support
- Extensive documentation and integration with Hugging Face ecosystem
Cons of diffusers
- Higher complexity due to supporting multiple models
- Potentially steeper learning curve for beginners
- May require more computational resources for some models
Code comparison
dalle-mini:
from dalle_mini import DalleBart, DalleBartProcessor
model = DalleBart.from_pretrained("dalle-mini/dalle-mini")
processor = DalleBartProcessor.from_pretrained("dalle-mini/dalle-mini")
diffusers:
from diffusers import StableDiffusionPipeline
model = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
Summary
diffusers offers a more comprehensive solution for working with various diffusion models, while dalle-mini focuses specifically on the DALL-E Mini model. diffusers benefits from broader community support and integration with the Hugging Face ecosystem, but may be more complex for users seeking a simpler, specific implementation. dalle-mini provides a more straightforward approach for those specifically interested in the DALL-E Mini model, but with potentially less flexibility and ongoing development compared to diffusers.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
DALL脗路E Mini
How to use it?
You can use the model on 冒聼聳聧茂赂聫 craiyon
How does it work?
Refer to our reports:
- DALL脗路E mini - Generate Images from Any Text Prompt
- DALL脗路E mini - Explained
- DALL脗路E mega - Training Journal
Development
Dependencies Installation
For inference only, use pip install dalle-mini
.
For development, clone the repo and use pip install -e ".[dev]"
.
Before making a PR, check style with make style
.
You can experiment with the pipeline step by step through our inference pipeline notebook
Training of DALL脗路E mini
Use tools/train/train.py
.
You can also adjust the sweep configuration file if you need to perform a hyperparameter search.
FAQ
Where to find the latest models?
Trained models are on 冒聼陇聴 Model Hub:
- VQGAN-f16-16384 for encoding/decoding images
- DALL脗路E mini or DALL脗路E mega for generating images from a text prompt
Where does the logo come from?
The "armchair in the shape of an avocado" was used by OpenAI when releasing DALL脗路E to illustrate the model's capabilities. Having successful predictions on this prompt represents a big milestone for us.
Contributing
Join the community on the LAION Discord. Any contribution is welcome, from reporting issues to proposing fixes/improvements or testing the model with cool prompts!
You can also use these great projects from the community:
-
spin off your own app with DALL-E Playground repository (thanks Sahar)
-
try DALL脗路E Flow project for generating, diffusion, and upscaling in a Human-in-the-Loop workflow (thanks Han Xiao)
-
run on Replicate, in the browser or via API
Acknowledgements
- 冒聼陇聴 Hugging Face for organizing the FLAX/JAX community week
- Google TPU Research Cloud (TRC) program for providing computing resources
- Weights & Biases for providing the infrastructure for experiment tracking and model management
Authors & Contributors
DALL脗路E mini was initially developed by:
- Boris Dayma
- Suraj Patil
- Pedro Cuenca
- Khalid Saifullah
- Tanishq Abraham
- Ph脙潞c L脙陋 Kh谩潞炉c
- Luke Melas
- Ritobrata Ghosh
Many thanks to the people who helped make it better:
- the DALLE-Pytorch and EleutherAI communities for testing and exchanging cool ideas
- Rohan Anil for adding Distributed Shampoo optimizer and always giving great suggestions
- Phil Wang has provided a lot of cool implementations of transformer variants and gives interesting insights with x-transformers
- Katherine Crowson for super conditioning
- the Gradio team made an amazing UI for our app
Citing DALL脗路E mini
If you find DALL脗路E mini useful in your research or wish to refer, please use the following BibTeX entry.
@misc{Dayma_DALL脗路E_Mini_2021,
author = {Dayma, Boris and Patil, Suraj and Cuenca, Pedro and Saifullah, Khalid and Abraham, Tanishq and L脙陋 Kh谩潞炉c, Ph脙潞c and Melas, Luke and Ghosh, Ritobrata},
doi = {10.5281/zenodo.5146400},
month = {7},
title = {DALL脗路E Mini},
url = {https://github.com/borisdayma/dalle-mini},
year = {2021}
}
References
Original DALL脗路E from "Zero-Shot Text-to-Image Generation" with image quantization from "Learning Transferable Visual Models From Natural Language Supervision".
Image encoder from "Taming Transformers for High-Resolution Image Synthesis".
Sequence to sequence model based on "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension" with implementation of a few variants:
- "GLU Variants Improve Transformer"
- "Deepnet: Scaling Transformers to 1,000 Layers"
- "NormFormer: Improved Transformer Pretraining with Extra Normalization"
- "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"
- "CogView: Mastering Text-to-Image Generation via Transformers"
- "Root Mean Square Layer Normalization"
- "Sinkformers: Transformers with Doubly Stochastic Attention"
- "Foundation Transformers
Main optimizer (Distributed Shampoo) from "Scalable Second Order Optimization for Deep Learning".
Citations
@misc{
title={Zero-Shot Text-to-Image Generation},
author={Aditya Ramesh and Mikhail Pavlov and Gabriel Goh and Scott Gray and Chelsea Voss and Alec Radford and Mark Chen and Ilya Sutskever},
year={2021},
eprint={2102.12092},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
year={2021},
eprint={2103.00020},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{
title={Taming Transformers for High-Resolution Image Synthesis},
author={Patrick Esser and Robin Rombach and Bj脙露rn Ommer},
year={2021},
eprint={2012.09841},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{
title={BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension},
author={Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and Abdelrahman Mohamed and Omer Levy and Ves Stoyanov and Luke Zettlemoyer},
year={2019},
eprint={1910.13461},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{
title={Scalable Second Order Optimization for Deep Learning},
author={Rohan Anil and Vineet Gupta and Tomer Koren and Kevin Regan and Yoram Singer},
year={2021},
eprint={2002.09018},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{
title={GLU Variants Improve Transformer},
author={Noam Shazeer},
year={2020},
url={https://arxiv.org/abs/2002.05202}
}
@misc{
title={DeepNet: Scaling transformers to 1,000 layers},
author={Wang, Hongyu and Ma, Shuming and Dong, Li and Huang, Shaohan and Zhang, Dongdong and Wei, Furu},
year={2022},
eprint={2203.00555}
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{
title={NormFormer: Improved Transformer Pretraining with Extra Normalization},
author={Sam Shleifer and Jason Weston and Myle Ott},
year={2021},
eprint={2110.09456},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{
title={Swin Transformer V2: Scaling Up Capacity and Resolution},
author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}
@misc{
title = {CogView: Mastering Text-to-Image Generation via Transformers},
author = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},
year = {2021},
eprint = {2105.13290},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
@misc{
title = {Root Mean Square Layer Normalization},
author = {Biao Zhang and Rico Sennrich},
year = {2019},
eprint = {1910.07467},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
@misc{
title = {Sinkformers: Transformers with Doubly Stochastic Attention},
url = {https://arxiv.org/abs/2110.11773},
author = {Sander, Michael E. and Ablin, Pierre and Blondel, Mathieu and Peyr脙漏, Gabriel},
publisher = {arXiv},
year = {2021},
}
@misc{
title = {Smooth activations and reproducibility in deep networks},
url = {https://arxiv.org/abs/2010.09931},
author = {Shamir, Gil I. and Lin, Dong and Coviello, Lorenzo},
publisher = {arXiv},
year = {2020},
}
@misc{
title = {Foundation Transformers},
url = {https://arxiv.org/abs/2210.06423},
author = {Wang, Hongyu and Ma, Shuming and Huang, Shaohan and Dong, Li and Wang, Wenhui and Peng, Zhiliang and Wu, Yu and Bajaj, Payal and Singhal, Saksham and Benhaim, Alon and Patra, Barun and Liu, Zhun and Chaudhary, Vishrav and Song, Xia and Wei, Furu},
publisher = {arXiv},
year = {2022},
}
Top Related Projects
PyTorch package for the discrete VAE used for DALL路E.
A latent text-to-image diffusion model
Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
High-Resolution Image Synthesis with Latent Diffusion Models
馃 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch and FLAX.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot