Top Related Projects
High-Resolution Image Synthesis with Latent Diffusion Models
Stable Diffusion web UI
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
Invoke is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, and serves as the foundation for multiple commercial products.
Quick Overview
The pesser/stable-diffusion repository is an implementation of the Stable Diffusion text-to-image model. It provides a framework for generating high-quality images from textual descriptions using deep learning techniques. This project aims to make Stable Diffusion more accessible and easier to use for researchers and developers.
Pros
- High-quality image generation from text descriptions
- Flexible and customizable implementation
- Active community and regular updates
- Supports various sampling methods and model configurations
Cons
- Requires significant computational resources (GPU recommended)
- Learning curve for users unfamiliar with deep learning concepts
- Limited documentation for advanced features
- Potential legal and ethical concerns regarding generated content
Code Examples
- Basic image generation:
import torch
from ldm.util import instantiate_from_config
from omegaconf import OmegaConf
config = OmegaConf.load("configs/stable-diffusion/v1-inference.yaml")
model = instantiate_from_config(config.model)
model.load_state_dict(torch.load("models/ldm/stable-diffusion-v1/model.ckpt")["state_dict"])
prompt = "A beautiful sunset over a calm ocean"
samples = model.sample(prompt, batch_size=1, num_samples=1)
- Customizing sampling parameters:
from ldm.models.diffusion.ddim import DDIMSampler
sampler = DDIMSampler(model)
samples = sampler.sample(S=50, conditioning=model.get_learned_conditioning(prompt), batch_size=1, shape=[3, 64, 64])
- Using different model configurations:
config = OmegaConf.load("configs/stable-diffusion/v1-inpainting-inference.yaml")
inpainting_model = instantiate_from_config(config.model)
inpainting_model.load_state_dict(torch.load("models/ldm/stable-diffusion-v1/inpainting_model.ckpt")["state_dict"])
mask = torch.zeros(1, 1, 64, 64) # Create a mask for inpainting
samples = inpainting_model.sample(prompt, mask=mask, batch_size=1, num_samples=1)
Getting Started
-
Clone the repository:
git clone https://github.com/pesser/stable-diffusion.git cd stable-diffusion
-
Install dependencies:
pip install -r requirements.txt
-
Download the pre-trained model:
wget https://github.com/CompVis/stable-diffusion/releases/download/v1-4/sd-v1-4.ckpt -O models/ldm/stable-diffusion-v1/model.ckpt
-
Run the example script:
python scripts/txt2img.py --prompt "A beautiful sunset over a calm ocean" --plms
Competitor Comparisons
High-Resolution Image Synthesis with Latent Diffusion Models
Pros of stablediffusion
- More actively maintained with frequent updates
- Broader community support and contributions
- Enhanced features and optimizations for improved performance
Cons of stablediffusion
- Potentially less stable due to frequent changes
- May require more setup and configuration
- Larger codebase, which could be more complex to navigate
Code Comparison
stable-diffusion:
from torch import autocast
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
with autocast("cuda"):
image = pipe(prompt).images[0]
stablediffusion:
import torch
from ldm.util import instantiate_from_config
from omegaconf import OmegaConf
config = OmegaConf.load("configs/stable-diffusion/v1-inference.yaml")
model = instantiate_from_config(config.model)
model.load_state_dict(torch.load("models/ldm/stable-diffusion-v1/model.ckpt")["state_dict"])
prompt = "a photo of an astronaut riding a horse on mars"
samples = model.sample(prompt)
Both repositories provide implementations of Stable Diffusion, but stablediffusion offers a more comprehensive and actively developed codebase with additional features and optimizations. However, it may require more setup and could be less stable due to frequent updates. The code examples show different approaches to loading and using the models, with stablediffusion using a more customizable configuration-based approach.
Stable Diffusion web UI
Pros of stable-diffusion-webui
- User-friendly web interface for easier interaction
- Extensive features including inpainting, outpainting, and image-to-image generation
- Active community development with frequent updates and new features
Cons of stable-diffusion-webui
- Higher resource requirements due to additional features
- Potentially more complex setup process for some users
- May have a steeper learning curve for beginners
Code Comparison
stable-diffusion:
from ldm.util import instantiate_from_config
model = instantiate_from_config(config.model)
model.load_state_dict(torch.load(f"{opt.model_path}/model.ckpt")["state_dict"])
stable-diffusion-webui:
import modules.sd_models
sd_model = modules.sd_models.load_model(checkpoint_info)
shared.sd_model = sd_model
The code snippets show different approaches to loading the Stable Diffusion model. stable-diffusion uses a more generic configuration-based approach, while stable-diffusion-webui employs a custom module for model loading, potentially offering more flexibility and integration with the web interface.
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
Pros of Diffusers
- More comprehensive library with support for multiple diffusion models
- Better documentation and easier integration with other Hugging Face tools
- Active development and frequent updates
Cons of Diffusers
- Potentially slower inference time for some models
- May require more setup and configuration for specific use cases
Code Comparison
Stable-diffusion:
import torch
from ldm.util import instantiate_from_config
from omegaconf import OmegaConf
config = OmegaConf.load("configs/stable-diffusion/v1-inference.yaml")
model = instantiate_from_config(config.model)
Diffusers:
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
image = pipe(prompt="a photo of an astronaut riding a horse on mars").images[0]
The Diffusers library provides a more streamlined API for using Stable Diffusion, while the original Stable-diffusion repository offers more low-level control. Diffusers is generally easier to use for beginners and integrates well with other Hugging Face tools, but may sacrifice some flexibility compared to the original implementation.
Invoke is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, and serves as the foundation for multiple commercial products.
Pros of InvokeAI
- More user-friendly interface with a web-based GUI
- Extensive features including inpainting, outpainting, and image-to-image generation
- Active community with frequent updates and improvements
Cons of InvokeAI
- Larger resource footprint due to additional features
- Steeper learning curve for advanced features
- May be overkill for users seeking a simple Stable Diffusion implementation
Code Comparison
InvokeAI:
from invokeai.app.invocations.baseinvocation import BaseInvocation
class CustomInvocation(BaseInvocation):
def invoke(self, context):
# Custom logic here
return result
Stable-Diffusion:
from ldm.models.diffusion.ddpm import LatentDiffusion
model = LatentDiffusion.from_pretrained("path/to/model")
images = model.sample(prompt="your prompt here", num_samples=1)
InvokeAI provides a more structured approach for creating custom invocations, while Stable-Diffusion offers a simpler, more direct method for generating images. InvokeAI's code structure is designed to support its extensive feature set, whereas Stable-Diffusion's implementation is more straightforward for basic image generation tasks.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Development repository. Please see CompVis/stable-diffusion for the Stable Diffusion release.
Latent Diffusion Models
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach*,
Andreas Blattmann*,
Dominik Lorenz,
Patrick Esser,
Björn Ommer
* equal contribution
News
April 2022
-
Thanks to Katherine Crowson, classifier-free guidance received a ~2x speedup and the PLMS sampler is available. See also this PR.
-
Our 1.45B latent diffusion LAION model was integrated into Huggingface Spaces ð¤ using Gradio. Try out the Web Demo:
-
More pre-trained LDMs are available:
- A 1.45B model trained on the LAION-400M database.
- A class-conditional model on ImageNet, achieving a FID of 3.6 when using classifier-free guidance Available via a colab notebook .
Requirements
A suitable conda environment named ldm
can be created
and activated with:
conda env create -f environment.yaml
conda activate ldm
Pretrained Models
A general list of all available checkpoints is available in via our model zoo. If you use any of these models in your work, we are always happy to receive a citation.
Text-to-Image
Download the pre-trained weights (5.7GB)
mkdir -p models/ldm/text2img-large/
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt
and sample with
python scripts/txt2img.py --prompt "a virus monster is playing guitar, oil on canvas" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0 --ddim_steps 50
This will save each sample individually as well as a grid of size n_iter
x n_samples
at the specified output location (default: outputs/txt2img-samples
).
Quality, sampling speed and diversity are best controlled via the scale
, ddim_steps
and ddim_eta
arguments.
As a rule of thumb, higher values of scale
produce better samples at the cost of a reduced output diversity.
Furthermore, increasing ddim_steps
generally also gives higher quality samples, but returns are diminishing for values > 250.
Fast sampling (i.e. low values of ddim_steps
) while retaining good quality can be achieved by using --ddim_eta 0.0
.
Faster sampling (i.e. even lower values of ddim_steps
) while retaining good quality can be achieved by using --ddim_eta 0.0
and --plms
(see Pseudo Numerical Methods for Diffusion Models on Manifolds).
Beyond 256²
For certain inputs, simply running the model in a convolutional fashion on larger features than it was trained on
can sometimes result in interesting results. To try it out, tune the H
and W
arguments (which will be integer-divided
by 8 in order to calculate the corresponding latent size), e.g. run
python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0
to create a sample of size 384x1024. Note, however, that controllability is reduced compared to the 256x256 setting.
The example below was generated using the above command.
Inpainting
Download the pre-trained weights
wget -O models/ldm/inpainting_big/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=1
and sample with
python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results
indir
should contain images *.png
and masks <image_fname>_mask.png
like
the examples provided in data/inpainting_examples
.
Class-Conditional ImageNet
Available via a notebook .
Unconditional Models
We also provide a script for sampling from unconditional LDMs (e.g. LSUN, FFHQ, ...). Start it via
CUDA_VISIBLE_DEVICES=<GPU_ID> python scripts/sample_diffusion.py -r models/ldm/<model_spec>/model.ckpt -l <logdir> -n <\#samples> --batch_size <batch_size> -c <\#ddim steps> -e <\#eta>
Train your own LDMs
Data preparation
Faces
For downloading the CelebA-HQ and FFHQ datasets, proceed as described in the taming-transformers repository.
LSUN
The LSUN datasets can be conveniently downloaded via the script available here.
We performed a custom split into training and validation images, and provide the corresponding filenames
at https://ommer-lab.com/files/lsun.zip.
After downloading, extract them to ./data/lsun
. The beds/cats/churches subsets should
also be placed/symlinked at ./data/lsun/bedrooms
/./data/lsun/cats
/./data/lsun/churches
, respectively.
ImageNet
The code will try to download (through Academic
Torrents) and prepare ImageNet the first time it
is used. However, since ImageNet is quite large, this requires a lot of disk
space and time. If you already have ImageNet on your disk, you can speed things
up by putting the data into
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
(which defaults to
~/.cache/autoencoders/data/ILSVRC2012_{split}/data/
), where {split}
is one
of train
/validation
. It should have the following structure:
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
âââ n01440764
â âââ n01440764_10026.JPEG
â âââ n01440764_10027.JPEG
â âââ ...
âââ n01443537
â âââ n01443537_10007.JPEG
â âââ n01443537_10014.JPEG
â âââ ...
âââ ...
If you haven't extracted the data, you can also place
ILSVRC2012_img_train.tar
/ILSVRC2012_img_val.tar
(or symlinks to them) into
${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/
/
${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/
, which will then be
extracted into above structure without downloading it again. Note that this
will only happen if neither a folder
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
nor a file
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready
exist. Remove them
if you want to force running the dataset preparation again.
Model Training
Logs and checkpoints for trained models are saved to logs/<START_DATE_AND_TIME>_<config_spec>
.
Training autoencoder models
Configs for training a KL-regularized autoencoder on ImageNet are provided at configs/autoencoder
.
Training can be started by running
CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/autoencoder/<config_spec>.yaml -t --gpus 0,
where config_spec
is one of {autoencoder_kl_8x8x64
(f=32, d=64), autoencoder_kl_16x16x16
(f=16, d=16),
autoencoder_kl_32x32x4
(f=8, d=4), autoencoder_kl_64x64x3
(f=4, d=3)}.
For training VQ-regularized models, see the taming-transformers repository.
Training LDMs
In configs/latent-diffusion/
we provide configs for training LDMs on the LSUN-, CelebA-HQ, FFHQ and ImageNet datasets.
Training can be started by running
CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/latent-diffusion/<config_spec>.yaml -t --gpus 0,
where <config_spec>
is one of {celebahq-ldm-vq-4
(f=4, VQ-reg. autoencoder, spatial size 64x64x3),ffhq-ldm-vq-4
(f=4, VQ-reg. autoencoder, spatial size 64x64x3),
lsun_bedrooms-ldm-vq-4
(f=4, VQ-reg. autoencoder, spatial size 64x64x3),
lsun_churches-ldm-vq-4
(f=8, KL-reg. autoencoder, spatial size 32x32x4),cin-ldm-vq-8
(f=8, VQ-reg. autoencoder, spatial size 32x32x4)}.
Model Zoo
Pretrained Autoencoding Models
All models were trained until convergence (no further substantial improvement in rFID).
Model | rFID vs val | train steps | PSNR | PSIM | Link | Comments |
---|---|---|---|---|---|---|
f=4, VQ (Z=8192, d=3) | 0.58 | 533066 | 27.43 +/- 4.26 | 0.53 +/- 0.21 | https://ommer-lab.com/files/latent-diffusion/vq-f4.zip | |
f=4, VQ (Z=8192, d=3) | 1.06 | 658131 | 25.21 +/- 4.17 | 0.72 +/- 0.26 | https://heibox.uni-heidelberg.de/f/9c6681f64bb94338a069/?dl=1 | no attention |
f=8, VQ (Z=16384, d=4) | 1.14 | 971043 | 23.07 +/- 3.99 | 1.17 +/- 0.36 | https://ommer-lab.com/files/latent-diffusion/vq-f8.zip | |
f=8, VQ (Z=256, d=4) | 1.49 | 1608649 | 22.35 +/- 3.81 | 1.26 +/- 0.37 | https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip | |
f=16, VQ (Z=16384, d=8) | 5.15 | 1101166 | 20.83 +/- 3.61 | 1.73 +/- 0.43 | https://heibox.uni-heidelberg.de/f/0e42b04e2e904890a9b6/?dl=1 | |
f=4, KL | 0.27 | 176991 | 27.53 +/- 4.54 | 0.55 +/- 0.24 | https://ommer-lab.com/files/latent-diffusion/kl-f4.zip | |
f=8, KL | 0.90 | 246803 | 24.19 +/- 4.19 | 1.02 +/- 0.35 | https://ommer-lab.com/files/latent-diffusion/kl-f8.zip | |
f=16, KL (d=16) | 0.87 | 442998 | 24.08 +/- 4.22 | 1.07 +/- 0.36 | https://ommer-lab.com/files/latent-diffusion/kl-f16.zip | |
f=32, KL (d=64) | 2.04 | 406763 | 22.27 +/- 3.93 | 1.41 +/- 0.40 | https://ommer-lab.com/files/latent-diffusion/kl-f32.zip |
Get the models
Running the following script downloads und extracts all available pretrained autoencoding models.
bash scripts/download_first_stages.sh
The first stage models can then be found in models/first_stage_models/<model_spec>
Pretrained LDMs
Datset | Task | Model | FID | IS | Prec | Recall | Link | Comments |
---|---|---|---|---|---|---|---|---|
CelebA-HQ | Unconditional Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=0) | 5.11 (5.11) | 3.29 | 0.72 | 0.49 | https://ommer-lab.com/files/latent-diffusion/celeba.zip | |
FFHQ | Unconditional Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=1) | 4.98 (4.98) | 4.50 (4.50) | 0.73 | 0.50 | https://ommer-lab.com/files/latent-diffusion/ffhq.zip | |
LSUN-Churches | Unconditional Image Synthesis | LDM-KL-8 (400 DDIM steps, eta=0) | 4.02 (4.02) | 2.72 | 0.64 | 0.52 | https://ommer-lab.com/files/latent-diffusion/lsun_churches.zip | |
LSUN-Bedrooms | Unconditional Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=1) | 2.95 (3.0) | 2.22 (2.23) | 0.66 | 0.48 | https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip | |
ImageNet | Class-conditional Image Synthesis | LDM-VQ-8 (200 DDIM steps, eta=1) | 7.77(7.76)* /15.82** | 201.56(209.52)* /78.82** | 0.84* / 0.65** | 0.35* / 0.63** | https://ommer-lab.com/files/latent-diffusion/cin.zip | *: w/ guiding, classifier_scale 10 **: w/o guiding, scores in bracket calculated with script provided by ADM |
Conceptual Captions | Text-conditional Image Synthesis | LDM-VQ-f4 (100 DDIM steps, eta=0) | 16.79 | 13.89 | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/text2img.zip | finetuned from LAION |
OpenImages | Super-resolution | LDM-VQ-4 | N/A | N/A | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip | BSR image degradation |
OpenImages | Layout-to-Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=0) | 32.02 | 15.92 | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/layout2img_model.zip | |
Landscapes | Semantic Image Synthesis | LDM-VQ-4 | N/A | N/A | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/semantic_synthesis256.zip | |
Landscapes | Semantic Image Synthesis | LDM-VQ-4 | N/A | N/A | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/semantic_synthesis.zip | finetuned on resolution 512x512 |
Get the models
The LDMs listed above can jointly be downloaded and extracted via
bash scripts/download_models.sh
The models can then be found in models/ldm/<model_spec>
.
Coming Soon...
- More inference scripts for conditional LDMs.
- In the meantime, you can play with our colab notebook https://colab.research.google.com/drive/1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing
Comments
-
Our codebase for the diffusion models builds heavily on OpenAI's ADM codebase and https://github.com/lucidrains/denoising-diffusion-pytorch. Thanks for open-sourcing!
-
The implementation of the transformer encoder is from x-transformers by lucidrains.
BibTeX
@misc{rombach2021highresolution,
title={High-Resolution Image Synthesis with Latent Diffusion Models},
author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
year={2021},
eprint={2112.10752},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Top Related Projects
High-Resolution Image Synthesis with Latent Diffusion Models
Stable Diffusion web UI
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
Invoke is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, and serves as the foundation for multiple commercial products.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot