stable-diffusion

No description available

1,034

398

1,034

View on GitHub

Top Related Projects

stablediffusion

40,869

High-Resolution Image Synthesis with Latent Diffusion Models

diffusers

29,520

🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch and FLAX.

Invoke is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, and serves as the foundation for multiple commercial products.

Quick Overview

The pesser/stable-diffusion repository is an implementation of the Stable Diffusion text-to-image model. It provides a framework for generating high-quality images from textual descriptions using deep learning techniques. This project aims to make Stable Diffusion more accessible and easier to use for researchers and developers.

Pros

High-quality image generation from text descriptions
Flexible and customizable implementation
Active community and regular updates
Supports various sampling methods and model configurations

Cons

Requires significant computational resources (GPU recommended)
Learning curve for users unfamiliar with deep learning concepts
Limited documentation for advanced features
Potential legal and ethical concerns regarding generated content

Code Examples

Basic image generation:

import torch
from ldm.util import instantiate_from_config
from omegaconf import OmegaConf

config = OmegaConf.load("configs/stable-diffusion/v1-inference.yaml")
model = instantiate_from_config(config.model)
model.load_state_dict(torch.load("models/ldm/stable-diffusion-v1/model.ckpt")["state_dict"])

prompt = "A beautiful sunset over a calm ocean"
samples = model.sample(prompt, batch_size=1, num_samples=1)

Customizing sampling parameters:

from ldm.models.diffusion.ddim import DDIMSampler

sampler = DDIMSampler(model)
samples = sampler.sample(S=50, conditioning=model.get_learned_conditioning(prompt), batch_size=1, shape=[3, 64, 64])

Using different model configurations:

config = OmegaConf.load("configs/stable-diffusion/v1-inpainting-inference.yaml")
inpainting_model = instantiate_from_config(config.model)
inpainting_model.load_state_dict(torch.load("models/ldm/stable-diffusion-v1/inpainting_model.ckpt")["state_dict"])

mask = torch.zeros(1, 1, 64, 64)  # Create a mask for inpainting
samples = inpainting_model.sample(prompt, mask=mask, batch_size=1, num_samples=1)

Getting Started

Clone the repository:

git clone https://github.com/pesser/stable-diffusion.git
cd stable-diffusion

Install dependencies:
```
pip install -r requirements.txt
```

Download the pre-trained model:

wget https://github.com/CompVis/stable-diffusion/releases/download/v1-4/sd-v1-4.ckpt -O models/ldm/stable-diffusion-v1/model.ckpt

Run the example script:

python scripts/txt2img.py --prompt "A beautiful sunset over a calm ocean" --plms

Competitor Comparisons

stablediffusion

40,869

High-Resolution Image Synthesis with Latent Diffusion Models

Pros of stablediffusion

More actively maintained with frequent updates
Broader community support and contributions
Enhanced features and optimizations for improved performance

Cons of stablediffusion

Potentially less stable due to frequent changes
May require more setup and configuration
Larger codebase, which could be more complex to navigate

Code Comparison

stable-diffusion:

from torch import autocast
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
with autocast("cuda"):
    image = pipe(prompt).images[0]

stablediffusion:

import torch
from ldm.util import instantiate_from_config
from omegaconf import OmegaConf

config = OmegaConf.load("configs/stable-diffusion/v1-inference.yaml")
model = instantiate_from_config(config.model)
model.load_state_dict(torch.load("models/ldm/stable-diffusion-v1/model.ckpt")["state_dict"])

prompt = "a photo of an astronaut riding a horse on mars"
samples = model.sample(prompt)

Both repositories provide implementations of Stable Diffusion, but stablediffusion offers a more comprehensive and actively developed codebase with additional features and optimizations. However, it may require more setup and could be less stable due to frequent updates. The code examples show different approaches to loading and using the models, with stablediffusion using a more customizable configuration-based approach.

stable-diffusion-webui

153,957

Stable Diffusion web UI

Pros of stable-diffusion-webui

User-friendly web interface for easier interaction
Extensive features including inpainting, outpainting, and image-to-image generation
Active community development with frequent updates and new features

Cons of stable-diffusion-webui

Higher resource requirements due to additional features
Potentially more complex setup process for some users
May have a steeper learning curve for beginners

Code Comparison

stable-diffusion:

from ldm.util import instantiate_from_config
model = instantiate_from_config(config.model)
model.load_state_dict(torch.load(f"{opt.model_path}/model.ckpt")["state_dict"])

stable-diffusion-webui:

import modules.sd_models
sd_model = modules.sd_models.load_model(checkpoint_info)
shared.sd_model = sd_model

The code snippets show different approaches to loading the Stable Diffusion model. stable-diffusion uses a more generic configuration-based approach, while stable-diffusion-webui employs a custom module for model loading, potentially offering more flexibility and integration with the web interface.

diffusers

29,520

🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch and FLAX.

Pros of Diffusers

More comprehensive library with support for multiple diffusion models
Better documentation and easier integration with other Hugging Face tools
Active development and frequent updates

Cons of Diffusers

Potentially slower inference time for some models
May require more setup and configuration for specific use cases

Code Comparison

Stable-diffusion:

import torch
from ldm.util import instantiate_from_config
from omegaconf import OmegaConf

config = OmegaConf.load("configs/stable-diffusion/v1-inference.yaml")
model = instantiate_from_config(config.model)

Diffusers:

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
image = pipe(prompt="a photo of an astronaut riding a horse on mars").images[0]

The Diffusers library provides a more streamlined API for using Stable Diffusion, while the original Stable-diffusion repository offers more low-level control. Diffusers is generally easier to use for beginners and integrates well with other Hugging Face tools, but may sacrifice some flexibility compared to the original implementation.

InvokeAI

24,952

Pros of InvokeAI

More user-friendly interface with a web-based GUI
Extensive features including inpainting, outpainting, and image-to-image generation
Active community with frequent updates and improvements

Cons of InvokeAI

Larger resource footprint due to additional features
Steeper learning curve for advanced features
May be overkill for users seeking a simple Stable Diffusion implementation

Code Comparison

InvokeAI:

from invokeai.app.invocations.baseinvocation import BaseInvocation

class CustomInvocation(BaseInvocation):
    def invoke(self, context):
        # Custom logic here
        return result

Stable-Diffusion:

from ldm.models.diffusion.ddpm import LatentDiffusion

model = LatentDiffusion.from_pretrained("path/to/model")
images = model.sample(prompt="your prompt here", num_samples=1)

InvokeAI provides a more structured approach for creating custom invocations, while Stable-Diffusion offers a simpler, more direct method for generating images. InvokeAI's code structure is designed to support its extensive feature set, whereas Stable-Diffusion's implementation is more straightforward for basic image generation tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Development repository. Please see CompVis/stable-diffusion for the Stable Diffusion release.

Latent Diffusion Models

arXiv | BibTeX

High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach*, Andreas Blattmann*, Dominik Lorenz, Patrick Esser, BjÃ¶rn Ommer
* equal contribution

News

April 2022

Thanks to Katherine Crowson, classifier-free guidance received a ~2x speedup and the PLMS sampler is available. See also this PR.
Our 1.45B latent diffusion LAION model was integrated into Huggingface Spaces ð¤ using Gradio. Try out the Web Demo:
More pre-trained LDMs are available:
- A 1.45B model trained on the LAION-400M database.
- A class-conditional model on ImageNet, achieving a FID of 3.6 when using classifier-free guidance Available via a colab notebook .

Requirements

A suitable conda environment named ldm can be created and activated with:

conda env create -f environment.yaml
conda activate ldm

Pretrained Models

A general list of all available checkpoints is available in via our model zoo. If you use any of these models in your work, we are always happy to receive a citation.

Text-to-Image

text2img-figure

Download the pre-trained weights (5.7GB)

mkdir -p models/ldm/text2img-large/
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt

and sample with

python scripts/txt2img.py --prompt "a virus monster is playing guitar, oil on canvas" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0  --ddim_steps 50

This will save each sample individually as well as a grid of size n_iter x n_samples at the specified output location (default: outputs/txt2img-samples). Quality, sampling speed and diversity are best controlled via the scale, ddim_steps and ddim_eta arguments. As a rule of thumb, higher values of scale produce better samples at the cost of a reduced output diversity.
Furthermore, increasing ddim_steps generally also gives higher quality samples, but returns are diminishing for values > 250. Fast sampling (i.e. low values of ddim_steps) while retaining good quality can be achieved by using --ddim_eta 0.0.
Faster sampling (i.e. even lower values of ddim_steps) while retaining good quality can be achieved by using --ddim_eta 0.0 and --plms (see Pseudo Numerical Methods for Diffusion Models on Manifolds).

Beyond 256Â²

For certain inputs, simply running the model in a convolutional fashion on larger features than it was trained on can sometimes result in interesting results. To try it out, tune the H and W arguments (which will be integer-divided by 8 in order to calculate the corresponding latent size), e.g. run

python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0

to create a sample of size 384x1024. Note, however, that controllability is reduced compared to the 256x256 setting.

The example below was generated using the above command. text2img-figure-conv

Inpainting

inpainting

Download the pre-trained weights

wget -O models/ldm/inpainting_big/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=1

and sample with

python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results

indir should contain images *.png and masks <image_fname>_mask.png like the examples provided in data/inpainting_examples.

Class-Conditional ImageNet

Available via a notebook . class-conditional

Unconditional Models

We also provide a script for sampling from unconditional LDMs (e.g. LSUN, FFHQ, ...). Start it via

CUDA_VISIBLE_DEVICES=<GPU_ID> python scripts/sample_diffusion.py -r models/ldm/<model_spec>/model.ckpt -l <logdir> -n <\#samples> --batch_size <batch_size> -c <\#ddim steps> -e <\#eta>

Train your own LDMs

Data preparation

Faces

For downloading the CelebA-HQ and FFHQ datasets, proceed as described in the taming-transformers repository.

LSUN

The LSUN datasets can be conveniently downloaded via the script available here. We performed a custom split into training and validation images, and provide the corresponding filenames at https://ommer-lab.com/files/lsun.zip. After downloading, extract them to ./data/lsun. The beds/cats/churches subsets should also be placed/symlinked at ./data/lsun/bedrooms/./data/lsun/cats/./data/lsun/churches, respectively.

ImageNet

The code will try to download (through Academic Torrents) and prepare ImageNet the first time it is used. However, since ImageNet is quite large, this requires a lot of disk space and time. If you already have ImageNet on your disk, you can speed things up by putting the data into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ (which defaults to ~/.cache/autoencoders/data/ILSVRC2012_{split}/data/), where {split} is one of train/validation. It should have the following structure:

${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
âââ n01440764
â   âââ n01440764_10026.JPEG
â   âââ n01440764_10027.JPEG
â   âââ ...
âââ n01443537
â   âââ n01443537_10007.JPEG
â   âââ n01443537_10014.JPEG
â   âââ ...
âââ ...

If you haven't extracted the data, you can also place ILSVRC2012_img_train.tar/ILSVRC2012_img_val.tar (or symlinks to them) into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/ / ${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/, which will then be extracted into above structure without downloading it again. Note that this will only happen if neither a folder ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ nor a file ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready exist. Remove them if you want to force running the dataset preparation again.

Model Training

Logs and checkpoints for trained models are saved to logs/<START_DATE_AND_TIME>_<config_spec>.

Training autoencoder models

Configs for training a KL-regularized autoencoder on ImageNet are provided at configs/autoencoder. Training can be started by running

CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/autoencoder/<config_spec>.yaml -t --gpus 0,

where config_spec is one of {autoencoder_kl_8x8x64(f=32, d=64), autoencoder_kl_16x16x16(f=16, d=16), autoencoder_kl_32x32x4(f=8, d=4), autoencoder_kl_64x64x3(f=4, d=3)}.

For training VQ-regularized models, see the taming-transformers repository.

Training LDMs

In configs/latent-diffusion/ we provide configs for training LDMs on the LSUN-, CelebA-HQ, FFHQ and ImageNet datasets. Training can be started by running

CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/latent-diffusion/<config_spec>.yaml -t --gpus 0,

where <config_spec> is one of {celebahq-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3),ffhq-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3), lsun_bedrooms-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3), lsun_churches-ldm-vq-4(f=8, KL-reg. autoencoder, spatial size 32x32x4),cin-ldm-vq-8(f=8, VQ-reg. autoencoder, spatial size 32x32x4)}.

Model Zoo

Pretrained Autoencoding Models

rec2

All models were trained until convergence (no further substantial improvement in rFID).

Model	rFID vs val	train steps	PSNR	PSIM	Link	Comments
f=4, VQ (Z=8192, d=3)	0.58	533066	27.43 +/- 4.26	0.53 +/- 0.21	https://ommer-lab.com/files/latent-diffusion/vq-f4.zip
f=4, VQ (Z=8192, d=3)	1.06	658131	25.21 +/- 4.17	0.72 +/- 0.26	https://heibox.uni-heidelberg.de/f/9c6681f64bb94338a069/?dl=1	no attention
f=8, VQ (Z=16384, d=4)	1.14	971043	23.07 +/- 3.99	1.17 +/- 0.36	https://ommer-lab.com/files/latent-diffusion/vq-f8.zip
f=8, VQ (Z=256, d=4)	1.49	1608649	22.35 +/- 3.81	1.26 +/- 0.37	https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip
f=16, VQ (Z=16384, d=8)	5.15	1101166	20.83 +/- 3.61	1.73 +/- 0.43	https://heibox.uni-heidelberg.de/f/0e42b04e2e904890a9b6/?dl=1

f=4, KL	0.27	176991	27.53 +/- 4.54	0.55 +/- 0.24	https://ommer-lab.com/files/latent-diffusion/kl-f4.zip
f=8, KL	0.90	246803	24.19 +/- 4.19	1.02 +/- 0.35	https://ommer-lab.com/files/latent-diffusion/kl-f8.zip
f=16, KL (d=16)	0.87	442998	24.08 +/- 4.22	1.07 +/- 0.36	https://ommer-lab.com/files/latent-diffusion/kl-f16.zip
f=32, KL (d=64)	2.04	406763	22.27 +/- 3.93	1.41 +/- 0.40	https://ommer-lab.com/files/latent-diffusion/kl-f32.zip

Get the models

Running the following script downloads und extracts all available pretrained autoencoding models.

bash scripts/download_first_stages.sh

The first stage models can then be found in models/first_stage_models/<model_spec>

Pretrained LDMs

Datset	Task	Model	FID	IS	Prec	Recall	Link	Comments
CelebA-HQ	Unconditional Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=0)	5.11 (5.11)	3.29	0.72	0.49	https://ommer-lab.com/files/latent-diffusion/celeba.zip
FFHQ	Unconditional Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=1)	4.98 (4.98)	4.50 (4.50)	0.73	0.50	https://ommer-lab.com/files/latent-diffusion/ffhq.zip
LSUN-Churches	Unconditional Image Synthesis	LDM-KL-8 (400 DDIM steps, eta=0)	4.02 (4.02)	2.72	0.64	0.52	https://ommer-lab.com/files/latent-diffusion/lsun_churches.zip
LSUN-Bedrooms	Unconditional Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=1)	2.95 (3.0)	2.22 (2.23)	0.66	0.48	https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip
ImageNet	Class-conditional Image Synthesis	LDM-VQ-8 (200 DDIM steps, eta=1)	7.77(7.76)* /15.82**	201.56(209.52)* /78.82**	0.84* / 0.65**	0.35* / 0.63**	https://ommer-lab.com/files/latent-diffusion/cin.zip	: w/ guiding, classifier_scale 10 *: w/o guiding, scores in bracket calculated with script provided by ADM
Conceptual Captions	Text-conditional Image Synthesis	LDM-VQ-f4 (100 DDIM steps, eta=0)	16.79	13.89	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/text2img.zip	finetuned from LAION
OpenImages	Super-resolution	LDM-VQ-4	N/A	N/A	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip	BSR image degradation
OpenImages	Layout-to-Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=0)	32.02	15.92	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/layout2img_model.zip
Landscapes	Semantic Image Synthesis	LDM-VQ-4	N/A	N/A	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/semantic_synthesis256.zip
Landscapes	Semantic Image Synthesis	LDM-VQ-4	N/A	N/A	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/semantic_synthesis.zip	finetuned on resolution 512x512

Get the models

The LDMs listed above can jointly be downloaded and extracted via

bash scripts/download_models.sh

The models can then be found in models/ldm/<model_spec>.

Coming Soon...

More inference scripts for conditional LDMs.
In the meantime, you can play with our colab notebook https://colab.research.google.com/drive/1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing

Comments

Our codebase for the diffusion models builds heavily on OpenAI's ADM codebase and https://github.com/lucidrains/denoising-diffusion-pytorch. Thanks for open-sourcing!
The implementation of the transformer encoder is from x-transformers by lucidrains.

BibTeX

@misc{rombach2021highresolution,
      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and BjÃ¶rn Ommer},
      year={2021},
      eprint={2112.10752},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot