Convert Figma logo to code with AI

Stability-AI logostablediffusion

High-Resolution Image Synthesis with Latent Diffusion Models

38,332
4,946
38,332
285

Top Related Projects

22,788

InvokeAI is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, supports terminal use through a CLI, and serves as the foundation for multiple commercial products.

Stable Diffusion web UI

25,061

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.

High-Resolution Image Synthesis with Latent Diffusion Models

Quick Overview

Stable Diffusion is an open-source text-to-image generation model developed by Stability AI. It allows users to create high-quality images from text descriptions, offering a powerful tool for artists, designers, and researchers in the field of AI-generated imagery.

Pros

  • Open-source and freely available for research and commercial use
  • Capable of generating high-quality, diverse images from text prompts
  • Supports various fine-tuning and customization options
  • Active community and ongoing development

Cons

  • Requires significant computational resources for optimal performance
  • May produce biased or inappropriate content if not properly filtered
  • Learning curve for achieving desired results can be steep
  • Potential copyright and ethical concerns surrounding AI-generated imagery

Code Examples

# Load the Stable Diffusion model
from diffusers import StableDiffusionPipeline
import torch

model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
# Generate an image from a text prompt
prompt = "A serene landscape with mountains and a lake at sunset"
image = pipe(prompt).images[0]
image.save("generated_landscape.png")
# Use a negative prompt to refine the generation
prompt = "A futuristic cityscape with flying cars"
negative_prompt = "trees, nature, old buildings"
image = pipe(prompt, negative_prompt=negative_prompt).images[0]
image.save("generated_cityscape.png")

Getting Started

To get started with Stable Diffusion:

  1. Install the required dependencies:

    pip install diffusers transformers accelerate scipy
    
  2. Load the model and generate an image:

    from diffusers import StableDiffusionPipeline
    import torch
    
    model_id = "CompVis/stable-diffusion-v1-4"
    pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
    pipe = pipe.to("cuda")
    
    prompt = "A beautiful landscape with mountains and a lake"
    image = pipe(prompt).images[0]
    image.save("generated_image.png")
    
  3. Experiment with different prompts and parameters to achieve desired results.

Competitor Comparisons

22,788

InvokeAI is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, supports terminal use through a CLI, and serves as the foundation for multiple commercial products.

Pros of InvokeAI

  • More user-friendly interface with a web-based UI
  • Extensive customization options and fine-tuning capabilities
  • Active community development and frequent updates

Cons of InvokeAI

  • Potentially slower image generation compared to StableDiffusion
  • May require more system resources due to additional features
  • Steeper learning curve for advanced features

Code Comparison

StableDiffusion:

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
image = pipe("a photo of an astronaut riding a horse on mars").images[0]
image.save("astronaut_rides_horse.png")

InvokeAI:

from invokeai.app.invocations.image_generate import ImageGenerateInvocation

generator = ImageGenerateInvocation(prompt="a photo of an astronaut riding a horse on mars")
result = generator.invoke()
result.image.save("astronaut_rides_horse.png")

Both repositories offer powerful image generation capabilities, but InvokeAI provides a more comprehensive toolkit for customization and experimentation. StableDiffusion, on the other hand, offers a simpler implementation and potentially faster generation times. The choice between the two depends on the user's specific needs and level of expertise.

Stable Diffusion web UI

Pros of stable-diffusion-webui

  • User-friendly web interface for easier interaction with Stable Diffusion models
  • Extensive features including image-to-image, inpainting, and outpainting
  • Active community development with frequent updates and new features

Cons of stable-diffusion-webui

  • May require more system resources due to additional features
  • Potential for increased complexity in setup and configuration
  • Dependency on external models and resources

Code Comparison

stablediffusion:

from ldm.util import instantiate_from_config
model = instantiate_from_config(config.model)
model.load_state_dict(torch.load(f"{opt.model_path}/model.ckpt")["state_dict"])

stable-diffusion-webui:

import modules.sd_models
sd_model = modules.sd_models.load_model(checkpoint_info)
shared.sd_model = sd_model

The stablediffusion repo focuses on the core model implementation, while stable-diffusion-webui provides a more abstracted approach for loading models within its web interface framework.

25,061

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.

Pros of diffusers

  • More comprehensive library supporting various diffusion models
  • Easier integration with other Hugging Face tools and ecosystems
  • Better documentation and community support

Cons of diffusers

  • May have slightly slower inference time for some models
  • Less focused on a single specific model implementation

Code Comparison

stablediffusion:

from ldm.util import instantiate_from_config
model = instantiate_from_config(config.model)
sampler = DDIMSampler(model)

diffusers:

from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
image = pipe(prompt="a photo of an astronaut riding a horse on mars").images[0]

The diffusers library provides a more streamlined API for using pre-trained models, while stablediffusion offers more low-level control over the model initialization and sampling process. diffusers is generally easier to use for beginners and integrates well with other Hugging Face tools, while stablediffusion may be preferred by those who need more fine-grained control over the model architecture and training process.

High-Resolution Image Synthesis with Latent Diffusion Models

Pros of latent-diffusion

  • More focused on the underlying latent diffusion model
  • Provides a more detailed explanation of the technical aspects
  • Offers a broader range of applications beyond image generation

Cons of latent-diffusion

  • Less user-friendly for non-technical users
  • Fewer pre-trained models and examples available
  • Limited community support compared to stablediffusion

Code Comparison

latent-diffusion:

model = LatentDiffusion(
    linear_start=0.0015, linear_end=0.0195, n_steps=1000,
    latent_channels=4, latent_size=32, time_embedding=128,
    unet_type="v2"
)

stablediffusion:

model = StableDiffusion(
    img_size=512, prediction_type="epsilon", num_train_timesteps=1000,
    beta_schedule="linear", clip_sample=False, use_ema=False
)

Both repositories implement diffusion models, but stablediffusion focuses more on image generation and provides a more user-friendly interface. latent-diffusion offers a deeper dive into the technical aspects of the model and provides more flexibility for various applications. stablediffusion has a larger community and more pre-trained models, making it easier for beginners to get started with image generation tasks.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Stable Diffusion Version 2

t2i t2i t2i

This repository contains Stable Diffusion models trained from scratch and will be continuously updated with new checkpoints. The following list provides an overview of all currently available models. More coming soon.

News

March 24, 2023

Stable UnCLIP 2.1

December 7, 2022

Version 2.1

  • New stable diffusion model (Stable Diffusion 2.1-v, Hugging Face) at 768x768 resolution and (Stable Diffusion 2.1-base, HuggingFace) at 512x512 resolution, both based on the same number of parameters and architecture as 2.0 and fine-tuned on 2.0, on a less restrictive NSFW filtering of the LAION-5B dataset. Per default, the attention operation of the model is evaluated at full precision when xformers is not installed. To enable fp16 (which can cause numerical instabilities with the vanilla attention module on the v2.1 model) , run your script with ATTN_PRECISION=fp16 python <thescript.py>

November 24, 2022

Version 2.0

  • New stable diffusion model (Stable Diffusion 2.0-v) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses OpenCLIP-ViT/H as the text encoder and is trained from scratch. SD 2.0-v is a so-called v-prediction model.

  • The above model is finetuned from SD 2.0-base, which was trained as a standard noise-prediction model on 512x512 images and is also made available.

  • Added a x4 upscaling latent text-guided diffusion model.

  • New depth-guided stable diffusion model, finetuned from SD 2.0-base. The model is conditioned on monocular depth estimates inferred via MiDaS and can be used for structure-preserving img2img and shape-conditional synthesis.

    d2i

  • A text-guided inpainting model, finetuned from SD 2.0-base.

We follow the original repository and provide basic inference scripts to sample from the models.


The original Stable Diffusion model was created in a collaboration with CompVis and RunwayML and builds upon the work:

High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach*, Andreas Blattmann*, Dominik Lorenz, Patrick Esser, Björn Ommer
CVPR '22 Oral | GitHub | arXiv | Project page

and many others.

Stable Diffusion is a latent text-to-image diffusion model.


Requirements

You can update an existing latent diffusion environment by running

conda install pytorch==1.12.1 torchvision==0.13.1 -c pytorch
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .

xformers efficient attention

For more efficiency and speed on GPUs, we highly recommended installing the xformers library.

Tested on A100 with CUDA 11.4. Installation needs a somewhat recent version of nvcc and gcc/g++, obtain those, e.g., via

export CUDA_HOME=/usr/local/cuda-11.4
conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc
conda install -c conda-forge gcc
conda install -c conda-forge gxx_linux-64==9.5.0

Then, run the following (compiling takes up to 30 min).

cd ..
git clone https://github.com/facebookresearch/xformers.git
cd xformers
git submodule update --init --recursive
pip install -r requirements.txt
pip install -e .
cd ../stablediffusion

Upon successful installation, the code will automatically default to memory efficient attention for the self- and cross-attention layers in the U-Net and autoencoder.

General Disclaimer

Stable Diffusion models are general text-to-image diffusion models and therefore mirror biases and (mis-)conceptions that are present in their training data. Although efforts were made to reduce the inclusion of explicit pornographic material, we do not recommend using the provided weights for services or products without additional safety mechanisms and considerations. The weights are research artifacts and should be treated as such. Details on the training procedure and data, as well as the intended use of the model can be found in the corresponding model card. The weights are available via the StabilityAI organization at Hugging Face under the CreativeML Open RAIL++-M License.

Stable Diffusion v2

Stable Diffusion v2 refers to a specific configuration of the model architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet and OpenCLIP ViT-H/14 text encoder for the diffusion model. The SD 2-v model produces 768x768 px outputs.

Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:

sd evaluation results

Text-to-Image

txt2img-stable2 txt2img-stable2

Stable Diffusion 2 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder. We provide a reference script for sampling.

Reference Sampling Script

This script incorporates an invisible watermarking of the outputs, to help viewers identify the images as machine-generated. We provide the configs for the SD2-v (768px) and SD2-base (512px) model.

First, download the weights for SD2.1-v and SD2.1-base.

To sample from the SD2.1-v model, run the following:

python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768  

or try out the Web Demo: Hugging Face Spaces.

To sample from the base model, use

python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/model.ckpt/> --config <path/to/config.yaml/>  

By default, this uses the DDIM sampler, and renders images of size 768x768 (which it was trained on) in 50 steps. Empirically, the v-models can be sampled with higher guidance scales.

Note: The inference config for all model versions is designed to be used with EMA-only checkpoints. For this reason use_ema=False is set in the configuration, otherwise the code will try to switch from non-EMA to EMA weights.

Enable Intel® Extension for PyTorch* optimizations in Text-to-Image script

If you're planning on running Text-to-Image on Intel® CPU, try to sample an image with TorchScript and Intel® Extension for PyTorch* optimizations. Intel® Extension for PyTorch* extends PyTorch by enabling up-to-date features optimizations for an extra performance boost on Intel® hardware. It can optimize memory layout of the operators to Channel Last memory format, which is generally beneficial for Intel CPUs, take advantage of the most advanced instruction set available on a machine, optimize operators and many more.

Prerequisites

Before running the script, make sure you have all needed libraries installed. (the optimization was checked on Ubuntu 20.04). Install jemalloc, numactl, Intel® OpenMP and Intel® Extension for PyTorch*.

apt-get install numactl libjemalloc-dev
pip install intel-openmp
pip install intel_extension_for_pytorch -f https://software.intel.com/ipex-whl-stable

To sample from the SD2.1-v model with TorchScript+IPEX optimizations, run the following. Remember to specify desired number of instances you want to run the program on (more).

MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000 python -m intel_extension_for_pytorch.cpu.launch --ninstance <number of an instance> --enable_jemalloc scripts/txt2img.py --prompt \"a corgi is playing guitar, oil on canvas\" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/intel/v2-inference-v-fp32.yaml  --H 768 --W 768 --precision full --device cpu --torchscript --ipex

To sample from the base model with IPEX optimizations, use

MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000 python -m intel_extension_for_pytorch.cpu.launch --ninstance <number of an instance> --enable_jemalloc scripts/txt2img.py --prompt \"a corgi is playing guitar, oil on canvas\" --ckpt <path/to/model.ckpt/> --config configs/stable-diffusion/intel/v2-inference-fp32.yaml  --n_samples 1 --n_iter 4 --precision full --device cpu --torchscript --ipex

If you're using a CPU that supports bfloat16, consider sample from the model with bfloat16 enabled for a performance boost, like so

# SD2.1-v
MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000 python -m intel_extension_for_pytorch.cpu.launch --ninstance <number of an instance> --enable_jemalloc scripts/txt2img.py --prompt \"a corgi is playing guitar, oil on canvas\" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/intel/v2-inference-v-bf16.yaml --H 768 --W 768 --precision full --device cpu --torchscript --ipex --bf16
# SD2.1-base
MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000 python -m intel_extension_for_pytorch.cpu.launch --ninstance <number of an instance> --enable_jemalloc scripts/txt2img.py --prompt \"a corgi is playing guitar, oil on canvas\" --ckpt <path/to/model.ckpt/> --config configs/stable-diffusion/intel/v2-inference-bf16.yaml --precision full --device cpu --torchscript --ipex --bf16

Image Modification with Stable Diffusion

depth2img-stable2

Depth-Conditional Stable Diffusion

To augment the well-established img2img functionality of Stable Diffusion, we provide a shape-preserving stable diffusion model.

Note that the original method for image modification introduces significant semantic changes w.r.t. the initial image. If that is not desired, download our depth-conditional stable diffusion model and the dpt_hybrid MiDaS model weights, place the latter in a folder midas_models and sample via

python scripts/gradio/depth2img.py configs/stable-diffusion/v2-midas-inference.yaml <path-to-ckpt>

or

streamlit run scripts/streamlit/depth2img.py configs/stable-diffusion/v2-midas-inference.yaml <path-to-ckpt>

This method can be used on the samples of the base model itself. For example, take this sample generated by an anonymous discord user. Using the gradio or streamlit script depth2img.py, the MiDaS model first infers a monocular depth estimate given this input, and the diffusion model is then conditioned on the (relative) depth output.

depth2image

This model is particularly useful for a photorealistic style; see the examples. For a maximum strength of 1.0, the model removes all pixel-based information and only relies on the text prompt and the inferred monocular depth estimate.

depth2img-stable3

Classic Img2Img

For running the "classic" img2img, use

python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8 --ckpt <path/to/model.ckpt>

and adapt the checkpoint and config paths accordingly.

Image Upscaling with Stable Diffusion

upscaling-x4 After downloading the weights, run

python scripts/gradio/superresolution.py configs/stable-diffusion/x4-upscaling.yaml <path-to-checkpoint>

or

streamlit run scripts/streamlit/superresolution.py -- configs/stable-diffusion/x4-upscaling.yaml <path-to-checkpoint>

for a Gradio or Streamlit demo of the text-guided x4 superresolution model.
This model can be used both on real inputs and on synthesized examples. For the latter, we recommend setting a higher noise_level, e.g. noise_level=100.

Image Inpainting with Stable Diffusion

inpainting-stable2

Download the SD 2.0-inpainting checkpoint and run

python scripts/gradio/inpainting.py configs/stable-diffusion/v2-inpainting-inference.yaml <path-to-checkpoint>

or

streamlit run scripts/streamlit/inpainting.py -- configs/stable-diffusion/v2-inpainting-inference.yaml <path-to-checkpoint>

for a Gradio or Streamlit demo of the inpainting model. This scripts adds invisible watermarking to the demo in the RunwayML repository, but both should work interchangeably with the checkpoints/configs.

Shout-Outs

License

The code in this repository is released under the MIT License.

The weights are available via the StabilityAI organization at Hugging Face, and released under the CreativeML Open RAIL++-M License License.

BibTeX

@misc{rombach2021highresolution,
      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
      year={2021},
      eprint={2112.10752},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}