Top Related Projects
Invoke is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, and serves as the foundation for multiple commercial products.
High-Resolution Image Synthesis with Latent Diffusion Models
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
Stable Diffusion web UI
Quick Overview
The k-diffusion
repository by crowsonkb is a PyTorch library for training and using diffusion models, which are a type of generative model that can be used for tasks like image generation, text generation, and more. The library provides a flexible and modular framework for working with diffusion models, with support for a variety of diffusion model architectures and training techniques.
Pros
- Flexible and Modular: The library is designed to be highly flexible and modular, allowing users to easily experiment with different diffusion model architectures and training techniques.
- Comprehensive Documentation: The project has extensive documentation, including detailed tutorials and examples, making it easy for users to get started and understand the library's capabilities.
- Active Development: The project is actively maintained and developed, with regular updates and improvements.
- Supports a Variety of Diffusion Models: The library supports a wide range of diffusion model architectures, including DDPM, DDIM, and more.
Cons
- Steep Learning Curve: Diffusion models can be complex and challenging to understand, and the library may have a steep learning curve for users who are new to the field.
- Limited Support for Non-Image Domains: While the library supports a variety of diffusion model architectures, its primary focus is on image generation, and it may have limited support for other domains like text or audio.
- Computational Complexity: Training and using diffusion models can be computationally intensive, which may limit the library's usability on resource-constrained systems.
- Potential Bias and Ethical Concerns: Like other generative models, diffusion models can potentially exhibit biases and raise ethical concerns, which users should be aware of and address.
Code Examples
Here are a few code examples from the k-diffusion
library:
- Loading a Pre-Trained Diffusion Model:
import torch
from k_diffusion.external import CompVisDenoiser
# Load a pre-trained DDPM model
model = CompVisDenoiser(torch.load("path/to/model.pt"))
This code demonstrates how to load a pre-trained diffusion model using the CompVisDenoiser
class from the k_diffusion.external
module.
- Sampling from a Diffusion Model:
import torch
from k_diffusion.sampling import sample_dpm_solver
# Sample from the loaded model
sample = sample_dpm_solver(model, noise=torch.randn(1, 3, 64, 64), steps=50)
This code shows how to use the sample_dpm_solver
function from the k_diffusion.sampling
module to generate a sample from the loaded diffusion model.
- Training a Diffusion Model:
import torch
from k_diffusion.external import CompVisDenoiser
from k_diffusion.utils import set_requires_grad
# Define the model and optimizer
model = CompVisDenoiser(...)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
# Train the model
for epoch in range(num_epochs):
optimizer.zero_grad()
loss = model.training_step(...)
loss.backward()
optimizer.step()
This code demonstrates how to train a diffusion model using the CompVisDenoiser
class and the training_step
method.
Getting Started
To get started with the k-diffusion
library, you can follow these steps:
- Install the library using pip:
pip install k-diffusion
- Import the necessary modules and classes:
from k_diffusion.external import CompVisDenoiser
from k_diffusion.sampling import sample_dpm_solver
- Load a pre-trained diffusion model:
model = CompVisDenoiser(torch.load("path/to/model.pt"))
- Generate a sample from the loaded model:
sample = sample_dpm_solver(model, noise=torch.randn(1, 3, 64, 64), steps=50)
- Optionally, train a new diffusion model:
optimizer = torch.optim.Adam(model.
Competitor Comparisons
Invoke is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, and serves as the foundation for multiple commercial products.
Pros of InvokeAI
- More comprehensive UI with a web interface for easier use
- Supports multiple AI models and pipelines
- Active development with frequent updates and new features
Cons of InvokeAI
- Larger codebase, potentially more complex to set up and customize
- May have higher system requirements due to additional features
Code Comparison
k-diffusion:
def sample(model, x, sigmas, extra_args=None, callback=None):
"""Implements Algorithm 2 (Heun steps) from Karras et al. (2022)."""
extra_args = {} if extra_args is None else extra_args
s_in = x.new_ones([x.shape[0]])
for i in trange(len(sigmas) - 1, disable=disable_progress):
denoised = model(x, sigmas[i] * s_in, **extra_args)
d = to_d(x, sigmas[i], denoised)
if callback is not None:
callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
dt = sigmas[i + 1] - sigmas[i]
if sigmas[i + 1] == 0:
# Euler method
x = x + d * dt
else:
# Heun's method
x_2 = x + d * dt
denoised_2 = model(x_2, sigmas[i + 1] * s_in, **extra_args)
d_2 = to_d(x_2, sigmas[i + 1], denoised_2)
d_prime = (d + d_2) / 2
x = x + d_prime * dt
return x
InvokeAI:
def sample(self, noise, prompt, strength=1.0, steps=50, seed=None):
"""Generate an image from a text prompt."""
if seed is not None:
torch.manual_seed(seed)
# Prepare latent variables
latents = self.prepare_latents(noise, strength)
# Text encoding
text_embeddings = self.get_text_embeddings(prompt)
# Sampling loop
for i in range(steps):
latents = self.diffusion_step(latents, text_embeddings, i, steps)
# Decode latents to image
image = self.vae.decode(latents)
return image
High-Resolution Image Synthesis with Latent Diffusion Models
Pros of stablediffusion
- More comprehensive and feature-rich, offering a complete text-to-image generation pipeline
- Backed by a larger organization (Stability AI), potentially leading to better long-term support and updates
- Includes pre-trained models and a user-friendly interface for easier adoption
Cons of stablediffusion
- Larger codebase and more dependencies, which may increase complexity and setup time
- Potentially slower inference speed due to the full pipeline implementation
- May require more computational resources for training and inference
Code Comparison
k-diffusion:
model = diffusion.DiffusionModel(...)
x = torch.randn(...)
samples = diffusion.sample(model, x, steps=20)
stablediffusion:
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
Summary
While k-diffusion focuses on providing a flexible and efficient implementation of diffusion models, stablediffusion offers a more complete solution for text-to-image generation. k-diffusion may be preferred for research and customization, while stablediffusion is better suited for production-ready applications and ease of use.
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
Pros of diffusers
- Extensive library with support for multiple diffusion models and pipelines
- Well-documented and actively maintained by the Hugging Face team
- Seamless integration with other Hugging Face libraries and ecosystem
Cons of diffusers
- Larger codebase and potentially steeper learning curve
- May have more dependencies and overhead for simple use cases
Code Comparison
k-diffusion:
import k_diffusion as K
sampler = K.sampling.sample_lms
x = sampler(model, x, sigmas, extra_args={'cond': cond})
diffusers:
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
image = pipe(prompt="a photo of an astronaut riding a horse on mars").images[0]
Summary
k-diffusion is a lightweight and flexible library focused on diffusion models, while diffusers offers a more comprehensive suite of tools and models within the broader Hugging Face ecosystem. k-diffusion may be preferred for custom implementations and research, while diffusers excels in providing ready-to-use pipelines and integration with other NLP tasks.
Stable Diffusion web UI
Pros of stable-diffusion-webui
- User-friendly web interface for easy interaction
- Extensive features including inpainting, outpainting, and various sampling methods
- Active community with frequent updates and extensions
Cons of stable-diffusion-webui
- Larger codebase, potentially more complex to modify or extend
- May have higher system requirements due to additional features
Code Comparison
k-diffusion:
def sample(model, x, sigmas, extra_args=None, callback=None):
"""Implements Algorithm 2 (Heun steps) from Karras et al. (2022)."""
extra_args = {} if extra_args is None else extra_args
s_in = x.new_ones([x.shape[0]])
for i in trange(len(sigmas) - 1):
denoised = model(x, sigmas[i] * s_in, **extra_args)
d = (x - denoised) / sigmas[i]
if callback is not None:
callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
dt = sigmas[i + 1] - sigmas[i]
x = x + d * dt
return x
stable-diffusion-webui:
def sample_euler(model, x, sigmas, extra_args=None, callback=None, disable=None, s_churn=0., s_tmin=0., s_tmax=float('inf'), s_noise=1.):
"""Implements Algorithm 2 (Euler steps) from Karras et al. (2022)."""
extra_args = {} if extra_args is None else extra_args
s_in = x.new_ones([x.shape[0]])
for i in trange(len(sigmas) - 1, disable=disable):
gamma = min(s_churn / (len(sigmas) - 1), 2 ** 0.5 - 1) if s_tmin <= sigmas[i] <= s_tmax else 0.
eps = torch.randn_like(x) * s_noise
sigma_hat = sigmas[i] * (gamma + 1)
if gamma > 0:
x = x + eps * (sigma_hat ** 2 - sigmas[i] ** 2) ** 0.5
denoised = model(x, sigma_hat * s_in, **extra_args)
d = (x - denoised) / sigma_hat
dt = sigmas[i + 1] - sigma_hat
x = x + d * dt
if callback is not None:
callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigma_hat, 'denoised': denoised})
return x
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
k-diffusion
An implementation of Elucidating the Design Space of Diffusion-Based Generative Models (Karras et al., 2022) for PyTorch, with enhancements and additional features, such as improved sampling algorithms and transformer-based diffusion models.
Hourglass diffusion transformer
k-diffusion
contains a new model type, image_transformer_v2
, that uses ideas from Hourglass Transformer and DiT.
Requirements
To use the new model type you will need to install custom CUDA kernels:
-
NATTEN for the sparse (neighborhood) attention used at low levels of the hierarchy. There is a shifted window attention version of the model type which does not require a custom CUDA kernel, but it does not perform as well and is slower to train and inference.
-
FlashAttention-2 for global attention. It will fall back to plain PyTorch if it is not installed.
Also, you should make sure your PyTorch installation is capable of using torch.compile()
. It will fall back to eager mode if torch.compile()
is not available, but it will be slower and use more memory in training.
Usage
Demo
To train a 256x256 RGB model on Oxford Flowers without installing custom CUDA kernels, install Hugging Face Datasets:
pip install datasets
and run:
python train.py --config configs/config_oxford_flowers_shifted_window.json --name flowers_demo_001 --evaluate-n 0 --batch-size 32 --sample-n 36 --mixed-precision bf16
If you run out of memory, try adding --checkpointing
or reducing the batch size. If you are using an older GPU (pre-Ampere), omit --mixed-precision bf16
to train in FP32. It is not recommended to train in FP16.
If you have NATTEN installed and working (preferred), you can train with neighborhood attention instead of shifted window attention by specifying --config configs/config_oxford_flowers.json
.
Config file
In the "model"
key of the config file:
-
Set the
"type"
key to"image_transformer_v2"
. -
The base patch size is set by the
"patch_size"
key, like"patch_size": [4, 4]
. -
Model depth for each level of the hierarchy is specified by the
"depths"
config key, like"depths": [2, 2, 4]
. This constructs a model with two transformer layers at the first level (4x4 patches), followed by two at the second level (8x8 patches), followed by four at the highest level (16x16 patches), followed by two more at the second level, followed by two more at the first level. -
Model width for each level of the hierarchy is specified by the
"widths"
config key, like"widths": [192, 384, 768]
. The widths must be multiples of the attention head dimension. -
The self-attention mechanism for each level of the hierarchy is specified by the
"self_attns"
config key, like:"self_attns": [ {"type": "neighborhood", "d_head": 64, "kernel_size": 7}, {"type": "neighborhood", "d_head": 64, "kernel_size": 7}, {"type": "global", "d_head": 64}, ]
If not specified, all levels of the hierarchy except for the highest use neighborhood attention with 64 dim heads and a 7x7 kernel. The highest level uses global attention with 64 dim heads. So the token count at every level but the highest can be very large.
-
As a fallback if you or your users cannot use NATTEN, you can also train a model with shifted window attention at the low levels of the hierarchy. Shifted window attention does not perform as well as neighborhood attention and it is slower to train and inference, but it does not require custom CUDA kernels. Specify it like:
"self_attns": [ {"type": "shifted-window", "d_head": 64, "window_size": 8}, {"type": "shifted-window", "d_head": 64, "window_size": 8}, {"type": "global", "d_head": 64}, ]
The window size at each level must evenly divide the image size at that level. Models trained with one attention type must be fine-tuned to be used with a different type.
Inference
TODO: write this section
Installation
k-diffusion
can be installed via PyPI (pip install k-diffusion
) but it will not include training and inference scripts, only library code that others can depend on. To run the training and inference scripts, clone this repository and run pip install -e <path to repository>
.
Training
To train models:
$ ./train.py --config CONFIG_FILE --name RUN_NAME
For instance, to train a model on MNIST:
$ ./train.py --config configs/config_mnist_transformer.json --name RUN_NAME
The configuration file allows you to specify the dataset type. Currently supported types are "imagefolder"
(finds all images in that folder and its subfolders, recursively), "cifar10"
(CIFAR-10), and "mnist"
(MNIST). "huggingface"
Hugging Face Datasets is also supported.
Multi-GPU and multi-node training is supported with Hugging Face Accelerate. You can configure Accelerate by running:
$ accelerate config
then running:
$ accelerate launch train.py --config CONFIG_FILE --name RUN_NAME
Enhancements/additional features
-
k-diffusion supports a highly efficient hierarchical transformer model type.
-
k-diffusion supports a soft version of Min-SNR loss weighting for improved training at high resolutions with less hyperparameters than the loss weighting used in Karras et al. (2022).
-
k-diffusion has wrappers for v-diffusion-pytorch, OpenAI diffusion, and CompVis diffusion models allowing them to be used with its samplers and ODE/SDE.
-
k-diffusion implements DPM-Solver, which produces higher quality samples at the same number of function evalutions as Karras Algorithm 2, as well as supporting adaptive step size control. DPM-Solver++(2S) and (2M) are implemented now too for improved quality with low numbers of steps.
-
k-diffusion supports CLIP guided sampling from unconditional diffusion models (see
sample_clip_guided.py
). -
k-diffusion supports log likelihood calculation (not a variational lower bound) for native models and all wrapped models.
-
k-diffusion can calculate, during training, the FID and KID vs the training set.
-
k-diffusion can calculate, during training, the gradient noise scale (1 / SNR), from An Empirical Model of Large-Batch Training, https://arxiv.org/abs/1812.06162).
To do
- Latent diffusion
Top Related Projects
Invoke is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, and serves as the foundation for multiple commercial products.
High-Resolution Image Synthesis with Latent Diffusion Models
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
Stable Diffusion web UI
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot