Hotshot-XL

✨ Hotshot-XL: State-of-the-art AI text-to-GIF model trained to work alongside Stable Diffusion XL

1,103

View on GitHub

Top Related Projects

stablediffusion

41,464

High-Resolution Image Synthesis with Latent Diffusion Models

Quick Overview

Hotshot-XL is an open-source AI model for generating images from text prompts. It's designed to be a powerful and efficient alternative to other text-to-image models, offering high-quality image generation capabilities.

Pros

Open-source and freely available for use and modification
Efficient performance, potentially requiring less computational resources than some alternatives
Capable of generating high-quality images from text descriptions
Actively maintained and developed by the community

Cons

May require significant computational resources for optimal performance
Documentation and usage instructions could be more comprehensive
Limited pre-trained models available compared to some commercial alternatives
Potential learning curve for users new to text-to-image generation models

Code Examples

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("hotshotco/Hotshot-XL", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "A serene landscape with mountains and a lake at sunset"
image = pipe(prompt).images[0]
image.save("generated_landscape.png")

This example loads the Hotshot-XL model, generates an image based on a text prompt, and saves the result.

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("hotshotco/Hotshot-XL", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "A futuristic cityscape with flying cars"
negative_prompt = "old, rundown, dirty"
image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=50).images[0]
image.save("futuristic_city.png")

This example demonstrates using a negative prompt and adjusting the number of inference steps for more control over the generated image.

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("hotshotco/Hotshot-XL", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "A portrait of a cyberpunk character"
image = pipe(prompt, height=768, width=512, guidance_scale=7.5).images[0]
image.save("cyberpunk_portrait.png")

This example shows how to specify custom image dimensions and adjust the guidance scale for the generated image.

Getting Started

To get started with Hotshot-XL:

Install the required dependencies:

pip install diffusers transformers accelerate

Use the following code to generate an image:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("hotshotco/Hotshot-XL", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "Your text prompt here"
image = pipe(prompt).images[0]
image.save("output_image.png")

Experiment with different prompts and parameters to achieve desired results.

Competitor Comparisons

stablediffusion

41,464

High-Resolution Image Synthesis with Latent Diffusion Models

Pros of Stable Diffusion

More established and widely adopted in the AI image generation community
Extensive documentation and community support
Broader range of pre-trained models and fine-tuning options

Cons of Stable Diffusion

Larger model size, requiring more computational resources
Potentially slower inference time for image generation
More complex setup and configuration process

Code Comparison

Stable Diffusion:

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
image = pipe("A beautiful sunset over the ocean").images[0]
image.save("sunset.png")

Hotshot-XL:

from hotshotco import HotshotXL

model = HotshotXL.from_pretrained("hotshotco/Hotshot-XL")
image = model.generate("A beautiful sunset over the ocean")
image.save("sunset.png")

Both repositories offer powerful image generation capabilities, but Stable Diffusion has a more established ecosystem and wider range of options. Hotshot-XL, being newer, may offer improved performance in certain scenarios but with less extensive documentation and community support.

generative-models

26,088

Generative Models by Stability AI

Pros of generative-models

More comprehensive, covering multiple generative AI models
Better documentation and examples for various use cases
Active development with frequent updates and contributions

Cons of generative-models

Larger codebase, potentially more complex to navigate
May require more computational resources due to multiple models

Code Comparison

Hotshot-XL:

from hotshotco import HotshotXL

model = HotshotXL.from_pretrained("hotshotco/Hotshot-XL")
output = model.generate("A futuristic cityscape")

generative-models:

from sgm.inference.api import SamplingPipeline
from sgm.inference.helpers import embed_watermark

model = SamplingPipeline.from_pretrained("stability-ai/sdxl")
images = model.text_to_image("A futuristic cityscape")
watermarked_images = embed_watermark(images)

Both repositories offer powerful generative AI capabilities, but generative-models provides a broader range of models and more extensive documentation. Hotshot-XL focuses specifically on its namesake model, potentially offering a more streamlined experience for users interested in that particular implementation. The code examples demonstrate the different approaches, with generative-models offering additional features like watermarking.

ControlNet

32,835

Let us control diffusion models!

Pros of ControlNet

More versatile, supporting various conditioning types (edges, depth, pose, etc.)
Extensive documentation and examples for different use cases
Larger community and more frequent updates

Cons of ControlNet

Requires more computational resources due to its complexity
Steeper learning curve for beginners
Less focused on video-specific applications

Code Comparison

ControlNet:

from share import *
import config

model = create_model('./models/control_sd15_canny.pth')
processor = CannyDetector()

input_image = load_image(image_path)
detected_map = processor(input_image)
result = model(detected_map, prompt)

Hotshot-XL:

from hotshotco import HotshotXL

model = HotshotXL.from_pretrained("hotshotco/Hotshot-XL")
video = model.generate(prompt="A cat playing piano", num_frames=16)
video.save("cat_piano.mp4")

The code snippets highlight the different focus areas of each project. ControlNet emphasizes image processing and conditioning, while Hotshot-XL is tailored for video generation with a simpler API.

stable-diffusion-webui

153,957

Stable Diffusion web UI

Pros of stable-diffusion-webui

More extensive feature set and customization options
Larger community and ecosystem of extensions
Supports multiple models and architectures

Cons of stable-diffusion-webui

Steeper learning curve for beginners
Requires more computational resources
Setup process can be more complex

Code Comparison

Hotshot-XL (Python):

from hotshotco import HotshotXL

model = HotshotXL()
video = model.generate("A cat playing piano")

stable-diffusion-webui (Python):

import modules.scripts as scripts
import gradio as gr

class Script(scripts.Script):
    def title(self):
        return "Custom Script"

    def ui(self, is_img2img):
        return [gr.Textbox(label="Prompt")]

The code snippets demonstrate the simplicity of Hotshot-XL's API for video generation, while stable-diffusion-webui shows a more complex structure for creating custom scripts and user interfaces.

stable-diffusion-webui offers greater flexibility and customization but requires more code to implement features. Hotshot-XL provides a more straightforward approach for specific video generation tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Hotshot-XL

ð Try it ð Model card ð¬ Discord

a barbie doll smiling in kitchen, oven on fire, disaster, pink wes anderson vibes, cinematic a teddy bear writing a letter dslr photo of mark zuckerberg happy, pulling on threads, lots of threads everywhere, laughing, hd, 8k a cat laughing

Hotshot-XL is an AI text-to-GIF model trained to work alongside Stable Diffusion XL.

Hotshot-XL can generate GIFs with any fine-tuned SDXL model. This means two things:

Youâll be able to make GIFs with any existing or newly fine-tuned SDXL model you may want to use.
If you'd like to make GIFs of personalized subjects, you can load your own SDXL based LORAs, and not have to worry about fine-tuning Hotshot-XL. This is awesome because itâs usually much easier to find suitable images for training data than it is to find videos. It also hopefully fits into everyone's existing LORA usage/workflows :) See more here.

Hotshot-XL is compatible with SDXL ControlNet to make GIFs in the composition/layout youâd like. See the ControlNet section below.

Hotshot-XL was trained to generate 1 second GIFs at 8 FPS.

Hotshot-XL was trained on various aspect ratios. For best results with the base Hotshot-XL model, we recommend using it with an SDXL model that has been fine-tuned with 512x512 images. You can find an SDXL model we fine-tuned for 512x512 resolutions here.

ð Try It

Try Hotshot-XL yourself here: https://www.hotshot.co

Or, if you'd like to run Hotshot-XL yourself locally, continue on to the sections below.

If youâre running Hotshot-XL yourself, you are going to be able to have a lot more flexibility/control with the model. As a very simple example, youâll be able to change the sampler. Weâve seen best results with Euler-A so far, but you may find interesting results with some other ones.

ð§ Setup

Environment Setup

pip install virtualenv --upgrade
virtualenv -p $(which python3) venv
source venv/bin/activate
pip install -r requirements.txt

Download the Hotshot-XL Weights

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/hotshotco/Hotshot-XL

or visit https://huggingface.co/hotshotco/Hotshot-XL

Download our fine-tuned SDXL model (or BYOSDXL)

Note: To maximize data and training efficiency, Hotshot-XL was trained at various aspect ratios around 512x512 resolution. For best results with the base Hotshot-XL model, we recommend using it with an SDXL model that has been fine-tuned with images around the 512x512 resolution. You can download an SDXL model we trained with images at 512x512 resolution below, or bring your own SDXL base model.

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/hotshotco/SDXL-512

or visit https://huggingface.co/hotshotco/SDXL-512

ð® Inference

Text-to-GIF

python inference.py \
  --prompt="a bulldog in the captains chair of a spaceship, hd, high quality" \
  --output="output.gif"

What to Expect:

Prompt	Sasquatch scuba diving	a camel smoking a cigarette	Ronald McDonald sitting at a vanity mirror putting on lipstick	drake licking his lips and staring through a window at a cupcake
Output

Text-to-GIF with personalized LORAs

python inference.py \
  --prompt="a bulldog in the captains chair of a spaceship, hd, high quality" \
  --output="output.gif" \
  --spatial_unet_base="path/to/stabilityai/stable-diffusion-xl-base-1.0/unet" \
  --lora="path/to/lora"

What to Expect:

Note: The outputs below use the DDIMScheduler.

Prompt	sks person screaming at a capri sun	sks person kissing kermit the frog	sks person wearing a tuxedo holding up a glass of champagne, fireworks in background, hd, high quality, 4K
Output

Text-to-GIF with ControlNet

python inference.py \
  --prompt="a girl jumping up and down and pumping her fist, hd, high quality" \
  --output="output.gif" \
  --control_type="depth" \
  --gif="https://media1.giphy.com/media/v1.Y2lkPTc5MGI3NjExbXNneXJicG1mOHJ2dzQ2Y2JteDY1ZWlrdjNjMjl3ZWxyeWFxY2EzdyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/YOTAoXBgMCmFeQQzuZ/giphy.gif"

By default, Hotshot-XL will create key frames from your source gif using 8 equally spaced frames and crop the keyframes to the default aspect ratio. For finer grained control, learn how to vary aspect ratios and vary frame rates/lengths.

Hotshot-XL currently supports the use of one ControlNet model at a time; supporting Multi-ControlNet would be exciting.

What to Expect:

Prompt	pixar style girl putting two thumbs up, happy, high quality, 8k, 3d, animated disney render	keanu reaves holding a sign that says "HELP", hd, high quality	a woman laughing, hd, high quality	barack obama making a rainbow with their hands, the word "MAGIC" in front of them, wearing a blue and white striped hoodie, hd, high quality
Output
Control

Varying Aspect Ratios

Note: The base SDXL model is trained to best create images around 1024x1024 resolution. To maximize data and training efficiency, Hotshot-XL was trained at aspect ratios around 512x512 resolution. Please see Additional Notes for a list of aspect ratios the base Hotshot-XL model was trained with.

Like SDXL, Hotshot-XL was trained at various aspect ratios with aspect ratio bucketing, and includes support for SDXL parameters like target-size and original-size. This means you can create GIFs at several different aspect ratios and resolutions, just with the base Hotshot-XL model.

python inference.py \
  --prompt="a bulldog in the captains chair of a spaceship, hd, high quality" \
  --output="output.gif" \
  --width=<WIDTH> \
  --height=<HEIGHT>

What to Expect:

	512x512	672x384	384x672
a monkey playing guitar, nature footage, hd, high quality

Varying frame rates & lengths (Experimental)

By default, Hotshot-XL is trained to generate GIFs that are 1 second long with 8FPS. If you'd like to play with generating GIFs with varying frame rates and time lengths, you can try out the parameters video_length and video_duration.

video_length sets the number of frames. The default value is 8.

video_duration sets the runtime of the output gif in milliseconds. The default value is 1000.

Please note that you should expect unstable/"jittery" results when modifying these parameters as the model was only trained with 1s videos @ 8fps. You'll be able to improve the stability of results for different time lengths and frame rates by fine-tuning Hotshot-XL. Please let us know if you do!

python inference.py \
  --prompt="a bulldog in the captains chair of a spaceship, hd, high quality" \
  --output="output.gif" \
  --video_length=16 \
  --video_duration=2000

Spatial Layers Only

Hotshot-XL is trained to generate GIFs alongside SDXL. If you'd like to generate just an image, you can simply set video_length=1 in your inference call and the Hotshot-XL temporal layers will be ignored, as you'd expect.

python inference.py \
  --prompt="a bulldog in the captains chair of a spaceship, hd, high quality" \
  --output="output.jpg" \
  --video_length=1

Additional Notes

Supported Aspect Ratios

Hotshot-XL was trained at the following aspect ratios; to reliably generate GIFs outside the range of these aspect ratios, you will want to fine-tune Hotshot-XL with videos at the resolution of your desired aspect ratio.

Aspect Ratio	Size
0.42	320 x 768
0.57	384 x 672
0.68	416 x 608
1.00	512 x 512
1.46	608 x 416
1.75	672 x 384
2.40	768 x 320

ðª Fine-Tuning

The following section relates to fine-tuning the Hotshot-XL temporal model with additional text/video pairs. If you're trying to generate GIFs of personalized concepts/subjects, we'd recommend not fine-tuning Hotshot-XL, but instead training your own SDXL based LORAs and just loading those.

Fine-Tuning Hotshot-XL

Dataset Preparation

The fine_tune.py script expects your samples to be structured like this:

fine_tune_dataset
âââ sample_001
â  âââ 0.jpg
â  âââ 1.jpg
â  âââ 2.jpg
...
...
â  âââ n.jpg
â  âââ prompt.txt

Each sample directory should contain your n key frames and a prompt.txt file which contains the prompt. The final checkpoint will be saved to output_dir. We've found it useful to send validation GIFs to Weights & Biases every so often. If you choose to use validation with Weights & Biases, you can set how often this runs with the validate_every_steps parameter.

accelerate launch fine_tune.py \
    --output_dir="<OUTPUT_DIR>" \
    --data_dir="fine_tune_dataset" \
    --report_to="wandb" \
    --run_validation_at_start \
    --resolution=512 \
    --mixed_precision=fp16 \
    --train_batch_size=4 \
    --learning_rate=1.25e-05 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --max_train_steps=1000 \
    --save_n_steps=20 \
    --validate_every_steps=50 \
    --vae_b16 \
    --gradient_checkpointing \
    --noise_offset=0.05 \
    --snr_gamma \
    --test_prompts="man sits at a table in a cafe, he greets another man with a smile and a handshakes"

ð Further work

There are lots of ways we are excited about improving Hotshot-XL. For example:

Fine-Tuning Hotshot-XL at larger frame rates to create longer/higher frame-rate GIFs
Fine-Tuning Hotshot-XL at larger resolutions to create higher resolution GIFs
Training temporal layers for a latent upscaler to produce higher resolution GIFs
Training an image conditioned "frame prediction" model for more coherent, longer GIFs
Training temporal layers for a VAE to mitigate flickering/dithering in outputs
Supporting Multi-ControlNet for greater control over GIF generation
Training & integrating different ControlNet models for further control over GIF generation (finer facial expression control would be very cool)
Moving Hotshot-XL into AITemplate for faster inference times

We ð contributions from the open-source community! Please let us know in the issues or PRs if you're interested in working on these improvements or anything else!

ð BibTeX

@software{Mullan_Hotshot-XL_2023,
  author = {Mullan, John and Crawbuck, Duncan and Sastry, Aakash},
  license = {Apache-2.0},
  month = oct,
  title = {{Hotshot-XL}},
  url = {https://github.com/hotshotco/hotshot-xl},
  version = {1.0.0},
  year = {2023}
}

ð Acknowledgements

Text-to-Video models are improving quickly and the development of Hotshot-XL has been greatly inspired by the following amazing works and teams:

We hope that releasing this model/codebase helps the community to continue pushing these creative tools forward in an open and responsible way.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Stable Diffusion

Cons of Stable Diffusion

Code Comparison

Pros of generative-models

Cons of generative-models

Code Comparison

Pros of ControlNet

Cons of ControlNet

Code Comparison

Pros of stable-diffusion-webui

Cons of stable-diffusion-webui

Code Comparison

Convert designs to code with AI

README

Hotshot-XL

ð Try It

ð§ Setup

Environment Setup

Download the Hotshot-XL Weights

Download our fine-tuned SDXL model (or BYOSDXL)

ð® Inference

Text-to-GIF

Text-to-GIF with personalized LORAs

Text-to-GIF with ControlNet

Varying Aspect Ratios

Varying frame rates & lengths (Experimental)

Spatial Layers Only

Additional Notes

Supported Aspect Ratios

ðª Fine-Tuning

Fine-Tuning Hotshot-XL

Dataset Preparation

ð Further work

ð BibTeX

ð Acknowledgements

Top Related Projects

Convert designs to code with AI

ð Try It

ð§ Setup

ð® Inference

ðª Fine-Tuning

ð Further work

ð BibTeX

ð Acknowledgements