Hotshot-XL
✨ Hotshot-XL: State-of-the-art AI text-to-GIF model trained to work alongside Stable Diffusion XL
Top Related Projects
High-Resolution Image Synthesis with Latent Diffusion Models
Generative Models by Stability AI
Let us control diffusion models!
Stable Diffusion web UI
Quick Overview
Hotshot-XL is an open-source AI model for generating images from text prompts. It's designed to be a powerful and efficient alternative to other text-to-image models, offering high-quality image generation capabilities.
Pros
- Open-source and freely available for use and modification
- Efficient performance, potentially requiring less computational resources than some alternatives
- Capable of generating high-quality images from text descriptions
- Actively maintained and developed by the community
Cons
- May require significant computational resources for optimal performance
- Documentation and usage instructions could be more comprehensive
- Limited pre-trained models available compared to some commercial alternatives
- Potential learning curve for users new to text-to-image generation models
Code Examples
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("hotshotco/Hotshot-XL", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "A serene landscape with mountains and a lake at sunset"
image = pipe(prompt).images[0]
image.save("generated_landscape.png")
This example loads the Hotshot-XL model, generates an image based on a text prompt, and saves the result.
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("hotshotco/Hotshot-XL", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "A futuristic cityscape with flying cars"
negative_prompt = "old, rundown, dirty"
image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=50).images[0]
image.save("futuristic_city.png")
This example demonstrates using a negative prompt and adjusting the number of inference steps for more control over the generated image.
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("hotshotco/Hotshot-XL", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "A portrait of a cyberpunk character"
image = pipe(prompt, height=768, width=512, guidance_scale=7.5).images[0]
image.save("cyberpunk_portrait.png")
This example shows how to specify custom image dimensions and adjust the guidance scale for the generated image.
Getting Started
To get started with Hotshot-XL:
-
Install the required dependencies:
pip install diffusers transformers accelerate
-
Use the following code to generate an image:
from diffusers import DiffusionPipeline import torch pipe = DiffusionPipeline.from_pretrained("hotshotco/Hotshot-XL", torch_dtype=torch.float16) pipe = pipe.to("cuda") prompt = "Your text prompt here" image = pipe(prompt).images[0] image.save("output_image.png")
-
Experiment with different prompts and parameters to achieve desired results.
Competitor Comparisons
High-Resolution Image Synthesis with Latent Diffusion Models
Pros of Stable Diffusion
- More established and widely adopted in the AI image generation community
- Extensive documentation and community support
- Broader range of pre-trained models and fine-tuning options
Cons of Stable Diffusion
- Larger model size, requiring more computational resources
- Potentially slower inference time for image generation
- More complex setup and configuration process
Code Comparison
Stable Diffusion:
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
image = pipe("A beautiful sunset over the ocean").images[0]
image.save("sunset.png")
Hotshot-XL:
from hotshotco import HotshotXL
model = HotshotXL.from_pretrained("hotshotco/Hotshot-XL")
image = model.generate("A beautiful sunset over the ocean")
image.save("sunset.png")
Both repositories offer powerful image generation capabilities, but Stable Diffusion has a more established ecosystem and wider range of options. Hotshot-XL, being newer, may offer improved performance in certain scenarios but with less extensive documentation and community support.
Generative Models by Stability AI
Pros of generative-models
- More comprehensive, covering multiple generative AI models
- Better documentation and examples for various use cases
- Active development with frequent updates and contributions
Cons of generative-models
- Larger codebase, potentially more complex to navigate
- May require more computational resources due to multiple models
Code Comparison
Hotshot-XL:
from hotshotco import HotshotXL
model = HotshotXL.from_pretrained("hotshotco/Hotshot-XL")
output = model.generate("A futuristic cityscape")
generative-models:
from sgm.inference.api import SamplingPipeline
from sgm.inference.helpers import embed_watermark
model = SamplingPipeline.from_pretrained("stability-ai/sdxl")
images = model.text_to_image("A futuristic cityscape")
watermarked_images = embed_watermark(images)
Both repositories offer powerful generative AI capabilities, but generative-models provides a broader range of models and more extensive documentation. Hotshot-XL focuses specifically on its namesake model, potentially offering a more streamlined experience for users interested in that particular implementation. The code examples demonstrate the different approaches, with generative-models offering additional features like watermarking.
Let us control diffusion models!
Pros of ControlNet
- More versatile, supporting various conditioning types (edges, depth, pose, etc.)
- Extensive documentation and examples for different use cases
- Larger community and more frequent updates
Cons of ControlNet
- Requires more computational resources due to its complexity
- Steeper learning curve for beginners
- Less focused on video-specific applications
Code Comparison
ControlNet:
from share import *
import config
model = create_model('./models/control_sd15_canny.pth')
processor = CannyDetector()
input_image = load_image(image_path)
detected_map = processor(input_image)
result = model(detected_map, prompt)
Hotshot-XL:
from hotshotco import HotshotXL
model = HotshotXL.from_pretrained("hotshotco/Hotshot-XL")
video = model.generate(prompt="A cat playing piano", num_frames=16)
video.save("cat_piano.mp4")
The code snippets highlight the different focus areas of each project. ControlNet emphasizes image processing and conditioning, while Hotshot-XL is tailored for video generation with a simpler API.
Stable Diffusion web UI
Pros of stable-diffusion-webui
- More extensive feature set and customization options
- Larger community and ecosystem of extensions
- Supports multiple models and architectures
Cons of stable-diffusion-webui
- Steeper learning curve for beginners
- Requires more computational resources
- Setup process can be more complex
Code Comparison
Hotshot-XL (Python):
from hotshotco import HotshotXL
model = HotshotXL()
video = model.generate("A cat playing piano")
stable-diffusion-webui (Python):
import modules.scripts as scripts
import gradio as gr
class Script(scripts.Script):
def title(self):
return "Custom Script"
def ui(self, is_img2img):
return [gr.Textbox(label="Prompt")]
The code snippets demonstrate the simplicity of Hotshot-XL's API for video generation, while stable-diffusion-webui shows a more complex structure for creating custom scripts and user interfaces.
stable-diffusion-webui offers greater flexibility and customization but requires more code to implement features. Hotshot-XL provides a more straightforward approach for specific video generation tasks.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Hotshot-XL
ð Try it ð Model card ð¬ Discord
Hotshot-XL is an AI text-to-GIF model trained to work alongside Stable Diffusion XL.
Hotshot-XL can generate GIFs with any fine-tuned SDXL model. This means two things:
- Youâll be able to make GIFs with any existing or newly fine-tuned SDXL model you may want to use.
- If you'd like to make GIFs of personalized subjects, you can load your own SDXL based LORAs, and not have to worry about fine-tuning Hotshot-XL. This is awesome because itâs usually much easier to find suitable images for training data than it is to find videos. It also hopefully fits into everyone's existing LORA usage/workflows :) See more here.
Hotshot-XL is compatible with SDXL ControlNet to make GIFs in the composition/layout youâd like. See the ControlNet section below.
Hotshot-XL was trained to generate 1 second GIFs at 8 FPS.
Hotshot-XL was trained on various aspect ratios. For best results with the base Hotshot-XL model, we recommend using it with an SDXL model that has been fine-tuned with 512x512 images. You can find an SDXL model we fine-tuned for 512x512 resolutions here.
ð Try It
Try Hotshot-XL yourself here: https://www.hotshot.co
Or, if you'd like to run Hotshot-XL yourself locally, continue on to the sections below.
If youâre running Hotshot-XL yourself, you are going to be able to have a lot more flexibility/control with the model. As a very simple example, youâll be able to change the sampler. Weâve seen best results with Euler-A so far, but you may find interesting results with some other ones.
ð§ Setup
Environment Setup
pip install virtualenv --upgrade
virtualenv -p $(which python3) venv
source venv/bin/activate
pip install -r requirements.txt
Download the Hotshot-XL Weights
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/hotshotco/Hotshot-XL
or visit https://huggingface.co/hotshotco/Hotshot-XL
Download our fine-tuned SDXL model (or BYOSDXL)
- Note: To maximize data and training efficiency, Hotshot-XL was trained at various aspect ratios around 512x512 resolution. For best results with the base Hotshot-XL model, we recommend using it with an SDXL model that has been fine-tuned with images around the 512x512 resolution. You can download an SDXL model we trained with images at 512x512 resolution below, or bring your own SDXL base model.
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/hotshotco/SDXL-512
or visit https://huggingface.co/hotshotco/SDXL-512
ð® Inference
Text-to-GIF
python inference.py \
--prompt="a bulldog in the captains chair of a spaceship, hd, high quality" \
--output="output.gif"
What to Expect:
Prompt | Sasquatch scuba diving | a camel smoking a cigarette | Ronald McDonald sitting at a vanity mirror putting on lipstick | drake licking his lips and staring through a window at a cupcake |
---|---|---|---|---|
Output |
Text-to-GIF with personalized LORAs
python inference.py \
--prompt="a bulldog in the captains chair of a spaceship, hd, high quality" \
--output="output.gif" \
--spatial_unet_base="path/to/stabilityai/stable-diffusion-xl-base-1.0/unet" \
--lora="path/to/lora"
What to Expect:
Note: The outputs below use the DDIMScheduler.
Prompt | sks person screaming at a capri sun | sks person kissing kermit the frog | sks person wearing a tuxedo holding up a glass of champagne, fireworks in background, hd, high quality, 4K |
---|---|---|---|
Output |
Text-to-GIF with ControlNet
python inference.py \
--prompt="a girl jumping up and down and pumping her fist, hd, high quality" \
--output="output.gif" \
--control_type="depth" \
--gif="https://media1.giphy.com/media/v1.Y2lkPTc5MGI3NjExbXNneXJicG1mOHJ2dzQ2Y2JteDY1ZWlrdjNjMjl3ZWxyeWFxY2EzdyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/YOTAoXBgMCmFeQQzuZ/giphy.gif"
By default, Hotshot-XL will create key frames from your source gif using 8 equally spaced frames and crop the keyframes to the default aspect ratio. For finer grained control, learn how to vary aspect ratios and vary frame rates/lengths.
Hotshot-XL currently supports the use of one ControlNet model at a time; supporting Multi-ControlNet would be exciting.
What to Expect:
Prompt | pixar style girl putting two thumbs up, happy, high quality, 8k, 3d, animated disney render | keanu reaves holding a sign that says "HELP", hd, high quality | a woman laughing, hd, high quality | barack obama making a rainbow with their hands, the word "MAGIC" in front of them, wearing a blue and white striped hoodie, hd, high quality |
---|---|---|---|---|
Output | ||||
Control |
Varying Aspect Ratios
- Note: The base SDXL model is trained to best create images around 1024x1024 resolution. To maximize data and training efficiency, Hotshot-XL was trained at aspect ratios around 512x512 resolution. Please see Additional Notes for a list of aspect ratios the base Hotshot-XL model was trained with.
Like SDXL, Hotshot-XL was trained at various aspect ratios with aspect ratio bucketing, and includes support for SDXL parameters like target-size and original-size. This means you can create GIFs at several different aspect ratios and resolutions, just with the base Hotshot-XL model.
python inference.py \
--prompt="a bulldog in the captains chair of a spaceship, hd, high quality" \
--output="output.gif" \
--width=<WIDTH> \
--height=<HEIGHT>
What to Expect:
512x512 | 672x384 | 384x672 | |
---|---|---|---|
a monkey playing guitar, nature footage, hd, high quality |
Varying frame rates & lengths (Experimental)
By default, Hotshot-XL is trained to generate GIFs that are 1 second long with 8FPS. If you'd like to play with generating GIFs with varying frame rates and time lengths, you can try out the parameters video_length
and video_duration
.
video_length
sets the number of frames. The default value is 8.
video_duration
sets the runtime of the output gif in milliseconds. The default value is 1000.
Please note that you should expect unstable/"jittery" results when modifying these parameters as the model was only trained with 1s videos @ 8fps. You'll be able to improve the stability of results for different time lengths and frame rates by fine-tuning Hotshot-XL. Please let us know if you do!
python inference.py \
--prompt="a bulldog in the captains chair of a spaceship, hd, high quality" \
--output="output.gif" \
--video_length=16 \
--video_duration=2000
Spatial Layers Only
Hotshot-XL is trained to generate GIFs alongside SDXL. If you'd like to generate just an image, you can simply set video_length=1
in your inference call and the Hotshot-XL temporal layers will be ignored, as you'd expect.
python inference.py \
--prompt="a bulldog in the captains chair of a spaceship, hd, high quality" \
--output="output.jpg" \
--video_length=1
Additional Notes
Supported Aspect Ratios
Hotshot-XL was trained at the following aspect ratios; to reliably generate GIFs outside the range of these aspect ratios, you will want to fine-tune Hotshot-XL with videos at the resolution of your desired aspect ratio.
Aspect Ratio | Size |
---|---|
0.42 | 320 x 768 |
0.57 | 384 x 672 |
0.68 | 416 x 608 |
1.00 | 512 x 512 |
1.46 | 608 x 416 |
1.75 | 672 x 384 |
2.40 | 768 x 320 |
ðª Fine-Tuning
The following section relates to fine-tuning the Hotshot-XL temporal model with additional text/video pairs. If you're trying to generate GIFs of personalized concepts/subjects, we'd recommend not fine-tuning Hotshot-XL, but instead training your own SDXL based LORAs and just loading those.
Fine-Tuning Hotshot-XL
Dataset Preparation
The fine_tune.py
script expects your samples to be structured like this:
fine_tune_dataset
âââ sample_001
â âââ 0.jpg
â âââ 1.jpg
â âââ 2.jpg
...
...
â âââ n.jpg
â âââ prompt.txt
Each sample directory should contain your n key frames and a prompt.txt
file which contains the prompt.
The final checkpoint will be saved to output_dir
.
We've found it useful to send validation GIFs to Weights & Biases every so often. If you choose to use validation with Weights & Biases, you can set how often this runs with the validate_every_steps
parameter.
accelerate launch fine_tune.py \
--output_dir="<OUTPUT_DIR>" \
--data_dir="fine_tune_dataset" \
--report_to="wandb" \
--run_validation_at_start \
--resolution=512 \
--mixed_precision=fp16 \
--train_batch_size=4 \
--learning_rate=1.25e-05 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=1000 \
--save_n_steps=20 \
--validate_every_steps=50 \
--vae_b16 \
--gradient_checkpointing \
--noise_offset=0.05 \
--snr_gamma \
--test_prompts="man sits at a table in a cafe, he greets another man with a smile and a handshakes"
ð Further work
There are lots of ways we are excited about improving Hotshot-XL. For example:
- Fine-Tuning Hotshot-XL at larger frame rates to create longer/higher frame-rate GIFs
- Fine-Tuning Hotshot-XL at larger resolutions to create higher resolution GIFs
- Training temporal layers for a latent upscaler to produce higher resolution GIFs
- Training an image conditioned "frame prediction" model for more coherent, longer GIFs
- Training temporal layers for a VAE to mitigate flickering/dithering in outputs
- Supporting Multi-ControlNet for greater control over GIF generation
- Training & integrating different ControlNet models for further control over GIF generation (finer facial expression control would be very cool)
- Moving Hotshot-XL into AITemplate for faster inference times
We ð contributions from the open-source community! Please let us know in the issues or PRs if you're interested in working on these improvements or anything else!
ð BibTeX
@software{Mullan_Hotshot-XL_2023,
author = {Mullan, John and Crawbuck, Duncan and Sastry, Aakash},
license = {Apache-2.0},
month = oct,
title = {{Hotshot-XL}},
url = {https://github.com/hotshotco/hotshot-xl},
version = {1.0.0},
year = {2023}
}
ð Acknowledgements
Text-to-Video models are improving quickly and the development of Hotshot-XL has been greatly inspired by the following amazing works and teams:
We hope that releasing this model/codebase helps the community to continue pushing these creative tools forward in an open and responsible way.
Top Related Projects
High-Resolution Image Synthesis with Latent Diffusion Models
Generative Models by Stability AI
Let us control diffusion models!
Stable Diffusion web UI
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot