Top Related Projects
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
PyTorch package for the discrete VAE used for DALL·E.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Quick Overview
The jcjohnson/densecap repository is a PyTorch implementation of the DenseCap model, which is a deep learning-based approach for generating dense captions for images. The model is capable of detecting and describing multiple objects and their relationships within an image.
Pros
- Accurate Object Detection and Description: The DenseCap model is able to accurately detect and describe multiple objects and their relationships within an image, providing a detailed understanding of the image content.
- Flexible and Customizable: The PyTorch implementation allows for easy customization and fine-tuning of the model to specific use cases or datasets.
- Efficient and Scalable: The model is designed to be efficient and scalable, making it suitable for real-world applications.
- Active Development and Community: The project has an active development community and is regularly maintained, ensuring ongoing improvements and bug fixes.
Cons
- Computational Complexity: The DenseCap model is computationally complex, requiring significant resources for training and inference, which may limit its deployment in resource-constrained environments.
- Limited Datasets: The model is primarily trained on the Visual Genome dataset, which may not cover all possible real-world scenarios, potentially limiting its performance on diverse datasets.
- Dependency on PyTorch: The project is built using PyTorch, which may be a limitation for users who prefer other deep learning frameworks, such as TensorFlow.
- Lack of Comprehensive Documentation: While the project has some documentation, it may not be as comprehensive as some users would prefer, making it more challenging for new users to get started.
Code Examples
import torch
from densecap.model import DenseCaptionModel
from densecap.data import DenseCaptionDataset
# Load the pre-trained DenseCap model
model = DenseCaptionModel.from_pretrained('jcjohnson/densecap')
# Load the dataset
dataset = DenseCaptionDataset('path/to/dataset')
# Prepare the input image
image = dataset[0]['image']
image = image.unsqueeze(0) # Add batch dimension
# Generate dense captions
captions, bboxes = model.generate_captions(image)
# Print the generated captions and bounding boxes
for caption, bbox in zip(captions, bboxes):
print(f'Caption: {caption}')
print(f'Bounding Box: {bbox}')
This code example demonstrates how to load the pre-trained DenseCap model, load a dataset, and generate dense captions for an input image.
from densecap.data import DenseCaptionDataset
from densecap.model import DenseCaptionModel
# Define the dataset and model
dataset = DenseCaptionDataset('path/to/dataset')
model = DenseCaptionModel(num_classes=len(dataset.vocab))
# Train the model
for epoch in range(num_epochs):
for batch in dataset:
images, captions, bboxes = batch
loss = model.forward(images, captions, bboxes)
loss.backward()
optimizer.step()
This code example shows how to fine-tune the DenseCap model on a custom dataset by defining the model, loading the dataset, and training the model using PyTorch.
import cv2
from densecap.model import DenseCaptionModel
# Load the pre-trained model
model = DenseCaptionModel.from_pretrained('jcjohnson/densecap')
# Load an image
image = cv2.imread('path/to/image.jpg')
# Generate dense captions
captions, bboxes = model.generate_captions(image)
# Visualize the results
for caption, bbox in zip(captions, bboxes):
x, y, w, h = bbox
cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.putText(image, caption, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (36, 255, 12), 2)
cv
Competitor Comparisons
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
Pros of Detectron2
- Detectron2 is a state-of-the-art object detection and segmentation library, offering advanced features and high-performance models.
- The library is actively maintained and supported by the Facebook AI Research team, ensuring regular updates and improvements.
- Detectron2 provides a wide range of pre-trained models, allowing users to quickly adapt and fine-tune the models for their specific use cases.
Cons of Detectron2
- Detectron2 has a steeper learning curve compared to DenseCaption, as it requires a deeper understanding of object detection and segmentation concepts.
- The library is primarily focused on object detection and segmentation, while DenseCaption is specialized in dense captioning, which may be more suitable for certain applications.
Code Comparison
DenseCaption:
import densecap
model = densecap.load_model('path/to/model.pth')
image = densecap.read_image('path/to/image.jpg')
captions = model.generate_captions(image)
for caption in captions:
print(caption)
Detectron2:
import detectron2
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
cfg = get_cfg()
cfg.merge_from_file("path/to/config.yaml")
predictor = DefaultPredictor(cfg)
image = cv2.imread("path/to/image.jpg")
outputs = predictor(image)
PyTorch package for the discrete VAE used for DALL·E.
Pros of DALL-E
- DALL-E is a state-of-the-art text-to-image generation model, capable of producing highly realistic and diverse images from natural language descriptions.
- The model has been trained on a vast dataset, allowing it to generate images across a wide range of topics and styles.
- DALL-E's capabilities have been demonstrated through impressive examples and use cases, showcasing its potential for various applications.
Cons of DALL-E
- DALL-E is a proprietary model developed by OpenAI, which means the code and training data are not publicly available.
- The model's performance and limitations are not as well-documented as open-source alternatives, making it more difficult to understand and extend.
- DALL-E's use may be subject to licensing and ethical considerations, as the technology raises concerns about the potential for misuse or unintended consequences.
Code Comparison
DALL-E:
import openai
openai.api_key = "your_api_key"
prompt = "A photo of a happy dog wearing a top hat"
response = openai.Image.create(
prompt=prompt,
n=1,
size="1024x1024"
)
image_url = response['data'][0]['url']
DenseCAP:
import torch
from densecap.model import DenseCapModel
model = DenseCapModel()
image = torch.randn(1, 3, 224, 224)
captions = model.generate_captions(image)
for caption in captions:
print(caption)
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Pros of CLIP
- CLIP is a state-of-the-art multimodal model that can perform a wide range of vision and language tasks, including image classification, image-text retrieval, and zero-shot learning.
- CLIP is highly scalable and can be fine-tuned on a wide range of downstream tasks, making it a versatile tool for various applications.
- CLIP has been shown to be robust to distribution shift and can generalize well to unseen data, making it a reliable choice for real-world deployments.
Cons of CLIP
- CLIP is a large and computationally expensive model, which may limit its use in resource-constrained environments.
- CLIP's performance on certain specialized tasks, such as dense captioning, may be inferior to models that are specifically designed for those tasks.
- CLIP's training process and model architecture are not as well-documented as some other open-source models, which may make it more challenging to understand and extend.
Code Comparison
DenseCap:
import torch
import torch.nn as nn
import torch.nn.functional as F
class DenseCaptionModule(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
super(DenseCaptionModule, self).__init__()
self.embed = nn.Embedding(vocab_size, embed_size)
self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, vocab_size)
CLIP:
import torch
import torch.nn as nn
import torch.nn.functional as F
class CLIP(nn.Module):
def __init__(self, image_encoder, text_encoder, logit_scale=100.0):
super().__init__()
self.image_encoder = image_encoder
self.text_encoder = text_encoder
self.logit_scale = nn.Parameter(torch.ones([]) * logit_scale)
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Pros of BLIP
- BLIP is a state-of-the-art image-text retrieval model that can generate high-quality image captions.
- BLIP has been trained on a large and diverse dataset, allowing it to handle a wide range of image types and topics.
- BLIP is built on top of the powerful CLIP model, which has shown impressive performance on various visual-language tasks.
Cons of BLIP
- BLIP is a relatively new model, and its performance may not be as well-established as some older image captioning models.
- The BLIP repository does not provide as much detailed documentation and examples as the DenseCaption repository.
- BLIP may have higher computational requirements than some simpler image captioning models, which could limit its usability on certain hardware.
Code Comparison
DenseCaption:
import torch
from densecap.model import DenseCapModel
from densecap.data import DenseCapDataset
model = DenseCapModel()
dataset = DenseCapDataset('path/to/data')
loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
output = model(batch)
# Process the output
BLIP:
import torch
from blip.model import BLIPModel
from blip.data import BLIPDataset
model = BLIPModel()
dataset = BLIPDataset('path/to/data')
loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
output = model(batch)
# Process the output
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
DenseCap
This is the code for the paper
DenseCap: Fully Convolutional Localization Networks for Dense Captioning,
Justin Johnson*,
Andrej Karpathy*,
Li Fei-Fei,
(* equal contribution)
Presented at CVPR 2016 (oral)
The paper addresses the problem of dense captioning, where a computer detects objects in images and describes them in natural language. Here are a few example outputs:
The model is a deep convolutional neural network trained in an end-to-end fashion on the Visual Genome dataset.
We provide:
- A pretrained model
- Code to run the model on new images, on either CPU or GPU
- Code to run a live demo with a webcam
- Evaluation code for dense captioning
- Instructions for training the model
If you find this code useful in your research, please cite:
@inproceedings{densecap,
title={DenseCap: Fully Convolutional Localization Networks for Dense Captioning},
author={Johnson, Justin and Karpathy, Andrej and Fei-Fei, Li},
booktitle={Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition},
year={2016}
}
Installation
DenseCap is implemented in Torch, and depends on the following packages: torch/torch7, torch/nn, torch/nngraph, torch/image, lua-cjson, qassemoquab/stnbhwd, jcjohnson/torch-rnn
After installing torch, you can install / update these dependencies by running the following:
luarocks install torch
luarocks install nn
luarocks install image
luarocks install lua-cjson
luarocks install https://raw.githubusercontent.com/qassemoquab/stnbhwd/master/stnbhwd-scm-1.rockspec
luarocks install https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/torch-rnn-scm-1.rockspec
(Optional) GPU acceleration
If have an NVIDIA GPU and want to accelerate the model with CUDA, you'll also need to install torch/cutorch and torch/cunn; you can install / update these by running:
luarocks install cutorch
luarocks install cunn
luarocks install cudnn
(Optional) cuDNN
If you want to use NVIDIA's cuDNN library, you'll need to register for the CUDA Developer Program (it's free) and download the library from NVIDIA's website; you'll also need to install the cuDNN bindings for Torch by running
luarocks install cudnn
Pretrained model
You can download a pretrained DenseCap model by running the following script:
sh scripts/download_pretrained_model.sh
This will download a zipped version of the model (about 1.1 GB) to data/models/densecap/densecap-pretrained-vgg16.t7.zip
, unpack
it to data/models/densecap/densecap-pretrained-vgg16.t7
(about 1.2 GB) and then delete the zipped version.
This is not the exact model that was used in the paper, but is has comparable performance; using 1000 region proposals per image, it achieves a mAP of 5.70 on the test set which is slightly better than the mAP of 5.39 that we report in the paper.
Running on new images
To run the model on new images, use the script run_model.lua
. To run the pretrained model on the provided elephant.jpg
image,
use the following command:
th run_model.lua -input_image imgs/elephant.jpg
By default this will run in GPU mode; to run in CPU only mode, simply add the flag -gpu -1
.
This command will write results into the folder vis/data
. We have provided a web-based visualizer to view these
results; to use it, change to the vis
directory and start a local HTTP server:
cd vis
python -m SimpleHTTPServer 8181
Then point your web browser to http://localhost:8181/view_results.html.
If you have an entire directory of images on which you want to run the model, use the -input_dir
flag instead:
th run_model.lua -input_dir /path/to/my/image/folder
This run the model on all files in the folder /path/to/my/image/folder/
whose filename does not start with .
.
The web-based visualizer is the prefered way to view results, but if you don't want to use it then you can instead
render an image with the detection boxes and captions "baked in"; add the flag -output_dir
to specify a directory
where output images should be written:
th run_model.lua -input_dir /path/to/my/image/folder -output_dir /path/to/output/folder/
The run_model.lua
script has several other flags; you can find details here.
Training
To train a new DenseCap model, you will following the following steps:
- Download the raw images and region descriptions from the Visual Genome website
- Use the script
preprocess.py
to generate a single HDF5 file containing the entire dataset (details here) - Use the script
train.lua
to train the model (details here) - Use the script
evaluate_model.lua
to evaluate a trained model on the validation or test data (details here)
For more instructions on training see INSTALL.md in doc
folder.
Evaluation
In the paper we propose a metric for automatically evaluating dense captioning results. Our metric depends on METEOR, and our evaluation code requires both Java and Python 2.7. The following script will download and unpack the METEOR jarfile:
sh scripts/setup_eval.sh
The evaluation code is not required to simply run a trained model on images; you can find more details about the evaluation code here.
Webcam demos
If you have a powerful GPU, then the DenseCap model is fast enough to run in real-time. We provide two demos to allow you to run DenseCap on frames from a webcam.
Single-machine demo
If you have a single machine with both a webcam and a powerful GPU, then you can use this demo to run DenseCap in real time at up to 10 frames per second. This demo depends on a few extra Lua packages:
You can install / update these dependencies by running the following:
luarocks install camera
luarocks install qtlua
You can start the demo by running the following:
qlua webcam/single_machine_demo.lua
Client / server demo
If you have a machine with a powerful GPU and another machine with a webcam, then this demo allows you use the GPU machine as a server and the webcam machine as a client; frames will be streamed from the client to to the server, the model will run on the server, and predictions will be shipped back to the client for viewing. This allows you to run DenseCap on a laptop, but with network and filesystem overhead you will typically only achieve 1 to 2 frames per second.
The server is written in Flask; on the server machine run the following to install dependencies:
cd webcam
virtualenv .env
source .env/bin/activate
pip install -r requirements.txt
cd ..
For technical reasons, the server needs to serve content over SSL; it expects to find SSL key
files and certificate files in webcam/ssl/server.key
and webcam/ssl/server.crt
respectively.
You can generate a self-signed SSL certificate by running the following:
mkdir webcam/ssl
# Step 1: Generate a private key
openssl genrsa -des3 -out webcam/ssl/server.key 1024
# Enter a password
# Step 2: Generate a certificate signing request
openssl req -new -key webcam/ssl/server.key -out webcam/ssl/server.csr
# Enter the password from above and leave all other fields blank
# Step 3: Strip the password from the keyfile
cp webcam/ssl/server.key webcam/ssl/server.key.org
openssl rsa -in webcam/ssl/server.key.org -out webcam/ssl/server.key
# Step 4: Generate self-signed certificate
openssl x509 -req -days 365 -in webcam/ssl/server.csr -signkey webcam/ssl/server.key -out webcam/ssl/server.crt
# Enter the password from above
You can now run the following two commands to start the server; both will run forever:
th webcam/daemon.lua
python webcam/server.py
On the client, point a web browser at the following page:
https://cs.stanford.edu/people/jcjohns/densecap/demo/web-client.html?server_url=SERVER_URL
but you should replace SERVER_URL with the actual URL of the server.
Note: If the server is using a self-signed SSL certificate, you may need to manually tell your browser that the certificate is safe by pointing your client's web browser directly at the server URL; you will get a message that the site is unsafe; for example on Chrome you will see the following:
Afterward you should see a message telling you that the DenseCap server is running, and the web client should work after refreshing.
Top Related Projects
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
PyTorch package for the discrete VAE used for DALL·E.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot