JARVIS
JARVIS, a system to connect LLMs with ML community. Paper: https://arxiv.org/pdf/2303.17580.pdf
Top Related Projects
Robust Speech Recognition via Large-Scale Weak Supervision
Port of OpenAI's Whisper model in C/C++
Stable Diffusion web UI
A latent text-to-image diffusion model
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Quick Overview
JARVIS (Just A Rather Very Intelligent System) is an open-source framework for building multimodal AI agents. It aims to create AI assistants capable of understanding and generating text, images, and other modalities, as well as interacting with various tools and APIs to accomplish complex tasks.
Pros
- Flexible and extensible architecture for creating custom AI agents
- Supports multiple modalities including text, image, and potentially audio
- Integrates with various external tools and APIs
- Active development and community support from Microsoft
Cons
- Still in early development stages, may have stability issues
- Limited documentation and examples compared to more mature frameworks
- Potential learning curve for developers new to multimodal AI systems
- May require significant computational resources for advanced use cases
Code Examples
# Initialize JARVIS agent
from jarvis import JarvisAgent
agent = JarvisAgent()
# Process multimodal input
text_input = "Describe this image:"
image_input = load_image("example.jpg")
response = agent.process(text=text_input, image=image_input)
print(response)
# Use JARVIS with external tools
from jarvis import JarvisAgent
from jarvis.tools import WebSearchTool, ImageGenerationTool
agent = JarvisAgent()
agent.add_tool(WebSearchTool())
agent.add_tool(ImageGenerationTool())
result = agent.execute_task("Find information about climate change and create an infographic")
print(result)
# Fine-tune JARVIS on custom data
from jarvis import JarvisAgent
from jarvis.training import FineTuner
agent = JarvisAgent()
fine_tuner = FineTuner(agent)
custom_data = load_custom_dataset()
fine_tuner.train(custom_data, epochs=5)
agent.save("custom_jarvis_model")
Getting Started
To get started with JARVIS, follow these steps:
-
Install JARVIS:
pip install jarvis-ai
-
Import and initialize the agent:
from jarvis import JarvisAgent agent = JarvisAgent()
-
Process input and get response:
response = agent.process(text="Hello, JARVIS!") print(response)
For more advanced usage and configuration options, refer to the official documentation on the GitHub repository.
Competitor Comparisons
Robust Speech Recognition via Large-Scale Weak Supervision
Pros of Whisper
- Highly accurate speech recognition across multiple languages
- Open-source and well-documented, allowing for easy integration and customization
- Supports transcription, translation, and language identification tasks
Cons of Whisper
- Focused solely on speech recognition, lacking broader AI capabilities
- Requires significant computational resources for optimal performance
- Limited real-time processing capabilities
Code Comparison
Whisper:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
JARVIS:
from jarvis import Jarvis
jarvis = Jarvis()
response = jarvis.process("What's the weather like today?")
print(response)
While Whisper excels in speech recognition tasks, JARVIS offers a more comprehensive AI assistant framework with broader capabilities. Whisper's code focuses on transcription, while JARVIS processes various types of user inputs and generates responses. JARVIS provides a more versatile platform for building AI applications, but may not match Whisper's specialized speech recognition accuracy.
Port of OpenAI's Whisper model in C/C++
Pros of whisper.cpp
- Lightweight and efficient C++ implementation of OpenAI's Whisper model
- Runs on CPU, making it accessible for devices without GPUs
- Focuses specifically on speech recognition and transcription
Cons of whisper.cpp
- Limited to speech-to-text functionality, lacking broader AI capabilities
- Requires manual integration for more complex applications
- Less extensive documentation and community support
Code Comparison
whisper.cpp:
#include "whisper.h"
int main(int argc, char** argv) {
struct whisper_context * ctx = whisper_init_from_file("ggml-base.en.bin");
whisper_full_default(ctx, wparams, pcmf32.data(), pcmf32.size());
whisper_print_timings(ctx);
whisper_free(ctx);
}
JARVIS:
from jarvis import Jarvis
jarvis = Jarvis()
response = jarvis.chat("What's the weather like today?")
print(response)
Summary
While whisper.cpp excels in efficient speech recognition, JARVIS offers a more comprehensive AI assistant framework. whisper.cpp is ideal for lightweight, CPU-based transcription tasks, whereas JARVIS provides a broader range of AI capabilities and easier integration for complex applications. The choice between them depends on the specific requirements of the project and the desired balance between efficiency and functionality.
Stable Diffusion web UI
Pros of stable-diffusion-webui
- User-friendly web interface for generating and manipulating images
- Extensive customization options and a wide range of built-in features
- Active community with frequent updates and extensions
Cons of stable-diffusion-webui
- Focused primarily on image generation, lacking broader AI capabilities
- May require more computational resources for optimal performance
- Steeper learning curve for advanced features and customizations
Code Comparison
JARVIS (Python):
from jarvis.core import Jarvis
jarvis = Jarvis()
response = jarvis.chat("What's the weather like today?")
print(response)
stable-diffusion-webui (Python):
import modules.scripts as scripts
import gradio as gr
class Script(scripts.Script):
def title(self):
return "Custom Script"
def ui(self, is_img2img):
return []
def run(self, p, *args):
# Custom image generation logic
return
The code snippets highlight the different focus areas of each project. JARVIS provides a simple interface for general AI interactions, while stable-diffusion-webui offers more specialized image generation capabilities with customizable scripts.
A latent text-to-image diffusion model
Pros of Stable-diffusion
- Focused on image generation and manipulation
- More mature and widely adopted in the AI art community
- Extensive documentation and community support
Cons of Stable-diffusion
- Limited to image-related tasks
- Requires significant computational resources for optimal performance
Code comparison
Stable-diffusion:
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
JARVIS:
from jarvis import JARVIS
jarvis = JARVIS()
response = jarvis.execute("Describe the process of photosynthesis")
print(response)
Key differences
- Stable-diffusion is specialized in image generation and manipulation, while JARVIS is a more general-purpose AI assistant.
- Stable-diffusion has a larger community and more extensive documentation, whereas JARVIS is newer and still evolving.
- JARVIS aims to provide a broader range of AI capabilities, including natural language processing and task execution, while Stable-diffusion focuses solely on image-related tasks.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of Transformers
- Broader scope, supporting a wide range of NLP tasks and models
- Larger community and more frequent updates
- Extensive documentation and examples
Cons of Transformers
- Can be more complex to use for specific applications
- May require more setup and configuration for specialized tasks
Code Comparison
JARVIS (Python):
from jarvis.tasks import ImageCaptioning
model = ImageCaptioning()
result = model.generate(image_path="image.jpg")
print(result)
Transformers (Python):
from transformers import pipeline
captioner = pipeline("image-to-text")
result = captioner("image.jpg")
print(result[0]["generated_text"])
Summary
Transformers offers a more comprehensive toolkit for various NLP tasks, while JARVIS focuses on specific AI applications. Transformers has a larger community and more frequent updates, but may require more setup for specialized tasks. JARVIS provides a simpler interface for its supported tasks but has a narrower scope. The code comparison shows that both libraries offer straightforward ways to perform tasks like image captioning, with Transformers requiring slightly more setup but providing more flexibility.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- More comprehensive and established toolkit for sequence modeling
- Supports a wider range of tasks and architectures
- Larger community and more extensive documentation
Cons of fairseq
- Steeper learning curve due to its extensive features
- May be overkill for simpler projects or specific use cases
Code Comparison
fairseq:
from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('/path/to/model')
tokens = model.encode('Hello world!')
translated = model.translate(tokens)
print(translated)
JARVIS:
from jarvis import JARVIS
jarvis = JARVIS()
response = jarvis.chat("Translate 'Hello world!' to French")
print(response)
Key Differences
- fairseq is focused on sequence-to-sequence models and offers more flexibility
- JARVIS is designed as an AI assistant with a simpler interface
- fairseq requires more setup and configuration
- JARVIS provides a more straightforward approach for general AI tasks
Use Cases
- fairseq: Ideal for research and advanced NLP projects
- JARVIS: Better suited for quick prototyping and general-purpose AI applications
Community and Support
- fairseq: Larger community, more third-party contributions
- JARVIS: Backed by Microsoft, potentially more enterprise-focused support
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
JARVIS
The mission of JARVIS is to explore artificial general intelligence (AGI) and deliver cutting-edge research to the whole community.
What's New
- [2024.01.15] We release Easytool for easier tool usage.
- The code and datasets are available at EasyTool.
- The paper is available at EasyTool: Enhancing LLM-based Agents with Concise Tool Instruction.
- [2023.11.30] We release TaskBench for evaluating task automation capability of LLMs.
- The code and datasets are avaliable at TaskBench.
- The paper is avaliable at TaskBench: Benchmarking Large Language Models for Task Automation.
- [2023.07.28] We are now in the process of planning evaluation and project rebuilding. We will release a new version of Jarvis in the near future.
- [2023.07.24] We released a light langchain version of Jarvis. See here.
- [2023.04.16] Jarvis now supports the OpenAI service on the Azure platform and the GPT-4 model.
- [2023.04.06] We added the Gradio demo and built the web API for
/tasks
and/results
inserver
mode.- The Gradio demo is now hosted on Hugging Face Space. (Build with
inference_mode=hybrid
andlocal_deployment=standard
) - The Web API
/tasks
and/results
access intermediate results forStage #1
: task planning andStage #1-3
: model selection with execution results. See here.
- The Gradio demo is now hosted on Hugging Face Space. (Build with
- [2023.04.03] We added the CLI mode and provided parameters for configuring the scale of local endpoints.
- You can enjoy a lightweight experience with Jarvis without deploying the models locally. See here.
- Just run
python awesome_chat.py --config configs/config.lite.yaml
to experience it.
- [2023.04.01] We updated a version of code for building.
Overview
Language serves as an interface for LLMs to connect numerous AI models for solving complicated AI tasks!
See our paper: HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace, Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu and Yueting Zhuang (the first two authors contribute equally)
We introduce a collaborative system that consists of an LLM as the controller and numerous expert models as collaborative executors (from HuggingFace Hub). The workflow of our system consists of four stages:
- Task Planning: Using ChatGPT to analyze the requests of users to understand their intention, and disassemble them into possible solvable tasks.
- Model Selection: To solve the planned tasks, ChatGPT selects expert models hosted on Hugging Face based on their descriptions.
- Task Execution: Invokes and executes each selected model, and return the results to ChatGPT.
- Response Generation: Finally, using ChatGPT to integrate the prediction of all models, and generate responses.
System Requirements
Default (Recommended)
For configs/config.default.yaml
:
- Ubuntu 16.04 LTS
- VRAM >= 24GB
- RAM > 12GB (minimal), 16GB (standard), 80GB (full)
- Disk > 284GB
- 42GB for
damo-vilab/text-to-video-ms-1.7b
- 126GB for
ControlNet
- 66GB for
stable-diffusion-v1-5
- 50GB for others
- 42GB for
Minimum (Lite)
For configs/config.lite.yaml
:
- Ubuntu 16.04 LTS
- Nothing else
The configuration configs/config.lite.yaml
does not require any expert models to be downloaded and deployed locally. However, it means that Jarvis is restricted to models running stably on HuggingFace Inference Endpoints.
Quick Start
First replace openai.key
and huggingface.token
in server/configs/config.default.yaml
with your personal OpenAI Key and your Hugging Face Token, or put them in the environment variables OPENAI_API_KEY
and HUGGINGFACE_ACCESS_TOKEN
respectively. Then run the following commands:
For Server:
# setup env
cd server
conda create -n jarvis python=3.8
conda activate jarvis
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt
# download models. Make sure that `git-lfs` is installed.
cd models
bash download.sh # required when `inference_mode` is `local` or `hybrid`.
# run server
cd ..
python models_server.py --config configs/config.default.yaml # required when `inference_mode` is `local` or `hybrid`
python awesome_chat.py --config configs/config.default.yaml --mode server # for text-davinci-003
Now you can access Jarvis' services by the Web API.
/hugginggpt
--methodPOST
, access the full service./tasks
--methodPOST
, access intermediate results for Stage #1./results
--methodPOST
, access intermediate results for Stage #1-3.
For example:
# request
curl --location 'http://localhost:8004/tasks' \
--header 'Content-Type: application/json' \
--data '{
"messages": [
{
"role": "user",
"content": "based on pose of /examples/d.jpg and content of /examples/e.jpg, please show me a new image"
}
]
}'
# response
[{"args":{"image":"/examples/d.jpg"},"dep":[-1],"id":0,"task":"openpose-control"},{"args":{"image":"/examples/e.jpg"},"dep":[-1],"id":1,"task":"image-to-text"},{"args":{"image":"<GENERATED>-0","text":"<GENERATED>-1"},"dep":[1,0],"id":2,"task":"openpose-text-to-image"}]
For Web:
We provide a user-friendly web page. After starting awesome_chat.py
in a server mode, you can run the commands to communicate with Jarvis in your browser:
- you need to install
nodejs
andnpm
first. - [ IMPORTANT ] if you are running the web client on another machine, you need set
http://{LAN_IP_of_the_server}:{port}/
toHUGGINGGPT_BASE_URL
ofweb/src/config/index.ts
. - if you want to use the video generation feature, you need to compile
ffmpeg
manually with H.264. - you can switch to ChatGPT by
double click
on the setting icon!
cd web
npm install
npm run dev
# Optional: Install ffmpeg
# This command need be executed without errors.
LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/ffmpeg -i input.mp4 -vcodec libx264 output.mp4
For Gradio
The Gradio demo is now hosted on Hugging Face Space. You can also run the following commands to start the demo locally:
python models_server.py --config configs/config.gradio.yaml
python run_gradio_demo.py --config configs/config.gradio.yaml
# or run the HF Space as a Docker image (Build with `inference_mode=hibrid` and `local_deployment=standard`)
docker run -it -p 7860:7860 --platform=linux/amd64 registry.hf.space/microsoft-hugginggpt:latest python app.py
For CLI:
You can also run Jarvis more easily in CLI mode:
cd server
python awesome_chat.py --config configs/config.default.yaml --mode cli
Examples of CLI mode:
Welcome to Jarvis! A collaborative system that consists of an LLM as the controller and numerous expert models as collaborative executors. Jarvis can plan tasks, schedule Hugging Face models, generate friendly responses based on your requests, and help you with many things. Please enter your request (`exit` to exit).
[ User ]: Given a collection of image A: /examples/a.jpg, B: /examples/b.jpg, C: /examples/c.jpg, please tell me how many zebras in these picture?
[ Jarvis ]: Based on your request, I have used several models to process the images you provided. For the image A: /examples/a.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "a cat sitting on a window sill looking out". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/a3dc.jpg', which contains two objects with labels 'potted plant' and 'cat'. For the image B: /examples/b.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "two zebras standing in a field of dry grass". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/f5df.jpg', which contains three objects with labels 'zebra'. For the image C: /examples/c.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "a zebra standing in a field of tall grass". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/69c7.jpg', which contains one object with label 'zebra'. Finally, I used the canny-control model canny-control to generate the image '/images/7e18.png' based on the image /examples/f.jpg. Therefore, there are four zebras in these pictures. Is there anything else I can help you with?
[ User ]: Please answer all the named entities in the sentence: Iron Man is a superhero appearing in American comic books published by Marvel Comics. The character was co-created by writer and editor Stan Lee, developed by scripter Larry Lieber, and designed by artists Don Heck and Jack Kirby.
[ Jarvis ]: Sure, I understand your request. Based on the inference results of the models, I will answer your request step by step. Firstly, I used the token-classification model dslim/bert-base-NER to recognize the named entities in the sentence. The inference result is that the named entities are Iron Man, Stan Lee, Larry Lieber, Don Heck and Jack Kirby. Then, I used the question-answering model deepset/roberta-base-squad2 to answer your request. The inference result is that there is no output since the context cannot be empty. Therefore, I cannot make it. I hope this answer your request. Is there anything else I can help you with?
[ User ]: Please generate a canny image based on /examples/f.jpg
[ Jarvis ]: Sure. I have generated a canny image based on /examples/f.jpg. To do this, I first used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text description of the image, which is "a herd of giraffes and zebras grazing in a field". Then I used the canny-control model to generate the canny image of the original image, which is stored in the path "/images/1ea2.png". Finally, I used the canny-text-to-image model lllyasviel/sd-controlnet-canny to generate the canny image based on the text description and the canny image, which is stored in the path "/images/ba66.png". Do you need anything else?
Configuration
The server-side configuration file is server/configs/config.default.yaml
, and some parameters are presented as follows:
model
: LLM, currently supportstext-davinci-003
. We are working on integrating more open-source LLMs.inference_mode
: mode of inference endpointslocal
: only use the local inference endpointshuggingface
: only use the Hugging Face Inference Endpoints (free of local inference endpoints)hybrid
: both oflocal
andhuggingface
local_deployment
: scale of locally deployed models, works underlocal
orhybrid
inference mode:minimal
(RAM>12GB, ControlNet only)standard
(RAM>16GB, ControlNet + Standard Pipelines)full
(RAM>42GB, All registered models)
On a personal laptop, we recommend the configuration of inference_mode: hybrid
and local_deployment: minimal
. But the available models under this setting may be limited due to the instability of remote Hugging Face Inference Endpoints.
NVIDIA Jetson Embedded Device Support
A Dockerfile is included that provides experimental support for NVIDIA Jetson embedded devices. This image provides accelerated ffmpeg, pytorch, torchaudio, and torchvision dependencies. To build the docker image, ensure that the default docker runtime is set to 'nvidia'. A pre-built image is provided at https://hub.docker.com/r/toolboc/nv-jarvis.
#Build the docker image
docker build --pull --rm -f "Dockerfile.jetson" -t toolboc/nv-jarvis:r35.2.1
Due to to memory requirements, JARVIS is required to run on Jetson AGX Orin family devices (64G on-board RAM device preferred) with config options set to:
inference_mode: local
local_deployment: standard
Models and configs are recommended to be provided through a volume mount from the host to the container as shown in the docker run
step below. It is possible to uncomment the # Download local models
section of the Dockerfile to build a container with models included.
Start the model server, awesomechat, and web app on Jetson Orin AGX
# run the container which will automatically start the model server
docker run --name jarvis --net=host --gpus all -v ~/jarvis/configs:/app/server/configs -v ~/src/JARVIS/server/models:/app/server/models toolboc/nv-jarvis:r35.2.1
# (wait for model server to complete initialization)
# start awesome_chat.py
docker exec jarvis python3 awesome_chat.py --config configs/config.default.yaml --mode server
#start the web application (application will be acessible at http://localhost:9999)
docker exec jarvis npm run dev --prefix=/app/web
Screenshots
Citation
If you find this work useful in your method, you can cite the paper as below:
@inproceedings{shen2023hugginggpt,
author = {Shen, Yongliang and Song, Kaitao and Tan, Xu and Li, Dongsheng and Lu, Weiming and Zhuang, Yueting},
booktitle = {Advances in Neural Information Processing Systems},
title = {HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace},
year = {2023}
}
@article{shen2023taskbench,
title = {TaskBench: Benchmarking Large Language Models for Task Automation},
author = {Shen, Yongliang and Song, Kaitao and Tan, Xu and Zhang, Wenqi and Ren, Kan and Yuan, Siyu and Lu, Weiming and Li, Dongsheng and Zhuang, Yueting},
journal = {arXiv preprint arXiv:2311.18760},
year = {2023}
}
@article{yuan2024easytool,
title = {EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction},
author = {Siyu Yuan and Kaitao Song and Jiangjie Chen and Xu Tan and Yongliang Shen and Ren Kan and Dongsheng Li and Deqing Yang},
journal = {arXiv preprint arXiv:2401.06201},
year = {2024}
}
Top Related Projects
Robust Speech Recognition via Large-Scale Weak Supervision
Port of OpenAI's Whisper model in C/C++
Stable Diffusion web UI
A latent text-to-image diffusion model
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot