llama-gpt
A self-hosted, offline, ChatGPT-like chatbot. Powered by Llama 2. 100% private, with no data leaving your device. New: Code Llama support!
Top Related Projects
LLM inference in C/C++
Python bindings for llama.cpp
Inference code for Llama models
Locally run an Instruction-Tuned Chat-Style LLM
Instruct-tune LLaMA on consumer hardware
A Gradio web UI for Large Language Models.
Quick Overview
LlamaGPT is an open-source, self-hosted ChatGPT-like chatbot powered by Llama 2. It provides a user-friendly web interface for interacting with the Llama 2 language model, allowing users to run their own AI assistant locally or on a server without relying on third-party services.
Pros
- Self-hosted solution, ensuring privacy and data control
- Uses the powerful Llama 2 language model
- User-friendly web interface for easy interaction
- Customizable and extendable open-source project
Cons
- Requires significant computational resources to run efficiently
- Limited to Llama 2's capabilities and knowledge cutoff
- May require technical expertise for setup and maintenance
- Potential legal and ethical considerations when using AI models
Getting Started
To set up LlamaGPT:
-
Clone the repository:
git clone https://github.com/getumbrel/llama-gpt.git
-
Install dependencies:
cd llama-gpt npm install
-
Download the Llama 2 model (7B or 13B) and place it in the
models
directory. -
Start the application:
npm run dev
-
Access the web interface at
http://localhost:3000
in your browser.
Note: Ensure you have sufficient hardware resources (CPU, RAM, and storage) to run the Llama 2 model effectively.
Competitor Comparisons
LLM inference in C/C++
Pros of llama.cpp
- Highly optimized C++ implementation for efficient inference
- Supports quantization for reduced memory usage and faster processing
- Offers more fine-grained control over model parameters and execution
Cons of llama.cpp
- Requires more technical expertise to set up and use
- Less user-friendly interface compared to LlamaGPT
- May require additional steps for model conversion and preparation
Code Comparison
llama.cpp:
int main(int argc, char ** argv) {
gpt_params params;
if (gpt_params_parse(argc, argv, params) == false) {
return 1;
}
llama_init_backend();
// ... (additional initialization code)
}
LlamaGPT:
app.get('/api/chat', async (req, res) => {
const { message } = req.query;
const response = await llm.chat(message);
res.json({ response });
});
The code snippets highlight the difference in implementation languages and approaches. llama.cpp focuses on low-level control and initialization, while LlamaGPT provides a higher-level API for chat functionality.
Python bindings for llama.cpp
Pros of llama-cpp-python
- More flexible and customizable, allowing integration into various Python projects
- Provides a Python API for the llama.cpp library, enabling easier use in Python environments
- Supports multiple model formats and quantization options
Cons of llama-cpp-python
- Requires more technical knowledge to set up and use effectively
- Less user-friendly for those seeking a simple, out-of-the-box chat interface
Code Comparison
llama-cpp-python:
from llama_cpp import Llama
llm = Llama(model_path="./models/7B/ggml-model.bin")
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)
llama-gpt:
import { ChatOpenAI } from "langchain/chat_models/openai";
import { HumanChatMessage, SystemChatMessage } from "langchain/schema";
const chat = new ChatOpenAI({ temperature: 0 });
const response = await chat.call([
new HumanChatMessage("Name the planets in the solar system"),
]);
console.log(response);
The llama-cpp-python example demonstrates direct interaction with the Llama model, while llama-gpt uses a higher-level abstraction through the LangChain library, simplifying the process for users but potentially limiting customization options.
Inference code for Llama models
Pros of Llama
- Official implementation from Meta, ensuring authenticity and direct updates
- More comprehensive documentation and support from the original developers
- Broader scope, covering the entire Llama model family
Cons of Llama
- Requires more setup and configuration for deployment
- Less user-friendly for those seeking a quick, out-of-the-box solution
- May have higher hardware requirements for running the full model
Code Comparison
Llama (Python):
from llama import Llama
model = Llama(model_path="path/to/model.pth")
output = model.generate("Hello, how are you?", max_length=50)
print(output)
Llama-GPT (JavaScript):
import { LlamaGPT } from 'llama-gpt';
const llama = new LlamaGPT();
llama.load('path/to/model.bin');
const response = await llama.generate('Hello, how are you?');
console.log(response);
Summary
Llama offers an official, comprehensive implementation with better documentation, while Llama-GPT provides a more accessible, user-friendly approach for quick deployment. The choice between them depends on the user's specific needs, technical expertise, and available resources.
Locally run an Instruction-Tuned Chat-Style LLM
Pros of alpaca.cpp
- Lightweight and efficient C++ implementation
- Supports quantization for reduced memory usage
- Easy to build and run on various platforms
Cons of alpaca.cpp
- Limited features compared to llama-gpt
- Less user-friendly interface
- Fewer customization options
Code Comparison
alpaca.cpp:
int main(int argc, char ** argv) {
gpt_params params;
if (gpt_params_parse(argc, argv, params) == false) {
return 1;
}
llama_init_backend();
// ... (additional code)
}
llama-gpt:
app.get('/api/chat', async (req, res) => {
const { message } = req.query;
try {
const response = await llama.chat(message);
res.json({ response });
} catch (error) {
res.status(500).json({ error: error.message });
}
});
The code snippets highlight the different approaches: alpaca.cpp focuses on low-level C++ implementation, while llama-gpt provides a higher-level API for easier integration into web applications. alpaca.cpp offers more control over the underlying model, but llama-gpt simplifies the process of building chat-based applications with its RESTful API.
Instruct-tune LLaMA on consumer hardware
Pros of Alpaca-LoRA
- Focuses on fine-tuning LLaMA models using LoRA technique, allowing for efficient adaptation with limited resources
- Provides detailed instructions for training and inference, making it accessible to researchers and developers
- Supports multiple model sizes and configurations, offering flexibility in deployment
Cons of Alpaca-LoRA
- Requires more technical knowledge to set up and use compared to LLaMA-GPT
- May have higher computational requirements for training and fine-tuning
Code Comparison
Alpaca-LoRA:
model = LlamaForCausalLM.from_pretrained(
base_model,
load_in_8bit=True,
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(model, lora_weights)
LLaMA-GPT:
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message: inputMessage }),
});
The code snippets highlight the different focus areas of the projects. Alpaca-LoRA emphasizes model loading and fine-tuning, while LLaMA-GPT provides a simpler API for chat interactions.
A Gradio web UI for Large Language Models.
Pros of text-generation-webui
- Supports a wider range of models and architectures
- Offers more advanced features like model merging and fine-tuning
- Provides a more customizable user interface
Cons of text-generation-webui
- More complex setup process
- Requires more technical knowledge to utilize all features
- May have higher system requirements for advanced functionalities
Code Comparison
text-generation-webui:
def generate_reply(
question, state, stopping_strings=None, is_chat=False, escape_html=False
):
# Complex generation logic
# ...
llama-gpt:
async function generateResponse(prompt) {
const response = await fetch('/api/generate', {
method: 'POST',
body: JSON.stringify({ prompt }),
});
return response.json();
}
text-generation-webui offers more advanced generation options and customization, while llama-gpt provides a simpler, more straightforward API for generating responses. text-generation-webui is better suited for users who need extensive control over the generation process, while llama-gpt is more appropriate for those seeking a quick and easy setup for basic LLM interactions.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
LlamaGPT
A self-hosted, offline, ChatGPT-like chatbot, powered by Llama 2. 100% private, with no data leaving your device.
New: Support for Code Llama models and Nvidia GPUs.
umbrel.com (we're hiring) »
Contents
- Demo
- Supported Models
- How to install
- OpenAI-compatible API
- Benchmarks
- Roadmap and contributing
- Acknowledgements
Demo
https://github.com/getumbrel/llama-gpt/assets/10330103/5d1a76b8-ed03-4a51-90bd-12ebfaf1e6cd
Supported models
Currently, LlamaGPT supports the following models. Support for running custom models is on the roadmap.
Model name | Model size | Model download size | Memory required |
---|---|---|---|
Nous Hermes Llama 2 7B Chat (GGML q4_0) | 7B | 3.79GB | 6.29GB |
Nous Hermes Llama 2 13B Chat (GGML q4_0) | 13B | 7.32GB | 9.82GB |
Nous Hermes Llama 2 70B Chat (GGML q4_0) | 70B | 38.87GB | 41.37GB |
Code Llama 7B Chat (GGUF Q4_K_M) | 7B | 4.24GB | 6.74GB |
Code Llama 13B Chat (GGUF Q4_K_M) | 13B | 8.06GB | 10.56GB |
Phind Code Llama 34B Chat (GGUF Q4_K_M) | 34B | 20.22GB | 22.72GB |
How to install
Install LlamaGPT on your umbrelOS home server
Running LlamaGPT on an umbrelOS home server is one click. Simply install it from the Umbrel App Store.
Install LlamaGPT on M1/M2 Mac
Make sure your have Docker and Xcode installed.
Then, clone this repo and cd
into it:
git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt
Run LlamaGPT with the following command:
./run-mac.sh --model 7b
You can access LlamaGPT at http://localhost:3000.
To run 13B or 70B chat models, replace
7b
with13b
or70b
respectively. To run 7B, 13B or 34B Code Llama models, replace7b
withcode-7b
,code-13b
orcode-34b
respectively.
To stop LlamaGPT, do Ctrl + C
in Terminal.
Install LlamaGPT anywhere else with Docker
You can run LlamaGPT on any x86 or arm64 system. Make sure you have Docker installed.
Then, clone this repo and cd
into it:
git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt
Run LlamaGPT with the following command:
./run.sh --model 7b
Or if you have an Nvidia GPU, you can run LlamaGPT with CUDA support using the --with-cuda
flag, like:
./run.sh --model 7b --with-cuda
You can access LlamaGPT at http://localhost:3000
.
To run 13B or 70B chat models, replace
7b
with13b
or70b
respectively. To run Code Llama 7B, 13B or 34B models, replace7b
withcode-7b
,code-13b
orcode-34b
respectively.
To stop LlamaGPT, do Ctrl + C
in Terminal.
Note: On the first run, it may take a while for the model to be downloaded to the
/models
directory. You may also see lots of output like this for a few minutes, which is normal:llama-gpt-llama-gpt-ui-1 | [INFO wait] Host [llama-gpt-api-13b:8000] not yet available...
After the model has been automatically downloaded and loaded, and the API server is running, you'll see an output like:
llama-gpt-ui_1 | ready - started server on 0.0.0.0:3000, url: http://localhost:3000
You can then access LlamaGPT at http://localhost:3000.
Install LlamaGPT with Kubernetes
First, make sure you have a running Kubernetes cluster and kubectl
is configured to interact with it.
Then, clone this repo and cd
into it.
To deploy to Kubernetes first create a namespace:
kubectl create ns llama
Then apply the manifests under the /deploy/kubernetes
directory with
kubectl apply -k deploy/kubernetes/. -n llama
Expose your service however you would normally do that.
OpenAI compatible API
Thanks to llama-cpp-python, a drop-in replacement for OpenAI API is available at http://localhost:3001
. Open http://localhost:3001/docs to see the API documentation.
Benchmarks
We've tested LlamaGPT models on the following hardware with the default system prompt, and user prompt: "How does the universe expand?" at temperature 0 to guarantee deterministic results. Generation speed is averaged over the first 10 generations.
Feel free to add your own benchmarks to this table by opening a pull request.
Nous Hermes Llama 2 7B Chat (GGML q4_0)
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 54 tokens/sec |
GCP c2-standard-16 vCPU (64 GB RAM) | 16.7 tokens/sec |
Ryzen 5700G 4.4GHz 4c (16 GB RAM) | 11.50 tokens/sec |
GCP c2-standard-4 vCPU (16 GB RAM) | 4.3 tokens/sec |
Umbrel Home (16GB RAM) | 2.7 tokens/sec |
Raspberry Pi 4 (8GB RAM) | 0.9 tokens/sec |
Nous Hermes Llama 2 13B Chat (GGML q4_0)
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 20 tokens/sec |
GCP c2-standard-16 vCPU (64 GB RAM) | 8.6 tokens/sec |
GCP c2-standard-4 vCPU (16 GB RAM) | 2.2 tokens/sec |
Umbrel Home (16GB RAM) | 1.5 tokens/sec |
Nous Hermes Llama 2 70B Chat (GGML q4_0)
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 4.8 tokens/sec |
GCP e2-standard-16 vCPU (64 GB RAM) | 1.75 tokens/sec |
GCP c2-standard-16 vCPU (64 GB RAM) | 1.62 tokens/sec |
Code Llama 7B Chat (GGUF Q4_K_M)
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 41 tokens/sec |
Code Llama 13B Chat (GGUF Q4_K_M)
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 25 tokens/sec |
Phind Code Llama 34B Chat (GGUF Q4_K_M)
Device | Generation speed |
---|---|
M1 Max MacBook Pro (64GB RAM) | 10.26 tokens/sec |
Roadmap and contributing
We're looking to add more features to LlamaGPT. You can see the roadmap here. The highest priorities are:
- Moving the model out of the Docker image and into a separate volume.
- Add Metal support for M1/M2 Macs.
- Add support for Code Llama models.
- Add CUDA support for NVIDIA GPUs.
- Add ability to load custom models.
- Allow users to switch between models.
If you're a developer who'd like to help with any of these, please open an issue to discuss the best way to tackle the challenge. If you're looking to help but not sure where to begin, check out these issues that have specifically been marked as being friendly to new contributors.
Acknowledgements
A massive thank you to the following developers and teams for making LlamaGPT possible:
- Mckay Wrigley for building Chatbot UI.
- Georgi Gerganov for implementing llama.cpp.
- Andrei for building the Python bindings for llama.cpp.
- NousResearch for fine-tuning the Llama 2 7B and 13B models.
- Phind for fine-tuning the Code Llama 34B model.
- Tom Jobbins for quantizing the Llama 2 models.
- Meta for releasing Llama 2 and Code Llama under a permissive license.
Top Related Projects
LLM inference in C/C++
Python bindings for llama.cpp
Inference code for Llama models
Locally run an Instruction-Tuned Chat-Style LLM
Instruct-tune LLaMA on consumer hardware
A Gradio web UI for Large Language Models.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot