llama-gpt

A self-hosted, offline, ChatGPT-like chatbot. Powered by Llama 2. 100% private, with no data leaving your device. New: Code Llama support!

10,984

711

10,984

View on GitHub

Top Related Projects

alpaca.cpp

10,223

Locally run an Instruction-Tuned Chat-Style LLM

alpaca-lora

18,931

Instruct-tune LLaMA on consumer hardware

text-generation-webui

44,456

LLM UI with advanced features, easy setup, and multiple backend support.

Quick Overview

LlamaGPT is an open-source, self-hosted ChatGPT-like chatbot powered by Llama 2. It provides a user-friendly web interface for interacting with the Llama 2 language model, allowing users to run their own AI assistant locally or on a server without relying on third-party services.

Pros

Self-hosted solution, ensuring privacy and data control
Uses the powerful Llama 2 language model
User-friendly web interface for easy interaction
Customizable and extendable open-source project

Cons

Requires significant computational resources to run efficiently
Limited to Llama 2's capabilities and knowledge cutoff
May require technical expertise for setup and maintenance
Potential legal and ethical considerations when using AI models

Getting Started

To set up LlamaGPT:

Clone the repository:

git clone https://github.com/getumbrel/llama-gpt.git

Install dependencies:
```
cd llama-gpt
npm install
```
Download the Llama 2 model (7B or 13B) and place it in the models directory.
Start the application:
```
npm run dev
```
Access the web interface at http://localhost:3000 in your browser.

Note: Ensure you have sufficient hardware resources (CPU, RAM, and storage) to run the Llama 2 model effectively.

Competitor Comparisons

llama.cpp

83,828

LLM inference in C/C++

Pros of llama.cpp

More efficient memory usage and faster inference
Supports a wider range of LLM models beyond just LLaMA
Offers more advanced quantization techniques for model optimization

Cons of llama.cpp

Requires more technical expertise to set up and use
Less user-friendly interface compared to llama-gpt
Lacks built-in chat functionality and web UI out of the box

Code Comparison

llama.cpp:

int main(int argc, char ** argv) {
    gpt_params params;
    if (gpt_params_parse(argc, argv, params) == false) {
        return 1;
    }
    // ... (model loading and inference code)
}

llama-gpt:

app.get('/api/chat', async (req, res) => {
  const { message } = req.query;
  const response = await llm.chat(message);
  res.json({ response });
});

llama.cpp focuses on low-level C++ implementation for efficient model inference, while llama-gpt provides a higher-level JavaScript API for easier integration into web applications. llama.cpp offers more flexibility and performance optimizations, whereas llama-gpt prioritizes ease of use and quick deployment for chat applications.

llama-cpp-python

9,403

Python bindings for llama.cpp

Pros of llama-cpp-python

More flexible and customizable, allowing integration into various Python projects
Provides a Python API for the llama.cpp library, enabling easier use in Python environments
Supports multiple model formats and quantization options

Cons of llama-cpp-python

Requires more technical knowledge to set up and use effectively
Less user-friendly for those seeking a simple, out-of-the-box chat interface

Code Comparison

llama-cpp-python:

from llama_cpp import Llama

llm = Llama(model_path="./models/7B/ggml-model.bin")
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)

llama-gpt:

import { ChatOpenAI } from "langchain/chat_models/openai";
import { HumanChatMessage, SystemChatMessage } from "langchain/schema";

const chat = new ChatOpenAI({ temperature: 0 });
const response = await chat.call([
  new HumanChatMessage("Name the planets in the solar system"),
]);
console.log(response);

The llama-cpp-python example demonstrates direct interaction with the Llama model, while llama-gpt uses a higher-level abstraction through the LangChain library, simplifying the process for users but potentially limiting customization options.

llama

58,578

Inference code for Llama models

Pros of Llama

Official implementation from Meta, ensuring authenticity and direct updates
More comprehensive documentation and support from the original developers
Broader scope, covering the entire Llama model family

Cons of Llama

Requires more setup and configuration for deployment
Less user-friendly for those seeking a quick, out-of-the-box solution
May have higher hardware requirements for running the full model

Code Comparison

Llama (Python):

from llama import Llama

model = Llama(model_path="path/to/model.pth")
output = model.generate("Hello, how are you?", max_length=50)
print(output)

Llama-GPT (JavaScript):

import { LlamaGPT } from 'llama-gpt';

const llama = new LlamaGPT();
llama.load('path/to/model.bin');
const response = await llama.generate('Hello, how are you?');
console.log(response);

Summary

Llama offers an official, comprehensive implementation with better documentation, while Llama-GPT provides a more accessible, user-friendly approach for quick deployment. The choice between them depends on the user's specific needs, technical expertise, and available resources.

alpaca.cpp

10,223

Locally run an Instruction-Tuned Chat-Style LLM

Pros of alpaca.cpp

Lightweight and efficient C++ implementation
Supports quantization for reduced memory usage
Easy to build and run on various platforms

Cons of alpaca.cpp

Limited features compared to llama-gpt
Less user-friendly interface
Fewer customization options

Code Comparison

alpaca.cpp:

int main(int argc, char ** argv) {
    gpt_params params;
    if (gpt_params_parse(argc, argv, params) == false) {
        return 1;
    }
    llama_init_backend();
    // ... (additional code)
}

llama-gpt:

app.get('/api/chat', async (req, res) => {
  const { message } = req.query;
  try {
    const response = await llama.chat(message);
    res.json({ response });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

The code snippets highlight the different approaches: alpaca.cpp focuses on low-level C++ implementation, while llama-gpt provides a higher-level API for easier integration into web applications. alpaca.cpp offers more control over the underlying model, but llama-gpt simplifies the process of building chat-based applications with its RESTful API.

alpaca-lora

18,931

Instruct-tune LLaMA on consumer hardware

Pros of Alpaca-LoRA

Focuses on fine-tuning LLaMA models using LoRA technique, allowing for efficient adaptation with limited resources
Provides detailed instructions for training and inference, making it accessible to researchers and developers
Supports multiple model sizes and configurations, offering flexibility in deployment

Cons of Alpaca-LoRA

Requires more technical knowledge to set up and use compared to LLaMA-GPT
May have higher computational requirements for training and fine-tuning

Code Comparison

Alpaca-LoRA:

model = LlamaForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, lora_weights)

LLaMA-GPT:

const response = await fetch('/api/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ message: inputMessage }),
});

The code snippets highlight the different focus areas of the projects. Alpaca-LoRA emphasizes model loading and fine-tuning, while LLaMA-GPT provides a simpler API for chat interactions.

text-generation-webui

44,456

LLM UI with advanced features, easy setup, and multiple backend support.

Pros of text-generation-webui

Supports a wider range of models and architectures
Offers more advanced features like model merging and fine-tuning
Provides a more customizable user interface

Cons of text-generation-webui

More complex setup process
Requires more technical knowledge to utilize all features
May have higher system requirements for advanced functionalities

Code Comparison

text-generation-webui:

def generate_reply(
    question, state, stopping_strings=None, is_chat=False, escape_html=False
):
    # Complex generation logic
    # ...

llama-gpt:

async function generateResponse(prompt) {
  const response = await fetch('/api/generate', {
    method: 'POST',
    body: JSON.stringify({ prompt }),
  });
  return response.json();
}

text-generation-webui offers more advanced generation options and customization, while llama-gpt provides a simpler, more straightforward API for generating responses. text-generation-webui is better suited for users who need extensive control over the generation process, while llama-gpt is more appropriate for those seeking a quick and easy setup for basic LLM interactions.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

LlamaGPT

A self-hosted, offline, ChatGPT-like chatbot, powered by Llama 2. 100% private, with no data leaving your device.
New: Support for Code Llama models and Nvidia GPUs.

umbrel.com (we're hiring) Â»

Demo
Supported Models
How to install
OpenAI-compatible API
Benchmarks
Roadmap and contributing
Acknowledgements

Demo

https://github.com/getumbrel/llama-gpt/assets/10330103/5d1a76b8-ed03-4a51-90bd-12ebfaf1e6cd

Supported models

Currently, LlamaGPT supports the following models. Support for running custom models is on the roadmap.

Model name	Model size	Model download size	Memory required
Nous Hermes Llama 2 7B Chat (GGML q4_0)	7B	3.79GB	6.29GB
Nous Hermes Llama 2 13B Chat (GGML q4_0)	13B	7.32GB	9.82GB
Nous Hermes Llama 2 70B Chat (GGML q4_0)	70B	38.87GB	41.37GB
Code Llama 7B Chat (GGUF Q4_K_M)	7B	4.24GB	6.74GB
Code Llama 13B Chat (GGUF Q4_K_M)	13B	8.06GB	10.56GB
Phind Code Llama 34B Chat (GGUF Q4_K_M)	34B	20.22GB	22.72GB

How to install

Install LlamaGPT on your umbrelOS home server

Running LlamaGPT on an umbrelOS home server is one click. Simply install it from the Umbrel App Store.

Install LlamaGPT on M1/M2 Mac

Make sure your have Docker and Xcode installed.

Then, clone this repo and cd into it:

git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt

Run LlamaGPT with the following command:

./run-mac.sh --model 7b

You can access LlamaGPT at http://localhost:3000.

To run 13B or 70B chat models, replace 7b with 13b or 70b respectively. To run 7B, 13B or 34B Code Llama models, replace 7b with code-7b, code-13b or code-34b respectively.

To stop LlamaGPT, do Ctrl + C in Terminal.

Install LlamaGPT anywhere else with Docker

You can run LlamaGPT on any x86 or arm64 system. Make sure you have Docker installed.

Then, clone this repo and cd into it:

git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt

Run LlamaGPT with the following command:

./run.sh --model 7b

Or if you have an Nvidia GPU, you can run LlamaGPT with CUDA support using the --with-cuda flag, like:

./run.sh --model 7b --with-cuda

You can access LlamaGPT at http://localhost:3000.

To run 13B or 70B chat models, replace 7b with 13b or 70b respectively. To run Code Llama 7B, 13B or 34B models, replace 7b with code-7b, code-13b or code-34b respectively.

To stop LlamaGPT, do Ctrl + C in Terminal.

Note: On the first run, it may take a while for the model to be downloaded to the /models directory. You may also see lots of output like this for a few minutes, which is normal:
llama-gpt-llama-gpt-ui-1       | [INFO  wait] Host [llama-gpt-api-13b:8000] not yet available...
After the model has been automatically downloaded and loaded, and the API server is running, you'll see an output like:
llama-gpt-ui_1   | ready - started server on 0.0.0.0:3000, url: http://localhost:3000
You can then access LlamaGPT at http://localhost:3000.

Install LlamaGPT with Kubernetes

First, make sure you have a running Kubernetes cluster and kubectl is configured to interact with it.

Then, clone this repo and cd into it.

To deploy to Kubernetes first create a namespace:

kubectl create ns llama

Then apply the manifests under the /deploy/kubernetes directory with

kubectl apply -k deploy/kubernetes/. -n llama

Expose your service however you would normally do that.

OpenAI compatible API

Thanks to llama-cpp-python, a drop-in replacement for OpenAI API is available at http://localhost:3001. Open http://localhost:3001/docs to see the API documentation.

Benchmarks

We've tested LlamaGPT models on the following hardware with the default system prompt, and user prompt: "How does the universe expand?" at temperature 0 to guarantee deterministic results. Generation speed is averaged over the first 10 generations.

Feel free to add your own benchmarks to this table by opening a pull request.

Nous Hermes Llama 2 7B Chat (GGML q4_0)

Device	Generation speed
M1 Max MacBook Pro (64GB RAM)	54 tokens/sec
GCP c2-standard-16 vCPU (64 GB RAM)	16.7 tokens/sec
Ryzen 5700G 4.4GHz 4c (16 GB RAM)	11.50 tokens/sec
GCP c2-standard-4 vCPU (16 GB RAM)	4.3 tokens/sec
Umbrel Home (16GB RAM)	2.7 tokens/sec
Raspberry Pi 4 (8GB RAM)	0.9 tokens/sec

Nous Hermes Llama 2 13B Chat (GGML q4_0)

Device	Generation speed
M1 Max MacBook Pro (64GB RAM)	20 tokens/sec
GCP c2-standard-16 vCPU (64 GB RAM)	8.6 tokens/sec
GCP c2-standard-4 vCPU (16 GB RAM)	2.2 tokens/sec
Umbrel Home (16GB RAM)	1.5 tokens/sec

Nous Hermes Llama 2 70B Chat (GGML q4_0)

Device	Generation speed
M1 Max MacBook Pro (64GB RAM)	4.8 tokens/sec
GCP e2-standard-16 vCPU (64 GB RAM)	1.75 tokens/sec
GCP c2-standard-16 vCPU (64 GB RAM)	1.62 tokens/sec

Code Llama 7B Chat (GGUF Q4_K_M)

Device	Generation speed
M1 Max MacBook Pro (64GB RAM)	41 tokens/sec

Code Llama 13B Chat (GGUF Q4_K_M)

Device	Generation speed
M1 Max MacBook Pro (64GB RAM)	25 tokens/sec

Phind Code Llama 34B Chat (GGUF Q4_K_M)

Device	Generation speed
M1 Max MacBook Pro (64GB RAM)	10.26 tokens/sec

Roadmap and contributing

We're looking to add more features to LlamaGPT. You can see the roadmap here. The highest priorities are:

Moving the model out of the Docker image and into a separate volume.
Add Metal support for M1/M2 Macs.
Add support for Code Llama models.
Add CUDA support for NVIDIA GPUs.
Add ability to load custom models.
Allow users to switch between models.

If you're a developer who'd like to help with any of these, please open an issue to discuss the best way to tackle the challenge. If you're looking to help but not sure where to begin, check out these issues that have specifically been marked as being friendly to new contributors.

Acknowledgements

A massive thank you to the following developers and teams for making LlamaGPT possible:

Mckay Wrigley for building Chatbot UI.
Georgi Gerganov for implementing llama.cpp.
Andrei for building the Python bindings for llama.cpp.
NousResearch for fine-tuning the Llama 2 7B and 13B models.
Phind for fine-tuning the Code Llama 34B model.
Tom Jobbins for quantizing the Llama 2 models.
Meta for releasing Llama 2 and Code Llama under a permissive license.

umbrel.com

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Getting Started

Competitor Comparisons

Pros of llama.cpp

Cons of llama.cpp

Code Comparison

Pros of llama-cpp-python

Cons of llama-cpp-python

Code Comparison

Pros of Llama

Cons of Llama

Code Comparison

Summary

Pros of alpaca.cpp

Cons of alpaca.cpp

Code Comparison

Pros of Alpaca-LoRA

Cons of Alpaca-LoRA

Code Comparison

Pros of text-generation-webui

Cons of text-generation-webui

Code Comparison

Convert designs to code with AI

README

LlamaGPT

Contents

Demo

Supported models

How to install

Install LlamaGPT on your umbrelOS home server

Install LlamaGPT on M1/M2 Mac

Install LlamaGPT anywhere else with Docker

Install LlamaGPT with Kubernetes

OpenAI compatible API

Benchmarks

Nous Hermes Llama 2 7B Chat (GGML q4_0)

Nous Hermes Llama 2 13B Chat (GGML q4_0)

Nous Hermes Llama 2 70B Chat (GGML q4_0)

Code Llama 7B Chat (GGUF Q4_K_M)

Code Llama 13B Chat (GGUF Q4_K_M)

Phind Code Llama 34B Chat (GGUF Q4_K_M)

Roadmap and contributing

Acknowledgements

Top Related Projects

Convert designs to code with AI