Top Related Projects
Inference code for Llama models
Inference Llama 2 in one file of pure C
Python bindings for llama.cpp
A Gradio web UI for Large Language Models.
Universal LLM Deployment Engine with ML Compilation
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Quick Overview
llama.cpp is a port of Facebook's LLaMA model to C/C++. It allows for the inference of LLaMA models on CPU, with various optimizations for different architectures. This project aims to make large language models more accessible and efficient on consumer hardware.
Pros
- Enables running LLaMA models on consumer-grade hardware
- Optimized for various CPU architectures, including ARM and x86
- Supports quantization for reduced memory usage and faster inference
- Active development and community support
Cons
- Requires obtaining LLaMA model weights separately (not included due to licensing)
- Performance may vary depending on hardware capabilities
- Limited to LLaMA model architecture
- May require technical expertise for optimal setup and usage
Code Examples
- Basic model inference:
#include "llama.h"
int main() {
llama_context * ctx = llama_init_from_file("path/to/model.bin", 0);
llama_eval(ctx, {0, 1, 2, 3}, 4, 0, 1);
float * logits = llama_get_logits(ctx);
llama_free(ctx);
return 0;
}
- Text generation:
#include "llama.h"
#include <string>
#include <vector>
std::string generate_text(llama_context * ctx, const std::string& prompt, int max_tokens) {
std::vector<llama_token> tokens = llama_tokenize(ctx, prompt, true);
for (int i = 0; i < max_tokens; ++i) {
llama_eval(ctx, tokens.data(), tokens.size(), 0, 1);
llama_token new_token = llama_sample_top_p_top_k(ctx, 0.9f, 40, 1.0f);
tokens.push_back(new_token);
if (new_token == llama_token_eos()) break;
}
return llama_token_to_str(ctx, tokens.data(), tokens.size());
}
- Model quantization:
#include "llama.h"
int main() {
llama_model_quantize("path/to/model.bin", "path/to/quantized_model.bin", LLAMA_FTYPE_MOSTLY_Q4_0);
return 0;
}
Getting Started
-
Clone the repository:
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp
-
Build the project:
make
-
Download LLaMA model weights (not provided in the repository)
-
Convert the model to ggml format:
python3 convert.py path/to/llama/model
-
Run the example:
./main -m path/to/ggml/model.bin -p "Once upon a time"
Competitor Comparisons
Inference code for Llama models
Pros of llama
- Official implementation from Meta, ensuring alignment with the original model architecture
- Potentially more comprehensive documentation and support from the Meta team
- May include advanced features or optimizations not present in community implementations
Cons of llama
- Likely requires more computational resources to run, as it's not optimized for efficiency
- May have stricter licensing terms or usage restrictions
- Possibly less flexible for customization or integration into other projects
Code Comparison
llama:
from llama import Llama
model = Llama(model_path="path/to/model")
output = model.generate("Hello, how are you?")
print(output)
llama.cpp:
#include "llama.h"
llama_context * ctx = llama_init_from_file("path/to/model");
llama_eval(ctx, "Hello, how are you?", 100);
llama_print_timings(ctx);
llama_free(ctx);
The llama repository provides a Python interface, while llama.cpp offers a C++ implementation, potentially allowing for lower-level optimizations and better performance on resource-constrained devices. llama.cpp is designed for efficiency and portability, making it suitable for a wider range of applications and hardware configurations.
Inference Llama 2 in one file of pure C
Pros of llama2.c
- Extremely lightweight and minimalistic implementation (single C file)
- Designed for educational purposes, making it easier to understand the core concepts
- Highly portable due to minimal dependencies
Cons of llama2.c
- Limited features and optimizations compared to llama.cpp
- Not designed for production use or high-performance applications
- Lacks support for advanced quantization techniques
Code Comparison
llama2.c:
float* malloc_run_memory(RunState* s) {
float* ptr = (float*)malloc(s->run_buffer_size);
s->run_buffer = ptr;
s->key_cache = ptr; ptr += s->kv_dim;
s->value_cache = ptr; ptr += s->kv_dim;
s->logits = ptr; ptr += s->vocab_size;
return ptr;
}
llama.cpp:
static void llama_model_load(
const std::string & fname,
llama_context & lctx,
int n_ctx,
int n_batch,
int n_gpu_layers,
ggml_type memory_type,
bool use_mmap,
bool use_mlock,
bool vocab_only,
llama_progress_callback progress_callback,
void * progress_callback_user_data) {
// ... (implementation details)
}
The code snippets highlight the difference in complexity and feature set between the two projects. llama2.c focuses on simplicity, while llama.cpp offers more advanced functionality and optimizations.
Python bindings for llama.cpp
Pros of llama-cpp-python
- Python bindings for easier integration with Python projects
- Simplified API for loading and running LLaMA models
- Includes additional features like token sampling and perplexity calculation
Cons of llama-cpp-python
- May have slightly lower performance due to Python overhead
- Fewer low-level customization options compared to the C++ implementation
- Potentially slower development cycle for new features
Code Comparison
llama-cpp-python:
from llama_cpp import Llama
llm = Llama(model_path="./models/7B/ggml-model.bin")
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)
llama.cpp:
#include "llama.h"
gpt_params params;
params.model = "./models/7B/ggml-model.bin";
llama_context * ctx = llama_init_from_file(params.model.c_str(), params.n_ctx);
llama_eval(ctx, tokens.data(), tokens.size(), 0, params.n_threads);
llama_print_timings(ctx);
llama_free(ctx);
The llama-cpp-python code is more concise and easier to use for Python developers, while the llama.cpp code offers more direct control over the underlying C++ implementation.
A Gradio web UI for Large Language Models.
Pros of text-generation-webui
- User-friendly web interface for easy interaction with language models
- Supports multiple models and frameworks (e.g., GPTQ, llama.cpp, Transformers)
- Offers various features like chat, instruct mode, and notebook mode
Cons of text-generation-webui
- Higher resource requirements due to additional features and UI
- Potentially slower inference speed compared to llama.cpp's optimized C++ implementation
- More complex setup process with additional dependencies
Code Comparison
text-generation-webui (Python):
def generate_reply(
question, state, stopping_strings=None, is_chat=False, escape_html=False
):
# ... (generation logic)
llama.cpp (C++):
int llama_sample_top_p_top_k(
llama_context * ctx,
const llama_token * last_n_tokens_data,
int last_n_tokens_size,
int top_k,
float top_p,
float temp,
float repeat_penalty
) {
// ... (sampling logic)
}
The code snippets highlight the different approaches: text-generation-webui focuses on high-level Python functions for generation, while llama.cpp implements low-level C++ functions for efficient sampling and inference.
Universal LLM Deployment Engine with ML Compilation
Pros of mlc-llm
- Supports a wider range of hardware platforms, including mobile devices and web browsers
- Offers integration with TVM (Tensor Virtual Machine) for optimized performance
- Provides a more comprehensive ecosystem for deploying LLMs in various environments
Cons of mlc-llm
- Generally more complex to set up and use compared to llama.cpp
- May have a steeper learning curve for beginners
- Less focused on pure C++ implementation, which could impact portability in some cases
Code Comparison
llama.cpp:
#include "llama.h"
int main() {
llama_context * ctx = llama_init_from_file("model.bin", params);
// ... model usage
}
mlc-llm:
import mlc_llm
import tvm
model = mlc_llm.LLM("llama-7b")
output = model.generate("Hello, how are you?")
print(output)
The code snippets highlight the difference in approach: llama.cpp focuses on a C++ implementation, while mlc-llm leverages Python and TVM for a more flexible, but potentially more complex, setup.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of Transformers
- Extensive model support: Covers a wide range of transformer-based models
- High-level API: Easier to use for beginners and rapid prototyping
- Integration with PyTorch and TensorFlow: Flexible for different deep learning frameworks
Cons of Transformers
- Resource-intensive: Requires more computational power and memory
- Slower inference: Generally slower execution compared to optimized C++ implementations
- Complexity: Can be overwhelming for users who need simple, specific functionality
Code Comparison
Transformers (Python):
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)
llama.cpp (C++):
#include "llama.h"
llama_context * ctx = llama_init_from_file("model.bin", params);
llama_eval(ctx, tokens.data(), tokens.size(), n_past, n_threads);
llama_free(ctx);
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
llama.cpp
Roadmap / Project status / Manifesto / ggml
Inference of Meta's LLaMA model (and others) in pure C/C++
Recent API changes
Hot topics
- Huggingface GGUF editor: discussion | tool
Description
The main goal of llama.cpp
is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
variety of hardware - locally and in the cloud.
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2 and AVX512 support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
Since its inception, the project has improved significantly thanks to many contributions. It is the main playground for developing new features for the ggml library.
Supported models:
Typically finetunes of the base models below are supported as well.
- LLaMA ð¦
- LLaMA 2 ð¦ð¦
- LLaMA 3 ð¦ð¦ð¦
- Mistral 7B
- Mixtral MoE
- DBRX
- Falcon
- Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
- Vigogne (French)
- BERT
- Koala
- Baichuan 1 & 2 + derivations
- Aquila 1 & 2
- Starcoder models
- Refact
- MPT
- Bloom
- Yi models
- StableLM models
- Deepseek models
- Qwen models
- PLaMo-13B
- Phi models
- GPT-2
- Orion 14B
- InternLM2
- CodeShell
- Gemma
- Mamba
- Grok-1
- Xverse
- Command-R models
- SEA-LION
- GritLM-7B + GritLM-8x7B
- OLMo
- Granite models
- GPT-NeoX + Pythia
- Snowflake-Arctic MoE
- Smaug
- Poro 34B
- Bitnet b1.58 models
- Flan T5
- Open Elm models
- ChatGLM3-6b + ChatGLM4-9b
- SmolLM
- EXAONE-3.0-7.8B-Instruct
- FalconMamba Models
(instructions for supporting more models: HOWTO-add-model.md)
Multimodal models:
- LLaVA 1.5 models, LLaVA 1.6 models
- BakLLaVA
- Obsidian
- ShareGPT4V
- MobileVLM 1.7B/3B models
- Yi-VL
- Mini CPM
- Moondream
- Bunny
Bindings:
- Python: abetlen/llama-cpp-python
- Go: go-skynet/go-llama.cpp
- Node.js: withcatai/node-llama-cpp
- JS/TS (llama.cpp server client): lgrammel/modelfusion
- JavaScript/Wasm (works in browser): tangledgroup/llama-cpp-wasm
- Typescript/Wasm (nicer API, available on npm): ngxson/wllama
- Ruby: yoshoku/llama_cpp.rb
- Rust (more features): edgenai/llama_cpp-rs
- Rust (nicer API): mdrokz/rust-llama.cpp
- Rust (more direct bindings): utilityai/llama-cpp-rs
- C#/.NET: SciSharp/LLamaSharp
- Scala 3: donderom/llm4s
- Clojure: phronmophobic/llama.clj
- React Native: mybigday/llama.rn
- Java: kherud/java-llama.cpp
- Zig: deins/llama.cpp.zig
- Flutter/Dart: netdur/llama_cpp_dart
- PHP (API bindings and features built on top of llama.cpp): distantmagic/resonance (more info)
- Guile Scheme: guile_llama_cpp
UI:
Unless otherwise noted these projects are open-source with permissive licensing:
- MindWorkAI/AI-Studio (FSL-1.1-MIT)
- iohub/collama
- janhq/jan (AGPL)
- nat/openplayground
- Faraday (proprietary)
- LMStudio (proprietary)
- Layla (proprietary)
- ramalama (MIT)
- LocalAI (MIT)
- LostRuins/koboldcpp (AGPL)
- Mozilla-Ocho/llamafile
- nomic-ai/gpt4all
- ollama/ollama
- oobabooga/text-generation-webui (AGPL)
- psugihara/FreeChat
- cztomsik/ava (MIT)
- ptsochantaris/emeltal
- pythops/tenere (AGPL)
- RAGNA Desktop (proprietary)
- RecurseChat (proprietary)
- semperai/amica
- withcatai/catai
- Mobile-Artificial-Intelligence/maid (MIT)
- Msty (proprietary)
- LLMFarm (MIT)
- KanTV(Apachev2.0 or later)
- Dot (GPL)
- MindMac (proprietary)
- KodiBot (GPL)
- eva (MIT)
- AI Sublime Text plugin (MIT)
- AIKit (MIT)
- LARS - The LLM & Advanced Referencing Solution (AGPL)
- LLMUnity (MIT)
(to have a project listed here, it should clearly state that it depends on llama.cpp
)
Tools:
- akx/ggify â download PyTorch models from HuggingFace Hub and convert them to GGML
- crashr/gppm â launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
- gpustack/gguf-parser - review/check the GGUF file and estimate the memory usage
Infrastructure:
- Paddler - Stateful load balancer custom-tailored for llama.cpp
- GPUStack - Manage GPU clusters for running LLMs
Games:
- Lucy's Labyrinth - A simple maze game where agents controlled by an AI model will try to trick you.
Demo
Typical run using LLaMA v2 13B on M2 Ultra
$ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
make: Nothing to be done for `default'.
main: build = 1041 (cf658ad)
main: seed = 1692823051
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_print_meta: format = GGUF V1 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 512
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model size = 13.02 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: mem required = 7024.01 MB (+ 400.00 MB per state)
...................................................................................................
llama_new_context_with_model: kv self size = 400.00 MB
llama_new_context_with_model: compute buffer total size = 75.41 MB
system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
Building a website can be done in 10 simple steps:
Step 1: Find the right website platform.
Step 2: Choose your domain name and hosting plan.
Step 3: Design your website layout.
Step 4: Write your website content and add images.
Step 5: Install security features to protect your site from hackers or spammers
Step 6: Test your website on multiple browsers, mobile devices, operating systems etcâ¦
Step 7: Test it again with people who are not related to you personally â friends or family members will work just fine!
Step 8: Start marketing and promoting the website via social media channels or paid ads
Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etcâ¦
Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
How does a Website Work?
A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit â whether itâs an image or text file (like PDFs). In order for someone elseâs browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols â this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
How to
llama_print_timings: load time = 576.45 ms
llama_print_timings: sample time = 283.10 ms / 400 runs ( 0.71 ms per token, 1412.91 tokens per second)
llama_print_timings: prompt eval time = 599.83 ms / 19 tokens ( 31.57 ms per token, 31.68 tokens per second)
llama_print_timings: eval time = 24513.59 ms / 399 runs ( 61.44 ms per token, 16.28 tokens per second)
llama_print_timings: total time = 25431.49 ms
Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook
And here is another demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook:
https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4
Usage
Here are the end-to-end binary build and model conversion steps for most supported models.
Basic usage
Firstly, you need to get the binary. There are different methods that you can follow:
- Method 1: Clone this repository and build locally, see how to build
- Method 2: If you are using MacOS or Linux, you can install llama.cpp via brew, flox or nix
- Method 3: Use a Docker image, see documentation for Docker
- Method 4: Download pre-built binary from releases
You can run a basic completion using this command:
llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
# Output:
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga â it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
See this page for a full list of parameters.
Conversation mode
If you want a more ChatGPT-like experience, you can run in conversation mode by passing -cnv
as a parameter:
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!
By default, the chat template will be taken from the input model. If you want to use another chat template, pass --chat-template NAME
as a parameter. See the list of supported templates
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
Web server
llama.cpp web server is a lightweight OpenAI API compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
Example usage:
./llama-server -m your_model.gguf --port 8080
# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
Interactive mode
[!NOTE] If you prefer basic usage, please consider using conversation mode instead of interactive mode
In this mode, you can always interrupt generation by pressing Ctrl+C and entering one or more lines of text, which will be converted into tokens and appended to the current context. You can also specify a reverse prompt with the parameter -r "reverse prompt string"
. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt that makes LLaMA emulate a chat between multiple users, say Alice and Bob, and pass -r "Alice:"
.
Here is an example of a few-shot interaction, invoked with the command
# default arguments using a 7B model
./examples/chat.sh
# advanced chat with a 13B model
./examples/chat-13B.sh
# custom arguments using a 13B model
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
Note the use of --color
to distinguish between user input and generated text. Other parameters are explained in more detail in the README for the llama-cli
example program.
Persistent Interaction
The prompt, user inputs, and model generations can be saved and resumed across calls to ./llama-cli
by leveraging --prompt-cache
and --prompt-cache-all
. The ./examples/chat-persistent.sh
script demonstrates this with support for long-running, resumable chat sessions. To use this example, you must provide a file to cache the initial chat prompt and a directory to save the chat session, and may optionally provide the same variables as chat-13B.sh
. The same prompt cache can be reused for new chat sessions. Note that both prompt cache and chat directory are tied to the initial prompt (PROMPT_TEMPLATE
) and the model file.
# Start a new chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
# Resume that chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
# Start a different chat with the same prompt/model
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh
# Different prompt cache for different prompt/model
PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh
Constrained output with grammars
llama.cpp
supports grammars to constrain model output. For example, you can force the model to output JSON only:
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
The grammars/
folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on its repo and not this one.
Build
Please refer to Build llama.cpp locally
Supported backends
Backend | Target devices |
---|---|
Metal | Apple Silicon |
BLAS | All |
BLIS | All |
SYCL | Intel and Nvidia GPU |
MUSA | Moore Threads GPU |
CUDA | Nvidia GPU |
hipBLAS | AMD GPU |
Vulkan | GPU |
CANN | Ascend NPU |
Tools
Prepare and Quantize
[!NOTE] You can use the GGUF-my-repo space on Hugging Face to quantise your model weights without any setup too. It is synced from
llama.cpp
main every 6 hours.
To obtain the official LLaMA 2 weights please see the Obtaining and using the Facebook LLaMA 2 model section. There is also a large selection of pre-quantized gguf
models available on Hugging Face.
Note: convert.py
has been moved to examples/convert_legacy_llama.py
and shouldn't be used for anything other than Llama/Llama2/Mistral
models and their derivatives.
It does not support LLaMA 3, you can use convert_hf_to_gguf.py
with LLaMA 3 downloaded from Hugging Face.
To learn more about quantizing model, read this documentation
Perplexity (measuring model quality)
You can use the perplexity
example to measure perplexity over a given prompt (lower perplexity is better).
For more information, see https://huggingface.co/docs/transformers/perplexity.
To learn more how to measure perplexity using llama.cpp, read this documentation
Contributing
- Contributors can open PRs
- Collaborators can push to branches in the
llama.cpp
repo and merge PRs into themaster
branch - Collaborators will be invited based on contributions
- Any help with managing issues and PRs is very appreciated!
- See good first issues for tasks suitable for first contributions
- Read the CONTRIBUTING.md for more information
- Make sure to read this: Inference at the edge
- A bit of backstory for those who are interested: Changelog podcast
Other documentations
Development documentations
Seminal papers and background on the models
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
- LLaMA:
- GPT-3
- GPT-3.5 / InstructGPT / ChatGPT:
Top Related Projects
Inference code for Llama models
Inference Llama 2 in one file of pure C
Python bindings for llama.cpp
A Gradio web UI for Large Language Models.
Universal LLM Deployment Engine with ML Compilation
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot