Top Related Projects
LLM inference in C/C++
Inference code for Llama models
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Quick Overview
llama-cpp-python is a Python binding for the llama.cpp library, which provides a high-performance C++ implementation of the LLaMA language model. This project allows Python developers to easily integrate and use LLaMA models in their applications, leveraging the speed and efficiency of the C++ implementation.
Pros
- High performance due to C++ backend
- Easy integration with Python projects
- Supports various LLaMA model sizes and configurations
- Includes both CPU and GPU acceleration options
Cons
- Requires compilation of C++ code, which may be challenging for some users
- Limited to LLaMA models only
- May have compatibility issues with certain Python environments or operating systems
- Documentation could be more comprehensive for advanced use cases
Code Examples
- Loading a model and generating text:
from llama_cpp import Llama
llm = Llama(model_path="./models/7B/ggml-model.bin")
output = llm("Q: What is the capital of France? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output["choices"][0]["text"])
- Using the model in a chat context:
from llama_cpp import Llama
llm = Llama(model_path="./models/7B/ggml-model.bin")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the weather like today?"}
]
response = llm.create_chat_completion(messages)
print(response["choices"][0]["message"]["content"])
- Streaming output:
from llama_cpp import Llama
llm = Llama(model_path="./models/7B/ggml-model.bin")
for chunk in llm("Q: What are the benefits of exercise? A: ", max_tokens=64, stream=True):
print(chunk["choices"][0]["text"], end="", flush=True)
Getting Started
-
Install the library:
pip install llama-cpp-python
-
Download a LLaMA model and convert it to the required format.
-
Use the library in your Python code:
from llama_cpp import Llama llm = Llama(model_path="path/to/your/model.bin") output = llm("Your prompt here", max_tokens=50) print(output["choices"][0]["text"])
Competitor Comparisons
LLM inference in C/C++
Pros of llama.cpp
- Written in C++, offering potentially better performance and lower-level control
- Provides a command-line interface for direct interaction with the model
- Supports various quantization methods for model optimization
Cons of llama.cpp
- Requires more setup and configuration for use in Python projects
- Less integrated with Python ecosystem and libraries
- May have a steeper learning curve for Python developers
Code Comparison
llama.cpp (C++):
#include "llama.h"
int main() {
llama_context * ctx = llama_init_from_file("model.bin", params);
llama_eval(ctx, tokens, n_tokens, n_past, n_threads);
llama_free(ctx);
}
llama-cpp-python (Python):
from llama_cpp import Llama
llm = Llama(model_path="model.bin")
output = llm("Q: What is the capital of France? A: ", max_tokens=32)
print(output)
The llama.cpp repository provides a low-level C++ implementation, offering fine-grained control and potential performance benefits. However, llama-cpp-python wraps this functionality in a more Python-friendly interface, making it easier to integrate into Python projects and workflows. While llama.cpp may be preferred for performance-critical applications or those requiring deep customization, llama-cpp-python offers a more accessible option for Python developers looking to leverage LLaMA models in their projects.
Inference code for Llama models
Pros of llama
- Official implementation from Meta, ensuring authenticity and alignment with the original model design
- Potentially more comprehensive documentation and support from the Meta team
- May receive earlier updates and improvements directly from the model creators
Cons of llama
- Limited to C++ implementation, which may be less accessible for Python developers
- Potentially more complex setup and integration process for non-C++ projects
- May lack some of the user-friendly features and Python-specific optimizations found in community-driven projects
Code Comparison
llama (C++):
#include "llama.h"
int main() {
llama_context * ctx = llama_init_from_file("model.bin", params);
llama_eval(ctx, tokens, n_tokens, n_past, n_threads);
llama_free(ctx);
}
llama-cpp-python (Python):
from llama_cpp import Llama
llm = Llama(model_path="model.bin")
output = llm("Q: What is the capital of France? A:", max_tokens=32)
print(output)
The code comparison demonstrates the difference in language and ease of use between the two implementations, with llama-cpp-python offering a more Python-friendly interface.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of transformers
- Extensive model support: Covers a wide range of NLP tasks and architectures
- Rich ecosystem: Integrates well with other Hugging Face libraries and tools
- Active community: Regular updates, extensive documentation, and community support
Cons of transformers
- Resource-intensive: Can be computationally demanding for large models
- Complexity: Steeper learning curve for beginners due to its comprehensive nature
- Python-centric: Limited support for other programming languages
Code Comparison
transformers:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
input_text = "Hello, how are you?"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50)
llama-cpp-python:
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.bin")
output = llm("Hello, how are you?", max_tokens=50)
The transformers library offers more flexibility and supports a wider range of models, while llama-cpp-python provides a simpler interface specifically for Llama models with potentially better performance for certain use cases.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Pros of DeepSpeed
- Offers advanced optimization techniques for large-scale model training
- Supports distributed training across multiple GPUs and nodes
- Provides a comprehensive ecosystem for efficient deep learning
Cons of DeepSpeed
- Steeper learning curve due to its complexity and advanced features
- Primarily focused on training, less emphasis on inference optimization
Code Comparison
DeepSpeed:
import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(args=args,
model=model,
model_parameters=params)
llama-cpp-python:
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.bin")
output = llm("Q: What is the capital of France? A: ", max_tokens=32)
Key Differences
- DeepSpeed is a comprehensive training optimization library, while llama-cpp-python focuses on inference for LLaMA models
- DeepSpeed offers more advanced features for large-scale training, while llama-cpp-python provides a simpler API for running LLaMA models
- DeepSpeed is better suited for researchers and organizations working on training large models, while llama-cpp-python is ideal for developers looking to integrate LLaMA models into their applications
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Pros of gpt-neox
- Designed for training large language models from scratch
- Supports distributed training across multiple GPUs and nodes
- Includes tools for dataset preparation and tokenization
Cons of gpt-neox
- More complex setup and configuration required
- Higher computational resources needed for training
- Less suitable for inference on consumer-grade hardware
Code Comparison
gpt-neox:
from megatron.neox_arguments import NeoXArgs
from megatron.global_vars import set_global_variables, get_tokenizer
from megatron.training import pretrain
args = NeoXArgs.from_ymls("configs/your_config.yml")
set_global_variables(args)
llama-cpp-python:
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.bin")
output = llm("Q: What is the capital of France? A:", max_tokens=32)
gpt-neox is focused on training large language models, offering distributed training capabilities and dataset preparation tools. However, it requires more setup and computational resources. llama-cpp-python, on the other hand, is designed for inference using pre-trained LLaMA models, making it easier to use on consumer hardware but lacking training capabilities.
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Pros of JAX
- Designed for high-performance numerical computing and machine learning
- Supports automatic differentiation and GPU/TPU acceleration
- Offers a rich ecosystem of libraries and tools for scientific computing
Cons of JAX
- Steeper learning curve, especially for those not familiar with NumPy
- Limited support for dynamic graphs compared to some other ML frameworks
- May require more setup and configuration for certain use cases
Code Comparison
JAX example:
import jax.numpy as jnp
from jax import grad, jit
def f(x):
return jnp.sum(jnp.sin(x))
grad_f = jit(grad(f))
llama-cpp-python example:
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.bin")
output = llm("Q: What is the capital of France? A:", max_tokens=32)
While JAX focuses on numerical computing and gradient-based optimization, llama-cpp-python provides a simple interface for running LLaMA models. JAX is more versatile for general machine learning tasks, while llama-cpp-python is specifically designed for working with LLaMA models in Python.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Python Bindings for llama.cpp
Simple Python bindings for @ggerganov's llama.cpp
library.
This package provides:
- Low-level access to C API via
ctypes
interface. - High-level Python API for text completion
- OpenAI-like API
- LangChain compatibility
- LlamaIndex compatibility
- OpenAI compatible web server
Documentation is available at https://llama-cpp-python.readthedocs.io/en/latest.
Installation
Requirements:
- Python 3.8+
- C compiler
- Linux: gcc or clang
- Windows: Visual Studio or MinGW
- MacOS: Xcode
To install the package, run:
pip install llama-cpp-python
This will also build llama.cpp
from source and install it alongside this python package.
If this fails, add --verbose
to the pip install
see the full cmake build log.
Pre-built Wheel (New)
It is also possible to install a pre-built wheel with basic CPU support.
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
Installation Configuration
llama.cpp
supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See the llama.cpp README for a full list.
All llama.cpp
cmake build options can be set via the CMAKE_ARGS
environment variable or via the --config-settings / -C
cli flag during installation.
Environment Variables
# Linux and Mac
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" \
pip install llama-cpp-python
# Windows
$env:CMAKE_ARGS = "-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"
pip install llama-cpp-python
CLI / requirements.txt
They can also be set via pip install -C / --config-settings
command and saved to a requirements.txt
file:
pip install --upgrade pip # ensure pip is up to date
pip install llama-cpp-python \
-C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"
# requirements.txt
llama-cpp-python -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"
Supported Backends
Below are some common backends, their build commands and any additional environment variables required.
OpenBLAS (CPU)
To install with OpenBLAS, set the GGML_BLAS
and GGML_BLAS_VENDOR
environment variables before installing:
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
CUDA
To install with CUDA support, set the GGML_CUDA=on
environment variable before installing:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
Pre-built Wheel (New)
It is also possible to install a pre-built wheel with CUDA support. As long as your system meets some requirements:
- CUDA Version is 12.1, 12.2, 12.3, or 12.4
- Python Version is 3.10, 3.11 or 3.12
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>
Where <cuda-version>
is one of the following:
cu121
: CUDA 12.1cu122
: CUDA 12.2cu123
: CUDA 12.3cu124
: CUDA 12.4
For example, to install the CUDA 12.1 wheel:
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
Metal
To install with Metal (MPS), set the GGML_METAL=on
environment variable before installing:
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
Pre-built Wheel (New)
It is also possible to install a pre-built wheel with Metal support. As long as your system meets some requirements:
- MacOS Version is 11.0 or later
- Python Version is 3.10, 3.11 or 3.12
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
hipBLAS (ROCm)
To install with hipBLAS / ROCm support for AMD cards, set the GGML_HIPBLAS=on
environment variable before installing:
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python
Vulkan
To install with Vulkan support, set the GGML_VULKAN=on
environment variable before installing:
CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python
SYCL
To install with SYCL support, set the GGML_SYCL=on
environment variable before installing:
source /opt/intel/oneapi/setvars.sh
CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
RPC
To install with RPC support, set the GGML_RPC=on
environment variable before installing:
source /opt/intel/oneapi/setvars.sh
CMAKE_ARGS="-DGGML_RPC=on" pip install llama-cpp-python
Windows Notes
Error: Can't find 'nmake' or 'CMAKE_C_COMPILER'
If you run into issues where it complains it can't find 'nmake'
'?'
or CMAKE_C_COMPILER, you can extract w64devkit as mentioned in llama.cpp repo and add those manually to CMAKE_ARGS before running pip
install:
$env:CMAKE_GENERATOR = "MinGW Makefiles"
$env:CMAKE_ARGS = "-DGGML_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"
See the above instructions and set CMAKE_ARGS
to the BLAS backend you want to use.
MacOS Notes
Detailed MacOS Metal GPU install documentation is available at docs/install/macos.md
M1 Mac Performance Issue
Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
Otherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`
Try installing with
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
Upgrading and Reinstalling
To upgrade and rebuild llama-cpp-python
add --upgrade --force-reinstall --no-cache-dir
flags to the pip install
command to ensure the package is rebuilt from source.
High-level API
The high-level API provides a simple managed interface through the Llama
class.
Below is a short example demonstrating how to use the high-level API to for basic text completion:
from llama_cpp import Llama
llm = Llama(
model_path="./models/7B/llama-model.gguf",
# n_gpu_layers=-1, # Uncomment to use GPU acceleration
# seed=1337, # Uncomment to set a specific seed
# n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
"Q: Name the planets in the solar system? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
By default llama-cpp-python
generates completions in an OpenAI compatible format:
{
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"object": "text_completion",
"created": 1679561337,
"model": "./models/7B/llama-model.gguf",
"choices": [
{
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
"index": 0,
"logprobs": None,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 28,
"total_tokens": 42
}
}
Text completion is available through the __call__
and create_completion
methods of the Llama
class.
Pulling models from Hugging Face Hub
You can download Llama
models in gguf
format directly from Hugging Face using the from_pretrained
method.
You'll need to install the huggingface-hub
package to use this feature (pip install huggingface-hub
).
llm = Llama.from_pretrained(
repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF",
filename="*q8_0.gguf",
verbose=False
)
By default from_pretrained
will download the model to the huggingface cache directory, you can then manage installed model files with the huggingface-cli
tool.
Chat Completion
The high-level API also provides a simple interface for chat completion.
Chat completion requires that the model knows how to format the messages into a single prompt.
The Llama
class does this using pre-registered chat formats (ie. chatml
, llama-2
, gemma
, etc) or by providing a custom chat handler object.
The model will will format the messages into a single prompt using the following order of precedence:
- Use the
chat_handler
if provided - Use the
chat_format
if provided - Use the
tokenizer.chat_template
from thegguf
model's metadata (should work for most new models, older models may not have this) - else, fallback to the
llama-2
chat format
Set verbose=True
to see the selected chat format.
from llama_cpp import Llama
llm = Llama(
model_path="path/to/llama-2/llama-model.gguf",
chat_format="llama-2"
)
llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are an assistant who perfectly describes images."},
{
"role": "user",
"content": "Describe this image in detail please."
}
]
)
Chat completion is available through the create_chat_completion
method of the Llama
class.
For OpenAI API v1 compatibility, you use the create_chat_completion_openai_v1
method which will return pydantic models instead of dicts.
JSON and JSON Schema Mode
To constrain chat responses to only valid JSON or a specific JSON Schema use the response_format
argument in create_chat_completion
.
JSON Mode
The following example will constrain the response to valid JSON strings only.
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a helpful assistant that outputs in JSON.",
},
{"role": "user", "content": "Who won the world series in 2020"},
],
response_format={
"type": "json_object",
},
temperature=0.7,
)
JSON Schema Mode
To constrain the response further to a specific JSON Schema add the schema to the schema
property of the response_format
argument.
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a helpful assistant that outputs in JSON.",
},
{"role": "user", "content": "Who won the world series in 2020"},
],
response_format={
"type": "json_object",
"schema": {
"type": "object",
"properties": {"team_name": {"type": "string"}},
"required": ["team_name"],
},
},
temperature=0.7,
)
Function Calling
The high-level API supports OpenAI compatible function and tool calling. This is possible through the functionary
pre-trained models chat format or through the generic chatml-function-calling
chat format.
from llama_cpp import Llama
llm = Llama(model_path="path/to/chatml/llama-model.gguf", chat_format="chatml-function-calling")
llm.create_chat_completion(
messages = [
{
"role": "system",
"content": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"
},
{
"role": "user",
"content": "Extract Jason is 25 years old"
}
],
tools=[{
"type": "function",
"function": {
"name": "UserDetail",
"parameters": {
"type": "object",
"title": "UserDetail",
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"age": {
"title": "Age",
"type": "integer"
}
},
"required": [ "name", "age" ]
}
}
}],
tool_choice={
"type": "function",
"function": {
"name": "UserDetail"
}
}
)
Functionary v2
The various gguf-converted files for this set of models can be found here. Functionary is able to intelligently call functions and also analyze any provided function outputs to generate coherent responses. All v2 models of functionary supports parallel function calling. You can provide either functionary-v1
or functionary-v2
for the chat_format
when initializing the Llama class.
Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. The LlamaHFTokenizer
class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files.
from llama_cpp import Llama
from llama_cpp.llama_tokenizer import LlamaHFTokenizer
llm = Llama.from_pretrained(
repo_id="meetkai/functionary-small-v2.2-GGUF",
filename="functionary-small-v2.2.q4_0.gguf",
chat_format="functionary-v2",
tokenizer=LlamaHFTokenizer.from_pretrained("meetkai/functionary-small-v2.2-GGUF")
)
NOTE: There is no need to provide the default system messages used in Functionary as they are added automatically in the Functionary chat handler. Thus, the messages should contain just the chat messages and/or system messages that provide additional context for the model (e.g.: datetime, etc.).
Multi-modal Models
llama-cpp-python
supports such as llava1.5 which allow the language model to read information from both text and images.
Below are the supported multi-modal models and their respective chat handlers (Python API) and chat formats (Server API).
Model | LlamaChatHandler | chat_format |
---|---|---|
llava-v1.5-7b | Llava15ChatHandler | llava-1-5 |
llava-v1.5-13b | Llava15ChatHandler | llava-1-5 |
llava-v1.6-34b | Llava16ChatHandler | llava-1-6 |
moondream2 | MoondreamChatHandler | moondream2 |
nanollava | NanollavaChatHandler | nanollava |
llama-3-vision-alpha | Llama3VisionAlphaChatHandler | llama-3-vision-alpha |
minicpm-v-2.6 | MiniCPMv26ChatHandler | minicpm-v-2.6 |
Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin")
llm = Llama(
model_path="./path/to/llava/llama-model.gguf",
chat_handler=chat_handler,
n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are an assistant who perfectly describes images."},
{
"role": "user",
"content": [
{"type" : "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
]
}
]
)
You can also pull the model from the Hugging Face Hub using the from_pretrained
method.
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MoondreamChatHandler
chat_handler = MoondreamChatHandler.from_pretrained(
repo_id="vikhyatk/moondream2",
filename="*mmproj*",
)
llm = Llama.from_pretrained(
repo_id="vikhyatk/moondream2",
filename="*text-model*",
chat_handler=chat_handler,
n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
response = llm.create_chat_completion(
messages = [
{
"role": "user",
"content": [
{"type" : "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
]
}
]
)
print(response["choices"][0]["text"])
Note: Multi-modal models also support tool calling and JSON mode.
Loading a Local Image
Images can be passed as base64 encoded data URIs. The following example demonstrates how to do this.
import base64
def image_to_base64_data_uri(file_path):
with open(file_path, "rb") as img_file:
base64_data = base64.b64encode(img_file.read()).decode('utf-8')
return f"data:image/png;base64,{base64_data}"
# Replace 'file_path.png' with the actual path to your PNG file
file_path = 'file_path.png'
data_uri = image_to_base64_data_uri(file_path)
messages = [
{"role": "system", "content": "You are an assistant who perfectly describes images."},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": data_uri }},
{"type" : "text", "text": "Describe this image in detail please."}
]
}
]
Speculative Decoding
llama-cpp-python
supports speculative decoding which allows the model to generate completions based on a draft model.
The fastest way to use speculative decoding is through the LlamaPromptLookupDecoding
class.
Just pass this as a draft model to the Llama
class during initialization.
from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
llama = Llama(
model_path="path/to/model.gguf",
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)
Embeddings
To generate text embeddings use create_embedding
or embed
. Note that you must pass embedding=True
to the constructor upon model creation for these to work properly.
import llama_cpp
llm = llama_cpp.Llama(model_path="path/to/model.gguf", embedding=True)
embeddings = llm.create_embedding("Hello, world!")
# or create multiple embeddings at once
embeddings = llm.create_embedding(["Hello, world!", "Goodbye, world!"])
There are two primary notions of embeddings in a Transformer-style model: token level and sequence level. Sequence level embeddings are produced by "pooling" token level embeddings together, usually by averaging them or using the first token.
Models that are explicitly geared towards embeddings will usually return sequence level embeddings by default, one for each input string. Non-embedding models such as those designed for text generation will typically return only token level embeddings, one for each token in each sequence. Thus the dimensionality of the return type will be one higher for token level embeddings.
It is possible to control pooling behavior in some cases using the pooling_type
flag on model creation. You can ensure token level embeddings from any model using LLAMA_POOLING_TYPE_NONE
. The reverse, getting a generation oriented model to yield sequence level embeddings is currently not possible, but you can always do the pooling manually.
Adjusting the Context Window
The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.
For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)
OpenAI Compatible Web Server
llama-cpp-python
offers a web server which aims to act as a drop-in replacement for the OpenAI API.
This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).
To install the server package and get started:
pip install 'llama-cpp-python[server]'
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install 'llama-cpp-python[server]'
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
Navigate to http://localhost:8000/docs to see the OpenAPI documentation.
To bind to 0.0.0.0
to enable remote connections, use python3 -m llama_cpp.server --host 0.0.0.0
.
Similarly, to change the port (default is 8000), use --port
.
You probably also want to set the prompt format. For chatml, use
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml
That will format the prompt according to how model expects it. You can find the prompt format in the model card. For possible options, see llama_cpp/llama_chat_format.py and look for lines starting with "@register_chat_format".
If you have huggingface-hub
installed, you can also use the --hf_model_repo_id
flag to load a model from the Hugging Face Hub.
python3 -m llama_cpp.server --hf_model_repo_id Qwen/Qwen2-0.5B-Instruct-GGUF --model '*q8_0.gguf'
Web Server Features
Docker image
A Docker image is available on GHCR. To run the server:
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
Docker on termux (requires root) is currently the only known way to run this on phones, see termux support issue
Low-level API
The low-level API is a direct ctypes
binding to the C API provided by llama.cpp
.
The entire low-level API can be found in llama_cpp/llama_cpp.py and directly mirrors the C API in llama.h.
Below is a short example demonstrating how to use the low-level API to tokenize a prompt:
import llama_cpp
import ctypes
llama_cpp.llama_backend_init(False) # Must be called once at the start of each program
params = llama_cpp.llama_context_default_params()
# use bytes for char * params
model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
ctx = llama_cpp.llama_new_context_with_model(model, params)
max_tokens = params.n_ctx
# use ctypes arrays for array params
tokens = (llama_cpp.llama_token * int(max_tokens))()
n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, llama_cpp.c_bool(True))
llama_cpp.llama_free(ctx)
Check out the examples folder for more examples of using the low-level API.
Documentation
Documentation is available via https://llama-cpp-python.readthedocs.io/. If you find any issues with the documentation, please open an issue or submit a PR.
Development
This package is under active development and I welcome any contributions.
To get started, clone the repository and install the package in editable / development mode:
git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
# Upgrade pip (required for editable mode)
pip install --upgrade pip
# Install with pip
pip install -e .
# if you want to use the fastapi / openapi server
pip install -e .[server]
# to install all optional dependencies
pip install -e .[all]
# to clear the local build cache
make clean
You can also test out specific commits of llama.cpp
by checking out the desired commit in the vendor/llama.cpp
submodule and then running make clean
and pip install -e .
again. Any changes in the llama.h
API will require
changes to the llama_cpp/llama_cpp.py
file to match the new API (additional changes may be required elsewhere).
FAQ
Are there pre-built binaries / binary wheels available?
The recommended installation method is to install from source as described above.
The reason for this is that llama.cpp
is built with compiler optimizations that are specific to your system.
Using pre-built binaries would require disabling these optimizations or supporting a large number of pre-built binaries for each platform.
That being said there are some pre-built binaries available through the Releases as well as some community provided wheels.
In the future, I would like to provide pre-built binaries and wheels for common platforms and I'm happy to accept any useful contributions in this area. This is currently being tracked in #741
How does this compare to other Python bindings of llama.cpp
?
I originally wrote this package for my own use with two goals in mind:
- Provide a simple process to install
llama.cpp
and access the full C API inllama.h
from Python - Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use
llama.cpp
Any contributions and changes to this package will be made with these goals in mind.
License
This project is licensed under the terms of the MIT license.
Top Related Projects
LLM inference in C/C++
Inference code for Llama models
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot