Top Related Projects
Inference code for Llama models
LLM inference in C/C++
Code and documentation to train Stanford's Alpaca models, and generate the data.
Instruct-tune LLaMA on consumer hardware
Inference Llama 2 in one file of pure C
A Gradio web UI for Large Language Models.
Quick Overview
Meta-llama/llama is the official repository for LLaMA (Large Language Model Meta AI), a collection of foundation language models developed by Meta AI. These models range in size from 7B to 65B parameters and are designed to be more efficient and performant than many existing large language models.
Pros
- High performance and efficiency compared to other large language models
- Available in multiple sizes to suit different computational requirements
- Open-source, allowing for research and development by the community
- Trained on a diverse range of high-quality data sources
Cons
- Requires significant computational resources for larger model sizes
- Access is restricted and requires approval from Meta AI
- Limited documentation and examples compared to some other language models
- Potential for misuse or generation of harmful content if not properly managed
Getting Started
To get started with LLaMA, you need to request access from Meta AI. Once approved:
-
Clone the repository:
git clone https://github.com/facebookresearch/llama.git
-
Install the required dependencies:
pip install -r requirements.txt
-
Download the model weights and tokenizer (after receiving access):
python download.py
-
Use the model in your Python code:
from llama import Llama model = Llama.build( ckpt_dir="llama-2-7b/", tokenizer_path="tokenizer.model", max_seq_len=512, max_batch_size=4, ) prompt = "Tell me a short story about a robot learning to paint." result = model.generate(prompt, max_gen_len=100) print(result)
Note: Actual usage may vary depending on the specific version and implementation details provided by Meta AI upon access approval.
Competitor Comparisons
Inference code for Llama models
Pros of Llama
- More comprehensive documentation and setup instructions
- Broader community support and active development
- Includes pre-trained models and fine-tuning scripts
Cons of Llama
- Larger repository size, potentially slower to clone and work with
- May have more complex dependencies and setup requirements
- Could be overwhelming for beginners due to extensive features
Code Comparison
Llama:
from llama import Llama
model = Llama(model_path="path/to/model.pth")
output = model.generate("Hello, how are you?", max_length=50)
print(output)
Llama>:
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.bin")
output = llm("Hello, how are you?", max_tokens=50)
print(output['choices'][0]['text'])
Note: The code comparison is hypothetical, as the actual implementation details may vary. Llama> appears to be a more streamlined version of Llama, potentially offering a simpler API for basic usage. However, Llama likely provides more advanced features and customization options for researchers and developers working on large language models.
LLM inference in C/C++
Pros of llama.cpp
- Optimized for CPU inference, allowing for efficient execution on consumer hardware
- Supports quantization, reducing model size and memory requirements
- Provides a portable C/C++ implementation, enhancing cross-platform compatibility
Cons of llama.cpp
- Limited to inference only, lacking training capabilities
- May not support all features and model variants available in the original LLaMA implementation
Code Comparison
LLaMA (Python):
from transformers import LlamaForCausalLM, LlamaTokenizer
model = LlamaForCausalLM.from_pretrained("path/to/model")
tokenizer = LlamaTokenizer.from_pretrained("path/to/tokenizer")
input_text = "Hello, how are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output = model.generate(input_ids, max_length=50)
llama.cpp (C++):
#include "llama.h"
llama_context * ctx = llama_init_from_file("path/to/model.bin", params);
llama_tokenize(ctx, "Hello, how are you?", tokens, max_tokens, add_bos);
llama_eval(ctx, tokens, n_tokens, n_past, n_threads);
llama_print_timings(ctx);
llama_free(ctx);
Code and documentation to train Stanford's Alpaca models, and generate the data.
Pros of Stanford Alpaca
- Open-source and freely available for research and non-commercial use
- Provides instruction-tuning dataset and methodology for fine-tuning LLaMA
- Includes scripts for data generation and model training
Cons of Stanford Alpaca
- Limited to non-commercial use due to LLaMA's license restrictions
- Smaller model size and potentially lower performance compared to full LLaMA
- Less extensive documentation and community support
Code Comparison
Stanford Alpaca:
def generate_prompt(instruction, input=None):
if input:
return f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
else:
return f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"
LLaMA:
struct LLaMATokenizer {
std::unordered_map<std::string, int32_t> token_to_id;
std::vector<std::string> id_to_token;
};
Note: LLaMA's repository primarily contains C++ code for model inference, while Stanford Alpaca focuses on Python scripts for fine-tuning and data generation.
Instruct-tune LLaMA on consumer hardware
Pros of Alpaca-LoRA
- Easier to fine-tune and adapt for specific tasks
- Requires less computational resources
- More accessible for researchers and developers with limited hardware
Cons of Alpaca-LoRA
- Potentially lower performance compared to full LLaMA model
- Limited to the capabilities of the base LLaMA model
- May require additional training data for optimal results
Code Comparison
LLaMA:
from transformers import LlamaForCausalLM, LlamaTokenizer
model = LlamaForCausalLM.from_pretrained("meta-llama/llama-7b")
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/llama-7b")
Alpaca-LoRA:
from peft import PeftModel, PeftConfig
from transformers import LlamaForCausalLM, LlamaTokenizer
model = LlamaForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
The main difference in the code is that Alpaca-LoRA requires additional steps to load the LoRA weights and apply them to the base LLaMA model. This allows for more efficient fine-tuning and adaptation of the model for specific tasks.
Inference Llama 2 in one file of pure C
Pros of llama2.c
- Lightweight and portable implementation in C
- Designed for educational purposes and easy understanding
- Can run on various platforms, including microcontrollers
Cons of llama2.c
- Limited functionality compared to the original implementation
- May not support all features and optimizations of the full LLaMA model
- Potentially lower performance for large-scale applications
Code Comparison
llama2.c:
float* llama_token_to_embedding(struct llama_model* model, int token) {
return model->token_embedding_table + token * model->dim;
}
llama:
def forward(self, tokens: torch.Tensor, start_pos: int):
_bsz, seqlen = tokens.shape
h = self.tok_embeddings(tokens)
self.cache_k = self.cache_k.to(h.device)
self.cache_v = self.cache_v.to(h.device)
The llama2.c implementation focuses on simplicity and readability, while the original llama repository uses PyTorch and includes more advanced features for model training and inference.
A Gradio web UI for Large Language Models.
Pros of text-generation-webui
- User-friendly web interface for easy interaction with language models
- Supports multiple models and architectures, offering flexibility
- Includes features like chat, instruct mode, and notebook interface
Cons of text-generation-webui
- May have higher resource requirements due to the web interface
- Potentially slower inference compared to direct model usage
- Limited fine-tuning capabilities compared to the original model repository
Code Comparison
text-generation-webui:
def generate_reply(
question, settings, stopping_strings=None, is_chat=False
):
# Generation logic here
return response
llama:
def generate(
prompt, max_gen_len, temperature=0.8, top_p=0.95
):
# Generation logic here
return response
The code snippets show that text-generation-webui focuses on a more user-friendly interface with additional parameters like stopping_strings
and is_chat
, while llama provides a more straightforward generation function with core parameters like temperature
and top_p
.
text-generation-webui is designed for ease of use and experimentation, offering a range of features through its web interface. llama, on the other hand, provides direct access to the model, which may be more suitable for advanced users or integration into other applications. The choice between the two depends on the user's needs, technical expertise, and desired level of control over the language model.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Note of deprecation
Thank you for developing with Llama models. As part of the Llama 3.1 release, weâve consolidated GitHub repos and added some additional repos as weâve expanded Llamaâs functionality into being an e2e Llama Stack. Please use the following repos going forward:
- llama-models - Central repo for the foundation models including basic utilities, model cards, license and use policies
- PurpleLlama - Key component of Llama Stack focusing on safety risks and inference time mitigations
- llama-toolchain - Model development (inference/fine-tuning/safety shields/synthetic data generation) interfaces and canonical implementations
- llama-agentic-system - E2E standalone Llama Stack system, along with opinionated underlying interface, that enables creation of agentic applications
- llama-recipes - Community driven scripts and integrations
If you have any questions, please feel free to file an issue on any of the above repos and we will do our best to respond in a timely manner.
Thank you!
(Deprecated) Llama 2
We are unlocking the power of large language models. Llama 2 is now accessible to individuals, creators, researchers, and businesses of all sizes so that they can experiment, innovate, and scale their ideas responsibly.
This release includes model weights and starting code for pre-trained and fine-tuned Llama language models â ranging from 7B to 70B parameters.
This repository is intended as a minimal example to load Llama 2 models and run inference. For more detailed examples leveraging Hugging Face, see llama-recipes.
Updates post-launch
See UPDATES.md. Also for a running list of frequently asked questions, see here.
Download
In order to download the model weights and tokenizer, please visit the Meta website and accept our License.
Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.
Pre-requisites: Make sure you have wget
and md5sum
installed. Then run the script: ./download.sh
.
Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden
, you can always re-request a link.
Access to Hugging Face
We are also providing downloads on Hugging Face. You can request access to the models by acknowledging the license and filling the form in the model card of a repo. After doing so, you should get access to all the Llama models of a version (Code Llama, Llama 2, or Llama Guard) within 1 hour.
Quick Start
You can follow the steps below to quickly get up and running with Llama 2 models. These steps will let you run quick inference locally. For more examples, see the Llama 2 recipes repository.
-
In a conda env with PyTorch / CUDA available clone and download this repository.
-
In the top-level directory run:
pip install -e .
-
Visit the Meta website and register to download the model/s.
-
Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.
-
Once you get the email, navigate to your downloaded llama repository and run the download.sh script.
- Make sure to grant execution permissions to the download.sh script
- During this process, you will be prompted to enter the URL from the email.
- Do not use the âCopy Linkâ option but rather make sure to manually copy the link from the email.
-
Once the model/s you want have been downloaded, you can run the model locally using the command below:
torchrun --nproc_per_node 1 example_chat_completion.py \
--ckpt_dir llama-2-7b-chat/ \
--tokenizer_path tokenizer.model \
--max_seq_len 512 --max_batch_size 6
Note
- Replace
llama-2-7b-chat/
with the path to your checkpoint directory andtokenizer.model
with the path to your tokenizer model. - The
ânproc_per_node
should be set to the MP value for the model you are using. - Adjust the
max_seq_len
andmax_batch_size
parameters as needed. - This example runs the example_chat_completion.py found in this repository but you can change that to a different .py file.
Inference
Different models require different model-parallel (MP) values:
Model | MP |
---|---|
7B | 1 |
13B | 2 |
70B | 8 |
All models support sequence length up to 4096 tokens, but we pre-allocate the cache according to max_seq_len
and max_batch_size
values. So set those according to your hardware.
Pretrained Models
These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt.
See example_text_completion.py
for some examples. To illustrate, see the command below to run it with the llama-2-7b model (nproc_per_node
needs to be set to the MP
value):
torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir llama-2-7b/ \
--tokenizer_path tokenizer.model \
--max_seq_len 128 --max_batch_size 4
Fine-tuned Chat Models
The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in chat_completion
needs to be followed, including the INST
and <<SYS>>
tags, BOS
and EOS
tokens, and the whitespaces and breaklines in between (we recommend calling strip()
on inputs to avoid double-spaces).
You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-recipes repo for an example of how to add a safety checker to the inputs and outputs of your inference code.
Examples using llama-2-7b-chat:
torchrun --nproc_per_node 1 example_chat_completion.py \
--ckpt_dir llama-2-7b-chat/ \
--tokenizer_path tokenizer.model \
--max_seq_len 512 --max_batch_size 6
Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not â and could not â cover all scenarios. In order to help developers address these risks, we have created the Responsible Use Guide. More details can be found in our research paper as well.
Issues
Please report any software âbugâ, or other problems with the models through one of the following means:
- Reporting issues with the model: github.com/facebookresearch/llama
- Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
- Reporting bugs and security concerns: facebook.com/whitehat/info
Model Card
See MODEL_CARD.md.
License
Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements.
See the LICENSE file, as well as our accompanying Acceptable Use Policy
References
For common questions, the FAQ can be found here which will be kept up to date over time as new questions arise.
Original Llama
The repo for the original llama release is in the llama_v1
branch.
Top Related Projects
Inference code for Llama models
LLM inference in C/C++
Code and documentation to train Stanford's Alpaca models, and generate the data.
Instruct-tune LLaMA on consumer hardware
Inference Llama 2 in one file of pure C
A Gradio web UI for Large Language Models.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot