llama

Inference code for Llama models

58,441

9,779

58,441

491

View on GitHub

Top Related Projects

stanford_alpaca

30,098

Code and documentation to train Stanford's Alpaca models, and generate the data.

alpaca-lora

18,931

Instruct-tune LLaMA on consumer hardware

llama2.c

18,609

Inference Llama 2 in one file of pure C

text-generation-webui

44,456

LLM UI with advanced features, easy setup, and multiple backend support.

Quick Overview

Meta-llama/llama is the official repository for LLaMA (Large Language Model Meta AI), a collection of foundation language models developed by Meta AI. These models range in size from 7B to 65B parameters and are designed to be more efficient and performant than many existing large language models.

Pros

High performance and efficiency compared to other large language models
Available in multiple sizes to suit different computational requirements
Open-source, allowing for research and development by the community
Trained on a diverse range of high-quality data sources

Cons

Requires significant computational resources for larger model sizes
Access is restricted and requires approval from Meta AI
Limited documentation and examples compared to some other language models
Potential for misuse or generation of harmful content if not properly managed

Getting Started

To get started with LLaMA, you need to request access from Meta AI. Once approved:

Clone the repository:

git clone https://github.com/facebookresearch/llama.git

Install the required dependencies:
```
pip install -r requirements.txt
```
Download the model weights and tokenizer (after receiving access):
```
python download.py
```

Use the model in your Python code:

from llama import Llama

model = Llama.build(
    ckpt_dir="llama-2-7b/",
    tokenizer_path="tokenizer.model",
    max_seq_len=512,
    max_batch_size=4,
)

prompt = "Tell me a short story about a robot learning to paint."
result = model.generate(prompt, max_gen_len=100)
print(result)

Note: Actual usage may vary depending on the specific version and implementation details provided by Meta AI upon access approval.

Competitor Comparisons

llama

58,578

Inference code for Llama models

Pros of Llama

More comprehensive documentation and setup instructions
Broader community support and active development
Includes pre-trained models and fine-tuning scripts

Cons of Llama

Larger repository size, potentially slower to clone and work with
May have more complex dependencies and setup requirements
Could be overwhelming for beginners due to extensive features

Code Comparison

Llama:

from llama import Llama

model = Llama(model_path="path/to/model.pth")
output = model.generate("Hello, how are you?", max_length=50)
print(output)

Llama>:

from llama_cpp import Llama

llm = Llama(model_path="path/to/model.bin")
output = llm("Hello, how are you?", max_tokens=50)
print(output['choices'][0]['text'])

Note: The code comparison is hypothetical, as the actual implementation details may vary. Llama> appears to be a more streamlined version of Llama, potentially offering a simpler API for basic usage. However, Llama likely provides more advanced features and customization options for researchers and developers working on large language models.

llama.cpp

83,828

LLM inference in C/C++

Pros of llama.cpp

Optimized for CPU inference, allowing LLaMA models to run on consumer hardware
Supports quantization, reducing model size and memory requirements
Provides a standalone C/C++ implementation, making it easier to integrate into various projects

Cons of llama.cpp

Limited to inference only, not designed for training or fine-tuning
May have slightly reduced accuracy compared to the original implementation due to optimizations

Code Comparison

llama.cpp:

// Load model
llama_context* ctx = llama_init_from_file("model.bin", params);

// Generate text
llama_eval(ctx, tokens.data(), tokens.size(), n_past, n_threads);

// Get probabilities
float* logits = llama_get_logits(ctx);

LLaMA:

# Load model
model = LLaMAForCausalLM.from_pretrained("path/to/model")

# Generate text
outputs = model.generate(input_ids, max_length=50)

# Get probabilities
probs = torch.softmax(outputs.logits, dim=-1)

The code snippets highlight the different approaches: llama.cpp focuses on C++ implementation with direct memory management, while LLaMA uses Python with PyTorch integration for easier use in machine learning workflows.

stanford_alpaca

30,098

Code and documentation to train Stanford's Alpaca models, and generate the data.

Pros of Stanford Alpaca

Open-source and freely available for research and non-commercial use
Provides instruction-tuning dataset and methodology for fine-tuning LLaMA
Includes scripts for data generation and model training

Cons of Stanford Alpaca

Limited to non-commercial use due to LLaMA's license restrictions
Smaller model size and potentially lower performance compared to full LLaMA
Less extensive documentation and community support

Code Comparison

Stanford Alpaca:

def generate_prompt(instruction, input=None):
    if input:
        return f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    else:
        return f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

LLaMA:

struct LLaMATokenizer {
    std::unordered_map<std::string, int32_t> token_to_id;
    std::vector<std::string> id_to_token;
};

Note: LLaMA's repository primarily contains C++ code for model inference, while Stanford Alpaca focuses on Python scripts for fine-tuning and data generation.

alpaca-lora

18,931

Instruct-tune LLaMA on consumer hardware

Pros of Alpaca-LoRA

Easier to fine-tune and adapt for specific tasks
Requires less computational resources
More accessible for researchers and developers with limited hardware

Cons of Alpaca-LoRA

Potentially lower performance compared to full LLaMA model
Limited to the capabilities of the base LLaMA model
May require additional training data for optimal results

Code Comparison

LLaMA:

from transformers import LlamaForCausalLM, LlamaTokenizer

model = LlamaForCausalLM.from_pretrained("meta-llama/llama-7b")
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/llama-7b")

Alpaca-LoRA:

from peft import PeftModel, PeftConfig
from transformers import LlamaForCausalLM, LlamaTokenizer

model = LlamaForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

The main difference in the code is that Alpaca-LoRA requires additional steps to load the LoRA weights and apply them to the base LLaMA model. This allows for more efficient fine-tuning and adaptation of the model for specific tasks.

llama2.c

18,609

Inference Llama 2 in one file of pure C

Pros of llama2.c

Lightweight and portable implementation in C
Designed for educational purposes and easy understanding
Can run on various platforms, including microcontrollers

Cons of llama2.c

Limited functionality compared to the original implementation
May not support all features and optimizations of the full LLaMA model
Potentially lower performance for large-scale applications

Code Comparison

llama2.c:

float* llama_token_to_embedding(struct llama_model* model, int token) {
    return model->token_embedding_table + token * model->dim;
}

llama:

def forward(self, tokens: torch.Tensor, start_pos: int):
    _bsz, seqlen = tokens.shape
    h = self.tok_embeddings(tokens)
    self.cache_k = self.cache_k.to(h.device)
    self.cache_v = self.cache_v.to(h.device)

The llama2.c implementation focuses on simplicity and readability, while the original llama repository uses PyTorch and includes more advanced features for model training and inference.

text-generation-webui

44,456

LLM UI with advanced features, easy setup, and multiple backend support.

Pros of text-generation-webui

User-friendly web interface for easy interaction with language models
Supports multiple models and architectures, offering flexibility
Includes features like chat, instruct mode, and notebook interface

Cons of text-generation-webui

May have higher resource requirements due to the web interface
Potentially slower inference compared to direct model usage
Limited fine-tuning capabilities compared to the original model repository

Code Comparison

text-generation-webui:

def generate_reply(
    question, settings, stopping_strings=None, is_chat=False
):
    # Generation logic here
    return response

llama:

def generate(
    prompt, max_gen_len, temperature=0.8, top_p=0.95
):
    # Generation logic here
    return response

The code snippets show that text-generation-webui focuses on a more user-friendly interface with additional parameters like stopping_strings and is_chat, while llama provides a more straightforward generation function with core parameters like temperature and top_p.

text-generation-webui is designed for ease of use and experimentation, offering a range of features through its web interface. llama, on the other hand, provides direct access to the model, which may be more suitable for advanced users or integration into other applications. The choice between the two depends on the user's needs, technical expertise, and desired level of control over the language model.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Note of deprecation

Thank you for developing with Llama models. As part of the Llama 3.1 release, weâve consolidated GitHub repos and added some additional repos as weâve expanded Llamaâs functionality into being an e2e Llama Stack. Please use the following repos going forward:

llama-models - Central repo for the foundation models including basic utilities, model cards, license and use policies
PurpleLlama - Key component of Llama Stack focusing on safety risks and inference time mitigations
llama-toolchain - Model development (inference/fine-tuning/safety shields/synthetic data generation) interfaces and canonical implementations
llama-agentic-system - E2E standalone Llama Stack system, along with opinionated underlying interface, that enables creation of agentic applications
llama-cookbook - Community driven scripts and integrations

If you have any questions, please feel free to file an issue on any of the above repos and we will do our best to respond in a timely manner.

Thank you!

(Deprecated) Llama 2

We are unlocking the power of large language models. Llama 2 is now accessible to individuals, creators, researchers, and businesses of all sizes so that they can experiment, innovate, and scale their ideas responsibly.

This release includes model weights and starting code for pre-trained and fine-tuned Llama language models â ranging from 7B to 70B parameters.

This repository is intended as a minimal example to load Llama 2 models and run inference. For more detailed examples leveraging Hugging Face, see llama-cookbook.

Updates post-launch

See UPDATES.md. Also for a running list of frequently asked questions, see here.

Download

In order to download the model weights and tokenizer, please visit the Meta website and accept our License.

Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.

Pre-requisites: Make sure you have wget and md5sum installed. Then run the script: ./download.sh.

Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.

Access to Hugging Face

We are also providing downloads on Hugging Face. You can request access to the models by acknowledging the license and filling the form in the model card of a repo. After doing so, you should get access to all the Llama models of a version (Code Llama, Llama 2, or Llama Guard) within 1 hour.

Quick Start

You can follow the steps below to quickly get up and running with Llama 2 models. These steps will let you run quick inference locally. For more examples, see the Llama 2 cookbook repository.

In a conda env with PyTorch / CUDA available clone and download this repository.
In the top-level directory run:
```
pip install -e .
```
Visit the Meta website and register to download the model/s.
Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.
Once you get the email, navigate to your downloaded llama repository and run the download.sh script.
- Make sure to grant execution permissions to the download.sh script
- During this process, you will be prompted to enter the URL from the email.
- Do not use the âCopy Linkâ option but rather make sure to manually copy the link from the email.
Once the model/s you want have been downloaded, you can run the model locally using the command below:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

Note

Replace llama-2-7b-chat/ with the path to your checkpoint directory and tokenizer.model with the path to your tokenizer model.
The ânproc_per_node should be set to the MP value for the model you are using.
Adjust the max_seq_len and max_batch_size parameters as needed.
This example runs the example_chat_completion.py found in this repository but you can change that to a different .py file.

Inference

Different models require different model-parallel (MP) values:

Model	MP
7B	1
13B	2
70B	8

All models support sequence length up to 4096 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.

Pretrained Models

These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt.

See example_text_completion.py for some examples. To illustrate, see the command below to run it with the llama-2-7b model (nproc_per_node needs to be set to the MP value):

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 4

Fine-tuned Chat Models

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in chat_completion needs to be followed, including the INST and <<SYS>> tags, BOS and EOS tokens, and the whitespaces and breaklines in between (we recommend calling strip() on inputs to avoid double-spaces).

You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-cookbook repo for an example of how to add a safety checker to the inputs and outputs of your inference code.

Examples using llama-2-7b-chat:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not â and could not â cover all scenarios. In order to help developers address these risks, we have created the Responsible Use Guide. More details can be found in our research paper as well.

Issues

Please report any software âbugâ, or other problems with the models through one of the following means:

Reporting issues with the model: github.com/facebookresearch/llama
Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
Reporting bugs and security concerns: facebook.com/whitehat/info

Model Card

See MODEL_CARD.md.

License

Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements.

See the LICENSE file, as well as our accompanying Acceptable Use Policy

References

For common questions, the FAQ can be found here which will be kept up to date over time as new questions arise.

Original Llama

The repo for the original llama release is in the llama_v1 branch.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot