Convert Figma logo to code with AI

meta-llama logollama

Inference code for Llama models

56,842
9,614
56,842
441

Top Related Projects

56,019

Inference code for Llama models

66,315

LLM inference in C/C++

Code and documentation to train Stanford's Alpaca models, and generate the data.

Instruct-tune LLaMA on consumer hardware

17,318

Inference Llama 2 in one file of pure C

A Gradio web UI for Large Language Models.

Quick Overview

Meta-llama/llama is the official repository for LLaMA (Large Language Model Meta AI), a collection of foundation language models developed by Meta AI. These models range in size from 7B to 65B parameters and are designed to be more efficient and performant than many existing large language models.

Pros

  • High performance and efficiency compared to other large language models
  • Available in multiple sizes to suit different computational requirements
  • Open-source, allowing for research and development by the community
  • Trained on a diverse range of high-quality data sources

Cons

  • Requires significant computational resources for larger model sizes
  • Access is restricted and requires approval from Meta AI
  • Limited documentation and examples compared to some other language models
  • Potential for misuse or generation of harmful content if not properly managed

Getting Started

To get started with LLaMA, you need to request access from Meta AI. Once approved:

  1. Clone the repository:

    git clone https://github.com/facebookresearch/llama.git
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    
  3. Download the model weights and tokenizer (after receiving access):

    python download.py
    
  4. Use the model in your Python code:

    from llama import Llama
    
    model = Llama.build(
        ckpt_dir="llama-2-7b/",
        tokenizer_path="tokenizer.model",
        max_seq_len=512,
        max_batch_size=4,
    )
    
    prompt = "Tell me a short story about a robot learning to paint."
    result = model.generate(prompt, max_gen_len=100)
    print(result)
    

Note: Actual usage may vary depending on the specific version and implementation details provided by Meta AI upon access approval.

Competitor Comparisons

56,019

Inference code for Llama models

Pros of Llama

  • More comprehensive documentation and setup instructions
  • Broader community support and active development
  • Includes pre-trained models and fine-tuning scripts

Cons of Llama

  • Larger repository size, potentially slower to clone and work with
  • May have more complex dependencies and setup requirements
  • Could be overwhelming for beginners due to extensive features

Code Comparison

Llama:

from llama import Llama

model = Llama(model_path="path/to/model.pth")
output = model.generate("Hello, how are you?", max_length=50)
print(output)

Llama>:

from llama_cpp import Llama

llm = Llama(model_path="path/to/model.bin")
output = llm("Hello, how are you?", max_tokens=50)
print(output['choices'][0]['text'])

Note: The code comparison is hypothetical, as the actual implementation details may vary. Llama> appears to be a more streamlined version of Llama, potentially offering a simpler API for basic usage. However, Llama likely provides more advanced features and customization options for researchers and developers working on large language models.

66,315

LLM inference in C/C++

Pros of llama.cpp

  • Optimized for CPU inference, allowing for efficient execution on consumer hardware
  • Supports quantization, reducing model size and memory requirements
  • Provides a portable C/C++ implementation, enhancing cross-platform compatibility

Cons of llama.cpp

  • Limited to inference only, lacking training capabilities
  • May not support all features and model variants available in the original LLaMA implementation

Code Comparison

LLaMA (Python):

from transformers import LlamaForCausalLM, LlamaTokenizer

model = LlamaForCausalLM.from_pretrained("path/to/model")
tokenizer = LlamaTokenizer.from_pretrained("path/to/tokenizer")

input_text = "Hello, how are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output = model.generate(input_ids, max_length=50)

llama.cpp (C++):

#include "llama.h"

llama_context * ctx = llama_init_from_file("path/to/model.bin", params);
llama_tokenize(ctx, "Hello, how are you?", tokens, max_tokens, add_bos);
llama_eval(ctx, tokens, n_tokens, n_past, n_threads);
llama_print_timings(ctx);
llama_free(ctx);

Code and documentation to train Stanford's Alpaca models, and generate the data.

Pros of Stanford Alpaca

  • Open-source and freely available for research and non-commercial use
  • Provides instruction-tuning dataset and methodology for fine-tuning LLaMA
  • Includes scripts for data generation and model training

Cons of Stanford Alpaca

  • Limited to non-commercial use due to LLaMA's license restrictions
  • Smaller model size and potentially lower performance compared to full LLaMA
  • Less extensive documentation and community support

Code Comparison

Stanford Alpaca:

def generate_prompt(instruction, input=None):
    if input:
        return f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    else:
        return f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

LLaMA:

struct LLaMATokenizer {
    std::unordered_map<std::string, int32_t> token_to_id;
    std::vector<std::string> id_to_token;
};

Note: LLaMA's repository primarily contains C++ code for model inference, while Stanford Alpaca focuses on Python scripts for fine-tuning and data generation.

Instruct-tune LLaMA on consumer hardware

Pros of Alpaca-LoRA

  • Easier to fine-tune and adapt for specific tasks
  • Requires less computational resources
  • More accessible for researchers and developers with limited hardware

Cons of Alpaca-LoRA

  • Potentially lower performance compared to full LLaMA model
  • Limited to the capabilities of the base LLaMA model
  • May require additional training data for optimal results

Code Comparison

LLaMA:

from transformers import LlamaForCausalLM, LlamaTokenizer

model = LlamaForCausalLM.from_pretrained("meta-llama/llama-7b")
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/llama-7b")

Alpaca-LoRA:

from peft import PeftModel, PeftConfig
from transformers import LlamaForCausalLM, LlamaTokenizer

model = LlamaForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

The main difference in the code is that Alpaca-LoRA requires additional steps to load the LoRA weights and apply them to the base LLaMA model. This allows for more efficient fine-tuning and adaptation of the model for specific tasks.

17,318

Inference Llama 2 in one file of pure C

Pros of llama2.c

  • Lightweight and portable implementation in C
  • Designed for educational purposes and easy understanding
  • Can run on various platforms, including microcontrollers

Cons of llama2.c

  • Limited functionality compared to the original implementation
  • May not support all features and optimizations of the full LLaMA model
  • Potentially lower performance for large-scale applications

Code Comparison

llama2.c:

float* llama_token_to_embedding(struct llama_model* model, int token) {
    return model->token_embedding_table + token * model->dim;
}

llama:

def forward(self, tokens: torch.Tensor, start_pos: int):
    _bsz, seqlen = tokens.shape
    h = self.tok_embeddings(tokens)
    self.cache_k = self.cache_k.to(h.device)
    self.cache_v = self.cache_v.to(h.device)

The llama2.c implementation focuses on simplicity and readability, while the original llama repository uses PyTorch and includes more advanced features for model training and inference.

A Gradio web UI for Large Language Models.

Pros of text-generation-webui

  • User-friendly web interface for easy interaction with language models
  • Supports multiple models and architectures, offering flexibility
  • Includes features like chat, instruct mode, and notebook interface

Cons of text-generation-webui

  • May have higher resource requirements due to the web interface
  • Potentially slower inference compared to direct model usage
  • Limited fine-tuning capabilities compared to the original model repository

Code Comparison

text-generation-webui:

def generate_reply(
    question, settings, stopping_strings=None, is_chat=False
):
    # Generation logic here
    return response

llama:

def generate(
    prompt, max_gen_len, temperature=0.8, top_p=0.95
):
    # Generation logic here
    return response

The code snippets show that text-generation-webui focuses on a more user-friendly interface with additional parameters like stopping_strings and is_chat, while llama provides a more straightforward generation function with core parameters like temperature and top_p.

text-generation-webui is designed for ease of use and experimentation, offering a range of features through its web interface. llama, on the other hand, provides direct access to the model, which may be more suitable for advanced users or integration into other applications. The choice between the two depends on the user's needs, technical expertise, and desired level of control over the language model.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Note of deprecation

Thank you for developing with Llama models. As part of the Llama 3.1 release, we’ve consolidated GitHub repos and added some additional repos as we’ve expanded Llama’s functionality into being an e2e Llama Stack. Please use the following repos going forward:

  • llama-models - Central repo for the foundation models including basic utilities, model cards, license and use policies
  • PurpleLlama - Key component of Llama Stack focusing on safety risks and inference time mitigations
  • llama-toolchain - Model development (inference/fine-tuning/safety shields/synthetic data generation) interfaces and canonical implementations
  • llama-agentic-system - E2E standalone Llama Stack system, along with opinionated underlying interface, that enables creation of agentic applications
  • llama-recipes - Community driven scripts and integrations

If you have any questions, please feel free to file an issue on any of the above repos and we will do our best to respond in a timely manner.

Thank you!

(Deprecated) Llama 2

We are unlocking the power of large language models. Llama 2 is now accessible to individuals, creators, researchers, and businesses of all sizes so that they can experiment, innovate, and scale their ideas responsibly.

This release includes model weights and starting code for pre-trained and fine-tuned Llama language models — ranging from 7B to 70B parameters.

This repository is intended as a minimal example to load Llama 2 models and run inference. For more detailed examples leveraging Hugging Face, see llama-recipes.

Updates post-launch

See UPDATES.md. Also for a running list of frequently asked questions, see here.

Download

In order to download the model weights and tokenizer, please visit the Meta website and accept our License.

Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.

Pre-requisites: Make sure you have wget and md5sum installed. Then run the script: ./download.sh.

Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.

Access to Hugging Face

We are also providing downloads on Hugging Face. You can request access to the models by acknowledging the license and filling the form in the model card of a repo. After doing so, you should get access to all the Llama models of a version (Code Llama, Llama 2, or Llama Guard) within 1 hour.

Quick Start

You can follow the steps below to quickly get up and running with Llama 2 models. These steps will let you run quick inference locally. For more examples, see the Llama 2 recipes repository.

  1. In a conda env with PyTorch / CUDA available clone and download this repository.

  2. In the top-level directory run:

    pip install -e .
    
  3. Visit the Meta website and register to download the model/s.

  4. Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.

  5. Once you get the email, navigate to your downloaded llama repository and run the download.sh script.

    • Make sure to grant execution permissions to the download.sh script
    • During this process, you will be prompted to enter the URL from the email.
    • Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.
  6. Once the model/s you want have been downloaded, you can run the model locally using the command below:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

Note

  • Replace llama-2-7b-chat/ with the path to your checkpoint directory and tokenizer.model with the path to your tokenizer model.
  • The –nproc_per_node should be set to the MP value for the model you are using.
  • Adjust the max_seq_len and max_batch_size parameters as needed.
  • This example runs the example_chat_completion.py found in this repository but you can change that to a different .py file.

Inference

Different models require different model-parallel (MP) values:

ModelMP
7B1
13B2
70B8

All models support sequence length up to 4096 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.

Pretrained Models

These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt.

See example_text_completion.py for some examples. To illustrate, see the command below to run it with the llama-2-7b model (nproc_per_node needs to be set to the MP value):

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 4

Fine-tuned Chat Models

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in chat_completion needs to be followed, including the INST and <<SYS>> tags, BOS and EOS tokens, and the whitespaces and breaklines in between (we recommend calling strip() on inputs to avoid double-spaces).

You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-recipes repo for an example of how to add a safety checker to the inputs and outputs of your inference code.

Examples using llama-2-7b-chat:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. In order to help developers address these risks, we have created the Responsible Use Guide. More details can be found in our research paper as well.

Issues

Please report any software “bug”, or other problems with the models through one of the following means:

Model Card

See MODEL_CARD.md.

License

Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements.

See the LICENSE file, as well as our accompanying Acceptable Use Policy

References

  1. Research Paper
  2. Llama 2 technical overview
  3. Open Innovation AI Research Community

For common questions, the FAQ can be found here which will be kept up to date over time as new questions arise.

Original Llama

The repo for the original llama release is in the llama_v1 branch.