Convert Figma logo to code with AI

hiyouga logoLLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

45,523
5,564
45,523
426

Top Related Projects

37,573

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

57,265

Inference code for Llama models

Code and documentation to train Stanford's Alpaca models, and generate the data.

Quick Overview

LLaMA-Factory is an open-source project designed for fine-tuning large language models (LLMs) like LLaMA, BLOOM, Falcon, and others. It provides a unified interface for various training paradigms, including full-parameter, LoRA, and QLoRA fine-tuning, making it easier for researchers and developers to experiment with and customize LLMs.

Pros

  • Supports a wide range of LLMs and training paradigms
  • Offers a user-friendly web UI for easy model fine-tuning
  • Includes comprehensive documentation and examples
  • Regularly updated with new features and model support

Cons

  • Requires significant computational resources for large models
  • May have a steep learning curve for beginners in LLM fine-tuning
  • Limited to supported models and architectures
  • Potential for overfitting if not used carefully

Code Examples

  1. Loading a pre-trained model:
from llmtuner import AutoModelForCausalLM

model_path = "path/to/your/model"
model = AutoModelForCausalLM.from_pretrained(model_path)
  1. Preparing a dataset for fine-tuning:
from llmtuner import load_dataset

dataset = load_dataset("json", data_files="path/to/your/data.json")
  1. Fine-tuning a model using LoRA:
from llmtuner import run_pt

run_pt(
    model_name_or_path="path/to/your/model",
    dataset_path="path/to/your/dataset",
    output_dir="path/to/save/results",
    lora_rank=8,
    num_train_epochs=3,
    learning_rate=2e-5
)

Getting Started

To get started with LLaMA-Factory:

  1. Clone the repository:

    git clone https://github.com/hiyouga/LLaMA-Factory.git
    cd LLaMA-Factory
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Run the web UI:

    python src/train_web.py
    
  4. Access the web interface at http://localhost:7860 to start fine-tuning your models.

Competitor Comparisons

37,573

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

  • Highly optimized for distributed training and large-scale models
  • Supports a wide range of AI frameworks (PyTorch, TensorFlow, etc.)
  • Offers advanced features like ZeRO optimizer and pipeline parallelism

Cons of DeepSpeed

  • Steeper learning curve due to its complexity and advanced features
  • Less focused on specific LLM fine-tuning tasks compared to LLaMA-Factory
  • May require more setup and configuration for simpler use cases

Code Comparison

LLaMA-Factory example:

from llmtuner import ChatModel

model = ChatModel(model_name_or_path="facebook/opt-350m")
response = model.chat("Hello, how are you?")
print(response)

DeepSpeed example:

import torch
import deepspeed

model = torch.nn.Linear(10, 10)
engine = deepspeed.initialize(model=model, config_params=ds_config)
output = engine(torch.randn(10, 10))

LLaMA-Factory is more focused on easy LLM fine-tuning and inference, while DeepSpeed provides a broader set of optimization tools for large-scale deep learning training across various AI frameworks.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Broader scope, supporting a wide range of transformer-based models beyond LLaMA
  • Extensive documentation and community support
  • Regular updates and maintenance from a large team at Hugging Face

Cons of transformers

  • More complex to use for specific LLaMA-related tasks
  • Larger codebase, which may be overwhelming for beginners
  • Less focused on fine-tuning and optimization specifically for LLaMA models

Code Comparison

transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

LLaMA-Factory:

from llmtuner import ChatModel

model = ChatModel("meta-llama/Llama-2-7b-hf")
response, history = model.chat("Hello", history=[])

The transformers code demonstrates the general approach for loading models, while LLaMA-Factory provides a more streamlined interface specifically for LLaMA models, with built-in chat functionality.

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Pros of gpt-neox

  • Designed for training large language models from scratch
  • Supports distributed training across multiple GPUs and nodes
  • Includes tools for dataset preparation and tokenization

Cons of gpt-neox

  • More complex setup and configuration required
  • Less focused on fine-tuning existing models
  • Steeper learning curve for beginners

Code Comparison

LLaMA-Factory example:

from llmtuner import ChatModel

model = ChatModel(model_name_or_path="facebook/opt-350m")
response = model.chat("Hello, how are you?")
print(response)

gpt-neox example:

from megatron.neox_arguments import NeoXArgs
from megatron.global_vars import set_global_variables
from megatron.neox_model import GPTNeoX

args = NeoXArgs.from_pretrained("EleutherAI/gpt-neox-20b")
model = GPTNeoX(args)

LLaMA-Factory is more user-friendly for fine-tuning and using pre-trained models, while gpt-neox offers more control and flexibility for training large language models from scratch. LLaMA-Factory provides a simpler API for quick integration, whereas gpt-neox requires more setup but allows for greater customization of the training process.

57,265

Inference code for Llama models

Pros of llama

  • Official implementation from Meta, ensuring authenticity and alignment with original research
  • Comprehensive documentation and detailed model architecture explanations
  • Extensive community support and regular updates from Meta's research team

Cons of llama

  • Limited fine-tuning and customization options out-of-the-box
  • Requires significant computational resources to run and train
  • Less user-friendly for beginners compared to LLaMA-Factory

Code Comparison

LLaMA-Factory:

from llama_factory import LLaMAModel, Trainer

model = LLaMAModel.from_pretrained("llama-7b")
trainer = Trainer(model)
trainer.train(dataset, epochs=3)

llama:

from llama import Llama

llama = Llama.build(
    ckpt_dir="llama-2-7b/",
    tokenizer_path="tokenizer.model",
    max_seq_len=512,
    max_batch_size=8,
)
output = llama.generate("Hello, how are you?")

LLaMA-Factory provides a more streamlined API for fine-tuning and training, while llama offers lower-level access to the model architecture and generation process. LLaMA-Factory is designed for easier customization and experimentation, whereas llama provides a more direct implementation of the original LLaMA model.

Code and documentation to train Stanford's Alpaca models, and generate the data.

Pros of Stanford Alpaca

  • Focuses specifically on instruction-following capabilities
  • Provides a curated dataset of 52K instruction-following examples
  • Simpler and more straightforward implementation

Cons of Stanford Alpaca

  • Limited to fine-tuning LLaMA models only
  • Less flexible in terms of training options and customization
  • Fewer features for data processing and model evaluation

Code Comparison

Stanford Alpaca:

def train(
    model,
    tokenizer,
    train_dataset,
    val_dataset,
    output_dir,
    num_train_epochs=3,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=0,
):
    # Training code here

LLaMA-Factory:

def train(
    model,
    tokenizer,
    dataset,
    args,
    callbacks=None,
    compute_metrics=None,
    preprocess_logits_for_metrics=None,
    use_deepspeed=False
):
    # More comprehensive training code with additional options

LLaMA-Factory offers a more versatile training function with additional parameters and options, reflecting its broader scope and flexibility compared to Stanford Alpaca's more focused approach.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

# LLaMA Factory

GitHub Repo stars GitHub last commit GitHub contributors GitHub workflow PyPI Citation GitHub pull request

Twitter Discord GitCode

Open in Colab Open in DSW Spaces Studios SageMaker

Easily fine-tune 100+ large language models with zero-code CLI and Web UI

Github trend

👋 Join our WeChat or NPU user group.

[ English | 中文 ]

Fine-tuning a large language model can be easy as...

https://github.com/user-attachments/assets/3991a3a8-4276-4d30-9cab-4cb0c4b9b99e

Choose your path:

[!NOTE] Except for the above links, all other websites are unauthorized third-party websites. Please carefully use them.

Table of Contents

Features

  • Various models: LLaMA, LLaVA, Mistral, Mixtral-MoE, Qwen, Qwen2-VL, DeepSeek, Yi, Gemma, ChatGLM, Phi, etc.
  • Integrated methods: (Continuous) pre-training, (multimodal) supervised fine-tuning, reward modeling, PPO, DPO, KTO, ORPO, etc.
  • Scalable resources: 16-bit full-tuning, freeze-tuning, LoRA and 2/3/4/5/6/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ.
  • Advanced algorithms: GaLore, BAdam, APOLLO, Adam-mini, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ and PiSSA.
  • Practical tricks: FlashAttention-2, Unsloth, Liger Kernel, RoPE scaling, NEFTune and rsLoRA.
  • Wide tasks: Multi-turn dialogue, tool using, image understanding, visual grounding, video recognition, audio understanding, etc.
  • Experiment monitors: LlamaBoard, TensorBoard, Wandb, MLflow, SwanLab, etc.
  • Faster inference: OpenAI-style API, Gradio UI and CLI with vLLM worker or SGLang worker.

Day-N Support for Fine-Tuning Cutting-Edge Models

Support DateModel Name
Day 0Qwen2.5 / Qwen2.5-VL / Gemma 3 / InternLM 3 / MiniCPM-o-2.6
Day 1Llama 3 / GLM-4 / Mistral Small / PaliGemma2

Benchmark

Compared to ChatGLM's P-Tuning, LLaMA Factory's LoRA tuning offers up to 3.7 times faster training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA Factory's QLoRA further improves the efficiency regarding the GPU memory.

benchmark

Definitions
  • Training Speed: the number of training samples processed per second during the training. (bs=4, cutoff_len=1024)
  • Rouge Score: Rouge-2 score on the development set of the advertising text generation task. (bs=4, cutoff_len=1024)
  • GPU Memory: Peak GPU memory usage in 4-bit quantized training. (bs=1, cutoff_len=1024)
  • We adopt pre_seq_len=128 for ChatGLM's P-Tuning and lora_rank=32 for LLaMA Factory's LoRA tuning.

Changelog

[25/03/15] We supported SGLang as inference backend. Try infer_backend: sglang to accelerate inference.

[25/03/12] We supported fine-tuning the Gemma-3 model.

[25/02/24] Announcing EasyR1, an efficient, scalable and multi-modality RL training framework for efficient GRPO training.

[25/02/11] We supported saving the Ollama modelfile when exporting the model checkpoints. See examples for usage.

[25/02/05] We supported fine-tuning the Qwen2-Audio and MiniCPM-o-2.6 on audio understanding tasks.

[25/01/31] We supported fine-tuning the DeepSeek-R1 and Qwen2.5-VL model.

Full Changelog

[25/01/15] We supported APOLLO optimizer. See examples for usage.

[25/01/14] We supported fine-tuning the MiniCPM-o-2.6 and MiniCPM-V-2.6 models. Thank @BUAADreamer's PR.

[25/01/14] We supported fine-tuning the InternLM 3 models. Thank @hhaAndroid's PR.

[25/01/10] We supported fine-tuning the Phi-4 model.

[24/12/21] We supported using SwanLab for experiment tracking and visualization. See this section for details.

[24/11/27] We supported fine-tuning the Skywork-o1 model and the OpenO1 dataset.

[24/10/09] We supported downloading pre-trained models and datasets from the Modelers Hub. See this tutorial for usage.

[24/09/19] We supported fine-tuning the Qwen2.5 models.

[24/08/30] We supported fine-tuning the Qwen2-VL models. Thank @simonJJJ's PR.

[24/08/27] We supported Liger Kernel. Try enable_liger_kernel: true for efficient training.

[24/08/09] We supported Adam-mini optimizer. See examples for usage. Thank @relic-yuexi's PR.

[24/07/04] We supported contamination-free packed training. Use neat_packing: true to activate it. Thank @chuan298's PR.

[24/06/16] We supported PiSSA algorithm. See examples for usage.

[24/06/07] We supported fine-tuning the Qwen2 and GLM-4 models.

[24/05/26] We supported SimPO algorithm for preference learning. See examples for usage.

[24/05/20] We supported fine-tuning the PaliGemma series models. Note that the PaliGemma models are pre-trained models, you need to fine-tune them with paligemma template for chat completion.

[24/05/18] We supported KTO algorithm for preference learning. See examples for usage.

[24/05/14] We supported training and inference on the Ascend NPU devices. Check installation section for details.

[24/04/26] We supported fine-tuning the LLaVA-1.5 multimodal LLMs. See examples for usage.

[24/04/22] We provided a Colab notebook for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check Llama3-8B-Chinese-Chat and Llama3-Chinese for details.

[24/04/21] We supported Mixture-of-Depths according to AstraMindAI's implementation. See examples for usage.

[24/04/16] We supported BAdam optimizer. See examples for usage.

[24/04/16] We supported unsloth's long-sequence training (Llama-2-7B-56k within 24GB). It achieves 117% speed and 50% memory compared with FlashAttention-2, more benchmarks can be found in this page.

[24/03/31] We supported ORPO. See examples for usage.

[24/03/21] Our paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" is available at arXiv!

[24/03/20] We supported FSDP+QLoRA that fine-tunes a 70B model on 2x24GB GPUs. See examples for usage.

[24/03/13] We supported LoRA+. See examples for usage.

[24/03/07] We supported GaLore optimizer. See examples for usage.

[24/03/07] We integrated vLLM for faster and concurrent inference. Try infer_backend: vllm to enjoy 270% inference speed.

[24/02/28] We supported weight-decomposed LoRA (DoRA). Try use_dora: true to activate DoRA training.

[24/02/15] We supported block expansion proposed by LLaMA Pro. See examples for usage.

[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this blog post for details.

[24/01/18] We supported agent tuning for most models, equipping model with tool using abilities by fine-tuning with dataset: glaive_toolcall_en.

[23/12/23] We supported unsloth's implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try use_unsloth: true argument to activate unsloth patch. It achieves 170% speed in our benchmark, check this page for details.

[23/12/12] We supported fine-tuning the latest MoE model Mixtral 8x7B in our framework. See hardware requirement here.

[23/12/01] We supported downloading pre-trained models and datasets from the ModelScope Hub. See this tutorial for usage.

[23/10/21] We supported NEFTune trick for fine-tuning. Try neftune_noise_alpha: 5 argument to activate NEFTune.

[23/09/27] We supported $S^2$-Attn proposed by LongLoRA for the LLaMA models. Try shift_attn: true argument to enable shift short attention.

[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See examples for usage.

[23/09/10] We supported FlashAttention-2. Try flash_attn: fa2 argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.

[23/08/12] We supported RoPE scaling to extend the context length of the LLaMA models. Try rope_scaling: linear argument in training and rope_scaling: dynamic argument at inference to extrapolate the position embeddings.

[23/08/11] We supported DPO training for instruction-tuned models. See examples for usage.

[23/07/31] We supported dataset streaming. Try streaming: true and max_steps: 10000 arguments to load your dataset in streaming mode.

[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos (LLaMA-2 / Baichuan) for details.

[23/07/18] We developed an all-in-one Web UI for training, evaluation and inference. Try train_web.py to fine-tune models in your Web browser. Thank @KanadeSiina and @codemayq for their efforts in the development.

[23/07/09] We released FastEdit ⚡🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow FastEdit if you are interested.

[23/06/29] We provided a reproducible example of training a chat model using instruction-following datasets, see Baichuan-7B-sft for details.

[23/06/22] We aligned the demo API with the OpenAI's format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.

[23/06/03] We supported quantized training and inference (aka QLoRA). See examples for usage.

Supported Models

ModelModel sizeTemplate
Baichuan 27B/13Bbaichuan2
BLOOM/BLOOMZ560M/1.1B/1.7B/3B/7.1B/176B-
ChatGLM36Bchatglm3
Command R35B/104Bcohere
DeepSeek (Code/MoE)7B/16B/67B/236Bdeepseek
DeepSeek 2.5/3236B/671Bdeepseek3
DeepSeek R1 (Distill)1.5B/7B/8B/14B/32B/70B/671Bdeepseek3
Falcon7B/11B/40B/180Bfalcon
Gemma/Gemma 2/CodeGemma2B/7B/9B/27Bgemma
Gemma 31B/4B/12B/27Bgemma3/gemma (1B)
GLM-49Bglm4
GPT-20.1B/0.4B/0.8B/1.5B-
Granite 3.0-3.11B/2B/3B/8Bgranite3
Index1.9Bindex
Hunyuan7Bhunyuan
InternLM 2-37B/8B/20Bintern2
Llama7B/13B/33B/65B-
Llama 27B/13B/70Bllama2
Llama 3-3.31B/3B/8B/70Bllama3
Llama 3.2 Vision11B/90Bmllama
LLaVA-1.57B/13Bllava
LLaVA-NeXT7B/8B/13B/34B/72B/110Bllava_next
LLaVA-NeXT-Video7B/34Bllava_next_video
MiniCPM1B/2B/4Bcpm/cpm3
MiniCPM-o-2.6/MiniCPM-V-2.68Bminicpm_o/minicpm_v
Ministral/Mistral-Nemo8B/12Bministral
Mistral/Mixtral7B/8x7B/8x22Bmistral
Mistral Small24Bmistral_small
OLMo1B/7B-
PaliGemma/PaliGemma23B/10B/28Bpaligemma
Phi-1.5/Phi-21.3B/2.7B-
Phi-3/Phi-3.54B/14Bphi
Phi-3-small7Bphi_small
Phi-414Bphi4
Pixtral12Bpixtral
Qwen/QwQ (1-2.5) (Code/Math/MoE)0.5B/1.5B/3B/7B/14B/32B/72B/110Bqwen
Qwen2-Audio7Bqwen2_audio
Qwen2-VL/Qwen2.5-VL/QVQ2B/3B/7B/32B/72Bqwen2_vl
Skywork o18Bskywork_o1
StarCoder 23B/7B/15B-
TeleChat23B/7B/35B/115Btelechat2
XVERSE7B/13B/65Bxverse
Yi/Yi-1.5 (Code)1.5B/6B/9B/34Byi
Yi-VL6B/34Byi_vl
Yuan 22B/51B/102Byuan

[!NOTE] For the "base" models, the template argument can be chosen from default, alpaca, vicuna etc. But make sure to use the corresponding template for the "instruct/chat" models.

Remember to use the SAME template in training and inference.

Please refer to constants.py for a full list of models we supported.

You also can add a custom chat template to template.py.

Supported Training Approaches

ApproachFull-tuningFreeze-tuningLoRAQLoRA
Pre-Training:white_check_mark::white_check_mark::white_check_mark::white_check_mark:
Supervised Fine-Tuning:white_check_mark::white_check_mark::white_check_mark::white_check_mark:
Reward Modeling:white_check_mark::white_check_mark::white_check_mark::white_check_mark:
PPO Training:white_check_mark::white_check_mark::white_check_mark::white_check_mark:
DPO Training:white_check_mark::white_check_mark::white_check_mark::white_check_mark:
KTO Training:white_check_mark::white_check_mark::white_check_mark::white_check_mark:
ORPO Training:white_check_mark::white_check_mark::white_check_mark::white_check_mark:
SimPO Training:white_check_mark::white_check_mark::white_check_mark::white_check_mark:

[!TIP] The implementation details of PPO can be found in this blog.

Provided Datasets

Pre-training datasets
Supervised fine-tuning datasets
Preference datasets

Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.

pip install --upgrade huggingface_hub
huggingface-cli login

Requirement

MandatoryMinimumRecommend
python3.93.10
torch1.13.12.6.0
transformers4.41.24.50.0
datasets2.16.03.2.0
accelerate0.34.01.2.1
peft0.14.00.15.0
trl0.8.60.9.6
OptionalMinimumRecommend
CUDA11.612.2
deepspeed0.10.00.16.4
bitsandbytes0.39.00.43.1
vllm0.4.30.7.3
flash-attn2.3.02.7.2

Hardware Requirement

* estimated

MethodBits7B14B30B70BxB
Full (bf16 or fp16)32120GB240GB600GB1200GB18xGB
Full (pure_bf16)1660GB120GB300GB600GB8xGB
Freeze/LoRA/GaLore/APOLLO/BAdam1616GB32GB64GB160GB2xGB
QLoRA810GB20GB40GB80GBxGB
QLoRA46GB12GB24GB48GBx/2GB
QLoRA24GB8GB16GB24GBx/4GB

Getting Started

Installation

[!IMPORTANT] Installation is mandatory.

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Extra dependencies available: torch, torch-npu, metrics, deepspeed, liger-kernel, bitsandbytes, hqq, eetq, gptq, awq, aqlm, vllm, sglang, galore, apollo, badam, adam-mini, qwen, minicpm_v, modelscope, openmind, swanlab, quality

[!TIP] Use pip install --no-deps -e . to resolve package conflicts.

Setting up a virtual environment with uv

Create an isolated Python environment with uv:

uv sync --extra torch --extra metrics --prerelease=allow

Run LLaMA-Factory in the isolated environment:

uv run --prerelease=allow llamafactory-cli train examples/train_lora/llama3_lora_pretrain.yaml
For Windows users

Install BitsAndBytes

If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you need to install a pre-built version of bitsandbytes library, which supports CUDA 11.1 to 12.2, please select the appropriate release version based on your CUDA version.

pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl

Install Flash Attention-2

To enable FlashAttention-2 on the Windows platform, please use the script from flash-attention-windows-wheel to compile and install it by yourself.

For Ascend NPU users

To install LLaMA Factory on Ascend NPU devices, please upgrade Python to version 3.10 or higher and specify extra dependencies: pip install -e ".[torch-npu,metrics]". Additionally, you need to install the Ascend CANN Toolkit and Kernels. Please follow the installation tutorial or use the following commands:

# replace the url according to your CANN version and devices
# install CANN Toolkit
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C20SPC702/Ascend-cann-toolkit_8.0.0.alpha002_linux-"$(uname -i)".run
bash Ascend-cann-toolkit_8.0.0.alpha002_linux-"$(uname -i)".run --install

# install CANN Kernels
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C20SPC702/Ascend-cann-kernels-910b_8.0.0.alpha002_linux-"$(uname -i)".run
bash Ascend-cann-kernels-910b_8.0.0.alpha002_linux-"$(uname -i)".run --install

# set env variables
source /usr/local/Ascend/ascend-toolkit/set_env.sh
RequirementMinimumRecommend
CANN8.0.RC18.0.0.alpha002
torch2.1.02.4.0
torch-npu2.1.02.4.0.post2
deepspeed0.13.20.13.2

Remember to use ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify the device to use.

If you cannot infer model on NPU devices, try setting do_sample: false in the configurations.

Download the pre-built Docker images: 32GB | 64GB

Install BitsAndBytes

To use QLoRA based on bitsandbytes on Ascend NPU, please follow these 3 steps:

  1. Manually compile bitsandbytes: Refer to the installation documentation for the NPU version of bitsandbytes to complete the compilation and installation. The compilation requires a cmake version of at least 3.22.1 and a g++ version of at least 12.x.
# Install bitsandbytes from source
# Clone bitsandbytes repo, Ascend NPU backend is currently enabled on multi-backend-refactor branch
git clone -b multi-backend-refactor https://github.com/bitsandbytes-foundation/bitsandbytes.git
cd bitsandbytes/

# Install dependencies
pip install -r requirements-dev.txt

# Install the dependencies for the compilation tools. Note that the commands for this step may vary depending on the operating system. The following are provided for reference
apt-get install -y build-essential cmake

# Compile & install  
cmake -DCOMPUTE_BACKEND=npu -S .
make
pip install .
  1. Install transformers from the main branch.
git clone -b main https://github.com/huggingface/transformers.git
cd transformers
pip install .
  1. Set double_quantization: false in the configuration. You can refer to the example.

Data Preparation

Please refer to data/README.md for checking the details about the format of dataset files. You can either use datasets on HuggingFace / ModelScope / Modelers hub or load the dataset in local disk.

[!NOTE] Please update data/dataset_info.json to use your custom dataset.

Quickstart

Use the following 3 commands to run LoRA fine-tuning, inference and merging of the Llama3-8B-Instruct model, respectively.

llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

See examples/README.md for advanced usage (including distributed training).

[!TIP] Use llamafactory-cli help to show help information.

Read FAQs first if you encounter any problems.

Fine-Tuning with LLaMA Board GUI (powered by Gradio)

llamafactory-cli webui

Build Docker

For CUDA users:

cd docker/docker-cuda/
docker compose up -d
docker compose exec llamafactory bash

For Ascend NPU users:

cd docker/docker-npu/
docker compose up -d
docker compose exec llamafactory bash

For AMD ROCm users:

cd docker/docker-rocm/
docker compose up -d
docker compose exec llamafactory bash
Build without Docker Compose

For CUDA users:

docker build -f ./docker/docker-cuda/Dockerfile \
    --build-arg INSTALL_BNB=false \
    --build-arg INSTALL_VLLM=false \
    --build-arg INSTALL_DEEPSPEED=false \
    --build-arg INSTALL_FLASHATTN=false \
    --build-arg PIP_INDEX=https://pypi.org/simple \
    -t llamafactory:latest .

docker run -dit --gpus=all \
    -v ./hf_cache:/root/.cache/huggingface \
    -v ./ms_cache:/root/.cache/modelscope \
    -v ./om_cache:/root/.cache/openmind \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -p 7860:7860 \
    -p 8000:8000 \
    --shm-size 16G \
    --name llamafactory \
    llamafactory:latest

docker exec -it llamafactory bash

For Ascend NPU users:

# Choose docker image upon your environment
docker build -f ./docker/docker-npu/Dockerfile \
    --build-arg INSTALL_DEEPSPEED=false \
    --build-arg PIP_INDEX=https://pypi.org/simple \
    -t llamafactory:latest .

# Change `device` upon your resources
docker run -dit \
    -v ./hf_cache:/root/.cache/huggingface \
    -v ./ms_cache:/root/.cache/modelscope \
    -v ./om_cache:/root/.cache/openmind \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -p 7860:7860 \
    -p 8000:8000 \
    --device /dev/davinci0 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    --shm-size 16G \
    --name llamafactory \
    llamafactory:latest

docker exec -it llamafactory bash

For AMD ROCm users:

docker build -f ./docker/docker-rocm/Dockerfile \
    --build-arg INSTALL_BNB=false \
    --build-arg INSTALL_VLLM=false \
    --build-arg INSTALL_DEEPSPEED=false \
    --build-arg INSTALL_FLASHATTN=false \
    --build-arg PIP_INDEX=https://pypi.org/simple \
    -t llamafactory:latest .

docker run -dit \
    -v ./hf_cache:/root/.cache/huggingface \
    -v ./ms_cache:/root/.cache/modelscope \
    -v ./om_cache:/root/.cache/openmind \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -v ./saves:/app/saves \
    -p 7860:7860 \
    -p 8000:8000 \
    --device /dev/kfd \
    --device /dev/dri \
    --shm-size 16G \
    --name llamafactory \
    llamafactory:latest

docker exec -it llamafactory bash
Details about volume
  • hf_cache: Utilize Hugging Face cache on the host machine. Reassignable if a cache already exists in a different directory.
  • ms_cache: Similar to Hugging Face cache but for ModelScope users.
  • om_cache: Similar to Hugging Face cache but for Modelers users.
  • data: Place datasets on this dir of the host machine so that they can be selected on LLaMA Board GUI.
  • output: Set export dir to this location so that the merged result can be accessed directly on the host machine.

Deploy with OpenAI-style API and vLLM

API_PORT=8000 llamafactory-cli api examples/inference/llama3_vllm.yaml

[!TIP] Visit this page for API document.

Examples: Image understanding | Function calling

Download from ModelScope Hub

If you have trouble with downloading models and datasets from Hugging Face, you can use ModelScope.

export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows

Train the model by specifying a model ID of the ModelScope Hub as the model_name_or_path. You can find a full list of model IDs at ModelScope Hub, e.g., LLM-Research/Meta-Llama-3-8B-Instruct.

Download from Modelers Hub

You can also use Modelers Hub to download models and datasets.

export USE_OPENMIND_HUB=1 # `set USE_OPENMIND_HUB=1` for Windows

Train the model by specifying a model ID of the Modelers Hub as the model_name_or_path. You can find a full list of model IDs at Modelers Hub, e.g., TeleAI/TeleChat-7B-pt.

Use W&B Logger

To use Weights & Biases for logging experimental results, you need to add the following arguments to yaml files.

report_to: wandb
run_name: test_run # optional

Set WANDB_API_KEY to your key when launching training tasks to log in with your W&B account.

Use SwanLab Logger

To use SwanLab for logging experimental results, you need to add the following arguments to yaml files.

use_swanlab: true
swanlab_run_name: test_run # optional

When launching training tasks, you can log in to SwanLab in three ways:

  1. Add swanlab_api_key=<your_api_key> to the yaml file, and set it to your API key.
  2. Set the environment variable SWANLAB_API_KEY to your API key.
  3. Use the swanlab login command to complete the login.

Projects using LLaMA Factory

If you have a project that should be incorporated, please contact via email or create a pull request.

Click to show
  1. Wang et al. ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation. 2023. [arxiv]
  2. Yu et al. Open, Closed, or Small Language Models for Text Classification? 2023. [arxiv]
  3. Wang et al. UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural Language. 2023. [arxiv]
  4. Luceri et al. Leveraging Large Language Models to Detect Influence Campaigns in Social Media. 2023. [arxiv]
  5. Zhang et al. Alleviating Hallucinations of Large Language Models through Induced Hallucinations. 2023. [arxiv]
  6. Wang et al. Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs. KDD 2024. [arxiv]
  7. Wang et al. CANDLE: Iterative Conceptualization and Instantiation Distillation from Large Language Models for Commonsense Reasoning. ACL 2024. [arxiv]
  8. Choi et al. FACT-GPT: Fact-Checking Augmentation via Claim Matching with LLMs. 2024. [arxiv]
  9. Zhang et al. AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts. 2024. [arxiv]
  10. Lyu et al. KnowTuning: Knowledge-aware Fine-tuning for Large Language Models. 2024. [arxiv]
  11. Yang et al. LaCo: Large Language Model Pruning via Layer Collaps. 2024. [arxiv]
  12. Bhardwaj et al. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. 2024. [arxiv]
  13. Yang et al. Enhancing Empathetic Response Generation by Augmenting LLMs with Small-scale Empathetic Models. 2024. [arxiv]
  14. Yi et al. Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding. ACL 2024 Findings. [arxiv]
  15. Cao et al. Head-wise Shareable Attention for Large Language Models. 2024. [arxiv]
  16. Zhang et al. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. 2024. [arxiv]
  17. Kim et al. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. 2024. [arxiv]
  18. Yu et al. KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models. ACL 2024. [arxiv]
  19. Huang et al. Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning. 2024. [arxiv]
  20. Duan et al. Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization. 2024. [arxiv]
  21. Xie and Schwertfeger. Empowering Robotics with Large Language Models: osmAG Map Comprehension with LLMs. 2024. [arxiv]
  22. Wu et al. Large Language Models are Parallel Multilingual Learners. 2024. [arxiv]
  23. Zhang et al. EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling. 2024. [arxiv]
  24. Weller et al. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. 2024. [arxiv]
  25. Hongbin Na. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. COLING 2024. [arxiv]
  26. Zan et al. CodeS: Natural Language to Code Repository via Multi-Layer Sketch. 2024. [arxiv]
  27. Liu et al. Extensive Self-Contrast Enables Feedback-Free Language Model Alignment. 2024. [arxiv]
  28. Luo et al. BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models. 2024. [arxiv]
  29. Du et al. Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model. 2024. [arxiv]
  30. Ma et al. Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens Rotation. ICML 2024. [arxiv]
  31. Liu et al. Dynamic Generation of Personalities with Large Language Models. 2024. [arxiv]
  32. Shang et al. How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models. 2024. [arxiv]
  33. Huang et al. LLMTune: Accelerate Database Knob Tuning with Large Language Models. 2024. [arxiv]
  34. Deng et al. Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction. 2024. [arxiv]
  35. Acikgoz et al. Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare. 2024. [arxiv]
  36. Zhang et al. Small Language Models Need Strong Verifiers to Self-Correct Reasoning. ACL 2024 Findings. [arxiv]
  37. Zhou et al. FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering. NAACL 2024. [arxiv]
  38. Xu et al. Large Language Models for Cyber Security: A Systematic Literature Review. 2024. [arxiv]
  39. Dammu et al. "They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations. 2024. [arxiv]
  40. Yi et al. A safety realignment framework via subspace-oriented model fusion for large language models. 2024. [arxiv]
  41. Lou et al. SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling. 2024. [arxiv]
  42. Zhang et al. Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners. 2024. [arxiv]
  43. Zhang et al. TS-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language Models. 2024. [arxiv]
  44. Zihong Chen. Sentence Segmentation and Sentence Punctuation Based on XunziALLM. 2024. [paper]
  45. Gao et al. The Best of Both Worlds: Toward an Honest and Helpful Large Language Model. 2024. [arxiv]
  46. Wang and Song. MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset. 2024. [arxiv]
  47. Hu et al. Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models. 2024. [arxiv]
  48. Ge et al. Time Sensitive Knowledge Editing through Efficient Finetuning. ACL 2024. [arxiv]
  49. Tan et al. Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions. 2024. [arxiv]
  50. Song et al. Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters. 2024. [arxiv]
  51. Gu et al. RWKV-CLIP: A Robust Vision-Language Representation Learner. 2024. [arxiv]
  52. Chen et al. Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees. 2024. [arxiv]
  53. Zhu et al. Are Large Language Models Good Statisticians?. 2024. [arxiv]
  54. Li et al. Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning. 2024. [arxiv]
  55. Ding et al. IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce. 2024. [arxiv]
  56. He et al. COMMUNITY-CROSS-INSTRUCT: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities. 2024. [arxiv]
  57. Lin et al. FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving. 2024. [arxiv]
  58. Treutlein et al. Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data. 2024. [arxiv]
  59. Feng et al. SS-Bench: A Benchmark for Social Story Generation and Evaluation. 2024. [arxiv]
  60. Feng et al. Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement. 2024. [arxiv]
  61. Liu et al. Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals. 2024. [arxiv]
  62. Iyer et al. Exploring Very Low-Resource Translation with LLMs: The University of Edinburgh's Submission to AmericasNLP 2024 Translation Task. AmericasNLP 2024. [paper]
  63. Li et al. Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring. 2024. [arxiv]
  64. Yang et al. Financial Knowledge Large Language Model. 2024. [arxiv]
  65. Lin et al. DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging. 2024. [arxiv]
  66. Bako et al. Evaluating the Semantic Profiling Abilities of LLMs for Natural Language Utterances in Data Visualization. 2024. [arxiv]
  67. Huang et al. RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization. 2024. [arxiv]
  68. Jiang et al. LLM-Collaboration on Automatic Science Journalism for the General Audience. 2024. [arxiv]
  69. Inouye et al. Applied Auto-tuning on LoRA Hyperparameters. 2024. [paper]
  70. Qi et al. Research on Tibetan Tourism Viewpoints information generation system based on LLM. 2024. [arxiv]
  71. Xu et al. Course-Correction: Safety Alignment Using Synthetic Preferences. 2024. [arxiv]
  72. Sun et al. LAMBDA: A Large Model Based Data Agent. 2024. [arxiv]
  73. Zhu et al. CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare. 2024. [arxiv]
  74. Yu et al. Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment. 2024. [arxiv]
  75. Xie et al. The Power of Personalized Datasets: Advancing Chinese Composition Writing for Elementary School through Targeted Model Fine-Tuning. IALP 2024. [paper]
  76. Liu et al. Instruct-Code-Llama: Improving Capabilities of Language Model in Competition Level Code Generation by Online Judge Feedback. ICIC 2024. [paper]
  77. Wang et al. Cybernetic Sentinels: Unveiling the Impact of Safety Data Selection on Model Security in Supervised Fine-Tuning. ICIC 2024. [paper]
  78. Xia et al. Understanding the Performance and Estimating the Cost of LLM Fine-Tuning. 2024. [arxiv]
  79. Zeng et al. Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions. 2024. [arxiv]
  80. Xia et al. Using Pre-trained Language Model for Accurate ESG Prediction. FinNLP 2024. [paper]
  81. Liang et al. I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm. 2024. [arxiv]
  82. Bai et al. Aligning Large Language Model with Direct Multi-Preference Optimization for Recommendation. CIKM 2024. [paper]
  83. StarWhisper: A large language model for Astronomy, based on ChatGLM2-6B and Qwen-14B.
  84. DISC-LawLLM: A large language model specialized in Chinese legal domain, based on Baichuan-13B, is capable of retrieving and reasoning on legal knowledge.
  85. Sunsimiao: A large language model specialized in Chinese medical domain, based on Baichuan-7B and ChatGLM-6B.
  86. CareGPT: A series of large language models for Chinese medical domain, based on LLaMA2-7B and Baichuan-13B.
  87. MachineMindset: A series of MBTI Personality large language models, capable of giving any LLM 16 different personality types based on different datasets and training methods.
  88. Luminia-13B-v3: A large language model specialized in generate metadata for stable diffusion. [demo]
  89. Chinese-LLaVA-Med: A multimodal large language model specialized in Chinese medical domain, based on LLaVA-1.5-7B.
  90. AutoRE: A document-level relation extraction system based on large language models.
  91. NVIDIA RTX AI Toolkit: SDKs for fine-tuning LLMs on Windows PC for NVIDIA RTX.
  92. LazyLLM: An easy and lazy way for building multi-agent LLMs applications and supports model fine-tuning via LLaMA Factory.
  93. RAG-Retrieval: A full pipeline for RAG retrieval model fine-tuning, inference, and distillation. [blog]
  94. 360-LLaMA-Factory: A modified library that supports long sequence SFT & DPO using ring attention.
  95. Sky-T1: An o1-like model fine-tuned by NovaSky AI with very small cost.

License

This repository is licensed under the Apache-2.0 License.

Please follow the model licenses to use the corresponding model weights: Baichuan 2 / BLOOM / ChatGLM3 / Command R / DeepSeek / Falcon / Gemma / GLM-4 / GPT-2 / Granite / Index / InternLM / Llama / Llama 2 (LLaVA-1.5) / Llama 3 / MiniCPM / Mistral/Mixtral/Pixtral / OLMo / Phi-1.5/Phi-2 / Phi-3/Phi-4 / Qwen / Skywork / StarCoder 2 / TeleChat2 / XVERSE / Yi / Yi-1.5 / Yuan 2

Citation

If this work is helpful, please kindly cite as:

@inproceedings{zheng2024llamafactory,
  title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models},
  author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Zhangchi Feng and Yongqiang Ma},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)},
  address={Bangkok, Thailand},
  publisher={Association for Computational Linguistics},
  year={2024},
  url={http://arxiv.org/abs/2403.13372}
}

Acknowledgement

This repo benefits from PEFT, TRL, QLoRA and FastChat. Thanks for their wonderful works.

Star History

Star History Chart