mlc-llm

Universal LLM Deployment Engine with ML Compilation

21,058

1,785

21,058

311

View on GitHub

Top Related Projects

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

nanoGPT

43,369

The simplest, fastest repository for training/finetuning medium-sized GPTs.

triton

16,367

Development repository for the Triton language and compiler

Quick Overview

MLC-LLM is an open-source project aimed at running large language models (LLMs) efficiently on various hardware devices. It focuses on optimizing LLM inference for different platforms, including mobile devices, web browsers, and various operating systems, using machine learning compilation techniques.

Pros

Enables efficient LLM inference on a wide range of devices and platforms
Supports multiple popular LLM architectures (e.g., GPT, LLaMA, BLOOM)
Utilizes machine learning compilation for optimized performance
Open-source and community-driven development

Cons

Requires some technical expertise to set up and use effectively
May have limitations in supporting the latest LLM architectures immediately
Performance can vary depending on the specific hardware and model used
Documentation might be challenging for beginners

Code Examples

Loading and running a model:

import mlc_llm
import mlc_chat

model = mlc_llm.load_model("vicuna-7b-v1.3")
chat = mlc_chat.ChatModule(model)

response = chat.generate("What is the capital of France?")
print(response)

Configuring model parameters:

import mlc_llm

model_config = mlc_llm.ModelConfig(
    model_name="llama-7b",
    quantization="int4",
    max_seq_len=2048
)
model = mlc_llm.load_model(model_config)

Running inference on a specific device:

import mlc_llm
import mlc_chat

model = mlc_llm.load_model("gpt-2", device="cuda")
chat = mlc_chat.ChatModule(model)

response = chat.generate("Explain quantum computing in simple terms.")
print(response)

Getting Started

To get started with MLC-LLM, follow these steps:

Install the required dependencies:

pip install mlc-llm mlc-ai-nightly torch

Download a pre-trained model:

python -m mlc_llm.model_download --model-name vicuna-7b-v1.3

Run a simple inference:

import mlc_llm
import mlc_chat

model = mlc_llm.load_model("vicuna-7b-v1.3")
chat = mlc_chat.ChatModule(model)

response = chat.generate("Hello, how are you?")
print(response)

For more detailed instructions and advanced usage, refer to the project's documentation on GitHub.

Competitor Comparisons

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Extensive library with support for a wide range of models and tasks
Well-documented and actively maintained by a large community
Easy integration with popular deep learning frameworks

Cons of transformers

Can be resource-intensive for large models
May require more setup and configuration for specific use cases
Less optimized for edge devices and mobile platforms

Code Comparison

transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_text = "Hello, how are you?"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50)

mlc-llm:

import mlc_llm
import torch

model = mlc_llm.load_model("gpt2")
tokenizer = mlc_llm.Tokenizer("gpt2")
input_text = "Hello, how are you?"
input_ids = tokenizer.encode(input_text)
output = model.generate(torch.tensor([input_ids]))

Both repositories provide powerful tools for working with language models, but they have different focuses. transformers offers a comprehensive suite of pre-trained models and tools, while mlc-llm is more specialized for efficient deployment on various hardware platforms.

llama.cpp

83,828

LLM inference in C/C++

Pros of llama.cpp

Highly optimized for CPU inference, allowing efficient execution on consumer hardware
Supports quantization techniques for reduced memory usage and faster inference
Simpler codebase, easier to understand and modify for specific use cases

Cons of llama.cpp

Limited to CPU execution, doesn't leverage GPU acceleration
Focused primarily on Llama models, less flexible for other architectures
Lacks some advanced features and optimizations present in MLC-LLM

Code Comparison

llama.cpp:

int main(int argc, char ** argv) {
    gpt_params params;
    if (gpt_params_parse(argc, argv, params) == false) {
        return 1;
    }
    llama_init_backend();
    // ... (model loading and inference code)
}

MLC-LLM:

def main():
    args = parse_args()
    with mlc.context():
        model = mlc.Model(args.model)
        tokenizer = mlc.Tokenizer(args.tokenizer)
        # ... (model loading and inference code)

if __name__ == "__main__":
    main()

The code snippets show the entry points for both projects. llama.cpp uses C++ and focuses on parameter parsing and backend initialization, while MLC-LLM utilizes Python and emphasizes a higher-level API for model and tokenizer setup.

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

Highly optimized for distributed training and inference of large language models
Offers a wide range of optimization techniques, including ZeRO, 3D parallelism, and pipeline parallelism
Integrates well with popular deep learning frameworks like PyTorch and Hugging Face Transformers

Cons of DeepSpeed

Steeper learning curve due to its extensive feature set and configuration options
Primarily focused on training and inference optimization, rather than end-to-end LLM deployment
May require more manual tuning and configuration for optimal performance

Code Comparison

MLC-LLM:

import mlc_llm
import tvm

model = mlc_llm.load_model("llama-7b")
output = model.generate("Hello, how are you?")

DeepSpeed:

import deepspeed
import torch

model = deepspeed.init_inference(model)
input_ids = tokenizer.encode("Hello, how are you?", return_tensors="pt")
output = model.generate(input_ids)

While both libraries aim to optimize LLM performance, MLC-LLM focuses on efficient deployment across various hardware platforms, whereas DeepSpeed provides a comprehensive suite of optimization techniques for training and inference of large-scale models.

llama

58,578

Inference code for Llama models

Pros of Llama

Developed by Meta, leveraging extensive resources and expertise
Focuses on large language models, offering state-of-the-art performance
Provides pre-trained models for various applications

Cons of Llama

Limited to specific hardware and environments
Requires significant computational resources for deployment
Less emphasis on cross-platform compatibility

Code Comparison

MLC-LLM:

import tvm
from mlc_llm import LLMRuntime

runtime = LLMRuntime("path/to/model")
response = runtime.generate("Hello, how are you?")

Llama:

from llama import Llama

model = Llama.load("path/to/model")
response = model.generate("Hello, how are you?")

Key Differences

MLC-LLM focuses on deploying LLMs across various devices and platforms
Llama emphasizes high-performance language modeling on powerful hardware
MLC-LLM offers more flexibility in terms of deployment options
Llama provides pre-trained models with cutting-edge performance
MLC-LLM aims to optimize LLMs for resource-constrained environments

Both projects contribute significantly to the field of large language models, with MLC-LLM prioritizing accessibility and Llama focusing on pushing the boundaries of model capabilities.

nanoGPT

43,369

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Pros of nanoGPT

Simpler implementation, making it easier to understand and modify
Focused on educational purposes, with clear explanations and comments
Lightweight and requires fewer computational resources

Cons of nanoGPT

Limited scalability for large language models
Fewer optimizations and features compared to MLC-LLM
Less suitable for production-level deployments

Code Comparison

nanoGPT:

class GPTConfig:
    def __init__(self, vocab_size, n_layer, n_head, n_embd, block_size, bias, dropout):
        self.vocab_size = vocab_size
        self.n_layer = n_layer
        self.n_head = n_head
        self.n_embd = n_embd
        self.block_size = block_size
        self.bias = bias
        self.dropout = dropout

MLC-LLM:

class LLMConfig:
    def __init__(self, model_name, vocab_size, hidden_size, num_hidden_layers, num_attention_heads):
        self.model_name = model_name
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads

triton

16,367

Development repository for the Triton language and compiler

Pros of Triton

Focuses on GPU programming and optimization, offering more specialized tools for GPU-specific tasks
Provides a Python-like syntax for writing GPU kernels, making it more accessible to Python developers
Offers automatic code generation and optimization for different GPU architectures

Cons of Triton

Has a narrower scope compared to MLC-LLM, focusing primarily on GPU programming
May have a steeper learning curve for developers not familiar with GPU programming concepts
Less emphasis on end-to-end machine learning model deployment and inference

Code Comparison

Triton example (GPU kernel):

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements):
    pid = tl.program_id(0)
    offset = pid * 128
    x = tl.load(x_ptr + offset)
    y = tl.load(y_ptr + offset)
    output = x + y
    tl.store(output_ptr + offset, output)

MLC-LLM example (model deployment):

import mlc_llm
import tvm

model = mlc_llm.LLM("gpt2")
model.compile()
model.save("gpt2_compiled")

While Triton focuses on low-level GPU programming, MLC-LLM provides a higher-level interface for deploying and optimizing large language models across various hardware platforms.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

MLC LLM

Universal LLM Deployment Engine with ML Compilation

Get Started | Documentation | Blog

About

MLC LLM is a machine learning compiler and high-performance deployment engine for large language models. The mission of this project is to enable everyone to develop, optimize, and deploy AI models natively on everyone's platforms.Â

	AMD GPU	NVIDIA GPU	Apple GPU	Intel GPU
Linux / Win	â Vulkan, ROCm	â Vulkan, CUDA	N/A	â Vulkan
macOS	â Metal (dGPU)	N/A	â Metal	â Metal (iGPU)
Web Browser	â WebGPU and WASM
iOS / iPadOS	â Metal on Apple A-series GPU
Android	â OpenCL on Adreno GPU		â OpenCL on Mali GPU

MLC LLM compiles and runs code on MLCEngine -- a unified high-performance LLM inference engine across the above platforms. MLCEngine providesÂ OpenAI-compatible APIÂ available through REST server, python, javascript, iOS, Android, all backed by the same engine and compiler that we keep improving with the community.

Get Started

Please visit our documentation to get started with MLC LLM.

Citation

Please consider citing our project if you find it useful:

@software{mlc-llm,
    author = {{MLC team}},
    title = {{MLC-LLM}},
    url = {https://github.com/mlc-ai/mlc-llm},
    year = {2023-2025}
}

The underlying techniques of MLC LLM include:

References (Click to expand)

@inproceedings{tensorir,
    author = {Feng, Siyuan and Hou, Bohan and Jin, Hongyi and Lin, Wuwei and Shao, Junru and Lai, Ruihang and Ye, Zihao and Zheng, Lianmin and Yu, Cody Hao and Yu, Yong and Chen, Tianqi},
    title = {TensorIR: An Abstraction for Automatic Tensorized Program Optimization},
    year = {2023},
    isbn = {9781450399166},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3575693.3576933},
    doi = {10.1145/3575693.3576933},
    booktitle = {Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
    pages = {804â817},
    numpages = {14},
    keywords = {Tensor Computation, Machine Learning Compiler, Deep Neural Network},
    location = {Vancouver, BC, Canada},
    series = {ASPLOS 2023}
}

@inproceedings{metaschedule,
    author = {Shao, Junru and Zhou, Xiyou and Feng, Siyuan and Hou, Bohan and Lai, Ruihang and Jin, Hongyi and Lin, Wuwei and Masuda, Masahiro and Yu, Cody Hao and Chen, Tianqi},
    booktitle = {Advances in Neural Information Processing Systems},
    editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
    pages = {35783--35796},
    publisher = {Curran Associates, Inc.},
    title = {Tensor Program Optimization with Probabilistic Programs},
    url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/e894eafae43e68b4c8dfdacf742bcbf3-Paper-Conference.pdf},
    volume = {35},
    year = {2022}
}

@inproceedings{tvm,
    author = {Tianqi Chen and Thierry Moreau and Ziheng Jiang and Lianmin Zheng and Eddie Yan and Haichen Shen and Meghan Cowan and Leyuan Wang and Yuwei Hu and Luis Ceze and Carlos Guestrin and Arvind Krishnamurthy},
    title = {{TVM}: An Automated {End-to-End} Optimizing Compiler for Deep Learning},
    booktitle = {13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)},
    year = {2018},
    isbn = {978-1-939133-08-3},
    address = {Carlsbad, CA},
    pages = {578--594},
    url = {https://www.usenix.org/conference/osdi18/presentation/chen},
    publisher = {USENIX Association},
    month = oct,
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot