Convert Figma logo to code with AI

mlc-ai logomlc-llm

Universal LLM Deployment Engine with ML Compilation

18,588
1,501
18,588
179

Top Related Projects

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

64,646

LLM inference in C/C++

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

55,392

Inference code for Llama models

36,124

The simplest, fastest repository for training/finetuning medium-sized GPTs.

12,546

Development repository for the Triton language and compiler

Quick Overview

MLC-LLM is an open-source project aimed at running large language models (LLMs) efficiently on various hardware devices. It focuses on optimizing LLM inference for different platforms, including mobile devices, web browsers, and various operating systems, using machine learning compilation techniques.

Pros

  • Enables efficient LLM inference on a wide range of devices and platforms
  • Supports multiple popular LLM architectures (e.g., GPT, LLaMA, BLOOM)
  • Utilizes machine learning compilation for optimized performance
  • Open-source and community-driven development

Cons

  • Requires some technical expertise to set up and use effectively
  • May have limitations in supporting the latest LLM architectures immediately
  • Performance can vary depending on the specific hardware and model used
  • Documentation might be challenging for beginners

Code Examples

  1. Loading and running a model:
import mlc_llm
import mlc_chat

model = mlc_llm.load_model("vicuna-7b-v1.3")
chat = mlc_chat.ChatModule(model)

response = chat.generate("What is the capital of France?")
print(response)
  1. Configuring model parameters:
import mlc_llm

model_config = mlc_llm.ModelConfig(
    model_name="llama-7b",
    quantization="int4",
    max_seq_len=2048
)
model = mlc_llm.load_model(model_config)
  1. Running inference on a specific device:
import mlc_llm
import mlc_chat

model = mlc_llm.load_model("gpt-2", device="cuda")
chat = mlc_chat.ChatModule(model)

response = chat.generate("Explain quantum computing in simple terms.")
print(response)

Getting Started

To get started with MLC-LLM, follow these steps:

  1. Install the required dependencies:
pip install mlc-llm mlc-ai-nightly torch
  1. Download a pre-trained model:
python -m mlc_llm.model_download --model-name vicuna-7b-v1.3
  1. Run a simple inference:
import mlc_llm
import mlc_chat

model = mlc_llm.load_model("vicuna-7b-v1.3")
chat = mlc_chat.ChatModule(model)

response = chat.generate("Hello, how are you?")
print(response)

For more detailed instructions and advanced usage, refer to the project's documentation on GitHub.

Competitor Comparisons

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Extensive library with support for a wide range of models and tasks
  • Well-documented and actively maintained by a large community
  • Easy integration with popular deep learning frameworks

Cons of transformers

  • Can be resource-intensive for large models
  • May require more setup and configuration for specific use cases
  • Less optimized for edge devices and mobile platforms

Code Comparison

transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_text = "Hello, how are you?"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50)

mlc-llm:

import mlc_llm
import torch

model = mlc_llm.load_model("gpt2")
tokenizer = mlc_llm.Tokenizer("gpt2")
input_text = "Hello, how are you?"
input_ids = tokenizer.encode(input_text)
output = model.generate(torch.tensor([input_ids]))

Both repositories provide powerful tools for working with language models, but they have different focuses. transformers offers a comprehensive suite of pre-trained models and tools, while mlc-llm is more specialized for efficient deployment on various hardware platforms.

64,646

LLM inference in C/C++

Pros of llama.cpp

  • Lightweight and efficient C++ implementation, optimized for CPU usage
  • Supports quantization techniques for reduced memory footprint
  • Easier to integrate into existing C++ projects

Cons of llama.cpp

  • Limited to CPU execution, lacking GPU acceleration capabilities
  • Fewer supported model architectures compared to mlc-llm
  • Less focus on cross-platform deployment and mobile devices

Code Comparison

llama.cpp:

// Load model
llama_context* ctx = llama_init_from_file("model.bin", params);

// Generate text
llama_eval(ctx, tokens.data(), tokens.size(), n_past, n_threads);

// Get probabilities
float* logits = llama_get_logits(ctx);

mlc-llm:

# Load model
model = mlc_llm.LLMRuntime("model.so")

# Generate text
output = model.generate(prompt, max_length=100)

# Get probabilities
logits = model.get_logits()

Both projects aim to run large language models efficiently, but llama.cpp focuses on CPU optimization and C++ integration, while mlc-llm emphasizes cross-platform deployment and supports a wider range of model architectures. mlc-llm also provides better GPU acceleration capabilities, making it more suitable for high-performance applications.

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

  • More mature and widely adopted in industry and research
  • Offers a broader range of optimization techniques for various deep learning tasks
  • Extensive documentation and community support

Cons of DeepSpeed

  • Steeper learning curve for beginners
  • Primarily focused on distributed training, which may be overkill for smaller projects

Code Comparison

MLC-LLM:

import mlc_llm
import torch

model = mlc_llm.load_model("gpt2")
input_ids = torch.tensor([[1, 2, 3, 4, 5]])
output = model(input_ids)

DeepSpeed:

import deepspeed
import torch

model = MyModel()
engine = deepspeed.initialize(model=model, config_params=ds_config)
input_ids = torch.tensor([[1, 2, 3, 4, 5]])
output = engine(input_ids)

MLC-LLM focuses on efficient inference of large language models, particularly on edge devices. It provides a simpler API for loading and running pre-trained models. DeepSpeed, on the other hand, offers a more comprehensive suite of optimization techniques for training and inference, with a focus on distributed computing and memory efficiency.

While MLC-LLM is tailored for LLM inference, DeepSpeed provides broader support for various deep learning tasks and model architectures. DeepSpeed's extensive features make it more suitable for large-scale projects and research, while MLC-LLM's simplicity may be preferable for quick LLM deployments and edge computing scenarios.

55,392

Inference code for Llama models

Pros of Llama

  • Developed by Meta, leveraging extensive resources and expertise
  • Focuses on large language models, offering state-of-the-art performance
  • Provides pre-trained models for various applications

Cons of Llama

  • Limited to specific hardware and environments
  • Requires significant computational resources for deployment
  • Less emphasis on cross-platform compatibility

Code Comparison

MLC-LLM:

import tvm
from mlc_llm import LLMRuntime

runtime = LLMRuntime("path/to/model")
response = runtime.generate("Hello, how are you?")

Llama:

from llama import Llama

model = Llama.load("path/to/model")
response = model.generate("Hello, how are you?")

Key Differences

  • MLC-LLM focuses on deploying LLMs across various devices and platforms
  • Llama emphasizes high-performance language modeling on powerful hardware
  • MLC-LLM offers more flexibility in terms of deployment options
  • Llama provides pre-trained models with cutting-edge performance
  • MLC-LLM aims to optimize LLMs for resource-constrained environments

Both projects contribute significantly to the field of large language models, with MLC-LLM prioritizing accessibility and Llama focusing on pushing the boundaries of model capabilities.

36,124

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Pros of nanoGPT

  • Simpler implementation, making it easier to understand and modify
  • Focused on educational purposes, with clear explanations and comments
  • Lightweight and requires fewer computational resources

Cons of nanoGPT

  • Limited scalability for large language models
  • Fewer optimizations and features compared to MLC-LLM
  • Less suitable for production-level deployments

Code Comparison

nanoGPT:

class GPTConfig:
    def __init__(self, vocab_size, n_layer, n_head, n_embd, block_size, bias, dropout):
        self.vocab_size = vocab_size
        self.n_layer = n_layer
        self.n_head = n_head
        self.n_embd = n_embd
        self.block_size = block_size
        self.bias = bias
        self.dropout = dropout

MLC-LLM:

class LLMConfig:
    def __init__(self, model_name, vocab_size, hidden_size, num_hidden_layers, num_attention_heads):
        self.model_name = model_name
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
12,546

Development repository for the Triton language and compiler

Pros of Triton

  • Focuses on GPU programming and optimization, offering more specialized tools for GPU-specific tasks
  • Provides a Python-like syntax for writing GPU kernels, making it more accessible to Python developers
  • Offers automatic code generation and optimization for different GPU architectures

Cons of Triton

  • Has a narrower scope compared to MLC-LLM, focusing primarily on GPU programming
  • May have a steeper learning curve for developers not familiar with GPU programming concepts
  • Less emphasis on end-to-end machine learning model deployment and inference

Code Comparison

Triton example (GPU kernel):

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements):
    pid = tl.program_id(0)
    offset = pid * 128
    x = tl.load(x_ptr + offset)
    y = tl.load(y_ptr + offset)
    output = x + y
    tl.store(output_ptr + offset, output)

MLC-LLM example (model deployment):

import mlc_llm
import tvm

model = mlc_llm.LLM("gpt2")
model.compile()
model.save("gpt2_compiled")

While Triton focuses on low-level GPU programming, MLC-LLM provides a higher-level interface for deploying and optimizing large language models across various hardware platforms.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

MLC LLM

Installation License Join Discoard Related Repository: WebLLM

Universal LLM Deployment Engine with ML Compilation

Get Started | Documentation | Blog

About

MLC LLM is a machine learning compiler and high-performance deployment engine for large language models. The mission of this project is to enable everyone to develop, optimize, and deploy AI models natively on everyone's platforms. 

AMD GPU NVIDIA GPU Apple GPU Intel GPU
Linux / Win ✅ Vulkan, ROCm ✅ Vulkan, CUDA N/A ✅ Vulkan
macOS ✅ Metal (dGPU) N/A ✅ Metal ✅ Metal (iGPU)
Web Browser ✅ WebGPU and WASM
iOS / iPadOS ✅ Metal on Apple A-series GPU
Android ✅ OpenCL on Adreno GPU ✅ OpenCL on Mali GPU

MLC LLM compiles and runs code on MLCEngine -- a unified high-performance LLM inference engine across the above platforms. MLCEngine provides OpenAI-compatible API available through REST server, python, javascript, iOS, Android, all backed by the same engine and compiler that we keep improving with the community.

Get Started

Please visit our documentation to get started with MLC LLM.

Citation

Please consider citing our project if you find it useful:

@software{mlc-llm,
    author = {MLC team},
    title = {{MLC-LLM}},
    url = {https://github.com/mlc-ai/mlc-llm},
    year = {2023}
}

The underlying techniques of MLC LLM include:

References (Click to expand)
@inproceedings{tensorir,
    author = {Feng, Siyuan and Hou, Bohan and Jin, Hongyi and Lin, Wuwei and Shao, Junru and Lai, Ruihang and Ye, Zihao and Zheng, Lianmin and Yu, Cody Hao and Yu, Yong and Chen, Tianqi},
    title = {TensorIR: An Abstraction for Automatic Tensorized Program Optimization},
    year = {2023},
    isbn = {9781450399166},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3575693.3576933},
    doi = {10.1145/3575693.3576933},
    booktitle = {Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
    pages = {804–817},
    numpages = {14},
    keywords = {Tensor Computation, Machine Learning Compiler, Deep Neural Network},
    location = {Vancouver, BC, Canada},
    series = {ASPLOS 2023}
}

@inproceedings{metaschedule,
    author = {Shao, Junru and Zhou, Xiyou and Feng, Siyuan and Hou, Bohan and Lai, Ruihang and Jin, Hongyi and Lin, Wuwei and Masuda, Masahiro and Yu, Cody Hao and Chen, Tianqi},
    booktitle = {Advances in Neural Information Processing Systems},
    editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
    pages = {35783--35796},
    publisher = {Curran Associates, Inc.},
    title = {Tensor Program Optimization with Probabilistic Programs},
    url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/e894eafae43e68b4c8dfdacf742bcbf3-Paper-Conference.pdf},
    volume = {35},
    year = {2022}
}

@inproceedings{tvm,
    author = {Tianqi Chen and Thierry Moreau and Ziheng Jiang and Lianmin Zheng and Eddie Yan and Haichen Shen and Meghan Cowan and Leyuan Wang and Yuwei Hu and Luis Ceze and Carlos Guestrin and Arvind Krishnamurthy},
    title = {{TVM}: An Automated {End-to-End} Optimizing Compiler for Deep Learning},
    booktitle = {13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)},
    year = {2018},
    isbn = {978-1-939133-08-3},
    address = {Carlsbad, CA},
    pages = {578--594},
    url = {https://www.usenix.org/conference/osdi18/presentation/chen},
    publisher = {USENIX Association},
    month = oct,
}