Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
LLM inference in C/C++
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Inference code for Llama models
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Development repository for the Triton language and compiler
Quick Overview
MLC-LLM is an open-source project aimed at running large language models (LLMs) efficiently on various hardware devices. It focuses on optimizing LLM inference for different platforms, including mobile devices, web browsers, and various operating systems, using machine learning compilation techniques.
Pros
- Enables efficient LLM inference on a wide range of devices and platforms
- Supports multiple popular LLM architectures (e.g., GPT, LLaMA, BLOOM)
- Utilizes machine learning compilation for optimized performance
- Open-source and community-driven development
Cons
- Requires some technical expertise to set up and use effectively
- May have limitations in supporting the latest LLM architectures immediately
- Performance can vary depending on the specific hardware and model used
- Documentation might be challenging for beginners
Code Examples
- Loading and running a model:
import mlc_llm
import mlc_chat
model = mlc_llm.load_model("vicuna-7b-v1.3")
chat = mlc_chat.ChatModule(model)
response = chat.generate("What is the capital of France?")
print(response)
- Configuring model parameters:
import mlc_llm
model_config = mlc_llm.ModelConfig(
model_name="llama-7b",
quantization="int4",
max_seq_len=2048
)
model = mlc_llm.load_model(model_config)
- Running inference on a specific device:
import mlc_llm
import mlc_chat
model = mlc_llm.load_model("gpt-2", device="cuda")
chat = mlc_chat.ChatModule(model)
response = chat.generate("Explain quantum computing in simple terms.")
print(response)
Getting Started
To get started with MLC-LLM, follow these steps:
- Install the required dependencies:
pip install mlc-llm mlc-ai-nightly torch
- Download a pre-trained model:
python -m mlc_llm.model_download --model-name vicuna-7b-v1.3
- Run a simple inference:
import mlc_llm
import mlc_chat
model = mlc_llm.load_model("vicuna-7b-v1.3")
chat = mlc_chat.ChatModule(model)
response = chat.generate("Hello, how are you?")
print(response)
For more detailed instructions and advanced usage, refer to the project's documentation on GitHub.
Competitor Comparisons
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of transformers
- Extensive library with support for a wide range of models and tasks
- Well-documented and actively maintained by a large community
- Easy integration with popular deep learning frameworks
Cons of transformers
- Can be resource-intensive for large models
- May require more setup and configuration for specific use cases
- Less optimized for edge devices and mobile platforms
Code Comparison
transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_text = "Hello, how are you?"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50)
mlc-llm:
import mlc_llm
import torch
model = mlc_llm.load_model("gpt2")
tokenizer = mlc_llm.Tokenizer("gpt2")
input_text = "Hello, how are you?"
input_ids = tokenizer.encode(input_text)
output = model.generate(torch.tensor([input_ids]))
Both repositories provide powerful tools for working with language models, but they have different focuses. transformers offers a comprehensive suite of pre-trained models and tools, while mlc-llm is more specialized for efficient deployment on various hardware platforms.
LLM inference in C/C++
Pros of llama.cpp
- Lightweight and efficient C++ implementation, optimized for CPU usage
- Supports quantization techniques for reduced memory footprint
- Easier to integrate into existing C++ projects
Cons of llama.cpp
- Limited to CPU execution, lacking GPU acceleration capabilities
- Fewer supported model architectures compared to mlc-llm
- Less focus on cross-platform deployment and mobile devices
Code Comparison
llama.cpp:
// Load model
llama_context* ctx = llama_init_from_file("model.bin", params);
// Generate text
llama_eval(ctx, tokens.data(), tokens.size(), n_past, n_threads);
// Get probabilities
float* logits = llama_get_logits(ctx);
mlc-llm:
# Load model
model = mlc_llm.LLMRuntime("model.so")
# Generate text
output = model.generate(prompt, max_length=100)
# Get probabilities
logits = model.get_logits()
Both projects aim to run large language models efficiently, but llama.cpp focuses on CPU optimization and C++ integration, while mlc-llm emphasizes cross-platform deployment and supports a wider range of model architectures. mlc-llm also provides better GPU acceleration capabilities, making it more suitable for high-performance applications.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Pros of DeepSpeed
- More mature and widely adopted in industry and research
- Offers a broader range of optimization techniques for various deep learning tasks
- Extensive documentation and community support
Cons of DeepSpeed
- Steeper learning curve for beginners
- Primarily focused on distributed training, which may be overkill for smaller projects
Code Comparison
MLC-LLM:
import mlc_llm
import torch
model = mlc_llm.load_model("gpt2")
input_ids = torch.tensor([[1, 2, 3, 4, 5]])
output = model(input_ids)
DeepSpeed:
import deepspeed
import torch
model = MyModel()
engine = deepspeed.initialize(model=model, config_params=ds_config)
input_ids = torch.tensor([[1, 2, 3, 4, 5]])
output = engine(input_ids)
MLC-LLM focuses on efficient inference of large language models, particularly on edge devices. It provides a simpler API for loading and running pre-trained models. DeepSpeed, on the other hand, offers a more comprehensive suite of optimization techniques for training and inference, with a focus on distributed computing and memory efficiency.
While MLC-LLM is tailored for LLM inference, DeepSpeed provides broader support for various deep learning tasks and model architectures. DeepSpeed's extensive features make it more suitable for large-scale projects and research, while MLC-LLM's simplicity may be preferable for quick LLM deployments and edge computing scenarios.
Inference code for Llama models
Pros of Llama
- Developed by Meta, leveraging extensive resources and expertise
- Focuses on large language models, offering state-of-the-art performance
- Provides pre-trained models for various applications
Cons of Llama
- Limited to specific hardware and environments
- Requires significant computational resources for deployment
- Less emphasis on cross-platform compatibility
Code Comparison
MLC-LLM:
import tvm
from mlc_llm import LLMRuntime
runtime = LLMRuntime("path/to/model")
response = runtime.generate("Hello, how are you?")
Llama:
from llama import Llama
model = Llama.load("path/to/model")
response = model.generate("Hello, how are you?")
Key Differences
- MLC-LLM focuses on deploying LLMs across various devices and platforms
- Llama emphasizes high-performance language modeling on powerful hardware
- MLC-LLM offers more flexibility in terms of deployment options
- Llama provides pre-trained models with cutting-edge performance
- MLC-LLM aims to optimize LLMs for resource-constrained environments
Both projects contribute significantly to the field of large language models, with MLC-LLM prioritizing accessibility and Llama focusing on pushing the boundaries of model capabilities.
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Pros of nanoGPT
- Simpler implementation, making it easier to understand and modify
- Focused on educational purposes, with clear explanations and comments
- Lightweight and requires fewer computational resources
Cons of nanoGPT
- Limited scalability for large language models
- Fewer optimizations and features compared to MLC-LLM
- Less suitable for production-level deployments
Code Comparison
nanoGPT:
class GPTConfig:
def __init__(self, vocab_size, n_layer, n_head, n_embd, block_size, bias, dropout):
self.vocab_size = vocab_size
self.n_layer = n_layer
self.n_head = n_head
self.n_embd = n_embd
self.block_size = block_size
self.bias = bias
self.dropout = dropout
MLC-LLM:
class LLMConfig:
def __init__(self, model_name, vocab_size, hidden_size, num_hidden_layers, num_attention_heads):
self.model_name = model_name
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
Development repository for the Triton language and compiler
Pros of Triton
- Focuses on GPU programming and optimization, offering more specialized tools for GPU-specific tasks
- Provides a Python-like syntax for writing GPU kernels, making it more accessible to Python developers
- Offers automatic code generation and optimization for different GPU architectures
Cons of Triton
- Has a narrower scope compared to MLC-LLM, focusing primarily on GPU programming
- May have a steeper learning curve for developers not familiar with GPU programming concepts
- Less emphasis on end-to-end machine learning model deployment and inference
Code Comparison
Triton example (GPU kernel):
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements):
pid = tl.program_id(0)
offset = pid * 128
x = tl.load(x_ptr + offset)
y = tl.load(y_ptr + offset)
output = x + y
tl.store(output_ptr + offset, output)
MLC-LLM example (model deployment):
import mlc_llm
import tvm
model = mlc_llm.LLM("gpt2")
model.compile()
model.save("gpt2_compiled")
While Triton focuses on low-level GPU programming, MLC-LLM provides a higher-level interface for deploying and optimizing large language models across various hardware platforms.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
About
MLC LLM is a machine learning compiler and high-performance deployment engine for large language models. The mission of this project is to enable everyone to develop, optimize, and deploy AI models natively on everyone's platforms.Â
AMD GPU | NVIDIA GPU | Apple GPU | Intel GPU | |
---|---|---|---|---|
Linux / Win | â Vulkan, ROCm | â Vulkan, CUDA | N/A | â Vulkan |
macOS | â Metal (dGPU) | N/A | â Metal | â Metal (iGPU) |
Web Browser | â WebGPU and WASM | |||
iOS / iPadOS | â Metal on Apple A-series GPU | |||
Android | â OpenCL on Adreno GPU | â OpenCL on Mali GPU |
MLC LLM compiles and runs code on MLCEngine -- a unified high-performance LLM inference engine across the above platforms. MLCEngine provides OpenAI-compatible API available through REST server, python, javascript, iOS, Android, all backed by the same engine and compiler that we keep improving with the community.
Get Started
Please visit our documentation to get started with MLC LLM.
Citation
Please consider citing our project if you find it useful:
@software{mlc-llm,
author = {MLC team},
title = {{MLC-LLM}},
url = {https://github.com/mlc-ai/mlc-llm},
year = {2023}
}
The underlying techniques of MLC LLM include:
References (Click to expand)
@inproceedings{tensorir,
author = {Feng, Siyuan and Hou, Bohan and Jin, Hongyi and Lin, Wuwei and Shao, Junru and Lai, Ruihang and Ye, Zihao and Zheng, Lianmin and Yu, Cody Hao and Yu, Yong and Chen, Tianqi},
title = {TensorIR: An Abstraction for Automatic Tensorized Program Optimization},
year = {2023},
isbn = {9781450399166},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3575693.3576933},
doi = {10.1145/3575693.3576933},
booktitle = {Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
pages = {804â817},
numpages = {14},
keywords = {Tensor Computation, Machine Learning Compiler, Deep Neural Network},
location = {Vancouver, BC, Canada},
series = {ASPLOS 2023}
}
@inproceedings{metaschedule,
author = {Shao, Junru and Zhou, Xiyou and Feng, Siyuan and Hou, Bohan and Lai, Ruihang and Jin, Hongyi and Lin, Wuwei and Masuda, Masahiro and Yu, Cody Hao and Chen, Tianqi},
booktitle = {Advances in Neural Information Processing Systems},
editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
pages = {35783--35796},
publisher = {Curran Associates, Inc.},
title = {Tensor Program Optimization with Probabilistic Programs},
url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/e894eafae43e68b4c8dfdacf742bcbf3-Paper-Conference.pdf},
volume = {35},
year = {2022}
}
@inproceedings{tvm,
author = {Tianqi Chen and Thierry Moreau and Ziheng Jiang and Lianmin Zheng and Eddie Yan and Haichen Shen and Meghan Cowan and Leyuan Wang and Yuwei Hu and Luis Ceze and Carlos Guestrin and Arvind Krishnamurthy},
title = {{TVM}: An Automated {End-to-End} Optimizing Compiler for Deep Learning},
booktitle = {13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)},
year = {2018},
isbn = {978-1-939133-08-3},
address = {Carlsbad, CA},
pages = {578--594},
url = {https://www.usenix.org/conference/osdi18/presentation/chen},
publisher = {USENIX Association},
month = oct,
}
Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
LLM inference in C/C++
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Inference code for Llama models
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Development repository for the Triton language and compiler
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot