LightGBM
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Top Related Projects
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
scikit-learn: machine learning in Python
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
An Open Source Machine Learning Framework for Everyone
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Quick Overview
LightGBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithms, used for ranking, classification, and many other machine learning tasks. It is developed by Microsoft and is designed to be efficient, scalable, and accurate, particularly for large datasets.
Pros
- Faster training speed and higher efficiency compared to other boosting frameworks
- Lower memory usage due to its histogram-based algorithm
- Supports parallel, distributed, and GPU learning
- Handles large-scale data with ease
Cons
- Can be prone to overfitting if not properly tuned
- May require more careful parameter tuning compared to some other frameworks
- Less interpretable than simpler models like decision trees
- Documentation can be challenging for beginners
Code Examples
- Basic binary classification:
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)
# Create dataset for LightGBM
train_data = lgb.Dataset(X_train, label=y_train)
# Set parameters
params = {
'objective': 'binary',
'metric': 'binary_logloss',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9
}
# Train model
model = lgb.train(params, train_data, num_boost_round=100)
# Make predictions
y_pred = model.predict(X_test)
- Feature importance visualization:
import matplotlib.pyplot as plt
# Get feature importance
importance = model.feature_importance()
feature_names = data.feature_names
# Plot feature importance
plt.figure(figsize=(10, 6))
plt.bar(range(len(importance)), importance)
plt.xticks(range(len(importance)), feature_names, rotation=90)
plt.title('Feature Importance')
plt.tight_layout()
plt.show()
- Cross-validation:
from sklearn.model_selection import cross_val_score
# Prepare LightGBM dataset
lgb_dataset = lgb.Dataset(data.data, label=data.target)
# Perform 5-fold cross-validation
cv_results = lgb.cv(params, lgb_dataset, num_boost_round=100, nfold=5, stratified=True, shuffle=True)
# Print mean and standard deviation of AUC
print(f"AUC: {cv_results['auc-mean'][-1]:.4f} (+/- {cv_results['auc-stdv'][-1]:.4f})")
Getting Started
To get started with LightGBM:
-
Install LightGBM:
pip install lightgbm
-
Import the library:
import lightgbm as lgb
-
Prepare your data and create a LightGBM dataset:
train_data = lgb.Dataset(X_train, label=y_train)
-
Set parameters and train the model:
params = {'objective': 'binary'} model = lgb.train(params, train_data, num_boost_round=100)
-
Make predictions:
y_pred = model.predict(X_test)
Competitor Comparisons
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Pros of XGBoost
- More mature and widely adopted in industry and competitions
- Better handling of missing values
- Stronger support for distributed and GPU computing
Cons of XGBoost
- Generally slower training speed, especially for large datasets
- Higher memory usage
- More complex hyperparameter tuning
Code Comparison
XGBoost:
import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
LightGBM:
import lightgbm as lgb
model = lgb.LGBMClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Both libraries offer similar APIs, making it easy to switch between them. The main differences lie in their underlying algorithms and performance characteristics. XGBoost is often preferred for smaller datasets or when dealing with missing values, while LightGBM shines with larger datasets and faster training times. The choice between the two often depends on the specific use case and dataset characteristics.
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Pros of CatBoost
- Better handling of categorical features without manual preprocessing
- Improved performance on datasets with high cardinality categorical features
- Built-in GPU support for faster training
Cons of CatBoost
- Generally slower training time compared to LightGBM
- Less extensive documentation and community support
- Fewer advanced features and customization options
Code Comparison
CatBoost:
from catboost import CatBoostRegressor
model = CatBoostRegressor(iterations=1000, learning_rate=0.1)
model.fit(X_train, y_train, cat_features=cat_features)
predictions = model.predict(X_test)
LightGBM:
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
params = {'num_leaves': 31, 'learning_rate': 0.1}
model = lgb.train(params, train_data, num_boost_round=1000)
predictions = model.predict(X_test)
Both CatBoost and LightGBM are powerful gradient boosting libraries, each with its own strengths. CatBoost excels in handling categorical features and provides built-in GPU support, while LightGBM offers faster training times and more advanced customization options. The choice between the two depends on the specific requirements of your project and the nature of your dataset.
scikit-learn: machine learning in Python
Pros of scikit-learn
- Comprehensive library with a wide range of machine learning algorithms
- Excellent documentation and community support
- Consistent API across different algorithms, making it easy to use and switch between models
Cons of scikit-learn
- Generally slower performance compared to specialized libraries like LightGBM
- Less efficient for large-scale datasets and high-dimensional problems
- Limited support for GPU acceleration
Code Comparison
scikit-learn:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
LightGBM:
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
params = {'objective': 'binary'}
model = lgb.train(params, train_data)
predictions = model.predict(X_test)
Both libraries offer easy-to-use APIs, but LightGBM is more focused on gradient boosting and provides faster training times, especially for large datasets. scikit-learn offers a broader range of algorithms and is more suitable for general-purpose machine learning tasks, while LightGBM excels in gradient boosting applications and handling high-dimensional data efficiently.
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Pros of h2o-3
- Supports a wider range of algorithms and models, including deep learning
- Offers a user-friendly web interface for non-programmers
- Provides built-in distributed computing capabilities
Cons of h2o-3
- Generally slower performance compared to LightGBM
- More complex setup and configuration process
- Larger memory footprint, especially for big datasets
Code Comparison
h2o-3:
import h2o
from h2o.estimators import H2OGradientBoostingEstimator
h2o.init()
data = h2o.import_file("data.csv")
model = H2OGradientBoostingEstimator()
model.train(x=["feature1", "feature2"], y="target", training_frame=data)
LightGBM:
import lightgbm as lgb
from sklearn.datasets import load_iris
data = load_iris()
train_data = lgb.Dataset(data.data, label=data.target)
params = {'objective': 'multiclass', 'num_class': 3}
model = lgb.train(params, train_data)
Both libraries offer gradient boosting implementations, but LightGBM focuses on efficiency and speed, while h2o-3 provides a broader range of algorithms and features. LightGBM's code is more concise and straightforward, while h2o-3 requires additional setup steps but offers more flexibility in terms of data handling and model configuration.
An Open Source Machine Learning Framework for Everyone
Pros of TensorFlow
- Comprehensive ecosystem for deep learning and neural networks
- Supports distributed computing and GPU acceleration
- Extensive community support and resources
Cons of TensorFlow
- Steeper learning curve compared to LightGBM
- Higher computational requirements for simple tasks
- More complex setup and configuration
Code Comparison
LightGBM:
import lightgbm as lgb
model = lgb.LGBMClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
TensorFlow:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(X_train, y_train, epochs=10)
predictions = model.predict(X_test)
LightGBM is more concise and straightforward for gradient boosting tasks, while TensorFlow offers more flexibility for complex neural network architectures but requires more code for simple models.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Pros of PyTorch
- More flexible and dynamic computational graph, allowing for easier debugging and experimentation
- Broader ecosystem and community support, with extensive libraries for various deep learning tasks
- Better support for GPU acceleration and distributed computing
Cons of PyTorch
- Steeper learning curve for beginners compared to LightGBM's simpler API
- Generally slower training speed for traditional machine learning tasks
Code Comparison
PyTorch example (neural network):
import torch.nn as nn
model = nn.Sequential(
nn.Linear(10, 5),
nn.ReLU(),
nn.Linear(5, 1)
)
LightGBM example (gradient boosting):
import lightgbm as lgb
model = lgb.LGBMRegressor(
n_estimators=100,
learning_rate=0.1
)
Summary
PyTorch is a deep learning framework offering flexibility and a rich ecosystem, ideal for complex neural network architectures and research. LightGBM, on the other hand, is a gradient boosting framework optimized for efficiency and speed in traditional machine learning tasks. While PyTorch excels in deep learning and GPU utilization, LightGBM is often preferred for its simplicity and faster training on structured data.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
<img src=https://github.com/microsoft/LightGBM/blob/master/docs/logo/LightGBM_logo_black_text.svg width=300 />
Light Gradient Boosting Machine
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
- Faster training speed and higher efficiency.
- Lower memory usage.
- Better accuracy.
- Support of parallel, distributed, and GPU learning.
- Capable of handling large-scale data.
For further details, please refer to Features.
Benefiting from these advantages, LightGBM is being widely-used in many winning solutions of machine learning competitions.
Comparison experiments on public datasets show that LightGBM can outperform existing boosting frameworks on both efficiency and accuracy, with significantly lower memory consumption. What's more, distributed learning experiments show that LightGBM can achieve a linear speed-up by using multiple machines for training in specific settings.
Get Started and Documentation
Our primary documentation is at https://lightgbm.readthedocs.io/ and is generated from this repository. If you are new to LightGBM, follow the installation instructions on that site.
Next you may want to read:
- Examples showing command line usage of common tasks.
- Features and algorithms supported by LightGBM.
- Parameters is an exhaustive list of customization you can make.
- Distributed Learning and GPU Learning can speed up computation.
- FLAML provides automated tuning for LightGBM (code examples).
- Optuna Hyperparameter Tuner provides automated tuning for LightGBM hyperparameters (code examples).
- Understanding LightGBM Parameters (and How to Tune Them using Neptune).
Documentation for contributors:
- How we update readthedocs.io.
- Check out the Development Guide.
News
Please refer to changelogs at GitHub releases page.
External (Unofficial) Repositories
Projects listed here offer alternative ways to use LightGBM.
They are not maintained or officially endorsed by the LightGBM
development team.
LightGBMLSS (An extension of LightGBM to probabilistic modelling from which prediction intervals and quantiles can be derived): https://github.com/StatMixedML/LightGBMLSS
FLAML (AutoML library for hyperparameter optimization): https://github.com/microsoft/FLAML
supertree (interactive visualization of decision trees): https://github.com/mljar/supertree
Optuna (hyperparameter optimization framework): https://github.com/optuna/optuna
Julia-package: https://github.com/IQVIA-ML/LightGBM.jl
JPMML (Java PMML converter): https://github.com/jpmml/jpmml-lightgbm
Nyoka (Python PMML converter): https://github.com/SoftwareAG/nyoka
Treelite (model compiler for efficient deployment): https://github.com/dmlc/treelite
lleaves (LLVM-based model compiler for efficient inference): https://github.com/siboehm/lleaves
Hummingbird (model compiler into tensor computations): https://github.com/microsoft/hummingbird
cuML Forest Inference Library (GPU-accelerated inference): https://github.com/rapidsai/cuml
daal4py (Intel CPU-accelerated inference): https://github.com/intel/scikit-learn-intelex/tree/master/daal4py
m2cgen (model appliers for various languages): https://github.com/BayesWitnesses/m2cgen
leaves (Go model applier): https://github.com/dmitryikh/leaves
ONNXMLTools (ONNX converter): https://github.com/onnx/onnxmltools
SHAP (model output explainer): https://github.com/slundberg/shap
Shapash (model visualization and interpretation): https://github.com/MAIF/shapash
dtreeviz (decision tree visualization and model interpretation): https://github.com/parrt/dtreeviz
SynapseML (LightGBM on Spark): https://github.com/microsoft/SynapseML
Kubeflow Fairing (LightGBM on Kubernetes): https://github.com/kubeflow/fairing
Kubeflow Operator (LightGBM on Kubernetes): https://github.com/kubeflow/xgboost-operator
lightgbm_ray (LightGBM on Ray): https://github.com/ray-project/lightgbm_ray
Mars (LightGBM on Mars): https://github.com/mars-project/mars
ML.NET (.NET/C#-package): https://github.com/dotnet/machinelearning
LightGBM.NET (.NET/C#-package): https://github.com/rca22/LightGBM.Net
Ruby gem: https://github.com/ankane/lightgbm-ruby
LightGBM4j (Java high-level binding): https://github.com/metarank/lightgbm4j
lightgbm3 (Rust binding): https://github.com/Mottl/lightgbm3-rs
MLflow (experiment tracking, model monitoring framework): https://github.com/mlflow/mlflow
{bonsai}
(R {parsnip}
-compliant interface): https://github.com/tidymodels/bonsai
{mlr3extralearners}
(R {mlr3}
-compliant interface): https://github.com/mlr-org/mlr3extralearners
lightgbm-transform (feature transformation binding): https://github.com/microsoft/lightgbm-transform
postgresml
(LightGBM training and prediction in SQL, via a Postgres extension): https://github.com/postgresml/postgresml
vaex-ml
(Python DataFrame library with its own interface to LightGBM): https://github.com/vaexio/vaex
Support
- Ask a question on Stack Overflow with the
lightgbm
tag, we monitor this for new questions. - Open bug reports and feature requests on GitHub issues.
How to Contribute
Check CONTRIBUTING page.
Microsoft Open Source Code of Conduct
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Reference Papers
Yu Shi, Guolin Ke, Zhuoming Chen, Shuxin Zheng, Tie-Yan Liu. "Quantized Training of Gradient Boosting Decision Trees" (link). Advances in Neural Information Processing Systems 35 (NeurIPS 2022), pp. 18822-18833.
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. "LightGBM: A Highly Efficient Gradient Boosting Decision Tree". Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3149-3157.
Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tie-Yan Liu. "A Communication-Efficient Parallel Algorithm for Decision Tree". Advances in Neural Information Processing Systems 29 (NIPS 2016), pp. 1279-1287.
Huan Zhang, Si Si and Cho-Jui Hsieh. "GPU Acceleration for Large-scale Tree Boosting". SysML Conference, 2018.
License
This project is licensed under the terms of the MIT license. See LICENSE for additional details.
Top Related Projects
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
scikit-learn: machine learning in Python
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
An Open Source Machine Learning Framework for Everyone
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot