xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

27,179

8,793

27,179

468

View on GitHub

Top Related Projects

LightGBM

17,445

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

catboost

8,500

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

scikit-learn

62,466

scikit-learn: machine learning in Python

interpret

6,630

Fit interpretable models. Explain blackbox machine learning.

h2o-3

7,244

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Quick Overview

XGBoost (eXtreme Gradient Boosting) is a highly efficient and scalable implementation of gradient boosting machines. It is designed to be flexible, portable, and highly performant, making it one of the most popular machine learning libraries for structured/tabular data. XGBoost is widely used in data science competitions and real-world applications.

Pros

Excellent performance and speed, often outperforming other gradient boosting implementations
Built-in support for handling missing values
Provides regularization to prevent overfitting
Supports various objective functions, including regression, classification, and ranking

Cons

Can be complex to tune due to many hyperparameters
May overfit on small datasets if not properly regularized
Less interpretable compared to simpler models like decision trees
Requires more memory compared to some other algorithms

Code Examples

Basic Classification Example:

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

Regression with Early Stopping:

import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# Generate synthetic regression data
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set parameters and train with early stopping
params = {'max_depth': 6, 'eta': 0.1, 'objective': 'reg:squarederror'}
model = xgb.train(params, dtrain, num_boost_round=1000, 
                  early_stopping_rounds=10, evals=[(dtest, 'test')])

# Make predictions
predictions = model.predict(dtest)

Feature Importance Visualization:

import xgboost as xgb
import matplotlib.pyplot as plt

# Assuming you have already trained a model
model = xgb.XGBRegressor()
model.fit(X_train, y_train)

# Plot feature importance
xgb.plot_importance(model)
plt.show()

Getting Started

To get started with XGBoost, first install it using pip:

pip install xgboost

Then, you can use XGBoost in your Python code:

import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

# Print accuracy
accuracy = (predictions == y_test).mean()
print(f"Accuracy: {accuracy:.2f}")

This example

Competitor Comparisons

LightGBM

17,445

Pros of LightGBM

Faster training speed and lower memory usage due to histogram-based algorithm
Better handling of categorical features without preprocessing
Supports distributed and GPU learning out of the box

Cons of LightGBM

Less robust to overfitting on small datasets
Fewer built-in cross-validation and hyperparameter tuning tools
Slightly steeper learning curve for beginners

Code Comparison

XGBoost:

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

LightGBM:

import lightgbm as lgb
model = lgb.LGBMClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Both libraries offer similar high-level APIs, making it easy to switch between them. The main differences lie in the underlying algorithms and default parameters. LightGBM generally requires less feature engineering and preprocessing, especially for categorical variables. XGBoost, on the other hand, provides more robust performance on smaller datasets and has a wider range of built-in tools for model tuning and evaluation.

catboost

8,500

Pros of CatBoost

Better handling of categorical features without manual preprocessing
Improved performance on datasets with high cardinality categorical features
Built-in support for GPU acceleration, potentially faster training times

Cons of CatBoost

Less mature ecosystem and community support compared to XGBoost
Fewer integration options with other machine learning frameworks
Can be slower on datasets with primarily numerical features

Code Comparison

XGBoost:

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

CatBoost:

from catboost import CatBoostClassifier
model = CatBoostClassifier(cat_features=cat_features)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Both libraries offer similar ease of use, but CatBoost requires specifying categorical features explicitly. XGBoost typically needs preprocessing for categorical variables, while CatBoost handles them natively. CatBoost's API is designed to be more user-friendly for beginners, with fewer hyperparameters to tune out of the box. XGBoost provides more flexibility and control over the model's behavior, which can be advantageous for experienced users working on complex problems.

scikit-learn

62,466

scikit-learn: machine learning in Python

Pros of scikit-learn

Comprehensive library with a wide range of machine learning algorithms
Consistent and user-friendly API across different models
Excellent documentation and community support

Cons of scikit-learn

Limited support for GPU acceleration
Less optimized for large-scale datasets compared to XGBoost
Gradient boosting implementation not as advanced as XGBoost

Code Comparison

XGBoost:

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

scikit-learn:

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Both XGBoost and scikit-learn are popular machine learning libraries, but they serve different purposes. XGBoost specializes in gradient boosting and is highly optimized for performance, especially with large datasets. scikit-learn, on the other hand, offers a broader range of algorithms and is known for its ease of use and consistency across different models. While XGBoost excels in gradient boosting tasks, scikit-learn provides a more comprehensive toolkit for various machine learning tasks, making it a great choice for general-purpose machine learning projects.

interpret

6,630

Fit interpretable models. Explain blackbox machine learning.

Pros of Interpret

Focuses on model interpretability and explainability
Provides a unified interface for various interpretable ML techniques
Offers interactive visualizations for better understanding of model decisions

Cons of Interpret

Less optimized for high-performance computing compared to XGBoost
Smaller community and ecosystem than XGBoost
May have a steeper learning curve for users new to interpretable ML

Code Comparison

XGBoost:

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Interpret:

from interpret.glassbox import ExplainableBoostingClassifier
model = ExplainableBoostingClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
ebm_global = model.explain_global()

XGBoost is primarily focused on gradient boosting and offers high-performance implementations for various tasks. It's widely used in competitions and production environments due to its speed and accuracy.

Interpret, on the other hand, emphasizes model interpretability and provides tools for explaining model decisions. It includes various interpretable ML techniques and visualization tools, making it suitable for applications where understanding model behavior is crucial.

While both libraries can be used for machine learning tasks, they serve different primary purposes. XGBoost is ideal for achieving high performance in predictive modeling, while Interpret is better suited for scenarios where model transparency and explainability are paramount.

h2o-3

7,244

Pros of H2O-3

Offers a broader range of machine learning algorithms beyond gradient boosting
Provides a user-friendly web interface for model building and visualization
Supports distributed computing out of the box for large-scale data processing

Cons of H2O-3

Generally slower performance for gradient boosting tasks
Less flexibility in fine-tuning model parameters
Steeper learning curve for users familiar with Python-centric workflows

Code Comparison

XGBoost:

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

H2O-3:

import h2o
from h2o.estimators import H2OGradientBoostingEstimator
h2o.init()
model = H2OGradientBoostingEstimator()
model.train(x=features, y=target, training_frame=train)
predictions = model.predict(test)

Both libraries offer gradient boosting capabilities, but XGBoost is more specialized and typically faster for this specific task. H2O-3 provides a wider range of algorithms and a more comprehensive machine learning platform, making it suitable for users who need a variety of models and prefer a GUI interface. XGBoost is generally favored by data scientists who prioritize performance and fine-grained control over gradient boosting models.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

eXtreme Gradient Boosting

Community | Documentation | Resources | Contributors | Release Notes

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, Dask, Spark, PySpark) and can solve problems beyond billions of examples.

License

Contribute to XGBoost

XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone. Checkout the Community Page.

Reference

Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016
XGBoost originates from research project at University of Washington.