catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

8,368

1,216

8,368

609

View on GitHub

Top Related Projects

LightGBM

17,161

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

xgboost

26,866

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

scikit-learn

62,466

scikit-learn: machine learning in Python

FLAML

4,115

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

h2o-3

7,129

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

interpret

6,467

Fit interpretable models. Explain blackbox machine learning.

Quick Overview

CatBoost is a high-performance, open-source gradient boosting library developed by Yandex. It is designed for machine learning tasks, particularly for handling categorical features efficiently. CatBoost implements novel techniques to combat prediction shift and overfitting, making it a powerful tool for both regression and classification problems.

Pros

Excellent handling of categorical features without extensive preprocessing
Built-in mechanisms to prevent overfitting
Fast performance on both CPU and GPU
Supports various loss functions and evaluation metrics

Cons

Can be slower in training compared to some other gradient boosting libraries
Limited built-in feature importance methods
Steeper learning curve for advanced customization
Less extensive documentation compared to some more established libraries

Code Examples

Basic classification example:

from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = CatBoostClassifier(iterations=100, learning_rate=0.1)
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

Handling categorical features:

import pandas as pd
from catboost import CatBoostRegressor

# Load data with categorical features
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Specify categorical features
cat_features = ['category1', 'category2']

# Train model
model = CatBoostRegressor(iterations=300, cat_features=cat_features)
model.fit(X, y)

Cross-validation with custom metric:

from catboost import CatBoostClassifier, cv
from sklearn.metrics import f1_score

# Define custom metric
def custom_f1_score(predictions, target):
    return f1_score(target, predictions, average='macro')

# Perform cross-validation
cv_results = cv(
    pool=train_data,
    params={'iterations': 500, 'learning_rate': 0.05},
    fold_count=5,
    custom_metric=custom_f1_score
)

Getting Started

To get started with CatBoost, first install it using pip:

pip install catboost

Then, you can import and use CatBoost in your Python code:

from catboost import CatBoostClassifier, Pool

# Prepare your data
X, y = load_data()
train_pool = Pool(X, y, cat_features=[0, 1, 2])

# Initialize and train the model
model = CatBoostClassifier(iterations=300, learning_rate=0.1)
model.fit(train_pool)

# Make predictions
predictions = model.predict(test_data)

This basic example demonstrates how to train a CatBoost classifier and make predictions. Adjust parameters and data preparation according to your specific use case.

Competitor Comparisons

LightGBM

17,161

Pros of LightGBM

Generally faster training speed, especially on large datasets
Lower memory usage due to its histogram-based algorithm
Better handling of categorical features without preprocessing

Cons of LightGBM

Can be more prone to overfitting on small datasets
Less robust to noisy data compared to CatBoost
Requires more careful parameter tuning for optimal performance

Code Comparison

LightGBM:

import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
params = {'num_leaves': 31, 'objective': 'binary'}
model = lgb.train(params, train_data, num_boost_round=100)

CatBoost:

from catboost import CatBoostClassifier
model = CatBoostClassifier(iterations=100, depth=5, learning_rate=0.1)
model.fit(X_train, y_train)

Both LightGBM and CatBoost are powerful gradient boosting frameworks, each with its strengths. LightGBM excels in speed and efficiency, making it suitable for large-scale applications. CatBoost, on the other hand, offers better out-of-the-box performance and handles categorical features more elegantly. The choice between the two often depends on the specific requirements of the project and the characteristics of the dataset.

xgboost

26,866

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Pros of XGBoost

Longer history and more widespread adoption in industry and competitions
Extensive documentation and large community support
Efficient handling of sparse data

Cons of XGBoost

Generally slower training times, especially on large datasets
Less effective handling of categorical features without preprocessing
More hyperparameters to tune, potentially requiring more expertise

Code Comparison

XGBoost:

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

CatBoost:

from catboost import CatBoostClassifier
model = CatBoostClassifier(cat_features=cat_features)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

The main difference in usage is that CatBoost allows direct specification of categorical features, while XGBoost typically requires preprocessing of categorical variables (e.g., one-hot encoding) before training. CatBoost's syntax is generally simpler and requires less data preparation, especially when dealing with categorical features.

Both libraries offer powerful gradient boosting implementations, but they have different strengths. XGBoost excels in handling sparse data and has a longer track record, while CatBoost offers faster training on large datasets and better out-of-the-box handling of categorical features.

scikit-learn

62,466

scikit-learn: machine learning in Python

Pros of scikit-learn

Comprehensive library with a wide range of machine learning algorithms
Excellent documentation and large community support
Seamless integration with other scientific Python libraries

Cons of scikit-learn

Generally slower performance compared to CatBoost
Less effective handling of categorical features
Limited support for GPU acceleration

Code Comparison

scikit-learn:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

CatBoost:

from catboost import CatBoostClassifier
clf = CatBoostClassifier(iterations=500, learning_rate=0.1)
clf.fit(X_train, y_train, cat_features=cat_features)
y_pred = clf.predict(X_test)

The code comparison shows that CatBoost requires explicit specification of categorical features, while scikit-learn handles them implicitly. CatBoost also offers more fine-tuned control over the training process with parameters like iterations and learning rate.

Both libraries provide similar ease of use for basic model training and prediction. However, CatBoost's specialized handling of categorical features and built-in performance optimizations can lead to improved results in many scenarios, especially with datasets containing categorical variables.

FLAML

4,115

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

Pros of FLAML

Automated machine learning (AutoML) framework supporting multiple algorithms
Efficient hyperparameter tuning with budget-aware optimization
Lightweight and easy to integrate into existing ML pipelines

Cons of FLAML

Less specialized for gradient boosting compared to CatBoost
May require more setup and configuration for specific use cases
Potentially slower for large datasets due to its multi-algorithm approach

Code Comparison

FLAML:

from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification")
predictions = automl.predict(X_test)

CatBoost:

from catboost import CatBoostClassifier
model = CatBoostClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Key Differences

FLAML offers a more general-purpose AutoML solution, while CatBoost specializes in gradient boosting
CatBoost provides built-in support for categorical features, whereas FLAML relies on preprocessing
FLAML's AutoML approach may be more suitable for users who want to explore multiple algorithms, while CatBoost is ideal for those focused on gradient boosting performance

h2o-3

7,129

Pros of H2O-3

Supports a wider range of algorithms and models, including deep learning
Offers a user-friendly web interface for non-programmers
Provides distributed computing capabilities for handling large datasets

Cons of H2O-3

Generally slower performance compared to CatBoost, especially for gradient boosting tasks
More complex setup and configuration process
Steeper learning curve for advanced users

Code Comparison

H2O-3 (Python):

import h2o
from h2o.estimators import H2ORandomForestEstimator

h2o.init()
data = h2o.import_file("dataset.csv")
model = H2ORandomForestEstimator()
model.train(x=["feature1", "feature2"], y="target", training_frame=data)

CatBoost (Python):

from catboost import CatBoostRegressor

model = CatBoostRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

H2O-3 offers a more comprehensive suite of machine learning algorithms and a user-friendly interface, making it suitable for a wider range of users and applications. However, CatBoost excels in performance and ease of use for gradient boosting tasks, with simpler setup and faster execution times. The code comparison demonstrates that H2O-3 requires more setup steps, while CatBoost offers a more straightforward implementation for specific tasks.

interpret

6,467

Fit interpretable models. Explain blackbox machine learning.

Pros of Interpret

Focuses on model interpretability and explainability
Supports multiple machine learning frameworks (scikit-learn, LightGBM, etc.)
Provides a unified API for various interpretability techniques

Cons of Interpret

Less optimized for performance compared to CatBoost
Smaller community and fewer contributions
Limited to interpretability, not a full-featured ML library

Code Comparison

Interpret:

from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show

ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)
ebm_global = ebm.explain_global()
show(ebm_global)

CatBoost:

from catboost import CatBoostClassifier

model = CatBoostClassifier()
model.fit(X_train, y_train)
feature_importances = model.get_feature_importance()

Interpret focuses on providing interpretability tools for various models, while CatBoost is a high-performance gradient boosting library with some built-in interpretability features. Interpret offers a more comprehensive set of explanation methods, but CatBoost excels in performance and efficiency for gradient boosting tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Website | Documentation | Tutorials | Installation | Release Notes

CatBoost is a machine learning method based on gradient boosting over decision trees.

Main advantages of CatBoost:

Superior quality when compared with other GBDT libraries on many datasets.
Best in class prediction speed.
Support for both numerical and categorical features.
Fast GPU and multi-GPU support for training out of the box.
Visualization tools included.
Fast and reproducible distributed training with Apache Spark and CLI.

Get Started and Documentation

All CatBoost documentation is available here.

Install CatBoost by following the guide for the

Next you may want to investigate:

Tutorials
Training modes and metrics
Cross-validation
Parameters tuning
Feature importance calculation
Regular and staged predictions
CatBoost for Apache Spark videos: Introduction and Architecture

If you cannot open documentation in your browser try adding yastatic.net and yastat.net to the list of allowed domains in your privacy badger.

Catboost models in production

If you want to evaluate Catboost model in your application read model api documentation.

Questions and bug reports

For reporting bugs please use the catboost/bugreport page.
Ask a question on Stack Overflow with the catboost tag, we monitor this for new questions.
Seek prompt advice at Telegram group or Russian-speaking Telegram chat

Help to Make CatBoost Better

Check out open problems and help wanted issues to see what can be improved, or open an issue if you want something.
Add your stories and experience to Awesome CatBoost.
To contribute to CatBoost you need to first read CLA text and add to your pull request, that you agree to the terms of the CLA. More information can be found in CONTRIBUTING.md
Instructions for contributors can be found here.

News

Reference Paper

Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Ostroumova Prokhorenkova, Aleksandr Vorobev "Fighting biases with dynamic boosting". arXiv:1706.09516, 2017.

Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin "CatBoost: gradient boosting with categorical features support". Workshop on ML Systems at NIPS 2017.

License

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot