Convert Figma logo to code with AI

catboost logocatboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

8,368
1,216
8,368
609

Top Related Projects

17,161

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

26,866

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

scikit-learn: machine learning in Python

4,115

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

7,129

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Fit interpretable models. Explain blackbox machine learning.

Quick Overview

CatBoost is a high-performance, open-source gradient boosting library developed by Yandex. It is designed for machine learning tasks, particularly for handling categorical features efficiently. CatBoost implements novel techniques to combat prediction shift and overfitting, making it a powerful tool for both regression and classification problems.

Pros

  • Excellent handling of categorical features without extensive preprocessing
  • Built-in mechanisms to prevent overfitting
  • Fast performance on both CPU and GPU
  • Supports various loss functions and evaluation metrics

Cons

  • Can be slower in training compared to some other gradient boosting libraries
  • Limited built-in feature importance methods
  • Steeper learning curve for advanced customization
  • Less extensive documentation compared to some more established libraries

Code Examples

  1. Basic classification example:
from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = CatBoostClassifier(iterations=100, learning_rate=0.1)
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)
  1. Handling categorical features:
import pandas as pd
from catboost import CatBoostRegressor

# Load data with categorical features
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Specify categorical features
cat_features = ['category1', 'category2']

# Train model
model = CatBoostRegressor(iterations=300, cat_features=cat_features)
model.fit(X, y)
  1. Cross-validation with custom metric:
from catboost import CatBoostClassifier, cv
from sklearn.metrics import f1_score

# Define custom metric
def custom_f1_score(predictions, target):
    return f1_score(target, predictions, average='macro')

# Perform cross-validation
cv_results = cv(
    pool=train_data,
    params={'iterations': 500, 'learning_rate': 0.05},
    fold_count=5,
    custom_metric=custom_f1_score
)

Getting Started

To get started with CatBoost, first install it using pip:

pip install catboost

Then, you can import and use CatBoost in your Python code:

from catboost import CatBoostClassifier, Pool

# Prepare your data
X, y = load_data()
train_pool = Pool(X, y, cat_features=[0, 1, 2])

# Initialize and train the model
model = CatBoostClassifier(iterations=300, learning_rate=0.1)
model.fit(train_pool)

# Make predictions
predictions = model.predict(test_data)

This basic example demonstrates how to train a CatBoost classifier and make predictions. Adjust parameters and data preparation according to your specific use case.

Competitor Comparisons

17,161

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Pros of LightGBM

  • Generally faster training speed, especially on large datasets
  • Lower memory usage due to its histogram-based algorithm
  • Better handling of categorical features without preprocessing

Cons of LightGBM

  • Can be more prone to overfitting on small datasets
  • Less robust to noisy data compared to CatBoost
  • Requires more careful parameter tuning for optimal performance

Code Comparison

LightGBM:

import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
params = {'num_leaves': 31, 'objective': 'binary'}
model = lgb.train(params, train_data, num_boost_round=100)

CatBoost:

from catboost import CatBoostClassifier
model = CatBoostClassifier(iterations=100, depth=5, learning_rate=0.1)
model.fit(X_train, y_train)

Both LightGBM and CatBoost are powerful gradient boosting frameworks, each with its strengths. LightGBM excels in speed and efficiency, making it suitable for large-scale applications. CatBoost, on the other hand, offers better out-of-the-box performance and handles categorical features more elegantly. The choice between the two often depends on the specific requirements of the project and the characteristics of the dataset.

26,866

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Pros of XGBoost

  • Longer history and more widespread adoption in industry and competitions
  • Extensive documentation and large community support
  • Efficient handling of sparse data

Cons of XGBoost

  • Generally slower training times, especially on large datasets
  • Less effective handling of categorical features without preprocessing
  • More hyperparameters to tune, potentially requiring more expertise

Code Comparison

XGBoost:

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

CatBoost:

from catboost import CatBoostClassifier
model = CatBoostClassifier(cat_features=cat_features)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

The main difference in usage is that CatBoost allows direct specification of categorical features, while XGBoost typically requires preprocessing of categorical variables (e.g., one-hot encoding) before training. CatBoost's syntax is generally simpler and requires less data preparation, especially when dealing with categorical features.

Both libraries offer powerful gradient boosting implementations, but they have different strengths. XGBoost excels in handling sparse data and has a longer track record, while CatBoost offers faster training on large datasets and better out-of-the-box handling of categorical features.

scikit-learn: machine learning in Python

Pros of scikit-learn

  • Comprehensive library with a wide range of machine learning algorithms
  • Excellent documentation and large community support
  • Seamless integration with other scientific Python libraries

Cons of scikit-learn

  • Generally slower performance compared to CatBoost
  • Less effective handling of categorical features
  • Limited support for GPU acceleration

Code Comparison

scikit-learn:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

CatBoost:

from catboost import CatBoostClassifier
clf = CatBoostClassifier(iterations=500, learning_rate=0.1)
clf.fit(X_train, y_train, cat_features=cat_features)
y_pred = clf.predict(X_test)

The code comparison shows that CatBoost requires explicit specification of categorical features, while scikit-learn handles them implicitly. CatBoost also offers more fine-tuned control over the training process with parameters like iterations and learning rate.

Both libraries provide similar ease of use for basic model training and prediction. However, CatBoost's specialized handling of categorical features and built-in performance optimizations can lead to improved results in many scenarios, especially with datasets containing categorical variables.

4,115

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

Pros of FLAML

  • Automated machine learning (AutoML) framework supporting multiple algorithms
  • Efficient hyperparameter tuning with budget-aware optimization
  • Lightweight and easy to integrate into existing ML pipelines

Cons of FLAML

  • Less specialized for gradient boosting compared to CatBoost
  • May require more setup and configuration for specific use cases
  • Potentially slower for large datasets due to its multi-algorithm approach

Code Comparison

FLAML:

from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification")
predictions = automl.predict(X_test)

CatBoost:

from catboost import CatBoostClassifier
model = CatBoostClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Key Differences

  • FLAML offers a more general-purpose AutoML solution, while CatBoost specializes in gradient boosting
  • CatBoost provides built-in support for categorical features, whereas FLAML relies on preprocessing
  • FLAML's AutoML approach may be more suitable for users who want to explore multiple algorithms, while CatBoost is ideal for those focused on gradient boosting performance
7,129

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Pros of H2O-3

  • Supports a wider range of algorithms and models, including deep learning
  • Offers a user-friendly web interface for non-programmers
  • Provides distributed computing capabilities for handling large datasets

Cons of H2O-3

  • Generally slower performance compared to CatBoost, especially for gradient boosting tasks
  • More complex setup and configuration process
  • Steeper learning curve for advanced users

Code Comparison

H2O-3 (Python):

import h2o
from h2o.estimators import H2ORandomForestEstimator

h2o.init()
data = h2o.import_file("dataset.csv")
model = H2ORandomForestEstimator()
model.train(x=["feature1", "feature2"], y="target", training_frame=data)

CatBoost (Python):

from catboost import CatBoostRegressor

model = CatBoostRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

H2O-3 offers a more comprehensive suite of machine learning algorithms and a user-friendly interface, making it suitable for a wider range of users and applications. However, CatBoost excels in performance and ease of use for gradient boosting tasks, with simpler setup and faster execution times. The code comparison demonstrates that H2O-3 requires more setup steps, while CatBoost offers a more straightforward implementation for specific tasks.

Fit interpretable models. Explain blackbox machine learning.

Pros of Interpret

  • Focuses on model interpretability and explainability
  • Supports multiple machine learning frameworks (scikit-learn, LightGBM, etc.)
  • Provides a unified API for various interpretability techniques

Cons of Interpret

  • Less optimized for performance compared to CatBoost
  • Smaller community and fewer contributions
  • Limited to interpretability, not a full-featured ML library

Code Comparison

Interpret:

from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show

ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)
ebm_global = ebm.explain_global()
show(ebm_global)

CatBoost:

from catboost import CatBoostClassifier

model = CatBoostClassifier()
model.fit(X_train, y_train)
feature_importances = model.get_feature_importance()

Interpret focuses on providing interpretability tools for various models, while CatBoost is a high-performance gradient boosting library with some built-in interpretability features. Interpret offers a more comprehensive set of explanation methods, but CatBoost excels in performance and efficiency for gradient boosting tasks.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

<img src=http://storage.mds.yandex.net/get-devtools-opensource/250854/catboost-logo.png width=300/>

Website | Documentation | Tutorials | Installation | Release Notes

GitHub license PyPI version Conda Version GitHub issues Telegram Twitter

CatBoost is a machine learning method based on gradient boosting over decision trees.

Main advantages of CatBoost:

Get Started and Documentation

All CatBoost documentation is available here.

Install CatBoost by following the guide for the

Next you may want to investigate:

If you cannot open documentation in your browser try adding yastatic.net and yastat.net to the list of allowed domains in your privacy badger.

Catboost models in production

If you want to evaluate Catboost model in your application read model api documentation.

Questions and bug reports

Help to Make CatBoost Better

  • Check out open problems and help wanted issues to see what can be improved, or open an issue if you want something.
  • Add your stories and experience to Awesome CatBoost.
  • To contribute to CatBoost you need to first read CLA text and add to your pull request, that you agree to the terms of the CLA. More information can be found in CONTRIBUTING.md
  • Instructions for contributors can be found here.

News

Latest news are published on twitter.

Reference Paper

Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Ostroumova Prokhorenkova, Aleksandr Vorobev "Fighting biases with dynamic boosting". arXiv:1706.09516, 2017.

Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin "CatBoost: gradient boosting with categorical features support". Workshop on ML Systems at NIPS 2017.

License

© YANDEX LLC, 2017-2024. Licensed under the Apache License, Version 2.0. See LICENSE file for more details.