catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Top Related Projects
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
scikit-learn: machine learning in Python
A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Fit interpretable models. Explain blackbox machine learning.
Quick Overview
CatBoost is a high-performance, open-source gradient boosting library developed by Yandex. It is designed for machine learning tasks, particularly for handling categorical features efficiently. CatBoost implements novel techniques to combat prediction shift and overfitting, making it a powerful tool for both regression and classification problems.
Pros
- Excellent handling of categorical features without extensive preprocessing
- Built-in mechanisms to prevent overfitting
- Fast performance on both CPU and GPU
- Supports various loss functions and evaluation metrics
Cons
- Can be slower in training compared to some other gradient boosting libraries
- Limited built-in feature importance methods
- Steeper learning curve for advanced customization
- Less extensive documentation compared to some more established libraries
Code Examples
- Basic classification example:
from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = CatBoostClassifier(iterations=100, learning_rate=0.1)
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
- Handling categorical features:
import pandas as pd
from catboost import CatBoostRegressor
# Load data with categorical features
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Specify categorical features
cat_features = ['category1', 'category2']
# Train model
model = CatBoostRegressor(iterations=300, cat_features=cat_features)
model.fit(X, y)
- Cross-validation with custom metric:
from catboost import CatBoostClassifier, cv
from sklearn.metrics import f1_score
# Define custom metric
def custom_f1_score(predictions, target):
return f1_score(target, predictions, average='macro')
# Perform cross-validation
cv_results = cv(
pool=train_data,
params={'iterations': 500, 'learning_rate': 0.05},
fold_count=5,
custom_metric=custom_f1_score
)
Getting Started
To get started with CatBoost, first install it using pip:
pip install catboost
Then, you can import and use CatBoost in your Python code:
from catboost import CatBoostClassifier, Pool
# Prepare your data
X, y = load_data()
train_pool = Pool(X, y, cat_features=[0, 1, 2])
# Initialize and train the model
model = CatBoostClassifier(iterations=300, learning_rate=0.1)
model.fit(train_pool)
# Make predictions
predictions = model.predict(test_data)
This basic example demonstrates how to train a CatBoost classifier and make predictions. Adjust parameters and data preparation according to your specific use case.
Competitor Comparisons
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Pros of LightGBM
- Generally faster training speed, especially on large datasets
- Lower memory usage due to its histogram-based algorithm
- Better handling of categorical features without preprocessing
Cons of LightGBM
- Can be more prone to overfitting on small datasets
- Less robust to noisy data compared to CatBoost
- Requires more careful parameter tuning for optimal performance
Code Comparison
LightGBM:
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
params = {'num_leaves': 31, 'objective': 'binary'}
model = lgb.train(params, train_data, num_boost_round=100)
CatBoost:
from catboost import CatBoostClassifier
model = CatBoostClassifier(iterations=100, depth=5, learning_rate=0.1)
model.fit(X_train, y_train)
Both LightGBM and CatBoost are powerful gradient boosting frameworks, each with its strengths. LightGBM excels in speed and efficiency, making it suitable for large-scale applications. CatBoost, on the other hand, offers better out-of-the-box performance and handles categorical features more elegantly. The choice between the two often depends on the specific requirements of the project and the characteristics of the dataset.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Pros of XGBoost
- Longer history and more widespread adoption in industry and competitions
- Extensive documentation and large community support
- Efficient handling of sparse data
Cons of XGBoost
- Generally slower training times, especially on large datasets
- Less effective handling of categorical features without preprocessing
- More hyperparameters to tune, potentially requiring more expertise
Code Comparison
XGBoost:
import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
CatBoost:
from catboost import CatBoostClassifier
model = CatBoostClassifier(cat_features=cat_features)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
The main difference in usage is that CatBoost allows direct specification of categorical features, while XGBoost typically requires preprocessing of categorical variables (e.g., one-hot encoding) before training. CatBoost's syntax is generally simpler and requires less data preparation, especially when dealing with categorical features.
Both libraries offer powerful gradient boosting implementations, but they have different strengths. XGBoost excels in handling sparse data and has a longer track record, while CatBoost offers faster training on large datasets and better out-of-the-box handling of categorical features.
scikit-learn: machine learning in Python
Pros of scikit-learn
- Comprehensive library with a wide range of machine learning algorithms
- Excellent documentation and large community support
- Seamless integration with other scientific Python libraries
Cons of scikit-learn
- Generally slower performance compared to CatBoost
- Less effective handling of categorical features
- Limited support for GPU acceleration
Code Comparison
scikit-learn:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
CatBoost:
from catboost import CatBoostClassifier
clf = CatBoostClassifier(iterations=500, learning_rate=0.1)
clf.fit(X_train, y_train, cat_features=cat_features)
y_pred = clf.predict(X_test)
The code comparison shows that CatBoost requires explicit specification of categorical features, while scikit-learn handles them implicitly. CatBoost also offers more fine-tuned control over the training process with parameters like iterations and learning rate.
Both libraries provide similar ease of use for basic model training and prediction. However, CatBoost's specialized handling of categorical features and built-in performance optimizations can lead to improved results in many scenarios, especially with datasets containing categorical variables.
A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
Pros of FLAML
- Automated machine learning (AutoML) framework supporting multiple algorithms
- Efficient hyperparameter tuning with budget-aware optimization
- Lightweight and easy to integrate into existing ML pipelines
Cons of FLAML
- Less specialized for gradient boosting compared to CatBoost
- May require more setup and configuration for specific use cases
- Potentially slower for large datasets due to its multi-algorithm approach
Code Comparison
FLAML:
from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification")
predictions = automl.predict(X_test)
CatBoost:
from catboost import CatBoostClassifier
model = CatBoostClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Key Differences
- FLAML offers a more general-purpose AutoML solution, while CatBoost specializes in gradient boosting
- CatBoost provides built-in support for categorical features, whereas FLAML relies on preprocessing
- FLAML's AutoML approach may be more suitable for users who want to explore multiple algorithms, while CatBoost is ideal for those focused on gradient boosting performance
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Pros of H2O-3
- Supports a wider range of algorithms and models, including deep learning
- Offers a user-friendly web interface for non-programmers
- Provides distributed computing capabilities for handling large datasets
Cons of H2O-3
- Generally slower performance compared to CatBoost, especially for gradient boosting tasks
- More complex setup and configuration process
- Steeper learning curve for advanced users
Code Comparison
H2O-3 (Python):
import h2o
from h2o.estimators import H2ORandomForestEstimator
h2o.init()
data = h2o.import_file("dataset.csv")
model = H2ORandomForestEstimator()
model.train(x=["feature1", "feature2"], y="target", training_frame=data)
CatBoost (Python):
from catboost import CatBoostRegressor
model = CatBoostRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
H2O-3 offers a more comprehensive suite of machine learning algorithms and a user-friendly interface, making it suitable for a wider range of users and applications. However, CatBoost excels in performance and ease of use for gradient boosting tasks, with simpler setup and faster execution times. The code comparison demonstrates that H2O-3 requires more setup steps, while CatBoost offers a more straightforward implementation for specific tasks.
Fit interpretable models. Explain blackbox machine learning.
Pros of Interpret
- Focuses on model interpretability and explainability
- Supports multiple machine learning frameworks (scikit-learn, LightGBM, etc.)
- Provides a unified API for various interpretability techniques
Cons of Interpret
- Less optimized for performance compared to CatBoost
- Smaller community and fewer contributions
- Limited to interpretability, not a full-featured ML library
Code Comparison
Interpret:
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)
ebm_global = ebm.explain_global()
show(ebm_global)
CatBoost:
from catboost import CatBoostClassifier
model = CatBoostClassifier()
model.fit(X_train, y_train)
feature_importances = model.get_feature_importance()
Interpret focuses on providing interpretability tools for various models, while CatBoost is a high-performance gradient boosting library with some built-in interpretability features. Interpret offers a more comprehensive set of explanation methods, but CatBoost excels in performance and efficiency for gradient boosting tasks.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
<img src=http://storage.mds.yandex.net/get-devtools-opensource/250854/catboost-logo.png width=300/>
Website | Documentation | Tutorials | Installation | Release Notes
CatBoost is a machine learning method based on gradient boosting over decision trees.
Main advantages of CatBoost:
- Superior quality when compared with other GBDT libraries on many datasets.
- Best in class prediction speed.
- Support for both numerical and categorical features.
- Fast GPU and multi-GPU support for training out of the box.
- Visualization tools included.
- Fast and reproducible distributed training with Apache Spark and CLI.
Get Started and Documentation
All CatBoost documentation is available here.
Install CatBoost by following the guide for the
Next you may want to investigate:
- Tutorials
- Training modes and metrics
- Cross-validation
- Parameters tuning
- Feature importance calculation
- Regular and staged predictions
- CatBoost for Apache Spark videos: Introduction and Architecture
If you cannot open documentation in your browser try adding yastatic.net and yastat.net to the list of allowed domains in your privacy badger.
Catboost models in production
If you want to evaluate Catboost model in your application read model api documentation.
Questions and bug reports
- For reporting bugs please use the catboost/bugreport page.
- Ask a question on Stack Overflow with the catboost tag, we monitor this for new questions.
- Seek prompt advice at Telegram group or Russian-speaking Telegram chat
Help to Make CatBoost Better
- Check out open problems and help wanted issues to see what can be improved, or open an issue if you want something.
- Add your stories and experience to Awesome CatBoost.
- To contribute to CatBoost you need to first read CLA text and add to your pull request, that you agree to the terms of the CLA. More information can be found in CONTRIBUTING.md
- Instructions for contributors can be found here.
News
Latest news are published on twitter.
Reference Paper
Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Ostroumova Prokhorenkova, Aleksandr Vorobev "Fighting biases with dynamic boosting". arXiv:1706.09516, 2017.
Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin "CatBoost: gradient boosting with categorical features support". Workshop on ML Systems at NIPS 2017.
License
© YANDEX LLC, 2017-2024. Licensed under the Apache License, Version 2.0. See LICENSE file for more details.
Top Related Projects
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
scikit-learn: machine learning in Python
A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Fit interpretable models. Explain blackbox machine learning.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot