scikit-learn

scikit-learn: machine learning in Python

62,466

25,988

62,466

2,197

View on GitHub

Top Related Projects

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

xgboost

26,866

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

catboost

8,368

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

keras

63,156

Deep Learning for humans

Quick Overview

Scikit-learn is a popular open-source machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means, and more. Scikit-learn is designed to be efficient, scalable, and easy to use, making it a go-to choice for both beginners and experienced data scientists.

Pros

Comprehensive Algorithms: Scikit-learn provides a wide range of state-of-the-art machine learning algorithms, covering a diverse set of tasks and use cases.
Ease of Use: The library has a user-friendly API and excellent documentation, making it accessible for both novice and experienced users.
Performance: Scikit-learn is built on top of efficient numerical libraries like NumPy and SciPy, ensuring fast and scalable performance.
Active Community: The project has a large and active community of contributors, ensuring regular updates, bug fixes, and new feature additions.

Cons

Limited Deep Learning Support: While Scikit-learn is excellent for traditional machine learning tasks, it has limited support for deep learning compared to specialized libraries like TensorFlow or PyTorch.
Steep Learning Curve for Beginners: The breadth of algorithms and features in Scikit-learn can be overwhelming for beginners, requiring a significant investment in learning the library.
Lack of Interpretability: Some of the more complex models in Scikit-learn, such as random forests and gradient boosting, can be difficult to interpret, which can be a drawback in certain applications.
Dependency on Other Libraries: Scikit-learn relies on other scientific computing libraries like NumPy and SciPy, which can add complexity for users who are not familiar with the Python data science ecosystem.

Code Examples

Here are a few code examples demonstrating the usage of Scikit-learn:

Classification with Support Vector Machines (SVM):

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Generate sample data
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an SVM classifier
clf = SVC(kernel='rbf', C=1.0)
clf.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = clf.score(X_test, y_test)
print(f'Accuracy: {accuracy:.2f}')

Clustering with K-Means:

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate sample data
X, y = make_blobs(n_samples=500, centers=4, n_features=2, random_state=42)

# Perform K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

# Visualize the clustering results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', s=200, c='red')
plt.title('K-Means Clustering')
plt.show()

Regression with Random Forest:

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Generate sample data
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random

Competitor Comparisons

tensorflow

190,523

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

More powerful for deep learning and neural networks
Better support for distributed computing and GPU acceleration
Flexible ecosystem with tools like TensorBoard for visualization

Cons of TensorFlow

Steeper learning curve and more complex API
Slower for simple machine learning tasks
Larger library size and longer setup time

Code Comparison

TensorFlow example (neural network):

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')

scikit-learn example (random forest):

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

TensorFlow is better suited for complex deep learning tasks, while scikit-learn excels in traditional machine learning algorithms. TensorFlow offers more flexibility and power but requires more expertise, whereas scikit-learn provides a simpler, more intuitive API for quick prototyping and smaller-scale projects. The choice between the two depends on the specific requirements of your machine learning task and your level of expertise.

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

More flexible and dynamic computational graph
Better support for GPU acceleration and distributed computing
Easier to debug and understand due to its pythonic nature

Cons of PyTorch

Steeper learning curve for beginners
Smaller ecosystem of pre-built models and tools
Less suitable for traditional machine learning tasks

Code Comparison

scikit-learn:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

PyTorch:

import torch.nn as nn
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc = nn.Linear(input_size, output_size)
    def forward(self, x):
        return self.fc(x)

scikit-learn is more concise for traditional machine learning tasks, while PyTorch offers more flexibility for deep learning and custom model architectures. scikit-learn provides a higher-level API, making it easier for beginners and quick prototyping. PyTorch's lower-level API allows for more control over the model's internals and computation, which is beneficial for research and complex deep learning projects.

LightGBM

17,161

Pros of LightGBM

Faster training speed and higher efficiency, especially for large datasets
Lower memory usage due to its histogram-based algorithm
Better accuracy in many scenarios, particularly for categorical features

Cons of LightGBM

Less extensive documentation and community support compared to scikit-learn
Steeper learning curve for beginners due to more hyperparameters
Not as versatile for general machine learning tasks beyond gradient boosting

Code Comparison

LightGBM:

import lightgbm as lgb
model = lgb.LGBMClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

scikit-learn:

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Both libraries offer similar ease of use for basic implementation. However, LightGBM provides more advanced options for fine-tuning performance, while scikit-learn offers a wider range of algorithms and preprocessing tools within a single package.

xgboost

26,866

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Pros of XGBoost

Faster training and prediction times for large datasets
Better handling of missing values and categorical features
Generally achieves higher accuracy on a wide range of problems

Cons of XGBoost

Less intuitive for beginners compared to scikit-learn's API
Requires more hyperparameter tuning to achieve optimal performance
Limited to tree-based models, while scikit-learn offers a broader range of algorithms

Code Comparison

XGBoost:

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

scikit-learn:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Both libraries offer similar high-level APIs for model training and prediction. However, XGBoost provides more advanced features and parameters for fine-tuning gradient boosting models, while scikit-learn offers a wider variety of algorithms and a more consistent API across different model types.

XGBoost is generally preferred for competitions and when maximum performance is required, while scikit-learn is often chosen for its ease of use, extensive documentation, and broader range of algorithms for various machine learning tasks.

catboost

8,368

Pros of CatBoost

Handles categorical features automatically without preprocessing
Generally faster training and prediction times, especially on GPU
Often achieves better performance out-of-the-box on datasets with categorical features

Cons of CatBoost

Less flexibility and customization options compared to scikit-learn
Smaller community and ecosystem of extensions/plugins
Limited to gradient boosting algorithms, while scikit-learn offers a wide range of ML algorithms

Code Comparison

scikit-learn:

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

CatBoost:

from catboost import CatBoostClassifier
model = CatBoostClassifier(cat_features=cat_features)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

The main difference in usage is that CatBoost allows direct specification of categorical features, while scikit-learn requires preprocessing of categorical variables (e.g., one-hot encoding) before training. CatBoost's API is designed to be similar to scikit-learn for ease of use, but with some additional parameters specific to its implementation.

keras

63,156

Deep Learning for humans

Pros of Keras

Higher-level API, making it easier to build and experiment with neural networks
Better suited for deep learning tasks and complex neural network architectures
Supports multiple backend engines (TensorFlow, Theano, CNTK)

Cons of Keras

Less flexible for non-neural network machine learning tasks
Slower execution compared to lower-level libraries
Steeper learning curve for understanding underlying concepts

Code Comparison

Keras:

from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))

Scikit-learn:

from sklearn.neural_network import MLPClassifier

model = MLPClassifier(hidden_layer_sizes=(64,), activation='relu')
model.fit(X_train, y_train)

Summary

Keras is better suited for deep learning and complex neural network architectures, while Scikit-learn offers a broader range of machine learning algorithms and is more flexible for general-purpose tasks. Keras provides a higher-level API, making it easier to build and experiment with neural networks, but may have a steeper learning curve for understanding underlying concepts. Scikit-learn, on the other hand, is more intuitive for beginners and offers faster execution for simpler models.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

.. -- mode: rst --

.. |Azure| image:: https://dev.azure.com/scikit-learn/scikit-learn/_apis/build/status/scikit-learn.scikit-learn?branchName=main :target: https://dev.azure.com/scikit-learn/scikit-learn/_build/latest?definitionId=1&branchName=main

.. |CircleCI| image:: https://circleci.com/gh/scikit-learn/scikit-learn/tree/main.svg?style=shield :target: https://circleci.com/gh/scikit-learn/scikit-learn

.. |Codecov| image:: https://codecov.io/gh/scikit-learn/scikit-learn/branch/main/graph/badge.svg?token=Pk8G9gg3y9 :target: https://codecov.io/gh/scikit-learn/scikit-learn

.. |Nightly wheels| image:: https://github.com/scikit-learn/scikit-learn/actions/workflows/wheels.yml/badge.svg?event=schedule :target: https://github.com/scikit-learn/scikit-learn/actions?query=workflow%3A%22Wheel+builder%22+event%3Aschedule

.. |Ruff| image:: https://img.shields.io/badge/code%20style-ruff-000000.svg :target: https://github.com/astral-sh/ruff

.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/scikit-learn.svg :target: https://pypi.org/project/scikit-learn/

.. |PyPi| image:: https://img.shields.io/pypi/v/scikit-learn :target: https://pypi.org/project/scikit-learn

.. |DOI| image:: https://zenodo.org/badge/21369/scikit-learn/scikit-learn.svg :target: https://zenodo.org/badge/latestdoi/21369/scikit-learn/scikit-learn

.. |Benchmark| image:: https://img.shields.io/badge/Benchmarked%20by-asv-blue :target: https://scikit-learn.org/scikit-learn-benchmarks

.. image:: https://raw.githubusercontent.com/scikit-learn/scikit-learn/main/doc/logos/scikit-learn-logo.png :target: https://scikit-learn.org/

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us <https://scikit-learn.org/dev/about.html#authors>__ page for a list of core contributors.

It is currently maintained by a team of volunteers.

Website: https://scikit-learn.org

Installation

Dependencies


scikit-learn requires:

- Python (>= |PythonMinVersion|)
- NumPy (>= |NumPyMinVersion|)
- SciPy (>= |SciPyMinVersion|)
- joblib (>= |JoblibMinVersion|)
- threadpoolctl (>= |ThreadpoolctlMinVersion|)

=======

Scikit-learn plotting capabilities (i.e., functions start with ``plot_`` and
classes end with ``Display``) require Matplotlib (>= |MatplotlibMinVersion|).
For running the examples Matplotlib >= |MatplotlibMinVersion| is required.
A few examples require scikit-image >= |Scikit-ImageMinVersion|, a few examples
require pandas >= |PandasMinVersion|, some examples require seaborn >=
|SeabornMinVersion| and plotly >= |PlotlyMinVersion|.

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip::

pip install -U scikit-learn

or conda::

conda install -c conda-forge scikit-learn

The documentation includes more detailed installation instructions <https://scikit-learn.org/stable/install.html>_.

Changelog

See the changelog <https://scikit-learn.org/dev/whats_new.html>__ for a history of notable changes to scikit-learn.

Development

We welcome new contributors of all experience levels. The scikit-learn community goals are to be helpful, welcoming, and effective. The Development Guide <https://scikit-learn.org/stable/developers/index.html>_ has detailed information about contributing code, documentation, tests, and more. We've included some basic information in this README.

Important links


- Official source code repo: https://github.com/scikit-learn/scikit-learn
- Download releases: https://pypi.org/project/scikit-learn/
- Issue tracker: https://github.com/scikit-learn/scikit-learn/issues

Source code
~~~~~~~~~~~

You can check the latest sources with the command::

    git clone https://github.com/scikit-learn/scikit-learn.git

Contributing
~~~~~~~~~~~~

To learn more about making a contribution to scikit-learn, please see our
`Contributing guide
<https://scikit-learn.org/dev/developers/contributing.html>`_.

Testing
~~~~~~~

After installation, you can launch the test suite from outside the source
directory (you will need to have ``pytest`` >= |PyTestMinVersion| installed)::

    pytest sklearn

See the web page https://scikit-learn.org/dev/developers/contributing.html#testing-and-improving-test-coverage
for more information.

    Random number generation can be controlled during testing by setting
    the ``SKLEARN_SEED`` environment variable.

Submitting a Pull Request

Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: https://scikit-learn.org/stable/developers/index.html

Project History

The project is currently maintained by a team of volunteers.

Note: scikit-learn was previously referred to as scikits.learn.

Help and Support

Documentation


- HTML documentation (stable release): https://scikit-learn.org
- HTML documentation (development version): https://scikit-learn.org/dev/
- FAQ: https://scikit-learn.org/stable/faq.html

Communication

Main Channels ^^^^^^^^^^^^^

Website: https://scikit-learn.org
Blog: https://blog.scikit-learn.org
Mailing list: https://mail.python.org/mailman/listinfo/scikit-learn

Developer & Support ^^^^^^^^^^^^^^^^^^^^^^

GitHub Discussions: https://github.com/scikit-learn/scikit-learn/discussions
Stack Overflow: https://stackoverflow.com/questions/tagged/scikit-learn
Discord: https://discord.gg/h9qyrK8Jc8

Social Media Platforms ^^^^^^^^^^^^^^^^^^^^^^

LinkedIn: https://www.linkedin.com/company/scikit-learn
YouTube: https://www.youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw/playlists
Facebook: https://www.facebook.com/scikitlearnofficial/
Instagram: https://www.instagram.com/scikitlearnofficial/
TikTok: https://www.tiktok.com/@scikit.learn
Bluesky: https://bsky.app/profile/scikit-learn.org
Mastodon: https://mastodon.social/@sklearn@fosstodon.org

Resources ^^^^^^^^^

Calendar: https://blog.scikit-learn.org/calendar/
Logos & Branding: https://github.com/scikit-learn/scikit-learn/tree/main/doc/logos

Citation


If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot