Top Related Projects
An Open Source Machine Learning Framework for Everyone
Tensors and Dynamic neural networks in Python with strong GPU acceleration
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Deep Learning for humans
Quick Overview
Scikit-learn is a popular open-source machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means, and more. Scikit-learn is designed to be efficient, scalable, and easy to use, making it a go-to choice for both beginners and experienced data scientists.
Pros
- Comprehensive Algorithms: Scikit-learn provides a wide range of state-of-the-art machine learning algorithms, covering a diverse set of tasks and use cases.
- Ease of Use: The library has a user-friendly API and excellent documentation, making it accessible for both novice and experienced users.
- Performance: Scikit-learn is built on top of efficient numerical libraries like NumPy and SciPy, ensuring fast and scalable performance.
- Active Community: The project has a large and active community of contributors, ensuring regular updates, bug fixes, and new feature additions.
Cons
- Limited Deep Learning Support: While Scikit-learn is excellent for traditional machine learning tasks, it has limited support for deep learning compared to specialized libraries like TensorFlow or PyTorch.
- Steep Learning Curve for Beginners: The breadth of algorithms and features in Scikit-learn can be overwhelming for beginners, requiring a significant investment in learning the library.
- Lack of Interpretability: Some of the more complex models in Scikit-learn, such as random forests and gradient boosting, can be difficult to interpret, which can be a drawback in certain applications.
- Dependency on Other Libraries: Scikit-learn relies on other scientific computing libraries like NumPy and SciPy, which can add complexity for users who are not familiar with the Python data science ecosystem.
Code Examples
Here are a few code examples demonstrating the usage of Scikit-learn:
- Classification with Support Vector Machines (SVM):
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
# Generate sample data
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=42)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train an SVM classifier
clf = SVC(kernel='rbf', C=1.0)
clf.fit(X_train, y_train)
# Evaluate the model on the test set
accuracy = clf.score(X_test, y_test)
print(f'Accuracy: {accuracy:.2f}')
- Clustering with K-Means:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate sample data
X, y = make_blobs(n_samples=500, centers=4, n_features=2, random_state=42)
# Perform K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
# Visualize the clustering results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', s=200, c='red')
plt.title('K-Means Clustering')
plt.show()
- Regression with Random Forest:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
# Generate sample data
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=42)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random
Competitor Comparisons
An Open Source Machine Learning Framework for Everyone
Pros of TensorFlow
- More powerful for deep learning and neural networks
- Better support for distributed computing and GPU acceleration
- Flexible ecosystem with tools like TensorBoard for visualization
Cons of TensorFlow
- Steeper learning curve and more complex API
- Slower for simple machine learning tasks
- Larger library size and longer setup time
Code Comparison
TensorFlow example (neural network):
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')
scikit-learn example (random forest):
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
TensorFlow is better suited for complex deep learning tasks, while scikit-learn excels in traditional machine learning algorithms. TensorFlow offers more flexibility and power but requires more expertise, whereas scikit-learn provides a simpler, more intuitive API for quick prototyping and smaller-scale projects. The choice between the two depends on the specific requirements of your machine learning task and your level of expertise.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Pros of PyTorch
- More flexible and dynamic computational graph
- Better support for GPU acceleration and distributed computing
- Easier to debug and understand due to its pythonic nature
Cons of PyTorch
- Steeper learning curve for beginners
- Smaller ecosystem of pre-built models and tools
- Less suitable for traditional machine learning tasks
Code Comparison
scikit-learn:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
PyTorch:
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc = nn.Linear(input_size, output_size)
def forward(self, x):
return self.fc(x)
scikit-learn is more concise for traditional machine learning tasks, while PyTorch offers more flexibility for deep learning and custom model architectures. scikit-learn provides a higher-level API, making it easier for beginners and quick prototyping. PyTorch's lower-level API allows for more control over the model's internals and computation, which is beneficial for research and complex deep learning projects.
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Pros of LightGBM
- Faster training speed and higher efficiency, especially for large datasets
- Lower memory usage due to its histogram-based algorithm
- Better accuracy in many scenarios, particularly for categorical features
Cons of LightGBM
- Less extensive documentation and community support compared to scikit-learn
- Steeper learning curve for beginners due to more hyperparameters
- Not as versatile for general machine learning tasks beyond gradient boosting
Code Comparison
LightGBM:
import lightgbm as lgb
model = lgb.LGBMClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
scikit-learn:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Both libraries offer similar ease of use for basic implementation. However, LightGBM provides more advanced options for fine-tuning performance, while scikit-learn offers a wider range of algorithms and preprocessing tools within a single package.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Pros of XGBoost
- Faster training and prediction times for large datasets
- Better handling of missing values and categorical features
- Generally achieves higher accuracy on a wide range of problems
Cons of XGBoost
- Less intuitive for beginners compared to scikit-learn's API
- Requires more hyperparameter tuning to achieve optimal performance
- Limited to tree-based models, while scikit-learn offers a broader range of algorithms
Code Comparison
XGBoost:
import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
scikit-learn:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Both libraries offer similar high-level APIs for model training and prediction. However, XGBoost provides more advanced features and parameters for fine-tuning gradient boosting models, while scikit-learn offers a wider variety of algorithms and a more consistent API across different model types.
XGBoost is generally preferred for competitions and when maximum performance is required, while scikit-learn is often chosen for its ease of use, extensive documentation, and broader range of algorithms for various machine learning tasks.
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Pros of CatBoost
- Handles categorical features automatically without preprocessing
- Generally faster training and prediction times, especially on GPU
- Often achieves better performance out-of-the-box on datasets with categorical features
Cons of CatBoost
- Less flexibility and customization options compared to scikit-learn
- Smaller community and ecosystem of extensions/plugins
- Limited to gradient boosting algorithms, while scikit-learn offers a wide range of ML algorithms
Code Comparison
scikit-learn:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
CatBoost:
from catboost import CatBoostClassifier
model = CatBoostClassifier(cat_features=cat_features)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
The main difference in usage is that CatBoost allows direct specification of categorical features, while scikit-learn requires preprocessing of categorical variables (e.g., one-hot encoding) before training. CatBoost's API is designed to be similar to scikit-learn for ease of use, but with some additional parameters specific to its implementation.
Deep Learning for humans
Pros of Keras
- Higher-level API, making it easier to build and experiment with neural networks
- Better suited for deep learning tasks and complex neural network architectures
- Supports multiple backend engines (TensorFlow, Theano, CNTK)
Cons of Keras
- Less flexible for non-neural network machine learning tasks
- Slower execution compared to lower-level libraries
- Steeper learning curve for understanding underlying concepts
Code Comparison
Keras:
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))
Scikit-learn:
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(64,), activation='relu')
model.fit(X_train, y_train)
Summary
Keras is better suited for deep learning and complex neural network architectures, while Scikit-learn offers a broader range of machine learning algorithms and is more flexible for general-purpose tasks. Keras provides a higher-level API, making it easier to build and experiment with neural networks, but may have a steeper learning curve for understanding underlying concepts. Scikit-learn, on the other hand, is more intuitive for beginners and offers faster execution for simpler models.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
.. -- mode: rst --
|Azure| |CirrusCI| |Codecov| |CircleCI| |Nightly wheels| |Black| |PythonVersion| |PyPi| |DOI| |Benchmark|
.. |Azure| image:: https://dev.azure.com/scikit-learn/scikit-learn/_apis/build/status/scikit-learn.scikit-learn?branchName=main :target: https://dev.azure.com/scikit-learn/scikit-learn/_build/latest?definitionId=1&branchName=main
.. |CircleCI| image:: https://circleci.com/gh/scikit-learn/scikit-learn/tree/main.svg?style=shield :target: https://circleci.com/gh/scikit-learn/scikit-learn
.. |CirrusCI| image:: https://img.shields.io/cirrus/github/scikit-learn/scikit-learn/main?label=Cirrus%20CI :target: https://cirrus-ci.com/github/scikit-learn/scikit-learn/main
.. |Codecov| image:: https://codecov.io/gh/scikit-learn/scikit-learn/branch/main/graph/badge.svg?token=Pk8G9gg3y9 :target: https://codecov.io/gh/scikit-learn/scikit-learn
.. |Nightly wheels| image:: https://github.com/scikit-learn/scikit-learn/workflows/Wheel%20builder/badge.svg?event=schedule :target: https://github.com/scikit-learn/scikit-learn/actions?query=workflow%3A%22Wheel+builder%22+event%3Aschedule
.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/scikit-learn.svg :target: https://pypi.org/project/scikit-learn/
.. |PyPi| image:: https://img.shields.io/pypi/v/scikit-learn :target: https://pypi.org/project/scikit-learn
.. |Black| image:: https://img.shields.io/badge/code%20style-black-000000.svg :target: https://github.com/psf/black
.. |DOI| image:: https://zenodo.org/badge/21369/scikit-learn/scikit-learn.svg :target: https://zenodo.org/badge/latestdoi/21369/scikit-learn/scikit-learn
.. |Benchmark| image:: https://img.shields.io/badge/Benchmarked%20by-asv-blue :target: https://scikit-learn.org/scikit-learn-benchmarks
.. |PythonMinVersion| replace:: 3.9 .. |NumPyMinVersion| replace:: 1.19.5 .. |SciPyMinVersion| replace:: 1.6.0 .. |JoblibMinVersion| replace:: 1.2.0 .. |ThreadpoolctlMinVersion| replace:: 3.1.0 .. |MatplotlibMinVersion| replace:: 3.3.4 .. |Scikit-ImageMinVersion| replace:: 0.17.2 .. |PandasMinVersion| replace:: 1.1.5 .. |SeabornMinVersion| replace:: 0.9.0 .. |PytestMinVersion| replace:: 7.1.2 .. |PlotlyMinVersion| replace:: 5.14.0
.. image:: https://raw.githubusercontent.com/scikit-learn/scikit-learn/main/doc/logos/scikit-learn-logo.png :target: https://scikit-learn.org/
scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.
The project was started in 2007 by David Cournapeau as a Google Summer
of Code project, and since then many volunteers have contributed. See
the About us <https://scikit-learn.org/dev/about.html#authors>
__ page
for a list of core contributors.
It is currently maintained by a team of volunteers.
Website: https://scikit-learn.org
Installation
Dependencies
scikit-learn requires:
- Python (>= |PythonMinVersion|)
- NumPy (>= |NumPyMinVersion|)
- SciPy (>= |SciPyMinVersion|)
- joblib (>= |JoblibMinVersion|)
- threadpoolctl (>= |ThreadpoolctlMinVersion|)
=======
**Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4.**
scikit-learn 1.0 and later require Python 3.7 or newer.
scikit-learn 1.1 and later require Python 3.8 or newer.
Scikit-learn plotting capabilities (i.e., functions start with ``plot_`` and
classes end with ``Display``) require Matplotlib (>= |MatplotlibMinVersion|).
For running the examples Matplotlib >= |MatplotlibMinVersion| is required.
A few examples require scikit-image >= |Scikit-ImageMinVersion|, a few examples
require pandas >= |PandasMinVersion|, some examples require seaborn >=
|SeabornMinVersion| and plotly >= |PlotlyMinVersion|.
User installation
If you already have a working installation of NumPy and SciPy,
the easiest way to install scikit-learn is using pip
::
pip install -U scikit-learn
or conda
::
conda install -c conda-forge scikit-learn
The documentation includes more detailed installation instructions <https://scikit-learn.org/stable/install.html>
_.
Changelog
See the changelog <https://scikit-learn.org/dev/whats_new.html>
__
for a history of notable changes to scikit-learn.
Development
We welcome new contributors of all experience levels. The scikit-learn
community goals are to be helpful, welcoming, and effective. The
Development Guide <https://scikit-learn.org/stable/developers/index.html>
_
has detailed information about contributing code, documentation, tests, and
more. We've included some basic information in this README.
Important links
- Official source code repo: https://github.com/scikit-learn/scikit-learn
- Download releases: https://pypi.org/project/scikit-learn/
- Issue tracker: https://github.com/scikit-learn/scikit-learn/issues
Source code
~~~~~~~~~~~
You can check the latest sources with the command::
git clone https://github.com/scikit-learn/scikit-learn.git
Contributing
~~~~~~~~~~~~
To learn more about making a contribution to scikit-learn, please see our
`Contributing guide
<https://scikit-learn.org/dev/developers/contributing.html>`_.
Testing
~~~~~~~
After installation, you can launch the test suite from outside the source
directory (you will need to have ``pytest`` >= |PyTestMinVersion| installed)::
pytest sklearn
See the web page https://scikit-learn.org/dev/developers/contributing.html#testing-and-improving-test-coverage
for more information.
Random number generation can be controlled during testing by setting
the ``SKLEARN_SEED`` environment variable.
Submitting a Pull Request
Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: https://scikit-learn.org/stable/developers/index.html
Project History
The project was started in 2007 by David Cournapeau as a Google Summer
of Code project, and since then many volunteers have contributed. See
the About us <https://scikit-learn.org/dev/about.html#authors>
__ page
for a list of core contributors.
The project is currently maintained by a team of volunteers.
Note: scikit-learn
was previously referred to as scikits.learn
.
Help and Support
Documentation
- HTML documentation (stable release): https://scikit-learn.org
- HTML documentation (development version): https://scikit-learn.org/dev/
- FAQ: https://scikit-learn.org/stable/faq.html
Communication
- Mailing list: https://mail.python.org/mailman/listinfo/scikit-learn
- Logos & Branding: https://github.com/scikit-learn/scikit-learn/tree/main/doc/logos
- Blog: https://blog.scikit-learn.org
- Calendar: https://blog.scikit-learn.org/calendar/
- Twitter: https://twitter.com/scikit_learn
- Stack Overflow: https://stackoverflow.com/questions/tagged/scikit-learn
- GitHub Discussions: https://github.com/scikit-learn/scikit-learn/discussions
- Website: https://scikit-learn.org
- LinkedIn: https://www.linkedin.com/company/scikit-learn
- Bluesky: https://bsky.app/profile/scikit-learn.org
- YouTube: https://www.youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw/playlists
- Facebook: https://www.facebook.com/scikitlearnofficial/
- Instagram: https://www.instagram.com/scikitlearnofficial/
- TikTok: https://www.tiktok.com/@scikit.learn
- Mastodon: https://mastodon.social/@sklearn@fosstodon.org
- Discord: https://discord.gg/h9qyrK8Jc8
Citation
If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn
Top Related Projects
An Open Source Machine Learning Framework for Everyone
Tensors and Dynamic neural networks in Python with strong GPU acceleration
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Deep Learning for humans
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot