tpot
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
Top Related Projects
scikit-learn: machine learning in Python
An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Automated Machine Learning with scikit-learn
AutoML library for deep learning
Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
An open-source, low-code machine learning library in Python
Quick Overview
TPOT (Tree-based Pipeline Optimization Tool) is an automated machine learning tool that optimizes machine learning pipelines using genetic programming. It automates the most tedious parts of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.
Pros
- Automates the entire machine learning pipeline, from preprocessing to model selection and hyperparameter tuning
- Produces Python code for the best pipeline, allowing for easy reproduction and further customization
- Supports both classification and regression tasks
- Integrates well with scikit-learn, making it familiar for many data scientists
Cons
- Can be computationally expensive and time-consuming for large datasets or complex problems
- May not always find the absolute best solution, as it relies on genetic algorithms
- Limited flexibility in terms of custom algorithms or preprocessing steps
- Requires some understanding of machine learning concepts to interpret and use results effectively
Code Examples
- Basic classification example:
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')
- Regression example with custom config:
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
tpot_config = {
'sklearn.linear_model.ElasticNetCV': {
'l1_ratio': np.arange(0.0, 1.0, 0.1),
'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
},
'sklearn.ensemble.RandomForestRegressor': {
'n_estimators': [100, 200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [None, 5, 10, 20]
}
}
tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, config_dict=tpot_config, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')
- Using TPOT with custom scoring metric:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, f1_score
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
f1_scorer = make_scorer(f1_score, average='weighted')
tpot = TPOTClassifier(generations=5, population_size=50, scoring=f1_scorer, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')
Getting Started
To get started with TPOT, first install it using
Competitor Comparisons
scikit-learn: machine learning in Python
Pros of scikit-learn
- Comprehensive library with a wide range of machine learning algorithms and tools
- Well-established, mature project with extensive documentation and community support
- Highly optimized and efficient implementations of algorithms
Cons of scikit-learn
- Requires manual hyperparameter tuning and model selection
- Less automated approach to machine learning pipeline creation
- May require more domain expertise to effectively use all features
Code Comparison
scikit-learn:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
TPOT:
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
tpot.export('tpot_pipeline.py')
scikit-learn offers more control over individual algorithms and requires manual pipeline construction. TPOT automates the process of creating and optimizing machine learning pipelines, potentially saving time and reducing the need for expert knowledge. However, scikit-learn provides a broader range of tools and may be more suitable for users who need fine-grained control over their models.
An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Pros of NNI
- Broader scope: Supports neural architecture search and hyperparameter optimization for deep learning models
- More flexible: Offers multiple optimization algorithms and supports various frameworks (TensorFlow, PyTorch, etc.)
- Distributed training: Enables running experiments across multiple machines
Cons of NNI
- Steeper learning curve: More complex setup and configuration compared to TPOT
- Less focus on traditional machine learning: TPOT specializes in automating scikit-learn pipelines
Code Comparison
TPOT example:
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
NNI example:
import nni
@nni.trace
def run_trial(params):
model = create_model(params)
model.fit(X_train, y_train)
return model.score(X_test, y_test)
if __name__ == '__main__':
nni.run(run_trial)
Both tools aim to automate machine learning workflows, but NNI offers more extensive features for deep learning and distributed training, while TPOT focuses on traditional machine learning pipelines using scikit-learn.
Automated Machine Learning with scikit-learn
Pros of auto-sklearn
- Utilizes meta-learning for faster model selection and hyperparameter optimization
- Implements ensemble selection to combine multiple models for improved performance
- Supports warm-starting from previous runs, allowing for incremental improvements
Cons of auto-sklearn
- Limited to scikit-learn compatible algorithms and preprocessors
- Requires more computational resources due to its complex optimization process
- Less flexibility in customizing the search space compared to TPOT
Code Comparison
TPOT:
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, cv=5, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
auto-sklearn:
import autosklearn.classification
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30)
automl.fit(X_train, y_train)
print(automl.score(X_test, y_test))
Both TPOT and auto-sklearn offer automated machine learning solutions, but they differ in their approaches and features. TPOT uses genetic programming to optimize machine learning pipelines, while auto-sklearn employs Bayesian optimization and meta-learning. The choice between the two depends on specific project requirements, available computational resources, and desired level of customization.
AutoML library for deep learning
Pros of AutoKeras
- Built on top of Keras, leveraging its extensive ecosystem and compatibility with TensorFlow
- Supports both image classification and regression tasks out-of-the-box
- Offers a simple, high-level API for quick implementation of AutoML workflows
Cons of AutoKeras
- Limited flexibility in customizing the search space compared to TPOT
- Primarily focused on neural network architectures, while TPOT explores a broader range of ML algorithms
- Less mature project with potentially fewer community contributions and resources
Code Comparison
AutoKeras:
import autokeras as ak
clf = ak.ImageClassifier(max_trials=10)
clf.fit(x_train, y_train)
results = clf.predict(x_test)
TPOT:
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
results = tpot.predict(X_test)
Both libraries offer simple APIs for AutoML, but AutoKeras focuses on neural networks for specific tasks like image classification, while TPOT provides a more general-purpose solution for various machine learning algorithms and pipelines.
Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
Pros of mljar-supervised
- Offers a wider range of ML algorithms, including neural networks and ensemble methods
- Provides more detailed explanations and visualizations of model performance
- Supports both binary and multi-class classification tasks
Cons of mljar-supervised
- Less mature project with fewer contributors and stars on GitHub
- May be slower in processing large datasets compared to TPOT
- Documentation is not as comprehensive as TPOT's
Code Comparison
TPOT example:
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
mljar-supervised example:
from supervised import AutoML
automl = AutoML(results_path="automl_results")
automl.fit(X_train, y_train)
predictions = automl.predict(X_test)
Both libraries aim to automate the machine learning pipeline, but mljar-supervised offers more flexibility in terms of algorithms and explanations. TPOT, on the other hand, has a larger community and may be more suitable for handling big data. The choice between the two depends on the specific requirements of your project and the level of control you need over the ML process.
An open-source, low-code machine learning library in Python
Pros of PyCaret
- More comprehensive, covering a wider range of ML tasks including classification, regression, clustering, and anomaly detection
- Easier to use with a low-code interface and automated preprocessing steps
- Better integration with popular visualization libraries like Plotly
Cons of PyCaret
- Less focus on automated feature engineering compared to TPOT
- May be slower for large datasets due to its comprehensive approach
- Less emphasis on evolutionary algorithms for hyperparameter optimization
Code Comparison
TPOT:
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_pipeline.py')
PyCaret:
from pycaret.classification import *
setup(data=df, target='target')
best_model = compare_models()
predict_model(best_model)
save_model(best_model, 'best_model')
Both libraries aim to automate the machine learning pipeline, but PyCaret offers a more user-friendly interface with broader functionality, while TPOT focuses on evolutionary algorithms for pipeline optimization. PyCaret is generally easier for beginners, while TPOT may provide more fine-tuned results for specific use cases.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
To try the TPOT2 (alpha) please go here!
TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.
An example Machine Learning pipeline
Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.
TPOT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.
TPOT is still under active development and we encourage you to check back on this repository regularly for updates.
For further information about TPOT, please see the project documentation.
License
Please see the repository license for the licensing and usage information for TPOT.
Generally, we have licensed TPOT to make it as widely usable as possible.
Installation
We maintain the TPOT installation instructions in the documentation. TPOT requires a working installation of Python.
Usage
TPOT can be used on the command line or with Python code.
Click on the corresponding links to find more information on TPOT usage in the documentation.
Examples
Classification
Below is a minimal working example with the optical recognition of handwritten digits dataset.
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25, random_state=42)
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')
Running this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the tpot_digits_pipeline.py
file and look similar to the following:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['target'], random_state=42)
# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
StackingEstimator(estimator=LogisticRegression(C=0.1, dual=False, penalty="l1")),
RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.35000000000000003, min_samples_leaf=20, min_samples_split=19, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
Regression
Similarly, TPOT can optimize pipelines for regression problems. Below is a minimal working example with the practice Boston housing prices data set.
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
train_size=0.75, test_size=0.25, random_state=42)
tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')
which should result in a pipeline that achieves about 12.77 mean squared error (MSE), and the Python code in tpot_boston_pipeline.py
should look similar to:
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from tpot.export_utils import set_param_recursive
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['target'], random_state=42)
# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
ExtraTreesRegressor(bootstrap=False, max_features=0.5, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
Check the documentation for more examples and tutorials.
Contributing to TPOT
We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.
Before submitting any contributions, please review our contribution guidelines.
Having problems or have questions about TPOT?
Please check the existing open and closed issues to see if your issue has already been attended to. If it hasn't, file a new issue on this repository so we can review your issue.
Citing TPOT
If you use TPOT in a scientific publication, please consider citing at least one of the following papers:
Trang T. Le, Weixuan Fu and Jason H. Moore (2020). Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics.36(1): 250-256.
BibTeX entry:
@article{le2020scaling,
title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector},
author={Le, Trang T and Fu, Weixuan and Moore, Jason H},
journal={Bioinformatics},
volume={36},
number={1},
pages={250--256},
year={2020},
publisher={Oxford University Press}
}
Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and Jason H. Moore (2016). Automating biomedical data science through tree-based pipeline optimization. Applications of Evolutionary Computation, pages 123-137.
BibTeX entry:
@inbook{Olson2016EvoBio,
author={Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},
editor={Squillero, Giovanni and Burelli, Paolo},
chapter={Automating Biomedical Data Science Through Tree-Based Pipeline Optimization},
title={Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I},
year={2016},
publisher={Springer International Publishing},
pages={123--137},
isbn={978-3-319-31204-0},
doi={10.1007/978-3-319-31204-0_9},
url={http://dx.doi.org/10.1007/978-3-319-31204-0_9}
}
Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of GECCO 2016, pages 485-492.
BibTeX entry:
@inproceedings{OlsonGECCO2016,
author = {Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H.},
title = {Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},
booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference 2016},
series = {GECCO '16},
year = {2016},
isbn = {978-1-4503-4206-3},
location = {Denver, Colorado, USA},
pages = {485--492},
numpages = {8},
url = {http://doi.acm.org/10.1145/2908812.2908918},
doi = {10.1145/2908812.2908918},
acmid = {2908918},
publisher = {ACM},
address = {New York, NY, USA},
}
Alternatively, you can cite the repository directly with the following DOI:
Support for TPOT
TPOT was developed in the Computational Genetics Lab at the University of Pennsylvania with funding from the NIH under grant R01 AI117694. We are incredibly grateful for the support of the NIH and the University of Pennsylvania during the development of this project.
The TPOT logo was designed by Todd Newmuis, who generously donated his time to the project.
Top Related Projects
scikit-learn: machine learning in Python
An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Automated Machine Learning with scikit-learn
AutoML library for deep learning
Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
An open-source, low-code machine learning library in Python
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot