tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

9,958

1,576

9,958

305

View on GitHub

Top Related Projects

scikit-learn

62,466

scikit-learn: machine learning in Python

nni

14,238

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

auto-sklearn

7,915

Automated Machine Learning with scikit-learn

mljar-supervised

3,183

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation

pycaret

9,463

An open-source, low-code machine learning library in Python

Quick Overview

TPOT (Tree-based Pipeline Optimization Tool) is an automated machine learning tool that optimizes machine learning pipelines using genetic programming. It automates the most tedious parts of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

Pros

Automates the entire machine learning pipeline, from preprocessing to model selection and hyperparameter tuning
Produces Python code for the best pipeline, allowing for easy reproduction and further customization
Supports both classification and regression tasks
Integrates well with scikit-learn, making it familiar for many data scientists

Cons

Can be computationally expensive and time-consuming for large datasets or complex problems
May not always find the absolute best solution, as it relies on genetic algorithms
Limited flexibility in terms of custom algorithms or preprocessing steps
Requires some understanding of machine learning concepts to interpret and use results effectively

Code Examples

Basic classification example:

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)

print(tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')

Regression example with custom config:

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tpot_config = {
    'sklearn.linear_model.ElasticNetCV': {
        'l1_ratio': np.arange(0.0, 1.0, 0.1),
        'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
    },
    'sklearn.ensemble.RandomForestRegressor': {
        'n_estimators': [100, 200, 500],
        'max_features': ['auto', 'sqrt', 'log2'],
        'max_depth': [None, 5, 10, 20]
    }
}

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, config_dict=tpot_config, random_state=42)
tpot.fit(X_train, y_train)

print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

Using TPOT with custom scoring metric:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, f1_score

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

f1_scorer = make_scorer(f1_score, average='weighted')

tpot = TPOTClassifier(generations=5, population_size=50, scoring=f1_scorer, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)

print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')

Getting Started

To get started with TPOT, first install it using

Competitor Comparisons

scikit-learn

62,466

scikit-learn: machine learning in Python

Pros of scikit-learn

Comprehensive library with a wide range of machine learning algorithms and tools
Well-established, mature project with extensive documentation and community support
Highly optimized and efficient implementations of algorithms

Cons of scikit-learn

Requires manual hyperparameter tuning and model selection
Less automated approach to machine learning pipeline creation
May require more domain expertise to effectively use all features

Code Comparison

scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

TPOT:

from tpot import TPOTClassifier

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
tpot.export('tpot_pipeline.py')

scikit-learn offers more control over individual algorithms and requires manual pipeline construction. TPOT automates the process of creating and optimizing machine learning pipelines, potentially saving time and reducing the need for expert knowledge. However, scikit-learn provides a broader range of tools and may be more suitable for users who need fine-grained control over their models.

nni

14,238

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

Pros of NNI

Broader scope: Supports neural architecture search and hyperparameter optimization for deep learning models
More flexible: Offers multiple optimization algorithms and supports various frameworks (TensorFlow, PyTorch, etc.)
Distributed training: Enables running experiments across multiple machines

Cons of NNI

Steeper learning curve: More complex setup and configuration compared to TPOT
Less focus on traditional machine learning: TPOT specializes in automating scikit-learn pipelines

Code Comparison

TPOT example:

from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

NNI example:

import nni
@nni.trace
def run_trial(params):
    model = create_model(params)
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

if __name__ == '__main__':
    nni.run(run_trial)

Both tools aim to automate machine learning workflows, but NNI offers more extensive features for deep learning and distributed training, while TPOT focuses on traditional machine learning pipelines using scikit-learn.

auto-sklearn

7,915

Automated Machine Learning with scikit-learn

Pros of auto-sklearn

Utilizes meta-learning for faster model selection and hyperparameter optimization
Implements ensemble selection to combine multiple models for improved performance
Supports warm-starting from previous runs, allowing for incremental improvements

Cons of auto-sklearn

Limited to scikit-learn compatible algorithms and preprocessors
Requires more computational resources due to its complex optimization process
Less flexibility in customizing the search space compared to TPOT

Code Comparison

TPOT:

from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, cv=5, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

auto-sklearn:

import autosklearn.classification
automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30)
automl.fit(X_train, y_train)
print(automl.score(X_test, y_test))

Both TPOT and auto-sklearn offer automated machine learning solutions, but they differ in their approaches and features. TPOT uses genetic programming to optimize machine learning pipelines, while auto-sklearn employs Bayesian optimization and meta-learning. The choice between the two depends on specific project requirements, available computational resources, and desired level of customization.

autokeras

9,258

AutoML library for deep learning

Pros of AutoKeras

Built on top of Keras, leveraging its extensive ecosystem and compatibility with TensorFlow
Supports both image classification and regression tasks out-of-the-box
Offers a simple, high-level API for quick implementation of AutoML workflows

Cons of AutoKeras

Limited flexibility in customizing the search space compared to TPOT
Primarily focused on neural network architectures, while TPOT explores a broader range of ML algorithms
Less mature project with potentially fewer community contributions and resources

Code Comparison

AutoKeras:

import autokeras as ak

clf = ak.ImageClassifier(max_trials=10)
clf.fit(x_train, y_train)
results = clf.predict(x_test)

TPOT:

from tpot import TPOTClassifier

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
results = tpot.predict(X_test)

Both libraries offer simple APIs for AutoML, but AutoKeras focuses on neural networks for specific tasks like image classification, while TPOT provides a more general-purpose solution for various machine learning algorithms and pipelines.

mljar-supervised

3,183

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation

Pros of mljar-supervised

Offers a wider range of ML algorithms, including neural networks and ensemble methods
Provides more detailed explanations and visualizations of model performance
Supports both binary and multi-class classification tasks

Cons of mljar-supervised

Less mature project with fewer contributors and stars on GitHub
May be slower in processing large datasets compared to TPOT
Documentation is not as comprehensive as TPOT's

Code Comparison

TPOT example:

from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

mljar-supervised example:

from supervised import AutoML
automl = AutoML(results_path="automl_results")
automl.fit(X_train, y_train)
predictions = automl.predict(X_test)

Both libraries aim to automate the machine learning pipeline, but mljar-supervised offers more flexibility in terms of algorithms and explanations. TPOT, on the other hand, has a larger community and may be more suitable for handling big data. The choice between the two depends on the specific requirements of your project and the level of control you need over the ML process.

pycaret

9,463

An open-source, low-code machine learning library in Python

Pros of PyCaret

More comprehensive, covering a wider range of ML tasks including classification, regression, clustering, and anomaly detection
Easier to use with a low-code interface and automated preprocessing steps
Better integration with popular visualization libraries like Plotly

Cons of PyCaret

Less focus on automated feature engineering compared to TPOT
May be slower for large datasets due to its comprehensive approach
Less emphasis on evolutionary algorithms for hyperparameter optimization

Code Comparison

TPOT:

from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_pipeline.py')

PyCaret:

from pycaret.classification import *
setup(data=df, target='target')
best_model = compare_models()
predict_model(best_model)
save_model(best_model, 'best_model')

Both libraries aim to automate the machine learning pipeline, but PyCaret offers a more user-friendly interface with broader functionality, while TPOT focuses on evolutionary algorithms for pipeline optimization. PyCaret is generally easier for beginners, while TPOT may provide more fine-tuned results for specific use cases.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

TPOT

TPOT stands for Tree-based Pipeline Optimization Tool. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. Consider TPOT your Data Science Assistant.

Contributors

TPOT recently went through a major refactoring. The package was rewritten from scratch to improve efficiency and performance, support new features, and fix numerous bugs. New features include genetic feature selection, a significantly expanded and more flexible method of defining search spaces, multi-objective optimization, a more modular framework allowing for easier customization of the evolutionary algorithm, and more. While in development, this new version was referred to as "TPOT2" but we have now merged what was once TPOT2 into the main TPOT package. You can learn more about this new version of TPOT in our GPTP paper titled "TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning."

Ribeiro, P. et al. (2024). TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning. In: Winkler, S., Trujillo, L., Ofria, C., Hu, T. (eds) Genetic Programming Theory and Practice XX. Genetic and Evolutionary Computation. Springer, Singapore. https://doi.org/10.1007/978-981-99-8413-8_1

The current version of TPOT was developed at Cedars-Sinai by:
- Pedro Henrique Ribeiro (Lead developer - https://github.com/perib, https://www.linkedin.com/in/pedro-ribeiro/)
- Anil Saini (anil.saini@cshs.org)
- Jose Hernandez (jgh9094@gmail.com)
- Jay Moran (jay.moran@cshs.org)
- Nicholas Matsumoto (nicholas.matsumoto@cshs.org)
- Hyunjun Choi (hyunjun.choi@cshs.org)
- Gabriel Ketron (gabriel.ketron@cshs.org) - Miguel E. Hernandez (miguel.e.hernandez@cshs.org)
- Jason Moore (moorejh28@gmail.com)

The original version of TPOT was primarily developed at the University of Pennsylvania by:
- Randal S. Olson (rso@randalolson.com)
- Weixuan Fu (weixuanf@upenn.edu)
- Daniel Angell (dpa34@drexel.edu)
- Jason Moore (moorejh28@gmail.com)
- and many more generous open-source contributors

License

Please see the repository license for the licensing and usage information for TPOT. Generally, we have licensed TPOT to make it as widely usable as possible.

TPOT is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

TPOT is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with TPOT. If not, see http://www.gnu.org/licenses/.

Documentation

The documentation webpage can be found here.

We also recommend looking at the Tutorials folder for jupyter notebooks with examples and guides.

Installation

TPOT requires a working installation of Python.

Creating a conda environment (optional)

We recommend using conda environments for installing TPOT, though it would work equally well if manually installed without it.

More information on making anaconda environments found here.

conda create --name tpotenv python=3.10
conda activate tpotenv

Packages Used

python version >=3.10, <3.14 numpy scipy scikit-learn update_checker tqdm stopit pandas joblib xgboost matplotlib traitlets lightgbm optuna jupyter networkx dask distributed dask-ml dask-jobqueue func_timeout configspace

Many of the hyperparameter ranges used in our configspaces were adapted from either the original TPOT package or the AutoSklearn package.

Note for M1 Mac or other Arm-based CPU users

You need to install the lightgbm package directly from conda using the following command before installing TPOT.

This is to ensure that you get the version that is compatible with your system.

conda install --yes -c conda-forge 'lightgbm>=3.3.3'

Installing Extra Features with pip

If you want to utilize the additional features provided by TPOT along with scikit-learn extensions, you can install them using pip. The command to install TPOT with these extra features is as follows:

pip install tpot[sklearnex]

Please note that while these extensions can speed up scikit-learn packages, there are some important considerations:

These extensions may not be fully developed and tested on Arm-based CPUs, such as M1 Macs. You might encounter compatibility issues or reduced performance on such systems.

We recommend using Python 3.9 when installing these extra features, as it provides better compatibility and stability.

Developer/Latest Branch Installation

pip install -e /path/to/tpotrepo

If you downloaded with git pull, then the repository folder will be named TPOT. (Note: this folder is the one that includes setup.py inside of it and not the folder of the same name inside it). If you downloaded as a zip, the folder may be called tpot-main.

Usage

See the Tutorials Folder for more instructions and examples.

Best Practices

1

TPOT uses dask for parallel processing. When Python is parallelized, each module is imported within each processes. Therefore it is important to protect all code within a if __name__ == "__main__" when running TPOT from a script. This is not required when running TPOT from a notebook.

For example:

#my_analysis.py

import tpot
if __name__ == "__main__":
    X, y = load_my_data()
    est = tpot.TPOTClassifier()
    est.fit(X,y)
    #rest of analysis

2

When designing custom objective functions, avoid the use of global variables.

Don't Do:

global_X = [[1,2],[4,5]]
global_y = [0,1]
def foo(est):
    return my_scorer(est, X=global_X, y=global_y)

Instead use a partial

from functools import partial

def foo_scorer(est, X, y):
    return my_scorer(est, X, y)

if __name__=='__main__':
    X = [[1,2],[4,5]]
    y = [0,1]
    final_scorer = partial(foo_scorer, X=X, y=y)

Similarly when using lambda functions.

Dont Do:

def new_objective(est, a, b)
    #definition

a = 100
b = 20
bad_function = lambda est :  new_objective(est=est, a=a, b=b)

Do:

def new_objective(est, a, b)
    #definition

a = 100
b = 20
good_function = lambda est, a=a, b=b : new_objective(est=est, a=a, b=b)

Tips

TPOT will not check if your data is correctly formatted. It will assume that you have passed in operators that can handle the type of data that was passed in. For instance, if you pass in a pandas dataframe with categorical features and missing data, then you should also include in your configuration operators that can handle those feautures of the data. Alternatively, if you pass in preprocessing = True, TPOT will impute missing values, one hot encode categorical features, then standardize the data. (Note that this is currently fitted and transformed on the entire training set before splitting for CV. Later there will be an option to apply per fold, and have the parameters be learnable.)

Setting verbose to 5 can be helpful during debugging as it will print out the error generated by failing pipelines.

Contributing to TPOT

We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.

Citing TPOT

If you use TPOT in a scientific publication, please consider citing at least one of the following papers:

Trang T. Le, Weixuan Fu and Jason H. Moore (2020). Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics.36(1): 250-256.

BibTeX entry:

@article{le2020scaling,
  title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector},
  author={Le, Trang T and Fu, Weixuan and Moore, Jason H},
  journal={Bioinformatics},
  volume={36},
  number={1},
  pages={250--256},
  year={2020},
  publisher={Oxford University Press}
}

Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and Jason H. Moore (2016). Automating biomedical data science through tree-based pipeline optimization. Applications of Evolutionary Computation, pages 123-137.

BibTeX entry:

@inbook{Olson2016EvoBio,
    author={Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},
    editor={Squillero, Giovanni and Burelli, Paolo},
    chapter={Automating Biomedical Data Science Through Tree-Based Pipeline Optimization},
    title={Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I},
    year={2016},
    publisher={Springer International Publishing},
    pages={123--137},
    isbn={978-3-319-31204-0},
    doi={10.1007/978-3-319-31204-0_9},
    url={http://dx.doi.org/10.1007/978-3-319-31204-0_9}
}

Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of GECCO 2016, pages 485-492.

BibTeX entry:

@inproceedings{OlsonGECCO2016,
    author = {Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H.},
    title = {Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},
    booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference 2016},
    series = {GECCO '16},
    year = {2016},
    isbn = {978-1-4503-4206-3},
    location = {Denver, Colorado, USA},
    pages = {485--492},
    numpages = {8},
    url = {http://doi.acm.org/10.1145/2908812.2908918},
    doi = {10.1145/2908812.2908918},
    acmid = {2908918},
    publisher = {ACM},
    address = {New York, NY, USA},
}

Support for TPOT

TPOT was developed in the Artificial Intelligence Innovation (A2I) Lab at Cedars-Sinai with funding from the NIH under grants U01 AG066833 and R01 LM010098. We are incredibly grateful for the support of the NIH and the Cedars-Sinai during the development of this project.

The TPOT logo was designed by Todd Newmuis, who generously donated his time to the project.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot