featuretools

An open source python library for automated feature engineering

7,474

902

7,474

158

View on GitHub

Top Related Projects

feature_engine

2,102

Feature engineering package with sklearn like functionality

scikit-learn

62,466

scikit-learn: machine learning in Python

mljar-supervised

3,183

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation

tpot

9,958

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

alibi

2,540

Algorithms for explaining machine learning models

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.

Quick Overview

Featuretools is an open-source Python library for automated feature engineering. It enables data scientists and machine learning practitioners to automatically create meaningful features from temporal and relational datasets, significantly reducing the time and effort required for feature engineering in machine learning projects.

Pros

Automates the feature engineering process, saving time and effort
Handles complex relational and temporal data structures
Provides a high-level API for easy integration into existing workflows
Supports parallel processing for improved performance on large datasets

Cons

May generate a large number of features, requiring additional feature selection steps
Learning curve for understanding and effectively using Deep Feature Synthesis
Can be computationally intensive for very large datasets
Generated features may sometimes lack interpretability

Code Examples

Creating a feature matrix from relational data:

import featuretools as ft

# Create an EntitySet from your data
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe=customers_df, dataframe_name="customers", index="customer_id")
es = es.add_dataframe(dataframe=transactions_df, dataframe_name="transactions", index="transaction_id")

# Add a relationship between the dataframes
es = es.add_relationship("customers", "customer_id", "transactions", "customer_id")

# Run Deep Feature Synthesis
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers",
                                      trans_primitives=["cum_sum", "time_since_previous"],
                                      agg_primitives=["sum", "mean", "count"])

Using custom primitives:

from featuretools.primitives import make_agg_primitive

def custom_mean(values):
    return np.mean([v for v in values if v is not None])

CustomMean = make_agg_primitive(function=custom_mean,
                                input_types=[ft.variable_types.Numeric],
                                return_type=ft.variable_types.Numeric,
                                name="custom_mean")

feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers",
                                      agg_primitives=[CustomMean])

Feature selection using Featuretools:

from featuretools.selection import remove_low_information_features

# Remove features with low information content
filtered_feature_matrix = remove_low_information_features(feature_matrix)

# Select top k features based on mutual information
from featuretools.selection import select_features
selected_features = select_features(filtered_feature_matrix, target, k=20, features=feature_defs)

Getting Started

To get started with Featuretools:

Install the library:

pip install featuretools

Import and use in your Python script:

import featuretools as ft

# Create an EntitySet
es = ft.EntitySet(id="my_dataset")

# Add your data and relationships
es = es.add_dataframe(dataframe=your_df, dataframe_name="main", index="id")

# Run Deep Feature Synthesis
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="main",
                                      trans_primitives=["cum_sum", "diff"],
                                      agg_primitives=["sum", "mean", "max"])

# Use the generated features in your machine learning pipeline

Competitor Comparisons

feature_engine

2,102

Feature engineering package with sklearn like functionality

Pros of Feature Engine

More focused on traditional feature engineering techniques
Easier integration with scikit-learn pipelines
Simpler API for common feature engineering tasks

Cons of Feature Engine

Less automated feature generation capabilities
Smaller community and fewer contributors
Limited support for time series data

Code Comparison

Feature Engine example:

from feature_engine.encoding import OrdinalEncoder

encoder = OrdinalEncoder(encoding_method='ordered')
X_encoded = encoder.fit_transform(X)

Featuretools example:

import featuretools as ft

feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity="customers",
                                      trans_primitives=["cum_sum", "diff"])

Feature Engine focuses on explicit feature engineering steps, while Featuretools offers more automated feature generation. Feature Engine integrates seamlessly with scikit-learn, making it easier to incorporate into existing ML pipelines. Featuretools excels in automated feature discovery, especially for complex datasets with multiple related tables.

Both libraries have their strengths, and the choice between them depends on the specific requirements of your project and the complexity of your data.

scikit-learn

62,466

scikit-learn: machine learning in Python

Pros of scikit-learn

Comprehensive library with a wide range of machine learning algorithms and tools
Well-established, mature project with extensive documentation and community support
Seamless integration with other popular data science libraries in the Python ecosystem

Cons of scikit-learn

Lacks automated feature engineering capabilities
May require more manual preprocessing and feature selection compared to Featuretools
Less focused on time-series data and relational datasets

Code Comparison

Featuretools (automated feature engineering):

import featuretools as ft
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe=transactions_df, dataframe_name="transactions", index="transaction_id", time_index="transaction_time")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="customers", max_depth=2)

scikit-learn (manual feature engineering):

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
X = StandardScaler().fit_transform(X)
X_selected = SelectKBest(k=10).fit_transform(X, y)

Both libraries are powerful tools for machine learning tasks, but they serve different purposes. Featuretools excels in automated feature engineering, especially for relational and time-series data, while scikit-learn offers a broader range of machine learning algorithms and tools for various tasks.

mljar-supervised

3,183

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation

Pros of mljar-supervised

Focuses on automated machine learning (AutoML) for classification and regression tasks
Provides a simple API for quick model training and deployment
Includes built-in model explanations and feature importance analysis

Cons of mljar-supervised

Limited to tabular data and doesn't offer advanced feature engineering capabilities
May not be as flexible for custom preprocessing steps or complex data transformations
Smaller community and fewer integrations compared to Featuretools

Code Comparison

mljar-supervised:

from supervised import AutoML

automl = AutoML(results_path="automl_results")
automl.fit(X, y)
predictions = automl.predict(X_test)

Featuretools:

import featuretools as ft

es = ft.EntitySet(id="example")
es = es.add_dataframe(dataframe=df, dataframe_name="data")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="data")

Summary

mljar-supervised is an AutoML tool focused on simplifying the machine learning pipeline for tabular data, while Featuretools specializes in automated feature engineering across various data types. mljar-supervised offers quick model training and deployment with built-in explanations, but may be less flexible for complex data transformations. Featuretools provides more advanced feature engineering capabilities but requires additional steps for model training and evaluation.

tpot

9,958

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Pros of TPOT

Automates the entire machine learning pipeline, including feature selection and model selection
Uses genetic programming to optimize the pipeline, potentially finding better solutions than manual tuning
Supports both classification and regression tasks out of the box

Cons of TPOT

Can be computationally expensive and time-consuming for large datasets or complex problems
Less flexibility in customizing individual steps of the pipeline compared to manual feature engineering
May produce complex pipelines that are difficult to interpret or explain

Code Comparison

TPOT:

from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_pipeline.py')

Featuretools:

import featuretools as ft
es = ft.EntitySet(id="example")
es = es.add_dataframe(dataframe=df, dataframe_name="data", index="id")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="data")

Both libraries aim to automate aspects of the machine learning process, but TPOT focuses on end-to-end pipeline optimization, while Featuretools specializes in automated feature engineering. TPOT may be more suitable for users looking for a hands-off approach to model building, while Featuretools offers more control over the feature creation process.

alibi

2,540

Algorithms for explaining machine learning models

Pros of Alibi

Focuses on model interpretability and fairness, offering a wide range of explainable AI techniques
Provides advanced algorithms for detecting concept drift and model monitoring
Supports both TensorFlow and PyTorch frameworks

Cons of Alibi

More specialized in scope, primarily dealing with model explanations and fairness
May have a steeper learning curve for users not familiar with explainable AI concepts
Less emphasis on automated feature engineering compared to Featuretools

Code Comparison

Alibi (Anchor explainer):

explainer = AnchorTabular(predict_fn, feature_names)
explanation = explainer.explain(X)
print(explanation.anchor)

Featuretools (Automated feature engineering):

es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe=df, dataframe_name="customers")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="customers")

Both libraries serve different purposes in the machine learning pipeline. Featuretools excels in automated feature engineering, while Alibi focuses on model interpretability and fairness. The choice between them depends on the specific needs of your project.

dowhy

7,619

Pros of DoWhy

Focuses specifically on causal inference and causal effect estimation
Provides a unified framework for causal inference across various methods
Supports multiple estimation methods and sensitivity analyses

Cons of DoWhy

More specialized and narrower in scope compared to Featuretools
May have a steeper learning curve for those new to causal inference
Less extensive documentation and community support

Code Comparison

DoWhy:

import dowhy
from dowhy import CausalModel

model = CausalModel(
    data=data,
    treatment='treatment',
    outcome='outcome',
    graph=graph
)
identified_estimand = model.identify_effect()
estimate = model.estimate_effect(identified_estimand)

Featuretools:

import featuretools as ft

es = ft.EntitySet(id="example")
es = es.add_dataframe(dataframe=data, dataframe_name="data")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="data")

DoWhy is tailored for causal inference tasks, while Featuretools is designed for automated feature engineering. The code examples highlight their different focuses, with DoWhy centered on causal modeling and Featuretools on generating feature matrices.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." â Pedro Domingos, A Few Useful Things to Know about Machine Learning

Featuretools is a python library for automated feature engineering. See the documentation for more information.

Installation

Install with pip

python -m pip install featuretools

or from the Conda-forge channel on conda:

conda install -c conda-forge featuretools

Add-ons

You can install add-ons individually or all at once by running:

python -m pip install "featuretools[complete]"

Premium Primitives - Use Premium Primitives from the premium-primitives repo

python -m pip install "featuretools[premium]"

NLP Primitives - Use Natural Language Primitives from the nlp-primitives repo

python -m pip install "featuretools[nlp]"

Dask Support - Use Dask to run DFS with njobs > 1

python -m pip install "featuretools[dask]"

Example

Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.

>> import featuretools as ft
>> es = ft.demo.load_mock_customer(return_entityset=True)
>> es.plot()

Featuretools can automatically create a single table of features for any "target dataframe"

>> feature_matrix, features_defs = ft.dfs(entityset=es, target_dataframe_name="customers")
>> feature_matrix.head(5)

            zip_code  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount) MODE(sessions.device)  MIN(transactions.amount)  MAX(transactions.amount)  YEAR(join_date)  SKEW(transactions.amount)  DAY(join_date)                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id                                                                                                                                                                                                                                  ...
1              60091                  131               10                  10236.77               desktop                      5.60                    149.95             2008                   0.070041               1                   ...                                                     169.77                                 0.610052                                   41.95                               791.976505                              175.939423                                 9.299023                                 -0.377150                                5.857976                                        1                                -0.395358
2              02139                  122                8                   9118.81                mobile                      5.81                    149.15             2008                   0.028647              20                   ...                                                     114.85                                 0.492531                                   42.96                               596.243506                              230.333502                                10.925037                                  0.962350                                7.420480                                        1                                -0.470007
3              02139                   78                5                   5758.24               desktop                      6.78                    147.73             2008                   0.070814              10                   ...                                                      64.98                                 0.645728                                   21.77                               369.770121                              471.048551                                 9.819148                                 -0.244976                               12.537259                                        1                                -0.630425
4              60091                  111                8                   8205.28               desktop                      5.73                    149.56             2008                   0.087986              30                   ...                                                      83.53                                 0.516262                                   17.27                               584.673126                              322.883448                                13.065436                                 -0.548969                               12.738488                                        1                                -0.497169
5              02139                   58                4                   4571.37                tablet                      5.91                    148.17             2008                   0.085883              19                   ...                                                      73.09                                 0.830112                                   27.46                               313.448942                              198.522508                                 8.950528                                  0.098885                                5.599228                                        1                                -0.396571

[5 rows x 69 columns]

We now have a feature vector for each customer that can be used for machine learning. See the documentation on Deep Feature Synthesis for more examples.

Featuretools contains many different types of built-in primitives for creating features. If the primitive you need is not included, Featuretools also allows you to define your own custom primitives.

Demos

Predict Next Purchase

Repository | Notebook

In this demonstration, we use a multi-table dataset of 3 million online grocery orders from Instacart to predict what a customer will buy next. We show how to generate features with automated feature engineering and build an accurate machine learning pipeline using Featuretools, which can be reused for multiple prediction problems. For more advanced users, we show how to scale that pipeline to a large dataset using Dask.

For more examples of how to use Featuretools, check out our demos page.

Testing & Development

The Featuretools community welcomes pull requests. Instructions for testing and development are available here.

Support

The Featuretools community is happy to provide support to users of Featuretools. Project support can be found in four places depending on the type of question:

For usage questions, use Stack Overflow with the featuretools tag.
For bugs, issues, or feature requests start a Github issue.
For discussion regarding development on the core library, use Slack.
For everything else, the core developers can be reached by email at open_source_support@alteryx.com

Citing Featuretools

If you use Featuretools, please consider citing the following paper:

James Max Kanter, Kalyan Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. IEEE DSAA 2015.

BibTeX entry:

@inproceedings{kanter2015deep,
  author    = {James Max Kanter and Kalyan Veeramachaneni},
  title     = {Deep feature synthesis: Towards automating data science endeavors},
  booktitle = {2015 {IEEE} International Conference on Data Science and Advanced Analytics, DSAA 2015, Paris, France, October 19-21, 2015},
  pages     = {1--10},
  year      = {2015},
  organization={IEEE}
}

Built at Alteryx

Featuretools is an open source project maintained by Alteryx. To see the other open source projects weâre working on visit Alteryx Open Source. If building impactful data science pipelines is important to you or your business, please get in touch.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot