Top Related Projects
Feature engineering package with sklearn like functionality
scikit-learn: machine learning in Python
Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
Algorithms for explaining machine learning models
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Quick Overview
Featuretools is an open-source Python library for automated feature engineering. It enables data scientists and machine learning practitioners to automatically create meaningful features from temporal and relational datasets, significantly reducing the time and effort required for feature engineering in machine learning projects.
Pros
- Automates the feature engineering process, saving time and effort
- Handles complex relational and temporal data structures
- Provides a high-level API for easy integration into existing workflows
- Supports parallel processing for improved performance on large datasets
Cons
- May generate a large number of features, requiring additional feature selection steps
- Learning curve for understanding and effectively using Deep Feature Synthesis
- Can be computationally intensive for very large datasets
- Generated features may sometimes lack interpretability
Code Examples
- Creating a feature matrix from relational data:
import featuretools as ft
# Create an EntitySet from your data
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe=customers_df, dataframe_name="customers", index="customer_id")
es = es.add_dataframe(dataframe=transactions_df, dataframe_name="transactions", index="transaction_id")
# Add a relationship between the dataframes
es = es.add_relationship("customers", "customer_id", "transactions", "customer_id")
# Run Deep Feature Synthesis
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers",
trans_primitives=["cum_sum", "time_since_previous"],
agg_primitives=["sum", "mean", "count"])
- Using custom primitives:
from featuretools.primitives import make_agg_primitive
def custom_mean(values):
return np.mean([v for v in values if v is not None])
CustomMean = make_agg_primitive(function=custom_mean,
input_types=[ft.variable_types.Numeric],
return_type=ft.variable_types.Numeric,
name="custom_mean")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers",
agg_primitives=[CustomMean])
- Feature selection using Featuretools:
from featuretools.selection import remove_low_information_features
# Remove features with low information content
filtered_feature_matrix = remove_low_information_features(feature_matrix)
# Select top k features based on mutual information
from featuretools.selection import select_features
selected_features = select_features(filtered_feature_matrix, target, k=20, features=feature_defs)
Getting Started
To get started with Featuretools:
- Install the library:
pip install featuretools
- Import and use in your Python script:
import featuretools as ft
# Create an EntitySet
es = ft.EntitySet(id="my_dataset")
# Add your data and relationships
es = es.add_dataframe(dataframe=your_df, dataframe_name="main", index="id")
# Run Deep Feature Synthesis
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="main",
trans_primitives=["cum_sum", "diff"],
agg_primitives=["sum", "mean", "max"])
# Use the generated features in your machine learning pipeline
Competitor Comparisons
Feature engineering package with sklearn like functionality
Pros of Feature Engine
- More focused on traditional feature engineering techniques
- Easier integration with scikit-learn pipelines
- Simpler API for common feature engineering tasks
Cons of Feature Engine
- Less automated feature generation capabilities
- Smaller community and fewer contributors
- Limited support for time series data
Code Comparison
Feature Engine example:
from feature_engine.encoding import OrdinalEncoder
encoder = OrdinalEncoder(encoding_method='ordered')
X_encoded = encoder.fit_transform(X)
Featuretools example:
import featuretools as ft
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
trans_primitives=["cum_sum", "diff"])
Feature Engine focuses on explicit feature engineering steps, while Featuretools offers more automated feature generation. Feature Engine integrates seamlessly with scikit-learn, making it easier to incorporate into existing ML pipelines. Featuretools excels in automated feature discovery, especially for complex datasets with multiple related tables.
Both libraries have their strengths, and the choice between them depends on the specific requirements of your project and the complexity of your data.
scikit-learn: machine learning in Python
Pros of scikit-learn
- Comprehensive library with a wide range of machine learning algorithms and tools
- Well-established, mature project with extensive documentation and community support
- Seamless integration with other popular data science libraries in the Python ecosystem
Cons of scikit-learn
- Lacks automated feature engineering capabilities
- May require more manual preprocessing and feature selection compared to Featuretools
- Less focused on time-series data and relational datasets
Code Comparison
Featuretools (automated feature engineering):
import featuretools as ft
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe=transactions_df, dataframe_name="transactions", index="transaction_id", time_index="transaction_time")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="customers", max_depth=2)
scikit-learn (manual feature engineering):
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
X = StandardScaler().fit_transform(X)
X_selected = SelectKBest(k=10).fit_transform(X, y)
Both libraries are powerful tools for machine learning tasks, but they serve different purposes. Featuretools excels in automated feature engineering, especially for relational and time-series data, while scikit-learn offers a broader range of machine learning algorithms and tools for various tasks.
Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
Pros of mljar-supervised
- Focuses on automated machine learning (AutoML) for classification and regression tasks
- Provides a simple API for quick model training and deployment
- Includes built-in model explanations and feature importance analysis
Cons of mljar-supervised
- Limited to tabular data and doesn't offer advanced feature engineering capabilities
- May not be as flexible for custom preprocessing steps or complex data transformations
- Smaller community and fewer integrations compared to Featuretools
Code Comparison
mljar-supervised:
from supervised import AutoML
automl = AutoML(results_path="automl_results")
automl.fit(X, y)
predictions = automl.predict(X_test)
Featuretools:
import featuretools as ft
es = ft.EntitySet(id="example")
es = es.add_dataframe(dataframe=df, dataframe_name="data")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="data")
Summary
mljar-supervised is an AutoML tool focused on simplifying the machine learning pipeline for tabular data, while Featuretools specializes in automated feature engineering across various data types. mljar-supervised offers quick model training and deployment with built-in explanations, but may be less flexible for complex data transformations. Featuretools provides more advanced feature engineering capabilities but requires additional steps for model training and evaluation.
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
Pros of TPOT
- Automates the entire machine learning pipeline, including feature selection and model selection
- Uses genetic programming to optimize the pipeline, potentially finding better solutions than manual tuning
- Supports both classification and regression tasks out of the box
Cons of TPOT
- Can be computationally expensive and time-consuming for large datasets or complex problems
- Less flexibility in customizing individual steps of the pipeline compared to manual feature engineering
- May produce complex pipelines that are difficult to interpret or explain
Code Comparison
TPOT:
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_pipeline.py')
Featuretools:
import featuretools as ft
es = ft.EntitySet(id="example")
es = es.add_dataframe(dataframe=df, dataframe_name="data", index="id")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="data")
Both libraries aim to automate aspects of the machine learning process, but TPOT focuses on end-to-end pipeline optimization, while Featuretools specializes in automated feature engineering. TPOT may be more suitable for users looking for a hands-off approach to model building, while Featuretools offers more control over the feature creation process.
Algorithms for explaining machine learning models
Pros of Alibi
- Focuses on model interpretability and fairness, offering a wide range of explainable AI techniques
- Provides advanced algorithms for detecting concept drift and model monitoring
- Supports both TensorFlow and PyTorch frameworks
Cons of Alibi
- More specialized in scope, primarily dealing with model explanations and fairness
- May have a steeper learning curve for users not familiar with explainable AI concepts
- Less emphasis on automated feature engineering compared to Featuretools
Code Comparison
Alibi (Anchor explainer):
explainer = AnchorTabular(predict_fn, feature_names)
explanation = explainer.explain(X)
print(explanation.anchor)
Featuretools (Automated feature engineering):
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe=df, dataframe_name="customers")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="customers")
Both libraries serve different purposes in the machine learning pipeline. Featuretools excels in automated feature engineering, while Alibi focuses on model interpretability and fairness. The choice between them depends on the specific needs of your project.
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Pros of DoWhy
- Focuses specifically on causal inference and causal effect estimation
- Provides a unified framework for causal inference across various methods
- Supports multiple estimation methods and sensitivity analyses
Cons of DoWhy
- More specialized and narrower in scope compared to Featuretools
- May have a steeper learning curve for those new to causal inference
- Less extensive documentation and community support
Code Comparison
DoWhy:
import dowhy
from dowhy import CausalModel
model = CausalModel(
data=data,
treatment='treatment',
outcome='outcome',
graph=graph
)
identified_estimand = model.identify_effect()
estimate = model.estimate_effect(identified_estimand)
Featuretools:
import featuretools as ft
es = ft.EntitySet(id="example")
es = es.add_dataframe(dataframe=data, dataframe_name="data")
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="data")
DoWhy is tailored for causal inference tasks, while Featuretools is designed for automated feature engineering. The code examples highlight their different focuses, with DoWhy centered on causal modeling and Featuretools on generating feature matrices.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
"One of the holy grails of machine learning is to automate more and more of the feature engineering process." â Pedro Domingos, A Few Useful Things to Know about Machine Learning
Featuretools is a python library for automated feature engineering. See the documentation for more information.
Installation
Install with pip
python -m pip install featuretools
or from the Conda-forge channel on conda:
conda install -c conda-forge featuretools
Add-ons
You can install add-ons individually or all at once by running:
python -m pip install "featuretools[complete]"
Premium Primitives - Use Premium Primitives from the premium-primitives repo
python -m pip install "featuretools[premium]"
NLP Primitives - Use Natural Language Primitives from the nlp-primitives repo
python -m pip install "featuretools[nlp]"
Dask Support - Use Dask to run DFS with njobs > 1
python -m pip install "featuretools[dask]"
Example
Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.
>> import featuretools as ft
>> es = ft.demo.load_mock_customer(return_entityset=True)
>> es.plot()
Featuretools can automatically create a single table of features for any "target dataframe"
>> feature_matrix, features_defs = ft.dfs(entityset=es, target_dataframe_name="customers")
>> feature_matrix.head(5)
zip_code COUNT(transactions) COUNT(sessions) SUM(transactions.amount) MODE(sessions.device) MIN(transactions.amount) MAX(transactions.amount) YEAR(join_date) SKEW(transactions.amount) DAY(join_date) ... SUM(sessions.MIN(transactions.amount)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.MIN(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.MEAN(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) STD(sessions.MAX(transactions.amount)) NUM_UNIQUE(sessions.DAY(session_start)) MIN(sessions.SKEW(transactions.amount))
customer_id ...
1 60091 131 10 10236.77 desktop 5.60 149.95 2008 0.070041 1 ... 169.77 0.610052 41.95 791.976505 175.939423 9.299023 -0.377150 5.857976 1 -0.395358
2 02139 122 8 9118.81 mobile 5.81 149.15 2008 0.028647 20 ... 114.85 0.492531 42.96 596.243506 230.333502 10.925037 0.962350 7.420480 1 -0.470007
3 02139 78 5 5758.24 desktop 6.78 147.73 2008 0.070814 10 ... 64.98 0.645728 21.77 369.770121 471.048551 9.819148 -0.244976 12.537259 1 -0.630425
4 60091 111 8 8205.28 desktop 5.73 149.56 2008 0.087986 30 ... 83.53 0.516262 17.27 584.673126 322.883448 13.065436 -0.548969 12.738488 1 -0.497169
5 02139 58 4 4571.37 tablet 5.91 148.17 2008 0.085883 19 ... 73.09 0.830112 27.46 313.448942 198.522508 8.950528 0.098885 5.599228 1 -0.396571
[5 rows x 69 columns]
We now have a feature vector for each customer that can be used for machine learning. See the documentation on Deep Feature Synthesis for more examples.
Featuretools contains many different types of built-in primitives for creating features. If the primitive you need is not included, Featuretools also allows you to define your own custom primitives.
Demos
Predict Next Purchase
In this demonstration, we use a multi-table dataset of 3 million online grocery orders from Instacart to predict what a customer will buy next. We show how to generate features with automated feature engineering and build an accurate machine learning pipeline using Featuretools, which can be reused for multiple prediction problems. For more advanced users, we show how to scale that pipeline to a large dataset using Dask.
For more examples of how to use Featuretools, check out our demos page.
Testing & Development
The Featuretools community welcomes pull requests. Instructions for testing and development are available here.
Support
The Featuretools community is happy to provide support to users of Featuretools. Project support can be found in four places depending on the type of question:
- For usage questions, use Stack Overflow with the
featuretools
tag. - For bugs, issues, or feature requests start a Github issue.
- For discussion regarding development on the core library, use Slack.
- For everything else, the core developers can be reached by email at open_source_support@alteryx.com
Citing Featuretools
If you use Featuretools, please consider citing the following paper:
James Max Kanter, Kalyan Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. IEEE DSAA 2015.
BibTeX entry:
@inproceedings{kanter2015deep,
author = {James Max Kanter and Kalyan Veeramachaneni},
title = {Deep feature synthesis: Towards automating data science endeavors},
booktitle = {2015 {IEEE} International Conference on Data Science and Advanced Analytics, DSAA 2015, Paris, France, October 19-21, 2015},
pages = {1--10},
year = {2015},
organization={IEEE}
}
Built at Alteryx
Featuretools is an open source project maintained by Alteryx. To see the other open source projects weâre working on visit Alteryx Open Source. If building impactful data science pipelines is important to you or your business, please get in touch.
Top Related Projects
Feature engineering package with sklearn like functionality
scikit-learn: machine learning in Python
Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
Algorithms for explaining machine learning models
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot