feature_engine

Feature engineering package with sklearn like functionality

2,102

326

2,102

View on GitHub

Top Related Projects

scikit-learn

62,466

scikit-learn: machine learning in Python

ydata-profiling

12,990

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

interpret

6,630

Fit interpretable models. Explain blackbox machine learning.

featuretools

7,474

An open source python library for automated feature engineering

alibi

2,540

Algorithms for explaining machine learning models

tpot

9,958

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Quick Overview

Feature-engine is a Python library for feature engineering and selection. It provides a collection of transformers for encoding categorical variables, handling missing data, scaling numerical features, and creating new features. The library is designed to work seamlessly with scikit-learn pipelines and follows a similar API structure.

Pros

Comprehensive set of feature engineering techniques in one library
Consistent API design, compatible with scikit-learn pipelines
Extensive documentation and examples
Supports both numerical and categorical data transformations

Cons

May have a steeper learning curve for beginners compared to simpler libraries
Some advanced techniques might require additional understanding of the underlying concepts
Limited to Python ecosystem, not available in other programming languages

Code Examples

Encoding categorical variables:

from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(top_categories=5, variables=['category'])
X_encoded = encoder.fit_transform(X)

Handling missing data:

from feature_engine.imputation import MeanMedianImputer

imputer = MeanMedianImputer(imputation_method='median', variables=['age', 'income'])
X_imputed = imputer.fit_transform(X)

Creating new features:

from feature_engine.creation import CombineWithReferenceFeature

combiner = CombineWithReferenceFeature(
    variables_to_combine=['height', 'weight'],
    reference_variables=['age'],
    operations=['divide', 'subtract']
)
X_new_features = combiner.fit_transform(X)

Getting Started

To get started with Feature-engine, follow these steps:

Install the library:

pip install feature_engine

Import the necessary modules and create a transformer:

from feature_engine.encoding import OrdinalEncoder

# Create an ordinal encoder for categorical variables
encoder = OrdinalEncoder(encoding_method='ordered', variables=['category1', 'category2'])

# Fit the encoder to your data and transform
X_encoded = encoder.fit_transform(X)

Use Feature-engine in a scikit-learn pipeline:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from feature_engine.selection import DropConstantFeatures

# Create a pipeline with feature selection and model
pipeline = Pipeline([
    ('drop_constant', DropConstantFeatures()),
    ('ordinal_encoder', OrdinalEncoder(encoding_method='ordered', variables=['category1', 'category2'])),
    ('logistic_regression', LogisticRegression())
])

# Fit the pipeline to your data
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

Competitor Comparisons

scikit-learn

62,466

scikit-learn: machine learning in Python

Pros of scikit-learn

Comprehensive machine learning library with a wide range of algorithms and tools
Well-established, large community support, and extensive documentation
Seamless integration with other scientific Python libraries (NumPy, SciPy, Pandas)

Cons of scikit-learn

Can be overwhelming for beginners due to its extensive functionality
Limited focus on feature engineering compared to specialized libraries
May require additional preprocessing steps for certain feature transformations

Code Comparison

scikit-learn:

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

scaler = StandardScaler()
imputer = SimpleImputer(strategy='mean')

X_scaled = scaler.fit_transform(X)
X_imputed = imputer.fit_transform(X)

feature-engine:

from feature_engine.imputation import MeanMedianImputer
from feature_engine.wrappers import SklearnTransformerWrapper

imputer = MeanMedianImputer(imputation_method='mean')
scaler = SklearnTransformerWrapper(StandardScaler())

X_transformed = imputer.fit_transform(X)
X_transformed = scaler.fit_transform(X_transformed)

ydata-profiling

12,990

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Pros of ydata-profiling

Provides comprehensive data profiling and reporting capabilities
Generates interactive HTML reports for easy data exploration
Supports various data types including categorical, numerical, and time series

Cons of ydata-profiling

Primarily focused on data analysis rather than feature engineering
May be slower for large datasets due to extensive profiling
Less flexibility in customizing specific feature transformations

Code Comparison

ydata-profiling:

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Profiling Report")
profile.to_file("report.html")

feature_engine:

from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(variables=['category_column'])
X_encoded = encoder.fit_transform(X)

Key Differences

ydata-profiling excels in data analysis and visualization
feature_engine focuses on feature engineering and transformation
ydata-profiling generates reports, while feature_engine modifies datasets
feature_engine offers more granular control over feature manipulation
ydata-profiling is better suited for initial data exploration and understanding

interpret

6,630

Fit interpretable models. Explain blackbox machine learning.

Pros of Interpret

Offers a wide range of interpretability techniques for machine learning models
Provides interactive visualizations for model explanations
Supports both global and local interpretability methods

Cons of Interpret

Steeper learning curve due to its comprehensive nature
May have higher computational requirements for complex models
Less focused on feature engineering compared to Feature Engine

Code Comparison

Interpret:

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
from interpret.glassbox import ExplainableBoostingClassifier

set_visualize_provider(InlineProvider())
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)
ebm_global = ebm.explain_global()

Feature Engine:

from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import MeanMedianImputer

ohe = OneHotEncoder(variables=['category'])
imputer = MeanMedianImputer(imputation_method='median', variables=['age'])

X_transformed = ohe.fit_transform(X)
X_transformed = imputer.fit_transform(X_transformed)

Both libraries serve different primary purposes: Interpret focuses on model interpretability, while Feature Engine specializes in feature engineering. Interpret is better suited for explaining complex models, while Feature Engine excels at preparing and transforming data for model input.

featuretools

7,474

An open source python library for automated feature engineering

Pros of Featuretools

Automated feature engineering with Deep Feature Synthesis
Handles time-dependent data and relational datasets
Extensive documentation and tutorials

Cons of Featuretools

Steeper learning curve for complex datasets
May generate a large number of features, requiring additional feature selection
Performance can be slower for very large datasets

Code Comparison

Feature_engine example:

from feature_engine.encoding import OrdinalEncoder

encoder = OrdinalEncoder(encoding_method='ordered')
X_encoded = encoder.fit_transform(X)

Featuretools example:

import featuretools as ft

feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity='customers',
                                      trans_primitives=['cum_sum', 'diff'])

Feature_engine focuses on individual feature engineering tasks, while Featuretools offers automated feature generation across multiple related datasets. Feature_engine provides more control over specific transformations, whereas Featuretools excels at discovering complex relationships and creating features automatically. Both libraries have their strengths and can be used complementarily depending on the project requirements and dataset complexity.

alibi

2,540

Algorithms for explaining machine learning models

Pros of Alibi

Focuses on machine learning model interpretability and fairness
Provides advanced algorithms for explaining black-box models
Offers tools for detecting concept drift in production models

Cons of Alibi

Steeper learning curve due to its focus on complex ML concepts
Less emphasis on feature engineering and preprocessing

Code Comparison

Feature Engine example:

from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(variables=['category'])
X_encoded = encoder.fit_transform(X)

Alibi example:

from alibi.explainers import AnchorTabular

explainer = AnchorTabular(predict_fn, feature_names)
explanation = explainer.explain(X[0])

Summary

Feature Engine specializes in feature engineering and preprocessing, offering a wide range of transformers for data preparation. It's user-friendly and integrates well with scikit-learn pipelines.

Alibi, on the other hand, focuses on model interpretability, fairness, and monitoring. It provides advanced algorithms for explaining black-box models and detecting concept drift in production environments.

While Feature Engine is ideal for data scientists looking to streamline their feature engineering process, Alibi caters to those working with complex models who need to understand and explain model decisions, especially in sensitive or regulated domains.

tpot

9,958

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Pros of TPOT

Automates the entire machine learning pipeline, including feature selection and model selection
Uses genetic programming to optimize the pipeline, potentially finding better solutions than manual tuning
Supports both classification and regression tasks out of the box

Cons of TPOT

Can be computationally expensive and time-consuming for large datasets
Less control over specific feature engineering steps compared to Feature Engine
May produce complex pipelines that are difficult to interpret or explain

Code Comparison

TPOT:

from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
tpot.score(X_test, y_test)

Feature Engine:

from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=['category_column'])
X_encoded = encoder.fit_transform(X)

TPOT focuses on automating the entire machine learning pipeline, while Feature Engine provides more granular control over specific feature engineering tasks. TPOT is better suited for quick prototyping and exploring various model architectures, while Feature Engine allows for more precise feature manipulation and transformation.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Feature-engine


Open Source
Tutorials
Code
Downloads
Meta
Documentation
Citation
Testing

Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models. Feature-engine's transformers follow Scikit-learn's functionality with fit() and transform() methods to learn the transforming parameters from the data and then transform it.

Feature-engine features in the following resources

Blogs about Feature-engine

Documentation

Documentation

Pst! How did you find us?

We want to share Feature-engine with more people. It'd help us loads if you tell us how you discovered us.

Then we'd know what we are doing right and which channels to use to share the love.

Please share your story by answering 1 quick question at this link . ð

Current Feature-engine's transformers include functionality for:

Missing Data Imputation
Categorical Encoding
Discretisation
Outlier Capping or Removal
Variable Transformation
Variable Creation
Variable Selection
Datetime Features
Time Series
Preprocessing
Scaling
Scikit-learn Wrappers

Imputation Methods

MeanMedianImputer
ArbitraryNumberImputer
RandomSampleImputer
EndTailImputer
CategoricalImputer
AddMissingIndicator
DropMissingData

Encoding Methods

OneHotEncoder
OrdinalEncoder
CountFrequencyEncoder
MeanEncoder
WoEEncoder
RareLabelEncoder
DecisionTreeEncoder
StringSimilarityEncoder

Discretisation methods

EqualFrequencyDiscretiser
EqualWidthDiscretiser
GeometricWidthDiscretiser
DecisionTreeDiscretiser
ArbitraryDiscreriser

Outlier Handling methods

Winsorizer
ArbitraryOutlierCapper
OutlierTrimmer

Variable Transformation methods

LogTransformer
LogCpTransformer
ReciprocalTransformer
ArcsinTransformer
PowerTransformer
BoxCoxTransformer
YeoJohnsonTransformer

Variable Scaling methods

MeanNormalizationScaler

Variable Creation:

MathFeatures
RelativeFeatures
CyclicalFeatures
DecisionTreeFeatures()

Feature Selection:

DropFeatures
DropConstantFeatures
DropDuplicateFeatures
DropCorrelatedFeatures
SmartCorrelationSelection
ShuffleFeaturesSelector
SelectBySingleFeaturePerformance
SelectByTargetMeanPerformance
RecursiveFeatureElimination
RecursiveFeatureAddition
DropHighPSIFeatures
SelectByInformationValue
ProbeFeatureSelection
MRMR

Datetime

DatetimeFeatures
DatetimeSubtraction

Time Series

LagFeatures
WindowFeatures
ExpandingWindowFeatures

Pipelines

Pipeline
make_pipeline

Preprocessing

MatchCategories
MatchVariables

Wrappers:

SklearnTransformerWrapper

Installation

From PyPI using pip:

pip install feature_engine

From Anaconda:

conda install -c conda-forge feature_engine

Or simply clone it:

git clone https://github.com/feature-engine/feature_engine.git

Example Usage

>>> import pandas as pd
>>> from feature_engine.encoding import RareLabelEncoder

>>> data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1}
>>> data = pd.DataFrame(data)
>>> data['var_A'].value_counts()

Out[1]:
A    10
B    10
C     2
D     1
Name: var_A, dtype: int64

>>> rare_encoder = RareLabelEncoder(tol=0.10, n_categories=3)
>>> data_encoded = rare_encoder.fit_transform(data)
>>> data_encoded['var_A'].value_counts()

Out[2]:
A       10
B       10
Rare     3
Name: var_A, dtype: int64

Find more examples in our Jupyter Notebook Gallery or in the documentation.

Contribute

Details about how to contribute can be found in the Contribute Page

Briefly:

Fork the repo
Clone your fork into your local computer:

git clone https://github.com/<YOURUSERNAME>/feature_engine.git

navigate into the repo folder

cd feature_engine

Install Feature-engine as a developer:

pip install -e .

Optional: Create and activate a virtual environment with any tool of choice
Install Feature-engine developer dependencies:

pip install -e ".[tests]"

Create a feature branch with a meaningful name for your feature:

git checkout -b myfeaturebranch

Develop your feature, tests and documentation
Make sure the tests pass
Make a PR

Thank you!!

Documentation

Feature-engine documentation is built using Sphinx and is hosted on Read the Docs.

To build the documentation make sure you have the dependencies installed: from the root directory:

pip install -r docs/requirements.txt

Now you can build the docs using:

sphinx-build -b html docs build

License

The content of this repository is licensed under a BSD 3-Clause license.

Sponsor us

Sponsor us and support further our mission to democratize machine learning and programming tools through open-source software.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot