Convert Figma logo to code with AI

feature-engine logofeature_engine

Feature engineering package with sklearn like functionality

1,848
308
1,848
74

Top Related Projects

scikit-learn: machine learning in Python

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Fit interpretable models. Explain blackbox machine learning.

An open source python library for automated feature engineering

2,371

Algorithms for explaining machine learning models

9,653

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Quick Overview

Feature-engine is a Python library for feature engineering and selection. It provides a collection of transformers for encoding categorical variables, handling missing data, scaling numerical features, and creating new features. The library is designed to work seamlessly with scikit-learn pipelines and follows a similar API structure.

Pros

  • Comprehensive set of feature engineering techniques in one library
  • Consistent API design, compatible with scikit-learn pipelines
  • Extensive documentation and examples
  • Supports both numerical and categorical data transformations

Cons

  • May have a steeper learning curve for beginners compared to simpler libraries
  • Some advanced techniques might require additional understanding of the underlying concepts
  • Limited to Python ecosystem, not available in other programming languages

Code Examples

  1. Encoding categorical variables:
from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(top_categories=5, variables=['category'])
X_encoded = encoder.fit_transform(X)
  1. Handling missing data:
from feature_engine.imputation import MeanMedianImputer

imputer = MeanMedianImputer(imputation_method='median', variables=['age', 'income'])
X_imputed = imputer.fit_transform(X)
  1. Creating new features:
from feature_engine.creation import CombineWithReferenceFeature

combiner = CombineWithReferenceFeature(
    variables_to_combine=['height', 'weight'],
    reference_variables=['age'],
    operations=['divide', 'subtract']
)
X_new_features = combiner.fit_transform(X)

Getting Started

To get started with Feature-engine, follow these steps:

  1. Install the library:
pip install feature_engine
  1. Import the necessary modules and create a transformer:
from feature_engine.encoding import OrdinalEncoder

# Create an ordinal encoder for categorical variables
encoder = OrdinalEncoder(encoding_method='ordered', variables=['category1', 'category2'])

# Fit the encoder to your data and transform
X_encoded = encoder.fit_transform(X)
  1. Use Feature-engine in a scikit-learn pipeline:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from feature_engine.selection import DropConstantFeatures

# Create a pipeline with feature selection and model
pipeline = Pipeline([
    ('drop_constant', DropConstantFeatures()),
    ('ordinal_encoder', OrdinalEncoder(encoding_method='ordered', variables=['category1', 'category2'])),
    ('logistic_regression', LogisticRegression())
])

# Fit the pipeline to your data
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

Competitor Comparisons

scikit-learn: machine learning in Python

Pros of scikit-learn

  • Comprehensive machine learning library with a wide range of algorithms and tools
  • Well-established, large community support, and extensive documentation
  • Seamless integration with other scientific Python libraries (NumPy, SciPy, Pandas)

Cons of scikit-learn

  • Can be overwhelming for beginners due to its extensive functionality
  • Limited focus on feature engineering compared to specialized libraries
  • May require additional preprocessing steps for certain feature transformations

Code Comparison

scikit-learn:

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

scaler = StandardScaler()
imputer = SimpleImputer(strategy='mean')

X_scaled = scaler.fit_transform(X)
X_imputed = imputer.fit_transform(X)

feature-engine:

from feature_engine.imputation import MeanMedianImputer
from feature_engine.wrappers import SklearnTransformerWrapper

imputer = MeanMedianImputer(imputation_method='mean')
scaler = SklearnTransformerWrapper(StandardScaler())

X_transformed = imputer.fit_transform(X)
X_transformed = scaler.fit_transform(X_transformed)

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Pros of ydata-profiling

  • Provides comprehensive data profiling and reporting capabilities
  • Generates interactive HTML reports for easy data exploration
  • Supports various data types including categorical, numerical, and time series

Cons of ydata-profiling

  • Primarily focused on data analysis rather than feature engineering
  • May be slower for large datasets due to extensive profiling
  • Less flexibility in customizing specific feature transformations

Code Comparison

ydata-profiling:

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Profiling Report")
profile.to_file("report.html")

feature_engine:

from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(variables=['category_column'])
X_encoded = encoder.fit_transform(X)

Key Differences

  • ydata-profiling excels in data analysis and visualization
  • feature_engine focuses on feature engineering and transformation
  • ydata-profiling generates reports, while feature_engine modifies datasets
  • feature_engine offers more granular control over feature manipulation
  • ydata-profiling is better suited for initial data exploration and understanding

Fit interpretable models. Explain blackbox machine learning.

Pros of Interpret

  • Offers a wide range of interpretability techniques for machine learning models
  • Provides interactive visualizations for model explanations
  • Supports both global and local interpretability methods

Cons of Interpret

  • Steeper learning curve due to its comprehensive nature
  • May have higher computational requirements for complex models
  • Less focused on feature engineering compared to Feature Engine

Code Comparison

Interpret:

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
from interpret.glassbox import ExplainableBoostingClassifier

set_visualize_provider(InlineProvider())
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)
ebm_global = ebm.explain_global()

Feature Engine:

from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import MeanMedianImputer

ohe = OneHotEncoder(variables=['category'])
imputer = MeanMedianImputer(imputation_method='median', variables=['age'])

X_transformed = ohe.fit_transform(X)
X_transformed = imputer.fit_transform(X_transformed)

Both libraries serve different primary purposes: Interpret focuses on model interpretability, while Feature Engine specializes in feature engineering. Interpret is better suited for explaining complex models, while Feature Engine excels at preparing and transforming data for model input.

An open source python library for automated feature engineering

Pros of Featuretools

  • Automated feature engineering with Deep Feature Synthesis
  • Handles time-dependent data and relational datasets
  • Extensive documentation and tutorials

Cons of Featuretools

  • Steeper learning curve for complex datasets
  • May generate a large number of features, requiring additional feature selection
  • Performance can be slower for very large datasets

Code Comparison

Feature_engine example:

from feature_engine.encoding import OrdinalEncoder

encoder = OrdinalEncoder(encoding_method='ordered')
X_encoded = encoder.fit_transform(X)

Featuretools example:

import featuretools as ft

feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity='customers',
                                      trans_primitives=['cum_sum', 'diff'])

Feature_engine focuses on individual feature engineering tasks, while Featuretools offers automated feature generation across multiple related datasets. Feature_engine provides more control over specific transformations, whereas Featuretools excels at discovering complex relationships and creating features automatically. Both libraries have their strengths and can be used complementarily depending on the project requirements and dataset complexity.

2,371

Algorithms for explaining machine learning models

Pros of Alibi

  • Focuses on machine learning model interpretability and fairness
  • Provides advanced algorithms for explaining black-box models
  • Offers tools for detecting concept drift in production models

Cons of Alibi

  • Steeper learning curve due to its focus on complex ML concepts
  • Less emphasis on feature engineering and preprocessing

Code Comparison

Feature Engine example:

from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(variables=['category'])
X_encoded = encoder.fit_transform(X)

Alibi example:

from alibi.explainers import AnchorTabular

explainer = AnchorTabular(predict_fn, feature_names)
explanation = explainer.explain(X[0])

Summary

Feature Engine specializes in feature engineering and preprocessing, offering a wide range of transformers for data preparation. It's user-friendly and integrates well with scikit-learn pipelines.

Alibi, on the other hand, focuses on model interpretability, fairness, and monitoring. It provides advanced algorithms for explaining black-box models and detecting concept drift in production environments.

While Feature Engine is ideal for data scientists looking to streamline their feature engineering process, Alibi caters to those working with complex models who need to understand and explain model decisions, especially in sensitive or regulated domains.

9,653

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Pros of TPOT

  • Automates the entire machine learning pipeline, including feature selection and model selection
  • Uses genetic programming to optimize the pipeline, potentially finding better solutions than manual tuning
  • Supports both classification and regression tasks out of the box

Cons of TPOT

  • Can be computationally expensive and time-consuming for large datasets
  • Less control over specific feature engineering steps compared to Feature Engine
  • May produce complex pipelines that are difficult to interpret or explain

Code Comparison

TPOT:

from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
tpot.score(X_test, y_test)

Feature Engine:

from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=['category_column'])
X_encoded = encoder.fit_transform(X)

TPOT focuses on automating the entire machine learning pipeline, while Feature Engine provides more granular control over specific feature engineering tasks. TPOT is better suited for quick prototyping and exploring various model architectures, while Feature Engine allows for more precise feature manipulation and transformation.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Feature-engine

feature-engine logo

PackagePyPI - Python Version PyPI Conda Monthly Downloads Downloads
MetaGitHub GitHub contributors Gitter first-timers-only Sponsorship
DocumentationRead the Docs DOI JOSS
TestingCircleCI Codecov Code style: black

Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models. Feature-engine's transformers follow Scikit-learn's functionality with fit() and transform() methods to learn the transforming parameters from the data and then transform it.

Feature-engine features in the following resources

Blogs about Feature-engine

Documentation

Pst! How did you find us?

We want to share Feature-engine with more people. It'd help us loads if you tell us how you discovered us.

Then we'd know what we are doing right and which channels to use to share the love.

Please share your story by answering 1 quick question at this link . 😃

Current Feature-engine's transformers include functionality for:

  • Missing Data Imputation
  • Categorical Encoding
  • Discretisation
  • Outlier Capping or Removal
  • Variable Transformation
  • Variable Creation
  • Variable Selection
  • Datetime Features
  • Time Series
  • Preprocessing
  • Scikit-learn Wrappers

Imputation Methods

  • MeanMedianImputer
  • ArbitraryNumberImputer
  • RandomSampleImputer
  • EndTailImputer
  • CategoricalImputer
  • AddMissingIndicator
  • DropMissingData

Encoding Methods

  • OneHotEncoder
  • OrdinalEncoder
  • CountFrequencyEncoder
  • MeanEncoder
  • WoEEncoder
  • RareLabelEncoder
  • DecisionTreeEncoder
  • StringSimilarityEncoder

Discretisation methods

  • EqualFrequencyDiscretiser
  • EqualWidthDiscretiser
  • GeometricWidthDiscretiser
  • DecisionTreeDiscretiser
  • ArbitraryDiscreriser

Outlier Handling methods

  • Winsorizer
  • ArbitraryOutlierCapper
  • OutlierTrimmer

Variable Transformation methods

  • LogTransformer
  • LogCpTransformer
  • ReciprocalTransformer
  • ArcsinTransformer
  • PowerTransformer
  • BoxCoxTransformer
  • YeoJohnsonTransformer

Variable Creation:

  • MathFeatures
  • RelativeFeatures
  • CyclicalFeatures
  • DecisionTreeFeatures()

Feature Selection:

  • DropFeatures
  • DropConstantFeatures
  • DropDuplicateFeatures
  • DropCorrelatedFeatures
  • SmartCorrelationSelection
  • ShuffleFeaturesSelector
  • SelectBySingleFeaturePerformance
  • SelectByTargetMeanPerformance
  • RecursiveFeatureElimination
  • RecursiveFeatureAddition
  • DropHighPSIFeatures
  • SelectByInformationValue
  • ProbeFeatureSelection

Datetime

  • DatetimeFeatures
  • DatetimeSubtraction

Time Series

  • LagFeatures
  • WindowFeatures
  • ExpandingWindowFeatures

Pipelines

  • Pipeline
  • make_pipeline

Preprocessing

  • MatchCategories
  • MatchVariables

Wrappers:

  • SklearnTransformerWrapper

Installation

From PyPI using pip:

pip install feature_engine

From Anaconda:

conda install -c conda-forge feature_engine

Or simply clone it:

git clone https://github.com/feature-engine/feature_engine.git

Example Usage

>>> import pandas as pd
>>> from feature_engine.encoding import RareLabelEncoder

>>> data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1}
>>> data = pd.DataFrame(data)
>>> data['var_A'].value_counts()
Out[1]:
A    10
B    10
C     2
D     1
Name: var_A, dtype: int64
>>> rare_encoder = RareLabelEncoder(tol=0.10, n_categories=3)
>>> data_encoded = rare_encoder.fit_transform(data)
>>> data_encoded['var_A'].value_counts()
Out[2]:
A       10
B       10
Rare     3
Name: var_A, dtype: int64

Find more examples in our Jupyter Notebook Gallery or in the documentation.

Contribute

Details about how to contribute can be found in the Contribute Page

Briefly:

  • Fork the repo
  • Clone your fork into your local computer:
git clone https://github.com/<YOURUSERNAME>/feature_engine.git
  • navigate into the repo folder
cd feature_engine
  • Install Feature-engine as a developer:
pip install -e .
  • Optional: Create and activate a virtual environment with any tool of choice
  • Install Feature-engine dependencies:
pip install -r requirements.txt

and

pip install -r test_requirements.txt
  • Create a feature branch with a meaningful name for your feature:
git checkout -b myfeaturebranch
  • Develop your feature, tests and documentation
  • Make sure the tests pass
  • Make a PR

Thank you!!

Documentation

Feature-engine documentation is built using Sphinx and is hosted on Read the Docs.

To build the documentation make sure you have the dependencies installed: from the root directory:

pip install -r docs/requirements.txt

Now you can build the docs using:

sphinx-build -b html docs build

License

The content of this repository is licensed under a BSD 3-Clause license.

Sponsor us

Sponsor us and support further our mission to democratize machine learning and programming tools through open-source software.