Feature-engine#

A user-friendly feature engineering alternative to scikit-learn#

Feature-engine rocks!#

Feature-engine is a Python library with multiple transformers to engineer and select features for machine learning models. It supports imputation, encoding, transformation, discretisation, feature extraction from dates, text, time series, and much more.

Feature-engine, like scikit-learn, uses the methods fit() and transform() to learn parameters from and then transform the data. Like this, you can use any feature-engine’s transformer within a scikit-learn pipeline, cross-validation or hyperparameter tuning framework.

Working with dataframes? 👉 Feature-engine is a no-brainer#

Unlike scikit-learn, feature-engine is designed to work with dataframes. No column order or name changes. A dataframe comes in, and the same dataframe comes out, with the transformed variables.

We normally apply different feature engineering processes to different feature subsets. With sklearn, we restrict the feature engineering techniques to a certain group of variables by using an auxiliary class: the ColumnTransformer. This class also results in a change in the name of the variables after the transformation.

Feature-engine, instead, allows you to select the variables you want to transform within each transformer. This way, different engineering procedures can be easily applied to different feature subsets without the need for additional transformers or changes in the feature names.

Sitting at the interface of pandas and scikit-learn#

Pandas is great for data analysis and transformation. We ❤️ it too. But it doesn’t automatically learn and store parameters from the data. And that is key for machine learning. That’s why we created feature-engine.

Feature-engine wraps pandas functionality in scikit-learn-like transformers, capturing much of the pandas logic needed to learn and store parameters, while making these transformations compatible with scikit-learn estimators, selectors, cross-validation functions and hyperparameter search methods.

If your work is primarily data analysis and transformation for machine learning, and pandas and scikit-learn are your main tools, then feature-engine is your friend.

Feature-engine transformers#

Feature-engine includes transformers for:

Missing data imputation
Encoding of categorical features
Discretisation
Outlier capping or removal
Feature transformation
Feature combinations
Feature scaling
Feature extraction from datetime
Feature extraction from text
Feature extraction from time series
Preprocessing
Feature selection

Feature-engine transformers are fully compatible with scikit-learn. That means that you can use feature-engine transformers within a scikit-learn pipeline, or in a grid or random search for hyperparameters. Check **Quick Start** for an example.

How did you find us? 👀#

We want to share feature-engine with more people. It’d help us loads if you tell us how you discovered us.

Then we can know what we are doing right and which channels we should use to share the love.

🙏 Please share your story by answering 1 quick question at this link 😃

What is feature engineering?#

Feature engineering is the process of using domain knowledge and statistical tools to create features for machine learning algorithms. The raw data that we normally gather as part of our business activities is rarely fit to train machine learning models. Instead, data scientists spend a large part of their time on data analysis, preprocessing, and feature engineering.

Pandas is a common library for data preprocessing and feature engineering. It supports pretty much every method that is commonly used to transform raw data. However, pandas is not compatible with sklearn out of the box and is also not able to learn and store the feature engineering parameters.

Feature-engine’s transformers wrap pandas functionality around an API that exposes fit and transform methods to learn and store parameters from data and then use these parameters to transform the variables. Like this, Feature-engine makes the awesome functionality available in pandas fully compatible with scikit-Learn.

What is unique about feature-engine?#

The following characteristics make feature-engine unique:

Feature-engine contains the most exhaustive collection of feature engineering transformations.
Feature-engine can transform a specific group of variables in the dataframe, right out-of-the-box.
Feature-engine returns dataframes, hence suitable for data analysis and model deployment.
Feature-engine is compatible with the scikit-learn pipeline, grid and random search and cross validation.
Feature-engine automatically recognises numerical, categorical and datetime variables.
Feature-engine alerts you if a transformation is not possible, e.g., if applying logarithm to negative variables or divisions by 0.
Feature-engine supports the widest range of feature selection techniques in the Python ecosystem.

Installation#

Feature-engine is a Python 3 package and works well with 3.9 or later.

The simplest way to install feature-engine is from PyPI with pip:

$ pip install feature-engine

Note, you can also install it with a _ as follows:

$ pip install feature_engine

Feature-engine is an active project and routinely publishes new releases. To upgrade feature-engine to the latest version, use pip like this:

$ pip install -U feature-engine

If you’re using Anaconda, you can install the Anaconda feature-engine package:

$ conda install -c conda-forge feature_engine

Feature-engine features in the following tutorials#

Feature Engineering for Machine Learning, online course.
Feature Selection for Machine Learning, online course.
Feature Engineering for Time Series Forecasting, online course.
Python Feature Engineering Cookbook, book.
Feature Selection in Machine Learning with Python, book.

More learning resources in **Learning Resources**.

Feature-engine’s Transformers#

Feature-engine hosts the following groups of transformers:

Missing Data Imputation: Imputers#

Missing data imputation consists in replacing missing values in categorical data and numerical variables with estimates of those nan values or arbitrary data points. Feature-engine supports the following missing data imputation methods:

MeanMedianImputer: replaces missing data in numerical variables with the mean or median
ArbitraryNumberImputer: replaces missing data in numerical variables with an arbitrary number
EndTailImputer: replaces missing data in numerical variables with numbers at the distribution tails
CategoricalImputer: replaces missing data with an arbitrary string or with the most frequent category
RandomSampleImputer: replaces missing data by random sampling observations from the variable
AddMissingIndicator: adds a binary indicator to flag observations with missing data
DropMissingData: removes observations (rows) containing missing values from the dataframe

Categorical Encoders: Encoders#

Categorical encoding is the process of replacing categorical values by numerical values. Most machine learning models, and in particular, those supported by scikit-learn, don’t accept strings as inputs. Hence, we need to convert these strings into numbers that can be interpeted by these models.

There are various categorical encoding techniques, including one hot encoding, ordinal encoding and target encoding. Feature-engine supports the following methods:

OneHotEncoder: performs one hot encoding, optional: of popular categories
CountFrequencyEncoder: replaces categories with the observation count or percentage
OrdinalEncoder: replaces categories with integers, either arbitrarily or informed by target
MeanEncoder: replaces categories with the target mean
WoEEncoder: replaces categories with the weight of evidence
DecisionTreeEncoder: replaces categories with predictions of a decision tree
RareLabelEncoder: groups infrequent categories
StringSimilarityEncoder: encodes categories based on string similarity

Variable Discretisation: Discretisers#

Discretisation, or binning, consists in sorting numerical features into discrete intervals. The most commonly used methods are equal-width and equal-frequency discretisation. Feature-engine supports these and more advanced methods, like discretisation with decision trees:

ArbitraryDiscretiser: sorts variable into intervals defined by the user
EqualFrequencyDiscretiser: sorts variable into equal frequency intervals
EqualWidthDiscretiser: sorts variable into equal width intervals
DecisionTreeDiscretiser: uses decision trees to create finite variables
GeometricWidthDiscretiser: sorts variable into geometrical intervals

Outlier Capping or Removal#

Outliers are values that are very different with respect to the distribution observed for the variable. Some machine learning models and statistical tests are sensitive to outliers. In some cases, we may want to remove outliers or replace them with permitted values.

ArbitraryOutlierCapper: caps maximum and minimum values at user defined values
Winsorizer: caps maximum or minimum values using statistical parameters
OutlierTrimmer: removes outliers from the dataset

Numerical Transformation: Transformers#

We normally use variance stabilising transformations to make the data meet the assumptions of certain statistical tests, like ANOVA, and machine learning models, like linear regression. Feature-engine supports the following transformations:

LogTransformer: performs logarithmic transformation of numerical variables
LogCpTransformer: performs logarithmic transformation after adding a constant value
ReciprocalTransformer: performs reciprocal transformation of numerical variables
PowerTransformer: performs power transformation of numerical variables
BoxCoxTransformer: performs Box-Cox transformation of numerical variables
YeoJohnsonTransformer: performs Yeo-Johnson transformation of numerical variables
ArcsinTransformer: performs arcsin transformation of numerical variables
ArcSinhTransformer: applies arcsinh (pseudo-logarithm) transformation of numerical variables

Feature Creation:#

Feature-engine allows you to create new features by combining them mathematically or transforming them with mathematical functions:

MathFeatures: creates new features by applying operations like sum, mean, or product across existing variables
RelativeFeatures: creates features by subtracting or dividing variables by a reference variable
CyclicalFeatures: encodes cyclical variables using sine and cosine transformations
DecisionTreeFeatures: generates features from predictions of decision trees trained on one or more variables.
GeoDistanceFeatures: computes distance-based features from latitude and longitude coordinates

Datetime:#

Data scientists rarely use datetime features in their original representation with machine learning models. Instead, we extract many new features from the date and time parts of the datetime variable:

DatetimeFeatures: extract features like month, hour, minute, etc, from datetime variables
DatetimeSubtraction: computes elapsed time between datetime variables
DatetimeOrdinal: converts datetime variables into ordinal numbers based on the Gregorian calendar

Text:#

Some datasets contain pieces of text among their variables, like customer feedback, or email content, for example. The most common way to derive numerical variables from texts is through bag of words. In addition, we can extract a lot of metadata to capture text complexity, like number of characters, words and sentences, or lexical diversity.

TextFeatures: extracts numerical features from text/string variables

Feature Selection:#

Simpler models are easier to interpret, deploy, and maintain. Feature-engine expands the feature selection functionality existing in other libraries like scikit-learn and MLXtend, with additional methods:

DropFeatures: drops an arbitrary subset of variables from a dataframe
DropConstantFeatures: drops constant and quasi-constant variables from a dataframe
DropDuplicateFeatures: drops duplicated variables from a dataframe
DropCorrelatedFeatures: drops correlated variables from a dataframe
SmartCorrelatedSelection: selects best features from correlated groups
DropHighPSIFeatures: selects features based on the Population Stability Index (PSI)
SelectByInformationValue: selects features based on their information value
SelectByShuffling: selects features by evaluating model performance after feature shuffling
SelectBySingleFeaturePerformance: selects features based on single feature model performance
SelectByTargetMeanPerformance: selects features based on target mean encoding
RecursiveFeatureElimination: selects features recursively, by evaluating model performance
RecursiveFeatureAddition: selects features recursively, by evaluating model performance
ProbeFeatureSelection: selects features with greater importance than those of random variables
MRMR: selects features based on the Maximum Relevance Minimum Redundancy framework

Forecasting:#

To address forecasting as a regression by using traditional machine learning algorithms, we first need to transform the time series into a table of static fetaures. We can do this through lags and windows combined with aggregations over past data:

LagFeatures: create features using lags
WindowFeatures: create features with mathematical operations over time windows
ExpandingWindowFeatures: create features from expanding time windows

Preprocessing:#

When transforming variables and doing data cleaning, we usually change the variables data types (dtype in pandas). These can cause problems further down the pipeline. To tackle this head on, feature-engine has transformers to ensure the data types and variable names match.

MatchCategories: ensures categorical variables are of type ‘category’
MatchVariables: ensures that columns in test set match those in train set

Feature scaling:#

Scaling the data can help to balance the impact of all variables on the model, and can improve its performance. Scikit-learn offers a comprehensive array of tools to apply data normalisation, standardisation, and min-max scaling, among other methods. If you want to apply these procedures to a subset of the variables only, check out the SklearnTransformerWrapper:

Feature-engine extends scikit-learn’s scaling functionality with the following transformer:

MeanNormalizationScaler: scale variables using mean normalisation

Scikit-learn Wrapper:#

An alternative to scikit-learn’s ColumnTransformer:

SklearnTransformerWrapper: applies scikit-learn transformations to a selected subset of features

Getting Help#

Can’t get something to work? Here are places where you can find help.

The **User Guide** in the docs.
Stack Overflow. If you ask a question, please mention “feature_engine” in it.
If you are enrolled in the Feature Engineering for Machine Learning course , post a question in a relevant section.
If you are enrolled in the Feature Selection for Machine Learning course , post a question in a relevant section.
Ask a question in the repo by filing an issue (check before if there is already a similar issue created :) ).

Contributing#

Interested in contributing to feature-engine? That is great news!

Feature-engine is a welcoming and inclusive project and we would be delighted to have you on board. We follow the Python Software Foundation Code of Conduct.

Regardless of your skill level you can help us. We appreciate bug reports, user testing, feature requests, bug fixes, expansion of our test suite, product enhancements, and documentation improvements. We also appreciate blogs about feature-engine. If you happen to have one, let us know!

For more details on how to contribute check the contributing page. Click on the **Contribute** guide.

Sponsors#

Feature-engine is supported by Train in Data and JetBrains.

Train in Data is your go-to online school for mastering machine learning. We offer intermediate and advanced courses in Python programming, data science and machine learning, taught by industry experts with extensive experience in developing, optimizing, and deploying machine learning models in enterprise production environments.

We focus on building a solid, intuitive grasp of machine learning concepts, backed by hands-on Python coding to make sure you can actually apply what you learn.

Our approach? Simple: learn the theory, understand the why behind it, then get coding.

Open Source#

Feature-engine’s license is an open source BSD 3-Clause.

Feature-engine is hosted on GitHub. The issues and pull requests are tracked there.

Boost Your Data Science Skills