Feature Creation#

Feature creation, is a common step during data preprocessing, and consists of constructing new variables from the dataset’s original features. By combining two or more variables, we develop new features that can improve the performance of a machine learning model, capture additional information or relationships among variables, or simply make more sense within the domain we are working on.

One of the most common feature creation methods in data science is one-hot encoding, which is a feature engineering technique used to transform a categorical feature into multiple binary variables that represent each category.

Another common feature extraction procedure consist of creating new features from past values of time series data, for example through the use of lags and windows.

In general, creating features requires a dose of domain knowledge and significant time invested in analyzing the raw data, including evaluating the relationship between the independent or predictor variables and the dependent or target variable in the dataset.

Feature creation can be one of the more creative aspects of feature engineering, and the new features can help improve a predictive model’s performance.

Lastly, a data scientist should be mindful that creating new features may increase the dimensionality of the dataset quite dramatically. For example, one hot encoding of highly cardinal categorical features results in lots of binary variables, and so does polynomial combinations of high powers. This may have downstream effects depending on the machine learning algorithm being used. For example, decision trees are known for not being able to cope with huge number of features.

Creating New Features with Feature-engine#

Feature-engine has several transformers that create and add new features to the dataset. One of the most popular ones is the OneHotEncoder that creates dummy variables from categorical features.

With Feature-engine we can also create new features from time series data through lags and windows by using LagFeatures or WindowFeatures.

Feature-engine’s creation module, supports transformers that create and add new features to a pandas dataframe by either combining existing features through different mathematical or statistical operations, or through feature transformations. These transformers operate with numerical variables, that is, those with integer and float data types.

Summary of Feature-engine’s feature-creation transformers:

CyclicalFeatures - Creates two new features per variable by applying the trigonometric operations sine and cosine to the original feature.
MathFeatures - Combines a set of features into new variables by applying basic mathematical functions like the sum, mean, maximum or standard deviation.
RelativeFeatures - Utilizes basic mathematical functions between a group of variables and one or more reference features, appending the new features to the pandas dataframe.
DecisionTreeFeatures - Creates new features as the output of decision trees trained on 1 or more feature combinations.

Feature creation module#

Feature-engine in Practice#

Here, you’ll get a taste of the transformers from the feature creation module from Feature-engine. We’ll use the wine quality dataset. The dataset is comprised of 11 features, including alcohol, ash, and flavonoids, and has quality as its target variable.

Through exploratory data analysis and our domain knowledge which includes real-world experimentation, i.e., drinking various brands/types of wine, we believe that we can create better features to train our algorithm by combining original features with various mathematical operations.

Let’s load the dataset from Scikit-learn.

import pandas as pd
from sklearn.datasets import load_wine
from feature_engine.creation import RelativeFeatures, MathFeatures

X, y = load_wine(return_X_y=True, as_frame=True)
print(X.head())

Below we see the wine quality dataset:

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
  14.23        1.71  2.43               15.6      127.0           2.80
  13.20        1.78  2.14               11.2      100.0           2.65
  13.16        2.36  2.67               18.6      101.0           2.80
  14.37        1.95  2.50               16.8      113.0           3.85
  13.24        2.59  2.87               21.0      118.0           2.80

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
      3.06                  0.28             2.29             5.64  1.04
      2.76                  0.26             1.28             4.38  1.05
      3.24                  0.30             2.81             5.68  1.03
      3.49                  0.24             2.18             7.80  0.86
      2.69                  0.39             1.82             4.32  1.04

   od280/od315_of_diluted_wines  proline
                        3.92   1065.0
                        3.40   1050.0
                        3.17   1185.0
                        3.45   1480.0
                        2.93    735.0

Now, we create a new feature by removing non-flavonoid phenols from the total phenols to obtain the phenols that are not flavonoid.

rf = RelativeFeatures(
    variables=["total_phenols"],
    reference=["nonflavanoid_phenols"],
    func=["sub"],
)

rf.fit(X)
X_tr = rf.transform(X)

print(X_tr.head())

We see the new feature and its data points at the right of the pandas dataframe:

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
  14.23        1.71  2.43               15.6      127.0           2.80
  13.20        1.78  2.14               11.2      100.0           2.65
  13.16        2.36  2.67               18.6      101.0           2.80
  14.37        1.95  2.50               16.8      113.0           3.85
  13.24        2.59  2.87               21.0      118.0           2.80

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
      3.06                  0.28             2.29             5.64  1.04
      2.76                  0.26             1.28             4.38  1.05
      3.24                  0.30             2.81             5.68  1.03
      3.49                  0.24             2.18             7.80  0.86
      2.69                  0.39             1.82             4.32  1.04

   od280/od315_of_diluted_wines  proline  \
                        3.92   1065.0
                        3.40   1050.0
                        3.17   1185.0
                        3.45   1480.0
                        2.93    735.0

   total_phenols_sub_nonflavanoid_phenols
                                  2.52
                                  2.39
                                  2.50
                                  3.61
                                  2.41

Let’s now create new features by combining a subset of 3 existing variables:

mf = MathFeatures(
    variables=["flavanoids", "proanthocyanins", "proline"],
    func=["sum", "mean"],
)

mf.fit(X_tr)
X_tr = mf.transform(X_tr)

print(X_tr.head())

We see the new features at the right of the resulting pandas dataframe:

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
  14.23        1.71  2.43               15.6      127.0           2.80
  13.20        1.78  2.14               11.2      100.0           2.65
  13.16        2.36  2.67               18.6      101.0           2.80
  14.37        1.95  2.50               16.8      113.0           3.85
  13.24        2.59  2.87               21.0      118.0           2.80

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
      3.06                  0.28             2.29             5.64  1.04
      2.76                  0.26             1.28             4.38  1.05
      3.24                  0.30             2.81             5.68  1.03
      3.49                  0.24             2.18             7.80  0.86
      2.69                  0.39             1.82             4.32  1.04

   od280/od315_of_diluted_wines  proline  \
                        3.92   1065.0
                        3.40   1050.0
                        3.17   1185.0
                        3.45   1480.0
                        2.93    735.0

   total_phenols_sub_nonflavanoid_phenols  \
                                  2.52
                                  2.39
                                  2.50
                                  3.61
                                  2.41

   sum_flavanoids_proanthocyanins_proline  \
                               1070.35
                               1054.04
                               1191.05
                               1485.67
                                739.51

   mean_flavanoids_proanthocyanins_proline
                             356.783333
                             351.346667
                             397.016667
                             495.223333
                             246.503333

In the above examples, we used RelativeFeature() and MathFeatures to perform automated feature engineering on the input data by applying the transformations defined in the func parameter on the features identified in variables and reference parameters.

The original and new features can now be used to train a regression model, or a multiclass classification algorithm, to predict the quality of the wine.

Summary#

Through feature engineering and feature creation, we can optimize the machine learning algorithm’s learning process and improve its performance metrics.

We’d strongly recommend the creation of features based on domain knowledge, exploratory data analysis and thorough data mining. We also understand that this is not always possible, particularly with big datasets and limited time allocated to each project. In this situation, we can combine the creation of features with feature selection procedures to let machine learning algorithms select what works best for them.

Good luck with your models!

Tutorials, books and courses#

For tutorials about this and other feature engineering for machine learning methods check out our online course:

Feature Engineering for Machine Learning#

Or read our book:

Python Feature Engineering Cookbook#

Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.

Transformers in other Libraries#

Check also the following transformer from Scikit-learn:

Boost Your Data Science Skills