YeoJohnsonTransformer#

The Yeo-Johnson transformation is an extension of the Box-Cox transformation, enabling power transformations on variables with zero and negative values, in addition to positive values.

The Box-Cox transformation, on the other hand, is suitable for numeric variables that are strictly positive. When variables include negative values, we have two options:

shift the distribution toward positive values by adding a constant
use the Yeo-Johnson transformation

The Yeo-Johnson transformation is defined as:

where Y is the independent variable and λ is the transformation parameter.

Uses of the Yeo-Johnson and Box-Cox transformation#

Both the Yeo-Johnson and Box-Cox transformations automate the process of identifying the optimal power transformation to approximate a Gaussian distribution. They evaluate various power transformations, including the logarithmic and reciprocal functions, estimating the transformation parameter through maximum likelihood.

These transformations are commonly applied during data preprocessing, particularly when parametric statistical tests or linear models for regression are employed. Such tests and models often have underlying assumptions about the data that may not be inherently satisfied, making these transformations essential for meeting those assumptions.

Yeo-Johnson vs Box-Cox transformation#

How does the Yeo-Johnson transformation relate to the Box-Cox transformation?

The Yeo-Johnson transformation extends the Box-Cox transformation to handle variables with zero, negative, and positive values.

For strictly positive values: The Yeo-Johnson transformation is equivalent to the Box-Cox transformation applied to (X + 1).
For strictly negative values: The Yeo-Johnson transformation corresponds to the Box-Cox transformation applied to (-X + 1) with a power of (2 — λ), where λ is the transformation parameter.
For variables with both positive and negative values: The Yeo-Johnson transformation combines the two approaches, using different powers for the positive and negative segments of the variable.

To apply the Yeo-Johnson transformation in Python, you can use scipy.stats.yeojohnson, which can transform one variable at a time. For transforming multiple variables simultaneously, libraries like scikit-klearn and Feature-engine are more suitable.

The YeoJohnsonTransformer#

Feature-engine’s YeoJohnsonTransformer() applies the Yeo-Johnson transformation to numeric variables.

Under the hood, YeoJohnsonTransformer() uses scipy.stats.yeojohnson to apply the transformations to each variable.

Python Implementation#

In this section, we will apply the Yeo-Johnson transformation to several variables from the Ames house prices dataset. After performing the transformations, we will carry out data analysis to understand the impact on the variable distributions.

Let’s begin by importing the necessary libraries and transformers, loading the dataset, and splitting it into training and testing sets.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from feature_engine.transformation import YeoJohnsonTransformer

# Load dataset
data = fetch_openml(name='house_prices', as_frame=True)
data = data.frame

# Separate into train and test sets
X_train, X_test, y_train, y_test =  train_test_split(
            data.drop(['Id', 'SalePrice'], axis=1),
            data['SalePrice'], test_size=0.3, random_state=0)

X_train.head()

In the following output we see the predictor variables of the house prices dataset:

      MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
         20       RL         70.0     8400   Pave   NaN      Reg
        60       RL         59.0     7837   Pave   NaN      IR1
         30       RL         67.0     8777   Pave   NaN      Reg
         50       RL         60.0     7200   Pave   NaN      Reg
         50       RL         50.0     5000   Pave  Pave      Reg

     LandContour Utilities LotConfig  ... ScreenPorch PoolArea PoolQC  Fence  \
        Lvl    AllPub    Inside  ...           0        0    NaN    NaN
       Lvl    AllPub    Inside  ...           0        0    NaN    NaN
        Lvl    AllPub    Inside  ...           0        0    NaN  MnPrv
        Lvl    AllPub    Corner  ...           0        0    NaN  MnPrv
        Lvl    AllPub    Inside  ...           0        0    NaN    NaN

     MiscFeature MiscVal  MoSold  YrSold  SaleType  SaleCondition
        NaN       0       6    2010        WD         Normal
       NaN       0       5    2009        WD         Normal
        NaN       0       5    2008        WD         Normal
        NaN       0       6    2007        WD         Normal
        NaN       0       5    2010        WD         Normal

[5 rows x 79 columns]

Let’s now set up the transformer to apply the Yeo-Johnson transformation to 2 variables; LotArea and GrLivArea:

tf = YeoJohnsonTransformer(variables = ['LotArea', 'GrLivArea'])

tf.fit(X_train)

With fit(), YeoJohnsonTransformer() learns the optimal lambda for the yeo-johnson power transformation. We can inspect these values as follows:

tf.lambda_dict_

We see the optimal lambda values below:

{'LotArea': 0.02258978732751055, 'GrLivArea': 0.06781061353154169}

We can now go ahead and apply the data transformation to get closer to normal distributions.

train_t = tf.transform(X_train)
test_t = tf.transform(X_test)

We’ll check out the effect of the transformation in the next section.

Effect of the transformation on the variable distribution#

Let’s carry out an analysis of transformations. We’ll explore the variables distribution before and after applying the transformation described by Yeo and Johnson.

Let’s make histograms of the original data to check out the original variables distribution:

X_train[['LotArea', 'GrLivArea']].hist(bins=50, figsize=(10,4))
plt.show()

In the following image, we can observe the skewness in the distribution of ‘LotArea’ and ‘GrLivArea’ in the original data:

Now, let’s plot histograms of the transformed variables:

train_t[['LotArea', 'GrLivArea']].hist(bins=50, figsize=(10,4))
plt.show()

We see that in the transformed data, both variables have a more symmetric, Gaussian-like distribution.

Recovering the original data#

After applying the Yeo-Johnson transformation, we can restore the original data representation, that is, the original variable values, using the inverse_transform method.

train_unt = tf.inverse_transform(train_t)
test_unt = tf.inverse_transform(test_t)

Additional resources#

You can find more details about the YeoJohnsonTransformer() here:

Jupyter notebook

For more details about this and other feature engineering methods check out these resources:

Feature Engineering for Machine Learning#

Or read our book:

Python Feature Engineering Cookbook#

Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.

Boost Your Data Science Skills