BoxCoxTransformer

API Reference

class feature_engine.transformation.BoxCoxTransformer(variables=None)[source]

The BoxCoxTransformer() applies the BoxCox transformation to numerical variables.

The Box-Cox transformation is defined as:

  • T(Y)=(Y exp(λ)−1)/λ if λ!=0

  • log(Y) otherwise

where Y is the response variable and λ is the transformation parameter. λ varies, typically from -5 to 5. In the transformation, all values of λ are considered and the optimal value for a given variable is selected.

The BoxCox transformation implemented by this transformer is that of SciPy.stats: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html

The BoxCoxTransformer() works only with numerical positive variables (>=0).

A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.

Parameters
variables: list, default=None

The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.

Attributes

lambda_dict_:

Dictionary with the best BoxCox exponent per variable.

variables_:

The group of variables that will be transformed.

n_features_in_:

The number of features in the train set used in fit.

References

1

Box and Cox. “An Analysis of Transformations”. Read at a RESEARCH MEETING, 1964. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1964.tb00553.x

Methods

fit:

Learn the optimal lambda for the BoxCox transformation.

transform:

Apply the BoxCox transformation.

fit_transform:

Fit to data, then transform it.

fit(X, y=None)[source]

Learn the optimal lambda for the BoxCox transformation.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The training input samples. Can be the entire dataframe, not just the variables to transform.

y: pandas Series, default=None

It is not needed in this transformer. You can pass y or None.

Returns
self
Raises
TypeError
  • If the input is not a Pandas DataFrame

  • If any of the user provided variables are not numerical

ValueError
  • If there are no numerical variables in the df or the df is empty

  • If the variable(s) contain null values

  • If some variables contain zero values

transform(X)[source]

Apply the BoxCox transformation.

Parameters
X: Pandas DataFrame of shape = [n_samples, n_features]

The data to be transformed.

Returns
X: pandas dataframe

The dataframe with the transformed variables.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If the df has different number of features than the df used in fit()

  • If some variables contain negative values

Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import transformation as vt

# Load dataset
data = data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test =  train_test_split(
            data.drop(['Id', 'SalePrice'], axis=1),
            data['SalePrice'], test_size=0.3, random_state=0)

# set up the variable transformer
tf = vt.BoxCoxTransformer(variables = ['LotArea', 'GrLivArea'])

# fit the transformer
tf.fit(X_train)

# transform the data
train_t= tf.transform(X_train)
test_t= tf.transform(X_test)

# un-transformed variable
X_train['LotArea'].hist(bins=50)
../_images/lotarearaw.png
# transformed variable
train_t['GrLivArea'].hist(bins=50)
../_images/lotareaboxcox.png