LogTransformer() will apply the logarithm to the indicated variables. Note
that the logarithm can only be applied to positive values. Thus, if the variable contains
0 or negative variables, this transformer will return and error.
Let’s load the house prices dataset and separate it into train and test sets (more details about the dataset here).
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from feature_engine import transformation as vt # Load dataset data = pd.read_csv('houseprice.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)
Now we want to apply the logarithm to 2 of the variables in the dataset using the
# set up the variable transformer tf = vt.LogTransformer(variables = ['LotArea', 'GrLivArea']) # fit the transformer tf.fit(X_train)
fit(), this transformer does not learn any parameters. We can go ahead not an
transform the variables.
# transform the data train_t= tf.transform(X_train) test_t= tf.transform(X_test)
Next, we make a histogram of the original variable distribution:
# un-transformed variable X_train['LotArea'].hist(bins=50)
And now, we can explore the distribution of the variable after the logarithm transformation:
# transformed variable train_t['LotArea'].hist(bins=50)
Note that the transformed variable has a more Gaussian looking distribution.