CategoricalImputer() replaces missing data in categorical variables with an
arbitrary value, like the string ‘Missing’ or by the most frequent category.
You can indicate which variables to impute passing the variable names in a list, or the imputer automatically finds and selects all variables of type object and categorical.
Originally, we designed this imputer to work only with categorical variables. From version
1.1.0 we introduced the parameter
ignore_format to allow the imputer to also impute
numerical variables with this functionality. This is, because in some cases, variables
that are by nature categorical, have numerical values.
Below a code example using the House Prices Dataset (more details about the dataset here).
In this example, we impute 2 variables from the dataset with the string ‘Missing’, which is the default functionality of the transformer:
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from feature_engine.imputation import CategoricalImputer # Load dataset data = pd.read_csv('houseprice.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0) # set up the imputer imputer = CategoricalImputer(variables=['Alley', 'MasVnrType']) # fit the imputer imputer.fit(X_train) # transform the data train_t= imputer.transform(X_train) test_t= imputer.transform(X_test) test_t['MasVnrType'].value_counts().plot.bar()
Note in the plot the presence of the category “Missing” which is added after the imputation:
In the following Jupyter notebook you will find more details on the functionality of the
EndTailImputer(), including how to select numerical variables automatically.
You will also find demos on how to impute using the maximum value or the interquartile
range proximity rule.
Check also this Jupyter notebook
All notebooks can be found in a dedicated repository.