AddMissingIndicator() adds a binary variable indicating if observations are
missing (missing indicator). It adds missing indicators to both categorical and numerical
You can select the variables for which the missing indicators should be created passing
a variable list to the
variables parameter. Alternatively, the imputer will
automatically select all variables.
The imputer has the option to add missing indicators to all variables or only to those
that have missing data in the train set. You can change the behaviour using the
missing_only=True, missing indicators will be added only to those variables with
missing data in the train set. This means that if you passed a variable list to
variables and some of those variables did not have missing data, no missing indicators
will be added to them. If it is paramount that all variables in your list get their
missing indicators, make sure to set
It is recommended to use
missing_only=True when not passing a list of variables to
Below a code example using the House Prices Dataset (more details about the dataset here).
First, let’s load the data and separate it into train and test:
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from feature_engine.imputation import AddMissingIndicator # Load dataset data = pd.read_csv('houseprice.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)
Now we set up the imputer to add missing indicators to the 4 indicated variables:
# set up the imputer addBinary_imputer = AddMissingIndicator( variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea'], ) # fit the imputer addBinary_imputer.fit(X_train)
Because we left the default value for
will check if the variables indicated above have missing data in X_train. If they do,
missing indicators will be added for all 4 variables looking forward. If one of them
had not had missing data in X_train, missing indicators would have been added to the
remaining 3 variables only.
We can know which variables will have missing indicators by looking at the variable list
Now, we can go ahead and add the missing indicators:
# transform the data train_t = addBinary_imputer.transform(X_train) test_t = addBinary_imputer.transform(X_test) train_t[['Alley_na', 'MasVnrType_na', 'LotFrontage_na', 'MasVnrArea_na']].head()
Note that after adding missing indicators, we still need to replace NA in the original variables if we plan to use them to train machine learning models.
Missing indicators are commonly used together with random sampling, mean or median imputation, or frequent category imputation.
In the following Jupyter notebook you will find more details on the functionality of the
AddMissingIndicator(), including how to use the parameter
how to select the variables automatically.
All notebooks can be found in a dedicated repository.