MatchVariables() ensures that the columns in the test set are identical to those
in the train set.
If the test set contains additional columns, they are dropped. Alternatively, if the
test set lacks columns that were present in the train set, they will be added with a
value determined by the user, for example np.nan.
MatchVariables() will also
return the variables in the order seen in the train set.
Let’s explore this with an example. First we load the Titanic dataset and split it into a train and a test set:
import numpy as np import pandas as pd from feature_engine.preprocessing import MatchVariables # Load dataset def load_titanic(): data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl') data = data.replace('?', np.nan) data['cabin'] = data['cabin'].astype(str).str data['pclass'] = data['pclass'].astype('O') data['age'] = data['age'].astype('float') data['fare'] = data['fare'].astype('float') data['embarked'].fillna('C', inplace=True) data.drop( labels=['name', 'ticket', 'boat', 'body', 'home.dest'], axis=1, inplace=True, ) return data # load data as pandas dataframe data = load_titanic() # Split test and train train = data.iloc[0:1000, :] test = data.iloc[1000:, :]
Now, we set up
MatchVariables() and fit it to the train set.
# set up the transformer match_cols = MatchVariables(missing_values="ignore") # learn the variables in the train set match_cols.fit(train)
MatchVariables() stores the variables from the train set in its attribute:
# the transformer stores the input variables match_cols.input_features_
['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked']
Now, we drop some columns in the test set.
# Let's drop some columns in the test set for the demo test_t = test.drop(["sex", "age"], axis=1) test_t.head()
pclass survived sibsp parch fare cabin embarked 1000 3 1 0 0 7.7500 n Q 1001 3 1 2 0 23.2500 n Q 1002 3 1 2 0 23.2500 n Q 1003 3 1 2 0 23.2500 n Q 1004 3 1 0 0 7.7875 n Q
If we transform the dataframe with the dropped columns using
we see that the new dataframe contains all the variables, and those that were missing
are now back in the data, with np.nan values as default.
# the transformer adds the columns back test_tt = match_cols.transform(test_t) test_tt.head()
The following variables are added to the DataFrame: ['sex', 'age'] pclass survived sex age sibsp parch fare cabin embarked 1000 3 1 NaN NaN 0 0 7.7500 n Q 1001 3 1 NaN NaN 2 0 23.2500 n Q 1002 3 1 NaN NaN 2 0 23.2500 n Q 1003 3 1 NaN NaN 2 0 23.2500 n Q 1004 3 1 NaN NaN 0 0 7.7875 n Q
Note how the missing columns were added back to the transformed test set, with missing values, in the position (i.e., order) in which they were in the train set.
Similarly, if the test set contained additional columns, those would be removed. To test that, let’s add some extra columns to the test set:
# let's add some columns for the demo test_t[['var_a', 'var_b']] = 0 test_t.head()
pclass survived sibsp parch fare cabin embarked var_a var_b 1000 3 1 0 0 7.7500 n Q 0 0 1001 3 1 2 0 23.2500 n Q 0 0 1002 3 1 2 0 23.2500 n Q 0 0 1003 3 1 2 0 23.2500 n Q 0 0 1004 3 1 0 0 7.7875 n Q 0 0
And now, we transform the data with
test_tt = match_cols.transform(test_t) test_tt.head()
The following variables are added to the DataFrame: ['age', 'sex'] The following variables are dropped from the DataFrame: ['var_a', 'var_b'] pclass survived sex age sibsp parch fare cabin embarked 1000 3 1 NaN NaN 0 0 7.7500 n Q 1001 3 1 NaN NaN 2 0 23.2500 n Q 1002 3 1 NaN NaN 2 0 23.2500 n Q 1003 3 1 NaN NaN 2 0 23.2500 n Q 1004 3 1 NaN NaN 0 0 7.7875 n Q
Now, the transformer simultaneously added the missing columns with NA as values and removed the additional columns from the resulting dataset.
MatchVariables() will print out messages indicating which variables
were added or removed. We can switch off the messages through the parameter
When to use the transformer#
These transformer is useful in “predict then optimize type of problems”. In such cases, a machine learning model is trained on a certain dataset, with certain input features. Then, test sets are “post-processed” according to scenarios that want to be modelled. For example, “what would have happened if the customer received an email campaign”? where the variable “receive_campaign” would be turned from 0 -> 1.
While creating these modelling datasets, a lot of meta data e.g., “scenario number”, “time scenario was generated”, etc, could be added to the data. Then we need to pass these data over to the model to obtain the modelled prediction.
MatchVariables() provides an easy an elegant way to remove the additional metadeta,
while returning datasets with the input features in the correct order, allowing the
different scenarios to be modelled directly inside a machine learning pipeline.
You can also find a similar implementation of the example shown in this page in the following Jupyter notebook:
All notebooks can be found in a dedicated repository.