MatchCategories#

MatchCategories() ensures that categorical variables are encoded as pandas ‘categorical’ dtype instead of generic python ‘object’ or other dtypes.

Under the hood, ‘categorical’ dtype is a representation that maps each category to an integer, thus providing a more memory-efficient object structure than, for example, ‘str’, and allowing faster grouping, mapping, and similar operations on the resulting object.

MatchCategories() remembers the encodings or levels that represent each category, and can thus can be used to ensure that the correct encoding gets applied when passing categorical data to modeling packages that support this dtype, or to prevent unseen categories from reaching a further transformer or estimator in a pipeline, for example.

Let’s explore this with an example. First we load the Titanic dataset and split it into a train and a test sets:

from feature_engine.preprocessing import MatchCategories
from feature_engine.datasets import load_titanic

# Load dataset
data = load_titanic(
    predictors_only=True,
    handle_missing=True,
    cabin="letter_only",
)

data['pclass'] = data['pclass'].astype('O')

# Split test and train
train = data.iloc[0:1000, :]
test = data.iloc[1000:, :]

Now, we set up MatchCategories() and fit it to the train set.

# set up the transformer
match_categories = MatchCategories(missing_values="ignore")

# learn the mapping of categories to integers in the train set
match_categories.fit(train)

MatchCategories() stores the mappings from the train set in its attribute:

# the transformer stores the mappings for categorical variables
match_categories.category_dict_
{'pclass': Int64Index([1, 2, 3], dtype='int64'),
 'sex': Index(['female', 'male'], dtype='object'),
 'cabin': Index(['A', 'B', 'C', 'D', 'E', 'F', 'M', 'T'], dtype='object'),
 'embarked': Index(['C', 'Missing', 'Q', 'S'], dtype='object')}

If we transform the test dataframe using the same match_categories object, categorical variables will be converted to a ‘category’ dtype with the same numeration (mapping from categories to integers) that was applied to the train dataset:

# encoding that would be gotten from the train set
train.embarked.unique()
array(['S', 'C', 'Missing', 'Q'], dtype=object)
# encoding that would be gotten from the test set
test.embarked.unique()
array(['Q', 'S', 'C'], dtype=object)
# with 'match_categories', the encoding remains the same
match_categories.transform(train).embarked.cat.categories
Index(['C', 'Missing', 'Q', 'S'], dtype='object')
# this will have the same encoding as the train set
match_categories.transform(test).embarked.cat.categories
Index(['C', 'Missing', 'Q', 'S'], dtype='object')

If some category was not present in the training data, it will not mapped to any integer and will thus not get encoded. This behavior can be modified through the parameter errors:

# categories present in the train data
train.cabin.unique()
array(['B', 'C', 'E', 'D', 'A', 'M', 'T', 'F'], dtype=object)
# categories present in the test data - 'G' is new
test.cabin.unique()
array(['M', 'F', 'E', 'G'], dtype=object)
match_categories.transform(train).cabin.unique()
['B', 'C', 'E', 'D', 'A', 'M', 'T', 'F']
Categories (8, object): ['A', 'B', 'C', 'D', 'E', 'F', 'M', 'T']
# unseen category 'G' will not get mapped to any integer
match_categories.transform(test).cabin.unique()
['M', 'F', 'E', NaN]
Categories (8, object): ['A', 'B', 'C', 'D', 'E', 'F', 'M', 'T']

When to use the transformer#

This transformer is useful when creating custom transformers for categorical columns, as well as when passing categorical columns to modeling packages which support them natively but leave the variable casting to the user, such as lightgbm or glum.