.. _match_categories:

.. currentmodule:: feature_engine.preprocessing

MatchCategories
===============

:class:`MatchCategories()` ensures that categorical variables are encoded as pandas
'categorical' dtype instead of generic python 'object' or other dtypes.

Under the hood, 'categorical' dtype is a representation that maps each
category to an integer, thus providing a more memory-efficient object
structure than, for example, 'str', and allowing faster grouping, mapping, and similar
operations on the resulting object.

:class:`MatchCategories()` remembers the encodings or levels that represent each
category, and can thus can be used to ensure that the correct encoding gets
applied when passing categorical data to modeling packages that support this
dtype, or to prevent unseen categories from reaching a further transformer
or estimator in a pipeline, for example.

Let's explore this with an example. First we load the Titanic dataset and split it into
a train and a test sets:

.. code:: python

    from feature_engine.preprocessing import MatchCategories
    from feature_engine.datasets import load_titanic

    # Load dataset
    data = load_titanic(
        predictors_only=True,
        handle_missing=True,
        cabin="letter_only",
    )

    data['pclass'] = data['pclass'].astype('O')

    # Split test and train
    train = data.iloc[0:1000, :]
    test = data.iloc[1000:, :]

Now, we set up :class:`MatchCategories()` and fit it to the train set.

.. code:: python

    # set up the transformer
    match_categories = MatchCategories(missing_values="ignore")

    # learn the mapping of categories to integers in the train set
    match_categories.fit(train)

:class:`MatchCategories()` stores the mappings from the train set in its attribute:

.. code:: python

    # the transformer stores the mappings for categorical variables
    match_categories.category_dict_

.. code:: python

    {'pclass': Int64Index([1, 2, 3], dtype='int64'),
     'sex': Index(['female', 'male'], dtype='object'),
     'cabin': Index(['A', 'B', 'C', 'D', 'E', 'F', 'M', 'T'], dtype='object'),
     'embarked': Index(['C', 'Missing', 'Q', 'S'], dtype='object')}

If we transform the test dataframe using the same `match_categories` object,
categorical variables will be converted to a 'category' dtype with the same
numeration (mapping from categories to integers) that was applied to the train
dataset:

.. code:: python

    # encoding that would be gotten from the train set
    train.embarked.unique()

.. code:: python

    array(['S', 'C', 'Missing', 'Q'], dtype=object)

.. code:: python
    
    # encoding that would be gotten from the test set
    test.embarked.unique()

.. code:: python

    array(['Q', 'S', 'C'], dtype=object)

.. code:: python
    
    # with 'match_categories', the encoding remains the same
    match_categories.transform(train).embarked.cat.categories

.. code:: python

    Index(['C', 'Missing', 'Q', 'S'], dtype='object')

.. code:: python

    # this will have the same encoding as the train set
    match_categories.transform(test).embarked.cat.categories

.. code:: python

    Index(['C', 'Missing', 'Q', 'S'], dtype='object')

If some category was not present in the training data, it will not mapped
to any integer and will thus not get encoded. This behavior can be modified through the
parameter `errors`:

.. code:: python

    # categories present in the train data
    train.cabin.unique()

.. code:: python

    array(['B', 'C', 'E', 'D', 'A', 'M', 'T', 'F'], dtype=object)

.. code:: python
    
    # categories present in the test data - 'G' is new
    test.cabin.unique()

.. code:: python

    array(['M', 'F', 'E', 'G'], dtype=object)

.. code:: python

    match_categories.transform(train).cabin.unique()

.. code:: python

    ['B', 'C', 'E', 'D', 'A', 'M', 'T', 'F']
    Categories (8, object): ['A', 'B', 'C', 'D', 'E', 'F', 'M', 'T']

.. code:: python
    
    # unseen category 'G' will not get mapped to any integer
    match_categories.transform(test).cabin.unique()

.. code:: python

    ['M', 'F', 'E', NaN]
    Categories (8, object): ['A', 'B', 'C', 'D', 'E', 'F', 'M', 'T']


When to use the transformer
^^^^^^^^^^^^^^^^^^^^^^^^^^^

This transformer is useful when creating custom transformers for categorical columns,
as well as when passing categorical columns to modeling packages which support them
natively but leave the variable casting to the user, such as ``lightgbm`` or ``glum``.