CountFrequencyEncoder#
Count encoding and frequency encoding are 2 categorical encoding techniques that were commonly used during data preprocessing in Kaggle’s data science competitions, even when their predictive value is not immediately obvious.
Count encoding consists of replacing the categories of categorical features by their counts, which are estimated from the training set. For example, in the variable color, if 10 observations are blue and 5 observations are red, blue will be replaced by 10 and red by 5.
Frequency encoding consists of replacing the labels of categorical data with their frequency, which is also estimated from the training set. Then, in the variable City, if London appears in 10% of the observations and Bristol in 1%, London will be replaced by 0.1 and Bristol with 0.01.
Count and frequency encoding in machine learning#
We’d use count encoding or frequency encoding when we think that the representation of the categories in the dataset has some sort of predictive value. To be honest, the only example that I can think of where count encoding could be useful is in sales forecasting or sales data analysis scenarios, where the count of a product or an item represents its popularity. In other words, we may be more likely to sell a product with a high count.
Count encoding and frequency encoding can be suitable for categorical variables with high cardinality because these types of categorical encoding will cause what is called collisions: categories that are present in a similar number of observations will be replaced with similar, if not the same values, which reduces the variability.
This, of course, can result in the loss of information by placing two categories that are otherwise different in the same pot. But on the other hand, if we are using count encoding or frequency encoding, we have reasons to believe that the count or the frequency are a good indicator of predictive performance or somehow capture data insight, so that categories with similar counts would show similar patterns or behaviors.
Count and Frequency encoding with Feature-engine#
The CountFrequencyEncoder()
replaces categories of categorical features by
either the count or the percentage of observations each category shows in the training set.
With CountFrequencyEncoder()
we can automatically encode all categorical
features in the dataset, or only a subset of them, by passing the variable names in a
list to the variables
parameter when we set up the encoder.
By default, CountFrequencyEncoder()
will encode only categorical data. If we
want to encode numerical values, we need to explicitly say so by setting the parameter
ignore_format
to True.
Count and frequency encoding with unseen categories#
When we learn mappings from strings to numbers, either with count encoding or other encoding techniques like ordinal encoding or target encoding, we do so by observing the categories in the training set. Hence, we won’t have mappings for categories that appear only in the test set. These are the so-called “unseen categories.”
When encountering unseen categories, CountFrequencyEncoder()
will ignore them
by default, which means that unseen categories will be replaced with missing values.
We can instruct the encoder to raise an error when a new category is encountered, or
alternatively, to encode unseen categories with zero.
Count encoding vs other encoding methods#
Count and frequency encoding, similar to ordinal encoding and contrarily to one-hot encoding, feature hashing, or binary encoding, does not increase the dataset dimensionality. From one categorical variable, we obtain one numerical feature.
Python example#
Let’s examine the functionality of CountFrequencyEncoder()
by using the Titanic
dataset. We’ll start by loading the libraries and functions, loading the dataset, and then
splitting it into a training and a testing set.
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.encoding import CountFrequencyEncoder
X, y = load_titanic(
return_X_y_frame=True,
handle_missing=True,
predictors_only=True,
cabin="letter_only",
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0,
)
print(X_train.head())
We see the resulting dataframe with the predictor variables below:
pclass sex age sibsp parch fare cabin embarked
501 2 female 13.000000 0 1 19.5000 M S
588 2 female 4.000000 1 1 23.0000 M S
402 2 female 30.000000 1 0 13.8583 M C
1193 3 male 29.881135 0 0 7.7250 M Q
686 3 female 22.000000 0 0 7.7250 M Q
This dataset has three obvious categorical features: cabin, embarked, and sex, and in addition, pclass could also be handled as a categorical.
Count encoding#
We’ll start by encoding the three categorical variables using their counts, that is, replacing the strings with the number of times each category is present in the training dataset.
encoder = CountFrequencyEncoder(
encoding_method='count',
variables=['cabin', 'sex', 'embarked'],
)
encoder.fit(X_train)
With fit()
, the count encoder learns the counts of each category. We can inspect the
counts as follows:
encoder.encoder_dict_
We see the counts of each category for each of the three variables in the following output:
{'cabin': {'M': 702,
'C': 71,
'B': 42,
'E': 32,
'D': 32,
'A': 17,
'F': 15,
'G': 4,
'T': 1},
'sex': {'male': 581, 'female': 335},
'embarked': {'S': 652, 'C': 179, 'Q': 83, 'Missing': 2}}
Now, we can go ahead and encode the variables:
train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)
print(train_t.head())
We see the resulting dataframe where the categorical features are now replaced with integer values corresponding to the category counts:
pclass sex age sibsp parch fare cabin embarked
501 2 335 13.000000 0 1 19.5000 702 652
588 2 335 4.000000 1 1 23.0000 702 652
402 2 335 30.000000 1 0 13.8583 702 179
1193 3 581 29.881135 0 0 7.7250 702 83
686 3 335 22.000000 0 0 7.7250 702 83
We can now use the encoded dataframes to train machine learning models.
Frequency encoding#
Let’s now perform frequency encoding. We’ll encode 2 categorical and 1 numerical variable, hence, we need to set the encoder to ignore the variable’s type:
encoder = CountFrequencyEncoder(
encoding_method='frequency',
variables=['cabin', 'pclass', 'embarked'],
ignore_format=True,
)
Now, we fit the frequency encoder to the train set and transform it straightaway, and then we transform the test set:
t_train = encoder.fit_transform(X_train)
t_test = encoder.transform(X_test)
test.head()
In the following output we see the transformed dataframe, where the categorical features are now encoded into their frequencies:
pclass sex age sibsp parch fare cabin embarked
1139 0.543668 male 38.000000 0 0 7.8958 0.766376 0.71179
533 0.205240 female 21.000000 0 1 21.0000 0.766376 0.71179
459 0.205240 male 42.000000 1 0 27.0000 0.766376 0.71179
1150 0.543668 male 29.881135 0 0 14.5000 0.766376 0.71179
393 0.205240 male 25.000000 0 0 31.5000 0.766376 0.71179
With fit()
the encoder learns the frequencies of each category, which are stored in
its encoder_dict_
parameter. We can inspect them like this:
encoder.encoder_dict_
In the encoder_dict_
we find the frequencies for each one of the unique categories of
each variable to encode. This way, we can map the original value to the new value.
{'cabin': {'M': 0.7663755458515283,
'C': 0.07751091703056769,
'B': 0.04585152838427948,
'E': 0.034934497816593885,
'D': 0.034934497816593885,
'A': 0.018558951965065504,
'F': 0.016375545851528384,
'G': 0.004366812227074236,
'T': 0.001091703056768559},
'pclass': {3: 0.5436681222707423,
1: 0.25109170305676853,
2: 0.2052401746724891},
'embarked': {'S': 0.7117903930131004,
'C': 0.19541484716157206,
'Q': 0.0906113537117904,
'Missing': 0.002183406113537118}}
We can now use these dataframes to train machine learning algorithms.
With the method inverse_transform
, we can transform the encoded dataframes back to their
original representation, that is, we can replace the encoding with the original categorical
values.
Additional resources#
In the following notebook, you can find more details into the CountFrequencyEncoder()
functionality and example plots with the encoded variables:
For more details about this and other feature engineering methods check out these resources:
Our book:
Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.