.. _count_freq_encoder: .. currentmodule:: feature_engine.encoding CountFrequencyEncoder ===================== Count encoding and frequency encoding are 2 categorical encoding techniques that were commonly used during data preprocessing in Kaggle's data science competitions, even when their predictive value is not immediately obvious. Count encoding consists of replacing the categories of categorical features by their counts, which are estimated from the training set. For example, in the variable color, if 10 observations are blue and 5 observations are red, blue will be replaced by 10 and red by 5. Frequency encoding consists of replacing the labels of categorical data with their frequency, which is also estimated from the training set. Then, in the variable City, if London appears in 10% of the observations and Bristol in 1%, London will be replaced by 0.1 and Bristol with 0.01. Count and frequency encoding in machine learning ------------------------------------------------ We'd use count encoding or frequency encoding when we think that the representation of the categories in the dataset has some sort of predictive value. To be honest, the only example that I can think of where count encoding could be useful is in sales forecasting or sales data analysis scenarios, where the count of a product or an item represents its popularity. In other words, we may be more likely to sell a product with a high count. Count encoding and frequency encoding can be suitable for categorical variables with high cardinality because these types of categorical encoding will cause what is called collisions: categories that are present in a similar number of observations will be replaced with similar, if not the same values, which reduces the variability. This, of course, can result in the loss of information by placing two categories that are otherwise different in the same pot. But on the other hand, if we are using count encoding or frequency encoding, we have reasons to believe that the count or the frequency are a good indicator of predictive performance or somehow capture data insight, so that categories with similar counts would show similar patterns or behaviors. Count and Frequency encoding with Feature-engine ------------------------------------------------ The :class:`CountFrequencyEncoder()` replaces categories of categorical features by either the count or the percentage of observations each category shows in the training set. With :class:`CountFrequencyEncoder()` we can automatically encode all categorical features in the dataset, or only a subset of them, by passing the variable names in a list to the `variables` parameter when we set up the encoder. By default, :class:`CountFrequencyEncoder()` will encode only categorical data. If we want to encode numerical values, we need to explicitly say so by setting the parameter `ignore_format` to True. Count and frequency encoding with unseen categories --------------------------------------------------- When we learn mappings from strings to numbers, either with count encoding or other encoding techniques like ordinal encoding or target encoding, we do so by observing the categories in the training set. Hence, we won't have mappings for categories that appear only in the test set. These are the so-called "unseen categories." When encountering unseen categories, :class:`CountFrequencyEncoder()` will ignore them by default, which means that unseen categories will be replaced with missing values. We can instruct the encoder to raise an error when a new category is encountered, or alternatively, to encode unseen categories with zero. Count encoding vs other encoding methods ---------------------------------------- Count and frequency encoding, similar to ordinal encoding and contrarily to one-hot encoding, feature hashing, or binary encoding, does not increase the dataset dimensionality. From one categorical variable, we obtain one numerical feature. Python example -------------- Let's examine the functionality of :class:`CountFrequencyEncoder()` by using the Titanic dataset. We'll start by loading the libraries and functions, loading the dataset, and then splitting it into a training and a testing set. .. code:: python from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.encoding import CountFrequencyEncoder X, y = load_titanic( return_X_y_frame=True, handle_missing=True, predictors_only=True, cabin="letter_only", ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, ) print(X_train.head()) We see the resulting dataframe with the predictor variables below: .. code:: python pclass sex age sibsp parch fare cabin embarked 501 2 female 13.000000 0 1 19.5000 M S 588 2 female 4.000000 1 1 23.0000 M S 402 2 female 30.000000 1 0 13.8583 M C 1193 3 male 29.881135 0 0 7.7250 M Q 686 3 female 22.000000 0 0 7.7250 M Q This dataset has three obvious categorical features: cabin, embarked, and sex, and in addition, pclass could also be handled as a categorical. Count encoding ~~~~~~~~~~~~~~ We'll start by encoding the three categorical variables using their counts, that is, replacing the strings with the number of times each category is present in the training dataset. .. code:: python encoder = CountFrequencyEncoder( encoding_method='count', variables=['cabin', 'sex', 'embarked'], ) encoder.fit(X_train) With `fit()`, the count encoder learns the counts of each category. We can inspect the counts as follows: .. code:: python encoder.encoder_dict_ We see the counts of each category for each of the three variables in the following output: .. code:: python {'cabin': {'M': 702, 'C': 71, 'B': 42, 'E': 32, 'D': 32, 'A': 17, 'F': 15, 'G': 4, 'T': 1}, 'sex': {'male': 581, 'female': 335}, 'embarked': {'S': 652, 'C': 179, 'Q': 83, 'Missing': 2}} Now, we can go ahead and encode the variables: .. code:: python train_t = encoder.transform(X_train) test_t = encoder.transform(X_test) print(train_t.head()) We see the resulting dataframe where the categorical features are now replaced with integer values corresponding to the category counts: .. code:: python pclass sex age sibsp parch fare cabin embarked 501 2 335 13.000000 0 1 19.5000 702 652 588 2 335 4.000000 1 1 23.0000 702 652 402 2 335 30.000000 1 0 13.8583 702 179 1193 3 581 29.881135 0 0 7.7250 702 83 686 3 335 22.000000 0 0 7.7250 702 83 We can now use the encoded dataframes to train machine learning models. Frequency encoding ~~~~~~~~~~~~~~~~~~ Let's now perform frequency encoding. We'll encode 2 categorical and 1 numerical variable, hence, we need to set the encoder to ignore the variable's type: .. code:: python encoder = CountFrequencyEncoder( encoding_method='frequency', variables=['cabin', 'pclass', 'embarked'], ignore_format=True, ) Now, we fit the frequency encoder to the train set and transform it straightaway, and then we transform the test set: .. code:: python t_train = encoder.fit_transform(X_train) t_test = encoder.transform(X_test) test.head() In the following output we see the transformed dataframe, where the categorical features are now encoded into their frequencies: .. code:: python pclass sex age sibsp parch fare cabin embarked 1139 0.543668 male 38.000000 0 0 7.8958 0.766376 0.71179 533 0.205240 female 21.000000 0 1 21.0000 0.766376 0.71179 459 0.205240 male 42.000000 1 0 27.0000 0.766376 0.71179 1150 0.543668 male 29.881135 0 0 14.5000 0.766376 0.71179 393 0.205240 male 25.000000 0 0 31.5000 0.766376 0.71179 With `fit()` the encoder learns the frequencies of each category, which are stored in its `encoder_dict_` parameter. We can inspect them like this: .. code:: python encoder.encoder_dict_ In the `encoder_dict_` we find the frequencies for each one of the unique categories of each variable to encode. This way, we can map the original value to the new value. .. code:: python {'cabin': {'M': 0.7663755458515283, 'C': 0.07751091703056769, 'B': 0.04585152838427948, 'E': 0.034934497816593885, 'D': 0.034934497816593885, 'A': 0.018558951965065504, 'F': 0.016375545851528384, 'G': 0.004366812227074236, 'T': 0.001091703056768559}, 'pclass': {3: 0.5436681222707423, 1: 0.25109170305676853, 2: 0.2052401746724891}, 'embarked': {'S': 0.7117903930131004, 'C': 0.19541484716157206, 'Q': 0.0906113537117904, 'Missing': 0.002183406113537118}} We can now use these dataframes to train machine learning algorithms. With the method `inverse_transform`, we can transform the encoded dataframes back to their original representation, that is, we can replace the encoding with the original categorical values. Additional resources -------------------- In the following notebook, you can find more details into the :class:`CountFrequencyEncoder()` functionality and example plots with the encoded variables: - `Jupyter notebook `_ For more details about this and other feature engineering methods check out these resources: .. figure:: ../../images/feml.png :width: 300 :figclass: align-center :align: left :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning Feature Engineering for Machine Learning .. figure:: ../../images/fetsf.png :width: 300 :figclass: align-center :align: right :target: https://www.trainindata.com/p/feature-engineering-for-forecasting Feature Engineering for Time Series Forecasting | | | | | | | | | | Our book: .. figure:: ../../images/cookbook.png :width: 200 :figclass: align-center :align: left :target: https://packt.link/0ewSo Python Feature Engineering Cookbook | | | | | | | | | | | | | Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.