CountFrequencyEncoder#
The CountFrequencyEncoder()
replaces categories by either the count or the
percentage of observations per category. For example in the variable colour, if 10
observations are blue, blue will be replaced by 10. Alternatively, if 10% of the
observations are blue, blue will be replaced by 0.1.
Let’s look at an example using the Titanic Dataset.
First, let’s load the data and separate it into train and test:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.encoding import CountFrequencyEncoder
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0)
Now, we set up the CountFrequencyEncoder()
to replace the categories by their
frequencies, only in the 3 indicated variables:
# set up the encoder
encoder = CountFrequencyEncoder(encoding_method='frequency',
variables=['cabin', 'pclass', 'embarked'])
# fit the encoder
encoder.fit(X_train)
With fit()
the encoder learns the frequencies of each category, which are stored in
its encoder_dict_
parameter:
encoder.encoder_dict_
In the encoder_dict_
we find the frequencies for each one of the categories of each
variable that we want to encode. This way, we can map the original value to the new
value.
{'cabin': {'n': 0.7663755458515283,
'C': 0.07751091703056769,
'B': 0.04585152838427948,
'E': 0.034934497816593885,
'D': 0.034934497816593885,
'A': 0.018558951965065504,
'F': 0.016375545851528384,
'G': 0.004366812227074236,
'T': 0.001091703056768559},
'pclass': {3: 0.5436681222707423,
1: 0.25109170305676853,
2: 0.2052401746724891},
'embarked': {'S': 0.7117903930131004,
'C': 0.19759825327510916,
'Q': 0.0906113537117904}}
We can now go ahead and replace the original strings with the numbers:
# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)
More details#
In the following notebook, you can find more details into the CountFrequencyEncoder()
functionality and example plots with the encoded variables:
All notebooks can be found in a dedicated repository.