RareLabelEncoder#
The RareLabelEncoder()
groups infrequent categories into one new category
called ‘Rare’ or a different string indicated by the user. We need to specify the
minimum percentage of observations a category should have to be preserved and the
minimum number of unique categories a variable should have to be re-grouped.
tol
In the parameter tol
we indicate the minimum proportion of observations a category
should have, not to be grouped. In other words, categories which frequency, or proportion
of observations is <= tol
will be grouped into a unique term.
n_categories
In the parameter n_categories
we indicate the minimum cardinality of the categorical
variable in order to group infrequent categories. For example, if n_categories=5
,
categories will be grouped only in those categorical variables with more than 5 unique
categories. The rest of the variables will be ignored.
This parameter is useful when we have big datasets and do not have time to examine all categorical variables individually. This way, we ensure that variables with low cardinality are not reduced any further.
max_n_categories
In the parameter max_n_categories
we indicate the maximum number of unique categories
that we want in the encoded variable. If max_n_categories=5
, then the most popular 5
categories will remain in the variable after the encoding, all other will be grouped into
a single category.
This parameter is useful if we are going to perform one hot encoding at the back of it, to control the expansion of the feature space.
Example
Let’s look at an example using the Titanic Dataset.
First, let’s load the data and separate it into train and test:
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.encoding import RareLabelEncoder
X, y = load_titanic(
return_X_y_frame=True,
handle_missing=True,
predictors_only=True,
cabin="letter_only",
)
X["pclass"] = X["pclass"].astype("O")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0,
)
print(X_train.head())
We see the resulting data below:
pclass sex age sibsp parch fare cabin embarked
501 2 female 13.000000 0 1 19.5000 M S
588 2 female 4.000000 1 1 23.0000 M S
402 2 female 30.000000 1 0 13.8583 M C
1193 3 male 29.881135 0 0 7.7250 M Q
686 3 female 22.000000 0 0 7.7250 M Q
Let’s explore the number of uniue categories in the variable "cabin"
.
X_train["cabin"].unique()
We see the number of unique categories in the output below:
array(['M', 'E', 'C', 'D', 'B', 'A', 'F', 'T', 'G'], dtype=object)
Now, we set up the RareLabelEncoder()
to group categories shown by less than 3%
of the observations into a new group or category called ‘Rare’. We will group the
categories in the indicated variables if they have more than 2 unique categories each.
encoder = RareLabelEncoder(
tol=0.03,
n_categories=2,
variables=['cabin', 'pclass', 'embarked'],
replace_with='Rare',
)
# fit the encoder
encoder.fit(X_train)
With fit()
, the RareLabelEncoder()
finds the categories present in more than
3% of the observations, that is, those that will not be grouped. These categories can
be found in the encoder_dict_
attribute.
encoder.encoder_dict_
In the encoder_dict_
we find the most frequent categories per variable to encode.
Any category that is not in this dictionary, will be grouped.
{'cabin': ['M', 'C', 'B', 'E', 'D'],
'pclass': [3, 1, 2],
'embarked': ['S', 'C', 'Q']}
Now we can go ahead and transform the variables:
# transform the data
train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)
Let’s now inspect the number of unique categories in the variable "cabin"
after the
transformation:
train_t["cabin"].unique()
In the output below, we see that the infrequent categories have been replaced by
"Rare"
.
array(['M', 'E', 'C', 'D', 'B', 'Rare'], dtype=object)
We can also specify the maximum number of categories that can be considered frequent
using the max_n_categories
parameter.
Let’s begin by creating a toy dataframe and count the values of observations per category:
from feature_engine.encoding import RareLabelEncoder
import pandas as pd
data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1}
data = pd.DataFrame(data)
data['var_A'].value_counts()
A 10
B 10
C 2
D 1
Name: var_A, dtype: int64
In this block of code, we group the categories only for variables with more than 3 unique categories and then we plot the result:
rare_encoder = RareLabelEncoder(tol=0.05, n_categories=3)
rare_encoder.fit_transform(data)['var_A'].value_counts()
A 10
B 10
C 2
Rare 1
Name: var_A, dtype: int64
Now, we retain the 2 most frequent categories of the variable and group the rest into the ‘Rare’ group:
rare_encoder = RareLabelEncoder(tol=0.05, n_categories=3, max_n_categories=2)
Xt = rare_encoder.fit_transform(data)
Xt['var_A'].value_counts()
A 10
B 10
Rare 3
Name: var_A, dtype: int64
Tips#
The RareLabelEncoder()
can be used to group infrequent categories and like this
control the expansion of the feature space if using one hot encoding.
Some categorical encodings will also return NAN if a category is present in the test set, but was not seen in the train set. This inconvenient can usually be avoided if we group rare labels before training the encoders.
Some categorical encoders will also return NAN if there is not enough observations for
a certain category. For example the WoEEncoder()
and the PRatioEncoder()
.
This behaviour can be also prevented by grouping infrequent labels before the encoding
with the RareLabelEncoder()
.
Additional resources#
In the following notebook, you can find more details into the RareLabelEncoder()
functionality and example plots with the encoded variables:
For more details about this and other feature engineering methods check out these resources:

Feature Engineering for Machine Learning#
Or read our book:

Python Feature Engineering Cookbook#
Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.