OneHotEncoder#
One-hot encoding is a method used to represent categorical data, where each category is represented by a binary variable. The binary variable takes the value 1 if the category is present and 0 otherwise. The binary variables are also known as dummy variables.
To represent the categorical feature “is-smoker” with categories “Smoker” and “Non-smoker”, we can generate the dummy variable “Smoker”, which takes 1 if the person smokes and 0 otherwise. We can also generate the variable “Non-smoker”, which takes 1 if the person does not smoke and 0 otherwise.
The following table shows a possible one hot encoded representation of the variable “is smoker”:
is smoker |
smoker |
non-smoker |
---|---|---|
smoker |
1 |
0 |
non-smoker |
0 |
1 |
non-smoker |
0 |
1 |
smoker |
1 |
0 |
non-smoker |
0 |
1 |
For the categorical variable Country with values England, Argentina, and
Germany, we can create three variables called England
, Argentina
, and Germany
.
These variables will take the value of 1 if the observation is England, Argentina, or
Germany, respectively, and 0 otherwise.
Encoding into k vs k-1 variables#
A categorical feature with k unique categories can be encoded using k-1 binary variables.
For Smoker
, k is 2 as it contains two labels (Smoker and Non-Smoker), so we only
need one binary variable (k - 1 = 1) to capture all of the information.
In the following table we see that the dummy variable Smoker
fully represents the
original categorical values:
is smoker |
smoker |
---|---|
smoker |
1 |
non-smoker |
0 |
non-smoker |
0 |
smoker |
1 |
non-smoker |
0 |
For the Country variable, which has three categories (k=3; England, Argentina, and Germany), we need two (k - 1 = 2) binary variables to capture all the information. The variable will be fully represented like this:
Country |
England |
Argentina |
---|---|---|
England |
1 |
0 |
Argentina |
0 |
1 |
Germany |
0 |
0 |
As we see in the previous table, if the observation is England, it will show the value 1 in
the England
variable; if the observation is Argentina, it will show the value 1 in
the Argentina
variable; and if the observation is Germany, it will show zeroes in
both dummy variables.
Like these, by looking at the values of the k-1 dummies, we can infer the original categorical value of each observation.
Encoding into k-1 binary variables is well-suited for linear regression models. Linear models evaluate all features during fit, thus, with k-1 they have all the information about the original categorical variable.
There are a few occasions in which we may prefer to encode the categorical variables with k binary variables.
Encode into k dummy variables if training decision trees based models or performing feature selection. Decision tree based models and many feature selection algorithms evaluate variables or groups of variables separately. Thus, if encoding into k-1, the last category will not be examined. In other words, we lose the information contained in that category.
Binary variables#
When a categorical variable has only 2 categories, like “Smoker” in our previous example, then encoding into k-1 suits all purposes, because the second dummy variable created by one hot encoding is completely redundant.
Encoding popular categories#
One hot encoding can increase the feature space dramatically, particularly if we have many categorical features, or the features have high cardinality. To control the feature space, it is common practice to encode only the most frequent categories in each categorical variable.
When we encode the most frequent categories, we will create binary variables for each of these frequent categories, and when the observation has a different, less popular category, it will have a 0 in all binary variables. See the following example:
var |
popular1 |
popular2 |
---|---|---|
popular1 |
1 |
0 |
popular2 |
0 |
1 |
popular1 |
1 |
0 |
non-popular |
0 |
0 |
popular2 |
0 |
1 |
less popular |
0 |
0 |
unpopular |
0 |
0 |
lonely |
0 |
0 |
As we see in the previous table, less popular categories are represented as a group by showing zeroes in all binary variables.
OneHotEncoder#
Feature-engine’s OneHotEncoder()
encodes categorical data as a one-hot numeric
dataframe.
OneHotEncoder()
can encode into k or k-1 dummy variables. The behaviour is
specified through the drop_last
parameter, which can be set to False
for k, or to
True
for k-1 dummy variables.
OneHotEncoder()
can specifically encode binary variables into k-1 variables
(that is, 1 dummy) while encoding categorical features of higher cardinality into k
dummies. This behaviour is specified by setting the parameter drop_last_binary=True
.
This will ensure that for every binary variable in the dataset, that is, for every
categorical variable with ONLY 2 categories, only 1 dummy is created. This is recommended,
unless you suspect that the variable could, in principle, take more than 2 values.
OneHotEncoder()
can also create binary variables for the n most popular
categories, n being determined by the user. For example, if we encode only the 6 more
popular categories, by setting the parameter top_categories=6
, the transformer will
add binary variables only for the 6 most frequent categories. The most frequent categories
are those with the greatest number of observations. The remaining categories will show
zeroes in each one of the derived dummies. This behaviour is useful when the categorical
variables are highly cardinal to control the expansion of the feature space.
Note
The parameter drop_last
is ignored when encoding the most popular categories.
Python implementation#
Let’s look at an example of one hot encoding, using Feature-engine’s OneHotEncoder()
utilizing the Titanic Dataset.
We’ll start by importing the libraries, functions and classes, and loading the data into a pandas dataframe and dividing it into a training and a testing set:
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.encoding import OneHotEncoder
X, y = load_titanic(
return_X_y_frame=True,
handle_missing=True,
predictors_only=True,
cabin="letter_only",
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0,
)
print(X_train.head())
We see the first 5 rows of the training data below:
pclass sex age sibsp parch fare cabin embarked
501 2 female 13.000000 0 1 19.5000 M S
588 2 female 4.000000 1 1 23.0000 M S
402 2 female 30.000000 1 0 13.8583 M C
1193 3 male 29.881135 0 0 7.7250 M Q
686 3 female 22.000000 0 0 7.7250 M Q
Let’s explore the cardinality of 4 of the categorical features:
X_train[['sex', 'pclass', 'cabin', 'embarked']].nunique()
sex 2
pclass 3
cabin 9
embarked 4
dtype: int64
We see that the variable sex has 2 categories, pclass has 3 categories, the variable cabin has 9 categories, and the variable embarked has 4 categories.
Let’s now set up the OneHotEncoder to encode 2 of the categorical variables into k-1 dummy variables:
encoder = OneHotEncoder(
variables=['cabin', 'embarked'],
drop_last=True,
)
encoder.fit(X_train)
With fit()
the encoder learns the categories of the variables, which are stored in the
attribute encoder_dict_
.
encoder.encoder_dict_
{'cabin': ['M', 'E', 'C', 'D', 'B', 'A', 'F', 'T'],
'embarked': ['S', 'C', 'Q']}
The encoder_dict_
contains the categories that will be represented by dummy variables
for each categorical variable.
With transform, we go ahead and encode the variables. Note that by default, the
OneHotEncoder()
drops the original categorical variables, which are now
represented by the one-hot array.
train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)
print(train_t.head())
Below we see the one hot dummy variables added to the dataset and the original variables are no longer in the dataframe:
pclass sex age sibsp parch fare cabin_M cabin_E \
501 2 female 13.000000 0 1 19.5000 1 0
588 2 female 4.000000 1 1 23.0000 1 0
402 2 female 30.000000 1 0 13.8583 1 0
1193 3 male 29.881135 0 0 7.7250 1 0
686 3 female 22.000000 0 0 7.7250 1 0
cabin_C cabin_D cabin_B cabin_A cabin_F cabin_T embarked_S \
501 0 0 0 0 0 0 1
588 0 0 0 0 0 0 1
402 0 0 0 0 0 0 0
1193 0 0 0 0 0 0 0
686 0 0 0 0 0 0 0
embarked_C embarked_Q
501 0 0
588 0 0
402 1 0
1193 0 1
686 0 1
Finding categorical variables automatically#
Feature-engine’s OneHotEncoder()
can automatically find and encode all
categorical features in the pandas dataframe. Let’s show that with an example.
Let’s set up the OneHotEncoder to find and encode all categorical features:
encoder = OneHotEncoder(
variables=None,
drop_last=True,
)
encoder.fit(X_train)
With fit, the encoder finds the categorical features and identifies it’s unique categories. We can find the categorical variables like this:
encoder.variables_
['sex', 'cabin', 'embarked']
And we can identify the unique categories for each variables like this:
encoder.encoder_dict_
{'sex': ['female'],
'cabin': ['M', 'E', 'C', 'D', 'B', 'A', 'F', 'T'],
'embarked': ['S', 'C', 'Q']}
We can now encode the categorical variables:
train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)
print(train_t.head())
And here we see the resulting dataframe:
pclass age sibsp parch fare sex_female cabin_M cabin_E \
501 2 13.000000 0 1 19.5000 1 1 0
588 2 4.000000 1 1 23.0000 1 1 0
402 2 30.000000 1 0 13.8583 1 1 0
1193 3 29.881135 0 0 7.7250 0 1 0
686 3 22.000000 0 0 7.7250 1 1 0
cabin_C cabin_D cabin_B cabin_A cabin_F cabin_T embarked_S \
501 0 0 0 0 0 0 1
588 0 0 0 0 0 0 1
402 0 0 0 0 0 0 0
1193 0 0 0 0 0 0 0
686 0 0 0 0 0 0 0
embarked_C embarked_Q
501 0 0
588 0 0
402 1 0
1193 0 1
686 0 1
Encoding variables of type numeric#
By default, Feature-engine’s OneHotEncoder()
will only encode categorical
features. If you attempt to encode a variable of numeric dtype, it will raise an error.
To avoid this error, you can instruct the encoder to ignore the data type format as
follows:
enc = OneHotEncoder(
variables=['pclass'],
drop_last=True,
ignore_format=True,
)
enc.fit(X_train)
train_t = enc.transform(X_train)
test_t = enc.transform(X_test)
print(train_t.head())
Note that pclass had numeric values instead of strings, and it was one hot encoded by the transformer into 2 dummies:
sex age sibsp parch fare cabin embarked pclass_2 \
501 female 13.000000 0 1 19.5000 M S 1
588 female 4.000000 1 1 23.0000 M S 1
402 female 30.000000 1 0 13.8583 M C 1
1193 male 29.881135 0 0 7.7250 M Q 0
686 female 22.000000 0 0 7.7250 M Q 0
pclass_3
501 0
588 0
402 0
1193 1
686 1
Encoding binary variables into 1 dummy#
With Feature-engine’s OneHotEncoder()
we can encode all categorical variables
into k dummies and the binary variables into k-1 by setting the encoder as follows:
ohe = OneHotEncoder(
variables=['sex', 'cabin','embarked'],
drop_last=False,
drop_last_binary=True,
)
train_t = ohe.fit_transform(X_train)
test_t = ohe.transform(X_test)
print(train_t.head())
As we see in the following input, for the variable sex, we have only have 1 dummy, and for all the rest we have k dummies:
pclass age sibsp parch fare sex_female cabin_M cabin_E \
501 2 13.000000 0 1 19.5000 1 1 0
588 2 4.000000 1 1 23.0000 1 1 0
402 2 30.000000 1 0 13.8583 1 1 0
1193 3 29.881135 0 0 7.7250 0 1 0
686 3 22.000000 0 0 7.7250 1 1 0
cabin_C cabin_D cabin_B cabin_A cabin_F cabin_T cabin_G \
501 0 0 0 0 0 0 0
588 0 0 0 0 0 0 0
402 0 0 0 0 0 0 0
1193 0 0 0 0 0 0 0
686 0 0 0 0 0 0 0
embarked_S embarked_C embarked_Q embarked_Missing
501 1 0 0 0
588 1 0 0 0
402 0 1 0 0
1193 0 0 1 0
Encoding frequent categories#
If the categorical variables are highly cardinal, we may end up with very big datasets after one hot encoding. In addition, if some of these variables are fairly constant or fairly similar, we may end up with one hot encoded features that are highly correlated, if not identical. To avoid this behaviour, we can encode only the most frequent categories.
To encode the 2 most frequent categories of each categorical column, we set up the transformer as follows:
ohe = OneHotEncoder(
top_categories=2,
variables=['pclass', 'cabin', 'embarked'],
ignore_format=True,
)
train_t = ohe.fit_transform(X_train)
test_t = ohe.transform(X_test)
print(train_t.head())
As we see in the resulting dataframe, we created only 2 dummies per variable:
sex age sibsp parch fare pclass_3 pclass_1 cabin_M \
501 female 13.000000 0 1 19.5000 0 0 1
588 female 4.000000 1 1 23.0000 0 0 1
402 female 30.000000 1 0 13.8583 0 0 1
1193 male 29.881135 0 0 7.7250 1 0 1
686 female 22.000000 0 0 7.7250 1 0 1
cabin_C embarked_S embarked_C
501 0 1 0
588 0 1 0
402 0 0 1
1193 0 0 0
686 0 0 0
Finally, if we want to obtain the column names in the resulting dataframe we can do the following:
encoder.get_feature_names_out()
We see the names of the columns below:
['sex',
'age',
'sibsp',
'parch',
'fare',
'pclass_3',
'pclass_1',
'cabin_M',
'cabin_C',
'embarked_S',
'embarked_C']
Considerations#
Encoding categorical variables into k dummies, will handle unknown categories automatically. Those features not seen during training will show zeroes in all dummies.
Encoding categorical features into k-1 dummies, will cause unseen data to be treated as the category that is dropped.
Encoding the top categories will make unseen categories part of the group of less popular categories.
If you add a big number of dummy variables to your data, many may be identical or highly correlated. Consider dropping identical and correlated features with the transformers from the selection module.
For alternative encoding methods used in data science check the OrdinalEncoder()
and other encoders included in the encoding module.
Tutorials, books and courses#
For more details into OneHotEncoder()
’s functionality visit:
For tutorials about this and other data preprocessing methods check out our online course:
Or read our book:
Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.