OneHotEncoder#

One-hot encoding is a method used to represent categorical data, where each category is represented by a binary variable. The binary variable takes the value 1 if the category is present and 0 otherwise. The binary variables are also known as dummy variables.

To represent the categorical feature “is-smoker” with categories “Smoker” and “Non-smoker”, we can generate the dummy variable “Smoker”, which takes 1 if the person smokes and 0 otherwise. We can also generate the variable “Non-smoker”, which takes 1 if the person does not smoke and 0 otherwise.

The following table shows a possible one hot encoded representation of the variable “is smoker”:

is smoker

smoker

non-smoker

smoker

1

0

non-smoker

0

1

non-smoker

0

1

smoker

1

0

non-smoker

0

1

For the categorical variable Country with values England, Argentina, and Germany, we can create three variables called England, Argentina, and Germany. These variables will take the value of 1 if the observation is England, Argentina, or Germany, respectively, and 0 otherwise.

Encoding into k vs k-1 variables#

A categorical feature with k unique categories can be encoded using k-1 binary variables. For Smoker, k is 2 as it contains two labels (Smoker and Non-Smoker), so we only need one binary variable (k - 1 = 1) to capture all of the information.

In the following table we see that the dummy variable Smoker fully represents the original categorical values:

is smoker

smoker

smoker

1

non-smoker

0

non-smoker

0

smoker

1

non-smoker

0

For the Country variable, which has three categories (k=3; England, Argentina, and Germany), we need two (k - 1 = 2) binary variables to capture all the information. The variable will be fully represented like this:

Country

England

Argentina

England

1

0

Argentina

0

1

Germany

0

0

As we see in the previous table, if the observation is England, it will show the value 1 in the England variable; if the observation is Argentina, it will show the value 1 in the Argentina variable; and if the observation is Germany, it will show zeroes in both dummy variables.

Like these, by looking at the values of the k-1 dummies, we can infer the original categorical value of each observation.

Encoding into k-1 binary variables is well-suited for linear regression models. Linear models evaluate all features during fit, thus, with k-1 they have all the information about the original categorical variable.

There are a few occasions in which we may prefer to encode the categorical variables with k binary variables.

Encode into k dummy variables if training decision trees based models or performing feature selection. Decision tree based models and many feature selection algorithms evaluate variables or groups of variables separately. Thus, if encoding into k-1, the last category will not be examined. In other words, we lose the information contained in that category.

Binary variables#

When a categorical variable has only 2 categories, like “Smoker” in our previous example, then encoding into k-1 suits all purposes, because the second dummy variable created by one hot encoding is completely redundant.

OneHotEncoder#

Feature-engine’s OneHotEncoder() encodes categorical data as a one-hot numeric dataframe.

OneHotEncoder() can encode into k or k-1 dummy variables. The behaviour is specified through the drop_last parameter, which can be set to False for k, or to True for k-1 dummy variables.

OneHotEncoder() can specifically encode binary variables into k-1 variables (that is, 1 dummy) while encoding categorical features of higher cardinality into k dummies. This behaviour is specified by setting the parameter drop_last_binary=True. This will ensure that for every binary variable in the dataset, that is, for every categorical variable with ONLY 2 categories, only 1 dummy is created. This is recommended, unless you suspect that the variable could, in principle, take more than 2 values.

OneHotEncoder() can also create binary variables for the n most popular categories, n being determined by the user. For example, if we encode only the 6 more popular categories, by setting the parameter top_categories=6, the transformer will add binary variables only for the 6 most frequent categories. The most frequent categories are those with the greatest number of observations. The remaining categories will show zeroes in each one of the derived dummies. This behaviour is useful when the categorical variables are highly cardinal to control the expansion of the feature space.

Note

The parameter drop_last is ignored when encoding the most popular categories.

Python implementation#

Let’s look at an example of one hot encoding, using Feature-engine’s OneHotEncoder() utilizing the Titanic Dataset.

We’ll start by importing the libraries, functions and classes, and loading the data into a pandas dataframe and dividing it into a training and a testing set:

from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.encoding import OneHotEncoder

X, y = load_titanic(
    return_X_y_frame=True,
    handle_missing=True,
    predictors_only=True,
    cabin="letter_only",
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train.head())

We see the first 5 rows of the training data below:

      pclass     sex        age  sibsp  parch     fare cabin embarked
501        2  female  13.000000      0      1  19.5000     M        S
588        2  female   4.000000      1      1  23.0000     M        S
402        2  female  30.000000      1      0  13.8583     M        C
1193       3    male  29.881135      0      0   7.7250     M        Q
686        3  female  22.000000      0      0   7.7250     M        Q

Let’s explore the cardinality of 4 of the categorical features:

X_train[['sex', 'pclass', 'cabin', 'embarked']].nunique()
sex         2
pclass      3
cabin       9
embarked    4
dtype: int64

We see that the variable sex has 2 categories, pclass has 3 categories, the variable cabin has 9 categories, and the variable embarked has 4 categories.

Let’s now set up the OneHotEncoder to encode 2 of the categorical variables into k-1 dummy variables:

encoder = OneHotEncoder(
    variables=['cabin', 'embarked'],
    drop_last=True,
    )

encoder.fit(X_train)

With fit() the encoder learns the categories of the variables, which are stored in the attribute encoder_dict_.

encoder.encoder_dict_
{'cabin': ['M', 'E', 'C', 'D', 'B', 'A', 'F', 'T'],
 'embarked': ['S', 'C', 'Q']}

The encoder_dict_ contains the categories that will be represented by dummy variables for each categorical variable.

With transform, we go ahead and encode the variables. Note that by default, the OneHotEncoder() drops the original categorical variables, which are now represented by the one-hot array.

train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

print(train_t.head())

Below we see the one hot dummy variables added to the dataset and the original variables are no longer in the dataframe:

      pclass     sex        age  sibsp  parch     fare  cabin_M  cabin_E  \
501        2  female  13.000000      0      1  19.5000        1        0
588        2  female   4.000000      1      1  23.0000        1        0
402        2  female  30.000000      1      0  13.8583        1        0
1193       3    male  29.881135      0      0   7.7250        1        0
686        3  female  22.000000      0      0   7.7250        1        0

      cabin_C  cabin_D  cabin_B  cabin_A  cabin_F  cabin_T  embarked_S  \
501         0        0        0        0        0        0           1
588         0        0        0        0        0        0           1
402         0        0        0        0        0        0           0
1193        0        0        0        0        0        0           0
686         0        0        0        0        0        0           0

      embarked_C  embarked_Q
501            0           0
588            0           0
402            1           0
1193           0           1
686            0           1

Finding categorical variables automatically#

Feature-engine’s OneHotEncoder() can automatically find and encode all categorical features in the pandas dataframe. Let’s show that with an example.

Let’s set up the OneHotEncoder to find and encode all categorical features:

encoder = OneHotEncoder(
    variables=None,
    drop_last=True,
    )

encoder.fit(X_train)

With fit, the encoder finds the categorical features and identifies it’s unique categories. We can find the categorical variables like this:

encoder.variables_
['sex', 'cabin', 'embarked']

And we can identify the unique categories for each variables like this:

encoder.encoder_dict_
{'sex': ['female'],
 'cabin': ['M', 'E', 'C', 'D', 'B', 'A', 'F', 'T'],
 'embarked': ['S', 'C', 'Q']}

We can now encode the categorical variables:

train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

print(train_t.head())

And here we see the resulting dataframe:

      pclass        age  sibsp  parch     fare  sex_female  cabin_M  cabin_E  \
501        2  13.000000      0      1  19.5000           1        1        0
588        2   4.000000      1      1  23.0000           1        1        0
402        2  30.000000      1      0  13.8583           1        1        0
1193       3  29.881135      0      0   7.7250           0        1        0
686        3  22.000000      0      0   7.7250           1        1        0

      cabin_C  cabin_D  cabin_B  cabin_A  cabin_F  cabin_T  embarked_S  \
501         0        0        0        0        0        0           1
588         0        0        0        0        0        0           1
402         0        0        0        0        0        0           0
1193        0        0        0        0        0        0           0
686         0        0        0        0        0        0           0

      embarked_C  embarked_Q
501            0           0
588            0           0
402            1           0
1193           0           1
686            0           1

Encoding variables of type numeric#

By default, Feature-engine’s OneHotEncoder() will only encode categorical features. If you attempt to encode a variable of numeric dtype, it will raise an error. To avoid this error, you can instruct the encoder to ignore the data type format as follows:

enc = OneHotEncoder(
    variables=['pclass'],
    drop_last=True,
    ignore_format=True,
    )

enc.fit(X_train)

train_t = enc.transform(X_train)
test_t = enc.transform(X_test)

print(train_t.head())

Note that pclass had numeric values instead of strings, and it was one hot encoded by the transformer into 2 dummies:

         sex        age  sibsp  parch     fare cabin embarked  pclass_2  \
501   female  13.000000      0      1  19.5000     M        S         1
588   female   4.000000      1      1  23.0000     M        S         1
402   female  30.000000      1      0  13.8583     M        C         1
1193    male  29.881135      0      0   7.7250     M        Q         0
686   female  22.000000      0      0   7.7250     M        Q         0

      pclass_3
501          0
588          0
402          0
1193         1
686          1

Encoding binary variables into 1 dummy#

With Feature-engine’s OneHotEncoder() we can encode all categorical variables into k dummies and the binary variables into k-1 by setting the encoder as follows:

ohe = OneHotEncoder(
    variables=['sex', 'cabin','embarked'],
    drop_last=False,
    drop_last_binary=True,
    )

train_t = ohe.fit_transform(X_train)
test_t = ohe.transform(X_test)

print(train_t.head())

As we see in the following input, for the variable sex, we have only have 1 dummy, and for all the rest we have k dummies:

      pclass        age  sibsp  parch     fare  sex_female  cabin_M  cabin_E  \
501        2  13.000000      0      1  19.5000           1        1        0
588        2   4.000000      1      1  23.0000           1        1        0
402        2  30.000000      1      0  13.8583           1        1        0
1193       3  29.881135      0      0   7.7250           0        1        0
686        3  22.000000      0      0   7.7250           1        1        0

      cabin_C  cabin_D  cabin_B  cabin_A  cabin_F  cabin_T  cabin_G  \
501         0        0        0        0        0        0        0
588         0        0        0        0        0        0        0
402         0        0        0        0        0        0        0
1193        0        0        0        0        0        0        0
686         0        0        0        0        0        0        0

      embarked_S  embarked_C  embarked_Q  embarked_Missing
501            1           0           0                 0
588            1           0           0                 0
402            0           1           0                 0
1193           0           0           1                 0

Encoding frequent categories#

If the categorical variables are highly cardinal, we may end up with very big datasets after one hot encoding. In addition, if some of these variables are fairly constant or fairly similar, we may end up with one hot encoded features that are highly correlated, if not identical. To avoid this behaviour, we can encode only the most frequent categories.

To encode the 2 most frequent categories of each categorical column, we set up the transformer as follows:

ohe = OneHotEncoder(
    top_categories=2,
    variables=['pclass', 'cabin', 'embarked'],
    ignore_format=True,
    )

train_t = ohe.fit_transform(X_train)
test_t = ohe.transform(X_test)

print(train_t.head())

As we see in the resulting dataframe, we created only 2 dummies per variable:

         sex        age  sibsp  parch     fare  pclass_3  pclass_1  cabin_M  \
501   female  13.000000      0      1  19.5000         0         0        1
588   female   4.000000      1      1  23.0000         0         0        1
402   female  30.000000      1      0  13.8583         0         0        1
1193    male  29.881135      0      0   7.7250         1         0        1
686   female  22.000000      0      0   7.7250         1         0        1

      cabin_C  embarked_S  embarked_C
501         0           1           0
588         0           1           0
402         0           0           1
1193        0           0           0
686         0           0           0

Finally, if we want to obtain the column names in the resulting dataframe we can do the following:

encoder.get_feature_names_out()

We see the names of the columns below:

['sex',
 'age',
 'sibsp',
 'parch',
 'fare',
 'pclass_3',
 'pclass_1',
 'cabin_M',
 'cabin_C',
 'embarked_S',
 'embarked_C']

Considerations#

Encoding categorical variables into k dummies, will handle unknown categories automatically. Those features not seen during training will show zeroes in all dummies.

Encoding categorical features into k-1 dummies, will cause unseen data to be treated as the category that is dropped.

Encoding the top categories will make unseen categories part of the group of less popular categories.

If you add a big number of dummy variables to your data, many may be identical or highly correlated. Consider dropping identical and correlated features with the transformers from the selection module.

For alternative encoding methods used in data science check the OrdinalEncoder() and other encoders included in the encoding module.

Tutorials, books and courses#

For more details into OneHotEncoder()’s functionality visit:

For tutorials about this and other data preprocessing methods check out our online course:

../../_images/feml.png

Feature Engineering for Machine Learning#











Or read our book:

../../_images/cookbook.png

Python Feature Engineering Cookbook#














Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.