OneHotEncoder#

One-hot encoding is a method used to represent categorical data, where each category is represented by a binary variable. The binary variable takes the value 1 if the category is present and 0 otherwise. The binary variables are also known as dummy variables.

To represent the categorical feature “is-smoker” with categories “Smoker” and “Non-smoker”, we can generate the dummy variable “Smoker”, which takes 1 if the person smokes and 0 otherwise. We can also generate the variable “Non-smoker”, which takes 1 if the person does not smoke and 0 otherwise.

The following table shows a possible one hot encoded representation of the variable “is smoker”:

is smoker	smoker	non-smoker
smoker	1	0
non-smoker	0	1
non-smoker	0	1
smoker	1	0
non-smoker	0	1

For the categorical variable Country with values England, Argentina, and Germany, we can create three variables called England, Argentina, and Germany. These variables will take the value of 1 if the observation is England, Argentina, or Germany, respectively, and 0 otherwise.

Encoding into k vs k-1 variables#

A categorical feature with k unique categories can be encoded using k-1 binary variables. For Smoker, k is 2 as it contains two labels (Smoker and Non-Smoker), so we only need one binary variable (k - 1 = 1) to capture all of the information.

In the following table we see that the dummy variable Smoker fully represents the original categorical values:

is smoker	smoker
smoker	1
non-smoker	0
non-smoker	0
smoker	1
non-smoker	0

For the Country variable, which has three categories (k=3; England, Argentina, and Germany), we need two (k - 1 = 2) binary variables to capture all the information. The variable will be fully represented like this:

Country	England	Argentina
England	1	0
Argentina	0	1
Germany	0	0

As we see in the previous table, if the observation is England, it will show the value 1 in the England variable; if the observation is Argentina, it will show the value 1 in the Argentina variable; and if the observation is Germany, it will show zeroes in both dummy variables.

Like these, by looking at the values of the k-1 dummies, we can infer the original categorical value of each observation.

Encoding into k-1 binary variables is well-suited for linear regression models. Linear models evaluate all features during fit, thus, with k-1 they have all the information about the original categorical variable.

There are a few occasions in which we may prefer to encode the categorical variables with k binary variables.

Encode into k dummy variables if training decision trees based models or performing feature selection. Decision tree based models and many feature selection algorithms evaluate variables or groups of variables separately. Thus, if encoding into k-1, the last category will not be examined. In other words, we lose the information contained in that category.

Binary variables#

When a categorical variable has only 2 categories, like “Smoker” in our previous example, then encoding into k-1 suits all purposes, because the second dummy variable created by one hot encoding is completely redundant.

Encoding popular categories#

One hot encoding can increase the feature space dramatically, particularly if we have many categorical features, or the features have high cardinality. To control the feature space, it is common practice to encode only the most frequent categories in each categorical variable.

When we encode the most frequent categories, we will create binary variables for each of these frequent categories, and when the observation has a different, less popular category, it will have a 0 in all binary variables. See the following example:

var	popular1	popular2
popular1	1	0
popular2	0	1
popular1	1	0
non-popular	0	0
popular2	0	1
less popular	0	0
unpopular	0	0
lonely	0	0

As we see in the previous table, less popular categories are represented as a group by showing zeroes in all binary variables.

OneHotEncoder#

Feature-engine’s OneHotEncoder() encodes categorical data as a one-hot numeric dataframe.

OneHotEncoder() can encode into k or k-1 dummy variables. The behaviour is specified through the drop_last parameter, which can be set to False for k, or to True for k-1 dummy variables.

OneHotEncoder() can specifically encode binary variables into k-1 variables (that is, 1 dummy) while encoding categorical features of higher cardinality into k dummies. This behaviour is specified by setting the parameter drop_last_binary=True. This will ensure that for every binary variable in the dataset, that is, for every categorical variable with ONLY 2 categories, only 1 dummy is created. This is recommended, unless you suspect that the variable could, in principle, take more than 2 values.

OneHotEncoder() can also create binary variables for the n most popular categories, n being determined by the user. For example, if we encode only the 6 more popular categories, by setting the parameter top_categories=6, the transformer will add binary variables only for the 6 most frequent categories. The most frequent categories are those with the greatest number of observations. The remaining categories will show zeroes in each one of the derived dummies. This behaviour is useful when the categorical variables are highly cardinal to control the expansion of the feature space.

Note

The parameter drop_last is ignored when encoding the most popular categories.

Python implementation#

Let’s look at an example of one hot encoding, using Feature-engine’s OneHotEncoder() utilizing the Titanic Dataset.

We’ll start by importing the libraries, functions and classes, and loading the data into a pandas dataframe and dividing it into a training and a testing set:

from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.encoding import OneHotEncoder

X, y = load_titanic(
    return_X_y_frame=True,
    handle_missing=True,
    predictors_only=True,
    cabin="letter_only",
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train.head())

We see the first 5 rows of the training data below:

      pclass     sex        age  sibsp  parch     fare cabin embarked
      2  female  13.000000      0      1  19.5000     M        S
      2  female   4.000000      1      1  23.0000     M        S
      2  female  30.000000      1      0  13.8583     M        C
     3    male  29.881135      0      0   7.7250     M        Q
      3  female  22.000000      0      0   7.7250     M        Q

Let’s explore the cardinality of 4 of the categorical features:

X_train[['sex', 'pclass', 'cabin', 'embarked']].nunique()

sex         2
pclass      3
cabin       9
embarked    4
dtype: int64

We see that the variable sex has 2 categories, pclass has 3 categories, the variable cabin has 9 categories, and the variable embarked has 4 categories.

Let’s now set up the OneHotEncoder to encode 2 of the categorical variables into k-1 dummy variables:

encoder = OneHotEncoder(
    variables=['cabin', 'embarked'],
    drop_last=True,
    )

encoder.fit(X_train)

With fit() the encoder learns the categories of the variables, which are stored in the attribute encoder_dict_.

encoder.encoder_dict_

{'cabin': ['M', 'E', 'C', 'D', 'B', 'A', 'F', 'T'],
 'embarked': ['S', 'C', 'Q']}

The encoder_dict_ contains the categories that will be represented by dummy variables for each categorical variable.

With transform, we go ahead and encode the variables. Note that by default, the OneHotEncoder() drops the original categorical variables, which are now represented by the one-hot array.

train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

print(train_t.head())

Below we see the one hot dummy variables added to the dataset and the original variables are no longer in the dataframe:

      pclass     sex        age  sibsp  parch     fare  cabin_M  cabin_E  \
      2  female  13.000000      0      1  19.5000        1        0
      2  female   4.000000      1      1  23.0000        1        0
      2  female  30.000000      1      0  13.8583        1        0
     3    male  29.881135      0      0   7.7250        1        0
      3  female  22.000000      0      0   7.7250        1        0

      cabin_C  cabin_D  cabin_B  cabin_A  cabin_F  cabin_T  embarked_S  \
       0        0        0        0        0        0           1
       0        0        0        0        0        0           1
       0        0        0        0        0        0           0
      0        0        0        0        0        0           0
       0        0        0        0        0        0           0

      embarked_C  embarked_Q
          0           0
          0           0
          1           0
         0           1
          0           1

Finding categorical variables automatically#

Feature-engine’s OneHotEncoder() can automatically find and encode all categorical features in the pandas dataframe. Let’s show that with an example.

Let’s set up the OneHotEncoder to find and encode all categorical features:

encoder = OneHotEncoder(
    variables=None,
    drop_last=True,
    )

encoder.fit(X_train)

With fit, the encoder finds the categorical features and identifies it’s unique categories. We can find the categorical variables like this:

encoder.variables_

['sex', 'cabin', 'embarked']

And we can identify the unique categories for each variables like this:

encoder.encoder_dict_

{'sex': ['female'],
 'cabin': ['M', 'E', 'C', 'D', 'B', 'A', 'F', 'T'],
 'embarked': ['S', 'C', 'Q']}

We can now encode the categorical variables:

train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

print(train_t.head())

And here we see the resulting dataframe:

      pclass        age  sibsp  parch     fare  sex_female  cabin_M  cabin_E  \
      2  13.000000      0      1  19.5000           1        1        0
      2   4.000000      1      1  23.0000           1        1        0
      2  30.000000      1      0  13.8583           1        1        0
     3  29.881135      0      0   7.7250           0        1        0
      3  22.000000      0      0   7.7250           1        1        0

      cabin_C  cabin_D  cabin_B  cabin_A  cabin_F  cabin_T  embarked_S  \
       0        0        0        0        0        0           1
       0        0        0        0        0        0           1
       0        0        0        0        0        0           0
      0        0        0        0        0        0           0
       0        0        0        0        0        0           0

      embarked_C  embarked_Q
          0           0
          0           0
          1           0
         0           1
          0           1

Encoding variables of type numeric#

By default, Feature-engine’s OneHotEncoder() will only encode categorical features. If you attempt to encode a variable of numeric dtype, it will raise an error. To avoid this error, you can instruct the encoder to ignore the data type format as follows:

enc = OneHotEncoder(
    variables=['pclass'],
    drop_last=True,
    ignore_format=True,
    )

enc.fit(X_train)

train_t = enc.transform(X_train)
test_t = enc.transform(X_test)

print(train_t.head())

Note that pclass had numeric values instead of strings, and it was one hot encoded by the transformer into 2 dummies:

         sex        age  sibsp  parch     fare cabin embarked  pclass_2  \
 female  13.000000      0      1  19.5000     M        S         1
 female   4.000000      1      1  23.0000     M        S         1
 female  30.000000      1      0  13.8583     M        C         1
  male  29.881135      0      0   7.7250     M        Q         0
 female  22.000000      0      0   7.7250     M        Q         0

      pclass_3
        0
        0
        0
       1
        1

Encoding binary variables into 1 dummy#

With Feature-engine’s OneHotEncoder() we can encode all categorical variables into k dummies and the binary variables into k-1 by setting the encoder as follows:

ohe = OneHotEncoder(
    variables=['sex', 'cabin','embarked'],
    drop_last=False,
    drop_last_binary=True,
    )

train_t = ohe.fit_transform(X_train)
test_t = ohe.transform(X_test)

print(train_t.head())

As we see in the following input, for the variable sex, we have only have 1 dummy, and for all the rest we have k dummies:

      pclass        age  sibsp  parch     fare  sex_female  cabin_M  cabin_E  \
      2  13.000000      0      1  19.5000           1        1        0
      2   4.000000      1      1  23.0000           1        1        0
      2  30.000000      1      0  13.8583           1        1        0
     3  29.881135      0      0   7.7250           0        1        0
      3  22.000000      0      0   7.7250           1        1        0

      cabin_C  cabin_D  cabin_B  cabin_A  cabin_F  cabin_T  cabin_G  \
       0        0        0        0        0        0        0
       0        0        0        0        0        0        0
       0        0        0        0        0        0        0
      0        0        0        0        0        0        0
       0        0        0        0        0        0        0

      embarked_S  embarked_C  embarked_Q  embarked_Missing
          1           0           0                 0
          1           0           0                 0
          0           1           0                 0
         0           0           1                 0

Encoding frequent categories#

If the categorical variables are highly cardinal, we may end up with very big datasets after one hot encoding. In addition, if some of these variables are fairly constant or fairly similar, we may end up with one hot encoded features that are highly correlated, if not identical. To avoid this behaviour, we can encode only the most frequent categories.

To encode the 2 most frequent categories of each categorical column, we set up the transformer as follows:

ohe = OneHotEncoder(
    top_categories=2,
    variables=['pclass', 'cabin', 'embarked'],
    ignore_format=True,
    )

train_t = ohe.fit_transform(X_train)
test_t = ohe.transform(X_test)

print(train_t.head())

As we see in the resulting dataframe, we created only 2 dummies per variable:

         sex        age  sibsp  parch     fare  pclass_3  pclass_1  cabin_M  \
 female  13.000000      0      1  19.5000         0         0        1
 female   4.000000      1      1  23.0000         0         0        1
 female  30.000000      1      0  13.8583         0         0        1
  male  29.881135      0      0   7.7250         1         0        1
 female  22.000000      0      0   7.7250         1         0        1

      cabin_C  embarked_S  embarked_C
       0           1           0
       0           1           0
       0           0           1
      0           0           0
       0           0           0

Finally, if we want to obtain the column names in the resulting dataframe we can do the following:

encoder.get_feature_names_out()

We see the names of the columns below:

['sex',
 'age',
 'sibsp',
 'parch',
 'fare',
 'pclass_3',
 'pclass_1',
 'cabin_M',
 'cabin_C',
 'embarked_S',
 'embarked_C']

Considerations#

Encoding categorical variables into k dummies, will handle unknown categories automatically. Those features not seen during training will show zeroes in all dummies.

Encoding categorical features into k-1 dummies, will cause unseen data to be treated as the category that is dropped.

Encoding the top categories will make unseen categories part of the group of less popular categories.

If you add a big number of dummy variables to your data, many may be identical or highly correlated. Consider dropping identical and correlated features with the transformers from the selection module.

For alternative encoding methods used in data science check the OrdinalEncoder() and other encoders included in the encoding module.

Tutorials, books and courses#

For more details into OneHotEncoder()’s functionality visit:

Jupyter notebook

For tutorials about this and other data preprocessing methods check out our online course:

Feature Engineering for Machine Learning#

Or read our book:

Python Feature Engineering Cookbook#

Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.

Boost Your Data Science Skills

OneHotEncoder#

Encoding into k vs k-1 variables#

Binary variables#

Encoding popular categories#

OneHotEncoder#

Python implementation#

Finding categorical variables automatically#

Encoding variables of type numeric#

Encoding binary variables into 1 dummy#

Encoding frequent categories#

Considerations#

Tutorials, books and courses#