DropDuplicateFeatures#

Duplicate features are columns in a dataset that are identical, or, in other words, that contain exactly the same values. Duplicate features can be introduced accidentally, either through poor data management processes or during data manipulation.

For example, duplicated new records can be created by one-hot encoding a categorical variable or by adding missing data indicators. We can also accidentally generate duplicate records when we merge different data sources that show some variable overlap.

Checking for and removing duplicate features is a standard procedure in any data analysis workflow that helps us reduce the dimension of the dataset quickly and ensure data quality. In Python, we can find duplicate values in an attribute table very easily with Pandas. Dropping those duplicate features, however, requires a few more lines of code.

Feature-engine aims to accelerate the process of data validation by finding and removing duplicate features with the DropDuplicateFeatures() class, which is part of the selection API.

DropDuplicateFeatures() does exactly that; it finds and removes duplicated variables from a dataframe. DropDuplicateFeatures() will automatically evaluate all variables, or alternatively, you can pass a list with the variables you wish to have examined. And it works with numerical and categorical features alike.

So let’s see how to set up DropDuplicateFeatures().

Example

In this demo, we will use the Titanic dataset and introduce a few duplicated features manually:

import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.selection import DropDuplicateFeatures

data = load_titanic(
    handle_missing=True,
    predictors_only=True,
)

# Lets duplicate some columns
data = pd.concat([data, data[['sex', 'age', 'sibsp']]], axis=1)
data.columns = ['pclass', 'survived', 'sex', 'age',
                'sibsp', 'parch', 'fare','cabin', 'embarked',
                'sex_dup', 'age_dup', 'sibsp_dup']

We then split the data into a training and a testing set:

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['survived'], axis=1),
    data['survived'],
    test_size=0.3,
    random_state=0,
)

print(X_train.head())

Below we see the resulting data:

      pclass     sex        age  sibsp  parch     fare    cabin embarked  \
501        2  female  13.000000      0      1  19.5000  Missing        S
588        2  female   4.000000      1      1  23.0000  Missing        S
402        2  female  30.000000      1      0  13.8583  Missing        C
1193       3    male  29.881135      0      0   7.7250  Missing        Q
686        3  female  22.000000      0      0   7.7250  Missing        Q

     sex_dup    age_dup  sibsp_dup
501   female  13.000000          0
588   female   4.000000          1
402   female  30.000000          1
1193    male  29.881135          0
686   female  22.000000          0

As expected, the variables sex and sex_dup have duplicate field values throughout all the rows. The same is true for the variables age and age_dup.

Now, we set up DropDuplicateFeatures() to find the duplicate features:

transformer = DropDuplicateFeatures()

With fit() the transformer finds the duplicated features:

transformer.fit(X_train)

The features that are duplicated and will be removed are stored in the features_to_drop_ attribute:

transformer.features_to_drop_
{'age_dup', 'sex_dup', 'sibsp_dup'}

With transform() we remove the duplicated variables:

train_t = transformer.transform(X_train)
test_t = transformer.transform(X_test)

We can go ahead and check the variables in the transformed dataset, and we will see that the duplicated features are not there any more:

train_t.columns
Index(['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked'], dtype='object')

The transformer also stores the groups of duplicated features, which is useful for data analysis and validation.

transformer.duplicated_feature_sets_
[{'sex', 'sex_dup'}, {'age', 'age_dup'}, {'sibsp', 'sibsp_dup'}]

More details#

In this Kaggle kernel we use DropDuplicateFeatures() in a pipeline with other feature selection algorithms:

For more details about this and other feature selection methods check out these resources: