DropDuplicateFeatures#
The DropDuplicateFeatures()
finds and removes duplicated variables from a dataframe.
Duplicated features are identical features, regardless of the variable or column name. If
they show the same values for every observation, then they are considered duplicated.
The transformer will automatically evaluate all variables, or alternatively, you can pass a list with the variables you wish to have examined. And it works with numerical and categorical features.
Example
Let’s see how to use DropDuplicateFeatures()
in an example with the Titanic dataset.
These dataset does not have duplicated features, so we will add a few manually:
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.selection import DropDuplicateFeatures
data = load_titanic(
handle_missing=True,
predictors_only=True,
)
# Lets duplicate some columns
data = pd.concat([data, data[['sex', 'age', 'sibsp']]], axis=1)
data.columns = ['pclass', 'survived', 'sex', 'age',
'sibsp', 'parch', 'fare','cabin', 'embarked',
'sex_dup', 'age_dup', 'sibsp_dup']
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived'], axis=1),
data['survived'],
test_size=0.3,
random_state=0,
)
print(X_train.head())
Below we see the resulting data:
pclass sex age sibsp parch fare cabin embarked \
501 2 female 13.000000 0 1 19.5000 Missing S
588 2 female 4.000000 1 1 23.0000 Missing S
402 2 female 30.000000 1 0 13.8583 Missing C
1193 3 male 29.881135 0 0 7.7250 Missing Q
686 3 female 22.000000 0 0 7.7250 Missing Q
sex_dup age_dup sibsp_dup
501 female 13.000000 0
588 female 4.000000 1
402 female 30.000000 1
1193 male 29.881135 0
686 female 22.000000 0
Now, we set up DropDuplicateFeatures()
to find the duplications:
transformer = DropDuplicateFeatures()
With fit()
the transformer finds the duplicated features:
transformer.fit(X_train)
The features that are duplicated and will be removed are stored by the transformer:
transformer.features_to_drop_
{'age_dup', 'sex_dup', 'sibsp_dup'}
With transform()
we remove the duplicated variables:
train_t = transformer.transform(X_train)
test_t = transformer.transform(X_test)
If we examine the variable names of the transformed dataset, we see that the duplicated features are not present:
train_t.columns
Index(['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked'], dtype='object')
And the transformer also stores the groups of duplicated features, which could be useful if we have groups where more than 2 features are identical.
transformer.duplicated_feature_sets_
[{'sex', 'sex_dup'}, {'age', 'age_dup'}, {'sibsp', 'sibsp_dup'}]
More details#
In this Kaggle kernel we use DropDuplicateFeatures()
together with other feature selection algorithms:
All notebooks can be found in a dedicated repository.