DropDuplicateFeatures

API Reference

class feature_engine.selection.DropDuplicateFeatures(variables=None, missing_values='ignore')[source]

DropDuplicateFeatures() finds and removes duplicated features in a dataframe.

Duplicated features are identical features, regardless of the variable or column name. If they show the same values for every observation, then they are considered duplicated.

The transformer will first identify and store the duplicated variables. Next, the transformer will drop these variables from a dataframe.

Parameters
variables: list, default=None

The list of variables to evaluate. If None, the transformer will evaluate all variables in the dataset.

missing_valuesstr, default=ignore

Takes values ‘raise’ and ‘ignore’. Whether the missing values should be raised as error or ignored when finding duplicated features.

Attributes

features_to_drop_:

Set with the duplicated features that will be dropped.

duplicated_feature_sets_:

Groups of duplicated features. Each list is a group of duplicated features.

variables_:

The variables to consider for the feature selection.

n_features_in_:

The number of features in the train set used in fit.

Methods

fit:

Find duplicated features.

transform:

Remove duplicated features

fit_transform:

Fit to data. Then transform it.

fit(X, y=None)[source]

Find duplicated features.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The input dataframe.

y: None

y is not needed for this transformer. You can pass y or None.

Returns
self
transform(X)[source]

Return dataframe with selected features.

Parameters
X: pandas dataframe of shape = [n_samples, n_features].

The input dataframe.

Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]

Pandas dataframe with the selected features.

rtype

DataFrame ..

Example

The DropDuplicateFeatures() finds and removes duplicated variables from a dataframe. The user can pass a list of variables to examine, or alternatively the selector will examine all variables in the data set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.selection import DropDuplicateFeatures

def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data = data[['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'cabin', 'embarked']]
        data = pd.concat([data, data[['sex', 'age', 'sibsp']]], axis=1)
        data.columns = ['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'cabin', 'embarked',
                        'sex_dup', 'age_dup', 'sibsp_dup']
        return data

# load data as pandas dataframe
data = load_titanic()
data.head()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
            data.drop(['survived'], axis=1),
            data['survived'], test_size=0.3, random_state=0)

# set up the transformer
transformer = DropDuplicateFeatures()

# fit the transformer
transformer.fit(X_train)

# transform the data
train_t = transformer.transform(X_train)

train_t.columns
Index(['pclass', 'sex', 'age', 'sibsp', 'parch', 'cabin', 'embarked'], dtype='object')
transformer.features_to_drop_
{'age_dup', 'sex_dup', 'sibsp_dup'}
transformer.duplicated_feature_sets_
[{'sex', 'sex_dup'}, {'age', 'age_dup'}, {'sibsp', 'sibsp_dup'}]