DropDuplicateFeatures¶
API Reference¶
- class feature_engine.selection.DropDuplicateFeatures(variables=None, missing_values='ignore')[source]¶
DropDuplicateFeatures() finds and removes duplicated features in a dataframe.
Duplicated features are identical features, regardless of the variable or column name. If they show the same values for every observation, then they are considered duplicated.
The transformer will first identify and store the duplicated variables. Next, the transformer will drop these variables from a dataframe.
- Parameters
- variables: list, default=None
The list of variables to evaluate. If None, the transformer will evaluate all variables in the dataset.
- missing_valuesstr, default=ignore
Takes values ‘raise’ and ‘ignore’. Whether the missing values should be raised as error or ignored when finding duplicated features.
Attributes
features_to_drop_:
Set with the duplicated features that will be dropped.
duplicated_feature_sets_:
Groups of duplicated features. Each list is a group of duplicated features.
variables_:
The variables to consider for the feature selection.
n_features_in_:
The number of features in the train set used in fit.
Methods
fit:
Find duplicated features.
transform:
Remove duplicated features
fit_transform:
Fit to data. Then transform it.
Example¶
The DropDuplicateFeatures() finds and removes duplicated variables from a dataframe. The user can pass a list of variables to examine, or alternatively the selector will examine all variables in the data set.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.selection import DropDuplicateFeatures
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data = data[['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'cabin', 'embarked']]
data = pd.concat([data, data[['sex', 'age', 'sibsp']]], axis=1)
data.columns = ['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'cabin', 'embarked',
'sex_dup', 'age_dup', 'sibsp_dup']
return data
# load data as pandas dataframe
data = load_titanic()
data.head()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived'], axis=1),
data['survived'], test_size=0.3, random_state=0)
# set up the transformer
transformer = DropDuplicateFeatures()
# fit the transformer
transformer.fit(X_train)
# transform the data
train_t = transformer.transform(X_train)
train_t.columns
Index(['pclass', 'sex', 'age', 'sibsp', 'parch', 'cabin', 'embarked'], dtype='object')
transformer.features_to_drop_
{'age_dup', 'sex_dup', 'sibsp_dup'}
transformer.duplicated_feature_sets_
[{'sex', 'sex_dup'}, {'age', 'age_dup'}, {'sibsp', 'sibsp_dup'}]