DropConstantFeatures#
The DropConstantFeatures()
drops constant and quasi-constant variables from a dataframe.
By default, it drops only constant variables. Constant variables have a single
value. Quasi-constant variables have a single value in most of its observations.
This transformer works with numerical and categorical variables, and it offers a pretty straightforward way of reducing the feature space. Be mindful though, that depending on the context, quasi-constant variables could be useful.
Example
Let’s see how to use DropConstantFeatures()
in an example with the Titanic dataset. We
first load the data and separate it into train and test:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.selection import DropConstantFeatures
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
# load data as pandas dataframe
data = load_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0)
Now, we set up the DropConstantFeatures()
to remove features that show the same
value in more than 70% of the observations:
# set up the transformer
transformer = DropConstantFeatures(tol=0.7, missing_values='ignore')
With fit()
the transformer finds the variables to drop:
# fit the transformer
transformer.fit(X_train)
The variables to drop are stored in the attribute features_to_drop_
:
transformer.features_to_drop_
['parch', 'cabin', 'embarked']
We see in the following code snippets that for the variables parch and embarked, more than 70% of the observations displayed the same value:
X_train['embarked'].value_counts() / len(X_train)
S 0.711790
C 0.197598
Q 0.090611
Name: embarked, dtype: float64
71% of the passengers embarked in S.
X_train['parch'].value_counts() / len(X_train)
0 0.771834
1 0.125546
2 0.086245
3 0.005459
4 0.004367
5 0.003275
6 0.002183
9 0.001092
Name: parch, dtype: float64
77% of the passengers had 0 parent or child. Because of this, these features were deemed constant and removed.
With transform()
, we can go ahead and drop the variables from the data:
# transform the data
train_t = transformer.transform(X_train)
More details#
In this Kaggle kernel we use DropConstantFeatures()
together with other feature selection algorithms:
All notebooks can be found in a dedicated repository.