DropConstantFeatures() drops constant and quasi-constant variables from a dataframe.
By default, it drops only constant variables. Constant variables have a single
value. Quasi-constant variables have a single value in most of its observations.
This transformer works with numerical and categorical variables, and it offers a pretty straightforward way of reducing the feature space. Be mindful though, that depending on the context, quasi-constant variables could be useful.
Let’s see how to use
DropConstantFeatures() in an example with the Titanic dataset. We
first load the data and separate it into train and test:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from feature_engine.selection import DropConstantFeatures # Load dataset def load_titanic(): data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl') data = data.replace('?', np.nan) data['cabin'] = data['cabin'].astype(str).str data['pclass'] = data['pclass'].astype('O') data['embarked'].fillna('C', inplace=True) return data # load data as pandas dataframe data = load_titanic() # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['survived', 'name', 'ticket'], axis=1), data['survived'], test_size=0.3, random_state=0)
Now, we set up the
DropConstantFeatures() to remove features that show the same
value in more than 70% of the observations:
# set up the transformer transformer = DropConstantFeatures(tol=0.7, missing_values='ignore')
fit() the transformer finds the variables to drop:
# fit the transformer transformer.fit(X_train)
The variables to drop are stored in the attribute
['parch', 'cabin', 'embarked']
We see in the following code snippets that for the variables parch and embarked, more than 70% of the observations displayed the same value:
X_train['embarked'].value_counts() / len(X_train)
S 0.711790 C 0.197598 Q 0.090611 Name: embarked, dtype: float64
71% of the passengers embarked in S.
X_train['parch'].value_counts() / len(X_train)
0 0.771834 1 0.125546 2 0.086245 3 0.005459 4 0.004367 5 0.003275 6 0.002183 9 0.001092 Name: parch, dtype: float64
77% of the passengers had 0 parent or child. Because of this, these features were deemed constant and removed.
transform(), we can go ahead and drop the variables from the data:
# transform the data train_t = transformer.transform(X_train)