DropConstantFeatures() drops constant and quasi-constant variables from a dataframe.
By default, it drops only constant variables. Constant variables have a single
value. Quasi-constant variables have a single value in most of its observations.
This transformer works with numerical and categorical variables, and it offers a pretty straightforward way of reducing the feature space. Be mindful though, that depending on the context, quasi-constant variables could be useful.
Let’s see how to use
DropConstantFeatures() in an example with the Titanic dataset. We
first load the data and separate it into train and test:
from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.selection import DropConstantFeatures X, y = load_titanic( return_X_y_frame=True, handle_missing=True, ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, )
Now, we set up the
DropConstantFeatures() to remove features that show the same
value in more than 70% of the observations:
# set up the transformer transformer = DropConstantFeatures(tol=0.7)
fit() the transformer finds the variables to drop:
# fit the transformer transformer.fit(X_train)
The variables to drop are stored in the attribute
['parch', 'cabin', 'embarked', 'body']
We see in the following code snippets that for the variables parch and embarked, more than 70% of the observations displayed the same value:
X_train['embarked'].value_counts(normalize = True)
S 0.711790 C 0.195415 Q 0.090611 Missing 0.002183 Name: embarked, dtype: float64
71% of the passengers embarked in S.
X_train['parch'].value_counts(normalize = True)
0 0.771834 1 0.125546 2 0.086245 3 0.005459 4 0.004367 5 0.003275 6 0.002183 9 0.001092 Name: parch, dtype: float64
77% of the passengers had 0 parent or child. Because of this, these features were deemed constant and removed.
transform(), we can go ahead and drop the variables from the data:
train_t = transformer.transform(X_train) test_t = transformer.transform(X_test) print(train_t.head())
We see the resulting dataframe below:
pclass name sex age sibsp \ 501 2 Mellinger, Miss. Madeleine Violet female 13.000000 0 588 2 Wells, Miss. Joan female 4.000000 1 402 2 Duran y More, Miss. Florentina female 30.000000 1 1193 3 Scanlan, Mr. James male 29.881135 0 686 3 Bradley, Miss. Bridget Delia female 22.000000 0 ticket fare boat \ 501 250644 19.5000 14 588 29103 23.0000 14 402 SC/PARIS 2148 13.8583 12 1193 36209 7.7250 Missing 686 334914 7.7250 13 home.dest 501 England / Bennington, VT 588 Cornwall / Akron, OH 402 Barcelona, Spain / Havana, Cuba 1193 Missing 686 Kingwilliamstown, Co Cork, Ireland Glens Falls...
In this Kaggle kernel we use
DropConstantFeatures() together with other feature selection algorithms:
All notebooks can be found in a dedicated repository.