DropConstantFeatures#
The DropConstantFeatures()
drops constant and quasi-constant variables from a dataframe.
By default, it drops only constant variables. Constant variables have a single
value. Quasi-constant variables have a single value in most of its observations.
This transformer works with numerical and categorical variables, and it offers a pretty straightforward way of reducing the feature space. Be mindful though, that depending on the context, quasi-constant variables could be useful.
Example
Let’s see how to use DropConstantFeatures()
in an example with the Titanic dataset. We
first load the data and separate it into train and test:
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.selection import DropConstantFeatures
X, y = load_titanic(
return_X_y_frame=True,
handle_missing=True,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0,
)
Now, we set up the DropConstantFeatures()
to remove features that show the same
value in more than 70% of the observations:
# set up the transformer
transformer = DropConstantFeatures(tol=0.7)
With fit()
the transformer finds the variables to drop:
# fit the transformer
transformer.fit(X_train)
The variables to drop are stored in the attribute features_to_drop_
:
transformer.features_to_drop_
['parch', 'cabin', 'embarked', 'body']
We see in the following code snippets that for the variables parch and embarked, more than 70% of the observations displayed the same value:
X_train['embarked'].value_counts(normalize = True)
S 0.711790
C 0.195415
Q 0.090611
Missing 0.002183
Name: embarked, dtype: float64
71% of the passengers embarked in S.
X_train['parch'].value_counts(normalize = True)
0 0.771834
1 0.125546
2 0.086245
3 0.005459
4 0.004367
5 0.003275
6 0.002183
9 0.001092
Name: parch, dtype: float64
77% of the passengers had 0 parent or child. Because of this, these features were deemed constant and removed.
With transform()
, we can go ahead and drop the variables from the data:
train_t = transformer.transform(X_train)
test_t = transformer.transform(X_test)
print(train_t.head())
We see the resulting dataframe below:
pclass name sex age sibsp \
501 2 Mellinger, Miss. Madeleine Violet female 13.000000 0
588 2 Wells, Miss. Joan female 4.000000 1
402 2 Duran y More, Miss. Florentina female 30.000000 1
1193 3 Scanlan, Mr. James male 29.881135 0
686 3 Bradley, Miss. Bridget Delia female 22.000000 0
ticket fare boat \
501 250644 19.5000 14
588 29103 23.0000 14
402 SC/PARIS 2148 13.8583 12
1193 36209 7.7250 Missing
686 334914 7.7250 13
home.dest
501 England / Bennington, VT
588 Cornwall / Akron, OH
402 Barcelona, Spain / Havana, Cuba
1193 Missing
686 Kingwilliamstown, Co Cork, Ireland Glens Falls...
More details#
In this Kaggle kernel we use DropConstantFeatures()
together with other feature selection algorithms:
All notebooks can be found in a dedicated repository.