DropFeatures#

The DropFeatures() drops a list of variables indicated by the user from the original dataframe. The user can pass a single variable as a string or list of variables to be dropped.

DropFeatures() offers similar functionality to pandas.dataframe.drop, but the difference is that DropFeatures() can be integrated into a Scikit-learn pipeline.

When is this transformer useful?

Sometimes, we create new variables combining other variables in the dataset, for example, we obtain the variable age by subtracting date_of_application from date_of_birth. After we obtained our new variable, we do not need the date variables in the dataset any more. Thus, we can add DropFeatures() in the Pipeline to have these removed.

Example

Let’s see how to use DropFeatures() in an example with the Titanic dataset. We first load the data and separate it into train and test:

from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.selection import DropFeatures

X, y = load_titanic(
    return_X_y_frame=True,
    handle_missing=True,
)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train.head())

Now, we go ahead and print the dataset column names:

X_train.columns

Index(['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare',
       'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

Now, with DropFeatures() we can very easily drop a group of variables. Below we set up the transformer to drop a list of 6 variables:

# set up the transformer
transformer = DropFeatures(
    features_to_drop=['sibsp', 'parch', 'ticket', 'fare', 'body', 'home.dest']
)

# fit the transformer
transformer.fit(X_train)

With fit() this transformer does not learn any parameter. We can go ahead and remove the variables as follows:

train_t = transformer.transform(X_train)
test_t = transformer.transform(X_test)

And now, if we print the variable names of the transformed dataset, we see that it has been reduced:

train_t.columns

Index(['pclass', 'name', 'sex', 'age', 'cabin', 'embarked', 'boat'], dtype='object')

Additional resources#

In this Kaggle kernel we feature 3 different end-to-end machine learning pipelines using DropFeatures():

Kaggle Kernel

All notebooks can be found in a dedicated repository.

For more details about this and other feature selection methods check out these resources:

Feature Selection for Machine Learning#

Or read our book:

Feature Selection in Machine Learning#

Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.

Boost Your Data Science Skills

DropFeatures#

Additional resources#