retain_variables_if_in_df() returns the subset of variables in a list that is present in the dataset.

Let’s create a toy dataset with numerical, categorical and datetime variables:

import pandas as pd
df = pd.DataFrame({
    "Name": ["tom", "nick", "krish", "jack"],
    "City": ["London", "Manchester", "Liverpool", "Bristol"],
    "Age": [20, 21, 19, 18],
    "Marks": [0.9, 0.8, 0.7, 0.6],
    "dob": pd.date_range("2020-02-24", periods=4, freq="T"),


We see the resulting dataframe below:

    Name        City  Age  Marks                 dob
0    tom      London   20    0.9 2020-02-24 00:00:00
1   nick  Manchester   21    0.8 2020-02-24 00:01:00
2  krish   Liverpool   19    0.7 2020-02-24 00:02:00
3   jack     Bristol   18    0.6 2020-02-24 00:03:00

With retain_variables_if_in_df() we capture in a list, the names of the variables that are present in the dataset. So let’s do that and then display the resulting list:

from feature_engine.variable_handling import retain_variables_if_in_df

vars_in_df = retain_variables_if_in_df(df, variables = ["Name", "City", "Dogs"])


We see the names of the subset of variables that are in the dataframe below:

['Name', 'City']

If none of variables in the list are in the dataset, retain_variables_if_in_df() will raise an error.


This function was originally developed for internal use.

When we run various feature selection transformers one after the other, for example, DropConstantFeatures, then DropDuplicateFeatures, and finally RecursiveFeatureElimination, we can’t anticipate which variables will be dropped by each transformer. Hence, these transformers use retain_variables_if_in_df() under the hood, to select those variables that were entered by the user and that still remain in the dataset, before applying the selection algorithm.

We’ve now decided to expose this function as part of the variable_handling module. It might be useful, for example, if you are creating Feature-engine compatible selection transformers.