find_all_variables#

With find_all_variables() you can automatically capture in a list the names of all the variables in the dataset.

Let’s create a toy dataset with numerical, categorical and datetime variables:

import pandas as pd
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000,
    n_features=4,
    n_redundant=1,
    n_clusters_per_class=1,
    weights=[0.50],
    class_sep=2,
    random_state=1,
)

# transform arrays into pandas df and series
colnames = [f"num_var_{i+1}" for i in range(4)]
X = pd.DataFrame(X, columns=colnames)

X["cat_var1"] = ["Hello"] * 1000
X["cat_var2"] = ["Bye"] * 1000

X["date1"] = pd.date_range("2020-02-24", periods=1000, freq="min")
X["date2"] = pd.date_range("2021-09-29", periods=1000, freq="h")
X["date3"] = ["2020-02-24"] * 1000

print(X.head())

We see the resulting dataframe below:

   num_var_1  num_var_2  num_var_3  num_var_4 cat_var1 cat_var2  \
0  -1.558594   1.634123   1.556932   2.869318    Hello      Bye
1   1.499925   1.651008   1.159977   2.510196    Hello      Bye
2   0.277127  -0.263527   0.532159   0.274491    Hello      Bye
3  -1.139190  -1.131193   2.296540   1.189781    Hello      Bye
4  -0.530061  -2.280109   2.469580   0.365617    Hello      Bye

                date1               date2       date3
0 2020-02-24 00:00:00 2021-09-29 00:00:00  2020-02-24
1 2020-02-24 00:01:00 2021-09-29 01:00:00  2020-02-24
2 2020-02-24 00:02:00 2021-09-29 02:00:00  2020-02-24
3 2020-02-24 00:03:00 2021-09-29 03:00:00  2020-02-24
4 2020-02-24 00:04:00 2021-09-29 04:00:00  2020-02-24

We can now use find_all_variables() to capture all the variable names in a list. So let’s do that and then display the items in the list:

from feature_engine.variable_handling import find_all_variables

vars_all = find_all_variables(X)

vars_all

We see the variable names in the list below:

['num_var_1',
 'num_var_2',
 'num_var_3',
 'num_var_4',
 'cat_var1',
 'cat_var2',
 'date1',
 'date2',
 'date3']

We have the option to exclude datetime variables as follows:

vars_all = find_all_variables(X, exclude_datetime=True)

vars_all

In the list below, we can see that variables of type datetime were ignored:

['num_var_1',
 'num_var_2',
 'num_var_3',
 'num_var_4',
 'cat_var1',
 'cat_var2',
 'date3']

If find_all_variables() does not find suitable variables, it will raise an error. To return an empty list instead, set return_empty to True.

For example, this command raises an error:

find_all_variables(
    X[[ 'date1', 'date2', 'date3']],
    exclude_datetime=True,
)

However, this command returns an empty list:

find_all_variables(
    X[[ 'date1', 'date2', 'date3']],
    exclude_datetime=True,
    return_empty=True,
)