DecisionTreeFeatures#

The winners of the KDD 2009 competition observed that many features had high mutual information with the target, but low correlation, leading them to conclude that the relationships were non-linear. While non-linear relationships can be captured by non-linear models, to leverage the information from these features with linear models, we need to somehow transform that information into a linear, or monotonic relationship with the target.

The output of decision trees, that is, their predictions, should be monotonic with the target, if there is a good fit for the tree.

In addition, decision trees trained on 2 or more features could capture feature interactions that simpler models would miss.

By enriching the dataset with features resulting from the predictions of decision trees, we can create better performing models. On the downside the features resulting from decision trees, are not easy to interpret or explain.

DecisionTreeFeatures() creates and adds features resulting from the predictions of decision trees trained on 1 or more features.

Values of the tree based features#

If we create features for regression, DecisionTreeFeatures() will train scikit-learn’s DecisionTreeRegressor under the hood, and the features are derived from the predict method of these regressors. Hence, the features will be in the scale of the target. Remember however, that the output of decision tree regressors is not continuous, it is a piecewise function.

If we create features for classification, DecisionTreeFeatures() will train scikit-learn’s DecisionTreeClassifier under the hood. If the target is binary, the resulting features are the output of the predict_proba method of the model corresponding to the predictions of class 1. If the output is multiclass, on the other hand, the features are derived from the predict method, and hence return the predicted class.

Examples#

In the rest of the document, we’ll show the versatility of DecisionTreeFeatures() to create multiple features by using decision trees.

Let’s start by loading and displaying the California housing dataset from sklearn

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from feature_engine.creation import DecisionTreeFeatures

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X.drop(labels=["Latitude", "Longitude"], axis=1, inplace=True)
print(X.head())

In the following output we see the dataframe:

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467

Let’s split the dataset into a training and a testing set:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

Combining features - integers#

We’ll set up DecisionTreeFeatures() to create all possible combinations of 2 features. To create all possible combinations we use integers with the features_to_combine parameter:

dtf = DecisionTreeFeatures(features_to_combine=2)
dtf.fit(X_train, y_train)

If we leave the parameter variables to None, DecisionTreeFeatures() will combine all numerical variables in the training set, in the way we indicate in features_to_combine. Since we set features_to_combine=2, the transformer will create all possible combinations of 1 or 2 variables.

We can find the feature combinations that will be used to train the trees as follows:

dtf.input_features_

In the following output we see the combinations of 1 and 2 features that will be used to train decision trees, based of all the numerical variables in the training set:

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 ['MedInc', 'HouseAge'],
 ['MedInc', 'AveRooms'],
 ['MedInc', 'AveBedrms'],
 ['MedInc', 'Population'],
 ['MedInc', 'AveOccup'],
 ['HouseAge', 'AveRooms'],
 ['HouseAge', 'AveBedrms'],
 ['HouseAge', 'Population'],
 ['HouseAge', 'AveOccup'],
 ['AveRooms', 'AveBedrms'],
 ['AveRooms', 'Population'],
 ['AveRooms', 'AveOccup'],
 ['AveBedrms', 'Population'],
 ['AveBedrms', 'AveOccup'],
 ['Population', 'AveOccup']]

Let’s now add the new features to the data:

train_t = dtf.transform(X_train)
test_t = dtf.transform(X_test)

print(test_t.head())

In the following output we see the resulting data with the features derived from decision trees:

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  \
14740  4.1518      22.0  5.663073   1.075472      1551.0  4.180593
10101  5.7796      32.0  6.107226   0.927739      1296.0  3.020979
20566  4.3487      29.0  5.930712   1.026217      1554.0  2.910112
2670   2.4511      37.0  4.992958   1.316901       390.0  2.746479
15709  5.0049      25.0  4.319261   1.039578       649.0  1.712401

       tree(MedInc)  tree(HouseAge)  tree(AveRooms)  tree(AveBedrms)  ...  \
14740      2.204822        2.130618        2.001950         2.080254  ...
10101      2.975513        2.051980        2.001950         2.165554  ...
20566      2.204822        2.051980        2.001950         2.165554  ...
2670       1.416771        2.051980        1.802158         1.882763  ...
15709      2.420124        2.130618        1.802158         2.165554  ...

       tree(['HouseAge', 'AveRooms'])  tree(['HouseAge', 'AveBedrms'])  \
14740                        1.885406                         2.124812
10101                        1.885406                         2.124812
20566                        1.885406                         2.124812
2670                         1.797902                         1.836498
15709                        1.797902                         2.124812

       tree(['HouseAge', 'Population'])  tree(['HouseAge', 'AveOccup'])  \
14740                          2.004703                        1.437440
10101                          2.004703                        2.257968
20566                          2.004703                        2.257968
2670                           2.123579                        2.257968
15709                          2.123579                        2.603372

       tree(['AveRooms', 'AveBedrms'])  tree(['AveRooms', 'Population'])  \
14740                         2.099977                          1.878989
10101                         2.438937                          2.077321
20566                         2.099977                          1.878989
2670                          1.728401                          1.843904
15709                         1.821467                          1.843904

       tree(['AveRooms', 'AveOccup'])  tree(['AveBedrms', 'Population'])  \
14740                        1.719582                           2.056003
10101                        2.156884                           2.056003
20566                        2.156884                           2.056003
2670                         1.747990                           1.882763
15709                        2.783690                           2.221092

       tree(['AveBedrms', 'AveOccup'])  tree(['Population', 'AveOccup'])
14740                         1.400491                          1.484939
10101                         2.153210                          2.059187
20566                         2.153210                          2.059187
2670                          1.861020                          2.235743
15709                         2.727460                          2.747390

[5 rows x 27 columns]

Combining features - Lists#

Let’s say that we want to create features based of trees trained of 2 or more variables. Instead of using an integer in features_to_combine, we need to pass a list of integers, telling DecisionTreeFeatures() to make all possible combinations of the integers mentioned in the list.

We’ll set up the transformer to create features based on all possible combinations of 2 and 3 features of just 3 of the numerical variables this time:

dtf = DecisionTreeFeatures(
    variables=["AveRooms", "AveBedrms", "Population"],
    features_to_combine=[2,3])

dtf.fit(X_train, y_train)

If we now examine the feature combinations:

dtf.input_features_

We see that they are based of combinations of 2 or 3 of the variables that we set in the variables parameter:

[['AveRooms', 'AveBedrms'],
 ['AveRooms', 'Population'],
 ['AveBedrms', 'Population'],
 ['AveRooms', 'AveBedrms', 'Population']]

We can now add the features to the data and inspect the result:

train_t = dtf.transform(X_train)
test_t = dtf.transform(X_test)

print(test_t.head())

In the following output we see the dataframe with the new features:

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  \
14740  4.1518      22.0  5.663073   1.075472      1551.0  4.180593
10101  5.7796      32.0  6.107226   0.927739      1296.0  3.020979
20566  4.3487      29.0  5.930712   1.026217      1554.0  2.910112
2670   2.4511      37.0  4.992958   1.316901       390.0  2.746479
15709  5.0049      25.0  4.319261   1.039578       649.0  1.712401

       tree(['AveRooms', 'AveBedrms'])  tree(['AveRooms', 'Population'])  \
14740                         2.099977                          1.878989
10101                         2.438937                          2.077321
20566                         2.099977                          1.878989
2670                          1.728401                          1.843904
15709                         1.821467                          1.843904

       tree(['AveBedrms', 'Population'])  \
14740                           2.056003
10101                           2.056003
20566                           2.056003
2670                            1.882763
15709                           2.221092

       tree(['AveRooms', 'AveBedrms', 'Population'])
14740                                       2.099977
10101                                       2.438937
20566                                       2.099977
2670                                        1.843904
15709                                       1.843904

Specifying the feature combinations - tuples#

We can indicate precisely the features that we want to use as input of the decision trees. Let’s make a tuple containing the features combinations. We want a tree trained with Population, a tree trained with Population and AveOccup, and a tree trained with those 2 variables plus HouseAge:

features = (('Population'), ('Population', 'AveOccup'),
            ('Population', 'AveOccup', 'HouseAge'))

Now, we pass this tuple to DecisionTreeFeatures() and note that we can leave the parameter variables to the default, because with tuples, that parameter is ignored:

dtf = DecisionTreeFeatures(
    variables=None,
    features_to_combine=features,
    cv=5,
    scoring="neg_mean_squared_error"
)

dtf.fit(X_train, y_train)

If we inspect the input features, it will coincide with the tuple we passed to features_to_combine:

dtf.input_features_

We see that the input features are those from the tuple:

['Population',
 ['Population', 'AveOccup'],
 ['Population', 'AveOccup', 'HouseAge']]

And now we can go ahead and add the features to the data:

train_t = dtf.transform(X_train)
test_t = dtf.transform(X_test)

Examining the new features#

DecisionTreeFeatures() appends the word tree to the new features, so if we wanted to display only the new features, we can do so as follows

tree_features = [var for var in test_t.columns if "tree" in var]
print(test_t[tree_features].head())
       tree(Population)  tree(['Population', 'AveOccup'])  \
14740          2.008283                          1.484939
10101          2.008283                          2.059187
20566          2.008283                          2.059187
2670           2.128961                          2.235743
15709          2.128961                          2.747390

       tree(['Population', 'AveOccup', 'HouseAge'])
14740                                      1.443097
10101                                      2.257968
20566                                      2.257968
2670                                       2.257968
15709                                      3.111251

Evaluating individual trees#

We can evaluate the performance of each of the trees used to create the features, if we so wish. Let’s set up the DecisionTreeFeatures():

dtf = DecisionTreeFeatures(features_to_combine=2)
dtf.fit(X_train, y_train)

DecisionTreeFeatures() trains each tree with cross-validation. If we do not pass a grid with hyperparameters, it will optimize the depth by default. We can find the trained estimators like this:

dtf.estimators_

Because the estimators are trained with sklearn’s GridSearchCV, what is stored is the result of the search:

[GridSearchCV(cv=3, estimator=DecisionTreeRegressor(random_state=0),
              param_grid={'max_depth': [1, 2, 3, 4]},
              scoring='neg_mean_squared_error'),
 GridSearchCV(cv=3, estimator=DecisionTreeRegressor(random_state=0),
              param_grid={'max_depth': [1, 2, 3, 4]},
              scoring='neg_mean_squared_error'),
 ...

 GridSearchCV(cv=3, estimator=DecisionTreeRegressor(random_state=0),
              param_grid={'max_depth': [1, 2, 3, 4]},
              scoring='neg_mean_squared_error'),
 GridSearchCV(cv=3, estimator=DecisionTreeRegressor(random_state=0),
              param_grid={'max_depth': [1, 2, 3, 4]},
              scoring='neg_mean_squared_error')]

If you want to inspect an individual tree and it’s performance, you can do so like this:

tree = dtf.estimators_[4]
tree.best_params_

In the following output, we see the best parameters obtained for a tree trained based of the feature Population to predict house price:

{'max_depth': 2}

If we want to check out the performance of the best tree during found in the grid search, we can do so like this:

tree.score(X_test[['Population']], y_test)

The following performance value corresponds to the negative of the mean squared error which is the metric optimised durign the search (you can select the metric to optimize through the scoring parameter of DecisionTreeFeatures()).

-1.3308515769033213

Note that you can also isolate the tree, and then obtain a performance metric:

tree.best_estimator_.score(X_test[['Population']], y_test)

In this case, the following performance metric corresponds to the R2, which is the default metric returned by scikit-learn’s DecisionTreeRegressor.

0.0017890442253447603

Dropping the original variables#

With DecisionTreeFeatures(), we can automatically remove from the resulting dataframe the features used as input from the decision trees. We need to set drop_original to True.

dtf = DecisionTreeFeatures(
    variables=["AveRooms", "AveBedrms", "Population"],
    features_to_combine=[2,3],
    drop_original=True
)

dtf.fit(X_train, y_train)

train_t = dtf.transform(X_train)
test_t = dtf.transform(X_test)

print(test_t.head())

We see in the resulting dataframe that the variables [“AveRooms”, “AveBedrms”, “Population”] are not there:

       MedInc  HouseAge  AveOccup  tree(['AveRooms', 'AveBedrms'])  \
14740  4.1518      22.0  4.180593                         2.099977
10101  5.7796      32.0  3.020979                         2.438937
20566  4.3487      29.0  2.910112                         2.099977
2670   2.4511      37.0  2.746479                         1.728401
15709  5.0049      25.0  1.712401                         1.821467

       tree(['AveRooms', 'Population'])  tree(['AveBedrms', 'Population'])  \
14740                          1.878989                           2.056003
10101                          2.077321                           2.056003
20566                          1.878989                           2.056003
2670                           1.843904                           1.882763
15709                          1.843904                           2.221092

       tree(['AveRooms', 'AveBedrms', 'Population'])
14740                                       2.099977
10101                                       2.438937
20566                                       2.099977
2670                                        1.843904
15709                                       1.843904

Creating features for classification#

If we are creating features for a classifier instead of a regressor, the procedure is identical. We just need to set the parameter regression to False.

Note that if you are creating features for binary classification, the added features will contain the probabilities of class 1. If you are creating features for multi-class classification, on the other hand, the features will contain the prediction of the class.

Additional resources#

For more details about this and other feature engineering methods check out these resources:

../../_images/feml.png

Feature Engineering for Machine Learning#











Or read our book:

../../_images/cookbook.png

Python Feature Engineering Cookbook#














Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.