DecisionTreeFeatures#

class feature_engine.creation.DecisionTreeFeatures(variables=None, features_to_combine=None, precision=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=0, missing_values='raise', drop_original=False)[source]#

DecisionTreeFeatures() adds new variables to the data that result of the output of decision trees trained with one or more features.

Features that result from the predictions of decision trees are likely monotonic with the target and therefore could improve the performance of linear models. Features that result from decision trees trained on various features can help capture feature interactions that could otherwise be missed by simpler models.

DecisionTreeFeatures() works only with numerical variables. You can pass a list of variables to use as inputs of the decision trees. Alternatively, the transformer will automatically select and combine all numerical variables.

Missing data should be imputed before using this transformer.

More details in the User Guide.

Parameters

variables: list, default=None

The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.

features_to_combine: integer, list or tuple, default=None

Used to determine how the variables indicated in variables will be combined to obtain the new features by using decision trees. If integer, then the value corresponds to the largest size of combinations allowed between features. For example, if you want to combine three variables, [“var_A”, “var_B”, “var_C”], and:

features_to_combine = 1, the transformer returns new features based on
the predictions of a decision tree trained on each individual variable, generating 3 new features.

features_to_combine = 2, the transformer returns the features from
features_to_combine=1, plus features based on the predictions of a decision tree based on all possible combinations of 2 variables, i.e., (“var_A”, “var_B”), (“var_A”, “var_C”), and (“var_B”, “var_C”), resulting in a total of 6 new features.

features_to_combine = 3, the transformer returns the features from
features_to_combine=2, plus one additional feature based on the predictions of a decision trained on the 3 variables, [“var_A”, “var_B”, “var_C”], resulting in a total of 7 new features.

If list, the list must contain integers indicating the number of features that should be used as input of a decision tree. For example, if the data has 4 variables, [“var_A”, “var_B”, “var_C”, “var_D”] and and features_to_combine = [2,3], then all possible combinations of 2 and 3 v ariables will be returned. That’ll result in the following combinations: (“var_A”, “var_B”), (“var_A”, “var_C”), (“var_A”, “var_D”), (“var_B”, “var_C”), (“var_B”, “var_D”), (“var_C”, “var_D”), (“var_A”, “var_B”, “var_C”), (“var_A”, “var_B”, “var_D”), (“var_A”, “var_C”, “var_D”), and (“var_B”, “var_C”, “var_D”).

|

If tuple, the tuple must contain strings and/or tuples that indicate how to combine the variables to create the new features. For example, if features_to_combine=("var_C", ("var_A", "var_C"), "var_C", ("var_B", "var_D"), then, the transformer will train a decision tree based of each value within the tuple, resulting in 4 new features.

|

If None, then the transformer will create all possible combinations of 1 or more features, and use those as inputs to decision trees. This is equivalent to passing an integer that is equal to the number of variables to combine.

precision: int, default=None

The precision of the predictions. In other words, the number of decimals after the comma for the new feature values.

cv: int, cross-validation generator or an iterable, default=3

Determines the cross-validation splitting strategy. Possible inputs for cv are:

None, to use cross_validate’s default 5-fold cross validation

int, to specify the number of folds in a (Stratified)KFold,

CV splitter

(https://scikit-learn.org/stable/glossary.html#term-CV-splitter)

An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls. For more details check Scikit-learn’s cross_validate’s documentation.

scoring: str, default=’neg_mean_squared_error’

Desired metric to optimise the performance of the tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

param_grid: dictionary, default=None

The hyperparameters for the decision tree to test with a grid search. The param_grid can contain any of the permitted hyperparameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). If None, then param_grid will optimise the ‘max_depth’ over [1, 2, 3, 4].

regression: boolean, default=True

Indicates whether the discretiser should train a regression or a classification decision tree.

random_stateint, default=None

The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.

missing_values: string, default=’raise’

Indicates if missing values should be ignored or raised. If 'raise' the transformer will return an error if the the datasets to fit or transform contain missing values. If 'ignore', missing data will be ignored when learning parameters or performing the transformation.

drop_original: bool, default=False

If True, the original variables to transform will be dropped from the dataframe.

Attributes

variables_:: The group of variables that will be transformed.
input_features_ = list: List containing all the feature combinations that are used to create new features.
estimators_: List: The decision trees trained on the feature combinations.
feature_names_in_:: List with the names of features seen during fit.
n_features_in_:: The number of features in the train set used in fit.

See also

feature_engine.discretisation.DecisionTreeDiscretiser
feature_engine.encoding.DecisionTreeEncoder
sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor

References

1: Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

Examples

>>> import pandas as pd
>>> from feature_engine.creation import DecisionTreeFeatures
>>> X = dict()
>>> X["Name"] = ["tom", "nick", "krish", "megan", "peter",
>>>              "jordan", "fred", "sam", "alexa", "brittany"]
>>> X["Age"] = [20, 44, 19, 33, 51, 40, 41, 37, 30, 54]
>>> X["Height"] = [164, 150, 178, 158, 188, 190, 168, 174, 176, 171]
>>> X["Marks"] = [1.0, 0.8, 0.6, 0.1, 0.3, 0.4, 0.8, 0.6, 0.5, 0.2]
>>> X = pd.DataFrame(X)
>>> y = pd.Series([4.1, 5.8, 3.9, 6.2, 4.3, 4.5, 7.2, 4.4, 4.1, 6.7])
>>> dtf = DecisionTreeFeatures(features_to_combine=2)
>>> dtf.fit(X, y)
>>> dtf.transform(X)
           Name  Age  Height  ...  tree(['Age', 'Height'])  tree(['Age', 'Marks'])
0       tom   20     164  ...                    4.100                     4.2
1      nick   44     150  ...                    6.475                     5.6
2     krish   19     178  ...                    4.000                     4.2
3     megan   33     158  ...                    6.475                     6.2
4     peter   51     188  ...                    4.400                     5.6
5    jordan   40     190  ...                    4.400                     4.2
6      fred   41     168  ...                    6.475                     7.2
7       sam   37     174  ...                    4.400                     4.2
8     alexa   30     176  ...                    4.000                     4.2
9  brittany   54     171  ...                    6.475                     5.6
   tree(['Height', 'Marks'])
0             6.00
1             6.00
2             4.24
3             6.00
4             4.24
5             4.24
6             6.00
7             4.24
8             4.24
9             6.00

Methods

fit:	Trains the decision trees.
fit_transform:	Fit to data, then transform it.
get_feature_names_out:	Get output feature names for transformation.
get_params:	Get parameters for this estimator.
set_params:	Set the parameters of this estimator.
transform:	Create new features.

fit(X, y)[source]#

Fits decision trees based on the input variable combinations with cross-validation and grid-search for hyperparameters.

Parameters

X: pandas dataframe of shape = [n_samples, n_features]: The training input samples. Can be the entire dataframe, not just the variables to transform.
y: pandas Series or np.array = [n_samples,]: The target variable that is used to train the decision tree.

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation. In other words, returns the variable names of transformed dataframe.

Parameters

input_featuresarray or list, default=None

This parameter exits only for compatibility with the Scikit-learn pipeline.

If None, then feature_names_in_ is used as feature names in.
If an array or list, then input_features must match feature_names_in_.

Returns

feature_names_out: list: Transformed feature names.

rtype: List[Union[str, int]] ..

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X)[source]#

Create and add new variables.

Parameters

X: Pandas DataFrame of shame = [n_samples, n_features]: The data to be transformed.

Returns

X_new: Pandas dataframe.: Either the original dataframe plus the new features or a dataframe of only the new features.

rtype: DataFrame ..

Boost Your Data Science Skills

DecisionTreeFeatures#