DecisionTreeFeatures#
- class feature_engine.creation.DecisionTreeFeatures(variables=None, features_to_combine=None, precision=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=0, missing_values='raise', drop_original=False)[source]#
DecisionTreeFeatures()adds new variables to the data that result of the output of decision trees trained with one or more features.Features that result from the predictions of decision trees are likely monotonic with the target and therefore could improve the performance of linear models. Features that result from decision trees trained on various features can help capture feature interactions that could otherwise be missed by simpler models.
DecisionTreeFeatures()works only with numerical variables. You can pass a list of variables to use as inputs of the decision trees. Alternatively, the transformer will automatically select and combine all numerical variables.Missing data should be imputed before using this transformer.
More details in the User Guide.
- Parameters
- variables: list, default=None
The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.
- features_to_combine: integer, list or tuple, default=None
Used to determine how the variables indicated in
variableswill be combined to obtain the new features by using decision trees. Ifinteger, then the value corresponds to the largest size of combinations allowed between features. For example, if you want to combine three variables, [“var_A”, “var_B”, “var_C”], and:features_to_combine = 1, the transformer returns new features based onthe predictions of a decision tree trained on each individual variable, generating 3 new features.
features_to_combine = 2, the transformer returns the features fromfeatures_to_combine=1, plus features based on the predictions of a decision tree based on all possible combinations of 2 variables, i.e., (“var_A”, “var_B”), (“var_A”, “var_C”), and (“var_B”, “var_C”), resulting in a total of 6 new features.
features_to_combine = 3, the transformer returns the features fromfeatures_to_combine=2, plus one additional feature based on the predictions of a decision trained on the 3 variables, [“var_A”, “var_B”, “var_C”], resulting in a total of 7 new features.
If
list, the list must contain integers indicating the number of features that should be used as input of a decision tree. For example, if the data has 4 variables, [“var_A”, “var_B”, “var_C”, “var_D”] and andfeatures_to_combine = [2,3], then all possible combinations of 2 and 3 v ariables will be returned. That’ll result in the following combinations: (“var_A”, “var_B”), (“var_A”, “var_C”), (“var_A”, “var_D”), (“var_B”, “var_C”), (“var_B”, “var_D”), (“var_C”, “var_D”), (“var_A”, “var_B”, “var_C”), (“var_A”, “var_B”, “var_D”), (“var_A”, “var_C”, “var_D”), and (“var_B”, “var_C”, “var_D”).- |
If
tuple, the tuple must contain strings and/or tuples that indicate how to combine the variables to create the new features. For example, iffeatures_to_combine=("var_C", ("var_A", "var_C"), "var_C", ("var_B", "var_D"), then, the transformer will train a decision tree based of each value within the tuple, resulting in 4 new features.- |
If
None, then the transformer will create all possible combinations of 1 or more features, and use those as inputs to decision trees. This is equivalent to passing an integer that is equal to the number of variables to combine.- precision: int, default=None
The precision of the predictions. In other words, the number of decimals after the comma for the new feature values.
- cv: int, cross-validation generator or an iterable, default=3
Determines the cross-validation splitting strategy. Possible inputs for cv are:
None, to use cross_validate’s default 5-fold cross validation
int, to specify the number of folds in a (Stratified)KFold,
An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with
shuffle=Falseso the splits will be the same across calls. For more details check Scikit-learn’scross_validate’s documentation.- scoring: str, default=’neg_mean_squared_error’
Desired metric to optimise the performance of the tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html
- param_grid: dictionary, default=None
The hyperparameters for the decision tree to test with a grid search. The
param_gridcan contain any of the permitted hyperparameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). If None, then param_grid will optimise the ‘max_depth’ over[1, 2, 3, 4].- regression: boolean, default=True
Indicates whether the discretiser should train a regression or a classification decision tree.
- random_stateint, default=None
The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.
- missing_values: string, default=’raise’
Indicates if missing values should be ignored or raised. If
'raise'the transformer will return an error if the the datasets tofitortransformcontain missing values. If'ignore', missing data will be ignored when learning parameters or performing the transformation.- drop_original: bool, default=False
If True, the original variables to transform will be dropped from the dataframe.
- Attributes
- variables_:
The group of variables that will be transformed.
- input_features_ = list
List containing all the feature combinations that are used to create new features.
- estimators_: List
The decision trees trained on the feature combinations.
- feature_names_in_:
List with the names of features seen during
fit.- n_features_in_:
The number of features in the train set used in fit.
See also
References
- 1
Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
Examples
>>> import pandas as pd >>> from feature_engine.creation import DecisionTreeFeatures >>> X = dict() >>> X["Name"] = ["tom", "nick", "krish", "megan", "peter", >>> "jordan", "fred", "sam", "alexa", "brittany"] >>> X["Age"] = [20, 44, 19, 33, 51, 40, 41, 37, 30, 54] >>> X["Height"] = [164, 150, 178, 158, 188, 190, 168, 174, 176, 171] >>> X["Marks"] = [1.0, 0.8, 0.6, 0.1, 0.3, 0.4, 0.8, 0.6, 0.5, 0.2] >>> X = pd.DataFrame(X) >>> y = pd.Series([4.1, 5.8, 3.9, 6.2, 4.3, 4.5, 7.2, 4.4, 4.1, 6.7]) >>> dtf = DecisionTreeFeatures(features_to_combine=2) >>> dtf.fit(X, y) >>> dtf.transform(X) Name Age Height ... tree(['Age', 'Height']) tree(['Age', 'Marks']) 0 tom 20 164 ... 4.100 4.2 1 nick 44 150 ... 6.475 5.6 2 krish 19 178 ... 4.000 4.2 3 megan 33 158 ... 6.475 6.2 4 peter 51 188 ... 4.400 5.6 5 jordan 40 190 ... 4.400 4.2 6 fred 41 168 ... 6.475 7.2 7 sam 37 174 ... 4.400 4.2 8 alexa 30 176 ... 4.000 4.2 9 brittany 54 171 ... 6.475 5.6 tree(['Height', 'Marks']) 0 6.00 1 6.00 2 4.24 3 6.00 4 4.24 5 4.24 6 6.00 7 4.24 8 4.24 9 6.00
Methods
fit:
Trains the decision trees.
fit_transform:
Fit to data, then transform it.
get_feature_names_out:
Get output feature names for transformation.
get_params:
Get parameters for this estimator.
set_params:
Set the parameters of this estimator.
transform:
Create new features.
- fit(X, y)[source]#
Fits decision trees based on the input variable combinations with cross-validation and grid-search for hyperparameters.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The training input samples. Can be the entire dataframe, not just the variables to transform.
- y: pandas Series or np.array = [n_samples,]
The target variable that is used to train the decision tree.
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to
Xandywith optional parametersfit_paramsand returns a transformed version ofX.- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters. Pass only if the estimator accepts additional params in its
fitmethod.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)[source]#
Get output feature names for transformation. In other words, returns the variable names of transformed dataframe.
- Parameters
- input_featuresarray or list, default=None
This parameter exits only for compatibility with the Scikit-learn pipeline.
If
None, thenfeature_names_in_is used as feature names in.If an array or list, then
input_featuresmust matchfeature_names_in_.
- Returns
- feature_names_out: list
Transformed feature names.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.