DecisionTreeDiscretiser#
- class feature_engine.discretisation.DecisionTreeDiscretiser(variables=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None)[source]#
The DecisionTreeDiscretiser() replaces numerical variables by discrete, i.e., finite variables, which values are the predictions of a decision tree.
The method is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf
The DecisionTreeDiscretiser() trains a decision tree per variable. Then, it transforms the variables, with predictions of the decision tree.
The DecisionTreeDiscretiser() works only with numerical variables. A list of variables to transform can be indicated. Alternatively, the discretiser will automatically select all numerical variables.
More details in the User Guide.
- Parameters
- variables: list, default=None
The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.
- cv: int, cross-validation generator or an iterable, default=3
Determines the cross-validation splitting strategy. Possible inputs for cv are:
None, to use cross_validate’s default 5-fold cross validation
int, to specify the number of folds in a (Stratified)KFold,
An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with
shuffle=False
so the splits will be the same across calls. For more details check Scikit-learn’scross_validate
’s documentation.- scoring: str, default=’neg_mean_squared_error’
Desired metric to optimise the performance of the tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html
- param_grid: dictionary, default=None
The hyperparameters for the decision tree to test with a grid search. The
param_grid
can contain any of the permitted hyperparameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). If None, then param_grid will optimise the ‘max_depth’ over[1, 2, 3, 4]
.- regression: boolean, default=True
Indicates whether the discretiser should train a regression or a classification decision tree.
- random_stateint, default=None
The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.
- Attributes
- binner_dict_:
Dictionary containing the fitted tree per variable.
- scores_dict_:
Dictionary with the score of the best decision tree per variable.
- variables_:
The group of variables that will be transformed.
- feature_names_in_:
List with the names of features seen during
fit
.- n_features_in_:
The number of features in the train set used in fit.
See also
sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor
References
- 1
Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
Examples
>>> import pandas as pd >>> import numpy as np >>> from feature_engine.discretisation import DecisionTreeDiscretiser >>> np.random.seed(42) >>> X = pd.DataFrame(dict(x= np.random.randint(1,100, 100))) >>> y_reg = pd.Series(np.random.randn(100)) >>> dtd = DecisionTreeDiscretiser(random_state=42) >>> dtd.fit(X, y_reg) >>> dtd.transform(X)["x"].value_counts() -0.090091 90 0.479454 10 Name: x, dtype: int64
You can also apply this for classification problems adjusting the scoring metric.
>>> y_clf = pd.Series(np.random.randint(0,2,100)) >>> dtd = DecisionTreeDiscretiser(regression=False, scoring="f1", random_state=42) >>> dtd.fit(X, y_clf) >>> dtd.transform(X)["x"].value_counts() 0.480769 52 0.687500 48 Name: x, dtype: int64
Methods
fit:
Fit a decision tree per variable.
fit_transform:
Fit to data, then transform it.
get_feature_names_out:
Get output feature names for transformation.
get_params:
Get parameters for this estimator.
set_params:
Set the parameters of this estimator.
transform:
Replace continuous variable values by the predictions of the decision tree.
- fit(X, y)[source]#
Fit one decision tree per variable to discretize with cross-validation and grid-search for hyperparameters.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The training dataset. Can be the entire dataframe, not just the variables to be transformed.
- y: pandas series.
Target variable. Required to train the decision tree.
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to
X
andy
with optional parametersfit_params
and returns a transformed version ofX
.- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)[source]#
Get output feature names for transformation. In other words, returns the variable names of transformed dataframe.
- Parameters
- input_featuresarray or list, default=None
This parameter exits only for compatibility with the Scikit-learn pipeline.
If
None
, thenfeature_names_in_
is used as feature names in.If an array or list, then
input_features
must matchfeature_names_in_
.
- Returns
- feature_names_out: list
Transformed feature names.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)[source]#
Replaces original variable values with the predictions of the tree. The decision tree predictions are finite, aka, discrete.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The input samples.
- Returns
- X_new: pandas dataframe of shape = [n_samples, n_features]
The dataframe with transformed variables.
- rtype
DataFrame
..