DecisionTreeDiscretiser#
- class feature_engine.discretisation.DecisionTreeDiscretiser(variables=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None)[source]#
The DecisionTreeDiscretiser() replaces numerical variables by discrete, i.e., finite variables, which values are the predictions of a decision tree.
The method is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf
The DecisionTreeDiscretiser() trains a decision tree per variable. Then, it transforms the variables, with predictions of the decision tree.
The DecisionTreeDiscretiser() works only with numerical variables. A list of variables to transform can be indicated. Alternatively, the discretiser will automatically select all numerical variables.
More details in the User Guide.
- Parameters
- variables: list, default=None
The list of numerical variables to transform. If None, the discretiser will automatically select all numerical variables.
- cv: int, default=3
Desired cross-validation fold to fit the decision tree.
- scoring: str, default=’neg_mean_squared_error’
Desired metric to optimise the performance of the tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html
- param_grid: dictionary, default=None
The hyperparameters for the decision tree to test with a grid search. The
param_grid
can contain any of the permitted hyperparameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().If None, then
param_grid = {'max_depth': [1, 2, 3, 4]}
.- regression: boolean, default=True
Indicates whether the discretiser should train a regression or a classification decision tree.
- random_stateint, default=None
The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.
- Attributes
- binner_dict_:
Dictionary containing the fitted tree per variable.
- scores_dict_:
Dictionary with the score of the best decision tree per variable.
- variables_:
The variables that will be discretised.
- n_features_in_:
The number of features in the train set used in fit.
See also
sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor
References
- 1
Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
Methods
fit:
Fit a decision tree per variable.
transform:
Replace continuous variable values by the predictions of the decision tree.
fit_transform:
Fit to the data, then transform it.
- fit(X, y)[source]#
Fit one decision tree per variable to discretize with cross-validation and grid-search for hyperparameters.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The training dataset. Can be the entire dataframe, not just the variables to be transformed.
- y: pandas series.
Target variable. Required to train the decision tree.
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to
X
andy
with optional parametersfit_params
and returns a transformed version ofX
.- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)[source]#
Replaces original variable values with the predictions of the tree. The decision tree predictions are finite, aka, discrete.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The input samples.
- Returns
- X_new: pandas dataframe of shape = [n_samples, n_features]
The dataframe with transformed variables.
- :rtype:py:class:
~pandas.core.frame.DataFrame