Feature Selection#
Feature-engine’s feature selection transformers are used to drop subsets of variables with low predictive value. Feature-engine hosts selection algorithms that are, in general, not available in other libraries. These algorithms have been gathered from data science competitions or used in the industry.
Feature-engine’s transformers select features based on different strategies. Some algorithms remove constant or quasi-constant features. Some algorithms remove duplicated or correlated variables. Some algorithms select features based on a machine learning model performance. Some transformers implement selection procedures used in finance. And some transformers support functionality that has been developed in the industry or in data science competitions.
In the following tables you find the algorithms that belong to each category.
Selection based on feature characteristics#
Transformer |
Categorical variables |
Allows NA |
Description |
---|---|---|---|
√ |
√ |
Drops arbitrary features determined by user |
|
√ |
√ |
Drops constant and quasi-constant features |
|
√ |
√ |
Drops features that are duplicated |
|
× |
√ |
Drops features that are correlated |
|
× |
√ |
From a correlated feature group drops the less useful features |
|
√ |
× |
Selects features based on the MRMR framework |
Selection based on a machine learning model#
Transformer |
Categorical variables |
Allows NA |
Description |
---|---|---|---|
× |
× |
Selects features based on single feature model performance |
|
× |
× |
Removes features recursively by evaluating model performance |
|
× |
× |
Adds features recursively by evaluating model performance |
Selection methods commonly used in finance#
Transformer |
Categorical variables |
Allows NA |
Description |
---|---|---|---|
× |
√ |
Drops features with high Population Stability Index |
|
√ |
x |
Drops features with low information value |
Alternative feature selection methods#
Transformer |
Categorical variables |
Allows NA |
Description |
---|---|---|---|
× |
× |
Selects features if shuffling their values causes a drop in model performance |
|
√ |
× |
Using the target mean as performance proxy, selects high performing features |
|
× |
× |
Selects features who importance is greater than those of random variables |
Other Feature Selection Libraries#
For additional feature selection algorithms visit the following open-source libraries:
Scikit-learn hosts multiple filter and embedded methods that select features based on statistical tests or machine learning model derived importance. MLXtend hosts greedy (wrapper) feature selection methods.