MRMR - Minimum Redundancy Maximum Relevance#

MRMR() selects features based on the Maximum Relevance Minimum Redundancy framework. In this framework, features with a strong relationship with the target (high relevance), but weak relationship with other predictor features (low redundancy) are favored and hence selected.

The MRMR algorithm obtains a measure of relevance and a measure of redundancy, and then it assigns an importance score to each feature based on the difference or ratio between relevance and redundancy. After that, it selects the features with the highest scores.

MRMR was first described in bioinformatics as a method to select features for microarray gene expression data, and then expanded and popularized by Uber in the context of marketing models.

MRMR - Mechanism#

The MRMR feature selection algorithm works as follows:

  1. Determine the relevance of all predictor variables and select the feature with highest relevance.

  2. Determine the redundancy between the remaining features and the selected in step 1.

  3. Calculate the importance score as the ratio or difference between relevance and redundancy and select the feature with the maximum value: add it to the selected features.

  4. Determine the mean redundancy between the remaining features and the features selected so far.

  5. Calculate the importance score and select the feature with the maximum value: add it to the selected features.

  6. Repeat 4 and 5 until the desired number of selected features is reached.

MRMR is an iterative algorithm. At each iteration, it determines the mean redundancy between the remaining features and the features that were selected in previous rounds. With the redundancy, it obtains a new measure of importance, and then it selects the feature with highest importance.

Note that you need to define the number of features to select.

Relevance#

MRMR() has 3 strategies to determine feature relevance. To determine the relationship of each feature with the target variable, MRMR() obtains:

  1. The F-statistic, which is derived from ANOVA if the target is discrete or correlation if the target is categorical.

  2. The mutual information.

  3. The importance derived from random forests.

The relevance is used to select the first feature and then to calculate the MRMR values at each of the subsequent iterations.

F-statistic#

The F-statistic determines the degree of linear association between the features and the target. If the target is categorical, the F-statistic is calculated using Scikit-learn’s f_classif function. If the target is continuous, the F-statistic is determined using f_regression.

Note that in both cases, these statistic is useful when the features are continuous. For discrete features, other tests should be used, like chi-square, which at the moment is not implemented. So if your datasets contain both numerical and categorical variables, try the mutual information or the importance derived from random forests instead.

Mutual information#

The mutual information is a measure that quantifies how much we know about one variable, by examining the values of a second variable. In other words, it measures the non-linear association between features. Higher values indicate stronger associations.

MRMR() uses scikit-learn’s mutual_info_classif to determine the association between features and a discrete target, or mutual_info_regression, to calculate the association between features and a continuous target.

The mutual information is calculated differently for continuous and discrete variables, so it is important to flag categorical and discrete through the discrete_features parameter.

Random Forests#

Random forests can automatically assign a measure of relevance, by computing the degree of impurity reduction returned by each feature at each node of the tree, and then across the forest. Hence, the importance derived from random forests offers a good approximation of the relationship between features and target.

Note however that if the features are highly correlated, the importance derived from random forests will be diluted.

Redundancy#

Redundant features are those that are highly correlated, or show high dependency with other features in the dataset.

MRMR() has 2 strategies to determine the relationship of the variables to other variables in the dataset: Pearson’s correlation coefficient or mutual information.

Correlation#

To determine each features’s redundancy, MRMR() obtains Pearson’s correlation coefficient between each feature and the features selected in previous rounds. Next, it takes the average of the absolute value of the coefficients.

Note that correlation assumes that all features are continuous, so this metric may returned biased results for categorical and discrete variables.

Mutual information#

To determine each features’s redundancy, MRMR() caclulates the mutual information between each feature and the features selected in former iterations, and then takes the average.

MRMR#

The MRMR method obtains a measure of feature importance by comparing its relevance to the target and its redundancy with other, previously selected features.

High feature importance is obtained when the relevance is high and the redundancy is low.

A value of MRMR can be obtained through the difference or the ratio between relevance and redundancy.

The following schemes are supported by MRMR():

Method

Relevance

Redundance

Scheme

‘MID’

Mutual information

Mutual information

Difference

‘MIQ’

Mutual information

Mutual information

Ratio

‘FCD’

F-Statistic

Correlation

Difference

‘FCQ’

F-Statistic

Correlation

Ratio

‘RFCQ’

Random Forests

Correlation

Ratio

Feature selection#

MRMR() selects as many features as indicated in the parameter 'max_features'. If left to None, MRMR() will select 20% of the features seen in the training dataset used during fit.

Note that the number of features to select is arbitrary.

MRMR is fast to compute if using the F-statistic and correlation. However, these statistical values are suitable to establish linear relationships.

To detect non-linear relationships, using the variant that determines relevance through random forests derived feature importance might be better. Mutual information inherently detects linear and non-linear associations, but for continuous features it takes longer to compute, impacting the speed of selection of MRMR.

Python examples#

Let’s see how to implement MRMR(). We’ll start by using Scikit-learn’s breast cancer dataset. The target variable is binary, representing malignant or benign tumors. All predictor variables are continuous.

Let’s import the required libraries and classes:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from feature_engine.selection import MRMR

Let’s now load the cancer diagnostic data and display its top rows:

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
y = y.map({0:1, 1:0})
print(X.head())

In the following output, we see the top 5 rows of the dataset:

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840
1        20.57         17.77          132.90     1326.0          0.08474
2        19.69         21.25          130.00     1203.0          0.10960
3        11.42         20.38           77.58      386.1          0.14250
4        20.29         14.34          135.10     1297.0          0.10030

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419
1           0.07864          0.0869              0.07017         0.1812
2           0.15990          0.1974              0.12790         0.2069
3           0.28390          0.2414              0.10520         0.2597
4           0.13280          0.1980              0.10430         0.1809

   mean fractal dimension  ...  worst radius  worst texture  worst perimeter  \
0                 0.07871  ...         25.38          17.33           184.60
1                 0.05667  ...         24.99          23.41           158.80
2                 0.05999  ...         23.57          25.53           152.50
3                 0.09744  ...         14.91          26.50            98.87
4                 0.05883  ...         22.54          16.67           152.20

   worst area  worst smoothness  worst compactness  worst concavity  \
0      2019.0            0.1622             0.6656           0.7119
1      1956.0            0.1238             0.1866           0.2416
2      1709.0            0.1444             0.4245           0.4504
3       567.7            0.2098             0.8663           0.6869
4      1575.0            0.1374             0.2050           0.4000

   worst concave points  worst symmetry  worst fractal dimension
0                0.2654          0.4601                  0.11890
1                0.1860          0.2750                  0.08902
2                0.2430          0.3613                  0.08758
3                0.2575          0.6638                  0.17300
4                0.1625          0.2364                  0.07678

[5 rows x 30 columns]

Let’s now split the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

The F-Statistic framework: linear associations#

In this example, we want MRMR() to determine feature relevance by using the F-statistic obtained with ANOVA, and the redundancy by examining the Pearson’s correlation coefficient among features. The importance is obtained as the ratio between relevance and redundancy.

Let’s set up MRMR():

sel = MRMR(method="FCQ", regression=False)
sel.fit(X, y)

With fit(), MRMR() computed the relevance, redundancy, MRMR or feature importance and determined which features should be selected. Note that this is an iterative process.

We can print out the relevance as follows:

sel.relevance_

In the following output, we see an array with the F-statistic obtained from ANOVA:

array([6.46981021e+02, 1.18096059e+02, 6.97235272e+02, 5.73060747e+02,
       8.36511234e+01, 3.13233079e+02, 5.33793126e+02, 8.61676020e+02,
       6.95274435e+01, 9.34592949e-02, 2.68840327e+02, 3.90947023e-02,
       2.53897392e+02, 2.43651586e+02, 2.55796780e+00, 5.32473391e+01,
       3.90144816e+01, 1.13262760e+02, 2.41174067e-02, 3.46827476e+00,
       8.60781707e+02, 1.49596905e+02, 8.97944219e+02, 6.61600206e+02,
       1.22472880e+02, 3.04341063e+02, 4.36691939e+02, 9.64385393e+02,
       1.18860232e+02, 6.64439606e+01])

We can instead create a bar plot with the relevance:

pd.Series(sel.relevance_, index=sel.variables_).sort_values(
    ascending=False).plot.bar(figsize=(15, 4))
plt.title("Relevance")
plt.show()

In the following image, we see the F-statistic per feature:

../../_images/f_statistic.png

We can see the subset of features that will be removed as follows:

sel.features_to_drop_

In the following output we see the features that were not selected:

['mean radius',
 'mean texture',
 'mean area',
 'mean smoothness',
 'mean compactness',
 'mean concavity',
 'mean symmetry',
 'mean fractal dimension',
 'radius error',
 'texture error',
 'perimeter error',
 'area error',
 'smoothness error',
 'compactness error',
 'concavity error',
 'concave points error',
 'symmetry error',
 'fractal dimension error',
 'worst texture',
 'worst smoothness',
 'worst compactness',
 'worst concavity',
 'worst symmetry',
 'worst fractal dimension']

Finally, we can go ahead and retain the selected features like this:

Xtr = sel.transform(X_test)
print(Xtr.head())

In the following output we see the test set with a reduced number of features:

     mean perimeter  mean concave points  worst radius  worst perimeter  \
512           88.64              0.08172         16.41           113.30
457           84.10              0.02068         14.35            91.29
439           89.59              0.02652         14.91            96.53
298           91.22              0.01374         16.22           105.80
37            82.61              0.02923         13.30            84.46

     worst area  worst concave points
512       844.4               0.20510
457       632.9               0.06005
439       688.9               0.08216
298       819.7               0.07530
37        545.9               0.05013

In the final dataset we only have the “relevant features”. And by relevant, we mean those with high association with the target, and low association with other features.

Since we left the parameter 'max_features' as None, :class:`MRMR() selected 20% of the features in the training set. The training set contained 30 features, so 6 features remain after applying MRMR.

Using random forests#

When we have categorical or discrete variables, or want to examine non-linear associations, we can determine the relevance using the random forests derived feature importance.

MRMR() will train a random forest using grid search over a hyperparameter grid that can be specified by the user. The redundancy is determined using Pearson’s correlation coefficient.

In a similar way, the MRMR feature selection algorithm will compute the feature importance as the ratio between the random forest importance and Pearson’s correlation coefficient.

Lets, set up MRMR() to use a random forests classifier for the relevance. Note that we need to specify a cross-validation scheme, a performance metric, and we have the option to pass a grid with hyperparameters to optimize:

sel = MRMR(
    method="RFCQ",
    scoring="roc_auc",
    param_grid = {"n_estimators": [5, 50, 500], "max_depth":[1,2,3]},
    cv=3,
    regression=False,
    random_state=42,
)

sel.fit(X, y)

We can now go ahead and plot the relevance:

pd.Series(sel.relevance_, index=sel.variables_).sort_values(
    ascending=False).plot.bar(figsize=(15, 4))
plt.title("Relevance")
plt.show()

In the following image we see the relationship between features and the target derived from random forests:

../../_images/rfimportancemrmr.png

We can now go ahead and select relevant features using the transform() method.

Xtr = sel.transform(X_test)
print(Xtr.head())

In the following output we see the test set with a reduced number of features:

     mean concave points  fractal dimension error  worst radius  \
512              0.08172                 0.004005         16.41
457              0.02068                 0.001828         14.35
439              0.02652                 0.002104         14.91
298              0.01374                 0.001957         16.22
37               0.02923                 0.001777         13.30

     worst perimeter  worst area  worst concave points
512           113.30       844.4               0.20510
457            91.29       632.9               0.06005
439            96.53       688.9               0.08216
298           105.80       819.7               0.07530
37             84.46       545.9               0.05013

In the final dataset we only have the relevant features. Again, we selected 20% of the features in the training set.

Mutual information#

If we have non-linear associations and / or categorical or discrete variables, a better option is to obtain the relevance and redundancy utilizing mutual information.

The mutual information is calculated differently for numerical and categorical variables, so it is best to flag discrete features with a boolean array.

For this demo, we’ll use the California housing dataset to predict house prices, and we’ll treat 3 variables as discrete. Let’s load the data:

from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X[['AveRooms', 'AveBedrms', 'AveOccup']] = X[['AveRooms', 'AveBedrms', 'AveOccup']].astype(int)

print(X.head())

In the following output, we see the first 5 rows of the dataset:

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0         6          1       322.0         2     37.88
1  8.3014      21.0         6          0      2401.0         2     37.86
2  7.2574      52.0         8          1       496.0         2     37.85
3  5.6431      52.0         5          1       558.0         2     37.85
4  3.8462      52.0         6          1       565.0         2     37.85

   Longitude
0    -122.23
1    -122.22
2    -122.24
3    -122.25
4    -122.25

Now, we’ll set up MRMR() to use mutual information to determine both redundancy and relevance, and the importance score as the ratio (or quotient, hence the Q in MIQ) between the two.

Note the boolean vector with True in the position of the categorical variables ‘AveRooms’, ‘AveBedrms’ and ‘AveOccup’:

sel = MRMR(
    variables = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'],
    method="MIQ",
    max_features=4,
    discrete_features=[False, False, True, True, False, True],
    regression=True,
    random_state=42,
)

sel.fit(X,y)

To select features we use transform():

Xtr = sel.transform(X)
print(Xtr.head())

Note that the variables latitude and longitude, which were not part of the set of features examined by MRMR are retained in the transformed dataset:

   MedInc  HouseAge  AveBedrms  AveOccup  Latitude  Longitude
0  8.3252      41.0          1         2     37.88    -122.23
1  8.3014      21.0          0         2     37.86    -122.22
2  7.2574      52.0          1         2     37.85    -122.24
3  5.6431      52.0          1         2     37.85    -122.25
4  3.8462      52.0          1         2     37.85    -122.25

For compatibility with Scikit-learn, MRMR() also supports the method get_support():

sel.get_support()

which returns the following output:

[True, True, False, True, False, True, True, True]

Considerations#

The maximum relevance minimum redundancy feature selection method is fast, and therefore allows scrutinizing fairly big datasets. Computing the F-statistic is fast. That’s one of the reasons that made it gain popularity.

Additional resources#

For more details about this and other feature selection methods check out these resources:

../../_images/fsml.png

Feature Selection for Machine Learning#











Or read our book:

../../_images/fsmlbook.png

Feature Selection in Machine Learning#















Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.