SelectByInformationValue#

SelectByInformationValue() selects features based on whether the feature’s information value score is greater than the threshold passed by the user.

The IV is calculated as:

\[IV = ∑ (fraction of positive cases - fraction of negative cases) * WoE\]

where:

the fraction of positive cases is the proportion of observations of class 1, from the total class 1 observations.
the fraction of negative cases is the proportion of observations of class 0, from the total class 0 observations.
WoE is the weight of the evidence.

The WoE is calculated as:

\[WoE = ln(fraction of positive cases / fraction of negative cases)\]

Information value (IV) is used to assess a feature’s predictive power of a binary-class dependent variable. To derive a feature’s IV, the weight of evidence (WoE) must first be calculated for each unique category or bin that comprises the feature. If a category or bin contains a large percentage of true or positive labels compared to the percentage of false or negative labels, then that category or bin will have a high WoE value.

Once the WoE is derived, SelectByInformationValue() calculates the IV for each variable. A variable’s IV is essentially the weighted sum of the individual WoE values for each category or bin within that variable where the weights incorporate the absolute difference between the numerator and denominator. This value assesses the feature’s predictive power in capturing the binary dependent variable.

The table below presents a general framework for using IV to determine a variable’s predictive power:

Information Value	Predictive Power
< 0.02	Useless
0.02 to 0.1	Weak
0.1 to 0.3	Medium
0.3 to 0.5	Strong
> 0.5	Suspicious, too good to be true

Table taken from listendata.

Example#

Let’s see how to use this transformer to select variables from UC Irvine’s credit approval data set which can be found here. This dataset concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality.

The data is comprised of both numerical and categorical data.

Let’s import the required libraries and classes:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from feature_engine.selection import SelectByInformationValue

Let’s now load and prepare the credit approval data:

# load data
data = pd.read_csv('crx.data', header=None)

# name variables
var_names = ['A' + str(s) for s in range(1,17)]
data.columns = var_names
data.rename(columns={'A16': 'target'}, inplace=True)

# preprocess data
data = data.replace('?', np.nan)
data['A2'] = data['A2'].astype('float')
data['A14'] = data['A14'].astype('float')
data['target'] = data['target'].map({'+':1, '-':0})

# drop rows with missing data
data.dropna(axis=0, inplace=True)

data.head()

Let’s now review the first 5 rows of the dataset:

  A1     A2     A3 A4 A5 A6 A7    A8 A9 A10  A11 A12 A13    A14  A15  target
b  30.83  0.000  u  g  w  v  1.25  t   t    1   f   g  202.0    0       1
a  58.67  4.460  u  g  q  h  3.04  t   t    6   f   g   43.0  560       1
a  24.50  0.500  u  g  q  h  1.50  t   f    0   f   g  280.0  824       1
b  27.83  1.540  u  g  w  v  3.75  t   t    5   t   g  100.0    3       1
b  20.17  5.625  u  g  w  v  1.71  t   f    0   f   s  120.0    0       1

Let’s now split the data into train and test sets:

# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['target'], axis=1),
    data['target'],
    test_size=0.2,
    random_state=0)

X_train.shape, X_test.shape

We see the size of the datasets below.

((522, 15), (131, 15))

Now, we set up SelectByInformationValue(). We will pass six categorical variables to the parameter variables. We will set the parameter threshold to 0.2. We see from the above mentioned table that an IV score of 0.2 signifies medium predictive power.

sel = SelectByInformationValue(
    variables=['A1', 'A6', 'A9', 'A10', 'A12', 'A13'],
    threshold=0.2,
)

sel.fit(X_train, y_train)

With fit(), the transformer:

calculates the WoE for each variable

calculates the the IV for each variable

identifies the variables that have an IV score below the threshold

In the attribute variables_, we find the variables that were evaluated:

['A1', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

In the attribute features_to_drop_, we find the variables that were not selected:

sel.features_to_drop_

['A1', 'A12', 'A13']

The attribute information_values_ shows the IV scores for each variable.

{'A1': 0.0009535686492270659,
 'A6': 0.6006252129425703,
 'A9': 2.9184484098456807,
 'A10': 0.8606638171665587,
 'A12': 0.012251943759377052,
 'A13': 0.04383964979386022}

We see that the transformer correctly selected the features that have an IV score greater than the threshold which was set to 0.2.

The transformer also has the method get_support with similar functionality to Scikit-learn’s selectors method. If you execute sel.get_support(), you obtain:

[False, True, True, True, True, True, True,
 True, True, True, True, False, False, True,
 True]

With transform(), we can go ahead and drop the features that do not meet the threshold:

Xtr = sel.transform(X_test)

Xtr.head()

        A2     A3 A4 A5  A6 A7      A8 A9 A10  A11    A14  A15
42.17   5.04  u  g   q  h  12.750  t   f    0   92.0    0
39.17   1.71  u  g   x  v   0.125  t   t    5  480.0    0
 45.83  10.50  u  g   q  v   5.000  t   t    7    0.0    0
20.00   0.00  u  g   d  v   0.500  f   f    0  144.0    0
 34.00   4.50  u  g  aa  v   1.000  t   f    0  240.0    0

Note that Xtr includes all the numerical features - i.e., A2, A3, A8, A11, and A14 - because we only evaluated a few of the categorical features.

And, finally, we can also obtain the names of the features in the final transformed dataset:

sel.get_feature_names_out()

['A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A14', 'A15']

If we want to select from categorical and numerical variables, we can do so as well by sorting the numerical variables into bins first. Let’s sort them into 5 bins of equal-frequency:

sel = SelectByInformationValue(
    bins=5,
    strategy="equal_frequency",
    threshold=0.2,
)

sel.fit(X_train.drop(["A4", "A5", "A7"], axis=1), y_train)

If we now inspect the information values:

sel.information_values_

We see the following:

{'A1': 0.0009535686492270659,
 'A2': 0.10319123021570434,
 'A3': 0.2596258749173557,
 'A6': 0.6006252129425703,
 'A8': 0.7291628533346297,
 'A9': 2.9184484098456807,
 'A10': 0.8606638171665587,
 'A11': 1.0634602064399297,
 'A12': 0.012251943759377052,
 'A13': 0.04383964979386022,
 'A14': 0.3316668794040285,
 'A15': 0.6228678069374612}

And if we inspect the features to drop:

sel.features_to_drop_

We see the following:

['A1', 'A2', 'A12', 'A13']

Note#

The WoE is given by a logarithm of a fraction. Thus, if for any category or bin, the fraction of observations of class 0 is 0, the WoE is not defined, and the transformer will raise an error.

If you encounter this problem try grouping variables into fewer bins if they are numerical, or grouping rare categories with the RareLabelEncoder if they are categorical.

Additional resources#

For more details about this and other feature selection methods check out these resources:

Feature Selection for Machine Learning#

Or read our book:

Feature Selection in Machine Learning#

Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.

Boost Your Data Science Skills

SelectByInformationValue#

Example#

Note#

Additional resources#