DropCorrelatedFeatures#
The DropCorrelatedFeatures()
finds and removes correlated variables from a dataframe.
Correlation is calculated with pandas.corr()
. All correlation methods supported by pandas.corr()
can be used in the selection, including Spearman, Kendall, or Spearman. You can also pass a
bespoke correlation function, provided it returns a value between -1 and 1.
Features are removed on first found first removed basis, without any further insight. That is, the first feature will be retained an all subsequent features that are correlated with this, will be removed.
The transformer will examine all numerical variables automatically. Note that you could pass a dataframe with categorical and datetime variables, and these will be ignored automatically. Alternatively, you can pass a list with the variables you wish to evaluate.
Example
Let’s create a toy dataframe where 4 of the features are correlated:
import pandas as pd
from sklearn.datasets import make_classification
from feature_engine.selection import DropCorrelatedFeatures
# make dataframe with some correlated variables
def make_data():
X, y = make_classification(n_samples=1000,
n_features=12,
n_redundant=4,
n_clusters_per_class=1,
weights=[0.50],
class_sep=2,
random_state=1)
# trasform arrays into pandas df and series
colnames = ['var_'+str(i) for i in range(12)]
X = pd.DataFrame(X, columns =colnames)
return X
X = make_data()
Now, we set up DropCorrelatedFeatures()
to find and remove variables which
(absolute) correlation coefficient is bigger than 0.8:
tr = DropCorrelatedFeatures(variables=None, method='pearson', threshold=0.8)
With fit()
the transformer finds the correlated variables and with transform()
it drops
them from the dataset:
Xt = tr.fit_transform(X)
The correlated feature groups are stored in the transformer’s attributes:
tr.correlated_feature_sets_
[{'var_0', 'var_8'}, {'var_4', 'var_6', 'var_7', 'var_9'}]
As well as the features that will be removed from the dataset:
tr.features_to_drop_
{'var_6', 'var_7', 'var_8', 'var_9'}
If we now go ahead and print the transformed data, we see that the correlated features have been removed.
print(print(Xt.head()))
var_0 var_1 var_2 var_3 var_4 var_5 var_10 \
0 1.471061 -2.376400 -0.247208 1.210290 -3.247521 0.091527 2.070526
1 1.819196 1.969326 -0.126894 0.034598 -2.910112 -0.186802 1.184820
2 1.625024 1.499174 0.334123 -2.233844 -3.399345 -0.313881 -0.066448
3 1.939212 0.075341 1.627132 0.943132 -4.783124 -0.468041 0.713558
4 1.579307 0.372213 0.338141 0.951526 -3.199285 0.729005 0.398790
var_11
0 -1.989335
1 -1.309524
2 -0.852703
3 0.484649
4 -0.186530
More details#
In this notebook, we show how to use DropCorrelatedFeatures()
with a different
relation metric:
All notebooks can be found in a dedicated repository.