.. _psi_selection: .. currentmodule:: feature_engine.selection DropHighPSIFeatures =================== The :class:`DropHighPSIFeatures()` finds and removes features with changes in their distribution, i.e. "unstable values", from a pandas dataframe. The stability of the distribution is computed using the **Population Stability Index (PSI)** and all features having a PSI value above a given threshold are removed. Unstable features may introduce an additional bias in a model if the training population significantly differs from the population in production. Removing features for which a shift in the distribution is suspected leads to more robust models and therefore to better performance. In the field of Credit Risk modelling, eliminating features with high PSI is common practice and usually required by the Regulator. Population Stability Index - PSI -------------------------------- The PSI is a measure of how much a population has changed in time or how different the distributions are between two different population samples. To determine the PSI, continuous features are sorted into discrete intervals, the fraction of observations per interval is then determined, and finally those values are compared between the 2 groups, or as we call them in Feature-engine, between the basis and test sets, to obtain the PSI. In other words, the PSI is computed as follows: - Define the intervals into which the observations will be sorted. - Sort the feature values into those intervals. - Determine the fraction of observations within each interval. - Compute the PSI. The PSI is determined as: .. math:: PSI = \sum_{i=1}^n (test_i - basis_i) . ln(\frac{test_i}{basis_i}) where `basis` and `test` are the "reference" and "evaluation" datasets, respectively, and `i` refers to the interval. In other words, the PSI determines the difference in the proportion of observations in each interval, between the reference (aka, original) and test datasets. In the PSI equation, `n` is the total number of intervals. Important ~~~~~~~~~ When working with the PSI it is worth highlighting the following: - The PSI is not symmetric; switching the order of the basis and test dataframes in the PSI calculation will lead to different values. - The number of bins used to define the distributions has an impact on the PSI values. - The PSI is a suitable metric for numerical features (i.e., either continuous or with high cardinality). - For categorical or discrete features, the change in distributions is better assessed with Chi-squared. Threshold ~~~~~~~~~ Different thresholds can be used to assess the magnitude of the distribution shift according to the PSI value. The most commonly used thresholds are: - Below 10%, the variable has not experienced a significant shift. - Above 25%, the variable has experienced a major shift. - Between those two values, the shift is intermediate. - 'auto': the threshold will be calculated based on the size of the base and target datasets and the number of bins. When 'auto', the threshold is calculated using the chi2 approximation, proposed by B. Yurdakul: .. math:: threshold = \chi^2_{(q,B-1)} . (\frac{1}{N} + \frac{1}{M}) where q is the percentile, B is the number of bins, N is the size of basis dataset, N is the size of test dataset. In our implementation, we are using the 99.9th percentile. As mentioned above, the number of bins has an impact on PSI value, because with a higher number of bins it is easier to find divergence in data and vice versa. The same could be said about dataset size - the more data we have, the harder it is to find the difference (if the shift is not drastic). This formula tries to catch these relationships and adjust threshold to correctly detect feature drift. Procedure --------- To compute the PSI, the :class:`DropHighPSIFeatures()` splits the input dataset in two: a basis data set (aka the reference data) and a test set. The basis data set is assumed to contain the expected or original feature distributions. The test set will be assessed against the basis data set. In the next step, the interval boundaries are determined based on the features in the basis or reference data. These intervals can be determined to be of equal with, or equal number of observations. Next, :class:`DropHighPSIFeatures()` sorts each of the variable values into those intervals, both in the basis and test datasets, and then determines the proportion (percentage) of observations within each interval. Finally, the PSI is determined as indicated in the previous paragraph for each feature. With the PSI value per feature, :class:`DropHighPSIFeatures()` can now select the features that are unstable and drop them, based on a threshold. Splitting the data ------------------ :class:`DropHighPSIFeatures()` allows us to determine how much a feature distribution has changed in time, or how much it differs between 2 groups. If we want to evaluate the distribution change in time, we can use a datetime variable as splitting reference and provide a datetime cut-off as split point. If we want to compare the distribution change between 2 groups, :class:`DropHighPSIFeatures()` offers 3 different approaches to split the input dataframe: - Based on proportion of observations. - Based on proportions of unique observations. - Using a cut-off value. Proportion of observations ~~~~~~~~~~~~~~~~~~~~~~~~~~ Splitting by proportion of observations will result in a certain proportion of observations allocated to either the reference and test datasets. For example, if we set `split_frac=0.75`, then 75% and 25% of the observations will be put into the reference and test data, respectively. If we select this method, we can pass a variable in the parameter `split_col` or leave it to None. Note that the data split is not done at random, but instead guided by the values in the reference variable indicated in `split_col`. Under the hood, the reference variable indicated in `split_col` is ordered, and the percentage of observations is determined with NumPy quantile. This means that the observations with smaller values in `split_col` will land in the reference dataset, and those with bigger values will go to the test set. If the rows in your dataset are sorted in time, this could be a good default option to split the dataframe in 2 and compute the PSI. This will for example be the case if your data set contains daily (or any other frequency) sales information on a company's products. Proportions of unique observations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If we split based on proportion of unique observations, it is important that we indicate which column we want to use as reference in the `split_col` parameter, to make a meaningful split. If we leave this to None, :class:`DropHighPSIFeatures()` will use the dataframe index as reference. This makes sense only if the index in the dataframe has meaningful values. :class:`DropHighPSIFeatures()` will first identify the unique values of the variable in `split_col`. Then it will put a certain proportion of those values into the reference dataset and the remaining to the test dataset. The proportion is indicated in the parameter `split_frac`. Under the hood, :class:`DropHighPSIFeatures()` will sort the unique values of the reference variable, and then use NumPy quantiles to determine the fraction that should be allocated to the reference and test sets. Thus, it is important to consider that the order of the unique values matters in the split. This split makes sense when we have for example unique customer identifiers and multiple rows per customer in the dataset. We want to make sure that all rows belonging to the same customer are allocated either in the reference or test data, but the same customer cannot be in both data sets. This way of splitting the data will also ensure that we have a certain percentage, indicated in `split_frac` of customers in either data set after the split. Thus, if `split_frac=0.6` and `split_distinct=True`, :class:`DropHighPSIFeatures()` will send the first 60% of customers to the reference data set, and the remaining 40% to the test set. And it will ensure that rows belonging to the same customer are just in one of the 2 data sets. Using a cut-off value ~~~~~~~~~~~~~~~~~~~~~ We have the option to pass a reference variable to use to split the dataframe using `split_col` and also a cut-off value in the `cut_off` parameter. The cut-off value can be a number, integer or float, a date or a list of values. If we pass a datetime column in `split_col` and a datetime value in the `cut_off`, we can split the data in a temporal manner. Observations collected before the time indicated will be sent to the reference dataframe, and the remaining to the test set. If we pass a list of values in the `cut_off` all observations which values are included in the list will go into the reference data set, and the remaining to the test set. This split is useful if we have a categorical variable indicating a portfolio from which the observations have been collected. For example, if we set `split_col='portfolio'` and `cut_off=['port_1', 'port_2']`, all observations that belong to the first and second portfolio will be sent to the reference data set, and the observations from other portfolios to the test set. Finally, if we pass a number to `cut_off`, all observations which value in the variable indicated in `split_col` is <= cut-off, will be sent to the reference data set, alternatively to the test set. This can be useful for example when dates are defined as integer (for example 20200411) or when using an ordinal customer segmentation to split the dataframe (1: retail customers, 2: private banking customers, 3: SME and 4: Wholesale). split_col ~~~~~~~~~ To split the data set, we recommend that you indicate which column you want to use as reference in the `split_col` parameter. If you don't, the split will be done based on the values of the dataframe index. This might be a good option if the index contains meaningful values or if splitting just based on `split_frac`. Examples -------- The versatility of the class lies in the different options to split the input dataframe in a reference or basis data set with the "expected" distributions, and a test set which will be evaluated against the reference. After splitting the data, :class:`DropHighPSIFeatures()` goes ahead and compares the feature distributions in both data sets by computing the PSI. To illustrate how to best use :class:`DropHighPSIFeatures()` depending on your data, we provide various examples illustrating the different possibilities. Case 1: split data based on proportions (split_frac) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In this case, :class:`DropHighPSIFeatures()` will split the dataset in 2, based on the indicated proportion. The proportion is indicated in the `split_frac` parameter. You have the option to select a variable in `split_col` or leave it to None. In the latter, the dataframe index will be used to split. Let's first create a toy dataframe containing 5 random variables and 1 variable with a shift in its distribution (*var_3* in this case). .. code:: python import pandas as pd import seaborn as sns from sklearn.datasets import make_classification from feature_engine.selection import DropHighPSIFeatures # Create a dataframe with 500 observations and 6 random variables X, y = make_classification( n_samples=500, n_features=6, random_state=0 ) colnames = ["var_" + str(i) for i in range(6)] X = pd.DataFrame(X, columns=colnames) # Add a column with a shift. X['var_3'][250:] = X['var_3'][250:] + 1 The default approach in :class:`DropHighPSIFeatures()` is to split the input dataframe `X` in two equally sized data sets. You can adjust the proportions by changing the value in the `split_frac` parameter. For example, let's split the input dataframe into a reference data set containing 60% of the observations and a test set containing 40% of the observations. .. code:: python # Remove the features with high PSI values using a 60-40 split. transformer = DropHighPSIFeatures(split_frac=0.6) transformer.fit(X) The value of `split_frac` tells :class:`DropHighPSIFeatures()` to split X according to a 60% - 40% ratio. The `fit()` method performs the split of the dataframe and the calculation of the PSI. Because we created random variables, these features will have low PSI values (i.e., no distribution change). However, we manually added a distribution shift in the variable *var_3* and therefore expect the PSI for this particular feature to be above the 0.25 PSI threshold. The PSI values are accessible through the `psi_values_` attribute: .. code:: python transformer.psi_values_ The analysis of the PSI values below shows that only feature 3 (called `var_3`) has a PSI above the 0.25 threshold (default value) and will be removed by the `transform` method. .. code:: python {'var_0': 0.07405459925568803, 'var_1': 0.09124093185820083, 'var_2': 0.16985790067687764, 'var_3': 1.342485289730313, 'var_4': 0.0743442762545251, 'var_5': 0.06809060587241555} From the output, we see that the PSI value for *var_0* is around 7%. This means that, when comparing the first 300 and the last 200 observations of the dataframe, there is only a small difference in the distribution of the *var_0* feature. A similar conclusion applies to *var_1, var_2, var_4* and *var_5*. Looking at the PSI value for *var_3*, we see that it exceeds by far the 0.25 threshold. We can then conclude the population of this feature has shifted and it is wise not to include it in the feature set for modelling. The cut-off value used to split the dataframe is stored in the `cut_off_` attribute: .. code:: python transformer.cut_off_ This yields the following answer .. code:: python 299.4 The value of 299.4 means that observations with index from 0 to 299 are used to define the basis data set. This corresponds to 60% (300 / 500) of the original dataframe (X). The value of 299.4 may seem strange because it is not one of the value present in (the (index of) the dataframe. Intuitively, we would expect the cut_off to be an integer in the present case. However, the cut_off is computed using quantiles and the quantiles are computed using extrapolation. Splitting with proportions will order the index or the reference column first, and then determine the data that will go into each dataframe. In other words, the order of the index or the variable indicated in `split_col` matters. Observations with the lowest values will be sent to the basis dataframe and the ones with the highest values to the test set. The `features_to_drop_` attribute provides the list with the features to be dropped when executing the `transform` method. The command .. code:: python transformer.features_to_drop_ Yields the following result: .. code:: python ['var_3'] That the *var_3* feature is dropped during the procedure is illustrated when looking at the columns from the `X_transformed` dataframe. .. code:: python X_transformed = transformer.transform(X) X_transformed.columns Index(['var_0', 'var_1', 'var_2', 'var_4', 'var_5'], dtype='object') :class:`DropHighPSIFeatures()` also contains a `fit_transform` method that combines the `fit` and the `transform` methods. The difference in distribution between a non-shifted and a shifted distribution is clearly visible when plotting the cumulative density function. For the shifted variable: .. code:: python X['above_cut_off'] = X.index > transformer.cut_off_ sns.ecdfplot(data=X, x='var_3', hue='above_cut_off') and a non-shifted variable (for example *var_1*) .. code:: python sns.ecdfplot(data=X, x='var_1', hue='above_cut_off') .. image:: ../../images/PSI_distribution_case1.png Case 2: split data based on variable (numerical cut_off) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the previous example, we wanted to split the input dataframe in 2 datasets, with the reference dataset containing 60% of the observations. We let :class:`DropHighPSIFeatures()` find the cut-off to achieve this. We can instead, provide ourselves the numerical cut-off that determines which observations will go to the reference or basis data set, and which to the test set. Using the `cut_off` parameter, we can define the specific threshold for the split. A real life example for this case is the use of the customer ID or contract ID to split the dataframe. These IDs are often increasing in value over time which justifies their use to assess distribution shifts in the features. Let's create a toy dataframe representing the customers' characteristics of a company. This dataset contains six random variables (in real life this are variables like age or postal code), the seniority of the customer (i.e. the number of months since the start of the relationship between the customer and the company) and the customer ID (i.e. the number (integer) used to identify the customer). Generally the customer ID grows over time which means that early customers have a lower customer ID than late customers. From the definition of the variables, we expect the *seniority* to increase with the customer ID and therefore to have a high PSI value when comparing early and late customer, .. code:: python import pandas as pd from sklearn.datasets import make_classification from feature_engine.selection import DropHighPSIFeatures X, y = make_classification( n_samples=500, n_features=6, random_state=0 ) colnames = ["var_" + str(i) for i in range(6)] X = pd.DataFrame(X, columns=colnames) # Let's add a variable for the customer ID X['customer_id'] = [customer_id for customer_id in range(1, 501)] # Add a column with the seniority... that is related to the customer ID X['seniority'] = 100 - X['customer_id'] // 10 transformer = DropHighPSIFeatures(split_col='customer_id', cut_off=250) transformer.fit(X) In this case, :class:`DropHighPSIFeatures()` will allocate in the basis or reference data set, all observations which values in `customer_id` are <= 250. The test dataframe contains the remaining observations. The method `fit()` will determine the PSI values, which are stored in the class: .. code:: python transformer.psi_values_ We see that :class:`DropHighPSIFeatures()` does not provide any PSI value for the `customer_id` feature, because this variable was used as a reference to split the data. .. code:: python {'var_0': 0.07385590683974477, 'var_1': 0.061155637727757485, 'var_2': 0.1736694458621651, 'var_3': 0.044965387331530465, 'var_4': 0.0904519893659045, 'var_5': 0.027545195437270797, 'seniority': 7.8688986006052035} .. code:: python transformer.features_to_drop_ Gives .. code:: python ['seniority'] Executing the dataframe transformation leads to the exclusion of the *seniority* feature but not to the exclusion of the *customer_id*. .. code:: python X_transformed = transformer.transform(X) X_transformed.columns Index(['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'customer_id'], dtype='object') Case 3: split data based on time (date as cut_off) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :class:`DropHighPSIFeatures()` can handle different types of `split_col` variables. The following case illustrates how it works with a date variable. In fact, we often want to determine if the distribution of a feature changes in time, for example after a certain event like the start of the Covid-19 pandemic. This is how to do it. Let's create a toy dataframe with 6 random numerical variables and two date variables. One will be use to specific the split of the dataframe while the second one is expected to have a high PSI value. .. code:: python import pandas as pd from datetime import date from sklearn.datasets import make_classification from feature_engine.selection import DropHighPSIFeatures X, y = make_classification( n_samples=1000, n_features=6, random_state=0 ) colnames = ["var_" + str(i) for i in range(6)] X = pd.DataFrame(X, columns=colnames) # Add two time variables to the dataframe X['time'] = [date(year, 1, 1) for year in range(1000, 2000)] X['century'] = X['time'].apply(lambda x: ((x.year - 1) // 100) + 1) # Let's shuffle the dataframe and reset the index to remove the correlation # between the index and the time variables. X = X.sample(frac=1).reset_index(drop=True) Dropping features with high PSI values comparing two periods of time is done simply by providing the name of the column with the date and a cut-off date. In the example below the PSI calculations will be done comparing the periods up to the French revolution and after. .. code:: python transformer = DropHighPSIFeatures(split_col='time', cut_off=date(1789, 7, 14)) transformer.fit(X) **Important**: if the date variable is in pandas or NumPy datetime format, you may need to pass the cut_off value as `pd.to_datetime(1789-07-14)`. The PSI values shows the *century* variables in unstable as its value is above the 0.25 threshold. .. code:: python transformer.psi_values_ {'var_0': 0.0181623637463045, 'var_1': 0.10595496570984747, 'var_2': 0.05425659114295842, 'var_3': 0.09720689210928271, 'var_4': 0.07917647542638032, 'var_5': 0.10122468631060424, 'century': 8.272395772368412} The class has correctly identified the feature to be dropped. .. code:: python transformer.features_to_drop_ ['century'] And the transform method correctly removes the feature. .. code:: python X_transformed = transformer.transform(X) X_transformed.columns Index(['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'time'], dtype='object') The difference in distribution between a non-shifted and a shifted distribution is clearly visible when plotting the cumulative density function for each of the group. We can plot the cumulative distribution of the shifted variable like this: .. code:: python X['above_cut_off'] = X.time > pd.to_datetime(transformer.cut_off_) sns.ecdfplot(data=X, x='century', hue='above_cut_off') and the distribution of a non-shifted variable, for example *var_2*, like this: .. code:: python sns.ecdfplot(data=X, x='var_2', hue='above_cut_off') And below we can compare both plots: .. image:: ../../images/PSI_distribution_case3.png Case 4: split data based on a categorical variable (category or list as cut_off) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :class:`DropHighPSIFeatures()` can also split the original dataframe based on a categorical variable. The cut-off can then be defined in two ways: - Using a single string. - Using a list of values. In the first case, the column with the categorical variable is sorted alphabetically and the split is determined by the cut-off. We recommend being very careful when using a single category as cut-off, because alphabetical sorting in combination with a cut-off does not always provide obvious results. In other words, for this way of splitting the data to be meaningful, the alphabetical order of the categories in the reference variable should have an intrinsic meaning. A better purpose for splitting the data based on a categorical variable would be to pass a list with the values of the variable that want in the reference dataframe. A real life example for this case is the computation of the PSI between different customer segments like 'Retail', 'SME' or 'Wholesale'. In this case, if we indicate ['Retail'] as cut-off, observations for Retail will be sent to the basis data set, and those for 'SME' and 'Wholesale' will be added to the test set. Split passing a category value ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Let's show how to set up the transformer in this case. The example data set contains 6 randoms variables, a categorical variable with the labels of the different categories and 2 category related features. .. code:: python import pandas as pd import seaborn as sns from sklearn.datasets import make_classification from feature_engine.selection import DropHighPSIFeatures X, y = make_classification( n_samples=1000, n_features=6, random_state=0 ) colnames = ["var_" + str(i) for i in range(6)] X = pd.DataFrame(X, columns=colnames) # Add a categorical column X['group'] = ["A", "B", "C", "D", "E"] * 200 # And two category related features X['group_means'] = X.group.map({"A": 1, "B": 2, "C": 0, "D": 1.5, "E": 2.5}) X['shifted_feature'] = X['group_means'] + X['var_2'] We can define a simple cut-off value (for example the letter C). In this case, observations with values that come before C, alphabetically, will be allocated to the reference data set. .. code:: python transformer = DropHighPSIFeatures(split_col='group', cut_off='C') X_transformed = transformer.fit_transform(X) The PSI values are provided in the `psi_values_` attribute. .. code:: python transformer.psi_values_ {'var_0': 0.06485778974895254, 'var_1': 0.03605540598761757, 'var_2': 0.040632784917352296, 'var_3': 0.023845405645510645, 'var_4': 0.028007185972248064, 'var_5': 0.07009152672971862, 'group_means': 6.601444547497699, 'shifted_feature': 0.48428009522119164} From these values we see that the last 2 features should be removed. We can corroborate that in the `features_to_drop_` attribute: .. code:: python transformer.features_to_drop_ ['group_means', 'shifted_feature'] And these columns are removed from the original dataframe by the transform method that, in the present case, has been applied through the fit_transform method a couple of block cells above. .. code:: python X_transformed.columns Index(['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'group'], dtype='object') Split passing a list of categories ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Instead of passing a category value, we can instead pass a list of values to the `cut_off`. Using the same data set let's set up the :class:`DropHighPSIFeatures()` to split the dataframe according to the list ['A', 'C', 'E'] for the categorical variable *group*. In this case, the PSI's will be computed by comparing two dataframes: the first one containing only the values A, C and E for the *group* variable and the second one containing only the values B and D. .. code:: python trans = DropHighPSIFeatures(split_col='group', cut_off=['A', 'C', 'E']) X_no_drift = trans.fit_transform(X) .. code:: python trans.psi_values_ 'var_0': 0.04322345673014104, 'var_1': 0.03534439253617049, 'var_2': 0.05220272785661243, 'var_3': 0.04550964862452317, 'var_4': 0.04492720670343145, 'var_5': 0.044886435640028144, 'group_means': 6.601444547497699, 'shifted_features': 0.3683642099948127} Here again, the object will remove the *group_means* and the *shifted_features* columns from the dataframe. .. code:: python trans.features_to_drop_ ['group_means', 'shifted_features'] And these columns are removed from the original dataframe by the transform method that has been applied through the `fit_transform` method. .. code:: python X_transformed.columns Index(['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'group'], dtype='object') In the following plots, we can compare the distribution of a feature with high PSI and one with low PSI, in the different categories of the categorical variable. With this code we plot the cumulative distribution of a feature which distribution is different among the different categories of the variable: .. code:: python sns.ecdfplot(data=X, x='shifted_feature', hue='group') With this code we plot the cumulative distribution of a feature which distribution is the same across the different categories of the categorical variable: .. code:: python sns.ecdfplot(data=X, x='var_0', hue='group') And below we can compare the plots of both features: .. image:: ../../images/PSI_distribution_case4.png Case 5: split data based on unique values (split_distinct) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A variant to the previous example is the use of the `split_distinct` functionality. In this case, the split is not done based on the number observations from `split_col` but from the number of distinct values in the reference variable indicated in `split_col`. A real life example for this case is when dealing with groups of different sizes like customers income classes ('1000', '2000', '3000', '4000', ...). Split_distinct allows to control the numbers of classes in the basis and test dataframes regardless of the number of observations in each class. This case is illustrated in the toy data for this case. The data set contains 6 random variable and 1 income variable that is larger for one of the 6 group defined (the F group). .. code:: python import numpy as np import pandas as pd import seaborn as sns from sklearn.datasets import make_classification from feature_engine.selection import DropHighPSIFeatures X, y = make_classification( n_samples=1000, n_features=6, random_state=0 ) colnames = ["var_" + str(i) for i in range(6)] X = pd.DataFrame(X, columns=colnames) # Add a categorical column X['group'] = ["A", "B", "C", "D", "E"] * 100 + ["F"] * 500 # And an income variable that is category dependent. np.random.seed(0) X['income'] = np.random.uniform(1000, 2000, 500).tolist() + np.random.uniform(1250, 2250, 500).tolist() # Shuffle the dataframe to make the dataset more real life case. X = X.sample(frac=1).reset_index(drop=True) The `group` column contains 500 observations in the (A, B, C, D, E) group and 500 in the (F) group. When we pass `split_distinct=True` when initializing the `DropHighPSIFeatures` object, the two dataframes used to compute the PSI will contain the same number of **unique** values in the `group` column (i.e., one dataframe will contain 300 rows associated to groups A, B and C while the other will contain 700 rows associated to groups D, E and F). .. code:: python transformer = DropHighPSIFeatures(split_col='group', split_distinct=True) transformer.fit(X) transformer.psi_values_ This yields the following PSI values: .. code:: python {'var_0': 0.014825303242393804, 'var_1': 0.03818316821350485, 'var_2': 0.029635981271458896, 'var_3': 0.021700399485890084, 'var_4': 0.061194837255216114, 'var_5': 0.04119583769297253, 'income': 0.46191580731264914} And we can find the feature that will be dropped, income, here: .. code:: python transformer.features_to_drop_ ['income'] The former feature will be removed from the dataset when calling the `transform()` method. .. code:: python X_transformed = transformer.transform(X) X_transformed.columns Index(['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'group'], dtype='object') The difference in distribution between a non-shifted and a shifted distribution is clearly visible when plotting the cumulative density function for each of the group. For the shifted variable (income): .. code:: python sns.ecdfplot(data=X, x='income', hue='group') and a non-shifted variable (for example *var_4*) .. code:: python sns.ecdfplot(data=X, x="var_4", hue="group") .. image:: ../../images/PSI_distribution_case5.png Additional resources -------------------- In this notebook, we show how to use :class:`DropHighPSIFeatures` on a real dataset and give more detail about the underlying base and reference sub-dataframes used to determine the PSI. - `Jupyter notebook `_ All notebooks can be found in a `dedicated repository `_. For more details about this and other feature selection methods check out these resources: .. figure:: ../../images/fsml.png :width: 300 :figclass: align-center :align: left :target: https://www.trainindata.com/p/feature-selection-for-machine-learning Feature Selection for Machine Learning | | | | | | | | | | Or read our book: .. figure:: ../../images/fsmlbook.png :width: 200 :figclass: align-center :align: left :target: https://leanpub.com/feature-selection-in-machine-learning Feature Selection in Machine Learning | | | | | | | | | | | | | | Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.