.. _outlier_trimmer: .. currentmodule:: feature_engine.outliers OutlierTrimmer =============== Outliers are data points that significantly deviate from the rest of the dataset, potentially indicating errors or rare occurrences. Outliers can distort the learning process of machine learning models by skewing parameter estimates and reducing predictive accuracy. To prevent this, if you suspect that the outliers are errors or rare occurrences, you can remove them from the training data. In this guide, we show how to remove outliers in Python using the :class:`OutlierTrimmer()`. The first step to removing outliers consists of identifying those outliers. Outliers can be identified through various statistical methods, such as box plots, z-scores, the interquartile range (IQR), or the median absolute deviation. Additionally, visual inspection of the data using scatter plots or histograms is common practice in data science, and can help detect observations that significantly deviate from the overall pattern of the dataset. The :class:`OutlierTrimmer()` can identify outliers by using all of these methods and then remove them automatically. Hence, we’ll begin this guide with data analysis, showing how we can identify outliers through these statistical methods and boxplots, and then we will remove outliers by using the :class:`OutlierTrimmer()`. Identifying outliers -------------------- Outliers are data points that are usually far greater, or far smaller than some value that determines where most of the values in the distribution lie. These minimum and maximum values, that delimit the data distribution, can be calculated in 4 ways: by using the z-score if the variable is normally distributed, by using the interquartile range proximity rule or the median absolute deviation if the variables are skewed, or by using percentiles. Gaussian limits or z-score ~~~~~~~~~~~~~~~~~~~~~~~~~~ If the variable shows a normal distribution, most of its values lie between the mean minus 3 times the standard deviation and the mean plus 3 times the standard deviation. Hence, we can determine the limits of the distribution as follows: - right tail (upper_bound): mean + 3* std - left tail (lower_bound): mean - 3* std We can consider outliers those data points that lie beyond these limits. Interquartile range proximity rule ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The interquartile range proximity rule can be used to detect outliers both in variables that show a normal distribution and in variables with a skew. When using the IQR, we detect outliers as those values that lie before the 25th percentile times a factor of the IQR, or after the 75th percentile times a factor of the IQR. This factor is normally 1.5, or 3 if we want to be more stringent. With the IQR method, the limits are calculated as follows: IQR limits: - right tail (upper_limit): 75th quantile + 3* IQR - left tail (lower_limit): 25th quantile - 3* IQR where IQR is the inter-quartile range: - IQR = 75th quantile - 25th quantile = third quartile - first quartile. Observations found beyond those limits can be considered extreme values. Maximum absolute deviation ~~~~~~~~~~~~~~~~~~~~~~~~~~ Parameters like the mean and the standard deviation are strongly affected by the presence of outliers. Therefore, it might be a better solution to use a metric that is robust against outliers, like the median absolute deviation from the median, commonly shortened to the median absolute deviation (MAD), to delimit the normal data distribution. When we use MAD, we determine the limits of the distribution as follows: MAD limits: - right tail (upper_limit): median + 3* MAD - left tail (lower_limit): median - 3* MAD MAD is the median absolute deviation from the median. In other words, MAD is the median value of the absolute difference between each observation and its median. - MAD = median(abs(X-median(X)) Percentiles ~~~~~~~~~~~ A simpler way to determine the values that delimit the data distribution is by using percentiles. Like this, outlier values would be those that lie before or after a certain percentile or quantiles: - right tail: 95th percentile - left tail: 5th percentile The number of outliers identified by any of these methods will vary. These methods detect outliers, but they can’t decide if they are true outliers or faithful data points. That required further examination and domain knowledge. Let’s move on to removing outliers in Python. Remove outliers in Python ------------------------- In this demo, we'll identify and remove outliers from the Titanic Dataset. First, let's load the data and separate it into train and test: .. code:: python from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.outliers import OutlierTrimmer X, y = load_titanic( return_X_y_frame=True, predictors_only=True, handle_missing=True, ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, ) print(X_train.head()) We see the resulting pandas dataframe below: .. code:: python pclass sex age sibsp parch fare cabin embarked 501 2 female 13.000000 0 1 19.5000 Missing S 588 2 female 4.000000 1 1 23.0000 Missing S 402 2 female 30.000000 1 0 13.8583 Missing C 1193 3 male 29.881135 0 0 7.7250 Missing Q 686 3 female 22.000000 0 0 7.7250 Missing Q Identifying outliers ~~~~~~~~~~~~~~~~~~~~ Let's now identify potential extreme values in the training set by using boxplots. .. code:: python X_train.boxplot(column=['age', 'fare', 'sibsp']) plt.title("Box plot - outliers") plt.ylabel("variable values") plt.show() In the following boxplots, we see that all three variables have data points that are significantly greater than the majority of the data distribution. The variable age also shows outlier values towards the lower values. .. figure:: ../../images/boxplot-titanic.png :align: center The variables have different scales, so let's plot them individually for better visualization. Let's start by making a boxplot of the variable fare: .. code:: python X_train.boxplot(column=['fare']) plt.title("Box plot - outliers") plt.ylabel("variable values") plt.show() We see the boxplot in the following image: .. figure:: ../../images/boxplot-fare.png :align: center Next, we plot the variable age: .. code:: python X_train.boxplot(column=['age']) plt.title("Box plot - outliers") plt.ylabel("variable values") plt.show() We see the boxplot in the following image: .. figure:: ../../images/boxplot-age.png :align: center And finally, we make a boxplot of the variable sibsp: .. code:: python X_train.boxplot(column=['sibsp']) plt.title("Box plot - outliers") plt.ylabel("variable values") plt.show() We see the boxplot and the outlier values in the following image: .. figure:: ../../images/boxplot-sibsp.png :align: center Outlier removal ~~~~~~~~~~~~~~~ Now, we will use the :class:`OutlierTrimmer()` to remove outliers. We'll start by using the IQR as outlier detection method. IQR ^^^ We want to remove outliers at the right side of the distribution only (param `tail`). We want the maximum values to be determined using the 75th quantile of the variable (param `capping_method`) plus 1.5 times the IQR (param `fold`). And we only want to cap outliers in 2 variables, which we indicate in a list. .. code:: python ot = OutlierTrimmer(capping_method='iqr', tail='right', fold=1.5, variables=['sibsp', 'fare'], ) ot.fit(X_train) With `fit()`, the :class:`OutlierTrimmer()` finds the values at which it should cap the variables. These values are stored in one of its attributes: .. code:: python ot.right_tail_caps_ .. code:: python {'sibsp': 2.5, 'fare': 66.34379999999999} We can now go ahead and remove the outliers: .. code:: python train_t = ot.transform(X_train) test_t = ot.transform(X_test) We can compare the sizes of the original and transformed datasets to check that the outliers were removed: .. code:: python X_train.shape, train_t.shape We see that the transformed dataset contains less rows: .. code:: python ((916, 8), (764, 8)) If we evaluate now the maximum of the variables in the transformed datasets, they should be <= the values observed in the attribute `right_tail_caps_`: .. code:: python train_t[['fare', 'age']].max() .. code:: python fare 65.0 age 53.0 dtype: float64 Finally, we can check the boxplots of the transformed variables to corroborate the effect on their distribution. .. code:: python train_t.boxplot(column=['sibsp', "fare"]) plt.title("Box plot - outliers") plt.ylabel("variable values") plt.show() We see the boxplot and the `sibsp` does no longer have outliers, but as `fare` was very skewed, when removing outliers, the parameters of the IQR change, and we continue to see outliers: .. figure:: ../../images/boxplot-sibsp-fare-iqr.png :align: center We'll come back to this later, but now let's continue showing the functionality of the :class:`OutlierTrimmer()`. When we remove outliers from the datasets, we then need to re-align the target variables. We can do this with pandas loc. But the :class:`OutlierTrimmer()` can do that automatically as follows: .. code:: python train_t, y_train_t = ot.transform_x_y(X_train, y_train) test_t, y_test_t = ot.transform_x_y(X_test, y_test) The method `transform_x_y` will remove outliers from the predictor datasets and then align the target variable. That means, it will remove from the target those rows corresponding to the outlier values. We can corroborate the size adjustment in the target as follows: .. code:: python y_train.shape, y_train_t.shape, The previous command returns the following output: .. code:: python ((916,), (764,)) We can obtain the names of the fetaures in the transformed dataset as follows: .. code:: python ot.get_feature_names_out() That returns the following variable namesL .. code:: python ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked'] MAD ^^^ We saw that the IQR did not work amazingly for the variable fare, because its skew is too big. So let's remove outliers by using the MAD instead: .. code:: python ot = OutlierTrimmer(capping_method='mad', tail='right', fold=3, variables=['fare'], ) ot.fit(X_train) train_t, y_train_t = ot.transform_x_y(X_train, y_train) test_t, y_test_t = ot.transform_x_y(X_test, y_test) train_t.boxplot(column=["fare"]) plt.title("Box plot - outliers") plt.ylabel("variable values") plt.show() In the following image, we see that after this transformation, the variable fare no longer shows outlier values: .. figure:: ../../images/boxplot-fare-mad.png :align: center Z-score ^^^^^^^ The variable age is more homogeneously distributed across its value range, so let's use the z-score or gaussian approximation to detect outliers. We saw in the boxplot that it has outliers at both ends, so we'll cap both ends of the distribution: .. code:: python ot_age = OutlierTrimmer(capping_method='gaussian', tail="both", fold=3, variables=['age'], ) ot_age.fit(X_train) Let's inspect the maximum values beyond which data points will be considered outliers: .. code:: python ot_age.right_tail_caps_ .. code:: python {'age': 67.73951212364803} And the lower values beyond which data points will be considered outliers: .. code:: python ot_age.left_tail_caps_ .. code:: python {'age': -7.410476010820627} The minimum value does not make sense, because age can't be negative. So, we'll try capping this variable with percentiles instead. Percentiles ^^^^^^^^^^^ We'll cap age at the bottom 5 and top 95 percentile: .. code:: python ot = OutlierTrimmer(capping_method='mad', tail='right', fold=3, variables=['fare'], ) ot.fit(X_train) Let's inspect the maximum values beyond which data points will be considered outliers: .. code:: python ot_age.right_tail_caps_ .. code:: python {'age': 54.0} And the lower values beyond which data points will be considered outliers: .. code:: python ot_age.left_tail_caps_ .. code:: python {'age': 9.0} Let's tranform the dataset and target: .. code:: python train_t, y_train_t = ot_age.transform_x_y(X_train, y_train) test_t, y_test_t = ot_age.transform_x_y(X_test, y_test) And plot the resulting variable: .. code:: python train_t.boxplot(column=['age']) plt.title("Box plot - outliers") plt.ylabel("variable values") plt.show() In the following image, we see that after this transformation, the variable age still shows some outlier values towards its higher values, so we should be more stringent with the percentiles or use MAD: .. figure:: ../../images/boxplot-age-percentiles.png :align: center Pipeline -------- The :class:`OutlierTrimmer()` removes observations from the predictor data sets. If we want to use this transformer within a Pipeline, we can't use Scikit-learn's pipeline because it can't readjust the target. But we can use Feature-engine's pipeline instead. Let's start by creating a pipeline that removes outliers and then encodes categorical variables: .. code:: python from feature_engine.encoding import OneHotEncoder from feature_engine.pipeline import Pipeline pipe = Pipeline( [ ("outliers", ot), ("enc", OneHotEncoder()), ] ) pipe.fit(X_train, y_train) The `transform` method will transform only the dataset with the predictors, just like scikit-learn's pipeline: .. code:: python train_t = pipe.transform(X_train) X_train.shape, train_t.shape We see the adjusted data size compared to the original size here: .. code:: python ((916, 8), (736, 76)) Feature-engine's pipeline can also adjust the target: .. code:: python train_t, y_train_t = pipe.transform_x_y(X_train, y_train) y_train.shape, y_train_t.shape We see the adjusted data size compared to the original size here: .. code:: python ((916,), (736,)) To wrap up, let's add a machine learning algorithm to the pipeline. We'll use logistic regression to predict survival: .. code:: python from sklearn.linear_model import LogisticRegression pipe = Pipeline( [ ("outliers", ot), ("enc", OneHotEncoder()), ("logit", LogisticRegression(random_state=10)), ] ) pipe.fit(X_train, y_train) Now, we can predict survival: .. code:: python preds = pipe.predict(X_train) preds[0:10] We see the following output: .. code:: python array([1, 1, 1, 0, 1, 0, 1, 1, 0, 1], dtype=int64) We can obtain the probability of survival: .. code:: python preds = pipe.predict_proba(X_train) preds[0:10] We see the following output: .. code:: python array([[0.13027536, 0.86972464], [0.14982143, 0.85017857], [0.2783799 , 0.7216201 ], [0.86907159, 0.13092841], [0.31794531, 0.68205469], [0.86905145, 0.13094855], [0.1396715 , 0.8603285 ], [0.48403632, 0.51596368], [0.6299007 , 0.3700993 ], [0.49712853, 0.50287147]]) We can obtain the accuracy of the predictions over the test set: .. code:: python pipe.score(X_test, y_test) That returns the following accuracy: .. code:: python 0.7823343848580442 We can obtain the names of the features after the trasnformation: .. code:: python pipe[:-1].get_feature_names_out() That returns the following names: .. code:: python ['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex_female', 'sex_male', 'cabin_Missing', ... And finally, we can obtain the transformed dataset and target as follows: .. code:: python X_test_t, y_test_t = pipe[:-1].transform_x_y(X_test, y_test) X_test.shape, X_test_t.shape We see the resulting sizes here: .. code:: python ((393, 8), (317, 76)) Tutorials, books and courses ---------------------------- In the following Jupyter notebook, in our accompanying Github repository, you will find more examples using :class:`OutlierTrimmer()`. - `Jupyter notebook `_ For tutorials about this and other feature engineering methods check out our online course: .. figure:: ../../images/feml.png :width: 300 :figclass: align-center :align: left :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning Feature Engineering for Machine Learning | | | | | | | | | | Or read our book: .. figure:: ../../images/cookbook.png :width: 200 :figclass: align-center :align: left :target: https://packt.link/0ewSo Python Feature Engineering Cookbook | | | | | | | | | | | | | Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.