.. _outlier_trimmer:
.. currentmodule:: feature_engine.outliers
OutlierTrimmer
===============
Outliers are data points that significantly deviate from the rest of the dataset, potentially indicating errors or rare
occurrences. Outliers can distort the learning process of machine learning models by skewing parameter estimates and
reducing predictive accuracy. To prevent this, if you suspect that the outliers are errors or rare occurrences, you can
remove them from the training data.
In this guide, we show how to remove outliers in Python using the :class:`OutlierTrimmer()`.
The first step to removing outliers consists of identifying those outliers. Outliers can be identified through various
statistical methods, such as box plots, zscores, the interquartile range (IQR), or the median absolute deviation.
Additionally, visual inspection of the data using scatter plots or histograms is common practice in data science, and
can help detect observations that significantly deviate from the overall pattern of the dataset.
The :class:`OutlierTrimmer()` can identify outliers by using all of these methods and then remove them automatically.
Hence, we’ll begin this guide with data analysis, showing how we can identify outliers through these statistical methods
and boxplots, and then we will remove outliers by using the :class:`OutlierTrimmer()`.
Identifying outliers

Outliers are data points that are usually far greater, or far smaller than some value that determines where most of the
values in the distribution lie. These minimum and maximum values, that delimit the data distribution, can be calculated
in 4 ways: by using the zscore if the variable is normally distributed, by using the interquartile range proximity rule
or the median absolute deviation if the variables are skewed, or by using percentiles.
Gaussian limits or zscore
~~~~~~~~~~~~~~~~~~~~~~~~~~
If the variable shows a normal distribution, most of its values lie between the mean minus 3 times the standard deviation
and the mean plus 3 times the standard deviation. Hence, we can determine the limits of the distribution as follows:
 right tail (upper_bound): mean + 3* std
 left tail (lower_bound): mean  3* std
We can consider outliers those data points that lie beyond these limits.
Interquartile range proximity rule
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The interquartile range proximity rule can be used to detect outliers both in variables that show a normal distribution
and in variables with a skew. When using the IQR, we detect outliers as those values that lie before the 25th percentile
times a factor of the IQR, or after the 75th percentile times a factor of the IQR. This factor is normally 1.5, or 3 if
we want to be more stringent. With the IQR method, the limits are calculated as follows:
IQR limits:
 right tail (upper_limit): 75th quantile + 3* IQR
 left tail (lower_limit): 25th quantile  3* IQR
where IQR is the interquartile range:
 IQR = 75th quantile  25th quantile = third quartile  first quartile.
Observations found beyond those limits can be considered extreme values.
Maximum absolute deviation
~~~~~~~~~~~~~~~~~~~~~~~~~~
Parameters like the mean and the standard deviation are strongly affected by the presence of outliers. Therefore, it
might be a better solution to use a metric that is robust against outliers, like the median absolute deviation from the
median, commonly shortened to the median absolute deviation (MAD), to delimit the normal data distribution.
When we use MAD, we determine the limits of the distribution as follows:
MAD limits:
 right tail (upper_limit): median + 3* MAD
 left tail (lower_limit): median  3* MAD
MAD is the median absolute deviation from the median. In other words, MAD is the median value of the absolute difference
between each observation and its median.
 MAD = median(abs(Xmedian(X))
Percentiles
~~~~~~~~~~~
A simpler way to determine the values that delimit the data distribution is by using percentiles. Like this, outlier
values would be those that lie before or after a certain percentile or quantiles:
 right tail: 95th percentile
 left tail: 5th percentile
The number of outliers identified by any of these methods will vary. These methods detect outliers, but they can’t
decide if they are true outliers or faithful data points. That required further examination and domain knowledge.
Let’s move on to removing outliers in Python.
Remove outliers in Python

In this demo, we'll identify and remove outliers from the Titanic Dataset. First, let's load the data and separate it
into train and test:
.. code:: python
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.outliers import OutlierTrimmer
X, y = load_titanic(
return_X_y_frame=True,
predictors_only=True,
handle_missing=True,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0,
)
print(X_train.head())
We see the resulting pandas dataframe below:
.. code:: python
pclass sex age sibsp parch fare cabin embarked
501 2 female 13.000000 0 1 19.5000 Missing S
588 2 female 4.000000 1 1 23.0000 Missing S
402 2 female 30.000000 1 0 13.8583 Missing C
1193 3 male 29.881135 0 0 7.7250 Missing Q
686 3 female 22.000000 0 0 7.7250 Missing Q
Identifying outliers
~~~~~~~~~~~~~~~~~~~~
Let's now identify potential extreme values in the training set by using boxplots.
.. code:: python
X_train.boxplot(column=['age', 'fare', 'sibsp'])
plt.title("Box plot  outliers")
plt.ylabel("variable values")
plt.show()
In the following boxplots, we see that all three variables have data points that are significantly greater than the
majority of the data distribution. The variable age also shows outlier values towards the lower values.
.. figure:: ../../images/boxplottitanic.png
:align: center
The variables have different scales, so let's plot them individually for better visualization. Let's start by making a
boxplot of the variable fare:
.. code:: python
X_train.boxplot(column=['fare'])
plt.title("Box plot  outliers")
plt.ylabel("variable values")
plt.show()
We see the boxplot in the following image:
.. figure:: ../../images/boxplotfare.png
:align: center
Next, we plot the variable age:
.. code:: python
X_train.boxplot(column=['age'])
plt.title("Box plot  outliers")
plt.ylabel("variable values")
plt.show()
We see the boxplot in the following image:
.. figure:: ../../images/boxplotage.png
:align: center
And finally, we make a boxplot of the variable sibsp:
.. code:: python
X_train.boxplot(column=['sibsp'])
plt.title("Box plot  outliers")
plt.ylabel("variable values")
plt.show()
We see the boxplot and the outlier values in the following image:
.. figure:: ../../images/boxplotsibsp.png
:align: center
Outlier removal
~~~~~~~~~~~~~~~
Now, we will use the :class:`OutlierTrimmer()` to remove outliers. We'll start by using the IQR as outlier detection
method.
IQR
^^^
We want to remove outliers at the right side of the distribution only (param `tail`). We want the maximum values to be
determined using the 75th quantile of the variable (param `capping_method`) plus 1.5 times the IQR (param `fold`). And
we only want to cap outliers in 2 variables, which we indicate in a list.
.. code:: python
ot = OutlierTrimmer(capping_method='iqr',
tail='right',
fold=1.5,
variables=['sibsp', 'fare'],
)
ot.fit(X_train)
With `fit()`, the :class:`OutlierTrimmer()` finds the values at which it should cap the variables. These values are
stored in one of its attributes:
.. code:: python
ot.right_tail_caps_
.. code:: python
{'sibsp': 2.5, 'fare': 66.34379999999999}
We can now go ahead and remove the outliers:
.. code:: python
train_t = ot.transform(X_train)
test_t = ot.transform(X_test)
We can compare the sizes of the original and transformed datasets to check that the outliers were removed:
.. code:: python
X_train.shape, train_t.shape
We see that the transformed dataset contains less rows:
.. code:: python
((916, 8), (764, 8))
If we evaluate now the maximum of the variables in the transformed datasets, they should be <= the values observed in
the attribute `right_tail_caps_`:
.. code:: python
train_t[['fare', 'age']].max()
.. code:: python
fare 65.0
age 53.0
dtype: float64
Finally, we can check the boxplots of the transformed variables to corroborate the effect on their distribution.
.. code:: python
train_t.boxplot(column=['sibsp', "fare"])
plt.title("Box plot  outliers")
plt.ylabel("variable values")
plt.show()
We see the boxplot and the `sibsp` does no longer have outliers, but as `fare` was very skewed, when removing outliers,
the parameters of the IQR change, and we continue to see outliers:
.. figure:: ../../images/boxplotsibspfareiqr.png
:align: center
We'll come back to this later, but now let's continue showing the functionality of the :class:`OutlierTrimmer()`.
When we remove outliers from the datasets, we then need to realign the target variables. We can do this with pandas
loc. But the :class:`OutlierTrimmer()` can do that automatically as follows:
.. code:: python
train_t, y_train_t = ot.transform_x_y(X_train, y_train)
test_t, y_test_t = ot.transform_x_y(X_test, y_test)
The method `transform_x_y` will remove outliers from the predictor datasets and then align the target variable. That
means, it will remove from the target those rows corresponding to the outlier values.
We can corroborate the size adjustment in the target as follows:
.. code:: python
y_train.shape, y_train_t.shape,
The previous command returns the following output:
.. code:: python
((916,), (764,))
We can obtain the names of the fetaures in the transformed dataset as follows:
.. code:: python
ot.get_feature_names_out()
That returns the following variable namesL
.. code:: python
['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked']
MAD
^^^
We saw that the IQR did not work amazingly for the variable fare, because its skew is too big. So let's remove outliers
by using the MAD instead:
.. code:: python
ot = OutlierTrimmer(capping_method='mad',
tail='right',
fold=3,
variables=['fare'],
)
ot.fit(X_train)
train_t, y_train_t = ot.transform_x_y(X_train, y_train)
test_t, y_test_t = ot.transform_x_y(X_test, y_test)
train_t.boxplot(column=["fare"])
plt.title("Box plot  outliers")
plt.ylabel("variable values")
plt.show()
In the following image, we see that after this transformation, the variable fare no longer shows outlier values:
.. figure:: ../../images/boxplotfaremad.png
:align: center
Zscore
^^^^^^^
The variable age is more homogeneously distributed across its value range, so let's use the zscore or gaussian
approximation to detect outliers. We saw in the boxplot that it has outliers at both ends, so we'll cap both ends of the
distribution:
.. code:: python
ot_age = OutlierTrimmer(capping_method='gaussian',
tail="both",
fold=3,
variables=['age'],
)
ot_age.fit(X_train)
Let's inspect the maximum values beyond which data points will be considered outliers:
.. code:: python
ot_age.right_tail_caps_
.. code:: python
{'age': 67.73951212364803}
And the lower values beyond which data points will be considered outliers:
.. code:: python
ot_age.left_tail_caps_
.. code:: python
{'age': 7.410476010820627}
The minimum value does not make sense, because age can't be negative. So, we'll try capping this variable with
percentiles instead.
Percentiles
^^^^^^^^^^^
We'll cap age at the bottom 5 and top 95 percentile:
.. code:: python
ot = OutlierTrimmer(capping_method='mad',
tail='right',
fold=3,
variables=['fare'],
)
ot.fit(X_train)
Let's inspect the maximum values beyond which data points will be considered outliers:
.. code:: python
ot_age.right_tail_caps_
.. code:: python
{'age': 54.0}
And the lower values beyond which data points will be considered outliers:
.. code:: python
ot_age.left_tail_caps_
.. code:: python
{'age': 9.0}
Let's tranform the dataset and target:
.. code:: python
train_t, y_train_t = ot_age.transform_x_y(X_train, y_train)
test_t, y_test_t = ot_age.transform_x_y(X_test, y_test)
And plot the resulting variable:
.. code:: python
train_t.boxplot(column=['age'])
plt.title("Box plot  outliers")
plt.ylabel("variable values")
plt.show()
In the following image, we see that after this transformation, the variable age still shows some outlier values towards
its higher values, so we should be more stringent with the percentiles or use MAD:
.. figure:: ../../images/boxplotagepercentiles.png
:align: center
Pipeline

The :class:`OutlierTrimmer()` removes observations from the predictor data sets. If we want to use this transformer
within a Pipeline, we can't use Scikitlearn's pipeline because it can't readjust the target. But we can use
Featureengine's pipeline instead.
Let's start by creating a pipeline that removes outliers and then encodes categorical variables:
.. code:: python
from feature_engine.encoding import OneHotEncoder
from feature_engine.pipeline import Pipeline
pipe = Pipeline(
[
("outliers", ot),
("enc", OneHotEncoder()),
]
)
pipe.fit(X_train, y_train)
The `transform` method will transform only the dataset with the predictors, just like scikitlearn's pipeline:
.. code:: python
train_t = pipe.transform(X_train)
X_train.shape, train_t.shape
We see the adjusted data size compared to the original size here:
.. code:: python
((916, 8), (736, 76))
Featureengine's pipeline can also adjust the target:
.. code:: python
train_t, y_train_t = pipe.transform_x_y(X_train, y_train)
y_train.shape, y_train_t.shape
We see the adjusted data size compared to the original size here:
.. code:: python
((916,), (736,))
To wrap up, let's add a machine learning algorithm to the pipeline. We'll use logistic regression to predict survival:
.. code:: python
from sklearn.linear_model import LogisticRegression
pipe = Pipeline(
[
("outliers", ot),
("enc", OneHotEncoder()),
("logit", LogisticRegression(random_state=10)),
]
)
pipe.fit(X_train, y_train)
Now, we can predict survival:
.. code:: python
preds = pipe.predict(X_train)
preds[0:10]
We see the following output:
.. code:: python
array([1, 1, 1, 0, 1, 0, 1, 1, 0, 1], dtype=int64)
We can obtain the probability of survival:
.. code:: python
preds = pipe.predict_proba(X_train)
preds[0:10]
We see the following output:
.. code:: python
array([[0.13027536, 0.86972464],
[0.14982143, 0.85017857],
[0.2783799 , 0.7216201 ],
[0.86907159, 0.13092841],
[0.31794531, 0.68205469],
[0.86905145, 0.13094855],
[0.1396715 , 0.8603285 ],
[0.48403632, 0.51596368],
[0.6299007 , 0.3700993 ],
[0.49712853, 0.50287147]])
We can obtain the accuracy of the predictions over the test set:
.. code:: python
pipe.score(X_test, y_test)
That returns the following accuracy:
.. code:: python
0.7823343848580442
We can obtain the names of the features after the trasnformation:
.. code:: python
pipe[:1].get_feature_names_out()
That returns the following names:
.. code:: python
['pclass',
'age',
'sibsp',
'parch',
'fare',
'sex_female',
'sex_male',
'cabin_Missing',
...
And finally, we can obtain the transformed dataset and target as follows:
.. code:: python
X_test_t, y_test_t = pipe[:1].transform_x_y(X_test, y_test)
X_test.shape, X_test_t.shape
We see the resulting sizes here:
.. code:: python
((393, 8), (317, 76))
Tutorials, books and courses

In the following Jupyter notebook, in our accompanying Github repository, you will find more examples using
:class:`OutlierTrimmer()`.
 `Jupyter notebook `_
For tutorials about this and other feature engineering methods check out our online course:
.. figure:: ../../images/feml.png
:width: 300
:figclass: aligncenter
:align: left
:target: https://www.trainindata.com/p/featureengineeringformachinelearning
Feature Engineering for Machine Learning










Or read our book:
.. figure:: ../../images/cookbook.png
:width: 200
:figclass: aligncenter
:align: left
:target: https://packt.link/0ewSo
Python Feature Engineering Cookbook













Both our book and course are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting Sole, the main developer of Featureengine.