.. _winsorizer:
.. currentmodule:: feature_engine.outliers
Winsorizer
==========
The :class:`Winsorizer()` caps maximum and/or minimum values of a variable at automatically
determined values. The minimum and maximum values can be calculated in 1 of 3 different ways:
Gaussian limits:
 right tail: mean + 3* std
 left tail: mean  3* std
IQR limits:
 right tail: 75th quantile + 3* IQR
 left tail: 25th quantile  3* IQR
where IQR is the interquartile range: 75th quantile  25th quantile.
MAD limits:
 right tail: median + 3* MAD
 left tail: median  3* MAD
where MAD is the median absolute deviation from the median.
percentiles or quantiles:
 right tail: 95th percentile
 left tail: 5th percentile
**Example**
Let's cap some outliers in the Titanic Dataset. First, let's load the data and separate
it into train and test:
.. code:: python
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.outliers import Winsorizer
X, y = load_titanic(
return_X_y_frame=True,
predictors_only=True,
handle_missing=True,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0,
)
print(X_train.head())
We see the resulting data below:
.. code:: python
pclass sex age sibsp parch fare cabin embarked
501 2 female 13.000000 0 1 19.5000 Missing S
588 2 female 4.000000 1 1 23.0000 Missing S
402 2 female 30.000000 1 0 13.8583 Missing C
1193 3 male 29.881135 0 0 7.7250 Missing Q
686 3 female 22.000000 0 0 7.7250 Missing Q
Now, we will set the :class:`Winsorizer()` to cap outliers at the right side of the
distribution only (param `tail`). We want the maximum values to be determined using the
mean value of the variable (param `capping_method`) plus 3 times the standard deviation
(param `fold`). And we only want to cap outliers in 2 variables, which we indicate in a
list.
.. code:: python
capper = Winsorizer(capping_method='gaussian',
tail='right',
fold=3,
variables=['age', 'fare'])
capper.fit(X_train)
With `fit()`, the :class:`Winsorizer()` finds the values at which it should cap the variables.
These values are stored in its attribute:
.. code:: python
capper.right_tail_caps_
.. code:: python
{'age': 67.73951212364803, 'fare': 174.70395336846678}
We can now go ahead and censor the outliers:
.. code:: python
# transform the data
train_t = capper.transform(X_train)
test_t = capper.transform(X_test)
If we evaluate now the maximum of the variables in the transformed datasets, they should
coincide with the values observed in the attribute `right_tail_caps_`:
.. code:: python
train_t[['fare', 'age']].max()
.. code:: python
fare 174.703953
age 67.739512
dtype: float64
Additional resources

You can find more details about the :class:`Winsorizer()` functionality in the following
notebook:
 `Jupyter notebook `_
For more details about this and other feature engineering methods check out these resources:
.. figure:: ../../images/feml.png
:width: 300
:figclass: aligncenter
:align: left
:target: https://www.trainindata.com/p/featureengineeringformachinelearning
Feature Engineering for Machine Learning










Or read our book:
.. figure:: ../../images/cookbook.png
:width: 200
:figclass: aligncenter
:align: left
:target: https://packt.link/0ewSo
Python Feature Engineering Cookbook













Both our book and course are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting Sole, the main developer of Featureengine.