.. _winsorizer: .. currentmodule:: feature_engine.outliers Winsorizer ========== The :class:`Winsorizer()` caps maximum and/or minimum values of a variable at automatically determined values. The minimum and maximum values can be calculated in 1 of 3 different ways: Gaussian limits: - right tail: mean + 3* std - left tail: mean - 3* std IQR limits: - right tail: 75th quantile + 3* IQR - left tail: 25th quantile - 3* IQR where IQR is the inter-quartile range: 75th quantile - 25th quantile. MAD limits: - right tail: median + 3* MAD - left tail: median - 3* MAD where MAD is the median absolute deviation from the median. percentiles or quantiles: - right tail: 95th percentile - left tail: 5th percentile **Example** Let's cap some outliers in the Titanic Dataset. First, let's load the data and separate it into train and test: .. code:: python from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.outliers import Winsorizer X, y = load_titanic( return_X_y_frame=True, predictors_only=True, handle_missing=True, ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, ) print(X_train.head()) We see the resulting data below: .. code:: python pclass sex age sibsp parch fare cabin embarked 501 2 female 13.000000 0 1 19.5000 Missing S 588 2 female 4.000000 1 1 23.0000 Missing S 402 2 female 30.000000 1 0 13.8583 Missing C 1193 3 male 29.881135 0 0 7.7250 Missing Q 686 3 female 22.000000 0 0 7.7250 Missing Q Now, we will set the :class:`Winsorizer()` to cap outliers at the right side of the distribution only (param `tail`). We want the maximum values to be determined using the mean value of the variable (param `capping_method`) plus 3 times the standard deviation (param `fold`). And we only want to cap outliers in 2 variables, which we indicate in a list. .. code:: python capper = Winsorizer(capping_method='gaussian', tail='right', fold=3, variables=['age', 'fare']) capper.fit(X_train) With `fit()`, the :class:`Winsorizer()` finds the values at which it should cap the variables. These values are stored in its attribute: .. code:: python capper.right_tail_caps_ .. code:: python {'age': 67.73951212364803, 'fare': 174.70395336846678} We can now go ahead and censor the outliers: .. code:: python # transform the data train_t = capper.transform(X_train) test_t = capper.transform(X_test) If we evaluate now the maximum of the variables in the transformed datasets, they should coincide with the values observed in the attribute `right_tail_caps_`: .. code:: python train_t[['fare', 'age']].max() .. code:: python fare 174.703953 age 67.739512 dtype: float64 Additional resources -------------------- You can find more details about the :class:`Winsorizer()` functionality in the following notebook: - `Jupyter notebook `_ For more details about this and other feature engineering methods check out these resources: .. figure:: ../../images/feml.png :width: 300 :figclass: align-center :align: left :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning Feature Engineering for Machine Learning | | | | | | | | | | Or read our book: .. figure:: ../../images/cookbook.png :width: 200 :figclass: align-center :align: left :target: https://packt.link/0ewSo Python Feature Engineering Cookbook | | | | | | | | | | | | | Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.