ArbitraryOutlierCapper¶
API Reference¶
- class feature_engine.outliers.ArbitraryOutlierCapper(max_capping_dict=None, min_capping_dict=None, missing_values='raise')[source]¶
The ArbitraryOutlierCapper() caps the maximum or minimum values of a variable at an arbitrary value indicated by the user.
You must provide the maximum or minimum values that will be used to cap each variable in a dictionary {feature:capping value}
- Parameters
- max_capping_dict: dictionary, default=None
Dictionary containing the user specified capping values for the right tail of the distribution of each variable (maximum values).
- min_capping_dict: dictionary, default=None
Dictionary containing user specified capping values for the eft tail of the distribution of each variable (minimum values).
- missing_valuesstring, default=’raise’
Indicates if missing values should be ignored or raised. If
missing_values='raise'
the transformer will return an error if the training or the datasets to transform contain missing values.
Attributes
right_tail_caps_:
Dictionary with the maximum values at which variables will be capped.
left_tail_caps_:
Dictionary with the minimum values at which variables will be capped.
variables_:
The group of variables that will be transformed.
n_features_in_:
The number of features in the train set used in fit.
Methods
fit:
This transformer does not learn any parameter.
transform:
Cap the variables.
fit_transform:
Fit to the data. Then transform it.
- fit(X, y=None)[source]¶
This transformer does not learn any parameter.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The training input samples.
- y: pandas Series, default=None
y is not needed in this transformer. You can pass y or None.
- Returns
- self
- Raises
- TypeError
If the input is not a Pandas DataFrame
- transform(X)[source]¶
Cap the variable values, that is, censors outliers.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The data to be transformed.
- Returns
- X: pandas dataframe of shape = [n_samples, n_features]
The dataframe with the capped variables.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
- ValueError
If the dataframe is not of same size as that used in fit()
Example¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.outliers import ArbitraryOutlierCapper
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
data['fare'] = data['fare'].astype('float')
data['fare'].fillna(data['fare'].median(), inplace=True)
data['age'] = data['age'].astype('float')
data['age'].fillna(data['age'].median(), inplace=True)
return data
data = load_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0)
# set up the capper
capper = ArbitraryOutlierCapper(max_capping_dict={'age': 50, 'fare': 200}, min_capping_dict=None)
# fit the capper
capper.fit(X_train)
# transform the data
train_t= capper.transform(X_train)
test_t= capper.transform(X_test)
capper.right_tail_caps_
{'age': 50, 'fare': 200}
train_t[['fare', 'age']].max()
fare 200
age 50
dtype: float64