.. _arbitrary_number_imputer: .. currentmodule:: feature_engine.imputation ArbitraryNumberImputer ====================== The :class:`ArbitraryNumberImputer()` replaces missing data with an arbitrary numerical value determined by the user. It works only with numerical variables. The :class:`ArbitraryNumberImputer()` can find and impute all numerical variables automatically. Alternatively, you can pass a list of the variables you want to impute to the `variables` parameter. You can impute all variables with the same number, in which case you need to define the variables to impute in the `variables` parameter and the imputation number in `arbitrary_number` parameter. For example, you can impute varA and varB with 99 like this: .. code-block:: python transformer = ArbitraryNumberImputer( variables = ['varA', 'varB'], arbitrary_number = 99 ) Xt = transformer.fit_transform(X) You can also impute different variables with different numbers. To do this, you need to pass a dictionary with the variable names and the numbers to use for their imputation to the `imputer_dict` parameter. For example, you can impute varA with 1 and varB with 99 like this: .. code-block:: python transformer = ArbitraryNumberImputer( imputer_dict = {'varA' : 1, 'varB': 99} ) Xt = transformer.fit_transform(X) Below a code example using the House Prices Dataset (more details about the dataset :ref:`here `). First, let's load the data and separate it into train and test: .. code:: python import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from feature_engine.imputation import ArbitraryNumberImputer # Load dataset data = pd.read_csv('houseprice.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0, ) Now we set up the :class:`ArbitraryNumberImputer()` to impute 2 variables from the dataset with the number -999: .. code:: python # set up the imputer arbitrary_imputer = ArbitraryNumberImputer( arbitrary_number=-999, variables=['LotFrontage', 'MasVnrArea'], ) # fit the imputer arbitrary_imputer.fit(X_train) With `fit()`, the transformer does not learn any parameter. It just assigns the imputation values to each variable, which can be found in the attribute `imputer_dict_`. With transform, we replace the missing data with the arbitrary values both in train and test sets: .. code:: python # transform the data train_t= arbitrary_imputer.transform(X_train) test_t= arbitrary_imputer.transform(X_test) Note that after the imputation, if the percentage of missing values is relatively big, the variable distribution will differ from the original one (in red the imputed variable): .. code:: python fig = plt.figure() ax = fig.add_subplot(111) X_train['LotFrontage'].plot(kind='kde', ax=ax) train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red') lines, labels = ax.get_legend_handles_labels() ax.legend(lines, labels, loc='best') .. image:: ../../images/arbitraryvalueimputation.png More details ^^^^^^^^^^^^ In the following Jupyter notebook you will find more details on the functionality of the :class:`ArbitraryNumberImputer()`, including how to select numerical variables automatically. You will also see how to navigate the different attributes of the transformer. - `Jupyter notebook `_ For more details about this and other feature engineering methods check out these resources: - `Feature engineering for machine learning `_, online course. - `Python Feature Engineering Cookbook `_, book.