MathematicalCombination#

MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more additional features as a result. That is, it sums, multiplies, takes the average, finds the maximum, minimum or standard deviation of a group of variables and returns the result into new variables.

For example, if we have the variables:

  • number_payments_first_quarter,

  • number_payments_second_quarter,

  • number_payments_third_quarter and

  • number_payments_fourth_quarter,

we can use MathematicalCombination() to calculate the total number of payments and mean number of payments as follows:

transformer = MathematicalCombination(
    variables_to_combine=[
        'number_payments_first_quarter',
        'number_payments_second_quarter',
        'number_payments_third_quarter',
        'number_payments_fourth_quarter'
    ],
    math_operations=[
        'sum',
        'mean'
    ],
    new_variables_name=[
        'total_number_payments',
        'mean_number_payments'
    ]
)

Xt = transformer.fit_transform(X)

The transformed dataset, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.

The variable total_number_payments is obtained by adding up the features indicated in variables_to_combine, whereas the variable mean_number_payments is the mean of those 4 features.

Below we show another example using the House Prices Dataset (more details about the dataset here). In this example, we sum 2 variables: ‘LotFrontage’ and ‘LotArea’ to obtain ‘LotTotal’.

import pandas as pd
from sklearn.model_selection import train_test_split

from feature_engine.creation import MathematicalCombination

data = pd.read_csv('houseprice.csv').fillna(0)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0
)

math_combinator = MathematicalCombination(
    variables_to_combine=['LotFrontage', 'LotArea'],
    math_operations = ['sum'],
    new_variables_names = ['LotTotal']
)

math_combinator.fit(X_train, y_train)

X_train_ = math_combinator.transform(X_train)

In the attribute combination_dict_ the transformer stores the variable name and the operation used to obtain that variable. This way, we can easily identify which variable is the result of which transformation.

print(math_combinator.combination_dict_)
{'LotTotal': 'sum'}

We can see that the transformed dataset contains the additional variable:

print(X_train_.loc[:,['LotFrontage', 'LotArea', 'LotTotal']].head())
      LotFrontage  LotArea  LotTotal
64            0.0     9375    9375.0
682           0.0     2887    2887.0
960          50.0     7207    7257.0
1384         60.0     9060    9120.0
1100         60.0     8400    8460.0

new_variables_names

Even though the transfomer allows to combine variables automatically, it was originally designed to combine variables with domain knowledge. In this case, we normally want to give meaningful names to the variables. We can do so through the parameter new_variables_names.

new_variables_names takes a list of strings, with the new variable names. In this parameter, you need to enter a name or a list of names for the newly created features (recommended). You must enter one name for each mathematical transformation indicated in the math_operations parameter. That is, if you want to perform mean and sum of features, you should enter 2 new variable names. If you perform only mean of features, enter 1 variable name. Alternatively, if you chose to perform all mathematical transformations, enter 6 new variable names.

The name of the variables should coincide with the order in which the mathematical operations are initialised in the transformer. That is, if you set math_operations = [‘mean’, ‘prod’], the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables.

More details#

You can find creative ways to use the MathematicalCombination() in the following Jupyter notebooks and Kaggle kernels.