MathematicalCombination#
MathematicalCombination()
applies basic mathematical operations to multiple
features, returning one or more additional features as a result. That is, it sums,
multiplies, takes the average, finds the maximum, minimum or standard deviation of a
group of variables and returns the result into new variables.
For example, if we have the variables:
number_payments_first_quarter,
number_payments_second_quarter,
number_payments_third_quarter and
number_payments_fourth_quarter,
we can use MathematicalCombination()
to calculate the total number of payments
and mean number of payments as follows:
transformer = MathematicalCombination(
variables_to_combine=[
'number_payments_first_quarter',
'number_payments_second_quarter',
'number_payments_third_quarter',
'number_payments_fourth_quarter'
],
math_operations=[
'sum',
'mean'
],
new_variables_name=[
'total_number_payments',
'mean_number_payments'
]
)
Xt = transformer.fit_transform(X)
The transformed dataset, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.
The variable total_number_payments is obtained by adding up the features
indicated in variables_to_combine
, whereas the variable mean_number_payments is
the mean of those 4 features.
Below we show another example using the House Prices Dataset (more details about the dataset here). In this example, we sum 2 variables: ‘LotFrontage’ and ‘LotArea’ to obtain ‘LotTotal’.
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.creation import MathematicalCombination
data = pd.read_csv('houseprice.csv').fillna(0)
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
math_combinator = MathematicalCombination(
variables_to_combine=['LotFrontage', 'LotArea'],
math_operations = ['sum'],
new_variables_names = ['LotTotal']
)
math_combinator.fit(X_train, y_train)
X_train_ = math_combinator.transform(X_train)
In the attribute combination_dict_
the transformer stores the variable name and the
operation used to obtain that variable. This way, we can easily identify which variable
is the result of which transformation.
print(math_combinator.combination_dict_)
{'LotTotal': 'sum'}
We can see that the transformed dataset contains the additional variable:
print(X_train_.loc[:,['LotFrontage', 'LotArea', 'LotTotal']].head())
LotFrontage LotArea LotTotal
64 0.0 9375 9375.0
682 0.0 2887 2887.0
960 50.0 7207 7257.0
1384 60.0 9060 9120.0
1100 60.0 8400 8460.0
new_variables_names
Even though the transfomer allows to combine variables automatically, it was originally
designed to combine variables with domain knowledge. In this case, we normally want to
give meaningful names to the variables. We can do so through the parameter
new_variables_names
.
new_variables_names
takes a list of strings, with the new variable names. In this
parameter, you need to enter a name or a list of names for the newly created features
(recommended). You must enter one name for each mathematical transformation indicated
in the math_operations
parameter. That is, if you want to perform mean and sum of
features, you should enter 2 new variable names. If you perform only mean of features,
enter 1 variable name. Alternatively, if you chose to perform all mathematical
transformations, enter 6 new variable names.
The name of the variables should coincide with the order in which the mathematical operations are initialised in the transformer. That is, if you set math_operations = [‘mean’, ‘prod’], the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables.
More details#
You can find creative ways to use the MathematicalCombination()
in the
following Jupyter notebooks and Kaggle kernels.