MathFeatures#
MathFeatures()
applies basic functions to groups of features, returning one or
more additional variables as a result. It uses pandas.agg()
to create the features,
so in essence, you can pass any function that is accepted by this method. One exception
is that MathFeatures()
does not accept dictionaries for the parameter func
.
The functions can be passed as strings, numpy methods, i.e., np.mean, or any function that you create, as long as, it returns a scalar from a vector.
For supported aggregation functions, see pandas documentation.
As an example, if we have the variables:
number_payments_first_quarter
number_payments_second_quarter
number_payments_third_quarter
number_payments_fourth_quarter
we can use MathFeatures()
to calculate the total number of payments
and mean number of payments as follows:
transformer = MathFeatures(
variables=[
'number_payments_first_quarter',
'number_payments_second_quarter',
'number_payments_third_quarter',
'number_payments_fourth_quarter'
],
func=['sum','mean'],
new_variables_name=[
'total_number_payments',
'mean_number_payments'
]
)
Xt = transformer.fit_transform(X)
The transformed dataset, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.
The variable total_number_payments is obtained by adding up the features
indicated in variables
, whereas the variable mean_number_payments is
the mean of those 4 features.
Examples#
Let’s dive into how we can use MathFeatures()
in more details. Let’s first
create a toy dataset:
import numpy as np
import pandas as pd
from feature_engine.creation import MathFeatures
df = pd.DataFrame.from_dict(
{
"Name": ["tom", "nick", "krish", "jack"],
"City": ["London", "Manchester", "Liverpool", "Bristol"],
"Age": [20, 21, 19, 18],
"Marks": [0.9, 0.8, 0.7, 0.6],
"dob": pd.date_range("2020-02-24", periods=4, freq="T"),
})
print(df)
The dataset looks like this:
Name City Age Marks dob
0 tom London 20 0.9 2020-02-24 00:00:00
1 nick Manchester 21 0.8 2020-02-24 00:01:00
2 krish Liverpool 19 0.7 2020-02-24 00:02:00
3 jack Bristol 18 0.6 2020-02-24 00:03:00
We can now apply several functions over the numerical variables Age and Marks using strings to indicate the functions:
transformer = MathFeatures(
variables=["Age", "Marks"],
func = ["sum", "prod", "min", "max", "std"],
)
df_t = transformer.fit_transform(df)
print(df_t)
And we obtain the following dataset, where the new variables are named after the function used to obtain them, plus the group of variables that were used in the computation:
Name City Age Marks dob sum_Age_Marks \
0 tom London 20 0.9 2020-02-24 00:00:00 20.9
1 nick Manchester 21 0.8 2020-02-24 00:01:00 21.8
2 krish Liverpool 19 0.7 2020-02-24 00:02:00 19.7
3 jack Bristol 18 0.6 2020-02-24 00:03:00 18.6
prod_Age_Marks min_Age_Marks max_Age_Marks std_Age_Marks
0 18.0 0.9 20.0 13.505740
1 16.8 0.8 21.0 14.283557
2 13.3 0.7 19.0 12.940054
3 10.8 0.6 18.0 12.303658
For more flexibility, we can pass existing functions to the func
argument as follows:
transformer = MathFeatures(
variables=["Age", "Marks"],
func = [np.sum, np.prod, np.min, np.max, np.std],
)
df_t = transformer.fit_transform(df)
print(df_t)
And we obtain the following dataframe:
Name City Age Marks dob sum_Age_Marks \
0 tom London 20 0.9 2020-02-24 00:00:00 20.9
1 nick Manchester 21 0.8 2020-02-24 00:01:00 21.8
2 krish Liverpool 19 0.7 2020-02-24 00:02:00 19.7
3 jack Bristol 18 0.6 2020-02-24 00:03:00 18.6
prod_Age_Marks amin_Age_Marks amax_Age_Marks std_Age_Marks
0 18.0 0.9 20.0 13.505740
1 16.8 0.8 21.0 14.283557
2 13.3 0.7 19.0 12.940054
3 10.8 0.6 18.0 12.303658
We have the option to set the parameter drop_original
to True to drop the variables
after performing the calculations.
We can obtain the names of all the features in the transformed data as follows:
transformer.get_feature_names_out(input_features=None)
Which will return the names of all the variables in the transformed data:
['Name',
'City',
'Age',
'Marks',
'dob',
'sum_Age_Marks',
'prod_Age_Marks',
'amin_Age_Marks',
'amax_Age_Marks',
'std_Age_Marks']
New variables names#
Even though the transfomer allows to combine variables automatically, its use is intended
to combine variables with domain knowledge. In this case, we normally want to
give meaningful names to the variables. We can do so through the parameter
new_variables_names
.
new_variables_names
takes a list of strings, with the new variable names. In this
parameter, you need to enter a list of names for the newly created features. You must
enter one name for each function indicated in the func
parameter.
That is, if you want to perform mean and sum of features, you should enter 2 new
variable names. If you compute only the mean of features, enter 1 variable name.
The name of the variables should coincide with the order of the functions in func
.
That is, if you set func = ['mean', 'prod']
, the first new variable name will be
assigned to the mean of the variables and the second variable name to the product of the
variables.
Let’s look at an example. In the following code snippet, we add up, and find the maximum and minimum value of 2 variables, which results in 3 new features. We add the names for the new features in a list:
transformer = MathFeatures(
variables=["Age", "Marks"],
func = ["sum", "min", "max"],
new_variables_names = ["sum_vars", "min_vars", "max_vars"]
)
df_t = transformer.fit_transform(df)
print(df_t)
The resulting dataframe contains the new features under the variable names that we provided:
Name City Age Marks dob sum_vars min_vars \
0 tom London 20 0.9 2020-02-24 00:00:00 20.9 0.9
1 nick Manchester 21 0.8 2020-02-24 00:01:00 21.8 0.8
2 krish Liverpool 19 0.7 2020-02-24 00:02:00 19.7 0.7
3 jack Bristol 18 0.6 2020-02-24 00:03:00 18.6 0.6
max_vars
0 20.0
1 21.0
2 19.0
3 18.0
Additional resources#
For more details about this and other feature engineering methods check out these resources:
Or read our book:
Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.