DatetimeSubtraction#
Very often, we have datetime variables in our datasets, and we want to determine the time
difference between them. For example, if we work with financial data, we may have the
variable date of loan application, with the date and time when the customer applied for
a loan, and also the variable date of birth, with the customer’s date of birth. With those
two variables, we want to infer the age of the customer at the time of application. In order
to do this, we can compute the difference in years between date_of_loan_application
and
date_of_birth
and capture it in a new variable.
In a different example, if we are trying to predict the price of the house and we have information about the year in which the house was built, we can infer the age of the house at the point of sale. Generally, older houses cost less. To calculate the age of the house, we’d simply compute the difference in years between the sale date and the date at which it was built.
The Python program offers many options for making operations between datetime objects, like, for example, the datetime module. Since most likely you will be working with Pandas dataframes, we will focus this guide on pandas and then how we can automate the procedure with Feature-engine.
Subtracting datetime features with pandas#
In Python, we can subtract datetime objects with pandas. To work with datetime variables
in pandas, we need to make sure that the timestamp, which can be represented in various
formats, like strings (str), objects ("O"
), or datetime, is cast as a datetime. If not, we
can convert strings to datetime objects by executing pd.to_datetime(df[variable_of_interest])
.
Let’s create a toy dataframe with 2 datetime variables for a short demo:
import numpy as np
import pandas as pd
data = pd.DataFrame({
"date1": pd.date_range("2019-03-05", periods=5, freq="D"),
"date2": pd.date_range("2018-03-05", periods=5, freq="W")})
print(data)
This is the data that we created, containing two datetime variables:
date1 date2
0 2019-03-05 2018-03-11
1 2019-03-06 2018-03-18
2 2019-03-07 2018-03-25
3 2019-03-08 2018-04-01
4 2019-03-09 2018-04-08
Now, we can subtract date2
from date1
and capture the difference in a new variable by
utilizing the pandas subtraction operator:
data["diff"] = data["date1"].sub(data["date2"])
print(data)
The new variable, which expresses the difference in number of days, is at the right of the dataframe:
date1 date2 diff
0 2019-03-05 2018-03-11 359 days
1 2019-03-06 2018-03-18 353 days
2 2019-03-07 2018-03-25 347 days
3 2019-03-08 2018-04-01 341 days
4 2019-03-09 2018-04-08 335 days
If we want the units in something other than days, we can use numpy’s timedelta. The following example shows how to use this syntax:
data["diff"] = data["date1"].sub(data["date2"], axis=0).div(
np.timedelta64(1, "Y").astype("timedelta64[ns]"))
print(data)
We see the new variable now expressing the difference in years, at the right of the dataframe:
date1 date2 diff
0 2019-03-05 2018-03-11 0.982909
1 2019-03-06 2018-03-18 0.966481
2 2019-03-07 2018-03-25 0.950054
3 2019-03-08 2018-04-01 0.933626
4 2019-03-09 2018-04-08 0.917199
If you wanted to subtract various datetime variables, you would have to write lines of code
for every subtraction. Fortunately, we can automate this procedure with DatetimeSubstraction()
.
Datetime subtraction with Feature-engine#
DatetimeSubstraction()
automatically subtracts several date and time features from
each other. You just need to indicate the features at the right of the subtraction operation
in the variables
parameters and those on the left in the reference parameter
. You can also
change the output unit through the output_unit
parameter.
DatetimeSubstraction()
works with variables whose dtype
is datetime, as well as
with object-like and categorical variables, provided that they can be parsed into datetime
format. This will be done under the hood by the transformer.
Following up with the former example, here is how we obtain the difference in number of
days using DatetimeSubstraction()
:
import pandas as pd
from feature_engine.datetime import DatetimeSubtraction
data = pd.DataFrame({
"date1": pd.date_range("2019-03-05", periods=5, freq="D"),
"date2": pd.date_range("2018-03-05", periods=5, freq="W")})
dtf = DatetimeSubtraction(
variables="date1",
reference="date2",
output_unit="Y")
data = dtf.fit_transform(data)
print(data)
With transform()
, DatetimeSubstraction()
returns a new dataframe containing the
original variables and also the new variables with the time difference:
date1 date2 date1_sub_date2
0 2019-03-05 2018-03-11 0.982909
1 2019-03-06 2018-03-18 0.966481
2 2019-03-07 2018-03-25 0.950054
3 2019-03-08 2018-04-01 0.933626
4 2019-03-09 2018-04-08 0.917199
Drop original variables after computation#
We have the option to drop the original datetime variables after the computation:
import pandas as pd
from feature_engine.datetime import DatetimeSubtraction
data = pd.DataFrame({
"date1": pd.date_range("2019-03-05", periods=5, freq="D"),
"date2": pd.date_range("2018-03-05", periods=5, freq="W")})
dtf = DatetimeSubtraction(
variables="date1",
reference="date2",
output_unit="M",
drop_original=True
)
data = dtf.fit_transform(data)
print(data)
In this case, the resulting dataframe contains only the time difference between the two original variables:
date1_sub_date2
0 11.794903
1 11.597774
2 11.400645
3 11.203515
4 11.006386
Subtract multiple variables simultaneously#
We can perform multiple subtractions at the same time. In this example, we will add new
datetime variables to the toy dataframe as strings. The idea is to show that
DatetimeSubstraction()
will convert those strings to datetime under the hood to
carry out the subtraction operation.
import pandas as pd
from feature_engine.datetime import DatetimeSubtraction
data = pd.DataFrame({
"date1" : ["2022-09-01", "2022-10-01", "2022-12-01"],
"date2" : ["2022-09-15", "2022-10-15", "2022-12-15"],
"date3" : ["2022-08-01", "2022-09-01", "2022-11-01"],
"date4" : ["2022-08-15", "2022-09-15", "2022-11-15"],
})
dtf = DatetimeSubtraction(variables=["date1", "date2"], reference=["date3", "date4"])
data = dtf.fit_transform(data)
print(data)
The resulting dataframe contains the original variables plus the new variables expressing the time difference between the date objects.
date1 date2 date3 date4 date1_sub_date3 \
0 2022-09-01 2022-09-15 2022-08-01 2022-08-15 31.0
1 2022-10-01 2022-10-15 2022-09-01 2022-09-15 30.0
2 2022-12-01 2022-12-15 2022-11-01 2022-11-15 30.0
date2_sub_date3 date1_sub_date4 date2_sub_date4
0 45.0 17.0 31.0
1 44.0 16.0 30.0
2 44.0 16.0 30.0
Working with missing values#
By default, DatetimeSubstraction()
will raise an error if the dataframe passed
to the fit()
or transform()
methods contains NA in the variables to subtract. We can
override this behaviour and allow computations between variables with nan by setting the
parameter missing_values
to "ignore"
. Here is a code example:
import numpy as np
import pandas as pd
from feature_engine.datetime import DatetimeSubtraction
data = pd.DataFrame({
"date1" : ["2022-09-01", "2022-10-01", "2022-12-01"],
"date2" : ["2022-09-15", np.nan, "2022-12-15"],
"date3" : ["2022-08-01", "2022-09-01", "2022-11-01"],
"date4" : ["2022-08-15", "2022-09-15", np.nan],
})
dtf = DatetimeSubtraction(
variables=["date1", "date2"],
reference=["date3", "date4"],
missing_values="ignore")
data = dtf.fit_transform(data)
print(data)
When any of the variables contains NAN, the new features with the time difference will also display NANs:
date1 date2 date3 date4 date1_sub_date3 \
0 2022-09-01 2022-09-15 2022-08-01 2022-08-15 31.0
1 2022-10-01 NaN 2022-09-01 2022-09-15 30.0
2 2022-12-01 2022-12-15 2022-11-01 NaN 30.0
date2_sub_date3 date1_sub_date4 date2_sub_date4
0 45.0 17.0 31.0
1 NaN 16.0 NaN
2 44.0 NaN NaN
Working with different timezones#
If we have timestamps in different timezones or variables in different timezones, we can
still perform subtraction operations with DatetimeSubstraction()
by first setting
all timestamps to the universal central time zone. Here is a code example, were we return
the time difference in microseconds:
import pandas as pd
from feature_engine.datetime import DatetimeSubtraction
data = pd.DataFrame({
"date1": ['12:34:45+3', '23:01:02-6', '11:59:21-8', '08:44:23Z'],
"date2": ['09:34:45+1', '23:01:02-6+1', '11:59:21-8-2', '08:44:23+3']
})
dfts = DatetimeSubtraction(
variables="date1",
reference="date2",
utc=True,
output_unit="ms",
format="mixed"
)
new = dfts.fit_transform(data)
print(new)
We see the resulting dataframe with the time difference in microseconds:
date1 date2 date1_sub_date2
0 12:34:45+3 09:34:45+1 3600000.0
1 23:01:02-6 23:01:02-6+1 25200000.0
2 11:59:21-8 11:59:21-8-2 21600000.0
3 08:44:23Z 08:44:23+3 10800000.0
Adding arbitrary names to the new variables#
Often, we want to compute just a few time differences. In this case, we may want as well to assign the new variables specific names. In this code example, we do so:
import pandas as pd
from feature_engine.datetime import DatetimeSubtraction
data = pd.DataFrame({
"date1": pd.date_range("2019-03-05", periods=5, freq="D"),
"date2": pd.date_range("2018-03-05", periods=5, freq="W")})
dtf = DatetimeSubtraction(
variables="date1",
reference="date2",
new_variables_names=["my_new_var"]
)
data = dtf.fit_transform(data)
print(data)
In the resulting dataframe, we see that the time difference was captured in a variable
called my_new_var
:
date1 date2 my_new_var
0 2019-03-05 2018-03-11 359.0
1 2019-03-06 2018-03-18 353.0
2 2019-03-07 2018-03-25 347.0
3 2019-03-08 2018-04-01 341.0
4 2019-03-09 2018-04-08 335.0
We should be mindful to pass a list of variales containing as many names as new variables.
The number of variables that will be created is obtained by multiplying the number of variables
in the parameter variables
by the number of variables in the parameter reference
.
get_feature_names_out()#
Finally, we can extract the names of the transformed dataframe for compatibility with the Scikit-learn pipeline:
import pandas as pd
from feature_engine.datetime import DatetimeSubtraction
data = pd.DataFrame({
"date1" : ["2022-09-01", "2022-10-01", "2022-12-01"],
"date2" : ["2022-09-15", "2022-10-15", "2022-12-15"],
"date3" : ["2022-08-01", "2022-09-01", "2022-11-01"],
"date4" : ["2022-08-15", "2022-09-15", "2022-11-15"],
})
dtf = DatetimeSubtraction(variables=["date1", "date2"], reference=["date3", "date4"])
dtf.fit(data)
dtf.get_feature_names_out()
Below the name of the variables that will appear in any dataframe resulting from applying
the transform()
method:
['date1',
'date2',
'date3',
'date4',
'date1_sub_date3',
'date2_sub_date3',
'date1_sub_date4',
'date2_sub_date4']
Combining extraction and subtraction of datetime features#
We can also combine the creation of numerical variables from datetime features with the creation of new features by subtraction of datetime variables:
import pandas as pd
from sklearn.pipeline import Pipeline
from feature_engine.datetime import DatetimeFeatures, DatetimeSubtraction
data = pd.DataFrame({
"date1" : ["2022-09-01", "2022-10-01", "2022-12-01"],
"date2" : ["2022-09-15", "2022-10-15", "2022-12-15"],
"date3" : ["2022-08-01", "2022-09-01", "2022-11-01"],
"date4" : ["2022-08-15", "2022-09-15", "2022-11-15"],
})
dtf = DatetimeFeatures(variables=["date1", "date2"], drop_original=False)
dts = DatetimeSubtraction(
variables=["date1", "date2"],
reference=["date3", "date4"],
drop_original=True,
)
pipe = Pipeline([
("features", dtf),("subtraction", dts)
])
data = pipe.fit_transform(data)
print(data)
In the following output we see the new dataframe contaning the features that were extracted from the different datetime variables followed by those created by capturing the time difference:
date1_month date1_year date1_day_of_week date1_day_of_month date1_hour \
0 9 2022 3 1 0
1 10 2022 5 1 0
2 12 2022 3 1 0
date1_minute date1_second date2_month date2_year date2_day_of_week \
0 0 0 9 2022 3
1 0 0 10 2022 5
2 0 0 12 2022 3
date2_day_of_month date2_hour date2_minute date2_second \
0 15 0 0 0
1 15 0 0 0
2 15 0 0 0
date1_sub_date3 date2_sub_date3 date1_sub_date4 date2_sub_date4
0 31.0 45.0 17.0 31.0
1 30.0 44.0 16.0 30.0
2 30.0 44.0 16.0 30.0
More details#
For tutorials on how to create and use features from datetime columns, check the following courses:
And the following book: