DatetimeSubtraction#

Very often, we have datetime variables in our datasets, and we want to determine the time difference between them. For example, if we work with financial data, we may have the variable date of loan application, with the date and time when the customer applied for a loan, and also the variable date of birth, with the customer’s date of birth. With those two variables, we want to infer the age of the customer at the time of application. In order to do this, we can compute the difference in years between date_of_loan_application and date_of_birth and capture it in a new variable.

In a different example, if we are trying to predict the price of the house and we have information about the year in which the house was built, we can infer the age of the house at the point of sale. Generally, older houses cost less. To calculate the age of the house, we’d simply compute the difference in years between the sale date and the date at which it was built.

The Python program offers many options for making operations between datetime objects, like, for example, the datetime module. Since most likely you will be working with Pandas dataframes, we will focus this guide on pandas and then how we can automate the procedure with Feature-engine.

Subtracting datetime features with pandas#

In Python, we can subtract datetime objects with pandas. To work with datetime variables in pandas, we need to make sure that the timestamp, which can be represented in various formats, like strings (str), objects ("O"), or datetime, is cast as a datetime. If not, we can convert strings to datetime objects by executing pd.to_datetime(df[variable_of_interest]).

Let’s create a toy dataframe with 2 datetime variables for a short demo:

import numpy as np
import pandas as pd

data = pd.DataFrame({
    "date1": pd.date_range("2019-03-05", periods=5, freq="D"),
    "date2": pd.date_range("2018-03-05", periods=5, freq="W")})

print(data)

This is the data that we created, containing two datetime variables:

       date1      date2
0 2019-03-05 2018-03-11
1 2019-03-06 2018-03-18
2 2019-03-07 2018-03-25
3 2019-03-08 2018-04-01
4 2019-03-09 2018-04-08

Now, we can subtract date2 from date1 and capture the difference in a new variable by utilizing the pandas subtraction operator:

data["diff"] = data["date1"].sub(data["date2"])

print(data)

The new variable, which expresses the difference in number of days, is at the right of the dataframe:

       date1      date2     diff
0 2019-03-05 2018-03-11 359 days
1 2019-03-06 2018-03-18 353 days
2 2019-03-07 2018-03-25 347 days
3 2019-03-08 2018-04-01 341 days
4 2019-03-09 2018-04-08 335 days

If we want the units in something other than days, we can use numpy’s timedelta. The following example shows how to use this syntax:

data["diff"] = data["date1"].sub(data["date2"], axis=0).div(
    np.timedelta64(1, "Y").astype("timedelta64[ns]"))

print(data)

We see the new variable now expressing the difference in years, at the right of the dataframe:

       date1      date2      diff
0 2019-03-05 2018-03-11  0.982909
1 2019-03-06 2018-03-18  0.966481
2 2019-03-07 2018-03-25  0.950054
3 2019-03-08 2018-04-01  0.933626
4 2019-03-09 2018-04-08  0.917199

If you wanted to subtract various datetime variables, you would have to write lines of code for every subtraction. Fortunately, we can automate this procedure with DatetimeSubstraction().

Datetime subtraction with Feature-engine#

DatetimeSubstraction() automatically subtracts several date and time features from each other. You just need to indicate the features at the right of the subtraction operation in the variables parameters and those on the left in the reference parameter. You can also change the output unit through the output_unit parameter.

DatetimeSubstraction() works with variables whose dtype is datetime, as well as with object-like and categorical variables, provided that they can be parsed into datetime format. This will be done under the hood by the transformer.

Following up with the former example, here is how we obtain the difference in number of days using DatetimeSubstraction():

import pandas as pd
from feature_engine.datetime import DatetimeSubtraction

data = pd.DataFrame({
    "date1": pd.date_range("2019-03-05", periods=5, freq="D"),
    "date2": pd.date_range("2018-03-05", periods=5, freq="W")})

dtf = DatetimeSubtraction(
    variables="date1",
    reference="date2",
    output_unit="Y")

data = dtf.fit_transform(data)

print(data)

With transform(), DatetimeSubstraction() returns a new dataframe containing the original variables and also the new variables with the time difference:

       date1      date2  date1_sub_date2
0 2019-03-05 2018-03-11         0.982909
1 2019-03-06 2018-03-18         0.966481
2 2019-03-07 2018-03-25         0.950054
3 2019-03-08 2018-04-01         0.933626
4 2019-03-09 2018-04-08         0.917199

Drop original variables after computation#

We have the option to drop the original datetime variables after the computation:

import pandas as pd
from feature_engine.datetime import DatetimeSubtraction

data = pd.DataFrame({
    "date1": pd.date_range("2019-03-05", periods=5, freq="D"),
    "date2": pd.date_range("2018-03-05", periods=5, freq="W")})

dtf = DatetimeSubtraction(
    variables="date1",
    reference="date2",
    output_unit="M",
    drop_original=True
)

data = dtf.fit_transform(data)

print(data)

In this case, the resulting dataframe contains only the time difference between the two original variables:

   date1_sub_date2
0        11.794903
1        11.597774
2        11.400645
3        11.203515
4        11.006386

Subtract multiple variables simultaneously#

We can perform multiple subtractions at the same time. In this example, we will add new datetime variables to the toy dataframe as strings. The idea is to show that DatetimeSubstraction() will convert those strings to datetime under the hood to carry out the subtraction operation.

import pandas as pd
from feature_engine.datetime import DatetimeSubtraction

data = pd.DataFrame({
    "date1" : ["2022-09-01", "2022-10-01", "2022-12-01"],
    "date2" : ["2022-09-15", "2022-10-15", "2022-12-15"],
    "date3" : ["2022-08-01", "2022-09-01", "2022-11-01"],
    "date4" : ["2022-08-15", "2022-09-15", "2022-11-15"],
})

dtf = DatetimeSubtraction(variables=["date1", "date2"], reference=["date3", "date4"])

data = dtf.fit_transform(data)

print(data)

The resulting dataframe contains the original variables plus the new variables expressing the time difference between the date objects.

        date1       date2       date3       date4  date1_sub_date3  \
0  2022-09-01  2022-09-15  2022-08-01  2022-08-15             31.0
1  2022-10-01  2022-10-15  2022-09-01  2022-09-15             30.0
2  2022-12-01  2022-12-15  2022-11-01  2022-11-15             30.0

   date2_sub_date3  date1_sub_date4  date2_sub_date4
0             45.0             17.0             31.0
1             44.0             16.0             30.0
2             44.0             16.0             30.0

Working with missing values#

By default, DatetimeSubstraction() will raise an error if the dataframe passed to the fit() or transform() methods contains NA in the variables to subtract. We can override this behaviour and allow computations between variables with nan by setting the parameter missing_values to "ignore". Here is a code example:

import numpy as np
import pandas as pd
from feature_engine.datetime import DatetimeSubtraction

data = pd.DataFrame({
    "date1" : ["2022-09-01", "2022-10-01", "2022-12-01"],
    "date2" : ["2022-09-15", np.nan, "2022-12-15"],
    "date3" : ["2022-08-01", "2022-09-01", "2022-11-01"],
    "date4" : ["2022-08-15", "2022-09-15", np.nan],
})

dtf = DatetimeSubtraction(
    variables=["date1", "date2"],
    reference=["date3", "date4"],
    missing_values="ignore")

data = dtf.fit_transform(data)

print(data)

When any of the variables contains NAN, the new features with the time difference will also display NANs:

        date1       date2       date3       date4  date1_sub_date3  \
0  2022-09-01  2022-09-15  2022-08-01  2022-08-15             31.0
1  2022-10-01         NaN  2022-09-01  2022-09-15             30.0
2  2022-12-01  2022-12-15  2022-11-01         NaN             30.0

   date2_sub_date3  date1_sub_date4  date2_sub_date4
0             45.0             17.0             31.0
1              NaN             16.0              NaN
2             44.0              NaN              NaN

Working with different timezones#

If we have timestamps in different timezones or variables in different timezones, we can still perform subtraction operations with DatetimeSubstraction() by first setting all timestamps to the universal central time zone. Here is a code example, were we return the time difference in microseconds:

import pandas as pd
from feature_engine.datetime import DatetimeSubtraction

data = pd.DataFrame({
    "date1": ['12:34:45+3', '23:01:02-6', '11:59:21-8', '08:44:23Z'],
    "date2": ['09:34:45+1', '23:01:02-6+1', '11:59:21-8-2', '08:44:23+3']
})

dfts = DatetimeSubtraction(
    variables="date1",
    reference="date2",
    utc=True,
    output_unit="ms",
    format="mixed"
)

new = dfts.fit_transform(data)

print(new)

We see the resulting dataframe with the time difference in microseconds:

        date1         date2  date1_sub_date2
0  12:34:45+3    09:34:45+1        3600000.0
1  23:01:02-6  23:01:02-6+1       25200000.0
2  11:59:21-8  11:59:21-8-2       21600000.0
3   08:44:23Z    08:44:23+3       10800000.0

Adding arbitrary names to the new variables#

Often, we want to compute just a few time differences. In this case, we may want as well to assign the new variables specific names. In this code example, we do so:

import pandas as pd
from feature_engine.datetime import DatetimeSubtraction

data = pd.DataFrame({
    "date1": pd.date_range("2019-03-05", periods=5, freq="D"),
    "date2": pd.date_range("2018-03-05", periods=5, freq="W")})

dtf = DatetimeSubtraction(
    variables="date1",
    reference="date2",
    new_variables_names=["my_new_var"]
    )

data = dtf.fit_transform(data)

print(data)

In the resulting dataframe, we see that the time difference was captured in a variable called my_new_var:

       date1      date2  my_new_var
0 2019-03-05 2018-03-11       359.0
1 2019-03-06 2018-03-18       353.0
2 2019-03-07 2018-03-25       347.0
3 2019-03-08 2018-04-01       341.0
4 2019-03-09 2018-04-08       335.0

We should be mindful to pass a list of variales containing as many names as new variables. The number of variables that will be created is obtained by multiplying the number of variables in the parameter variables by the number of variables in the parameter reference.

get_feature_names_out()#

Finally, we can extract the names of the transformed dataframe for compatibility with the Scikit-learn pipeline:

import pandas as pd
from feature_engine.datetime import DatetimeSubtraction

data = pd.DataFrame({
    "date1" : ["2022-09-01", "2022-10-01", "2022-12-01"],
    "date2" : ["2022-09-15", "2022-10-15", "2022-12-15"],
    "date3" : ["2022-08-01", "2022-09-01", "2022-11-01"],
    "date4" : ["2022-08-15", "2022-09-15", "2022-11-15"],
})

dtf = DatetimeSubtraction(variables=["date1", "date2"], reference=["date3", "date4"])
dtf.fit(data)

dtf.get_feature_names_out()

Below the name of the variables that will appear in any dataframe resulting from applying the transform() method:

['date1',
 'date2',
 'date3',
 'date4',
 'date1_sub_date3',
 'date2_sub_date3',
 'date1_sub_date4',
 'date2_sub_date4']

Combining extraction and subtraction of datetime features#

We can also combine the creation of numerical variables from datetime features with the creation of new features by subtraction of datetime variables:

import pandas as pd
from sklearn.pipeline import Pipeline
from feature_engine.datetime import DatetimeFeatures, DatetimeSubtraction

data = pd.DataFrame({
    "date1" : ["2022-09-01", "2022-10-01", "2022-12-01"],
    "date2" : ["2022-09-15", "2022-10-15", "2022-12-15"],
    "date3" : ["2022-08-01", "2022-09-01", "2022-11-01"],
    "date4" : ["2022-08-15", "2022-09-15", "2022-11-15"],
})

dtf = DatetimeFeatures(variables=["date1", "date2"], drop_original=False)
dts = DatetimeSubtraction(
    variables=["date1", "date2"],
    reference=["date3", "date4"],
    drop_original=True,
)

pipe = Pipeline([
    ("features", dtf),("subtraction", dts)
])

data = pipe.fit_transform(data)

print(data)

In the following output we see the new dataframe contaning the features that were extracted from the different datetime variables followed by those created by capturing the time difference:

   date1_month  date1_year  date1_day_of_week  date1_day_of_month  date1_hour  \
0            9        2022                  3                   1           0
1           10        2022                  5                   1           0
2           12        2022                  3                   1           0

   date1_minute  date1_second  date2_month  date2_year  date2_day_of_week  \
0             0             0            9        2022                  3
1             0             0           10        2022                  5
2             0             0           12        2022                  3

   date2_day_of_month  date2_hour  date2_minute  date2_second  \
0                  15           0             0             0
1                  15           0             0             0
2                  15           0             0             0

   date1_sub_date3  date2_sub_date3  date1_sub_date4  date2_sub_date4
0             31.0             45.0             17.0             31.0
1             30.0             44.0             16.0             30.0
2             30.0             44.0             16.0             30.0

More details#

For tutorials on how to create and use features from datetime columns, check the following courses:

And the following book: