.. _datetime_subtraction: .. currentmodule:: feature_engine.datetime DatetimeSubtraction =================== Very often, we have datetime variables in our datasets, and we want to determine the time difference between them. For example, if we work with financial data, we may have the variable **date of loan application**, with the date and time when the customer applied for a loan, and also the variable **date of birth**, with the customer's date of birth. With those two variables, we want to infer the **age of the customer** at the time of application. In order to do this, we can compute the difference in years between `date_of_loan_application` and `date_of_birth` and capture it in a new variable. In a different example, if we are trying to predict the price of the house and we have information about the year in which the house was built, we can infer the age of the house at the point of sale. Generally, older houses cost less. To calculate the age of the house, we’d simply compute the difference in years between the sale date and the date at which it was built. The Python program offers many options for making operations between datetime objects, like, for example, the datetime module. Since most likely you will be working with Pandas dataframes, we will focus this guide on pandas and then how we can automate the procedure with Feature-engine. Subtracting datetime features with pandas ----------------------------------------- In Python, we can subtract datetime objects with pandas. To work with datetime variables in pandas, we need to make sure that the timestamp, which can be represented in various formats, like strings (str), objects (`"O"`), or datetime, is cast as a datetime. If not, we can convert strings to datetime objects by executing `pd.to_datetime(df[variable_of_interest])`. Let’s create a toy dataframe with 2 datetime variables for a short demo: .. code:: python import numpy as np import pandas as pd data = pd.DataFrame({ "date1": pd.date_range("2019-03-05", periods=5, freq="D"), "date2": pd.date_range("2018-03-05", periods=5, freq="W")}) print(data) This is the data that we created, containing two datetime variables: .. code:: python date1 date2 0 2019-03-05 2018-03-11 1 2019-03-06 2018-03-18 2 2019-03-07 2018-03-25 3 2019-03-08 2018-04-01 4 2019-03-09 2018-04-08 Now, we can subtract `date2` from `date1` and capture the difference in a new variable by utilizing the pandas subtraction operator: .. code:: python data["diff"] = data["date1"].sub(data["date2"]) print(data) The new variable, which expresses the difference in number of days, is at the right of the dataframe: .. code:: python date1 date2 diff 0 2019-03-05 2018-03-11 359 days 1 2019-03-06 2018-03-18 353 days 2 2019-03-07 2018-03-25 347 days 3 2019-03-08 2018-04-01 341 days 4 2019-03-09 2018-04-08 335 days If we want the units in something other than days, we can use numpy’s timedelta. The following example shows how to use this syntax: .. code:: python data["diff"] = data["date1"].sub(data["date2"], axis=0).div( np.timedelta64(1, "Y").astype("timedelta64[ns]")) print(data) We see the new variable now expressing the difference in years, at the right of the dataframe: .. code:: python date1 date2 diff 0 2019-03-05 2018-03-11 0.982909 1 2019-03-06 2018-03-18 0.966481 2 2019-03-07 2018-03-25 0.950054 3 2019-03-08 2018-04-01 0.933626 4 2019-03-09 2018-04-08 0.917199 If you wanted to subtract various datetime variables, you would have to write lines of code for every subtraction. Fortunately, we can automate this procedure with :class:`DatetimeSubstraction()`. Datetime subtraction with Feature-engine ---------------------------------------- :class:`DatetimeSubstraction()` automatically subtracts several date and time features from each other. You just need to indicate the features at the right of the subtraction operation in the `variables` parameters and those on the left in the `reference parameter`. You can also change the output unit through the `output_unit` parameter. :class:`DatetimeSubstraction()` works with variables whose `dtype` is datetime, as well as with object-like and categorical variables, provided that they can be parsed into datetime format. This will be done under the hood by the transformer. Following up with the former example, here is how we obtain the difference in number of days using :class:`DatetimeSubstraction()`: .. code:: python import pandas as pd from feature_engine.datetime import DatetimeSubtraction data = pd.DataFrame({ "date1": pd.date_range("2019-03-05", periods=5, freq="D"), "date2": pd.date_range("2018-03-05", periods=5, freq="W")}) dtf = DatetimeSubtraction( variables="date1", reference="date2", output_unit="Y") data = dtf.fit_transform(data) print(data) With `transform()`, :class:`DatetimeSubstraction()` returns a new dataframe containing the original variables and also the new variables with the time difference: .. code:: python date1 date2 date1_sub_date2 0 2019-03-05 2018-03-11 0.982909 1 2019-03-06 2018-03-18 0.966481 2 2019-03-07 2018-03-25 0.950054 3 2019-03-08 2018-04-01 0.933626 4 2019-03-09 2018-04-08 0.917199 Drop original variables after computation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We have the option to drop the original datetime variables after the computation: .. code:: python import pandas as pd from feature_engine.datetime import DatetimeSubtraction data = pd.DataFrame({ "date1": pd.date_range("2019-03-05", periods=5, freq="D"), "date2": pd.date_range("2018-03-05", periods=5, freq="W")}) dtf = DatetimeSubtraction( variables="date1", reference="date2", output_unit="M", drop_original=True ) data = dtf.fit_transform(data) print(data) In this case, the resulting dataframe contains only the time difference between the two original variables: .. code:: python date1_sub_date2 0 11.794903 1 11.597774 2 11.400645 3 11.203515 4 11.006386 Subtract multiple variables simultaneously ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can perform multiple subtractions at the same time. In this example, we will add new datetime variables to the toy dataframe as strings. The idea is to show that :class:`DatetimeSubstraction()` will convert those strings to datetime under the hood to carry out the subtraction operation. .. code:: python import pandas as pd from feature_engine.datetime import DatetimeSubtraction data = pd.DataFrame({ "date1" : ["2022-09-01", "2022-10-01", "2022-12-01"], "date2" : ["2022-09-15", "2022-10-15", "2022-12-15"], "date3" : ["2022-08-01", "2022-09-01", "2022-11-01"], "date4" : ["2022-08-15", "2022-09-15", "2022-11-15"], }) dtf = DatetimeSubtraction(variables=["date1", "date2"], reference=["date3", "date4"]) data = dtf.fit_transform(data) print(data) The resulting dataframe contains the original variables plus the new variables expressing the time difference between the date objects. .. code:: python date1 date2 date3 date4 date1_sub_date3 \ 0 2022-09-01 2022-09-15 2022-08-01 2022-08-15 31.0 1 2022-10-01 2022-10-15 2022-09-01 2022-09-15 30.0 2 2022-12-01 2022-12-15 2022-11-01 2022-11-15 30.0 date2_sub_date3 date1_sub_date4 date2_sub_date4 0 45.0 17.0 31.0 1 44.0 16.0 30.0 2 44.0 16.0 30.0 Working with missing values ~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, :class:`DatetimeSubstraction()` will raise an error if the dataframe passed to the `fit()` or `transform()` methods contains NA in the variables to subtract. We can override this behaviour and allow computations between variables with nan by setting the parameter `missing_values` to `"ignore"`. Here is a code example: .. code:: python import numpy as np import pandas as pd from feature_engine.datetime import DatetimeSubtraction data = pd.DataFrame({ "date1" : ["2022-09-01", "2022-10-01", "2022-12-01"], "date2" : ["2022-09-15", np.nan, "2022-12-15"], "date3" : ["2022-08-01", "2022-09-01", "2022-11-01"], "date4" : ["2022-08-15", "2022-09-15", np.nan], }) dtf = DatetimeSubtraction( variables=["date1", "date2"], reference=["date3", "date4"], missing_values="ignore") data = dtf.fit_transform(data) print(data) When any of the variables contains NAN, the new features with the time difference will also display NANs: .. code:: python date1 date2 date3 date4 date1_sub_date3 \ 0 2022-09-01 2022-09-15 2022-08-01 2022-08-15 31.0 1 2022-10-01 NaN 2022-09-01 2022-09-15 30.0 2 2022-12-01 2022-12-15 2022-11-01 NaN 30.0 date2_sub_date3 date1_sub_date4 date2_sub_date4 0 45.0 17.0 31.0 1 NaN 16.0 NaN 2 44.0 NaN NaN Working with different timezones ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If we have timestamps in different timezones or variables in different timezones, we can still perform subtraction operations with :class:`DatetimeSubstraction()` by first setting all timestamps to the universal central time zone. Here is a code example, were we return the time difference in microseconds: .. code:: python import pandas as pd from feature_engine.datetime import DatetimeSubtraction data = pd.DataFrame({ "date1": ['12:34:45+3', '23:01:02-6', '11:59:21-8', '08:44:23Z'], "date2": ['09:34:45+1', '23:01:02-6+1', '11:59:21-8-2', '08:44:23+3'] }) dfts = DatetimeSubtraction( variables="date1", reference="date2", utc=True, output_unit="ms", format="mixed" ) new = dfts.fit_transform(data) print(new) We see the resulting dataframe with the time difference in microseconds: .. code:: python date1 date2 date1_sub_date2 0 12:34:45+3 09:34:45+1 3600000.0 1 23:01:02-6 23:01:02-6+1 25200000.0 2 11:59:21-8 11:59:21-8-2 21600000.0 3 08:44:23Z 08:44:23+3 10800000.0 Adding arbitrary names to the new variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Often, we want to compute just a few time differences. In this case, we may want as well to assign the new variables specific names. In this code example, we do so: .. code:: python import pandas as pd from feature_engine.datetime import DatetimeSubtraction data = pd.DataFrame({ "date1": pd.date_range("2019-03-05", periods=5, freq="D"), "date2": pd.date_range("2018-03-05", periods=5, freq="W")}) dtf = DatetimeSubtraction( variables="date1", reference="date2", new_variables_names=["my_new_var"] ) data = dtf.fit_transform(data) print(data) In the resulting dataframe, we see that the time difference was captured in a variable called `my_new_var`: .. code:: python date1 date2 my_new_var 0 2019-03-05 2018-03-11 359.0 1 2019-03-06 2018-03-18 353.0 2 2019-03-07 2018-03-25 347.0 3 2019-03-08 2018-04-01 341.0 4 2019-03-09 2018-04-08 335.0 We should be mindful to pass a list of variales containing as many names as new variables. The number of variables that will be created is obtained by multiplying the number of variables in the parameter `variables` by the number of variables in the parameter `reference`. get_feature_names_out() ~~~~~~~~~~~~~~~~~~~~~~~ Finally, we can extract the names of the transformed dataframe for compatibility with the Scikit-learn pipeline: .. code:: python import pandas as pd from feature_engine.datetime import DatetimeSubtraction data = pd.DataFrame({ "date1" : ["2022-09-01", "2022-10-01", "2022-12-01"], "date2" : ["2022-09-15", "2022-10-15", "2022-12-15"], "date3" : ["2022-08-01", "2022-09-01", "2022-11-01"], "date4" : ["2022-08-15", "2022-09-15", "2022-11-15"], }) dtf = DatetimeSubtraction(variables=["date1", "date2"], reference=["date3", "date4"]) dtf.fit(data) dtf.get_feature_names_out() Below the name of the variables that will appear in any dataframe resulting from applying the `transform()` method: .. code:: python ['date1', 'date2', 'date3', 'date4', 'date1_sub_date3', 'date2_sub_date3', 'date1_sub_date4', 'date2_sub_date4'] Combining extraction and subtraction of datetime features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can also combine the creation of numerical variables from datetime features with the creation of new features by subtraction of datetime variables: .. code:: python import pandas as pd from sklearn.pipeline import Pipeline from feature_engine.datetime import DatetimeFeatures, DatetimeSubtraction data = pd.DataFrame({ "date1" : ["2022-09-01", "2022-10-01", "2022-12-01"], "date2" : ["2022-09-15", "2022-10-15", "2022-12-15"], "date3" : ["2022-08-01", "2022-09-01", "2022-11-01"], "date4" : ["2022-08-15", "2022-09-15", "2022-11-15"], }) dtf = DatetimeFeatures(variables=["date1", "date2"], drop_original=False) dts = DatetimeSubtraction( variables=["date1", "date2"], reference=["date3", "date4"], drop_original=True, ) pipe = Pipeline([ ("features", dtf),("subtraction", dts) ]) data = pipe.fit_transform(data) print(data) In the following output we see the new dataframe contaning the features that were extracted from the different datetime variables followed by those created by capturing the time difference: .. code:: python date1_month date1_year date1_day_of_week date1_day_of_month date1_hour \ 0 9 2022 3 1 0 1 10 2022 5 1 0 2 12 2022 3 1 0 date1_minute date1_second date2_month date2_year date2_day_of_week \ 0 0 0 9 2022 3 1 0 0 10 2022 5 2 0 0 12 2022 3 date2_day_of_month date2_hour date2_minute date2_second \ 0 15 0 0 0 1 15 0 0 0 2 15 0 0 0 date1_sub_date3 date2_sub_date3 date1_sub_date4 date2_sub_date4 0 31.0 45.0 17.0 31.0 1 30.0 44.0 16.0 30.0 2 30.0 44.0 16.0 30.0 More details ------------ For tutorials on how to create and use features from datetime columns, check the following courses: - `Feature Engineering for Machine Learning `_. - `Feature Engineering for Time Series Forecasting `_. And the following book: - `Python Feature Engineering Cookbook `_.