make_pipeline#

make_pipeline is a shorthand for Pipeline. While to set up Pipeline we create tuples with step names and transformers or estimators, with make_pipeline we just add a sequence of transformers and estimators, and the names will be added automatically.

Setting up a Pipeline with make_pipeline#

In the following example, we set up a Pipeline that drops missing data, then replaces categories with ordinal numbers, and finally fits a Lasso regression model.

import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import make_pipeline

from sklearn.linear_model import Lasso

X = pd.DataFrame(
    dict(
        x1=[2, 1, 1, 0, np.nan],
        x2=["a", np.nan, "b", np.nan, "a"],
    )
)
y = pd.Series([1, 2, 3, 4, 5])

pipe = make_pipeline(
    DropMissingData(),
    OrdinalEncoder(encoding_method="arbitrary"),
    Lasso(random_state=10),
)
# predict
pipe.fit(X, y)
preds_pipe = pipe.predict(X)
preds_pipe

In the output we see the predictions made by the pipeline:

array([2., 2.])

The names of the pipeline were assigned automatically:

print(pipe)

Pipeline(steps=[('dropmissingdata', DropMissingData()),
                ('ordinalencoder', OrdinalEncoder(encoding_method='arbitrary')),
                ('lasso', Lasso(random_state=10))])

The pipeline returned by make_pipeline has exactly the same characteristics than Pipeline. Hence, for additional guidelines, check out the Pipeline documentation.

Forecasting#

Let’s set up another pipeline to do direct forecasting:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import Lasso
from sklearn.metrics import root_mean_squared_error
from sklearn.multioutput import MultiOutputRegressor

from feature_engine.timeseries.forecasting import (
    LagFeatures,
    WindowFeatures,
)
from feature_engine.pipeline import make_pipeline

We’ll use the Australia electricity demand dataset described here:

Godahewa, Rakshitha, Bergmeir, Christoph, Webb, Geoff, Hyndman, Rob, & Montero-Manso, Pablo. (2021). Australian Electricity Demand Dataset (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4659727

url = "https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/demand.csv"
df = pd.read_csv(url)

df.drop(columns=["Industrial"], inplace=True)

# Convert the integer Date to an actual date with datetime type
df["date"] = df["Date"].apply(
    lambda x: pd.Timestamp("1899-12-30") + pd.Timedelta(x, unit="days")
)

# Create a timestamp from the integer Period representing 30 minute intervals
df["date_time"] = df["date"] + \
    pd.to_timedelta((df["Period"] - 1) * 30, unit="m")

df.dropna(inplace=True)

# Rename columns
df = df[["date_time", "OperationalLessIndustrial"]]

df.columns = ["date_time", "demand"]

# Resample to hourly
df = (
    df.set_index("date_time")
    .resample("h")
    .agg({"demand": "sum"})
)

print(df.head())

Here, we see the first rows of data:

                          demand
date_time
2002-01-01 00:00:00  6919.366092
2002-01-01 01:00:00  7165.974188
2002-01-01 02:00:00  6406.542994
2002-01-01 03:00:00  5815.537828
2002-01-01 04:00:00  5497.732922

We’ll predict the next 3 hours of energy demand. We’ll use direct forecasting. Let’s create the target variable:

horizon = 3
y = pd.DataFrame(index=df.index)
for h in range(horizon):
    y[f"h_{h}"] = df.shift(periods=-h, freq="h")
y.dropna(inplace=True)
df = df.loc[y.index]
print(y.head())

This is our target variable:

                             h_0          h_1          h_2
date_time
2002-01-01 00:00:00  6919.366092  7165.974188  6406.542994
2002-01-01 01:00:00  7165.974188  6406.542994  5815.537828
2002-01-01 02:00:00  6406.542994  5815.537828  5497.732922
2002-01-01 03:00:00  5815.537828  5497.732922  5385.851060
2002-01-01 04:00:00  5497.732922  5385.851060  5574.731890

Next, we split the data into a training set and a test set:

end_train = '2014-12-31 23:59:59'
X_train = df.loc[:end_train]
y_train = y.loc[:end_train]

begin_test = '2014-12-31 17:59:59'
X_test  = df.loc[begin_test:]
y_test = y.loc[begin_test:]

Next, we set up LagFeatures and WindowFeatures to create features from lags and windows:

lagf = LagFeatures(
    variables=["demand"],
    periods=[1, 3, 6],
    missing_values="ignore",
    drop_na=True,
)


winf = WindowFeatures(
    variables=["demand"],
    window=["3h"],
    freq="1h",
    functions=["mean"],
    missing_values="ignore",
    drop_original=True,
    drop_na=True,
)

We wrap the lasso regression within the multioutput regressor to predict multiple targets:

lasso = MultiOutputRegressor(Lasso(random_state=0, max_iter=10))

Now, we assemble Pipeline:

pipe = make_pipeline(lagf, winf, lasso)

print(pipe)

The steps’ names were assigned automatically:

Pipeline(steps=[('lagfeatures',
                 LagFeatures(drop_na=True, missing_values='ignore',
                             periods=[1, 3, 6], variables=['demand'])),
                ('windowfeatures',
                 WindowFeatures(drop_na=True, drop_original=True, freq='1h',
                                functions=['mean'], missing_values='ignore',
                                variables=['demand'], window=['3h'])),
                ('multioutputregressor',
                 MultiOutputRegressor(estimator=Lasso(max_iter=10,
                                                      random_state=0)))])

Let’s fit the Pipeline:

pipe.fit(X_train, y_train)

Now, we can make forecasts for the test set:

forecast = pipe.predict(X_test)

forecasts = pd.DataFrame(
    pipe.predict(X_test),
    columns=[f"step_{i+1}" for i in range(3)]

)

print(forecasts.head())

We see the 3 hr ahead energy demand prediction for each hour:

        step_1       step_2       step_3
8031.043352  8262.804811  8484.551733
7017.158081  7160.568853  7496.282999
6587.938171  6806.903940  7212.741943
6503.807479  6789.946587  7195.796841
6646.981390  6970.501840  7308.359237

To learn more about direct forecasting and how to create features, check out our courses:

Feature Engineering for Time Series Forecasting#

Forecasting with Machine Learning#

Feature Engineering for Machine Learning#

Python Feature Engineering Cookbook#

Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.

This site uses cookies

make_pipeline#

Setting up a Pipeline with make_pipeline#

Forecasting#