GeoDistanceFeatures#

GeoDistanceFeatures() calculates the distance between two geographical coordinate pairs (latitude/longitude) and adds the result as a new feature.

GeoDistanceFeatures() is useful for location-based machine learning problems such as real estate pricing, delivery route optimization, ride-sharing applications, and any domain where geographic proximity is relevant.

Distance Methods#

The transformer supports different distance calculation methods:

  • haversine: Great-circle distance using the Haversine formula (default). Most accurate for typical distances on Earth’s surface.

  • euclidean: Simple Euclidean distance in the coordinate space. Fast but less accurate for long distances.

  • manhattan: Manhattan (taxicab) distance in coordinate space. Useful as a rough approximation for grid-based city layouts.

Output Units#

The distance can be returned in various units:

  • km: Kilometers (default)

  • miles: Miles

  • meters: Meters

  • feet: Feet

Python Demo#

Let’s create a dataframe with origin and destination coordinates:

import pandas as pd
from feature_engine.creation import GeoDistanceFeatures

# Sample data: trips between US cities
X = pd.DataFrame({
    'origin_lat': [40.7128, 34.0522, 41.8781, 29.7604],
    'origin_lon': [-74.0060, -118.2437, -87.6298, -95.3698],
    'dest_lat': [34.0522, 41.8781, 40.7128, 33.4484],
    'dest_lon': [-118.2437, -87.6298, -74.0060, -112.0740],
    'trip_id': [1, 2, 3, 4]
})

Now let’s calculate the distances using the haversine formula and returning the values in km:

# Set up the transformer
gdt = GeoDistanceFeatures(
    lat1='origin_lat',
    lon1='origin_lon',
    lat2='dest_lat',
    lon2='dest_lon',
    method='haversine',
    output_unit='km',
    output_col='distance_km'
)

# Fit and transform
gdt.fit(X)
X_transformed = gdt.transform(X)

print(X_transformed[['trip_id', 'distance_km']])

In the following output we see the trip ID followed by the distance traveled in each trip:

   trip_id   distance_km
0        1   3935.746254
1        2   2808.517344
2        3   1144.286561
3        4   1634.724892

Using different distance methods#

We can use the Euclidean distance method, which provides a faster but less accurate calculation suitable for short distances:

gdt_euclidean = GeoDistanceFeatures(
    lat1='origin_lat', lon1='origin_lon',
    lat2='dest_lat', lon2='dest_lon',
    method='euclidean',
    output_col='distance_euclidean'
)

gdt_euclidean.fit(X)
X_euclidean = gdt_euclidean.transform(X)
print(X_euclidean[['trip_id', 'distance_euclidean']])

The Euclidean distances differ from the Haversine values because they don’t account for Earth’s curvature:

   trip_id  distance_euclidean
0        1         4940.252715
1        2         3493.298968
2        3         1519.295694
3        4         1720.178310

Alternatively, we can use the Manhattan distance, which is useful for grid-based city layouts:

gdt_manhattan = GeoDistanceFeatures(
    lat1='origin_lat', lon1='origin_lon',
    lat2='dest_lat', lon2='dest_lon',
    method='manhattan',
    output_col='distance_manhattan'
)

gdt_manhattan.fit(X)
X_manhattan = gdt_manhattan.transform(X)
print(X_manhattan[['trip_id', 'distance_manhattan']])

The Manhattan distance sums the absolute differences in latitude and longitude:

   trip_id  distance_manhattan
0        1          5628.24000
1        2          4684.15800
2        3          1637.36700
3        4          2279.96460

Using different output units#

The transformer supports returning distances in km (default), miles, meters, or feet. Here we calculate distances in miles:

gdt = GeoDistanceFeatures(
    lat1='origin_lat', lon1='origin_lon',
    lat2='dest_lat', lon2='dest_lon',
    output_unit='miles',
    output_col='distance_miles'
)

gdt.fit(X)
X_transformed = gdt.transform(X)
print(X_transformed[['trip_id', 'distance_miles']])

The distances are now expressed in miles instead of kilometers:

   trip_id  distance_miles
0        1     2445.258392
1        2     1745.046817
2        3      711.000629
3        4     1015.643614

Dropping original coordinate columns#

To reduce the dimensionality of the output dataset, we can remove the original coordinate columns after calculating the distance:

gdt = GeoDistanceFeatures(
    lat1='origin_lat', lon1='origin_lon',
    lat2='dest_lat', lon2='dest_lon',
    drop_original=True
)

gdt.fit(X)
X_transformed = gdt.transform(X)

# Coordinate columns are removed
print(X_transformed.columns.tolist())

After transformation, only the non-coordinate columns and the new distance column remain:

['trip_id', 'geo_distance']

Calculating distance within a Pipeline#

GeoDistanceFeatures() works seamlessly with scikit-learn pipelines. In the following example, we create a pipeline that first calculates the geographic distance, then scales the features, and finally trains a regression model:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Create sample target variable
y = pd.Series([100, 150, 80, 200])

# Create a pipeline for price prediction
pipe = Pipeline([
    ('geo_distance', GeoDistanceFeatures(
        lat1='origin_lat', lon1='origin_lon',
        lat2='dest_lat', lon2='dest_lon',
        output_unit='km',
        drop_original=True
    )),
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit the pipeline
pipe.fit(X, y)

# Make predictions
predictions = pipe.predict(X)
print(f"Predictions: {predictions}")

The pipeline successfully trains and returns predictions:

Predictions: [100. 150.  80. 200.]