2.2.3 Spatial lags

Map design: The process of creating maps that effectively communicate spatial information.
Map types: Different types of maps, such as density, choropleth, and dot density maps.
Map interactivity: The ability to interact with maps through zooming, panning, and clicking on features.
Map elements: The components of a map, including title, legend, scale, and north arrow.
Map projections: The methods used to represent the curved surface of the Earth on a flat map.

# Import the required libraries
import pandas as pd
import numpy as np
import geopandas as gpd
import seaborn as sns
sns.set_theme(style="ticks")
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

A dot map can be used to visualize the distribution of the point data. The dot map can directly show the density of the locations for a large number of points.

# Read the TfL Bus Stop Locations data
df_bus_stop = pd.read_csv("data/tfl_bus-stops.csv")

# Convert the DataFrame to a GeoDataFrame
gdf_bus_stop = gpd.GeoDataFrame(df_bus_stop, geometry=gpd.points_from_xy(df_bus_stop['Location_Easting'], df_bus_stop['Location_Northing']))

# We also need the London boundary (we create this boundary by dissolve function in last week) to add the background of the map.
gdf_london_boundary = gpd.read_file("data/whole_london_boundary.geojson")
# transfer the coordinate system of the London boundary to OSGB36
gdf_london_boundary = gdf_london_boundary.to_crs("EPSG:27700")

# Plot the bus stops and London boundary on a map
# Create a figure and axis
fig, ax = plt.subplots(figsize=(6, 6))
# Plot the London boundary
gdf_london_boundary.plot(ax=ax, # ax is the axis to plot on
                          color='lightgrey', # color of the polygon
                          edgecolor='black', # color of the boundary
                          alpha=0.3, # transparency of all colors
                          linewidth=0.5) # linewidth of the polygon boundary
# The bus stop points plot
gdf_bus_stop.plot(ax=ax, # another layer of points but put in the same axis
                  color='red', # color of the points
                  edgecolor='grey',  # color of the point boundary
                  linewidth=0.05, # linewidth of the point boundary
                  markersize=1, # size of the points
                  alpha=0.4, # transparency of the all colors
                  label='TfL Bus Stops', # label of the points
                   )
# We need the matplotlib_map_utils to add the scale bar and north arrow on the map
from matplotlib_map_utils.core.north_arrow import NorthArrow, north_arrow
from matplotlib_map_utils.core.scale_bar import ScaleBar, scale_bar
NorthArrow.set_size("small")
ScaleBar.set_size("small")
# Add the North Arrow and scale bar to the map
north_arrow(ax, location="upper right", rotation={"crs": gdf_bus_stop.crs, "reference": "center"})
scale_bar(ax, location="upper left", style="boxes", bar={"projection": gdf_bus_stop.crs})
# Set the title
ax.set_title('TfL Bus Stops in London')
# Set the legend
ax.legend(labels=['Bus Stops Locations'], loc='lower left', fontsize=8)
ax.set_xlabel('Easting (m)')
ax.set_ylabel('Northing (m)')
# We can save the figure as a jpg file
# plt.savefig("tfl_bus_stops.jpg", dpi=300, bbox_inches='tight')
plt.show()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 25
     23 ScaleBar.set_size("small")
     24 # Add the North Arrow and scale bar to the map
---> 25 north_arrow(ax, location="upper right", rotation={"crs": gdf_bus_stop.crs, "reference": "center"})
     26 scale_bar(ax, location="upper left", style="boxes", bar={"projection": gdf_bus_stop.crs})
     27 # Set the title

File ~/PycharmProjects/MSc_AI_DS_Research_Projects/.venv/lib/python3.11/site-packages/matplotlib_map_utils/core/north_arrow.py:287, in north_arrow(ax, draw, location, scale, base, fancy, label, shadow, pack, aob, rotation)
    285 _pack = naf._validate_dict(pack, _DEFAULT_PACK, nat._VALIDATE_PACK, return_clean=True)
    286 _aob = naf._validate_dict(aob, _DEFAULT_AOB, nat._VALIDATE_AOB, return_clean=True)
--> 287 _rotation = naf._validate_dict(rotation, _DEFAULT_ROTATION, nat._VALIDATE_ROTATION, return_clean=True)
    289 # First, getting the figure for our axes
    290 fig = ax.get_figure()

File ~/PycharmProjects/MSc_AI_DS_Research_Projects/.venv/lib/python3.11/site-packages/matplotlib_map_utils/validation/functions.py:243, in _validate_dict(input_dict, default_dict, functions, to_validate, return_clean, parse_false)
    241 # NOTE: This is messy but the only way to get the rotation value to the crs function
    242 if key=="crs":
--> 243     _ = func(prop=key, val=val, rotation_dict=values, **fd["kwargs"])
    244 # NOTE: This is messy but the only way to check the projection dict without keywords
    245 elif key=="projection":

File ~/PycharmProjects/MSc_AI_DS_Research_Projects/.venv/lib/python3.11/site-packages/matplotlib_map_utils/validation/functions.py:101, in _validate_crs(prop, val, rotation_dict, none_ok)
     99 if reference == "center":
    100     if crs is None:
--> 101         raise ValueError(f"If degrees is set to None, and reference is 'center', then a valid crs must be supplied")
    102 else:
    103     if crs is None or reference is None or coords is None:

ValueError: If degrees is set to None, and reference is 'center', then a valid crs must be supplied

_images/c5497ebd9d8af8989c81ad5c5866a90459f323788ee84384155b985d9a800656.png

Tailoring the dot map

We can tailor the dot map by changing the size of the dots, the color of the dots, and the transparency of the dots according to different attributes of the data (e.g., categorical data and numerical data).

Different colors represent different categories of the data (e.g., different types of transport).
Different sizes represent different values of the data (e.g., population size).

# Read the TfL Underground station geo-data
gdf_underground = gpd.read_file("data/Underground_Stations.geojson")
# transfer the coordinate system of the Underground stations to OSGB36
gdf_underground = gdf_underground.to_crs("EPSG:27700")
gdf_underground

	OBJECTID	NAME	LINES	ATCOCODE	MODES	ACCESSIBILITY	NIGHT_TUBE	NETWORK	DATASET_LAST_UPDATED	FULL_NAME	geometry
0	111	St. Paul's	Central	940GZZLUSPU	bus, tube	Not Accessible	Yes	London Underground	2021-11-29 00:00:00+00:00	St. Paul's station	POINT Z (532110.145 181274.419 0)
1	112	Mile End	District, Hammersmith & City, Central	940GZZLUMED	bus, tube	Partially Accessible - Interchange Only	Yes	London Underground	2021-11-29 00:00:00+00:00	Mile End station	POINT Z (536502.678 182534.693 0)
2	113	Bethnal Green	Central	940GZZLUBLG	bus, tube	Not Accessible	Yes	London Underground	2021-11-29 00:00:00+00:00	Bethnal Green station	POINT Z (535045.414 182718.596 0)
3	114	Leyton	Central	940GZZLULYN	bus, tube	Not Accessible	Yes	London Underground	2021-11-29 00:00:00+00:00	Leyton station	POINT Z (538369.354 186076.113 0)
4	115	Snaresbrook	Central	940GZZLUSNB	bus, tube	Not Accessible	Yes	London Underground	2021-11-29 00:00:00+00:00	Snaresbrook station	POINT Z (540158.622 188804.247 0)
...	...	...	...	...	...	...	...	...	...	...	...
268	483	Seven Sisters	Victoria	940GZZLUSVS	tube	Not Accessible	Yes	London Underground	2021-11-29 00:00:00+00:00	Seven Sisters station	POINT Z (533644.648 188931.178 0)
269	484	Theydon Bois	Central	940GZZLUTHB	tube	Partially Accessible	No	London Underground	2021-11-29 00:00:00+00:00	Theydon Bois station	POINT Z (545543.329 199101.251 0)
270	485	Kenton	Bakerloo	940GZZLUKEN	tube	Not Accessible	No	London Underground	2021-11-29 00:00:00+00:00	Kenton station	POINT Z (516729.561 188314.716 0)
271	486	Woodside Park	Northern	940GZZLUWOP	tube, bus	Fully Accessible	No	London Underground	2021-11-29 00:00:00+00:00	Woodside Park station	POINT Z (525720.193 192588.5 0)
272	487	Preston Road	Metropolitan	940GZZLUPRD	tube, bus	Not Accessible	No	London Underground	2021-11-29 00:00:00+00:00	Preston Road station	POINT Z (518250.918 187282.096 0)

273 rows × 11 columns

We will add usage information (counted by entry/exit) of London Underground stations to build a dot map. The usage data was published by TfL in the London Underground Station Crowding page. The usage data is in the CSV or XLSX format, and we can read it using the Pandas library.

# we only need the London Underground stations and annualised usage, so we need to filter the data.
df_usage_lu = df_usage[df_usage['Unnamed: 0'] == 'LU'][['Unnamed: 3', 'Annualised']].rename(columns={'Unnamed: 3': 'name', 'Annualised': 'usage'})
df_usage_lu.head()

	name	usage
1	Acton Town	4823835
2	Aldgate	6897314
3	Aldgate East	10947896
4	Alperton	2598605
5	Amersham	1729521

# Note that the usage columns have some missing values and some values are string '---' which means no data.
# You can use more codes to fill the data as NaN occurs as the mismatch of the name between the two datasets, though some cases are indicating one station.
print(gdf_underground_usage.usage.values)

[9189088 11145822 nan 8592536 1849887 6955728 1948236 4051381 3371643
 1167037 885823 4492296 267679 332284 nan 4383503 8491273 10787392 806675
 1503305 3937018 1212120 2153159 1860607 6314434 3063800 1597432 6030683
 6258859 1631696 1555062 1663555 1654538 2598605 4159381 13568222 27379635
 12573815 7822836 5106522 2203557 3293586 3947509 1450247 2307306 5962575
 7049636 nan 4377539 nan 14965619 6897314 nan 23330204 5185508 40070382
 72124262 11239346 21207073 nan nan 3644937 9413060 11353518 26088906
 2074444 1841679 5785361 3763523 5568858 7890670 nan 2624208 2899358 nan
 nan 7087668 nan 14076462 2645552 2542830 1623305 838918 3216718 1804676
 1952034 5748997 2138577 15779710 4373423 12110869 3172252 8590075
 37424555 1722368 3269238 5652447 5008989 nan 51106043 20240977 7803444
 9576731 6802053 nan 23172392 12328817 5573360 nan 5236551 7045026 5339416
 7137484 12397420 6265211 6323510 8040059 5210928 5113576 58726155 3889479
 2967011 4532637 '---' 4875982 5615724 4633070 3130908 1604937 1455300
 4909025 14142708 8975412 nan 6151983 1786001 2752421 2861021 397280
 1841035 3276268 2005688 3043893 3466311 3061427 820575 3540050 2385943
 17051267 2042489 1817060 2466317 4598994 1296040 nan 3937317 4045767
 26739879 nan nan 2689160 nan 2681574 2617813 nan 2297814 13156281 2717311
 nan 5022937 9088053 4266560 19172572 '---' nan nan 4774045 5338957
 13349683 15429088 4823835 1954910 28247401 20479785 7381051 8263631 nan
 2852360 2776023 3000487 3370726 1675267 2782422 5240900 6823150 5181211
 2134593 4663306 10947896 17452196 4157626 15510977 2565708 1989399 nan
 1057631 54376225 5419509 1373730 853803 1463521 1396078 1408690 1003523
 3340365 1398638 2223811 8435460 3353641 3002979 2073359 5870213 8508705
 nan 11223888 1296777 8399835 4108613 4040066 605628 4354076 nan 5347065
 4731235 4827763 nan 3529781 2547324 4780369 3727656 8875310 3019402
 9534074 7948900 15137045 3609039 1435853 5133701 3259745 2158123 1729521
 5426495 nan 18805872 nan 4321012 5198096 4408040 966889 1851241 808774
 2162088 837496 1653584 4067823 15846905 12258814 5179915 12165434 734300
 1515511 3657219 2423452]

# Plot the Underground stations with usage data
# Create a figure and axis
fig, ax = plt.subplots(figsize=(8, 8))
# Plot the London boundary
gdf_london_boundary.plot(ax=ax, # ax is the axis to plot on
                          color='lightgrey', # color of the polygon
                          edgecolor='black', # color of the boundary
                          alpha=0.2, # transparency of all colors
                          linewidth=0.5) # linewidth of the polygon boundary
# The Underground stations plot
gdf_underground_usage_s.plot(ax=ax,
                     column='usage', # column name to tailor the dot map, here is the numerical data
                     cmap='viridis', # color map
                      markersize=40, # size of the points
                      alpha=0.7, # transparency of the all colors
                      edgecolor='black', # color of the point boundary
                      linewidth=0.5, # linewidth of the point boundary
                      legend=True,
                      legend_kwds={ "label": "Annualised Usage",
                                    "orientation": "horizontal"
                                    },
                      )
# Add the North Arrow and scale bar to the map
north_arrow(ax, location="upper right", rotation={"crs": gdf_underground.crs, "reference": "center"})
scale_bar(ax, location="lower left", style="boxes", bar={"projection": gdf_underground.crs})
# Set the title
ax.set_title('Annualised Usage of TfL Underground Stations in London')
ax.set_xlabel('Easting (m)')
ax.set_ylabel('Northing (m)')
plt.show()

_images/4cbfc58391dff10cb43c6e3895f05a38a8962a1840c5cfe6e33afb1d6381f2f0.png

A proportional map is a variation of a dot map, where the size of the dots is proportional to the value of the variable being represented.

# Plot the Underground stations with usage data
# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 10))
# Plot the London boundary
gdf_london_boundary.plot(ax=ax, # ax is the axis to plot on
                          color='lightgrey', # color of the polygon
                          edgecolor='black', # color of the boundary
                          alpha=0.2, # transparency of all colors
                          linewidth=0.5) # linewidth of the polygon boundary
# We use the seaborn to plot the Underground stations
sns.scatterplot(ax=ax,
                data=gdf_underground_usage_s,
                x=gdf_underground_usage_s.geometry.x, # x coordinate
                y=gdf_underground_usage_s.geometry.y, # y coordinate
                hue='NIGHT_TUBE', # column name to tailor the dot map, here is the categorical data
                size=gdf_underground_usage_s['usage'], # size of the points
                sizes=(0, 700), # size range of the point selection
                alpha=0.8, # transparency of the all colors
                palette='Set3', # color map
                linewidth=0.5, # linewidth of the point boundary
                edgecolor='black', # color of the point boundary
                legend=True,
                 )
# Add the North Arrow and scale bar to the map
north_arrow(ax, location="upper right", rotation={"crs": gdf_london_boundary.crs, "reference": "center"})
scale_bar(ax, location="lower right", style="boxes", bar={"projection": gdf_london_boundary.crs})
# Set the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles,
          labels=['Night Tube', 'Yes', 'No', 'Annualised Usage', '15M', '30M', '45M', '60M'],
          loc='lower left', fontsize=10, title_fontsize=12, frameon=False)
# Set the title
ax.set_title('Annualised Usage of TfL Underground Stations in London')
ax.set_xlabel('Easting (m)')
ax.set_ylabel('Northing (m)')
plt.show()

_images/1ea3d8652afc9794def644c19515dd7174f457d549f22a98f0a76a73b9291942.png

A choropleth map is a type of thematic map where areas are shaded or patterned in proportion to the value of a variable being represented. It is used to visualize the spatial distribution of a variable across different regions (Polygons).

# Read the UK LAs data
gdf_las = gpd.read_file("data/Local_Authority_Districts_December_2024_Boundaries_UK_BSC.geojson")
# Convert the coordinate reference system (CRS) to OSGB
gdf_las = gdf_las.to_crs("EPSG:27700")
gdf_las['area'] = gdf_las.geometry.area / 1e6 # convert the area to km2

# Read the 2022 population data
df_pop = pd.read_csv("data/populaiton_2022_all_ages_uk.csv")

# merge the population data with the gdf_las data
gdf_las_pop = pd.merge(gdf_las, df_pop, left_on='LAD24CD', right_on='Code', how='left')
gdf_las_pop = gdf_las_pop.rename(columns={'All ages': 'population'})
# calculate the population density
gdf_las_pop['density'] = gdf_las_pop['population'] / gdf_las_pop['area'] # calculate the population density

We have used a sequential color to plot the population distribution across the UK LAs. The darker the color, the higher the population. Alternatively, we can change the legend which means we can use the categorical data to plot the choropleth map. For example, we can classify the population density to several levels and plot the choropleth map by using setting scheme and k in the plot().

import contextily as cx

Interactive maps are maps that allow users to interact with the map by zooming, panning, and clicking on features. They are used to visualize geospatial data in a more dynamic way. There are several libraries in Python that can be used to create interactive maps, such as Folium, Plotly, and Bokeh. You can find an interactive map of uk LA population density generated from the Folium at here (click).

import libpysal
import spreg
# ensure the spreg pkg is at least the 1.8.2 version
print(spreg.__version__)

1.8.3

A spatial weight matrix (spatial adjacency matrix) $W$ is a square matrix that quantifies the spatial relationship between each pair of locations (i and j are the spatial unit index) across $n$ spatial units: $$ W =

(1)#\[\begin{bmatrix} w_{11} & w_{12} & \dots & w_{1n} \\ w_{21} & w_{22} & \dots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n1} & w_{n2} & \dots & w_{nn} \end{bmatrix}\]

\[ \begin{align}\begin{aligned}- $w_{ij}$ is the spatial weight between location $i$ and $j$; - Typically, $w_{ii} = 0$ (no self-weight).\\For example, if we have 4 areas/locations (A, B, C, D) with the index (1, 2, 3, 4) and the following spatial weight matrix can be $W$: \end{aligned}\end{align} \]

W =

(2)#\[\begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24} \\ w_{31} & w_{32} & w_{33} & w_{34} \\ w_{41} & w_{42} & w_{43} & w_{44} \end{bmatrix}\]

$$

pysal library community consist lots of pkgs in spatial analysis, like spreg, esda, libpysal and mgwr.
NYC Neighborhoods Data Page

gdf_nyc.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

import libpysal.weights as weights

df_wq = w_q.to_adjlist(drop_islands=True)
df_wq

	focal	neighbor	weight
0	0	17	1.0
1	0	31	1.0
2	0	77	1.0
3	0	155	1.0
4	1	17	1.0
...	...	...	...
1085	194	22	1.0
1086	194	38	1.0
1087	194	79	1.0
1088	194	94	1.0
1089	194	139	1.0

1090 rows × 3 columns

More usages can be found at The libysal function page

Global spatial autocorrelation:

Global spatial autocorrelation answers the question: Is there overall clustering across the whole map?

Moran’s I (Moran, 1950) is a measure of global spatial autocorrelation. It is a single value that summarizes the degree of spatial autocorrelation in the entire dataset: $$ I = \frac{n}{W} \cdot \frac{\sum_i \sum_j w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2} $$

Where:

$ n $: Number of spatial units
$ x_i $: Value at location $ i $
$ \bar{x} $: Mean of the variable
$ w_{ij} $: Spatial weight between location $ i $ and $ j $
$ W = \sum_i \sum_j w_{ij} $: Sum of all spatial weights

$I > 0$: Positive spatial autocorrelation.

$I < 0$: Negative spatial autocorrelation.

$I = 0$: No spatial autocorrelation.

# compute the global Moran's I for total population
import esda
gdf_nyc['poptot'] = gdf_nyc['poptot'].astype(int)
mi = esda.Moran(gdf_nyc['poptot'], w_q, two_tailed=True)

Local spatial autocorrelation:

Local spatial autocorrelation answers the questions: Is there clustering in a specific area? Which specific areas are clustered or are spatial outlier?

Local indicators of spatial association (LISA):

is a statistical technique that decomposes global spatial autocorrelation measures into location-specific contributions. It identifies significant spatial clusters or spatial outliers by measuring the correlation between a location and its neighbors.

Local Moran’s I:

Local Moran’s I is a measure of local spatial autocorrelation. It is a value for each spatial unit that summarizes the degree of spatial autocorrelation in the local neighborhood of that unit: $$ I_i = \frac{(x_i - \bar{x})}{S^2} \cdot \sum_j w_{ij} (x_j - \bar{x}) $$

Where:

$ I_i $: Local Moran’s I for location $ i $
$ x_i $: Value at location $ i $
$ \bar{x} $: Mean of the variable
$ S$: Standard deviation of the variable
$ S^2 $: Variance of the variable
$ w_{ij} $: Spatial weight between location $ i $ and $ j $
$ j $: Neighbors of location $ i $
$ \sum_j w_{ij} $: Sum of the spatial weights for location $ i $

$ I_i > 0 $: Positive spatial autocorrelation (high values surrounded by high values or low values surrounded by low values)

$I_i < 0 $: Negative spatial autocorrelation (high values surrounded by low values or low values surrounded by high values)

$I_i = 0 $: No spatial autocorrelation You can find etails about the Local Moran’s I in the libpysal Local Moran I documentation.

Getis-Ord Gi*:

Gi* is a measure of local spatial autocorrelation that identifies clusters of high or low values. It is similar to Local Moran’s I, but it is more sensitive to the presence of outliers. It is used to identify hot spots (high values) and cold spots (low values) in the data: $$ G_i^* = \frac{\sum_j w_{ij} (x_j - \bar{x})}{\sqrt{\frac{S^2}{n}}} $$ Where:

$ G_i^* $: Getis-Ord Gi* for location $ i $;
$ x_i $: Value at location $ i $;
$ \bar{x} $: Mean of the variable;
$ S^2 $: Variance of the variable;
$ n $: Number of spatial units;
$ w_{ij} $: Spatial weight between location $ i $ and $ j $;
$ j $: Neighbors of location $ i $;
$ \sum_j w_{ij} $: Sum of the spatial weights for location $ i $;
$ \sqrt{\frac{S^2}{n}} $: Standard deviation of the variable;
$ \sum_j w_{ij} $: Sum of the spatial weights for location $ i $;
$ \sqrt{\frac{S^2}{n}} $: Standard deviation of the variable;
$ \sum_j w_{ij} $: Sum of the spatial weights for location $ i $.

Z-scores: standardized G statistics that follow a normal distribution under spatial randomness.

P values: P-values from randomization/permutation.

You can use any strategies to plot the gi resluts, here we can plot the results to the map with three parts:

Hotspots: High positive Z-score (>0) with p-value < 0.05.
Coldspots: High negative Z-score (<0) with p-value < 0.05.
We usually use blank color to draw the ‘Not Significant cluster’ (p_value $\geq$ 0.05).

2.2.3 Spatial lags#

# we can use the libpysal to compute the spatial lag
import libpysal as lps
ylag = lps.weights.lag_spatial(w_q, gdf_nyc['poptot'])
print(ylag)

[49362.25       28475.         47006.375      51427.75
33333333 29394.         16276.33333333 30615.83333333
6        26460.66666667 36855.4        35909.16666667
5        25859.66666667 32423.         24938.33333333
        51096.18518519 38548.04166667 27779.4
25       48425.66666667 40841.52631579 32608.25
        47603.33333333 29392.6        89058.
5            0.         43581.375      56055.6
75       32732.5        48332.8        27345.
83333333 33253.75       41449.         29378.5
14285714 33103.         22970.6        46351.
8        30433.11111111 31618.75       40978.16666667
2        17969.         21665.         46341.33333333
76470588 27424.44444444 26972.33333333 37355.22222222
6        39146.5        53198.         70978.375
4        38980.25       28301.42857143 30198.
5        56681.25       32198.         32485.4
2        66668.66666667 29724.57142857 46159.
5        34756.875      57186.6        28039.28571429
42857143 45772.66666667 57685.         35781.75
6        33185.33333333 60449.         26227.25
6        52577.33333333 56311.6        56218.
33333333 31126.4        42913.33333333 39061.6
        39723.33333333 54064.66666667 35346.4
8        24372.         39252.8        58900.2
875      22039.25       46541.         35084.4
5        35568.57142857 31351.5        24651.75
        61634.33333333 35926.         33531.28571429
33333333 43120.66666667 32351.16666667 42640.125
375      52090.         34730.         46812.
6        48268.         29281.5        29394.
16666667 46641.         41365.         27143.
4        56350.66666667 24468.16666667 24628.
        49325.         32260.5        42973.6
75       44167.28571429 32440.         53287.6
5        36147.57142857 44762.5        37229.8
        29262.2        28446.         30908.
25       55331.5        55481.4        53901.
        47130.66666667 31085.5        62864.85714286
5        26322.66666667 75716.         32316.5
4        15612.5        37183.4        37630.
85714286 43798.66666667 62967.16666667 41816.66666667
8        40267.         45930.3        52549.6
4        49492.         49030.4        46985.85714286
        26605.75       28592.14285714 26234.83333333
        29548.42857143 23352.         29178.
42857143 62009.6        47940.4        24141.5
4        32973.875      48749.         53538.2
71428571 52693.2        33380.33333333]

# Plot the original data and spatial lag of the population data for comparison
fig, ax = plt.subplots(1, 2, figsize=(12, 8))
gdf_nyc.plot(column='poptot', ax=ax[0], legend=True, cmap='Reds',
                       edgecolor='black', linewidth=0.05, scheme = 'naturalbreaks',
                       legend_kwds = {'bbox_to_anchor': (1.2, 0.35),'title': "Population"})
ax[0].set_title('Total population in NYC neighbourhoods')

gdf_nyc.plot(column='sl_poptop', ax=ax[1], legend=True, cmap='Reds',
                       edgecolor='black', linewidth=0.05, scheme = 'naturalbreaks',
                       legend_kwds = {'bbox_to_anchor': (1.2, 0.35),'title': "SL values of population"})
ax[1].set_title('Spatial Lag of Total population in NYC neighbourhoods')
ax[0].set_axis_off()
ax[1].set_axis_off()
plt.show()

_images/36ff8308341735886a1cf1c4e092fe8a21df064182e5a02b53fbf8a2a6f33ff2.png

The Spatial Lag Model (SLM) is a type of regression model that explicitly includes the effect of neighboring dependent variable values (i.e., spatial lag) on the value at each location:

\[ \mathbf{y} = \rho \mathbf{W y} + \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]

$\mathbf{y}$ is the vector of the dependent variable
$\mathbf{W}$ is the spatial weights matrix, so $\mathbf{W y}$ is the spatial lag of $\mathbf{y}$
$\rho$ is the spatial autoregressive coefficient
$\mathbf{X}$ is the matrix of explanatory variables
$\boldsymbol{\beta}$ is the vector of coefficients
$\boldsymbol{\epsilon}$ is the error term

# we need to set the crs to EPSG:26917 (NAD83 / UTM zone 17N) as the polygon valus in the data is in this CRS, not the EPSG:4326.
gdf_columbus = gdf_columbus.set_crs(epsg=26917, allow_override=True)
print(gdf_columbus.crs)

EPSG:26917

# Spatial weights
w = weights.Queen.from_dataframe(gdf_columbus, use_index='True')
w.transform = 'R'  # Row-standardized

# Variables
y = gdf_columbus[['CRIME']].values          # Dependent variable
X = gdf_columbus[['INC', 'HOVAL']].values   # Independent variables are Income and House Value.

# Fit spatial lag model
model_sl = ML_Lag(y, X, w=w, name_y='CRIME', name_x=['INC', 'HOVAL'], name_w='Queen', name_ds='Columbus')

# Output summary
print(model_sl.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :    Columbus
Weights matrix      :       Queen
Dependent Variable  :       CRIME                Number of Observations:          49
Mean dependent var  :     35.1288                Number of Variables   :           4
S.D. dependent var  :     16.7321                Degrees of Freedom    :          45
Pseudo R-squared    :      0.6473
Spatial Pseudo R-squared:  0.5806
Log likelihood      :   -182.6740
Sigma-square ML     :     96.8572                Akaike info criterion :     373.348
S.E of regression   :      9.8416                Schwarz criterion     :     380.915

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        45.60325         7.25740         6.28369         0.00000
                 INC        -1.04873         0.30741        -3.41154         0.00065
               HOVAL        -0.26633         0.08910        -2.98929         0.00280
             W_CRIME         0.42333         0.11951         3.54216         0.00040
------------------------------------------------------------------------------------

SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
                 INC        -1.0487         -0.7699         -1.8186
               HOVAL        -0.2663         -0.1955         -0.4618
================================ END OF REPORT =====================================

In summary, the results reflect the spatial lag model equation, where the spatial autoregressive coefficient $\rho$ corresponds to the coefficient of W_CRIME (0.42). The vector of regression coefficients $\boldsymbol{\beta}$ includes the coefficients for INC and HOVAL, which are -1.05 and -0.27 respectively. The error term $\boldsymbol{\epsilon}$ is represented by the standard error of the regression (9.84) and the maximum likelihood estimate of the error variance (Sigma-square ML = 96.86).

residuals = model_sl.u  # residuals from the SLM
moran_resid = esda.Moran(residuals, w)
print(moran_resid.I, moran_resid.p_sim)

0.033068305761995494 0.268

# get the in sample prediction (fttted) crime rates and the residuals
gdf_columbus["fitted"] = model_sl.predy.flatten()
gdf_columbus["residuals"] = model_sl.u.flatten()

2.2.5 Spatial Durbin Model (SDM)#

The Spatial Durbin Model (SDM) is specified as:

\[ \mathbf{y} = \rho \mathbf{W} \mathbf{y} + \mathbf{X} \boldsymbol{\beta} + \mathbf{W} \mathbf{X} \boldsymbol{\theta} + \boldsymbol{\varepsilon} \]

where:

$ \mathbf{y} $ is the dependent variable vector,
$ \rho \mathbf{W} \mathbf{y} $ represents the spatial lag of the dependent variable,
$ \mathbf{X} \boldsymbol{\beta} $ are the usual covariates,
$ \mathbf{W} \mathbf{X} \boldsymbol{\theta} $ are the spatially lagged covariates,
$ \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}) $ is the error term.

# Fit spatial lag model
model_dur = ML_Lag(y, X, w=w, name_y='CRIME', name_x=['INC', 'HOVAL'], name_w='Queen', name_ds='Columbus',
                   slx_lags=1,
                   slx_vars= [True, True])

# Output summary
print(model_dur.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG WITH SLX - SPATIAL DURBIN MODEL (METHOD = FULL)
-------------------------------------------------------------------------------------------------
Data set            :    Columbus
Weights matrix      :       Queen
Dependent Variable  :       CRIME                Number of Observations:          49
Mean dependent var  :     35.1288                Number of Variables   :           6
S.D. dependent var  :     16.7321                Degrees of Freedom    :          43
Pseudo R-squared    :      0.6603
Spatial Pseudo R-squared:  0.6120
Log likelihood      :   -181.6393
Sigma-square ML     :     93.2722                Akaike info criterion :     375.279
S.E of regression   :      9.6578                Schwarz criterion     :     386.629

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        44.32001        13.04547         3.39735         0.00068
                 INC        -0.91991         0.33474        -2.74811         0.00599
               HOVAL        -0.29713         0.09042        -3.28625         0.00102
               W_INC        -0.58391         0.57422        -1.01687         0.30921
             W_HOVAL         0.25768         0.18723         1.37626         0.16874
             W_CRIME         0.40346         0.16133         2.50079         0.01239
------------------------------------------------------------------------------------

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                              DF         VALUE           PROB

SPATIAL DURBIN MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
                 INC        -0.9199         -1.6010         -2.5209
               HOVAL        -0.2971          0.2310         -0.0661
================================ END OF REPORT =====================================

Geographically Weighted Regression (GWR) and Multiscale Geographically Weighted Regression (MGWR) model can be found at here(click)

Machine learning (ML) workflow

ML tasks in space and time

2.3.1 SVM regression model and hyperpramer using GridSearch + Random CV#

Step 1 Problem definition:

We will use the POI data (X features) to predict the crime levels (y targets) across London grids (1km and 1km) via the machine learning models.

# Load the house price data
gdf_grid = gpd.read_file('data/GEOSTAT_Dec_2011_GEC_in_the_United_Kingdom_2022.geojson')
gdf_grid.head()

	OBJECTID	grd_fixid	grd_floaid	grd_newid	GlobalID	geometry
0	1	4.0.31510000.31050000	3151.3105.4	1kmN3105E3151	c142c42b-5160-4760-a836-956a3920caf8	POLYGON ((-6.41829 49.87519, -6.41852 49.87527...
1	2	4.0.31520000.31040000	3152.3104.4	1kmN3104E3152	26709ad2-2ae8-4f76-95f4-850c60e9c8e6	POLYGON ((-6.41461 49.87373, -6.40491 49.87515...
2	3	4.0.31520000.31050000	3152.3105.4	1kmN3105E3152	e477a513-48a2-4012-9cd7-3063ac2bc6c6	POLYGON ((-6.40698 49.8838, -6.40414 49.87568,...
3	4	4.0.31520000.31060000	3152.3106.4	1kmN3106E3152	8d1562d5-3252-40ef-858e-c80391bebefe	POLYGON ((-6.40726 49.88398, -6.42047 49.88204...
4	5	4.0.31530000.31050000	3153.3105.4	1kmN3105E3153	2952357c-da71-4e17-93a0-ace5e73e86d4	POLYGON ((-6.40414 49.87568, -6.40698 49.8838,...

# read the London boundary
gdf_london_bound = gpd.read_file('data/whole_london_boundary.geojson')
# to select the Grids in London, we do not use the overlay to clip the shap but use the spatial join to select the grids of London
gdf_london_grid = gpd.sjoin(gdf_grid, gdf_london_bound, how='inner', predicate='intersects')

# drop the index_right column
gdf_london_grid = gdf_london_grid.drop(columns=['index_right'])

# here we only use the gis_osm_pois_free_1.shp file
gdf_poi = gpd.read_file('data/osm_england-latest-free-shp/gis_osm_pois_free_1.shp')
gdf_poi.columns

Index(['osm_id', 'code', 'fclass', 'name', 'geometry'], dtype='object')

# we calculate the numbers and poi type for each grid
df_poi_counts = gdf_poi_grid_s.groupby(['grd_newid']).size().reset_index(name='poi_counts')
df_poi_types = gdf_poi_grid_s.groupby(['grd_newid']).nunique()['fclass'].reset_index(name='poi_types')

# read the crime data
df_crime = pd.read_csv('data/crime_2024_call_police_met_city.csv')
# transfer df to gdf
gdf_crime_s = gpd.GeoDataFrame(df_crime, geometry=gpd.points_from_xy(df_crime['Longitude'], df_crime['Latitude']), crs='EPSG:4326')

# merge the crime data with the grid data
gdf_london_grid_poi_crime = gdf_poi_grid_s_poi_type.merge(df_crime, on='grd_newid', how='left')
# gdf_london_grid_poi_crime.to_file('data/london_grid_poi_crime.geojson', driver='GeoJSON')

Step 3 Data Splitting:

Step 4 Feature Scaling:

Step 5 Model Training and Hyperparameter Tuning (Cross-Validation):

Random CV and GridSearch

gscv_results_df = pd.DataFrame(grid_search.cv_results_)
gscv_results_df

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_C	param_epsilon	param_gamma	param_kernel	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.011516	0.001450	0.002091	0.000071	0.1	0.1	scale	linear	{'C': 0.1, 'epsilon': 0.1, 'gamma': 'scale', '...	-0.161602	-0.309483	-0.269594	-0.213100	-1.038503	-0.398456	0.323925	39
1	0.012661	0.000412	0.004879	0.000386	0.1	0.1	scale	rbf	{'C': 0.1, 'epsilon': 0.1, 'gamma': 'scale', '...	-0.178756	-0.425421	-0.337690	-0.235625	-1.408892	-0.517277	0.453782	119
2	0.046547	0.011293	0.001759	0.000071	0.1	0.1	scale	poly	{'C': 0.1, 'epsilon': 0.1, 'gamma': 'scale', '...	-0.235991	-0.381201	-0.431818	-0.278688	-0.515603	-0.368660	0.101424	6
3	0.010931	0.000872	0.001876	0.000066	0.1	0.1	auto	linear	{'C': 0.1, 'epsilon': 0.1, 'gamma': 'auto', 'k...	-0.161602	-0.309483	-0.269594	-0.213100	-1.038503	-0.398456	0.323925	39
4	0.012968	0.002267	0.004626	0.000337	0.1	0.1	auto	rbf	{'C': 0.1, 'epsilon': 0.1, 'gamma': 'auto', 'k...	-0.179099	-0.424231	-0.337557	-0.236367	-1.408735	-0.517198	0.453636	118
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
139	0.022242	0.002893	0.002663	0.000720	100.0	0.3	0.1	rbf	{'C': 100, 'epsilon': 0.3, 'gamma': 0.1, 'kern...	-0.189447	-0.327770	-0.273917	-0.211199	-1.403278	-0.481122	0.463628	97
140	0.115625	0.023100	0.001242	0.000179	100.0	0.3	0.1	poly	{'C': 100, 'epsilon': 0.3, 'gamma': 0.1, 'kern...	-0.228806	-0.361929	-0.396099	-0.261518	-1.036504	-0.456971	0.296263	74
141	0.099460	0.013776	0.001299	0.000078	100.0	0.3	0.01	linear	{'C': 100, 'epsilon': 0.3, 'gamma': 0.01, 'ker...	-0.169583	-0.301983	-0.259901	-0.212266	-1.041523	-0.397051	0.325294	23
142	0.016670	0.004624	0.003011	0.001341	100.0	0.3	0.01	rbf	{'C': 100, 'epsilon': 0.3, 'gamma': 0.01, 'ker...	-0.157879	-0.318991	-0.271800	-0.206021	-1.139578	-0.418854	0.364540	64
143	0.014776	0.002152	0.001799	0.000968	100.0	0.3	0.01	poly	{'C': 100, 'epsilon': 0.3, 'gamma': 0.01, 'ker...	-0.300499	-0.815115	-0.488829	-0.297477	-2.221372	-0.824658	0.723373	137

144 rows × 17 columns

# We can generate MSE manually and other performance metrics in the training set
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Make predictions on the training set
y_train_pred = best_model.predict(X_train_scaled)
# Calculate performance metrics
mse_train = mean_squared_error(y_train_scaled, y_train_pred)
mae_train = mean_absolute_error(y_train_scaled, y_train_pred)
r2_train = r2_score(y_train_scaled, y_train_pred)
print("Training set performance metrics:")
print("Mean Squared Error (MSE):", mse_train)
print("Mean Absolute Error (MAE):", mae_train)
print("R-squared (R2):", r2_train)

Training set performance metrics:
Mean Squared Error (MSE): 0.3026418555774837
Mean Absolute Error (MAE): 0.3689697950763964
R-squared (R2): 0.6973581444225163

Step 6 Model evaluation and testing:

2.3.2 Evaluating the trained SVR via random CV and Spatial CV#

We have previously discussed spatial CV and its importance. Here, the random CV may cause data leakage issues when splitting the original traning data into a new training set (80%) and new test/validation set (20%) from the original training data. We will analyze and compare the differences of trained SVR performance between random CV and spatial CV.

# we integrate the training and validation indices into the gdf for plotting
for i in range(0, 5):
    gdf_london_grid_poi_crime[f'index_gr'] = gdf_london_grid_poi_crime.index
    gdf_london_grid_poi_crime[f'Fold {i + 1}'] = gdf_london_grid_poi_crime.index_gr.apply(
        lambda x: 'new traning set' if x in tr_idx_list[i] else (
            'validation set' if x in va_idx_list[i] else 'testing set'))

    # we plot the train and test via the index on the grids
    fig, ax = plt.subplots(1, 1, figsize=(6, 6))
    gdf_london_grid_poi_crime.plot(ax=ax, column=f'Fold {i + 1}', cmap='tab20c', legend=True, figsize=(8, 6),
                                   legend_kwds={'frameon': False})
    plt.title(f'Fold {i + 1} of Random CV')
    ax.axis('off')
    plt.show()

_images/b6723ee7d4df7e46cb2797b052df95ba8edbe862a3a5b84ce6635f31daa16bf2.png

_images/cad50e556eb88ce10b171746a0e0bba5a57e4612730885bad465f21f69f5c99a.png

_images/1d0ea5342be40433449c5ec535b7ec1bca2be2f317cc766fbd54a76b4dcb6245.png

_images/4e430cfb6d426c3c2fb749cbf869e1d0787412067a24a4facea01bc622a00616.png

_images/8b7a7acb00e91fc22ab1a75272419482afb57c3834f18a5d4286ac851263eaf1.png

gdf_london_grid_poi_crime_train

	OBJECTID	grd_fixid	grd_floaid	grd_newid	pcd_count	tot_p	tot_m	tot_f	tot_hh	GlobalID	...	crime_counts	index_gr	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	centroid_x	centroid_y	cluster
0	214289	4.0.35930000.32030000	3593.3203.4	1kmN3203E3593	3	9	5	4	5	9f0c1e1f-e8fa-40ca-a969-d1aedbefa750	...	5.0	0	new traning set	new traning set	validation set	new traning set	new traning set	503482.732613	175573.691920	2
1	214290	4.0.35930000.32040000	3593.3204.4	1kmN3204E3593	54	1957	1029	928	833	e2530627-9630-4200-ae13-788341456d76	...	13.0	1	new traning set	new traning set	new traning set	new traning set	validation set	503319.399333	176558.376435	2
2	214865	4.0.35940000.32020000	3594.3202.4	1kmN3202E3594	13	445	223	222	169	bf801f98-5376-4c37-8bcb-d80ae9af1f0b	...	0.0	2	new traning set	new traning set	validation set	new traning set	new traning set	504633.969678	174752.667917	2
3	214866	4.0.35940000.32030000	3594.3203.4	1kmN3203E3594	3	34	18	16	11	472a8640-40e0-4c94-8eb5-37647847dc99	...	1519.0	3	new traning set	validation set	new traning set	new traning set	new traning set	504470.650327	175737.362096	2
4	214867	4.0.35940000.32040000	3594.3204.4	1kmN3204E3594	1	15	9	6	8	c21982c0-0d13-43df-8fa5-f77c28fba3d6	...	0.0	4	new traning set	new traning set	new traning set	validation set	new traning set	504307.315366	176722.053398	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1381	233221	4.0.36330000.32050000	3633.3205.4	1kmN3205E3633	149	17774	9897	7877	4810	81df1f73-eabc-432a-96cf-354e3b7e1e58	...	3716.0	1381	new traning set	new traning set	validation set	new traning set	new traning set	542671.312583	184090.897282	1
1382	233222	4.0.36330000.32060000	3633.3206.4	1kmN3206E3633	138	17822	9494	8328	4784	c5616e02-895d-4669-8768-8357a13dd7f2	...	1345.0	1382	new traning set	validation set	new traning set	new traning set	new traning set	542507.875127	185075.860321	1
1383	233223	4.0.36330000.32070000	3633.3207.4	1kmN3207E3633	55	6018	3233	2785	1854	8f730656-d48d-4f99-928b-30cb304f3058	...	861.0	1383	new traning set	new traning set	new traning set	new traning set	validation set	542344.421880	186060.820314	1
1384	233224	4.0.36330000.32080000	3633.3208.4	1kmN3208E3633	19	1544	734	810	529	40f56c08-6faf-46f0-b86f-43a4633d560e	...	152.0	1384	new traning set	new traning set	new traning set	validation set	new traning set	542180.952846	187045.777269	1
1385	233225	4.0.36330000.32090000	3633.3209.4	1kmN3209E3633	68	4602	2308	2294	1387	65b9480d-a4f7-4b82-9731-65080d9b5f31	...	419.0	1385	new traning set	new traning set	validation set	new traning set	new traning set	542017.468026	188030.731192	1

1386 rows × 24 columns

# use groupcv in sklearn to select the data
from sklearn.model_selection import GroupKFold

# Define the number of splits
n_splits = 5
# Create a GroupKFold object
group_kfold = GroupKFold(n_splits=n_splits)
# Initialize an empty list to store the mean squared errors
g_mse_list = []
# Initialize an empty list to store the training and validation indices
g_tr_idx_list = []
g_va_idx_list = []
# Loop through the splits
for i, (g_train_idx, g_va_idx) in enumerate(
        group_kfold.split(X_train_scaled, y_train_scaled, groups=gdf_london_grid_poi_crime_train['cluster'])):
    X_tr, X_va = X_train_scaled[g_train_idx], X_train_scaled[g_va_idx]
    y_tr, y_va = y_train_scaled[g_train_idx], y_train_scaled[g_va_idx]
    model = svm_model.set_params(**{'C': 10, 'epsilon': 0.3, 'gamma': 0.1, 'kernel': 'poly'})
    model.fit(X_tr, y_tr.ravel())
    preds = model.predict(X_va)
    g_mse = mean_squared_error(y_va, preds)
    print(f"Fold {i + 1}: MSE = {g_mse:.2f}",
          f"Training samples length = {len(train_idx.tolist())},",
          f"Validation samples length = {len(va_idx.tolist())}")
    g_mse_list.append(g_mse)
    print('Mean of mses:', f'{np.mean(g_mse_list):.2f}')
    g_tr_idx_list.append(g_train_idx.tolist())
    g_va_idx_list.append(g_va_idx.tolist())

Fold 1: MSE = 0.31 Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.31
Fold 2: MSE = 0.36 Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.34
Fold 3: MSE = 0.36 Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.35
Fold 4: MSE = 0.76 Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.45
Fold 5: MSE = 0.24 Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.41

# We integrate the training and validation indices into the gdf for plotting
for i in range(0, 5):
    gdf_london_grid_poi_crime[f'index_gr'] = gdf_london_grid_poi_crime.index
    gdf_london_grid_poi_crime[f'Group Fold {i + 1}'] = gdf_london_grid_poi_crime.index_gr.apply(
        lambda x: 'new traning set' if x in g_tr_idx_list[i] else (
            'validation set' if x in g_va_idx_list[i] else 'testing set'))
    # we plot the train and test via the index on the grids
    fig, ax = plt.subplots(1, 1, figsize=(6, 6))
    gdf_london_grid_poi_crime.plot(ax=ax, column=f'Group Fold {i + 1}', cmap='tab20c', legend=True, figsize=(8, 6),
                                   legend_kwds={'frameon': False})
    plt.title(f'Fold {i + 1} of Spatial CV')
    ax.axis('off')
    plt.show()

_images/bf8417df3b9fc0c17843798a14d9746dc54a6f3b6e12bce3aef63a3f3696f2da.png

_images/b1a7521b9a78754045015395a0da96af9fa0345091deded73b34989fbaaf55ca.png

_images/327e161c7f239a33c011205dbca37dc256f4bbc068bf74ca06e3f7919c3a44a5.png

_images/f81ead393c58b1206cda1e7c7b99ff346e1a12a4e04849bcf867cb042a2738e5.png

_images/5ebd401e72c9ce716e6d5a34742a28cafd4103b17e8f1fafb5691f0b661898a8.png

Two important parameter settings in grid_search.fit() and GridSearchCV():

cv: this should be defined as the group_kfold from the GroupKFold
groups: this should be defined as the array/list groups_label from the training set

Hyperparameter via GridSearch and Spatial CV

Training set performance metrics

Testing set performance metrics

We observe that the model’s performance decreases when using spatial cross-validation and grid search compared to the random cross-validation approach. Additionally, the optimal hyperparameters have changed. This suggests that, in the context of spatial prediction, the current model is more robust than the one we previously trained, even though its overall performance appears to be lower.

2.2.3 Spatial lags

Contents

2.2.3 Spatial lags#

2.2.5 Spatial Durbin Model (SDM)#

2.3.1 SVM regression model and hyperpramer using GridSearch + Random CV#

2.3.2 Evaluating the trained SVR via random CV and Spatial CV#