8 Machine learning for spatio-temporal analysis 1

Time	Activity
16:00–16:05	Introduction — Overview of the main tasks for the lab tutorials
16:05–16:45	Tutorial: Machine learning for spatio-temporal analysis — Follow Section 8.1-8.2 of the Jupyter Notebook to practice the SVM, CV and spatial SV framework
16:45–17:30	Tutorial: Machine learning for spatio-temporal analysis — Follow Section 8.3 of the Jupyter Notebook to practice the RF
17:30–17:55	Quiz — Complete quiz tasks
17:55–18:00	Wrap-up — Recap key points and address final questions

For this module’s lab tutorials, you can download all the required data using the provided link (click).

Please make sure that both the Jupyter Notebook file and the data and img folder are placed in the same directory (specifically within the STBDA_lab folder) to ensure the code runs correctly.

Week 8 Key Takeaways:

Understand the fundamentals of machine learning (ML) in space and time, including the ML workflow and tasks in spatial and spatio-temporal prediction.
Learn about spatial cross-validation (CV) and its importance in avoiding data leakage in spatial data.
Implement a Support Vector Machine (SVM) regression model for spatial prediction, including hyperparameter tuning using Random CV.
Evaluate the trained SVM model using both Random CV and spatial CV, and understand the differences in performance.
Explore the use of Random Forests (RF) for spatial and temporal prediction tasks.

8 Machine learning for spatio-temporal analysis 1#

8.1 Fundamentals of ML in space and time#

8.1.1 Machine learning (ML) workflow#

Machine learning (ML) is a branch of artificial intelligence that focuses on developing algorithms and models that enable computers to learn and make predictions from data without being explicitly programmed. Key steps in one typical supervised ML workflow:

Problem Definition: define the objective (e.g., classification or regression).
Data Collection: gather data from relevant sources (\(X\) features/predictors/independent variables and \(y\) labels/targets/responses/dependent variables).
Initial Data Preprocessing: handle missing values, remove duplicates, encode categorical variables (if necessary), outlier handling (if necessary).
Data Splitting: split the dataset into training, validation, and test sets (e.g., 70/15/15 or 80/20 rule).
Feature Scaling: (Normalization/Standardization), Fit the scaler on the training set and test sets.
Further Preprocessing : feature selection, dimensionality reduction (e.g., PCA) fit only on training data if necessary.
Model Training and Hyperparameter Tuning (Cross-Validation): Train the model using the training set and optimise the model using Hyperparameter Tuning (grid search, random search, or Bayesian optimization) –> performance metrics on the training set.
Model Evaluation and testing: evaluate on the validation/test set –> performance metrics on the testing set.
Deployment: deploy the trained model into a production environment.

8.1.2 ML tasks in space and time#

1.Spatial prediction – Space (X, Y):

Spatial prediction focuses only on predicting values at unobserved spatial units/locations using data from known units/locations (under the same observation time/ no time factor).

Example: Assume there are 100 neighborhoods in a city. You have house price data for 90 of these neighborhoods for the year 2024. However, for the remaining 10 neighborhoods, no house price data is available — possibly because no houses were sold in those areas during that year. Your goal is to predict house prices in those 10 neighborhoods based solely on the spatial patterns learned from the 90 neighborhoods with other spatial available data (No historical (temporal) data is considered).

2.Spatial-temporal prediction – Space (X, Y) + Time (T):

Spatial-temporal prediction considers both space and time dynamics, which are aiming to predict values at spatial units for a specific time, i.e., to answer how things change over time and across space.

Example: Assume there are 100 neighborhoods in a city. You have house price data for all of these neighborhoods for the year 2013–2023. Your goal is to predict house prices for those 100 neighborhoods in 2024 based on the historical house price data (with other available data).

8.1.3 Spatial cross-validation (CV)#

Spatial cross-validation (CV) is a validation technique designed specifically for spatial data. It aims to avoid data leakage and produce realistic performance estimates by accounting for spatial autocorrelation (i.e., nearby locations tend to have similar values).

Why we need spatial CV: Random CV assumes data points are independent and identically distributed, but in spatial data, nearby observations are often similar (spatial autocorrelation), so if train and test sets are spatially close, the model may look like it performs well — but it’s just memorizing nearby areas.

Spatial CV methods include( Ref link ):

Method	Description	Use Cases
1. Clustering-Based Spatial CV	Spatial units (e.g., areas or points) are grouped using clustering (e.g., KMeans)	Heterogeneous areas, no clear spatial boundaries
2. Grid-Based Spatial CV	Study area is divided into uniform grid cells (e.g., 5km × 5km blocks), and CV is done across cells	Raster data, environmental monitoring, spatial interpolation
3. Geo-Attribute-Based CV	Spatial units grouped based on spatial and non-spatial features (e.g., census area, adminstrative area, city, region)	Urban modeling, socio-spatial segregation, enriched context
4. Spatial Leave-One-Out CV	Each spatial unit is left out in turn; training excludes that unit and often a buffer zone around it	Small datasets or models needing strict leakage control

8.2 Spatial prediction using Support Vector Machine (SVM)#

8.2.1 SVM regression model and hyperpramer using GridSearch + Random CV#

Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression problems. The core idea of SVM is to find the optimal hyperplane that separates different classes in the feature space with the maximum margin. For non-linearly separable data, SVM uses kernel functions to project the data into a higher-dimensional space where a linear separator is possible.

Key concepts:

Margin: The distance between the hyperplane and the nearest data points from each class (support vectors).
Support Vectors: Data points that lie closest to the decision boundary and influence its position.
Kernel Trick: A technique to transform data into higher dimensions using functions like polynomial, radial basis function (RBF), etc.

The important parameters in SVM from the sklearn library:

kernel (str): Specifies the kernel type to be used:
- linear
- poly (polynomial)
- rbf (radial basis function, default)
- sigmoid
C (float): Regularization parameter. A smaller C encourages a wider margin (more tolerance to misclassification), while a larger C aims for accurate classification (less tolerance).
gamma (float or scale / auto ): Kernel coefficient for ‘rbf’, ‘poly’, and ‘sigmoid’.
- scale (default): 1 / (n_features * X.var())
- auto: 1 / n_features
- Float: custom value
degree (int): Degree of the polynomial kernel function (‘poly’). Ignored by other kernels.
coef0 (float): Independent term in kernel function. It’s significant in ‘poly’ and ‘sigmoid’.
shrinking (bool): Whether to use the shrinking heuristic, which can speed up training.
decision_function_shape (str): One-vs-one (‘ovo’) or one-vs-rest (‘ovr’). Default is ‘ovr’.
epsilon : Epsilon in the epsilon-SVR model. Specifies the epsilon-tube within which no penalty is associated in the training loss function.

# Import necessary libraries
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt

We use the UK open police data in London to demonstrate the machine learning for a spatial prediction task. Apart from the spatial unit of analysis – LSOA previously, we use a different geospatial unit data called GEOSTAT 1km grid from ONS add important place data – Point of Interest data from OpenStreetMap as the feature data resources. A point of interest (POI) is a specific location or place (like restaurants, cafes, bars and hospitals) that is of particular significance or utility within a geographic context.

Step 1 Problem definition:

We will use the POI data (X features) to predict the crime levels (y targets) across London grids (1km and 1km) via the machine learning models.

Step 2 data collection and preprocessing:

# Load the house price data
gdf_grid = gpd.read_file('data/GEOSTAT_Dec_2011_GEC_in_the_United_Kingdom_2022.geojson')
gdf_grid.head()

	OBJECTID	grd_fixid	grd_floaid	grd_newid	GlobalID	geometry
0	1	4.0.31510000.31050000	3151.3105.4	1kmN3105E3151	c142c42b-5160-4760-a836-956a3920caf8	POLYGON ((-6.41829 49.87519, -6.41852 49.87527...
1	2	4.0.31520000.31040000	3152.3104.4	1kmN3104E3152	26709ad2-2ae8-4f76-95f4-850c60e9c8e6	POLYGON ((-6.41461 49.87373, -6.40491 49.87515...
2	3	4.0.31520000.31050000	3152.3105.4	1kmN3105E3152	e477a513-48a2-4012-9cd7-3063ac2bc6c6	POLYGON ((-6.40698 49.8838, -6.40414 49.87568,...
3	4	4.0.31520000.31060000	3152.3106.4	1kmN3106E3152	8d1562d5-3252-40ef-858e-c80391bebefe	POLYGON ((-6.40726 49.88398, -6.42047 49.88204...
4	5	4.0.31530000.31050000	3153.3105.4	1kmN3105E3153	2952357c-da71-4e17-93a0-ace5e73e86d4	POLYGON ((-6.40414 49.87568, -6.40698 49.8838,...

# There are 260661 grids (1km * 1km) in the UK
print(len(gdf_grid))

# read the London boundary
gdf_london_bound = gpd.read_file('data/whole_london_boundary.geojson')
# to select the Grids in London, we do not use the overlay to clip the shap but use the spatial join to select the grids of London
gdf_london_grid = gpd.sjoin(gdf_grid, gdf_london_bound, how='inner', predicate='intersects')

# There are 1733 grids of London
gdf_london_grid

	OBJECTID	grd_fixid	grd_floaid	grd_newid	pcd_count	tot_p	tot_m	tot_f	tot_hh	GlobalID	geometry	index_right	name
214288	214289	4.0.35930000.32030000	3593.3203.4	1kmN3203E3593	3	9	5	4	5	9f0c1e1f-e8fa-40ca-a969-d1aedbefa750	POLYGON ((-0.50339 51.46588, -0.51765 51.46459...	0	Greater London
214289	214290	4.0.35930000.32040000	3593.3204.4	1kmN3204E3593	54	1957	1029	928	833	e2530627-9630-4200-ae13-788341456d76	POLYGON ((-0.50545 51.47476, -0.51972 51.47347...	0	Greater London
214864	214865	4.0.35940000.32020000	3594.3202.4	1kmN3202E3594	13	445	223	222	169	bf801f98-5376-4c37-8bcb-d80ae9af1f0b	POLYGON ((-0.48707 51.45829, -0.50133 51.457, ...	0	Greater London
214865	214866	4.0.35940000.32030000	3594.3203.4	1kmN3203E3594	3	34	18	16	11	472a8640-40e0-4c94-8eb5-37647847dc99	POLYGON ((-0.48913 51.46717, -0.50339 51.46588...	0	Greater London
214866	214867	4.0.35940000.32040000	3594.3204.4	1kmN3204E3594	1	15	9	6	8	c21982c0-0d13-43df-8fa5-f77c28fba3d6	POLYGON ((-0.49118 51.47605, -0.50545 51.47476...	0	Greater London
...	...	...	...	...	...	...	...	...	...	...	...	...	...
240547	240548	4.0.36510000.32040000	3651.3204.4	1kmN3204E3651	0	0	0	0	0	c1be8f1c-6c91-4dca-aad2-ce9f51538b0e	POLYGON ((0.32328 51.54665, 0.30897 51.54546, ...	0	Greater London
240548	240549	4.0.36510000.32050000	3651.3205.4	1kmN3205E3651	2	73	41	32	25	65336bd5-772b-4398-a0de-4ccb1b438479	POLYGON ((0.32137 51.55555, 0.30706 51.55436, ...	0	Greater London
240549	240550	4.0.36510000.32060000	3651.3206.4	1kmN3206E3651	2	36	16	20	13	befe6e14-0fef-4c44-8440-c3e80c139c63	POLYGON ((0.31947 51.56445, 0.30515 51.56326, ...	0	Greater London
240928	240929	4.0.36520000.32030000	3652.3203.4	1kmN3203E3652	2	26	15	11	12	96832cf0-62a9-49a4-b442-ac283b90db87	POLYGON ((0.33949 51.53894, 0.32518 51.53776, ...	0	Greater London
240929	240930	4.0.36520000.32040000	3652.3204.4	1kmN3204E3652	0	0	0	0	0	f0c429cf-e001-432e-a41b-259a3c9e6582	POLYGON ((0.33759 51.54784, 0.32328 51.54665, ...	0	Greater London

1733 rows × 13 columns

# drop the index_right column
gdf_london_grid = gdf_london_grid.drop(columns=['index_right'])

# plot the London grid data
gdf_london_grid.plot(figsize= (6,6))

<Axes: >

_images/a320ab8e53826ec179c89f31c6295bdb3508ff67c859998199d453b19358aa00.png

# here we only use the gis_osm_pois_free_1.shp file
gdf_poi = gpd.read_file('data/osm_england-latest-free-shp/gis_osm_pois_free_1.shp')
gdf_poi.columns

Index(['osm_id', 'code', 'fclass', 'name', 'geometry'], dtype='object')

gdf_poi.plot(markersize=0.01, figsize=(8,12))

<Axes: >

_images/db7352405df84243179fb43eec8022be5247095a716091a1568dceab683c001e.png

# spatila join the poi data with the grid data
gdf_poi_grid = gpd.sjoin(gdf_poi, gdf_london_grid, how='left', predicate='intersects')
gdf_poi_grid_s = gdf_poi_grid[~gdf_poi_grid.grd_newid.isna()]

# we calculate the numbers and poi type for each grid
df_poi_counts = gdf_poi_grid_s.groupby(['grd_newid']).size().reset_index(name='poi_counts')
df_poi_types = gdf_poi_grid_s.groupby(['grd_newid']).nunique()['fclass'].reset_index(name='poi_types')

# merge the poi data with the grid data one by one
gdf_london_grid_poi = gdf_london_grid.merge(df_poi_counts, on='grd_newid', how='left')
gdf_poi_grid_s_poi_type = gdf_london_grid_poi.merge(df_poi_types, on='grd_newid', how='left')

gdf_poi_grid_s_poi_type

	OBJECTID	grd_fixid	grd_floaid	grd_newid	pcd_count	tot_p	tot_m	tot_f	tot_hh	GlobalID	geometry	name	poi_counts	poi_types
0	214289	4.0.35930000.32030000	3593.3203.4	1kmN3203E3593	3	9	5	4	5	9f0c1e1f-e8fa-40ca-a969-d1aedbefa750	POLYGON ((-0.50339 51.46588, -0.51765 51.46459...	Greater London	3.0	3.0
1	214290	4.0.35930000.32040000	3593.3204.4	1kmN3204E3593	54	1957	1029	928	833	e2530627-9630-4200-ae13-788341456d76	POLYGON ((-0.50545 51.47476, -0.51972 51.47347...	Greater London	5.0	5.0
2	214865	4.0.35940000.32020000	3594.3202.4	1kmN3202E3594	13	445	223	222	169	bf801f98-5376-4c37-8bcb-d80ae9af1f0b	POLYGON ((-0.48707 51.45829, -0.50133 51.457, ...	Greater London	1.0	1.0
3	214866	4.0.35940000.32030000	3594.3203.4	1kmN3203E3594	3	34	18	16	11	472a8640-40e0-4c94-8eb5-37647847dc99	POLYGON ((-0.48913 51.46717, -0.50339 51.46588...	Greater London	8.0	4.0
4	214867	4.0.35940000.32040000	3594.3204.4	1kmN3204E3594	1	15	9	6	8	c21982c0-0d13-43df-8fa5-f77c28fba3d6	POLYGON ((-0.49118 51.47605, -0.50545 51.47476...	Greater London	2.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1728	240548	4.0.36510000.32040000	3651.3204.4	1kmN3204E3651	0	0	0	0	0	c1be8f1c-6c91-4dca-aad2-ce9f51538b0e	POLYGON ((0.32328 51.54665, 0.30897 51.54546, ...	Greater London	NaN	NaN
1729	240549	4.0.36510000.32050000	3651.3205.4	1kmN3205E3651	2	73	41	32	25	65336bd5-772b-4398-a0de-4ccb1b438479	POLYGON ((0.32137 51.55555, 0.30706 51.55436, ...	Greater London	2.0	2.0
1730	240550	4.0.36510000.32060000	3651.3206.4	1kmN3206E3651	2	36	16	20	13	befe6e14-0fef-4c44-8440-c3e80c139c63	POLYGON ((0.31947 51.56445, 0.30515 51.56326, ...	Greater London	NaN	NaN
1731	240929	4.0.36520000.32030000	3652.3203.4	1kmN3203E3652	2	26	15	11	12	96832cf0-62a9-49a4-b442-ac283b90db87	POLYGON ((0.33949 51.53894, 0.32518 51.53776, ...	Greater London	1.0	1.0
1732	240930	4.0.36520000.32040000	3652.3204.4	1kmN3204E3652	0	0	0	0	0	f0c429cf-e001-432e-a41b-259a3c9e6582	POLYGON ((0.33759 51.54784, 0.32328 51.54665, ...	Greater London	NaN	NaN

1733 rows × 14 columns

gdf_poi_grid_s_poi_type.plot(column='poi_counts', cmap='RdPu', legend=True)

<Axes: >

_images/ae81054b48822c514ecad377d3fc7155d4ef0345a08d1e73ace554ddb4f9a325.png

gdf_poi_grid_s_poi_type.plot(column='poi_types', cmap='RdPu', legend=True)

<Axes: >

_images/e019a04c508a77f75ba9c96a57f459a40aa805781176af0b8de6716ad6bed233.png

# read the crime data all in police_met_city_2024/2024-0*/**.csv
import glob
# Get all CSV files in the directory
csv_files = glob.glob('data/police_met_city_2024/2024-**/**/*.csv', recursive=True)
# Read and concatenate all CSV files into a single DataFrame
df_crime = pd.concat((pd.read_csv(file) for file in csv_files), ignore_index=True)
# create a GeoDataFrame from the crime data
import geopandas as gpd
gdf_crime = gpd.GeoDataFrame(df_crime, geometry=gpd.points_from_xy(df_crime['Longitude'], df_crime['Latitude']), crs='EPSG:4326')
gdf_crime

	Crime ID	Month	Reported by	Falls within	Longitude	Latitude	Location	LSOA code	LSOA name	Crime type	Last outcome category	Context	geometry
0	b74d06161e2425ef2abf28345aa962fa862753669659d3...	2024-09	City of London Police	City of London Police	-0.107682	51.517786	On or near B521	E01000917	Camden 027C	Robbery	Unable to prosecute suspect	NaN	POINT (-0.10768 51.51779)
1	ab9f963604d03486ffbefec80f9f9296c7f65ecd4cd047...	2024-09	City of London Police	City of London Police	-0.107682	51.517786	On or near B521	E01000917	Camden 027C	Violence and sexual offences	Unable to prosecute suspect	NaN	POINT (-0.10768 51.51779)
2	7eaad3255f757cf5073dfd3712193f4497e788b3f6519f...	2024-09	City of London Police	City of London Police	-0.112096	51.515942	On or near Nightclub	E01000914	Camden 028B	Violence and sexual offences	Status update unavailable	NaN	POINT (-0.1121 51.51594)
3	71c1e6edfb73383e82ca6acdc1c9c0ef015334b529f298...	2024-09	City of London Police	City of London Police	-0.111596	51.518281	On or near Chancery Lane	E01000914	Camden 028B	Violence and sexual offences	Status update unavailable	NaN	POINT (-0.1116 51.51828)
4	06e7c6b6541166a8abe71de2593b9576c5e26215e349e4...	2024-09	City of London Police	City of London Police	-0.098519	51.517332	On or near Little Britain	E01000001	City of London 001A	Other theft	Investigation complete; no suspect identified	NaN	POINT (-0.09852 51.51733)
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1149316	255e00edd07e15d153995d97bfcb22dc9b43055f10fafe...	2024-10	Metropolitan Police Service	Metropolitan Police Service	-2.059862	52.604816	On or near Stubby Lane	E01010563	Wolverhampton 010D	Violence and sexual offences	Status update unavailable	NaN	POINT (-2.05986 52.60482)
1149317	ada15de8fd97cc0d01e54aa33cc54b1327e38a3b012514...	2024-10	Metropolitan Police Service	Metropolitan Police Service	-2.132925	52.585743	On or near Clarence Street	E01034312	Wolverhampton 020G	Violence and sexual offences	Status update unavailable	NaN	POINT (-2.13292 52.58574)
1149318	820ad95840e882e48b74938cb919e05df6b5321d3abce9...	2024-10	Metropolitan Police Service	Metropolitan Police Service	-2.123708	52.583362	On or near Old Hall Street	E01034313	Wolverhampton 020H	Other theft	Status update unavailable	NaN	POINT (-2.12371 52.58336)
1149319	fdd17e2c1eff3d47060559feaa3f4e9259dbf13e1088e2...	2024-10	Metropolitan Police Service	Metropolitan Police Service	-2.095217	52.566487	On or near Detling Drive	E01010450	Wolverhampton 029B	Other crime	Status update unavailable	NaN	POINT (-2.09522 52.56649)
1149320	bb019d86db42a192000d2c99522ce53a4e2f5e4a364a34...	2024-10	Metropolitan Police Service	Metropolitan Police Service	-2.182265	52.225795	On or near Sling Lane	E01032400	Wychavon 006B	Violence and sexual offences	Status update unavailable	NaN	POINT (-2.18226 52.2258)

1149321 rows × 13 columns

gdf_crime.plot(markersize=1, figsize=(6,8))

<Axes: >

_images/3ea24781c5fe41938c32b06f5079a27fca0566e2ccd157a5313a15bd5a1289ed.png

# spatial join the crime data with the grid data
gdf_crime_grid = gpd.sjoin(gdf_crime, gdf_london_grid_poi, how='left', predicate='intersects')

# gropby the grid id, month and count the number of all crime types
df_crime = gdf_crime_grid.groupby(['grd_newid', 'Month']).size().reset_index(name='crime_counts')

df_crime

	grd_newid	Month	crime_counts
0	1kmN3178E3627	2024-07	1
1	1kmN3178E3628	2024-11	1
2	1kmN3178E3629	2024-09	1
3	1kmN3178E3630	2024-11	1
4	1kmN3179E3617	2024-10	1
...	...	...	...
17989	1kmN3223E3625	2024-11	2
17990	1kmN3224E3620	2024-03	1
17991	1kmN3224E3620	2024-05	1
17992	1kmN3224E3620	2024-07	1
17993	1kmN3224E3620	2024-11	1

17994 rows × 3 columns

# reshape the df_crime to have a column for each month
df_crime_pivot = df_crime.pivot(index='grd_newid', columns='Month', values='crime_counts').fillna(0)

df_crime_pivot

Month	2024-01	2024-02	2024-03	2024-04	2024-05	2024-06	2024-07	2024-08	2024-09	2024-10	2024-11	2024-12
grd_newid
1kmN3178E3627	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
1kmN3178E3628	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
1kmN3178E3629	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
1kmN3178E3630	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
1kmN3179E3617	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...
1kmN3222E3630	3.0	1.0	2.0	0.0	2.0	0.0	2.0	0.0	0.0	0.0	0.0	1.0
1kmN3223E3621	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
1kmN3223E3624	5.0	1.0	2.0	2.0	1.0	0.0	5.0	4.0	3.0	1.0	2.0	0.0
1kmN3223E3625	2.0	0.0	0.0	0.0	2.0	2.0	4.0	2.0	2.0	0.0	2.0	0.0
1kmN3224E3620	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0

1619 rows × 12 columns

# create a new column for the total crime counts
df_crime_pivot['crime_counts'] = df_crime_pivot.sum(axis=1)

df_crime_pivot

Month	2024-01	2024-02	2024-03	2024-04	2024-05	2024-06	2024-07	2024-08	2024-09	2024-10	2024-11	2024-12	crime_counts
grd_newid
1kmN3178E3627	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0
1kmN3178E3628	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0
1kmN3178E3629	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
1kmN3178E3630	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0
1kmN3179E3617	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1kmN3222E3630	3.0	1.0	2.0	0.0	2.0	0.0	2.0	0.0	0.0	0.0	0.0	1.0	11.0
1kmN3223E3621	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0
1kmN3223E3624	5.0	1.0	2.0	2.0	1.0	0.0	5.0	4.0	3.0	1.0	2.0	0.0	26.0
1kmN3223E3625	2.0	0.0	0.0	0.0	2.0	2.0	4.0	2.0	2.0	0.0	2.0	0.0	16.0
1kmN3224E3620	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	4.0

1619 rows × 13 columns

# merge the crime data with the grid data
gdf_london_grid_poi_crime = gdf_poi_grid_s_poi_type.merge(df_crime_pivot, on='grd_newid', how='left')

# fill the NaN values with 0 for poi and crime
gdf_london_grid_poi_crime = gdf_london_grid_poi_crime.fillna(0)
gdf_london_grid_poi_crime.to_file('data/london_grid_poi_crime.geojson', driver='GeoJSON')
# transform this gdf to the EPSG:27700
gdf_london_grid_poi_crime = gdf_london_grid_poi_crime.to_crs(epsg=27700)

gdf_london_grid_poi_crime

	OBJECTID	grd_fixid	grd_floaid	grd_newid	pcd_count	tot_p	tot_m	tot_f	tot_hh	GlobalID	...	2024-04	2024-05	2024-06	2024-07	2024-08	2024-09	2024-10	2024-11	2024-12	crime_counts
0	214289	4.0.35930000.32030000	3593.3203.4	1kmN3203E3593	3	9	5	4	5	9f0c1e1f-e8fa-40ca-a969-d1aedbefa750	...	0.0	3.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	5.0
1	214290	4.0.35930000.32040000	3593.3204.4	1kmN3204E3593	54	1957	1029	928	833	e2530627-9630-4200-ae13-788341456d76	...	1.0	1.0	0.0	0.0	4.0	2.0	0.0	1.0	1.0	13.0
2	214865	4.0.35940000.32020000	3594.3202.4	1kmN3202E3594	13	445	223	222	169	bf801f98-5376-4c37-8bcb-d80ae9af1f0b	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	214866	4.0.35940000.32030000	3594.3203.4	1kmN3203E3594	3	34	18	16	11	472a8640-40e0-4c94-8eb5-37647847dc99	...	153.0	126.0	105.0	124.0	134.0	127.0	137.0	141.0	124.0	1519.0
4	214867	4.0.35940000.32040000	3594.3204.4	1kmN3204E3594	1	15	9	6	8	c21982c0-0d13-43df-8fa5-f77c28fba3d6	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1728	240548	4.0.36510000.32040000	3651.3204.4	1kmN3204E3651	0	0	0	0	0	c1be8f1c-6c91-4dca-aad2-ce9f51538b0e	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1729	240549	4.0.36510000.32050000	3651.3205.4	1kmN3205E3651	2	73	41	32	25	65336bd5-772b-4398-a0de-4ccb1b438479	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1730	240550	4.0.36510000.32060000	3651.3206.4	1kmN3206E3651	2	36	16	20	13	befe6e14-0fef-4c44-8440-c3e80c139c63	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1731	240929	4.0.36520000.32030000	3652.3203.4	1kmN3203E3652	2	26	15	11	12	96832cf0-62a9-49a4-b442-ac283b90db87	...	0.0	0.0	0.0	1.0	0.0	0.0	2.0	0.0	0.0	3.0
1732	240930	4.0.36520000.32040000	3652.3204.4	1kmN3204E3652	0	0	0	0	0	f0c429cf-e001-432e-a41b-259a3c9e6582	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

1733 rows × 27 columns

gdf_london_grid_poi_crime.plot(column= 'crime_counts', cmap='Reds', legend=True, k=8, scheme='FisherJenks', figsize=(10,8))

/Users/896604/miniconda3/envs/STBDA/lib/python3.10/site-packages/mapclassify/classifiers.py:1860: UserWarning: Numba not installed. Using slow pure python version.
  warnings.warn(

<Axes: >

_images/21e3aba5ff1df15ea363dcb7d3029996fa1473f63a6f7ab75b486bbd1505bc3d.png

Step 3 Data Splitting:

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
# Define the features (X) and target variable (y)
X = gdf_london_grid_poi_crime[['poi_counts', 'poi_types']]
y = gdf_london_grid_poi_crime['crime_counts']
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Check the shapes of the training and testing sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (1386, 2)
X_test shape: (347, 2)
y_train shape: (1386,)
y_test shape: (347,)

Step 4 Feature Scaling:

# Feature scaling
from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler()
# Fit the scaler on the training data
scaler.fit(X_train)
# Transform the training and testing data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Check the scaled data
print("X_train_scaled shape:", X_train_scaled.shape)
print("X_test_scaled shape:", X_test_scaled.shape)
# to transform the y train and y test data, reshape (-1,1) means to convert the 1D array to 2D array; this is required by the scaler
y_train_scaled = scaler.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = scaler.transform(y_test.values.reshape(-1, 1))
# Check the scaled data
print("y_train_scaled shape:", y_train_scaled.shape)
print("y_test_scaled shape:", y_test_scaled.shape)

X_train_scaled shape: (1386, 2)
X_test_scaled shape: (347, 2)
y_train_scaled shape: (1386, 1)
y_test_scaled shape: (347, 1)

Step 5 Model Training and Hyperparameter Tuning (Cross-Validation):

#  train the SVM regression model and hyperparameter tuning using CV
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

# Define the SVM model
svm_model = SVR()
# Define the hyperparameters to tune
param_grid = {
    'kernel': ['linear', 'rbf', 'poly'],
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.1, 0.01],
    'epsilon': [0.1, 0.2, 0.3]
}

RandomizedSearchCV and GridSearchCV

# Perform grid search with cross-validation, here we use 5-fold CV and negative mean squared error as the scoring metric
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
# there are 4 types of parameters set with 3, 4, 4, 3 selections in the param_grid, so the total number of combinations is 4*3*4*3 = 144;
# considering the 5-fold CV, the total number of models to be trained is 144*5 = 720
# ravel() is used to convert the 2D array to 1D array, this is required by the grid search
grid_search.fit(X_train_scaled, y_train_scaled.ravel())
# print the performance metrics
print("Best cross-validation score (MSE):", -grid_search.best_score_)

Best cross-validation score (MSE): 0.3469068801108338

gscv_results_df = pd.DataFrame(grid_search.cv_results_)
gscv_results_df

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_C	param_epsilon	param_gamma	param_kernel	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.014670	0.003606	0.001402	0.000076	0.1	0.1	scale	linear	{'C': 0.1, 'epsilon': 0.1, 'gamma': 'scale', '...	-0.161602	-0.309483	-0.269594	-0.213100	-1.038503	-0.398456	0.323925	39
1	0.015693	0.003923	0.004752	0.000310	0.1	0.1	scale	rbf	{'C': 0.1, 'epsilon': 0.1, 'gamma': 'scale', '...	-0.178756	-0.425421	-0.337690	-0.235625	-1.408892	-0.517277	0.453782	119
2	0.046964	0.009909	0.001963	0.000640	0.1	0.1	scale	poly	{'C': 0.1, 'epsilon': 0.1, 'gamma': 'scale', '...	-0.235991	-0.381201	-0.431818	-0.278688	-0.515603	-0.368660	0.101424	6
3	0.012760	0.002559	0.001370	0.000068	0.1	0.1	auto	linear	{'C': 0.1, 'epsilon': 0.1, 'gamma': 'auto', 'k...	-0.161602	-0.309483	-0.269594	-0.213100	-1.038503	-0.398456	0.323925	39
4	0.015724	0.002159	0.004598	0.000059	0.1	0.1	auto	rbf	{'C': 0.1, 'epsilon': 0.1, 'gamma': 'auto', 'k...	-0.179099	-0.424231	-0.337557	-0.236367	-1.408735	-0.517198	0.453636	118
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
139	0.024504	0.001720	0.002951	0.001099	100.0	0.3	0.1	rbf	{'C': 100, 'epsilon': 0.3, 'gamma': 0.1, 'kern...	-0.189447	-0.327770	-0.273917	-0.211199	-1.403278	-0.481122	0.463628	97
140	0.123786	0.031712	0.001063	0.000090	100.0	0.3	0.1	poly	{'C': 100, 'epsilon': 0.3, 'gamma': 0.1, 'kern...	-0.228806	-0.361929	-0.396099	-0.261518	-1.036504	-0.456971	0.296263	74
141	0.099407	0.017301	0.000878	0.000061	100.0	0.3	0.01	linear	{'C': 100, 'epsilon': 0.3, 'gamma': 0.01, 'ker...	-0.169583	-0.301983	-0.259907	-0.212265	-1.041531	-0.397054	0.325296	23
142	0.013884	0.003764	0.002413	0.000245	100.0	0.3	0.01	rbf	{'C': 100, 'epsilon': 0.3, 'gamma': 0.01, 'ker...	-0.157879	-0.318991	-0.271800	-0.206021	-1.139578	-0.418854	0.364540	64
143	0.015400	0.005429	0.001108	0.000058	100.0	0.3	0.01	poly	{'C': 100, 'epsilon': 0.3, 'gamma': 0.01, 'ker...	-0.300499	-0.815115	-0.488829	-0.297477	-2.221372	-0.824658	0.723373	137

144 rows × 17 columns

# get the best model from the grid search
best_model = grid_search.best_estimator_

# We can generate MSE manually and other performance metrics in the training set
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Make predictions on the training set
y_train_pred = best_model.predict(X_train_scaled)
# Calculate performance metrics
mse_train = mean_squared_error(y_train_scaled, y_train_pred)
mae_train = mean_absolute_error(y_train_scaled, y_train_pred)
r2_train = r2_score(y_train_scaled, y_train_pred)
print("Training set performance metrics:")
print("Mean Squared Error (MSE):", mse_train)
print("Mean Absolute Error (MAE):", mae_train)
print("R-squared (R2):", r2_train)

Training set performance metrics:
Mean Squared Error (MSE): 0.30264185557747325
Mean Absolute Error (MAE): 0.3689697950764123
R-squared (R2): 0.6973581444225267

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)
# Get the best model
best_model = grid_search.best_estimator_

Best hyperparameters: {'C': 10, 'epsilon': 0.3, 'gamma': 0.1, 'kernel': 'poly'}

Step 6 Model evaluation and testing:

# Make predictions on the test set
y_test_pred = best_model.predict(X_test_scaled)
# Calculate performance metrics
mse_test = mean_squared_error(y_test_scaled, y_test_pred)
mae_test = mean_absolute_error(y_test_scaled, y_test_pred)
r2_test = r2_score(y_test_scaled, y_test_pred)
print("Test set performance metrics:")
print("Mean Squared Error (MSE):", mse_test)
print("Mean Absolute Error (MAE):", mae_test)
print("R-squared (R2):", r2_test)

Test set performance metrics:
Mean Squared Error (MSE): 0.5879835533727016
Mean Absolute Error (MAE): 0.39449333359072636
R-squared (R2): 0.8348198471666948

8.2.2 Evaluating the trained SVR via random CV and Spatial CV#

Further discussion: is this trained SVR model robust in the context of spatial prediction?

We have previously discussed spatial CV and its importance. Here, the random CV may cause data leakage issues when splitting the original traning data into a new training set (80%) and new test/validation set (20%) from the original training data. We will analyze and compare the differences of trained SVR performance between random CV and spatial CV.

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np
# random state is used to ensure the reproducibility of the random split, which means the same random seed will generate the same random numbers
kf = KFold(n_splits=5, shuffle=True, random_state=42)

mse_list = []
tr_idx_list = []
va_idx_list = []
for i, (train_idx, va_idx) in enumerate(kf.split(X_train_scaled)):
    X_tr, X_va = X_train_scaled[train_idx], X_train_scaled[va_idx]
    y_tr, y_va = y_train_scaled[train_idx], y_train_scaled[va_idx]
    # two ** means to unpack the dictionary as keyword arguments,
    model = svm_model.set_params(**{'C': 10, 'epsilon': 0.3, 'gamma': 0.1, 'kernel': 'poly'})
    model.fit(X_tr, y_tr.ravel())
    preds = model.predict(X_va)
    mse = mean_squared_error(y_va, preds)
    print(f"Fold {i+1}: MSE = {mse:.2f},",
          f"Training samples length = {len(train_idx.tolist())},",
          f"Validation samples length = {len(va_idx.tolist())}")
    mse_list.append(mse)
    print('Mean of mses:', f'{np.mean(mse_list):.2f}')
    tr_idx_list.append(train_idx.tolist())
    va_idx_list.append(va_idx.tolist())

Fold 1: MSE = 0.51, Training samples length = 1108, Validation samples length = 278
Mean of mses: 0.51
Fold 2: MSE = 0.34, Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.42
Fold 3: MSE = 0.29, Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.38
Fold 4: MSE = 0.41, Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.39
Fold 5: MSE = 0.27, Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.36

# we integrate the training and validation indices into the gdf for plotting
for i in range(0, 5):
    gdf_london_grid_poi_crime[f'index_gr'] = gdf_london_grid_poi_crime.index
    gdf_london_grid_poi_crime[f'Fold {i + 1}'] = gdf_london_grid_poi_crime.index_gr.apply(
        lambda x: 'new traning set' if x in tr_idx_list[i] else (
            'validation set' if x in va_idx_list[i] else 'testing set'))

    # we plot the train and test via the index on the grids
    fig, ax = plt.subplots(1, 1, figsize=(6, 6))
    gdf_london_grid_poi_crime.plot(ax=ax, column=f'Fold {i + 1}', cmap='tab20c', legend=True, figsize=(8, 6),
                                   legend_kwds={'frameon': False})
    plt.title(f'Fold {i + 1} of Random CV')
    ax.axis('off')
    plt.show()

_images/48bc32b49315b1a17b1334f6fbfc03cce1344d3580ebadb3291140d7b46d61e3.png

_images/436244d31c302b04653c46b6daec69765493bf6b88e1444fdc4393e477f5df4e.png

_images/bfc0424fd231e45de5956aab730953ae9501265389dfb762c048837c5e1b1edf.png

_images/6bcf648c4feb1ffe29aba1dde96ed7497a72357bd82ff4193fb0b8a585084401.png

_images/14f333793d31daae523dd234e2833112b0e98d32c2cf6ae012fc6a44571b480c.png

# let us use a spatial CV through a k-means clustering to seprate the grids to 5 groups according to the coordinates
from sklearn.cluster import KMeans
# Define the number of clusters
n_clusters = 5
# Create a KMeans object
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
# Fit the KMeans model according to the x and y coordinates of the grid centroids
gdf_london_grid_poi_crime['centroid_x'] = gdf_london_grid_poi_crime.geometry.centroid.x
gdf_london_grid_poi_crime['centroid_y'] = gdf_london_grid_poi_crime.geometry.centroid.y
# we don't use the testing set for clustering
gdf_london_grid_poi_crime_train = gdf_london_grid_poi_crime.loc[
    gdf_london_grid_poi_crime['Fold 1'] != 'testing set'].copy()
# Fit the KMeans model to the training data
kmeans.fit(gdf_london_grid_poi_crime_train[['centroid_x', 'centroid_y']])
# Assign cluster labels to the training data
gdf_london_grid_poi_crime_train.loc[:, 'cluster'] = kmeans.labels_
# we plot the clusters
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
gdf_london_grid_poi_crime_train.plot(ax=ax, column='cluster')

<Axes: >

_images/785e50b3719d6a3dbeb0d7d36d939b252b1997be005fbd153919620e11276186.png

gdf_london_grid_poi_crime_train

	OBJECTID	grd_fixid	grd_floaid	grd_newid	pcd_count	tot_p	tot_m	tot_f	tot_hh	GlobalID	...	crime_counts	index_gr	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	centroid_x	centroid_y	cluster
0	214289	4.0.35930000.32030000	3593.3203.4	1kmN3203E3593	3	9	5	4	5	9f0c1e1f-e8fa-40ca-a969-d1aedbefa750	...	5.0	0	new traning set	new traning set	validation set	new traning set	new traning set	503481.416280	175573.572278	2
1	214290	4.0.35930000.32040000	3593.3204.4	1kmN3204E3593	54	1957	1029	928	833	e2530627-9630-4200-ae13-788341456d76	...	13.0	1	new traning set	new traning set	new traning set	new traning set	validation set	503318.092908	176558.231134	2
2	214865	4.0.35940000.32020000	3594.3202.4	1kmN3202E3594	13	445	223	222	169	bf801f98-5376-4c37-8bcb-d80ae9af1f0b	...	0.0	2	new traning set	new traning set	validation set	new traning set	new traning set	504632.628333	174752.571059	2
3	214866	4.0.35940000.32030000	3594.3203.4	1kmN3203E3594	3	34	18	16	11	472a8640-40e0-4c94-8eb5-37647847dc99	...	1519.0	3	new traning set	validation set	new traning set	new traning set	new traning set	504469.320044	175737.238521	2
4	214867	4.0.35940000.32040000	3594.3204.4	1kmN3204E3594	1	15	9	6	8	c21982c0-0d13-43df-8fa5-f77c28fba3d6	...	0.0	4	new traning set	new traning set	new traning set	validation set	new traning set	504305.994245	176721.903963	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1381	233221	4.0.36330000.32050000	3633.3205.4	1kmN3205E3633	149	17774	9897	7877	4810	81df1f73-eabc-432a-96cf-354e3b7e1e58	...	3716.0	1381	new traning set	new traning set	validation set	new traning set	new traning set	542669.454612	184090.745646	1
1382	233222	4.0.36330000.32060000	3633.3206.4	1kmN3206E3633	138	17822	9494	8328	4784	c5616e02-895d-4669-8768-8357a13dd7f2	...	1345.0	1382	new traning set	validation set	new traning set	new traning set	new traning set	542506.025965	185075.698136	1
1383	233223	4.0.36330000.32070000	3633.3207.4	1kmN3207E3633	55	6018	3233	2785	1854	8f730656-d48d-4f99-928b-30cb304f3058	...	861.0	1383	new traning set	new traning set	new traning set	new traning set	validation set	542342.579758	186060.646425	1
1384	233224	4.0.36330000.32080000	3633.3208.4	1kmN3208E3633	19	1544	734	810	529	40f56c08-6faf-46f0-b86f-43a4633d560e	...	152.0	1384	new traning set	new traning set	new traning set	validation set	new traning set	542179.116799	187045.595039	1
1385	233225	4.0.36330000.32090000	3633.3209.4	1kmN3209E3633	68	4602	2308	2294	1387	65b9480d-a4f7-4b82-9731-65080d9b5f31	...	419.0	1385	new traning set	new traning set	validation set	new traning set	new traning set	542015.638307	188030.545768	1

1386 rows × 36 columns

Now we calculate the model performance in the training set using spatial CV, here we use the GroupKFold from sklearn.

# use groupcv in sklearn to select the data
from sklearn.model_selection import GroupKFold

# Define the number of splits
n_splits = 5
# Create a GroupKFold object
group_kfold = GroupKFold(n_splits=n_splits)
# Initialize an empty list to store the mean squared errors
g_mse_list = []
# Initialize an empty list to store the training and validation indices
g_tr_idx_list = []
g_va_idx_list = []
# Loop through the splits
for i, (g_train_idx, g_va_idx) in enumerate(
        group_kfold.split(X_train_scaled, y_train_scaled, groups=gdf_london_grid_poi_crime_train['cluster'])):
    X_tr, X_va = X_train_scaled[g_train_idx], X_train_scaled[g_va_idx]
    y_tr, y_va = y_train_scaled[g_train_idx], y_train_scaled[g_va_idx]
    model = svm_model.set_params(**{'C': 10, 'epsilon': 0.3, 'gamma': 0.1, 'kernel': 'poly'})
    model.fit(X_tr, y_tr.ravel())
    preds = model.predict(X_va)
    g_mse = mean_squared_error(y_va, preds)
    print(f"Fold {i + 1}: MSE = {g_mse:.2f}",
          f"Training samples length = {len(train_idx.tolist())},",
          f"Validation samples length = {len(va_idx.tolist())}")
    g_mse_list.append(g_mse)
    print('Mean of mses:', f'{np.mean(g_mse_list):.2f}')
    g_tr_idx_list.append(g_train_idx.tolist())
    g_va_idx_list.append(g_va_idx.tolist())

Fold 1: MSE = 0.31 Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.31
Fold 2: MSE = 0.36 Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.34
Fold 3: MSE = 0.36 Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.35
Fold 4: MSE = 0.76 Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.45
Fold 5: MSE = 0.24 Training samples length = 1109, Validation samples length = 277
Mean of mses: 0.41

Mean of mses (0.41) is higher than the random CV (0.36), which means the model performance is worse via the spatial CV than the random CV.

# We integrate the training and validation indices into the gdf for plotting
for i in range(0, 5):
    gdf_london_grid_poi_crime[f'index_gr'] = gdf_london_grid_poi_crime.index
    gdf_london_grid_poi_crime[f'Group Fold {i + 1}'] = gdf_london_grid_poi_crime.index_gr.apply(
        lambda x: 'new traning set' if x in g_tr_idx_list[i] else (
            'validation set' if x in g_va_idx_list[i] else 'testing set'))
    # we plot the train and test via the index on the grids
    fig, ax = plt.subplots(1, 1, figsize=(6, 6))
    gdf_london_grid_poi_crime.plot(ax=ax, column=f'Group Fold {i + 1}', cmap='tab20c', legend=True, figsize=(8, 6),
                                   legend_kwds={'frameon': False})
    plt.title(f'Fold {i + 1} of Spatial CV')
    ax.axis('off')
    plt.show()

_images/d40151fb970b489fdb50c2005dc62ef789d0c7e35b969edc10d0b5dcc7fb5015.png

_images/f5aff9d0d38f43e727f312a2bb6f196a5b14d98f32b5149f5065ea8355d5be54.png

_images/94fc9083db07a78ae178b24a7c6b7f3982a8e750081920c4a244caffa0911259.png

_images/8fa5d7a9a71367d481ebaa3562751fed3f6a3167376a06b7abe8969a954a4dd9.png

_images/3f1e19ff5ee5018c4d08e6f01ab65eb34cd3397433bfacd4429c76b64b77afb1.png

8.2.3 Retrain the SVR model and hyperparameter using GridSearch and Spatial CV#

Two important parameter settings in grid_search.fit() and GridSearchCV():

cv: this should be defined as the group_kfold from the GroupKFold
groups: this should be defined as the array/list groups_label from the training set

groups_label = gdf_london_grid_poi_crime_train['cluster'].values

Hyperparameter via GridSearch and Spatial CV

# Perform grid search with cross-validation, here we use 5-fold CV and negative mean squared error as the scoring metric
grid_search = GridSearchCV(svm_model, param_grid, cv=group_kfold, scoring='neg_mean_squared_error', n_jobs=-1)
# there are 4 types of parameters set with 3, 4, 4, 3 selections in the param_grid, so the total number of combinations is 4*3*4*3 = 144;
# considering the 5-fold CV, the total number of models to be trained is 144*5 = 720
# ravel() is used to convert the 2D array to 1D array, this is required by the grid search
grid_search.fit(X_train_scaled, y_train_scaled.ravel(), groups=groups_label)
# print the performance metrics
print("Best cross-validation score (MSE):", -grid_search.best_score_)

Best cross-validation score (MSE): 0.3880810150439128

Training set performance metrics

# get the best model from gridsearch and spatial cv
best_model_scv = grid_search.best_estimator_

# We can generate MSE manually and other performance metrics in the training set
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Make predictions on the training set
y_train_pred = best_model_scv.predict(X_train_scaled)
# Calculate performance metrics
mse_train = mean_squared_error(y_train_scaled, y_train_pred)
mae_train = mean_absolute_error(y_train_scaled, y_train_pred)
r2_train = r2_score(y_train_scaled, y_train_pred)
print("Training set performance metrics:")
print("Mean Squared Error (MSE):", mse_train)
print("Mean Absolute Error (MAE):", mae_train)
print("R-squared (R2):", r2_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

Training set performance metrics:
Mean Squared Error (MSE): 0.32209763512058076
Mean Absolute Error (MAE): 0.3537723982700458
R-squared (R2): 0.6779023648794191
Best hyperparameters: {'C': 10, 'epsilon': 0.1, 'gamma': 0.1, 'kernel': 'poly'}

Testing set performance metrics

# Make predictions on the test set
y_test_pred = best_model_scv.predict(X_test_scaled)
# Calculate performance metrics
mse_test = mean_squared_error(y_test_scaled, y_test_pred)
mae_test = mean_absolute_error(y_test_scaled, y_test_pred)
r2_test = r2_score(y_test_scaled, y_test_pred)
print("Test set performance metrics:")
print("Mean Squared Error (MSE):", mse_test)
print("Mean Absolute Error (MAE):", mae_test)
print("R-squared (R2):", r2_test)

Test set performance metrics:
Mean Squared Error (MSE): 0.6321250775477194
Mean Absolute Error (MAE): 0.3746731427775627
R-squared (R2): 0.8224193239416807

We observe that the model’s performance decreases when using spatial cross-validation and grid search compared to the random cross-validation approach. Additionally, the optimal hyperparameters have changed. This suggests that, in the context of spatial prediction, the current model is more robust than the one we previously trained, even though its overall performance appears to be lower.

Future work could include incorporating additional spatial features we previously calculated, such as spatial lags of all variables and LISA. Additionally, including more feature data may help improve the model’s performance.

8.3 Random Forests (RF) in the spatial and temporal prediction task#

Random Forests (RF) is an ensemble learning method that combines multiple decision trees to improve predictive accuracy and control overfitting. It is widely used for both classification and regression tasks. Key parameters in RF include:

n_estimators: The number of trees in the forest. More trees can lead to better performance but also increase computation time.
max_depth: The maximum depth of each tree. Limiting depth can help prevent overfitting.
min_samples_split: The minimum number of samples required to split an internal node. Higher values can lead to more conservative trees.
min_samples_leaf: The minimum number of samples required to be at a leaf node. Higher values can help smooth the model.
max_features: The number of features to consider when looking for the best split. It can be set to a fraction of the total number of features or a specific number.

Random Forests can be also used in feature engineering, such as feature importance ranking (Gini index), which helps identify the most relevant features for the prediction task.

We will use the same data as in the SVM section, but we will use the Random Forests model in a spatial and temporal prediction task: we use the first 11 months of crime data (2024-01 to 2024-11) to predict the crime level in December 2024. The features are the same as in the SVM section, i.e., the POI data.

# re-load the gdf_london_grid_poi_crime data
gdf_london_grid_poi_crime = gpd.read_file('data/london_grid_poi_crime.geojson')

gdf_london_grid_poi_crime

	OBJECTID	grd_fixid	grd_floaid	grd_newid	pcd_count	tot_p	tot_m	tot_f	tot_hh	GlobalID	...	2024-05	2024-06	2024-07	2024-08	2024-09	2024-10	2024-11	2024-12	crime_counts	geometry
0	214289	4.0.35930000.32030000	3593.3203.4	1kmN3203E3593	3	9	5	4	5	9f0c1e1f-e8fa-40ca-a969-d1aedbefa750	...	3.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	5.0	POLYGON ((-0.50339 51.46588, -0.51765 51.46459...
1	214290	4.0.35930000.32040000	3593.3204.4	1kmN3204E3593	54	1957	1029	928	833	e2530627-9630-4200-ae13-788341456d76	...	1.0	0.0	0.0	4.0	2.0	0.0	1.0	1.0	13.0	POLYGON ((-0.50545 51.47476, -0.51972 51.47347...
2	214865	4.0.35940000.32020000	3594.3202.4	1kmN3202E3594	13	445	223	222	169	bf801f98-5376-4c37-8bcb-d80ae9af1f0b	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((-0.48707 51.45829, -0.50133 51.457, ...
3	214866	4.0.35940000.32030000	3594.3203.4	1kmN3203E3594	3	34	18	16	11	472a8640-40e0-4c94-8eb5-37647847dc99	...	126.0	105.0	124.0	134.0	127.0	137.0	141.0	124.0	1519.0	POLYGON ((-0.48913 51.46717, -0.50339 51.46588...
4	214867	4.0.35940000.32040000	3594.3204.4	1kmN3204E3594	1	15	9	6	8	c21982c0-0d13-43df-8fa5-f77c28fba3d6	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((-0.49118 51.47605, -0.50545 51.47476...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1728	240548	4.0.36510000.32040000	3651.3204.4	1kmN3204E3651	0	0	0	0	0	c1be8f1c-6c91-4dca-aad2-ce9f51538b0e	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((0.32328 51.54665, 0.30897 51.54546, ...
1729	240549	4.0.36510000.32050000	3651.3205.4	1kmN3205E3651	2	73	41	32	25	65336bd5-772b-4398-a0de-4ccb1b438479	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((0.32137 51.55555, 0.30706 51.55436, ...
1730	240550	4.0.36510000.32060000	3651.3206.4	1kmN3206E3651	2	36	16	20	13	befe6e14-0fef-4c44-8440-c3e80c139c63	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((0.31947 51.56445, 0.30515 51.56326, ...
1731	240929	4.0.36520000.32030000	3652.3203.4	1kmN3203E3652	2	26	15	11	12	96832cf0-62a9-49a4-b442-ac283b90db87	...	0.0	0.0	1.0	0.0	0.0	2.0	0.0	0.0	3.0	POLYGON ((0.33949 51.53894, 0.32518 51.53776, ...
1732	240930	4.0.36520000.32040000	3652.3204.4	1kmN3204E3652	0	0	0	0	0	f0c429cf-e001-432e-a41b-259a3c9e6582	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	POLYGON ((0.33759 51.54784, 0.32328 51.54665, ...

1733 rows × 27 columns

# we need to reshape the df to have the columns: spatial index (grid), temporal index (month), X (poi_counts and poi_types) and y (crime_counts) as we need to predict the crime level in December 2024
df_X_y = gdf_london_grid_poi_crime.melt(id_vars=['grd_newid'],
                                          value_vars=[f'2024-{i:02d}' for i in range(1, 13)],
                                          var_name='Month', value_name='c_counts')

df_X_y.sort_values(by=['grd_newid', 'Month'], inplace=True)
df_X_y

	grd_newid	Month	c_counts
1048	1kmN3178E3626	2024-01	0.0
2781	1kmN3178E3626	2024-02	0.0
4514	1kmN3178E3626	2024-03	0.0
6247	1kmN3178E3626	2024-04	0.0
7980	1kmN3178E3626	2024-05	0.0
...	...	...	...
13137	1kmN3224E3624	2024-08	0.0
14870	1kmN3224E3624	2024-09	0.0
16603	1kmN3224E3624	2024-10	0.0
18336	1kmN3224E3624	2024-11	0.0
20069	1kmN3224E3624	2024-12	0.0

20796 rows × 3 columns

# make sure we have the all spatial and temporal unit of analysis
df_X_y.grd_newid.nunique() * df_X_y.Month.nunique()

# attach the poi_counts and poi_types to the df_X_y
df_X_y = pd.merge(df_X_y, gdf_london_grid_poi_crime[['grd_newid', 'poi_counts', 'poi_types']],  on='grd_newid', how='left')

df_X_y

	grd_newid	Month	c_counts	poi_counts	poi_types
0	1kmN3178E3626	2024-01	0.0	0.0	0.0
1	1kmN3178E3626	2024-02	0.0	0.0	0.0
2	1kmN3178E3626	2024-03	0.0	0.0	0.0
3	1kmN3178E3626	2024-04	0.0	0.0	0.0
4	1kmN3178E3626	2024-05	0.0	0.0	0.0
...	...	...	...	...	...
20791	1kmN3224E3624	2024-08	0.0	0.0	0.0
20792	1kmN3224E3624	2024-09	0.0	0.0	0.0
20793	1kmN3224E3624	2024-10	0.0	0.0	0.0
20794	1kmN3224E3624	2024-11	0.0	0.0	0.0
20795	1kmN3224E3624	2024-12	0.0	0.0	0.0

20796 rows × 5 columns

# Selceting the training and testing data
# we use the first 11 months as the training data and the last month (December) as the testing data
df_X_y_train = df_X_y[df_X_y['Month'] != '2024-12']
df_X_y_test = df_X_y[df_X_y['Month'] == '2024-12']
X_train = df_X_y_train[['poi_counts', 'poi_types']]
y_train = df_X_y_train['c_counts']
X_test = df_X_y_test[['poi_counts', 'poi_types']]
y_test = df_X_y_test['c_counts']

# hyperparameter tuning using GridSearchCV in Random Forests
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
# Define the Random Forest model
rf_model = RandomForestRegressor(random_state=42)
# Define the hyperparameters to tune
param_grid_rf = {
    'n_estimators': [50, 100],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 7],
    'min_samples_leaf': [1, 2, 6]}
# Perform grid search with cross-validation, here we use 5-fold CV and negative mean squared error as the scoring metric
grid_search_rf = GridSearchCV(rf_model, param_grid_rf, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
# Fit the model on the training data
grid_search_rf.fit(X_train, y_train)
# Print the best hyperparameters and the best score
print("Best hyperparameters:", grid_search_rf.best_params_)
print("Best cross-validation score (MSE):", -grid_search_rf.best_score_)

Best hyperparameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 7, 'n_estimators': 100}
Best cross-validation score (MSE): 6265.360893316517

# Get the best model from the grid search
best_rf_model = grid_search_rf.best_estimator_

# output the importance of the features
importances = best_rf_model.feature_importances_
feature_names = X_train.columns
# Create a DataFrame to hold feature names and their importance scores
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
# Sort the DataFrame by importance scores in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
feature_importance_df

	Feature	Importance
0	poi_counts	0.574831
1	poi_types	0.425169

# Training set performance metrics
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Make predictions on the training set
y_train_pred_rf = best_rf_model.predict(X_train)
# Calculate performance metrics
mse_train_rf = mean_squared_error(y_train, y_train_pred_rf)
mae_train_rf = mean_absolute_error(y_train, y_train_pred_rf)
r2_train_rf = r2_score(y_train, y_train_pred_rf)
print("Training set performance metrics:")
print("Mean Squared Error (MSE):", mse_train_rf)
print("Mean Absolute Error (MAE):", mae_train_rf)
print("R-squared (R2):", r2_train_rf)

Training set performance metrics:
Mean Squared Error (MSE): 1576.2537522196371
Mean Absolute Error (MAE): 24.499505744829683
R-squared (R2): 0.8532881379884858

# Testing set performance metrics
# Make predictions on the test set
y_test_pred_rf = best_rf_model.predict(df_X_y_test[['poi_counts', 'poi_types']])
# Calculate performance metrics
mse_test_rf = mean_squared_error(y_test, y_test_pred_rf)
mae_test_rf = mean_absolute_error(y_test, y_test_pred_rf)
r2_test_rf = r2_score(y_test, y_test_pred_rf)
print("Test set performance metrics:")
print("Mean Squared Error (MSE):", mse_test_rf)
print("Mean Absolute Error (MAE):", mae_test_rf)
print("R-squared (R2):", r2_test_rf)

Test set performance metrics:
Mean Squared Error (MSE): 2058.6680182257987
Mean Absolute Error (MAE): 23.931832242381546
R-squared (R2): 0.8598031898525247

Quiz#

Q1: Use the same data in the Random Forests section, but use the SVM model to predict the crime level in December 2024. Use the same hyperparameter tuning method (GridsearchCV) as in the Random Forests section. Compare the performance of the SVM model with the Random Forests model.
Q2: Use the same data in the Random Forests section, but use the GWR model to predict the crime level in December 2024. Compare the performance of the GWR model with the Random Forests model.

Q1 solution:

from sklearn.svm import SVR
# Define the SVM model
svm_model = SVR()
# Define the hyperparameters to tune
param_grid_svm = {
    'kernel': ['linear', 'rbf'],
    'C': [0.1, 1, 10],
    'gamma': [0.1, 0.01],
    'epsilon': [0.1,0.3]
}
# Perform grid search with cross-validation, here we use 5-fold CV and negative mean squared error as the scoring metric
grid_search_svm = GridSearchCV(svm_model, param_grid_svm, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
# Fit the model on the training data
grid_search_svm.fit(X_train, y_train)
# Print the best hyperparameters and the best score
print("Best hyperparameters:", grid_search_svm.best_params_)
print("Best cross-validation score (MSE):", -grid_search_svm.best_score_)
# Get the best model from the grid search
best_svm_model = grid_search_svm.best_estimator_

/Users/896604/miniconda3/envs/STBDA/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  warnings.warn(

Best hyperparameters: {'C': 0.1, 'epsilon': 0.1, 'gamma': 0.1, 'kernel': 'linear'}
Best cross-validation score (MSE): 5763.656583753928

# output the training set performance metrics
# Make predictions on the training set
y_train_pred_svm = best_svm_model.predict(X_train)
# Calculate performance metrics
mse_train_svm = mean_squared_error(y_train, y_train_pred_svm)
mae_train_svm = mean_absolute_error(y_train, y_train_pred_svm)
r2_train_svm = r2_score(y_train, y_train_pred_svm)
print("Training set performance metrics:")
print("Mean Squared Error (MSE):", mse_train_svm)
print("Mean Absolute Error (MAE):", mae_train_svm)
print("R-squared (R2):", r2_train_svm)

Training set performance metrics:
Mean Squared Error (MSE): 5172.553455303947
Mean Absolute Error (MAE): 27.01774179526304
R-squared (R2): 0.5185578795843584

# Testing set performance metrics
# Make predictions on the test set
y_test_pred_svm = best_svm_model.predict(X_test)
# Calculate performance metrics
mse_test_svm = mean_squared_error(y_test, y_test_pred_svm)
mae_test_svm = mean_absolute_error(y_test, y_test_pred_svm)
r2_test_svm = r2_score(y_test, y_test_pred_svm)
print("Test set performance metrics:")
print("Mean Squared Error (MSE):", mse_test_svm)
print("Mean Absolute Error (MAE):", mae_test_svm)
print("R-squared (R2):", r2_test_svm)

Test set performance metrics:
Mean Squared Error (MSE): 8500.947557530795
Mean Absolute Error (MAE): 26.248980252989444
R-squared (R2): 0.4210792025496609

Q2 solution:

df_X_y_train

	grd_newid	Month	c_counts	poi_counts	poi_types
0	1kmN3178E3626	2024-01	0.0	0.0	0.0
1	1kmN3178E3626	2024-02	0.0	0.0	0.0
2	1kmN3178E3626	2024-03	0.0	0.0	0.0
3	1kmN3178E3626	2024-04	0.0	0.0	0.0
4	1kmN3178E3626	2024-05	0.0	0.0	0.0
...	...	...	...	...	...
20790	1kmN3224E3624	2024-07	0.0	0.0	0.0
20791	1kmN3224E3624	2024-08	0.0	0.0	0.0
20792	1kmN3224E3624	2024-09	0.0	0.0	0.0
20793	1kmN3224E3624	2024-10	0.0	0.0	0.0
20794	1kmN3224E3624	2024-11	0.0	0.0	0.0

19063 rows × 5 columns

# we need to prepare the data for  GWR, which requires the coordinates and the features and target variables
df_coors = gdf_london_grid_poi_crime[['grd_newid', 'geometry']]
# transfer crs to uk grid system
df_coors = df_coors.to_crs(epsg=27700)
df_coors['east'] = df_coors.geometry.centroid.x
df_coors['north'] = df_coors.geometry.centroid.y

df_coors_s = df_coors[['grd_newid', 'east', 'north']]
# merge the coordinates with the training data
df_X_y_train = df_X_y_train.merge(df_coors_s, on='grd_newid', how='left')

df_X_y_train

	grd_newid	Month	c_counts	poi_counts	poi_types	east	north
0	1kmN3178E3626	2024-01	0.0	0.0	0.0	540160.898592	156351.273341
1	1kmN3178E3626	2024-02	0.0	0.0	0.0	540160.898592	156351.273341
2	1kmN3178E3626	2024-03	0.0	0.0	0.0	540160.898592	156351.273341
3	1kmN3178E3626	2024-04	0.0	0.0	0.0	540160.898592	156351.273341
4	1kmN3178E3626	2024-05	0.0	0.0	0.0	540160.898592	156351.273341
...	...	...	...	...	...	...	...
19058	1kmN3224E3624	2024-07	0.0	0.0	0.0	530671.170943	201329.693106
19059	1kmN3224E3624	2024-08	0.0	0.0	0.0	530671.170943	201329.693106
19060	1kmN3224E3624	2024-09	0.0	0.0	0.0	530671.170943	201329.693106
19061	1kmN3224E3624	2024-10	0.0	0.0	0.0	530671.170943	201329.693106
19062	1kmN3224E3624	2024-11	0.0	0.0	0.0	530671.170943	201329.693106

19063 rows × 7 columns

from mgwr.gwr import GWR
from mgwr.sel_bw import Sel_BW

# Prepare data
coords = df_X_y_train[['east', 'north']].values               # shape (n_samples, 2)
X = df_X_y_train[['poi_counts', 'poi_types']].values          # shape (n_samples, n_features)
y = df_X_y_train[['c_counts']].values                         # shape (n_samples, 1)

# Bandwidth selection
sel_bw = Sel_BW(coords, y, X, fixed=False,  kernel='gaussian')
bw = sel_bw.search(bw_min=2, bw_max=100)
print(f"Selected bandwidth: {bw}")
# GWR model fitting
mgwr_model = GWR(coords, y, X, bw=bw, fixed=False,  kernel='gaussian')
results = mgwr_model.fit()  # assign to a variable!

Selected bandwidth: 25.0

import numpy as np
np.float = float # Monkey patch, as gwr.summary() relies on numpy version < 1.24
results.summary()

===========================================================================
Model type                                                         Gaussian
Number of observations:                                               19063
Number of covariates:                                                     3

Global Regression Results
---------------------------------------------------------------------------
Residual sum of squares:                                       77741310.342
Log-likelihood:                                                 -106288.329
AIC:                                                             212582.657
AICc:                                                            212584.659
BIC:                                                           77553464.425
R2:                                                                   0.620
Adj. R2:                                                              0.620

Variable                              Est.         SE  t(Est/SE)    p-value
------------------------------- ---------- ---------- ---------- ----------
X0                                   9.506      0.691     13.764      0.000
X1                                   0.890      0.009     99.121      0.000
X2                                  -0.225      0.062     -3.624      0.000

Geographically Weighted Regression (GWR) Results
---------------------------------------------------------------------------
Spatial kernel:                                           Adaptive gaussian
Bandwidth used:                                                      25.000

Diagnostic information
---------------------------------------------------------------------------
Residual sum of squares:                                       18810354.478
Effective number of parameters (trace(S)):                          598.280
Degree of freedom (n - trace(S)):                                 18464.720
Sigma estimate:                                                      31.917
Log-likelihood:                                                  -92763.327
AIC:                                                             186725.214
AICc:                                                            186764.183
BIC:                                                             191432.859
R2:                                                                   0.908
Adjusted R2:                                                          0.905
Adj. alpha (95%):                                                     0.000
Adj. critical t value (95%):                                          3.662

Summary Statistics For GWR Parameter Estimates
---------------------------------------------------------------------------
Variable                   Mean        STD        Min     Median        Max
-------------------- ---------- ---------- ---------- ---------- ----------
X0                        0.591     46.791   -976.165      3.646    259.681
X1                        0.200      0.832     -3.308      0.076      6.296
X2                        2.307      2.838    -14.226      2.116     11.568
===========================================================================

# predicting the December 2024 crime level
# In-sample prediction
y_train_predictions = mgwr_model.predict(coords, X)

# Calculate performance metrics for the training set
mse_train_mgwr = mean_squared_error(y, y_train_predictions.predictions)
mae_train_mgwr = mean_absolute_error(y, y_train_predictions.predictions)
r2_train_mgwr = r2_score(y, y_train_predictions.predictions)
print("Training set performance metrics:")
print("Mean Squared Error (MSE):", mse_train_mgwr)
print("Mean Absolute Error (MAE):", mae_train_mgwr)
print("R-squared (R2):", r2_train_mgwr)

Training set performance metrics:
Mean Squared Error (MSE): 986.74681202883
Mean Absolute Error (MAE): 15.172640017308824
R-squared (R2): 0.9081572608960851

df_X_y_test = df_X_y_test.merge(df_coors_s, on='grd_newid', how='left')

df_X_y_test

	grd_newid	Month	c_counts	poi_counts	poi_types	east	north
0	1kmN3178E3626	2024-12	0.0	0.0	0.0	540160.898592	156351.273341
1	1kmN3178E3627	2024-12	0.0	13.0	7.0	541148.801501	156514.772306
2	1kmN3178E3628	2024-12	0.0	1.0	1.0	542136.703610	156678.270809
3	1kmN3178E3629	2024-12	0.0	1.0	1.0	543124.604471	156841.769399
4	1kmN3178E3630	2024-12	0.0	1.0	1.0	544112.504850	157005.269369
...	...	...	...	...	...	...	...
1728	1kmN3223E3626	2024-12	0.0	0.0	0.0	532810.468008	200672.557618
1729	1kmN3223E3627	2024-12	0.0	0.0	0.0	533798.279960	200836.397365
1730	1kmN3224E3620	2024-12	0.0	5.0	4.0	526719.906430	200674.265154
1731	1kmN3224E3621	2024-12	0.0	1.0	1.0	527707.725648	200838.116449
1732	1kmN3224E3624	2024-12	0.0	0.0	0.0	530671.170943	201329.693106

1733 rows × 7 columns

coors_test = df_X_y_test[['east', 'north']].values
X_test = df_X_y_test[['poi_counts', 'poi_types']].values
y_test = df_X_y_test[['c_counts']].values

mgwr_model.bw

25.0

# Testing set performance metrics
gwr_test_model = GWR(coors_test, y_test, X_test, bw=25, fixed=False, kernel='gaussian')
results_test = gwr_test_model.fit()

import numpy as np
np.float = float # Monkey patch, as gwr.summary() relies on numpy version < 1.24
results_test.summary()

===========================================================================
Model type                                                         Gaussian
Number of observations:                                                1733
Number of covariates:                                                     3

Global Regression Results
---------------------------------------------------------------------------
Residual sum of squares:                                       10908510.231
Log-likelihood:                                                  -10038.681
AIC:                                                              20083.362
AICc:                                                             20085.385
BIC:                                                           10895608.567
R2:                                                                   0.571
Adj. R2:                                                              0.571

Variable                              Est.         SE  t(Est/SE)    p-value
------------------------------- ---------- ---------- ---------- ----------
X0                                  13.559      2.848      4.761      0.000
X1                                   1.186      0.037     32.022      0.000
X2                                  -1.891      0.256     -7.376      0.000

Geographically Weighted Regression (GWR) Results
---------------------------------------------------------------------------
Spatial kernel:                                           Adaptive gaussian
Bandwidth used:                                                      25.000

Diagnostic information
---------------------------------------------------------------------------
Residual sum of squares:                                        6830750.006
Effective number of parameters (trace(S)):                           85.372
Degree of freedom (n - trace(S)):                                  1647.628
Sigma estimate:                                                      64.388
Log-likelihood:                                                   -9633.065
AIC:                                                              19438.874
AICc:                                                             19448.046
BIC:                                                              19910.261
R2:                                                                   0.732
Adjusted R2:                                                          0.718
Adj. alpha (95%):                                                     0.002
Adj. critical t value (95%):                                          3.133

Summary Statistics For GWR Parameter Estimates
---------------------------------------------------------------------------
Variable                   Mean        STD        Min     Median        Max
-------------------- ---------- ---------- ---------- ---------- ----------
X0                       10.592     31.057    -55.690      4.272    246.785
X1                        0.383      0.511     -0.474      0.253      2.418
X2                        1.341      3.122    -14.972      1.886      5.325
===========================================================================

# In testing set prediction
y_test_predictions = gwr_test_model.predict(coors_test, X_test)
# Calculate performance metrics for the testing set
mse_test_mgwr = mean_squared_error(y_test, y_test_predictions.predictions)
mae_test_mgwr = mean_absolute_error(y_test, y_test_predictions.predictions)
r2_test_mgwr = r2_score(y_test, y_test_predictions.predictions)
print("Testing set performance metrics:")
print("Mean Squared Error (MSE):", mse_test_mgwr)
print("Mean Absolute Error (MAE):", mae_test_mgwr)
print("R-squared (R2):", r2_test_mgwr)

Testing set performance metrics:
Mean Squared Error (MSE): 3941.575306423653
Mean Absolute Error (MAE): 26.264055429467643
R-squared (R2): 0.7315758150297141