5 Geospatial statistical modelling

Time	Activity
16:00–16:05	Introduction — Overview of the main tasks for the lab tutorials
16:05–16:45	Tutorial: geospatial modelling 1 — Follow Section 5.1-5.3 of the Jupyter Notebook to practice the spatial lag, SLM, SEM
16:45–17:30	Tutorial: geospatial modelling 2 — Follow Section 5.4-5.6 of the Jupyter Notebook to practice the SDM, GWR and mGWR
17:30–17:55	Quiz — Complete quiz tasks
17:55–18:00	Wrap-up — Recap key points and address final questions

For this module’s lab tutorials, you can download all the required data using the provided link (click).

Please make sure that both the Jupyter Notebook file and the data and img folder are placed in the same directory (specifically within the STBDA_lab folder) to ensure the code runs correctly.

Week 5 Key Takeaways:

Understand the spatial lags and compute using libpysal;
Understand and perform the SLM model;
Understand and perform the SEM model;
Understand and perform the SDM model;
Understand and perform the GWR and mGWR model.

5 Geospatial statistical modelling#

Spatial regression is a type of statistical modeling that accounts for spatial dependence (when nearby locations influence each other) and spatial heterogeneity (when the distribution of a single variable and relationships between variables vary across space) in the data. In other words, it’s used when geolocation matters and traditional regression methods (like OLS) might give misleading results due to spatial dependence or spatial heterogeneity.

# Import the required libraries
import numpy as np
import esda
import geopandas as gpd
import libpysal.weights as weights
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

5.1 Spatial lag#

Spatial lag refers to a weighted measurement of neighboring values (based on a spatial weights matrix). It captures how a variable at one location is influenced by its neighbors. In other words, spatial lag helps capture spatial dependence — where values in one area depend on values in neighboring areas/units.

Spatial lag as a new variable is popularly used as a predictor in spatial regression models.

The spatial lag of a variable \(y\) is calculated as:

\[ Wy \]

Where:

\(y\) is the vector of the variable you’re analyzing (e.g., income, house price)
\(W\) is the spatial weights matrix, which defines how locations relate to each other (e.g., neighbors)
\(Wy\) is the spatial lag, a new vector where each element is the weighted sum of neighboring \(y\) values

Each element \((Wy)_i\) is computed as:

\[ (Wy)_i = \sum_{j=1}^{n} w_{ij} y_j \]

Where:

\(w_{ij}\) is the spatial weight between location \(i\) and neighbor \(j\)
\(y_j\) is the value of the variable \(y\) at location \(j\)

We use the defination above to calculate the spatial lag manually or use libpysal library to compute the spatial lag for the crime incidents in London LSOAs used before.

gdf_crime_agg_may = gpd.read_file('data/gdf_crime_agg_may.geojson')

gdf_crime_agg_may

	LSOA21CD	Month	Count	Z_score	p_sim	geometry
0	E01000001	2024-05	11.0	4.896641	0.001	POLYGON ((532153.339 181867.677, 532154.301 18...
1	E01000002	2024-05	38.0	6.055690	0.001	POLYGON ((532636.307 181926.254, 532633.858 18...
2	E01000003	2024-05	13.0	1.267929	0.013	POLYGON ((532155.508 182165.404, 532160.055 18...
3	E01000005	2024-05	72.0	3.965206	0.001	POLYGON ((533620.885 181402.575, 533641.692 18...
4	E01000006	2024-05	10.0	-0.012130	0.339	POLYGON ((545128.728 184311.01, 545147.089 184...
...	...	...	...	...	...	...
4989	E01035718	2024-05	156.0	1.016448	0.015	POLYGON ((528046.674 180618.19, 528048.069 180...
4990	E01035719	2024-05	9.0	0.637040	0.047	POLYGON ((530268.738 178811.47, 530267.488 178...
4991	E01035720	2024-05	3.0	-0.203603	0.226	POLYGON ((529857.882 178761.915, 529858.59 178...
4992	E01035721	2024-05	61.0	0.429416	0.069	POLYGON ((528643.526 178631.049, 528637.399 17...
4993	E01035722	2024-05	7.0	-0.170283	0.302	POLYGON ((529608.089 178302.605, 529608.368 17...

4994 rows × 6 columns

# we have calculated the spatial weights for LSOAs using the Queen contiguity method before
# compute the spatial weights using the Queen contiguity method
w_lsoa_london = weights.Queen.from_dataframe(gdf_crime_agg_may, use_index='True')
df_w_lsoa = w_lsoa_london.to_adjlist(drop_islands=True)
df_w_lsoa

	focal	neighbor	weight
0	0	1	1.0
1	0	2	1.0
2	0	4536	1.0
3	0	4537	1.0
4	0	4601	1.0
...	...	...	...
29463	4992	4623	1.0
29464	4993	4422	1.0
29465	4993	4483	1.0
29466	4993	4485	1.0
29467	4993	4486	1.0

29468 rows × 3 columns

For example, we calculate the spatial lag of LSOA21CD (E01000001) where index is 0 in the gdf_crime_agg_may manually use the spatial weights’ matrix.

# first, we found the neighbors of LSOA21CD (E01000001) in the df_w_lsoa
df_w_0 = df_w_lsoa[df_w_lsoa['focal'] == 0]
df_w_0

	neighbor	weight
0	1	1.0
1	2	1.0
2	4536	1.0
3	4537	1.0
4	4601	1.0

So, the spatial lag of LSOA21CD (E01000001) (index is 0) is the sum of each neighbor’s crime count times the weight:

# we can observe that the neighbors of 0 have five neighbors (1, 2, 4536, 4537, 4601)
gdf_crime_agg_may.iloc[1]['Count'] * df_w_0[df_w_0['neighbor'] == 1]['weight'].values[0] + \
gdf_crime_agg_may.iloc[2]['Count'] * df_w_0[df_w_0['neighbor'] == 2]['weight'].values[0] + \
gdf_crime_agg_may.iloc[4536]['Count'] * df_w_0[df_w_0['neighbor'] == 4536]['weight'].values[0] + \
gdf_crime_agg_may.iloc[4537]['Count'] * df_w_0[df_w_0['neighbor'] == 4537]['weight'].values[0] + \
gdf_crime_agg_may.iloc[4601]['Count'] * df_w_0[df_w_0['neighbor'] == 4601]['weight'].values[0]

811.0

gdf_crime_agg_may.iloc[2]['Count']

13.0

# we can use the libpysal to compute the spatial lag
import libpysal as lps
ylag = lps.weights.lag_spatial(w_lsoa_london, gdf_crime_agg_may['Count'])
print(ylag)

[811. 788. 266. ...  36. 168.  40.]

len(ylag)

# add the spatial lag to the gdf_crime_agg_may
gdf_crime_agg_may['sl_Count'] = ylag
gdf_crime_agg_may.head()

	LSOA21CD	Month	Count	Z_score	p_sim	geometry	sl_Count
0	E01000001	2024-05	11.0	4.896641	0.001	POLYGON ((532153.339 181867.677, 532154.301 18...	811.0
1	E01000002	2024-05	38.0	6.055690	0.001	POLYGON ((532636.307 181926.254, 532633.858 18...	788.0
2	E01000003	2024-05	13.0	1.267929	0.013	POLYGON ((532155.508 182165.404, 532160.055 18...	266.0
3	E01000005	2024-05	72.0	3.965206	0.001	POLYGON ((533620.885 181402.575, 533641.692 18...	805.0
4	E01000006	2024-05	10.0	-0.012130	0.339	POLYGON ((545128.728 184311.01, 545147.089 184...	59.0

# Plot the original data and spatial lag of the crime data for comparison
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
gdf_crime_agg_may.plot(column='Count', ax=ax[0], legend=True, cmap='Reds',
                       edgecolor='black', linewidth=0.05, scheme = 'naturalbreaks',
                       legend_kwds = {'bbox_to_anchor': (1.2, 0.35),'title': "Crime count"})
ax[0].set_title('Crime Count in London LSOAs (May 2024)')
gdf_crime_agg_may.plot(column='sl_Count', ax=ax[1], legend=True, cmap='Reds',
                       edgecolor='black', linewidth=0.05, scheme = 'naturalbreaks',
                       legend_kwds = {'bbox_to_anchor': (1.2, 0.35),'title': "SL values"})
ax[1].set_title('Spatial Lag of Crime Count in London LSOAs (May 2024)')
ax[0].set_axis_off()
ax[1].set_axis_off()
plt.show()

_images/923f6d6f801128b3943e040ff0ca9b03efa556f19e213086b3d8a0c8bdfd550d.png

5.2 Spatial Lag Model (SLM)#

The Spatial Lag Model (SLM) is a type of regression model that explicitly includes the effect of neighboring dependent variable values (i.e., spatial lag) on the value at each location:

\[ \mathbf{y} = \rho \mathbf{W y} + \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]

\(\mathbf{y}\) is the vector of the dependent variable
\(\mathbf{W}\) is the spatial weights matrix, so \(\mathbf{W y}\) is the spatial lag of \(\mathbf{y}\)
\(\rho\) is the spatial autoregressive coefficient
\(\mathbf{X}\) is the matrix of explanatory variables
\(\boldsymbol{\beta}\) is the vector of coefficients
\(\boldsymbol{\epsilon}\) is the error term

We need a new pkg called spreg as a part of libpysal to compute the spatial lag model. The columbus data frame has 49 rows and 22 columns. Unit of analysis: 49 neighbourhoods in Columbus, OH, 1980 data. The Spatial Lag Model documentation can be found here.

import spreg

from libpysal.examples import load_example
from spreg import ML_Lag  # Maximum Likelihood Spatial Lag Model
# Load sample dataset
data = load_example('Columbus')
gdf_columbus = gpd.read_file(data.get_path('columbus.shp'))
gdf_columbus.head()

	AREA	PERIMETER	COLUMBUS_	COLUMBUS_I	POLYID	NEIG	HOVAL	INC	CRIME	OPEN	...	DISCBD	X	Y	NSA	NSB	EW	THOUS	NEIGNO	geometry
0	0.309441	2.440629	2	5	1	5	80.467003	19.531	15.725980	2.850747	...	5.03	38.799999	44.070000	1.0	1.0	1.0	1000.0	1005.0	POLYGON ((8.62413 14.23698, 8.5597 14.74245, 8...
1	0.259329	2.236939	3	1	2	1	44.567001	21.232	18.801754	5.296720	...	4.27	35.619999	42.380001	1.0	1.0	0.0	1000.0	1001.0	POLYGON ((8.25279 14.23694, 8.28276 14.22994, ...
2	0.192468	2.187547	4	6	3	6	26.350000	15.956	30.626781	4.534649	...	3.89	39.820000	41.180000	1.0	1.0	1.0	1000.0	1006.0	POLYGON ((8.65331 14.00809, 8.81814 14.00205, ...
3	0.083841	1.427635	5	2	4	2	33.200001	4.477	32.387760	0.394427	...	3.70	36.500000	40.520000	1.0	1.0	0.0	1000.0	1002.0	POLYGON ((8.4595 13.82035, 8.47341 13.83227, 8...
4	0.488888	2.997133	6	7	5	7	23.225000	11.252	50.731510	0.405664	...	2.83	40.009998	38.000000	1.0	1.0	1.0	1000.0	1007.0	POLYGON ((8.68527 13.63952, 8.67758 13.72221, ...

5 rows × 21 columns

# we need to set the crs to EPSG:26917 (NAD83 / UTM zone 17N) as the polygon valus in the data is in this CRS, not the EPSG:4326.
gdf_columbus = gdf_columbus.set_crs(epsg=26917, allow_override=True)
print(gdf_columbus.crs)

EPSG:26917

gdf_columbus[['INC', 'HOVAL', 'CRIME']]

	INC	HOVAL	CRIME
0	19.531000	80.467003	15.725980
1	21.232000	44.567001	18.801754
2	15.956000	26.350000	30.626781
3	4.477000	33.200001	32.387760
4	11.252000	23.225000	50.731510
5	16.028999	28.750000	26.066658
6	8.438000	75.000000	0.178269
7	11.337000	37.125000	38.425858
8	17.586000	52.599998	30.515917
9	13.598000	96.400002	34.000835
10	7.467000	19.700001	62.275448
11	10.048000	19.900000	56.705669
12	9.549000	41.700001	46.716129
13	9.963000	42.900002	57.066132
14	9.873000	18.000000	48.585487
15	7.625000	18.799999	54.838711
16	9.798000	41.750000	36.868774
17	13.185000	60.000000	43.962486
18	11.618000	30.600000	54.521965
19	31.070000	81.266998	0.223797
20	10.655000	19.975000	40.074074
21	11.709000	30.450001	33.705048
22	21.155001	47.733002	20.048504
23	14.236000	53.200001	38.297871
24	8.461000	17.900000	61.299175
25	8.085000	20.299999	40.969742
26	10.822000	34.099998	52.794430
27	7.856000	22.850000	56.919785
28	8.681000	32.500000	60.750446
29	13.906000	22.500000	68.892044
30	16.940001	31.799999	17.677214
31	18.941999	40.299999	19.145592
32	9.918000	23.600000	41.968163
33	14.948000	28.450001	23.974028
34	12.814000	27.000000	39.175053
35	18.739000	36.299999	14.305556
36	17.017000	43.299999	42.445076
37	11.107000	22.700001	53.710938
38	18.476999	39.599998	19.100863
39	29.833000	61.950001	16.241299
40	22.207001	42.099998	18.905146
41	25.872999	44.333000	16.491890
42	13.380000	25.700001	36.663612
43	16.961000	33.500000	25.962263
44	14.135000	27.733000	29.028488
45	18.323999	76.099998	16.530533
46	18.950001	42.500000	27.822861
47	11.813000	26.799999	26.645266
48	18.796000	35.799999	22.541491

# plot the three maps of the columbus data: income ('INC'), house value ('HOVAL') and crime rate ('CRIME')
fig, ax = plt.subplots(1, 3, figsize=(12, 6))
gdf_columbus.plot(column='INC', ax=ax[0], legend=True, cmap='YlOrRd', edgecolor='black', legend_kwds={'shrink': 0.5})
ax[0].set_title('Income')
ax[0].axis('off')
gdf_columbus.plot(column='HOVAL', ax=ax[1], legend=True, cmap='Blues', edgecolor='black', legend_kwds={'shrink': 0.5})
ax[1].set_title('House Value')
ax[1].axis('off')
gdf_columbus.plot(column='CRIME', ax=ax[2], legend=True, cmap='RdPu', edgecolor='black', legend_kwds={'shrink': 0.5})
ax[2].set_title('Crime Rate')
ax[2].axis('off')
plt.tight_layout()

_images/5a98823245bb98d0d83d7050cc72aab53966de7b994db4789d5ffe6568b33857.png

# Spatial weights
w = weights.Queen.from_dataframe(gdf_columbus, use_index='True')
w.transform = 'R'  # Row-standardized

# Variables
y = gdf_columbus[['CRIME']].values          # Dependent variable / matrix
X = gdf_columbus[['INC', 'HOVAL']].values   # Independent variables are Income and House Value.

# Fit spatial lag model
model_sl = ML_Lag(y, X, w=w, name_y='CRIME', name_x=['INC', 'HOVAL'], name_w='Queen', name_ds='Columbus')

# Output summary
print(model_sl.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :    Columbus
Weights matrix      :       Queen
Dependent Variable  :       CRIME                Number of Observations:          49
Mean dependent var  :     35.1288                Number of Variables   :           4
S.D. dependent var  :     16.7321                Degrees of Freedom    :          45
Pseudo R-squared    :      0.6473
Spatial Pseudo R-squared:  0.5806
Log likelihood      :   -182.6740
Sigma-square ML     :     96.8572                Akaike info criterion :     373.348
S.E of regression   :      9.8416                Schwarz criterion     :     380.915

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        45.60325         7.25740         6.28369         0.00000
                 INC        -1.04873         0.30741        -3.41154         0.00065
               HOVAL        -0.26633         0.08910        -2.98929         0.00280
             W_CRIME         0.42333         0.11951         3.54216         0.00040
------------------------------------------------------------------------------------

SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
                 INC        -1.0487         -0.7699         -1.8186
               HOVAL        -0.2663         -0.1955         -0.4618
================================ END OF REPORT =====================================

Interpretation of the results:

Number of Variables: Includes intercept/constant, two predictors /independent variables(INC, HOVAL), and spatial lag \(y\) (the spatial lag is inherently calculated as part of the model fitting process) (W_CRIME);
Pseudo \(R^2\): Indicates Goodness of model fit (similar to \(R^2\) in OLS), here, ~65% of variance explained (higher is better);
Spatial Pseudo \(R^2\): Fit due to the spatial component (higher is better);
Sigma-square ML: Maximum likelihood estimate of error variance;
Standard Error of Regression: Residual spread;
Log Likelihood: Used for model comparison (higher is better);
AIC/ Akaike Information Criterion: Used for model comparison (lower is better);
Schwarz Criterion (BIC): Bayesian Information Criterion (penalizes model complexity) (Lower BIC is better);
The coefficient of variable: it reflects the relationship between each independent variable and the dependent variable. A positive coefficient means the variable is positively correlated with the dependent variable — as the independent variable increases, the dependent variable tends to increase. A negative coefficient means the variable is negatively correlated — as the independent variable increases, the dependent variable tends to decrease.
The z-statistic (also called the Wald statistic) tests whether a coefficient is significantly different from zero. A high absolute z-value (e.g., > 2 or < -2) suggests the coefficient is statistically significant.
The p-value tells you the probability of observing a coefficient as extreme as the one calculated, assuming the true coefficient is zero (i.e., null hypothesis is true). Low p-values (< 0.05) -> reject the null hypothesis -> the variable is significant. High p-values (> 0.05) -> fail to reject the null -> variable may not be important.

In summary, the results reflect the spatial lag model equation, where the spatial autoregressive coefficient \(\rho\) corresponds to the coefficient of W_CRIME (0.42). The vector of regression coefficients \(\boldsymbol{\beta}\) includes the coefficients for INC and HOVAL, which are -1.05 and -0.27 respectively. The error term \(\boldsymbol{\epsilon}\) is represented by the standard error of the regression (9.84) and the maximum likelihood estimate of the error variance (Sigma-square ML = 96.86).

Residuals analysis

If residuals still show spatial autocorrelation –> model may be misspecified. In other words, it suggests that your model isn’t fully capturing the spatial structure in the data.

residuals = model_sl.u  # residuals from the SLM
moran_resid = esda.Moran(residuals, w)
print(moran_resid.I, moran_resid.p_sim)

0.03306830572737454 0.258

We can also plot the fitted y values and residuals/errors in residuals analysis:

Mapping fitted values: show where your model thinks the crime rate is high/low.
Mapping residual values: show where the model over/underpredicting.

# get the in sample prediction (fitted) crime rates and the residuals
gdf_columbus["fitted"] = model_sl.predy.flatten()
gdf_columbus["residuals"] = model_sl.u.flatten()

fig, ax = plt.subplots(1, 3, figsize=(12, 6))
gdf_columbus.plot(column='CRIME', ax=ax[0], legend=True, cmap='RdPu', edgecolor='black', legend_kwds={'shrink': 0.5},
                  vmin=gdf_columbus.CRIME.min(), vmax=gdf_columbus.CRIME.max())
ax[0].set_title('Observed crime rate')
ax[0].axis('off')
gdf_columbus.plot(column='fitted', ax=ax[1], legend=True, cmap='RdPu', edgecolor='black', legend_kwds={'shrink': 0.5}, vmin=gdf_columbus.CRIME.min(), vmax=gdf_columbus.CRIME.max())
ax[1].set_title('Predicited/fitted crime rate \n by the SL model')
ax[1].axis('off')
gdf_columbus.plot(column='residuals', ax=ax[2], legend=True, cmap='coolwarm', edgecolor='black', legend_kwds={'shrink': 0.5},
                  vmin=gdf_columbus.residuals.min(), vmax=abs(gdf_columbus.residuals.min()))
ax[2].set_title('Residuals')
ax[2].axis('off')
plt.tight_layout()

_images/91909d08cf395dd93d135be3d7ec126f7e4ff601d661f39286efe8f3917b4c9e.png

5.3 Spatial Error Model (SEM)#

The Spatial Error Model (SEM) accounts for spatial autocorrelation in regression error terms. It is used when spatially correlated unobserved variables that affect the dependent variable are omitted from the model. Rather than directly impacting the dependent variable, spatial effects are captured within the error terms themselves.

The Spatial Error Model (SEM) is given by:

\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u} \]

where:

\(\mathbf{y}\) is the dependent variable,
\(\mathbf{X}\) is the matrix of independent variables,
\(\boldsymbol{\beta}\) is the vector of coefficients,
\(\mathbf{u}\) is the error term.

The error term is spatially correlated, and can be expressed as:

\[ \mathbf{u} = \lambda W \mathbf{u} + \boldsymbol{\epsilon} \]

where:

\(\lambda\) is the spatial autoregressive parameter,
\(\mathbf{W}\) is the spatial weight matrix,
\(\boldsymbol{\epsilon}\) is the white noise error term.

Thus, the SEM model is:

\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + (\lambda W \mathbf{u} + \boldsymbol{\epsilon}) \]

from spreg import ML_Error  # Maximum Likelihood Spatial Error Model
# Use the same X and y variables in gdf_columbus
# Fit the spatial error model
model_sem = ML_Error(y, X, w=w, name_y='CRIME', name_x=['INC', 'HOVAL'], name_w='Queen', name_ds='Columbus')
# Output summary
print(model_sem.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: ML SPATIAL ERROR (METHOD = full)
---------------------------------------------------
Data set            :    Columbus
Weights matrix      :       Queen
Dependent Variable  :       CRIME                Number of Observations:          49
Mean dependent var  :     35.1288                Number of Variables   :           3
S.D. dependent var  :     16.7321                Degrees of Freedom    :          46
Pseudo R-squared    :      0.5362
Log likelihood      :   -183.7494
Sigma-square ML     :     97.6742                Akaike info criterion :     373.499
S.E of regression   :      9.8830                Schwarz criterion     :     379.174

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        60.27947         5.36559        11.23444         0.00000
                 INC        -0.95731         0.33423        -2.86420         0.00418
               HOVAL        -0.30456         0.09205        -3.30873         0.00094
              lambda         0.54675         0.13805         3.96052         0.00007
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

Interpretation:

For model comparison, the Spatial Lag Model (SL) demonstrates better performance than the Spatial Error Model (SE), as indicated by its higher R² and log-likelihood values. Although the AIC values for SL are only slightly better than SE, the difference is not substantial enough to definitively favor one model based solely on these criteria.

5.4 Spatial Durbin Model (SDM)#

The Spatial Durbin Model (SDM) extends the spatial lag model (SAR) by including spatially lagged independent variables in addition to the spatially lagged dependent variable. It allows for capturing spatial dependence in both the dependent and independent variables.

The Spatial Durbin Model (SDM) is specified as:

\[ \mathbf{y} = \rho \mathbf{W} \mathbf{y} + \mathbf{X} \boldsymbol{\beta} + \mathbf{W} \mathbf{X} \boldsymbol{\theta} + \boldsymbol{\varepsilon} \]

where:

\( \mathbf{y} \) is the dependent variable vector,
\( \rho \mathbf{W} \mathbf{y} \) represents the spatial lag of the dependent variable,
\( \mathbf{X} \boldsymbol{\beta} \) are the usual covariates,
\( \mathbf{W} \mathbf{X} \boldsymbol{\theta} \) are the spatially lagged covariates,
\( \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}) \) is the error term.

We also use the function spreg.ML_Lag but need to set the parameter slx_lags(Number of spatial lags of X to include in the model specification. If slx_lags>0, the specification becomes of the Spatial Durbin type) and slx_var (either All(default) or list of booleans to select x variables to be lagged)

# Fit spatial durbin model
model_dur = ML_Lag(y, X, w=w, name_y='CRIME', name_x=['INC', 'HOVAL'], name_w='Queen', name_ds='Columbus', slx_lags=1)

# Output summary
print(model_dur.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG WITH SLX - SPATIAL DURBIN MODEL (METHOD = FULL)
-------------------------------------------------------------------------------------------------
Data set            :    Columbus
Weights matrix      :       Queen
Dependent Variable  :       CRIME                Number of Observations:          49
Mean dependent var  :     35.1288                Number of Variables   :           6
S.D. dependent var  :     16.7321                Degrees of Freedom    :          43
Pseudo R-squared    :      0.6603
Spatial Pseudo R-squared:  0.6120
Log likelihood      :   -181.6393
Sigma-square ML     :     93.2722                Akaike info criterion :     375.279
S.E of regression   :      9.6578                Schwarz criterion     :     386.629

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        44.32001        13.04547         3.39735         0.00068
                 INC        -0.91991         0.33474        -2.74811         0.00599
               HOVAL        -0.29713         0.09042        -3.28625         0.00102
               W_INC        -0.58391         0.57422        -1.01687         0.30921
             W_HOVAL         0.25768         0.18723         1.37626         0.16874
             W_CRIME         0.40346         0.16133         2.50079         0.01239
------------------------------------------------------------------------------------

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                              DF         VALUE           PROB
Common Factor Hypothesis Test     2          4.474           0.1068

SPATIAL DURBIN MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
                 INC        -0.9199         -1.6010         -2.5209
               HOVAL        -0.2971          0.2310         -0.0661
================================ END OF REPORT =====================================

Interpretation:

Based on the results of the report, we can observe that the Spatial Durbin Model (SDM) fits the data using a different approach. Firstly, we notice that the spatial lag variables of the independent variables — specifically, W_INC and W_HOVAL — have p-values greater than 0.05, indicating that they are not statistically significant. However, when considering metrics such as \(R^2\) and log likelihood, the SDM shows slightly better performance compared to the Spatial Lag Model (SLM) and the Spatial Error Model (SEM) which suggest a marginally stronger overall fit.

5.5 Geographically Weighted Regression (GWR)#

Geographically Weighted Regression (GWR) is an extension of traditional regression that allows the relationships between independent variables (X) and the dependent variable (y) to vary across geographic space.

The linear regression /OLS:

\[ y_i = \beta_0 + \sum_{k=1}^{p} \beta_k x_{ik} + \varepsilon_i, i \in \{1, 2, \ldots, n\}, k \in \{1, 2, \ldots, p\} \]

GWR:

\[ y_i = \beta_0(u_i, v_i) + \sum_{k=1}^{p} \beta_k(u_i, v_i) \, x_{ik} + \varepsilon_i, i \in \{1, 2, \ldots, n\}, k \in \{1, 2, \ldots, p\} \]

Where:

\( y_i \): Dependent variable / Response variable at location \( i \);
\( x_{ik} \): \( k -th\) independent variable/predictor at location \( i \);
\( (u_i, v_i) \): Coordinates of location \( i \);
\( \beta_k(u_i, v_i) \): Location-specific regression coefficient for variable \( k \) at location \(i\);
\( \varepsilon_i \): Error term;
\( n \) is the number of observations;
\( p \) is the number of independent variables;
\(\beta_0\) is the intercept.

Weighted Least Squares Estimation

Each local coefficient \( \hat{\boldsymbol{\beta}}(u_i, v_i) \) is estimated by solving:

\[ \hat{\boldsymbol{\beta}}(u_i, v_i) = \left( X^\top W_i X \right)^{-1} X^\top W_i \boldsymbol{y} \]

Where:

\( X \): Matrix of predictors/ dependent variables/ features;
\( W_i \): Spatial weighting matrix for location \( i \);
\( \boldsymbol{y} \): Vector of observed responses.

Kernel Function for Weights (e.g., Gaussian)

\[ w_{ij} = \exp\left( -\frac{d_{ij}^2}{b^2} \right) \]

Where:

\( d_{ij} \): Distance between location \( i \) and \( j \)
\( b \): Bandwidth parameter

Example: Calculating the local coefficient \( \hat{\boldsymbol{\beta}}(u_0, v_0)\)

data

index \(i,j\)	\(x\)	\(y\)	\(u (lon)\)	\(v (lat)\)
1	1.0	2.0	0.0	0.0
2	2.0	3.0	1.0	0.0
3	3.0	6.0	0.0	1.0
4	4.0	8.0	1.0	1.0

Fitting the equation of GWR:

\[y = \beta_0(u_0, v_0) + \beta_1(u_0, v_0) \cdot x\]

Calculating dist

index \(i\)	Coords	Dist to (0, 0)
1	(0, 0)	0.00
2	(1, 0)	1.00
3	(0, 1)	1.00
4	(1, 1)	1.41

Calculating spatial weight (set bandwith = 1) using

\(w_{j} = \exp\left( -\frac{d_{j}^2}{1} \right)\),

Exponential function: \(\exp(x) = e^x\) (\(e\) is Euler’s number, approximately 2.71828)

index \(i,j\)	Dis \(d_j\) to i (0,0)	Spatial weight \(w_1j\)
1	0.00	1.00
2	1.00	0.37
3	1.00	0.37
4	1.41	0.14

Caculating the local coefficent vector:

The \(\mathbf{X} \) and vector \( \mathbf{y}\) are:

\[\begin{split} \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ 1 & x_4 \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix} \end{split}\]

\[\begin{split} \mathbf{y} = \begin{bmatrix} 2 \\ 3 \\ 6 \\ 8 \end{bmatrix} \end{split}\]

Calculating the diagonal matrix:

\[\begin{split} W=diag([w_{11}, w_{12}, w_{13}, w_{14}]) = \mathbf{W} = \begin{bmatrix} 1.0 & 0 & 0 & 0 \\ 0 & 0.37 & 0 & 0 \\ 0 & 0 & 0.37 & 0 \\ 0 & 0 & 0 & 0.14 \end{bmatrix} \end{split}\]

\[\begin{split}\mathbf{X}^T \mathbf{W} \mathbf{X} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{bmatrix} \cdot \begin{bmatrix} 1.0 & 0 & 0 & 0 \\ 0 & 0.37 & 0 & 0 \\ 0 & 0 & 0.37 & 0 \\ 0 & 0 & 0 & 0.14 \end{bmatrix} \cdot \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix} = \begin{bmatrix} 1.88 & 3.41 \\ 3.41 & 8.05 \end{bmatrix} \end{split}\]
\[\begin{split} \mathbf{X}^T \mathbf{W} \mathbf{y} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{bmatrix} \cdot \begin{bmatrix} 1.0 & 0 & 0 & 0 \\ 0 & 0.37 & 0 & 0 \\ 0 & 0 & 0.37 & 0 \\ 0 & 0 & 0 & 0.14 \end{bmatrix} \cdot \begin{bmatrix} 2 \\ 3 \\ 6\\ 8 \end{bmatrix} = \begin{bmatrix} 6.45 \\ 15.36 \end{bmatrix}\end{split}\]
\[\begin{split} \hat{\boldsymbol{\beta}}(u_0, v_0) = \left( X^\top W X \right)^{-1} X^\top W \boldsymbol{y} = (\begin{bmatrix} 1.88 & 3.41 \\ 3.41 & 8.05 \end{bmatrix})^{-1} \cdot \begin{bmatrix} 6.45 \\ 15.36 \end{bmatrix} = \begin{bmatrix} -0.13 \\ 1.96 \end{bmatrix}\end{split}\]

# an example to calculate the expotional function
import math
print(math.exp(0),
      math.exp(-1),
      math.exp(-1),
      math.exp(-(1.41*1.41)))

1.0 0.36787944117144233 0.36787944117144233 0.13695539364545323

# Calculating the diagonal matrix
W = np.diag([1, 0.3679, 0.3679, 0.1370])
print(W)

[[1.     0.     0.     0.    ]
 [0.     0.3679 0.     0.    ]
 [0.     0.     0.3679 0.    ]
 [0.     0.     0.     0.137 ]]

# set X and y
X = np.array([
    [1, 1],
    [1, 2],
    [1, 3],
    [1, 4]
])
y = np.array([2, 3, 6, 8])
#  Spatial weights matrix W
weights = np.array([1.0, 0.37, 0.37, 0.14])
W = np.diag(weights)
# Calculate XT W X
XTWX = X.T @ W @ X
# Calculate XT W y
XTWy = X.T @ W @ y
# Calculate local coefficients beta_hat
beta_hat = np.linalg.inv(XTWX) @ XTWy
print("XTWX:\n", XTWX)
print("XTWy:\n", XTWy)
print("Estimated local coefficients beta_hat:\n", beta_hat)

XTWX:
 [[1.88 3.41]
 [3.41 8.05]]
XTWy:
 [ 6.45 15.36]
Estimated local coefficients beta_hat:
 [-0.12980975  1.96306227]

As a part of PySAL community, mgwr provide the GWR and mGWR model in python.

from mgwr.gwr import GWR, MGWR
from mgwr.sel_bw import Sel_BW
print(np.__version__)

1.26.4

We use Homicides and selected socio-economic characteristics for continental U.S. counties as an example for gwr and mgwr. Data for four decennial census years: 1960, 1970, 1980 and 1990.

import libpysal
us_hse = libpysal.examples.load_example("NCOVR")
gdf_us_hse = gpd.read_file(us_hse.get_path('NAT.shp'))
gdf_us_hse = gdf_us_hse.to_crs("EPSG:5070")
gdf_us_hse

	NAME	STATE_NAME	STATE_FIPS	CNTY_FIPS	FIPS	STFIPS	COFIPS	FIPSNO	SOUTH	HR60	...	BLK90	GI59	GI69	GI79	GI89	FH60	FH70	FH80	FH90	geometry
0	Lake of the Woods	Minnesota	27	077	27077	27	77	27077	0	0.000000	...	0.024534	0.285235	0.372336	0.342104	0.336455	11.279621	5.400000	5.663881	9.515860	POLYGON ((49050.745 2838555.999, 49054.346 285...
1	Ferry	Washington	53	019	53019	53	19	53019	0	0.000000	...	0.317712	0.256158	0.360665	0.361928	0.360640	10.053476	2.600000	10.079576	11.397059	POLYGON ((-1704187.741 2978490.561, -1690089.6...
2	Stevens	Washington	53	065	53065	53	65	53065	0	1.863863	...	0.210030	0.283999	0.394083	0.357566	0.369942	9.258437	5.600000	6.812127	10.352015	POLYGON ((-1598348.829 2964085.947, -1605943.0...
3	Okanogan	Washington	53	047	53047	53	47	53047	0	2.612330	...	0.155922	0.258540	0.371218	0.381240	0.394519	9.039900	8.100000	10.084926	12.840340	POLYGON ((-1713271.344 2979541.451, -1712880.7...
4	Pend Oreille	Washington	53	051	53051	53	51	53051	0	0.000000	...	0.134605	0.243263	0.365614	0.358706	0.387848	8.243930	4.100000	7.557643	10.313002	POLYGON ((-1574802.183 3066600.159, -1545432.6...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3080	La Paz	Arizona	04	012	04012	4	12	4012	0	5.046682	...	2.628811	0.271556	0.364110	0.372662	0.405743	9.216414	8.000000	9.296093	12.379134	POLYGON ((-1708456.819 1273380.859, -1711763.0...
3081	Cibola	New Mexico	35	006	35006	35	6	35006	0	3.411368	...	1.001029	0.287346	0.347426	0.363889	0.404823	9.645881	9.300000	11.925297	14.907666	POLYGON ((-1016458.583 1340151.199, -1063696.6...
3082	York	Virginia	51	199	51199	51	199	51199	1	1.544425	...	12.534861	0.271201	0.279962	0.310047	0.326229	8.776244	6.300000	8.063281	9.825284	POLYGON ((1713621.948 1740205.362, 1713411.118...
3083	Prince William	Virginia	51	153	51153	51	153	51153	1	9.302820	...	11.365661	0.240268	0.288488	0.289718	0.286416	7.525717	5.300000	8.626065	10.276895	POLYGON ((1584066.804 1880564.938, 1562143.846...
3084	Gallatin	Montana	30	031	30031	30	31	30031	0	3.396162	...	0.228960	0.245227	0.343359	0.344086	0.377684	8.598461	5.651882	8.166075	10.187915	POLYGON ((-1209699.354 2515401.971, -1198622.3...

3085 rows × 70 columns

gdf_us_hse.columns

Index(['NAME', 'STATE_NAME', 'STATE_FIPS', 'CNTY_FIPS', 'FIPS', 'STFIPS',
       'COFIPS', 'FIPSNO', 'SOUTH', 'HR60', 'HR70', 'HR80', 'HR90', 'HC60',
       'HC70', 'HC80', 'HC90', 'PO60', 'PO70', 'PO80', 'PO90', 'RD60', 'RD70',
       'RD80', 'RD90', 'PS60', 'PS70', 'PS80', 'PS90', 'UE60', 'UE70', 'UE80',
       'UE90', 'DV60', 'DV70', 'DV80', 'DV90', 'MA60', 'MA70', 'MA80', 'MA90',
       'POL60', 'POL70', 'POL80', 'POL90', 'DNL60', 'DNL70', 'DNL80', 'DNL90',
       'MFIL59', 'MFIL69', 'MFIL79', 'MFIL89', 'FP59', 'FP69', 'FP79', 'FP89',
       'BLK60', 'BLK70', 'BLK80', 'BLK90', 'GI59', 'GI69', 'GI79', 'GI89',
       'FH60', 'FH70', 'FH80', 'FH90', 'geometry'],
      dtype='object')

For independent variables, we use ‘UE90’, ‘POL90’, ‘GI89’ which refer to the Unemployment rate, Resource deprivation and Gini index of family income inequality of 1990; the dependent variable is ‘HR90’, i.e., Homicide rate per 100,000

gdf_us_hse[['HR90', 'UE90', 'RD90', 'GI89']]

	HR90	UE90	RD90	GI89
0	0.000000	3.894790	-0.802774	0.336455
1	15.885624	16.811594	-0.135483	0.360640
2	6.462453	10.700794	-0.276544	0.369942
3	6.996502	10.203544	0.370762	0.394519
4	7.478033	14.991023	0.107992	0.387848
...	...	...	...	...
3080	5.521552	10.910351	0.494534	0.405743
3081	8.691999	11.002271	0.688425	0.404823
3082	4.367330	4.102888	-1.339770	0.326229
3083	3.727712	3.353981	-1.747367	0.286416
3084	2.048855	5.488553	-0.361145	0.377684

3085 rows × 4 columns

from mgwr.gwr import GWR
from mgwr.sel_bw import Sel_BW
import warnings
warnings.filterwarnings('ignore')

# Ensure geometries are projected (e.g., EPSG:5070)
gdf_us_hse = gdf_us_hse.to_crs(epsg=5070)

# Extract centroids and coordinates
centroids = gdf_us_hse.geometry.centroid
coords = pd.DataFrame({
    'X': centroids.x,
    'Y': centroids.y}).values

# Dependent and independent variables
y= gdf_us_hse[['HR90']].values
X= gdf_us_hse[['UE90', 'RD90', 'GI89']].values

# select bandwidth
bw_selector = Sel_BW(coords, y, X, kernel='gaussian')
bw = bw_selector.search(bw_min=20)
print(f"Optimal bandwidth: {bw}")

Optimal bandwidth: 26.0

# fit the gwr_model
gwr_model = GWR(coords, y, X, bw)
gwr_results = gwr_model.fit()

np.float = float # Monkey patch, as gwr.summary() relies on numpy version < 1.24
gwr_results.summary()

===========================================================================
Model type                                                         Gaussian
Number of observations:                                                3085
Number of covariates:                                                     4

Global Regression Results
---------------------------------------------------------------------------
Residual sum of squares:                                          89071.988
Log-likelihood:                                                   -9564.688
AIC:                                                              19137.376
AICc:                                                             19139.396
BIC:                                                              64318.288
R2:                                                                   0.345
Adj. R2:                                                              0.345

Variable                              Est.         SE  t(Est/SE)    p-value
------------------------------- ---------- ---------- ---------- ----------
X0                                  32.257      2.662     12.119      0.000
X1                                  -0.174      0.041     -4.274      0.000
X2                                   6.283      0.263     23.934      0.000
X3                                 -66.506      7.064     -9.415      0.000

Geographically Weighted Regression (GWR) Results
---------------------------------------------------------------------------
Spatial kernel:                                           Adaptive bisquare
Bandwidth used:                                                      26.000

Diagnostic information
---------------------------------------------------------------------------
Residual sum of squares:                                          34561.697
Effective number of parameters (trace(S)):                         1066.029
Degree of freedom (n - trace(S)):                                  2018.971
Sigma estimate:                                                       4.137
Log-likelihood:                                                   -8104.405
AIC:                                                              18342.868
AICc:                                                             19472.896
BIC:                                                              24781.647
R2:                                                                   0.746
Adjusted R2:                                                          0.612
Adj. alpha (95%):                                                     0.000
Adj. critical t value (95%):                                          3.740

Summary Statistics For GWR Parameter Estimates
---------------------------------------------------------------------------
Variable                   Mean        STD        Min     Median        Max
-------------------- ---------- ---------- ---------- ---------- ----------
X0                       25.316     61.088   -334.721     22.559    296.359
X1                       -0.069      1.135    -10.447     -0.042      6.231
X2                        4.520      7.703    -38.419      4.080     36.367
X3                      -48.895    153.210   -748.675    -45.006    866.587
===========================================================================

# Convert local R2 to df
localR2_df = pd.DataFrame(gwr_results.localR2, columns=["localR2"])
# Convert params to df
params_df = pd.DataFrame(gwr_results.params, columns=['intercept', 'UE90', 'RD90', 'GI89'])
# Convert t-statistic to df
tvalues_df = pd.DataFrame(gwr_results.tvalues, columns=['intercept_tv', 'UE90_tv', 'RD90_tv', 'GI89_tv'])

from scipy.stats import t
# Approximate degrees of freedom
free = gwr_results.n - gwr_results.k
# Compute approximate two-tailed p-values
pvalues = 2 * (1 - t.cdf(abs(gwr_results.tvalues), df=free))
pvalues_df = pd.DataFrame(pvalues, columns=['intercept_p', 'UE90_p', 'RD90_p', 'GI89_p'])

# Concat
df_m = pd.concat([localR2_df, params_df, tvalues_df, pvalues_df], axis=1)
df_m

	localR2	intercept	UE90	RD90	GI89	intercept_tv	UE90_tv	RD90_tv	GI89_tv	intercept_p	UE90_p	RD90_p	GI89_p
0	0.633629	29.324285	-0.544060	8.825613	-54.691183	0.530777	-1.136838	1.206369	-0.379389	0.595612	0.255694	2.277677e-01	0.704425
1	0.867377	39.572338	0.683470	6.282482	-103.678889	0.523849	1.252162	0.661318	-0.556264	0.600421	0.210606	5.084581e-01	0.578071
2	0.830943	19.896255	0.661203	4.685874	-52.824834	0.328547	1.488115	0.596222	-0.352140	0.742521	0.136823	5.510709e-01	0.724757
3	0.883899	69.724504	0.603537	8.337601	-179.250594	0.836823	1.092445	0.775423	-0.867623	0.402757	0.274723	4.381492e-01	0.385668
4	0.796476	3.273553	0.695606	2.205631	-12.398164	0.059606	1.719729	0.308246	-0.090220	0.952473	0.085582	7.579163e-01	0.928118
...	...	...	...	...	...	...	...	...	...	...	...	...	...
3080	0.254210	5.676060	-0.319183	-1.082073	15.286796	0.083111	-0.382010	-0.140611	0.088373	0.933769	0.702480	8.881867e-01	0.929586
3081	0.561805	-30.728438	0.025356	-1.208809	101.497956	-0.709811	0.051656	-0.215739	0.934326	0.477875	0.958806	8.292057e-01	0.350209
3082	0.412670	-32.983666	-1.603241	-0.722064	138.316916	-1.562146	-2.494287	-0.334925	2.340559	0.118356	0.012673	7.377043e-01	0.019318
3083	0.915385	135.867091	5.082617	20.600332	-379.116343	4.149671	2.826873	5.045853	-4.715368	0.000034	0.004731	4.776815e-07	0.000003
3084	0.396840	-42.842168	-0.435088	-3.864959	127.061487	-1.266138	-0.884013	-0.856698	1.428765	0.205559	0.376758	3.916784e-01	0.153173

3085 rows × 13 columns

gdf_us_co = gdf_us_hse[['NAME', 'STATE_NAME', 'geometry']]
gdf_us_co

	NAME	STATE_NAME	geometry
0	Lake of the Woods	Minnesota	POLYGON ((49050.745 2838555.999, 49054.346 285...
1	Ferry	Washington	POLYGON ((-1704187.741 2978490.561, -1690089.6...
2	Stevens	Washington	POLYGON ((-1598348.829 2964085.947, -1605943.0...
3	Okanogan	Washington	POLYGON ((-1713271.344 2979541.451, -1712880.7...
4	Pend Oreille	Washington	POLYGON ((-1574802.183 3066600.159, -1545432.6...
...	...	...	...
3080	La Paz	Arizona	POLYGON ((-1708456.819 1273380.859, -1711763.0...
3081	Cibola	New Mexico	POLYGON ((-1016458.583 1340151.199, -1063696.6...
3082	York	Virginia	POLYGON ((1713621.948 1740205.362, 1713411.118...
3083	Prince William	Virginia	POLYGON ((1584066.804 1880564.938, 1562143.846...
3084	Gallatin	Montana	POLYGON ((-1209699.354 2515401.971, -1198622.3...

3085 rows × 3 columns

gdf_us_co_gwr = pd.concat([gdf_us_co, df_m], axis=1)

gdf_us_co_gwr

	NAME	STATE_NAME	geometry	localR2	intercept	UE90	RD90	GI89	intercept_tv	UE90_tv	RD90_tv	GI89_tv	intercept_p	UE90_p	RD90_p	GI89_p
0	Lake of the Woods	Minnesota	POLYGON ((49050.745 2838555.999, 49054.346 285...	0.633629	29.324285	-0.544060	8.825613	-54.691183	0.530777	-1.136838	1.206369	-0.379389	0.595612	0.255694	2.277677e-01	0.704425
1	Ferry	Washington	POLYGON ((-1704187.741 2978490.561, -1690089.6...	0.867377	39.572338	0.683470	6.282482	-103.678889	0.523849	1.252162	0.661318	-0.556264	0.600421	0.210606	5.084581e-01	0.578071
2	Stevens	Washington	POLYGON ((-1598348.829 2964085.947, -1605943.0...	0.830943	19.896255	0.661203	4.685874	-52.824834	0.328547	1.488115	0.596222	-0.352140	0.742521	0.136823	5.510709e-01	0.724757
3	Okanogan	Washington	POLYGON ((-1713271.344 2979541.451, -1712880.7...	0.883899	69.724504	0.603537	8.337601	-179.250594	0.836823	1.092445	0.775423	-0.867623	0.402757	0.274723	4.381492e-01	0.385668
4	Pend Oreille	Washington	POLYGON ((-1574802.183 3066600.159, -1545432.6...	0.796476	3.273553	0.695606	2.205631	-12.398164	0.059606	1.719729	0.308246	-0.090220	0.952473	0.085582	7.579163e-01	0.928118
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3080	La Paz	Arizona	POLYGON ((-1708456.819 1273380.859, -1711763.0...	0.254210	5.676060	-0.319183	-1.082073	15.286796	0.083111	-0.382010	-0.140611	0.088373	0.933769	0.702480	8.881867e-01	0.929586
3081	Cibola	New Mexico	POLYGON ((-1016458.583 1340151.199, -1063696.6...	0.561805	-30.728438	0.025356	-1.208809	101.497956	-0.709811	0.051656	-0.215739	0.934326	0.477875	0.958806	8.292057e-01	0.350209
3082	York	Virginia	POLYGON ((1713621.948 1740205.362, 1713411.118...	0.412670	-32.983666	-1.603241	-0.722064	138.316916	-1.562146	-2.494287	-0.334925	2.340559	0.118356	0.012673	7.377043e-01	0.019318
3083	Prince William	Virginia	POLYGON ((1584066.804 1880564.938, 1562143.846...	0.915385	135.867091	5.082617	20.600332	-379.116343	4.149671	2.826873	5.045853	-4.715368	0.000034	0.004731	4.776815e-07	0.000003
3084	Gallatin	Montana	POLYGON ((-1209699.354 2515401.971, -1198622.3...	0.396840	-42.842168	-0.435088	-3.864959	127.061487	-1.266138	-0.884013	-0.856698	1.428765	0.205559	0.376758	3.916784e-01	0.153173

3085 rows × 16 columns

fig, ax = plt.subplots(figsize=(10, 8))
gdf_us_co_gwr.plot(ax=ax, column='localR2', cmap='coolwarm',
                   edgecolor='black',linewidth=0.3, legend=True, vmin=-max(gdf_us_co_gwr.localR2), vmax=max(gdf_us_co_gwr.localR2),
                   legend_kwds={'shrink': 0.5, 'orientation': 'vertical', 'label': 'Local $R^2$'})
ax.set_title('Local $R^2$ of independent variables of GWR' )
ax.axis('off')
plt.show()

_images/045ca7a1f321ee74cb71a62afc92f4bc4a9d715e35149812afed7efa7514bd73.png

# plot the local coefficient
fig, ax = plt.subplots(figsize=(10, 8))

gdf_us_co_gwr[gdf_us_co_gwr['UE90_p'] < 0.05].plot(ax=ax, column='UE90',
                                                                    cmap='coolwarm', edgecolor='black',
                                                                    linewidth=0.3, legend=True,
                                                                    vmin=-max(gdf_us_co_gwr.UE90), vmax=max(gdf_us_co_gwr.UE90),
                                                                    legend_kwds={'shrink': 0.5, 'orientation': 'vertical', 'label': 'Local coefficient'})
# plot the not significant areas
gdf_us_co_gwr[gdf_us_co_gwr['UE90_p'] >= 0.05].plot(ax=ax, color='white', edgecolor='black', linewidth=0.3)
ax.set_title('Local coefficient of unemployment rate on homicide rate')
ax.axis('off')
plt.show()

_images/200403689d4a21a0d49f6e903cf5bc0489a3d1897df4e615a95480ddd6621d5c.png

5.6 Multiscale Geographically Weighted Regression (MGWR)#

MGWR (multiscale modelling) extends the GWR by allowing each predictor variable to operate at its own spatial scale (i.e., one \(bw_k\) per variable).

At spatial location \( i \), the MGWR model is expressed as:

\[ y_i = \beta_0(bw_0, u_i, v_i) + \sum_{k=1}^{p} \beta_k(bw_k, u_i, v_i) \cdot x_{ik} + \varepsilon_i \]

\( y_i \): Dependent variable at location \( i \);
\( x_{ik} \): \( k^{th} \) independent variable at location \( i \);
\( \beta_k(bw_k, u_i, v_i) \): Spatially varying coefficient for variable \( k \);
\( bw_k \): Bandwidth specific to variable \( k \);
\( (u_i, v_i) \): Coordinates of location \( i \);
\( \varepsilon_i \): Error term.

In MGWR, each coefficient is estimated independently using its own kernel scale for variable-specific spatial smoothing:

Each variable \( x_k \) has its own bandwidth \( bw_k \).
A separate weight matrix \( W_k(u_i, v_i) \) is used for estimating the coefficient \( \beta_k \) at location \( i \).

\[ \hat{\beta}_k(bw_k, u_i, v_i) = \left( X^T W_k(u_i, v_i) X \right)^{-1} X^T W_k(u_i, v_i) y \]

# Extract centroids and coordinates
centroids = gdf_columbus.geometry.centroid
coords = pd.DataFrame({
    'X': centroids.x,
    'Y': centroids.y}).values

y = gdf_columbus[['CRIME']].values          # Dependent variable
X = gdf_columbus[['INC', 'HOVAL']].values   # Independent variables are Income and House Value.

GWR model

# select bandwidth
bw_selector = Sel_BW(coords, y, X, kernel='gaussian')
bw = bw_selector.search(bw_min=2)
print(f"Optimal bandwidth: {bw}")

Optimal bandwidth: 8.0

# fit the gwr_model
gwr_model = GWR(coords, y, X, bw)
gwr_results = gwr_model.fit()

np.float = float # Monkey patch, as gwr.summary() relies on numpy version < 1.24
gwr_results.summary()

===========================================================================
Model type                                                         Gaussian
Number of observations:                                                  49
Number of covariates:                                                     3

Global Regression Results
---------------------------------------------------------------------------
Residual sum of squares:                                           6014.893
Log-likelihood:                                                    -187.377
AIC:                                                                380.754
AICc:                                                               383.664
BIC:                                                               5835.869
R2:                                                                   0.552
Adj. R2:                                                              0.533

Variable                              Est.         SE  t(Est/SE)    p-value
------------------------------- ---------- ---------- ---------- ----------
X0                                  68.619      4.735     14.490      0.000
X1                                  -1.597      0.334     -4.780      0.000
X2                                  -0.274      0.103     -2.654      0.008

Geographically Weighted Regression (GWR) Results
---------------------------------------------------------------------------
Spatial kernel:                                           Adaptive bisquare
Bandwidth used:                                                       8.000

Diagnostic information
---------------------------------------------------------------------------
Residual sum of squares:                                            438.121
Effective number of parameters (trace(S)):                           33.760
Degree of freedom (n - trace(S)):                                    15.240
Sigma estimate:                                                       5.362
Log-likelihood:                                                    -123.200
AIC:                                                                315.919
AICc:                                                               503.685
BIC:                                                                381.679
R2:                                                                   0.967
Adjusted R2:                                                          0.890
Adj. alpha (95%):                                                     0.004
Adj. critical t value (95%):                                          2.986

Summary Statistics For GWR Parameter Estimates
---------------------------------------------------------------------------
Variable                   Mean        STD        Min     Median        Max
-------------------- ---------- ---------- ---------- ---------- ----------
X0                       61.120     16.533     29.348     61.983     97.408
X1                       -1.277      1.946     -7.292     -1.039      4.073
X2                       -0.204      0.927     -3.809     -0.113      1.963
===========================================================================

MGWR model

# Bandwidth selection
selector = Sel_BW(coords, y, X, multi=True)
bw = selector.search(bw_min=[2])
print(bw)

[44. 48. 44.]

# Fit MGWR model
mgwr_model = MGWR(coords, y, X, selector=selector)
mgwr_results = mgwr_model.fit()
# Summary output
print(mgwr_results.summary())

===========================================================================
Model type                                                         Gaussian
Number of observations:                                                  49
Number of covariates:                                                     3

Global Regression Results
---------------------------------------------------------------------------
Residual sum of squares:                                           6014.893
Log-likelihood:                                                    -187.377
AIC:                                                                380.754
AICc:                                                               383.664
BIC:                                                               5835.869
R2:                                                                   0.552
Adj. R2:                                                              0.533

Variable                              Est.         SE  t(Est/SE)    p-value
------------------------------- ---------- ---------- ---------- ----------
X0                                  68.619      4.735     14.490      0.000
X1                                  -1.597      0.334     -4.780      0.000
X2                                  -0.274      0.103     -2.654      0.008

Multi-Scale Geographically Weighted Regression (MGWR) Results
---------------------------------------------------------------------------
Spatial kernel:                                           Adaptive bisquare
Criterion for optimal bandwidth:                                       AICc
Score of Change (SOC) type:                                     Smoothing f
Termination criterion for MGWR:                                       1e-05

MGWR bandwidths
---------------------------------------------------------------------------
Variable             Bandwidth      ENP_j   Adj t-val(95%)   Adj alpha(95%)
X0                      44.000      1.741            2.255            0.029
X1                      48.000      1.674            2.238            0.030
X2                      44.000      2.029            2.320            0.025

Diagnostic information
---------------------------------------------------------------------------
Residual sum of squares:                                           5115.764
Effective number of parameters (trace(S)):                            5.444
Degree of freedom (n - trace(S)):                                    43.556
Sigma estimate:                                                      10.838
Log-likelihood:                                                    -183.410
AIC:                                                                379.709
AICc:                                                               382.018
BIC:                                                                391.900
R2                                                                    0.619
Adjusted R2                                                           0.571

Summary Statistics For MGWR Parameter Estimates
---------------------------------------------------------------------------
Variable                   Mean        STD        Min     Median        Max
-------------------- ---------- ---------- ---------- ---------- ----------
X0                       73.026      0.761     71.723     72.982     74.566
X1                       -1.832      0.035     -1.928     -1.826     -1.781
X2                       -0.264      0.040     -0.329     -0.249     -0.216
===========================================================================

None

Quiz#

Q1: Identify all Lower Layer Super Output Areas (LSOAs) within Kingston upon Hull and East Riding of Yorkshire. (UK LSOAs data can be found in data/Lower_layer_Super_Output_Areas_(December_2021)_Boundaries_EW_BFC_(V10).geojson); Obtain crime data from the 2024 Humberside Police open dataset folder (data/police_humberside_2024). Use the 2024-08 all crime data type for analysis.
Q2: Retrieve Hull and east yorkshire data on the LSOA mean house value (data/Mean house prices by lower layer super output area- HPSSA dataset 47.csv) and LSOA population data/Lower layer Super Output Area population estimates 2019-2022.csv. Output some maps of these two variables: mean house price value (use column Year ending Dec 2022) and population (use column Total) .
Q3: Apply three different spatial modelling techniques using house prices and population as independent /explanatory variables. Compare the results from these models to evaluate their effectiveness. Try to map the residuals from three models.

Q1 solution

gdf_lsoa_uk = gpd.read_file('data/Lower_layer_Super_Output_Areas_(December_2021)_Boundaries_EW_BFC_(V10).geojson')
gdf_lsoa_uk.head()

	FID	LSOA21CD	LSOA21NM	BNG_E	BNG_N	LAT	LONG	Shape__Area	Shape__Length	GlobalID	geometry
0	1	E01000001	City of London 001A	532123	181632	51.51817	-0.097150	129865.314476	2635.767993	c625aea8-6d73-4b2a-be76-4d5c44cad9f8	POLYGON ((-0.09665 51.52028, -0.09663 51.52025...
1	2	E01000002	City of London 001B	532480	181715	51.51883	-0.091970	228419.782242	2707.816821	52c878e9-ac68-4886-b4a8-fea9cd241a70	POLYGON ((-0.08967 51.52069, -0.08971 51.52058...
2	3	E01000003	City of London 001C	532239	182033	51.52174	-0.095330	59054.204697	1224.573160	b9d8faca-d489-478d-8ce6-acaf76186d7d	POLYGON ((-0.0965 51.52295, -0.09644 51.52282,...
3	4	E01000005	City of London 001E	533581	181283	51.51469	-0.076280	189577.709503	2275.805344	15e1417d-537c-4845-9820-fc7596bd59b0	POLYGON ((-0.07568 51.51575, -0.07539 51.51555...
4	5	E01000006	Barking and Dagenham 016A	544994	184274	51.53875	0.089317	146536.995750	1966.092607	8a6c4ee0-c0ff-4736-9cfa-fb12a6d50da0	POLYGON ((0.09125 51.53905, 0.0915 51.5389, 0....

# select the LSOAs in Kingston upon Hull
gdf_lsoa_hu = gdf_lsoa_uk[ (gdf_lsoa_uk['LSOA21NM'].str.contains('Kingston upon Hull')) | (gdf_lsoa_uk['LSOA21NM'].str.contains('East Riding of Yorkshire'))]
# transfer the crs
gdf_lsoa_hu = gdf_lsoa_hu.to_crs(epsg=27700)
print(len(gdf_lsoa_hu))
gdf_lsoa_hu.plot()

<Axes: >

_images/80458af73d16091de300dfdbd85cc94f5a30e783ecbbf739de461725903c6674.png

gdf_lsoa_hu

	FID	LSOA21CD	LSOA21NM	LSOA21NMW	BNG_E	BNG_N	LAT	LONG	Shape__Area	Shape__Length	GlobalID	geometry
12115	12116	E01012756	Kingston upon Hull 025A		507367	430316	53.75817	-0.37294	198939.277603	3497.174098	b1f95abf-436c-4f5a-b6e1-93be9e0eb24d	POLYGON ((507433.012 430447.008, 507438.614 43...
12116	12117	E01012757	Kingston upon Hull 025B		508017	429519	53.75088	-0.36336	318088.364120	3716.172876	5f00b0d5-45c1-4529-9e08-881c4c58834d	POLYGON ((508039.325 430021.467, 508047.639 42...
12117	12118	E01012758	Kingston upon Hull 018A		507706	430320	53.75814	-0.36780	311920.300018	3772.447527	94cd85a6-b982-4038-9625-333de4b818df	POLYGON ((507791.196 430453.312, 507795.719 43...
12118	12119	E01012759	Kingston upon Hull 025C		507184	429662	53.75233	-0.37594	398184.936035	3984.903843	1516f69d-ec4e-4e2b-8f3d-81d5f680f631	POLYGON ((507088.105 429992.978, 507088.215 42...
12119	12120	E01012760	Kingston upon Hull 025D		507714	429795	53.75342	-0.36786	126002.444229	2081.849577	2e8144b1-63d2-4100-b2a5-a1fe09759513	POLYGON ((507913.848 429950.94, 507917.695 429...
...	...	...	...	...	...	...	...	...	...	...	...	...
33462	33463	E01035468	Kingston upon Hull 031I		507220	427868	53.73621	-0.37602	194811.873474	2342.011663	41c0a371-996d-4c62-8888-fb4830b71528	POLYGON ((507209.199 428143.672, 507211.664 42...
33463	33464	E01035469	Kingston upon Hull 034C		509268	435508	53.80442	-0.34228	602229.054047	4797.372241	ef8cf4ea-28e0-44ba-bdad-f74de2c628c2	POLYGON ((509524.731 435503.71, 509535.919 435...
33464	33465	E01035470	Kingston upon Hull 035A		508599	435590	53.80530	-0.35241	512661.485153	3739.146033	f97f0a63-17c4-461a-84e0-f69e41992515	POLYGON ((508962.524 435417.007, 508963.112 43...
33465	33466	E01035471	Kingston upon Hull 035B		508633	435072	53.80064	-0.35207	204771.100311	2848.973680	1dafa6ed-12b8-45cb-a67f-c559fdccfec6	POLYGON ((508775.831 435260.654, 508776.953 43...
33466	33467	E01035472	Kingston upon Hull 035C		508142	434887	53.79908	-0.35959	503061.344757	3672.250000	25c17302-b93b-4752-a042-f2ceadb41dd8	POLYGON ((507930.326 435252.723, 507986.946 43...

381 rows × 12 columns

df_crime_hu = pd.read_csv('data/police_humberside_2024/2024-08/2024-08-humberside-street.csv')
df_crime_hu

	Crime ID	Month	Reported by	Falls within	Longitude	Latitude	Location	LSOA code	LSOA name	Crime type	Last outcome category	Context
0	00c6cf00a107bbc4bfea4bda86cc111d8641d960029de2...	2024-08	Humberside Police	Humberside Police	-1.444477	53.554880	On or near Cherrys Road	E01007405	Barnsley 011D	Violence and sexual offences	Unable to prosecute suspect	NaN
1	e0b9bd3e5fd219191f6bca6cb730788db2d4a30b49725a...	2024-08	Humberside Police	Humberside Police	-1.771430	53.719914	On or near Birkhouse Road	E01010908	Calderdale 015D	Violence and sexual offences	Status update unavailable	NaN
2	5ced08aeab4910f5616a9a491eff404ea79387aad6d863...	2024-08	Humberside Police	Humberside Police	-1.570098	52.427211	On or near Mackenzie Close	E01009531	Coventry 010D	Violence and sexual offences	Action to be taken by another organisation	NaN
3	08a2e1a76109963fb73e9583b7954cdbfa7cf8136f6bc6...	2024-08	Humberside Police	Humberside Police	-0.956696	53.622353	On or near Maple Road	E01007635	Doncaster 001D	Violence and sexual offences	Awaiting court outcome	NaN
4	d76f6945e7b393056c7b1fda351a5d273be1fb7332a87a...	2024-08	Humberside Police	Humberside Police	-1.027047	53.603907	On or near Park/Open Space	E01007626	Doncaster 004B	Violence and sexual offences	Unable to prosecute suspect	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
8824	18c37130d9b4c9d501b20c4ee7b263987d4efd9611d84f...	2024-08	Humberside Police	Humberside Police	-1.023274	53.711056	On or near Himsworth Close	E01027885	Selby 008C	Violence and sexual offences	Status update unavailable	NaN
8825	68d04b2c377d180022edfceb558762f546f5948b3d9550...	2024-08	Humberside Police	Humberside Police	-1.340161	53.342703	On or near West Street	E01008030	Sheffield 063A	Violence and sexual offences	Unable to prosecute suspect	NaN
8826	e24f85c02e60a9b330e0ee8bc2232154749b7629ed8898...	2024-08	Humberside Police	Humberside Police	-1.352667	54.637803	On or near Spring Bank Wood	E01033476	Stockton-on-Tees 005F	Violence and sexual offences	Unable to prosecute suspect	NaN
8827	44a158b876156e38830ef93eba54787459556df85a62ef...	2024-08	Humberside Police	Humberside Police	-0.222929	53.512611	On or near Grimsby Road	E01026367	West Lindsey 001A	Violence and sexual offences	Unable to prosecute suspect	NaN
8828	e0aae955f4c4be84cd9d62300261701efd99117bd71636...	2024-08	Humberside Police	Humberside Police	-0.491256	53.446820	On or near Green Lane	E01026411	West Lindsey 005D	Public order	Unable to prosecute suspect	NaN

8829 rows × 12 columns

# create the gdf from the df_crime_hu_2024 with x and y coordinates
gdf_crime_hu = gpd.GeoDataFrame(df_crime_hu, geometry=gpd.points_from_xy(df_crime_hu['Longitude'],
                                                                              df_crime_hu['Latitude']), crs='EPSG:4326')
# transform the CRS to EPSG:27700 (British National Grid)
gdf_crime_hu = gdf_crime_hu.to_crs(epsg=27700)
gdf_crime_hu

	Crime ID	Month	Reported by	Falls within	Longitude	Latitude	Location	LSOA code	LSOA name	Crime type	Last outcome category	Context	geometry
0	00c6cf00a107bbc4bfea4bda86cc111d8641d960029de2...	2024-08	Humberside Police	Humberside Police	-1.444477	53.554880	On or near Cherrys Road	E01007405	Barnsley 011D	Violence and sexual offences	Unable to prosecute suspect	NaN	POINT (436900.168 406613.966)
1	e0b9bd3e5fd219191f6bca6cb730788db2d4a30b49725a...	2024-08	Humberside Police	Humberside Police	-1.771430	53.719914	On or near Birkhouse Road	E01010908	Calderdale 015D	Violence and sexual offences	Status update unavailable	NaN	POINT (415181.204 424856.32)
2	5ced08aeab4910f5616a9a491eff404ea79387aad6d863...	2024-08	Humberside Police	Humberside Police	-1.570098	52.427211	On or near Mackenzie Close	E01009531	Coventry 010D	Violence and sexual offences	Action to be taken by another organisation	NaN	POINT (429328.382 281110.253)
3	08a2e1a76109963fb73e9583b7954cdbfa7cf8136f6bc6...	2024-08	Humberside Police	Humberside Police	-0.956696	53.622353	On or near Maple Road	E01007635	Doncaster 001D	Violence and sexual offences	Awaiting court outcome	NaN	POINT (469104.345 414482.775)
4	d76f6945e7b393056c7b1fda351a5d273be1fb7332a87a...	2024-08	Humberside Police	Humberside Police	-1.027047	53.603907	On or near Park/Open Space	E01007626	Doncaster 004B	Violence and sexual offences	Unable to prosecute suspect	NaN	POINT (464479.321 412364.712)
...	...	...	...	...	...	...	...	...	...	...	...	...	...
8824	18c37130d9b4c9d501b20c4ee7b263987d4efd9611d84f...	2024-08	Humberside Police	Humberside Police	-1.023274	53.711056	On or near Himsworth Close	E01027885	Selby 008C	Violence and sexual offences	Status update unavailable	NaN	POINT (464565.252 424288.812)
8825	68d04b2c377d180022edfceb558762f546f5948b3d9550...	2024-08	Humberside Police	Humberside Police	-1.340161	53.342703	On or near West Street	E01008030	Sheffield 063A	Violence and sexual offences	Unable to prosecute suspect	NaN	POINT (444029.051 383067.892)
8826	e24f85c02e60a9b330e0ee8bc2232154749b7629ed8898...	2024-08	Humberside Police	Humberside Police	-1.352667	54.637803	On or near Spring Bank Wood	E01033476	Stockton-on-Tees 005F	Violence and sexual offences	Unable to prosecute suspect	NaN	POINT (441880.904 527155.758)
8827	44a158b876156e38830ef93eba54787459556df85a62ef...	2024-08	Humberside Police	Humberside Police	-0.222929	53.512611	On or near Grimsby Road	E01026367	West Lindsey 001A	Violence and sexual offences	Unable to prosecute suspect	NaN	POINT (517938.386 403236.345)
8828	e0aae955f4c4be84cd9d62300261701efd99117bd71636...	2024-08	Humberside Police	Humberside Police	-0.491256	53.446820	On or near Green Lane	E01026411	West Lindsey 005D	Public order	Unable to prosecute suspect	NaN	POINT (500301.241 395507.5)

8829 rows × 13 columns

# spatial join the gdf_crime_hu and gdf_lsoa_hull to get the LSOA index for each crime incident (we don't use the 'LSOA codes' in the df_crime_hu_2024)
gdf_crime_hu_lsoa = gpd.sjoin(gdf_crime_hu, gdf_lsoa_hu, how='inner', predicate='within')
gdf_crime_hu_lsoa

	Crime ID	Month	Reported by	Falls within	Longitude	Latitude	Location	LSOA code	LSOA name	Crime type	...	LSOA21CD	LSOA21NM	LSOA21NMW	BNG_E	BNG_N	LAT	LONG	Shape__Area	Shape__Length	GlobalID
16	NaN	2024-08	Humberside Police	Humberside Police	-0.205009	54.101015	On or near Pinfold Grove	E01012934	East Riding of Yorkshire 001B	Anti-social behaviour	...	E01012934	East Riding of Yorkshire 001B		513910	470010	54.11338	-0.25897	2.907797e+07	27656.377873	0ceeaede-a8d0-4686-8d4f-0091b1f243dc
17	NaN	2024-08	Humberside Police	Humberside Police	-0.208800	54.096802	On or near The Hollows	E01012934	East Riding of Yorkshire 001B	Anti-social behaviour	...	E01012934	East Riding of Yorkshire 001B		513910	470010	54.11338	-0.25897	2.907797e+07	27656.377873	0ceeaede-a8d0-4686-8d4f-0091b1f243dc
18	NaN	2024-08	Humberside Police	Humberside Police	-0.269446	54.124844	On or near Mere Lane	E01012934	East Riding of Yorkshire 001B	Anti-social behaviour	...	E01012934	East Riding of Yorkshire 001B		513910	470010	54.11338	-0.25897	2.907797e+07	27656.377873	0ceeaede-a8d0-4686-8d4f-0091b1f243dc
19	NaN	2024-08	Humberside Police	Humberside Police	-0.211149	54.096594	On or near New Pasture Close	E01012934	East Riding of Yorkshire 001B	Anti-social behaviour	...	E01012934	East Riding of Yorkshire 001B		513910	470010	54.11338	-0.25897	2.907797e+07	27656.377873	0ceeaede-a8d0-4686-8d4f-0091b1f243dc
20	NaN	2024-08	Humberside Police	Humberside Police	-0.210440	54.098129	On or near Ellerburn Drive	E01012934	East Riding of Yorkshire 001B	Anti-social behaviour	...	E01012934	East Riding of Yorkshire 001B		513910	470010	54.11338	-0.25897	2.907797e+07	27656.377873	0ceeaede-a8d0-4686-8d4f-0091b1f243dc
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5508	4e0999e7dee15e9aaf8a6bfa21bce92b1de5725b1b506e...	2024-08	Humberside Police	Humberside Police	-0.348154	53.798576	On or near Halecroft Park	E01033107	Kingston upon Hull 035E	Other theft	...	E01033107	Kingston upon Hull 035E		508618	434547	53.79592	-0.35249	3.921917e+05	3331.231133	f698240e-05d9-4a6a-8c12-ba6f537ddef1
5509	5a22e7a96f1e5df4bb33513cc926d6ac49bc2f34eaea74...	2024-08	Humberside Police	Humberside Police	-0.350806	53.799107	On or near Shinewater Park	E01033107	Kingston upon Hull 035E	Public order	...	E01033107	Kingston upon Hull 035E		508618	434547	53.79592	-0.35249	3.921917e+05	3331.231133	f698240e-05d9-4a6a-8c12-ba6f537ddef1
5510	9dab6d637c67dca1f82da99de1bc8853a44a82fd5edd97...	2024-08	Humberside Police	Humberside Police	-0.350433	53.795452	On or near Runnymede Way	E01033107	Kingston upon Hull 035E	Shoplifting	...	E01033107	Kingston upon Hull 035E		508618	434547	53.79592	-0.35249	3.921917e+05	3331.231133	f698240e-05d9-4a6a-8c12-ba6f537ddef1
5511	1c068c671bf0613fe0e9264e872d8abfa43e2253bbcb75...	2024-08	Humberside Police	Humberside Police	-0.350433	53.795452	On or near Runnymede Way	E01033107	Kingston upon Hull 035E	Shoplifting	...	E01033107	Kingston upon Hull 035E		508618	434547	53.79592	-0.35249	3.921917e+05	3331.231133	f698240e-05d9-4a6a-8c12-ba6f537ddef1
5512	cd82f19c01afc3adc33c8567fa1f0a9bdfe4dafccd24b0...	2024-08	Humberside Police	Humberside Police	-0.350443	53.799453	On or near Gilderidge Park	E01033107	Kingston upon Hull 035E	Violence and sexual offences	...	E01033107	Kingston upon Hull 035E		508618	434547	53.79592	-0.35249	3.921917e+05	3331.231133	f698240e-05d9-4a6a-8c12-ba6f537ddef1

5496 rows × 25 columns

gdf_crime_hu_lsoa.plot()

<Axes: >

_images/e2536c54493d459e2a5c11b598122abe13aff8f4d40071199d94e842dede007c.png

df_crime_hu_agg = gdf_crime_hu_lsoa.groupby(['LSOA21CD', 'Month']).agg({'Crime ID': 'count'}).reset_index().rename(columns={'Crime ID': 'Numbers'})
df_crime_hu_agg

	LSOA21CD	Month	Numbers
0	E01012756	2024-08	18
1	E01012757	2024-08	22
2	E01012758	2024-08	16
3	E01012759	2024-08	14
4	E01012760	2024-08	1
...	...	...	...
372	E01035467	2024-08	18
373	E01035468	2024-08	7
374	E01035469	2024-08	11
375	E01035470	2024-08	9
376	E01035471	2024-08	10

377 rows × 3 columns

# set the geometry to the df_crime_hull_agg
gdf_crime_hu_agg = pd.merge(gdf_lsoa_hu, df_crime_hu_agg, on='LSOA21CD', how='left')

fig, ax = plt.subplots(figsize=(8, 8))
# plot the crime data
gdf_crime_hu_agg.plot(ax=ax, column='Numbers', cmap='RdPu', edgecolor='black', linewidth=0.05, legend=True,
                      legend_kwds={'shrink': 0.5})
ax.set_title('Crime Numbers in East yorkshire and Hull LSOAs (Aug 2024)')
ax.axis('off')
plt.show()

_images/08893c9fc47aecbf419ab647d280936780bc40d6f4bc9ee737d49f291f0e9781.png

Q2 solution

# read the csv file
df_house_price = pd.read_csv('data/Mean house prices by lower layer super output area- HPSSA dataset 47.csv')

df_house_price_hu = df_house_price[(df_house_price['Local authority name'].str.contains('Hull')) | (df_house_price['Local authority name'].str.contains('East Riding of Yorkshire')) ]
df_house_price_hu

	Local authority code	Local authority name	LSOA code	LSOA name	Year ending Dec 1995	Year ending Mar 1996	Year ending Jun 1996	Year ending Sep 1996	Year ending Dec 1996	Year ending Mar 1997	...	Year ending Sep 2020	Year ending Dec 2020	Year ending Mar 2021	Year ending Jun 2021	Year ending Sep 2021	Year ending Dec 2021	Year ending Mar 2022	Year ending Jun 2022	Year ending Sep 2022	Year ending Dec 2022
808	E06000010	Kingston upon Hull, City of	E01012756	Kingston upon Hull 025A	40,920	41,652	40,819	42,748	39,059	39,567	...	135,647	151,126	159,225	152,954	146,913	148,260	143,033	141,745	148,008	132,769
809	E06000010	Kingston upon Hull, City of	E01012757	Kingston upon Hull 025B	31,376	29,795	30,367	30,543	30,956	32,781	...	132,178	123,194	127,827	135,008	141,380	148,043	135,320	139,061	130,330	137,602
810	E06000010	Kingston upon Hull, City of	E01012758	Kingston upon Hull 018A	49,324	46,195	43,328	51,442	60,738	64,053	...	224,715	238,168	230,783	239,213	249,940	238,605	246,214	240,353	252,263	275,177
811	E06000010	Kingston upon Hull, City of	E01012759	Kingston upon Hull 025C	29,197	32,087	32,336	30,748	30,246	27,783	...	110,625	114,500	106,262	104,782	106,537	115,615	104,722	106,965	108,325	104,977
812	E06000010	Kingston upon Hull, City of	E01012760	Kingston upon Hull 025D	28,461	28,470	27,965	28,415	28,411	28,692	...	94,397	93,843	92,871	95,963	98,633	102,472	108,855	114,807	121,153	124,452
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1179	E06000011	East Riding of Yorkshire	E01013126	East Riding of Yorkshire 009E	70,635	64,828	60,053	66,088	72,992	78,539	...	324,286	332,693	336,088	345,774	349,465	378,558	389,615	411,366	463,313	455,630
1180	E06000011	East Riding of Yorkshire	E01013127	East Riding of Yorkshire 018E	100,765	98,318	96,342	97,912	89,962	94,182	...	436,643	480,611	507,417	509,694	477,282	438,554	421,262	416,400	503,722	498,563
1181	E06000011	East Riding of Yorkshire	E01032596	East Riding of Yorkshire 043F	41,966	41,335	40,676	41,430	41,593	42,379	...	134,154	127,511	124,427	135,288	139,925	142,370	146,717	150,261	146,702	144,453
1182	E06000011	East Riding of Yorkshire	E01033495	East Riding of Yorkshire 032G	:	:	:	:	:	:	...	152,262	204,612	214,227	227,404	222,521	209,466	203,156	183,714	194,962	195,376
1183	E06000011	East Riding of Yorkshire	E01033496	East Riding of Yorkshire 032H	60,974	62,406	65,388	62,199	58,120	56,913	...	259,775	271,633	300,098	297,239	278,697	276,424	266,227	254,792	285,688	300,487

376 rows × 113 columns

gdf_crime_agg_hp = pd.merge(gdf_crime_hu_agg, df_house_price_hu[['LSOA code', 'Year ending Dec 2022']], left_on='LSOA21CD', right_on='LSOA code', how='left')

gdf_crime_agg_hp['Year ending Dec 2022'].values[0]

'132,769'

# we need to do data cleaning for the column 'Year ending Dec 2022'
# the column 'Year ending Dec 2022' is a string, we need to convert it to float
gdf_crime_agg_hp['Year ending Dec 2022'] = gdf_crime_agg_hp['Year ending Dec 2022'].fillna('')
gdf_crime_agg_hp['Year ending Dec 2022'] =  gdf_crime_agg_hp['Year ending Dec 2022'].replace({':': '', ',': ''}, regex=True)
gdf_crime_agg_hp['Year ending Dec 2022'] = gdf_crime_agg_hp['Year ending Dec 2022'].replace('', 0)
gdf_crime_agg_hp['Year ending Dec 2022'] = gdf_crime_agg_hp['Year ending Dec 2022'].astype(float)

gdf_crime_agg_hp = gdf_crime_agg_hp.rename(columns={'Year ending Dec 2022': 'House Price'})

gdf_crime_agg_hp.head()

	FID	LSOA21CD	LSOA21NM	BNG_E	BNG_N	LAT	LONG	Shape__Area	Shape__Length	GlobalID	geometry	Month	Numbers	LSOA code	House Price
0	12116	E01012756	Kingston upon Hull 025A	507367	430316	53.75817	-0.37294	198939.277603	3497.174098	b1f95abf-436c-4f5a-b6e1-93be9e0eb24d	POLYGON ((507433.012 430447.008, 507438.614 43...	2024-08	18.0	E01012756	132769.0
1	12117	E01012757	Kingston upon Hull 025B	508017	429519	53.75088	-0.36336	318088.364120	3716.172876	5f00b0d5-45c1-4529-9e08-881c4c58834d	POLYGON ((508039.325 430021.467, 508047.639 42...	2024-08	22.0	E01012757	137602.0
2	12118	E01012758	Kingston upon Hull 018A	507706	430320	53.75814	-0.36780	311920.300018	3772.447527	94cd85a6-b982-4038-9625-333de4b818df	POLYGON ((507791.196 430453.312, 507795.719 43...	2024-08	16.0	E01012758	275177.0
3	12119	E01012759	Kingston upon Hull 025C	507184	429662	53.75233	-0.37594	398184.936035	3984.903843	1516f69d-ec4e-4e2b-8f3d-81d5f680f631	POLYGON ((507088.105 429992.978, 507088.215 42...	2024-08	14.0	E01012759	104977.0
4	12120	E01012760	Kingston upon Hull 025D	507714	429795	53.75342	-0.36786	126002.444229	2081.849577	2e8144b1-63d2-4100-b2a5-a1fe09759513	POLYGON ((507913.848 429950.94, 507917.695 429...	2024-08	1.0	E01012760	124452.0

fig, ax = plt.subplots(figsize=(8, 8))
gdf_crime_agg_hp.plot(ax=ax, column='House Price', cmap='Blues', edgecolor='black', linewidth=0.05, legend=True,
                      legend_kwds={'shrink': 0.5})
ax.set_title('House price in East yorkshire and Hull LSOAs')
ax.axis('off')
plt.show()

_images/d2a6b0fb189b565f8546dff3f052cf0a29f07c8af1e7ea21f74a246a9645d8f6.png

# read the population data
df_population = pd.read_csv('data/Lower layer Super Output Area population estimates 2019-2022.csv')

df_population.head()

	LAD 2021 Code	LAD 2021 Name	LSOA 2021 Code	LSOA 2021 Name	Total	F0	F1	F2	F3	F4	...	M81	M82	M83	M84	M85	M86	M87	M88	M89	M90
0	E06000001	Hartlepool	E01011949	Hartlepool 009A	1,870	15	3	11	13	6	...	6	6	3	3	3	3	2	1	3	1
1	E06000001	Hartlepool	E01011950	Hartlepool 008A	1,097	6	5	8	8	5	...	1	1	1	2	0	0	0	0	0	0
2	E06000001	Hartlepool	E01011951	Hartlepool 007A	1,241	8	7	5	8	3	...	1	1	3	1	2	2	1	0	1	0
3	E06000001	Hartlepool	E01011952	Hartlepool 002A	1,615	13	11	17	15	15	...	0	3	1	2	3	3	4	3	0	13
4	E06000001	Hartlepool	E01011953	Hartlepool 002B	1,982	9	12	18	11	13	...	3	4	3	2	0	1	0	0	1	4

5 rows × 187 columns

df_population_hu = df_population[ (df_population['LAD 2021 Name'].str.contains('Kingston upon Hull')) | df_population['LAD 2021 Name'].str.contains('East Riding of Yorkshire')]
df_population_hu.head()

	LAD 2021 Code	LAD 2021 Name	LSOA 2021 Code	LSOA 2021 Name	Total	F0	F1	F2	F3	F4	...	M81	M82	M83	M84	M85	M86	M87	M88	M89	M90
822	E06000010	Kingston upon Hull	E01012756	Kingston upon Hull 025A	1,477	7	2	4	9	6	...	5	4	7	5	0	0	0	0	0	0
823	E06000010	Kingston upon Hull	E01012757	Kingston upon Hull 025B	1,388	4	8	3	6	3	...	2	1	2	1	2	2	2	4	1	1
824	E06000010	Kingston upon Hull	E01012758	Kingston upon Hull 018A	1,538	3	5	5	5	4	...	1	0	2	1	0	2	1	0	0	0
825	E06000010	Kingston upon Hull	E01012759	Kingston upon Hull 025C	1,670	21	8	18	7	16	...	1	0	1	3	0	1	1	1	0	2
826	E06000010	Kingston upon Hull	E01012760	Kingston upon Hull 025D	1,435	10	3	3	10	4	...	2	1	3	1	0	0	1	0	0	0

5 rows × 187 columns

df_population_hu = df_population_hu[['LSOA 2021 Code', 'Total']]

df_population_hu

	LSOA 2021 Code	Total
822	E01012756	1,477
823	E01012757	1,388
824	E01012758	1,538
825	E01012759	1,670
826	E01012760	1,435
...	...	...
1198	E01034756	2,488
1199	E01034757	1,209
1200	E01034758	1,758
1201	E01034759	1,681
1202	E01034760	1,469

381 rows × 2 columns

# merge the population data with the gdf_crime_hull_agg_hp
gdf_crime_agg_hp_pop = pd.merge(gdf_crime_agg_hp, df_population_hu, left_on='LSOA21CD', right_on='LSOA 2021 Code', how='left')

gdf_crime_agg_hp_pop.head()

	FID	LSOA21CD	LSOA21NM	BNG_E	BNG_N	LAT	LONG	Shape__Area	Shape__Length	GlobalID	geometry	Month	Numbers	LSOA code	House Price	LSOA 2021 Code	Total
0	12116	E01012756	Kingston upon Hull 025A	507367	430316	53.75817	-0.37294	198939.277603	3497.174098	b1f95abf-436c-4f5a-b6e1-93be9e0eb24d	POLYGON ((507433.012 430447.008, 507438.614 43...	2024-08	18.0	E01012756	132769.0	E01012756	1,477
1	12117	E01012757	Kingston upon Hull 025B	508017	429519	53.75088	-0.36336	318088.364120	3716.172876	5f00b0d5-45c1-4529-9e08-881c4c58834d	POLYGON ((508039.325 430021.467, 508047.639 42...	2024-08	22.0	E01012757	137602.0	E01012757	1,388
2	12118	E01012758	Kingston upon Hull 018A	507706	430320	53.75814	-0.36780	311920.300018	3772.447527	94cd85a6-b982-4038-9625-333de4b818df	POLYGON ((507791.196 430453.312, 507795.719 43...	2024-08	16.0	E01012758	275177.0	E01012758	1,538
3	12119	E01012759	Kingston upon Hull 025C	507184	429662	53.75233	-0.37594	398184.936035	3984.903843	1516f69d-ec4e-4e2b-8f3d-81d5f680f631	POLYGON ((507088.105 429992.978, 507088.215 42...	2024-08	14.0	E01012759	104977.0	E01012759	1,670
4	12120	E01012760	Kingston upon Hull 025D	507714	429795	53.75342	-0.36786	126002.444229	2081.849577	2e8144b1-63d2-4100-b2a5-a1fe09759513	POLYGON ((507913.848 429950.94, 507917.695 429...	2024-08	1.0	E01012760	124452.0	E01012760	1,435

gdf_crime_agg_hp_pop.Total.values

array(['1,477', '1,388', '1,538', '1,670', '1,435', '1,769', '1,339',
       '1,706', '1,179', '1,391', '1,376', '1,559', '1,314', '1,325',
       '1,594', '1,580', '1,545', '1,534', '1,497', '1,627', '1,403',
       '1,557', '1,381', '1,328', '1,800', '1,280', '1,451', '1,294',
       '2,009', '1,323', '1,260', '1,456', '1,437', '1,290', '1,340',
       '1,633', '1,526', '1,428', '1,404', '1,993', '1,204', '1,312',
       '1,671', '1,225', '1,381', '1,430', '1,342', '1,368', '1,660',
       '1,365', '1,448', '1,378', '1,698', '1,758', '1,837', '1,547',
       '1,320', '1,611', '1,487', '1,508', '1,461', '1,424', '1,214',
       '1,377', '1,316', '1,913', '1,585', '2,005', '1,452', '1,614',
       '1,472', '1,791', '1,391', '1,836', '1,597', '1,739', '1,041',
       '1,445', '1,605', '1,672', '1,338', '1,555', '1,582', '1,435',
       '1,691', '1,286', '1,505', '1,346', '1,264', '1,707', '1,532',
       '1,714', '1,675', '2,175', '2,145', '1,726', '1,825', '2,256',
       '1,650', '2,404', '1,383', '2,441', '1,766', '1,311', '1,754',
       '1,745', '1,876', '1,649', '1,807', '1,673', '1,599', '1,491',
       '1,515', '1,703', '1,670', '1,570', '1,851', '1,410', '1,776',
       '1,509', '1,545', '1,462', '1,899', '1,581', '1,529', '1,607',
       '1,645', '2,420', '2,034', '1,659', '2,193', '1,903', '1,673',
       '1,465', '1,590', '1,448', '1,527', '1,644', '1,215', '1,320',
       '1,576', '1,448', '1,037', '1,176', '1,205', '1,136', '1,629',
       '1,993', '3,612', '1,553', '1,905', '1,900', '1,695', '1,437',
       '1,165', '1,496', '2,176', '1,671', '1,345', '2,324', '1,474',
       '1,701', '1,869', '1,372', '1,526', '1,679', '1,015', '1,818',
       '1,340', '1,166', '2,101', '1,896', '1,544', '1,303', '1,212',
       '1,703', '1,402', '1,491', '1,565', '1,378', '1,394', '1,341',
       '1,317', '1,190', '1,512', '1,379', '1,624', '1,619', '1,192',
       '1,380', '1,039', '1,449', '1,628', '2,301', '1,266', '1,497',
       '1,730', '1,245', '1,597', '1,631', '1,096', '1,810', '1,583',
       '1,544', '1,628', '1,358', '1,753', '1,710', '1,680', '1,325',
       '1,537', '1,784', '1,910', '1,722', '1,540', '1,698', '1,577',
       '1,830', '1,411', '1,589', '1,407', '1,421', '1,846', '1,958',
       '1,711', '1,905', '2,246', '1,732', '2,210', '1,891', '1,760',
       '1,521', '1,837', '2,150', '1,488', '1,491', '1,366', '1,599',
       '1,419', '1,597', '1,808', '1,409', '1,367', '1,747', '1,811',
       '2,101', '1,954', '1,869', '1,896', '1,431', '1,982', '1,868',
       '1,917', '1,308', '1,663', '1,758', '1,540', '1,460', '1,364',
       '1,401', '1,434', '1,562', '1,800', '1,463', '1,447', '1,382',
       '1,471', '2,208', '2,351', '1,581', '1,549', '1,362', '1,995',
       '1,533', '1,582', '1,561', '2,328', '1,652', '1,742', '1,698',
       '1,915', '1,889', '1,898', '1,590', '1,672', '1,489', '1,652',
       '2,448', '1,568', '1,626', '1,442', '1,337', '1,593', '1,539',
       '1,610', '2,321', '1,578', '1,454', '1,558', '1,767', '1,827',
       '1,589', '1,377', '1,518', '2,097', '1,440', '1,540', '1,395',
       '1,503', '1,648', '1,394', '1,295', '1,737', '1,405', '1,357',
       '2,029', '1,707', '1,420', '1,518', '2,662', '1,613', '1,387',
       '1,509', '1,778', '1,273', '2,217', '1,406', '1,631', '1,596',
       '1,408', '1,371', '1,096', '1,417', '1,171', '2,171', '1,762',
       '1,531', '1,366', '1,438', '1,524', '1,536', '1,421', '1,854',
       '1,459', '1,463', '1,475', '2,052', '2,311', '1,488', '2,153',
       '2,231', '1,635', '2,283', '1,635', '1,857', '1,398', '1,533',
       '1,511', '1,313', '1,430', '1,475', '1,587', '1,488', '1,164',
       '1,239', '1,800', '2,488', '1,209', '1,758', '1,681', '1,469',
       '2,397', '2,326', '1,240', '1,342', '1,782', '1,192', '2,406',
       '2,819', '1,291', '1,768'], dtype=object)

# we need to rename the column 'Total' to 'Population' and do data cleaning
gdf_crime_agg_hp_pop['Total'] = gdf_crime_agg_hp_pop['Total'].replace({ ',': ''}, regex=True).astype(float)
gdf_crime_agg_hp_pop = gdf_crime_agg_hp_pop.rename(columns={'Total': 'Population'})
gdf_crime_agg_hp_pop

	FID	LSOA21CD	LSOA21NM	LSOA21NMW	BNG_E	BNG_N	LAT	LONG	Shape__Area	Shape__Length	GlobalID	geometry	Month	Numbers	LSOA code	House Price	LSOA 2021 Code	Population
0	12116	E01012756	Kingston upon Hull 025A		507367	430316	53.75817	-0.37294	198939.277603	3497.174098	b1f95abf-436c-4f5a-b6e1-93be9e0eb24d	POLYGON ((507433.012 430447.008, 507438.614 43...	2024-08	18.0	E01012756	132769.0	E01012756	1477.0
1	12117	E01012757	Kingston upon Hull 025B		508017	429519	53.75088	-0.36336	318088.364120	3716.172876	5f00b0d5-45c1-4529-9e08-881c4c58834d	POLYGON ((508039.325 430021.467, 508047.639 42...	2024-08	22.0	E01012757	137602.0	E01012757	1388.0
2	12118	E01012758	Kingston upon Hull 018A		507706	430320	53.75814	-0.36780	311920.300018	3772.447527	94cd85a6-b982-4038-9625-333de4b818df	POLYGON ((507791.196 430453.312, 507795.719 43...	2024-08	16.0	E01012758	275177.0	E01012758	1538.0
3	12119	E01012759	Kingston upon Hull 025C		507184	429662	53.75233	-0.37594	398184.936035	3984.903843	1516f69d-ec4e-4e2b-8f3d-81d5f680f631	POLYGON ((507088.105 429992.978, 507088.215 42...	2024-08	14.0	E01012759	104977.0	E01012759	1670.0
4	12120	E01012760	Kingston upon Hull 025D		507714	429795	53.75342	-0.36786	126002.444229	2081.849577	2e8144b1-63d2-4100-b2a5-a1fe09759513	POLYGON ((507913.848 429950.94, 507917.695 429...	2024-08	1.0	E01012760	124452.0	E01012760	1435.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
376	33463	E01035468	Kingston upon Hull 031I		507220	427868	53.73621	-0.37602	194811.873474	2342.011663	41c0a371-996d-4c62-8888-fb4830b71528	POLYGON ((507209.199 428143.672, 507211.664 42...	2024-08	7.0	NaN	0.0	E01035468	1192.0
377	33464	E01035469	Kingston upon Hull 034C		509268	435508	53.80442	-0.34228	602229.054047	4797.372241	ef8cf4ea-28e0-44ba-bdad-f74de2c628c2	POLYGON ((509524.731 435503.71, 509535.919 435...	2024-08	11.0	NaN	0.0	E01035469	2406.0
378	33465	E01035470	Kingston upon Hull 035A		508599	435590	53.80530	-0.35241	512661.485153	3739.146033	f97f0a63-17c4-461a-84e0-f69e41992515	POLYGON ((508962.524 435417.007, 508963.112 43...	2024-08	9.0	NaN	0.0	E01035470	2819.0
379	33466	E01035471	Kingston upon Hull 035B		508633	435072	53.80064	-0.35207	204771.100311	2848.973680	1dafa6ed-12b8-45cb-a67f-c559fdccfec6	POLYGON ((508775.831 435260.654, 508776.953 43...	2024-08	10.0	NaN	0.0	E01035471	1291.0
380	33467	E01035472	Kingston upon Hull 035C		508142	434887	53.79908	-0.35959	503061.344757	3672.250000	25c17302-b93b-4752-a042-f2ceadb41dd8	POLYGON ((507930.326 435252.723, 507986.946 43...	NaN	NaN	NaN	0.0	E01035472	1768.0

381 rows × 18 columns

gdf_crime_agg_hp_pop['Numbers'] = gdf_crime_agg_hp_pop['Numbers'].fillna(0)
gdf_crime_agg_hp_pop['House Price'] = gdf_crime_agg_hp_pop['House Price'].fillna(0)
gdf_crime_agg_hp_pop['Population'] = gdf_crime_agg_hp_pop['Population'].fillna(0)

Q3 solution

import libpysal.weights as weights
# use the SlM, SEM and SDM to do the spatial regression
# set the y and X variables
y = gdf_crime_agg_hp_pop[['Numbers']].values
X = gdf_crime_agg_hp_pop[['House Price', 'Population']].values
# set the spatial weights
w_lsoa  = weights.Queen.from_dataframe(gdf_crime_agg_hp_pop, use_index='True')
w = w_lsoa
w.transform = 'R'  # Row-standardized

# Fit spatial lag model
model_sl = ML_Lag(y, X, w=w, name_y='Numbers', name_x=['House Price', 'Population'], name_w='Queen', name_ds='East yorkshire and Hull')
# Output summary
print(model_sl.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :East yorkshire and Hull
Weights matrix      :       Queen
Dependent Variable  :     Numbers                Number of Observations:         381
Mean dependent var  :     13.1417                Number of Variables   :           4
S.D. dependent var  :     17.8143                Degrees of Freedom    :         377
Pseudo R-squared    :      0.2035
Spatial Pseudo R-squared:  0.1112
Log likelihood      :  -1601.4457
Sigma-square ML     :    253.8232                Akaike info criterion :    3210.891
S.E of regression   :     15.9318                Schwarz criterion     :    3226.663

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         7.88095         4.64113         1.69807         0.08950
         House Price        -0.00003         0.00001        -3.93716         0.00008
          Population         0.00368         0.00262         1.40820         0.15907
           W_Numbers         0.39369         0.06406         6.14562         0.00000
------------------------------------------------------------------------------------

SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
         House Price        -0.0000         -0.0000         -0.0001
          Population         0.0037          0.0024          0.0061
================================ END OF REPORT =====================================

# Fit spatial error model
model_sem = ML_Error(y, X, w=w, name_y='Numbers', name_x=['House Price', 'Population'], name_w='Queen', name_ds='East yorkshire and Hull')
# Output summary
print(model_sem.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: ML SPATIAL ERROR (METHOD = full)
---------------------------------------------------
Data set            :East yorkshire and Hull
Weights matrix      :       Queen
Dependent Variable  :     Numbers                Number of Observations:         381
Mean dependent var  :     13.1417                Number of Variables   :           3
S.D. dependent var  :     17.8143                Degrees of Freedom    :         378
Pseudo R-squared    :      0.0858
Log likelihood      :  -1604.2593
Sigma-square ML     :    257.5623                Akaike info criterion :    3214.519
S.E of regression   :     16.0487                Schwarz criterion     :    3226.347

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        12.51763         4.97003         2.51862         0.01178
         House Price        -0.00003         0.00001        -3.14638         0.00165
          Population         0.00375         0.00274         1.36661         0.17175
              lambda         0.39451         0.06546         6.02678         0.00000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

# Fit spatial lag model
model_dur = ML_Lag(y, X, w=w, name_y='Numbers', name_x=['House Price', 'Population'], name_w='Queen', name_ds='East yorkshire and Hull')

# Output summary
print(model_dur.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :East yorkshire and Hull
Weights matrix      :       Queen
Dependent Variable  :     Numbers                Number of Observations:         381
Mean dependent var  :     13.1417                Number of Variables   :           4
S.D. dependent var  :     17.8143                Degrees of Freedom    :         377
Pseudo R-squared    :      0.2035
Spatial Pseudo R-squared:  0.1112
Log likelihood      :  -1601.4457
Sigma-square ML     :    253.8232                Akaike info criterion :    3210.891
S.E of regression   :     15.9318                Schwarz criterion     :    3226.663

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         7.88095         4.64113         1.69807         0.08950
         House Price        -0.00003         0.00001        -3.93716         0.00008
          Population         0.00368         0.00262         1.40820         0.15907
           W_Numbers         0.39369         0.06406         6.14562         0.00000
------------------------------------------------------------------------------------

SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
         House Price        -0.0000         -0.0000         -0.0001
          Population         0.0037          0.0024          0.0061
================================ END OF REPORT =====================================

5 Geospatial statistical modelling

Contents

5 Geospatial statistical modelling#

5.1 Spatial lag#

5.2 Spatial Lag Model (SLM)#

5.3 Spatial Error Model (SEM)#

5.4 Spatial Durbin Model (SDM)#

5.5 Geographically Weighted Regression (GWR)#

5.6 Multiscale Geographically Weighted Regression (MGWR)#

Quiz#