7 Clustering for spatial and temporal big data

Time	Activity
16:00–16:05	Introduction — Overview of the main tasks for the lab tutorials
16:05–16:45	Tutorial: Clustering for spatial and temporal big data — Follow Section 7.1 of the Jupyter Notebook to practice the GMM, KMeans and DBSCAN
16:45–17:30	Tutorial: Clustering for spatial and temporal big data — Follow Section 7.2-7.3 of the Jupyter Notebook to practice the DB index, Silhouette coefficient and ST-DBSCAN
17:30–17:55	Quiz — Complete quiz tasks
17:55–18:00	Wrap-up — Recap key points and address final questions

For this module’s lab tutorials, you can download all the required data using the provided link (click).

Please make sure that both the Jupyter Notebook file and the data and img folder are placed in the same directory (specifically within the STBDA_lab folder) to ensure the code runs correctly.

Week 7 Key Takeaways:

Understand the concept of clustering and its applications in spatial and temporal big data.
Learn how to apply clustering algorithms such as Gaussian Mixture Model (GMM), KMeans, and DBSCAN to spatial data.
Explore the use of clustering in road accident data analysis.
Gain insights into parameter tuning in clustering, including the use of Silhouette Coefficient and Davies-Bouldin Index.
Understand the concept of spatial and temporal clustering and its applications in human mobility analysis.
Learn how to apply ST-DBSCAN for clustering spatio-temporal data.

7 Clustering for spatial and temporal big data#

Clustering is a machine learning method (unsupervised learning) that groups a collection of objects into clusters, with each cluster containing objects that are highly similar to each other and distinct from those in other clusters.

7.1 Spatial Clustering#

Spatial clustering is a method used in spatial data analysis to group together objects or data points mainly based on their spatial (geographical) location. The goal is to identify regions/areas where the data points/objects are more similar to each other than to those in other regions/areas.

7.1.1 Gaussian Mixture Model (GMM)#

A Gaussian Mixture Model (GMM) is a probabilistic model that assumes that the data is generated from a mixture of several Gaussian distributions with unknown parameters. The equation for a GMM is a weighted sum of multiple Gaussian (Normal) distributions. The model is defined as:

\(p(x) = \sum_{i=1}^{K} w_i \mathcal{N}({x}|\mu_i, \sigma_i)\)

\( p(x)\) overall probability
\( {x} \) is the data point.
\( K \) is the total number of Gaussian (a normal distribution as a cluster).
\( w_i \) are the weights (size) of the \(i\)th Gaussian (summing to 1).
\( \mathcal{N}({x}|\mu_i, \sigma_i) \) is the probability of the \(i\)th Gaussian.
\(\mu_i\) is the mean of the \(i\)th Gaussian.
\(\sigma_i\) is the variance of the \(i\)th Gaussian.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Read the data
df = pd.read_csv('data/cluster_data.csv')

df

	Feature1	Feature2
0	4.862775	1.011907
1	5.743938	0.472795
2	10.177229	-1.014992
3	-0.405046	1.447232
4	1.260348	9.982292
...	...	...
495	9.447088	10.323605
496	5.335410	0.273029
497	5.668767	2.307601
498	8.489488	10.133309
499	6.282093	-0.258828

500 rows × 2 columns

# Plot the data using seaborn
import seaborn as sns
plt.figure(figsize=(10, 8))
g = sns.scatterplot(data=df, x='Feature1', y='Feature2', s=50)
g.set_title('Data for Clustering')
plt.show()

_images/cc47f77d60b1ab5345b4c12c886229bb5010ac5acaf519b9693462e8af90e7b2.png

from sklearn.mixture import GaussianMixture

from sklearn.mixture import GaussianMixture
# Define the parameters for GMM
gmm = GaussianMixture(n_components=6, covariance_type='full', random_state=50) 
# GMM fitting
gmm.fit(df[['Feature1', 'Feature2']]) 
# Get labels from clusters
gmm_labels = gmm.predict(df[['Feature1', 'Feature2']])

# Assign the labels of clusters to df
df['gmm_labels']=gmm_labels

# Plot the data with clusters
plt.figure(figsize=(10, 8))
g = sns.scatterplot(data=df, x='Feature1', y='Feature2', hue='gmm_labels', palette='Set1', s=50, legend=True)
g.set_title('GMM Clustering')
g.legend(title='Clusters', loc='center right')
plt.show()

_images/48aabf3db0d817483acc4ee5582225e21b9b4b4baa432cbc19e2062f162aee6f.png

Applying GMM clustering on road accidents

In this section, we will apply GMM to the road accident data in Camden Town, London, to identify clusters of accidents based on their spatial distribution. The dataset contains information about road accidents, including their geographical coordinates. You can find the dataset here.

In spatial analysis, use projected coordinates, such as the UK National Grid, instead of GPS coordinates (latitude and longitude) in clustering.

Projected Coordinates: In a projected coordinate system (e.g., the UK National Grid), distances are measured in consistent units (e.g., meters). This makes the process straightforward to compute distances and areas accurately and make the results understandable.
GPS Coordinates: Latitude and longitude are in degrees, and the actual distance represented by one degree varies depending on the location on the Earth’s surface. The Earth’s curvature affects calculations, requiring more complex spherical trigonometry for accurate distance and area computations.

# Read the road accident data
df_accident = pd.read_csv('data/dft-road-casualty-statistics-collision-2023.csv')

df_accident

	accident_index	accident_year	accident_reference	location_easting_osgr	location_northing_osgr	longitude	latitude	police_force	accident_severity	number_of_vehicles	...	light_conditions	weather_conditions	road_surface_conditions	special_conditions_at_site	carriageway_hazards	urban_or_rural_area	did_police_officer_attend_scene_of_accident	trunk_road_flag	lsoa_of_accident_location	enhanced_severity_collision
0	2023010419171	2023	10419171	525060.0	170416.0	-0.202878	51.418974	1	3	1	...	4	8	2	0	0	1	1	2	E01003383	-1
1	2023010419183	2023	10419183	535463.0	198745.0	-0.042464	51.671155	1	3	3	...	4	1	1	0	0	1	1	2	E01001547	-1
2	2023010419189	2023	10419189	508702.0	177696.0	-0.435789	51.487777	1	3	2	...	4	1	1	0	0	1	1	2	E01002448	-1
3	2023010419191	2023	10419191	520341.0	190175.0	-0.263972	51.597575	1	3	2	...	4	9	1	0	0	1	1	2	E01000129	-1
4	2023010419192	2023	10419192	527255.0	176963.0	-0.168976	51.477324	1	3	2	...	4	1	1	0	0	1	1	2	E01004583	-1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
104253	2023991452286	2023	991452286	246754.0	661133.0	-4.447490	55.819059	99	2	1	...	5	2	2	-1	-1	1	1	-1	-1	7
104254	2023991452640	2023	991452640	224491.0	581627.0	-4.752200	55.097920	99	3	2	...	1	9	1	0	0	2	2	-1	-1	3
104255	2023991453360	2023	991453360	383341.0	806427.0	-2.276957	57.148422	99	3	2	...	4	2	2	-1	-1	1	1	-1	-1	3
104256	2023991461915	2023	991461915	271662.0	655488.0	-4.047591	55.775637	99	3	1	...	4	8	1	-1	-1	1	1	-1	-1	3
104257	2023991462793	2023	991462793	327349.0	666746.0	-3.163102	55.888350	99	3	2	...	1	1	1	0	0	2	1	-1	-1	3

104258 rows × 37 columns

df_accident.columns

Index(['accident_index', 'accident_year', 'accident_reference',
       'location_easting_osgr', 'location_northing_osgr', 'longitude',
       'latitude', 'police_force', 'accident_severity', 'number_of_vehicles',
       'number_of_casualties', 'date', 'day_of_week', 'time',
       'local_authority_district', 'local_authority_ons_district',
       'local_authority_highway', 'first_road_class', 'first_road_number',
       'road_type', 'speed_limit', 'junction_detail', 'junction_control',
       'second_road_class', 'second_road_number',
       'pedestrian_crossing_human_control',
       'pedestrian_crossing_physical_facilities', 'light_conditions',
       'weather_conditions', 'road_surface_conditions',
       'special_conditions_at_site', 'carriageway_hazards',
       'urban_or_rural_area', 'did_police_officer_attend_scene_of_accident',
       'trunk_road_flag', 'lsoa_of_accident_location',
       'enhanced_severity_collision'],
      dtype='object')

# Camden Town LSOA index
camden_lsoas = ['E01000936', 'E01000937', 'E01000964', 'E01000934', 'E01000935', 'E01000942', 'E01000932', 'E01000933',
                'E01000930', 'E01000931', 'E01000938', 'E01000939', 'E01000943', 'E01000916', 'E01000917', 'E01000914',
                'E01000915', 'E01000912', 'E01000913', 'E01000910', 'E01000911', 'E01000918', 'E01000919', 'E01000876',
                'E01000877', 'E01000874', 'E01000875', 'E01000872', 'E01000873', 'E01000870', 'E01000871', 'E01000878',
                'E01000879', 'E01000966', 'E01000940', 'E01000967', 'E01000965', 'E01000962', 'E01000963', 'E01000960',
                'E01000961', 'E01000941', 'E01000968', 'E01000949', 'E01000969', 'E01000856', 'E01000857', 'E01000854',
                'E01000855', 'E01000852', 'E01000853', 'E01000850', 'E01000851', 'E01000858', 'E01000859', 'E01000946',
                'E01000947', 'E01000944', 'E01000890', 'E01000899', 'E01000945', 'E01000948', 'E01000926', 'E01000927',
                'E01000924', 'E01000925', 'E01000922', 'E01000905', 'E01000923', 'E01000955', 'E01000920', 'E01000921',
                'E01000928', 'E01000929', 'E01000950', 'E01000897', 'E01000896', 'E01000895', 'E01000894', 'E01000893',
                'E01000892', 'E01000891', 'E01000898', 'E01000906', 'E01000907', 'E01000904', 'E01000902', 'E01000903',
                'E01000900', 'E01000901', 'E01000908', 'E01000909', 'E01000866', 'E01000867', 'E01000951', 'E01000864',
                'E01000865', 'E01000862', 'E01000863', 'E01000860', 'E01000861', 'E01000868', 'E01000869', 'E01000846',
                'E01000847', 'E01000844', 'E01000845', 'E01000842', 'E01000843', 'E01000848', 'E01000849', 'E01000974',
                'E01000972', 'E01000973', 'E01000970', 'E01000971', 'E01000956', 'E01000957', 'E01000954', 'E01000952',
                'E01000953', 'E01000958', 'E01000959', 'E01000887', 'E01000886', 'E01000885', 'E01000884', 'E01000883',
                'E01000882', 'E01000881', 'E01000880', 'E01000889', 'E01000888']
print('The number of LSOAs in Camden Town:', len(camden_lsoas))

The number of LSOAs in Camden Town: 133

df_accident_camden = df_accident.query('lsoa_of_accident_location in @camden_lsoas')

df_accident_camden

	accident_index	accident_year	accident_reference	location_easting_osgr	location_northing_osgr	longitude	latitude	police_force	accident_severity	number_of_vehicles	...	light_conditions	weather_conditions	road_surface_conditions	special_conditions_at_site	carriageway_hazards	urban_or_rural_area	did_police_officer_attend_scene_of_accident	trunk_road_flag	lsoa_of_accident_location	enhanced_severity_collision
5	2023010419198	2023	10419198	524780.0	184471.0	-0.201941	51.545349	1	3	1	...	4	1	1	0	0	1	1	2	E01000974	-1
20	2023010419281	2023	10419281	525522.0	184858.0	-0.191108	51.548662	1	2	1	...	4	2	2	0	0	1	1	2	E01000973	-1
21	2023010419282	2023	10419282	530049.0	184288.0	-0.126067	51.542516	1	3	1	...	4	2	2	0	0	1	1	2	E01000868	-1
48	2023010419391	2023	10419391	528183.0	184396.0	-0.152921	51.543913	1	2	2	...	4	1	1	0	0	1	1	2	E01000906	-1
52	2023010419398	2023	10419398	526452.0	185517.0	-0.177466	51.554377	1	3	2	...	4	1	1	0	0	1	3	2	E01000879	-1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
22654	2023010487700	2023	10487700	529836.0	181379.0	-0.130208	51.516423	1	2	1	...	4	1	1	0	9	1	3	2	E01000855	-1
22687	2023010491078	2023	10491078	529537.0	181828.0	-0.134350	51.520526	1	3	2	...	1	1	1	9	0	1	3	2	E01000851	-1
22695	2023010494344	2023	10494344	526857.0	184926.0	-0.171841	51.548975	1	2	2	...	1	1	1	0	0	1	1	2	E01000882	-1
22731	2023011366792	2023	11366792	530953.0	181613.0	-0.114033	51.518268	1	3	2	...	1	1	1	0	0	1	1	2	E01000914	3
87863	2023481407567	2023	481407567	531350.0	181567.0	-0.108331	51.517763	48	2	1	...	4	8	2	0	0	1	3	2	E01000917	5

722 rows × 37 columns

# we notice only 125/133 LSOAs have road accidents in Camden Town, 2023
df_accident_camden.lsoa_of_accident_location.nunique()

# plot the data using folium
import folium
# Create a map centered around Camden Town
m = folium.Map(location=[51.5416, -0.1425], zoom_start=13)
# we use the basemap layer from cartoons
folium.TileLayer('cartodb positron').add_to(m)
# Add road accidents to the map
for _, row in df_accident_camden.iterrows():
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=3,
        color='Royalblue',
        fill=True,
        fill_color='Royalblue',
        fill_opacity=0.8
    ).add_to(m)
m

Make this Notebook Trusted to load map: File -> Trust Notebook

df_accident_camden[['location_easting_osgr', 'location_northing_osgr']]

	location_easting_osgr	location_northing_osgr
5	524780.0	184471.0
20	525522.0	184858.0
21	530049.0	184288.0
48	528183.0	184396.0
52	526452.0	185517.0
...	...	...
22654	529836.0	181379.0
22687	529537.0	181828.0
22695	526857.0	184926.0
22731	530953.0	181613.0
87863	531350.0	181567.0

722 rows × 2 columns

# clustering the road accidents using GMM
gmm = GaussianMixture(n_components=6, covariance_type='full', random_state=50)
# GMM fitting
gmm.fit(df_accident_camden[['location_easting_osgr', 'location_northing_osgr']])
# Get labels from clusters
gmm_labels = gmm.predict(df_accident_camden[['location_easting_osgr', 'location_northing_osgr']])
# put the labels to the df
df_accident_camden['gmm_labels'] = gmm_labels

df_accident_camden

	accident_index	accident_year	accident_reference	location_easting_osgr	location_northing_osgr	longitude	latitude	police_force	accident_severity	number_of_vehicles	...	weather_conditions	road_surface_conditions	special_conditions_at_site	carriageway_hazards	urban_or_rural_area	did_police_officer_attend_scene_of_accident	trunk_road_flag	lsoa_of_accident_location	enhanced_severity_collision	gmm_labels
5	2023010419198	2023	10419198	524780.0	184471.0	-0.201941	51.545349	1	3	1	...	1	1	0	0	1	1	2	E01000974	-1	3
20	2023010419281	2023	10419281	525522.0	184858.0	-0.191108	51.548662	1	2	1	...	2	2	0	0	1	1	2	E01000973	-1	3
21	2023010419282	2023	10419282	530049.0	184288.0	-0.126067	51.542516	1	3	1	...	2	2	0	0	1	1	2	E01000868	-1	2
48	2023010419391	2023	10419391	528183.0	184396.0	-0.152921	51.543913	1	2	2	...	1	1	0	0	1	1	2	E01000906	-1	2
52	2023010419398	2023	10419398	526452.0	185517.0	-0.177466	51.554377	1	3	2	...	1	1	0	0	1	3	2	E01000879	-1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
22654	2023010487700	2023	10487700	529836.0	181379.0	-0.130208	51.516423	1	2	1	...	1	1	0	9	1	3	2	E01000855	-1	5
22687	2023010491078	2023	10491078	529537.0	181828.0	-0.134350	51.520526	1	3	2	...	1	1	9	0	1	3	2	E01000851	-1	1
22695	2023010494344	2023	10494344	526857.0	184926.0	-0.171841	51.548975	1	2	2	...	1	1	0	0	1	1	2	E01000882	-1	0
22731	2023011366792	2023	11366792	530953.0	181613.0	-0.114033	51.518268	1	3	2	...	1	1	0	0	1	1	2	E01000914	3	5
87863	2023481407567	2023	481407567	531350.0	181567.0	-0.108331	51.517763	48	2	1	...	8	2	0	0	1	3	2	E01000917	5	5

722 rows × 38 columns

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import contextily as ctx
import geopandas as gpd
from shapely.geometry import Point

# Create a GeoDataFrame from the DataFrame
geometry = [Point(xy) for xy in zip(df_accident_camden['longitude'], df_accident_camden['latitude'])]
gdf = gpd.GeoDataFrame(df_accident_camden, geometry=geometry, crs="EPSG:4326")

# Project to Web Mercator (required by contextily)
gdf = gdf.to_crs(epsg=3857)

# Generate color palette
palette = sns.color_palette("Set2", n_colors=6)
hex_colors = [mcolors.to_hex(color) for color in palette]

# Plot
fig, ax = plt.subplots(figsize=(8, 8))
for label in sorted(gdf['gmm_labels'].unique()):
    gdf[gdf['gmm_labels'] == label].plot(
        ax=ax,
        markersize=13,
        color=hex_colors[label],
        label=f"Cluster {label}"
    )

# Add basemap (CartoDB Positron)
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)

# plot
ax.set_axis_off()
plt.legend()
plt.title("Road Accidents in Camden by GMM Cluster", fontsize=14)
plt.tight_layout()
plt.show()

_images/acf9824b52c4ab8162bad3487f208580a083bf8729f1e6eee40013cc324ee417.png

7.1.2 KMeans#

KMeans is a clustering method that divides data into K separate clusters, ensuring each data point is assigned to the cluster whose mean (centroid) is closest, determined by its features.

from sklearn.cluster import KMeans
# Define the parameters for KMeans
kmeans = KMeans(n_clusters=6, random_state=50)
# KMeans fitting
kmeans.fit(df[['Feature1', 'Feature2']])
# Generating labels
kmeans_labels = kmeans.predict(df[['Feature1', 'Feature2']])
# Assign the labels of clusters to df
df['kmeans_labels'] = kmeans_labels

# Plot the data with clusters
plt.figure(figsize=(10, 8))
g = sns.scatterplot(data=df, x='Feature1', y='Feature2', hue='kmeans_labels', palette='Set1', s=50, legend=True)
g.set_title('KMeans Clustering')
g.legend(title='Clusters', loc='center right')
plt.show()

_images/1095bb05861dd983be2b6d0fe72f2cf2863663aa97a1f3adc11e9b930bbe54c8.png

Applying KMeans clustering on road accidents

# KMeans clustering on road accidents in Camden Town
kmeans = KMeans(n_clusters=6, random_state=50)
# KMeans fitting
kmeans.fit(df_accident_camden[['location_easting_osgr', 'location_northing_osgr']])
# Generating labels
kmeans_labels = kmeans.predict(df_accident_camden[['location_easting_osgr', 'location_northing_osgr']])
# put the labels to the df and gdf
df_accident_camden['kmeans_labels'] = kmeans_labels
gdf['kmeans_labels'] = kmeans_labels

# Plot the data with clusters
fig, ax = plt.subplots(figsize=(8, 8))
for label in sorted(gdf['kmeans_labels'].unique()):
    gdf[gdf['kmeans_labels'] == label].plot(
        ax=ax,
        markersize=13,
        color=hex_colors[label],
        label=f"Cluster {label}"
    )

# Add basemap (CartoDB Positron)
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)

# plot
ax.set_axis_off()
plt.legend()
plt.title("Road Accidents in Camden by Kmeans Cluster", fontsize=14)
plt.tight_layout()
plt.show()

_images/31973212d7b9a899f7a9c752966d27e35e1d3dac97c833de70e54ee7911333b6.png

7.1.3 DBSCAN#

Density-based spatial clustering of applications with noise is a popular density-based clustering algorithm. Unlike partition-based methods like KMeans, DBSCAN is capable of finding clusters of arbitrary shapes and handling noise in the data.

Key Parameters:

eps (epsilon): The maximum distance between two points for one to be considered as in the neighborhood of the other.
min_samples: The number of points (pts) in a neighborhood for a pt to be considered as a core pt (This includes the pt itself).

from sklearn.cluster import DBSCAN
# Define the parameters for DBSCAN
dbscan = DBSCAN(eps=0.6, min_samples=5)
# DBSACn implementing
dbscan.fit(df[['Feature1', 'Feature2']])

DBSCAN(eps=0.6)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Generating the labels
df['dbscan_labels'] = dbscan.labels_
print('The numbers of clusters from DBSCAN:', df['dbscan_labels'].nunique()-1)
# We replace -1 with Noise
df.dbscan_labels = df.dbscan_labels.replace({-1:'Noise'})

The numbers of clusters from DBSCAN: 8

# Plot the data with clusters
plt.figure(figsize=(10, 8))
g = sns.scatterplot(data=df, x='Feature1', y='Feature2', hue='dbscan_labels', palette='Set1', s=50, legend=True)
g.set_title('DBSCAN Clustering')
g.legend(title='Clusters', loc='center right')
plt.show()

_images/20a81061a29bfde6291f3ae449f28b744560af308cc3313c9ff4290775253cc3.png

Applying DBSCAN clustering on road accidents

# Define the parameters for DBSCAN, 200 means in meters, then for each cluster,
# each point is within 200 meters of at least 3 other points.
dbscan = DBSCAN(eps=200, min_samples=3)
# DBSACn implementing
dbscan.fit(df_accident_camden[['location_easting_osgr',
                               'location_northing_osgr']])
# Generating the labels
gdf['dbscan_labels'] = dbscan.labels_
print('The numbers of clusters from DBSCAN:',
      gdf['dbscan_labels'].nunique()-1)

The numbers of clusters from DBSCAN: 25

# define the colors for clusters
palette = sns.color_palette("Set2", n_colors=gdf['dbscan_labels'].nunique()-1)
hex_colors = [mcolors.to_hex(color) for color in palette]
# Plot the data with clusters
fig, ax = plt.subplots(figsize=(8, 8))
for label in sorted(gdf['dbscan_labels'].unique()):
    if label != -1:  # Exclude noise points
        gdf[gdf['dbscan_labels'] == label].plot(
            ax=ax,
            markersize=13,
            color=hex_colors[label],
            label=f"Cluster {label}"
        )
    else:
        gdf[gdf['dbscan_labels'] == label].plot(
            ax=ax,
            markersize=13,
            color='gray',
            label='Noise'
        )
# Add basemap (CartoDB Positron)
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)
# plot
ax.set_axis_off()
plt.legend()
plt.title("Road Accidents in Camden by DBSCAN Cluster", fontsize=14)
plt.tight_layout()
plt.show()

_images/e42ccba05575aa1c8f16da4bcb1cf966b94e9b93c9630da1d7309862e65aae19.png

Interpretation:

Though we can see that the clusters from GMM and KMeans are similar, DBSCAN identify more clusters and noise points. This is because DBSCAN is a density-based clustering algorithm, which can find clusters of arbitrary shapes and handle noise in the data. In application, local clusters of road accidents can be identified using DBSCAN, which can help to identify areas with high accident rates and potential hotspots for road safety interventions.

7.2 HyperParameter in clustering#

Metrics assess the effectiveness of clustering outcomes. By evaluating various clustering algorithms to select the optimal one or adjusting parameters within the same algorithm for tuning purposes (e.g., selecting the numbers of clusters), we can determine which configuration produces clusters that aligned with our domain knowledge.

Related metrics for clustering in sklearn can be found here

7.2.1 Silhouette coefficient#

The Silhouette Coefficient is a metric used to evaluate the quality of clustering in unsupervised learning. It measures how similar each data point is to its own cluster compared to other clusters.

\(s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\)

\(s(i)\) represents the Silhouette Coefficient for data point \(i\).
\(a(i)\) is the average distance from to all other points within the same cluster (intra-cluster distance).
\(b(i)\) is the minimum average distance from i to points in any other cluster (inter-cluster distance).
\( max(a(i),b(i))\) ensures \(s(i)\) that is normalized to range between -1 and +1.

Interpretation:

\(s(i)\) near +1 indicates that point \(i\) is well-clustered, with its distance to points in its own cluster significantly smaller than its distance to points in neighboring clusters.
\(s(i)\) near 0 indicates that point \(i\) is on or very close to the decision boundary between two neighboring clusters.
\(s(i)\) near -1 indicates that point \(i\) may have been assigned to the wrong cluster.

from sklearn.metrics import silhouette_score

# This function returns the mean Silhouette Coefficient of all samples.
# We calculate the gmm clusters for an example
silhouette_avg = silhouette_score(df[['Feature1', 'Feature2']], df['gmm_labels']) 
print("Silhouette Coefficient:", silhouette_avg)

Silhouette Coefficient: 0.6583554245759751

7.2.2 Davies-Bouldin Index#

The Davies-Bouldin Index provides a measure of the average similarity between each cluster.

\( DB = \frac{1}{n} \sum_{i=1}^{n} \max_{j \neq i} \left( \frac{S_i + S_j}{d(c_i, c_j)} \right) \)

\(n\) is the number of clusters.
\(S_i\) is the average distance between each point in cluster \(i\) and the centroid \(c_i\).
\(d(c_i, c_j)\) is the distance between centroids \(c_i\) and \(c_j\).
\(\max_{j \neq i} \left( \frac{S_i + S_j}{d(c_i, c_j)} \right)\) computes the maximum similarity measure (Euclidean distance) between cluster \(i\) and any other cluster \(j\).

Interpretation:

Lower values of DB indicate better-defined and well-separated clusters.

from sklearn.metrics import davies_bouldin_score

# Compute Davies-Bouldin Index
# We calculate the gmm clusters for an example
db_index = davies_bouldin_score(df[['Feature1', 'Feature2']], df['gmm_labels'])
print("Davies-Bouldin Index:", db_index)

Davies-Bouldin Index: 0.5099268217594325

7.2.3 Parameter tuning process#

Example: optimising the numbers of clusters using Silhouette Scores and DB index in GMM

# Define the range for the number of clusters
cluster_range = range(2, 11)

# Initialize lists to store metric scores
silhouette_scores = []
davies_bouldin_scores = []

for n_clusters in cluster_range:
    # Fit GMM model using different cluster numbers
    gmm = GaussianMixture(n_components=n_clusters, random_state=42)
    gmm.fit(df[['Feature1', 'Feature2']])
    
    # Generating labels
    labels = gmm.predict(df[['Feature1', 'Feature2']])
    
    # Calculate Silhouette Coefficient
    silhouette_avg = silhouette_score(df[['Feature1', 'Feature2']], labels)
    silhouette_scores.append(silhouette_avg)
    
    # Calculate Davies-Bouldin Index
    db_index = davies_bouldin_score(df[['Feature1', 'Feature2']], labels)
    davies_bouldin_scores.append(db_index)
    
    print(f'Number of clusters: {n_clusters}')
    print(f'Silhouette Coefficient: {silhouette_avg}')
    print(f'Davies-Bouldin Index: {db_index}\n')

Number of clusters: 2
Silhouette Coefficient: 0.4759573267675314
Davies-Bouldin Index: 0.9428541532461406

Number of clusters: 3
Silhouette Coefficient: 0.49528077888296457
Davies-Bouldin Index: 0.7161162630982991

Number of clusters: 4
Silhouette Coefficient: 0.5615867904177323
Davies-Bouldin Index: 0.5269555531806575

Number of clusters: 5
Silhouette Coefficient: 0.6167432592503065
Davies-Bouldin Index: 0.5778207829980668

Number of clusters: 6
Silhouette Coefficient: 0.6583554245759751
Davies-Bouldin Index: 0.5099268217594325

Number of clusters: 7
Silhouette Coefficient: 0.582565242046876
Davies-Bouldin Index: 0.5994514303332313

Number of clusters: 8
Silhouette Coefficient: 0.5642378840860819
Davies-Bouldin Index: 0.7263834300455105

Number of clusters: 9
Silhouette Coefficient: 0.5271595593685239
Davies-Bouldin Index: 1.0159927193940836

Number of clusters: 10
Silhouette Coefficient: 0.5212884507182589
Davies-Bouldin Index: 0.983680222721036

# Plotting the metrics with cluster numbers
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
sns.lineplot(x=cluster_range, y=silhouette_scores, marker='o', size=20, ax=ax[0], legend=False)
ax[0].set_title('Silhouette Coefficient')
ax[0].set_xlabel('Number of clusters')
ax[0].set_ylabel('Silhouette Coefficient')

sns.lineplot(x=cluster_range, y=davies_bouldin_scores, marker='o', size=20 ,ax=ax[1], legend=False)
ax[1].set_title('Davies-Bouldin Index')
ax[1].set_xlabel('Number of clusters')
ax[1].set_ylabel('Davies-Bouldin Index')
plt.show()

_images/7aab70831af126a9157eaf754db8369f70c44319ac361294ed9aac3f308af257.png

In summary, we can identify the optimal cluster number (6) in this clustering task where the Silhouette Coefficient is highest and the Davies-Bouldin Index is lowest.

The selection of numbers of clusters in DBSCAN clustering:

In DBSCAN, using the index may not get the optimal numbers of clusters as the DB index and Silhouette Coefficient cannot reach the lowest or highest values. For lots of cases, the numbers of clusters in DBSCAN are determined by the domain knowledge and the parameters of DBSCAN, such as eps and min_samples. For example, 500 m is a reasonable distance for some urban analysis as 500 meters is a reasonable distance for people to walk (or around a place) so can be represented for some clustering in urban analysis.

7.3 Spatial and Temporal Clustering#

Spatial and temporal clustering combines spatial and temporal dimensions to analyze data that varies over both space and time. This approach is particularly useful in applications such as environmental monitoring, traffic analysis, and social media trend analysis, where data points are not only located in space but also change over time.

For example, for human mobility data, we can cluster the data based on both spatial and temporal dimensions to identify patterns of human movement in a specific area over time. This can help to understand how people move in and out of an area, how long they stay, and how their movement patterns change over time. Or in crime science, the spatial and temporal clustering can be used to identify the near repeat patterns of crime, which can help to understand the patterns of crime in a specific area over time (ref paper).

ST-DBSCAN

Spatio-Temporal Density-Based Spatial Clustering of Applications with Noise. It’s an extension of the classic DBSCAN clustering algorithm designed specifically for spatio-temporal data — that is, data with both spatial (location) and temporal (time) components.

eps1: Spatial distance threshold (like radius in regular DBSCAN).
eps2: Temporal distance threshold.
minPts: Minimum number of points needed to form a dense region.

You can find the ST-DBSCAN doc here

We will use the mobile phone GPS data to identify the visiting behavior from the spatial and temporal patterns of human mobility using ST-DBSCAN. The data is from GeoLife GPS Trajectories. This GPS trajectory dataset was collected in the (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012).

# Read the all gps points from 001 user
import glob
path = 'data/Geolife Trajectories 1.3/Data/001/Trajectory/*.plt'
all_files = glob.glob(path)
all_files.sort()  # Sort the files to ensure they are in order
print(len(all_files), all_files)

71 ['data/Geolife Trajectories 1.3/Data/001/Trajectory/20081023055305.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081023234104.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081024234405.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081025231428.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081026081229.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081027111634.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081027233029.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081027235802.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081028102805.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081028233053.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081028235048.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081029110529.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081029234123.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081030233959.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081101004235.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081102030834.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081102233452.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081103133204.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081103233729.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081104054859.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081104234436.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081105110052.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081106051423.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081106133604.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081106233404.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081108034358.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081109033335.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081110233534.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081111114304.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081111234235.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081112234504.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081113121334.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081113234558.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081114130723.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081115003535.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081116024429.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081116235322.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081117042454.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081117234459.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081118003844.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081118132804.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081118235053.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081119104736.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081120121822.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081120234905.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081201102704.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081202000605.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081202145929.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081203000505.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081203112659.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081203235305.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081204142253.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081205002359.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081205143505.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081205235730.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081206040725.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081206083854.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081206084030.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081207030135.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081208001534.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081208142853.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081209002004.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081210001529.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081210080559.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081210235522.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081211110523.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081212000534.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081212141923.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081213092204.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081214080334.plt', 'data/Geolife Trajectories 1.3/Data/001/Trajectory/20081214235553.plt']

# read all files and concatenate them into a single DataFrame
df_geolife = pd.concat((pd.read_csv(f, header=None, skiprows=6) for f in all_files), ignore_index=True)

# the df includes spatial and temporal information
df_geolife

	0	1	2	3	4	5	6
0	39.984094	116.319236	0	492	39744.245197	2008-10-23	05:53:05
1	39.984198	116.319322	0	492	39744.245208	2008-10-23	05:53:06
2	39.984224	116.319402	0	492	39744.245266	2008-10-23	05:53:11
3	39.984211	116.319389	0	492	39744.245324	2008-10-23	05:53:16
4	39.984217	116.319422	0	491	39744.245382	2008-10-23	05:53:21
...	...	...	...	...	...	...	...
108602	39.977969	116.326651	0	311	39797.021505	2008-12-15	00:30:58
108603	39.977946	116.326653	0	310	39797.021562	2008-12-15	00:31:03
108604	39.977897	116.326624	0	310	39797.021620	2008-12-15	00:31:08
108605	39.977882	116.326626	0	310	39797.021678	2008-12-15	00:31:13
108606	39.977879	116.326628	0	310	39797.021736	2008-12-15	00:31:18

108607 rows × 7 columns

# add column names
df_geolife.columns = ['Latitude', 'Longitude', '2', 'Altitude', '4', 'Date', 'Time']
# astype
df_geolife['Date'] = df_geolife['Date'].astype(str)
df_geolife['Time'] = df_geolife['Time'].astype(str)
# combine date and time into a single datetime column
df_geolife['Datetime'] = pd.to_datetime(df_geolife['Date'] + ' ' + df_geolife['Time'], format='%Y-%m-%d %H:%M:%S')

df_geolife['Date'].unique()

array(['2008-10-23', '2008-10-24', '2008-10-25', '2008-10-26',
       '2008-10-27', '2008-10-28', '2008-10-29', '2008-10-30',
       '2008-10-31', '2008-11-01', '2008-11-02', '2008-11-03',
       '2008-11-04', '2008-11-05', '2008-11-06', '2008-11-07',
       '2008-11-08', '2008-11-09', '2008-11-10', '2008-11-11',
       '2008-11-12', '2008-11-13', '2008-11-14', '2008-11-15',
       '2008-11-16', '2008-11-17', '2008-11-18', '2008-11-19',
       '2008-11-20', '2008-11-21', '2008-12-01', '2008-12-02',
       '2008-12-03', '2008-12-04', '2008-12-05', '2008-12-06',
       '2008-12-07', '2008-12-08', '2008-12-09', '2008-12-10',
       '2008-12-11', '2008-12-12', '2008-12-13', '2008-12-14',
       '2008-12-15'], dtype=object)

df_geolife = df_geolife[['Latitude', 'Longitude', 'Altitude', 'Datetime']]

gdf_geolife = gpd.GeoDataFrame(df_geolife, geometry=gpd.points_from_xy(df_geolife['Longitude'], df_geolife['Latitude']), crs="EPSG:4326")
print(gdf_geolife.crs)

EPSG:4326

gdf_geolife

	Latitude	Longitude	Altitude	Datetime	geometry
0	39.984094	116.319236	492	2008-10-23 05:53:05	POINT (116.31924 39.98409)
1	39.984198	116.319322	492	2008-10-23 05:53:06	POINT (116.31932 39.9842)
2	39.984224	116.319402	492	2008-10-23 05:53:11	POINT (116.3194 39.98422)
3	39.984211	116.319389	492	2008-10-23 05:53:16	POINT (116.31939 39.98421)
4	39.984217	116.319422	491	2008-10-23 05:53:21	POINT (116.31942 39.98422)
...	...	...	...	...	...
108602	39.977969	116.326651	311	2008-12-15 00:30:58	POINT (116.32665 39.97797)
108603	39.977946	116.326653	310	2008-12-15 00:31:03	POINT (116.32665 39.97795)
108604	39.977897	116.326624	310	2008-12-15 00:31:08	POINT (116.32662 39.9779)
108605	39.977882	116.326626	310	2008-12-15 00:31:13	POINT (116.32663 39.97788)
108606	39.977879	116.326628	310	2008-12-15 00:31:18	POINT (116.32663 39.97788)

108607 rows × 5 columns

import folium
m = folium.Map(location=[39.989614, 116.327158], zoom_start=12, tiles='CartoDB Positron')
# only show the 2008-10-28 points for better visualization
gdf_geolife_sample = gdf_geolife[gdf_geolife['Datetime'].dt.date == pd.to_datetime('2008-10-28').date()]
# As there are too many points, we use samples to plot the points
for _, row in gdf_geolife_sample.iterrows():
    folium.CircleMarker(
        location=[row.geometry.y, row.geometry.x],
        radius=1,
        color='RoyalBlue',
        fill=True,
        fill_color='RoyalBlue',
        fill_opacity=0.8
    ).add_to(m)
m

Make this Notebook Trusted to load map: File -> Trust Notebook

We use ST-DBSCAN to cluster the data, as DBSCAN is not suitable for spatio-temporal data. If we set the eps for spatial and temporal units, a ST-cluster represents a group of points that an individual moves within a certain distance and certain time around a place. So this can be used to detect human activities at places in urban analytics. For example, if we set 50 meters and 5 minutes, a ST-cluster represents the user moves within 10 meters and 5 minutes around a place. Such a cluster can be used to identify the stay points of the user. Or for GPS data, we can infer the travel mode from this, like walking, cycling, or driving. For GPS of Cars, We can infer the traffic jams or traffic flow from the clusters.

# pip install st-dbscan

import st_dbscan

# we need to transfer the data to projected coordinates, as this is in lat/lon and in Beijing, China, we can UTM Zone 50N projection
gdf_geolife_sample_p = gdf_geolife_sample.to_crs("EPSG:32650")
gdf_geolife_sample_p

	Latitude	Longitude	Altitude	Datetime	geometry
18430	39.978943	116.325146	392	2008-10-28 00:00:38	POINT (442376.74 4425638.146)
18431	39.978928	116.325205	406	2008-10-28 00:00:43	POINT (442381.765 4425636.443)
18432	39.978911	116.325194	414	2008-10-28 00:00:48	POINT (442380.811 4425634.563)
18433	39.979037	116.325211	422	2008-10-28 00:01:13	POINT (442382.369 4425648.537)
18434	39.978904	116.325214	410	2008-10-28 00:01:14	POINT (442382.513 4425633.773)
...	...	...	...	...	...
19621	39.982950	116.326545	1817	2008-10-28 23:59:22	POINT (442499.555 4426081.984)
19622	39.982934	116.326552	1818	2008-10-28 23:59:27	POINT (442500.139 4426080.204)
19623	39.982920	116.326564	1818	2008-10-28 23:59:32	POINT (442501.152 4426078.642)
19624	39.982920	116.326564	1818	2008-10-28 23:59:35	POINT (442501.152 4426078.642)
19625	39.982913	116.326569	1818	2008-10-28 23:59:37	POINT (442501.573 4426077.862)

1196 rows × 5 columns

# we create a x and y column for the projected coordinates
gdf_geolife_sample_p['x'] = gdf_geolife_sample_p.geometry.x
gdf_geolife_sample_p['y'] = gdf_geolife_sample_p.geometry.y

gdf_geolife_sample_p

	Latitude	Longitude	Altitude	Datetime	geometry	x	y
18430	39.978943	116.325146	392	2008-10-28 00:00:38	POINT (442376.74 4425638.146)	442376.739518	4.425638e+06
18431	39.978928	116.325205	406	2008-10-28 00:00:43	POINT (442381.765 4425636.443)	442381.764750	4.425636e+06
18432	39.978911	116.325194	414	2008-10-28 00:00:48	POINT (442380.811 4425634.563)	442380.811214	4.425635e+06
18433	39.979037	116.325211	422	2008-10-28 00:01:13	POINT (442382.369 4425648.537)	442382.368622	4.425649e+06
18434	39.978904	116.325214	410	2008-10-28 00:01:14	POINT (442382.513 4425633.773)	442382.513075	4.425634e+06
...	...	...	...	...	...	...	...
19621	39.982950	116.326545	1817	2008-10-28 23:59:22	POINT (442499.555 4426081.984)	442499.554866	4.426082e+06
19622	39.982934	116.326552	1818	2008-10-28 23:59:27	POINT (442500.139 4426080.204)	442500.139127	4.426080e+06
19623	39.982920	116.326564	1818	2008-10-28 23:59:32	POINT (442501.152 4426078.642)	442501.151975	4.426079e+06
19624	39.982920	116.326564	1818	2008-10-28 23:59:35	POINT (442501.152 4426078.642)	442501.151975	4.426079e+06
19625	39.982913	116.326569	1818	2008-10-28 23:59:37	POINT (442501.573 4426077.862)	442501.573017	4.426078e+06

1196 rows × 7 columns

# We need to caculate each point's speed, which can help to identify the stay points or walking points.
# We can use the difference of time and distance to calculate the speed

# Calculate the time difference in seconds
gdf_geolife_sample_p['time_diff'] = gdf_geolife_sample_p['Datetime'].diff().dt.total_seconds().fillna(0)

# Calculate distance between consecutive points
gdf_geolife_sample_p['distance'] = gdf_geolife_sample_p.geometry.distance(gdf_geolife_sample_p.geometry.shift()).fillna(0)

# Calculate speed (meters per second)
gdf_geolife_sample_p['speed'] = gdf_geolife_sample_p['distance'] / gdf_geolife_sample_p['time_diff']

# Filter for slow (likely stationary or walking) points
gdf_geolife_sample_slow = gdf_geolife_sample_p[gdf_geolife_sample_p['speed'] < 2]

gdf_geolife_sample_slow

	Latitude	Longitude	Altitude	Datetime	geometry	x	y	time_diff	distance	speed
18431	39.978928	116.325205	406	2008-10-28 00:00:43	POINT (442381.765 4425636.443)	442381.764750	4.425636e+06	5.0	5.305953	1.061191
18432	39.978911	116.325194	414	2008-10-28 00:00:48	POINT (442380.811 4425634.563)	442380.811214	4.425635e+06	5.0	2.107762	0.421552
18433	39.979037	116.325211	422	2008-10-28 00:01:13	POINT (442382.369 4425648.537)	442382.368622	4.425649e+06	25.0	14.060423	0.562417
18436	39.978802	116.325175	398	2008-10-28 00:01:22	POINT (442379.097 4425622.477)	442379.097307	4.425622e+06	5.0	1.334667	0.266933
18437	39.978815	116.325199	394	2008-10-28 00:01:27	POINT (442381.158 4425623.904)	442381.157517	4.425624e+06	5.0	2.506366	0.501273
...	...	...	...	...	...	...	...	...	...	...
19621	39.982950	116.326545	1817	2008-10-28 23:59:22	POINT (442499.555 4426081.984)	442499.554866	4.426082e+06	5.0	2.849030	0.569806
19622	39.982934	116.326552	1818	2008-10-28 23:59:27	POINT (442500.139 4426080.204)	442500.139127	4.426080e+06	5.0	1.873791	0.374758
19623	39.982920	116.326564	1818	2008-10-28 23:59:32	POINT (442501.152 4426078.642)	442501.151975	4.426079e+06	5.0	1.861318	0.372264
19624	39.982920	116.326564	1818	2008-10-28 23:59:35	POINT (442501.152 4426078.642)	442501.151975	4.426079e+06	3.0	0.000000	0.000000
19625	39.982913	116.326569	1818	2008-10-28 23:59:37	POINT (442501.573 4426077.862)	442501.573017	4.426078e+06	2.0	0.886527	0.443264

287 rows × 10 columns

# set the unix time
gdf_geolife_sample_slow['unixtime'] = gdf_geolife_sample_slow['Datetime'].astype(np.int64) // 10**9
# we use ST-DBSCAN to cluster the data, eps is in meters, and min_samples is in seconds
st_dbscan_model = st_dbscan.ST_DBSCAN(eps1=20, eps2=5*60, min_samples=5)  # eps in meters, min_samples in seconds
# fit the model
st_dbscan_model.fit(gdf_geolife_sample_slow[['x', 'y', 'unixtime']])

<st_dbscan.st_dbscan.ST_DBSCAN at 0x1711cd720>

st_dbscan_model.labels

array([-1, -1, -1,  0,  0,  0,  0,  0, -1, -1, -1, -1,  1,  1,  1,  1,  1,
        1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  2,  2,  2,  2,  2,  2,
        2,  2,  2,  2,  2, -1, -1, -1, -1,  3,  3,  3,  3,  3,  3,  3,  3,
       -1, -1, -1,  4,  4,  4,  4,  4,  4,  4,  4,  4, -1, -1, -1, -1, -1,
       -1,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,
        5,  5,  5, -1, -1, -1, -1, -1,  6,  6,  6,  6,  6,  6,  6,  6,  6,
        6,  6,  6,  6,  6,  6,  6,  6,  7,  7,  7,  7,  7,  7,  7,  7,  7,
        7,  7,  7,  7, -1, -1, -1, -1, -1, -1,  8,  8,  8,  8,  8,  8,  8,
        8,  8,  8,  8, -1,  9,  9,  9,  9,  9,  9, -1, -1, 10, 10, 10, 10,
       10, 10, 10, -1, 11, 11, 11, 11, 11, -1, -1, -1, -1, -1, -1, -1, 12,
       12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, -1, -1, 13, 13, 13,
       13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, -1, -1, -1, -1, 14,
       14, 14, 14, 14, 14, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       15, 15, 15, 15, 15, -1, -1, -1, 16, 16, 16, 16, 16, -1, -1, -1, -1,
       -1, 17, 17, 17, 17, 17, -1, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 19, 19, 19, 19, 19, 19, 19,
       19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20])

gdf_geolife_sample_slow['labels'] = st_dbscan_model.labels

gdf_geolife_sample_slow.labels.value_counts()

labels
-1     87
   19
   17
  15
  14
  13
   13
   11
   11
  10
    9
    8
   8
    7
   7
   7
    6
    5
   5
   5
   5
   5
Name: count, dtype: int64

gdf_geolife_sample_slow['labels'].nunique()

gdf_geolife_sample_slow

	Latitude	Longitude	Altitude	Datetime	geometry	x	y	time_diff	distance	speed	unixtime	labels
18431	39.978928	116.325205	406	2008-10-28 00:00:43	POINT (442381.765 4425636.443)	442381.764750	4.425636e+06	5.0	5.305953	1.061191	1225152043	-1
18432	39.978911	116.325194	414	2008-10-28 00:00:48	POINT (442380.811 4425634.563)	442380.811214	4.425635e+06	5.0	2.107762	0.421552	1225152048	-1
18433	39.979037	116.325211	422	2008-10-28 00:01:13	POINT (442382.369 4425648.537)	442382.368622	4.425649e+06	25.0	14.060423	0.562417	1225152073	-1
18436	39.978802	116.325175	398	2008-10-28 00:01:22	POINT (442379.097 4425622.477)	442379.097307	4.425622e+06	5.0	1.334667	0.266933	1225152082	0
18437	39.978815	116.325199	394	2008-10-28 00:01:27	POINT (442381.158 4425623.904)	442381.157517	4.425624e+06	5.0	2.506366	0.501273	1225152087	0
...	...	...	...	...	...	...	...	...	...	...	...	...
19621	39.982950	116.326545	1817	2008-10-28 23:59:22	POINT (442499.555 4426081.984)	442499.554866	4.426082e+06	5.0	2.849030	0.569806	1225238362	20
19622	39.982934	116.326552	1818	2008-10-28 23:59:27	POINT (442500.139 4426080.204)	442500.139127	4.426080e+06	5.0	1.873791	0.374758	1225238367	20
19623	39.982920	116.326564	1818	2008-10-28 23:59:32	POINT (442501.152 4426078.642)	442501.151975	4.426079e+06	5.0	1.861318	0.372264	1225238372	20
19624	39.982920	116.326564	1818	2008-10-28 23:59:35	POINT (442501.152 4426078.642)	442501.151975	4.426079e+06	3.0	0.000000	0.000000	1225238375	20
19625	39.982913	116.326569	1818	2008-10-28 23:59:37	POINT (442501.573 4426077.862)	442501.573017	4.426078e+06	2.0	0.886527	0.443264	1225238377	20

287 rows × 12 columns

from matplotlib.colors import rgb2hex

# Define and convert palette (excluding noise)
unique_labels = sorted(gdf_geolife_sample_slow['labels'].unique())
non_noise_labels = [label for label in unique_labels if label != -1]

palette_rgb = sns.color_palette("Set2", n_colors=len(non_noise_labels))
palette_hex = [rgb2hex(color) for color in palette_rgb]

# Map labels to colors
label_to_color = {label: palette_hex[i] for i, label in enumerate(non_noise_labels)}
label_to_color[-1] = 'black'  # Noise points as black

# Create the map
m = folium.Map(location=[39.989614, 116.327158], zoom_start=14, tiles='CartoDB Positron')

# Plot all points with correct colors
for _, row in gdf_geolife_sample_slow.iterrows():
    label = row['labels']
    folium.CircleMarker(
        location=[row.Latitude, row.Longitude],
        radius=2,
        color=label_to_color[label],
        fill=True,
        fill_color=label_to_color[label],
        fill_opacity=0.8,
        popup=f"Cluster {label}" if label != -1 else "Noise"
    ).add_to(m)

# Build the legend HTML manually
legend_html = """
<div style="
    position: fixed;
    bottom: 50px; left: 50px; width: 180px; height: auto;
    background-color: white;
    border:2px solid grey;
    z-index:9999;
    font-size:14px;
    padding: 10px;
    box-shadow: 2px 2px 5px rgba(0,0,0,0.3);
">
<b>Legend</b><br>
"""

# Add entries for each label
for label in label_to_color:
    color = label_to_color[label]
    label_name = f"Cluster {label}" if label != -1 else "Noise"
    legend_html += f'<i style="background:{color}; width:10px; height:10px; display:inline-block; margin-right:5px;"></i>{label_name}<br>'

legend_html += "</div>"

# Add the legend to the map
m.get_root().html.add_child(folium.Element(legend_html))

m

Make this Notebook Trusted to load map: File -> Trust Notebook

Quiz#

Q1: For road accident data, identify the clusters of road accidents in HUll city (use local_authority_ons_district == 'E06000010'), use the DBSCAN to detect the local clusters (eps = 50m).
Q2: Plot the clusters on the interactive map.
Q3: Output the cluster’s attribute distribution, like accident_severity. All the schema info can be found in data/dft-road-casualty-statistics-road-safety-open-dataset-data-guide-2024.xlsx

Q1 solution

# select the road accident data in Hull city
df_accident_hull = df_accident[df_accident.local_authority_ons_district == 'E06000010']

# we can use the eps = 100m and min_samples = 3 to detect the local clusters
dbscan_hull = DBSCAN(eps=100, min_samples=8)
# DBSCAN implementing
dbscan_hull.fit(df_accident_hull[['location_easting_osgr', 'location_northing_osgr']])
# Generating the labels
df_accident_hull['dbscan_labels'] = dbscan_hull.labels_

# we can see the numbers of clusters are 46
df_accident_hull['dbscan_labels'].nunique()

df_accident_hull['dbscan_labels'].value_counts()

dbscan_labels
-1    656
 1     10
 0      9
 2      9
 3      8
Name: count, dtype: int64

Q2 solution

# to plot the clusters on the interactive map, each cluster is represented by a different color
# define the colors for clusters
palette = sns.color_palette("Set1", n_colors=df_accident_hull['dbscan_labels'].nunique()-1)
hex_colors = [mcolors.to_hex(color) for color in palette]
import folium
# Create a map centered around Hull
m_hull = folium.Map(location=[53.7446, -0.3364], zoom_start=13, tiles='CartoDB Positron')
# Add road accidents to the map
for _, row in df_accident_hull.iterrows():
    if row['dbscan_labels'] != -1:  # Exclude noise points
        folium.CircleMarker(
            location=[row['latitude'], row['longitude']],
            radius=5,
            color=hex_colors[row['dbscan_labels']],
            fill=True,
            fill_color=hex_colors[row['dbscan_labels']],
            fill_opacity=0.8
        ).add_to(m_hull)
    else:
        folium.CircleMarker(
            location=[row['latitude'], row['longitude']],
            radius=1,
            color='gray', # noise points in gray and smaller
            fill=True,
            fill_color='gray',
            fill_opacity=0.8
        ).add_to(m_hull)
m_hull

Make this Notebook Trusted to load map: File -> Trust Notebook

Q3 solution

df_accident_hull

	accident_index	accident_year	accident_reference	location_easting_osgr	location_northing_osgr	longitude	latitude	police_force	accident_severity	number_of_vehicles	...	weather_conditions	road_surface_conditions	special_conditions_at_site	carriageway_hazards	urban_or_rural_area	did_police_officer_attend_scene_of_accident	trunk_road_flag	lsoa_of_accident_location	enhanced_severity_collision	dbscan_labels
42911	2023161259218	2023	161259218	512565.0	430430.0	-0.294088	53.758080	16	3	2	...	1	2	0	0	1	1	2	E01012894	3	-1
42913	2023161259253	2023	161259253	506690.0	428678.0	-0.383764	53.743573	16	3	2	...	1	1	0	0	1	1	2	E01012865	3	-1
42915	2023161259587	2023	161259587	508294.0	427894.0	-0.359728	53.736199	16	2	2	...	4	2	0	0	1	1	2	E01012889	5	-1
42919	2023161259760	2023	161259760	506591.0	431025.0	-0.384456	53.764680	16	2	2	...	1	1	0	0	1	1	2	E01012794	7	-1
42920	2023161259768	2023	161259768	512239.0	429220.0	-0.299469	53.747279	16	3	1	...	1	1	0	0	1	1	1	E01012848	3	-1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
44870	2023161394255	2023	161394255	510320.0	429252.0	-0.328544	53.747976	16	3	2	...	1	1	0	0	1	3	2	E01012809	3	-1
44871	2023161394288	2023	161394288	509768.0	430438.0	-0.336490	53.758747	16	3	2	...	1	1	0	0	1	2	2	E01012867	3	-1
44872	2023161394309	2023	161394309	507077.0	429479.0	-0.377622	53.750691	16	3	1	...	2	2	0	0	1	1	2	E01012862	3	-1
44874	2023161395038	2023	161395038	512127.0	432042.0	-0.300143	53.772656	16	3	1	...	1	1	0	0	1	1	2	E01012812	3	-1
44876	2023161396602	2023	161396602	510113.0	433789.0	-0.330065	53.788781	16	3	2	...	1	1	0	0	2	2	2	E01012789	3	-1

692 rows × 38 columns

accident_severity = {
    1: "Fatal",
    2: "Serious",
    3: "Slight"
}

for label in df_accident_hull['dbscan_labels'].unique():
    if label != -1:  # Exclude noise points
        cluster_data = df_accident_hull[df_accident_hull['dbscan_labels'] == label]
        severity_counts = cluster_data['accident_severity'].value_counts().to_dict()
        print(f"Cluster {label} severity distribution:")
        for severity, count in severity_counts.items():
            print(f"  {accident_severity[severity]}: {count}")

Cluster 0 severity distribution:
  Serious: 5
  Slight: 4
Cluster 3 severity distribution:
  Slight: 8
Cluster 1 severity distribution:
  Slight: 7
  Serious: 3
Cluster 2 severity distribution:
  Slight: 7
  Serious: 2

7 Clustering for spatial and temporal big data

Contents

7 Clustering for spatial and temporal big data#

7.1 Spatial Clustering#

7.1.1 Gaussian Mixture Model (GMM)#

7.1.2 KMeans#

7.1.3 DBSCAN#

7.2 HyperParameter in clustering#

7.2.1 Silhouette coefficient#

7.2.2 Davies-Bouldin Index#

7.2.3 Parameter tuning process#

7.3 Spatial and Temporal Clustering#

Quiz#