{ "cells": [ { "cell_type": "markdown", "id": "66e6a8d2-74ed-4417-8cc2-329889cf77ce", "metadata": {}, "source": [ "# 3 Clustering" ] }, { "cell_type": "markdown", "id": "e4194aca-773a-47f8-8456-c4d437abd5fa", "metadata": {}, "source": [ "\n", "Clustering is a machine learning method that groups a collection of objects into clusters, with each cluster containing objects that are highly similar to each other and distinct from those in other clusters. \n" ] }, { "cell_type": "markdown", "id": "76488c51-b50e-4f16-b724-cdbf35b637b3", "metadata": {}, "source": [ "## 3.1 Gaussian Mixture Model (GMM) " ] }, { "cell_type": "markdown", "id": "8038c249-f923-41b1-8b30-15c911c886f4", "metadata": {}, "source": [ "[A Gaussian Mixture Model (GMM)](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html) is a probabilistic model that assumes that the data is generated from a mixture of several Gaussian distributions with unknown parameters. The equation for a GMM is a weighted sum of multiple Gaussian (Normal) distributions. \n", "The model is defined as:\n", "\n", "$p(x) = \\sum_{i=1}^{K} w_i \\mathcal{N}({x}|\\mu_i, \\sigma_i)$\n", "\n", "- $ p(x)$ overall probability\n", "- $ {x} $ is the data point.\n", "- $ K $ is the total number of Gaussian (a normal distribution as a cluster).\n", "- $ w_i $ are the weights (size) of the $i$th Gaussian (summing to 1).\n", "- $ \\mathcal{N}({x}|\\mu_i, \\sigma_i) $ is the probability of the $i$th Gaussian.\n", "- $\\mu_i$ is the mean of the $i$th Gaussian.\n", "- $\\sigma_i$ is the variance of the $i$th Gaussian.\n", " " ] }, { "cell_type": "code", "id": "b4b89a43-4e5d-405b-ac13-91f2eb2a6a92", "metadata": { "ExecuteTime": { "end_time": "2025-03-28T20:48:44.591829Z", "start_time": "2025-03-28T20:48:44.589915Z" } }, "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import warnings\n", "warnings.filterwarnings('ignore')" ], "outputs": [], "execution_count": 45 }, { "cell_type": "markdown", "id": "2b3cbcfb-6bd2-4438-be02-1cc1d1fa2f56", "metadata": {}, "source": [ "__You can download the sample data from [here](https://hullacuk-my.sharepoint.com/:x:/g/personal/tongxin_chen_hull_ac_uk/EX-R87tgUxlCupINUCh4dgYBAB-1TH765Eh0ujdkIO89NQ?e=5JH382).__\n", "\n", "Download the data file and place it in the same folder as this Jupyter Notebook." ] }, { "cell_type": "code", "id": "bcc48218-0ad0-4ed7-b9c6-df0164808c17", "metadata": { "ExecuteTime": { "end_time": "2025-03-28T20:48:44.613426Z", "start_time": "2025-03-28T20:48:44.610812Z" } }, "source": [ "# Read the data\n", "df = pd.read_csv('cluster_data.csv')" ], "outputs": [], "execution_count": 46 }, { "cell_type": "code", "id": "166f6610-630a-4f61-89d5-c1029cc1c6e0", "metadata": { "ExecuteTime": { "end_time": "2025-03-28T20:48:44.622403Z", "start_time": "2025-03-28T20:48:44.619309Z" } }, "source": [ "df" ], "outputs": [ { "data": { "text/plain": [ " Feature1 Feature2\n", "0 4.862775 1.011907\n", "1 5.743938 0.472795\n", "2 10.177229 -1.014992\n", "3 -0.405046 1.447232\n", "4 1.260348 9.982292\n", ".. ... ...\n", "495 9.447088 10.323605\n", "496 5.335410 0.273029\n", "497 5.668767 2.307601\n", "498 8.489488 10.133309\n", "499 6.282093 -0.258828\n", "\n", "[500 rows x 2 columns]" ], "text/html": [ "
\n", " | Feature1 | \n", "Feature2 | \n", "
---|---|---|
0 | \n", "4.862775 | \n", "1.011907 | \n", "
1 | \n", "5.743938 | \n", "0.472795 | \n", "
2 | \n", "10.177229 | \n", "-1.014992 | \n", "
3 | \n", "-0.405046 | \n", "1.447232 | \n", "
4 | \n", "1.260348 | \n", "9.982292 | \n", "
... | \n", "... | \n", "... | \n", "
495 | \n", "9.447088 | \n", "10.323605 | \n", "
496 | \n", "5.335410 | \n", "0.273029 | \n", "
497 | \n", "5.668767 | \n", "2.307601 | \n", "
498 | \n", "8.489488 | \n", "10.133309 | \n", "
499 | \n", "6.282093 | \n", "-0.258828 | \n", "
500 rows × 2 columns
\n", "DBSCAN(eps=0.6)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DBSCAN(eps=0.6)