2 APM and Apriori#
Association pattern mining (APM) is a technique used in data mining to identify frequent patterns or associations among a set of items in large datasets.
2.1 The metrics used in APM#
The metrics help to quantify the strength and significance of these discovered relationships in data sets.
2.1.1 Support#
Support of an item set is the proportion of transactions in the dataset in which the item set appears, it indicates how frequently the item set appears in the dataset.
Transactions refer to individual instances or records in a dataset that contain a collection of items.
\(\text{Support}(A \Rightarrow B) = \frac{\text{Freq.} (A \text{ and } B)}{\text{Total number of transactions}}\)
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
# Define the items and records/transactions
data = {'Bread': [0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1],
'Milk': [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0],
'Cheese':[0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
'Eggs': [0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
'Yogurt':[0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1]
}
# Define the df records/transactions
df = pd.DataFrame(data)
df
Bread | Milk | Cheese | Eggs | Yogurt | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 1 | 1 | 1 |
3 | 0 | 0 | 1 | 1 | 1 |
4 | 1 | 1 | 1 | 1 | 1 |
5 | 1 | 0 | 0 | 0 | 1 |
6 | 1 | 1 | 1 | 1 | 1 |
7 | 1 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 1 | 1 | 0 |
9 | 0 | 1 | 1 | 1 | 1 |
10 | 1 | 1 | 0 | 0 | 1 |
11 | 0 | 0 | 0 | 0 | 0 |
12 | 1 | 1 | 0 | 0 | 0 |
13 | 1 | 0 | 0 | 0 | 1 |
14 | 0 | 0 | 0 | 0 | 0 |
15 | 1 | 1 | 0 | 0 | 0 |
16 | 1 | 1 | 1 | 1 | 1 |
17 | 0 | 0 | 1 | 1 | 1 |
18 | 0 | 1 | 1 | 1 | 0 |
19 | 1 | 0 | 1 | 1 | 1 |
Example: \(\text{Support}(Bread \Rightarrow Cheese)\)
# Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese
both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]
# Count these transactions
both_present_count = len(both_present)
# Calculate
total_transactions = len(df)
support = both_present_count / total_transactions
print(f"Support of Bread => Cheese: {support:.2f}")
Support of Bread => Cheese: 0.25
2.1.2 Confidence#
Confidence measures the strength of association between two items in APM. It is determined by dividing the number of transactions that contain both items by the number of transactions that contain the first item alone.
\(\text{Confidence}(A \Rightarrow B) = \frac{\text{Freq.}(A \text{ and } B)}{ \text{Freq.}(A)}\)
Example: \(\text{Confidence}(Bread \Rightarrow Cheese)\)
# Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese
both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]
# Count these transactions
both_present_count = len(both_present)
# Calculate
lhs_present_count = len(df[(df['Bread'] == 1)])
confidence = both_present_count / lhs_present_count
print(f"Confidence of Bread => Cheese: {confidence:.2f}")
Confidence of Bread => Cheese: 0.45
2.1.3 Completeness#
Completeness assess the strength of the association rule (A => B) by quantifying how often B (consequence) appears relative to A (Antecedent), providing insights into the reliability and coverage of the rule in capturing the relationship between A and B in the dataset.
\(\text{Completeness}(A \Rightarrow B) = \frac{\text{Freq.}(A \text{ and } B)}{ \text{Freq.}(B)}\)
Example: \(\text{Completeness}(Bread \Rightarrow Cheese)\)
# Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese
both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]
# Count these transactions
both_present_count = len(both_present)
# Calculate
rhs_present_count = len(df[(df['Cheese'] == 1)])
Completeness = both_present_count / rhs_present_count
print(f"Completeness of Bread => Cheese: {Completeness:.2f}")
Completeness of Bread => Cheese: 0.50
2.1.4 Lift#
Lift quantifies how much more often items A and B occur together than expected if they were statistically independent.
\(\text{Lift}(A \Rightarrow B) = \frac{\text{Support}(A \text{ and } B)}{ \text{Support}(A) \times \text{Support}(B) }\)
Lift > 1: A and B are positively associated (potentially useful).
Lift = 1: A and B are independent (likely not useful).
Lift < 1: A and B are negatively associated (potentially anti-useful).
Example: \(\text{Lift}(Bread \Rightarrow Cheese)\)
lhs_rhs_support = support # see the support example: support of (bread and cheese)
lhs_support = lhs_present_count / total_transactions # support of bread
rhs_support = rhs_present_count / total_transactions # support of cheese
# calculating lift
lift = lhs_rhs_support / (lhs_support * rhs_support)
print(f"Lift of Bread => Cheese: {lift:.2f}")
Lift of Bread => Cheese: 0.91
2.1.5 Conviction#
Conviction essentially asks: how often does A occur without B?
A high conviction value indicates that the rule A ⇒ B is unlikely to happen by chance, suggesting a strong association between A and B.
Conversely, a low conviction value implies that A and B occurring together might be due to chance, indicating a weaker association.
\(\text{Conviction}(A \Rightarrow B) = \frac{1 - \text{Support}(B)}{1 - \text{Confidence}(A \Rightarrow B)}\)
Example: \(\text{Conviction}(Bread \Rightarrow Cheese)\)
conviction = (1 - rhs_support) / (1 - confidence)
print(f"Conviction of Bread => Cheese: {conviction:.2f}")
Conviction of Bread => Cheese: 0.92
2.2 Apriori algorithm#
The Apriori algorithm, introduced in Fast algorithms for mining association rules by Rakesh Agrawal and Ramakrishnan Srikant in 1994, is a foundational tool in data mining for discovering frequent item sets and generating association rules.
This algorithm works on transaction databases, such as those recording items purchased by customers in a supermarket, with the primary goal of identifying item sets that frequently appear together.
# ! pip install mlxtend
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# read csv as df
df_cs = pd.read_csv('cakeshop.csv')
df_cs
TransactionID | BREAD | MILK | JAM | EGGS | BUTTER | SUGAR | FLOUR | CHOCOLATE | YEAST | CANDLES | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bread | Milk | Jam | Eggs | Butter | Sugar | Flour | Chocolate | Yeast | Candles |
1 | 2 | NaN | Milk | NaN | Eggs | NaN | Sugar | Flour | Chocolate | NaN | NaN |
2 | 3 | NaN | Milk | NaN | Eggs | NaN | Sugar | NaN | Chocolate | NaN | NaN |
3 | 4 | NaN | Milk | NaN | Eggs | Butter | NaN | NaN | Chocolate | NaN | Candles |
4 | 5 | NaN | Milk | NaN | Eggs | Butter | NaN | Flour | Chocolate | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | 496 | Bread | Milk | NaN | NaN | Butter | Sugar | Flour | NaN | NaN | Candles |
496 | 497 | Bread | Milk | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Candles |
497 | 498 | Bread | Milk | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Candles |
498 | 499 | NaN | Milk | NaN | Eggs | NaN | NaN | NaN | NaN | Yeast | Candles |
499 | 500 | Bread | Milk | Jam | Eggs | Butter | Sugar | NaN | NaN | Yeast | Candles |
500 rows × 11 columns
# delete TransactionID columns
del df_cs['TransactionID']
df_cs
BREAD | MILK | JAM | EGGS | BUTTER | SUGAR | FLOUR | CHOCOLATE | YEAST | CANDLES | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Bread | Milk | Jam | Eggs | Butter | Sugar | Flour | Chocolate | Yeast | Candles |
1 | NaN | Milk | NaN | Eggs | NaN | Sugar | Flour | Chocolate | NaN | NaN |
2 | NaN | Milk | NaN | Eggs | NaN | Sugar | NaN | Chocolate | NaN | NaN |
3 | NaN | Milk | NaN | Eggs | Butter | NaN | NaN | Chocolate | NaN | Candles |
4 | NaN | Milk | NaN | Eggs | Butter | NaN | Flour | Chocolate | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | Bread | Milk | NaN | NaN | Butter | Sugar | Flour | NaN | NaN | Candles |
496 | Bread | Milk | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Candles |
497 | Bread | Milk | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Candles |
498 | NaN | Milk | NaN | Eggs | NaN | NaN | NaN | NaN | Yeast | Candles |
499 | Bread | Milk | Jam | Eggs | Butter | Sugar | NaN | NaN | Yeast | Candles |
500 rows × 10 columns
one-hot encoding
# Convert categorical variable into dummy/indicator variables (Converting to ones and zeros): one-hot or hot-one encoding
df_cs = pd.get_dummies(df_cs)
df_cs
BREAD_Bread | MILK_Milk | JAM_Jam | EGGS_Eggs | BUTTER_Butter | SUGAR_Sugar | FLOUR_Flour | CHOCOLATE_Chocolate | YEAST_Yeast | CANDLES_Candles | |
---|---|---|---|---|---|---|---|---|---|---|
0 | True | True | True | True | True | True | True | True | True | True |
1 | False | True | False | True | False | True | True | True | False | False |
2 | False | True | False | True | False | True | False | True | False | False |
3 | False | True | False | True | True | False | False | True | False | True |
4 | False | True | False | True | True | False | True | True | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | True | True | False | False | True | True | True | False | False | True |
496 | True | True | False | False | False | False | False | False | False | True |
497 | True | True | False | False | False | False | False | False | False | True |
498 | False | True | False | True | False | False | False | False | True | True |
499 | True | True | True | True | True | True | False | False | True | True |
500 rows × 10 columns
2.2.1 Calculating support using apriori#
# Applying the Apriori algorithm to identify frequent items in df_cs
# min_support=0.2: Only include item sets that appear in at least 20% of transactions
# use_colnames=True: Use actual column names instead of indices for item sets
df_frequent = apriori(df_cs, min_support=0.2, use_colnames=True)
df_frequent
support | itemsets | |
---|---|---|
0 | 0.712 | (BREAD_Bread) |
1 | 0.802 | (MILK_Milk) |
2 | 0.246 | (JAM_Jam) |
3 | 0.336 | (EGGS_Eggs) |
4 | 0.392 | (BUTTER_Butter) |
5 | 0.592 | (MILK_Milk, BREAD_Bread) |
6 | 0.240 | (EGGS_Eggs, BREAD_Bread) |
7 | 0.246 | (BUTTER_Butter, BREAD_Bread) |
8 | 0.284 | (EGGS_Eggs, MILK_Milk) |
9 | 0.258 | (BUTTER_Butter, MILK_Milk) |
10 | 0.208 | (MILK_Milk, BREAD_Bread, EGGS_Eggs) |
# get the length for each item sets
df_frequent['length'] = df_frequent.itemsets.apply(lambda x : len(x))
df_frequent
support | itemsets | length | |
---|---|---|---|
0 | 0.712 | (BREAD_Bread) | 1 |
1 | 0.802 | (MILK_Milk) | 1 |
2 | 0.246 | (JAM_Jam) | 1 |
3 | 0.336 | (EGGS_Eggs) | 1 |
4 | 0.392 | (BUTTER_Butter) | 1 |
5 | 0.592 | (MILK_Milk, BREAD_Bread) | 2 |
6 | 0.240 | (EGGS_Eggs, BREAD_Bread) | 2 |
7 | 0.246 | (BUTTER_Butter, BREAD_Bread) | 2 |
8 | 0.284 | (EGGS_Eggs, MILK_Milk) | 2 |
9 | 0.258 | (BUTTER_Butter, MILK_Milk) | 2 |
10 | 0.208 | (MILK_Milk, BREAD_Bread, EGGS_Eggs) | 3 |
2.2.2 Metrics for association_rules#
# return all the rules (and related metrics) with confidence above 0.7
association_rules(df_frequent, metric='confidence', min_threshold=0.7)
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | (MILK_Milk) | (BREAD_Bread) | 0.802 | 0.712 | 0.592 | 0.738155 | 1.036734 | 1.0 | 0.020976 | 1.099886 | 0.178952 | 0.642082 | 0.090815 | 0.784808 |
1 | (BREAD_Bread) | (MILK_Milk) | 0.712 | 0.802 | 0.592 | 0.831461 | 1.036734 | 1.0 | 0.020976 | 1.174800 | 0.123029 | 0.642082 | 0.148791 | 0.784808 |
2 | (EGGS_Eggs) | (BREAD_Bread) | 0.336 | 0.712 | 0.240 | 0.714286 | 1.003210 | 1.0 | 0.000768 | 1.008000 | 0.004819 | 0.297030 | 0.007937 | 0.525682 |
3 | (EGGS_Eggs) | (MILK_Milk) | 0.336 | 0.802 | 0.284 | 0.845238 | 1.053913 | 1.0 | 0.014528 | 1.279385 | 0.077041 | 0.332553 | 0.218374 | 0.599676 |
4 | (EGGS_Eggs, MILK_Milk) | (BREAD_Bread) | 0.284 | 0.712 | 0.208 | 0.732394 | 1.028644 | 1.0 | 0.005792 | 1.076211 | 0.038891 | 0.263959 | 0.070814 | 0.512265 |
5 | (EGGS_Eggs, BREAD_Bread) | (MILK_Milk) | 0.240 | 0.802 | 0.208 | 0.866667 | 1.080632 | 1.0 | 0.015520 | 1.485000 | 0.098178 | 0.249400 | 0.326599 | 0.563009 |
# return all the rules with lift above 1.05
association_rules(df_frequent, metric='lift', min_threshold=1.05)
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | (EGGS_Eggs) | (MILK_Milk) | 0.336 | 0.802 | 0.284 | 0.845238 | 1.053913 | 1.0 | 0.014528 | 1.279385 | 0.077041 | 0.332553 | 0.218374 | 0.599676 |
1 | (MILK_Milk) | (EGGS_Eggs) | 0.802 | 0.336 | 0.284 | 0.354115 | 1.053913 | 1.0 | 0.014528 | 1.028046 | 0.258358 | 0.332553 | 0.027281 | 0.599676 |
2 | (EGGS_Eggs, BREAD_Bread) | (MILK_Milk) | 0.240 | 0.802 | 0.208 | 0.866667 | 1.080632 | 1.0 | 0.015520 | 1.485000 | 0.098178 | 0.249400 | 0.326599 | 0.563009 |
3 | (MILK_Milk) | (EGGS_Eggs, BREAD_Bread) | 0.802 | 0.240 | 0.208 | 0.259352 | 1.080632 | 1.0 | 0.015520 | 1.026128 | 0.376845 | 0.249400 | 0.025463 | 0.563009 |
# we define a df for rules for further selection
df_rules = association_rules(df_frequent, metric='lift', min_threshold=1.05)
df_rules['antecedent_len'] = df_rules['antecedents'].apply(lambda x: len(x))
df_rules
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski | antecedent_len | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | (EGGS_Eggs) | (MILK_Milk) | 0.336 | 0.802 | 0.284 | 0.845238 | 1.053913 | 1.0 | 0.014528 | 1.279385 | 0.077041 | 0.332553 | 0.218374 | 0.599676 | 1 |
1 | (MILK_Milk) | (EGGS_Eggs) | 0.802 | 0.336 | 0.284 | 0.354115 | 1.053913 | 1.0 | 0.014528 | 1.028046 | 0.258358 | 0.332553 | 0.027281 | 0.599676 | 1 |
2 | (EGGS_Eggs, BREAD_Bread) | (MILK_Milk) | 0.240 | 0.802 | 0.208 | 0.866667 | 1.080632 | 1.0 | 0.015520 | 1.485000 | 0.098178 | 0.249400 | 0.326599 | 0.563009 | 2 |
3 | (MILK_Milk) | (EGGS_Eggs, BREAD_Bread) | 0.802 | 0.240 | 0.208 | 0.259352 | 1.080632 | 1.0 | 0.015520 | 1.026128 | 0.376845 | 0.249400 | 0.025463 | 0.563009 | 1 |
# useful length
df_rules.query('antecedent_len >= 2')
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski | antecedent_len | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | (EGGS_Eggs, BREAD_Bread) | (MILK_Milk) | 0.24 | 0.802 | 0.208 | 0.866667 | 1.080632 | 1.0 | 0.01552 | 1.485 | 0.098178 | 0.2494 | 0.326599 | 0.563009 | 2 |
2.3 APM used in road accident data#
In this section, we will use APM in road accident (‘accident_data_v1.0.0_2023.db’) for example.
Schema info can be found in ‘dft-road-casualty-statistics-road-safety-open-dataset-data-guide-2023-1.xlsx’ on the Project Report page in Canvas
2.3.1 Database infos#
import pandas as pd
import sqlite3
# Connect to accident database
conn = sqlite3.connect('accident_data_v1.0.0_2023.db')
# create cursor handling SQL
cur = conn.cursor()
# output all the tables in db
res_tables = cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
table_names = res_tables.fetchall()
print(table_names)
[('accident',), ('casualty',), ('vehicle',), ('lsoa',)]
table_names[1][0]
'casualty'
# Check the foreign keys to understand the relationships across tables in database
# it is clear that accident_index column is used to link tables in db
for tn in table_names:
res_fk = cur.execute(f"PRAGMA foreign_key_list({tn[0]})")
fk_info = res_fk.fetchall()
if len(fk_info) < 1:
print(f'No foreign keys found in Table: {tn[0]}')
else:
print(f'Table {tn[0]} f_keys info --',
'id:', fk_info[0][0], '--',
'seq:', fk_info[0][1],'--',
'table:', fk_info[0][2],'--',
'from:', fk_info[0][3],'--',
'to:', fk_info[0][4],'--',
'on_update:', fk_info[0][5],'--',
'on_delete:', fk_info[0][6],'--',
'match:', fk_info[0][7])
No foreign keys found in Table: accident
Table casualty f_keys info -- id: 0 -- seq: 0 -- table: accident -- from: accident_index -- to: accident_index -- on_update: NO ACTION -- on_delete: NO ACTION -- match: NONE
Table vehicle f_keys info -- id: 0 -- seq: 0 -- table: accident -- from: accident_index -- to: accident_index -- on_update: NO ACTION -- on_delete: NO ACTION -- match: NONE
No foreign keys found in Table: lsoa
2.3.2 APM in accident table#
# using pandas to read table accident
df_ac = pd.read_sql("SELECT * FROM accident;", conn)
# This is an accident-level dataset which means each row representing one accident identified by 'accident index'
df_ac
accident_index | accident_year | accident_reference | location_easting_osgr | location_northing_osgr | longitude | latitude | police_force | accident_severity | number_of_vehicles | ... | pedestrian_crossing_physical_facilities | light_conditions | weather_conditions | road_surface_conditions | special_conditions_at_site | carriageway_hazards | urban_or_rural_area | did_police_officer_attend_scene_of_accident | trunk_road_flag | lsoa_of_accident_location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017010001708 | 2017 | 010001708 | 532920.0 | 196330.0 | -0.080107 | 51.650061 | 1 | 1 | 2 | ... | 0 | 4 | 1 | 1 | 0 | 0 | 1 | 1 | 2 | E01001450 |
1 | 2017010009342 | 2017 | 010009342 | 526790.0 | 181970.0 | -0.173845 | 51.522425 | 1 | 3 | 2 | ... | 0 | 4 | 1 | 2 | 0 | 0 | 1 | 1 | 2 | E01004702 |
2 | 2017010009344 | 2017 | 010009344 | 535200.0 | 181260.0 | -0.052969 | 51.514096 | 1 | 3 | 3 | ... | 0 | 4 | 1 | 1 | 0 | 0 | 1 | 1 | 2 | E01004298 |
3 | 2017010009348 | 2017 | 010009348 | 534340.0 | 193560.0 | -0.060658 | 51.624832 | 1 | 3 | 2 | ... | 4 | 4 | 2 | 2 | 0 | 0 | 1 | 1 | 2 | E01001429 |
4 | 2017010009350 | 2017 | 010009350 | 533680.0 | 187820.0 | -0.072372 | 51.573408 | 1 | 2 | 1 | ... | 5 | 4 | 1 | 2 | 0 | 0 | 1 | 1 | 2 | E01001808 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
461347 | 2020991027064 | 2020 | 991027064 | 343034.0 | 731654.0 | -2.926320 | 56.473539 | 99 | 2 | 2 | ... | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | -1 | -1 |
461348 | 2020991029573 | 2020 | 991029573 | 257963.0 | 658891.0 | -4.267565 | 55.802353 | 99 | 3 | 1 | ... | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 2 | -1 | -1 |
461349 | 2020991030297 | 2020 | 991030297 | 383664.0 | 810646.0 | -2.271903 | 57.186317 | 99 | 2 | 2 | ... | 0 | 1 | 1 | 1 | 0 | 0 | 2 | 1 | -1 | -1 |
461350 | 2020991030900 | 2020 | 991030900 | 277161.0 | 674852.0 | -3.968753 | 55.950940 | 99 | 3 | 2 | ... | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 2 | -1 | -1 |
461351 | 2020991032575 | 2020 | 991032575 | 240402.0 | 681950.0 | -4.561040 | 56.003843 | 99 | 3 | 1 | ... | 0 | 1 | 1 | 1 | 0 | 2 | 1 | 1 | -1 | -1 |
461352 rows × 36 columns
You can use pandas .describe() or info() to output the basic info in a df
You can also use some libs called ydata-profiling, sweetviz and dtale to implement the initial EDA
df_ac.describe()
accident_year | location_easting_osgr | location_northing_osgr | longitude | latitude | police_force | accident_severity | number_of_vehicles | number_of_casualties | day_of_week | ... | pedestrian_crossing_human_control | pedestrian_crossing_physical_facilities | light_conditions | weather_conditions | road_surface_conditions | special_conditions_at_site | carriageway_hazards | urban_or_rural_area | did_police_officer_attend_scene_of_accident | trunk_road_flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 461352.000000 | 461236.000000 | 4.612360e+05 | 461226.000000 | 461226.000000 | 461352.000000 | 461352.000000 | 461352.000000 | 461352.000000 | 461352.000000 | ... | 461352.000000 | 461352.000000 | 461352.000000 | 461352.000000 | 461352.000000 | 461352.000000 | 461352.000000 | 461352.000000 | 461352.000000 | 461352.000000 |
mean | 2018.368378 | 452593.115756 | 2.801803e+05 | -1.246998 | 52.408849 | 28.193388 | 2.784436 | 1.840441 | 1.301245 | 4.108171 | ... | 0.262112 | 1.099518 | 2.036235 | 1.649918 | 1.385239 | 0.217519 | 0.170518 | 1.325463 | 1.334877 | 1.680147 |
std | 1.091566 | 94822.718705 | 1.515501e+05 | 1.389702 | 1.365101 | 24.801609 | 0.443648 | 0.709869 | 0.746398 | 1.927216 | ... | 1.460680 | 2.333113 | 1.724691 | 1.819426 | 0.955180 | 1.240981 | 1.121964 | 0.469057 | 0.556352 | 0.851554 |
min | 2017.000000 | 64084.000000 | 1.023500e+04 | -7.525273 | 49.912362 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 |
25% | 2017.000000 | 388539.000000 | 1.755300e+05 | -2.172668 | 51.465689 | 5.000000 | 3.000000 | 1.000000 | 1.000000 | 2.000000 | ... | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 2.000000 |
50% | 2018.000000 | 459198.500000 | 2.208650e+05 | -1.126264 | 51.870257 | 22.000000 | 3.000000 | 2.000000 | 1.000000 | 4.000000 | ... | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 2.000000 |
75% | 2019.000000 | 529290.000000 | 3.865510e+05 | -0.136389 | 53.372899 | 45.000000 | 3.000000 | 2.000000 | 1.000000 | 6.000000 | ... | 0.000000 | 0.000000 | 4.000000 | 1.000000 | 2.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 |
max | 2020.000000 | 655391.000000 | 1.209512e+06 | 1.759641 | 60.763722 | 99.000000 | 3.000000 | 24.000000 | 59.000000 | 7.000000 | ... | 9.000000 | 9.000000 | 7.000000 | 9.000000 | 9.000000 | 9.000000 | 9.000000 | 3.000000 | 3.000000 | 2.000000 |
8 rows × 29 columns
df_ac.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 461352 entries, 0 to 461351
Data columns (total 36 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 accident_index 461352 non-null object
1 accident_year 461352 non-null int64
2 accident_reference 461352 non-null object
3 location_easting_osgr 461236 non-null float64
4 location_northing_osgr 461236 non-null float64
5 longitude 461226 non-null float64
6 latitude 461226 non-null float64
7 police_force 461352 non-null int64
8 accident_severity 461352 non-null int64
9 number_of_vehicles 461352 non-null int64
10 number_of_casualties 461352 non-null int64
11 date 461352 non-null object
12 day_of_week 461352 non-null int64
13 time 461352 non-null object
14 local_authority_district 461352 non-null int64
15 local_authority_ons_district 461352 non-null object
16 local_authority_highway 461352 non-null object
17 first_road_class 461352 non-null int64
18 first_road_number 461352 non-null int64
19 road_type 461352 non-null int64
20 speed_limit 461352 non-null int64
21 junction_detail 461352 non-null int64
22 junction_control 461352 non-null int64
23 second_road_class 461352 non-null int64
24 second_road_number 461352 non-null int64
25 pedestrian_crossing_human_control 461352 non-null int64
26 pedestrian_crossing_physical_facilities 461352 non-null int64
27 light_conditions 461352 non-null int64
28 weather_conditions 461352 non-null int64
29 road_surface_conditions 461352 non-null int64
30 special_conditions_at_site 461352 non-null int64
31 carriageway_hazards 461352 non-null int64
32 urban_or_rural_area 461352 non-null int64
33 did_police_officer_attend_scene_of_accident 461352 non-null int64
34 trunk_road_flag 461352 non-null int64
35 lsoa_of_accident_location 461352 non-null object
dtypes: float64(4), int64(25), object(7)
memory usage: 126.7+ MB
Select three columns for analysis
accident_severity, speed_limit, weather_conditions (rules can be if speed_limit, weather_conditions then accident_severity)
# select three columns
df_ac_s = df_ac[['speed_limit', 'weather_conditions', 'accident_severity']]
df_ac_s
speed_limit | weather_conditions | accident_severity | |
---|---|---|---|
0 | 30 | 1 | 1 |
1 | 30 | 1 | 3 |
2 | 30 | 1 | 3 |
3 | 30 | 2 | 3 |
4 | 20 | 1 | 2 |
... | ... | ... | ... |
461347 | 30 | 1 | 2 |
461348 | 30 | 1 | 3 |
461349 | 60 | 1 | 2 |
461350 | 30 | 1 | 3 |
461351 | 30 | 1 | 3 |
461352 rows × 3 columns
# One-hot encoding transfer df columns
df_list = []
for c in ['speed_limit', 'weather_conditions', 'accident_severity']:
df_c = pd.get_dummies(df_ac_s[c], prefix=c)
df_list.append(df_c)
# axis=1 indicates that the dataframes in df_list are being concatenated horizontally, i.e., by columns.
# This means that the resulting dataframe df_oh will have columns from all the dataframes in df_list combined side by side, with the indices aligned.
df_oh = pd.concat(df_list, axis=1)
df_oh
speed_limit_-1 | speed_limit_20 | speed_limit_30 | speed_limit_40 | speed_limit_50 | speed_limit_60 | speed_limit_70 | weather_conditions_-1 | weather_conditions_1 | weather_conditions_2 | weather_conditions_3 | weather_conditions_4 | weather_conditions_5 | weather_conditions_6 | weather_conditions_7 | weather_conditions_8 | weather_conditions_9 | accident_severity_1 | accident_severity_2 | accident_severity_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | True | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False |
1 | False | False | True | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | True |
2 | False | False | True | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | True |
3 | False | False | True | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | True |
4 | False | True | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
461347 | False | False | True | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | True | False |
461348 | False | False | True | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | True |
461349 | False | False | False | False | False | True | False | False | True | False | False | False | False | False | False | False | False | False | True | False |
461350 | False | False | True | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | True |
461351 | False | False | True | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | True |
461352 rows × 20 columns
# create frequent sets for rules
df_fre = apriori(df_oh, min_support=0.2, use_colnames=True)
df_fre
support | itemsets | |
---|---|---|
0 | 0.595235 | (speed_limit_30) |
1 | 0.795436 | (weather_conditions_1) |
2 | 0.798299 | (accident_severity_3) |
3 | 0.476458 | (speed_limit_30, weather_conditions_1) |
4 | 0.486100 | (speed_limit_30, accident_severity_3) |
5 | 0.631260 | (accident_severity_3, weather_conditions_1) |
6 | 0.387476 | (speed_limit_30, accident_severity_3, weather_... |
# select rules with confidence >0.7
df_rule = association_rules(df_fre, metric='confidence', min_threshold=0.7)
# select rules with lift >1
df_rule.query('lift > 1')
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | (speed_limit_30) | (weather_conditions_1) | 0.595235 | 0.795436 | 0.476458 | 0.800454 | 1.006308 | 1.0 | 0.002987 | 1.025146 | 0.015487 | 0.521168 | 0.024529 | 0.699722 |
1 | (speed_limit_30) | (accident_severity_3) | 0.595235 | 0.798299 | 0.486100 | 0.816651 | 1.022989 | 1.0 | 0.010924 | 1.100092 | 0.055518 | 0.535685 | 0.090985 | 0.712785 |
4 | (speed_limit_30, accident_severity_3) | (weather_conditions_1) | 0.486100 | 0.795436 | 0.387476 | 0.797113 | 1.002109 | 1.0 | 0.000815 | 1.008267 | 0.004094 | 0.433390 | 0.008199 | 0.642119 |
5 | (speed_limit_30, weather_conditions_1) | (accident_severity_3) | 0.476458 | 0.798299 | 0.387476 | 0.813243 | 1.018719 | 1.0 | 0.007120 | 1.080016 | 0.035098 | 0.436701 | 0.074088 | 0.649310 |
As ‘weather_conditions_1’ represents ‘Fine no high winds’ and ‘accident_severity_3’ represents ‘Slight’, so the useful rule can be generated as ‘ Weather with fine no high winds’ and ‘the road with speed limit is 30’ \(\Rightarrow\) ‘Accident severity is slight’