2 APM and Apriori

2 APM and Apriori#

Association pattern mining (APM) is a technique used in data mining to identify frequent patterns or associations among a set of items in large datasets.

2.1 The metrics used in APM#

The metrics help to quantify the strength and significance of these discovered relationships in data sets.

2.1.1 Support#

Support of an item set is the proportion of transactions in the dataset in which the item set appears, it indicates how frequently the item set appears in the dataset.

Transactions refer to individual instances or records in a dataset that contain a collection of items.

$Support (A \Rightarrow B) = \frac{Freq. (A and B)}{Total number of transactions}$

import warnings
warnings.filterwarnings('ignore')

import pandas as pd

# Define the items and records/transactions

data = {'Bread': [0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1],
        'Milk':  [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0],
        'Cheese':[0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
        'Eggs':  [0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
        'Yogurt':[0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1]
       }

# Define the df records/transactions
df = pd.DataFrame(data)

df

	Bread	Milk	Cheese	Eggs	Yogurt
0	0	0	0	0	0
1	0	0	0	0	0
2	1	0	1	1	1
3	0	0	1	1	1
4	1	1	1	1	1
5	1	0	0	0	1
6	1	1	1	1	1
7	1	0	0	0	0
8	0	0	1	1	0
9	0	1	1	1	1
10	1	1	0	0	1
11	0	0	0	0	0
12	1	1	0	0	0
13	1	0	0	0	1
14	0	0	0	0	0
15	1	1	0	0	0
16	1	1	1	1	1
17	0	0	1	1	1
18	0	1	1	1	0
19	1	0	1	1	1

Example: $Support (B r e a d \Rightarrow C h e e s e)$

# Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese
both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]

# Count these transactions
both_present_count = len(both_present)

# Calculate
total_transactions = len(df)
support = both_present_count / total_transactions

print(f"Support of Bread => Cheese: {support:.2f}")

Support of Bread => Cheese: 0.25

2.1.2 Confidence#

Confidence measures the strength of association between two items in APM. It is determined by dividing the number of transactions that contain both items by the number of transactions that contain the first item alone.

$Confidence (A \Rightarrow B) = \frac{Freq. (A and B)}{Freq. (A)}$

Example: $Confidence (B r e a d \Rightarrow C h e e s e)$

# Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese
both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]

# Count these transactions
both_present_count = len(both_present)

# Calculate
lhs_present_count = len(df[(df['Bread'] == 1)])
confidence = both_present_count / lhs_present_count

print(f"Confidence of Bread => Cheese: {confidence:.2f}")

Confidence of Bread => Cheese: 0.45

2.1.3 Completeness#

Completeness assess the strength of the association rule (A => B) by quantifying how often B (consequence) appears relative to A (Antecedent), providing insights into the reliability and coverage of the rule in capturing the relationship between A and B in the dataset.

$Completeness (A \Rightarrow B) = \frac{Freq. (A and B)}{Freq. (B)}$

Example: $Completeness (B r e a d \Rightarrow C h e e s e)$

# Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese
both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]

# Count these transactions
both_present_count = len(both_present)

# Calculate
rhs_present_count = len(df[(df['Cheese'] == 1)])
Completeness = both_present_count / rhs_present_count

print(f"Completeness of Bread => Cheese: {Completeness:.2f}")

Completeness of Bread => Cheese: 0.50

2.1.4 Lift#

Lift quantifies how much more often items A and B occur together than expected if they were statistically independent.

$Lift (A \Rightarrow B) = \frac{Support (A and B)}{Support (A) \times Support (B)}$

Lift > 1: A and B are positively associated (potentially useful).
Lift = 1: A and B are independent (likely not useful).
Lift < 1: A and B are negatively associated (potentially anti-useful).

Example: $Lift (B r e a d \Rightarrow C h e e s e)$

lhs_rhs_support = support # see the support example: support of (bread and cheese)

lhs_support = lhs_present_count / total_transactions  # support of bread

rhs_support = rhs_present_count / total_transactions # support of cheese

# calculating lift
lift = lhs_rhs_support / (lhs_support * rhs_support)
print(f"Lift of Bread => Cheese: {lift:.2f}")

Lift of Bread => Cheese: 0.91

2.1.5 Conviction#

Conviction essentially asks: how often does A occur without B?

A high conviction value indicates that the rule A ⇒ B is unlikely to happen by chance, suggesting a strong association between A and B.

Conversely, a low conviction value implies that A and B occurring together might be due to chance, indicating a weaker association.

$Conviction (A \Rightarrow B) = \frac{1 - Support (B)}{1 - Confidence (A \Rightarrow B)}$

Example: $Conviction (B r e a d \Rightarrow C h e e s e)$

conviction = (1 - rhs_support) / (1 - confidence)
print(f"Conviction of Bread => Cheese: {conviction:.2f}")

Conviction of Bread => Cheese: 0.92

2.2 Apriori algorithm#

The Apriori algorithm, introduced in Fast algorithms for mining association rules by Rakesh Agrawal and Ramakrishnan Srikant in 1994, is a foundational tool in data mining for discovering frequent item sets and generating association rules.

This algorithm works on transaction databases, such as those recording items purchased by customers in a supermarket, with the primary goal of identifying item sets that frequently appear together.

apriori in mlxtend lib

# ! pip install mlxtend

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# read csv as df
df_cs = pd.read_csv('cakeshop.csv')

df_cs

	TransactionID	BREAD	MILK	JAM	EGGS	BUTTER	SUGAR	FLOUR	CHOCOLATE	YEAST	CANDLES
0	1	Bread	Milk	Jam	Eggs	Butter	Sugar	Flour	Chocolate	Yeast	Candles
1	2	NaN	Milk	NaN	Eggs	NaN	Sugar	Flour	Chocolate	NaN	NaN
2	3	NaN	Milk	NaN	Eggs	NaN	Sugar	NaN	Chocolate	NaN	NaN
3	4	NaN	Milk	NaN	Eggs	Butter	NaN	NaN	Chocolate	NaN	Candles
4	5	NaN	Milk	NaN	Eggs	Butter	NaN	Flour	Chocolate	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...
495	496	Bread	Milk	NaN	NaN	Butter	Sugar	Flour	NaN	NaN	Candles
496	497	Bread	Milk	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Candles
497	498	Bread	Milk	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Candles
498	499	NaN	Milk	NaN	Eggs	NaN	NaN	NaN	NaN	Yeast	Candles
499	500	Bread	Milk	Jam	Eggs	Butter	Sugar	NaN	NaN	Yeast	Candles

500 rows × 11 columns

# delete TransactionID columns 
del df_cs['TransactionID']
df_cs

	BREAD	MILK	JAM	EGGS	BUTTER	SUGAR	FLOUR	CHOCOLATE	YEAST	CANDLES
0	Bread	Milk	Jam	Eggs	Butter	Sugar	Flour	Chocolate	Yeast	Candles
1	NaN	Milk	NaN	Eggs	NaN	Sugar	Flour	Chocolate	NaN	NaN
2	NaN	Milk	NaN	Eggs	NaN	Sugar	NaN	Chocolate	NaN	NaN
3	NaN	Milk	NaN	Eggs	Butter	NaN	NaN	Chocolate	NaN	Candles
4	NaN	Milk	NaN	Eggs	Butter	NaN	Flour	Chocolate	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...
495	Bread	Milk	NaN	NaN	Butter	Sugar	Flour	NaN	NaN	Candles
496	Bread	Milk	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Candles
497	Bread	Milk	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Candles
498	NaN	Milk	NaN	Eggs	NaN	NaN	NaN	NaN	Yeast	Candles
499	Bread	Milk	Jam	Eggs	Butter	Sugar	NaN	NaN	Yeast	Candles

500 rows × 10 columns

one-hot encoding

# Convert categorical variable into dummy/indicator variables (Converting to ones and zeros): one-hot or hot-one encoding
df_cs = pd.get_dummies(df_cs)

df_cs

	BREAD_Bread	MILK_Milk	JAM_Jam	EGGS_Eggs	BUTTER_Butter	SUGAR_Sugar	FLOUR_Flour	CHOCOLATE_Chocolate	YEAST_Yeast	CANDLES_Candles
0	True	True	True	True	True	True	True	True	True	True
1	False	True	False	True	False	True	True	True	False	False
2	False	True	False	True	False	True	False	True	False	False
3	False	True	False	True	True	False	False	True	False	True
4	False	True	False	True	True	False	True	True	False	False
...	...	...	...	...	...	...	...	...	...	...
495	True	True	False	False	True	True	True	False	False	True
496	True	True	False	False	False	False	False	False	False	True
497	True	True	False	False	False	False	False	False	False	True
498	False	True	False	True	False	False	False	False	True	True
499	True	True	True	True	True	True	False	False	True	True

500 rows × 10 columns

2.2.1 Calculating support using apriori#

# Applying the Apriori algorithm to identify frequent items in df_cs  
# min_support=0.2: Only include item sets that appear in at least 20% of transactions  
# use_colnames=True: Use actual column names instead of indices for item sets  
df_frequent = apriori(df_cs, min_support=0.2, use_colnames=True)
df_frequent

	support	itemsets
0	0.712	(BREAD_Bread)
1	0.802	(MILK_Milk)
2	0.246	(JAM_Jam)
3	0.336	(EGGS_Eggs)
4	0.392	(BUTTER_Butter)
5	0.592	(MILK_Milk, BREAD_Bread)
6	0.240	(EGGS_Eggs, BREAD_Bread)
7	0.246	(BUTTER_Butter, BREAD_Bread)
8	0.284	(EGGS_Eggs, MILK_Milk)
9	0.258	(BUTTER_Butter, MILK_Milk)
10	0.208	(MILK_Milk, BREAD_Bread, EGGS_Eggs)

# get the length for each item sets
df_frequent['length'] = df_frequent.itemsets.apply(lambda x : len(x))

df_frequent

	support	itemsets	length
0	0.712	(BREAD_Bread)	1
1	0.802	(MILK_Milk)	1
2	0.246	(JAM_Jam)	1
3	0.336	(EGGS_Eggs)	1
4	0.392	(BUTTER_Butter)	1
5	0.592	(MILK_Milk, BREAD_Bread)	2
6	0.240	(EGGS_Eggs, BREAD_Bread)	2
7	0.246	(BUTTER_Butter, BREAD_Bread)	2
8	0.284	(EGGS_Eggs, MILK_Milk)	2
9	0.258	(BUTTER_Butter, MILK_Milk)	2
10	0.208	(MILK_Milk, BREAD_Bread, EGGS_Eggs)	3

2.2.2 Metrics for association_rules#

The returned columns metrics

# return all the rules (and related metrics) with confidence above 0.7
association_rules(df_frequent, metric='confidence', min_threshold=0.7)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski
0	(MILK_Milk)	(BREAD_Bread)	0.802	0.712	0.592	0.738155	1.036734	1.0	0.020976	1.099886	0.178952	0.642082	0.090815	0.784808
1	(BREAD_Bread)	(MILK_Milk)	0.712	0.802	0.592	0.831461	1.036734	1.0	0.020976	1.174800	0.123029	0.642082	0.148791	0.784808
2	(EGGS_Eggs)	(BREAD_Bread)	0.336	0.712	0.240	0.714286	1.003210	1.0	0.000768	1.008000	0.004819	0.297030	0.007937	0.525682
3	(EGGS_Eggs)	(MILK_Milk)	0.336	0.802	0.284	0.845238	1.053913	1.0	0.014528	1.279385	0.077041	0.332553	0.218374	0.599676
4	(EGGS_Eggs, MILK_Milk)	(BREAD_Bread)	0.284	0.712	0.208	0.732394	1.028644	1.0	0.005792	1.076211	0.038891	0.263959	0.070814	0.512265
5	(EGGS_Eggs, BREAD_Bread)	(MILK_Milk)	0.240	0.802	0.208	0.866667	1.080632	1.0	0.015520	1.485000	0.098178	0.249400	0.326599	0.563009

# return all the rules with lift above 1.05
association_rules(df_frequent, metric='lift', min_threshold=1.05) 

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski
0	(EGGS_Eggs)	(MILK_Milk)	0.336	0.802	0.284	0.845238	1.053913	1.0	0.014528	1.279385	0.077041	0.332553	0.218374	0.599676
1	(MILK_Milk)	(EGGS_Eggs)	0.802	0.336	0.284	0.354115	1.053913	1.0	0.014528	1.028046	0.258358	0.332553	0.027281	0.599676
2	(EGGS_Eggs, BREAD_Bread)	(MILK_Milk)	0.240	0.802	0.208	0.866667	1.080632	1.0	0.015520	1.485000	0.098178	0.249400	0.326599	0.563009
3	(MILK_Milk)	(EGGS_Eggs, BREAD_Bread)	0.802	0.240	0.208	0.259352	1.080632	1.0	0.015520	1.026128	0.376845	0.249400	0.025463	0.563009

# we define a df for rules for further selection
df_rules = association_rules(df_frequent, metric='lift', min_threshold=1.05) 
df_rules['antecedent_len'] = df_rules['antecedents'].apply(lambda x: len(x))

df_rules

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski	antecedent_len
0	(EGGS_Eggs)	(MILK_Milk)	0.336	0.802	0.284	0.845238	1.053913	1.0	0.014528	1.279385	0.077041	0.332553	0.218374	0.599676	1
1	(MILK_Milk)	(EGGS_Eggs)	0.802	0.336	0.284	0.354115	1.053913	1.0	0.014528	1.028046	0.258358	0.332553	0.027281	0.599676	1
2	(EGGS_Eggs, BREAD_Bread)	(MILK_Milk)	0.240	0.802	0.208	0.866667	1.080632	1.0	0.015520	1.485000	0.098178	0.249400	0.326599	0.563009	2
3	(MILK_Milk)	(EGGS_Eggs, BREAD_Bread)	0.802	0.240	0.208	0.259352	1.080632	1.0	0.015520	1.026128	0.376845	0.249400	0.025463	0.563009	1

# useful length
df_rules.query('antecedent_len >= 2')

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski	antecedent_len
2	(EGGS_Eggs, BREAD_Bread)	(MILK_Milk)	0.24	0.802	0.208	0.866667	1.080632	1.0	0.01552	1.485	0.098178	0.2494	0.326599	0.563009	2

2.3 APM used in road accident data#

In this section, we will use APM in road accident (‘accident_data_v1.0.0_2023.db’) for example.

Schema info can be found in ‘dft-road-casualty-statistics-road-safety-open-dataset-data-guide-2023-1.xlsx’ on the Project Report page in Canvas

2.3.1 Database infos#

import pandas as pd
import sqlite3

# Connect to accident database 
conn = sqlite3.connect('accident_data_v1.0.0_2023.db')

# create cursor handling SQL 
cur = conn.cursor()

# output all the tables in db
res_tables = cur.execute("SELECT name FROM sqlite_master WHERE type='table'") 
table_names = res_tables.fetchall()
print(table_names)

[('accident',), ('casualty',), ('vehicle',), ('lsoa',)]

table_names[1][0]

'casualty'

# Check the foreign keys to understand the relationships across tables in database
# it is clear that accident_index column is used to link tables in db
for tn in table_names:
    res_fk = cur.execute(f"PRAGMA foreign_key_list({tn[0]})")
    fk_info = res_fk.fetchall()
    if len(fk_info) < 1:
        print(f'No foreign keys found in Table: {tn[0]}')
    
    else:
        print(f'Table {tn[0]} f_keys info --',  
             'id:', fk_info[0][0], '--',
             'seq:', fk_info[0][1],'--',
             'table:', fk_info[0][2],'--',
             'from:', fk_info[0][3],'--',
             'to:', fk_info[0][4],'--',
             'on_update:', fk_info[0][5],'--',
             'on_delete:', fk_info[0][6],'--',
             'match:', fk_info[0][7]) 

No foreign keys found in Table: accident
Table casualty f_keys info -- id: 0 -- seq: 0 -- table: accident -- from: accident_index -- to: accident_index -- on_update: NO ACTION -- on_delete: NO ACTION -- match: NONE
Table vehicle f_keys info -- id: 0 -- seq: 0 -- table: accident -- from: accident_index -- to: accident_index -- on_update: NO ACTION -- on_delete: NO ACTION -- match: NONE
No foreign keys found in Table: lsoa

2.3.2 APM in accident table#

# using pandas to read table accident
df_ac = pd.read_sql("SELECT * FROM accident;", conn)

# This is an accident-level dataset which means each row representing one accident identified by 'accident index'
df_ac

	accident_index	accident_year	accident_reference	location_easting_osgr	location_northing_osgr	longitude	latitude	police_force	accident_severity	number_of_vehicles	...	pedestrian_crossing_physical_facilities	light_conditions	weather_conditions	road_surface_conditions	special_conditions_at_site	carriageway_hazards	urban_or_rural_area	did_police_officer_attend_scene_of_accident	trunk_road_flag	lsoa_of_accident_location
0	2017010001708	2017	010001708	532920.0	196330.0	-0.080107	51.650061	1	1	2	...	0	4	1	1	0	0	1	1	2	E01001450
1	2017010009342	2017	010009342	526790.0	181970.0	-0.173845	51.522425	1	3	2	...	0	4	1	2	0	0	1	1	2	E01004702
2	2017010009344	2017	010009344	535200.0	181260.0	-0.052969	51.514096	1	3	3	...	0	4	1	1	0	0	1	1	2	E01004298
3	2017010009348	2017	010009348	534340.0	193560.0	-0.060658	51.624832	1	3	2	...	4	4	2	2	0	0	1	1	2	E01001429
4	2017010009350	2017	010009350	533680.0	187820.0	-0.072372	51.573408	1	2	1	...	5	4	1	2	0	0	1	1	2	E01001808
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
461347	2020991027064	2020	991027064	343034.0	731654.0	-2.926320	56.473539	99	2	2	...	0	1	1	1	0	0	1	1	-1	-1
461348	2020991029573	2020	991029573	257963.0	658891.0	-4.267565	55.802353	99	3	1	...	0	1	1	1	0	0	1	2	-1	-1
461349	2020991030297	2020	991030297	383664.0	810646.0	-2.271903	57.186317	99	2	2	...	0	1	1	1	0	0	2	1	-1	-1
461350	2020991030900	2020	991030900	277161.0	674852.0	-3.968753	55.950940	99	3	2	...	0	1	1	1	0	0	1	2	-1	-1
461351	2020991032575	2020	991032575	240402.0	681950.0	-4.561040	56.003843	99	3	1	...	0	1	1	1	0	2	1	1	-1	-1

461352 rows × 36 columns

You can use pandas .describe() or info() to output the basic info in a df

You can also use some libs called ydata-profiling, sweetviz and dtale to implement the initial EDA

df_ac.describe()

	accident_year	location_easting_osgr	location_northing_osgr	longitude	latitude	police_force	accident_severity	number_of_vehicles	number_of_casualties	day_of_week	...	pedestrian_crossing_human_control	pedestrian_crossing_physical_facilities	light_conditions	weather_conditions	road_surface_conditions	special_conditions_at_site	carriageway_hazards	urban_or_rural_area	did_police_officer_attend_scene_of_accident	trunk_road_flag
count	461352.000000	461236.000000	4.612360e+05	461226.000000	461226.000000	461352.000000	461352.000000	461352.000000	461352.000000	461352.000000	...	461352.000000	461352.000000	461352.000000	461352.000000	461352.000000	461352.000000	461352.000000	461352.000000	461352.000000	461352.000000
mean	2018.368378	452593.115756	2.801803e+05	-1.246998	52.408849	28.193388	2.784436	1.840441	1.301245	4.108171	...	0.262112	1.099518	2.036235	1.649918	1.385239	0.217519	0.170518	1.325463	1.334877	1.680147
std	1.091566	94822.718705	1.515501e+05	1.389702	1.365101	24.801609	0.443648	0.709869	0.746398	1.927216	...	1.460680	2.333113	1.724691	1.819426	0.955180	1.240981	1.121964	0.469057	0.556352	0.851554
min	2017.000000	64084.000000	1.023500e+04	-7.525273	49.912362	1.000000	1.000000	1.000000	1.000000	1.000000	...	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000
25%	2017.000000	388539.000000	1.755300e+05	-2.172668	51.465689	5.000000	3.000000	1.000000	1.000000	2.000000	...	0.000000	0.000000	1.000000	1.000000	1.000000	0.000000	0.000000	1.000000	1.000000	2.000000
50%	2018.000000	459198.500000	2.208650e+05	-1.126264	51.870257	22.000000	3.000000	2.000000	1.000000	4.000000	...	0.000000	0.000000	1.000000	1.000000	1.000000	0.000000	0.000000	1.000000	1.000000	2.000000
75%	2019.000000	529290.000000	3.865510e+05	-0.136389	53.372899	45.000000	3.000000	2.000000	1.000000	6.000000	...	0.000000	0.000000	4.000000	1.000000	2.000000	0.000000	0.000000	2.000000	2.000000	2.000000
max	2020.000000	655391.000000	1.209512e+06	1.759641	60.763722	99.000000	3.000000	24.000000	59.000000	7.000000	...	9.000000	9.000000	7.000000	9.000000	9.000000	9.000000	9.000000	3.000000	3.000000	2.000000

8 rows × 29 columns

df_ac.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 461352 entries, 0 to 461351
Data columns (total 36 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0   accident_index                               461352 non-null  object 
 1   accident_year                                461352 non-null  int64  
 2   accident_reference                           461352 non-null  object 
 3   location_easting_osgr                        461236 non-null  float64
 4   location_northing_osgr                       461236 non-null  float64
 5   longitude                                    461226 non-null  float64
 6   latitude                                     461226 non-null  float64
 7   police_force                                 461352 non-null  int64  
 8   accident_severity                            461352 non-null  int64  
 9   number_of_vehicles                           461352 non-null  int64  
 10  number_of_casualties                         461352 non-null  int64  
 11  date                                         461352 non-null  object 
 12  day_of_week                                  461352 non-null  int64  
 13  time                                         461352 non-null  object 
 14  local_authority_district                     461352 non-null  int64  
 15  local_authority_ons_district                 461352 non-null  object 
 16  local_authority_highway                      461352 non-null  object 
 17  first_road_class                             461352 non-null  int64  
 18  first_road_number                            461352 non-null  int64  
 19  road_type                                    461352 non-null  int64  
 20  speed_limit                                  461352 non-null  int64  
 21  junction_detail                              461352 non-null  int64  
 22  junction_control                             461352 non-null  int64  
 23  second_road_class                            461352 non-null  int64  
 24  second_road_number                           461352 non-null  int64  
 25  pedestrian_crossing_human_control            461352 non-null  int64  
 26  pedestrian_crossing_physical_facilities      461352 non-null  int64  
 27  light_conditions                             461352 non-null  int64  
 28  weather_conditions                           461352 non-null  int64  
 29  road_surface_conditions                      461352 non-null  int64  
 30  special_conditions_at_site                   461352 non-null  int64  
 31  carriageway_hazards                          461352 non-null  int64  
 32  urban_or_rural_area                          461352 non-null  int64  
 33  did_police_officer_attend_scene_of_accident  461352 non-null  int64  
 34  trunk_road_flag                              461352 non-null  int64  
 35  lsoa_of_accident_location                    461352 non-null  object 
dtypes: float64(4), int64(25), object(7)
memory usage: 126.7+ MB

Select three columns for analysis

accident_severity, speed_limit, weather_conditions (rules can be if speed_limit, weather_conditions then accident_severity)

# select three columns
df_ac_s = df_ac[['speed_limit', 'weather_conditions', 'accident_severity']]

df_ac_s

	speed_limit	weather_conditions	accident_severity
0	30	1	1
1	30	1	3
2	30	1	3
3	30	2	3
4	20	1	2
...	...	...	...
461347	30	1	2
461348	30	1	3
461349	60	1	2
461350	30	1	3
461351	30	1	3

461352 rows × 3 columns

# One-hot encoding transfer df columns 

df_list = []
for c in ['speed_limit', 'weather_conditions', 'accident_severity']:
    df_c = pd.get_dummies(df_ac_s[c], prefix=c)
    df_list.append(df_c)

# axis=1 indicates that the dataframes in df_list are being concatenated horizontally, i.e., by columns. 
# This means that the resulting dataframe df_oh will have columns from all the dataframes in df_list combined side by side, with the indices aligned.
df_oh = pd.concat(df_list, axis=1) 

df_oh

	speed_limit_-1	speed_limit_20	speed_limit_30	speed_limit_40	speed_limit_50	speed_limit_60	speed_limit_70	weather_conditions_-1	weather_conditions_1	weather_conditions_2	weather_conditions_3	weather_conditions_4	weather_conditions_5	weather_conditions_6	weather_conditions_7	weather_conditions_8	weather_conditions_9	accident_severity_1	accident_severity_2	accident_severity_3
0	False	False	True	False	False	False	False	False	True	False	False	False	False	False	False	False	False	True	False	False
1	False	False	True	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	True
2	False	False	True	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	True
3	False	False	True	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	True
4	False	True	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	True	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
461347	False	False	True	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	True	False
461348	False	False	True	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	True
461349	False	False	False	False	False	True	False	False	True	False	False	False	False	False	False	False	False	False	True	False
461350	False	False	True	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	True
461351	False	False	True	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	True

461352 rows × 20 columns

# create frequent sets for rules
df_fre = apriori(df_oh, min_support=0.2, use_colnames=True)
df_fre

	support	itemsets
0	0.595235	(speed_limit_30)
1	0.795436	(weather_conditions_1)
2	0.798299	(accident_severity_3)
3	0.476458	(speed_limit_30, weather_conditions_1)
4	0.486100	(speed_limit_30, accident_severity_3)
5	0.631260	(accident_severity_3, weather_conditions_1)
6	0.387476	(speed_limit_30, accident_severity_3, weather_...

# select rules with confidence >0.7
df_rule = association_rules(df_fre, metric='confidence', min_threshold=0.7) 

# select rules with lift >1
df_rule.query('lift > 1')

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski
0	(speed_limit_30)	(weather_conditions_1)	0.595235	0.795436	0.476458	0.800454	1.006308	1.0	0.002987	1.025146	0.015487	0.521168	0.024529	0.699722
1	(speed_limit_30)	(accident_severity_3)	0.595235	0.798299	0.486100	0.816651	1.022989	1.0	0.010924	1.100092	0.055518	0.535685	0.090985	0.712785
4	(speed_limit_30, accident_severity_3)	(weather_conditions_1)	0.486100	0.795436	0.387476	0.797113	1.002109	1.0	0.000815	1.008267	0.004094	0.433390	0.008199	0.642119
5	(speed_limit_30, weather_conditions_1)	(accident_severity_3)	0.476458	0.798299	0.387476	0.813243	1.018719	1.0	0.007120	1.080016	0.035098	0.436701	0.074088	0.649310

As ‘weather_conditions_1’ represents ‘Fine no high winds’ and ‘accident_severity_3’ represents ‘Slight’, so the useful rule can be generated as ‘ Weather with fine no high winds’ and ‘the road with speed limit is 30’ $\Rightarrow$ ‘Accident severity is slight’

	Bread	Milk	Cheese	Eggs	Yogurt
0	0	0	0	0	0
1	0	0	0	0	0
2	1	0	1	1	1
3	0	0	1	1	1
4	1	1	1	1	1
5	1	0	0	0	1
6	1	1	1	1	1
7	1	0	0	0	0
8	0	0	1	1	0
9	0	1	1	1	1
10	1	1	0	0	1
11	0	0	0	0	0
12	1	1	0	0	0
13	1	0	0	0	1
14	0	0	0	0	0
15	1	1	0	0	0
16	1	1	1	1	1
17	0	0	1	1	1
18	0	1	1	1	0
19	1	0	1	1	1

	Bread	Milk	Cheese	Eggs	Yogurt
0	0	0	0	0	0
1	0	0	0	0	0
2	1	0	1	1	1
3	0	0	1	1	1
4	1	1	1	1	1
5	1	0	0	0	1
6	1	1	1	1	1
7	1	0	0	0	0
8	0	0	1	1	0
9	0	1	1	1	1
10	1	1	0	0	1
11	0	0	0	0	0
12	1	1	0	0	0
13	1	0	0	0	1
14	0	0	0	0	0
15	1	1	0	0	0
16	1	1	1	1	1
17	0	0	1	1	1
18	0	1	1	1	0
19	1	0	1	1	1