





Preview text:
lOMoAR cPSD| 23136115 DATA MINING - ASSIGNMENT 5 ITDSIU21001 - Phan Quoc Anh
Question 1: Term-Frequency Vector Analysis
Your job is to write by yourself a cosine similarity based comparator. To do that,
you need to create Term-Frequency Vectors.
Task: Write a program in any computer language, that:
• Accepts two files with some text as the input (e.g., fragments of the book)
• Tokenizes the texts, and then normalizes the text (remove the punctuation, transform to lowercase)
• Computes term-frequency vectors for both files
• Computes cosine similarity according to the formula given in the beginning
• Outputs the cosine as the output Solution:
Listing 1: Python Code for Term-Frequency Vector Analysis
import string import math from collections import Counter
def read file ( filepath ):
with open( file path , ’ r ’ , encoding=’ utf−8’ ) as file : return file . read ()
def normalize text ( text ):
translator = str . maketrans( ’ ’ , ’ ’ , string . punctuation )
return text . translate ( translator ). lower ()
def compute tf ( text ): words = text . split () return Counter(words)
def cosine similarity (vec1 , vec2 ):
all words = set ( vec1 . keys ()). union ( set ( vec2 . keys ())) vec1 tf = [ vec1
[ word ] for word in all words ] lOMoAR cPSD| 23136115
vec2 tf = [ vec2 [ word ] for word in all words ] dot product = sum(a ∗ b for a , b in zip( vec1 tf ,
vec2 tf )) magnitude1 = math. sqrt (sum(a ∗ a for a in vec1
tf )) magnitude2 = math. sqrt (sum(b ∗ b for b in vec2 tf ))
if not magnitude1 or not magnitude2 : return 0.0
return dot product / (magnitude1 ∗ magnitude2)
def main( file1 path , file2 path ): text1 = read file ( file1 path ) text2 = read file (
file2 path ) norm text1 = normalize text ( text1 ) norm text2 = normalize text
( text2 ) tf vector1 = compute tf ( norm text1 ) tf vector2 = compute tf ( norm
text2 ) similarity = cosine similarity ( tf vector1 , tf vector2 ) print( f ’ Cosine
Similarity : { similarity } ’ )
file1 path = ’/path/to/ f i r s t f i l e . txt ’ file2 path =
’/path/to/ second file . txt ’ main( file1 path , file2 path )
Question 2: Analysis of the Confusion Matrix Task:
Your task is to analyze the results of three different classifiers that have
classified data into four classes. Based on the given confusion matrices, calculate
the kappa statistic for each classifier to evaluate their performance.
1. Based on the given confusion matrices, calculate the kappa statistic
foreach of the three classifiers.
2. Show all calculation steps.
3. Compare the kappa statistic results for all classifiers and analyze
whichclassifier is the most efficient Solution: Classifier A Confusion Matrix: |
| Class 1 | Class 2 | Class 3 | Class 4 |
|---------|---------|---------|---------|---------| | Class 1 | 50 | 10 | 5 | 0 | | Class 2 | 8 | 40 | 5 | 2 | | Class 3 | 4 | 8 | 60 | 4 | | Class 4 | 0 | 3 | 7 | 50 | Total samples: 248 lOMoAR cPSD| 23136115 Classifier B Confusion Matrix: |
| Class 1 | Class 2 | Class 3 | Class 4 |
|---------|---------|---------|---------|---------| | Class 1 | 45 | 12 | 8 | 0 | | Class 2 | 10 | 38 | 7 | 0 | | Class 3 | 6 | 9 | 55 | 5 | | Class 4 | 0 | 2 | 5 | 52 | Total samples: 249 lOMoAR cPSD| 23136115 Classifier C Confusion Matrix: |
| Class 1 | Class 2 | Class 3 | Class 4 |
|---------|---------|---------|---------|---------| | Class 1 | 48 | 10 | 7 | 0 | | Class 2 | 9 | 42 | 6 | 0 | | Class 3 | 5 | 8 | 58 | 6 | | Class 4 | 0 | 3 | 6 | 53 | Total samples: 253 Comparison • Classifier A Kappa: 0.738
• Classifier B Kappa: 0.681 • Classifier C Kappa: 0.724
Conclusion: Classifier A is the most efficient with the highest Kappa statistic of 0.738.
Question 3: Playing with Rules and Decision Trees
Task: Design and implementation of algorithm for converting classification rules to decision trees.
Your job is to develop an algorithm that converts a set of classification rules
into a decision tree structure. You are asked to design, implement, and test an
algorithm that transforms a set of classification rules into a decision tree. The
algorithm should optimize the conversion process, especially handling repeated sub-trees efficiently. lOMoAR cPSD| 23136115 -
You should specify how you encode classification rules (e.g. - IF
condition1 AND condition2 THEN class1 - IF (condition3 XOR condition4) AND Attr3 ¿ 0 THEN class2) -
You can implement the solution in any programming language you are
comfortable with (e.g. Java, Python). -
Decide, what to do if rules are inconsistent -
Visualize a tree using some visualization solution, e.g. DOT language within Grpahiviz
Expected input: set of classification rules which follow propositional logic
constraints, Expected output: an image showing the tree. Solution:
Listing 2: Python Code for Converting Classification Rules to Decision Trees from
graphviz import Digraph class Node: def i n i t ( self , name, children=None ):
s e l f .name = name s e l f . children = children if children is not None else {}
def add rule ( tree , rule ): current node =
tree for condition in rule [: −1]:
if condition not in current node . children :
current node . children [ condition ] = Node( condition )
current node = current node . children [ condition ]
current node . children [ ’ class ’ ] = rule [−1]
def print tree (node , graph , parent=None ): graph . node(node .name) if parent :
graph . edge ( parent .name, node .name)
for child in node . children . values ():
if isinstance ( child , Node ): print tree ( child , graph , node) rules = [ [ ’ condition1 ’ , ’AND’ , ’ condition2 ’ , ’THEN’ , ’ class1 ’ ] , [ ’ condition3 ’ , ’XOR’ , ’ condition4 ’ , ’AND’ , ’ Attr3 > 0 ’ , ’THEN’ , ’ class2 ’ ] ]
tree = Node( ’ root ’ ) for rule in rules : add rule ( tree , rule ) graph = Digraph () lOMoAR cPSD| 23136115 print tree ( tree , graph)
graph . render ( ’ decision tree ’ , format=’png ’ , view=True)
Note: Make sure you have Graphviz installed and the Python graphviz package
installed to run this code. You can install it using: pip install graphviz