










Preview text:
  lOMoAR cPSD| 23136115 Introduction to Data Mining   
Lab 5: Putting it all together  5.1.  The data mining process 
In the fifth class, we are going to look at some more global issues about the data mining process. (See the 
lecture of class 5 by Ian H. Witten, [1]1). We are going through four lessons: the data mining process, Pitfalls 
and pratfalls, and data mining and ethics. 
According to [1], the data mining process includes steps: ask a question, gather data, clean the data, define 
new features, and deploy the result. Write down the brief for these steps:  -  Ask a question  
Ask the right kind of question, such as "What do I want to know?". 
This essential step provides the necessary framework for the subsequent stages of the data mining 
process, ensuring a focused and goal-oriented approach. Omitting this step can lead to a lack of clarity and  potential pitfalls.  -  Gather data 
Obtain the required data to answer the research question and/or enrich existing datasets. While there is 
a wealth of data available, challenges such as data quality, relevance, and quantity can limit its usefulness. 
To optimize model performance, increasing the amount of data can be a more advantageous approach 
than solely fine-tuning the algorithm, as the adage 'more data beats a clever algorithm' suggests.  -  Clean the data 
Real-world data is often characterized by noise, missing values, and inconsistencies. To improve data 
quality and facilitate accurate analysis, data preprocessing techniques, such as anomaly detection, 
imputation, integration, normalization, and standardization, can be employed to clean and transform  the data.  -  Define new features   
1 http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/      lOMoAR cPSD| 23136115
Create new attributes or features from the existing data that can provide additional insights and improve 
model performance. This process, often referred to as feature engineering, involves transforming and 
combining existing features to create more informative ones.  -  Deploy the result 
Deploy the discovered knowledge or model into real-world applications or decision-making processes. This 
involves sharing the results with relevant stakeholders and integrating them into business operations. And 
here are the 7 steps of the KDD process according to Han and Kamber (2011):  + Data Cleaning 
Removing noise and inconsistent data to improve data quality.   + Data Selection 
Retrieving relevant data from the database for analysis.  + Data Integration 
Combining data from multiple sources into a coherent data store. 
+ Data Transformation 
Converting data into appropriate forms for mining, often involving normalization and aggregation.  + Data Mining 
Applying intelligent methods to extract data patterns.  + Pattern Evaluation 
Identifying truly interesting patterns in the data that represent valuable knowledge, using appropriate 
interestingness measures to evaluate their significance. 
+ Knowledge Representation 
Presenting the mined knowledge in a clear, concise, and visually appealing format that is easily 
understandable and actionable by the end-user.  5.2.  Pitfalls and pratfalls 
Follow the lecture in [1] to learn what are pitfalls and pratfalls in data mining. 
Do experiments to investigate how OneR and J48 deal with missing values. 
Write down the results in the following table:      lOMoAR cPSD| 23136115 Dataset 
OneR’s classifier model and  J48’s 
  model   and  performance    classifier  performance        lOMoAR cPSD| 23136115 weather  Classifier  Classifier  nominal.arff(original) 
=== Classifier model (full training === Classifier model (full training set)  set) ===  ===  outlook:  J48 pruned tree  sunny -> no  ------------------  overcast  -> yes  rainy  -> yes  outlook = sunny  (10/14 instances correct)  | humidity = high: no (3.0) |  humidity = normal: yes (2.0)  outlook = overcast: yes (4.0)  outlook = rainy  | windy = TRUE: no (2.0)  | windy = FALSE: yes (3.0)   Number of Leaves :  5   Size of the tree :  8  Performance  Performance  === 10-fold Stratified  === 10-fold Stratified cross-  crossvalidation ===  validation ===  === Summary ===  === Summary === 
Correctly Classified Instances 6 Correctly Classified Instances 7  42.8571 %  50 % 
Incorrectly Classified Instances 
Incorrectly Classified Instances 7  8 57.1429 %  50 % 
Kappa statistic -0.1429 Kappa statistic -0.0426  Mean    absolute    error Mean    absolute    error  0.5714  0.4167 
 Root mean squared error Root mean squared error  0.7559  0.5984  Relative absolute error 120  Relative absolute error 87.5  %  %  Root relative squared error   Root relative squared error  153.2194 %  121.2987 %  Total Number of Instances  Total Number of Instances 14  14 
=== Detailed Accuracy By Class === 
=== Detailed Accuracy By Class ===   TP Rate FP Rate Precision   TP Rate FP Rate Precision  Recall F-Measure MCC ROC Area  Recall F-Measure MCC ROC  PRC Area Class  Area PRC Area Class   0.556 0.600 0.625   0.444 0.600 0.571  0.556 0.588 -0.043 0.633      lOMoAR cPSD| 23136115
0.444 0.500 -0.149 0.422 0.611 0.758 yes  yes   0.400 0.444 0.333      lOMoAR cPSD| 23136115      0.400 0.556 0.286 
0.400 0.364 -0.043 0.633 0.457 
0.400 0.333 -0.149 0.422 0.329 no  no  Weighted Avg. 0.500 0.544  Weighted Avg. 0.429 0.584  0.521 0.500 0.508 -0.043  0.469 0.429 0.440 -0.149  0.633 0.650  0.422 0.510  === Confusion Matrix ===  === Confusion Matrix ===   a b <-- classified as   a b <-- classified as   5 4 | a = yes   4 5 | a = yes   3 2 | b = no   3 2 | b = no      lOMoAR cPSD| 23136115 weather  Classifier  Classifier  nominal.arff(with 
=== Classifier model (full training === Classifier model (full training set)  missing values)  set) ===  ===  outlook:  J48 pruned tree  sunny -> yes  -----------------:  overcast  -> yes  yes (14.0/5.0)  rainy  -> yes    ?  -> no   Number of Leaves :  1  (13/14 instances correct)   Size of the tree :  1  Performance  Performance  === 10-fold Stratified  === 10-fold Stratified cross-  crossvalidation ===  validation ===  === Summary ===  === Summary === 
Correctly Classified Instances Correctly Classified Instances 7  13 92.8571 %  50 % 
Incorrectly Classified Instances 
Incorrectly Classified Instances 7  1 7.1429 %  50 %  Kappa statistic 0.8372  Kappa statistic -0.1395  Mean    absolute    error Mean    absolute    error  0.0714  0.5403 
 Root mean squared error Root mean squared error  0.2673  0.5727  Relative absolute error 15   Relative    absolute    error  %  113.4615 % 
Root relative squared error 54.1712   Root relative squared error  %  116.0707 %  Total Number of Instances  Total Number of Instances 14  14 
=== Detailed Accuracy By Class === 
=== Detailed Accuracy By Class ===   TP Rate FP Rate Precision   TP Rate FP Rate Precision  Recall F-Measure MCC ROC Area      lOMoAR cPSD| 23136115   Recall F-Measure MCC ROC  PRC Area Class  Area PRC Area Class   0.667 0.800 0.600   1.000 0.200 0.900  0.667 0.632 -0.141 0.211 
1.000 0.947 0.849 0.900 0.900 0.545 yes  yes   0.200 0.333 0.250   0.800 0.000 1.000 
0.200 0.222 -0.141 0.211 0.306 
0.800 0.889 0.849 0.900 0.871 no  no  Weighted Avg. 0.500 0.633  Weighted Avg. 0.929 0.129  0.475 0.500 0.485 -0.141  0.936 0.929 0.926 0.849  0.211 0.460  0.900 0.890  === Confusion Matrix ===  === Confusion Matrix ===   a b <-- classified as   a b <-- classified as   6 3 | a = yes   9 0 | a = yes   4 1 | b = no   1 4 | b = no 
Remark: how do OneR and J48 deal with missing values? 
- OneR: The mere fact that a value is missing can be as important as the value itself, leading to substantial  changes in the final result 
- J48: Even though some values were missing, the overall results remained unaffected.  5.3.  Data mining and ethics  Reading  5.4.  Association-rule learners 
Do experiments to investigate how Apriori and FP-Growth generate association rules for datasets vote.arff  Dataset 
Apriori based association rules 
FP-Growth based association rules  Vote.arff  Apriori  === Run information ===  ======= 
Scheme: weka.associations.FPGrowth P 
Minimum support: 0.45 (196 instances) 
2 -I -1 -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M  Minimum metric : 0.9  0.1 
Number of cycles performed: 11  Relation: vote  Instances: 435 Attributes: 
Generated sets of large itemsets:  17   handicapped-infants        lOMoAR cPSD| 23136115  
Size of set of large itemsets L(1): 20   water-project-cost-sharing 
adoption-of-the-budget-resolution 
Size of set of large itemsets L(2): 17   physician-fee-freeze el- salvador-aid 
Size of set of large itemsets L(3): 6   religious-groups-in-schools 
anti-satellite-test-ban aid-to-
Size of set of large itemsets L(4): 1  nicaraguan-contras   mx-missile  Best rules found:  immigration   synfuels-corporation-cutback  1.  adoption-of-the-budget- education-spending 
resolution=y physician-fee-freeze=n 219 ==> superfund-right-to-sue 
Class=democrat 219 lift:(1.63)  crime  lev:(0.19) [84] conv:(84.58)   duty-free-exports  2.  adoption-of-the-budget-  export-administration-act-
resolution=y physician-fee-freeze=n aid-to- southafrica 
nicaraguancontras=y 198 ==>   Class 
Class=democrat 198 lift:(1.63)  lev:(0.18) [76] conv: 
=== Associator model (full training set) ===  (76.47)  3.  physician-fee-freeze=n aid-to-
FPGrowth found 41 rules (displaying top 
nicaraguan-contras=y 211 ==>  10) 
Class=democrat 210 lift:(1.62)  lev:(0.19) [80] conv:  1.  [el-salvador-aid=y,  (40.74) 
Class=republican]: 157 ==> [physician-fee- 4. 
physician-fee-freeze=n education- freeze=y]: 156 lift:(2.44) 
spending=n 202 ==> Class=democrat 201  lev:(0.21) conv: 
 lift:(1.62) lev:(0.18) [77] conv:  (46.56)  (39.01)  2. 
[crime=y, Class=republican]: 158  5. 
physician-fee-freeze=n 247 ==> 
==> [physician-fee-freeze=y]: 155 Class=democrat 245 lift: 
(0.98)> lift:(2.41) lev:(0.21) conv:(23.43) 
(1.62) lev:(0.21) [93] conv:(31.8) 
3. [religious-groups-in-schools=y,  6. 
el-salvador-aid=n Class=democrat 
physician-fee-freeze=y]: 160 ==> 
200 ==> aid-to-nicaraguan-contras=y 197  [elsalvador-aid=y]: 156 
 lift:(1.77) lev:(0.2) [85] conv: 
lift:(2) lev:(0.18) conv:(16.4)  (22.18)  4. 
[Class=republican]: 168 ==>  7. 
el-salvador-aid=n 208 ==> aid-to- [physician-fee-freeze=y]: 163  nicaraguan-contras=y 204   lift: 
lift:(1.76) lev:(0.2) [88] conv:(18.46) 
(2.38) lev:(0.22) conv:(16.61)  8.  adoption-of-the-budget- 5.  [adoption-of-the-budget-
resolution=y aid-to-nicaraguan-contras=y 
resolution=y, anti-satellite-test-ban=y, mx-
Class=democrat 203 ==> physician-fee-
missile=y]: 161 ==> [aid-to-nicaraguan-
freeze=n 198 contras=y]: 155 lift:(1.73) 
(0.98)> lift:(1.72) lev:(0.19) [82] conv:(14.62) lev:(0.15) conv:(10.2)  9. el-salvador-aid=n aid-to- 6.  [physician-fee-freeze=y, 
nicaraguancontras=y 204 ==>  Class=democrat 197      lOMoAR cPSD| 23136115
 lift:(1.57) lev:(0.17) [71] conv: 
Class=republican]: 163 ==> [el- (9.85)  salvadoraid=y]: 156 
10. aid-to-nicaraguan-contras=y  lift:(1.96) lev: 
Class=democrat 218 ==> physician- (0.18) conv:(10.45) 
feefreeze=n 210 lift:(1.7) lev:  7. 
[religious-groups-in-schools=y, el-
salvador-aid=y, superfund-right-to-sue=y]:      lOMoAR cPSD| 23136115     (0.2) [86] conv:(10.47) 
160 ==> [crime=y]: 153 lift:  (1.68) lev:(0.14) conv:(8.6)  8.  [el-salvador-aid=y, superfund-
right-to-sue=y]: 170 ==> [crime=y]: 162   lift:(1.67) lev:(0.15)  conv:(8.12)  9. 
[crime=y, physician-fee-freeze=y]: 
168 ==> [el-salvador-aid=y]: 160   lift:(1.95) lev:(0.18)  conv:(9.57)  10.  [el-salvador-aid=y, physician-
feefreeze=y]: 168 ==> [crime=y]: 160   lift:(1.67) lev:(0.15)  conv:(8.02)