lOMoARcPSD| 23136115
Introduction to Data Mining
Lecture 6 – Activities
1. Exercises in Text [1] - 8.7
The following table consists of training data from an employee database. The data have been
generalized. For example, “31 . . . 35” for age represents the age range of 31 to 35. For a given row
entry, count represents the number of data tuples having the values for department, status, age, and
salary given in that row.
Let status be the class label attribute.
(a) How would you modify the basic decision tree algorithm to take into consideration the count
of each generalized data tuple (i.e., of each row entry)?
(b) Use your algorithm to construct a decision tree from the given data.
(c) Given a data tuple having the values “systems”, “26. . . 30”, and “46–50K” for the attributes
department, age, and salary, respectively, what would a Na¨ıve Bayesian classification of the
status for the tuple be?
1
Answers
lOMoARcPSD| 23136115
Introduction to Data Mining
(a) How would you modify the basic decision tree algorithm to take into consideration the count
of each generalized data tuple (i.e., of each row entry)?
The basic decision tree algorithm should be modified as follows to take into consideration the count
of each generalized data tuple.
The count of each tuple must be integrated into the calculation of the attribute selection measure
(such as information gain).
Take the count into consideration to determine the most common class among the tuples.
(b) Use your algorithm to construct a decision tree from the given data.
The resulting tree is:
Info(D) = 0.899
Gain(Dept) = 0.0488
Gain(Age) = 0.4247
Gain(Salary) = 0.5375
Second level is Dept or Age b/c Gain = 1. We can choose one of them. Dept/Age is correct. But If
choosing Dept, the tree is simpler (less nodes). Therefore, Dept is preferred. More than that, dept
= secretary -> no instance, hence, the majority class which is junior is assigned as its leaf.
(c) Given a data tuple having the values “systems”, “26. . . 30”, and “46–50K” for the attributes
department, age, and salary, respectively, what would a Na¨ıve Bayesian classification of the
status for the tuple be?
P (X|senior) = 18/52 × 0/52 × 40/52 = 0; << this case, the Laplacian correction was not used. 1
more tuple for each age group should be added. There are 6 groups of age => 18/52 × (0+1)/(52
+6) × 40/52 =
2
lOMoARcPSD| 23136115
Introduction to Data Mining
P (X|junior) =23/113 × 49/113 × 23/113 = 0.018. Thus, a Na¨ıve Bayesian classification predicts
“junior”. => Laplacian correction: 23/113 × (49+1)/(113+6) × 23/113 =
Other ways:
P (X|senior) = (18+1)/52+165 × (0+1)/(52 +165) × (40+1)/(52+165) =
P (X|junior) = (23+1)/(113+165) × (49+1)/(113+165) × (23+1)/(113+165) =
P (X|senior) = 18/52 × (0+6)/(52 +6) × 40/52 =
P (X|junior) = 23/113 × (49+6)/(113+6) × 23/113 =
P(senior|X) =
P(junior|X) =
Therefore, X belongs to class …

Preview text:

lOMoAR cPSD| 23136115 Introduction to Data Mining Lecture 6 – Activities
1. Exercises in Text [1] - 8.7
The following table consists of training data from an employee database. The data have been
generalized. For example, “31 . . . 35” for age represents the age range of 31 to 35. For a given row
entry, count represents the number of data tuples having the values for department, status, age, and salary given in that row.
Let status be the class label attribute.
(a) How would you modify the basic decision tree algorithm to take into consideration the count
of each generalized data tuple (i.e., of each row entry)?
(b) Use your algorithm to construct a decision tree from the given data.
(c) Given a data tuple having the values “systems”, “26. . . 30”, and “46–50K” for the attributes
department, age, and salary, respectively, what would a Na¨ıve Bayesian classification of the status for the tuple be? 1 Answers lOMoAR cPSD| 23136115 Introduction to Data Mining
(a) How would you modify the basic decision tree algorithm to take into consideration the count
of each generalized data tuple (i.e., of each row entry)?
The basic decision tree algorithm should be modified as follows to take into consideration the count
of each generalized data tuple.
• The count of each tuple must be integrated into the calculation of the attribute selection measure (such as information gain).
• Take the count into consideration to determine the most common class among the tuples.
(b) Use your algorithm to construct a decision tree from the given data. The resulting tree is: Info(D) = 0.899 Gain(Dept) = 0.0488 Gain(Age) = 0.4247 Gain(Salary) = 0.5375
Second level is Dept or Age b/c Gain = 1. We can choose one of them. Dept/Age is correct. But If
choosing Dept, the tree is simpler (less nodes). Therefore, Dept is preferred. More than that, dept
= secretary -> no instance, hence, the majority class which is junior is assigned as its leaf.
(c) Given a data tuple having the values “systems”, “26. . . 30”, and “46–50K” for the attributes
department, age, and salary, respectively, what would a Na¨ıve Bayesian classification of the status for the tuple be?
P (X|senior) = 18/52 × 0/52 × 40/52 = 0; << this case, the Laplacian correction was not used. 1
more tuple for each age group should be added. There are 6 groups of age => 18/52 × (0+1)/(52 +6) × 40/52 = 2 lOMoAR cPSD| 23136115 Introduction to Data Mining
P (X|junior) =23/113 × 49/113 × 23/113 = 0.018. Thus, a Na¨ıve Bayesian classification predicts
“junior”. => Laplacian correction: 23/113 × (49+1)/(113+6) × 23/113 = Other ways:
P (X|senior) = (18+1)/52+165 × (0+1)/(52 +165) × (40+1)/(52+165) =
P (X|junior) = (23+1)/(113+165) × (49+1)/(113+165) × (23+1)/(113+165) =
P (X|senior) = 18/52 × (0+6)/(52 +6) × 40/52 =
P (X|junior) = 23/113 × (49+6)/(113+6) × 23/113 = P(senior|X) = P(junior|X) =
Therefore, X belongs to class …