# ALDA HW 2 solution

1. (40 points) [PCA] [Xi Yang] In this problem, you will perform a PCA on the provided training dataset (“hw2q1 train.csv”) and the testing dataset (“hw2q1 test.csv”),

which come from the Connectionist Bench Dataset (http://archive.ics.uci.edu/

ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks)). In both datasets,

each row represents a data point or sample. The first 60 columns are input features, and

the last column ”Class” is the output label, with the letters ”R” and ”M” indicating if

a sample is a Rock or a Mine, respectively.

Write code in Matlab, R or Python to perform the following tasks. Please report your

outputs and key codes in the document file, and also include your code file (end with .m,

.r or .py) in the .zip file.

(a) (2 points) Load the data. Report the size of the training and testing sets. How

many Rock (R) and Mine (M) samples are in the training set and the testing set,

respectively?

(b) (18 points) Preprocessing Data-Normalization: Please run normalization on

all input features in both the training and testing datasets to obtain the normalized training and the normalized testing datasets. (Hint: you will need to use

the min/max of the training dataset to normalize the testing dataset and do NOT

normalize the output ”Class” of data.)

Use the NEW normalized datasets for the following tasks :

i. (2 points) Calculate the covariance matrix of the NEW training dataset.

ii. (2 points) Calculate the eigenvalues and the eigenvectors based on the covariance matrix in (i) above. Report the size of the covariance matrix and the 5

largest eigenvalues.

iii. (1 point) Display the eigenvalues using a bar graph or a plot, and choose a

reasonable number(s) of eigenvectors. Justify your answer.

iv. (13 points) Next, you will combine PCA with a K-nearest neighbor (KNN)

classifier. More specifically, PCA will be applied to reduce the dimensionality

of data by transforming the original data into p (p ≤ 60) principle components;

and then KNN (K = 3, euclidean distance as distance metric) will be employed

to the p principle components for classification (third-party packages are allowed

to use for KNN).

• (5 points) Report the accuracy of the NEW testing dataset when using

PCA (p = 10) and the 3NN classifier. To show your work, please submit the

corresponding csv file (including the name of csv file in your answer below).

Your csv file should have 12 columns: columns 1-10 are the 10 principle

components, column 11 is the original ground truth output ”Class”, and

the last column is the predicted output ”Class”.

• (6 points) Plot your results by varying p: 2, 4, 8, 10, 20, 40, and 60 respectively. In your plot, the x-axis represents the number of principle components and the y-axis refers to the accuracy of the NEW testing dataset

using the corresponding number of principle components and 3NN.

• (2 point) Based upon the PCA +3NN’s results above, what is the most

”reasonable” number of principle components among all the choices?

Justify your answer.

(c) (18 points) Preprocess Data-Standardization: Similarly, please run standardization on all input features to obtain the standardized training and the standardized

testing datasets. Then repeat the four steps i-iv in (b) above on the two NEW

standardized datasets.

(d) (2 points) Comparing the results from (b) and (c), which of the two data-processing

procedures, normalization or standardization, would you prefer for the given datasets?

And why? (Answer without any justification will get zero point.)

2. (20 points) [Decision Tree][Song Ju] In the given “hw2q2.csv”, all of the input features are nominal except for the first column, which is a ratio and continuous. The

output label has two class values: T or F. Complete the following tasks using the decision tree algorithm discussed in the lecture. In the case of ties, break ties in favor of the

leftmost feature. (You can hand-draw all of your trees on paper and scan your results

into the final pdf.)

(a) (10 points) Construct the tree manually using ID3/entropy computations, write

down the computation process and show your tree step by step. (No partial credit)

(b) (10 points) Construct the tree manually using the Gini index, write down the computation process and show your tree step by step. (No partial credit)

3. (30 points) [Evaluate Classifier][Song Ju] Sepsis is the leading cause of mortality in

the United States. Septic shock, the most advanced complication of sepsis due to severe

abnormalities of circulation and/or cellular metabolism, reaches a mortality rate as high

as 50%. It is estimated that as many as 80% of sepsis deaths could be prevented with

early diagnosis and intervention. To predict whether or not a patient has septic shock

(Yes/No), consider using the decision tree shown in Figure 1 which involves Systolic

Blood Pressure (SBP), Mean Arterial Pressure (MAP), and vasopressor (Vaso). We will

focus on the sub-tree which splits on the attribute ”SBP” as shown in the red dashed

region of Figure 1. Answer the following questions and show your work.

Figure 1: Decision Tree

(a) (13 points) Post-pruning based on optimistic errors.

i. (4 points) Calculate the optimistic errors before splitting and after splitting

using SBP respectively.

ii. (3 points) Based upon the optimistic errors, would the subtree be pruned or

retained? If it is pruned, draw the resulting decision tree and use it for the

next question; otherwise, use the original decision tree shown in Figure 1 for

the next question.

iii. (6 points) Use the decision tree from (a)-(ii) above to classify the provided

testing dataset (“hw2q3 test.csv”). Report the Accuracy, Recall, Precision,

Specificity, Sensitivity, and F1 Measure.

(b) (13 points) Post-pruning based on pessimistic errors. When calculating pessimistic errors, each leaf node will add a factor of 0.5 to the error.

i. (4 points) Calculate the pessimistic errors before splitting and after splitting

using SBP respectively.

ii. (3 points) Based on the pessimistic errors, would the subtree be pruned or

retained? If it is pruned, draw the resulting decision tree and use it for the

next question; otherwise, use the original decision tree shown in Figure 1 for

the next question.

iii. (6 points) Use the decision tree from (b)-(ii) above to classify the provided

testing dataset (“hw2q3 test.csv”). Report the Accuracy, Recall, Precision,

Specificity, Sensitivity, and F1 Measure.

(c) (4 points) We will compare the performance of the decision trees from (a)-(ii) and

(b)-(ii) on the testing dataset (“hw2q3 test.csv”). If we only consider Accuracy,

Recall, and Precision, which decision tree would be a better model for the task of

septic shock prediction. Justify your answer.

4. (15 points) [Adaboost][Xi Yang] Consider the labeled data points in Figure 2, where

‘ ’ and ‘ ’ indicate class labels. We will use AdaBoost with decision stumps to train a

classifier for the ‘ ’ and ‘ ’ labels. Each boosting iteration will select the stump that

minimizes the weighted training error. Breaking ties by choosing ‘ ’. All of the data

points start with uniform weights.

Figure 2: Adaboost

(a) (4 points) In Figure 2, draw a decision boundary corresponding to the first decision

stump that the algorithm would choose (the decision boundary should be either a

vertical or horizontal straight line). Label the decision boundary as (1), also indicate

the / sides of this boundary.

(b) (2 points) Circle the point(s) that have the highest weight after the first boosting

iteration.

(c) (5 points) After the labels have been reweighted in the first boosting iteration, what

is the weighted error of the decision boundary (1)?

(d) (4 points) Draw the decision boundary corresponding to the second decision stump

that the algorithm would choose (the decision boundary should be either a vertical

or horizontal straight line). Label the decision boundary as (2), also indicate the

/ sides of this boundary.

(Please display your answers for (a), (b) and (d) in a single figure.)

5. (20 points) [Na¨ıve Bayes + Decision Tree] [Ruth Okoilu] For this exercise, use the

provided ‘hw2q5.csv’ which contains 24 data points. It has six attributes: each data

point will be referred to using the first column ”Id” and we will use columns 2-5 to

predict the final column ”Class” (whether or not a patient should have contact lens).

(a) (15 points) Compare the performance of two classifiers: Na¨ıve Bayes (NB) vs. Decision Tree (DT) using 5-fold cross-validation (CV) and report their 5-fold CV

accuracy. For the ith fold, the testing dataset is composed of all the data points

whose (Id mod 5 = i − 1). Follow the lecture’s code to build your decision trees except that multiple-way splitting is allowed and use Information Gain (IG) to select

the best attribute. In the case of ties, break ties in favor of the leftmost feature.

For each fold, show the induced Na¨ıve Bayes and DT models.

(b) (5 points) Based on the 5-fold CV accuracy from (a), which classifier, NB or

DT, would you choose? Report your final model for the selected classifier.

Show your work. No Partial Credit.

6. (15 points) [KNN + CV] [Ruth Okoilu] Consider the following dataset with two

real-valued inputs x1 and x2 and a binary output class y shown in Table 1. Each data

point will be referred to using the first column ”ID” in the following. Use KNN with

unweighted Euclidean distance to predict the class y.

ID x1 x2 y(Class)

1 27 6 -

2 -6 2 -

3 2 2 +

4 36 2 +

5 -8 4 -

6 40 2 +

7 35 4 -

8 30 2 +

9 20 6 +

10 -1 4 -

Table 1: KNN + CV

(a) (2 points) What are the 3 nearest neighbors for data points 5 and 10 respectively.

(No partial credit).

(b) (5 points) What is the leave-one-out cross-validation error of 1-NN on this dataset?

(No partial credit).

(c) (5 points) What is the 5-fold cross-validation error of 3-NN on this dataset? For

the ith fold where i = 1, 2, 3, 4, 5, the testing dataset is composed of all the data

points whose (ID mod 5 = i − 1). (No partial credit).

(d) (3 points) Based on the results of (b) and (c), can we determine which is a better

classifier, 1-NN or 3-NN? Why? (Answers without a correct justification will get

zero points.)

which come from the Connectionist Bench Dataset (http://archive.ics.uci.edu/

ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks)). In both datasets,

each row represents a data point or sample. The first 60 columns are input features, and

the last column ”Class” is the output label, with the letters ”R” and ”M” indicating if

a sample is a Rock or a Mine, respectively.

Write code in Matlab, R or Python to perform the following tasks. Please report your

outputs and key codes in the document file, and also include your code file (end with .m,

.r or .py) in the .zip file.

(a) (2 points) Load the data. Report the size of the training and testing sets. How

many Rock (R) and Mine (M) samples are in the training set and the testing set,

respectively?

(b) (18 points) Preprocessing Data-Normalization: Please run normalization on

all input features in both the training and testing datasets to obtain the normalized training and the normalized testing datasets. (Hint: you will need to use

the min/max of the training dataset to normalize the testing dataset and do NOT

normalize the output ”Class” of data.)

Use the NEW normalized datasets for the following tasks :

i. (2 points) Calculate the covariance matrix of the NEW training dataset.

ii. (2 points) Calculate the eigenvalues and the eigenvectors based on the covariance matrix in (i) above. Report the size of the covariance matrix and the 5

largest eigenvalues.

iii. (1 point) Display the eigenvalues using a bar graph or a plot, and choose a

reasonable number(s) of eigenvectors. Justify your answer.

iv. (13 points) Next, you will combine PCA with a K-nearest neighbor (KNN)

classifier. More specifically, PCA will be applied to reduce the dimensionality

of data by transforming the original data into p (p ≤ 60) principle components;

and then KNN (K = 3, euclidean distance as distance metric) will be employed

to the p principle components for classification (third-party packages are allowed

to use for KNN).

• (5 points) Report the accuracy of the NEW testing dataset when using

PCA (p = 10) and the 3NN classifier. To show your work, please submit the

corresponding csv file (including the name of csv file in your answer below).

Your csv file should have 12 columns: columns 1-10 are the 10 principle

components, column 11 is the original ground truth output ”Class”, and

the last column is the predicted output ”Class”.

• (6 points) Plot your results by varying p: 2, 4, 8, 10, 20, 40, and 60 respectively. In your plot, the x-axis represents the number of principle components and the y-axis refers to the accuracy of the NEW testing dataset

using the corresponding number of principle components and 3NN.

• (2 point) Based upon the PCA +3NN’s results above, what is the most

”reasonable” number of principle components among all the choices?

Justify your answer.

(c) (18 points) Preprocess Data-Standardization: Similarly, please run standardization on all input features to obtain the standardized training and the standardized

testing datasets. Then repeat the four steps i-iv in (b) above on the two NEW

standardized datasets.

(d) (2 points) Comparing the results from (b) and (c), which of the two data-processing

procedures, normalization or standardization, would you prefer for the given datasets?

And why? (Answer without any justification will get zero point.)

2. (20 points) [Decision Tree][Song Ju] In the given “hw2q2.csv”, all of the input features are nominal except for the first column, which is a ratio and continuous. The

output label has two class values: T or F. Complete the following tasks using the decision tree algorithm discussed in the lecture. In the case of ties, break ties in favor of the

leftmost feature. (You can hand-draw all of your trees on paper and scan your results

into the final pdf.)

(a) (10 points) Construct the tree manually using ID3/entropy computations, write

down the computation process and show your tree step by step. (No partial credit)

(b) (10 points) Construct the tree manually using the Gini index, write down the computation process and show your tree step by step. (No partial credit)

3. (30 points) [Evaluate Classifier][Song Ju] Sepsis is the leading cause of mortality in

the United States. Septic shock, the most advanced complication of sepsis due to severe

abnormalities of circulation and/or cellular metabolism, reaches a mortality rate as high

as 50%. It is estimated that as many as 80% of sepsis deaths could be prevented with

early diagnosis and intervention. To predict whether or not a patient has septic shock

(Yes/No), consider using the decision tree shown in Figure 1 which involves Systolic

Blood Pressure (SBP), Mean Arterial Pressure (MAP), and vasopressor (Vaso). We will

focus on the sub-tree which splits on the attribute ”SBP” as shown in the red dashed

region of Figure 1. Answer the following questions and show your work.

Figure 1: Decision Tree

(a) (13 points) Post-pruning based on optimistic errors.

i. (4 points) Calculate the optimistic errors before splitting and after splitting

using SBP respectively.

ii. (3 points) Based upon the optimistic errors, would the subtree be pruned or

retained? If it is pruned, draw the resulting decision tree and use it for the

next question; otherwise, use the original decision tree shown in Figure 1 for

the next question.

iii. (6 points) Use the decision tree from (a)-(ii) above to classify the provided

testing dataset (“hw2q3 test.csv”). Report the Accuracy, Recall, Precision,

Specificity, Sensitivity, and F1 Measure.

(b) (13 points) Post-pruning based on pessimistic errors. When calculating pessimistic errors, each leaf node will add a factor of 0.5 to the error.

i. (4 points) Calculate the pessimistic errors before splitting and after splitting

using SBP respectively.

ii. (3 points) Based on the pessimistic errors, would the subtree be pruned or

retained? If it is pruned, draw the resulting decision tree and use it for the

next question; otherwise, use the original decision tree shown in Figure 1 for

the next question.

iii. (6 points) Use the decision tree from (b)-(ii) above to classify the provided

testing dataset (“hw2q3 test.csv”). Report the Accuracy, Recall, Precision,

Specificity, Sensitivity, and F1 Measure.

(c) (4 points) We will compare the performance of the decision trees from (a)-(ii) and

(b)-(ii) on the testing dataset (“hw2q3 test.csv”). If we only consider Accuracy,

Recall, and Precision, which decision tree would be a better model for the task of

septic shock prediction. Justify your answer.

4. (15 points) [Adaboost][Xi Yang] Consider the labeled data points in Figure 2, where

‘ ’ and ‘ ’ indicate class labels. We will use AdaBoost with decision stumps to train a

classifier for the ‘ ’ and ‘ ’ labels. Each boosting iteration will select the stump that

minimizes the weighted training error. Breaking ties by choosing ‘ ’. All of the data

points start with uniform weights.

Figure 2: Adaboost

(a) (4 points) In Figure 2, draw a decision boundary corresponding to the first decision

stump that the algorithm would choose (the decision boundary should be either a

vertical or horizontal straight line). Label the decision boundary as (1), also indicate

the / sides of this boundary.

(b) (2 points) Circle the point(s) that have the highest weight after the first boosting

iteration.

(c) (5 points) After the labels have been reweighted in the first boosting iteration, what

is the weighted error of the decision boundary (1)?

(d) (4 points) Draw the decision boundary corresponding to the second decision stump

that the algorithm would choose (the decision boundary should be either a vertical

or horizontal straight line). Label the decision boundary as (2), also indicate the

/ sides of this boundary.

(Please display your answers for (a), (b) and (d) in a single figure.)

5. (20 points) [Na¨ıve Bayes + Decision Tree] [Ruth Okoilu] For this exercise, use the

provided ‘hw2q5.csv’ which contains 24 data points. It has six attributes: each data

point will be referred to using the first column ”Id” and we will use columns 2-5 to

predict the final column ”Class” (whether or not a patient should have contact lens).

(a) (15 points) Compare the performance of two classifiers: Na¨ıve Bayes (NB) vs. Decision Tree (DT) using 5-fold cross-validation (CV) and report their 5-fold CV

accuracy. For the ith fold, the testing dataset is composed of all the data points

whose (Id mod 5 = i − 1). Follow the lecture’s code to build your decision trees except that multiple-way splitting is allowed and use Information Gain (IG) to select

the best attribute. In the case of ties, break ties in favor of the leftmost feature.

For each fold, show the induced Na¨ıve Bayes and DT models.

(b) (5 points) Based on the 5-fold CV accuracy from (a), which classifier, NB or

DT, would you choose? Report your final model for the selected classifier.

Show your work. No Partial Credit.

6. (15 points) [KNN + CV] [Ruth Okoilu] Consider the following dataset with two

real-valued inputs x1 and x2 and a binary output class y shown in Table 1. Each data

point will be referred to using the first column ”ID” in the following. Use KNN with

unweighted Euclidean distance to predict the class y.

ID x1 x2 y(Class)

1 27 6 -

2 -6 2 -

3 2 2 +

4 36 2 +

5 -8 4 -

6 40 2 +

7 35 4 -

8 30 2 +

9 20 6 +

10 -1 4 -

Table 1: KNN + CV

(a) (2 points) What are the 3 nearest neighbors for data points 5 and 10 respectively.

(No partial credit).

(b) (5 points) What is the leave-one-out cross-validation error of 1-NN on this dataset?

(No partial credit).

(c) (5 points) What is the 5-fold cross-validation error of 3-NN on this dataset? For

the ith fold where i = 1, 2, 3, 4, 5, the testing dataset is composed of all the data

points whose (ID mod 5 = i − 1). (No partial credit).

(d) (3 points) Based on the results of (b) and (c), can we determine which is a better

classifier, 1-NN or 3-NN? Why? (Answers without a correct justification will get

zero points.)

Starting from: $40

You'll get 1 file (14.9MB)