# CSE 417T: Homework 6 solved

Problems:

1. (15 points) Entropy for Decision Tree Learning

When deciding which split (feature and category or threshold) to use for the nodes when

learning a decision tree, we aim to maximize label purity in the child nodes. One way to

achieve this goal is to maximize the divergence from the uniform label distribution.

(a) Show that maximizing the divergence of the label distribution in the child nodes to the

uniform distribution is equivalent to minimizing entropy. HINT: use KL-divergence

(Kullback-Leibler Divergence) to quantify the distance between two discrete probability distributions.

(b) In the lecture we maximized information gain to decide on the splits. Show that maximizing information gain is equivalent to minimizing entropy. HINT: you can assume

binary splits.

(c) What is the time complexity to find the best split using entropy assuming binary features? What is the time complexity assuming continuous features?

2. (35 points) Efficient Decision Tree Learning

You are building a regression tree (CART), and your recursive function is called with data

D = {(~x1, y1), . . . ,(~xn, yn)} (where we have continuous attributes xi ∈ R

d and continuous labels

yi ∈ R). For a leaf let the prediction be the average predictor p ← y¯ =

1

|S|

P

(~xi,yi)∈S

yi

, where

S are all datapoints in that leaf. We use label variance as impurity measure.

(a) Show that p minimizes the average squared loss L(D) = 1

|D|

P

(~xj ,yj )∈D (yj − p)

2

. Explain why this must be the case.

(b) If the termination conditions are not met (i.e. you do not create a leaf), you need to split.

For this, you need to find the best split for any feature f. Assume we have sorted our

data set according to some feature f, such that f1 ≤ f2 ≤ · · · ≤ fn, where fi

is the value

of feature f in example ~xi

. Let all relevant splits be c1 ≤ c2 ≤ · · · ≤ cn−1, for example

ci =

fi+fi+1

2

.

i. Define the predictors y¯

i

L

, y¯

i

R

of the left and right sub-trees, after cutting at ci (left

≤ ci) in terms of y1, . . . , yn.

ii. Write down the expressions for the loss L

i

L

and L

i

R

on the left and right sub-trees,

after cutting at ci

. What is the time complexity of computing both terms for any cut

point ci?

iii. Express y¯

i+1

L

, y¯

i+1

R

in terms of y¯

i

L

, y¯

i

R

and yi+1.

iv. Let s

i =

Pi

j=1 y

2

j

and r

i =

Pn

j=i+1 y

2

j

, express L

i+1

L

,L

i+1

R

in terms of yi+1, y¯

i

L

, y¯

i

R

, si

, ri

.

v. Write a full update rule for iteration i + 1, assuming you have y¯

i

L

, y¯

i

R

, si

, ri

,L

i

, Ri

.

vi. What is the time complexity to find the best split using this method assuming your

data is already sorted? What is the time complexity for sorting D for all features?

Compare the time complexity to build an entire tree for the approach developed in

this problem to the time complexity of building an entire tree using entropy as in

problem 1(c).

2

3. (50 points) Implementation: Bagged and Boosted Decision Trees

In this assignment you will implement a decision tree algorithm and then use it for bagging

and boosting. Update your SVN repository. All of the following stub files are in the hw6

folder:

Stubs:

id3tree.m Returns a decision tree using the minimum entropy splitting rule

(implemented for you in entropysplit.m)

evaltree.m Evaluates a decision tree on a test data.

prunetree.m Prunes a tree using a validation set.

forest.m Builds a forest of id3trees.

evalforest.m Learns and evaluates a forest.

boosttree.m Applies adaboost to your id3tree.

evalboost.m Learns and evaluates a boosted classifier.

Already implemented:

entropysplit.m This function implements minimum entropy splitting

using information gain/entropy.

cart_tictoc.m This function computes train and test accuracy and runtime for

all your algorithms (uses analyze.m).

cart_visual.m This function visualizes trees (uses vistree.m and scat.m).

Datasets: (download from Piazza resources)

iris.mat Subset of the Iris flower dataset

(https://en.wikipedia.org/wiki/Iris_flower_data_set)

heart.mat Another dataset for binary classification.

As MATLAB does not support pointers and is really only good with matrices, we will represent a tree through a matrix T. A tree T with q nodes must be of dimensions 6 × q, where

each column represents a different node in the tree. The very first column is the root node

and each column represents the following information:

(1) prediction at this node

(2) index of feature to cut

(3) cutoff value c (<= c : left, and c : right)

(4) index of left subtree (0 = leaf)

(5) index of right subtree (0 = leaf)

(6) parent (0 = root)

entropysplit.m searches through all features and returns the best split (feature, cutthreshold combination) based on information gain (in this case entropy). You don’t need

to implement this, it is already done for you in the file entropysplit.m. Please take a

minute and familiarize yourself with the implementation. It will make subsequent tasks

easier.

(a) Implement the function id3tree.m which returns a decision tree based on the minimum entropy splitting rule. The function takes training data, test data, a maximum

depth, and the weigh of each training example (used in entropy splitting). You can

3

visualize your tree with the command visualhw7. Maximum depth and weight are

optional arguments. If they are not provided you should make maximum depth infinity and equally weight each example. (Hint: To speed up your code make sure you

initialize the matrix T with the command T = zeros(6, q) with some estimate of

q.) You should use the function entropysplit.m

(b) Implement the function evaltree.m, which evaluates a decision tree on a given test

data set. You can test the accuracy of your implementation with hw7tictoc.

(c) Decision trees (without depth limit) are high variance classifiers. Consequently, overfitting is a serious problem. One cure around it is to prune your tree to stop making

”silly” splits that overfit to the training set. You prune a tree bottom up, by removing

leaves that do not reduce the validation error (reduced error pruning). Implement the

function prunetree.m, which returns a decision tree pruned for a validation data set.

The accuracy on the validation set should be the same or higher after pruning.

(d) A more effective way to prevent overfitting is to use bagging. Implement the function

forest.m, which builds a forest (not a random forest) of ID3 decision trees. Each tree

should be built using training data drawn by randomly sampling n examples from the

training data with replacement, you can use MATLAB’s randsample function to do

this. (Do not randomly sample features.)

(e) Implement the function evalforest.m, which evaluates a forest on a test data set.

Remember that random forests make predictions by majority vote.

(f) Another option to improve your decision trees is to build trees of small depth (e.g. only

depth=3 or depth=4). These do not have high variance, but instead suffer from high

bias. You can reduce the bias of a classifier with boosting. Implement the function

boosttree.m, which applies ADABOOST on your id3tree functions.

(g) Implement the function evalboost.m, which evaluates an ensemble of boosted decision trees.

Once, you have implemented all functions you can run cart_tictoc.m, which computes

training and test error as well as runtime for all four algorithms. Try different datasets (you

may also use the ones from the previous homeworks1

.) and play around with the input

parameters of your forest and Adaboost methods.

Submit your implementation by committing all your functions and the partners.txt file

to the HW6 folder in your SVN repository.

1

If you want to use digits or faces, you will need to adapt your boosting algorithm for multi-class classification.

(Not required for the homework.)

4

1. (15 points) Entropy for Decision Tree Learning

When deciding which split (feature and category or threshold) to use for the nodes when

learning a decision tree, we aim to maximize label purity in the child nodes. One way to

achieve this goal is to maximize the divergence from the uniform label distribution.

(a) Show that maximizing the divergence of the label distribution in the child nodes to the

uniform distribution is equivalent to minimizing entropy. HINT: use KL-divergence

(Kullback-Leibler Divergence) to quantify the distance between two discrete probability distributions.

(b) In the lecture we maximized information gain to decide on the splits. Show that maximizing information gain is equivalent to minimizing entropy. HINT: you can assume

binary splits.

(c) What is the time complexity to find the best split using entropy assuming binary features? What is the time complexity assuming continuous features?

2. (35 points) Efficient Decision Tree Learning

You are building a regression tree (CART), and your recursive function is called with data

D = {(~x1, y1), . . . ,(~xn, yn)} (where we have continuous attributes xi ∈ R

d and continuous labels

yi ∈ R). For a leaf let the prediction be the average predictor p ← y¯ =

1

|S|

P

(~xi,yi)∈S

yi

, where

S are all datapoints in that leaf. We use label variance as impurity measure.

(a) Show that p minimizes the average squared loss L(D) = 1

|D|

P

(~xj ,yj )∈D (yj − p)

2

. Explain why this must be the case.

(b) If the termination conditions are not met (i.e. you do not create a leaf), you need to split.

For this, you need to find the best split for any feature f. Assume we have sorted our

data set according to some feature f, such that f1 ≤ f2 ≤ · · · ≤ fn, where fi

is the value

of feature f in example ~xi

. Let all relevant splits be c1 ≤ c2 ≤ · · · ≤ cn−1, for example

ci =

fi+fi+1

2

.

i. Define the predictors y¯

i

L

, y¯

i

R

of the left and right sub-trees, after cutting at ci (left

≤ ci) in terms of y1, . . . , yn.

ii. Write down the expressions for the loss L

i

L

and L

i

R

on the left and right sub-trees,

after cutting at ci

. What is the time complexity of computing both terms for any cut

point ci?

iii. Express y¯

i+1

L

, y¯

i+1

R

in terms of y¯

i

L

, y¯

i

R

and yi+1.

iv. Let s

i =

Pi

j=1 y

2

j

and r

i =

Pn

j=i+1 y

2

j

, express L

i+1

L

,L

i+1

R

in terms of yi+1, y¯

i

L

, y¯

i

R

, si

, ri

.

v. Write a full update rule for iteration i + 1, assuming you have y¯

i

L

, y¯

i

R

, si

, ri

,L

i

, Ri

.

vi. What is the time complexity to find the best split using this method assuming your

data is already sorted? What is the time complexity for sorting D for all features?

Compare the time complexity to build an entire tree for the approach developed in

this problem to the time complexity of building an entire tree using entropy as in

problem 1(c).

2

3. (50 points) Implementation: Bagged and Boosted Decision Trees

In this assignment you will implement a decision tree algorithm and then use it for bagging

and boosting. Update your SVN repository. All of the following stub files are in the hw6

folder:

Stubs:

id3tree.m Returns a decision tree using the minimum entropy splitting rule

(implemented for you in entropysplit.m)

evaltree.m Evaluates a decision tree on a test data.

prunetree.m Prunes a tree using a validation set.

forest.m Builds a forest of id3trees.

evalforest.m Learns and evaluates a forest.

boosttree.m Applies adaboost to your id3tree.

evalboost.m Learns and evaluates a boosted classifier.

Already implemented:

entropysplit.m This function implements minimum entropy splitting

using information gain/entropy.

cart_tictoc.m This function computes train and test accuracy and runtime for

all your algorithms (uses analyze.m).

cart_visual.m This function visualizes trees (uses vistree.m and scat.m).

Datasets: (download from Piazza resources)

iris.mat Subset of the Iris flower dataset

(https://en.wikipedia.org/wiki/Iris_flower_data_set)

heart.mat Another dataset for binary classification.

As MATLAB does not support pointers and is really only good with matrices, we will represent a tree through a matrix T. A tree T with q nodes must be of dimensions 6 × q, where

each column represents a different node in the tree. The very first column is the root node

and each column represents the following information:

(1) prediction at this node

(2) index of feature to cut

(3) cutoff value c (<= c : left, and c : right)

(4) index of left subtree (0 = leaf)

(5) index of right subtree (0 = leaf)

(6) parent (0 = root)

entropysplit.m searches through all features and returns the best split (feature, cutthreshold combination) based on information gain (in this case entropy). You don’t need

to implement this, it is already done for you in the file entropysplit.m. Please take a

minute and familiarize yourself with the implementation. It will make subsequent tasks

easier.

(a) Implement the function id3tree.m which returns a decision tree based on the minimum entropy splitting rule. The function takes training data, test data, a maximum

depth, and the weigh of each training example (used in entropy splitting). You can

3

visualize your tree with the command visualhw7. Maximum depth and weight are

optional arguments. If they are not provided you should make maximum depth infinity and equally weight each example. (Hint: To speed up your code make sure you

initialize the matrix T with the command T = zeros(6, q) with some estimate of

q.) You should use the function entropysplit.m

(b) Implement the function evaltree.m, which evaluates a decision tree on a given test

data set. You can test the accuracy of your implementation with hw7tictoc.

(c) Decision trees (without depth limit) are high variance classifiers. Consequently, overfitting is a serious problem. One cure around it is to prune your tree to stop making

”silly” splits that overfit to the training set. You prune a tree bottom up, by removing

leaves that do not reduce the validation error (reduced error pruning). Implement the

function prunetree.m, which returns a decision tree pruned for a validation data set.

The accuracy on the validation set should be the same or higher after pruning.

(d) A more effective way to prevent overfitting is to use bagging. Implement the function

forest.m, which builds a forest (not a random forest) of ID3 decision trees. Each tree

should be built using training data drawn by randomly sampling n examples from the

training data with replacement, you can use MATLAB’s randsample function to do

this. (Do not randomly sample features.)

(e) Implement the function evalforest.m, which evaluates a forest on a test data set.

Remember that random forests make predictions by majority vote.

(f) Another option to improve your decision trees is to build trees of small depth (e.g. only

depth=3 or depth=4). These do not have high variance, but instead suffer from high

bias. You can reduce the bias of a classifier with boosting. Implement the function

boosttree.m, which applies ADABOOST on your id3tree functions.

(g) Implement the function evalboost.m, which evaluates an ensemble of boosted decision trees.

Once, you have implemented all functions you can run cart_tictoc.m, which computes

training and test error as well as runtime for all four algorithms. Try different datasets (you

may also use the ones from the previous homeworks1

.) and play around with the input

parameters of your forest and Adaboost methods.

Submit your implementation by committing all your functions and the partners.txt file

to the HW6 folder in your SVN repository.

1

If you want to use digits or faces, you will need to adapt your boosting algorithm for multi-class classification.

(Not required for the homework.)

4

Starting from: $30

You'll get 1 file (625.6KB)