Starting from:

$25+

DD2424 - Assignment 1-One-Layer-Perceptron Solved



In this assignment you will train and test a one layer network with multiple outputs to classify images from the CIFAR-10 dataset. You will train the network using mini-batch gradient descent applied to a cost function that computes the cross-entropy loss of the classifier applied to the labelled training data and an L2 regularization term on the weight matrix.

Background 1: Mathematical background
The mathematical details of the network are as follows. Given an input vector, x, of size d × 1 our classifier outputs a vector of probabilities, p (K × 1), for each possible output label:

s = Wx + b             (1) p = SOFTMAX(s)           (2)

where the matrix W has size K × d, the vector b is K × 1 and SOFTMAX is

defined as

                                                         SOFTMAX                                                    (3)
The predicted class corresponds to the label with the highest probability:

                                                          k∗ = arg max {p1,...,pK}                                                      (4)

1≤k≤K


 

            a) Classification function                                        b) Loss function
Figure 1: Computational graph of the classification and loss function that is applied to each input x in this assignment.

The parameters W and b of our classifier are what we have to learn by exploiting labelled training data. Let , with each yi ∈

{1,...,K} and xi ∈ Rd, represent our labelled training data. In the lectures we have described how to set the parameters by minimizing the cross-entropy loss plus a regularization term on W. Mathematically this cost function is

                                                                                       cross(x,y,W,b) + λXWij2                                        (5)

i,j

where

                                                        lcross(x,y,W,b) = −log(py)                                                    (6)

and p has been calculated using equations (1, 2). (Note if the label is encoded by a one-hot representation then the cross-entropy loss is defined as −log(yT p).) The optimization problem we have to solve is

                                                     W∗,b∗ = argmin J(D,λ,W,b)                                                  (7)

W,b

In this assignment (as described in the lectures) we will solve this optimization problem via mini-batch gradient descent.

For mini-batch gradient descent we begin with a sensible random initialization of the parameters W,b and we then update our estimate for the parameters with

                                              (8)

                            b                                                    (9)

where η is the learning rate and B(t+1) is called a mini-batch and is a random subset of the training data D and



           ∂J(B(t+1),λ,W,b)                 1                 X             ∂lcross(x,y,W,b)

                                                =                                                                    (11)

                        ∂b                         |B(t+1)|                (t+1)                          ∂b

(x,y)∈B

To compute the relevant gradients for the mini-batch, we then have to compute the gradient of the loss w.r.t. each training example in the mini-batch. You should refer to the lecture notes for the explicit description of how to compute these gradients.

Before Starting
I assume that you will complete the assignment in Matlab. You can complete the assignment in another programming language. If you do though I will not answer programming specific questions and you will also probably have to find a way to display, plot and graph your results.

Besides invoking Matlab commands, you will be required to run a few operating system commands. For these commands I will assume your computer’s



Figure 2: Computational graph of the cost function applied to a mini-batch containing one training example x. If you have a mini-batch of size greater than one, then the loss computations are repeated for each entry in the mini-batch (as in the above graph), but the regularization term is only computed once.

operating system is either linux or unix. If otherwise, you’ll have to fend for yourself. But all the non-Matlab commands needed are more-or-less trivial.

The notes for this assignment, and those to follow, will give you pointers about which Matlab commands to use. However, I will not give detailed explanations about their usage. I assume you have some previous experience with Matlab, are aware of many of the in-built functions, how to manipulate vectors and matrices and how to write your own functions etc. Keep in mind the function help can be called to obtain information about particular functions. So for example

help plot will display information about the Matlab command plot.

Background 2: Getting Started
Set up your environment

Create a new directory to hold all the Matlab files you will write for this course:

$ mkdir DirName

$ cd DirName

$ mkdir Datasets

$ mkdir Result Pics

Download the CIFAR-10 dataset stored in its Matlab format from this link.

Move the cifar-10-matlab.tar.gz file to the Datasets directory you have just created, untar the file and then move up to the parent directory. Also download the file montage.m from the Canvas webpage for Assignment 1 and move it to DirName.

$ mv montage.m DirName/

$ mv cifar-10-matlab.tar.gz DirName/Datasets

$ cd DirName/Datasets $ tar xvfz cifar-10-matlab.tar.gz $ cd ..

Background 3: Useful Display Function
You have copied a function called montage.m, which is a slightly modified version of the function available at: http://www.mathworks.com/matlabcentral/fileexchange/22387

This is a useful function as it allows you to efficiently view the images in a directory or in a Matlab array or a cell array. To look at some of the images from the CIFAR-10 dataset you can use the following set of commands:

addpath DirName/Datasets/cifar-10-batches-mat/;

A = load(’data batch 1.mat’);

I = reshape(A.data’, 32, 32, 3, 10000);

I = permute(I, [2, 1, 3, 4]);

montage(I(:, :, :, 1:500), ’Size’, [5,5]);

This sequence of commands tells Matlab to add the directory of the CIFAR-10 dataset to its path. Then it loads one of the .mat files containing image and label data. You access the image data with the command A.data. It has size 10000 × 3072. Each row of A.data corresponds to an image of size 32 × 32 × 3 that has been flattened into a row vector. We can re-arrange A.data into an array format expected by montage (32×32×3×10000) using the command reshape. After reshaping the array, the rows and columns of each image still need to be permuted and this is achieved with the permute command. Now you have an array format montage expects. In the above code we just view the first 500 images. Use help to find out the different ways montage can be called.

You have looked at some of the CIFAR-10 images. Now it is time to start writing some code.

Exercise 1: Training a multi-linear classifier
For this assignment you will just use data in the file data batch 1.mat for training, the file data batch 2.mat for validation and the file test batch.mat for testing. Create a file Assignment1.m. In this file you will write the code for this assignment and the necessary (sub-)functions. Here are my recommendations for which functions to write and the order in which to write them:

1.    Write a function that reads in the data from a CIFAR-10 batch file and returns the image and label data in separate files. Make sure to convert your image data to single or double format. I would suggest the function has the following input and outputs function [X, Y, y] = LoadBatch(filename)

where

•    X contains the image pixel data, has size d×n, is of type double or single and has entries between 0 and 1. n is the number of images (10000) and d the dimensionality of each image (3072=32×32×3).

•    Y is K×n (K= # of labels = 10) and contains the one-hot representation of the label for each image.

•    y is a vector of length n containing the label for each image. A note of caution. CIFAR-10 encodes the labels as integers between 0-9 but Matlab indexes matrices and vectors starting at 1. Therefore it may be easier to encode the labels between 1-10.

This file will not be long. You just need to call A = load(fname); and then rearrange and store the values in A.data and A.labels.

Top-level: Read in and store the training, validation and test data.

2.    Next we should pre-process the raw input data as it helps training. You should transform training data to have zero mean. If trainX is the d×n image data matrix (each column corresponds to an image) for the training data then

mean X = mean(trainX, 2); std X = std(trainX, 0, 2);

Both mean X and std X have size d×1.

You should normalize the training, validation and test data with respect to the mean and standard deviation values computed from the training data as follows. If X is an d×n image data matrix then you can normalize X as

X = X - repmat(meanX, [1, size(X, 2)]);

X = X ./ repmat(std X, [1, size(X, 2)]);

Top-level: Compute the mean and standard deviation vector for the training data and then normalize the training, validation and test data w.r.t. these mean and standard deviation vectors.

3.    Top-Level: After reading in and pre-processing the data, you can initialize the parameters of the model W and b as you now know what size they should be. W has size K×d and b is K×1. Initialize each entry to have Gaussian random values with zero mean and standard deviation .01. You should use the Matlab function randn to create this data.

4.    Write a function that evaluates the network function, i.e. equations (1, 2), on multiple images and returns the results. I would suggest the function has the following form function P = EvaluateClassifier(X, W, b)

where

•    each column of X corresponds to an image and it has size d×n.

•    W and b are the parameters of the network.

•    each column of P contains the probability for each label for the image in the corresponding column of X. P has size K×n.

Top-level: Check the function runs on a subset of the training data given a random initialization of the network’s parameters:

P = EvaluateClassifier(trainX(:, 1:100), W, b).

5.    Write the function that computes the cost function given by equation (5) for a set of images. I suggest the function has the following inputs and outputs function J = ComputeCost(X, Y, W, b, lambda)

where

•    each column of X corresponds to an image and X has size d×n.

•    each column of Y (K×n) is the one-hot ground truth label for the corresponding column of X or Y is the (1×n) vector of ground truth labels.

•    J is a scalar corresponding to the sum of the loss of the network’s predictions for the images in X relative to the ground truth labels and the regularization term on W.

6.    Write a function that computes the accuracy of the network’s predictions given by equation (4) on a set of data. Remember the accuracy of a classifier for a given set of examples is the percentage of examples for which it gets the correct answer. I suggest the function has the following inputs and outputs function acc = ComputeAccuracy(X, y, W, b) where

•    each column of X corresponds to an image and X has size d×n.

•    y is the vector of ground truth labels of length n.

•    acc is a scalar value containing the accuracy.

7.    Write the function that evaluates, for a mini-batch, the gradients of the cost function w.r.t. W and b, that is equations (10, 11). I suggest the function has the form

function [grad W, grad b] = ComputeGradients(X, Y, P, W, lambda)

where

•    each column of X corresponds to an image and it has size d×n.

•    each column of Y (K×n) is the one-hot ground truth label for the corresponding column of X.

•    each column of P contains the probability for each label for the image in the corresponding column of X. P has size K×n.

•    grad W is the gradient matrix of the cost J relative to W and has size K×d.

•    grad b is the gradient vector of the cost J relative to b and has size K×1.

Be sure to check out how you can efficiently compute the gradient for a batch from the last slide of Lecture 3. This can lead to a much faster implementation ( 3 times faster) than looping through each training example in the batch.

Everyone makes mistakes when computing gradients. Therefore you must always check your analytic gradient computations against numerical estimations of the gradients! Download code from the Canvas webpage that computes the gradient vectors numerically. Note there are two versions 1) a slower but more accurate version based on the centered difference formula and 2) a faster but less accurate based on the finite difference method. You will probably have to make small changes to the functions to make them compatible with your code. It will take some time to run the numerical gradient calculations as your network has 32×32×3×10 different parameters in W. Initially, you should just perform your checks on mini-batches of size 1 and with no regularization (lambda=0). Afterwards you can increase the size of the mini-batch and include regularization into the computations. Another trick is that you should send in a reduced dimensionality version of trainX and W, so instead of

[ngrad b, ngrad W] = ComputeGradsNumSlow(trainX(:, 1), trainY(:, 1),

W, b, lambda, 1e-6); you can compute the gradients numerically for smaller inputs (the first 20 dimensions of the first training example)

[ngrad b, ngrad W] = ComputeGradsNumSlow(trainX(1:20, 1), trainY(:, 1),

W(:, 1:20), b, lambda, 1e-6);

You should then also compute your analytical gradients on this reduced version of the input data with reduced dimensionality. This will speed up computations and also reduce the risk of numerical precision issues (very possible when the full W is initialized with very small numbers and trainX also contains small numbers).

You could compare the numerically and analytically computed gradient vectors (matrices) by examining their absolute differences and declaring, if all these absolutes difference are small (<1e-6), then they have produced the same result. However, when the gradient vectors have small values this approach may fail. A more reliable method is to compute the relative error between a numerically computed gradient value gn and an analytically computed gradient value ga

 where eps a very small positive number max(eps

and check this is small. There are potentially more issues that can plague numerical gradient checking (especially when you start to train deep rectifier networks), so I suggest you read the relevant section of the Additional material for lecture 3 from Standford’s course Convolutional Neural Networks for Visual Recognition for a more thorough exposition especially for the subsequent assignments.

Do not continue with the rest of this assignment until you are sure your analytic gradient code is correct. If you are having problems, set the seed of the random number generator with the command rng to ensure at each test W and b have the same values and double/triple check that you have a correct implementation of the gradient equations from the lecture notes.

8. Once you have the gradient computations debugged you are now ready to write the code to perform the mini-batch gradient descent algorithm to learn the network’s parameters where the updates are defined in equations (8, 9). You have a couple of parameters controlling the learning algorithm (for this assignment you will just implement the most vanilla version of the mini-batch gradient descent algorithm, with no adaptive tuning of the learning rate or momentum terms):

•    n batch the size of the mini-batches

•    eta the learning rate

•    n epochs the number of runs through the whole training set.

As the images in the CIFAR-10 dataset are in random order, the easiest to generate each mini-batch is to just run through the images sequentially. Let n batch be the number of images in a mini-batch. Then for one epoch (a complete run through all the training images), you can generate the set of mini-batches with this snippet of code: for j=1:n/n batch j start = (j-1)*n batch + 1; j end = j*n batch; inds = j start:j end; Xbatch = Xtrain(:, j start:j end);

Ybatch = Ytrain(:, j start:j end); end

(A slight upgrade of this default implementation is to randomly shuffle your training examples before each epoch. One efficient way to do this is via the command randperm which when given the input n returns a vector containing a random permutation of the integers 1:n.) I suggest the mini-batch learning function has these inputs and outputs

function [Wstar, bstar] = MiniBatchGD(X, Y, GDparams, W, b, lambda)

where X contains all the training images, Y the labels for the training images, W, b are the initial values for the network’s parameters, lambda is the regularization factor in the cost function and

• GDparams is an object containing the parameter values n batch, eta and n epochs

For my initial experiments I set n batch=100, eta=.001, n epochs=20 and lambda=0. To help you debug I suggest that after each epoch you compute the cost function and print it out (and save it) on all the training data. For these parameter settings you should see that the training cost decreases for each epoch. After the first epoch my cost score on all the training data was 1.981428 where I had set the random number seed generator to rng(400) and I had initialized the weight matrix before the bias vector. In figure 8 you can see the training cost score when I run these parameter settings for 40 epochs. The cost score on the validation set is plotted in red in the same figure.

(Note: in Tensorflow and other software packages they count in the number of update steps as opposed to the number of training epochs. Given n training images and mini-batches of size n batch then one epoch will result in n/n batch update steps of the parameter values. Obviously, the performance of the resulting network depends on the number of update steps and how much training data is used. Therefore to make fair comparisons one should make sure the number of update steps is consistent across runs - for instance if you change the size of the mini-batch, number of training images, etc.. you may need to run more or fewer epochs of training.)

When you have finished training you can compute the accuracy of your learnt classifier on the test data. My network achieves (after 40



                                                                  10                    20                     30                    40

Figure 3: The graph of the training and validation loss computed after every epoch. The network was trained with the following parameter settings: n batch=100, eta=.001, n epochs=40 and lambda=0.

epochs) an accuracy of 38.83% (with a random shuffling of the training example at the beginning of each epoch). Much better than random but not great. Hopefully, we will achieve improved accuracies in the assignments to come when we build and train more complex networks.

After training you can also visualization the weight matrix W as an image and see what class templates your network has learnt. Figure 4 shows the templates my network learnt. Here is a code snippet to re-arrange each row of W (assuming W is a 10×d matrix) into a set of images that can be displayed by Matlab.

for i=1:10

im = reshape(W(i, :), 32, 32, 3); s im{i} = (im - min(im(:))) / (max(im(:)) - min(im(:))); s im{i} = permute(s im{i}, [2, 1, 3]);

end

Then you can use either imshow, imagesc or montage to display the images in the cell array s im.



Figure 4: The learnt W matrix visualized as class template images. The network was trained with the following parameter settings: n batch=100, eta=.001, n epochs=40 and lambda=0.

To complete the assignment:

In the DD2424 Canvas Assignment 1 page upload:

•    The final (and cleaned up) version of the code you have written for the assignment. If you have written the code in multiple files please place them in one file for the uploaded version. This will make life easier for me and the TAs! Note we will not run your code.

•    A pdf document containing a short report. Please do not zip your code and pdf document into on file and upload the zip file. Then the graders have to download the zip file and uncompress it and this is time consuming. Here you should state whether you successfully managed to write the functions to correctly compute the gradient analytically, what tests you ran to check against the numerically computed gradient and the results of these tests. You should include the following plots/figures

1.    Graphs of the total loss and the cost function on the training data and the validation data after each epoch of the mini-batch gradient descent algorithm.

2.    Images representing the learnt weight matrix after the completion of training.

for the following parameter settings

–    lambda=0, n epochs=40, nbatch=100, eta=.1

–    lambda=0, n epochs=40, n batch=100, eta=.001

–    lambda=.1, n epochs=40, n batch=100, eta=.001

–    lambda=1, n epochs=40, n batch=100, eta=.001

You should also report the final test accuracy your network achieves after each of these training runs. You should also make a short comment on the effect of increasing the amount of regularization and the importance of the correct learning rate.

Exercise 2: Optional for Bonus Points
1.    Optimize the performance of the network It would be interesting to discover what is the best possible performance achievable by Assignment 1’s network on CIFAR-10. Here are some tricks/avenues you can explore to help bump up performance (most of the same tricks you can use for more complicated networks and they would probably have more of an effect for those networks):

(a)    Use all the available training data for training (all five batches minus a small subset of the training images for a validation set). Decrease the size of the validation set down to ∼ 1000.

(b)    Train for a longer time and use your validation set to make sure you don’t overfit or to keep a record of the best model before you begin to overfit.

(c)     Do a grid search to find good values for the amount of regularization, the learning rate and the batch size. There is some significant empirical evidence that training with smaller batch sizes when using SGD training leads to better generalization.

(d)    Play around with decaying the learning rate by a factor ∼ .9 after each epoch. Or you can decay the learning rate by a factor of 10 after every nth epoch.

(e)    Implement Xavier initialization and comment if it stabilizes training.

Augment your training data by applying small random geometric and photometric jitter to the original training data. You can do this on the fly by applying a random jitter to each image in the mini-batch before

More products