# ECBM E6040 HOMEWORK #3 solution

INSTRUCTIONS: This homework contains two programming assignments. Submission for this homework will be via bitbucket repositories created for each student and should contain the following • All ﬁgures and discussions; document all parameters you used in the IPython notebook ﬁle, hw3b.ipynb, which is already included in the homework 3 repository. • Commit and push all the changes you made to the skeleton code in the Python ﬁles, hw3a.py and hw3b.py.

Programming

As the semester progresses, we are shifting our focus more and more towards programming.

In this homework, you will empirically study various regularization methods for neural networks, and experiment with diﬀerent convolutional neural network (CNN) conﬁgurations. You should start by going through the Deep Learning Tutorials Project, especially, LeNet. The source code provided in the Homework 3 repository is excerpted from logistic sgd.py, mlp.py, and convolutional mlp.py.

As in the previous homework, you will be using the same street view house numbers (SVHN) dataset [1]. A recent ivestigation has achieved superior classiﬁcation results on the SVHN dataset with above 95% accuracy (by using CNN with some modiﬁcations) [2].

Instead of reproducing the superior testing accuracy, your task is to explore the CNN framework from various points of view.

As in the previous homework, a python routine called load data is provided to you for downloading and preprocessing the dataset. You should use it, unless you have absolute reason not to. The ﬁrst time you call load data, it will take you some time

to download the dataset (about 180 MB). If you already have the dataset on the EC2 volume, you should simply reuse it. Please be careful NOT TO commit the dataset ﬁles into the repository. In addition to load data, you are provided with various skeleton functions.

Note that all the results, ﬁgures, and parameters should be placed inside the IPython notebook ﬁle hw3.ipynb.

PROBLEM a (50 points)

In this problem, you are asked to empirically test several regularization methods for neural networks, which are discussed in Chapter 7 of the textbook. To better see the eﬀect of regularization, you will be using a smaller training dataset down-sampled from the original SVHN dataset (generated by load data with an additional input argument ds rate). The testing dataset remains the same.

You will start by training a neural network model without any regularization, except optionally with L1 or L2 regularization. The testing result of this model serves as a baseline for comparison against diﬀerent regularization methods. If you do use L1 or L2 regularization for the baseline model, you should also include them with the same parameters for other models with diﬀerent regularization methods.

For neural network, you could use either MLP or CNN (from Problem b). A myMLP class has been provided to you.

i Implement an MLP or a CNN, and train it with the smaller dataset. Then, train the same model again with the complete dataset. Document your choice of parameters, and report the testing accuracy in both cases. For MLP, you could reuse any sets of parameters that you implemented in the previous homework.

ii Noise injection is a common method for regularization when the dataset is limited. For each example in the smaller dataset, generate several copies and and add a randomly sampled noise vector to each of them. A skeleton function test noise inject at input is provided to you. Train the same model from (i) with the new noisy dataset. Repeat the same procedure with another level of noise. Document your choice of noise, discuss the testing accuracy, and compare the result with those in (i).

iii Another way of noise injection is to inject it into the weights of aﬃne transformation between layers. A skeleton function test noise inject at weight is provided to you. Train the same model from (i) with the smaller dataset, but inject noise into the weights after each of the updates (More speciﬁcally, you need to modify the updates routine in the skeleton code). Document your choice of noise, discuss the testing accuracy, and compare the result with those in (i).

iv Data augmentation is another way to overcome the limitation of small datasets.

2

It has been a particularly eﬀective method for object recognition. You are asked to synthesize new data to augment the smaller dataset, and then train the model with the synthesized dataset. To do so, you create 4 new examples for each of the examples in the dataset by translating the example by 1 pixel along four diﬀerent directions, and padding zeros to the missing part. If you have other ideas about data augmentation, you could implement them instead of using the one described here. A skeleton function test data augmentation is provided to you. Train the same model from (i) with the new dataset. Document your choice of noise, discuss the testing accuracy, and compare the result with those in (i).

v Recent work has shown that one can fool a neural network with adversarial examples [4]. Such phenomenon is also discussed in section 7.13 in the textbook. You are asked to test and reproduce this phenomenon. To do so, take any models you trained in previous questions, and compute the gradient of the cost function with respect to the input (Please review section 7.13 of the textbook). Then, create an adversarial example by ﬁrst picking an example with correct classiﬁcation and adding an imperceptibly small vector to the input whose elements are equal to the sign of the elements of the gradient. Use the trained model to classify this adversarial example. Can you fool the model? Discuss your results, and plot the original input along with the adversarial example and a bar plot of the class-speciﬁc probabilities (output of the neural network) for the original input and the adversarial example.

PROBLEM b (50 points)

In this problem, you will experiment with convolutional neural networks. The CNN model is similar to LeNet from the Deep Learning tutorial, with the exception that it handles images with 3 color channels, whereas LeNet targets grey-scaled image.

i Implement an CNN with 2 convolution hidden layers for multi-channel inputs. First, go through the skeleton function test lenet() in hw3.py, and ﬁnish the missing part. After ﬁnishing the function, experiment with parameters, in particular, the number of ﬁlters in hidden layers. Document at least three diﬀerent sets of parameters explicitly, and discuss the accuracy of your test results.

ii Implement a multi-stage CNN, as shown in Figure.1. First, go through the skeleton function test convnet() in hw3.py, and ﬁnish the missing part. After ﬁnishing the function, experiment with all the parameters, in particular, the number of ﬁlters in hidden layers as well as the shape of ﬁlters. Document at least three diﬀerent sets of parameters explicitly, and discuss the accuracy of your test results.

iii The multi-stage CNN model you implemented in the previous question has a

3

2 Architecture

The ConvNet architecture is composed of repeatedly stacked feature stages. Each stage contains a convolution module, followed by a pooling/subsampling module and a normalizationmodule. While traditionalpooling modulesin ConvNetare either averageor max poolings, we use an Lp pooling here. The normalization moduleis subtractiveonly asopposedto subtractiveand divisive, i.e. the mean value of each neighborhood is subtracted to the output of each stage (but not divided by the standard deviation as it decreases performance with this dataset). Finally, multi-stage features are also used as opposed to single-stage features. This architecture is trained using stochastic gradient descent (SGD) with the Levenberg-Marquardt diagonal approximation to the Hessian [7].

2.1 Lp-Pooling

Figure 2. L2-pooling applied to a 9x9 feature map with a 3x3 Gaussian kernel and 2x2 stride

Lp pooling is a biologically inspired pooling layer modelled on complex cells [13, 5] who’s operation can be summarized in equation (1), where G is a Gaussian kernel, I is the input feature map and O is the output feature map. It can be imagined as giving an increased weight to stronger features and suppressing weaker features. Two special cases of Lp pooling are notable. P =1corresponds to a simple Gaussian averaging, whereas P = ∞ corresponds to max-pooling (i.e only the strongest signal is activated). Lp-pooling has been used previously in [6, 16] and a theoretical analysis of this method is described in [1].

O

=( !!

I

( i,j

) P

×

G ( i,j))1/P (1)

Figure 2 demonstrates a simple example of L2pooling.

2.2 Multi-Stage Features

Multi-Stage features (MS) are obtained by branching out outputs of all stages into the classiﬁer (Figure 3). They provide richer representations compared to Single-Stage features (SS) by adding complementary information such as local textures and ﬁne details lost by higher levels. MS features have consistently improved performance in other work [4, 12, 10] and in

Figure 3. A 2-stage ConvNet architecture where Multi-Stage features (MS) are fed to a 2-layer classiﬁer. The 1st stage features are branched out, subsampled again and then concatenated to 2nd stage features.

this work as well (Figure 4). However we observe minimal gains on this dataset compared to other types of objects such as pedestrians and trafﬁc signs (Table 1). The likely explanation for this observation is that gains are correlated to the amount of texture and multi-scale characteristics of the objects of interest.

3. Experiments 3.1. Data Preparation The SVHN classiﬁcation dataset [9] contains 32x32 images with 3 color channels. The dataset is divided into three subsets: train set, extra set and test set. The extra set is a large set of easy samples and train set is a smaller set of more difﬁcult samples. Since we are given no information about how the sampling of these images was done, we assume a random order to construct our validation set. We compose our validation set with 2/3 from training samples (400 per class) and 1/3 from extra samples (200 per class), yielding a total of 6000 samples. This distribution allows to measure success on easy samples but puts more emphasis on difﬁcult ones. The training and testing sets contain respectively 598388 and 26032 samples. Samples are pre-processed with a local contrast normalization (with a 7x7 kernel) on the Y channel of the YUV space followed by a global contrast normalization over each channel. No sample distortions were used to improve invariance. For some experiments, a padding of 2 pixels with zero value was added to each side of the input image in order to center the ﬁrst stage’s 5x5 ﬁlters onto image borders.

3.2 Architecture Details The ConvNet has 2 stages of feature extraction and a two-layer non-linear classiﬁer. The ﬁrst convolution layer produces 16 features with 5x5 convolution ﬁlters while the second convolution layer outputs 512 features with 7x7 ﬁlters. The output to the classiﬁer also includes inputs from the ﬁrst layer, which provides lo

Figure 1: Excerpted from [2]. A 2-stage CNN architecture where multi-stage features (MS) are fed into a 2-layer classiﬁer. The ﬁrst stage features are branched out, downsampled again and then concatenated to second stage features.

nonstandard feed-forward structure, but the THEANO package is still able to compute the gradient of the cost function with respect to diﬀerent parameters via the back-propagation algorithm. Discuss why the back-propagation algorithm can be applied to this model. You might want to review the section about the back-propagation algorithm in the textbook.

iv The state-of-the-art neural networks for object recognition usually implement a CNN in cascade with a MLP[3]. Implement a network with two convolution layers in cascade with a MLP with 2 hidden layers. Train the model, and document the testing accuracy. How does this model perform compared to your implementation of the MLP with 4 hidden layers in Homework 2?

BONUS PROBLEM (25 points)

i A nice advantage of CNNs that separates it from other machine learning models is that it is capable of learning features all the way from pixels to the classiﬁer, whereas other methods usually require multiple hand-crafted features. You are asked to compare the performance of a CNN with hand-picked features versus one with learned features. Speciﬁcally, use a CNN from the ﬁrst question with 3 ﬁlter sets at the input layer (each set has 3 ﬁlters for each color channels), and train the whole network. Then, use the same CNN model, but replace the 3 ﬁlter sets of the input layer with your own design (ex., Gaussian ﬁlters). For each ﬁlter set, you can use the same ﬁlter for each color channel. Train the model without updating the designed ﬁlters. Document and compare the testing accuracy of both models, and plot the ﬁlters learned via training and the ﬁlters you designed.

ii Another advantage of a CNN is that it greatly reduces the number of parameters in the network. The fewer parameters imply that a CNN can be usually trained in shorter time than a MLP with same amount of neurons and layers. You are

4

asked to compare a CNN and an MLP. In particular, implement a CNN with two convolution hidden layers, and a fully-connected MLP with three hidden layers (Note that in a CNN, convolution hidden layers are followed by a fully connected perceptron and an output layer). For each layer of the MLP, use the same number of neurons (activation functions) as the corresponding layer in the CNN. You can reuse the CNN from (i). Document the number of parameters of both models (total number of entries in all ﬁlters for CNN, and total number of entries in all weight matrices for MLP). Discuss the run-time for both models, and the testing accuracy.

NEED HELP:

If you have any questions you are advised to use Piazza forum which is accessible through Canvas system.

GOOD LUCK!

References

[1] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng, “Reading Digits in Natural Images with Unsupervised Feature Learning,” NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.

[2] Pierre Sermanet, Soumith Chintala, Yann LeCun, “Convolutional Neural Networks Applied to House Numbers Digit Classiﬁcation,” ICPR 2012

[3] Tara N. Sainath, Abdel-rahman Mohamed, Brian Kingsbury, Bhuvana Ramabhadran, “Deep convolutional neural networks for LVCSR,” ICASPP 2013

[4] Anh Nguyen, Jason Yosinski, Jeﬀ Clune, “Deep Neural Networks are Easily Fooled: High Conﬁdence Predictions for Unrecognizable Images,” IEEE CVPR 2015.

Programming

As the semester progresses, we are shifting our focus more and more towards programming.

In this homework, you will empirically study various regularization methods for neural networks, and experiment with diﬀerent convolutional neural network (CNN) conﬁgurations. You should start by going through the Deep Learning Tutorials Project, especially, LeNet. The source code provided in the Homework 3 repository is excerpted from logistic sgd.py, mlp.py, and convolutional mlp.py.

As in the previous homework, you will be using the same street view house numbers (SVHN) dataset [1]. A recent ivestigation has achieved superior classiﬁcation results on the SVHN dataset with above 95% accuracy (by using CNN with some modiﬁcations) [2].

Instead of reproducing the superior testing accuracy, your task is to explore the CNN framework from various points of view.

As in the previous homework, a python routine called load data is provided to you for downloading and preprocessing the dataset. You should use it, unless you have absolute reason not to. The ﬁrst time you call load data, it will take you some time

to download the dataset (about 180 MB). If you already have the dataset on the EC2 volume, you should simply reuse it. Please be careful NOT TO commit the dataset ﬁles into the repository. In addition to load data, you are provided with various skeleton functions.

Note that all the results, ﬁgures, and parameters should be placed inside the IPython notebook ﬁle hw3.ipynb.

PROBLEM a (50 points)

In this problem, you are asked to empirically test several regularization methods for neural networks, which are discussed in Chapter 7 of the textbook. To better see the eﬀect of regularization, you will be using a smaller training dataset down-sampled from the original SVHN dataset (generated by load data with an additional input argument ds rate). The testing dataset remains the same.

You will start by training a neural network model without any regularization, except optionally with L1 or L2 regularization. The testing result of this model serves as a baseline for comparison against diﬀerent regularization methods. If you do use L1 or L2 regularization for the baseline model, you should also include them with the same parameters for other models with diﬀerent regularization methods.

For neural network, you could use either MLP or CNN (from Problem b). A myMLP class has been provided to you.

i Implement an MLP or a CNN, and train it with the smaller dataset. Then, train the same model again with the complete dataset. Document your choice of parameters, and report the testing accuracy in both cases. For MLP, you could reuse any sets of parameters that you implemented in the previous homework.

ii Noise injection is a common method for regularization when the dataset is limited. For each example in the smaller dataset, generate several copies and and add a randomly sampled noise vector to each of them. A skeleton function test noise inject at input is provided to you. Train the same model from (i) with the new noisy dataset. Repeat the same procedure with another level of noise. Document your choice of noise, discuss the testing accuracy, and compare the result with those in (i).

iii Another way of noise injection is to inject it into the weights of aﬃne transformation between layers. A skeleton function test noise inject at weight is provided to you. Train the same model from (i) with the smaller dataset, but inject noise into the weights after each of the updates (More speciﬁcally, you need to modify the updates routine in the skeleton code). Document your choice of noise, discuss the testing accuracy, and compare the result with those in (i).

iv Data augmentation is another way to overcome the limitation of small datasets.

2

It has been a particularly eﬀective method for object recognition. You are asked to synthesize new data to augment the smaller dataset, and then train the model with the synthesized dataset. To do so, you create 4 new examples for each of the examples in the dataset by translating the example by 1 pixel along four diﬀerent directions, and padding zeros to the missing part. If you have other ideas about data augmentation, you could implement them instead of using the one described here. A skeleton function test data augmentation is provided to you. Train the same model from (i) with the new dataset. Document your choice of noise, discuss the testing accuracy, and compare the result with those in (i).

v Recent work has shown that one can fool a neural network with adversarial examples [4]. Such phenomenon is also discussed in section 7.13 in the textbook. You are asked to test and reproduce this phenomenon. To do so, take any models you trained in previous questions, and compute the gradient of the cost function with respect to the input (Please review section 7.13 of the textbook). Then, create an adversarial example by ﬁrst picking an example with correct classiﬁcation and adding an imperceptibly small vector to the input whose elements are equal to the sign of the elements of the gradient. Use the trained model to classify this adversarial example. Can you fool the model? Discuss your results, and plot the original input along with the adversarial example and a bar plot of the class-speciﬁc probabilities (output of the neural network) for the original input and the adversarial example.

PROBLEM b (50 points)

In this problem, you will experiment with convolutional neural networks. The CNN model is similar to LeNet from the Deep Learning tutorial, with the exception that it handles images with 3 color channels, whereas LeNet targets grey-scaled image.

i Implement an CNN with 2 convolution hidden layers for multi-channel inputs. First, go through the skeleton function test lenet() in hw3.py, and ﬁnish the missing part. After ﬁnishing the function, experiment with parameters, in particular, the number of ﬁlters in hidden layers. Document at least three diﬀerent sets of parameters explicitly, and discuss the accuracy of your test results.

ii Implement a multi-stage CNN, as shown in Figure.1. First, go through the skeleton function test convnet() in hw3.py, and ﬁnish the missing part. After ﬁnishing the function, experiment with all the parameters, in particular, the number of ﬁlters in hidden layers as well as the shape of ﬁlters. Document at least three diﬀerent sets of parameters explicitly, and discuss the accuracy of your test results.

iii The multi-stage CNN model you implemented in the previous question has a

3

2 Architecture

The ConvNet architecture is composed of repeatedly stacked feature stages. Each stage contains a convolution module, followed by a pooling/subsampling module and a normalizationmodule. While traditionalpooling modulesin ConvNetare either averageor max poolings, we use an Lp pooling here. The normalization moduleis subtractiveonly asopposedto subtractiveand divisive, i.e. the mean value of each neighborhood is subtracted to the output of each stage (but not divided by the standard deviation as it decreases performance with this dataset). Finally, multi-stage features are also used as opposed to single-stage features. This architecture is trained using stochastic gradient descent (SGD) with the Levenberg-Marquardt diagonal approximation to the Hessian [7].

2.1 Lp-Pooling

Figure 2. L2-pooling applied to a 9x9 feature map with a 3x3 Gaussian kernel and 2x2 stride

Lp pooling is a biologically inspired pooling layer modelled on complex cells [13, 5] who’s operation can be summarized in equation (1), where G is a Gaussian kernel, I is the input feature map and O is the output feature map. It can be imagined as giving an increased weight to stronger features and suppressing weaker features. Two special cases of Lp pooling are notable. P =1corresponds to a simple Gaussian averaging, whereas P = ∞ corresponds to max-pooling (i.e only the strongest signal is activated). Lp-pooling has been used previously in [6, 16] and a theoretical analysis of this method is described in [1].

O

=( !!

I

( i,j

) P

×

G ( i,j))1/P (1)

Figure 2 demonstrates a simple example of L2pooling.

2.2 Multi-Stage Features

Multi-Stage features (MS) are obtained by branching out outputs of all stages into the classiﬁer (Figure 3). They provide richer representations compared to Single-Stage features (SS) by adding complementary information such as local textures and ﬁne details lost by higher levels. MS features have consistently improved performance in other work [4, 12, 10] and in

Figure 3. A 2-stage ConvNet architecture where Multi-Stage features (MS) are fed to a 2-layer classiﬁer. The 1st stage features are branched out, subsampled again and then concatenated to 2nd stage features.

this work as well (Figure 4). However we observe minimal gains on this dataset compared to other types of objects such as pedestrians and trafﬁc signs (Table 1). The likely explanation for this observation is that gains are correlated to the amount of texture and multi-scale characteristics of the objects of interest.

3. Experiments 3.1. Data Preparation The SVHN classiﬁcation dataset [9] contains 32x32 images with 3 color channels. The dataset is divided into three subsets: train set, extra set and test set. The extra set is a large set of easy samples and train set is a smaller set of more difﬁcult samples. Since we are given no information about how the sampling of these images was done, we assume a random order to construct our validation set. We compose our validation set with 2/3 from training samples (400 per class) and 1/3 from extra samples (200 per class), yielding a total of 6000 samples. This distribution allows to measure success on easy samples but puts more emphasis on difﬁcult ones. The training and testing sets contain respectively 598388 and 26032 samples. Samples are pre-processed with a local contrast normalization (with a 7x7 kernel) on the Y channel of the YUV space followed by a global contrast normalization over each channel. No sample distortions were used to improve invariance. For some experiments, a padding of 2 pixels with zero value was added to each side of the input image in order to center the ﬁrst stage’s 5x5 ﬁlters onto image borders.

3.2 Architecture Details The ConvNet has 2 stages of feature extraction and a two-layer non-linear classiﬁer. The ﬁrst convolution layer produces 16 features with 5x5 convolution ﬁlters while the second convolution layer outputs 512 features with 7x7 ﬁlters. The output to the classiﬁer also includes inputs from the ﬁrst layer, which provides lo

Figure 1: Excerpted from [2]. A 2-stage CNN architecture where multi-stage features (MS) are fed into a 2-layer classiﬁer. The ﬁrst stage features are branched out, downsampled again and then concatenated to second stage features.

nonstandard feed-forward structure, but the THEANO package is still able to compute the gradient of the cost function with respect to diﬀerent parameters via the back-propagation algorithm. Discuss why the back-propagation algorithm can be applied to this model. You might want to review the section about the back-propagation algorithm in the textbook.

iv The state-of-the-art neural networks for object recognition usually implement a CNN in cascade with a MLP[3]. Implement a network with two convolution layers in cascade with a MLP with 2 hidden layers. Train the model, and document the testing accuracy. How does this model perform compared to your implementation of the MLP with 4 hidden layers in Homework 2?

BONUS PROBLEM (25 points)

i A nice advantage of CNNs that separates it from other machine learning models is that it is capable of learning features all the way from pixels to the classiﬁer, whereas other methods usually require multiple hand-crafted features. You are asked to compare the performance of a CNN with hand-picked features versus one with learned features. Speciﬁcally, use a CNN from the ﬁrst question with 3 ﬁlter sets at the input layer (each set has 3 ﬁlters for each color channels), and train the whole network. Then, use the same CNN model, but replace the 3 ﬁlter sets of the input layer with your own design (ex., Gaussian ﬁlters). For each ﬁlter set, you can use the same ﬁlter for each color channel. Train the model without updating the designed ﬁlters. Document and compare the testing accuracy of both models, and plot the ﬁlters learned via training and the ﬁlters you designed.

ii Another advantage of a CNN is that it greatly reduces the number of parameters in the network. The fewer parameters imply that a CNN can be usually trained in shorter time than a MLP with same amount of neurons and layers. You are

4

asked to compare a CNN and an MLP. In particular, implement a CNN with two convolution hidden layers, and a fully-connected MLP with three hidden layers (Note that in a CNN, convolution hidden layers are followed by a fully connected perceptron and an output layer). For each layer of the MLP, use the same number of neurons (activation functions) as the corresponding layer in the CNN. You can reuse the CNN from (i). Document the number of parameters of both models (total number of entries in all ﬁlters for CNN, and total number of entries in all weight matrices for MLP). Discuss the run-time for both models, and the testing accuracy.

NEED HELP:

If you have any questions you are advised to use Piazza forum which is accessible through Canvas system.

GOOD LUCK!

References

[1] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng, “Reading Digits in Natural Images with Unsupervised Feature Learning,” NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.

[2] Pierre Sermanet, Soumith Chintala, Yann LeCun, “Convolutional Neural Networks Applied to House Numbers Digit Classiﬁcation,” ICPR 2012

[3] Tara N. Sainath, Abdel-rahman Mohamed, Brian Kingsbury, Bhuvana Ramabhadran, “Deep convolutional neural networks for LVCSR,” ICASPP 2013

[4] Anh Nguyen, Jason Yosinski, Jeﬀ Clune, “Deep Neural Networks are Easily Fooled: High Conﬁdence Predictions for Unrecognizable Images,” IEEE CVPR 2015.

Starting from: $34.99

You'll get 1 file (1.7MB)