# Assignment #1 Solution

1 Softmax (10 points)

(a) (5 points) Prove that softmax is invariant to constant o sets in the input, that is, for any input vector x and any constant c,

softmax(x) = softmax(x + c)

where x + c means adding the constant c to every dimension of x. Remember that

exi

sof tmax(x)i = (1)

Pj exj

Note: In practice, we make use of this property and choose c = maxi xi when computing softmax probabilities for numerical stability (i.e., subtracting its maximum element from all elements of x).

(b) (5 points) Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in part (a). Write your implementation in q1 softmax.py. You may test by executing python q1 softmax.py.

Note: The provided tests are not exhaustive. Later parts of the assignment will reference this code so it is important to have a correct implementation. Your implementation should also be e cient and vectorized whenever possible (i.e., use numpy matrix operations rather than for loops). A non-vectorized implementation will not receive full credit!

2 Neural Network Basics (30 points)

(a) (3 points) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only (x), but not x, is present). Assume that the input x is a scalar for this question. Recall, the sigmoid function is

(x) =

1

(2)

1 + e x

(b) (3 points) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e., nd the gradients with respect to the softmax input vector , when the prediction is made by y^ = softmax( ). Remember the cross entropy function is

X

CE(y; y^) = yi log(^yi) (3)

i

where y is the one-hot label vector, and y^ is the predicted probability vector for all classes. (Hint: you might want to consider the fact many elements of y are zeros, and assume that only the k-th dimension of y is one.)

(c) (6 points) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that

is, nd @@Jx where J = CE(y; y^) is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot

label vector is y, and cross entropy cost is used. (Feel free to use 0(x) as the shorthand for sigmoid gradient, and feel free to de ne any variables whenever you see t.)

x y^

Recall that the forward propagation is as follows

h = sigmoid(xW1 + b1) y^ = softmax(hW2 + b2)

Note that here we're assuming that the input vector (thus the hidden variables and output probabilities) is a row vector to be consistent with the programming assignment. When we apply the sigmoid function to a vector, we are applying it to each of the elements of that vector. Wi and bi (i = 1; 2) are the weights and biases, respectively, of the two layers.

(d) (2 points) How many parameters are there in this neural network, assuming the input is Dx-dimensional, the output is Dy-dimensional, and there are H hidden units?

(e) (4 points) Fill in the implementation for the sigmoid activation function and its gradient in q2 sigmoid.py. Test your implementation using python q2 sigmoid.py. Again, thoroughly test your code as the pro-vided tests may not be exhaustive.

(f) (4 points) To make debugging easier, we will now implement a gradient checker. Fill in the implementa-tion for gradcheck naive in q2 gradcheck.py. Test your code using python q2 gradcheck.py.

(g) (8 points) Now, implement the forward and backward passes for a neural network with one sigmoid hidden layer. Fill in your implementation in q2 neural.py. Sanity check your implementation with python q2 neural.py.

3 word2vec (40 points + 2 bonus)

(a) (3 points) Assume you are given a predicted word vector vc corresponding to the center word c for skipgram, and word prediction is made with the softmax function found in word2vec models

exp(uvc)

y^o = p(o j c) = PW o (4)

exp(uv )

w=1 w c

where w denotes the w-th word and uw (w = 1; : : : ; W ) are the \output" word vectors for all words in the vocabulary. Assume cross entropy cost is applied to this prediction and word o is the expected word (the o-th element of the one-hot label vector is one), derive the gradients with respect to vc.

Hint: It will be helpful to use notation from question 2. For instance, letting y^ be the vector of softmax predictions for every word, y as the expected word vector, and the loss function

Jsoftmax CE(o; vc; U) = CE(y; y^)

(5)

where U = [u1; u2; ; uW ] is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices.

(b) (3 points) As in the previous part, derive gradients for the \output" word vectors uw's (including uo).

(c) (6 points) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector vc, and the expected output word is o. Assume that K negative samples (words) are drawn, and

they are 1; ; K, respectively for simplicity of notation (o 2= f1; : : : ; Kg). Again, for a given word, o, denote its output vector as uo. The negative sampling loss function in this case is

K

log( (uv ))

kX

uv

J

neg sample

(o; v

; U) =

log( (

))

(6)

c

o c

k c

=1

where ( ) is the sigmoid function.

After you've done this, describe with one sentence why this cost function is much more e cient to compute than the softmax-CE loss (you could provide a speed-up ratio, i.e., the runtime of the softmax-CE loss divided by the runtime of the negative sampling loss).

Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we are doing a minimization instead of maximization in our code.

(d) (8 points) Derive gradients for all of the word vectors for skip-gram and CBOW given the previous parts and given a set of context words [wordc m; : : : ; wordc 1; wordc; wordc+1; : : : ; wordc+m], where m is the context size. Denote the \input" and \output" word vectors for wordk as vk and uk respectively.

Hint: feel free to use F (o; vc) (where o is the expected word) as a placeholder for the Jsoftmax CE(o; vc; :::)

or Jneg sample(o; vc; :::) cost functions in this part | you'll see that this is a useful abstraction for the

coding part. That is, your solution may contain terms of the form @F (o;vc) .

@:::

Recall that for skip-gram, the cost for a context centered around c is

X

Jskip-gram(wordc m:::c+m) =

F (wc+j; vc)

(7)

m

j m;j6=0

where wc+j refers to the word at the j-th index from the center.

CBOW is slightly di erent. Instead of using vc as the predicted vector, we use v^ de ned below. For (a simpler variant of) CBOW, we sum up the input word vectors in the context

X

v^ =

vc+j

(8)

m j m;j6=0

then the CBOW cost is

JCBOW(wordc m:::c+m) = F (wc; v^)

(9)

Note: To be consistent with the v^ notation such as for the code portion, for skip-gram v^ = vc.

CS 224N: Assignment #1

(e) (12 points) In this part you will implement the word2vec models and train your own word vectors with stochastic gradient descent (SGD). First, write a helper function to normalize rows of a matrix in q3 word2vec.py. In the same le, ll in the implementation for the softmax and negative sampling cost and gradient functions. Then, ll in the implementation of the cost and gradient functions for the skip-gram model. When you are done, test your implementation by running python q3 word2vec.py. Note: If you choose not to implement CBOW (part h), simply remove the NotImplementedError so that your tests will complete.

(f) (4 points) Complete the implementation for your SGD optimizer in q3 sgd.py. Test your implemen-tation by running python q3 sgd.py.

(g) (4 points) Show time! Now we are going to load some real data and train word vectors with everything you just implemented! We are going to use the Stanford Sentiment Treebank (SST) dataset to train word vectors, and later apply them to a simple sentiment analysis task. You will need to fetch the datasets rst. To do this, run sh get datasets.sh. There is no additional code to write for this part; just run python q3 run.py.

Note: The training process may take a long time depending on the e ciency of your implementation (an e cient implementation takes approximately an hour). Plan accordingly!

When the script nishes, a visualization for your word vectors will appear. It will also be saved as q3 word vectors.png in your project directory. Include the plot in your homework write up. Brie y explain in at most three sentences what you see in the plot.

(h) (Extra credit: 2 points) Implement the CBOW model in q3 word2vec.py. Note: This part is optional but the gradient derivations for CBOW in part (d) are not!.

4 Sentiment Analysis (20 points)

Now, with the word vectors you trained, we are going to perform a simple sentiment analysis. For each sentence in the Stanford Sentiment Treebank dataset, we are going to use the average of all the word vectors in that sentence as its feature, and try to predict the sentiment level of the said sentence. The sentiment level of the phrases are represented as real values in the original dataset, here we'll just use ve classes:

\very negative ( )", \negative ( )", \neutral", \positive (+)", \very positive (++)"

which are represented by 0 to 4 in the code, respectively. For this part, you will learn to train a softmax classi er, and perform train/dev validation to improve generalization.

(a) (2 points) Implement a sentence featurizer. A simple way of representing a sentence is taking the average of the vectors of the words in the sentence. Fill in the implementation in q4 sentiment.py.

(b) (1 points) Explain in at most two sentences why we want to introduce regularization when doing classi-cation (in fact, most machine learning tasks).

(c) (2 points) Fill in the hyperparameter selection code in q4 sentiment.py to search for the \optimal" regularization parameter. Attach your code for chooseBestModel to your written write-up. You should be able to attain at least 36.5% accuracy on the dev and test sets using the pretrained vectors in part (d).

(d) (3 points) Run python q4 sentiment.py --yourvectors to train a model using your word vectors from q3. Now, run python q4 sentiment.py --pretrained to train a model using pretrained GloVe vectors (on Wikipedia data). Compare and report the best train, dev, and test accuracies. Why do you think the pretrained vectors did better? Be speci c and justify with 3 distinct reasons.

Page 4 of 5

CS 224N: Assignment #1

(e) (4 points) Plot the classi cation accuracy on the train and dev set with respect to the regularization

value for the pretrained GloVe vectors, using a logarithmic scale on the x-axis. This should have been done automatically. Include q4 reg acc.png in your homework write up. Brie y explain in at most three sentences what you see in the plot.

(f) (4 points) We will now analyze errors that the model makes (with pretrained GloVe vectors). When you ran python q4 sentiment.py --pretrained, two les should have been generated. Take a look at q4 dev conf.png and include it in your homework writeup. Interpret the confusion matrix in at most three sentences.

(g) (4 points) Next, take a look at q4 dev pred.txt. Choose 3 examples where your classi er made errors and brie y explain the error and what features the classi er would need to classify the example correctly (1 sentence per example). Try to pick examples with di erent reasons.

(a) (5 points) Prove that softmax is invariant to constant o sets in the input, that is, for any input vector x and any constant c,

softmax(x) = softmax(x + c)

where x + c means adding the constant c to every dimension of x. Remember that

exi

sof tmax(x)i = (1)

Pj exj

Note: In practice, we make use of this property and choose c = maxi xi when computing softmax probabilities for numerical stability (i.e., subtracting its maximum element from all elements of x).

(b) (5 points) Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in part (a). Write your implementation in q1 softmax.py. You may test by executing python q1 softmax.py.

Note: The provided tests are not exhaustive. Later parts of the assignment will reference this code so it is important to have a correct implementation. Your implementation should also be e cient and vectorized whenever possible (i.e., use numpy matrix operations rather than for loops). A non-vectorized implementation will not receive full credit!

2 Neural Network Basics (30 points)

(a) (3 points) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only (x), but not x, is present). Assume that the input x is a scalar for this question. Recall, the sigmoid function is

(x) =

1

(2)

1 + e x

(b) (3 points) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e., nd the gradients with respect to the softmax input vector , when the prediction is made by y^ = softmax( ). Remember the cross entropy function is

X

CE(y; y^) = yi log(^yi) (3)

i

where y is the one-hot label vector, and y^ is the predicted probability vector for all classes. (Hint: you might want to consider the fact many elements of y are zeros, and assume that only the k-th dimension of y is one.)

(c) (6 points) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that

is, nd @@Jx where J = CE(y; y^) is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot

label vector is y, and cross entropy cost is used. (Feel free to use 0(x) as the shorthand for sigmoid gradient, and feel free to de ne any variables whenever you see t.)

x y^

Recall that the forward propagation is as follows

h = sigmoid(xW1 + b1) y^ = softmax(hW2 + b2)

Note that here we're assuming that the input vector (thus the hidden variables and output probabilities) is a row vector to be consistent with the programming assignment. When we apply the sigmoid function to a vector, we are applying it to each of the elements of that vector. Wi and bi (i = 1; 2) are the weights and biases, respectively, of the two layers.

(d) (2 points) How many parameters are there in this neural network, assuming the input is Dx-dimensional, the output is Dy-dimensional, and there are H hidden units?

(e) (4 points) Fill in the implementation for the sigmoid activation function and its gradient in q2 sigmoid.py. Test your implementation using python q2 sigmoid.py. Again, thoroughly test your code as the pro-vided tests may not be exhaustive.

(f) (4 points) To make debugging easier, we will now implement a gradient checker. Fill in the implementa-tion for gradcheck naive in q2 gradcheck.py. Test your code using python q2 gradcheck.py.

(g) (8 points) Now, implement the forward and backward passes for a neural network with one sigmoid hidden layer. Fill in your implementation in q2 neural.py. Sanity check your implementation with python q2 neural.py.

3 word2vec (40 points + 2 bonus)

(a) (3 points) Assume you are given a predicted word vector vc corresponding to the center word c for skipgram, and word prediction is made with the softmax function found in word2vec models

exp(uvc)

y^o = p(o j c) = PW o (4)

exp(uv )

w=1 w c

where w denotes the w-th word and uw (w = 1; : : : ; W ) are the \output" word vectors for all words in the vocabulary. Assume cross entropy cost is applied to this prediction and word o is the expected word (the o-th element of the one-hot label vector is one), derive the gradients with respect to vc.

Hint: It will be helpful to use notation from question 2. For instance, letting y^ be the vector of softmax predictions for every word, y as the expected word vector, and the loss function

Jsoftmax CE(o; vc; U) = CE(y; y^)

(5)

where U = [u1; u2; ; uW ] is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices.

(b) (3 points) As in the previous part, derive gradients for the \output" word vectors uw's (including uo).

(c) (6 points) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector vc, and the expected output word is o. Assume that K negative samples (words) are drawn, and

they are 1; ; K, respectively for simplicity of notation (o 2= f1; : : : ; Kg). Again, for a given word, o, denote its output vector as uo. The negative sampling loss function in this case is

K

log( (uv ))

kX

uv

J

neg sample

(o; v

; U) =

log( (

))

(6)

c

o c

k c

=1

where ( ) is the sigmoid function.

After you've done this, describe with one sentence why this cost function is much more e cient to compute than the softmax-CE loss (you could provide a speed-up ratio, i.e., the runtime of the softmax-CE loss divided by the runtime of the negative sampling loss).

Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we are doing a minimization instead of maximization in our code.

(d) (8 points) Derive gradients for all of the word vectors for skip-gram and CBOW given the previous parts and given a set of context words [wordc m; : : : ; wordc 1; wordc; wordc+1; : : : ; wordc+m], where m is the context size. Denote the \input" and \output" word vectors for wordk as vk and uk respectively.

Hint: feel free to use F (o; vc) (where o is the expected word) as a placeholder for the Jsoftmax CE(o; vc; :::)

or Jneg sample(o; vc; :::) cost functions in this part | you'll see that this is a useful abstraction for the

coding part. That is, your solution may contain terms of the form @F (o;vc) .

@:::

Recall that for skip-gram, the cost for a context centered around c is

X

Jskip-gram(wordc m:::c+m) =

F (wc+j; vc)

(7)

m

j m;j6=0

where wc+j refers to the word at the j-th index from the center.

CBOW is slightly di erent. Instead of using vc as the predicted vector, we use v^ de ned below. For (a simpler variant of) CBOW, we sum up the input word vectors in the context

X

v^ =

vc+j

(8)

m j m;j6=0

then the CBOW cost is

JCBOW(wordc m:::c+m) = F (wc; v^)

(9)

Note: To be consistent with the v^ notation such as for the code portion, for skip-gram v^ = vc.

CS 224N: Assignment #1

(e) (12 points) In this part you will implement the word2vec models and train your own word vectors with stochastic gradient descent (SGD). First, write a helper function to normalize rows of a matrix in q3 word2vec.py. In the same le, ll in the implementation for the softmax and negative sampling cost and gradient functions. Then, ll in the implementation of the cost and gradient functions for the skip-gram model. When you are done, test your implementation by running python q3 word2vec.py. Note: If you choose not to implement CBOW (part h), simply remove the NotImplementedError so that your tests will complete.

(f) (4 points) Complete the implementation for your SGD optimizer in q3 sgd.py. Test your implemen-tation by running python q3 sgd.py.

(g) (4 points) Show time! Now we are going to load some real data and train word vectors with everything you just implemented! We are going to use the Stanford Sentiment Treebank (SST) dataset to train word vectors, and later apply them to a simple sentiment analysis task. You will need to fetch the datasets rst. To do this, run sh get datasets.sh. There is no additional code to write for this part; just run python q3 run.py.

Note: The training process may take a long time depending on the e ciency of your implementation (an e cient implementation takes approximately an hour). Plan accordingly!

When the script nishes, a visualization for your word vectors will appear. It will also be saved as q3 word vectors.png in your project directory. Include the plot in your homework write up. Brie y explain in at most three sentences what you see in the plot.

(h) (Extra credit: 2 points) Implement the CBOW model in q3 word2vec.py. Note: This part is optional but the gradient derivations for CBOW in part (d) are not!.

4 Sentiment Analysis (20 points)

Now, with the word vectors you trained, we are going to perform a simple sentiment analysis. For each sentence in the Stanford Sentiment Treebank dataset, we are going to use the average of all the word vectors in that sentence as its feature, and try to predict the sentiment level of the said sentence. The sentiment level of the phrases are represented as real values in the original dataset, here we'll just use ve classes:

\very negative ( )", \negative ( )", \neutral", \positive (+)", \very positive (++)"

which are represented by 0 to 4 in the code, respectively. For this part, you will learn to train a softmax classi er, and perform train/dev validation to improve generalization.

(a) (2 points) Implement a sentence featurizer. A simple way of representing a sentence is taking the average of the vectors of the words in the sentence. Fill in the implementation in q4 sentiment.py.

(b) (1 points) Explain in at most two sentences why we want to introduce regularization when doing classi-cation (in fact, most machine learning tasks).

(c) (2 points) Fill in the hyperparameter selection code in q4 sentiment.py to search for the \optimal" regularization parameter. Attach your code for chooseBestModel to your written write-up. You should be able to attain at least 36.5% accuracy on the dev and test sets using the pretrained vectors in part (d).

(d) (3 points) Run python q4 sentiment.py --yourvectors to train a model using your word vectors from q3. Now, run python q4 sentiment.py --pretrained to train a model using pretrained GloVe vectors (on Wikipedia data). Compare and report the best train, dev, and test accuracies. Why do you think the pretrained vectors did better? Be speci c and justify with 3 distinct reasons.

Page 4 of 5

CS 224N: Assignment #1

(e) (4 points) Plot the classi cation accuracy on the train and dev set with respect to the regularization

value for the pretrained GloVe vectors, using a logarithmic scale on the x-axis. This should have been done automatically. Include q4 reg acc.png in your homework write up. Brie y explain in at most three sentences what you see in the plot.

(f) (4 points) We will now analyze errors that the model makes (with pretrained GloVe vectors). When you ran python q4 sentiment.py --pretrained, two les should have been generated. Take a look at q4 dev conf.png and include it in your homework writeup. Interpret the confusion matrix in at most three sentences.

(g) (4 points) Next, take a look at q4 dev pred.txt. Choose 3 examples where your classi er made errors and brie y explain the error and what features the classi er would need to classify the example correctly (1 sentence per example). Try to pick examples with di erent reasons.

You'll get a 297.9KB .ZIP file.