# Assignment 2 Solution

Note: In this assignment, the inputs to neural network layers will be row vectors because this is standard practice for TensorFlow (some built-in TensorFlow functions assume the inputs are row vectors). This means the weight matrix of a hidden layer will right-multiply instead of left-multiply its input (i.e., xW + b instead of W x + b).

1 Tensor ow Softmax (25 points)

In this question, we will implement a linear classi er with loss function

J(W ) = CE(y; softmax(xW ))

Where x is a row vector of features and W is the weight matrix for the model. We will use TensorFlow's automatic di erentiation capability to t this model to provided data.

(a) (5 points, coding) Implement the softmax function using TensorFlow in q1 softmax.py. Remember that

exi

softmax(x)i = Pj exj

Note that you may not use tf.nn.softmax or related built-in functions. You can run basic (nonex-haustive tests) by running python q1 softmax.py.

(b) (5 points, coding) Implement the cross-entropy loss using TensorFlow in q1 softmax.py. Remember that

Nc

X

CE(y; y^) = yi log(^yi)

i=1

where y 2 RNc is a one-hot label vector and Nc is the number of classes. This loss is summed over all examples (rows) of a minibatch. Note that you may not use TensorFlow's built-in cross-entropy functions for this question. You can run basic (non-exhaustive tests) by running python q1 softmax.py.

(c) (5 points, coding/written) Carefully study the Model class in model.py. Brie y explain the purpose of placeholder variables and feed dictionaries in TensorFlow computations. Fill in the implementations for add placeholders and create feed dict in q1 classifier.py.

1

CS 224n: Assignment #2

Hint: Note that con guration variables are stored in the Con g class. You will need to use these con guration variables in the code.

(d) (5 points, coding) Implement the transformation for a softmax classi er in the function add prediction op in q1 classifier.py. Add cross-entropy loss in the function add loss op in the same le. Use the implementations from the earlier parts of the problem, not TensorFlow built-ins.

(e) (5 points, coding/written) Fill in the implementation for add training op in q1 classifier.py. Explain how TensorFlow's automatic di erentiation removes the need for us to de ne gradients explicitly. Verify that your model is able to t to synthetic data by running python q1 classifier.py and making sure that the tests pass.

Hint: Make sure to use the learning rate speci ed in Config.

2 Neural Transition-Based Dependency Parsing (50 points)

In this section, you'll be implementing a neural-network based dependency parser. A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between \head" words and words which modify those heads. Your implementation will be a transition-based parser, which incrementally builds up a parse one step at a time. At every step it maintains a partial parse, which is represented as follows:

A stack of words that are currently being processed. A bu er of words yet to be processed.

A list of dependencies predicted by the parser.

Initially, the stack only contains ROOT, the dependencies lists is empty, and the bu er contains all words of the sentence in order. At each step, the parse applies a transition to the partial parse until its bu er is empty and the stack is of size 1. The following transitions can be applied:

SHIFT: removes the rst word from the bu er and pushes it onto the stack.

LEFT-ARC: marks the second (second most recently added) item on the stack as a dependent of the rst item and removes the second item from the stack.

RIGHT-ARC: marks the rst (most recently added) item on the stack as a dependent of the second item and removes the rst item from the stack.

Your parser will decide among transitions at each state using a neural network classi er. First, you will implement the partial parse representation and transition functions.

(a) (6 points, written) Go through the sequence of transitions needed for parsing the sentence \I parsed this sentence correctly". The dependency tree for the sentence is shown below. At each step, give the con guration of the stack and bu er, as well as what transition was applied this step and what new dependency was added (if any). The rst three steps are provided below as an example.

ROOT

I

parsed

this

sentence

correctly

stack

bu er

new dependency

transition

[ROOT]

[I, parsed, this, sentence, correctly]

Initial Con guration

[ROOT, I]

[parsed, this, sentence, correctly]

SHIFT

[ROOT, I, parsed]

[this, sentence, correctly]

parsed!I

SHIFT

[ROOT, parsed]

[this, sentence, correctly]

LEFT-ARC

Page 2 of 8

CS 224n: Assignment #2

(b) (2 points, written) A sentence containing n words will be parsed in how many steps (in terms of n)? Brie y explain why.

(c) (6 points, coding) Implement the init and parse step functions in the PartialParse class in q2 parser transitions.py. This implements the transition mechanics your parser will use. You can run basic (not-exhaustive) tests by running python q2 parser transitions.py.

(d) (6 points, coding) Our network will predict which transition should be applied next to a partial parse. We could use it to parse a single sentence by applying predicted transitions until the parse is complete. However, neural networks run much more e ciently when making predictions about batches of data at a time (i.e., predicting the next transition for a many di erent partial parses simultaneously). We can parse sentences in minibatches with the following algorithm.

Algorithm 1 Minibatch Dependency Parsing

Input: sentences, a list of sentences to be parsed and model, our model that makes parse decisions

Initialize partial parses as a list of partial parses, one for each sentence in sentences Initialize unfinished parses as a shallow copy of partial parses while unfinished parses is not empty do

Take the rst batch size parses in unfinished parses as a minibatch

Use the model to predict the next transition for each partial parse in the minibatch

Perform a parse step on each partial parse in the minibatch with its predicted transition

Remove the completed parses from unfinished parses

end while

Return: The dependencies for each (now completed) parse in partial parses.

Implement this algorithm in the minibatch parse function in q2 parser transitions.py. You can run basic (not-exhaustive) tests by running python q2 parser transitions.py.

Note: You will need minibatch parse to be correctly implemented to evaluate the model you will build in part (h). However, you do not need it to train the model, so you should be able to complete most of part (h) even if minibatch parse is not implemented yet.

We are now going to train a neural network to predict, given the state of the stack, bu er, and dependencies, which transition should be applied next. First, the model extracts a feature vector representing the current state. We will be using the feature set presented in the original neural dependency parsing paper: A Fast and Accurate Dependency Parser using Neural Networks1. The function extracting these features has been implemented for you in parser utils. This feature vector consists of a list of tokens (e.g., the last word in the stack, rst word in the bu er, dependent of the second-to-last word in the stack if there is one, etc.). They can be represented as a list of integers

[w1; w2; :::; wm]

where m is the number of features and each 0 wi < jV j is the index of a token in the vocabulary (jV j is the vocabulary size). First our network looks up an embedding for each word and concatenates them into a single input vector:

x = [Lw0 ; Lw1 ; :::; Lwm ] 2 Rdm

1Chen and Manning, 2014, http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf

Page 3 of 8

CS 224n: Assignment #2

where L 2 RjV j d is an embedding matrix with each row Li as the vector for a particular word i. We then compute our prediction as:

h = ReLU(xW + b1)

y^ = softmax(hU + b2)

(recall that ReLU(z) = max(z; 0)). We evaluate using cross-entropy loss:

Nc

X

J( ) = CE(y; y^) = yi log y^i

i=1

To compute the loss for the training set, we average this J( ) across all training examples.

(e) (4 points, coding) In order to avoid neurons becoming too correlated and ending up in poor local

minimina, it is often helpful to randomly initialize parameters. One of the most frequent initializations

used is called Xavier initialization2.

Given a matrix A of dimension m n, Xavier initialization selects values Aij uniformly from [ ; ], where

p

= p 6 m + n

Implement the initialization in xavier weight init in q2 initialization.py. You can run basic (nonexhaustive tests) by running python q2 initialization.py. This function will be used to initialize W and U.

(f) (2 points, written) We will regularize our network by applying Dropout3. During training this randomly sets units in the hidden layer h to zero with probability pdrop and then multiplies h by a constant (dropping di erent units each minibatch). We can write this as

hdrop = d h

where d 2 f0; 1gDh (Dh is the size of h) is a mask vector where each entry is 0 with probability pdrop and 1 with probability (1 pdrop). is chosen such that the value of hdrop in expectation equals h:

Epdrop [hdrop]i = hi

for all 0 < i < Dh. What must equal in terms of pdrop? Brie y justify your answer.

(g) (4 points, written) We will train our model using the Adam4 optimizer. Recall that standard SGD uses the update rule

r Jminibatch( )

where is a vector containing all of the model parameters, J is the loss function, r Jminibatch( ) is the gradient of the loss function with respect to the parameters on a minibatch of data, and is the learning rate. Adam uses a more sophisticated update rule with two additional steps5.

2 This is also referred to as Glorot initialization and was initially described in http://jmlr.org/proceedings/papers/ v9/glorot10a/glorot10a.pdf

3Srivastava et al., 2014, https://www.cs.toronto.edu/ hinton/absps/JMLRdropout.pdf

4Kingma and Ma, 2015, https://arxiv.org/pdf/1412.6980.pdf

5The actual Adam update uses a few additional tricks that are less important, but we won't worry about them for this problem.

Page 4 of 8

CS 224n: Assignment #2

(i) First, Adam uses a trick called momentum by keeping track of m, a rolling average of the gradients:

m 1m + (1 1)r Jminibatch( )

m

where 1 is a hyperparameter between 0 and 1 (often set to 0.9). Brie y explain (you don't need to prove mathematically, just give an intuition) how using m stops the updates from varying as much. Why might this help with learning?

(ii) Adam also uses adaptive learning rates by keeping track of v, a rolling average of the magnitudes of the gradients:

m 1m + (1 1)r Jminibatch( )

v 2v + (1 p2)(r Jminibatch( ) r Jminibatch( ))

m= v

where and = denote elementwise multiplication and division (so z z is elementwise squaring)

and 2 is a hyperparameter between 0 and 1 (often set to 0.99). Since Adam divides the update by p

v, which of the model parameters will get larger updates? Why might this help with learning?

(h) (20 points, coding/written) In q2 parser model.py implement the neural network classi er govern-ing the dependency parser by lling in the appropriate sections. We will train and evaluate our model on the Penn Treebank (annotated with Universal Dependencies). Run python q2 parser model.py to train your model and compute predictions on the test data (make sure to turn o debug settings when doing nal evaluation).

Hints:

When debugging, pass the keyword argument debug=True to the main method (it is set to true by default). This will cause the code to run over a small subset of the data, so the training the model won't take as long.

This code should run within 1 hour on a CPU.

When running with debug=False, you should be able to get a loss smaller than 0.07 on the train set (by the end of the last epoch) and an Unlabeled Attachment Score larger than 88 on the dev set (with the best-performing model out of all the epochs). For comparison, the model in the original neural dependency parsing paper gets 92.5. If you want, you can tweak the hyperparameters for your model (hidden layer size, hyperparameters for Adam, number of epochs, etc.) to improve the performance (but you are not required to do so).

Deliverables:

Working implementation of the neural dependency parser in q2 parser model.py. (We'll look at, and possibly run this code for grading).

Report the best UAS your model achieves on the dev set and the UAS it achieves on the test set. List of predicted labels for the test set in the le q2 test.predicted.

(i) Bonus (1 point). Add an extension to your model (e.g., l2 regularization, an additional hidden layer) and report the change in UAS on the dev set. Brie y explain what your extension is and why it helps (or hurts!) the model. Some extensions may require tweaking the hyperparameters in Config to make them e ective.

Page 5 of 8

CS 224n: Assignment #2

3 Recurrent Neural Networks: Language Modeling (25 points)

In this section, you'll compute the gradients of a recurrent neural network (RNN) for language modeling.

Language modeling is a central task in NLP, and language models can be found at the heart of speech recognition, machine translation, and many other systems. Given a sequence of words (represented as one-hot row vectors) x(1); x(2); : : : ; x(t), a language model predicts the next word x(t+1) by modeling:

P (x(t+1) = vj j x(t); : : : ; x(1))

where vj is a word in the vocabulary.

Your job is to compute the gradients of a recurrent neural network language model, which uses feedback information in the hidden layer to model the \history" x(t); x(t 1); : : : ; x(1). Formally, the model6 is, for t = 1; : : : ; n 1:

e(t) = x(t)L

h(t) = sigmoid h(t 1)H + e(t)I + b1

y^(t) = softmax h(t)U + b2

(t+1)

= vj j x

(t)

; : : : ; x

(1)

(t)

P (x

)) = y^j

where h(0) = h0 2 RDh is some initialization vector for the hidden layer and x(t)L is the product of L with the one-hot row vector x(t) representing the current word. The parameters are:

L 2 RjV j d H 2 RDh Dh I 2 Rd Dh b1 2 RDh U 2 RDh jV j b2 2 RjV j

(1)

where L is the embedding matrix, I the input word representation matrix, H the hidden transformation matrix, and U is the output word representation matrix. b1 and b2 are biases. d is the embedding dimension, jV j is the vocabulary size, and Dh is the hidden layer dimension.

The output vector y^(t) 2 RjV j is a probability distribution over the vocabulary. The model is trained by minimizing the (un-regularized) cross-entropy loss:

jV j

J(t)( ) = CE(y(t); y^(t)) = X yj(t) log y^j(t)

j=1

where y(t) is the one-hot vector corresponding to the target word (which here is equal to x(t+1)). We average the cross-entropy loss across all examples (i.e., words) in a sequence to get the loss for a single sequence.

(a) (5 points, written) Conventionally, when reporting performance of a language model, we evaluate on perplexity, which is de ned as:

1

j x(t); : : : ; x(1))

P

jj=1j

1

y^j

P (xpred = x(t+1)

yj

PP(t) y(t); y^(t)

=

(t+1)

=

V

(t)

(t)

6 This model is adapted from a paper by Toma Mikolov, et al. from 2010: http://www.fit.vutbr.cz/research/groups/ speech/publi/2010/mikolov_interspeech2010_IS100722.pdf

Page 6 of 8

@J(t)

CS 224n: Assignment #2

i.e. the inverse probability of the correct word, according to the model distribution P . Show how you can derive perplexity from the cross-entropy loss (Hint: remember that y(t) is one-hot! ), and thus argue that minimizing the (arithmetic) mean cross-entropy loss will also minimize the (geometric) mean perplexity across the training set. This should be a very short problem - not too perplexing!

For a vocabulary of jV j words, what would you expect perplexity to be if your model predictions were completely random (chosen uniformly from the vocabulary)? Compute the corresponding cross-entropy loss for jV j = 10000.

(b) (8 points, written) Compute the gradients of the loss J with respect to the following model parameters

at a single point in time t (to save a bit of time, you don't have to compute the gradients with the respect to U and b1):

@J(t)

@J(t)

@J(t)

(t)

@J(t)

(t)

@b2

@Lx(t)

@I

@H

where Lx(t) is the row of L corresponding to the current word x(t), and (t) denotes the gradient for the appearance of that parameter at time t (equivalently, h(t 1) is taken to be xed, and you need not backpropagate to earlier timesteps just yet - you'll do that in part (c)).

Additionally, compute the derivative with respect to the previous hidden layer value:

@J(t)

@h(t 1)

(c) (8 points, written) Below is a sketch of the network at a single timestep:

y^(t)

h(t 1) h(t) :::

x(t)

Draw the \unrolled" network for 3 timesteps, and compute the backpropagation-through-time gradients:

@J(t)

@J(t)

@J(t)

@Lx(t 1)

@I

(t

1)

@H

(t

1)

where (t 1) denotes the gradient for the appearance of that parameter at time (t 1). Because param-eters are used multiple times in feed-forward computation, we need to compute the gradient for each time they appear.

You should use the backpropagation rules from Lecture 57 to express these derivatives in terms of

error term (t 1) = @h(t 1) computed in the previous part. (Doing so will allow for re-use of expressions

for t 2, t 3, and so on).

Note that the true gradient with respect to a training example requires us to run backpropagation all the way back to t = 0. In practice, however, we generally truncate this and only backpropagate for a xed number 5 10 timesteps.

7https://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture5.pdf

Page 7 of 8

CS 224n: Assignment #2

(d) (4 points, written) Given h(t 1), how many operations are required to perform one step of forward propagation to compute J(t)( )? How about backpropagation for a single step in time? For steps in time? Express your answer in big-O notation in terms of the dimensions d, Dh and jV j (Equation 1). What is the slow step?

Bonus (1 point, written) Given your knowledge of similar models (i.e. word2vec), suggest a way to speed up this part of the computation. Your approach can be an approximation, but you should ar-gue why it's a good one. The paper \Extensions of recurrent neural network language model" (Mikolov, et al. 2013) may be of interest here.

1 Tensor ow Softmax (25 points)

In this question, we will implement a linear classi er with loss function

J(W ) = CE(y; softmax(xW ))

Where x is a row vector of features and W is the weight matrix for the model. We will use TensorFlow's automatic di erentiation capability to t this model to provided data.

(a) (5 points, coding) Implement the softmax function using TensorFlow in q1 softmax.py. Remember that

exi

softmax(x)i = Pj exj

Note that you may not use tf.nn.softmax or related built-in functions. You can run basic (nonex-haustive tests) by running python q1 softmax.py.

(b) (5 points, coding) Implement the cross-entropy loss using TensorFlow in q1 softmax.py. Remember that

Nc

X

CE(y; y^) = yi log(^yi)

i=1

where y 2 RNc is a one-hot label vector and Nc is the number of classes. This loss is summed over all examples (rows) of a minibatch. Note that you may not use TensorFlow's built-in cross-entropy functions for this question. You can run basic (non-exhaustive tests) by running python q1 softmax.py.

(c) (5 points, coding/written) Carefully study the Model class in model.py. Brie y explain the purpose of placeholder variables and feed dictionaries in TensorFlow computations. Fill in the implementations for add placeholders and create feed dict in q1 classifier.py.

1

CS 224n: Assignment #2

Hint: Note that con guration variables are stored in the Con g class. You will need to use these con guration variables in the code.

(d) (5 points, coding) Implement the transformation for a softmax classi er in the function add prediction op in q1 classifier.py. Add cross-entropy loss in the function add loss op in the same le. Use the implementations from the earlier parts of the problem, not TensorFlow built-ins.

(e) (5 points, coding/written) Fill in the implementation for add training op in q1 classifier.py. Explain how TensorFlow's automatic di erentiation removes the need for us to de ne gradients explicitly. Verify that your model is able to t to synthetic data by running python q1 classifier.py and making sure that the tests pass.

Hint: Make sure to use the learning rate speci ed in Config.

2 Neural Transition-Based Dependency Parsing (50 points)

In this section, you'll be implementing a neural-network based dependency parser. A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between \head" words and words which modify those heads. Your implementation will be a transition-based parser, which incrementally builds up a parse one step at a time. At every step it maintains a partial parse, which is represented as follows:

A stack of words that are currently being processed. A bu er of words yet to be processed.

A list of dependencies predicted by the parser.

Initially, the stack only contains ROOT, the dependencies lists is empty, and the bu er contains all words of the sentence in order. At each step, the parse applies a transition to the partial parse until its bu er is empty and the stack is of size 1. The following transitions can be applied:

SHIFT: removes the rst word from the bu er and pushes it onto the stack.

LEFT-ARC: marks the second (second most recently added) item on the stack as a dependent of the rst item and removes the second item from the stack.

RIGHT-ARC: marks the rst (most recently added) item on the stack as a dependent of the second item and removes the rst item from the stack.

Your parser will decide among transitions at each state using a neural network classi er. First, you will implement the partial parse representation and transition functions.

(a) (6 points, written) Go through the sequence of transitions needed for parsing the sentence \I parsed this sentence correctly". The dependency tree for the sentence is shown below. At each step, give the con guration of the stack and bu er, as well as what transition was applied this step and what new dependency was added (if any). The rst three steps are provided below as an example.

ROOT

I

parsed

this

sentence

correctly

stack

bu er

new dependency

transition

[ROOT]

[I, parsed, this, sentence, correctly]

Initial Con guration

[ROOT, I]

[parsed, this, sentence, correctly]

SHIFT

[ROOT, I, parsed]

[this, sentence, correctly]

parsed!I

SHIFT

[ROOT, parsed]

[this, sentence, correctly]

LEFT-ARC

Page 2 of 8

CS 224n: Assignment #2

(b) (2 points, written) A sentence containing n words will be parsed in how many steps (in terms of n)? Brie y explain why.

(c) (6 points, coding) Implement the init and parse step functions in the PartialParse class in q2 parser transitions.py. This implements the transition mechanics your parser will use. You can run basic (not-exhaustive) tests by running python q2 parser transitions.py.

(d) (6 points, coding) Our network will predict which transition should be applied next to a partial parse. We could use it to parse a single sentence by applying predicted transitions until the parse is complete. However, neural networks run much more e ciently when making predictions about batches of data at a time (i.e., predicting the next transition for a many di erent partial parses simultaneously). We can parse sentences in minibatches with the following algorithm.

Algorithm 1 Minibatch Dependency Parsing

Input: sentences, a list of sentences to be parsed and model, our model that makes parse decisions

Initialize partial parses as a list of partial parses, one for each sentence in sentences Initialize unfinished parses as a shallow copy of partial parses while unfinished parses is not empty do

Take the rst batch size parses in unfinished parses as a minibatch

Use the model to predict the next transition for each partial parse in the minibatch

Perform a parse step on each partial parse in the minibatch with its predicted transition

Remove the completed parses from unfinished parses

end while

Return: The dependencies for each (now completed) parse in partial parses.

Implement this algorithm in the minibatch parse function in q2 parser transitions.py. You can run basic (not-exhaustive) tests by running python q2 parser transitions.py.

Note: You will need minibatch parse to be correctly implemented to evaluate the model you will build in part (h). However, you do not need it to train the model, so you should be able to complete most of part (h) even if minibatch parse is not implemented yet.

We are now going to train a neural network to predict, given the state of the stack, bu er, and dependencies, which transition should be applied next. First, the model extracts a feature vector representing the current state. We will be using the feature set presented in the original neural dependency parsing paper: A Fast and Accurate Dependency Parser using Neural Networks1. The function extracting these features has been implemented for you in parser utils. This feature vector consists of a list of tokens (e.g., the last word in the stack, rst word in the bu er, dependent of the second-to-last word in the stack if there is one, etc.). They can be represented as a list of integers

[w1; w2; :::; wm]

where m is the number of features and each 0 wi < jV j is the index of a token in the vocabulary (jV j is the vocabulary size). First our network looks up an embedding for each word and concatenates them into a single input vector:

x = [Lw0 ; Lw1 ; :::; Lwm ] 2 Rdm

1Chen and Manning, 2014, http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf

Page 3 of 8

CS 224n: Assignment #2

where L 2 RjV j d is an embedding matrix with each row Li as the vector for a particular word i. We then compute our prediction as:

h = ReLU(xW + b1)

y^ = softmax(hU + b2)

(recall that ReLU(z) = max(z; 0)). We evaluate using cross-entropy loss:

Nc

X

J( ) = CE(y; y^) = yi log y^i

i=1

To compute the loss for the training set, we average this J( ) across all training examples.

(e) (4 points, coding) In order to avoid neurons becoming too correlated and ending up in poor local

minimina, it is often helpful to randomly initialize parameters. One of the most frequent initializations

used is called Xavier initialization2.

Given a matrix A of dimension m n, Xavier initialization selects values Aij uniformly from [ ; ], where

p

= p 6 m + n

Implement the initialization in xavier weight init in q2 initialization.py. You can run basic (nonexhaustive tests) by running python q2 initialization.py. This function will be used to initialize W and U.

(f) (2 points, written) We will regularize our network by applying Dropout3. During training this randomly sets units in the hidden layer h to zero with probability pdrop and then multiplies h by a constant (dropping di erent units each minibatch). We can write this as

hdrop = d h

where d 2 f0; 1gDh (Dh is the size of h) is a mask vector where each entry is 0 with probability pdrop and 1 with probability (1 pdrop). is chosen such that the value of hdrop in expectation equals h:

Epdrop [hdrop]i = hi

for all 0 < i < Dh. What must equal in terms of pdrop? Brie y justify your answer.

(g) (4 points, written) We will train our model using the Adam4 optimizer. Recall that standard SGD uses the update rule

r Jminibatch( )

where is a vector containing all of the model parameters, J is the loss function, r Jminibatch( ) is the gradient of the loss function with respect to the parameters on a minibatch of data, and is the learning rate. Adam uses a more sophisticated update rule with two additional steps5.

2 This is also referred to as Glorot initialization and was initially described in http://jmlr.org/proceedings/papers/ v9/glorot10a/glorot10a.pdf

3Srivastava et al., 2014, https://www.cs.toronto.edu/ hinton/absps/JMLRdropout.pdf

4Kingma and Ma, 2015, https://arxiv.org/pdf/1412.6980.pdf

5The actual Adam update uses a few additional tricks that are less important, but we won't worry about them for this problem.

Page 4 of 8

CS 224n: Assignment #2

(i) First, Adam uses a trick called momentum by keeping track of m, a rolling average of the gradients:

m 1m + (1 1)r Jminibatch( )

m

where 1 is a hyperparameter between 0 and 1 (often set to 0.9). Brie y explain (you don't need to prove mathematically, just give an intuition) how using m stops the updates from varying as much. Why might this help with learning?

(ii) Adam also uses adaptive learning rates by keeping track of v, a rolling average of the magnitudes of the gradients:

m 1m + (1 1)r Jminibatch( )

v 2v + (1 p2)(r Jminibatch( ) r Jminibatch( ))

m= v

where and = denote elementwise multiplication and division (so z z is elementwise squaring)

and 2 is a hyperparameter between 0 and 1 (often set to 0.99). Since Adam divides the update by p

v, which of the model parameters will get larger updates? Why might this help with learning?

(h) (20 points, coding/written) In q2 parser model.py implement the neural network classi er govern-ing the dependency parser by lling in the appropriate sections. We will train and evaluate our model on the Penn Treebank (annotated with Universal Dependencies). Run python q2 parser model.py to train your model and compute predictions on the test data (make sure to turn o debug settings when doing nal evaluation).

Hints:

When debugging, pass the keyword argument debug=True to the main method (it is set to true by default). This will cause the code to run over a small subset of the data, so the training the model won't take as long.

This code should run within 1 hour on a CPU.

When running with debug=False, you should be able to get a loss smaller than 0.07 on the train set (by the end of the last epoch) and an Unlabeled Attachment Score larger than 88 on the dev set (with the best-performing model out of all the epochs). For comparison, the model in the original neural dependency parsing paper gets 92.5. If you want, you can tweak the hyperparameters for your model (hidden layer size, hyperparameters for Adam, number of epochs, etc.) to improve the performance (but you are not required to do so).

Deliverables:

Working implementation of the neural dependency parser in q2 parser model.py. (We'll look at, and possibly run this code for grading).

Report the best UAS your model achieves on the dev set and the UAS it achieves on the test set. List of predicted labels for the test set in the le q2 test.predicted.

(i) Bonus (1 point). Add an extension to your model (e.g., l2 regularization, an additional hidden layer) and report the change in UAS on the dev set. Brie y explain what your extension is and why it helps (or hurts!) the model. Some extensions may require tweaking the hyperparameters in Config to make them e ective.

Page 5 of 8

CS 224n: Assignment #2

3 Recurrent Neural Networks: Language Modeling (25 points)

In this section, you'll compute the gradients of a recurrent neural network (RNN) for language modeling.

Language modeling is a central task in NLP, and language models can be found at the heart of speech recognition, machine translation, and many other systems. Given a sequence of words (represented as one-hot row vectors) x(1); x(2); : : : ; x(t), a language model predicts the next word x(t+1) by modeling:

P (x(t+1) = vj j x(t); : : : ; x(1))

where vj is a word in the vocabulary.

Your job is to compute the gradients of a recurrent neural network language model, which uses feedback information in the hidden layer to model the \history" x(t); x(t 1); : : : ; x(1). Formally, the model6 is, for t = 1; : : : ; n 1:

e(t) = x(t)L

h(t) = sigmoid h(t 1)H + e(t)I + b1

y^(t) = softmax h(t)U + b2

(t+1)

= vj j x

(t)

; : : : ; x

(1)

(t)

P (x

)) = y^j

where h(0) = h0 2 RDh is some initialization vector for the hidden layer and x(t)L is the product of L with the one-hot row vector x(t) representing the current word. The parameters are:

L 2 RjV j d H 2 RDh Dh I 2 Rd Dh b1 2 RDh U 2 RDh jV j b2 2 RjV j

(1)

where L is the embedding matrix, I the input word representation matrix, H the hidden transformation matrix, and U is the output word representation matrix. b1 and b2 are biases. d is the embedding dimension, jV j is the vocabulary size, and Dh is the hidden layer dimension.

The output vector y^(t) 2 RjV j is a probability distribution over the vocabulary. The model is trained by minimizing the (un-regularized) cross-entropy loss:

jV j

J(t)( ) = CE(y(t); y^(t)) = X yj(t) log y^j(t)

j=1

where y(t) is the one-hot vector corresponding to the target word (which here is equal to x(t+1)). We average the cross-entropy loss across all examples (i.e., words) in a sequence to get the loss for a single sequence.

(a) (5 points, written) Conventionally, when reporting performance of a language model, we evaluate on perplexity, which is de ned as:

1

j x(t); : : : ; x(1))

P

jj=1j

1

y^j

P (xpred = x(t+1)

yj

PP(t) y(t); y^(t)

=

(t+1)

=

V

(t)

(t)

6 This model is adapted from a paper by Toma Mikolov, et al. from 2010: http://www.fit.vutbr.cz/research/groups/ speech/publi/2010/mikolov_interspeech2010_IS100722.pdf

Page 6 of 8

@J(t)

CS 224n: Assignment #2

i.e. the inverse probability of the correct word, according to the model distribution P . Show how you can derive perplexity from the cross-entropy loss (Hint: remember that y(t) is one-hot! ), and thus argue that minimizing the (arithmetic) mean cross-entropy loss will also minimize the (geometric) mean perplexity across the training set. This should be a very short problem - not too perplexing!

For a vocabulary of jV j words, what would you expect perplexity to be if your model predictions were completely random (chosen uniformly from the vocabulary)? Compute the corresponding cross-entropy loss for jV j = 10000.

(b) (8 points, written) Compute the gradients of the loss J with respect to the following model parameters

at a single point in time t (to save a bit of time, you don't have to compute the gradients with the respect to U and b1):

@J(t)

@J(t)

@J(t)

(t)

@J(t)

(t)

@b2

@Lx(t)

@I

@H

where Lx(t) is the row of L corresponding to the current word x(t), and (t) denotes the gradient for the appearance of that parameter at time t (equivalently, h(t 1) is taken to be xed, and you need not backpropagate to earlier timesteps just yet - you'll do that in part (c)).

Additionally, compute the derivative with respect to the previous hidden layer value:

@J(t)

@h(t 1)

(c) (8 points, written) Below is a sketch of the network at a single timestep:

y^(t)

h(t 1) h(t) :::

x(t)

Draw the \unrolled" network for 3 timesteps, and compute the backpropagation-through-time gradients:

@J(t)

@J(t)

@J(t)

@Lx(t 1)

@I

(t

1)

@H

(t

1)

where (t 1) denotes the gradient for the appearance of that parameter at time (t 1). Because param-eters are used multiple times in feed-forward computation, we need to compute the gradient for each time they appear.

You should use the backpropagation rules from Lecture 57 to express these derivatives in terms of

error term (t 1) = @h(t 1) computed in the previous part. (Doing so will allow for re-use of expressions

for t 2, t 3, and so on).

Note that the true gradient with respect to a training example requires us to run backpropagation all the way back to t = 0. In practice, however, we generally truncate this and only backpropagate for a xed number 5 10 timesteps.

7https://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture5.pdf

Page 7 of 8

CS 224n: Assignment #2

(d) (4 points, written) Given h(t 1), how many operations are required to perform one step of forward propagation to compute J(t)( )? How about backpropagation for a single step in time? For steps in time? Express your answer in big-O notation in terms of the dimensions d, Dh and jV j (Equation 1). What is the slow step?

Bonus (1 point, written) Given your knowledge of similar models (i.e. word2vec), suggest a way to speed up this part of the computation. Your approach can be an approximation, but you should ar-gue why it's a good one. The paper \Extensions of recurrent neural network language model" (Mikolov, et al. 2013) may be of interest here.

You'll get 1 file (34.4MB)