# Assignment #3 Solution

Note: In this assignment, the inputs to neural network layers will be row vectors because this is standard practice for TensorFlow (some built-in TensorFlow functions assume the inputs are row vectors). This means the weight matrix of a hidden layer will right-multiply instead of left-multiply its input (i.e., xW + b instead of Wx + b).

A primer on named entity recognition

In this assignment, we will build several di erent models for named entity recognition (NER). NER is a subtask of information extraction that seeks to locate and classify named entities in text into pre-de ned categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. In the assignment, for a given a word in a context, we want to predict whether it represents one of four categories:

Person (PER): e.g. \Martha Stewart", \Obama", \Tim Wagner", etc. Pronouns like \he" or \she" are not considered named entities.

Organization (ORG): e.g. \American Airlines", \Goldman Sachs", \Department of Defense".

Location (LOC): e.g. \Germany", \Panama Strait", \Brussels", but not unnamed locations like \the bar" or \the farm".

Miscellaneous (MISC): e.g. \Japanese", \USD", \1,000", \Englishmen".

We formulate this as a 5-class classi cation problem, using the four above classes and a null-class (O) for words that do not represent a named entity (most words fall into this category). For an entity that spans multiple words (\Department of Defense"), each word is separately tagged, and every contiguous sequence of non-null tags is considered to be an entity.

Here is a sample sentence (x(t)) with the named entities tagged above each token (y(t)) as well as hypothetical predictions produced by a system (y^(t)):

1

CS 224n

Assignment 3

Page 2 of 10

y(t)   ORG
ORG
O O
O
ORG ORG
. . .
O
PER PER
O
y^(t)
MISC
O
O
O
O
ORG
O
. . .
O
PER
PER
O
x(t)
American
Airlines,
a
unit
of
AMR
Corp.,
. . .
spokesman
Tim
Wagner
said.

In the above example, the system mistakenly predicted \American" to be of the MISC class and ignores \Airlines" and \Corp.". All together, it predicts 3 entities, \American", \AMR" and \Tim Wagner".

To evaluate the quality of a NER system's output, we look at precision, recall and the F1 measure.1 In particular, we will report precision, recall and F1 at both the token-level and the name-entity level. In the former case:

Precision is calculated as the ratio of correct non-null labels predicted to the total number of non-null labels predicted (in the above example, it would be p = 34 ).

Recall is calculated as the ratio of correct non-null labels predicted to the total number of correct non-null labels (in the above example, it would be r = 36 ).

F1 is the harmonic mean of the two: F1 = p2+prr . (in the above example, it would be F1 = 106 ).

For entity-level F1:

Precision is the fraction of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. In our example, \AMR" would be marked incorrectly because it does not cover the whole entity, i.e. \AMR Corp.", as would \American", and we would get a precision score of

13 .

Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions. Here, we would get a recall score of 13 .

Finally, the F1 score is still the harmonic mean of the two, and would be 13 in the example.

Our model also outputs a token-level confusion matrix 2. A confusion matrix is a speci c table layout that allows visualization of the classi cation performance. Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).

1. A window into NER (30 points)

Let's look at a simple baseline model that predicts a label for each token separately using features from a window around it.

Figure 1:  A sample input sequence

def
Figure 1 shows an example of an input sequence and the  rst window from this sequence. Let x =
x(1); x(2); : : : ; x(T )
def
be an input sequence of length T and y = y(1); y(2); : : : ; y(T ) be an output sequence,

1https://en.wikipedia.org/wiki/Precision_and_recall

2https://en.wikipedia.org/wiki/Confusion_matrix

CS 224n                                                                          Assignment 3                                                                  Page 3 of 10

also of length T . Here, each element x(t) and y(t) are one-hot vectors representing the word at the t-th index of the sentence. In a window based classi er, every input sequence is split into T new data points, each representing a window and its label. A new input is constructed from a window around x(t) by

(t)         (t) def             (t  w)                    (t)                (t+w)

concatenating w tokens to the left and right of x : x~ = [x ; : : : ; x ; : : : ; x ]; we continue to use y(t) as its label. For windows centered around tokens at the very beginning of a sentence, we add special start tokens (<START) to the beginning of the window and for windows centered around tokens at the very end of a sentence, we add special end tokens (<END) to the end of the window. For example, consider constructing a window around \Jim" in the sentence above. If window size were 1, we would add a single start token to the window (resulting in a window of [<START, Jim, bought]). If window size were 2, we would add two start tokens to the window (resulting in a window of [<START, <START, Jim, bought, 300]).

With these, each input and output is of a uniform length (w and 1 respectively) and we can use a simple feedforward neural net to predict y(t) from x~(t):

As a simple but e ective model to predict labels from each window, we will use a single hidden layer with a ReLU activation, combined with a softmax output layer and the cross-entropy loss:

e(t) = [x(t  w)L; : : : ; x(t)L; : : : ; x(t+w)L]

h(t) = ReLU(e(t)W + b1)

y^(t) = softmax(h(t)U + b2)

J  = CE(y(t); y^(t))

CE(y(t); y^(t)) =      Xyi(t) log(^yi(t)):

i

where L 2 RV D are word embeddings, h(t) is dimension H and y^(t) is of dimension C, where V is the size of the vocabulary, D is the size of the word embedding, H is the size of the hidden layer and C are the number of classes being predicted (here 5).

(a)   (5 points) (written)

i.   (2 points) Provide 2 examples of sentences containing a named entity with an ambiguous type (e.g. the entity could either be a person or an organization, or it could either be an organization or not an entity).

ii.   (1 point) Why might it be important to use features apart from the word itself to predict named entity labels?

iii.   (2 points) Describe at least two features (apart from the word) that would help in predicting whether a word is part of a named entity or not.

(b)   (5 points) (written)

i.   (2 points) What are the dimensions of e(t), W and U if we use a window of size w?

ii.   (3 points) What is the computational complexity of predicting labels for a sentence of length T ?

(c)   (15 points) (code) Implement a window-based classi er model in q1 window.py using this ap-proach.

To do so, you will have to:

i.   (5 points) Transform a batch of input sequences into a batch of windowed input-output pairs in the make windowed data function. You can test your implementation by running python q1 window.py test1.

ii.   (8 points) Implement the feed-forward model described above by appropriately completing functions in the WindowModel class. You can test your implementation by running python q1 window.py test2.

CS 224n                                                                          Assignment 3                                                                  Page 4 of 10

iii.   (2 points) Train your model using the command python q1 window.py train. The code

should take only about 2{3 minutes to run and you should get a development score of at least 81% F1.

The model and its output will be saved to results/window/<timestamp/, where <timestamp is the date and time at which the program was run. The le results.txt contains formatted

output of the model's predictions on the development set, and the le log contains the printed output, i.e. confusion matrices and F1 scores computed during the training.

Finally, you can interact with your model using:

python q1 window.py shell -m results/window/<timestamp/

Deliverable: After your model has trained, copy the window predictions.conll le from the appropriate results folder into the root folder of your code directory, so that it can be included in your submission.

(d)   (5 points) (written) Analyze the predictions of your model using the  les generated above.

i.   (1 point) Report your best development entity-level F1 score and the corresponding token-level confusion matrix. Brie y describe what the confusion matrix tells you about the errors your model is making.

ii.   (4 points) Describe at least 2 modeling limitations of the window-based model and support these conclusions using examples from your model's output (i.e. identify errors that your model made due to its limitations). You can also support your conclusions using predictions made by your model on examples manually entered through the shell.

2.    Recurrent neural nets for NER (40 points)

We will now tackle the task of NER by using a recurrent neural network (RNN).

Recall that each RNN cell combines the hidden state vector with the input using a sigmoid. We then

CS 224n                                                                          Assignment 3                                                                  Page 5 of 10

use the hidden state to predict the output at each timestep:

e(t) = x(t)L

h(t) =  (h(t  1)Wh + e(t)Wx + b1)

y^(t) = softmax(h(t)U + b2);

where L 2 RV D are word embeddings, Wh 2 RH H , Wx 2 RD H and b1 2 RH are parameters for the RNN cell, and U 2 RH C and b2 2 RC are parameters for the softmax. As before, V is the size of the vocabulary, D is the size of the word embedding, H is the size of the hidden layer and C are the number of classes being predicted (here 5).

In order to train the model, we use a cross-entropy loss for the every predicted token:

T

X

J =         CE(y(t); y^(t))

t=1

CE(y(t); y^(t)) =   Xyi(t) log(^yi(t)):

i

(a)   (4 points) (written)

i.   (1 point) How many more parameters does the RNN model in comparison to the window-based model?

ii.   (3 points) What is the computational complexity of predicting labels for a sentence of length T (for the RNN model)?

(b)   (2 points) (written) Recall that the actual score we want to optimize is entity-level F1.

i.   (1 point) Name at least one scenario in which decreasing the cross-entropy cost would lead to an decrease in entity-level F1 scores.

ii.   (1 point) Why it is di  cult to directly optimize for F1?

(c)   (5 points) (code) Implement an RNN cell using the equations described above in the rnn cell function of q2 rnn cell.py. You can test your implementation by running python q2 rnn cell.py test.

(d)   (8 points) (code/written) Implementing an RNN requires us to unroll the computation over the whole sentence. Unfortunately, each sentence can be of arbitrary length and this would cause the RNN to be unrolled a di erent number of times for di erent sentences, making it impossible to batch process the data.

The most common way to address this problem is pad our input with zeros. Suppose the largest sentence in our input is M tokens long, then, for an input of length T we will need to:

1.  Add 0-vectors to x and y to make them M tokens long.

2.  Create a masking vector, (m(t))Mt=1 which is 1 for all t T and 0 for all t T . This masking vector will allow us to ignore the predictions that the network makes on the padded input.3

3.  Of course, by extending the input and output by M T tokens, we might change our loss and hence gradient updates. In order to tackle this problem, we modify our loss using the masking vector:

M

X

J =         m(t) CE(y(t); y^(t)):

t=1

i.   (3 points) (written) How would the loss and gradient updates change if we did not use masking? How does masking solve this problem?

3In our code, we will actually use the Boolean values of True and False instead of 1 and 0 for computational e  ciency.

CS 224n                                                                          Assignment 3                                                                  Page 6 of 10

ii.   (5 points) (code) Implement pad sequences in your code. You can test your implementation by running python q2 rnn.py test1.

(e)   (12 points) (code) Implement the rest of the RNN model assuming only xed length input by appropriately completing functions in the RNNModel class. This will involve:

2.  Implementing the add prediction op operation that unrolls the RNN loop self.max length

times. Remember to reuse variables in your variable scope from the 2nd timestep onwards to share the RNN cell weights Wx and Wh across timesteps.

3.  Implementing add loss op to handle the mask vector returned in the previous part.

(f)   (3 points) (code) Train your model using the command python q2 rnn.py train.  Training

should take about 2 hours on your CPU and 10{20 minutes if you use the GPUs provided by Microsoft Azure. You should get a development F1 score of at least 85%.

The model and its output will be saved to results/rnn/<timestamp/, where <timestamp is the date and time at which the program was run. The le results.txt contains formatted

output of the model's predictions on the development set, and the le log contains the printed output, i.e. confusion matrices and F1 scores computed during the training.

Finally, you can interact with your model using:

python q2 rnn.py shell -m results/rnn/<timestamp/

Deliverable: After your model has trained, copy the rnn predictions.conll le from the appropriate results folder into your code directory so that it can be included in your submission.

(g)   (6 points) (written)

i.   (3 points) Describe at least 2 modeling limitations of this RNN model and support these con-clusions using examples from your model's output.

ii.   (3 points) For each limitation, suggest some way you could extend the model to overcome the limitation.

3.    Grooving with GRUs (30 points)

In class, we learned that a gated recurrent unit (GRU) is an improved RNN cell that greatly reduces the problem of vanishing gradients. Recall that a GRU is described by the following equations:

z(t) =  (x(t)Uz + h(t  1)Wz + bz)

r(t) =  (x(t)Ur + h(t  1)Wr + br)

~(t)                       (t)               (t)         (t  1)

h   = tanh(x           Uh + r      h              Wh + bh)

(t)     (t)     (t  1)         (t)             ~(t)

h      = z          h          + (1      z     )     h     ;

where z(t) is considered to be an update gate and r(t) is considered to be a reset gate.4

Also, to keep the notation consistent with the GRU, for this problem, let the basic RNN cell be described by the equations:

h(t) =  (x(t)Uh + h(t  1)Wh + bh):

To gain some inutition, let's explore the behavior of the basic RNN cell and the GRU on some generated 1-D sequences.

4         Section 4.2.2 of https://arxiv.org/pdf/1511.07916.pdf provides a good introduction to GRUs. http://colah. github.io/posts/2015-08-Understanding-LSTMs/ provides a more colorful picture of LSTMs and to an extent GRUs.

CS 224n                                                                          Assignment 3                                                                  Page 7 of 10

(a)   (4 points) (written) Modeling latching behavior. Let's say we are given input sequences starting with a 1 or 0, followed by n 0s, e.g. 0, 1, 00, 10, 000, 100, etc. We would like our state h to continue to remember what the rst character was, irrespective of how many 0s follow. This scenario can also be described as wanting the neural network to learn the following simple automaton:

x = 0
x = 0; 1

x = 1
h = 0
h = 1

In other words, when the network sees a 1, it should change its state to also be a 1 and stay there. In the following questions, assume that the state is initialized at 0 (i.e. h(0) = 0), and that all the parameters are scalars. Further, assume that all sigmoid activations and tanh activations are replaced by the indicator function:

(x) ! (
1
if x 0
tanh(x) ! (
1
if x 0
0
otherwise
0
otherwise :

i.   (1 point) Identify values of wh, uh and bh for an RNN cell that would allow it to replicate the behavior described by the automaton above.

ii.   (3 points) Let wr = ur = br = bz = bh = 0. Identify values of wz, uz, wh and uh for a GRU cell that would allow it to replicate the behavior described by the automaton above.

(b)   (6 points) (written) Modeling toggling behavior. Now, let us try modeling a more interesting behavior. We are now given an arbitrary input sequence, and must produce an output sequence that switches from 0 to 1 and vice versa whenever it sees a 1 in the input. For example, the input sequence 00100100 should produce 00111000. This behavior could be described by the following automaton:

x = 0

x = 0

x = 1

h = 0

h = 1

x = 1

Once again, assume that the state is initialized at 0 (i.e. h(0) = 0), that all the parameters are scalars, that all sigmoid activations and tanh activations are replaced by the indicator function.

i. (3 points) Show that a 1D RNN can not replicate the behavior described by the automaton above.

ii.   (3 points) Let wr = ur = bz = bh = 0. Identify values of br, wz, uz, wh and uh for a GRU cell that would allow it to replicate the behavior described by the automaton above.

(c)   (6 points) (code) Implement the GRU cell described above in q3 gru cell.py. You can test your implementation by running python q3 gru cell.py test.

(d)   (6 points) (code) We will now use an RNN model to try and learn the latching behavior described in part (a) using TensorFlow's RNN implementation: tf.nn.dynamic rnn.

i.   In q3 gru.py, implement add prediction op by applying TensorFlow's dynamic RNN model on the sequence input provided. Also apply a sigmoid function on the nal state to normalize the state values between 0 and 1.

ii.   Next, write code to calculate the gradient norm and implement gradient clipping in add training op.

CS 224n                                                                          Assignment 3                                                                  Page 8 of 10

iii.   Run the program:

python q3 gru.py predict -c [rnn|gru] [-g]

to generate a learning curve for this task for the RNN and GRU models. The -g ag activates gradient clipping.

These commands produce a plot of the learning dynamics in q3-noclip-<model.png and q3-clip-<model.png respectively.

Deliverable: Attach the plots of learning dynamics generated for a GRU and RNN, with and without gradient clipping (in total 4 plots) to your write up.

(e)   (5 points) (written) Analyze the graphs obtained above and describe the learning dynamics you see. Make sure you address the following questions:

i.   Does either model experience vanishing or exploding gradients? If so, does gradient clipping help?

ii.   Which model does better? Can you explain why?

(f)   (3 points) (code) Run the NER model from question 2 using the GRU cell using the command: python q2 rnn.py train -c gru

Training should take about 3{4 hours on your CPU and about 30 minutes if you use the GPUs provided by Microsoft Azure. You should get a development F1 score of at least 85%.

The model and its output will be saved to results/gru/<timestamp/, where <timestamp is the date and time at which the program was run. The le results.txt contains formatted

output of the model's predictions on the development set, and the le log contains the printed output, i.e. confusion matrices and F1 scores computed during the training.

Finally, you can interact with your model using:

python q2 rnn.py shell -m results/gru/<timestamp/

Deliverable: After your model has trained, copy the gru predictions.conll le from the appropriate results folder into your code directory so that it can be included in your submission.