Advanced Vector Space Models : Assignment 4 Solution

In this assignment, we will examine some advanced uses of vector representations of words. We are going to look at two different problems:

Solving word relation problems like analogies using word embeddings.
Discovering the different senses of a “polysemous” word by clustering together its synonyms. You will use an open source Python package for creating and manipulating word vectors called gensim. Gensim lets you easily train word embedding models like word2vec.
In order to use the gensim package, you’ll have to be using Python version 3.6 or higher. On my Mac, I did the following:

brew install python3
pip3 install gensim
Then when I ran python, I used the command python3 instead of just python
Here are the materials that you should download for this assignment:

question1.txt A template for answering question 1.
data.zip Contains all the data
vectorcluster.py Main code stub
evaluate.py Evaluation script
writeup.tex Report template.
makecooccurrences.py Script to make cooccurrences (optional use)
Tokenized Reuters RCV1 Corpus
Google’s pretrained word2vec vectors, under the heading “Pretrained word and phrase vectors”
Part 1: Exploring Analogies and Other Word Pair Relationships
Word2vec is a very cool word embedding method that was developed by Thomas Mikolov and his collaborators. One of the noteworthy things about the method is that it can be used to solve word analogy problems like man is to king as woman is to [blank] The way that it works is to perform vector math. They take the vectors representing king, man and woman and perform some vector arithmetic to produce a vector that is close to the expected answer.king−man+woman≈queenking−man+woman≈queen So We can find the nearest a vector in the vocabulary by looking for argmax cos(x,king−man+woman)argmax cos(x,king−man+woman). Omar Levy has a nice explnation of the method in this Quora post, and in his paper Linguistic Regularities in Sparse and Explicit Word Representations.

In addition to solving this sort of analogy problem, the same sort of vector arithmetic was used with word2vec embeddings to find relationships between pairs of words like the following:



In the first part of this homework, you will play around with the gensim library library. You will use gensim load a dense vector model trained using word2vec, and use it to manipulate and analyze the vectors.
You can start by experimenting on your own, or reading through this tutorial on using word2vec with gensim. You should familiarize yourself with the KeyedVectors documentation.

The questions below are designed to familiarize you with the gensim Word2Vec package, and get you thinking about what type of semantic information word embeddings can encode. You’ll submit your answers to these questions when you submit your other homework materials.

Load the word vectors using the following Python commands:

from gensim.models import KeyedVectors
vecfile = 'GoogleNews-vectors-negative300.bin'
vecs = KeyedVectors.load_word2vec_format(vecfile, binary=True)
What is the dimensionality of these word embeddings? Provide an integer answer.
What are the top-5 most similar words to picnic (not including picnic itself)? (Use the function gensim.models.KeyedVectors.wv.most_similar)
According to the word embeddings, which of these words is not like the others? ['tissue', 'papyrus', 'manila', 'newsprint', 'parchment', 'gazette'] (Use the function gensim.models.KeyedVectors.wv.doesnt_match)
Solve the following analogy: “leg” is to “jump” as X is to “throw”. (Use the function gensim.models.KeyedVectors.wv.most_similar with positiveand negative arguments.)
We have provided a file called question1.txt for you to submit answers to the questions above.

Part 2: Creating Word Sense Clusters
Many natural language processing (NLP) tasks require knowing the sense of polysemous words, which are words with multiple meanings. For example, the word bug can mean

a creepy crawly thing
an error in your computer code
a virus or bacteria that makes you sick
a listening device planted by the FBI
In past research my PhD students and I have looked into automatically deriving the different meaning of polysemous words like bug by clustering their paraphrases. We have developed a resource called the paraphrase database (PPDB) that contains of paraphrases for tens of millions words and phrases. For the target word bug, we have an unordered list of paraphrases including: insect, glitch, beetle, error, microbe, wire, cockroach, malfunction, microphone, mosquito, virus, tracker, pest, informer, snitch, parasite, bacterium, fault, mistake, failure and many others. We used automatic clustering group those into sets like:



These clusters approximate the different word senses of bug. You will explore the main idea underlying our word sense clustering method: which measure the similarity between each pair of paraphrases for a target word and then group together the paraphrases that are most similar to each other. This affinity matrix gives an example of one of the methods for measuring similarity that we tried in our paper:



Here the darkness values give an indication of how similar paraprhases are to each other. For instance sim(insect, pest) sim(insect, error).

In this assignment, we will use vector representations in order to measure their similarities of pairs of paraprhases. You will play with different vector space representations of words to create clusters of word senses.

In this image, we have a target word “bug”, and a list of all synonyms (taken from WordNet). The 4 circles are the 4 senses of “bug.” The input to the problem is all the synonyms in a single list, and the task is to separate them correctly. As humans, this is pretty intuitive, but computers aren’t that smart. We will use this task to explore different types of word representations.

You can read more about this task in these papers.

Clustering with Word Vectors
We expect that you have read Jurafsky and Martin, chapters 15 and 16. Word vectors, also known as word embeddings, can be thought of simply as points in some high-dimensional space. Remember in geometry class when you learned about the Euclidean plane, and 2-dimensional points in that plane? It’s not hard to understand distance between those points – you can even measure it with a ruler. Then you learned about 3-dimensional points, and how to calculate the distance between these. These 3-dimensional points can be thought of as positions in physical space.

Now, do your best to stop thinking about physical space, and generalize this idea in your mind: you can calculate a distance between 2-dimensional and 3-dimensional points, now imagine a point with 300 dimensions. The dimensions don’t necessarily have meaning in the same way as the X,Y, and Z dimensions in physical space, but we can calculate distances all the same.

This is how we will use word vectors in this assignment: as points in some high-dimensional space, where distances between points are meaningful. The interpretation of distance between word vectors depends entirely on how they were made, but for our purposes, we will consider distance to measure semantic similarity. Word vectors that are close together should have meanings that are similar.

With this framework, we can see how to solve our synonym clustering problem. Imagine in the image below that each point is a (2-dimensional) word vector. Using the distance between points, we can separate them into 3 clusters. This is our task.

 (Image taken from Wikipedia)

The Data
The data to be used for this assignment consists of sets of paraphrases corresponding to one of 56 polysemous target words, e.g.

Target
Paraphrase set
note.v
comment mark tell observe state notice say remark mention
hot.a
raging spicy blistering red-hot live
(Here the .v following the target note indicates the part of speech.)

Your objective is to automatically cluster each paraphrase set such that each cluster contains words pertaining to a single sense, or meaning, of the target word. Note that a single word from the paraphrase set might belong to one or more clusters.

For evaluation, we take the set of ground truth senses from WordNet.

Development data
The development data consists of two files – a words file (the input), and a clusters file (to evaluate your output). The words file dev_input.txt is formatted such that each line contains one target, its paraphrase set, and the number of ground truth clusters k, separated by a :: symbol:

target.pos :: k :: paraphrase1 paraphrase2 paraphrase3 ...

You can use k as input to your clustering algorithm.

The clusters file dev_output.txt contains the ground truth clusters for each target word’s paraphrase set, split over k lines:

target.pos :: 1 :: paraphrase2 paraphrase6
target.pos :: 2 :: paraphrase3 paraphrase4 paraphrase5
...
target.pos :: k :: paraphrase1 paraphrase9

Test data
For testing, you will receive only words file test_input.txt containing the test target words and their paraphrase sets. Your job is to create an output file, formatted in the same way as dev_output.txt, containing the clusters produced by your system. Neither order of senses, nor order of words in a cluster matter.

Evaluation
There are many possible ways to evaluate clustering solutions. For this homework we will rely on the paired F-score, which you can read more about in this paper.

The general idea behind paired F-score is to treat clustering prediction like a classification problem; given a target word and its paraphrase set, we call a positive instance any pair of paraphrases that appear together in a ground-truth cluster. Once we predict a clustering solution for the paraphrase set, we similarly generate the set of word pairs such that both words in the pair appear in the same predicted cluster. We can then evaluate our set of predicted pairs against the ground truth pairs using precision, recall, and F-score.

We have provided an evaluation script that you can use when developing your own system. You can run it as follows:

python evaluate.py <GROUND-TRUTH-FILE <PREDICTED-CLUSTERS-FILE

Baselines
On the dev data, a random baseline gets about 20%, the word cooccurrence matrix gets about 36%, and the word2vec vectors get about 30%.

1. Sparse Representations
Your next task is to generate clusters for the target words in test_input.txt based on a feature-based (not dense) vector space representation. In this type of VSM, each dimension of the vector space corresponds to a specific feature, such as a context word (see, for example, the term-context matrix described in Chapter 15.1.2 of Jurafsky & Martin).

You will calculate cooccurrence vectors on the Reuters RCV1 corpus. Download a tokenized and cleaned version here. The original is here. Use the provided script, makecooccurrences.py, to build these vectors. Be sure to set D and W to what you want.

It can take a long time to build cooccurrence vectors, so we have pre-built a set, included in the data.zip, called coocvec-500mostfreq-window-3.vec.filter. To save on space, these include only the words used in the given files.

You will add K-means clustering to vectorcluster.py. Here is an example of the K-means code:

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=k).fit(X)
print(kmeans.labels_)
The baseline system for this section represents words using a term-context matrix M of size |V| x D, where |V| is the size of the vocabulary and D=500. Each feature corresponds to one of the top 500 most-frequent words in the corpus. The value of matrix entry M[i][j] gives the number of times the context word represented by column j appeared within W=3 words to the left or right of the word represented by row i in the corpus. Using this representation, the baseline system clusters each paraphrase set using K-means.

While experimenting, write out clusters for the dev input to dev_output_features.txt and use the evaluate.py script to compare against the provided dev_output.txt.

Implementing the baseline will score you a B, but why not try and see if you can do better? You might try experimenting with different features, for example:

What if you reduce or increase D in the baseline implementation?
Does it help to change the window W used to extract contexts?
Play around with the feature weighting – instead of raw counts, would it help to use PPMI?
Try a different clustering algorithm that’s included with the scikit-learn clustering package, or implement your own.
What if you include additional types of features, like paraphrases in the Paraphrase Database or the part-of-speech of context words?
The only feature types that are off-limits are WordNet features.

Turn in the predicted clusters that your VSM generates in the file test_output_features.txt. Also provide a brief description of your method in writeup.pdf, making sure to describe the vector space model you chose, the clustering algorithm you used, and the results of any preliminary experiments you might have run on the dev set. We have provided a LaTeX file shell, writeup.tex, which you can use to guide your writeup.

2. Dense Representations
Finally, we’d like to see if dense word embeddings are better for clustering the words in our test set. Run the word clustering task again, but this time use a dense word representation.

For this task, use files:

Google’s pretrained word2vec vectors, under the heading “Pretrained word and phrase vectors”
The Google file is very large (~3.4GB), so we have also included in the data.zip a file called GoogleNews-vectors-negative300.filter, which is filtered to contain only the words in the dev/test splits.
Modify vectorcluster.py to load dense vectors.
The baseline system for this section uses the provided word vectors to represent words, and K-means for clustering.

As before, achieving the baseline score will get you a B, but you might try to see if you can do better. Here are some ideas:

Try downloading a different dense vector space model from the web, like Paragram or fastText.
Train your own word vectors, either on the provided corpus or something you find online. You can use the gensim.models.Word2Vec package for the skip-gram or CBOW models, or GLOVE. Try experimenting with the dimensionality.
Retrofitting is a simple way to add additional semantic knowledge to pre-trained vectors. The retrofitting code is available here. Experiment with different lexicons, or even try counter-fitting.
As in question 2, turn in the predicted clusters that your dense vector representation generates in the file test_output_dense.txt. Also provide a brief description of your method in writeup.pdf that includes the vectors you used, and any experimental results you have from running your model on the dev set.

In addition, do an analysis of different errors made by each system – i.e. look at instances that the word-context matrix representation gets wrong and dense gets right, and vice versa, and see if there are any interesting patterns. There is no right answer for this.

3. The Leaderboard
In order to stir up some friendly competition, we would also like you to submit the clustering from your best model to a leaderboard. Copy the output file from your best model to a file called test_output_leaderboard.txt, and include it with your submission.

Extra Credit
We made the clustering problem deliberately easier by providing you with k, the number of clusters, as an input. But in most clustering situations the best k isn’t obvious. To take this assignment one step further, see if you can come up with a way to automatically choose k. We have provided an additional test set, test_nok_input.txt, where the k field has been zeroed out. See if you can come up with a method that clusters words by sense, and chooses the best k on its own. (Don’t look at the number of WordNet synsets for this, as that would ruin all the fun.) The baseline system for this portion always chooses k=5. You can submit your output to this part in a file called test_nok_output_leaderboard.txt. Be sure to describe your method in writeup.pdf.

Deliverables
Here are the deliverables that you will need to submit:

question1.txt file with answers to questions from Exploration
simple VSM clustering output test_output_features.txt
dense model clustering output test_output_dense.txt
your favorite clustering output for the leaderboard, test_output_leaderboard.txt (this will probably be a copy of either test_output_features.txt or test_output_dense.txt)
writeup.pdf (compiled from writeup.tex)
your code (.zip). It should be written in Python 3.
(optional) the output of your model that automatically chooses the number of clusters, test_nok_output_leaderboard.txt (submit this to the Gradescope assignment ‘Homework 4 EXTRA CREDIT’)
Recommended readings
Vector Semantics. Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd edition draft) .
Semantics with Dense Vectors. Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd edition draft) .
Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. ArXiV 2013. Abstract
Linguistic Regularities in Continuous Space Word Representations. Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig. NAACL 2013. Abstract BibTex
Discovering Word Senses from Text. Patrick Pangel and Dekang Ling. KDD 2002. Abstract BibTex
Clustering Paraphrases by Word Sense. Anne Cocos and Chris Callison-Burch. NAACL 2016. Abstract BibTex
Powered by