# Pagerank soln

Lab Overview

Try to get as far as you can. In this lab, we still use dictionaries and sets, and you will implement pagerank, the method that made Google the success it is today. In the early days of Internet, it was a big problem to ﬁnd “high quality” web pages in answer to a keyword search. Google’s Sergey Brin and Larry Page, then graduate students in Stanford University, introduced a new algorithm called pagerank that can rank pages with respect to how good they were. And, they did in Python! They ran it from a small server in Stanford and they were an overnight success. In today’s Google, pagerank is still a part of why some page comes up on top over the others, but many other methods are added on top. In this lab, we will use a graph of friendships to compute pagerank instead of webpages. But, the algorithm is the same.

Pagerank idea

Understanding the intuition may help with solving this lab, but it is not necessary. You can safely skip this section if you wish. The idea of pagerank comes from social science and is also used very frequently analyzing who is very important in a group of friends. Think of a simple graph, like the one shown below.

1

2

3 4

In this graph, you have four people. If there is a link from person 1 to person 2, then it means that person 1 likes person 2. It is a directed link because friendship is not always two sided. Suppose, now we want to ﬁnd the most liked person. You will start from any person in the graph, and ask her who she likes. She will tell you all the people she likes. You choose one of them in random and go and ask that person. You will continue doing this and travel the graph. This is called navigation. Every now and then you will get tired of asking and just choose a person in random. This is called the random jump. Assuming you navigate some of the time and jump randomly the remaining time, the person you are most likely to be visiting at any point in time will have the highest pagerank. For example, in this example graph, it is clear that more links end up in person 4, so we expect 4 to have high pagerank. An interesting part of pagerank is that if an important person is your friend, then you are also more important. So, most likely being friends with 4 will give a boost to 1 as well. Well, you will see whether this is true or not when you compute it yourself.

Pagerank assigns a ﬂoating point value (score) to each person in the graph. The higher your pagerank, the more important you are. In the next checkpoints, we walk you through how to compute it in detail. You will then complete the lab with a (hopefully fun) exercise.

Checkpoint 1: Reading a graph

To start this lab, download the ﬁle extralab-files.zip which will contain a number of text ﬁles such as edge1.txt, edge2.txt, edge3.txt. Each ﬁle corresponds to a graph. For example, the ﬁle edge1.txt for the picture in page 1 has the following information:

1 3 4 2 4 3 2 4 4 1

which means that person 1 points to person 3 and 4, person 2 points to person 4, and person 4 points to person 1. Person 3 points to person 2 and 4. So, person 1 and 3 have 2 friends each while all the others have only 1 friend. We will refer to each person as a node from now on. Read this ﬁle into a dictionary where each key is a person, and each value is the set of people the person points to. For the above example, you should get the following dictionary:

{1: set([3, 4]), 2: set([4]), 3: set([2, 4]), 4: set([1])}

To complete checkpoint 1, write a program that reads a ﬁle into a dictionary called graph of this form and prints the dictionary. Make sure you convert all the key and values to integers. Expected output for edge1.txt is shown below.

File == edge1.txt 4 people total 1: set([3, 4]) 2: set([4]) 3: set([2, 4]) 4: set([1])

To complete Checkpoint 1: show your code and your output to a TA or a mentor. You will be graded both on correct output and the good program structure.

Checkpoint 2: Computing the pagerank of scores

Copy your code from check point 1 to a new ﬁle for check point 2. In this checkpoint, you will add score computation to your program. First, create a new dictionary called scores such that each key is a node (i.e. person) name and the value is a ﬂoat 1/N where N is the total number of nodes. For the above example, we would get: scores = {1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25}

2

This simply means that we assume each person has exactly the same score. Now, implement the function shown below in pseudo-code. Your function will take as input a score dictionary like the one you just created above and the graph you read in check point 1, construct and return a totally new set of scores based on the old scores and the graph.

def get_newscores(current_scores, graph): new_scores = {}

## N = number of nodes in the graph

for key in current_scores: ##initialize zero scores for everyone new_scores[key]= 0.15/N

## compute new scores for each node in the graph ## based on the scores in the dictionary current_scores ## and the graph (algorithm is given below)

for each person p in graph: M = the number of people p points to for each person q that p points to: new_scores[q] += 0.85 * current_scores[p]/M

return new_scores

To complete Checkpoint 2: after reading a graph, create a score dictionary as described earlier, implement the above function and call it for the initial set of scores and print out the scores before and after the call to the above function. For the test ﬁle edge1.txt, you should get the following:

Filename == edge1.txt 4 people total 1: set(['3', '4']) 2: set(['4']) 3: set(['2', '4']) 4: set(['1']) Initial scores 1: 0.2500 2: 0.2500 3: 0.2500 4: 0.2500 Computed scores 1: 0.2500 2: 0.1437 3: 0.1437 4: 0.4625

To complete Checkpoint 2: show your program and your output to a TA or a mentor. You will be graded both on correct output and the good program structure.

3

Checkpoint 3: Will it converge?

Now you have computed the scores for one time. Can we use the new scores to call the function again and again until we stop? This is very similar in essence to the Bunny and Fox problem. First, write a loop to call the function multiple times, each point updating the scores to be the scores returned by the given function. In the same way as Bunny and Fox problem, we want to compute the scores until they converge. Here is how we deﬁne convergence in this case: First, compute the pairwise diﬀerence between the old and new scores for each person.

diff = abs(0.25-0.25) + abs(0.25-0.14375) + \ abs(0.25-0.14375) + abs(0.25-0.4625) diff 0.42500000000000004

then set a threshold value,

threshold = 0.00001

At the end of an iteration, compute the diﬀerence between the current scores and the scores returned by the function as shown above. If the diﬀerence is less than threshold, then we break out of the loop. Otherwise, continue repeating the loop with the newly computed scores. To check the progress of your program, print out the diff values at each step. If they are not decreasing, then your program will never stop. This means you have a bug. Once done, you should print the ﬁnal scores for each person as before. To complete Checkpoint 3, implement the loop that is described above for the given threshold and print out the ﬁnal scores of each person. TAs will test your algorithm with diﬀerent input ﬁles. File edge1.txt converges in 27 steps for the given threshold, and produces the following output:

Filename == edge1.txt 4 people total 1: set(['3', '4']) 2: set(['4']) 3: set(['2', '4']) 4: set(['1']) Initial scores 1: 0.2500 2: 0.2500 3: 0.2500 4: 0.2500 Computing scores with threshold 1e-05 ... Iteration 26 (diff 0.000011): 1: 0.3426 2: 0.1153 3: 0.1831 4: 0.3590 Iteration 27 (diff 0.000007): 1: 0.3426 2: 0.1153 3: 0.1831 4: 0.3590

4

Interestingly, person 4 has the highest pagerank value, followed closely by person 1 as expected. Person 1 is really beneﬁting greatly from person 4’s clout. Oh yes, the clout score is also computed very similarly. To test it further, try setting the threshold even smaller and see that it still converges. To complete Checkpoint 3: show your code and your output to a TA or a mentor. You will be graded both on correct output and the good program structure.

Checkpoint 4: Now for some fun

We will now do two things. We will ﬁrst try the algorithm with much larger graphs. You have two graphs: lesmis is the graph of all the relationships in the book Les Miserables (imagine horrible singing by Russell Crowe) and hero is the graph containing all the heros in the Marvel universe. For each graph, you have the edges as before. Plus, you have a key of the names of characters in a separate ﬁle. To complete this checkpoint, you must do the following. I will not give much detail here. You should be able to ﬁgure out these steps on your own. Take this as a test of how well your programming skills are progressing:

• Read the graph and compute pagerank as before • Read the names corresponding to the graph • Ask the user for a number (k) • Find the people with the highest top k scores and print their name in the order of their score

To complete Checkpoint 4: show your code that ﬁnds and prints the name of the people with the top k pagerank scores in either ﬁle. NOTE: We will not give any help on this last checkpoint. We will also not provide you with correct output. You must convince yourself that you are doing it correctly with proper testing. NOTE 2: Both ﬁles with names are very large, so you better not print the scores in this checkpoint.

5

Try to get as far as you can. In this lab, we still use dictionaries and sets, and you will implement pagerank, the method that made Google the success it is today. In the early days of Internet, it was a big problem to ﬁnd “high quality” web pages in answer to a keyword search. Google’s Sergey Brin and Larry Page, then graduate students in Stanford University, introduced a new algorithm called pagerank that can rank pages with respect to how good they were. And, they did in Python! They ran it from a small server in Stanford and they were an overnight success. In today’s Google, pagerank is still a part of why some page comes up on top over the others, but many other methods are added on top. In this lab, we will use a graph of friendships to compute pagerank instead of webpages. But, the algorithm is the same.

Pagerank idea

Understanding the intuition may help with solving this lab, but it is not necessary. You can safely skip this section if you wish. The idea of pagerank comes from social science and is also used very frequently analyzing who is very important in a group of friends. Think of a simple graph, like the one shown below.

1

2

3 4

In this graph, you have four people. If there is a link from person 1 to person 2, then it means that person 1 likes person 2. It is a directed link because friendship is not always two sided. Suppose, now we want to ﬁnd the most liked person. You will start from any person in the graph, and ask her who she likes. She will tell you all the people she likes. You choose one of them in random and go and ask that person. You will continue doing this and travel the graph. This is called navigation. Every now and then you will get tired of asking and just choose a person in random. This is called the random jump. Assuming you navigate some of the time and jump randomly the remaining time, the person you are most likely to be visiting at any point in time will have the highest pagerank. For example, in this example graph, it is clear that more links end up in person 4, so we expect 4 to have high pagerank. An interesting part of pagerank is that if an important person is your friend, then you are also more important. So, most likely being friends with 4 will give a boost to 1 as well. Well, you will see whether this is true or not when you compute it yourself.

Pagerank assigns a ﬂoating point value (score) to each person in the graph. The higher your pagerank, the more important you are. In the next checkpoints, we walk you through how to compute it in detail. You will then complete the lab with a (hopefully fun) exercise.

Checkpoint 1: Reading a graph

To start this lab, download the ﬁle extralab-files.zip which will contain a number of text ﬁles such as edge1.txt, edge2.txt, edge3.txt. Each ﬁle corresponds to a graph. For example, the ﬁle edge1.txt for the picture in page 1 has the following information:

1 3 4 2 4 3 2 4 4 1

which means that person 1 points to person 3 and 4, person 2 points to person 4, and person 4 points to person 1. Person 3 points to person 2 and 4. So, person 1 and 3 have 2 friends each while all the others have only 1 friend. We will refer to each person as a node from now on. Read this ﬁle into a dictionary where each key is a person, and each value is the set of people the person points to. For the above example, you should get the following dictionary:

{1: set([3, 4]), 2: set([4]), 3: set([2, 4]), 4: set([1])}

To complete checkpoint 1, write a program that reads a ﬁle into a dictionary called graph of this form and prints the dictionary. Make sure you convert all the key and values to integers. Expected output for edge1.txt is shown below.

File == edge1.txt 4 people total 1: set([3, 4]) 2: set([4]) 3: set([2, 4]) 4: set([1])

To complete Checkpoint 1: show your code and your output to a TA or a mentor. You will be graded both on correct output and the good program structure.

Checkpoint 2: Computing the pagerank of scores

Copy your code from check point 1 to a new ﬁle for check point 2. In this checkpoint, you will add score computation to your program. First, create a new dictionary called scores such that each key is a node (i.e. person) name and the value is a ﬂoat 1/N where N is the total number of nodes. For the above example, we would get: scores = {1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25}

2

This simply means that we assume each person has exactly the same score. Now, implement the function shown below in pseudo-code. Your function will take as input a score dictionary like the one you just created above and the graph you read in check point 1, construct and return a totally new set of scores based on the old scores and the graph.

def get_newscores(current_scores, graph): new_scores = {}

## N = number of nodes in the graph

for key in current_scores: ##initialize zero scores for everyone new_scores[key]= 0.15/N

## compute new scores for each node in the graph ## based on the scores in the dictionary current_scores ## and the graph (algorithm is given below)

for each person p in graph: M = the number of people p points to for each person q that p points to: new_scores[q] += 0.85 * current_scores[p]/M

return new_scores

To complete Checkpoint 2: after reading a graph, create a score dictionary as described earlier, implement the above function and call it for the initial set of scores and print out the scores before and after the call to the above function. For the test ﬁle edge1.txt, you should get the following:

Filename == edge1.txt 4 people total 1: set(['3', '4']) 2: set(['4']) 3: set(['2', '4']) 4: set(['1']) Initial scores 1: 0.2500 2: 0.2500 3: 0.2500 4: 0.2500 Computed scores 1: 0.2500 2: 0.1437 3: 0.1437 4: 0.4625

To complete Checkpoint 2: show your program and your output to a TA or a mentor. You will be graded both on correct output and the good program structure.

3

Checkpoint 3: Will it converge?

Now you have computed the scores for one time. Can we use the new scores to call the function again and again until we stop? This is very similar in essence to the Bunny and Fox problem. First, write a loop to call the function multiple times, each point updating the scores to be the scores returned by the given function. In the same way as Bunny and Fox problem, we want to compute the scores until they converge. Here is how we deﬁne convergence in this case: First, compute the pairwise diﬀerence between the old and new scores for each person.

diff = abs(0.25-0.25) + abs(0.25-0.14375) + \ abs(0.25-0.14375) + abs(0.25-0.4625) diff 0.42500000000000004

then set a threshold value,

threshold = 0.00001

At the end of an iteration, compute the diﬀerence between the current scores and the scores returned by the function as shown above. If the diﬀerence is less than threshold, then we break out of the loop. Otherwise, continue repeating the loop with the newly computed scores. To check the progress of your program, print out the diff values at each step. If they are not decreasing, then your program will never stop. This means you have a bug. Once done, you should print the ﬁnal scores for each person as before. To complete Checkpoint 3, implement the loop that is described above for the given threshold and print out the ﬁnal scores of each person. TAs will test your algorithm with diﬀerent input ﬁles. File edge1.txt converges in 27 steps for the given threshold, and produces the following output:

Filename == edge1.txt 4 people total 1: set(['3', '4']) 2: set(['4']) 3: set(['2', '4']) 4: set(['1']) Initial scores 1: 0.2500 2: 0.2500 3: 0.2500 4: 0.2500 Computing scores with threshold 1e-05 ... Iteration 26 (diff 0.000011): 1: 0.3426 2: 0.1153 3: 0.1831 4: 0.3590 Iteration 27 (diff 0.000007): 1: 0.3426 2: 0.1153 3: 0.1831 4: 0.3590

4

Interestingly, person 4 has the highest pagerank value, followed closely by person 1 as expected. Person 1 is really beneﬁting greatly from person 4’s clout. Oh yes, the clout score is also computed very similarly. To test it further, try setting the threshold even smaller and see that it still converges. To complete Checkpoint 3: show your code and your output to a TA or a mentor. You will be graded both on correct output and the good program structure.

Checkpoint 4: Now for some fun

We will now do two things. We will ﬁrst try the algorithm with much larger graphs. You have two graphs: lesmis is the graph of all the relationships in the book Les Miserables (imagine horrible singing by Russell Crowe) and hero is the graph containing all the heros in the Marvel universe. For each graph, you have the edges as before. Plus, you have a key of the names of characters in a separate ﬁle. To complete this checkpoint, you must do the following. I will not give much detail here. You should be able to ﬁgure out these steps on your own. Take this as a test of how well your programming skills are progressing:

• Read the graph and compute pagerank as before • Read the names corresponding to the graph • Ask the user for a number (k) • Find the people with the highest top k scores and print their name in the order of their score

To complete Checkpoint 4: show your code that ﬁnds and prints the name of the people with the top k pagerank scores in either ﬁle. NOTE: We will not give any help on this last checkpoint. We will also not provide you with correct output. You must convince yourself that you are doing it correctly with proper testing. NOTE 2: Both ﬁles with names are very large, so you better not print the scores in this checkpoint.

5

You'll get 1 file (490.7KB)