Homework 6 Internet (Wikipedia) Domain Analysis solution


It makes substantial use of the code from Lab 8, the code from Lab 9, and dictionaries. Reusing and repurposing of existing, working code, as you will do here, is a common practice in software development.

Your code will include steps to crawl the pages, analyze the links, and analyze the contents of the text on the pages. For your own experiments you are welcome to crawl wherever you like, but the purposes of our grading, we are using pages that have been scraped and cleaned from Wikipedia. Our format does not exactly match the original Wikipedia format. Modified Crawling You must write a Python program that takes an initial URL (call it init_url), and an integer for the number of pages to find (call it N). These values should be provided to the program by the user via raw_input; the program should ask the user for init_url first and then ask for N. The program should gather URLs by following links in the order they are reached (found) on the pages, starting from init_url. At each step, the program should open the next new page on the list, find all URLs on the page, add the new ones to the end of the list, and continue. The program should stop gathering new URLs when either there are no more new urls (not previously seen) on the list, or when the length of the list has reached is at least N. The statement “at least N” means that the reading of URLs should not stop in the middle of the page when N is reached, but instead should complete the reading of a page and then stop. For example, if N==50 and the initial page has 15 URLs, the second page (the first URL on the initial page) has 12 new URLs, not previously seen, the third has 10, the fourth has 16, and the fifth has 21, your program should stop after reading 4 pages because its list will be length 53 at that point. The fifth URL will be on the list, but it will not be opened and the links on its page will NOT be added to the list. This crawling is similar to what you did for Lab 8, but not exactly the same because of the required ordering. To help you, a new version of the URL parser called parse_url_new.py has been provided and you MUST make use of it (see hw6_files.zip). The function parse_url_new.get_links returns the URLs found on a page as a list in the order the links are found (the old version returned a set, which by definition has no order). However, duplicate links can and will appear in this list. Suggestions for approaching this crawling problem are given below. Results and Output Once the URLs have been gathered, the results and output from the program should begin. In particular, you must complete the following: 1. Output all URLs found, in the order that they are first found during crawling. See the example output for the format. 2. Output the number of unique pages among those returned (53 in the above example) found that have the substring “nottingham” somewhere within the link — in other words, by analyzing the link it appears that the page might be about Nottingham. Also output the number of pages that have the substring “london”. 3. If you look at the links we’ve modified from Wikipedia you will notice that the month and year in which the link was last modified is embedded in the URL. For example, http://csci1100.dyndns-wiki.com/ver1/201303/Ada_Lovelace was last modified in March 2013. The date will be in the form of a four digit year followed by a two digit month. The date will always immediately follow .com/verK/, where K is a single digit. There will be no formatting errors. Your job in this third part is to find the number of pages that have been last updated on each date. You will need to use a dictionary to solve this problem where the date is the key and the value is the total count of pages updated on that date. You cannot make any assumptions for the distribution of dates; for example, your code should not assume that all links were updated in the year 2013 or only during the last 4 months. Your solution should handle all month and year date combinations of the form YYYYMM, where YYYY is a positive 4-digit year, and MM is a positive 2-digit month. You do not need to check if the date is in the future. You should output (a) the earliest year and month, together with its associated count, (b) the latest year and month, together with its associated count, and (c) the maximum count and all year/month combinations that have that maximum (one per line in the case of ties). (Note that if you run your program outside of our Wikipedia test set, this part of the assignment will not produce meaningful results, and it may crash. To avoid the latter, we suggest adding some error checks on the format of the URL.) 4. Among the first 25 (or fewer, if you have crawled fewer) pages found and stored in your ordered list during crawling, find and output in lexicographical order the word or words that appear on the most different pages (not the word that appears most often). These words should not appear in ignore.txt — we have a new version of this as well — and should each be length at least 4. This requires the solution to Checkpoint 2 of Lab 9. We have provided this to you as part of the zip file distributed with this handout, but you may use your own solution as well. As an intermediate result for this part of the assignment, output the URL of each page and the number of words on the page that have length at least 4 and are not in ignore.txt. For simplicity, if a word is repeated, allow it to be counted as many times as it appears. Unfortunately, we had to restrict our attention to no more than 25 pages because of the time required to access each page. Note that more realistic questions could be asked here about all the pages we have found, but the submission and testing are too involved. For the latter two problems you must make use of dictionaries. Test Cases and Output The following three test cases will be used. Note that they only differ in the version embedded in the URL and the value of N. Test 1: http://csci1100.dyndns-wiki.com/ver1/201303/Ada_Lovelace 10 Test 2 http://csci1100.dyndns-wiki.com/ver2/201303/Ada_Lovelace 20 Test 3: http://csci1100.dyndns-wiki.com/ver2/201303/Ada_Lovelace 50 We have provided our output as part of the zip file that you must download from the Piazza site. Please format your output to match ours as closely as you can. Of course you are welcome to use other test sets and play with and crawl as far as you like, just be forewarned that the process is slow! Suggestions on Crawling Here are suggestions (not requirements) for approaching the crawling part of the problem. As mentioned above, you will need to adapt your code from Lab 8: • One choice for the crawl function is to maintain three data structures, two being the all_links set and the unvisited list, as before, and the third being a separate list that stores the links (URLs) in the order they are found. In doing so, whenever a new link is found it should be stored in the unvisited list, the all_links set, and the ordered_links list. (Note that the latter two should be the same length.) The crawl function should return ordered_links. If you use this approach, then when you are taking urls off of the unvisited list, call the method pop(0). • A second choice is to only maintain the all_links set and the ordered list of links. Then, add an integer index variable that stores the index of the next URL in the ordered list to test. This ordered list is what should be returned by the crawl function. • A third choice is to keep just the list of unique links and keep an index of the next link (URL) to open.