# Frequent Items

Overview

In this assignment you will explore ﬁnding frequent items in datasets,with emphasis on techniques designed to work at enormous scale. You will use two data sets for this assignment:

These data sets each describe a set of m = 10,000 characters. The order of the ﬁle represents the order of the stream. As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difﬁcult to read or hard to follow. Find a sample form in this directory:

1 Streaming Algorithms A (20 points): Run the Misra-Gries Algorithm (see L11.3.1) with (k −1) = 9 counters on streams S1 and S2. Report the output of the counters at the end of the stream. In each stream, from just the counters, report how many objects might occur more than 20% of the time, and which must occur more than 20% of the time. B (20 points): BuildaCount-MinSketch(seeL12.1.1)with k = 10 countersusing t = 5 hashfunctions. Run it on streams S1 and S2. For both streams, report the estimated counts for objects a, b, and c. Just from the output of the sketch, which of these objects, with probably 1−δ = 31/32, might occur more than 20% of the time? C (5 points): How would your implementation of these algorithms need to change (to answer the same questions) if each object of the stream was a “word” seen on Twitter, and the stream contained all tweets concatenated together? D (5 points): Describe one advantage of the Count-Min Sketch over the Misra-Gries Algorithm. 2 BONUS The exact heavy-hitter problem is as follows: return all objects that occur more than 10% of the time. It cannot return any false positive for any false negatives. In the streaming setting,this requires Ω(min{m,n}) space if there are n objects that can occur and the stream is of length m. A: (1 point) A 2-Pass streaming algorithm is one that is able to read all of the data in-order exactly twice, but still only has limited memory. Describe a small space algorithm to solve the exactheavyhitterproblem, i.e., with ε = 0, (say for φ = 10% threshold).

In this assignment you will explore ﬁnding frequent items in datasets,with emphasis on techniques designed to work at enormous scale. You will use two data sets for this assignment:

These data sets each describe a set of m = 10,000 characters. The order of the ﬁle represents the order of the stream. As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difﬁcult to read or hard to follow. Find a sample form in this directory:

1 Streaming Algorithms A (20 points): Run the Misra-Gries Algorithm (see L11.3.1) with (k −1) = 9 counters on streams S1 and S2. Report the output of the counters at the end of the stream. In each stream, from just the counters, report how many objects might occur more than 20% of the time, and which must occur more than 20% of the time. B (20 points): BuildaCount-MinSketch(seeL12.1.1)with k = 10 countersusing t = 5 hashfunctions. Run it on streams S1 and S2. For both streams, report the estimated counts for objects a, b, and c. Just from the output of the sketch, which of these objects, with probably 1−δ = 31/32, might occur more than 20% of the time? C (5 points): How would your implementation of these algorithms need to change (to answer the same questions) if each object of the stream was a “word” seen on Twitter, and the stream contained all tweets concatenated together? D (5 points): Describe one advantage of the Count-Min Sketch over the Misra-Gries Algorithm. 2 BONUS The exact heavy-hitter problem is as follows: return all objects that occur more than 10% of the time. It cannot return any false positive for any false negatives. In the streaming setting,this requires Ω(min{m,n}) space if there are n objects that can occur and the stream is of length m. A: (1 point) A 2-Pass streaming algorithm is one that is able to read all of the data in-order exactly twice, but still only has limited memory. Describe a small space algorithm to solve the exactheavyhitterproblem, i.e., with ε = 0, (say for φ = 10% threshold).

You'll get 1 file (303.6KB)