Starting from:
$30

$24

Homework #3 Solution

PARALLEL QUERY PROCESSING [25%]






The task is to compute the join query Q(x1, x2, x3) : R1(x1), R2(x2), R3(x3) in parallel using p machines. The query is a cartesian product of three unary relations. Let the sizes

of the relations be N1, N2, N3 respectively.




[10%] Suppose that we compute Q in a single round by distributing the data once (using the HyperCube algorithm). What is the smallest possible load (maximum amount of data each machine receives)? What is the total communication?



[10%] Suppose that we compute Q in two rounds. In the first round we compute the cartesian product of R1, R2, and the second round the cartesian product of the intermediate result with R3. What is the load in each round? What is the total communication across both rounds?



[5%] Which of the above two strategies achieves the best load? Which one the best total communication?









DATA STREAMING [30%]






[15%] In reservoir sampling, we want to produce a uniform sample of size k from a stream of unknown size. The algorithm works by placing the i-th item in the reservoir (of size k) with probability k/i (all the first k items go in the reservoir initially). Show that at any point where we have seen n total items, the probability of each item being in the reservoir is k/n.



[15%] Recall the application of the Misra-Gries algorithm for the case where we want to find the top-k most frequent elements. For any element j, let fj be its actual frequency in the stream, and fˆj its estimated frequency. We showed in class that fj mk fˆj fj, where m is the total length of the stream.



Show that this bound can be improved to fj m k mˆ fˆj fj, where mˆ is the sum of the estimated frequencies fˆj.
















1
UNCERTAIN DATA [30%]






[15%] Consider a tuple-independent probabilistic database with the following rela-tions: R(A, B), S(A, C), T(A, D). Suppose we want to answer the boolean query






Q() : R(x, y), S(x, ‘a‘), T(x, w)




Describe a safe plan for computing the probability of Q if one exists; otherwise, ex-plain why it is not possible to obtain one.




[15%] Consider an inconsistent database, where the integrity constraints are pri-mary keys. The database consists of two relations R(A, B) and S(B, C, D). We want to obtain the consistent answers for the query Q(x) : R(x, y), S(y, z, ‘a‘). Write a query in SQL that computes the consistent answers for Q.












PROVENANCE [15%]






Consider the following instance with two relations R(A), S(A, B), where each tuple is tagged with a unique identifier.




fR(a1) : x1, R(a2) : x2, S(a1, b1) : y1, S(a1, b2) : y2, S(a2, b2) : y3g




For each polynomial below, write a Boolean CQ (without inequalities or constants) having that provenance polynomial.




x1y1 + x1y2 + x2y3



x1y21 + x1y22 + x1y2y3 + x2y2y3 + x2y23



(x1 + x2)(y1 + y2 + y3)









DELIVERABLES







Submit a single PDF document using Canvas (Homework 3). It is strongly suggested to use LATEX to write your solution.

























2

More products