# Clustering _SOLUTION

Overview In this assignment you will explore clustering: hierarchical and point-assignment. You will also experiment with high dimensional data. You will use three data sets for this assignment:

These data sets all have the following format. Each line is a data point. The lines have either 3 or 6 tab separated items. The ﬁrst one is an integer describing the index of the points. The next 2 (or 5 for C3) are thecoordinatesofthedatapoint. C1andC2arein2dimensions,andC3isin5dimensions. C1shouldhave n=20 points, C2 should have n=1004 points, and C3 should have n=1000 points. We will always measure distance with Euclidean distance. As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difﬁcult to read or hard to follow. Find a sample form in this directory: 1 Hierarchical Clustering (25 points) Therearemanyvariantsofhierarchicalclustering;hereweexplore3. Thekeydifferenceishowyoumeasure the distance d(S1,S2) between two clusters S1 and S2. Single-Link: measures the shortest link d(S1,S2) = min (s1,s2)∈S1×S2ks1 −s2k2. Complete-Link: measures the longest link d(S1,S2) = max (s1,s2)∈S1×S2ks1 −s2k2. Mean-Link: measures the distances to the means. First compute a1 = 1 |S1|Ps∈S1 s and a2 = 1 |S2|Ps∈S2 s thend (S1,S2) = ka1 −a2k2 . A (25 points): Run all hierarchical clustering variants on data set C1.txt until there are k = 4 clusters, and report the results as sets. It may be useful to do this pictorially. Whichvariantdidthebestjob, andwhichwastheeasiesttocompute(thinkifthedatawasmuchlarger)? Explain your answers. 2 Assignment-Based Clustering (50 points) Assignment-based clustering works by assigning every point x ∈ X to the closest cluster centers C. Let φC : X → C be this assignment map so that φC(x) = argminc∈C d(x,c). All points that map to the same cluster center are in the same cluster. CS 6140 Data Mining; Spring 2016 Instructor: Jeff M. Phillips, University of Utah Two good heuristics for these types of cluster are the Gonzalez (Algorithm 9.4.1) and k-Means++ (Algorithm 10.1.2) algorithms. A: (20 points) Run Gonzalez and k-Means++ on data set C2.txt for k = 3. To avoid too much variation in the results, choose c1 as the point with index 1. Report the centers and the subsets (as pictures) for Gonzalez. Report: • the 3-center cost maxx∈X d(x,φC(x)) and • the 3-means costq 1 |X|Px∈X(d(x,φC(x)))2 (Note this has been normalized so easy to compare to 3-center cost) For k-Means++, the algorithm is randomized, so you will need to report the variation in this algorithm. Run it several trials (at least 20) and plot the cumulative density function of the 3-means cost. Also report what fraction of the time the subsets are the same as the result from Gonzalez. B: (20 points) Recall that Lloyd’s algorithm for k-means clustering starts with a set of k centers C and runs as described in Algorithm 10.1.1. • RunLloydsAlgorithmwith C initiallywithpointsindexed{1,2,3}. Reporttheﬁnalsubsetandthe 3-means cost. • Run Lloyds Algorithm with C initially as the output of Gonzalez above. Report the ﬁnal subset and the 3-means cost. • Run Lloyds Algorithm with C initially as the output of each run of k-Means++ above. Plot a cumulative density function of the 3-means cost. Also report the fraction of the trials that the subsets are the same as the input. C: (10 points) Consider a set of points S ⊂Rd and d the Euclidean distance. Prove that arg min p∈RdX x∈S (d(x,p))2 = 1 |S|X x∈S x. Here are some suggested steps to follow towards the proof (note there are also other valid ways to prove this, but, for instance, achieving some of these steps will get you partial credit): 1. First prove the same results for S ∈R1. (a) Expand each term (d(x,p))2 = (x−p)2 = x2 + p2 −2xp. (b) Take the derivative of each term. (c) Add the above terms back together and ﬁnd where the total derivative is 0. 2. Show the results for each dimension can be solved independently (use properties of edge lengths in a right triangle – you may want to just redo the above steps using vector calculus). 3 k-Median Clustering (25 points) The k-median clustering problem on a data set P is to ﬁnd a set of k-centers C = {c1,c2,...,ck} to minimize Cost1(P,C) = 1 |P|Pp∈P d(p,φC(p)). We did not explicitly talk much about this formulation in class, but the techniques to solve it are all typically extensions of approaches we did talk about. This problemwillbemoreopen-ended,andwillaskyoutotryvariousapproachestosolvethisproblem. Wewill use data set C3.txt. A: (20 points) Findasetof 4 centers C = {c1,c2,c3,c4}forthe 4-mediansproblemondataset C3.txt. Reportthesetofcenters,aswellasCost1(P,C). Thecentersshouldbeinthewrite-upyouturnin,butalso in a ﬁle formatted the same was as the input so we can verify the cost you found. That is each line has 1 center with 6 tab separated numbers. The ﬁrst being the index (e.g., 1, 2, 3 or 4), and the next 5 being the 5-dimensional coordinates of that center. Your score will be based on how small a Cost1(P,C) you can ﬁnd. You can get 15 points for reasonable solution. The smallest found score in the class will get all 20 points. Other scores will obtain points in between. Very brieﬂy describe how you found the centers. B: (5 points) Run your algorithm again for the 5-medians problem on dataset C3.txt. Report the set of 5 centers and the Cost1(P,C). You do not need to turn in a ﬁle for these, just write it in your report. 4 BONUS (2 points) Recall that the k-center problem is to ﬁnd a set of k centers C to minimize Cost0(P,C) = max p∈P min c∈C d(p,c). Let C∗ be the optimal choice of k centers for the k-center problem, and let V∗ = Cost0(P,C∗). Prove that the Gonzalez algorithm always ﬁnds a set of k centers C such that Cost0(P,C) ≤ 2V∗.

These data sets all have the following format. Each line is a data point. The lines have either 3 or 6 tab separated items. The ﬁrst one is an integer describing the index of the points. The next 2 (or 5 for C3) are thecoordinatesofthedatapoint. C1andC2arein2dimensions,andC3isin5dimensions. C1shouldhave n=20 points, C2 should have n=1004 points, and C3 should have n=1000 points. We will always measure distance with Euclidean distance. As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difﬁcult to read or hard to follow. Find a sample form in this directory: 1 Hierarchical Clustering (25 points) Therearemanyvariantsofhierarchicalclustering;hereweexplore3. Thekeydifferenceishowyoumeasure the distance d(S1,S2) between two clusters S1 and S2. Single-Link: measures the shortest link d(S1,S2) = min (s1,s2)∈S1×S2ks1 −s2k2. Complete-Link: measures the longest link d(S1,S2) = max (s1,s2)∈S1×S2ks1 −s2k2. Mean-Link: measures the distances to the means. First compute a1 = 1 |S1|Ps∈S1 s and a2 = 1 |S2|Ps∈S2 s thend (S1,S2) = ka1 −a2k2 . A (25 points): Run all hierarchical clustering variants on data set C1.txt until there are k = 4 clusters, and report the results as sets. It may be useful to do this pictorially. Whichvariantdidthebestjob, andwhichwastheeasiesttocompute(thinkifthedatawasmuchlarger)? Explain your answers. 2 Assignment-Based Clustering (50 points) Assignment-based clustering works by assigning every point x ∈ X to the closest cluster centers C. Let φC : X → C be this assignment map so that φC(x) = argminc∈C d(x,c). All points that map to the same cluster center are in the same cluster. CS 6140 Data Mining; Spring 2016 Instructor: Jeff M. Phillips, University of Utah Two good heuristics for these types of cluster are the Gonzalez (Algorithm 9.4.1) and k-Means++ (Algorithm 10.1.2) algorithms. A: (20 points) Run Gonzalez and k-Means++ on data set C2.txt for k = 3. To avoid too much variation in the results, choose c1 as the point with index 1. Report the centers and the subsets (as pictures) for Gonzalez. Report: • the 3-center cost maxx∈X d(x,φC(x)) and • the 3-means costq 1 |X|Px∈X(d(x,φC(x)))2 (Note this has been normalized so easy to compare to 3-center cost) For k-Means++, the algorithm is randomized, so you will need to report the variation in this algorithm. Run it several trials (at least 20) and plot the cumulative density function of the 3-means cost. Also report what fraction of the time the subsets are the same as the result from Gonzalez. B: (20 points) Recall that Lloyd’s algorithm for k-means clustering starts with a set of k centers C and runs as described in Algorithm 10.1.1. • RunLloydsAlgorithmwith C initiallywithpointsindexed{1,2,3}. Reporttheﬁnalsubsetandthe 3-means cost. • Run Lloyds Algorithm with C initially as the output of Gonzalez above. Report the ﬁnal subset and the 3-means cost. • Run Lloyds Algorithm with C initially as the output of each run of k-Means++ above. Plot a cumulative density function of the 3-means cost. Also report the fraction of the trials that the subsets are the same as the input. C: (10 points) Consider a set of points S ⊂Rd and d the Euclidean distance. Prove that arg min p∈RdX x∈S (d(x,p))2 = 1 |S|X x∈S x. Here are some suggested steps to follow towards the proof (note there are also other valid ways to prove this, but, for instance, achieving some of these steps will get you partial credit): 1. First prove the same results for S ∈R1. (a) Expand each term (d(x,p))2 = (x−p)2 = x2 + p2 −2xp. (b) Take the derivative of each term. (c) Add the above terms back together and ﬁnd where the total derivative is 0. 2. Show the results for each dimension can be solved independently (use properties of edge lengths in a right triangle – you may want to just redo the above steps using vector calculus). 3 k-Median Clustering (25 points) The k-median clustering problem on a data set P is to ﬁnd a set of k-centers C = {c1,c2,...,ck} to minimize Cost1(P,C) = 1 |P|Pp∈P d(p,φC(p)). We did not explicitly talk much about this formulation in class, but the techniques to solve it are all typically extensions of approaches we did talk about. This problemwillbemoreopen-ended,andwillaskyoutotryvariousapproachestosolvethisproblem. Wewill use data set C3.txt. A: (20 points) Findasetof 4 centers C = {c1,c2,c3,c4}forthe 4-mediansproblemondataset C3.txt. Reportthesetofcenters,aswellasCost1(P,C). Thecentersshouldbeinthewrite-upyouturnin,butalso in a ﬁle formatted the same was as the input so we can verify the cost you found. That is each line has 1 center with 6 tab separated numbers. The ﬁrst being the index (e.g., 1, 2, 3 or 4), and the next 5 being the 5-dimensional coordinates of that center. Your score will be based on how small a Cost1(P,C) you can ﬁnd. You can get 15 points for reasonable solution. The smallest found score in the class will get all 20 points. Other scores will obtain points in between. Very brieﬂy describe how you found the centers. B: (5 points) Run your algorithm again for the 5-medians problem on dataset C3.txt. Report the set of 5 centers and the Cost1(P,C). You do not need to turn in a ﬁle for these, just write it in your report. 4 BONUS (2 points) Recall that the k-center problem is to ﬁnd a set of k centers C to minimize Cost0(P,C) = max p∈P min c∈C d(p,c). Let C∗ be the optimal choice of k centers for the k-center problem, and let V∗ = Cost0(P,C∗). Prove that the Gonzalez algorithm always ﬁnds a set of k centers C such that Cost0(P,C) ≤ 2V∗.

You'll get 1 file (1.2MB)