1 00:00:09,500 --> 00:00:11,830 So how does clustering work? 2 00:00:11,830 --> 00:00:14,270 The first step in clustering is to define 3 00:00:14,270 --> 00:00:17,150 the distance between two data points. 4 00:00:17,150 --> 00:00:19,750 The most popular way to compute the distance 5 00:00:19,750 --> 00:00:22,850 is what's called Euclidean distance. 6 00:00:22,850 --> 00:00:25,300 This is the standard way to compute distance 7 00:00:25,300 --> 00:00:27,520 that you might have seen before. 8 00:00:27,520 --> 00:00:29,800 Suppose we have two data points-- I 9 00:00:29,800 --> 00:00:33,320 and J. The distance between the two points, 10 00:00:33,320 --> 00:00:37,840 which we'll call Dij, is equal to the square root 11 00:00:37,840 --> 00:00:39,640 of the difference between the two 12 00:00:39,640 --> 00:00:44,420 points in the first component, squared, plus the difference 13 00:00:44,420 --> 00:00:46,830 between the two points in the second component, 14 00:00:46,830 --> 00:00:50,270 squared, all the way up to the difference between the two 15 00:00:50,270 --> 00:00:53,060 points in the k-th component, squared, 16 00:00:53,060 --> 00:00:55,630 where k here is the number of attributes 17 00:00:55,630 --> 00:00:58,660 or independent variables. 18 00:00:58,660 --> 00:01:02,330 Let's see how this works by looking at an example. 19 00:01:02,330 --> 00:01:05,910 In our movie lens dataset, we have binary vectors 20 00:01:05,910 --> 00:01:10,090 for each movie, classifying that movie into genres. 21 00:01:10,090 --> 00:01:12,650 The movie Toy Story is categorized 22 00:01:12,650 --> 00:01:16,620 as an animation, comedy, and children's movie. 23 00:01:16,620 --> 00:01:19,340 So the data for Toy Story has a 1 24 00:01:19,340 --> 00:01:24,460 in the spot for these three genres and a 0 everywhere else. 25 00:01:24,460 --> 00:01:28,640 The movie Batman Forever is categorized as an action, 26 00:01:28,640 --> 00:01:31,910 adventure, comedy, and crime movie. 27 00:01:31,910 --> 00:01:36,479 So Batman Forever has a 1 in the spot for these four genres 28 00:01:36,479 --> 00:01:39,270 and a 0 everywhere else. 29 00:01:39,270 --> 00:01:42,120 So given these two data observations, 30 00:01:42,120 --> 00:01:44,890 let's compute the distance between them. 31 00:01:44,890 --> 00:01:50,039 So the distance, d, would be equal to the square root 32 00:01:50,039 --> 00:02:16,440 of (0-0)^2 + (0-1)^2 + (0-1)^2 + (1-0)^2 , 1 et cetera. 33 00:02:16,440 --> 00:02:19,050 This ends up being equal to the square root of 5. 34 00:02:21,690 --> 00:02:24,040 In addition to Euclidean distance, 35 00:02:24,040 --> 00:02:26,020 there are many other popular distance 36 00:02:26,020 --> 00:02:27,990 metrics that could be used. 37 00:02:27,990 --> 00:02:32,140 One is called Manhattan distance, where the distance is 38 00:02:32,140 --> 00:02:35,810 computed to be the sum of the absolute values instead 39 00:02:35,810 --> 00:02:37,840 of the sum of square. 40 00:02:37,840 --> 00:02:41,000 Another is called maximum coordinate distance, 41 00:02:41,000 --> 00:02:44,320 where we only consider the measurement for which the data 42 00:02:44,320 --> 00:02:46,910 points deviate the most. 43 00:02:46,910 --> 00:02:48,760 Another important distance that we 44 00:02:48,760 --> 00:02:51,930 have to calculate for clustering is the distance 45 00:02:51,930 --> 00:02:56,450 between clusters, when a cluster is a group of data points. 46 00:02:56,450 --> 00:02:59,740 We just discussed how to compute the distance between two 47 00:02:59,740 --> 00:03:02,690 individual points, but how do we compute 48 00:03:02,690 --> 00:03:05,630 the distance between groups of points? 49 00:03:05,630 --> 00:03:07,970 One way of doing this is by using 50 00:03:07,970 --> 00:03:10,760 what's called the minimum distance. 51 00:03:10,760 --> 00:03:13,620 This defines the distance between clusters 52 00:03:13,620 --> 00:03:16,910 as the distance between the two data points in the clusters 53 00:03:16,910 --> 00:03:19,460 that are closest together. 54 00:03:19,460 --> 00:03:22,340 For example, we would define the distance 55 00:03:22,340 --> 00:03:24,850 between the yellow and red clusters 56 00:03:24,850 --> 00:03:28,130 by computing the Euclidean distance between these two 57 00:03:28,130 --> 00:03:30,120 points. 58 00:03:30,120 --> 00:03:33,670 The other points in the clusters could be really far away, 59 00:03:33,670 --> 00:03:37,070 but it doesn't matter if we use minimum distance. 60 00:03:37,070 --> 00:03:40,800 The only thing we care about is how close together the closest 61 00:03:40,800 --> 00:03:42,440 points are. 62 00:03:42,440 --> 00:03:46,190 Alternatively, we could use maximum distance. 63 00:03:46,190 --> 00:03:50,290 This one computes the distance between the two clusters 64 00:03:50,290 --> 00:03:52,290 as the distance between the two points 65 00:03:52,290 --> 00:03:54,340 that are the farthest apart. 66 00:03:54,340 --> 00:03:57,390 So for example, we would compute the distance 67 00:03:57,390 --> 00:03:59,810 between the yellow and red clusters 68 00:03:59,810 --> 00:04:03,150 by looking at these two points. 69 00:04:03,150 --> 00:04:06,360 Here, it doesn't matter how close together the other points 70 00:04:06,360 --> 00:04:07,240 are. 71 00:04:07,240 --> 00:04:10,430 All we care about is how close together the furthest points 72 00:04:10,430 --> 00:04:12,190 are. 73 00:04:12,190 --> 00:04:15,530 The most common distance metric between clusters 74 00:04:15,530 --> 00:04:17,690 is called centroid distance. 75 00:04:17,690 --> 00:04:19,910 And this is what we'll use. 76 00:04:19,910 --> 00:04:22,610 It defines the distance between clusters 77 00:04:22,610 --> 00:04:25,980 by computing the centroid of the clusters. 78 00:04:25,980 --> 00:04:28,020 The centroid is just the data point 79 00:04:28,020 --> 00:04:32,330 that takes the average of all data points in each component. 80 00:04:32,330 --> 00:04:36,100 This takes all data points in each cluster into account 81 00:04:36,100 --> 00:04:39,530 and can be thought of as the middle data point. 82 00:04:39,530 --> 00:04:43,570 In our example, the centroids between yellow and red 83 00:04:43,570 --> 00:04:46,700 are here, and we would compute the distance 84 00:04:46,700 --> 00:04:49,420 between the clusters by computing 85 00:04:49,420 --> 00:04:51,920 the Euclidean distance between those two points. 86 00:04:54,650 --> 00:04:56,980 When we are computing distances, it's 87 00:04:56,980 --> 00:05:00,880 highly influenced by the scale of the variables. 88 00:05:00,880 --> 00:05:04,120 As an example, suppose you're computing the distance 89 00:05:04,120 --> 00:05:07,170 between two data points, where one variable is 90 00:05:07,170 --> 00:05:10,680 the revenue of a company in thousands of dollars, 91 00:05:10,680 --> 00:05:14,050 and another is the age of the company in years. 92 00:05:14,050 --> 00:05:16,210 The revenue variable would really 93 00:05:16,210 --> 00:05:19,460 dominate in the distance calculation. 94 00:05:19,460 --> 00:05:22,150 The differences between the data points for revenue 95 00:05:22,150 --> 00:05:23,800 would be in the thousands. 96 00:05:23,800 --> 00:05:27,010 Whereas the differences between the year variable 97 00:05:27,010 --> 00:05:29,650 would probably be less than 10. 98 00:05:29,650 --> 00:05:34,250 To handle this, it's customary to normalize the data first. 99 00:05:34,250 --> 00:05:37,409 We can normalize by subtracting the mean of the data 100 00:05:37,409 --> 00:05:39,870 and dividing by the standard deviation. 101 00:05:39,870 --> 00:05:42,650 We'll see more of this in the homework. 102 00:05:42,650 --> 00:05:46,250 In our movie data set, all of our genre variables 103 00:05:46,250 --> 00:05:47,909 are on the same scale. 104 00:05:47,909 --> 00:05:50,770 So we don't have to worry about normalizing. 105 00:05:50,770 --> 00:05:53,580 But if we wanted to add a variable, like box office 106 00:05:53,580 --> 00:05:56,140 revenue, we would need to normalize 107 00:05:56,140 --> 00:06:00,400 so that this variable didn't dominate all of the others. 108 00:06:00,400 --> 00:06:03,870 Now that we've defined how we'll compute the distances, 109 00:06:03,870 --> 00:06:06,660 we'll talk about a specific clustering algorithm-- 110 00:06:06,660 --> 00:06:10,590 hierarchical clustering-- in the next video.