1 00:00:01,040 --> 00:00:03,460 The following content is provided under a Creative 2 00:00:03,460 --> 00:00:04,870 Commons license. 3 00:00:04,870 --> 00:00:07,910 Your support will help MIT OpenCourseWare continue to 4 00:00:07,910 --> 00:00:11,560 offer high quality educational resources for free. 5 00:00:11,560 --> 00:00:14,460 To make a donation or view additional materials from 6 00:00:14,460 --> 00:00:20,290 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:20,290 --> 00:00:21,540 ocw.mit.edu. 8 00:00:24,220 --> 00:00:27,450 PROFESSOR: I want to pick up with a little bit of overlap 9 00:00:27,450 --> 00:00:31,230 just to remind people where we were. 10 00:00:31,230 --> 00:00:35,490 We had been looking at clustering, and we looked at a 11 00:00:35,490 --> 00:00:40,690 fairly simple example of using agglomerative hierarchical 12 00:00:40,690 --> 00:00:46,130 clustering to cluster cities, based upon how far apart they 13 00:00:46,130 --> 00:00:48,160 were from each other cities. 14 00:00:48,160 --> 00:00:51,570 So, essentially, using this distance matrix, we could do a 15 00:00:51,570 --> 00:00:54,470 clustering that would reflect how close 16 00:00:54,470 --> 00:00:57,030 cities were to one another. 17 00:00:57,030 --> 00:01:00,640 And we went through a agglomerative clustering, and 18 00:01:00,640 --> 00:01:03,730 we saw that we would get a different answer, depending 19 00:01:03,730 --> 00:01:07,790 upon which linkage criterion we used. 20 00:01:07,790 --> 00:01:14,060 This is an important issue because as one is using 21 00:01:14,060 --> 00:01:18,930 clustering, one what has to be aware that it is related to 22 00:01:18,930 --> 00:01:22,320 these things, and you choose the wrong linkage criterion, 23 00:01:22,320 --> 00:01:25,440 you might get an answer other than the most useful. 24 00:01:28,160 --> 00:01:28,232 All right. 25 00:01:28,232 --> 00:01:33,580 I next went on and said, well, this is pretty easy, because 26 00:01:33,580 --> 00:01:37,430 when we're comparing the distance between two cities or 27 00:01:37,430 --> 00:01:42,840 the two features, we just subtract one distance from the 28 00:01:42,840 --> 00:01:44,440 other and we get a number. 29 00:01:44,440 --> 00:01:46,670 It's very straightforward. 30 00:01:46,670 --> 00:01:49,960 I then raised the question, suppose when we looked at 31 00:01:49,960 --> 00:01:54,330 cities, we looked at a more complicated way of looking at 32 00:01:54,330 --> 00:01:59,160 them than airline distance. 33 00:01:59,160 --> 00:02:02,710 So the first question, I said, well, suppose in addition to 34 00:02:02,710 --> 00:02:09,949 the distance by air, we add the distance by road, or the 35 00:02:09,949 --> 00:02:12,150 average temperature. 36 00:02:12,150 --> 00:02:13,970 Pick what you will. 37 00:02:13,970 --> 00:02:16,310 What do we do? 38 00:02:16,310 --> 00:02:23,390 Well, the answer was we start by generalizing from a feature 39 00:02:23,390 --> 00:02:31,860 being a single number to the notion of a feature vector, 40 00:02:31,860 --> 00:02:38,320 where the features used to describe the city are now 41 00:02:38,320 --> 00:02:42,855 represented by a vector, typically of numbers. 42 00:02:47,280 --> 00:02:53,280 If the vectors are all in the same physical units, we could 43 00:02:53,280 --> 00:02:58,410 easily imagine how we might compare two vectors. 44 00:02:58,410 --> 00:03:02,310 So we might, for example, we look at the Euclidean distance 45 00:03:02,310 --> 00:03:06,480 between the two just by, say, subtracting one 46 00:03:06,480 --> 00:03:09,880 vector from the other. 47 00:03:09,880 --> 00:03:13,700 However, if we think about that, it can be pretty 48 00:03:13,700 --> 00:03:19,190 misleading because, for example, when we look at a 49 00:03:19,190 --> 00:03:25,480 city, one element of the vector represents the distance 50 00:03:25,480 --> 00:03:29,930 in miles from another city, or in fact this case, the 51 00:03:29,930 --> 00:03:32,610 distance in miles to each city. 52 00:03:32,610 --> 00:03:36,940 And another represents temperatures. 53 00:03:36,940 --> 00:03:39,630 Well, it's kind of funny to compare distance, which might 54 00:03:39,630 --> 00:03:42,310 be thousands of miles, with the temperature 55 00:03:42,310 --> 00:03:45,640 which might be 5 degrees. 56 00:03:45,640 --> 00:03:47,960 A 5 degree difference in average temperature could be 57 00:03:47,960 --> 00:03:49,380 significant. 58 00:03:49,380 --> 00:03:52,270 Certainly a 20 degree difference in temperature is 59 00:03:52,270 --> 00:03:57,070 very significant, but a 20 mile difference in location 60 00:03:57,070 --> 00:03:59,840 might not be very significant. 61 00:03:59,840 --> 00:04:03,850 And so to equally weight a 20 degree temperature difference 62 00:04:03,850 --> 00:04:08,170 and at 20 miles distance difference might give us a 63 00:04:08,170 --> 00:04:10,610 very peculiar answer. 64 00:04:10,610 --> 00:04:14,980 And so we have to think about the question of, how are we 65 00:04:14,980 --> 00:04:18,799 going to scale the elements of the vectors? 66 00:04:34,310 --> 00:04:40,670 Even if we're in the same units, say inches, 67 00:04:40,670 --> 00:04:42,660 it can be an issue. 68 00:04:42,660 --> 00:04:45,810 So let's look at this example. 69 00:04:45,810 --> 00:04:51,410 Here I've got on the left, before scaling, something 70 00:04:51,410 --> 00:04:56,140 which we can say is in inches, height and width. 71 00:04:56,140 --> 00:04:59,420 This is not from a person, but you could imagine if you were 72 00:04:59,420 --> 00:05:03,120 trying to cluster people, and you measured their height in 73 00:05:03,120 --> 00:05:07,380 inches and their width in inches, maybe you don't want 74 00:05:07,380 --> 00:05:09,020 to treat them equally. 75 00:05:09,020 --> 00:05:09,340 Right? 76 00:05:09,340 --> 00:05:11,900 But there's a lot more variance in height than in 77 00:05:11,900 --> 00:05:15,770 width, or maybe there is and maybe there isn't. 78 00:05:15,770 --> 00:05:19,580 So here on the left we don't have any scaling, and we see a 79 00:05:19,580 --> 00:05:22,830 very natural clustering. 80 00:05:22,830 --> 00:05:27,670 On the other hand, we notice on the y-axis the values range 81 00:05:27,670 --> 00:05:34,390 from not too far from 0 to not too far from 1. 82 00:05:34,390 --> 00:05:39,750 Whereas on the x-axis, the dynamic range is much less, 83 00:05:39,750 --> 00:05:44,520 not too far from 0 to not too far from 1/2. 84 00:05:44,520 --> 00:05:50,940 So we have twice the dynamic range here than we have here. 85 00:05:50,940 --> 00:05:54,520 Therefore, not surprisingly, when we end up doing the 86 00:05:54,520 --> 00:06:02,180 clustering, width plays a very important role. 87 00:06:02,180 --> 00:06:04,390 And we end up clustering it this way, 88 00:06:04,390 --> 00:06:07,110 dividing it along here. 89 00:06:07,110 --> 00:06:11,190 On the other hand, if I take exactly the same data and 90 00:06:11,190 --> 00:06:18,040 scale it, and now the x-axis runs from 0 to 1/2 and the 91 00:06:18,040 --> 00:06:24,600 y-axis, roughly again, from 0 to 1, we see that suddenly 92 00:06:24,600 --> 00:06:26,870 when we look at it geometrically, we end up 93 00:06:26,870 --> 00:06:30,540 getting a very different look of clustering. 94 00:06:30,540 --> 00:06:32,530 What's the moral? 95 00:06:32,530 --> 00:06:37,850 The moral is you have to think hard about how to cluster your 96 00:06:37,850 --> 00:06:41,750 features, about how to scale your features, because it can 97 00:06:41,750 --> 00:06:45,580 have a dramatic influence on your answer. 98 00:06:45,580 --> 00:06:50,100 We'll see some real life examples of this shortly. 99 00:06:50,100 --> 00:06:52,300 But these are all the important things to think 100 00:06:52,300 --> 00:07:00,100 about, and they all, in some sense, tie up into the same 101 00:07:00,100 --> 00:07:01,640 major point. 102 00:07:01,640 --> 00:07:05,210 Whenever you're doing any kind of learning, including 103 00:07:05,210 --> 00:07:15,820 clustering, feature, selection, 104 00:07:15,820 --> 00:07:21,705 and scaling is critical. 105 00:07:25,740 --> 00:07:31,420 It is where most of the thinking ends up going. 106 00:07:31,420 --> 00:07:34,280 And then the rest gets to be fairly mechanical. 107 00:07:37,530 --> 00:07:42,450 How do we decide what features to use and how to scale them? 108 00:07:42,450 --> 00:07:45,630 We do that using domain knowledge. 109 00:07:48,800 --> 00:07:54,800 So we actually have to think about the objects that we're 110 00:07:54,800 --> 00:07:58,940 trying to learn about and what the objective of the learning 111 00:07:58,940 --> 00:08:00,190 process is. 112 00:08:03,200 --> 00:08:09,960 So continuing, how do we do the scaling? 113 00:08:09,960 --> 00:08:13,850 Most of the time, it's done using some variant of what's 114 00:08:13,850 --> 00:08:15,350 called the Minkowski metric. 115 00:08:18,950 --> 00:08:22,520 It's not nearly as imposing as it looks. 116 00:08:22,520 --> 00:08:29,040 So the distance between two vectors, X1 and X2, and then 117 00:08:29,040 --> 00:08:34,270 we use p to talk about, essentially, the degree we're 118 00:08:34,270 --> 00:08:36,330 going to be using. 119 00:08:36,330 --> 00:08:39,460 So we take the absolute difference between each 120 00:08:39,460 --> 00:08:47,460 element of X1 and X2, raise it to the p-th power, sum them 121 00:08:47,460 --> 00:08:52,460 and then take the 1 over p. 122 00:08:52,460 --> 00:08:56,150 Not very complicated, so let's say p is 2. 123 00:08:56,150 --> 00:08:59,560 That's the one you people are most familiar with. 124 00:08:59,560 --> 00:09:01,400 Effectively, all we're doing is getting 125 00:09:01,400 --> 00:09:03,650 the Euclidean distance. 126 00:09:03,650 --> 00:09:06,060 What we looked at when we looked at the mean squared 127 00:09:06,060 --> 00:09:10,970 distance between two things, between our errors and our 128 00:09:10,970 --> 00:09:13,250 measured data, between our measured data and our 129 00:09:13,250 --> 00:09:14,770 predicted data. 130 00:09:14,770 --> 00:09:17,640 We used the mean square error. 131 00:09:17,640 --> 00:09:20,260 That's essentially in Minkowski distance with p 132 00:09:20,260 --> 00:09:23,060 equal to 2. 133 00:09:23,060 --> 00:09:27,870 That's probably the most commonly used, but an almost 134 00:09:27,870 --> 00:09:32,900 equally commonly used sets p equal to 1, and that's 135 00:09:32,900 --> 00:09:34,610 something called the Manhattan distance. 136 00:09:39,380 --> 00:09:42,480 I suspect at least some of you have spent time walking around 137 00:09:42,480 --> 00:09:49,860 Manhattan, a small but densely populated island in New York. 138 00:09:49,860 --> 00:09:53,440 And midtown Manhattan has the feature that it's 139 00:09:53,440 --> 00:09:54,690 laid out in a grid. 140 00:09:57,120 --> 00:10:05,740 So what you have is a grid, and you have the avenues 141 00:10:05,740 --> 00:10:11,730 running north-south and the streets running east-west. 142 00:10:16,520 --> 00:10:21,180 And if you want to walk from, say, here to here or drive 143 00:10:21,180 --> 00:10:24,460 from here to here, you cannot take the diagonal because 144 00:10:24,460 --> 00:10:27,200 there are a bunch of buildings in the way. 145 00:10:27,200 --> 00:10:30,800 And so you have to move either left or right, or up, or down. 146 00:10:33,480 --> 00:10:38,440 That's the Manhattan distance between two points. 147 00:10:38,440 --> 00:10:42,950 This is used, in fact, for a lot of problems, typically 148 00:10:42,950 --> 00:10:46,240 when somebody is comparing the distance between two genes, 149 00:10:46,240 --> 00:10:51,390 for example, they use a Manhattan metric rather than a 150 00:10:51,390 --> 00:10:55,700 Euclidean metric to say how similar two things are. 151 00:10:59,910 --> 00:11:04,550 Just wanted to show that because it is something that 152 00:11:04,550 --> 00:11:07,170 you will run across in the literature when you read about 153 00:11:07,170 --> 00:11:08,420 these kinds of things. 154 00:11:19,700 --> 00:11:19,946 All right. 155 00:11:19,946 --> 00:11:25,430 So far, we've talked about issues where things are 156 00:11:25,430 --> 00:11:27,410 comparable. 157 00:11:27,410 --> 00:11:31,600 And we've been doing that by representing each element of 158 00:11:31,600 --> 00:11:36,700 the feature vector as a floating point number. 159 00:11:36,700 --> 00:11:39,650 So we can run a formula like that by subtracting 160 00:11:39,650 --> 00:11:40,900 one from the other. 161 00:11:43,620 --> 00:11:49,760 But we often, in fact, have to deal with nominal categories, 162 00:11:49,760 --> 00:11:51,925 things that have names rather than numbers. 163 00:11:58,370 --> 00:12:04,100 So for clustering people, maybe we care about eye color, 164 00:12:04,100 --> 00:12:06,650 blue, brown, gray, green. 165 00:12:06,650 --> 00:12:07,940 Hair color. 166 00:12:07,940 --> 00:12:12,594 Well, how do you compare blue to green? 167 00:12:12,594 --> 00:12:16,160 Do you subtract one from the other? 168 00:12:16,160 --> 00:12:17,000 Kind of hard to do. 169 00:12:17,000 --> 00:12:19,490 What does it mean to subtract green from blue? 170 00:12:19,490 --> 00:12:21,700 Well, I guess we could talk about it in the frequency 171 00:12:21,700 --> 00:12:25,220 domain, enlighten things. 172 00:12:25,220 --> 00:12:30,070 Typically, what we have to do in that case is, we convert 173 00:12:30,070 --> 00:12:40,030 them to a number and then have some ways 174 00:12:40,030 --> 00:12:42,380 to relate the numbers. 175 00:12:42,380 --> 00:12:46,960 Again, this is a place where domain knowledge is critical. 176 00:12:46,960 --> 00:12:51,340 So, for example, we might convert blue to 0, green to 177 00:12:51,340 --> 00:12:57,210 0.5, and brown to 1, thus indicating that we think blue 178 00:12:57,210 --> 00:13:02,430 eyes are closer to green eyes than they are to brown eyes. 179 00:13:02,430 --> 00:13:06,730 I don't know why we think that but maybe we think that. 180 00:13:06,730 --> 00:13:09,870 Red hair is closer to blonde hair than it is to black hair. 181 00:13:09,870 --> 00:13:12,530 I don't know. 182 00:13:12,530 --> 00:13:15,120 These are the sorts of things that are not mathematical 183 00:13:15,120 --> 00:13:17,670 questions, typically, but judgments that 184 00:13:17,670 --> 00:13:20,980 people have to make. 185 00:13:20,980 --> 00:13:27,450 Once we've converted things to numbers, we then have to go 186 00:13:27,450 --> 00:13:34,840 back to our old friend of scaling, which is often called 187 00:13:34,840 --> 00:13:36,090 normalization. 188 00:13:41,940 --> 00:13:47,300 Very often we try and contrive to have every feature range 189 00:13:47,300 --> 00:13:51,280 between 0 and 1, for example, so that everything is 190 00:13:51,280 --> 00:13:54,870 normalized to the same dynamic range, 191 00:13:54,870 --> 00:13:57,040 and then we can compare. 192 00:13:57,040 --> 00:13:59,290 Is that the right thing to do? 193 00:13:59,290 --> 00:14:02,500 Not necessarily, because you might consider some features 194 00:14:02,500 --> 00:14:04,950 more important than others and want to give 195 00:14:04,950 --> 00:14:06,730 them a greater weight. 196 00:14:06,730 --> 00:14:08,050 And, again, that's something we'll come 197 00:14:08,050 --> 00:14:09,300 back to and look at. 198 00:14:11,680 --> 00:14:13,270 All this is a bit abstract. 199 00:14:13,270 --> 00:14:15,720 I now want to look at an example. 200 00:14:15,720 --> 00:14:19,730 Let's look at the example of clustering mammals. 201 00:14:19,730 --> 00:14:23,210 There are, essentially, an unbounded number of features 202 00:14:23,210 --> 00:14:28,650 you could use, size at birth, gestation period, lifespan, 203 00:14:28,650 --> 00:14:33,240 length of tail, speed, eating habits. 204 00:14:33,240 --> 00:14:35,150 You name it. 205 00:14:35,150 --> 00:14:38,420 The choice of features and weighting will, of course, 206 00:14:38,420 --> 00:14:42,450 have an enormous impact on what clusters you get. 207 00:14:42,450 --> 00:14:47,910 If you choose size, humans might appear in one cluster. 208 00:14:47,910 --> 00:14:53,740 If you choose eating habits, they might appear in another. 209 00:14:53,740 --> 00:14:57,270 How should you choose which features you want? 210 00:14:57,270 --> 00:15:01,920 You have to begin by choosing, thinking about the reason 211 00:15:01,920 --> 00:15:04,200 you're doing the clustering in the first place. 212 00:15:04,200 --> 00:15:08,005 What is it you're trying to learn about the mammals? 213 00:15:10,510 --> 00:15:13,550 As an example, I'm going to choose the 214 00:15:13,550 --> 00:15:17,580 objective of eating habits. 215 00:15:17,580 --> 00:15:19,940 I want to cluster mammals somehow based 216 00:15:19,940 --> 00:15:22,340 upon what they eat. 217 00:15:22,340 --> 00:15:25,610 But I want to do that, and here's a very important thing 218 00:15:25,610 --> 00:15:30,310 about what we often see in learning without any direct 219 00:15:30,310 --> 00:15:32,172 information about what they eat. 220 00:15:34,830 --> 00:15:41,720 Typically, when we're using machine learning, we're trying 221 00:15:41,720 --> 00:15:47,420 to learn about something for which we have 222 00:15:47,420 --> 00:15:51,010 limited or no data. 223 00:15:51,010 --> 00:15:56,100 Remember when we talked about learning, I talked about 224 00:15:56,100 --> 00:15:58,950 learning in which it was supervised, and which we had 225 00:15:58,950 --> 00:16:04,510 some data, and unsupervised, in which, essentially, we 226 00:16:04,510 --> 00:16:07,530 don't have any labels. 227 00:16:07,530 --> 00:16:13,790 So let's say we don't have any labels about what mammals eat, 228 00:16:13,790 --> 00:16:18,770 but we do know a lot about the mammals themselves. 229 00:16:18,770 --> 00:16:23,380 And, in fact, the hypothesis I'm going to start with here 230 00:16:23,380 --> 00:16:31,570 is that you can infer people's or creatures' eating habits 231 00:16:31,570 --> 00:16:37,260 from their dental records, or their dentitian. 232 00:16:37,260 --> 00:16:41,100 But over time, we have evolved, all creatures have 233 00:16:41,100 --> 00:16:47,020 evolved, to have teeth that are related to what they eat, 234 00:16:47,020 --> 00:16:48,680 we can see. 235 00:16:48,680 --> 00:16:56,150 So I managed to procure a database of dentitian for 236 00:16:56,150 --> 00:16:57,400 various mammals. 237 00:17:02,070 --> 00:17:03,325 There's the laser pointer. 238 00:17:06,470 --> 00:17:10,980 So what I've got here is the number of 239 00:17:10,980 --> 00:17:12,020 different kinds of teeth. 240 00:17:12,020 --> 00:17:17,099 So the right top incisors, the right bottom incisors, molars, 241 00:17:17,099 --> 00:17:19,040 et cetera, pre-molars. 242 00:17:19,040 --> 00:17:21,460 Don't worry if you don't know about teeth very much. 243 00:17:21,460 --> 00:17:23,859 I don't know very much. 244 00:17:23,859 --> 00:17:26,150 And then for each animal, I have the number of 245 00:17:26,150 --> 00:17:27,400 each kind of tooth. 246 00:17:29,850 --> 00:17:32,590 Actually, I don't have it for this particular mammal, but 247 00:17:32,590 --> 00:17:33,970 these two I do. 248 00:17:33,970 --> 00:17:35,720 I don't even remember what they are. 249 00:17:35,720 --> 00:17:36,970 They're cute. 250 00:17:39,910 --> 00:17:40,010 All right. 251 00:17:40,010 --> 00:17:46,200 So I've got that database, and now I want to try and see what 252 00:17:46,200 --> 00:17:47,560 happens when I cluster them. 253 00:17:51,540 --> 00:17:57,450 The code to do this is not very complicated, but I should 254 00:17:57,450 --> 00:17:58,930 make a confession about it. 255 00:18:10,330 --> 00:18:12,830 Last night, I won't say I learned it. 256 00:18:12,830 --> 00:18:15,480 I was reminded of a lesson that I've often preached in 257 00:18:15,480 --> 00:18:19,410 6.00, is that it's not good to get your programming done at 258 00:18:19,410 --> 00:18:20,810 the last minute. 259 00:18:20,810 --> 00:18:23,810 So as I was debugging this code at 2:00 and 3:00 in the 260 00:18:23,810 --> 00:18:28,100 morning today, I was realizing how inefficient I am at 261 00:18:28,100 --> 00:18:29,600 debugging at that hour. 262 00:18:29,600 --> 00:18:31,990 Maybe for you guys that's the shank of the day. 263 00:18:31,990 --> 00:18:33,610 For me, it's too late. 264 00:18:33,610 --> 00:18:38,310 I think it all works, but I was certainly not at my best 265 00:18:38,310 --> 00:18:42,570 as I was debugging last night. 266 00:18:42,570 --> 00:18:42,900 All right. 267 00:18:42,900 --> 00:18:48,160 But at the moment, I don't want you to spend time working 268 00:18:48,160 --> 00:18:50,550 on the code itself. 269 00:18:50,550 --> 00:18:54,550 I would like you to think a little bit about the overall 270 00:18:54,550 --> 00:18:58,750 class structure of the code, which I've got on the first 271 00:18:58,750 --> 00:19:00,000 page of the handout. 272 00:19:02,490 --> 00:19:08,670 So at the bottom of my hierarchy, I've got something 273 00:19:08,670 --> 00:19:16,440 called a point, and that's an abstraction of the things to 274 00:19:16,440 --> 00:19:18,220 be clustered. 275 00:19:18,220 --> 00:19:23,780 And I've done it in quite a generalized way, because, as 276 00:19:23,780 --> 00:19:27,110 you're going to see, the code we're looking at today, I'm 277 00:19:27,110 --> 00:19:29,790 going to use not only for clustering mammals but for 278 00:19:29,790 --> 00:19:32,880 clustering all sorts of other things as well. 279 00:19:32,880 --> 00:19:37,520 So I decided to take the trouble of building up a set 280 00:19:37,520 --> 00:19:40,550 of classes that would be useful. 281 00:19:40,550 --> 00:19:46,860 And in this class, I can have the name of a point, its 282 00:19:46,860 --> 00:19:48,580 original attributes. 283 00:19:48,580 --> 00:19:51,760 That say its original feature vector, an unscaled feature 284 00:19:51,760 --> 00:19:56,720 vector, and then whether or not I choose to normalize it. 285 00:19:56,720 --> 00:19:59,580 I might have normalized features as well. 286 00:19:59,580 --> 00:20:01,700 Again, I don't want you worry too much about the 287 00:20:01,700 --> 00:20:03,820 details of the code. 288 00:20:03,820 --> 00:20:07,280 And then I have a distance metric, and I'm just for the 289 00:20:07,280 --> 00:20:09,340 moment using simple Euclidean distance. 290 00:20:12,130 --> 00:20:16,550 The next element in my hierarchy, not yet a 291 00:20:16,550 --> 00:20:17,190 hierarchy-- 292 00:20:17,190 --> 00:20:20,350 it's still flat-- 293 00:20:20,350 --> 00:20:21,600 is a cluster. 294 00:20:29,210 --> 00:20:32,550 And so what a cluster is, you can think of it as, at some 295 00:20:32,550 --> 00:20:35,990 abstract level, it's just going to be a set of points, 296 00:20:35,990 --> 00:20:39,170 the points that are in the cluster. 297 00:20:39,170 --> 00:20:42,605 But I've got some other operations on it 298 00:20:42,605 --> 00:20:43,855 that will be useful. 299 00:20:46,260 --> 00:20:49,950 I can compute the distance between two clusters, and as 300 00:20:49,950 --> 00:20:53,220 you'll see, I have single linkage, Mac Link, max , 301 00:20:53,220 --> 00:20:56,970 average, the three I talked about last week. 302 00:20:56,970 --> 00:21:00,320 And also this notion of a centroid. 303 00:21:00,320 --> 00:21:04,020 We'll come back to that when we get to k-means clustering. 304 00:21:07,280 --> 00:21:09,910 We don't need to worry right now about what that is. 305 00:21:15,210 --> 00:21:19,030 Then I'm going to have a cluster set. 306 00:21:19,030 --> 00:21:20,715 That's another useful data abstraction. 307 00:21:27,820 --> 00:21:31,740 And that's what you might guess from its name, just a 308 00:21:31,740 --> 00:21:34,000 set of clusters. 309 00:21:34,000 --> 00:21:36,755 The most interesting operation there is merge. 310 00:21:40,050 --> 00:21:43,070 As you saw, when we looked at hierarchical clustering last 311 00:21:43,070 --> 00:21:47,270 week, the key step there is merging two clusters. 312 00:21:47,270 --> 00:21:54,500 And in doing that, I'm going to have a function called Find 313 00:21:54,500 --> 00:22:00,960 Closest, which given a metric and a cluster, finds the 314 00:22:00,960 --> 00:22:05,890 cluster that is most similar to that, to self, because as 315 00:22:05,890 --> 00:22:08,090 you, again, will recall from hierarchical clustering, 316 00:22:08,090 --> 00:22:10,600 that's what I merged at each step is the two 317 00:22:10,600 --> 00:22:11,850 most similar clusters. 318 00:22:14,580 --> 00:22:18,040 And then there's some details about how it works, which 319 00:22:18,040 --> 00:22:21,080 again, we don't need to worry about at the moment. 320 00:22:24,430 --> 00:22:30,250 And then I'm going to have a subclass of point called 321 00:22:30,250 --> 00:22:51,070 Mammal, in which I will represent each mammal by the 322 00:22:51,070 --> 00:22:53,395 dentitian as we've looked at before. 323 00:22:57,870 --> 00:23:02,960 Then pretty simply, we can do a bunch of things with it. 324 00:23:06,010 --> 00:23:09,380 Before we look at the other details of the code, I want to 325 00:23:09,380 --> 00:23:12,500 now run it and see what we get. 326 00:23:12,500 --> 00:23:15,960 So I'm just going to use hierarchical clustering now to 327 00:23:15,960 --> 00:23:21,360 cluster the mammals based upon this feature vector, which 328 00:23:21,360 --> 00:23:24,650 will be a list of numbers showing how many of each kind 329 00:23:24,650 --> 00:23:26,930 of tooth the mammals have. 330 00:23:26,930 --> 00:23:28,180 Let's see what we get. 331 00:23:37,850 --> 00:23:40,170 So it's doing the merging. 332 00:23:45,390 --> 00:23:50,750 So we can see the first step, it merged beavers with ground 333 00:23:50,750 --> 00:23:54,980 hogs and it merged grey squirrels with porcupines, 334 00:23:54,980 --> 00:23:57,120 wolves and bears. 335 00:23:57,120 --> 00:24:01,230 Various other kinds of things, like jaguars and cougars, were 336 00:24:01,230 --> 00:24:03,650 a lot alike. 337 00:24:03,650 --> 00:24:06,610 Eventually, it starts doing more complicated merges. 338 00:24:06,610 --> 00:24:10,470 It merges a cluster containing only the river otter with one 339 00:24:10,470 --> 00:24:17,050 containing a Martin and a wolverine, beavers and ground 340 00:24:17,050 --> 00:24:21,220 hogs with squirrels and porcupines, et cetera. 341 00:24:21,220 --> 00:24:25,940 And at the end, I had it stop with two clusters. 342 00:24:25,940 --> 00:24:29,310 It came up with these clusters. 343 00:24:29,310 --> 00:24:33,340 Now we can look at these clusters and say, all right. 344 00:24:33,340 --> 00:24:34,420 What do we think? 345 00:24:34,420 --> 00:24:38,070 Have we learned anything interesting? 346 00:24:38,070 --> 00:24:40,480 Do we see anything in any of these-- 347 00:24:40,480 --> 00:24:41,790 do we think it makes sense? 348 00:24:41,790 --> 00:24:45,560 Remember, our goal was to cluster mammals based upon 349 00:24:45,560 --> 00:24:47,910 what they might eat. 350 00:24:47,910 --> 00:24:50,625 And we can ask, do we think this corresponds to that? 351 00:24:54,400 --> 00:24:54,710 No. 352 00:24:54,710 --> 00:24:55,030 All right. 353 00:24:55,030 --> 00:24:55,950 Who-- somebody said-- 354 00:24:55,950 --> 00:24:59,200 Now, why no? 355 00:24:59,200 --> 00:25:01,650 Go ahead. 356 00:25:01,650 --> 00:25:04,000 AUDIENCE: We've got-- like a deer doesn't eat similar 357 00:25:04,000 --> 00:25:06,190 things as a dog. 358 00:25:06,190 --> 00:25:09,760 And we've got one type on the top cluster and a different 359 00:25:09,760 --> 00:25:11,140 kind of bat in the bottom cluster. 360 00:25:11,140 --> 00:25:13,290 Seems like they would be even closer together. 361 00:25:13,290 --> 00:25:14,610 PROFESSOR: Well, sorry. 362 00:25:14,610 --> 00:25:16,810 Yeah. 363 00:25:16,810 --> 00:25:21,650 A deer doesn't eat what a dog eats, and for that matter, we 364 00:25:21,650 --> 00:25:26,050 have humans here, and while some human are by choice 365 00:25:26,050 --> 00:25:29,930 vegetarians, genetically, humans are essentially 366 00:25:29,930 --> 00:25:30,740 carnivores. 367 00:25:30,740 --> 00:25:31,810 We know that. 368 00:25:31,810 --> 00:25:33,490 We eat meat. 369 00:25:33,490 --> 00:25:38,300 And here we are with a bunch of herbivores, typically. 370 00:25:38,300 --> 00:25:40,620 Things are strange. 371 00:25:40,620 --> 00:25:43,260 By the way, bats might end up being in ones, because some 372 00:25:43,260 --> 00:25:47,590 bats eat fruit, other bat eat insects, but who knows? 373 00:25:47,590 --> 00:25:53,200 So I'm not very happy. 374 00:25:53,200 --> 00:25:56,950 Why do you think we got this clustering that maybe isn't 375 00:25:56,950 --> 00:25:58,200 helping us very much? 376 00:26:02,680 --> 00:26:07,210 Well, let's go look at what we did here. 377 00:26:07,210 --> 00:26:08,480 Let's look at test 0. 378 00:26:13,050 --> 00:26:16,560 So I said I wanted two clusters. 379 00:26:16,560 --> 00:26:19,820 I don't want it to print all the steps along the way. 380 00:26:19,820 --> 00:26:22,670 I'm going to print the history at the end. 381 00:26:22,670 --> 00:26:24,390 And scaling is identity. 382 00:26:27,900 --> 00:26:32,700 Well, let's go back and look at some of the data here. 383 00:26:43,500 --> 00:26:46,590 What we can see is-- 384 00:26:46,590 --> 00:26:49,660 or maybe we can't see too quickly, looking at all this-- 385 00:26:49,660 --> 00:26:55,570 is some kinds of teeth have a relatively small range. 386 00:26:55,570 --> 00:26:58,130 Other kinds of teeth have a big range. 387 00:27:00,820 --> 00:27:05,980 And so, at the moment, we're not doing any normalization, 388 00:27:05,980 --> 00:27:09,050 and maybe what we're doing is getting something distorted 389 00:27:09,050 --> 00:27:12,250 where we're only looking at a certain kind of tooth because 390 00:27:12,250 --> 00:27:16,180 it has a larger dynamic range. 391 00:27:16,180 --> 00:27:27,250 And in fact, if we look at the code, we can go back up and 392 00:27:27,250 --> 00:27:35,820 let's look at Build Mammal Points and Read Mammal Data. 393 00:27:35,820 --> 00:27:41,670 So Build Mammal Points calls Read Mammal Data, and then 394 00:27:41,670 --> 00:27:42,510 builds the points. 395 00:27:42,510 --> 00:27:46,150 So Read Mammal Data is the interesting piece. 396 00:27:46,150 --> 00:27:55,150 And what we can see here is, as we read it in-- 397 00:27:55,150 --> 00:27:59,820 this is just simply reading things in, ignoring comments, 398 00:27:59,820 --> 00:28:02,560 keeping track of things-- 399 00:28:02,560 --> 00:28:07,350 and then we come down here, I might do some scaling. 400 00:28:10,010 --> 00:28:18,160 So Point.Scale feature is using the scaling argument. 401 00:28:18,160 --> 00:28:19,850 Where's that coming from? 402 00:28:25,430 --> 00:28:36,170 If we look at Mammal Teeth, here from the mammal class, we 403 00:28:36,170 --> 00:28:39,950 see that there are two ways to scale it, identity, where we 404 00:28:39,950 --> 00:28:42,810 just multiply every element in the vector by 1. 405 00:28:45,350 --> 00:28:46,880 That doesn't change anything. 406 00:28:46,880 --> 00:28:50,220 Or what I've called 1 over max. 407 00:28:50,220 --> 00:28:53,940 And here, I've looked at the maximum number of each kind of 408 00:28:53,940 --> 00:28:58,200 tooth and I'm dividing 1 by that. 409 00:28:58,200 --> 00:29:01,440 So here we could have up to three of those. 410 00:29:01,440 --> 00:29:02,930 Here we could have four of those. 411 00:29:02,930 --> 00:29:06,790 We could have six of this kind of tooth, whatever it is. 412 00:29:06,790 --> 00:29:11,770 And so we can see, by dividing by the max, I'm now putting 413 00:29:11,770 --> 00:29:17,610 all of the different kinds of teeth on the same scale. 414 00:29:17,610 --> 00:29:19,600 I'm normalizing. 415 00:29:19,600 --> 00:29:24,340 And now we'll see, does that make a difference? 416 00:29:24,340 --> 00:29:27,050 Well, since we're dividing by 6 here and 3 here, it 417 00:29:27,050 --> 00:29:29,810 certainly could make a difference. 418 00:29:29,810 --> 00:29:33,840 It's a significant scaling factor, 2X. 419 00:29:33,840 --> 00:29:38,385 So let's go and change the code, or change the test. 420 00:29:43,430 --> 00:29:50,370 And let's look at Test 0-- 421 00:29:53,250 --> 00:29:55,080 0, not "O"-- 422 00:29:55,080 --> 00:30:01,700 with scale set to 1 over max. 423 00:30:01,700 --> 00:30:04,960 You'll notice, by the way, that rather than using some 424 00:30:04,960 --> 00:30:09,430 obscure code, like scale equals 12, I use strings so I 425 00:30:09,430 --> 00:30:11,410 remember what they are. 426 00:30:11,410 --> 00:30:16,100 It's, I think, a pretty useful programming trick. 427 00:30:16,100 --> 00:30:16,385 Whoops. 428 00:30:16,385 --> 00:30:20,540 Did I use the wrong name for this? 429 00:30:20,540 --> 00:30:21,790 Should be scaling? 430 00:30:34,360 --> 00:30:35,610 So off it's going. 431 00:30:40,190 --> 00:30:47,080 Now we get a different set of things, and as far as I know, 432 00:30:47,080 --> 00:30:49,760 once we've scaled things, we get what I think is a much 433 00:30:49,760 --> 00:30:53,650 more sensible pair, where I think what we essentially have 434 00:30:53,650 --> 00:30:58,290 is the herbivores down here, and the carnivores up here. 435 00:31:06,290 --> 00:31:06,335 Ok. 436 00:31:06,335 --> 00:31:08,780 I don't care how much you know about teeth. 437 00:31:08,780 --> 00:31:11,470 The point is scaling can really matter. 438 00:31:11,470 --> 00:31:13,420 You have to look at it, and you have to think about what 439 00:31:13,420 --> 00:31:15,160 you're doing. 440 00:31:15,160 --> 00:31:18,430 And the interesting thing here is that without any direct 441 00:31:18,430 --> 00:31:22,010 evidence about what mammals eat, we are able to use 442 00:31:22,010 --> 00:31:26,180 machine learning, clustering in this case, to infer a new 443 00:31:26,180 --> 00:31:31,410 fact that we have some mammals that are similar in what they 444 00:31:31,410 --> 00:31:37,000 eat, and some mammals that are also similar, some groups. 445 00:31:37,000 --> 00:31:41,510 Now I can't infer from this herbivores versus carnivores 446 00:31:41,510 --> 00:31:43,950 because I didn't have any labels to start with. 447 00:31:43,950 --> 00:31:47,310 But what I can infer is that, whatever they eat, there's 448 00:31:47,310 --> 00:31:50,870 something similar about these animals, and something similar 449 00:31:50,870 --> 00:31:52,630 about these animals. 450 00:31:52,630 --> 00:31:55,110 And there's a difference between the groups in C1 and 451 00:31:55,110 --> 00:31:57,620 the groups in C0. 452 00:31:57,620 --> 00:32:01,510 I can then go off and look at some points in each of these 453 00:32:01,510 --> 00:32:06,540 and then try and figure out how to label them later. 454 00:32:06,540 --> 00:32:12,070 OK, let's look at a difference data set, a far more 455 00:32:12,070 --> 00:32:14,160 interesting one, a richer one. 456 00:32:25,670 --> 00:32:27,840 Now, let's not look at that version of it. 457 00:32:27,840 --> 00:32:30,050 That's too hard to read. 458 00:32:30,050 --> 00:32:39,500 Let's look at the Excel spreadsheet. 459 00:32:39,500 --> 00:32:44,060 So this is a database I found online of every county in the 460 00:32:44,060 --> 00:32:50,510 United States, and a bunch of features about that county. 461 00:32:50,510 --> 00:32:52,010 So for each county in the United 462 00:32:52,010 --> 00:32:54,360 States, we have its name. 463 00:32:56,890 --> 00:32:59,370 The first part of the name is the state it's in. 464 00:32:59,370 --> 00:33:02,580 The second part of the name is the name of the county, and a 465 00:33:02,580 --> 00:33:07,180 bunch of things, like the average value of homes, how 466 00:33:07,180 --> 00:33:11,040 much poverty, its population density, its population 467 00:33:11,040 --> 00:33:16,610 change, how many people are over 65, et cetera. 468 00:33:16,610 --> 00:33:19,260 So the thing I want you to notice, of course, is while 469 00:33:19,260 --> 00:33:23,850 everything is a number, the scales are very different. 470 00:33:23,850 --> 00:33:28,830 It's a big difference between the percent of something, 471 00:33:28,830 --> 00:33:33,120 which will go between 0 and 100, and the population 472 00:33:33,120 --> 00:33:38,530 density, which ranges over a much larger dynamic range. 473 00:33:38,530 --> 00:33:43,130 So we can immediately suspect that scaling is going to be an 474 00:33:43,130 --> 00:33:44,380 issue here. 475 00:33:46,570 --> 00:33:50,400 So we now have a bunch of code that we can use that I've 476 00:33:50,400 --> 00:33:52,360 written to process this. 477 00:33:56,080 --> 00:34:07,760 It uses the same clusters that we have here, except I've 478 00:34:07,760 --> 00:34:11,310 added a kind of Point called the County. 479 00:34:11,310 --> 00:34:14,429 Looks very different from a mammal, but the good news is I 480 00:34:14,429 --> 00:34:17,909 got to reuse a lot of my code. 481 00:34:17,909 --> 00:34:21,040 Now let's run a test. 482 00:34:21,040 --> 00:34:26,810 We'll go down here to Test 3, and we'll see whether we can 483 00:34:26,810 --> 00:34:28,634 do hierarchical clustering of the counties. 484 00:34:38,500 --> 00:34:40,050 Whoops. 485 00:34:40,050 --> 00:34:43,850 Test 3 wants the name of what we're doing. 486 00:34:43,850 --> 00:34:44,940 So we'll give it the name. 487 00:34:44,940 --> 00:34:46,190 It's Counties.Text. 488 00:34:48,610 --> 00:34:52,114 I just exported the spreadsheet as a text file. 489 00:34:55,110 --> 00:35:00,930 Well, we can wait a while for this, but I'm not going to. 490 00:35:00,930 --> 00:35:04,480 Let's think about what we know that hierarchical clustering 491 00:35:04,480 --> 00:35:08,590 and how long this is likely to take. 492 00:35:08,590 --> 00:35:10,550 I'll give you a hint. 493 00:35:10,550 --> 00:35:16,870 There are approximately 3,100 counties in the United States. 494 00:35:16,870 --> 00:35:19,470 I'll bet none of you could have guessed that number. 495 00:35:22,170 --> 00:35:25,160 How many comparisons do we have to find the two counties 496 00:35:25,160 --> 00:35:26,700 that are most similar to each other? 497 00:35:32,035 --> 00:35:36,400 Comparing each county with every other county, how many 498 00:35:36,400 --> 00:35:38,340 comparisons is that going to be? 499 00:35:38,340 --> 00:35:39,795 Yeah. 500 00:35:39,795 --> 00:35:41,670 AUDIENCE: It's 3,100 choose 2. 501 00:35:41,670 --> 00:35:42,540 PROFESSOR: Right. 502 00:35:42,540 --> 00:35:45,850 So that will be 3,100 squared. 503 00:35:45,850 --> 00:35:48,916 Thank you. 504 00:35:48,916 --> 00:35:53,076 And that's just the first step in the cluster. 505 00:35:53,076 --> 00:35:58,530 To perform the next merge, we'll have to do it again. 506 00:36:01,090 --> 00:36:06,310 So in fact, as we've looked at last time, it's going to be a 507 00:36:06,310 --> 00:36:11,010 very long and tedious process, and one I'm not 508 00:36:11,010 --> 00:36:12,260 going to wait for. 509 00:36:14,460 --> 00:36:16,600 So I'm going to interrupt and we're going to look at a 510 00:36:16,600 --> 00:36:17,850 smaller example. 511 00:36:22,970 --> 00:36:32,700 Here I've just got only the counties in New England, a 512 00:36:32,700 --> 00:36:36,550 much smaller number than 3,100, and I'm going to 513 00:36:36,550 --> 00:36:40,820 cluster them using the exact same clustering code we used 514 00:36:40,820 --> 00:36:42,430 for the mammals. 515 00:36:42,430 --> 00:36:44,090 It's just that the points are now 516 00:36:44,090 --> 00:36:48,190 counties instead of mammals. 517 00:36:48,190 --> 00:36:51,350 And we got two clusters. 518 00:36:51,350 --> 00:36:54,550 Middlesex County in Massachusetts happens to be 519 00:36:54,550 --> 00:36:57,430 the county in which MIT is located. 520 00:36:57,430 --> 00:37:00,130 And all the others-- 521 00:37:00,130 --> 00:37:03,420 well, you know, MIT is a pretty distinctive place. 522 00:37:03,420 --> 00:37:06,560 Maybe that's what did it. 523 00:37:06,560 --> 00:37:08,630 I don't quite think so. 524 00:37:08,630 --> 00:37:13,180 Someone got a hypothesis about why we got this rather strange 525 00:37:13,180 --> 00:37:15,440 clustering? 526 00:37:15,440 --> 00:37:21,700 And is it because Middlesex contains MIT and Harvard both? 527 00:37:21,700 --> 00:37:24,310 This really surprised me, by the way, when I first ran it. 528 00:37:24,310 --> 00:37:27,970 I said, how can this be? 529 00:37:27,970 --> 00:37:33,880 So I went and I started looking at the data, and what 530 00:37:33,880 --> 00:37:41,130 I found is that Middlesex County has about 600,000 more 531 00:37:41,130 --> 00:37:45,430 people than any other county in New England. 532 00:37:45,430 --> 00:37:46,430 Who knew? 533 00:37:46,430 --> 00:37:48,500 I would have guessed Suffolk, where Boston is, was the 534 00:37:48,500 --> 00:37:49,710 biggest county. 535 00:37:49,710 --> 00:37:52,550 But, in fact, Middlesex is enormous relative to every 536 00:37:52,550 --> 00:37:54,820 other county in New England. 537 00:37:54,820 --> 00:37:58,690 And it turns out that difference of 600,000, when I 538 00:37:58,690 --> 00:38:03,330 didn't scale things, just swamped everything else. 539 00:38:03,330 --> 00:38:06,610 And so all I'm really getting here is a clustering that 540 00:38:06,610 --> 00:38:10,650 depends on the population. 541 00:38:10,650 --> 00:38:13,520 Middlesex is big relative to everything else and, 542 00:38:13,520 --> 00:38:15,130 therefore, that's what I get. 543 00:38:15,130 --> 00:38:18,350 And it ignores things like education level and housing 544 00:38:18,350 --> 00:38:21,610 prices, and all those other things because the differences 545 00:38:21,610 --> 00:38:27,190 are small relative to 600,000. 546 00:38:27,190 --> 00:38:31,160 Well, let's turn scaling on. 547 00:38:31,160 --> 00:38:34,405 To do that, I want to show you how I do this scaling. 548 00:38:38,690 --> 00:38:41,230 I did not, given the number of features and number of 549 00:38:41,230 --> 00:38:44,430 counties, do what I did for mammals and count them by hand 550 00:38:44,430 --> 00:38:46,520 to see what the maximum was. 551 00:38:46,520 --> 00:38:49,290 I decided it would be a lot faster even at 2:00 in the 552 00:38:49,290 --> 00:38:53,400 morning to write code to do it. 553 00:38:53,400 --> 00:38:54,855 So I've got some code here. 554 00:38:58,210 --> 00:39:03,140 I've got Build County Points, just like Build Mammal Points 555 00:39:03,140 --> 00:39:06,640 and Read County Data, like Read Mammal Data. 556 00:39:06,640 --> 00:39:10,940 But the difference here is, along the way, as I'm reading 557 00:39:10,940 --> 00:39:13,070 in each county, I'm keeping track of the 558 00:39:13,070 --> 00:39:16,670 maximum for each feature. 559 00:39:16,670 --> 00:39:18,420 And then I'm just going to just do the scaling 560 00:39:18,420 --> 00:39:19,950 automatically. 561 00:39:19,950 --> 00:39:23,820 So exactly the one over max scaling I did for mammals' 562 00:39:23,820 --> 00:39:27,150 teeth, I'm going to do it for counties, but I've just 563 00:39:27,150 --> 00:39:32,360 written some code to automate that process because I knew I 564 00:39:32,360 --> 00:39:35,910 would never be able to count them. 565 00:39:35,910 --> 00:39:37,360 All right, so now let's see what happens if 566 00:39:37,360 --> 00:39:38,610 we run it that way. 567 00:39:42,600 --> 00:39:52,380 Test 3, New England, and Scale equals True. 568 00:39:52,380 --> 00:39:54,910 I'm either scaling it or not, is the way I wrote this one. 569 00:40:09,910 --> 00:40:12,710 And with the scaling on again, I get a very 570 00:40:12,710 --> 00:40:13,760 different set of clusters. 571 00:40:13,760 --> 00:40:16,020 What have we got? 572 00:40:16,020 --> 00:40:18,506 Where's Middlesex? 573 00:40:18,506 --> 00:40:20,350 It's in one of these 2 clusters. 574 00:40:20,350 --> 00:40:21,130 Oh, here it is. 575 00:40:21,130 --> 00:40:23,970 It's C0. 576 00:40:23,970 --> 00:40:26,420 But it's with Fairfield, Connecticut and Hartford, 577 00:40:26,420 --> 00:40:31,340 Connecticut and Providence, Rhode Island. 578 00:40:31,340 --> 00:40:32,750 It's a different answer. 579 00:40:32,750 --> 00:40:35,420 Is it a better answer? 580 00:40:35,420 --> 00:40:37,350 It's not a meaningful question, right? 581 00:40:37,350 --> 00:40:40,900 It depends what I'm trying to infer, what we hope to learn 582 00:40:40,900 --> 00:40:44,040 from the clustering, and that's a question we're going 583 00:40:44,040 --> 00:40:48,130 to come back to on Tuesday in some detail with the counties, 584 00:40:48,130 --> 00:40:52,180 and look at how, by using different kinds of scaling or 585 00:40:52,180 --> 00:40:55,610 different kinds of features, we can learn different things 586 00:40:55,610 --> 00:40:59,170 about the counties in this country. 587 00:40:59,170 --> 00:41:01,460 Before I do that, however, I want to move 588 00:41:01,460 --> 00:41:04,550 away from New England. 589 00:41:04,550 --> 00:41:07,930 Remember we're focusing on New England because it took too 590 00:41:07,930 --> 00:41:12,630 long to do hierarchical clustering of 3,100 counties. 591 00:41:12,630 --> 00:41:14,300 But that's what I want to do. 592 00:41:14,300 --> 00:41:16,410 It's no good to just say, I'm sorry. 593 00:41:16,410 --> 00:41:17,170 It took too long. 594 00:41:17,170 --> 00:41:18,770 I give up. 595 00:41:18,770 --> 00:41:21,770 Well, the good news is there are other clustering 596 00:41:21,770 --> 00:41:26,040 mechanisms that are much more efficient. 597 00:41:26,040 --> 00:41:31,360 We'll later see they, too, have their own faults. 598 00:41:31,360 --> 00:41:36,720 But we're going to look at k-means clustering, which has 599 00:41:36,720 --> 00:41:41,420 the big advantage of being fast enough that we can run it 600 00:41:41,420 --> 00:41:43,750 on very big data sets. 601 00:41:43,750 --> 00:41:48,620 In fact, it is roughly linear in the number of counties. 602 00:41:48,620 --> 00:41:52,730 And as we've seen before, when n gets very large, anything 603 00:41:52,730 --> 00:41:58,270 that's worse than linear is likely to be ineffective. 604 00:41:58,270 --> 00:42:00,420 So let's think about how k-means works. 605 00:42:03,710 --> 00:42:07,870 Step one, is you choose k. 606 00:42:11,790 --> 00:42:15,430 k is the total number of clusters you want to have when 607 00:42:15,430 --> 00:42:16,680 you're done. 608 00:42:18,830 --> 00:42:22,140 So you start by saying, I want to take the counties and split 609 00:42:22,140 --> 00:42:25,020 them into k-clusters. 610 00:42:25,020 --> 00:42:29,400 2 clusters, 20 clusters, a 100 clusters, 1,000 clusters. 611 00:42:29,400 --> 00:42:33,860 You have to choose k in the beginning. 612 00:42:33,860 --> 00:42:38,520 And that it's one of the issues that you have with 613 00:42:38,520 --> 00:42:42,640 k-means clustering is, how do you choose k? 614 00:42:42,640 --> 00:42:46,500 We can talk about that later. 615 00:42:46,500 --> 00:43:02,630 Once I've chosen k, I choose k points as initial centroids. 616 00:43:02,630 --> 00:43:10,070 You may remember earlier today we saw this centroid method in 617 00:43:10,070 --> 00:43:12,200 class cluster. 618 00:43:12,200 --> 00:43:13,450 So what's a centroid? 619 00:43:19,400 --> 00:43:24,260 You've got a cluster, and in the clusters, you've got a 620 00:43:24,260 --> 00:43:25,830 bunch of points scattered around. 621 00:43:28,940 --> 00:43:32,640 The centroid you can think of as, quote, "the average 622 00:43:32,640 --> 00:43:38,490 point," the center of the cluster. 623 00:43:38,490 --> 00:43:41,480 The centroid need not be any of the points in the cluster. 624 00:43:44,690 --> 00:43:46,630 So, again, you need some metric. 625 00:43:46,630 --> 00:43:48,980 But let's say we're using Euclidean. 626 00:43:48,980 --> 00:43:50,670 It's easy to see on the board. 627 00:43:50,670 --> 00:43:52,300 The centroid is kind of there. 628 00:43:57,440 --> 00:44:06,350 Now let's assume that we're going to start by choosing 629 00:44:06,350 --> 00:44:10,340 k-point from the initial set and labeling each 630 00:44:10,340 --> 00:44:11,590 of them as a centroid. 631 00:44:14,400 --> 00:44:16,740 We often-- 632 00:44:16,740 --> 00:44:18,960 in fact, quite typically-- 633 00:44:18,960 --> 00:44:20,225 choose these at random. 634 00:44:29,900 --> 00:44:34,110 So we now have k randomly chosen points, each of which 635 00:44:34,110 --> 00:44:35,360 we're going to call centroid. 636 00:44:43,100 --> 00:44:51,310 The next step is to assign each point 637 00:44:51,310 --> 00:44:52,560 to the nearest centroid. 638 00:45:00,770 --> 00:45:02,490 So we've got k-centroids. 639 00:45:02,490 --> 00:45:07,280 We usually choose a small k, say 50. 640 00:45:07,280 --> 00:45:12,510 And now we have to compare each of the 3,100 counties to 641 00:45:12,510 --> 00:45:16,860 each of the 50 centroids, and put each one in the correct 642 00:45:16,860 --> 00:45:18,265 thing, in the closest. 643 00:45:20,940 --> 00:45:28,350 So it's 50 times 3,100, which is a lot smaller number than 644 00:45:28,350 --> 00:45:31,960 3,100 squared. 645 00:45:31,960 --> 00:45:34,630 So now I've got a clustering. 646 00:45:34,630 --> 00:45:38,380 Kind of strange, because what it looks like depends on this 647 00:45:38,380 --> 00:45:40,230 random choice. 648 00:45:40,230 --> 00:45:45,580 So there's no reason to expect that the initial assignment 649 00:45:45,580 --> 00:45:47,025 will give me anything very useful. 650 00:45:51,390 --> 00:46:00,160 Step (4) is, for each of the k-clusters, 651 00:46:00,160 --> 00:46:01,410 choose a new centroid. 652 00:46:17,910 --> 00:46:23,060 Now remember, I just chose at random k-centroids. 653 00:46:23,060 --> 00:46:29,060 Now I actually have a cluster with a bunch of points in it, 654 00:46:29,060 --> 00:46:33,390 so I could, for example, take the average of those and 655 00:46:33,390 --> 00:46:34,640 compute a centroid. 656 00:46:37,270 --> 00:46:39,930 And I can either take the average, or I can take the 657 00:46:39,930 --> 00:46:41,830 point nearest the average. 658 00:46:41,830 --> 00:46:43,080 It doesn't much matter. 659 00:46:48,190 --> 00:46:58,860 And then step (5) is one we've looked at before, assign each 660 00:46:58,860 --> 00:47:02,270 point to nearest centroid. 661 00:47:02,270 --> 00:47:03,680 So now I'm going to get a new clustering. 662 00:47:15,510 --> 00:47:27,460 And then, (6) is repeat (4) and (5) until 663 00:47:27,460 --> 00:47:28,710 the change is small. 664 00:47:35,520 --> 00:47:41,610 So each time I do step (5), I can keep track of how many 665 00:47:41,610 --> 00:47:46,500 points I've moved from one cluster to another. 666 00:47:46,500 --> 00:47:52,760 Or each time I do step (4), I can say how much have I moved 667 00:47:52,760 --> 00:47:54,010 the centroids? 668 00:47:56,550 --> 00:47:59,970 Each of those gives me a measure of how much change the 669 00:47:59,970 --> 00:48:02,580 new iteration has produced. 670 00:48:02,580 --> 00:48:07,220 When I get to the point where the iterations are not making 671 00:48:07,220 --> 00:48:08,540 much of a change-- 672 00:48:08,540 --> 00:48:10,710 and we'll see what we might mean by that-- 673 00:48:10,710 --> 00:48:13,210 we stop and say, OK, we now have a good clustering. 674 00:48:17,970 --> 00:48:21,000 So if we think of the complexity each iteration is 675 00:48:21,000 --> 00:48:25,250 order k-n, where k is the number of clusters, and n is 676 00:48:25,250 --> 00:48:27,060 the number of points. 677 00:48:27,060 --> 00:48:32,130 And then we do that step for some number of iterations. 678 00:48:32,130 --> 00:48:35,690 So if the number of iterations is small, it will converge 679 00:48:35,690 --> 00:48:38,070 quite quickly. 680 00:48:38,070 --> 00:48:42,520 And as we'll see, typically for k-means, we don't need a 681 00:48:42,520 --> 00:48:47,090 lot of iterations to get an answer. 682 00:48:47,090 --> 00:48:50,250 It's typically not proportional to n, in 683 00:48:50,250 --> 00:48:54,010 particular, which is very important. 684 00:48:54,010 --> 00:48:54,226 All right. 685 00:48:54,226 --> 00:48:57,910 Tuesday, we'll go over the code for k-means clustering, 686 00:48:57,910 --> 00:49:01,570 and then have some fun playing with counties and see what we 687 00:49:01,570 --> 00:49:04,470 can learn about where we live. 688 00:49:04,470 --> 00:49:04,642 All right. 689 00:49:04,642 --> 00:49:05,892 Thanks a lot.