1 00:00:00,790 --> 00:00:03,130 The following content is provided under a Creative 2 00:00:03,130 --> 00:00:04,550 Commons license. 3 00:00:04,550 --> 00:00:06,760 Your support will help MIT OpenCourseWare 4 00:00:06,760 --> 00:00:10,850 continue to offer high-quality educational resources for free. 5 00:00:10,850 --> 00:00:13,390 To make a donation, or to view additional materials 6 00:00:13,390 --> 00:00:17,320 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,320 --> 00:00:18,570 at ocw.mit.edu. 8 00:00:21,462 --> 00:00:22,920 JOHN GUTTAG: I'm a little reluctant 9 00:00:22,920 --> 00:00:25,880 to say good afternoon, given the weather, 10 00:00:25,880 --> 00:00:28,500 but I'll say it anyway. 11 00:00:28,500 --> 00:00:32,900 I guess now we all do know that we live in Boston. 12 00:00:32,900 --> 00:00:34,880 And I should say, I hope none of you 13 00:00:34,880 --> 00:00:39,740 were affected too much by the fire yesterday in Cambridge, 14 00:00:39,740 --> 00:00:42,650 but that seems to have been a pretty disastrous event 15 00:00:42,650 --> 00:00:44,000 for some. 16 00:00:44,000 --> 00:00:45,740 Anyway, here's the reading. 17 00:00:45,740 --> 00:00:48,840 This is a chapter in the book on clustering, 18 00:00:48,840 --> 00:00:52,610 a topic that Professor Grimson introduced last week. 19 00:00:52,610 --> 00:00:57,560 And I'm going to try and finish up with respect to this course 20 00:00:57,560 --> 00:01:00,080 today, though not with respect to everything 21 00:01:00,080 --> 00:01:02,780 there is to know about clustering. 22 00:01:02,780 --> 00:01:07,700 Quickly just reviewing where we were. 23 00:01:07,700 --> 00:01:10,640 We're in the unit of a course on machine learning, 24 00:01:10,640 --> 00:01:13,190 and we always follow the same paradigm. 25 00:01:13,190 --> 00:01:16,160 We observe some set of examples, which 26 00:01:16,160 --> 00:01:18,440 we call the training data. 27 00:01:18,440 --> 00:01:22,010 We try and infer something about the process 28 00:01:22,010 --> 00:01:25,450 that created those examples. 29 00:01:25,450 --> 00:01:28,390 And then we use inference techniques, different kinds 30 00:01:28,390 --> 00:01:30,760 of techniques, to make predictions 31 00:01:30,760 --> 00:01:33,820 about previously unseen data. 32 00:01:33,820 --> 00:01:36,830 We call that the test data. 33 00:01:36,830 --> 00:01:40,790 As Professor Grimson said, you can think of two broad classes. 34 00:01:40,790 --> 00:01:44,450 Supervised, where we have a set of examples and some label 35 00:01:44,450 --> 00:01:46,670 associated with the example-- 36 00:01:46,670 --> 00:01:50,600 Democrat, Republican, smart, dumb, 37 00:01:50,600 --> 00:01:54,770 whatever you want to associate with them-- 38 00:01:54,770 --> 00:01:57,920 and then we try and infer the labels. 39 00:01:57,920 --> 00:02:02,270 Or unsupervised, where we're given a set of feature vectors 40 00:02:02,270 --> 00:02:05,660 without labels, and then we attempt to group 41 00:02:05,660 --> 00:02:09,860 them into natural clusters. 42 00:02:09,860 --> 00:02:13,470 That's going to be today's topic, clustering. 43 00:02:13,470 --> 00:02:18,440 So clustering is an optimization problem. 44 00:02:18,440 --> 00:02:20,780 As we'll see later, supervised machine learning 45 00:02:20,780 --> 00:02:23,330 is also an optimization problem. 46 00:02:23,330 --> 00:02:26,660 Clustering's a rather simple one. 47 00:02:26,660 --> 00:02:31,180 We're going to start first with the notion of variability. 48 00:02:31,180 --> 00:02:34,940 So this little c is a single cluster, 49 00:02:34,940 --> 00:02:38,750 and we're going to talk about the variability in that cluster 50 00:02:38,750 --> 00:02:45,440 of the sum of the distance between the mean of the cluster 51 00:02:45,440 --> 00:02:47,880 and each example in the cluster. 52 00:02:47,880 --> 00:02:50,920 And then we square it. 53 00:02:50,920 --> 00:02:51,800 OK? 54 00:02:51,800 --> 00:02:54,860 Pretty straightforward. 55 00:02:54,860 --> 00:02:56,510 For the moment, we can just assume 56 00:02:56,510 --> 00:02:59,720 that we're using Euclidean distance as our distance 57 00:02:59,720 --> 00:03:00,910 metric. 58 00:03:00,910 --> 00:03:04,080 Minkowski with p equals two. 59 00:03:04,080 --> 00:03:10,030 So variability should look pretty similar to something 60 00:03:10,030 --> 00:03:13,010 we've seen before, right? 61 00:03:13,010 --> 00:03:16,100 It's not quite variance, right, but it's very close. 62 00:03:16,100 --> 00:03:19,650 In a minute, we'll look at why it's different. 63 00:03:19,650 --> 00:03:23,160 And then we can look at the dissimilarity 64 00:03:23,160 --> 00:03:27,570 of a set of clusters, a group of clusters, which I'm writing 65 00:03:27,570 --> 00:03:30,600 as capital C, and that's just the sum 66 00:03:30,600 --> 00:03:32,190 of all the variabilities. 67 00:03:34,720 --> 00:03:40,150 Now, if I had divided variability 68 00:03:40,150 --> 00:03:45,514 by the size of the cluster, what would I have? 69 00:03:45,514 --> 00:03:46,680 Something we've seen before. 70 00:03:46,680 --> 00:03:49,410 What would that be? 71 00:03:49,410 --> 00:03:51,890 Somebody? 72 00:03:51,890 --> 00:03:55,070 Isn't that just the variance? 73 00:03:55,070 --> 00:03:57,910 So the question is, why am I not doing that? 74 00:03:57,910 --> 00:04:02,170 If up til now, we always wanted to talk about variance, 75 00:04:02,170 --> 00:04:05,310 why suddenly am I not doing it? 76 00:04:05,310 --> 00:04:07,800 Why do I define this notion of variability 77 00:04:07,800 --> 00:04:10,750 instead of good old variance? 78 00:04:10,750 --> 00:04:11,395 Any thoughts? 79 00:04:15,120 --> 00:04:18,300 What am I accomplishing by not dividing 80 00:04:18,300 --> 00:04:20,459 by the size of the cluster? 81 00:04:20,459 --> 00:04:22,350 Or what would happen if I did divide 82 00:04:22,350 --> 00:04:24,420 by the size of the cluster? 83 00:04:24,420 --> 00:04:25,258 Yes. 84 00:04:25,258 --> 00:04:26,711 AUDIENCE: You normalize it? 85 00:04:26,711 --> 00:04:27,710 JOHN GUTTAG: Absolutely. 86 00:04:27,710 --> 00:04:29,720 I'd normalize it. 87 00:04:29,720 --> 00:04:31,820 That's exactly what it would be doing. 88 00:04:31,820 --> 00:04:36,380 And what might be good or bad about normalizing it? 89 00:04:41,010 --> 00:04:44,280 What does it essentially mean to normalize? 90 00:04:44,280 --> 00:04:48,420 It means that the penalty for a big cluster 91 00:04:48,420 --> 00:04:51,540 with a lot of variance in it is no higher 92 00:04:51,540 --> 00:04:53,520 than the penalty of a tiny little cluster 93 00:04:53,520 --> 00:04:56,720 with a lot of variance in it. 94 00:04:56,720 --> 00:05:00,590 By not normalizing, what I'm saying is 95 00:05:00,590 --> 00:05:05,510 I want to penalize big, highly-diverse clusters 96 00:05:05,510 --> 00:05:09,370 more than small, highly-diverse clusters. 97 00:05:09,370 --> 00:05:09,870 OK? 98 00:05:09,870 --> 00:05:12,990 And if you think about it, that probably makes sense. 99 00:05:15,770 --> 00:05:18,470 Big and bad is worse than small and bad. 100 00:05:21,500 --> 00:05:26,110 All right, so now we define the objective function. 101 00:05:26,110 --> 00:05:29,250 And can we say that the optimization problem 102 00:05:29,250 --> 00:05:34,470 we want to solve by clustering is simply finding a capital 103 00:05:34,470 --> 00:05:37,860 C that minimizes dissimilarity? 104 00:05:41,500 --> 00:05:43,460 Is that a reasonable definition? 105 00:05:46,743 --> 00:05:51,050 Well, hint-- no. 106 00:05:51,050 --> 00:05:54,680 What foolish thing could we do that would optimize 107 00:05:54,680 --> 00:05:56,510 that objective function? 108 00:05:56,510 --> 00:05:57,010 Yeah. 109 00:05:57,010 --> 00:05:58,676 AUDIENCE: You could have the same number 110 00:05:58,676 --> 00:05:59,720 of clusters as points? 111 00:05:59,720 --> 00:06:00,500 JOHN GUTTAG: Yeah. 112 00:06:00,500 --> 00:06:02,100 I can have the same number of clusters 113 00:06:02,100 --> 00:06:07,700 as points, assign each point to its own cluster, whoops. 114 00:06:07,700 --> 00:06:10,010 Ooh, almost a relay. 115 00:06:10,010 --> 00:06:14,520 The dissimilarity of each cluster would be 0. 116 00:06:14,520 --> 00:06:17,270 The variability would be 0, so the dissimilarity would be 0, 117 00:06:17,270 --> 00:06:19,630 and I just solved the problem. 118 00:06:19,630 --> 00:06:24,040 Well, that's clearly not a very useful thing to do. 119 00:06:24,040 --> 00:06:28,870 So, well, what do you think we do to get around that? 120 00:06:28,870 --> 00:06:29,370 Yeah. 121 00:06:29,370 --> 00:06:30,750 AUDIENCE: We apply a constraint? 122 00:06:30,750 --> 00:06:32,530 JOHN GUTTAG: We apply a constraint. 123 00:06:32,530 --> 00:06:33,030 Exactly. 124 00:06:35,830 --> 00:06:38,730 And so we have to pick some constraint. 125 00:06:42,970 --> 00:06:48,020 What would be a suitable constraint, for example? 126 00:06:48,020 --> 00:06:51,080 Well, maybe we'd say, OK, the clusters 127 00:06:51,080 --> 00:06:53,450 have to have some minimum distance between them. 128 00:06:55,960 --> 00:06:59,580 Or-- and this is the constraint we'll be using today-- 129 00:06:59,580 --> 00:07:02,740 we could constrain the number of clusters. 130 00:07:02,740 --> 00:07:07,160 Say, all right, I only want to have at most five clusters. 131 00:07:07,160 --> 00:07:11,680 Do the best you can to minimize dissimilarity, 132 00:07:11,680 --> 00:07:14,630 but you're not allowed to use more than five clusters. 133 00:07:14,630 --> 00:07:17,230 That's the most common constraint that 134 00:07:17,230 --> 00:07:20,550 gets placed in the problem. 135 00:07:20,550 --> 00:07:23,036 All right, we're going to look at two algorithms. 136 00:07:23,036 --> 00:07:24,910 Maybe I should say two methods, because there 137 00:07:24,910 --> 00:07:28,780 are multiple implementations of these methods. 138 00:07:28,780 --> 00:07:31,650 The first is called hierarchical clustering, 139 00:07:31,650 --> 00:07:33,750 and the second is called k-means. 140 00:07:33,750 --> 00:07:36,460 There should be an S on the word mean there. 141 00:07:36,460 --> 00:07:38,650 Sorry about that. 142 00:07:38,650 --> 00:07:41,000 All right, let's look at hierarchical clustering first. 143 00:07:44,330 --> 00:07:47,460 It's a strange algorithm. 144 00:07:47,460 --> 00:07:51,870 We start by assigning each item, each example, 145 00:07:51,870 --> 00:07:54,220 to its own cluster. 146 00:07:54,220 --> 00:07:57,610 So this is the trivial solution we talked about before. 147 00:07:57,610 --> 00:07:59,850 So if you have N items, you now have N clusters, 148 00:07:59,850 --> 00:08:02,280 each containing just one item. 149 00:08:07,050 --> 00:08:12,870 In the next step, we find the two most similar clusters 150 00:08:12,870 --> 00:08:17,470 we have and merge them into a single cluster, 151 00:08:17,470 --> 00:08:19,300 so that now instead of N clusters, 152 00:08:19,300 --> 00:08:20,860 we have N minus 1 clusters. 153 00:08:26,210 --> 00:08:29,140 And we continue this process until all items 154 00:08:29,140 --> 00:08:34,010 are clustered into a single cluster of size N. 155 00:08:34,010 --> 00:08:36,980 Now of course, that's kind of silly, 156 00:08:36,980 --> 00:08:38,456 because if all I wanted to put them 157 00:08:38,456 --> 00:08:39,830 all it in is in a single cluster, 158 00:08:39,830 --> 00:08:40,829 I don't need to iterate. 159 00:08:40,829 --> 00:08:43,280 I just go wham, right? 160 00:08:43,280 --> 00:08:46,010 But what's interesting about hierarchical clustering 161 00:08:46,010 --> 00:08:50,770 is you stop it, typically, somewhere along the way. 162 00:08:50,770 --> 00:08:53,960 You produce something called a [? dendogram. ?] 163 00:08:53,960 --> 00:08:55,240 Let me write that down. 164 00:09:02,960 --> 00:09:08,920 At each step here, it shows you what you've merged thus far. 165 00:09:08,920 --> 00:09:11,330 We'll see an example of that shortly. 166 00:09:11,330 --> 00:09:14,170 And then you can have some stopping criteria. 167 00:09:14,170 --> 00:09:16,730 We'll talk about that. 168 00:09:16,730 --> 00:09:19,820 This is called agglomerative hierarchical 169 00:09:19,820 --> 00:09:23,000 clustering because we start with a bunch of things 170 00:09:23,000 --> 00:09:24,200 and we agglomerate them. 171 00:09:24,200 --> 00:09:28,050 That is to say, we put them together. 172 00:09:28,050 --> 00:09:28,920 All right? 173 00:09:28,920 --> 00:09:31,480 Make sense? 174 00:09:31,480 --> 00:09:34,060 Well, there's a catch. 175 00:09:34,060 --> 00:09:36,760 What do we mean by distance? 176 00:09:36,760 --> 00:09:42,160 And there are multiple plausible definitions of distance, 177 00:09:42,160 --> 00:09:44,470 and you would get a different answer depending 178 00:09:44,470 --> 00:09:45,985 upon which measure you used. 179 00:09:50,410 --> 00:09:53,350 These are called linkage metrics. 180 00:09:53,350 --> 00:09:58,040 The most common one used is probably single-linkage, 181 00:09:58,040 --> 00:10:01,930 and that says the distance between a pair of clusters 182 00:10:01,930 --> 00:10:06,130 is equal to the shortest distance from any member of one 183 00:10:06,130 --> 00:10:08,990 cluster to any member of the other cluster. 184 00:10:12,100 --> 00:10:17,580 So if I have two clusters, here and here, 185 00:10:17,580 --> 00:10:21,230 and they have bunches of points in them, 186 00:10:21,230 --> 00:10:23,510 single-linkage distance would say, well, 187 00:10:23,510 --> 00:10:27,260 let's use these two points which are the closest, 188 00:10:27,260 --> 00:10:29,780 and the distance between these two 189 00:10:29,780 --> 00:10:32,215 is the distance between the clusters. 190 00:10:37,090 --> 00:10:43,990 You can also use complete-linkage, 191 00:10:43,990 --> 00:10:47,140 and that says the distance between any two clusters 192 00:10:47,140 --> 00:10:50,170 is equal to the greatest distance from any member 193 00:10:50,170 --> 00:10:53,441 to any other member. 194 00:10:53,441 --> 00:10:53,940 OK? 195 00:10:53,940 --> 00:10:56,150 So if we had the same picture we had before-- 196 00:11:01,860 --> 00:11:04,810 probably not the same picture, but it's a picture. 197 00:11:04,810 --> 00:11:07,450 Whoops. 198 00:11:07,450 --> 00:11:10,930 Then we would say, well, I guess complete-linkage is probably 199 00:11:10,930 --> 00:11:12,760 the distance, maybe, between those two. 200 00:11:19,078 --> 00:11:24,550 And finally, not surprisingly, you 201 00:11:24,550 --> 00:11:28,530 can take the average distance. 202 00:11:28,530 --> 00:11:31,050 These are all plausible metrics. 203 00:11:31,050 --> 00:11:36,450 They're all used and practiced for different kinds of results 204 00:11:36,450 --> 00:11:39,740 depending upon the application of the clustering. 205 00:11:42,740 --> 00:11:45,750 All right, let's look at an example. 206 00:11:45,750 --> 00:11:49,070 So what I have here is the air distance 207 00:11:49,070 --> 00:11:55,200 between six different cities, Boston, New York, Chicago, 208 00:11:55,200 --> 00:11:59,890 Denver, San Francisco, and Seattle. 209 00:11:59,890 --> 00:12:04,910 And now let's say we're-- want to cluster these airports just 210 00:12:04,910 --> 00:12:07,470 based upon their distance. 211 00:12:07,470 --> 00:12:09,620 So we start. 212 00:12:09,620 --> 00:12:12,860 The first piece of our [? dendogram ?] says, 213 00:12:12,860 --> 00:12:15,080 well, all right, I have six cities, 214 00:12:15,080 --> 00:12:17,480 I have six clusters, each containing one city. 215 00:12:22,777 --> 00:12:23,985 All right, what happens next? 216 00:12:27,030 --> 00:12:30,550 What's the next level going to look like? 217 00:12:30,550 --> 00:12:31,050 Yeah? 218 00:12:31,050 --> 00:12:32,980 AUDIENCE: You're going from Boston [INAUDIBLE] 219 00:12:32,980 --> 00:12:35,620 JOHN GUTTAG: I'm going to join Boston and New York, as 220 00:12:35,620 --> 00:12:38,860 improbable as that sounds. 221 00:12:38,860 --> 00:12:42,130 All right, so that's the next level. 222 00:12:42,130 --> 00:12:45,640 And if for some reason I only wanted to have five clusters, 223 00:12:45,640 --> 00:12:48,890 well, I could stop here. 224 00:12:48,890 --> 00:12:50,330 Next, what happens? 225 00:12:53,260 --> 00:12:56,100 Well, I look at it, I say well, I'll 226 00:12:56,100 --> 00:12:58,790 join up Chicago with Boston and New York. 227 00:13:04,320 --> 00:13:04,820 All right. 228 00:13:04,820 --> 00:13:06,590 What do I get at the next level? 229 00:13:06,590 --> 00:13:07,150 Somebody? 230 00:13:07,150 --> 00:13:07,650 Yeah. 231 00:13:07,650 --> 00:13:12,150 AUDIENCE: Seattle [INAUDIBLE] 232 00:13:12,150 --> 00:13:14,110 JOHN GUTTAG: Doesn't look like it to me. 233 00:13:14,110 --> 00:13:21,130 If you look at San Francisco and Seattle, they are 808 miles, 234 00:13:21,130 --> 00:13:27,140 and Denver and San Francisco is 1,235. 235 00:13:27,140 --> 00:13:31,241 So I'd end up, in fact, joining San Francisco and Seattle. 236 00:13:31,241 --> 00:13:34,130 AUDIENCE: That's what I said. 237 00:13:34,130 --> 00:13:38,084 JOHN GUTTAG: Well, that explains why I need my hearing fixed. 238 00:13:38,084 --> 00:13:39,380 [LAUGHTER] 239 00:13:39,380 --> 00:13:40,490 All right. 240 00:13:40,490 --> 00:13:44,480 So I combine San Francisco and Seattle, 241 00:13:44,480 --> 00:13:47,110 and now it gets interesting. 242 00:13:47,110 --> 00:13:50,230 I have two choices with Denver. 243 00:13:50,230 --> 00:13:57,520 Obviously, there are only two choices, 244 00:13:57,520 --> 00:14:03,280 and which I choose depends upon which linkage criterion I use. 245 00:14:03,280 --> 00:14:07,030 If I'm using single-linkage, well, then Denver 246 00:14:07,030 --> 00:14:09,910 gets joined with Boston, New York, and Chicago, 247 00:14:09,910 --> 00:14:13,570 because it's closer to Chicago than it is to either San 248 00:14:13,570 --> 00:14:14,760 Francisco or Seattle. 249 00:14:17,420 --> 00:14:20,160 But if I use complete-linkage, it 250 00:14:20,160 --> 00:14:23,950 gets joined up with San Francisco and Seattle, 251 00:14:23,950 --> 00:14:31,060 because it is further from Boston than it is from, 252 00:14:31,060 --> 00:14:32,920 I guess it's San Francisco or Seattle. 253 00:14:32,920 --> 00:14:35,310 Whichever it is, right? 254 00:14:35,310 --> 00:14:37,920 So this is a place where you see what 255 00:14:37,920 --> 00:14:41,160 answer I get depends upon the linkage criteria. 256 00:14:41,160 --> 00:14:44,100 And then if I want, I can consider to the next step 257 00:14:44,100 --> 00:14:46,090 and just join them all. 258 00:14:46,090 --> 00:14:47,100 All right? 259 00:14:47,100 --> 00:14:50,670 That's hierarchical clustering. 260 00:14:50,670 --> 00:14:56,110 So it's good because you get this whole history of the 261 00:14:56,110 --> 00:14:59,320 [? dendograms, ?] and you get to look at it, 262 00:14:59,320 --> 00:15:02,600 say, well, all right, that looks pretty good. 263 00:15:02,600 --> 00:15:06,560 I'll stick with this clustering. 264 00:15:06,560 --> 00:15:09,600 It's deterministic. 265 00:15:09,600 --> 00:15:13,680 Given a linkage criterion, you always get the same answer. 266 00:15:13,680 --> 00:15:14,900 There's nothing random here. 267 00:15:17,500 --> 00:15:20,500 Notice, by the way, the answer might not 268 00:15:20,500 --> 00:15:23,680 be optimal with regards to that linkage criteria. 269 00:15:23,680 --> 00:15:26,480 Why not? 270 00:15:26,480 --> 00:15:29,132 What kind of algorithm is this? 271 00:15:29,132 --> 00:15:29,840 AUDIENCE: Greedy. 272 00:15:29,840 --> 00:15:32,420 JOHN GUTTAG: It's a greedy algorithm, exactly. 273 00:15:32,420 --> 00:15:34,940 And so I'm making locally optimal decisions 274 00:15:34,940 --> 00:15:38,510 at each point which may or may not be globally optimal. 275 00:15:43,160 --> 00:15:44,450 It's flexible. 276 00:15:44,450 --> 00:15:46,070 Choosing different linkage criteria, 277 00:15:46,070 --> 00:15:48,050 I get different results. 278 00:15:48,050 --> 00:15:53,660 But it's also potentially really, really slow. 279 00:15:53,660 --> 00:15:58,610 This is not something you want to do on a million examples. 280 00:15:58,610 --> 00:16:02,570 The naive algorithm, the one I just sort of showed you, 281 00:16:02,570 --> 00:16:05,730 is N cubed. 282 00:16:05,730 --> 00:16:10,120 N cubed is typically impractical. 283 00:16:10,120 --> 00:16:14,590 For some linkage criteria, for example, single-linkage, there 284 00:16:14,590 --> 00:16:18,680 exists very clever N squared algorithms. 285 00:16:18,680 --> 00:16:21,380 For others, you can't beat N cubed. 286 00:16:21,380 --> 00:16:27,420 But even N squared is really not very good. 287 00:16:27,420 --> 00:16:30,670 Which gets me to a much faster greedy algorithm called 288 00:16:30,670 --> 00:16:31,170 k-means. 289 00:16:33,740 --> 00:16:40,350 Now, the k in k-means is the number of clusters you want. 290 00:16:40,350 --> 00:16:42,510 So the catch with k-means is if you 291 00:16:42,510 --> 00:16:46,050 don't have any idea how many clusters you want, 292 00:16:46,050 --> 00:16:50,260 it's problematical, whereas hierarchical, you 293 00:16:50,260 --> 00:16:53,640 get to inspect it and see what you're getting. 294 00:16:53,640 --> 00:16:57,330 If you know how many you want, it's a good choice 295 00:16:57,330 --> 00:16:59,010 because it's much faster. 296 00:17:02,170 --> 00:17:07,319 All right, the algorithm, again, is very simple. 297 00:17:07,319 --> 00:17:11,089 This is the one that Professor Grimson briefly discussed. 298 00:17:11,089 --> 00:17:16,349 You randomly choose k examples as your initial centroids. 299 00:17:16,349 --> 00:17:19,970 Doesn't matter which of the examples you choose. 300 00:17:19,970 --> 00:17:24,020 Then you create k clusters by assigning each example 301 00:17:24,020 --> 00:17:31,440 to the closest centroid, compute k new centroids 302 00:17:31,440 --> 00:17:35,470 by averaging the examples in each cluster. 303 00:17:35,470 --> 00:17:40,950 So in the first iteration, the centroids are all examples 304 00:17:40,950 --> 00:17:42,460 that you started with. 305 00:17:42,460 --> 00:17:46,410 But after that, they're probably not examples, 306 00:17:46,410 --> 00:17:49,620 because you're now taking the average of two examples, which 307 00:17:49,620 --> 00:17:53,070 may not correspond to any example you have. 308 00:17:53,070 --> 00:17:56,810 Actually the average of N examples. 309 00:17:56,810 --> 00:17:59,120 And then you just keep doing this 310 00:17:59,120 --> 00:18:02,730 until the centroids don't move. 311 00:18:02,730 --> 00:18:03,230 Right? 312 00:18:03,230 --> 00:18:04,875 Once you go through one iteration 313 00:18:04,875 --> 00:18:06,500 where they don't move, there's no point 314 00:18:06,500 --> 00:18:10,100 in recomputing them again and again and again, 315 00:18:10,100 --> 00:18:12,440 so it is converged. 316 00:18:16,610 --> 00:18:20,730 So let's look at the complexity. 317 00:18:20,730 --> 00:18:23,810 Well, at the moment, we can't tell you 318 00:18:23,810 --> 00:18:25,970 how many iterations you're going to have, 319 00:18:25,970 --> 00:18:28,370 but what's the complexity of one iteration? 320 00:18:34,640 --> 00:18:38,890 Well, let's think about what you're doing here. 321 00:18:38,890 --> 00:18:43,240 You've got k centroids. 322 00:18:43,240 --> 00:18:46,570 Now I have to take each example and compare it 323 00:18:46,570 --> 00:18:50,020 to each-- in a naively, at least-- to each centroid 324 00:18:50,020 --> 00:18:52,750 to see which it's closest to. 325 00:18:52,750 --> 00:18:54,310 Right? 326 00:18:54,310 --> 00:19:01,510 So that's k comparisons per example. 327 00:19:01,510 --> 00:19:07,480 So that's k times n times d, where 328 00:19:07,480 --> 00:19:10,480 how much time each of these comparison takes, 329 00:19:10,480 --> 00:19:12,910 which is likely to depend upon the dimensionality 330 00:19:12,910 --> 00:19:14,740 of the features, right? 331 00:19:14,740 --> 00:19:17,310 Just the Euclidean distance, for example. 332 00:19:20,150 --> 00:19:25,600 But this is a way small number than N squared, typically. 333 00:19:25,600 --> 00:19:27,490 So each iteration is pretty quick, 334 00:19:27,490 --> 00:19:31,330 and in practice, as we'll see, this typically 335 00:19:31,330 --> 00:19:34,540 converges quite quickly, so you usually 336 00:19:34,540 --> 00:19:39,120 need a very small number of iterations. 337 00:19:39,120 --> 00:19:41,580 So it is quite efficient, and then there 338 00:19:41,580 --> 00:19:43,830 are various ways you can optimize it 339 00:19:43,830 --> 00:19:45,900 to make it even more efficient. 340 00:19:45,900 --> 00:19:49,920 This is the most commonly-used clustering algorithm 341 00:19:49,920 --> 00:19:53,200 because it works really fast. 342 00:19:53,200 --> 00:19:55,220 Let's look at an example. 343 00:19:55,220 --> 00:19:58,880 So I've got a bunch of blue points here, 344 00:19:58,880 --> 00:20:02,090 and I actually wrote the code to do this. 345 00:20:02,090 --> 00:20:03,770 I'm not going to show you the code. 346 00:20:03,770 --> 00:20:13,020 And I chose four centroids at random, colored stars. 347 00:20:13,020 --> 00:20:18,390 A green one, a fuchsia-colored one, a red one, and a blue one. 348 00:20:21,410 --> 00:20:24,480 So maybe they're not the ones you would have chosen, 349 00:20:24,480 --> 00:20:25,380 but there they are. 350 00:20:28,030 --> 00:20:33,630 And I then, having chosen them, assign each point 351 00:20:33,630 --> 00:20:38,550 to one of those centroids, whichever one it's closest to. 352 00:20:38,550 --> 00:20:40,660 All right? 353 00:20:40,660 --> 00:20:41,290 Step one. 354 00:20:45,680 --> 00:20:50,350 And then I recompute the centroid. 355 00:20:50,350 --> 00:20:51,260 So let's go back. 356 00:20:53,780 --> 00:20:59,020 So we're here, and these are the initial centroids. 357 00:20:59,020 --> 00:21:03,280 Now, when I find the new centroids, 358 00:21:03,280 --> 00:21:06,130 if we look at where the red one is, 359 00:21:06,130 --> 00:21:10,540 the red one is this point, this point, and this point. 360 00:21:10,540 --> 00:21:14,170 Clearly, the new centroid is going to move, right? 361 00:21:14,170 --> 00:21:16,750 It's going to move somewhere along in here or something 362 00:21:16,750 --> 00:21:19,950 like that, right? 363 00:21:19,950 --> 00:21:24,154 So we'll get those new centroids. 364 00:21:24,154 --> 00:21:26,460 There it is. 365 00:21:26,460 --> 00:21:31,870 And now we'll re-assign points. 366 00:21:31,870 --> 00:21:38,190 And what we'll see is this point is now closer to the red star 367 00:21:38,190 --> 00:21:41,340 than it is to the fuchsia star, because we've 368 00:21:41,340 --> 00:21:43,920 moved the red star. 369 00:21:43,920 --> 00:21:44,970 Whoops. 370 00:21:44,970 --> 00:21:46,195 That one. 371 00:21:46,195 --> 00:21:47,070 Said the wrong thing. 372 00:21:47,070 --> 00:21:48,660 They were red to start with. 373 00:21:48,660 --> 00:21:53,490 This one is now suddenly closer to the purple, so-- 374 00:21:53,490 --> 00:21:54,150 and to the red. 375 00:21:54,150 --> 00:21:55,920 It will get recolored. 376 00:21:55,920 --> 00:21:57,350 We compute the new centroids. 377 00:21:59,970 --> 00:22:02,100 We're going to move something again. 378 00:22:02,100 --> 00:22:03,570 We continue. 379 00:22:03,570 --> 00:22:05,290 Points will move around. 380 00:22:05,290 --> 00:22:08,620 This time we move two points. 381 00:22:08,620 --> 00:22:09,820 Here we go again. 382 00:22:09,820 --> 00:22:11,980 Notice, again, the centroids don't 383 00:22:11,980 --> 00:22:14,090 correspond to actual examples. 384 00:22:14,090 --> 00:22:16,420 This one is close, but it's not really one of them. 385 00:22:19,210 --> 00:22:20,930 Move two more. 386 00:22:20,930 --> 00:22:24,040 Recompute centroids, and we're done. 387 00:22:24,040 --> 00:22:29,300 So here we've converged, and I think it was five iterations, 388 00:22:29,300 --> 00:22:31,481 and nothing will move again. 389 00:22:31,481 --> 00:22:31,980 All right? 390 00:22:31,980 --> 00:22:34,354 Does that make sense to everybody? 391 00:22:34,354 --> 00:22:35,270 So it's pretty simple. 392 00:22:38,420 --> 00:22:39,770 What are the downsides? 393 00:22:39,770 --> 00:22:45,170 Well, choosing k foolishly can lead to strange results. 394 00:22:45,170 --> 00:22:49,100 So if I chose k equal to 3, looking 395 00:22:49,100 --> 00:22:51,470 at this particular arrangement of points, 396 00:22:51,470 --> 00:22:55,670 it's not obvious what "the right answer" is, right? 397 00:22:55,670 --> 00:22:58,130 Maybe it's making all of this one cluster. 398 00:22:58,130 --> 00:23:00,100 I don't know. 399 00:23:00,100 --> 00:23:02,890 But there are weird k's and if you 400 00:23:02,890 --> 00:23:08,050 choose a k that is nonsensical with respect to your data, 401 00:23:08,050 --> 00:23:11,470 then your clustering will be nonsensical. 402 00:23:11,470 --> 00:23:13,240 So that's one problem we have think about. 403 00:23:13,240 --> 00:23:16,330 How do we choose k? 404 00:23:16,330 --> 00:23:20,120 Another problem, and this is one somebody raised last time, 405 00:23:20,120 --> 00:23:24,560 is that the results can depend upon the initial centroids. 406 00:23:24,560 --> 00:23:29,330 Unlike hierarchical clustering, k-means is non-deterministic. 407 00:23:29,330 --> 00:23:34,460 Depending upon what random examples we choose, 408 00:23:34,460 --> 00:23:36,470 we can get a different number of iterations. 409 00:23:36,470 --> 00:23:40,190 If we choose them poorly, it could take longer to converge. 410 00:23:40,190 --> 00:23:44,110 More worrisome, you get a different answer. 411 00:23:44,110 --> 00:23:45,670 You're running this greedy algorithm, 412 00:23:45,670 --> 00:23:47,920 and you might actually get to a different place, 413 00:23:47,920 --> 00:23:49,720 depending upon which centroids you chose. 414 00:23:52,390 --> 00:23:54,210 So these are the two issues we have 415 00:23:54,210 --> 00:23:57,000 to think about dealing with. 416 00:23:57,000 --> 00:24:00,980 So let's first think about choosing k. 417 00:24:00,980 --> 00:24:04,400 What often happens is people choose 418 00:24:04,400 --> 00:24:07,820 k using a priori knowledge about the application. 419 00:24:10,670 --> 00:24:13,070 If I'm in medicine, I actually know 420 00:24:13,070 --> 00:24:15,080 that there are only five different kinds 421 00:24:15,080 --> 00:24:17,280 of bacteria in the world. 422 00:24:17,280 --> 00:24:19,110 That's true. 423 00:24:19,110 --> 00:24:22,930 I mean, there are subspecies, but five large categories. 424 00:24:22,930 --> 00:24:25,980 And if I had a bunch of bacterium I wanted to cluster, 425 00:24:25,980 --> 00:24:30,050 may just set k equal to 5. 426 00:24:30,050 --> 00:24:32,390 Maybe I believe there are only two kinds of people 427 00:24:32,390 --> 00:24:35,585 in the world, those who are at MIT and those who are not. 428 00:24:35,585 --> 00:24:37,550 And so I'll choose k equal to 2. 429 00:24:40,200 --> 00:24:45,060 Often, we know enough about the application, we can choose k. 430 00:24:45,060 --> 00:24:49,110 As we'll see later, often we can think we do, and we don't. 431 00:24:51,940 --> 00:24:56,160 A better approach is to search for a good k. 432 00:25:01,050 --> 00:25:03,900 So you can try different values of k 433 00:25:03,900 --> 00:25:08,050 and evaluate the quality of the result. 434 00:25:08,050 --> 00:25:09,925 Assume you have some metric, as to say yeah, 435 00:25:09,925 --> 00:25:13,290 I like this clustering, I don't like this clustering. 436 00:25:13,290 --> 00:25:16,410 And we'll talk about do that in detail. 437 00:25:16,410 --> 00:25:22,260 Or you can run hierarchical clustering on a subset of data. 438 00:25:22,260 --> 00:25:23,970 I've got a million points. 439 00:25:23,970 --> 00:25:27,060 All right, what I'm going to do is take a subset of 1,000 440 00:25:27,060 --> 00:25:28,630 of them or 10,000. 441 00:25:28,630 --> 00:25:31,550 Run hierarchical clustering. 442 00:25:31,550 --> 00:25:36,750 From that, get a sense of the structure underlying the data. 443 00:25:36,750 --> 00:25:41,650 Decide k should be 6, and then run k-means with k equals 6. 444 00:25:41,650 --> 00:25:42,940 People often do this. 445 00:25:42,940 --> 00:25:47,380 They run hierarchical clustering on a small subset of the data 446 00:25:47,380 --> 00:25:48,570 and then choose k. 447 00:25:51,860 --> 00:25:57,830 And we'll look-- but one we're going to look at is that one. 448 00:25:57,830 --> 00:26:00,810 What about unlucky centroids? 449 00:26:00,810 --> 00:26:05,640 So here I got the same points we started with. 450 00:26:05,640 --> 00:26:08,390 Different initial centroids. 451 00:26:08,390 --> 00:26:11,310 I've got a fuchsia one, a black one, 452 00:26:11,310 --> 00:26:16,130 and then I've got red and blue down here, 453 00:26:16,130 --> 00:26:21,780 which I happened to accidentally choose close to one another. 454 00:26:21,780 --> 00:26:24,960 Well, if I start with these centroids, 455 00:26:24,960 --> 00:26:27,300 certainly you would expect things 456 00:26:27,300 --> 00:26:29,470 to take longer to converge. 457 00:26:29,470 --> 00:26:31,580 But in fact, what happens is this-- 458 00:26:34,450 --> 00:26:40,060 I get this assignment of blue, this assignment of red, 459 00:26:40,060 --> 00:26:43,160 and I'm done. 460 00:26:43,160 --> 00:26:48,980 It converges on this, which probably is not 461 00:26:48,980 --> 00:26:51,410 what we wanted out of this. 462 00:26:51,410 --> 00:26:54,350 Maybe it is, but the fact that I converged 463 00:26:54,350 --> 00:26:57,500 on some very different place shows 464 00:26:57,500 --> 00:26:59,480 that it's a real weakness of the algorithm, 465 00:26:59,480 --> 00:27:02,420 that it's sensitive to the randomly-chosen initial 466 00:27:02,420 --> 00:27:05,738 conditions. 467 00:27:05,738 --> 00:27:11,000 Well, couple of things you can do about that. 468 00:27:11,000 --> 00:27:17,180 You could be clever and try and select good initial centroids. 469 00:27:17,180 --> 00:27:20,150 So people often will do that, and what they'll do is try 470 00:27:20,150 --> 00:27:24,740 and just make sure that they're distributed over the space. 471 00:27:24,740 --> 00:27:27,290 So they would look at some picture like this 472 00:27:27,290 --> 00:27:31,940 and say, well, let's just put my centroids at the corners 473 00:27:31,940 --> 00:27:35,570 or something like that so that they're far apart. 474 00:27:39,760 --> 00:27:42,960 Another approach is to try multiple sets 475 00:27:42,960 --> 00:27:46,280 of randomly-chosen centroids, and then 476 00:27:46,280 --> 00:27:47,825 just select the best results. 477 00:27:50,830 --> 00:27:55,980 And that's what this little algorithm on the screen does. 478 00:27:55,980 --> 00:28:00,540 So I'll say best is equal to k-means of the points 479 00:28:00,540 --> 00:28:05,350 themselves, or something, then for t 480 00:28:05,350 --> 00:28:10,630 in range number of trials, I'll say C equals k-means of points, 481 00:28:10,630 --> 00:28:14,080 and I'll just keep track and choose the one with the least 482 00:28:14,080 --> 00:28:15,406 dissimilarity. 483 00:28:15,406 --> 00:28:16,780 The thing I'm trying to minimize. 484 00:28:16,780 --> 00:28:17,280 OK? 485 00:28:21,450 --> 00:28:24,910 The first one is got all the points in one cluster. 486 00:28:24,910 --> 00:28:27,460 So it's very dissimilar. 487 00:28:27,460 --> 00:28:29,050 And then I'll just keep generating 488 00:28:29,050 --> 00:28:31,210 for different k's and I'll choose 489 00:28:31,210 --> 00:28:34,700 the k that seems to be the best, that 490 00:28:34,700 --> 00:28:39,740 does the best job of minimizing my objective function. 491 00:28:39,740 --> 00:28:42,650 And this is a very common solution, by the way, 492 00:28:42,650 --> 00:28:46,010 for any randomized greedy algorithm. 493 00:28:46,010 --> 00:28:49,280 And there are a lot of randomized greedy algorithms 494 00:28:49,280 --> 00:28:53,270 that you just choose multiple initial conditions, 495 00:28:53,270 --> 00:28:55,580 try them all out and pick the best. 496 00:28:59,450 --> 00:29:00,830 All right, now I want to show you 497 00:29:00,830 --> 00:29:04,585 a slightly more real example. 498 00:29:07,530 --> 00:29:13,470 So this is a file we've got with medical patients, 499 00:29:13,470 --> 00:29:17,280 and we're going to try and cluster them and see 500 00:29:17,280 --> 00:29:19,170 whether the clusters tell us anything 501 00:29:19,170 --> 00:29:21,990 about the probability of them dying 502 00:29:21,990 --> 00:29:26,340 of a heart attack in, say, the next year or some period 503 00:29:26,340 --> 00:29:27,910 of time. 504 00:29:27,910 --> 00:29:30,570 So to simplify things, and this is something 505 00:29:30,570 --> 00:29:33,060 I have done with research, but we're looking 506 00:29:33,060 --> 00:29:35,550 at only four features here-- 507 00:29:35,550 --> 00:29:39,570 the heart rate in beats per minute, 508 00:29:39,570 --> 00:29:46,250 the number of previous heart attacks, the age, and something 509 00:29:46,250 --> 00:29:49,680 called ST elevation, a binary attribute. 510 00:29:49,680 --> 00:29:52,700 So the first three are obvious. 511 00:29:52,700 --> 00:29:57,510 If you take an ECG of somebody's heart, it looks like this. 512 00:29:57,510 --> 00:29:59,900 This is a normal one. 513 00:29:59,900 --> 00:30:01,850 They have the S, the T, and then there's 514 00:30:01,850 --> 00:30:06,480 this region between the S wave and the T wave. 515 00:30:06,480 --> 00:30:11,950 And if it's higher, hence elevated, that's a bad thing. 516 00:30:11,950 --> 00:30:13,890 And so this is about the first thing 517 00:30:13,890 --> 00:30:17,550 that they measure if someone is having cardiac problems. 518 00:30:17,550 --> 00:30:19,490 Do they have ST elevation? 519 00:30:22,370 --> 00:30:24,290 And then with each patient, we're 520 00:30:24,290 --> 00:30:28,270 going to have an outcome, whether they died, 521 00:30:28,270 --> 00:30:31,390 and it's related to the features, 522 00:30:31,390 --> 00:30:35,450 but it's probabilistic not deterministic. 523 00:30:35,450 --> 00:30:39,920 So for example, an older person with multiple heart attacks 524 00:30:39,920 --> 00:30:42,470 is at higher risk than a young person who's 525 00:30:42,470 --> 00:30:44,692 never had a heart attack. 526 00:30:44,692 --> 00:30:46,400 That doesn't mean, though, that the older 527 00:30:46,400 --> 00:30:48,440 person will die first. 528 00:30:48,440 --> 00:30:49,715 It's just more probable. 529 00:30:54,290 --> 00:30:57,327 We're going to take this data, we're going to cluster it, 530 00:30:57,327 --> 00:30:58,910 and then we're going to look at what's 531 00:30:58,910 --> 00:31:02,970 called the purity of the clusters 532 00:31:02,970 --> 00:31:06,030 relative to the outcomes. 533 00:31:06,030 --> 00:31:11,380 So is the cluster, say, enriched by people who died? 534 00:31:11,380 --> 00:31:14,380 If you have one cluster and everyone in it died, 535 00:31:14,380 --> 00:31:17,410 then the clustering is clearly finding some structure 536 00:31:17,410 --> 00:31:18,490 related to the outcome. 537 00:31:23,990 --> 00:31:27,910 So the file is in the zip file I uploaded. 538 00:31:27,910 --> 00:31:30,235 It looks more or less like this. 539 00:31:30,235 --> 00:31:30,940 Right? 540 00:31:30,940 --> 00:31:33,040 So it's very straightforward. 541 00:31:33,040 --> 00:31:34,310 The outcomes are binary. 542 00:31:34,310 --> 00:31:36,940 1 is a positive outcome. 543 00:31:36,940 --> 00:31:39,220 Strangely enough in the medical jargon, 544 00:31:39,220 --> 00:31:42,220 a death is a positive outcome. 545 00:31:42,220 --> 00:31:44,800 I guess maybe if you're responsible for the medical 546 00:31:44,800 --> 00:31:46,350 bills, it's positive. 547 00:31:46,350 --> 00:31:50,410 If you're the patient, it's hard to think of it as a good thing. 548 00:31:50,410 --> 00:31:53,530 Nevertheless, that's the way that they talk. 549 00:31:53,530 --> 00:31:55,450 And the others are all there, right? 550 00:31:55,450 --> 00:31:59,710 Heart rate, other things. 551 00:31:59,710 --> 00:32:01,480 All right, let's look at some code. 552 00:32:04,160 --> 00:32:05,481 So I've extracted some code. 553 00:32:05,481 --> 00:32:06,980 I'm not going to show you all of it. 554 00:32:06,980 --> 00:32:10,910 There's quite a lot of it, as you'll see. 555 00:32:10,910 --> 00:32:14,450 So we'll start-- one of the files you've got 556 00:32:14,450 --> 00:32:17,180 is called cluster dot pi. 557 00:32:17,180 --> 00:32:18,890 I decided there was enough code, I 558 00:32:18,890 --> 00:32:21,020 didn't want to put it all in one file. 559 00:32:21,020 --> 00:32:22,860 I was getting confused. 560 00:32:22,860 --> 00:32:24,560 So I said, let me create a file that 561 00:32:24,560 --> 00:32:27,950 has some of the code and a different file 562 00:32:27,950 --> 00:32:30,110 that will then import it and use it. 563 00:32:30,110 --> 00:32:33,500 Cluster has things that are pretty much 564 00:32:33,500 --> 00:32:38,700 unrelated to this example, but just useful for clustering. 565 00:32:38,700 --> 00:32:44,970 So an example here has name, features, and label. 566 00:32:44,970 --> 00:32:47,740 And really, the only interesting thing in it-- 567 00:32:47,740 --> 00:32:50,880 and it's not that interesting-- is distance. 568 00:32:50,880 --> 00:32:54,990 And the fact that I'm using Minkowski with 2 569 00:32:54,990 --> 00:32:56,760 says we're using Euclidean distance. 570 00:33:02,290 --> 00:33:04,400 Class cluster. 571 00:33:04,400 --> 00:33:08,410 It's a lot more code to that one. 572 00:33:08,410 --> 00:33:11,350 So we start with a non-empty list of examples. 573 00:33:11,350 --> 00:33:12,400 That's what init does. 574 00:33:12,400 --> 00:33:14,380 You can imagine what the code looks like, 575 00:33:14,380 --> 00:33:17,080 or you can look at it. 576 00:33:17,080 --> 00:33:25,580 Update is interesting in that it takes the cluster and examples 577 00:33:25,580 --> 00:33:35,550 and puts them in-- if you think of k-means in the cluster 578 00:33:35,550 --> 00:33:38,640 closest to the previous centroids 579 00:33:38,640 --> 00:33:43,500 and then returns the amount the centroid has changed. 580 00:33:43,500 --> 00:33:45,700 So if the centroid has changed by 0, 581 00:33:45,700 --> 00:33:48,140 then you don't have anything, right? 582 00:33:48,140 --> 00:33:50,270 Creates the new cluster. 583 00:33:50,270 --> 00:33:54,050 And the most interesting thing is computeCentroid. 584 00:33:54,050 --> 00:33:55,430 And if you look at this code, you 585 00:33:55,430 --> 00:33:58,820 can see that I'm a slightly unreconstructed Python 2 586 00:33:58,820 --> 00:34:00,290 programmers. 587 00:34:00,290 --> 00:34:01,910 I just noticed this. 588 00:34:01,910 --> 00:34:04,610 I really shouldn't have written 0.0. 589 00:34:04,610 --> 00:34:08,420 I should have just written 0, but in Python 2, 590 00:34:08,420 --> 00:34:10,760 you had to write that 0.0. 591 00:34:10,760 --> 00:34:12,320 Sorry about that. 592 00:34:12,320 --> 00:34:15,449 Thought I'd fixed these. 593 00:34:15,449 --> 00:34:18,880 Anyway, so how do we compute the centroid? 594 00:34:18,880 --> 00:34:25,750 We start by creating an array of all 0s. 595 00:34:25,750 --> 00:34:30,350 The dimensionality is the number of features in the example. 596 00:34:30,350 --> 00:34:34,100 It's one of the methods from-- 597 00:34:34,100 --> 00:34:37,130 I didn't put up on the PowerPoint. 598 00:34:37,130 --> 00:34:40,310 And then for e in examples, I'm going 599 00:34:40,310 --> 00:34:47,790 to add to vals e.getFeatures, and then I'm 600 00:34:47,790 --> 00:34:52,860 just going to divide vals by the length of self.examples, 601 00:34:52,860 --> 00:34:54,480 the number of examples. 602 00:34:54,480 --> 00:34:59,480 So now you see why I made it a pylab array, or a numpy array 603 00:34:59,480 --> 00:35:02,180 rather than a list, so I could do 604 00:35:02,180 --> 00:35:07,890 nice things like divide the whole thing in one expression. 605 00:35:07,890 --> 00:35:10,350 As you do math, any kind of math things, 606 00:35:10,350 --> 00:35:14,010 you'll find these arrays are incredibly convenient. 607 00:35:14,010 --> 00:35:16,440 Rather than having to write recursive functions 608 00:35:16,440 --> 00:35:19,140 or do bunches of iterations, the fact 609 00:35:19,140 --> 00:35:23,820 that you can do it in one keystroke is incredibly nice. 610 00:35:23,820 --> 00:35:25,570 And then I'm going to return the centroid. 611 00:35:30,330 --> 00:35:33,565 Variability is exactly what we saw in the formula. 612 00:35:36,360 --> 00:35:39,690 And then just for fun, so you could see this, 613 00:35:39,690 --> 00:35:42,270 I used an iterator here. 614 00:35:42,270 --> 00:35:43,950 I don't know that any of you have used 615 00:35:43,950 --> 00:35:47,340 the yield statement in Python. 616 00:35:47,340 --> 00:35:48,480 I recommend it. 617 00:35:48,480 --> 00:35:50,500 It's very convenient. 618 00:35:50,500 --> 00:35:52,740 One of the nice things about Python 619 00:35:52,740 --> 00:35:55,770 is almost anything that's built in, 620 00:35:55,770 --> 00:35:58,540 you can make your own version of it. 621 00:35:58,540 --> 00:36:04,470 And so once I've done this, if c is a cluster, 622 00:36:04,470 --> 00:36:11,320 I can now write something like for c in big C, 623 00:36:11,320 --> 00:36:17,740 and this will make it work just like iterating over a list. 624 00:36:17,740 --> 00:36:21,780 Right, so this makes it possible to iterate over it. 625 00:36:21,780 --> 00:36:24,360 If you haven't read about yield, you probably 626 00:36:24,360 --> 00:36:27,660 should read the probably about two paragraphs 627 00:36:27,660 --> 00:36:30,340 in the textbook explaining how it works, 628 00:36:30,340 --> 00:36:33,320 but it's very convenient. 629 00:36:33,320 --> 00:36:35,530 Dissimilarity we've already seen. 630 00:36:38,570 --> 00:36:41,870 All right, now we get to patients. 631 00:36:41,870 --> 00:36:48,300 This is in the file lec 12, lecture 12 dot py. 632 00:36:48,300 --> 00:36:51,810 In addition to importing the usual suspects of pylab 633 00:36:51,810 --> 00:36:57,260 and numpy, and probably it should import random too, 634 00:36:57,260 --> 00:37:01,550 it imports cluster, the one we just looked at. 635 00:37:04,160 --> 00:37:11,590 And so patient is a sub-type of cluster.Example. 636 00:37:11,590 --> 00:37:14,800 Then I'm going to define this interesting thing called 637 00:37:14,800 --> 00:37:18,330 scale attributes. 638 00:37:18,330 --> 00:37:21,720 So you might remember, in the last lecture 639 00:37:21,720 --> 00:37:25,680 when Professor Grimson was looking at these reptiles, 640 00:37:25,680 --> 00:37:28,770 he ran into this problem about alligators 641 00:37:28,770 --> 00:37:31,200 looking like chickens because they each have 642 00:37:31,200 --> 00:37:33,570 a large number of legs. 643 00:37:33,570 --> 00:37:37,330 And he said, well, what can we do to get around this? 644 00:37:37,330 --> 00:37:41,670 Well, we can represent the feature as a binary number. 645 00:37:41,670 --> 00:37:43,215 Has legs, doesn't have legs. 646 00:37:43,215 --> 00:37:45,210 0 or 1. 647 00:37:45,210 --> 00:37:47,940 And the problem he was dealing with 648 00:37:47,940 --> 00:37:51,860 is that when you have a feature vector 649 00:37:51,860 --> 00:37:55,910 and the dynamic range of some features 650 00:37:55,910 --> 00:37:59,210 is much greater than the others, they 651 00:37:59,210 --> 00:38:03,260 tend to dominate because the distances just look bigger when 652 00:38:03,260 --> 00:38:06,190 you get Euclidean distance. 653 00:38:06,190 --> 00:38:08,760 So for example, if we wanted to cluster the people 654 00:38:08,760 --> 00:38:13,980 in this room, and I had one feature that 655 00:38:13,980 --> 00:38:18,510 was, say, 1 for male and 0 for female, 656 00:38:18,510 --> 00:38:21,810 and another feature that was 1 for wears glasses, 657 00:38:21,810 --> 00:38:26,490 0 for doesn't wear glasses, and then a third feature which 658 00:38:26,490 --> 00:38:31,260 was weight, and I clustered them, 659 00:38:31,260 --> 00:38:33,240 well, weight would always completely 660 00:38:33,240 --> 00:38:36,690 dominate the Euclidean distance, right? 661 00:38:36,690 --> 00:38:39,030 Because the dynamic range of the weights in this 662 00:38:39,030 --> 00:38:45,450 room is much higher than the dynamic range of 0 to 1. 663 00:38:45,450 --> 00:38:51,120 And so for the reptiles, he said, well, OK, we'll 664 00:38:51,120 --> 00:38:53,640 just make it a binary variable. 665 00:38:53,640 --> 00:38:55,410 But maybe we don't want to make weight 666 00:38:55,410 --> 00:38:58,170 a binary variable, because maybe it is something 667 00:38:58,170 --> 00:39:00,880 we want to take into account. 668 00:39:00,880 --> 00:39:04,350 So what we do is we scale it. 669 00:39:04,350 --> 00:39:09,090 So this is a method called z-scaling. 670 00:39:09,090 --> 00:39:14,280 More general than just making things 0 or 1. 671 00:39:14,280 --> 00:39:16,200 It's a simple code. 672 00:39:16,200 --> 00:39:22,240 It takes in all of the values of a specific feature 673 00:39:22,240 --> 00:39:26,030 and then performs some simple calculations, 674 00:39:26,030 --> 00:39:34,970 and when it's done, the resulting array it returns 675 00:39:34,970 --> 00:39:40,320 has a known mean and a known standard deviation. 676 00:39:40,320 --> 00:39:41,960 So what's the mean going to be? 677 00:39:41,960 --> 00:39:44,179 It's always going to be the same thing, independent 678 00:39:44,179 --> 00:39:45,095 of the initial values. 679 00:39:47,660 --> 00:39:48,920 Take a look at the code. 680 00:39:48,920 --> 00:39:50,510 Try and see if you can figure it out. 681 00:39:55,190 --> 00:39:57,970 Anybody want to take a guess at it? 682 00:39:57,970 --> 00:39:59,550 0. 683 00:39:59,550 --> 00:40:00,120 Right? 684 00:40:00,120 --> 00:40:04,160 So the mean will always be 0. 685 00:40:04,160 --> 00:40:07,040 And the standard deviation, a little harder to figure, 686 00:40:07,040 --> 00:40:08,330 but it will always be 1. 687 00:40:13,320 --> 00:40:13,820 OK? 688 00:40:13,820 --> 00:40:17,140 So it's done this scaling. 689 00:40:17,140 --> 00:40:22,160 This is a very common kind of scaling called z-scaling. 690 00:40:22,160 --> 00:40:25,150 The other way people scale is interpolate. 691 00:40:25,150 --> 00:40:29,440 They take the smallest value and call it 0, the biggest value, 692 00:40:29,440 --> 00:40:33,580 they call it 1, and then they do a linear interpolation 693 00:40:33,580 --> 00:40:36,230 of all the values between 0 and 1. 694 00:40:36,230 --> 00:40:39,570 So the range is 0 to 1. 695 00:40:39,570 --> 00:40:43,230 That's also very common. 696 00:40:43,230 --> 00:40:45,600 So this is a general way to get all 697 00:40:45,600 --> 00:40:48,836 of the features sort of in the same ballpark 698 00:40:48,836 --> 00:40:50,002 so that we can compare them. 699 00:40:53,100 --> 00:40:55,140 And we'll look at what happens when we scale 700 00:40:55,140 --> 00:40:57,480 and when we don't scale. 701 00:40:57,480 --> 00:41:01,200 And that's why my getData function has this parameter 702 00:41:01,200 --> 00:41:02,820 to scale. 703 00:41:02,820 --> 00:41:06,150 It either creates a set of examples with the attributes 704 00:41:06,150 --> 00:41:10,090 as initially or scaled. 705 00:41:10,090 --> 00:41:11,980 And then there's k-means. 706 00:41:11,980 --> 00:41:14,920 It's exactly the algorithm I showed you 707 00:41:14,920 --> 00:41:20,200 with one little wrinkle, which is this part. 708 00:41:20,200 --> 00:41:23,200 You don't want to end up with empty clusters. 709 00:41:23,200 --> 00:41:26,170 If I tell you I want four clusters, 710 00:41:26,170 --> 00:41:28,240 I don't mean I want three with examples 711 00:41:28,240 --> 00:41:30,390 and one that's empty, right? 712 00:41:30,390 --> 00:41:34,050 Because then I really don't have four clusters. 713 00:41:34,050 --> 00:41:36,840 And so this is one of multiple ways 714 00:41:36,840 --> 00:41:39,510 to avoid having empty clusters. 715 00:41:39,510 --> 00:41:41,470 Basically what I did here is say, 716 00:41:41,470 --> 00:41:44,640 well, I'm going to try a lot of different initial conditions. 717 00:41:44,640 --> 00:41:47,880 If one of them is so unlucky to give me an empty cluster, 718 00:41:47,880 --> 00:41:51,550 I'm just going to skip it and go on to the next one 719 00:41:51,550 --> 00:41:55,892 by raising a value error, empty cluster. 720 00:41:55,892 --> 00:41:57,350 And if you look at the code, you'll 721 00:41:57,350 --> 00:42:00,450 see how this value error is used. 722 00:42:00,450 --> 00:42:02,690 And then try k-means. 723 00:42:02,690 --> 00:42:07,490 We'll call k-means numTrial times, each one getting 724 00:42:07,490 --> 00:42:11,060 a different set of initial centroids, 725 00:42:11,060 --> 00:42:13,550 and return the result with the lowest dissimilarity. 726 00:42:16,820 --> 00:42:23,090 Then I have various ways to examine the results. 727 00:42:23,090 --> 00:42:25,040 Nothing very interesting, and here's 728 00:42:25,040 --> 00:42:28,190 the key place where we're going to run the whole thing. 729 00:42:28,190 --> 00:42:31,970 We'll get the data, initially not scaling it, 730 00:42:31,970 --> 00:42:34,200 because remember, it defaults to true. 731 00:42:34,200 --> 00:42:38,770 Then initially, I'm only going to try one k. k equals 2. 732 00:42:38,770 --> 00:42:47,950 And we'll call testClustering with the patients. 733 00:42:47,950 --> 00:42:50,920 The number of clusters, k. 734 00:42:50,920 --> 00:42:53,770 I put in seed as a parameter here 735 00:42:53,770 --> 00:42:56,080 because I wanted to be able to play with it 736 00:42:56,080 --> 00:42:59,710 and make sure I got different things for 0 and 1 and 2 737 00:42:59,710 --> 00:43:01,630 just as a testing thing. 738 00:43:01,630 --> 00:43:06,230 And five trials it's defaulting to. 739 00:43:06,230 --> 00:43:12,480 And then we'll look at testClustering 740 00:43:12,480 --> 00:43:17,100 is returning the fraction of positive examples 741 00:43:17,100 --> 00:43:19,780 for each cluster. 742 00:43:19,780 --> 00:43:21,730 OK? 743 00:43:21,730 --> 00:43:23,530 So let's see what happens when we run it. 744 00:43:39,690 --> 00:43:41,460 All right. 745 00:43:41,460 --> 00:43:43,710 So we got two clusters. 746 00:43:43,710 --> 00:43:49,590 Cluster of size 118 with .3305, and a cluster 747 00:43:49,590 --> 00:43:55,010 of size 132 with a positive fraction of point quadruple 3. 748 00:43:59,230 --> 00:44:03,230 Should we be happy? 749 00:44:03,230 --> 00:44:07,870 Does our clustering tell us anything, somehow 750 00:44:07,870 --> 00:44:13,220 correspond to the expected outcome for patients here? 751 00:44:13,220 --> 00:44:15,630 Probably not, right? 752 00:44:15,630 --> 00:44:18,600 Those numbers are pretty much indistinguishable 753 00:44:18,600 --> 00:44:20,280 statistically. 754 00:44:20,280 --> 00:44:23,070 And you'd have to guess that the fraction of positives 755 00:44:23,070 --> 00:44:26,544 in the whole population is around .33, right? 756 00:44:26,544 --> 00:44:27,960 That about a third of these people 757 00:44:27,960 --> 00:44:30,350 died of their heart attack. 758 00:44:30,350 --> 00:44:35,040 And I might as well have signed them randomly 759 00:44:35,040 --> 00:44:36,584 to the two clusters, right? 760 00:44:36,584 --> 00:44:38,250 There's not much difference between this 761 00:44:38,250 --> 00:44:42,480 and what you would get with the random result. 762 00:44:42,480 --> 00:44:44,490 Well, why do we think that's true? 763 00:44:47,270 --> 00:44:49,550 Because I didn't scale, right? 764 00:44:49,550 --> 00:44:53,150 And so one of the issues we had to deal with 765 00:44:53,150 --> 00:44:56,760 is, well, age had a big dynamic range, 766 00:44:56,760 --> 00:45:02,300 and, say, ST elevation, which I told you was highly diagnostic, 767 00:45:02,300 --> 00:45:04,600 was either 0 or 1. 768 00:45:04,600 --> 00:45:06,280 And so probably everything is getting 769 00:45:06,280 --> 00:45:12,820 swamped by age or something else, right? 770 00:45:12,820 --> 00:45:17,350 All right, so we have an easy way to fix that. 771 00:45:17,350 --> 00:45:20,440 We'll just scale the data. 772 00:45:20,440 --> 00:45:21,670 Now let's see what we get. 773 00:45:26,660 --> 00:45:27,400 All right. 774 00:45:27,400 --> 00:45:31,140 That's interesting. 775 00:45:31,140 --> 00:45:33,090 With casting rule? 776 00:45:33,090 --> 00:45:35,600 Good grief. 777 00:45:35,600 --> 00:45:37,010 That caught me by surprise. 778 00:45:48,150 --> 00:45:51,360 Good thing I have the answers in PowerPoint to show you, 779 00:45:51,360 --> 00:45:53,236 because the code doesn't seem to be working. 780 00:46:00,190 --> 00:46:01,130 Try it once more. 781 00:46:05,310 --> 00:46:05,810 No. 782 00:46:05,810 --> 00:46:09,890 All right, well, in the interest of getting 783 00:46:09,890 --> 00:46:11,630 through this lecture on schedule, 784 00:46:11,630 --> 00:46:14,690 we'll go look at the results that we get-- 785 00:46:14,690 --> 00:46:16,291 I got last time I ran it. 786 00:46:20,281 --> 00:46:20,780 All right. 787 00:46:23,720 --> 00:46:32,110 When I scaled, what we see here is that now there is a pretty 788 00:46:32,110 --> 00:46:34,770 dramatic difference, right? 789 00:46:34,770 --> 00:46:37,170 One of the clusters has a much higher fraction 790 00:46:37,170 --> 00:46:43,030 of positive patients than others, 791 00:46:43,030 --> 00:46:46,910 but it's still a bit problematic. 792 00:46:46,910 --> 00:46:52,670 So this has pretty good specificity, 793 00:46:52,670 --> 00:46:57,275 or positive predictive value, but its sensitivity is lousy. 794 00:47:02,170 --> 00:47:06,640 Remember, a third of our initial population more or less, 795 00:47:06,640 --> 00:47:08,260 was positive. 796 00:47:08,260 --> 00:47:13,320 26 is way less than a third, so in fact I've 797 00:47:13,320 --> 00:47:18,690 got a class, a cluster, that is strongly enriched, 798 00:47:18,690 --> 00:47:23,250 but I'm still lumping most of the positive patients 799 00:47:23,250 --> 00:47:24,350 into the other cluster. 800 00:47:27,030 --> 00:47:31,790 And in fact, there are 83 positives. 801 00:47:31,790 --> 00:47:33,840 Wrote some code to do that. 802 00:47:33,840 --> 00:47:37,870 And so we see that of the 83 positives, 803 00:47:37,870 --> 00:47:41,800 only this class, which is 70% positive, 804 00:47:41,800 --> 00:47:44,710 only has 26 in it to start with it. 805 00:47:44,710 --> 00:47:48,980 So I'm clearly missing most of the positives. 806 00:47:48,980 --> 00:47:51,130 So why? 807 00:47:51,130 --> 00:47:54,640 Well, my hypothesis was that different subgroups 808 00:47:54,640 --> 00:47:58,852 of positive patients have different characteristics. 809 00:48:01,590 --> 00:48:09,080 And so we could test this by trying other values of k 810 00:48:09,080 --> 00:48:11,570 to see with-- we would get more clusters. 811 00:48:11,570 --> 00:48:14,540 So here, I said, let's try k equals 2, 4, and 6. 812 00:48:18,090 --> 00:48:19,740 And here's what I got when I ran that. 813 00:48:23,870 --> 00:48:32,010 So what you'll notice here, as we get to, say, 4, that I have 814 00:48:32,010 --> 00:48:39,030 two clusters, this one and this one, 815 00:48:39,030 --> 00:48:43,230 which are heavily enriched with positive patients. 816 00:48:43,230 --> 00:48:49,530 26 as before in the first one, but 76 patients 817 00:48:49,530 --> 00:48:51,240 in the third one. 818 00:48:51,240 --> 00:48:55,560 So I'm now getting a much higher fraction of patients 819 00:48:55,560 --> 00:49:00,930 in one of the "risky" clusters. 820 00:49:00,930 --> 00:49:08,930 And I can continue to do that, but if I look at k equals 6, 821 00:49:08,930 --> 00:49:11,420 we now look at the positive clusters. 822 00:49:11,420 --> 00:49:15,560 There were three of them significantly positive. 823 00:49:15,560 --> 00:49:20,210 But I'm not really getting a lot more patients total, 824 00:49:20,210 --> 00:49:22,260 so maybe 4 is the right answer. 825 00:49:24,860 --> 00:49:29,470 So what you see here is that we have at least two parameters 826 00:49:29,470 --> 00:49:32,530 to play with, scaling and k. 827 00:49:32,530 --> 00:49:35,200 Even though I was only wanted a structure 828 00:49:35,200 --> 00:49:37,090 that would separate the risk-- 829 00:49:37,090 --> 00:49:39,640 high-risk patients from the lower-risk, 830 00:49:39,640 --> 00:49:45,140 which is why I started with 2, I later 831 00:49:45,140 --> 00:49:48,260 discovered that, in fact, there are multiple reasons 832 00:49:48,260 --> 00:49:50,390 for being high-risk. 833 00:49:50,390 --> 00:49:52,070 And so maybe one of these clusters 834 00:49:52,070 --> 00:49:54,800 is heavily enriched by old people. 835 00:49:54,800 --> 00:49:56,420 Maybe another one is heavily enriched 836 00:49:56,420 --> 00:50:00,500 by people who have had three heart attacks in the past, 837 00:50:00,500 --> 00:50:03,990 or ST elevation or some combination. 838 00:50:03,990 --> 00:50:05,540 And when I had only two clusters, 839 00:50:05,540 --> 00:50:08,640 I couldn't get that fine gradation. 840 00:50:08,640 --> 00:50:11,520 So this is what data scientists spend 841 00:50:11,520 --> 00:50:14,130 their time doing when they're doing clustering, 842 00:50:14,130 --> 00:50:17,970 is they actually have multiple parameters. 843 00:50:17,970 --> 00:50:19,770 They try different things out. 844 00:50:19,770 --> 00:50:22,020 They look at the results, and that's 845 00:50:22,020 --> 00:50:26,040 why you actually have to think to manipulate data rather 846 00:50:26,040 --> 00:50:28,860 than just push a button and wait for the answer. 847 00:50:28,860 --> 00:50:30,060 All right. 848 00:50:30,060 --> 00:50:34,350 More of this general topic on Wednesday 849 00:50:34,350 --> 00:50:37,440 when we're going to talk about classification. 850 00:50:37,440 --> 00:50:38,828 Thank you.