1 00:00:00,530 --> 00:00:02,960 The following content is provided under a Creative 2 00:00:02,960 --> 00:00:04,370 Commons license. 3 00:00:04,370 --> 00:00:07,410 Your support will help MIT OpenCourseWare continue to 4 00:00:07,410 --> 00:00:11,060 offer high quality educational resources for free. 5 00:00:11,060 --> 00:00:13,960 To make a donation or view additional materials from 6 00:00:13,960 --> 00:00:19,780 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,780 --> 00:00:21,030 ocw.mit.edu. 8 00:00:25,998 --> 00:00:27,492 AUDIENCE: OK. 9 00:00:27,492 --> 00:00:29,484 Number (2) -- 10 00:00:29,484 --> 00:00:30,978 (2.3). 11 00:00:30,978 --> 00:00:34,962 It says if the code [INAUDIBLE] 12 00:00:34,962 --> 00:00:36,954 0 would be the [INAUDIBLE]. 13 00:00:36,954 --> 00:00:39,444 I thought it was, you're generating 14 00:00:39,444 --> 00:00:41,440 random values for that? 15 00:00:41,440 --> 00:00:45,880 PROFESSOR: Yeah, you were but if you look at what totes 0 is 16 00:00:45,880 --> 00:00:52,570 collecting, so if you look at the, where it draws a random 17 00:00:52,570 --> 00:00:57,640 number, j is indexing in totes, right? 18 00:00:57,640 --> 00:01:01,890 So when j is 0 your standard deviation, which is also being 19 00:01:01,890 --> 00:01:04,704 indexed by j, is going to be 0. 20 00:01:04,704 --> 00:01:06,175 So you're always going to get the same value. 21 00:01:10,058 --> 00:01:11,270 Any more questions? 22 00:01:11,270 --> 00:01:12,890 No? 23 00:01:12,890 --> 00:01:15,490 OK, that was easy. 24 00:01:18,440 --> 00:01:23,490 So in lecture we were talking a lot about clustering, we've 25 00:01:23,490 --> 00:01:27,920 been talking about clustering the past, is it two lectures? 26 00:01:27,920 --> 00:01:30,550 And we had two different types of clustering 27 00:01:30,550 --> 00:01:33,663 methods, what were they? 28 00:01:33,663 --> 00:01:35,630 AUDIENCE: Hierarchical and-- 29 00:01:35,630 --> 00:01:36,880 PROFESSOR: Heirarchical and K-means. 30 00:01:39,890 --> 00:01:42,450 Can someone give me a run down of what the steps are in 31 00:01:42,450 --> 00:01:43,700 hierarchical clustering? 32 00:01:50,310 --> 00:01:52,746 AUDIENCE: Something that breaks everything down into 33 00:01:52,746 --> 00:01:53,996 one cluster [INAUDIBLE] 34 00:01:56,970 --> 00:02:00,130 PROFESSOR: So let's say I have a bunch of data points. 35 00:02:00,130 --> 00:02:01,380 What would be the first step? 36 00:02:09,750 --> 00:02:13,820 You're going to first assign each point to a cluster. 37 00:02:13,820 --> 00:02:20,410 So each point gets its own cluster. 38 00:02:20,410 --> 00:02:24,430 And then the next step would be what? 39 00:02:30,238 --> 00:02:32,670 AUDIENCE: [INAUDIBLE] 40 00:02:32,670 --> 00:02:35,000 PROFESSOR: Right, so you're going to find the two clusters 41 00:02:35,000 --> 00:02:36,700 that are closest to each other and merge them. 42 00:02:36,700 --> 00:02:41,520 So in this very contrived example, would be these guys. 43 00:02:41,520 --> 00:02:45,040 And then you're going to keep doing that until you get to a 44 00:02:45,040 --> 00:02:46,770 certain number of clusters, right? 45 00:02:46,770 --> 00:02:50,480 So you merge these two, then you might merge these two, and 46 00:02:50,480 --> 00:02:55,212 you might merge these two, et cetera, et cetera, right? 47 00:02:55,212 --> 00:02:58,190 AUDIENCE: [INAUDIBLE] 48 00:02:58,190 --> 00:03:01,240 PROFESSOR: So you're going to set the number of clusters 49 00:03:01,240 --> 00:03:07,590 that you want at the outset, so I guess, for the mammalian 50 00:03:07,590 --> 00:03:11,200 teeth example, there was stopping criteria of two 51 00:03:11,200 --> 00:03:12,450 clusters, if I'm not mistaken. 52 00:03:20,080 --> 00:03:23,270 So let's take a look at the code here that implements a 53 00:03:23,270 --> 00:03:24,520 hierarchical cluster. 54 00:03:27,770 --> 00:03:30,560 So this is just some infrastructure code, it builds 55 00:03:30,560 --> 00:03:33,720 up the number of points. 56 00:03:33,720 --> 00:03:36,116 We have a cluster set class, which we'll go over in a 57 00:03:36,116 --> 00:03:43,150 second, and then we just add, for each point we're going to 58 00:03:43,150 --> 00:03:49,610 create a cluster object, and add it to the cluster set. 59 00:03:49,610 --> 00:03:50,910 Let's take a look at the cluster set. 60 00:03:56,390 --> 00:04:02,600 Cluster set has one attribute, the members attribute, and it 61 00:04:02,600 --> 00:04:08,770 just has a set of points, or a set of clusters, actually. 62 00:04:08,770 --> 00:04:13,900 And the key method in here-- 63 00:04:13,900 --> 00:04:16,180 or the key methods are merge-1 and merge-n. 64 00:04:16,180 --> 00:04:17,990 Merge-n is what actually implements 65 00:04:17,990 --> 00:04:19,800 the clustering here. 66 00:04:19,800 --> 00:04:23,860 So you give it the distance metric that you're going to 67 00:04:23,860 --> 00:04:28,210 use for your points, the number of clusters that you 68 00:04:28,210 --> 00:04:32,650 want at the end of your clustering, the history 69 00:04:32,650 --> 00:04:35,530 tracker, and then you also tell it if you want to print 70 00:04:35,530 --> 00:04:38,580 out some debugging information. 71 00:04:38,580 --> 00:04:44,590 So, which apparently is not used to this method, oh, now 72 00:04:44,590 --> 00:04:47,810 it is, merge-1. 73 00:04:47,810 --> 00:04:53,690 Anyway, so while we have more clusters than the number of 74 00:04:53,690 --> 00:04:56,800 clusters we desire, we're going to keep reiterating. 75 00:04:56,800 --> 00:04:59,560 And on each step we're going to call this function-- 76 00:04:59,560 --> 00:05:03,040 or method called merge-1 and just pass at 77 00:05:03,040 --> 00:05:04,750 the distance metric. 78 00:05:04,750 --> 00:05:13,500 And all merge-1 is going to do is it's going to, if 79 00:05:13,500 --> 00:05:16,790 there's only one-- 80 00:05:16,790 --> 00:05:19,470 if there's only one cluster here, then it's just going to 81 00:05:19,470 --> 00:05:20,520 return none. 82 00:05:20,520 --> 00:05:30,610 If there are two clusters and its going to merge them, and 83 00:05:30,610 --> 00:05:33,390 if there are more than two clusters, it's going to find 84 00:05:33,390 --> 00:05:36,220 the closest, according to the distance metric and 85 00:05:36,220 --> 00:05:37,470 then merge those two. 86 00:05:39,820 --> 00:05:43,780 So then the return value is going to be the 87 00:05:43,780 --> 00:05:45,030 two clusters it merged. 88 00:05:52,660 --> 00:05:58,200 Let's look at the merge clusters code. 89 00:05:58,200 --> 00:06:04,960 All it does is it takes the two clusters and for each 90 00:06:04,960 --> 00:06:11,400 point, in both clusters, it adds it to a new list points 91 00:06:11,400 --> 00:06:14,200 and it creates a new cluster from those points. 92 00:06:14,200 --> 00:06:18,720 And then it removes two clusters from members and adds 93 00:06:18,720 --> 00:06:20,185 the newly created cluster. 94 00:06:26,480 --> 00:06:30,670 So then, find closest method. 95 00:06:35,870 --> 00:06:40,018 So what's this bit of code doing here? 96 00:06:40,018 --> 00:06:41,268 AUDIENCE: [INAUDIBLE] 97 00:06:47,290 --> 00:06:51,250 PROFESSOR: Right, so we'll get to the metric in a second. 98 00:06:51,250 --> 00:06:59,010 So it initially finds, it looks at the first two members 99 00:06:59,010 --> 00:07:03,150 in the cluster set and it sets minDistance to be that, and it 100 00:07:03,150 --> 00:07:05,290 sets toMerge to be these two members. 101 00:07:05,290 --> 00:07:11,710 And then it narrates through every possible pair of 102 00:07:11,710 --> 00:07:19,560 clusters, in this cluster set and finds the minimum distance 103 00:07:19,560 --> 00:07:20,810 according to the metric. 104 00:07:33,080 --> 00:07:37,500 So let's look at the cluster class. 105 00:07:37,500 --> 00:07:41,540 All the cluster object does or class does is it 106 00:07:41,540 --> 00:07:43,950 holds a set of points. 107 00:07:43,950 --> 00:07:47,900 It knows the type of point that it's holding because that 108 00:07:47,900 --> 00:07:50,390 becomes important, when we talk about the different types 109 00:07:50,390 --> 00:07:54,160 of things that we want to cluster, and then it also has 110 00:07:54,160 --> 00:07:56,730 something called a centroid. 111 00:07:56,730 --> 00:08:01,150 All a centroid is, is just the middle of the cluster, if you 112 00:08:01,150 --> 00:08:06,370 take all of the points and average their distances, or 113 00:08:06,370 --> 00:08:09,500 average their location. 114 00:08:09,500 --> 00:08:16,980 So these different functions, these just compute metrics 115 00:08:16,980 --> 00:08:19,140 about this particular cluster, right? 116 00:08:19,140 --> 00:08:23,610 So single linkage dist., all this is going to do is it 117 00:08:23,610 --> 00:08:28,920 finds a minimum distance between every pair of points 118 00:08:28,920 --> 00:08:30,170 in the cluster. 119 00:08:33,950 --> 00:08:38,600 And what does max linkage distance do? 120 00:08:38,600 --> 00:08:42,720 I'm sorry I'm mistaken it finds the minimum distance 121 00:08:42,720 --> 00:08:45,190 between this-- 122 00:08:45,190 --> 00:08:46,680 a point in this cluster and a point in 123 00:08:46,680 --> 00:08:50,100 another cluster, I misspoke. 124 00:08:50,100 --> 00:08:54,264 So what does max linkage distance do? 125 00:08:54,264 --> 00:08:56,669 AUDIENCE: [INAUDIBLE] 126 00:08:56,669 --> 00:08:57,631 PROFESSOR: The opposite. 127 00:08:57,631 --> 00:09:01,680 I have to keep you talking or you'll fall asleep. 128 00:09:01,680 --> 00:09:03,220 And then, averageLinkageDist? 129 00:09:05,870 --> 00:09:06,740 Same thing. 130 00:09:06,740 --> 00:09:10,180 This is why having meaningful function names is important, 131 00:09:10,180 --> 00:09:13,000 because it helps you explain code. 132 00:09:13,000 --> 00:09:17,590 So it also has this method in here called update and what 133 00:09:17,590 --> 00:09:27,570 update does, is it takes a new set of points and it sets the 134 00:09:27,570 --> 00:09:31,750 points that this cluster has, to be these new points. 135 00:09:31,750 --> 00:09:35,440 And then it computes a new centroid for this cluster. 136 00:09:35,440 --> 00:09:41,990 And the return value is the distance of the old centroid 137 00:09:41,990 --> 00:09:46,160 from the new centroid. 138 00:09:46,160 --> 00:09:47,920 And this becomes important in some of the algorithms. 139 00:09:51,008 --> 00:09:54,820 Then there's just some bookkeeping stuff here, like 140 00:09:54,820 --> 00:09:59,890 members will just give you all the points in this cluster. 141 00:09:59,890 --> 00:10:03,540 You all know what yield does, right? 142 00:10:03,540 --> 00:10:05,428 AUDIENCE: [INAUDIBLE] 143 00:10:05,428 --> 00:10:06,380 PROFESSOR: OK. 144 00:10:06,380 --> 00:10:09,890 So yield returns a generator object, which allows you to 145 00:10:09,890 --> 00:10:12,200 iterate over elements. 146 00:10:12,200 --> 00:10:17,210 So this was asked during the quiz review. 147 00:10:17,210 --> 00:10:20,300 What's the difference between range and x range? 148 00:10:20,300 --> 00:10:21,700 Right. 149 00:10:21,700 --> 00:10:48,570 So if I have a range of values it actually returns a 150 00:10:48,570 --> 00:10:50,640 different type. 151 00:10:50,640 --> 00:10:56,980 So I can print out this list, right? 152 00:10:59,540 --> 00:11:08,540 In this case, it will print out the type of object it is, 153 00:11:08,540 --> 00:11:16,040 so this is accomplished using yield. 154 00:11:16,040 --> 00:11:45,060 So if I wanted to write this myself, what this is going to 155 00:11:45,060 --> 00:11:46,700 do is going to return something 156 00:11:46,700 --> 00:11:48,850 called a generator object. 157 00:11:48,850 --> 00:11:52,740 And all it does is, instead of holding all the numbers in 158 00:11:52,740 --> 00:11:56,310 memory, it's going to return them one at a time to me. 159 00:11:56,310 --> 00:12:00,520 So like when I use range here, it constructs a list and it 160 00:12:00,520 --> 00:12:02,830 has all of those integers in memory. 161 00:12:02,830 --> 00:12:07,330 If I use xrange it's not going to hold all the integers in 162 00:12:07,330 --> 00:12:09,070 memory, but I can still iterate over 163 00:12:09,070 --> 00:12:11,013 them one at a time. 164 00:12:11,013 --> 00:12:14,877 AUDIENCE: So within that function [INAUDIBLE] 165 00:12:14,877 --> 00:12:18,940 yield a bunch of times before the function, right? 166 00:12:18,940 --> 00:12:19,210 PROFESSOR: Yeah. 167 00:12:19,210 --> 00:12:21,870 AUDIENCE: How is that accomplished? 168 00:12:21,870 --> 00:12:23,782 Does it operate [INAUDIBLE] within the way you normally 169 00:12:23,782 --> 00:12:27,130 have functions [INAUDIBLE]? 170 00:12:27,130 --> 00:12:27,840 PROFESSOR: Right. 171 00:12:27,840 --> 00:12:34,110 So what this tells Python is that when it sees a yield, 172 00:12:34,110 --> 00:12:36,730 it's sort of like a return, except it's telling Python 173 00:12:36,730 --> 00:12:39,220 that I want to come back to this location at some point. 174 00:12:39,220 --> 00:12:43,910 So it a return just returns out of the function 175 00:12:43,910 --> 00:12:44,790 completely. 176 00:12:44,790 --> 00:12:48,540 What a yield does is it takes the value that is specified 177 00:12:48,540 --> 00:12:56,720 after yield, and it returns to the calling place in the 178 00:12:56,720 --> 00:12:59,580 program that value. 179 00:12:59,580 --> 00:13:05,980 But then, when it comes time to get a new value, it'll 180 00:13:05,980 --> 00:13:08,360 return back to where this yield exited. 181 00:13:08,360 --> 00:13:14,800 So kind of a way maybe seeing this is if I 182 00:13:14,800 --> 00:13:16,625 iterate over my xrange. 183 00:13:43,590 --> 00:13:46,270 Each time it needs new value, it's going to go back inside 184 00:13:46,270 --> 00:13:48,780 this function and grab it. 185 00:13:48,780 --> 00:13:51,440 So it looks like a function. 186 00:13:51,440 --> 00:13:56,240 But what it's actually doing is creating what's called a 187 00:13:56,240 --> 00:13:57,160 generator object. 188 00:13:57,160 --> 00:13:59,330 And it has these special methods for 189 00:13:59,330 --> 00:14:01,100 getting the next value. 190 00:14:01,100 --> 00:14:04,990 So it's some nice syntactic sugar. 191 00:14:04,990 --> 00:14:06,060 But it's pretty neat. 192 00:14:06,060 --> 00:14:10,280 But that's what's going on with this yield statement here 193 00:14:10,280 --> 00:14:13,500 is that instead of returning the entire list of points, or 194 00:14:13,500 --> 00:14:19,630 instead of doing it in some other way, all it's doing is 195 00:14:19,630 --> 00:14:22,850 just yielding each point one at a time so that you can 196 00:14:22,850 --> 00:14:24,100 iterate over them. 197 00:14:30,140 --> 00:14:31,390 So what else? 198 00:14:33,890 --> 00:14:37,822 So here's a method for computing the centroid. 199 00:14:44,810 --> 00:14:55,750 All we are going to do is total up where each point is. 200 00:14:55,750 --> 00:14:59,080 And then, take the average over all the points. 201 00:15:02,400 --> 00:15:04,094 Does that makes sense? 202 00:15:04,094 --> 00:15:05,344 All right. 203 00:15:07,370 --> 00:15:12,050 So the example we saw was mammal teeth. 204 00:15:12,050 --> 00:15:15,870 And the way that that's accomplished in this set of 205 00:15:15,870 --> 00:15:20,850 code is we're going to define a sub-class of a class, point, 206 00:15:20,850 --> 00:15:22,340 that's call mammal. 207 00:15:22,340 --> 00:15:33,190 And what point does is it has a name for a given data point. 208 00:15:33,190 --> 00:15:34,930 It has a set of attributes. 209 00:15:34,930 --> 00:15:39,370 And then, you can also give it some normalized attributes. 210 00:15:39,370 --> 00:15:42,830 If you don't give it the normalized attributes, then 211 00:15:42,830 --> 00:15:44,810 it'll just use the original attributes. 212 00:15:44,810 --> 00:15:48,040 So it becomes important when we talk about-- 213 00:15:48,040 --> 00:15:54,020 when we do scaling over data, which we'll do shortly. 214 00:15:54,020 --> 00:15:59,580 So there's nothing really special about it except for 215 00:15:59,580 --> 00:16:02,730 this distance function. 216 00:16:02,730 --> 00:16:05,880 It's just defining the Euclidean distance for a given 217 00:16:05,880 --> 00:16:08,730 multi-dimensional point. 218 00:16:08,730 --> 00:16:12,990 So everyone knows that if you have a point in two 219 00:16:12,990 --> 00:16:18,620 dimensions, then if it's an xy point, then it's just 220 00:16:18,620 --> 00:16:21,020 x-squared plus y-squared -- square root of. 221 00:16:21,020 --> 00:16:24,360 It generalizes to higher dimensions if you weren't 222 00:16:24,360 --> 00:16:25,520 already aware. 223 00:16:25,520 --> 00:16:28,180 So if I want to find the straight line distance between 224 00:16:28,180 --> 00:16:33,990 a point in 3D, it's just going to be x-squared plus y-squared 225 00:16:33,990 --> 00:16:35,640 plus z-squared -- 226 00:16:35,640 --> 00:16:36,990 square root. 227 00:16:36,990 --> 00:16:37,520 That's all. 228 00:16:37,520 --> 00:16:39,640 And then, so on and so forth. 229 00:16:51,010 --> 00:16:55,750 So all this does is it sub-classes point. 230 00:16:55,750 --> 00:17:01,070 And it has this function, scaleFeatures. 231 00:17:01,070 --> 00:17:07,354 And what scaleFeatures does is it'll take a key. 232 00:17:07,354 --> 00:17:09,940 And in this case, we have to find two ways of scaling this 233 00:17:09,940 --> 00:17:11,420 data, of scaling this point. 234 00:17:11,420 --> 00:17:13,790 So we have the identity, which is it's just going to leave 235 00:17:13,790 --> 00:17:15,020 every point alone. 236 00:17:15,020 --> 00:17:19,819 And then, we have this 1 over max, which is going to scale 237 00:17:19,819 --> 00:17:24,930 each attribute by the maximum value in this data set. 238 00:17:24,930 --> 00:17:33,600 And if we look at the data set, we know that our max 239 00:17:33,600 --> 00:17:36,680 value is 6. 240 00:17:36,680 --> 00:17:38,910 So you could do that automatically. 241 00:17:38,910 --> 00:17:45,620 But in this case, we're using prior knowledge of the data 242 00:17:45,620 --> 00:17:46,870 set that we have. 243 00:17:54,100 --> 00:17:57,130 So why don't we do a cluster? 244 00:17:57,130 --> 00:18:00,540 So this is going to do a hierarchical cluster, right? 245 00:18:00,540 --> 00:18:04,130 And what we're going to ask, if I just specify the default 246 00:18:04,130 --> 00:18:06,140 parameters, all it's going to do is it's going to look for 247 00:18:06,140 --> 00:18:07,710 two clusters. 248 00:18:07,710 --> 00:18:09,630 It's going to use the identity. 249 00:18:09,630 --> 00:18:13,760 And it's just going to print out the history, like when 250 00:18:13,760 --> 00:18:15,130 it's performed the different merges. 251 00:18:21,400 --> 00:18:24,617 Unless I have extraneous code that I'm already running. 252 00:18:50,470 --> 00:18:54,180 So what starts off first is we get a lot of merges with just 253 00:18:54,180 --> 00:18:57,650 these single element clusters, right? 254 00:18:57,650 --> 00:19:00,730 So I have a beaver with a groundhog, so I guess they're 255 00:19:00,730 --> 00:19:02,740 pretty similar in terms of teeth. 256 00:19:02,740 --> 00:19:07,670 We have a squirrel with a porcupine, wolf with a bear. 257 00:19:07,670 --> 00:19:13,570 And so eventually, though, we start finding clusters-- 258 00:19:13,570 --> 00:19:15,720 a wolf and a bear, I guess, are more similar. 259 00:19:15,720 --> 00:19:18,920 But they're also similar with a dog. 260 00:19:18,920 --> 00:19:22,130 So we're going to start merging multi-point clusters. 261 00:19:27,650 --> 00:19:32,770 So we start seeing to beaver and groundhog cluster is going 262 00:19:32,770 --> 00:19:34,315 to get merged with the squirrel and 263 00:19:34,315 --> 00:19:35,480 the porcupine cluster. 264 00:19:35,480 --> 00:19:42,460 So if you were to visualize this, the reason why it's 265 00:19:42,460 --> 00:19:45,550 called hierarchical clustering is-- 266 00:19:45,550 --> 00:19:46,760 which one did I say? 267 00:19:46,760 --> 00:19:48,010 Beaver, groundhog-- 268 00:19:52,580 --> 00:19:56,020 these guys have been merged into a cluster, right? 269 00:19:56,020 --> 00:19:58,070 They started out as their own cluster. 270 00:19:58,070 --> 00:20:01,660 And they've been merged into their own cluster. 271 00:20:01,660 --> 00:20:06,970 And then, the grey squirrel and the porcupine, same thing. 272 00:20:06,970 --> 00:20:09,190 They started off with their own clusters at the beginning. 273 00:20:09,190 --> 00:20:11,440 They got merged. 274 00:20:11,440 --> 00:20:14,580 And now, what this step is saying is that these two 275 00:20:14,580 --> 00:20:15,940 clusters get merged. 276 00:20:15,940 --> 00:20:18,440 So we're building this tree, or hierarchy. 277 00:20:18,440 --> 00:20:19,990 That's where the hierarchical comes from. 278 00:20:24,970 --> 00:20:28,720 So we use hierarchical clustering a 279 00:20:28,720 --> 00:20:30,670 lot in other fields. 280 00:20:30,670 --> 00:20:34,660 So in speech recognition, we can do a hierarchical 281 00:20:34,660 --> 00:20:36,190 clustering of speech sounds. 282 00:20:36,190 --> 00:20:47,170 So if I have say different vowels, and maybe a couple of 283 00:20:47,170 --> 00:20:54,040 consonants, I would expect to see, say, these kind of 284 00:20:54,040 --> 00:20:55,290 clustered together first. 285 00:20:58,270 --> 00:21:04,730 And so what I might see is these would be fricatives. 286 00:21:04,730 --> 00:21:08,720 But then, I might have some stops, like "t" and "b" that 287 00:21:08,720 --> 00:21:10,100 get merged first. 288 00:21:10,100 --> 00:21:16,930 So it's a way of making these generalized groupings at 289 00:21:16,930 --> 00:21:18,180 different levels. 290 00:21:23,450 --> 00:21:24,980 I don't know. 291 00:21:24,980 --> 00:21:27,140 Does anyone have any real questions about hierarchical 292 00:21:27,140 --> 00:21:28,390 clustering? 293 00:21:30,700 --> 00:21:33,483 So should I move on to k-means? 294 00:21:33,483 --> 00:21:34,870 All right. 295 00:21:34,870 --> 00:21:38,590 So what's the general idea with k-means? 296 00:21:46,860 --> 00:21:48,725 So I start off with a set of data points. 297 00:21:57,390 --> 00:21:59,743 What's my first step? 298 00:21:59,743 --> 00:22:02,570 AUDIENCE: Choose your total number of clusters? 299 00:22:02,570 --> 00:22:05,310 PROFESSOR: Right, so I'm going to choose a k. 300 00:22:05,310 --> 00:22:11,620 So let's say for giggles we're going to choose k equals 3. 301 00:22:11,620 --> 00:22:13,075 And then, what's my next step? 302 00:22:15,925 --> 00:22:18,780 AUDIENCE: Choose k's [INAUDIBLE]? 303 00:22:18,780 --> 00:22:21,470 PROFESSOR: So we're going to pick k random points 304 00:22:21,470 --> 00:22:22,720 from our data set. 305 00:22:25,870 --> 00:22:29,688 All right, and then, what do I do? 306 00:22:34,170 --> 00:22:35,664 AUDIENCE: Cluster? 307 00:22:35,664 --> 00:22:37,990 PROFESSOR: Then you'd cluster. 308 00:22:37,990 --> 00:22:39,200 Yeah, all right. 309 00:22:39,200 --> 00:22:43,430 So after we've chosen our three centroids here, these 310 00:22:43,430 --> 00:22:45,275 become our clusters, right? 311 00:22:45,275 --> 00:22:47,050 And we're going to look at each point. 312 00:22:47,050 --> 00:22:50,400 And we're going to figure out which cluster 313 00:22:50,400 --> 00:22:52,670 they're closest to. 314 00:22:52,670 --> 00:22:55,920 So in this case, this is going to be a pretty easy 315 00:22:55,920 --> 00:22:56,420 clustering. 316 00:22:56,420 --> 00:22:58,270 So all these points are going to belong here. 317 00:23:00,870 --> 00:23:04,390 All these points are going to belong here. 318 00:23:04,390 --> 00:23:08,620 All these points are going to belong here, right? 319 00:23:08,620 --> 00:23:12,820 And then, we're going to update our centroid for each 320 00:23:12,820 --> 00:23:15,520 of these clusters. 321 00:23:15,520 --> 00:23:19,340 And there's going to be a distance that the centroid 322 00:23:19,340 --> 00:23:22,990 moves each time we update it. 323 00:23:22,990 --> 00:23:31,920 So in this case, the centroid moved quite a bit, right? 324 00:23:31,920 --> 00:23:34,830 Then, we're going to find the maximum distance that the 325 00:23:34,830 --> 00:23:37,170 centroid moved. 326 00:23:37,170 --> 00:23:39,510 And if it's below a certain cut off value, then we're 327 00:23:39,510 --> 00:23:41,860 going to say, I've got a good enough clustering. 328 00:23:41,860 --> 00:23:45,110 If it's above a certain cut off value, then what I'm going 329 00:23:45,110 --> 00:23:48,960 to say is like, this centroid moved quite a bit for this 330 00:23:48,960 --> 00:23:50,370 cluster, right? 331 00:23:50,370 --> 00:23:52,470 So I'm going to try another iteration. 332 00:23:52,470 --> 00:23:55,550 I'm going to say, for each one of these points, I'm now going 333 00:23:55,550 --> 00:24:00,020 to look and try to the closest cluster that it belongs to 334 00:24:00,020 --> 00:24:02,990 based on these new centroids. 335 00:24:02,990 --> 00:24:05,460 And in this case, nothing's really going to change. 336 00:24:05,460 --> 00:24:08,060 So all of the deltas, all of the centroids, are going to 337 00:24:08,060 --> 00:24:09,520 stay the same. 338 00:24:09,520 --> 00:24:11,840 So it's going to be below the cut off value. 339 00:24:11,840 --> 00:24:14,340 And it's going to stop. 340 00:24:14,340 --> 00:24:23,000 So what's an advantage of k-means over hierarchical 341 00:24:23,000 --> 00:24:24,250 clustering? 342 00:24:27,362 --> 00:24:30,680 AUDIENCE: More efficient? 343 00:24:30,680 --> 00:24:31,820 PROFESSOR: Yeah. 344 00:24:31,820 --> 00:24:35,250 So let's say that I have a million points. 345 00:24:38,620 --> 00:24:41,640 If I were to hierarchically cluster these, that means I'd 346 00:24:41,640 --> 00:24:43,500 start off with a million clusters. 347 00:24:43,500 --> 00:24:44,960 And now, in each iteration, I'm just going to 348 00:24:44,960 --> 00:24:46,210 reduce it by 1. 349 00:24:53,485 --> 00:24:57,760 I don't know, let's say 3, OK? 350 00:24:57,760 --> 00:25:02,310 So on each iteration, we're just reducing it by 1, which, 351 00:25:02,310 --> 00:25:04,750 if that's all we were doing, would not be so hard. 352 00:25:04,750 --> 00:25:06,287 It doesn't take too long to count down from a 353 00:25:06,287 --> 00:25:07,480 million on a computer. 354 00:25:07,480 --> 00:25:11,140 But on each one of these steps, we have to compute the 355 00:25:11,140 --> 00:25:14,490 pairwise distance between each cluster. 356 00:25:14,490 --> 00:25:20,160 So it's going to be n times n minus 1 comparisons on each 357 00:25:20,160 --> 00:25:24,426 step which, in this first case, works out to a lot. 358 00:25:29,410 --> 00:25:33,880 And it doesn't get much better. 359 00:25:33,880 --> 00:25:37,520 So approximately, right? 360 00:25:37,520 --> 00:25:41,740 And it doesn't get much better as we go down. 361 00:25:41,740 --> 00:25:47,510 With k-means, what happens is we just have to perform, on 362 00:25:47,510 --> 00:25:54,120 each step, if we have a million points and we have k 363 00:25:54,120 --> 00:26:01,600 clusters, we just have to perform k times 1 million 364 00:26:01,600 --> 00:26:03,490 comparisons. 365 00:26:03,490 --> 00:26:06,340 Because for each point, we need to find the closest 366 00:26:06,340 --> 00:26:10,940 centroid approximately. 367 00:26:10,940 --> 00:26:15,490 So the upshot is that k-means winds up being a lot more 368 00:26:15,490 --> 00:26:18,210 efficient on each iteration, which is why if you have a 369 00:26:18,210 --> 00:26:21,500 large number of points, you might want to choose k-means 370 00:26:21,500 --> 00:26:23,360 over hierarchical clustering. 371 00:26:28,280 --> 00:26:32,360 What's an advantage, though, of hierarchical clustering 372 00:26:32,360 --> 00:26:35,330 over k-means? 373 00:26:35,330 --> 00:26:39,220 Even though it's less efficient, what's another-- 374 00:26:39,220 --> 00:26:40,600 AUDIENCE: [INAUDIBLE] 375 00:26:40,600 --> 00:26:41,460 PROFESSOR: What's that? 376 00:26:41,460 --> 00:26:42,806 AUDIENCE: More thorough. 377 00:26:42,806 --> 00:26:44,570 PROFESSOR: More thorough. 378 00:26:44,570 --> 00:26:46,262 AUDIENCE: And you can get a lot of different levels that 379 00:26:46,262 --> 00:26:47,390 you can look at. 380 00:26:47,390 --> 00:26:49,460 PROFESSOR: Yeah, you can get a lot of different levels. 381 00:26:49,460 --> 00:26:52,730 So you can look at the clusterings from different 382 00:26:52,730 --> 00:26:54,740 perspectives. 383 00:26:54,740 --> 00:26:57,957 But the key thing with-- 384 00:26:57,957 --> 00:27:00,939 AUDIENCE: You don't necessarily know how many 385 00:27:00,939 --> 00:27:04,418 clusters there actually are. 386 00:27:04,418 --> 00:27:06,406 Hierarchical clustering will tell you all of the 387 00:27:06,406 --> 00:27:08,080 [INAUDIBLE]. 388 00:27:08,080 --> 00:27:11,950 You can just go down the tree and-- 389 00:27:11,950 --> 00:27:14,660 PROFESSOR: Right, so you could go down different levels of 390 00:27:14,660 --> 00:27:19,130 the tree and pick however many number of clusters you want. 391 00:27:19,130 --> 00:27:21,170 But the big reason-- 392 00:27:21,170 --> 00:27:24,010 or one of the kind of main advantages that hierarchical 393 00:27:24,010 --> 00:27:28,830 clustering has over k-means is that k-means is random. 394 00:27:28,830 --> 00:27:31,230 It's non-deterministic. 395 00:27:31,230 --> 00:27:34,140 Hierarchical clustering is deterministic. 396 00:27:34,140 --> 00:27:36,980 It's always going to give you the same result. 397 00:27:36,980 --> 00:27:40,840 With k-means, because your initial starting conditions 398 00:27:40,840 --> 00:27:45,700 are random because you're choosing k random points, the 399 00:27:45,700 --> 00:27:49,650 end result will be different each time. 400 00:27:49,650 --> 00:27:55,340 And so when we do k-means clustering, this necessarily 401 00:27:55,340 --> 00:27:56,980 means that we don't necessarily 402 00:27:56,980 --> 00:27:58,290 want to do it once. 403 00:27:58,290 --> 00:28:03,780 Like if we choose k equals 3, we might want to do five 404 00:28:03,780 --> 00:28:07,130 different k-means clusterings and take the best one. 405 00:28:12,240 --> 00:28:15,040 So that's one of the big points with k-means. 406 00:28:19,120 --> 00:28:23,700 There's a degenerate condition with k-means. 407 00:28:23,700 --> 00:28:30,600 So if my objective is to-- 408 00:28:30,600 --> 00:28:33,080 if my stopping criteria is that the centroid doesn't 409 00:28:33,080 --> 00:28:38,690 move, what's a really easy way to make the centroid not move 410 00:28:38,690 --> 00:28:39,980 by choosing k? 411 00:28:43,420 --> 00:28:44,380 What was it? 412 00:28:44,380 --> 00:28:45,610 K equals n, right? 413 00:28:45,610 --> 00:28:52,902 So if I have n points and I have k equals n, then all of 414 00:28:52,902 --> 00:28:54,720 my points are going to be their own cluster. 415 00:28:54,720 --> 00:28:56,900 And every time I update, I'm never 416 00:28:56,900 --> 00:28:58,150 going to move my centroid. 417 00:29:00,360 --> 00:29:02,630 So in your problem set, you're going to be asked to compute a 418 00:29:02,630 --> 00:29:06,935 standard error for each of the clusters and a total error for 419 00:29:06,935 --> 00:29:08,830 the entire cluster. 420 00:29:08,830 --> 00:29:20,550 So what that is, is if I have the centroids, and I had each 421 00:29:20,550 --> 00:29:24,945 point in the centroids-- 422 00:29:35,810 --> 00:29:40,250 so what I'm going to do is I'm going to take the centroid for 423 00:29:40,250 --> 00:29:41,370 each cluster. 424 00:29:41,370 --> 00:29:46,230 And I'm going to find the distance from each point in 425 00:29:46,230 --> 00:29:48,050 the cluster to the centroid. 426 00:29:48,050 --> 00:29:51,520 And then, I'm going to sum up all of those distances over 427 00:29:51,520 --> 00:29:52,640 the entire cluster. 428 00:29:52,640 --> 00:29:55,662 That's going to give me my error. 429 00:29:55,662 --> 00:29:59,620 Not sure if this equation's totally right. 430 00:29:59,620 --> 00:30:02,354 There might be a like of division in there. 431 00:30:02,354 --> 00:30:07,190 But the general idea, what I'm trying to emphasize, is that 432 00:30:07,190 --> 00:30:10,250 we can reduce this number by just increasing 433 00:30:10,250 --> 00:30:12,110 the number of k. 434 00:30:12,110 --> 00:30:15,770 And if we make k equal n, then this is going to be 0. 435 00:30:15,770 --> 00:30:20,110 So like I was saying with statistics, you never want to 436 00:30:20,110 --> 00:30:21,610 trust just one number. 437 00:30:21,610 --> 00:30:24,220 With k-means, you never just want to trust one clustering 438 00:30:24,220 --> 00:30:26,460 or one measurement of error. 439 00:30:26,460 --> 00:30:28,850 You want to look at it from multiple perspectives and 440 00:30:28,850 --> 00:30:30,100 advantage points. 441 00:30:32,990 --> 00:30:35,840 So why don't we look at the code and try to match up all 442 00:30:35,840 --> 00:30:40,075 of that stuff with what you'll see on your problem set? 443 00:30:42,860 --> 00:30:47,450 So the big function to look at for k-means 444 00:30:47,450 --> 00:30:50,870 is aptly named k-means. 445 00:30:50,870 --> 00:30:54,660 And it's going to take a set of points. 446 00:30:54,660 --> 00:30:56,520 It's going to take a number of clusters. 447 00:30:56,520 --> 00:31:01,930 It's going to take a cut off value, and a point type, and a 448 00:31:01,930 --> 00:31:05,340 variable named maxIters. 449 00:31:05,340 --> 00:31:09,290 So first step, we get our initial centroids. 450 00:31:09,290 --> 00:31:10,990 All we're going to do is we're going to sample our points 451 00:31:10,990 --> 00:31:12,780 randomly and choose k of them. 452 00:31:16,470 --> 00:31:19,520 Our clusters, we're going to represent as a list. 453 00:31:19,520 --> 00:31:29,280 And we are going to, for all the points in the initial 454 00:31:29,280 --> 00:31:36,600 centroids, we are going to add a cluster 455 00:31:36,600 --> 00:31:39,690 with just that point. 456 00:31:39,690 --> 00:31:44,790 And then, we get into our loop here. 457 00:31:44,790 --> 00:31:49,500 So what this is saying is, while our biggest change in 458 00:31:49,500 --> 00:31:54,490 centroid is greater than the cut off and we haven't 459 00:31:54,490 --> 00:31:57,530 exceeded the number of iterations, or the maximum 460 00:31:57,530 --> 00:32:01,325 number of iterations, we're going to keep trying to refine 461 00:32:01,325 --> 00:32:03,690 our cluster. 462 00:32:03,690 --> 00:32:08,670 So that brings up a point I actually failed to mention. 463 00:32:08,670 --> 00:32:10,870 Why should we have this cut off point, maxiters? 464 00:32:14,170 --> 00:32:15,990 AUDIENCE: It'll go forever? 465 00:32:15,990 --> 00:32:18,720 PROFESSOR: Yeah, there's a chance that, if our cut-off 466 00:32:18,720 --> 00:32:23,060 value is too small, or we have a point that's on a border and 467 00:32:23,060 --> 00:32:27,625 likes to jump between clusters and move the centroid just 468 00:32:27,625 --> 00:32:31,790 above the cut off point, that we'll never 469 00:32:31,790 --> 00:32:34,080 converge to our cut off. 470 00:32:34,080 --> 00:32:36,810 And so we want to set up a secondary break. 471 00:32:36,810 --> 00:32:42,420 So we have this maxIters, which defaults to 100. 472 00:32:42,420 --> 00:32:46,290 So with this set up, though, there's a couple of things you 473 00:32:46,290 --> 00:32:46,990 have to consider. 474 00:32:46,990 --> 00:32:49,290 And this is, one, you need to make sure that 475 00:32:49,290 --> 00:32:51,150 maxiters is not too small. 476 00:32:51,150 --> 00:32:54,660 Because if it's too small, you're not going to converge. 477 00:32:54,660 --> 00:32:56,620 And you don't want to make it too large, because then your 478 00:32:56,620 --> 00:33:00,270 algorithm will just take forever to run, right? 479 00:33:00,270 --> 00:33:01,570 Likewise, you don't want to make your cut 480 00:33:01,570 --> 00:33:04,100 off too small, either. 481 00:33:04,100 --> 00:33:08,260 So sometimes you have to play around with the algorithm to 482 00:33:08,260 --> 00:33:11,980 figure out what the best parameters are. 483 00:33:11,980 --> 00:33:17,050 And that's, oftentimes, more of an art than a hard science. 484 00:33:17,050 --> 00:33:22,730 So anyway, continuing on. 485 00:33:22,730 --> 00:33:24,140 for each iteration, we're going to set up 486 00:33:24,140 --> 00:33:25,200 a new set of clusters. 487 00:33:25,200 --> 00:33:27,460 And we're going to set them initially to have 488 00:33:27,460 --> 00:33:29,140 no points in them. 489 00:33:29,140 --> 00:33:35,750 And then, for all the points in our data set, we are going 490 00:33:35,750 --> 00:33:38,050 to look for the smallest distance. 491 00:33:38,050 --> 00:33:41,140 So that means we are going to look-- 492 00:33:41,140 --> 00:33:43,090 we're going to initially set our smallest distance to be 493 00:33:43,090 --> 00:33:46,100 the distance from the point to the first centroid. 494 00:33:46,100 --> 00:33:48,310 And then, we're just going to interate through all the 495 00:33:48,310 --> 00:33:51,300 centroids, or through all the clusters, and then find the 496 00:33:51,300 --> 00:33:53,610 smallest distance. 497 00:33:53,610 --> 00:33:55,510 Is anyone lost by that? 498 00:33:55,510 --> 00:33:56,760 Make sense? 499 00:33:59,060 --> 00:34:03,460 Once we find that, we're just going to add that point to the 500 00:34:03,460 --> 00:34:04,710 new clusters. 501 00:34:08,550 --> 00:34:10,909 And then, we're going to go through our update. 502 00:34:10,909 --> 00:34:14,820 We're going to iterate through each of our clusters. 503 00:34:14,820 --> 00:34:17,530 We are going to update the points in the cluster. 504 00:34:17,530 --> 00:34:20,350 So remember, the update method sets the points of the cluster 505 00:34:20,350 --> 00:34:24,010 to be this new set of points you've given it. 506 00:34:24,010 --> 00:34:28,820 And it also updates the centroid, or it updates the 507 00:34:28,820 --> 00:34:31,590 centroid, and it returns the delta between the old centroid 508 00:34:31,590 --> 00:34:33,139 and the new centroid. 509 00:34:33,139 --> 00:34:36,460 So that's where this change variable is coming from. 510 00:34:36,460 --> 00:34:40,389 And then, we're just go look for the biggest change, right? 511 00:34:40,389 --> 00:34:44,929 And if, at some point in our clustering, the centroids have 512 00:34:44,929 --> 00:34:49,460 stabilized and our clusters are relatively stationary, 513 00:34:49,460 --> 00:34:53,860 then our max change will be small. 514 00:34:53,860 --> 00:34:56,143 And it'll wind up terminating the algorithm. 515 00:35:00,880 --> 00:35:06,750 And all this function does is, once it's converged or it's 516 00:35:06,750 --> 00:35:09,090 gone through the maximum number of iterations, it's 517 00:35:09,090 --> 00:35:13,600 just going to find the maximum distance of a 518 00:35:13,600 --> 00:35:16,530 point to its centroid. 519 00:35:16,530 --> 00:35:22,730 So it's going to look for a point that has a maximum 520 00:35:22,730 --> 00:35:26,380 distance from its corresponding centroid. 521 00:35:26,380 --> 00:35:31,500 And that's going to be the coherence of this clustering. 522 00:35:31,500 --> 00:35:33,260 And then, it's just going to return a tuple containing the 523 00:35:33,260 --> 00:35:34,675 clusters and the maximum distance. 524 00:35:38,220 --> 00:35:42,940 So it's not a hard algorithm to understand. 525 00:35:42,940 --> 00:35:45,890 And it's pretty simple to implement. 526 00:35:45,890 --> 00:35:48,920 There any real questions about what's going on here? 527 00:35:54,250 --> 00:36:02,050 So the example that he went over in lecture with k-means 528 00:36:02,050 --> 00:36:07,960 was this counties clustering example. 529 00:36:07,960 --> 00:36:13,010 So we had a bunch of data of different counties in the US. 530 00:36:13,010 --> 00:36:16,540 And we just played around with clustering them and seeing 531 00:36:16,540 --> 00:36:18,820 what we got. 532 00:36:18,820 --> 00:36:25,390 So if we made five clusters, and we clustered on all the 533 00:36:25,390 --> 00:36:32,210 features, wanted to see what the distribution would be for, 534 00:36:32,210 --> 00:36:45,830 say, incomes, what this function, test, is going to do 535 00:36:45,830 --> 00:36:50,910 is it's going to take a k, a cut off, 536 00:36:50,910 --> 00:36:52,280 and a number of trials. 537 00:36:52,280 --> 00:36:54,370 So remember, we said that because k-means is 538 00:36:54,370 --> 00:36:56,670 non-deterministic, we're going to want to run it a number of 539 00:36:56,670 --> 00:37:04,895 times to find maybe we get a bad initial set of points for 540 00:37:04,895 --> 00:37:08,780 our centroids or for our clusters. 541 00:37:08,780 --> 00:37:10,910 And that gives us a bad clustering. 542 00:37:10,910 --> 00:37:14,690 So we're going to run it a number of times and try and 543 00:37:14,690 --> 00:37:16,386 prevent that from happening. 544 00:37:16,386 --> 00:37:19,182 AUDIENCE: How do we [INAUDIBLE] multiple runs, 545 00:37:19,182 --> 00:37:20,580 because [INAUDIBLE] 546 00:37:20,580 --> 00:37:24,543 really different clustering happens [INAUDIBLE] 547 00:37:24,543 --> 00:37:25,793 after you run it a couple of times? 548 00:37:29,510 --> 00:37:33,070 PROFESSOR: It can be tricky, to be honest. 549 00:37:33,070 --> 00:37:39,400 One technique you could use would be to have a training 550 00:37:39,400 --> 00:37:41,090 set and a development set. 551 00:37:44,390 --> 00:37:48,060 What I mean by that is, you perform a clustering on the 552 00:37:48,060 --> 00:37:49,300 training set. 553 00:37:49,300 --> 00:37:53,980 And then, you take the development set, and you 554 00:37:53,980 --> 00:37:55,690 figure out which clusters they belong to. 555 00:37:55,690 --> 00:38:01,470 And then, you measure the error of that development set. 556 00:38:01,470 --> 00:38:08,170 So once you've assigned these development points to the 557 00:38:08,170 --> 00:38:10,700 clusters, you measure the distance to the centroid. 558 00:38:10,700 --> 00:38:13,750 And you sum up the squared distances, and you sum up over 559 00:38:13,750 --> 00:38:15,660 all the clusters. 560 00:38:15,660 --> 00:38:18,630 Then if you do that a number of times, what you would do is 561 00:38:18,630 --> 00:38:22,240 you'd choose the clustering that gave you the smallest 562 00:38:22,240 --> 00:38:24,160 error on your development set. 563 00:38:24,160 --> 00:38:26,325 And then, you'd say, that's probably my best clustering 564 00:38:26,325 --> 00:38:28,480 for this data. 565 00:38:28,480 --> 00:38:29,950 So that's one way of doing it. 566 00:38:29,950 --> 00:38:37,380 There's multiple ways of skinning the cat. 567 00:38:37,380 --> 00:38:40,670 Was trying to think of a good aphorism. 568 00:38:40,670 --> 00:38:42,810 And actually, that's what's on your problem set. 569 00:38:42,810 --> 00:38:46,450 One of the problems on your problem set is to cluster 570 00:38:46,450 --> 00:38:51,490 based on a holdout set and see what the effect of the error 571 00:38:51,490 --> 00:38:54,940 is on this holdout set. 572 00:38:54,940 --> 00:38:58,736 So did that answer your question? 573 00:38:58,736 --> 00:38:59,668 AUDIENCE: Yeah. 574 00:38:59,668 --> 00:39:02,000 PROFESSOR: OK. 575 00:39:02,000 --> 00:39:05,490 And then, choosing k, there's different methods 576 00:39:05,490 --> 00:39:07,020 for doing it, too. 577 00:39:07,020 --> 00:39:10,990 A lot of it is you run a lot of experiments and you see 578 00:39:10,990 --> 00:39:12,970 what you get. 579 00:39:12,970 --> 00:39:15,790 This is where the research part comes in for a lot of 580 00:39:15,790 --> 00:39:17,910 applications. 581 00:39:17,910 --> 00:39:23,670 So you can also try some other automatic methods, like 582 00:39:23,670 --> 00:39:25,930 entropy or other, more complicated 583 00:39:25,930 --> 00:39:27,600 measurements of error. 584 00:39:27,600 --> 00:39:31,910 But don't worry about those. 585 00:39:31,910 --> 00:39:36,260 For our purposes, if you get below the cut off value, and 586 00:39:36,260 --> 00:39:38,310 you run a number of iterations, and you've 587 00:39:38,310 --> 00:39:41,630 minimized your error on your test set, we'll be happy. 588 00:39:44,440 --> 00:39:47,620 We want you to be familiar with k-means, but not experts 589 00:39:47,620 --> 00:39:53,620 in it because it's a useful tool for your kit. 590 00:39:53,620 --> 00:40:00,850 Anyway, so all this code is going to do is it's going to 591 00:40:00,850 --> 00:40:02,620 run a number of trials. 592 00:40:02,620 --> 00:40:06,970 It's going to perform k-means clustering. 593 00:40:06,970 --> 00:40:16,640 And it's going to look for the clustering with the smallest 594 00:40:16,640 --> 00:40:19,550 maximum distance. 595 00:40:19,550 --> 00:40:23,080 So remember, the return value of k-means is the maximum 596 00:40:23,080 --> 00:40:26,680 distance from a point to its centroid. 597 00:40:26,680 --> 00:40:29,500 We're going to define our best clustering as being the 598 00:40:29,500 --> 00:40:33,730 clustering that gives us the smallest max distance. 599 00:40:33,730 --> 00:40:36,870 So that's another way you would choose your best 600 00:40:36,870 --> 00:40:38,120 clustering. 601 00:40:43,470 --> 00:40:45,760 Yeah, and that's all we're going to do for 602 00:40:45,760 --> 00:40:48,290 this bit of code here. 603 00:40:48,290 --> 00:40:52,350 We're going to find the average income in each cluster 604 00:40:52,350 --> 00:40:55,384 and draw a histogram. 605 00:40:58,110 --> 00:40:59,825 So I think this is actually done. 606 00:41:02,730 --> 00:41:05,990 So we have five clusters. 607 00:41:05,990 --> 00:41:10,200 And what they're showing us is that the clusters-- 608 00:41:10,200 --> 00:41:12,310 if we take the average income of the different clusters, 609 00:41:12,310 --> 00:41:16,760 they're going to be centered at these average incomes. 610 00:41:22,215 --> 00:41:24,005 Let's see what some other plots look like. 611 00:41:35,530 --> 00:41:39,260 This set of examples is using a point type 612 00:41:39,260 --> 00:41:42,620 that's called county. 613 00:41:42,620 --> 00:41:46,650 And county, like the mammal class, inherits from point. 614 00:41:46,650 --> 00:41:54,450 And it defines a set of features that can be 615 00:41:54,450 --> 00:41:56,490 used, or a set of-- 616 00:41:56,490 --> 00:41:57,740 it calls them filters. 617 00:41:57,740 --> 00:42:00,160 But basically, if you pass it one of these filter names, 618 00:42:00,160 --> 00:42:03,490 like allEducation, noEducational, wealthOnly, 619 00:42:03,490 --> 00:42:08,550 noWealth, what this is doing is it's selecting 620 00:42:08,550 --> 00:42:10,260 this tuple of tuples. 621 00:42:10,260 --> 00:42:16,620 And so if I say wealthonly, it's going to use this tuple 622 00:42:16,620 --> 00:42:18,240 as a filter. 623 00:42:18,240 --> 00:42:26,810 And for each element in this tuple of tuples, it has a 624 00:42:26,810 --> 00:42:29,680 tuple that has the name of the attribute and whether or not 625 00:42:29,680 --> 00:42:31,490 it should be used in the clustering. 626 00:42:31,490 --> 00:42:32,910 So if it has a 1, it should be used. 627 00:42:32,910 --> 00:42:35,620 If it has a 0, it shouldn't be used. 628 00:42:35,620 --> 00:42:40,560 So if we look at how that's applied, it'll get a 629 00:42:40,560 --> 00:42:42,220 filterSpec. 630 00:42:42,220 --> 00:42:56,080 And then, it'll just set an attribute called atrFilter. 631 00:42:56,080 --> 00:43:05,230 And if we see where that's used, then it's going to 632 00:43:05,230 --> 00:43:07,800 iterate through all of the attributes. 633 00:43:07,800 --> 00:43:13,430 And if the given attribute has a 1, then it's going to 634 00:43:13,430 --> 00:43:17,340 include it in the set of features that are used in this 635 00:43:17,340 --> 00:43:18,590 distance computation. 636 00:43:21,030 --> 00:43:23,310 So did that make sense at all? 637 00:43:27,390 --> 00:43:29,046 No? 638 00:43:29,046 --> 00:43:30,296 All right. 639 00:43:37,866 --> 00:43:41,470 So the idea is to illustrate that if you use different 640 00:43:41,470 --> 00:43:44,120 features when you're doing your clustering, you'll 641 00:43:44,120 --> 00:43:46,330 probably attain different clusterings. 642 00:43:46,330 --> 00:43:48,380 So in that first example I showed, we 643 00:43:48,380 --> 00:43:50,270 used all the features. 644 00:43:50,270 --> 00:43:56,410 But in this example, we are going to look at only wealth. 645 00:43:56,410 --> 00:44:03,170 And that includes the features, if we look at this 646 00:44:03,170 --> 00:44:06,485 set of filters here, it's going to include the home 647 00:44:06,485 --> 00:44:10,930 value, the income, the poverty. 648 00:44:10,930 --> 00:44:14,480 And then, all the other attributes, like population 649 00:44:14,480 --> 00:44:16,000 change, it's not include. 650 00:44:16,000 --> 00:44:19,610 So those aren't going to be used in the clusterings. 651 00:44:19,610 --> 00:44:24,280 So this is going to change what our clusters look like. 652 00:44:24,280 --> 00:44:29,045 So if we look at what we have for our clusters-- well, 653 00:44:29,045 --> 00:44:30,295 that's not very clear. 654 00:44:34,198 --> 00:44:37,790 Yeah, so let's see what happens. 655 00:44:50,245 --> 00:44:52,200 I'm not sure that's really going to show us. 656 00:44:55,820 --> 00:44:57,980 Probably better if I show them all at once, right? 657 00:45:05,225 --> 00:45:06,475 So this will take a while. 658 00:45:28,890 --> 00:45:29,871 Yes? 659 00:45:29,871 --> 00:45:33,308 AUDIENCE: Actually I had just one question [INAUDIBLE] 660 00:45:33,308 --> 00:45:38,218 but there is a method showing us [INAUDIBLE] iterative 661 00:45:38,218 --> 00:45:39,200 additionally? 662 00:45:39,200 --> 00:45:40,182 PROFESSOR: Mm-hm. 663 00:45:40,182 --> 00:45:41,655 AUDIENCE: [INAUDIBLE] 664 00:45:41,655 --> 00:45:44,790 PROFESSOR: It's like iterValues or iterKeys, yeah. 665 00:45:44,790 --> 00:45:48,240 AUDIENCE: I was wondering how to go about using that in 666 00:45:48,240 --> 00:45:49,190 actual code. 667 00:45:49,190 --> 00:45:50,760 PROFESSOR: In actual code? 668 00:45:50,760 --> 00:45:56,470 So you know that if you have a dictionary, d, 669 00:45:56,470 --> 00:45:57,950 you can do like d.keys. 670 00:46:00,770 --> 00:46:04,820 So remember I was demo-ing that code, the difference 671 00:46:04,820 --> 00:46:06,320 between range and xrange? 672 00:46:06,320 --> 00:46:08,150 Same thing. 673 00:46:08,150 --> 00:46:10,060 This is going to return an actual list of 674 00:46:10,060 --> 00:46:11,890 all the keys, right? 675 00:46:11,890 --> 00:46:18,720 What's d.iterkeys returns is a generator object that gives 676 00:46:18,720 --> 00:46:22,390 you, one by one by one, each key in the dictionary. 677 00:46:22,390 --> 00:46:25,320 So a lot of times, you guys in your code will use something 678 00:46:25,320 --> 00:46:32,726 like for k in d.keys(): to iterate through all the keys 679 00:46:32,726 --> 00:46:33,950 in the dictionary. 680 00:46:33,950 --> 00:46:39,260 What this method does, though, is it creates an actual copy 681 00:46:39,260 --> 00:46:42,520 of that list of keys, right? 682 00:46:42,520 --> 00:46:46,850 So when you call d.keys, if you have a lot of keys in your 683 00:46:46,850 --> 00:46:49,830 dictionary, it's going to go one by one by one by one 684 00:46:49,830 --> 00:46:53,510 through each of those keys, add it to a list, and then 685 00:46:53,510 --> 00:46:55,110 return it to you. 686 00:46:55,110 --> 00:47:01,330 What d.iterkeys does is it skips that going one by one by 687 00:47:01,330 --> 00:47:03,540 one and adding it to a new list. 688 00:47:03,540 --> 00:47:06,620 It just gives you a generator object which, when you use it 689 00:47:06,620 --> 00:47:08,890 in a for loop-- 690 00:47:20,240 --> 00:47:25,690 when you use it in a for loop, it's going to just yield the 691 00:47:25,690 --> 00:47:30,600 key one at a time without creating a separate list. 692 00:47:30,600 --> 00:47:31,365 That make sense? 693 00:47:31,365 --> 00:47:32,315 AUDIENCE: So will it be more efficient? 694 00:47:32,315 --> 00:47:35,070 PROFESSOR: Yeah, it's generally more efficient. 695 00:47:35,070 --> 00:47:41,170 And then, there's also, I think, iterValues, which goes 696 00:47:41,170 --> 00:47:42,240 through each of the values. 697 00:47:42,240 --> 00:47:43,810 And then, I think there's iterItems. 698 00:47:49,220 --> 00:47:53,550 And I think, if I'm not mistaken, if you do something 699 00:47:53,550 --> 00:48:04,230 just like that, so you do, for k, v in d, this is going to 700 00:48:04,230 --> 00:48:08,740 iterate through a tuple that contains the key and the 701 00:48:08,740 --> 00:48:10,575 values associated with that key. 702 00:48:10,575 --> 00:48:20,760 And it's equivalent to doing this. 703 00:48:24,650 --> 00:48:26,156 Make sense? 704 00:48:26,156 --> 00:48:27,406 AUDIENCE: Yeah. 705 00:48:31,570 --> 00:48:32,940 PROFESSOR: So where were we here? 706 00:48:35,772 --> 00:48:37,873 Oh, maybe I shouldn't have done this all at once. 707 00:48:54,882 --> 00:48:56,343 Why don't we just look at two? 708 00:48:59,900 --> 00:49:03,610 Why don't we take a look at what the average incomes, what 709 00:49:03,610 --> 00:49:12,950 the clustering gives us for average incomes for education 710 00:49:12,950 --> 00:49:16,250 versus no education. 711 00:49:16,250 --> 00:49:20,000 That's probably not going to be a very good comparison. 712 00:49:20,000 --> 00:49:24,130 Just doing five trials and two trials for each clustering. 713 00:49:30,900 --> 00:49:34,250 So k-means is the efficient one, which means that 714 00:49:34,250 --> 00:49:35,990 hierarchical would take a long, long time. 715 00:49:39,830 --> 00:49:41,240 There we go. 716 00:49:43,800 --> 00:49:51,750 So these are the average incomes if we cluster with k 717 00:49:51,750 --> 00:49:56,680 equals 50 on education. 718 00:49:56,680 --> 00:49:58,705 And then, there should be another one. 719 00:50:08,060 --> 00:50:09,310 I didn't create the new figure. 720 00:50:14,980 --> 00:50:17,550 So apparently there's a bug in the code. 721 00:50:17,550 --> 00:50:20,260 I wanted to show the two plots side by side so you could see 722 00:50:20,260 --> 00:50:21,820 the differences. 723 00:50:21,820 --> 00:50:23,350 Because what you should see is-- 724 00:50:25,960 --> 00:50:30,200 we would see a different distribution in incomes among 725 00:50:30,200 --> 00:50:33,310 the clusters if we clustered based on no education versus 726 00:50:33,310 --> 00:50:34,980 education level. 727 00:50:34,980 --> 00:50:40,810 But my code is buggy and not working the way 728 00:50:40,810 --> 00:50:42,400 I expected it to. 729 00:50:42,400 --> 00:50:43,650 So I apologize.