1 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,870 Commons license. 3 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 4 00:00:06,910 --> 00:00:10,560 offer high quality educational resources for free. 5 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 6 00:00:13,460 --> 00:00:19,290 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,290 --> 00:00:21,110 ocw.mit.edu 8 00:00:21,110 --> 00:00:21,960 PROFESSOR: Good morning. 9 00:00:21,960 --> 00:00:23,790 Oh, it's so nice to get a response. 10 00:00:23,790 --> 00:00:24,380 Thank you. 11 00:00:24,380 --> 00:00:27,670 I appreciate it. 12 00:00:27,670 --> 00:00:29,530 i have a confession to make. 13 00:00:29,530 --> 00:00:32,640 I stopped at my usual candy store this morning and they 14 00:00:32,640 --> 00:00:34,120 didn't have any. 15 00:00:34,120 --> 00:00:39,590 So I am bereft of anything other than crummy little 16 00:00:39,590 --> 00:00:41,640 Tootsie Rolls for today. 17 00:00:41,640 --> 00:00:46,950 But I promise I'll have a new supply by Thursday. 18 00:00:46,950 --> 00:00:51,220 We ended up the last lecture looking at pseudocode for 19 00:00:51,220 --> 00:00:55,260 k-means clustering and talking a little bit about the whole 20 00:00:55,260 --> 00:00:57,140 idea and what it's doing. 21 00:00:57,140 --> 00:00:59,950 So I want to start today by moving from the pseudocode to 22 00:00:59,950 --> 00:01:01,980 some real code. 23 00:01:01,980 --> 00:01:04,110 This was on a previous handout but it's 24 00:01:04,110 --> 00:01:07,000 also on today's handout. 25 00:01:07,000 --> 00:01:10,180 So let's look at it. 26 00:01:10,180 --> 00:01:14,320 Not so surprisingly I've chosen to call it k-means. 27 00:01:14,320 --> 00:01:18,010 And you'll notice that it's got some arguments. 28 00:01:18,010 --> 00:01:22,600 The point to be clustered, k, and that's 29 00:01:22,600 --> 00:01:24,180 an interesting question. 30 00:01:24,180 --> 00:01:27,370 Unlike hierarchical clustering, where we could run 31 00:01:27,370 --> 00:01:30,170 it and get what's called a dendrogram and stop at any 32 00:01:30,170 --> 00:01:34,680 level and see what we liked, k-means involves knowing in 33 00:01:34,680 --> 00:01:39,750 the very beginning how many clusters we want. 34 00:01:39,750 --> 00:01:42,910 We'll talk a little bit about how we could choose k. 35 00:01:42,910 --> 00:01:45,180 A cutoff. 36 00:01:45,180 --> 00:01:48,380 What the cutoff is doing, you may recall that in the 37 00:01:48,380 --> 00:01:53,350 pseudocode k-means was iterative and we keep 38 00:01:53,350 --> 00:01:58,280 re-clustering until the change is small enough that we feel 39 00:01:58,280 --> 00:02:00,110 it's stable. 40 00:02:00,110 --> 00:02:02,910 That is to say, the new clusters are not that much 41 00:02:02,910 --> 00:02:04,980 different from the old clusters. 42 00:02:04,980 --> 00:02:08,180 The cutoff is the definition of what we 43 00:02:08,180 --> 00:02:10,110 mean by small enough. 44 00:02:10,110 --> 00:02:12,600 We'll see how that gets used. 45 00:02:12,600 --> 00:02:15,360 The type of point to be clustered. 46 00:02:15,360 --> 00:02:18,120 The maximum number of iterations. 47 00:02:18,120 --> 00:02:22,680 There's no guarantee that things will converge. 48 00:02:22,680 --> 00:02:25,520 As we'll see, they usually converge very quickly in a 49 00:02:25,520 --> 00:02:28,550 small number of iterations. 50 00:02:28,550 --> 00:02:32,380 But it's prudent to have something like this just in 51 00:02:32,380 --> 00:02:36,200 case things go awry. 52 00:02:36,200 --> 00:02:38,370 And to print. 53 00:02:38,370 --> 00:02:41,370 Just my usual trick of being able to print some debugging 54 00:02:41,370 --> 00:02:46,150 information if I need it, but not getting buried in output 55 00:02:46,150 --> 00:02:48,190 if I don't. 56 00:02:48,190 --> 00:02:48,700 All right. 57 00:02:48,700 --> 00:02:49,740 Let's look at the code. 58 00:02:49,740 --> 00:02:52,730 And it very much follows the outline of the pseudocode we 59 00:02:52,730 --> 00:02:54,990 started with last time. 60 00:02:54,990 --> 00:02:59,240 We're going to start by choosing k initial 61 00:02:59,240 --> 00:03:03,200 centroids at random. 62 00:03:03,200 --> 00:03:06,750 So I'm just going to go and take all the points I have. 63 00:03:06,750 --> 00:03:09,110 And I'm assuming, by the way, I should have written this 64 00:03:09,110 --> 00:03:12,410 down probably, that I have at least k points. 65 00:03:12,410 --> 00:03:14,700 Otherwise it doesn't make much sense. 66 00:03:14,700 --> 00:03:16,070 If you have 10 points, you're not 67 00:03:16,070 --> 00:03:19,640 going to find 100 clusters. 68 00:03:19,640 --> 00:03:23,510 So I'll take k random centroids, and those will be 69 00:03:23,510 --> 00:03:24,760 my initial centroids. 70 00:03:27,060 --> 00:03:31,300 There are more sophisticated ways of choosing centroids, as 71 00:03:31,300 --> 00:03:35,460 discussed in the problem set, but most of the time people 72 00:03:35,460 --> 00:03:39,420 just choose them at random, because at least if you do it 73 00:03:39,420 --> 00:03:42,370 repetitively it guarantees against some sort of 74 00:03:42,370 --> 00:03:45,070 systematic error. 75 00:03:45,070 --> 00:03:46,020 Whoa. 76 00:03:46,020 --> 00:03:48,230 What happened? 77 00:03:48,230 --> 00:03:48,960 I see. 78 00:03:48,960 --> 00:03:50,210 Come back. 79 00:03:52,890 --> 00:03:55,830 Thank you. 80 00:03:55,830 --> 00:03:56,840 All right. 81 00:03:56,840 --> 00:03:59,300 Then I'm going to say that the clusters I have 82 00:03:59,300 --> 00:04:01,660 initially are empty. 83 00:04:01,660 --> 00:04:04,430 And then I'm going to create a bunch of singleton clusters, 84 00:04:04,430 --> 00:04:06,650 one for each centroid. 85 00:04:06,650 --> 00:04:09,930 So all of this is just the initialization, 86 00:04:09,930 --> 00:04:11,580 getting things going. 87 00:04:11,580 --> 00:04:14,330 I haven't had any iterations yet. 88 00:04:14,330 --> 00:04:18,089 And the biggest change so far I'm just setting arbitrarily 89 00:04:18,089 --> 00:04:19,500 to the cutoff. 90 00:04:19,500 --> 00:04:20,510 All right. 91 00:04:20,510 --> 00:04:23,470 And now I'm going to iterate until the change is smaller 92 00:04:23,470 --> 00:04:28,600 than the cutoff while biggest change is at least the cutoff. 93 00:04:28,600 --> 00:04:33,980 And just in case numIters is less than the maximum, I'm 94 00:04:33,980 --> 00:04:38,170 going to create a list containing k empty list. 95 00:04:38,170 --> 00:04:41,830 So these are the new clusters. 96 00:04:41,830 --> 00:04:45,860 And then I'm going to go through for i in range k. 97 00:04:45,860 --> 00:04:48,780 I'm going to append the empty cluster. 98 00:04:48,780 --> 00:04:50,810 These are going to be the new ones. 99 00:04:50,810 --> 00:04:53,650 And then for p and all the points I'm going to find the 100 00:04:53,650 --> 00:04:58,350 centroid in the existing clustering 101 00:04:58,350 --> 00:05:01,860 that's closest to p. 102 00:05:01,860 --> 00:05:03,395 That's what's going on here. 103 00:05:20,840 --> 00:05:26,310 Once I've found that, I'm going to add p to the correct 104 00:05:26,310 --> 00:05:30,325 cluster, go and do it for the next point. 105 00:05:33,880 --> 00:05:38,740 Then when I'm done, I'm going to compare the new clustering 106 00:05:38,740 --> 00:05:42,315 to the old clustering and get the biggest change. 107 00:05:45,250 --> 00:05:47,020 And then go back and do it again. 108 00:05:50,040 --> 00:05:50,750 All right? 109 00:05:50,750 --> 00:05:55,600 People, understand that basic structure and even some of the 110 00:05:55,600 --> 00:05:57,290 details of the code. 111 00:05:57,290 --> 00:06:00,430 It's not very complicated. 112 00:06:00,430 --> 00:06:03,130 But if you haven't seen it before, it can be 113 00:06:03,130 --> 00:06:05,170 a little bit tricky. 114 00:06:05,170 --> 00:06:11,870 When I'm done I'm going to just get some statistics here 115 00:06:11,870 --> 00:06:15,550 about the clusters, going to keep track of the number of 116 00:06:15,550 --> 00:06:21,530 iterations and the maximum diameter of a cluster, so the 117 00:06:21,530 --> 00:06:24,810 cluster in which things are least tightly grouped. 118 00:06:24,810 --> 00:06:29,010 And this will give me an indication of how good a 119 00:06:29,010 --> 00:06:30,260 clustering I have. 120 00:06:33,990 --> 00:06:36,830 OK? 121 00:06:36,830 --> 00:06:38,355 Does that make sense to everybody? 122 00:06:42,360 --> 00:06:45,605 Any questions about the k-means code? 123 00:06:50,600 --> 00:06:57,870 Well, before we use it, let's look at how we use it. 124 00:06:57,870 --> 00:07:00,870 I've written this function testOne that uses it. 125 00:07:05,120 --> 00:07:09,950 Some arbitrary values for k in the cutoff. 126 00:07:09,950 --> 00:07:12,190 Number of trials is kind of boring here. 127 00:07:12,190 --> 00:07:16,100 I've only said one is the default and I've set print 128 00:07:16,100 --> 00:07:18,010 steps to false. 129 00:07:18,010 --> 00:07:23,650 The thing I want you to notice here, because I'm choosing the 130 00:07:23,650 --> 00:07:27,610 initial clustering at random, I can get different results 131 00:07:27,610 --> 00:07:28,860 each time I run this. 132 00:07:34,150 --> 00:07:38,950 Because of that, I might want to run it many times and 133 00:07:38,950 --> 00:07:45,470 choose the quote, "best clustering." What metric am I 134 00:07:45,470 --> 00:07:48,080 using for best clustering? 135 00:07:48,080 --> 00:07:50,550 It's a minmax metric. 136 00:07:50,550 --> 00:07:55,950 I'm choosing the minimum of the maximum diameters. 137 00:07:55,950 --> 00:07:59,290 So I'm finding the worst cluster and trying to make 138 00:07:59,290 --> 00:08:02,310 that as good as I can make it. 139 00:08:02,310 --> 00:08:03,900 You could look at the average cluster. 140 00:08:03,900 --> 00:08:06,340 This is like the linkage distances we 141 00:08:06,340 --> 00:08:07,590 talked about before. 142 00:08:10,890 --> 00:08:12,410 That's the normal kind of thing. 143 00:08:12,410 --> 00:08:15,350 It's like when we did Monte Carlo simulations or random 144 00:08:15,350 --> 00:08:18,410 walks, flipping coins. 145 00:08:18,410 --> 00:08:21,660 You do a lot of trials and then you can either average 146 00:08:21,660 --> 00:08:23,930 over the trials, which wouldn't make sense for the 147 00:08:23,930 --> 00:08:27,000 clustering, or select the trial that has some 148 00:08:27,000 --> 00:08:29,220 property you like. 149 00:08:29,220 --> 00:08:32,429 This is the way people usually use k-means. 150 00:08:32,429 --> 00:08:38,970 Typically they may do 100 trials and choose the best, 151 00:08:38,970 --> 00:08:42,580 the one that gives them the best clustering. 152 00:08:42,580 --> 00:08:50,660 Let's look at this, and let's try it for a couple of 153 00:08:50,660 --> 00:08:51,910 examples here. 154 00:08:56,954 --> 00:09:00,820 Let's start it up. 155 00:09:07,340 --> 00:09:14,600 And we'll just run test one on our old mammal teeth database. 156 00:09:14,600 --> 00:09:17,170 We get some clustering. 157 00:09:17,170 --> 00:09:18,420 Now we'll run it again. 158 00:09:23,860 --> 00:09:24,890 We get a clustering. 159 00:09:24,890 --> 00:09:28,390 I don't know, is it the same clustering? 160 00:09:28,390 --> 00:09:29,640 Kind of looks like it is. 161 00:09:32,130 --> 00:09:35,740 No reason to suspect it would be. 162 00:09:35,740 --> 00:09:37,814 We run it again. 163 00:09:37,814 --> 00:09:39,760 Well you know, this is very unfortunate. 164 00:09:39,760 --> 00:09:42,870 It's supposed to give different answers here because 165 00:09:42,870 --> 00:09:44,120 it often does. 166 00:09:47,960 --> 00:09:49,320 I think they're the same answers, though. 167 00:09:49,320 --> 00:09:51,870 Aren't they? 168 00:09:51,870 --> 00:09:52,280 Yes? 169 00:09:52,280 --> 00:09:53,240 Anyone see a difference? 170 00:09:53,240 --> 00:09:54,490 No, they're the same. 171 00:09:56,740 --> 00:09:58,350 How unlucky can you be? 172 00:09:58,350 --> 00:10:02,040 Every time I ran it at my desk it came up the first two times 173 00:10:02,040 --> 00:10:03,350 with different things. 174 00:10:03,350 --> 00:10:06,430 But take my word for it, and we'll see that with other 175 00:10:06,430 --> 00:10:12,080 examples, it could come out with different answers. 176 00:10:12,080 --> 00:10:19,505 Let's try it with some printing on. 177 00:10:32,180 --> 00:10:33,460 We get some things here. 178 00:10:36,330 --> 00:10:37,580 Let's try it. 179 00:10:40,620 --> 00:10:41,870 What have we got out of this one? 180 00:10:46,170 --> 00:10:47,190 All right. 181 00:10:47,190 --> 00:10:48,210 Oh, well. 182 00:10:48,210 --> 00:10:51,100 Sometimes you get lucky and sometimes you get unlucky with 183 00:10:51,100 --> 00:10:52,910 randomness. 184 00:10:52,910 --> 00:10:54,160 All right. 185 00:10:56,680 --> 00:10:58,320 So, why did we start with k-means? 186 00:10:58,320 --> 00:11:01,200 Not because we needed it for the mammals' teeth. 187 00:11:01,200 --> 00:11:04,590 The hierarchical worked fine, but because it was too slow 188 00:11:04,590 --> 00:11:05,930 when we tried to look at something 189 00:11:05,930 --> 00:11:07,890 big like the counties. 190 00:11:07,890 --> 00:11:12,225 So now let's move on and talk about clustering the counties. 191 00:11:15,530 --> 00:11:18,440 We'll use exactly the k-means code. 192 00:11:18,440 --> 00:11:21,610 It's one of the reasons we're allowed to pass in the point 193 00:11:21,610 --> 00:11:24,790 type as an argument. 194 00:11:24,790 --> 00:11:33,010 But the interesting thing will be what we do for the counties 195 00:11:33,010 --> 00:11:35,970 themselves. 196 00:11:35,970 --> 00:11:39,990 This gets a little complicated. 197 00:11:39,990 --> 00:11:44,430 In particular, what I've added to the counties is this notion 198 00:11:44,430 --> 00:11:45,680 of a filter. 199 00:11:47,610 --> 00:11:51,760 The reason I've done this is, as we've seen before, the 200 00:11:51,760 --> 00:11:55,120 choice of features can make a big difference in what 201 00:11:55,120 --> 00:11:57,210 clustering you get. 202 00:11:57,210 --> 00:12:00,250 I didn't want to do a lot of typing as we do in these 203 00:12:00,250 --> 00:12:04,880 examples, so what I did is I created a bunch of filters. 204 00:12:04,880 --> 00:12:09,340 For example, no wealth, which says, all right, we're not 205 00:12:09,340 --> 00:12:11,150 going to look at home value. 206 00:12:11,150 --> 00:12:14,010 We're giving that a weight of 0. 207 00:12:14,010 --> 00:12:18,210 We're giving income a weight of 0, we're giving poverty 208 00:12:18,210 --> 00:12:20,240 level a rate of 0. 209 00:12:20,240 --> 00:12:22,700 But we're giving the population a 210 00:12:22,700 --> 00:12:24,695 weight of 1, et cetera. 211 00:12:28,150 --> 00:12:28,640 OK. 212 00:12:28,640 --> 00:12:33,000 What we see here is each filter supplies the weight, in 213 00:12:33,000 --> 00:12:37,970 this case either 0 or 1, to a feature. 214 00:12:37,970 --> 00:12:41,410 This will allow me as we go forward to run some 215 00:12:41,410 --> 00:12:43,205 experiments with different features. 216 00:12:47,220 --> 00:12:50,800 All features, everything has a weight of 1. 217 00:12:50,800 --> 00:12:52,690 I made a mistake though. 218 00:12:52,690 --> 00:12:53,940 That should have been a 1. 219 00:12:56,720 --> 00:13:00,390 Then I have filter names, which are just a dictionary. 220 00:13:00,390 --> 00:13:04,990 And that'll make it easy for me to run various kinds of 221 00:13:04,990 --> 00:13:06,375 tests with different filters. 222 00:13:13,820 --> 00:13:20,160 Then I've got init, which takes as its arguments the 223 00:13:20,160 --> 00:13:23,840 things you would expect, plus the filter name. 224 00:13:23,840 --> 00:13:27,010 So it takes the original attributes, the normalized 225 00:13:27,010 --> 00:13:28,840 attributes. 226 00:13:28,840 --> 00:13:31,370 And you will recall that, why do we need to normalize 227 00:13:31,370 --> 00:13:33,130 attributes? 228 00:13:33,130 --> 00:13:36,320 If we don't, we have something like population, which could 229 00:13:36,320 --> 00:13:42,020 number in the millions, and we're comparing it to percent 230 00:13:42,020 --> 00:13:46,060 female, which we know cannot be more than a 100. 231 00:13:46,060 --> 00:13:50,180 So the small values become totally dominated by the big 232 00:13:50,180 --> 00:13:53,630 absolute values and when we run any clustering it ends up 233 00:13:53,630 --> 00:13:57,510 only looking at population or number of farm acres, or 234 00:13:57,510 --> 00:13:59,440 something that's big. 235 00:13:59,440 --> 00:14:01,670 Has a big dynamic range. 236 00:14:01,670 --> 00:14:03,780 Manhattan has no farm acres. 237 00:14:03,780 --> 00:14:07,230 Some county in Iowa has a lot. 238 00:14:07,230 --> 00:14:09,860 Maybe they're identical in every other respect. 239 00:14:09,860 --> 00:14:13,370 Unlikely, but who knows? 240 00:14:13,370 --> 00:14:16,300 Except I guess there's no baseball teams in Iowa. 241 00:14:16,300 --> 00:14:21,290 But at any rate, we always scale or we try and normalize 242 00:14:21,290 --> 00:14:24,035 so that we don't get fooled by that. 243 00:14:28,830 --> 00:14:34,660 Then I go through and, if I haven't already, this is a 244 00:14:34,660 --> 00:14:38,220 class variable attribute filter, which is 245 00:14:38,220 --> 00:14:40,230 initially set to none. 246 00:14:40,230 --> 00:14:43,810 Not an instance variable, but a class variable. 247 00:14:43,810 --> 00:14:47,530 And what we see here is, if that class variable is still 248 00:14:47,530 --> 00:14:51,860 none, this will mean it's the first time we've generated a 249 00:14:51,860 --> 00:14:56,340 point of type county, then what we're going to do is set 250 00:14:56,340 --> 00:15:02,500 up the filter to only look at the attributes we care about. 251 00:15:02,500 --> 00:15:05,650 So only the attributes which have a value of 1. 252 00:15:10,050 --> 00:15:14,900 And then I'm going to override distance from class point to 253 00:15:14,900 --> 00:15:16,710 look at the features we care about. 254 00:15:22,160 --> 00:15:23,510 OK. 255 00:15:23,510 --> 00:15:27,140 Does this basic structure and idea make sense to people? 256 00:15:29,790 --> 00:15:30,530 It should. 257 00:15:30,530 --> 00:15:33,380 I hope it does, because the current problem set requires 258 00:15:33,380 --> 00:15:37,300 you to understand it in which you all will be doing some 259 00:15:37,300 --> 00:15:38,930 experiments. 260 00:15:38,930 --> 00:15:41,620 So now I want to do some experiments with it. 261 00:15:41,620 --> 00:15:43,580 I'm not going to spend too much time, even though it 262 00:15:43,580 --> 00:15:47,490 would be fun, because I don't want to deprive you of the fun 263 00:15:47,490 --> 00:15:50,480 of doing your problem sets. 264 00:15:50,480 --> 00:15:51,785 So let's look at an example. 265 00:16:00,380 --> 00:16:05,980 I've got test, which is pretty much like testOne. 266 00:16:05,980 --> 00:16:09,840 Runs k-means number of times and chooses the best. 267 00:16:09,840 --> 00:16:12,990 And we can start. 268 00:16:12,990 --> 00:16:16,955 Well, let's start by running some examples ourselves. 269 00:16:19,490 --> 00:16:22,340 So I'm going to start by clustering on 270 00:16:22,340 --> 00:16:24,480 education level only. 271 00:16:29,520 --> 00:16:32,900 I'm going to get 20 clusters, 20 chosen just so it wouldn't 272 00:16:32,900 --> 00:16:37,630 take too long to run, and we'll filter on education. 273 00:16:37,630 --> 00:16:38,880 And we'll see what we get. 274 00:16:51,420 --> 00:16:54,550 Well, I should have probably done more than one cluster 275 00:16:54,550 --> 00:16:57,620 just to make it work. 276 00:16:57,620 --> 00:17:02,770 But we've got it and just for fun I'm keeping track of what 277 00:17:02,770 --> 00:17:08,990 cluster Middlesex County, the county in which MIT shows up. 278 00:17:08,990 --> 00:17:10,770 So we can see that it's similar to a 279 00:17:10,770 --> 00:17:12,599 bunch of other counties. 280 00:17:12,599 --> 00:17:17,780 And it happens to have an average income of $28,665, or 281 00:17:17,780 --> 00:17:21,010 at least it did then. 282 00:17:21,010 --> 00:17:25,504 And if we look, we should also see-- 283 00:17:25,504 --> 00:17:26,900 no, let me go back. 284 00:17:36,800 --> 00:17:40,960 I foolishly didn't uncomment pyLab.show. 285 00:17:40,960 --> 00:17:42,545 So we better go back and do that. 286 00:17:49,430 --> 00:17:52,950 Well, we're just going to nuke it and run it again because 287 00:17:52,950 --> 00:17:55,180 it's easy and I wanted to run it with a 288 00:17:55,180 --> 00:17:56,430 couple of trials anyway. 289 00:18:03,240 --> 00:18:04,145 So, we'll first do the clustering. 290 00:18:04,145 --> 00:18:05,535 We get cluster 0. 291 00:18:08,070 --> 00:18:09,610 Now we're getting a second one. 292 00:18:09,610 --> 00:18:11,435 It's going to choose whichever was the tightest. 293 00:18:15,140 --> 00:18:19,510 And we'll see that that's what it looks like. 294 00:18:19,510 --> 00:18:23,250 So we've now clustered the counties based on education 295 00:18:23,250 --> 00:18:26,930 level, no other features. 296 00:18:26,930 --> 00:18:30,900 And we see that it's got some interesting properties. 297 00:18:30,900 --> 00:18:37,410 There is a small number of counties, clusters, out here 298 00:18:37,410 --> 00:18:39,560 near the right side with high income. 299 00:18:42,960 --> 00:18:46,260 And, in fact, we'll see that we are fortunate 300 00:18:46,260 --> 00:18:49,480 to be in that cluster. 301 00:18:49,480 --> 00:18:54,280 One of the clusters that contains wealthy counties. 302 00:18:54,280 --> 00:18:57,740 And you could look at it and see whether you recognize any 303 00:18:57,740 --> 00:19:02,240 of the other counties that hang out with Middlesex. 304 00:19:05,630 --> 00:19:09,320 Things like Marin County, San Francisco County. 305 00:19:09,320 --> 00:19:10,550 Not surprisingly. 306 00:19:10,550 --> 00:19:13,650 Remember, we're clustering by education and these might be 307 00:19:13,650 --> 00:19:16,710 counties where you would expect the level of education 308 00:19:16,710 --> 00:19:20,920 to be comparable to the level of education in Middlesex. 309 00:19:27,530 --> 00:19:28,780 All right. 310 00:19:31,670 --> 00:19:33,810 Let me get rid of that for now. 311 00:19:33,810 --> 00:19:36,340 Sure. 312 00:19:36,340 --> 00:19:37,100 I ran it. 313 00:19:37,100 --> 00:19:42,090 I didn't want you to have to sit through it, but I ran it 314 00:19:42,090 --> 00:19:43,675 on a much bigger sample size. 315 00:19:48,900 --> 00:19:52,660 So here's what I got when I ran it 316 00:19:52,660 --> 00:19:55,260 asking for 100 clusters. 317 00:19:55,260 --> 00:19:59,780 And I think it was 5 trials. 318 00:19:59,780 --> 00:20:04,850 And you'll notice that this case, actually, we have a much 319 00:20:04,850 --> 00:20:08,890 smaller cluster containing Middlesex. 320 00:20:08,890 --> 00:20:13,450 Not surprising, because I've done 100 rather than 20. 321 00:20:13,450 --> 00:20:17,800 And it should be pretty tight since I chose the best-- 322 00:20:17,800 --> 00:20:22,180 you can see we have a distribution here. 323 00:20:22,180 --> 00:20:26,210 Now, remember that the name of the game here is we're trying 324 00:20:26,210 --> 00:20:29,950 to see whether we can infer something interesting by 325 00:20:29,950 --> 00:20:31,440 clustering. 326 00:20:31,440 --> 00:20:33,970 Unsupervised learning. 327 00:20:33,970 --> 00:20:37,450 So one of the questions we should ask is, how different 328 00:20:37,450 --> 00:20:39,790 is what we're getting here from if we chose 329 00:20:39,790 --> 00:20:41,040 something at random? 330 00:20:43,340 --> 00:20:45,290 Now, remember we did not cluster on 331 00:20:45,290 --> 00:20:47,420 things based on income. 332 00:20:47,420 --> 00:20:52,500 I happened to plot income here just because I was curious as 333 00:20:52,500 --> 00:20:55,800 to how this clustering related to income. 334 00:20:55,800 --> 00:21:02,350 Suppose we had just chosen at random and split the counties 335 00:21:02,350 --> 00:21:05,590 at random into 100 different clusters, What would you have 336 00:21:05,590 --> 00:21:08,700 expected this kind of graph to look like? 337 00:21:14,310 --> 00:21:18,850 Do we have something that is different, obviously 338 00:21:18,850 --> 00:21:21,430 different, from what we might have gotten if we'd just done 339 00:21:21,430 --> 00:21:27,630 a random division into 100 different clusters? 340 00:21:27,630 --> 00:21:28,180 Think about it. 341 00:21:28,180 --> 00:21:28,975 What would you get? 342 00:21:28,975 --> 00:21:30,190 AUDIENCE: A bell curve? 343 00:21:30,190 --> 00:21:30,465 PROFESSOR: Pardon? 344 00:21:30,465 --> 00:21:32,900 AUDIENCE: We'd get a bell curve. 345 00:21:32,900 --> 00:21:35,820 PROFESSOR: Well, a bell curve is a good guess because bell 346 00:21:35,820 --> 00:21:39,480 curves occur a lot in nature. 347 00:21:39,480 --> 00:21:42,860 And as I said, I apologize for the rather miserable quality 348 00:21:42,860 --> 00:21:44,110 of the rewards. 349 00:21:46,550 --> 00:21:48,705 It's a good guess but I think it's the wrong guess. 350 00:21:52,510 --> 00:21:53,760 What would you expect? 351 00:21:56,110 --> 00:21:58,120 Would you expect the different clusters-- 352 00:21:58,120 --> 00:21:58,910 yeah, go ahead. 353 00:21:58,910 --> 00:22:01,676 AUDIENCE: You probably might expect them all to average at 354 00:22:01,676 --> 00:22:07,140 a certain point for a very sharp bell curve? 355 00:22:07,140 --> 00:22:11,400 PROFESSOR: A very sharp bell curve was one comment. 356 00:22:11,400 --> 00:22:13,530 Well, someone else want to try it? 357 00:22:13,530 --> 00:22:14,410 That's kind of close. 358 00:22:14,410 --> 00:22:15,890 I thought you were on the right track in the beginning. 359 00:22:20,950 --> 00:22:23,460 Well, take a different example. 360 00:22:23,460 --> 00:22:26,920 Let's take students. 361 00:22:26,920 --> 00:22:31,090 If I were to select a 100 MIT students at random and compute 362 00:22:31,090 --> 00:22:35,200 their GPA, would you it to be radically different from the 363 00:22:35,200 --> 00:22:37,930 GPA of all of MIT? 364 00:22:37,930 --> 00:22:41,430 The average GPA of all MIT students? 365 00:22:41,430 --> 00:22:43,460 Probably not, right? 366 00:22:43,460 --> 00:22:47,450 So if I take 100 counties and put them into a cluster, the 367 00:22:47,450 --> 00:22:50,860 average income of that cluster is probably pretty close to 368 00:22:50,860 --> 00:22:52,140 the average income in the country. 369 00:22:54,800 --> 00:22:59,570 So you'd actually expect it to be kind of flat, right? 370 00:22:59,570 --> 00:23:02,830 That each of the randomly chosen clusters would have the 371 00:23:02,830 --> 00:23:04,530 same income, more or less. 372 00:23:07,140 --> 00:23:10,490 Well, that's clearly not what we have here. 373 00:23:10,490 --> 00:23:15,820 So we can clearly infer from the fact that this is not flat 374 00:23:15,820 --> 00:23:19,980 that there is some interesting correlation between level of 375 00:23:19,980 --> 00:23:25,180 income and education. 376 00:23:25,180 --> 00:23:28,820 And for those of us who earn our living in education, we're 377 00:23:28,820 --> 00:23:32,540 glad to see it's positive, actually. 378 00:23:32,540 --> 00:23:34,800 Not negative. 379 00:23:34,800 --> 00:23:38,650 As another experiment, just for fun, I 380 00:23:38,650 --> 00:23:40,130 clustered by gender only. 381 00:23:42,980 --> 00:23:45,590 So this looked only at the female/male 382 00:23:45,590 --> 00:23:47,450 ratio in the counties. 383 00:23:47,450 --> 00:23:51,130 And here you'll see Middlesex is with-- 384 00:23:51,130 --> 00:23:54,300 remember we had about 3,000 counties to start with. 385 00:23:54,300 --> 00:23:57,600 So the fact that there were so few in the cluster on 386 00:23:57,600 --> 00:24:01,540 education was interesting, right? 387 00:24:01,540 --> 00:24:04,250 Here we have more. 388 00:24:04,250 --> 00:24:08,260 And we get a very different-looking picture. 389 00:24:08,260 --> 00:24:13,690 Which says, perhaps, that the female/male ratio is not 390 00:24:13,690 --> 00:24:18,240 unrelated to income, but it's a rather different relation 391 00:24:18,240 --> 00:24:23,790 than we get from education. 392 00:24:23,790 --> 00:24:27,240 This is what would be called a bi-modal distribution. 393 00:24:27,240 --> 00:24:31,290 A lot here and a lot here and not much in the middle. 394 00:24:31,290 --> 00:24:35,470 But again the dynamic range is much smaller. 395 00:24:35,470 --> 00:24:37,180 But we do have some counties where the 396 00:24:37,180 --> 00:24:40,570 income is pretty miserable. 397 00:24:40,570 --> 00:24:41,820 All right. 398 00:24:43,930 --> 00:24:49,055 We could play a lot more with this but I'm not going to. 399 00:24:49,055 --> 00:24:54,490 I do want to, before we leave it, because we're about to 400 00:24:54,490 --> 00:24:58,380 leave machine learning, reiterate a few of the major 401 00:24:58,380 --> 00:25:01,500 points that I wanted to make sure were 402 00:25:01,500 --> 00:25:04,570 the take home messages. 403 00:25:04,570 --> 00:25:09,600 So, we talked about supervised learning much less than we 404 00:25:09,600 --> 00:25:10,850 talked about unsupervised. 405 00:25:16,090 --> 00:25:18,770 Interestingly, because unsupervised learning is 406 00:25:18,770 --> 00:25:23,190 probably used more often in the sciences than supervised. 407 00:25:23,190 --> 00:25:27,220 And when we did supervised learning, we started with a 408 00:25:27,220 --> 00:25:35,690 training set that had labels. 409 00:25:35,690 --> 00:25:36,940 Each point had a label. 410 00:25:40,800 --> 00:25:44,810 And then we tried to infer the relationships between the 411 00:25:44,810 --> 00:25:48,260 features of the points and the associated labels. 412 00:26:07,580 --> 00:26:09,315 Between the features and the labels. 413 00:26:16,210 --> 00:26:18,410 We then looked at unsupervised learning. 414 00:26:25,020 --> 00:26:29,150 The issue here was, our training set was 415 00:26:29,150 --> 00:26:32,310 all unlabeled data. 416 00:26:32,310 --> 00:26:37,400 And what we try and infer is relationships among points. 417 00:26:45,860 --> 00:26:50,290 So, rather than trying to understand how the features 418 00:26:50,290 --> 00:26:54,690 relate to the labels, we're just trying to understand how 419 00:26:54,690 --> 00:26:59,220 the points, or actually, the features related to the 420 00:26:59,220 --> 00:27:00,750 points, relate to one another. 421 00:27:05,760 --> 00:27:08,460 Both of these, as I said earlier, are similar to what 422 00:27:08,460 --> 00:27:11,880 we saw when we did regression where we tried to 423 00:27:11,880 --> 00:27:13,130 fit curves to data. 424 00:27:15,940 --> 00:27:23,120 You need to be careful and wary of over-fitting just as 425 00:27:23,120 --> 00:27:24,675 you did with regression. 426 00:27:36,860 --> 00:27:45,600 In particular, if the training data is small, a small set of 427 00:27:45,600 --> 00:27:48,260 training data, you may learn things that are true of the 428 00:27:48,260 --> 00:27:51,530 training data that are not true of the data on which you 429 00:27:51,530 --> 00:27:55,720 will subsequently run the algorithm to test it. 430 00:27:55,720 --> 00:27:57,235 So you need to be wary of that. 431 00:28:03,530 --> 00:28:06,285 Another important lesson is that features matter. 432 00:28:10,480 --> 00:28:13,950 Which features matter? 433 00:28:13,950 --> 00:28:15,423 It matters whether they're normalized. 434 00:28:20,870 --> 00:28:24,900 And in some cases you can even weight them if you want to 435 00:28:24,900 --> 00:28:29,350 make some features more important than the others. 436 00:28:29,350 --> 00:28:36,620 Features need to be relevant to the kind of knowledge that 437 00:28:36,620 --> 00:28:39,490 you hope to acquire. 438 00:28:39,490 --> 00:28:42,090 For example, when I was trying to look at the eating habits 439 00:28:42,090 --> 00:28:47,200 of mammals, I chose features based upon teeth, not features 440 00:28:47,200 --> 00:28:52,030 based upon how much hair they had or their color of the 441 00:28:52,030 --> 00:28:54,060 lengths of the tails. 442 00:28:54,060 --> 00:28:58,590 I chose something that I had domain knowledge which would 443 00:28:58,590 --> 00:29:00,670 suggest that it was probably relevant to 444 00:29:00,670 --> 00:29:01,930 the problem at hand. 445 00:29:01,930 --> 00:29:03,180 Question at hand. 446 00:29:05,810 --> 00:29:08,180 And then we discovered it was. 447 00:29:08,180 --> 00:29:11,050 Just as here, I said, well, maybe education has something 448 00:29:11,050 --> 00:29:12,990 to do with income. 449 00:29:12,990 --> 00:29:17,195 We ran it and we discovered, thank goodness, that it does. 450 00:29:20,120 --> 00:29:20,520 OK. 451 00:29:20,520 --> 00:29:23,860 So I probably told you ten times that features matter. 452 00:29:23,860 --> 00:29:28,020 If not, I should have because they do. 453 00:29:28,020 --> 00:29:30,610 And it's probably the most important thing to get right 454 00:29:30,610 --> 00:29:33,560 in doing machine learning. 455 00:29:33,560 --> 00:29:38,160 Now, our foray into machine learning is part of a much 456 00:29:38,160 --> 00:29:39,020 larger unit. 457 00:29:39,020 --> 00:29:42,560 In fact, the largest unit of the course really, is about 458 00:29:42,560 --> 00:29:46,730 how to use computation to make sense of the kind of 459 00:29:46,730 --> 00:29:52,070 information one encounters in the world. 460 00:29:52,070 --> 00:29:58,180 A big part of this is finding useful ways to abstract from 461 00:29:58,180 --> 00:30:02,500 the situation you're initially confronted with to create a 462 00:30:02,500 --> 00:30:05,030 model about which one can reason. 463 00:30:07,550 --> 00:30:10,040 We saw that when we did curve fitting. 464 00:30:10,040 --> 00:30:11,980 We would abstract from the points to a 465 00:30:11,980 --> 00:30:13,890 curve to get a model. 466 00:30:13,890 --> 00:30:18,330 And we see that with machine learning, that we abstract 467 00:30:18,330 --> 00:30:23,270 from every detail about a county to say, the education 468 00:30:23,270 --> 00:30:26,090 level, to give us a model of the counties 469 00:30:26,090 --> 00:30:29,040 that might be useful. 470 00:30:29,040 --> 00:30:32,490 I now want to talk about another kind of way to build 471 00:30:32,490 --> 00:30:38,000 models that's as popular a way as there is. 472 00:30:38,000 --> 00:30:41,260 Probably the most common kinds of models. 473 00:30:41,260 --> 00:30:43,605 Those models are graph theoretic. 474 00:30:54,540 --> 00:30:59,590 There's a whole rich theory about graphs and graph theory 475 00:30:59,590 --> 00:31:03,110 that are used to understand these models. 476 00:31:03,110 --> 00:31:07,340 Suppose, for example, you had a list of all the airline 477 00:31:07,340 --> 00:31:10,360 flights between every city in the United States and what 478 00:31:10,360 --> 00:31:13,610 each flight cost. 479 00:31:13,610 --> 00:31:18,870 Suppose also, counterfactual supposition, that for all 480 00:31:18,870 --> 00:31:25,130 cities A, B and C, the cost of flying from A to C by way of B 481 00:31:25,130 --> 00:31:29,380 was the cost of A to B and the cost from B to C. We happen to 482 00:31:29,380 --> 00:31:33,520 know that's not true, but we can pretend it is. 483 00:31:33,520 --> 00:31:37,020 So what are some of the questions you might ask if I 484 00:31:37,020 --> 00:31:39,700 gave you all that data? 485 00:31:39,700 --> 00:31:42,660 And in fact, there's a company called ITA Software in 486 00:31:42,660 --> 00:31:46,700 Cambridge, recently acquired by Google for, I think, $700 487 00:31:46,700 --> 00:31:51,070 million, that is built upon answering these kinds of 488 00:31:51,070 --> 00:31:54,300 questions about these kinds of graphs. 489 00:31:54,300 --> 00:31:57,050 So you could ask, for example, what's the shortest number of 490 00:31:57,050 --> 00:31:59,060 hops between two cities? 491 00:31:59,060 --> 00:32:03,180 If I want to fly from here to Juneau, Alaska, what's the 492 00:32:03,180 --> 00:32:05,950 fewest number of stops? 493 00:32:05,950 --> 00:32:08,540 I could ask, what's the least expensive-- 494 00:32:08,540 --> 00:32:10,140 different question-- flight from here to Juneau. 495 00:32:12,640 --> 00:32:15,430 I could ask what's the least expensive way involving no 496 00:32:15,430 --> 00:32:19,020 more than two stops, just in case I don't want to stop too 497 00:32:19,020 --> 00:32:20,870 many places. 498 00:32:20,870 --> 00:32:22,700 I could say I have ten cities. 499 00:32:22,700 --> 00:32:25,080 What's the least expensive way to visit each 500 00:32:25,080 --> 00:32:27,230 of them on my vacation? 501 00:32:27,230 --> 00:32:32,390 All of these problems are nicely formalized as graphs. 502 00:32:32,390 --> 00:32:39,690 A graph is a set of nodes. 503 00:32:45,810 --> 00:32:48,080 Think of those as objects. 504 00:32:48,080 --> 00:32:52,760 Nodes are also often called vertices or a 505 00:32:52,760 --> 00:32:56,400 vertex for one of them. 506 00:32:56,400 --> 00:33:08,550 Those nodes are connected by a set of edges, 507 00:33:08,550 --> 00:33:09,800 often called arcs. 508 00:33:16,260 --> 00:33:21,360 If the edges are uni-directional, the 509 00:33:21,360 --> 00:33:32,180 equivalent of a one-way street, it's called a digraph, 510 00:33:32,180 --> 00:33:33,430 or directed graph. 511 00:33:48,500 --> 00:33:52,100 Graphs are typically used in situations in which there are 512 00:33:52,100 --> 00:33:54,585 interesting relationships among the parts. 513 00:33:58,120 --> 00:34:02,720 The first documented use of this kind of a graph was in 514 00:34:02,720 --> 00:34:09,179 1735 when the Swiss mathematician Leonard Euler 515 00:34:09,179 --> 00:34:13,840 used what we now call graph theory to formulate and solve 516 00:34:13,840 --> 00:34:16,580 the Konigberg's Bridges problem. 517 00:34:16,580 --> 00:34:24,120 So this is a map of Konigsberg, which was then the 518 00:34:24,120 --> 00:34:29,360 capital of East Prussia, a part of what's today Germany. 519 00:34:29,360 --> 00:34:32,199 And it was built at the intersection of two rivers and 520 00:34:32,199 --> 00:34:34,580 contained a lot of islands. 521 00:34:34,580 --> 00:34:37,949 The islands were connected to each other and to the mainland 522 00:34:37,949 --> 00:34:42,230 by seven bridges. 523 00:34:42,230 --> 00:34:45,880 For some bizarre reason which history does not record and I 524 00:34:45,880 --> 00:34:49,679 cannot even imagine, the residents of this city were 525 00:34:49,679 --> 00:34:53,300 obsessed with the question of whether it was possible to 526 00:34:53,300 --> 00:34:57,380 take a walk through the city that involved crossing each 527 00:34:57,380 --> 00:35:00,950 bridge exactly once. 528 00:35:00,950 --> 00:35:03,570 Could you somehow take a walk and go over each bridge 529 00:35:03,570 --> 00:35:05,230 exactly once? 530 00:35:05,230 --> 00:35:06,250 I don't know why they cared. 531 00:35:06,250 --> 00:35:07,550 They seemed to care. 532 00:35:07,550 --> 00:35:10,020 They debated it, they walked around, they did things. 533 00:35:13,520 --> 00:35:15,980 It probably would be unfair for me to ask you to look at 534 00:35:15,980 --> 00:35:18,440 this map and answer the question. 535 00:35:18,440 --> 00:35:19,705 But it's kind of complicated. 536 00:35:22,270 --> 00:35:27,920 Euler's great insight was that you didn't have to actually 537 00:35:27,920 --> 00:35:30,780 look at the level of detail represented by this map to 538 00:35:30,780 --> 00:35:32,440 answer the question. 539 00:35:32,440 --> 00:35:35,700 You could vastly simplify it. 540 00:35:35,700 --> 00:35:40,250 And what he said is, well, let's represent each land mass 541 00:35:40,250 --> 00:35:47,410 by a point, and each bridge as a line. 542 00:35:47,410 --> 00:35:50,550 So, in fact, his map of Konigsberg looked like that. 543 00:35:53,360 --> 00:35:55,900 Considerably simpler. 544 00:35:55,900 --> 00:35:58,450 This is a graph. 545 00:35:58,450 --> 00:36:02,190 We have some vertices and some edges. 546 00:36:02,190 --> 00:36:06,190 He said, well, we can just look at this problem and now 547 00:36:06,190 --> 00:36:09,090 ask the question. 548 00:36:09,090 --> 00:36:12,220 Once he reformulated the problem this way it became a 549 00:36:12,220 --> 00:36:18,050 lot simpler to think about and he reasoned as follows. 550 00:36:18,050 --> 00:36:23,590 If a walk were to traverse each bridge exactly once, it 551 00:36:23,590 --> 00:36:28,910 must be the case that each node, except for the first and 552 00:36:28,910 --> 00:36:36,050 the last, in the walk must have an even number of edges. 553 00:36:36,050 --> 00:36:39,870 So if you were to go to an island and leave the island 554 00:36:39,870 --> 00:36:43,540 and traverse each bridge to the island unless there were 555 00:36:43,540 --> 00:36:45,340 an even number, you couldn't traverse 556 00:36:45,340 --> 00:36:48,310 each one exactly once. 557 00:36:48,310 --> 00:36:50,760 If there were only one bridge, once you got to the island you 558 00:36:50,760 --> 00:36:51,825 were stuck. 559 00:36:51,825 --> 00:36:54,520 If there were two bridges you could get there and leave. 560 00:36:54,520 --> 00:36:56,390 But if there were three bridges you could get there, 561 00:36:56,390 --> 00:36:57,800 leave, get there and you're stuck again. 562 00:37:00,890 --> 00:37:07,450 He then looked at it and said, well, none of these nodes have 563 00:37:07,450 --> 00:37:09,280 an even number of edges. 564 00:37:09,280 --> 00:37:11,530 Therefore you can't do it. 565 00:37:11,530 --> 00:37:13,320 End of story. 566 00:37:13,320 --> 00:37:14,570 Stop arguing. 567 00:37:17,620 --> 00:37:20,280 Kind of a nice piece of logic. 568 00:37:20,280 --> 00:37:24,920 And then Euler later went on to generalize this theorem to 569 00:37:24,920 --> 00:37:27,430 cover a lot of other situations. 570 00:37:27,430 --> 00:37:30,960 But what was important was not the fact that he solved this 571 00:37:30,960 --> 00:37:36,950 problem but that he thought about the notion of taking a 572 00:37:36,950 --> 00:37:40,820 map and formulating it as a graph. 573 00:37:40,820 --> 00:37:44,960 This was the first example of that and since then everything 574 00:37:44,960 --> 00:37:47,420 has worked that way. 575 00:37:47,420 --> 00:37:52,950 So if you take this kind of idea and now you extend it to 576 00:37:52,950 --> 00:37:57,700 digraphs, you can deal with one-way 577 00:37:57,700 --> 00:38:00,030 bridges or one-way streets. 578 00:38:00,030 --> 00:38:05,380 Or suppose you want to look at our airline problem, you can 579 00:38:05,380 --> 00:38:08,395 extend it to include weights. 580 00:38:23,170 --> 00:38:28,320 For example, the number of miles between two cities or 581 00:38:28,320 --> 00:38:34,070 the amount of toll you'd have to pay on some road. 582 00:38:34,070 --> 00:38:36,790 So, for example, once you've done this you can easily 583 00:38:36,790 --> 00:38:40,910 represent the entire US highway system or any roadmap 584 00:38:40,910 --> 00:38:42,800 by a weighted directed graph. 585 00:38:45,870 --> 00:38:52,100 Or, used more often, probably, the World Wide Web is today 586 00:38:52,100 --> 00:38:57,430 typically modeled as a directed graph where there's 587 00:38:57,430 --> 00:39:03,210 an edge from page A to page B, if there's a link on page A to 588 00:39:03,210 --> 00:39:05,370 page B. 589 00:39:05,370 --> 00:39:07,970 And then maybe if you want to care, ask the question how 590 00:39:07,970 --> 00:39:12,110 often do people go from A to B, a very important question, 591 00:39:12,110 --> 00:39:15,250 say, to somebody like Google, who wants to know how often 592 00:39:15,250 --> 00:39:18,530 people click on a link to get to another place so they can 593 00:39:18,530 --> 00:39:23,180 charge for those clicks, you use a weighted graph, which 594 00:39:23,180 --> 00:39:27,090 says how often does someone go from here to there. 595 00:39:27,090 --> 00:39:30,710 And so a company like Google maintains a model of what 596 00:39:30,710 --> 00:39:34,920 happens and uses a weighted, directed graph to essentially 597 00:39:34,920 --> 00:39:39,620 represent the Web, the clicks and everything else, and can 598 00:39:39,620 --> 00:39:42,300 do all sorts of analysis on traffic patterns 599 00:39:42,300 --> 00:39:44,830 and things like that. 600 00:39:44,830 --> 00:39:48,700 There are also many less obvious uses of graphs. 601 00:39:48,700 --> 00:39:52,670 Biologists use graphs to measure things ranging from 602 00:39:52,670 --> 00:39:57,020 the way proteins interact with each other to, more obviously, 603 00:39:57,020 --> 00:40:01,240 gene expression networks are clearly graphs. 604 00:40:01,240 --> 00:40:05,530 Physicists use graphs to model phase transitions with 605 00:40:05,530 --> 00:40:09,350 typically the weight of the edge representing the amount 606 00:40:09,350 --> 00:40:13,360 of energy needed to go from one phase to another. 607 00:40:13,360 --> 00:40:17,320 Those are again weighted directed graphs. 608 00:40:17,320 --> 00:40:20,140 The direction is, can you get from this phase to that phase? 609 00:40:20,140 --> 00:40:23,450 And the weight is how much energy does it require? 610 00:40:23,450 --> 00:40:29,390 Epidemiologists use graphs to model diseases, et cetera. 611 00:40:29,390 --> 00:40:33,700 We'll see an example of that in a bit. 612 00:40:33,700 --> 00:40:35,080 All right. 613 00:40:35,080 --> 00:40:37,725 Let's look at some code now to implement graphs. 614 00:40:47,480 --> 00:40:52,000 This is also in your handout and I'm going to comment this 615 00:40:52,000 --> 00:40:54,610 out just so we don't run it by accident. 616 00:41:04,400 --> 00:41:06,830 As you might expect, I'm going to use classes 617 00:41:06,830 --> 00:41:09,130 to implement graphs. 618 00:41:09,130 --> 00:41:13,190 I start with a class node which is at this 619 00:41:13,190 --> 00:41:14,880 point pretty simple. 620 00:41:14,880 --> 00:41:17,220 It's just got a name. 621 00:41:17,220 --> 00:41:19,710 Now, you might say, well, why did I even bother introducing 622 00:41:19,710 --> 00:41:20,480 a class here? 623 00:41:20,480 --> 00:41:23,400 Why don't I just use strings? 624 00:41:23,400 --> 00:41:28,380 Well, because I was kind of wary that sometime later I 625 00:41:28,380 --> 00:41:32,240 might want to associate more properties with nodes. 626 00:41:32,240 --> 00:41:35,520 So you could imagine if I'm using a graph to model the 627 00:41:35,520 --> 00:41:40,560 World Wide Web, I don't want more than the URL for a page. 628 00:41:40,560 --> 00:41:44,220 I might want to have all the words on the page, or who 629 00:41:44,220 --> 00:41:46,730 knows what else about it. 630 00:41:46,730 --> 00:41:50,580 So I just said, for safety, let's start with a simple 631 00:41:50,580 --> 00:41:54,200 class, but let's make it a class so that any code I write 632 00:41:54,200 --> 00:41:58,250 can be reused if at some later date I decide nodes are going 633 00:41:58,250 --> 00:42:00,135 to be more complicated than just strings. 634 00:42:04,360 --> 00:42:06,750 Good programming practice. 635 00:42:06,750 --> 00:42:11,730 An edge is only a little bit more complicated. 636 00:42:11,730 --> 00:42:18,530 It's got a source, a destination, and a weight. 637 00:42:18,530 --> 00:42:21,420 So you can see that I'm using the most general form of an 638 00:42:21,420 --> 00:42:25,890 edge so that I will be able use edges not only for graphs 639 00:42:25,890 --> 00:42:30,220 and digraphs, but also weighted directed graphs by 640 00:42:30,220 --> 00:42:34,530 having all the potential properties I might need and 641 00:42:34,530 --> 00:42:37,210 then some simple things to fetch things. 642 00:42:40,240 --> 00:42:43,690 The next cluster in the hierarchy is a digraph. 643 00:42:51,520 --> 00:42:57,860 So it's got an init, of course, I can add nodes to it. 644 00:42:57,860 --> 00:42:59,890 I'm not going to allow myself to add the same 645 00:42:59,890 --> 00:43:01,910 node more than once. 646 00:43:01,910 --> 00:43:05,060 I can add edges. 647 00:43:05,060 --> 00:43:07,330 And I'm going to check to make sure that I'm only connecting 648 00:43:07,330 --> 00:43:10,180 nodes that are in the graph. 649 00:43:10,180 --> 00:43:13,180 Then I've got children of, which gives me all the 650 00:43:13,180 --> 00:43:21,245 descendants of a node, and has node in string. 651 00:43:24,970 --> 00:43:28,070 And then interestingly enough, maybe surprising to some of 652 00:43:28,070 --> 00:43:32,160 you, I've made graph a sub-class of digraph. 653 00:43:35,640 --> 00:43:37,690 Maybe that seems a little odd. 654 00:43:37,690 --> 00:43:39,710 After all, when I started I started 655 00:43:39,710 --> 00:43:42,610 talking about digraphs. 656 00:43:42,610 --> 00:43:45,060 About graphs, and then said, and we can add this feature. 657 00:43:45,060 --> 00:43:46,880 But now I'm going the other way around. 658 00:43:51,470 --> 00:43:52,560 Why is that? 659 00:43:52,560 --> 00:43:55,310 Why do you think that's the right way 660 00:43:55,310 --> 00:43:57,990 to structure a hierarchy? 661 00:43:57,990 --> 00:44:01,630 What's the relation of graphs to digraphs? 662 00:44:06,030 --> 00:44:10,800 Digraphs are more general than graphs. 663 00:44:10,800 --> 00:44:16,060 A graph is a specialization of a digraph. 664 00:44:16,060 --> 00:44:19,590 Just like a county is a specialization of a point. 665 00:44:22,150 --> 00:44:26,050 So, typically as you design these class hierarchies, the 666 00:44:26,050 --> 00:44:34,430 more specialized something is, the further it has to be down. 667 00:44:34,430 --> 00:44:35,830 More like a subclass. 668 00:44:35,830 --> 00:44:37,430 Does that make sense? 669 00:44:37,430 --> 00:44:39,340 I can't really turn this on its head. 670 00:44:43,640 --> 00:44:46,100 I can specialize a digraph to get a graph. 671 00:44:46,100 --> 00:44:49,420 I can't specialize a graph to get a digraph. 672 00:44:49,420 --> 00:44:53,620 And that's why this hierarchy is organized the way it is. 673 00:44:57,260 --> 00:45:01,930 What else is there are interesting to say about this? 674 00:45:01,930 --> 00:45:04,970 A key question, probably the most important question in 675 00:45:04,970 --> 00:45:11,240 designing and implementation of graphs, is the choice of 676 00:45:11,240 --> 00:45:15,760 data structures to represent the digraph in this case. 677 00:45:15,760 --> 00:45:22,120 There are two possibilities that people typically use. 678 00:45:22,120 --> 00:45:23,950 They can use an adjacency matrix. 679 00:45:33,620 --> 00:45:44,300 So if you have N nodes, you have an N by N matrix where 680 00:45:44,300 --> 00:45:48,850 each entry gives, in the case of a digraph, a weighted 681 00:45:48,850 --> 00:45:54,215 digraph, the weight-connecting nodes on that edge. 682 00:45:58,130 --> 00:46:02,670 Or in the case of a graph it can be just true or false. 683 00:46:02,670 --> 00:46:04,660 So this is, can I get from A to B? 684 00:46:07,160 --> 00:46:08,635 Is that going to be sufficient? 685 00:46:11,530 --> 00:46:17,020 Suppose you have a graph that looks like this. 686 00:46:17,020 --> 00:46:21,490 And we'll call that Boston and this New York. 687 00:46:21,490 --> 00:46:25,340 And I want to model the roads. 688 00:46:25,340 --> 00:46:29,230 Well, I might have a road that looks like this and a road 689 00:46:29,230 --> 00:46:33,430 that looks like this from Boston to New York. 690 00:46:33,430 --> 00:46:35,055 As in, I might have more than one road. 691 00:46:38,460 --> 00:46:42,850 So I have to be careful when I use an adjacency matrix 692 00:46:42,850 --> 00:46:46,670 representation, to realize that each element of the 693 00:46:46,670 --> 00:46:51,550 matrix could itself be somewhat more complicated in 694 00:46:51,550 --> 00:46:53,290 the case that there are multiple edges 695 00:46:53,290 --> 00:46:56,390 connecting two nodes. 696 00:46:56,390 --> 00:46:59,250 And in fact, in many graphs we will see there are multiple 697 00:46:59,250 --> 00:47:02,240 edges connecting the same two nodes. 698 00:47:02,240 --> 00:47:04,030 It would be surprising if there weren't. 699 00:47:10,430 --> 00:47:10,870 All right. 700 00:47:10,870 --> 00:47:15,570 Now, the other common representation is 701 00:47:15,570 --> 00:47:16,820 an adjacency list. 702 00:47:23,980 --> 00:47:31,340 In an adjacency list, for every node I list all of the 703 00:47:31,340 --> 00:47:33,420 edges emanating from that node. 704 00:47:42,470 --> 00:47:44,350 Which of these is better? 705 00:47:44,350 --> 00:47:45,910 Well, neither. 706 00:47:45,910 --> 00:47:48,580 It depends upon your application. 707 00:47:48,580 --> 00:47:52,910 An adjacency matrix is often the best choice when the 708 00:47:52,910 --> 00:47:56,070 connections are dense. 709 00:47:56,070 --> 00:48:00,230 Everything is connected to everything else. 710 00:48:00,230 --> 00:48:03,960 But is very wasteful if the connections are sparse. 711 00:48:06,460 --> 00:48:10,140 If there are no roads connecting most of your cities 712 00:48:10,140 --> 00:48:12,330 or no airplane flights connecting most of your 713 00:48:12,330 --> 00:48:16,170 cities, then you don't want to have a matrix where most of 714 00:48:16,170 --> 00:48:17,730 the entries are empty. 715 00:48:22,570 --> 00:48:26,440 Just to make sure that people follow the difference, which 716 00:48:26,440 --> 00:48:28,560 am I using in my implementation here? 717 00:48:33,480 --> 00:48:36,490 Am I using an adjacency matrix or an adjacency list? 718 00:48:43,790 --> 00:48:47,180 I heard somebody say an adjacency list and because my 719 00:48:47,180 --> 00:48:49,860 candy supply is so meager they didn't even bother raising in 720 00:48:49,860 --> 00:48:51,790 their hand so I know who said it. 721 00:48:51,790 --> 00:48:56,200 But yes, it is an adjacency list. 722 00:48:56,200 --> 00:48:59,220 And we can see that by looking what happens 723 00:48:59,220 --> 00:49:01,230 when we add an edge. 724 00:49:01,230 --> 00:49:05,280 I'm associating it with the source node. 725 00:49:05,280 --> 00:49:09,420 So from each node, and we can see that when we look at the 726 00:49:09,420 --> 00:49:10,670 children of-- 727 00:49:13,220 --> 00:49:18,370 here, I just return the edges of that node. 728 00:49:18,370 --> 00:49:20,590 And that's the list of all the places you can get 729 00:49:20,590 --> 00:49:21,840 to from that node. 730 00:49:24,890 --> 00:49:28,073 So it's very simple, but it's very useful. 731 00:49:31,020 --> 00:49:32,940 Next lecture we'll look-- 732 00:49:32,940 --> 00:49:33,600 Yeah? 733 00:49:33,600 --> 00:49:34,150 Thank you. 734 00:49:34,150 --> 00:49:34,610 Question. 735 00:49:34,610 --> 00:49:35,270 I love questions. 736 00:49:35,270 --> 00:49:39,426 AUDIENCE: Going back to the digraph, what makes the graph 737 00:49:39,426 --> 00:49:40,830 more specialized? 738 00:49:40,830 --> 00:49:41,370 PROFESSOR: What makes-- 739 00:49:41,370 --> 00:49:44,290 AUDIENCE: The graph more specialized? 740 00:49:44,290 --> 00:49:45,220 PROFESSOR: Good question. 741 00:49:45,220 --> 00:49:49,670 The question is, what makes the graph more specialized? 742 00:49:49,670 --> 00:49:55,880 What we'll see here when we look at graph, it's not a very 743 00:49:55,880 --> 00:50:00,840 efficient implementation, but every time you add an edge, I 744 00:50:00,840 --> 00:50:04,640 add an edge in the reverse direction. 745 00:50:04,640 --> 00:50:08,030 Because if you can get from node A to node B you can get 746 00:50:08,030 --> 00:50:14,390 from node B to node A. So I've removed the possibility that 747 00:50:14,390 --> 00:50:17,410 you have, say, one-way streets in the graph. 748 00:50:17,410 --> 00:50:20,080 And therefore it's a specialization. 749 00:50:20,080 --> 00:50:23,980 There are things I can not do with graphs that I can do with 750 00:50:23,980 --> 00:50:28,280 digraphs, but anything I can do with a graph I can do with 751 00:50:28,280 --> 00:50:29,690 the digraph. 752 00:50:29,690 --> 00:50:30,760 It's more general. 753 00:50:30,760 --> 00:50:32,010 That make sense? 754 00:50:35,800 --> 00:50:38,620 That's a great question and I'm glad you asked it. 755 00:50:38,620 --> 00:50:38,890 All right. 756 00:50:38,890 --> 00:50:42,040 Thursday we're going to look at a bunch of classic problems 757 00:50:42,040 --> 00:50:45,920 that can be solved using graphs, and I think that 758 00:50:45,920 --> 00:50:47,170 should be fun.