1 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,870 Commons license. 3 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 4 00:00:06,910 --> 00:00:10,560 offer high quality educational resources for free. 5 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 6 00:00:13,460 --> 00:00:17,390 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,390 --> 00:00:18,640 ocw.mit.edu. 8 00:00:21,490 --> 00:00:25,520 PROFESSOR: Last Tuesday, we ended up the lecture talking 9 00:00:25,520 --> 00:00:27,840 about knapsack problems. 10 00:00:27,840 --> 00:00:30,920 We talked about the continuous knapsack problem and the fact 11 00:00:30,920 --> 00:00:32,920 that you could solve that optimally 12 00:00:32,920 --> 00:00:35,060 with a greedy algorithm. 13 00:00:35,060 --> 00:00:38,680 And we looked at the 0-1 knapsack problem and discussed 14 00:00:38,680 --> 00:00:42,190 the fact that while we could write greedy algorithms that 15 00:00:42,190 --> 00:00:45,560 would solve the problem quickly, we have to be careful 16 00:00:45,560 --> 00:00:49,930 what we mean by "solve," and that while those algorithms 17 00:00:49,930 --> 00:00:54,010 would choose a set of items that we could indeed carry 18 00:00:54,010 --> 00:00:57,390 away, there was no guarantee that it would choose the 19 00:00:57,390 --> 00:01:01,120 optimal items, that is to say, one that would meet the 20 00:01:01,120 --> 00:01:03,700 objective function of maximizing the value. 21 00:01:07,170 --> 00:01:11,960 We looked after that at a brute force algorithm on the 22 00:01:11,960 --> 00:01:16,850 board only for finding an optimal solution, a guaranteed 23 00:01:16,850 --> 00:01:21,800 optimal solution, but observed the fact that on even a 24 00:01:21,800 --> 00:01:25,320 moderately sized set of items, it might take a 25 00:01:25,320 --> 00:01:27,280 decade or so to run. 26 00:01:27,280 --> 00:01:29,910 Decided that wasn't very good. 27 00:01:29,910 --> 00:01:33,940 Nevertheless, I want to start today looking at some code 28 00:01:33,940 --> 00:01:38,930 that implements a brute force algorithm, not because I 29 00:01:38,930 --> 00:01:43,560 expect anyone to actually run this on a real example, but 30 00:01:43,560 --> 00:01:46,100 because a bit later in the term, we'll see how we could 31 00:01:46,100 --> 00:01:49,800 modify this to something that would be practical. 32 00:01:49,800 --> 00:01:53,540 And there's some things to learn by looking at it. 33 00:01:53,540 --> 00:01:57,510 So let's look at some code here. 34 00:01:57,510 --> 00:02:01,090 I don't expect you to understand in real time all 35 00:02:01,090 --> 00:02:02,940 the details of this code. 36 00:02:02,940 --> 00:02:06,280 It's more I want you to understand the basic idea 37 00:02:06,280 --> 00:02:10,310 behind it and then the result we get. 38 00:02:10,310 --> 00:02:13,520 So you'll remember that we looked at the complexity by 39 00:02:13,520 --> 00:02:18,990 saying well, really, it's like binary numbers. 40 00:02:18,990 --> 00:02:23,930 So the first helper subroutine I'm going to use is something 41 00:02:23,930 --> 00:02:28,980 that generates binary numbers. 42 00:02:28,980 --> 00:02:34,330 So it takes an n, some natural number, and the number of 43 00:02:34,330 --> 00:02:38,570 digits and returns a binary string of that length, 44 00:02:38,570 --> 00:02:42,780 representing the decimal number n. 45 00:02:42,780 --> 00:02:47,340 Why am I giving it this number of digits? 46 00:02:47,340 --> 00:02:50,970 Because I need to zero pad it. 47 00:02:50,970 --> 00:02:53,970 If I want to have a vector that represents whether or not 48 00:02:53,970 --> 00:03:00,090 I take items, if I take only one item, say the first one, I 49 00:03:00,090 --> 00:03:03,370 don't want just a binary spring with one digit in it. 50 00:03:03,370 --> 00:03:06,200 Because I need all those zeros to indicate that I'm not 51 00:03:06,200 --> 00:03:08,700 taking the other items. 52 00:03:08,700 --> 00:03:12,320 And so the second argument tells me, in effect, how many 53 00:03:12,320 --> 00:03:14,820 zeros I'm going to need. 54 00:03:14,820 --> 00:03:20,010 And there's nothing mysterious about the way it does it. 55 00:03:20,010 --> 00:03:21,660 OK. 56 00:03:21,660 --> 00:03:25,320 The next helper function generates the 57 00:03:25,320 --> 00:03:29,940 power set of the items. 58 00:03:29,940 --> 00:03:32,230 What is a power set? 59 00:03:32,230 --> 00:03:37,170 If you take a set, you can then ask the question what are 60 00:03:37,170 --> 00:03:41,430 all the subsets of the set? 61 00:03:41,430 --> 00:03:43,230 What's the smallest subset of a set? 62 00:03:43,230 --> 00:03:45,260 It's the empty set. 63 00:03:45,260 --> 00:03:46,910 No items. 64 00:03:46,910 --> 00:03:49,970 What's the largest subset of a set? 65 00:03:49,970 --> 00:03:52,590 All of the items. 66 00:03:52,590 --> 00:03:56,930 And then we have everything in between, the set that contains 67 00:03:56,930 --> 00:03:59,780 the first item, the set that contains the second item, et 68 00:03:59,780 --> 00:04:02,620 cetera, the set that contains the first and the second, the 69 00:04:02,620 --> 00:04:03,740 first and the third. 70 00:04:03,740 --> 00:04:06,320 There are a lot of them. 71 00:04:06,320 --> 00:04:08,510 And of course, how many is a lot? 72 00:04:08,510 --> 00:04:11,870 Well, 2 to the n is a lot. 73 00:04:11,870 --> 00:04:15,470 But now we're going to generate every 74 00:04:15,470 --> 00:04:19,079 possible subset of items. 75 00:04:19,079 --> 00:04:23,670 And we're going to do this simply using the decimal to 76 00:04:23,670 --> 00:04:27,580 binary function to tell us whether or not we keep each 77 00:04:27,580 --> 00:04:29,560 one, so we can enumerate them. 78 00:04:29,560 --> 00:04:31,590 We can generate them all. 79 00:04:31,590 --> 00:04:36,830 And now we have the set of all possible items one might take, 80 00:04:36,830 --> 00:04:40,820 irrespective of whether they obey the constraint of not 81 00:04:40,820 --> 00:04:42,495 weighing too much. 82 00:04:47,090 --> 00:04:51,190 The next function is the one that does the work. 83 00:04:51,190 --> 00:04:53,290 This is the interesting one. 84 00:04:53,290 --> 00:04:55,380 Choose best. 85 00:04:55,380 --> 00:05:04,790 It takes a power set, the constraint, and two functions. 86 00:05:04,790 --> 00:05:06,960 One, getValue-- 87 00:05:06,960 --> 00:05:08,770 it tells me the value of an item. 88 00:05:08,770 --> 00:05:10,315 And the other getWeight-- 89 00:05:10,315 --> 00:05:13,770 it tells me the weight of an item. 90 00:05:13,770 --> 00:05:15,020 Then it just goes through. 91 00:05:17,610 --> 00:05:21,580 And it enumerates all possibilities 92 00:05:21,580 --> 00:05:24,100 and eventually chooses-- 93 00:05:24,100 --> 00:05:27,980 I won't say the best set, because it 94 00:05:27,980 --> 00:05:29,140 might not be unique. 95 00:05:29,140 --> 00:05:31,650 There might be more than one optimal answer. 96 00:05:31,650 --> 00:05:35,020 But it finds at least one optimal answer. 97 00:05:35,020 --> 00:05:38,470 And then it returns that. 98 00:05:38,470 --> 00:05:42,440 Again it's a very straightforward implementation 99 00:05:42,440 --> 00:05:47,190 of the brute force algorithm I sketched on the board. 100 00:05:47,190 --> 00:05:50,800 And then we can run it with testBest, which is going to 101 00:05:50,800 --> 00:05:52,680 build the items using the function we 102 00:05:52,680 --> 00:05:54,560 looked at last time. 103 00:05:54,560 --> 00:05:58,150 It's then going to get the power set of the items. 104 00:05:58,150 --> 00:06:00,970 It's going to call chooseBest and then print the result. 105 00:06:03,540 --> 00:06:05,490 So let's see what happens if we run it. 106 00:06:18,550 --> 00:06:19,770 We get an error. 107 00:06:19,770 --> 00:06:20,295 Oh dear. 108 00:06:20,295 --> 00:06:23,890 I hadn't expected that. 109 00:06:23,890 --> 00:06:27,420 And it says test-- oh testBest is not defined. 110 00:06:27,420 --> 00:06:27,960 All right. 111 00:06:27,960 --> 00:06:29,930 Let's try that again. 112 00:06:29,930 --> 00:06:31,730 Sure looks like it's defined to me. 113 00:06:38,000 --> 00:06:39,400 There it is. 114 00:06:39,400 --> 00:06:40,670 OK. 115 00:06:40,670 --> 00:06:44,170 And you may recall that this is a better answer than 116 00:06:44,170 --> 00:06:45,940 anything that was generated by the greedy 117 00:06:45,940 --> 00:06:47,290 algorithm on Tuesday. 118 00:06:47,290 --> 00:06:51,300 You may not recall it, but believe me it is. 119 00:06:51,300 --> 00:06:54,460 It happened to have found a better solution. 120 00:06:54,460 --> 00:06:57,320 And not surprisingly, that's because I contrived the 121 00:06:57,320 --> 00:07:00,550 example to make sure that would happen. 122 00:07:00,550 --> 00:07:03,720 Why does it work better in the sense-- or why does it find a 123 00:07:03,720 --> 00:07:04,830 better answer? 124 00:07:04,830 --> 00:07:07,160 Why might it find a better answer? 125 00:07:07,160 --> 00:07:12,680 Well, because the greedy algorithm chose something that 126 00:07:12,680 --> 00:07:14,770 was locally optimal at each step. 127 00:07:21,420 --> 00:07:24,400 But there was no guarantee that a sequence of locally 128 00:07:24,400 --> 00:07:26,750 optimal decisions would reach a global optimum. 129 00:07:32,220 --> 00:07:37,820 What this algorithm does is it finds a global optimum by 130 00:07:37,820 --> 00:07:39,920 looking at all solutions. 131 00:07:39,920 --> 00:07:43,060 And that's something we'll see again and again, as we go 132 00:07:43,060 --> 00:07:47,340 forward, that there's always a temptation to do things one 133 00:07:47,340 --> 00:07:50,930 step at a time, finding local optimum-- 134 00:07:50,930 --> 00:07:52,050 optima-- 135 00:07:52,050 --> 00:07:53,840 because it's fast. 136 00:07:53,840 --> 00:07:55,110 It's easy. 137 00:07:55,110 --> 00:08:00,490 But there's no guarantee it will work well. 138 00:08:00,490 --> 00:08:02,830 Now the problem, of course, with finding the global 139 00:08:02,830 --> 00:08:09,230 optimum is, as we discussed, it is prohibitively expensive. 140 00:08:09,230 --> 00:08:12,430 Now you could ask is it prohibitively expensive 141 00:08:12,430 --> 00:08:15,290 because I chose a stupid algorithm, 142 00:08:15,290 --> 00:08:17,670 the brute force algorithm? 143 00:08:17,670 --> 00:08:20,420 Well, it is a stupid algorithm. 144 00:08:20,420 --> 00:08:25,080 But in fact, this is a problem that is what we would call 145 00:08:25,080 --> 00:08:26,330 inherently exponential. 146 00:08:38,659 --> 00:08:41,440 We've looked at this concept before. 147 00:08:41,440 --> 00:08:45,910 That in addition to talking about the complexity of an 148 00:08:45,910 --> 00:08:49,860 algorithm, we can talk about the complexity of a problem in 149 00:08:49,860 --> 00:08:54,650 which we ask the question how fast can the absolute best 150 00:08:54,650 --> 00:08:58,170 solution, fastest solution to this problem, be? 151 00:08:58,170 --> 00:09:04,230 And here you can construct a mathematical proof that says 152 00:09:04,230 --> 00:09:08,320 the problem is inherently exponential. 153 00:09:08,320 --> 00:09:11,880 No matter what we do, we're not going to be able to find 154 00:09:11,880 --> 00:09:16,010 something that's guaranteed to find the optimal, that is 155 00:09:16,010 --> 00:09:19,090 faster than exponential. 156 00:09:19,090 --> 00:09:23,428 Well, now let's be careful of that statement. 157 00:09:23,428 --> 00:09:27,420 What that means is the worst case is inherently 158 00:09:27,420 --> 00:09:29,760 exponential. 159 00:09:29,760 --> 00:09:34,070 As we will see in a couple of weeks-- it'll take us a while 160 00:09:34,070 --> 00:09:35,420 to get there-- 161 00:09:35,420 --> 00:09:39,730 there are actually algorithms that people use to solve these 162 00:09:39,730 --> 00:09:43,760 inherently exponential problems and solve them fast 163 00:09:43,760 --> 00:09:45,010 enough to be useful. 164 00:09:47,230 --> 00:09:51,810 So, for example, when you go to look at airline fares on 165 00:09:51,810 --> 00:09:56,800 Kayak to try and find the best fare from A to B, it is an 166 00:09:56,800 --> 00:09:59,310 inherently exponential problem, but you get an answer 167 00:09:59,310 --> 00:10:01,120 pretty quickly. 168 00:10:01,120 --> 00:10:04,740 And that's because there are techniques you can use. 169 00:10:04,740 --> 00:10:06,820 Now, in fact, one of the reasons you get it is they 170 00:10:06,820 --> 00:10:08,330 don't guarantee that you actually 171 00:10:08,330 --> 00:10:10,680 get an optimal solution. 172 00:10:10,680 --> 00:10:13,140 But there are techniques that guarantee to give you an 173 00:10:13,140 --> 00:10:18,000 optimal solution that almost all the time will run quickly. 174 00:10:18,000 --> 00:10:23,110 And we'll look at one of those a bit later in the term. 175 00:10:23,110 --> 00:10:30,620 Before we do that, however, I want to leave for a while the 176 00:10:30,620 --> 00:10:35,350 whole question of complexity behind and look at another 177 00:10:35,350 --> 00:10:38,300 class of optimization problems. 178 00:10:38,300 --> 00:10:39,940 We'll look at several different kinds of 179 00:10:39,940 --> 00:10:43,300 optimization problems as the term goes forward. 180 00:10:43,300 --> 00:10:50,470 The kind I want to look at today is probably what I would 181 00:10:50,470 --> 00:10:52,430 say is the most exciting branch of 182 00:10:52,430 --> 00:10:53,730 computer science today. 183 00:10:53,730 --> 00:10:56,240 And of course I might have a bias. 184 00:10:56,240 --> 00:10:57,490 And that's machine learning. 185 00:11:05,040 --> 00:11:09,030 It's a word you'll hear a lot about. 186 00:11:09,030 --> 00:11:14,090 And it's a technique that many of you will apply. 187 00:11:14,090 --> 00:11:15,810 You might not write your own codes. 188 00:11:15,810 --> 00:11:19,370 But I guarantee you were either be the beneficiary or 189 00:11:19,370 --> 00:11:23,550 the victim of machine learning almost every time you log on 190 00:11:23,550 --> 00:11:25,890 to the web these days. 191 00:11:25,890 --> 00:11:28,820 I should probably start by defining what 192 00:11:28,820 --> 00:11:30,490 machine learning is. 193 00:11:30,490 --> 00:11:32,490 But that's hard to do. 194 00:11:32,490 --> 00:11:34,540 I really don't know how to do it. 195 00:11:34,540 --> 00:11:38,000 Superficially, you could say that machine learning deals 196 00:11:38,000 --> 00:11:42,920 with the question of how to build programs that learn. 197 00:11:42,920 --> 00:11:46,090 However, I think in a very real sense every program we 198 00:11:46,090 --> 00:11:48,170 write learns something. 199 00:11:48,170 --> 00:11:51,660 If I implement Newton's method, it's learning what the 200 00:11:51,660 --> 00:11:54,730 roots of the polynomial is. 201 00:11:54,730 --> 00:11:58,750 Certainly when we looked at curve fitting-- 202 00:11:58,750 --> 00:12:00,710 fitting curves to data-- 203 00:12:00,710 --> 00:12:03,320 we were learning a model of the data. 204 00:12:03,320 --> 00:12:05,840 That's what that regression is. 205 00:12:08,570 --> 00:12:10,660 Wikipedia says-- 206 00:12:10,660 --> 00:12:14,240 and of course, it must be true if Wikipedia says it-- 207 00:12:14,240 --> 00:12:17,470 that machine learning is a scientific discipline that is 208 00:12:17,470 --> 00:12:20,960 concerned with the design and development of algorithms that 209 00:12:20,960 --> 00:12:23,860 allow computers to evolve behaviors based 210 00:12:23,860 --> 00:12:26,290 on empirical data. 211 00:12:26,290 --> 00:12:28,640 I'm not sure how helpful this definition is. 212 00:12:28,640 --> 00:12:30,310 But it was the best I could find. 213 00:12:30,310 --> 00:12:31,620 And it doesn't really matter. 214 00:12:34,220 --> 00:12:38,910 But it sort of gets at the issue that a major focus of 215 00:12:38,910 --> 00:12:42,940 machine learning research is to automatically learn to 216 00:12:42,940 --> 00:12:48,640 recognize complex patterns and make intelligent decisions 217 00:12:48,640 --> 00:12:51,880 based on data. 218 00:12:51,880 --> 00:12:55,210 This whole process is something 219 00:12:55,210 --> 00:12:56,670 called inductive inference. 220 00:13:07,300 --> 00:13:10,410 The basic idea is one observes-- 221 00:13:10,410 --> 00:13:11,410 actually one doesn't. 222 00:13:11,410 --> 00:13:17,620 The program observes examples that represent an incomplete 223 00:13:17,620 --> 00:13:22,750 information about some statistical phenomena and then 224 00:13:22,750 --> 00:13:28,270 tries to generate a model, just like with curve fitting, 225 00:13:28,270 --> 00:13:33,170 that summarizes some statistical properties of that 226 00:13:33,170 --> 00:13:39,350 data and can be used to predict the future, for 227 00:13:39,350 --> 00:13:44,250 example, give you information about unseen data. 228 00:13:44,250 --> 00:13:51,420 There are roughly speaking two distinctive approaches to 229 00:13:51,420 --> 00:14:01,320 machine learning called supervised learning and 230 00:14:01,320 --> 00:14:02,570 unsupervised learning. 231 00:14:17,630 --> 00:14:20,460 Let's first talk about supervised learning. 232 00:14:20,460 --> 00:14:24,650 It's a little easier to appreciate how it might work. 233 00:14:24,650 --> 00:14:37,450 In supervised learning, we associate a label with each 234 00:14:37,450 --> 00:14:39,310 example in a training set. 235 00:14:52,450 --> 00:14:55,770 So think of that as an answer to a query about an example. 236 00:14:58,580 --> 00:15:06,520 If the label is discrete, we typically call it a 237 00:15:06,520 --> 00:15:08,010 classification problem. 238 00:15:25,630 --> 00:15:31,470 So we would try and classify, for example, a transaction on 239 00:15:31,470 --> 00:15:35,150 a credit card as belonging to the owner of that credit card 240 00:15:35,150 --> 00:15:39,360 or not belonging to the owner, as i.e., with some 241 00:15:39,360 --> 00:15:42,680 probability, a stolen credit card. 242 00:15:42,680 --> 00:15:43,490 So it's discrete. 243 00:15:43,490 --> 00:15:44,460 It belongs to the owner. 244 00:15:44,460 --> 00:15:47,930 It doesn't belong to the owner. 245 00:15:47,930 --> 00:15:54,620 If the labels are real valued, we think of it as 246 00:15:54,620 --> 00:15:57,090 a regression problem. 247 00:15:57,090 --> 00:16:01,990 And so indeed, when we did the curve fitting, we were doing 248 00:16:01,990 --> 00:16:04,000 machine learning. 249 00:16:04,000 --> 00:16:05,775 And we were handling a regression problem. 250 00:16:08,870 --> 00:16:13,610 Based on the examples from the training set, the goal is to 251 00:16:13,610 --> 00:16:17,460 build a program that can predict the answer for other 252 00:16:17,460 --> 00:16:23,260 cases before they were explicitly observed. 253 00:16:23,260 --> 00:16:26,490 So we're trying to generalize from the statistical 254 00:16:26,490 --> 00:16:30,790 properties of a training set to be able to make predictions 255 00:16:30,790 --> 00:16:32,430 about things we haven't seen. 256 00:16:36,960 --> 00:16:38,210 Let's look at an example. 257 00:16:48,990 --> 00:16:57,220 So here, I've got red and blue circles. 258 00:16:57,220 --> 00:17:05,240 And I'm trying to learn what makes a circle red or what's 259 00:17:05,240 --> 00:17:09,930 the difference between red and blue, other than the color? 260 00:17:09,930 --> 00:17:14,940 Think of my information as the (x,y) values and the label as 261 00:17:14,940 --> 00:17:16,190 the color, red or blue. 262 00:17:18,920 --> 00:17:20,079 So I've labeled each one. 263 00:17:20,079 --> 00:17:23,910 And now I'm trying to learn something. 264 00:17:23,910 --> 00:17:25,364 Well, it's kind of tricky. 265 00:17:29,040 --> 00:17:32,410 What are the questions I need to answer to think about this? 266 00:17:32,410 --> 00:17:36,610 And then we'll look at how we might do it. 267 00:17:36,610 --> 00:17:42,720 So, a first question I need to ask is 268 00:17:42,720 --> 00:17:45,745 are the labels accurate? 269 00:17:53,070 --> 00:17:56,980 And in fact, in a lot of real world examples, in most real 270 00:17:56,980 --> 00:17:59,410 world examples, there's no guarantee that 271 00:17:59,410 --> 00:18:02,000 the labels are accurate. 272 00:18:02,000 --> 00:18:04,640 So you have to assume that well, maybe some of 273 00:18:04,640 --> 00:18:06,690 the labels are wrong. 274 00:18:06,690 --> 00:18:08,555 How do we deal with that? 275 00:18:13,150 --> 00:18:18,810 Perhaps the most fundamental question is the past 276 00:18:18,810 --> 00:18:20,300 representative of the future? 277 00:18:34,240 --> 00:18:39,640 We've seen many examples where people have learned things, 278 00:18:39,640 --> 00:18:45,110 for example, to predict the price of housing. 279 00:18:45,110 --> 00:18:48,910 And it turns out you hit some singularity which means the 280 00:18:48,910 --> 00:18:51,710 past is not a very good predictor of the future. 281 00:18:51,710 --> 00:18:55,120 And even if all of your learning is good, you get the 282 00:18:55,120 --> 00:18:56,790 wrong answer. 283 00:18:56,790 --> 00:19:01,160 So you sort of always have to ask that question. 284 00:19:01,160 --> 00:19:05,780 Do you have enough data to generalize? 285 00:19:15,200 --> 00:19:18,430 And by this, I mean enough training data. 286 00:19:18,430 --> 00:19:21,970 If your training set is very small, you shouldn't have a 287 00:19:21,970 --> 00:19:23,560 lot of confidence in what you learn. 288 00:19:27,770 --> 00:19:31,420 A big issue here is feature extraction. 289 00:19:37,720 --> 00:19:41,820 As we'll see when we look at real examples, the world is a 290 00:19:41,820 --> 00:19:44,350 pretty complex place. 291 00:19:44,350 --> 00:19:50,420 And we need to decide what features we're going to use. 292 00:19:50,420 --> 00:19:53,650 If I were to ask 25% of you to come up in the front of the 293 00:19:53,650 --> 00:19:59,560 room and then try and separate you based upon some feature-- 294 00:19:59,560 --> 00:20:01,490 if I were to say, all right, I'm going to separate the good 295 00:20:01,490 --> 00:20:04,460 students from the bad students, but the only 296 00:20:04,460 --> 00:20:08,390 features I have available are the clothes you're wearing, it 297 00:20:08,390 --> 00:20:09,750 might not work so well. 298 00:20:12,670 --> 00:20:17,140 And very importantly, how tight should the fit be? 299 00:20:27,490 --> 00:20:29,535 So now let's go back to our example here. 300 00:20:32,750 --> 00:20:38,660 We can look at two different ways we might 301 00:20:38,660 --> 00:20:41,230 generalize from this data. 302 00:20:41,230 --> 00:20:45,410 And indeed, when we're looking at classification problems in 303 00:20:45,410 --> 00:20:48,320 supervised learning, what we're typically doing is 304 00:20:48,320 --> 00:20:53,405 trying to find some way of dividing our training data. 305 00:20:56,120 --> 00:20:59,210 In this case, I've given you a two-dimensional projection. 306 00:20:59,210 --> 00:21:01,120 As we'll see, it's not always two-dimensional. 307 00:21:01,120 --> 00:21:03,570 It's not usually two-dimensional. 308 00:21:03,570 --> 00:21:07,850 So I might choose this rather eccentric shape 309 00:21:07,850 --> 00:21:09,760 and say that's great. 310 00:21:09,760 --> 00:21:11,680 And why is that great? 311 00:21:11,680 --> 00:21:15,775 It's great because it minimizes training error. 312 00:21:20,290 --> 00:21:27,880 So if we look at it as an optimization problem, we might 313 00:21:27,880 --> 00:21:34,080 say that our objective function is how many points 314 00:21:34,080 --> 00:21:36,520 are correctly classified in the training 315 00:21:36,520 --> 00:21:39,980 data as red or blue. 316 00:21:39,980 --> 00:21:45,680 And this triangular shape has no training error. 317 00:21:45,680 --> 00:21:48,050 Every point is perfectly classified in 318 00:21:48,050 --> 00:21:50,470 the training data. 319 00:21:50,470 --> 00:21:54,480 If I choose this linear separator instead, I have some 320 00:21:54,480 --> 00:21:56,140 training error. 321 00:21:56,140 --> 00:22:01,750 This red point is misclassified in the training. 322 00:22:01,750 --> 00:22:03,380 Does that mean that the triangle is 323 00:22:03,380 --> 00:22:05,560 better than the line? 324 00:22:05,560 --> 00:22:07,340 Not necessarily, right? 325 00:22:07,340 --> 00:22:11,180 Because my goal is to predict future points. 326 00:22:11,180 --> 00:22:16,690 And maybe that's mislabeled or an experimental error. 327 00:22:16,690 --> 00:22:22,050 Maybe it's accurately labeled but an outlier, very unusual. 328 00:22:22,050 --> 00:22:25,240 And this will not generalize well. 329 00:22:25,240 --> 00:22:28,630 This is analogous to what we talked about as overfitting 330 00:22:28,630 --> 00:22:30,720 when we looked at curve fitting. 331 00:22:30,720 --> 00:22:34,510 And that's-- a very big problem in machine learning is 332 00:22:34,510 --> 00:22:38,170 if you overfit to your training data, it might not 333 00:22:38,170 --> 00:22:41,010 generalize well and might give you bogus 334 00:22:41,010 --> 00:22:42,955 answers going forward. 335 00:22:46,040 --> 00:22:46,300 OK. 336 00:22:46,300 --> 00:22:51,860 So that's a very quick look at supervised learning. 337 00:22:51,860 --> 00:22:53,960 We'll come back to that. 338 00:22:53,960 --> 00:22:56,170 I now want to talk about unsupervised learning. 339 00:22:58,810 --> 00:23:04,940 The big difference here is we have training data, but we 340 00:23:04,940 --> 00:23:06,190 don't have labels. 341 00:23:08,690 --> 00:23:12,890 So I just give you a bunch of points. 342 00:23:12,890 --> 00:23:16,100 It's as if we looked at this picture, and I didn't tell you 343 00:23:16,100 --> 00:23:20,240 which were the red points and which were the blue points. 344 00:23:20,240 --> 00:23:21,730 They were just all points. 345 00:23:24,320 --> 00:23:28,500 So what can I learn? 346 00:23:28,500 --> 00:23:48,420 What typically you're learning in unsupervised learning, is 347 00:23:48,420 --> 00:23:56,195 you're learning about regularities of the data. 348 00:24:04,580 --> 00:24:08,660 So if we looked at this and think away the red and the 349 00:24:08,660 --> 00:24:14,880 blue, we might well say well, at least, if I look at this, 350 00:24:14,880 --> 00:24:18,820 there is some structure to this data. 351 00:24:18,820 --> 00:24:21,730 And maybe what I should do is divide it this way. 352 00:24:21,730 --> 00:24:25,370 It gives me kind of a nice clean separation. 353 00:24:25,370 --> 00:24:28,650 But maybe I should divide it this way. 354 00:24:28,650 --> 00:24:31,300 Or maybe I should put a circle around each 355 00:24:31,300 --> 00:24:34,480 of these four groupings. 356 00:24:34,480 --> 00:24:36,210 Complicated, what to do. 357 00:24:36,210 --> 00:24:40,740 But what we see is there is clearly some structure here. 358 00:24:40,740 --> 00:24:44,860 And the idea of unsupervised learning is to 359 00:24:44,860 --> 00:24:46,200 discover that structure. 360 00:24:49,110 --> 00:24:53,270 Far and away, the dominant form of unsupervised learning 361 00:24:53,270 --> 00:24:56,370 is clustering. 362 00:24:56,370 --> 00:25:00,110 And that's what I was just talking about, is finding the 363 00:25:00,110 --> 00:25:04,450 cluster in this data. 364 00:25:04,450 --> 00:25:08,660 So we'll move forward here. 365 00:25:08,660 --> 00:25:13,060 There it is, with everything the same color. 366 00:25:13,060 --> 00:25:17,100 But here I've labeled the x- and y-axes 367 00:25:17,100 --> 00:25:18,790 as height and weight. 368 00:25:22,640 --> 00:25:24,540 What does clustering mean? 369 00:25:24,540 --> 00:25:29,300 It's the process of organizing the objects or the points into 370 00:25:29,300 --> 00:25:35,280 groups whose members are similar in some way. 371 00:25:35,280 --> 00:25:38,790 A key issue is what do we mean by similar? 372 00:25:38,790 --> 00:25:42,120 What's the metric we want to use? 373 00:25:42,120 --> 00:25:43,860 And we can see that here. 374 00:25:43,860 --> 00:25:46,650 If I tell you that, really, I want to 375 00:25:46,650 --> 00:25:50,080 cluster people by height-- 376 00:25:50,080 --> 00:25:52,850 say, people are similar if they're the same height-- 377 00:25:52,850 --> 00:25:56,790 then it's pretty clear how I should divide this, right, 378 00:25:56,790 --> 00:25:58,880 what my clusters should be. 379 00:25:58,880 --> 00:26:02,590 My clusters should probably be this group of shorter people 380 00:26:02,590 --> 00:26:04,110 and this group of taller people. 381 00:26:06,780 --> 00:26:11,510 If I tell you I'm interested in weight, then probably I 382 00:26:11,510 --> 00:26:16,390 want a cluster it with the divisor here between the 383 00:26:16,390 --> 00:26:18,950 heavier people and the lighter people. 384 00:26:18,950 --> 00:26:23,670 Or if I say well, I'm interested in some combination 385 00:26:23,670 --> 00:26:27,090 of those two, then maybe I'll get four clusters as I 386 00:26:27,090 --> 00:26:28,340 discussed before. 387 00:26:35,840 --> 00:26:38,680 Clustering algorithms are used all over the place. 388 00:26:38,680 --> 00:26:43,800 For example, in marketing, they're used to find groups of 389 00:26:43,800 --> 00:26:47,910 customers with similar behavior. 390 00:26:47,910 --> 00:26:52,380 Walmart is famous for using that clustering to find that. 391 00:26:52,380 --> 00:26:56,670 They did a clustering to determine when people bought 392 00:26:56,670 --> 00:26:58,010 the same thing. 393 00:26:58,010 --> 00:27:00,240 And then they would rearrange their shelves to encourage 394 00:27:00,240 --> 00:27:02,230 people to buy things. 395 00:27:02,230 --> 00:27:05,360 And sort of the most famous example they discovered was 396 00:27:05,360 --> 00:27:08,170 there was a strong correlation between people between people 397 00:27:08,170 --> 00:27:11,790 who bought beer and people who bought diapers. 398 00:27:11,790 --> 00:27:13,630 And so there was a period where if you walked in a 399 00:27:13,630 --> 00:27:16,160 Walmart store, you would find the beer and the diapers next 400 00:27:16,160 --> 00:27:17,550 to each other. 401 00:27:17,550 --> 00:27:19,720 And I leave it to you to speculate on 402 00:27:19,720 --> 00:27:21,420 why that was true. 403 00:27:21,420 --> 00:27:23,215 It just happened to be true in Walmart. 404 00:27:26,170 --> 00:27:30,600 Amazon uses clustering to find people who like similar books. 405 00:27:30,600 --> 00:27:33,820 So every time you buy a book on Amazon, they're running a 406 00:27:33,820 --> 00:27:37,090 clustering algorithm to find out who looks like you. 407 00:27:37,090 --> 00:27:39,530 Said, oh, this person looks just like you. 408 00:27:39,530 --> 00:27:42,130 So if they buy a book, maybe you'll get an email suggesting 409 00:27:42,130 --> 00:27:46,430 you buy that book or the next time you log into Amazon. 410 00:27:46,430 --> 00:27:48,800 Or when you look at a book, they tell you here are some 411 00:27:48,800 --> 00:27:50,400 similar books. 412 00:27:50,400 --> 00:27:52,670 And then they've done a clustering to group books as 413 00:27:52,670 --> 00:27:56,380 similar based on buying habits. 414 00:27:56,380 --> 00:28:01,550 Netflix uses that to recommend movies, et cetera. 415 00:28:01,550 --> 00:28:07,420 Biologists spend a lot of time these days doing clustering. 416 00:28:07,420 --> 00:28:09,630 They classify plants or animals 417 00:28:09,630 --> 00:28:10,780 based on their features. 418 00:28:10,780 --> 00:28:14,890 We'll shortly see an example of that, as in right after 419 00:28:14,890 --> 00:28:16,760 Patriot's Day. 420 00:28:16,760 --> 00:28:19,640 But they also use it a lot in genetics. 421 00:28:19,640 --> 00:28:24,140 So clustering is used to try and find genes that look like 422 00:28:24,140 --> 00:28:27,080 or groups of genes. 423 00:28:27,080 --> 00:28:30,930 Insurance companies use that to decide how much to charge 424 00:28:30,930 --> 00:28:33,340 you for your automobile insurance. 425 00:28:33,340 --> 00:28:36,990 They cluster drivers based upon-- and use that to predict 426 00:28:36,990 --> 00:28:40,420 who's going to have an accident. 427 00:28:40,420 --> 00:28:45,830 Document classification on the web is used all the time. 428 00:28:45,830 --> 00:28:47,200 It's used a lot in medicine. 429 00:28:47,200 --> 00:28:49,580 Just used all over the place. 430 00:28:49,580 --> 00:28:51,650 So what is it exactly? 431 00:28:51,650 --> 00:28:56,200 Well the nice thing is we can define it very 432 00:28:56,200 --> 00:28:59,840 straightforwardly as an optimization problem. 433 00:28:59,840 --> 00:29:03,670 And so we can ask what properties does a good 434 00:29:03,670 --> 00:29:04,920 clustering have? 435 00:29:07,610 --> 00:29:16,900 Well, it should have low intra-cluster dissimilarity. 436 00:29:26,100 --> 00:29:29,600 So in a good clustering, all of the points in the same 437 00:29:29,600 --> 00:29:34,060 cluster should be similar, by whatever metric you're using 438 00:29:34,060 --> 00:29:35,800 for similarity. 439 00:29:35,800 --> 00:29:38,910 As we'll see, there are a lot of choices there. 440 00:29:38,910 --> 00:29:41,960 But that's not enough. 441 00:29:41,960 --> 00:29:55,300 We'd also like to have high inter-cluster dissimilarity. 442 00:29:55,300 --> 00:29:57,790 So we'd like the points within a cluster to be a lot like 443 00:29:57,790 --> 00:29:58,600 each other. 444 00:29:58,600 --> 00:30:01,730 But if points are in different clusters, we'd like them to be 445 00:30:01,730 --> 00:30:04,900 quite different from each other. 446 00:30:04,900 --> 00:30:07,894 That tells us that we have a good cluster. 447 00:30:07,894 --> 00:30:09,570 All right, let's look at it. 448 00:30:18,640 --> 00:30:22,160 How might we model dissimilarity? 449 00:30:22,160 --> 00:30:26,760 Well, using a concept we've already seen-- variance. 450 00:30:26,760 --> 00:30:37,940 So we can talk about the variance of some cluster C as 451 00:30:37,940 --> 00:30:45,775 equal to the sum of all elements x in C, of the mean 452 00:30:45,775 --> 00:30:53,210 of C minus x squared. 453 00:30:53,210 --> 00:30:56,310 Or maybe we can take the square root of it, if we want. 454 00:30:56,310 --> 00:30:59,990 But it's exactly the idea we've seen before, right? 455 00:30:59,990 --> 00:31:02,610 Then we say what's the average value of the cluster? 456 00:31:02,610 --> 00:31:06,000 And then we look at how far is each point from the average. 457 00:31:06,000 --> 00:31:07,070 We sum them. 458 00:31:07,070 --> 00:31:09,140 And that tells us how much variance we 459 00:31:09,140 --> 00:31:10,770 have within the cluster. 460 00:31:13,810 --> 00:31:15,060 Make sense? 461 00:31:17,560 --> 00:31:19,680 So that's variance. 462 00:31:22,660 --> 00:31:26,190 So we can use that to talk about how similar or 463 00:31:26,190 --> 00:31:28,430 dissimilar the elements in the cluster are. 464 00:31:31,860 --> 00:31:36,490 We can use the same idea to compare points in separate 465 00:31:36,490 --> 00:31:40,660 clusters and compute various different ways-- and we'll 466 00:31:40,660 --> 00:31:42,300 look at different ways-- 467 00:31:42,300 --> 00:31:46,510 to look at the distance between clusters. 468 00:31:46,510 --> 00:31:50,890 So combining these two things, we could get, say, a metric 469 00:31:50,890 --> 00:31:52,140 we'll call badness-- 470 00:31:55,000 --> 00:31:57,480 not a technical word. 471 00:31:57,480 --> 00:32:03,470 And now I'll ask the question is the optimization problem 472 00:32:03,470 --> 00:32:05,690 that we're solving in clustering 473 00:32:05,690 --> 00:32:07,540 finding a set of clusters-- 474 00:32:07,540 --> 00:32:09,230 capital C-- 475 00:32:09,230 --> 00:32:13,120 such that badness of that set of clusters is minimized? 476 00:32:16,080 --> 00:32:19,600 Is that a sufficient definition of the problem 477 00:32:19,600 --> 00:32:22,710 we're trying to solve? 478 00:32:22,710 --> 00:32:25,760 Find a set of clusters C, such that the 479 00:32:25,760 --> 00:32:29,130 badness of C is minimized. 480 00:32:29,130 --> 00:32:32,835 is that good enough? 481 00:32:32,835 --> 00:32:33,310 AUDIENCE: No. 482 00:32:33,310 --> 00:32:35,010 PROFESSOR: No, why not? 483 00:32:35,010 --> 00:32:37,007 AUDIENCE: Just imagine a case where you view cluster-- if 484 00:32:37,007 --> 00:32:38,855 you make a single cluster, every cluster has 485 00:32:38,855 --> 00:32:40,170 one element in it. 486 00:32:40,170 --> 00:32:41,700 And the variance is 0. 487 00:32:41,700 --> 00:32:42,680 PROFESSOR: Exactly. 488 00:32:42,680 --> 00:32:46,890 So that has a trivial solution, which is probably 489 00:32:46,890 --> 00:32:50,330 not the one we want, of putting each 490 00:32:50,330 --> 00:32:54,040 point in its own cluster. 491 00:32:54,040 --> 00:32:54,810 Badness-- 492 00:32:54,810 --> 00:32:55,520 it won't be bad. 493 00:32:55,520 --> 00:32:58,790 It'll be a perfect clustering in some sense. 494 00:32:58,790 --> 00:33:02,640 But it doesn't do us any good really. 495 00:33:02,640 --> 00:33:05,570 So what do we do to fix that? 496 00:33:05,570 --> 00:33:07,270 What do we usually do when we formulate an 497 00:33:07,270 --> 00:33:08,360 optimization problem? 498 00:33:08,360 --> 00:33:09,990 What's missing? 499 00:33:09,990 --> 00:33:11,620 I've given you the objective function. 500 00:33:11,620 --> 00:33:13,620 What have I not giving you? 501 00:33:13,620 --> 00:33:14,100 AUDIENCE: Constraints. 502 00:33:14,100 --> 00:33:16,020 PROFESSOR: A constraint. 503 00:33:16,020 --> 00:33:22,620 So we need to add some constraint here that will 504 00:33:22,620 --> 00:33:27,230 prevent us from finding a trivial solution. 505 00:33:27,230 --> 00:33:31,700 So what kind of constraints might we look at? 506 00:33:31,700 --> 00:33:33,940 There are different ways of doing it. 507 00:33:36,950 --> 00:33:38,380 A couple of ones that is usual. 508 00:33:38,380 --> 00:33:43,520 Sometimes you might have as a constraint, the maximum number 509 00:33:43,520 --> 00:33:44,770 of clusters. 510 00:33:46,965 --> 00:33:52,530 Say, all right, cluster my data, but I want at most K 511 00:33:52,530 --> 00:33:53,460 clusters -- 512 00:33:53,460 --> 00:33:56,160 10 clusters. 513 00:33:56,160 --> 00:33:58,600 That would be my constraint, like the weight for the 514 00:33:58,600 --> 00:34:01,680 knapsack problem. 515 00:34:01,680 --> 00:34:07,740 Or maybe I'll want to put something on the maximum 516 00:34:07,740 --> 00:34:10,469 distance between clusters. 517 00:34:10,469 --> 00:34:13,110 So I don't want the distance between any two clusters to be 518 00:34:13,110 --> 00:34:14,360 more than something. 519 00:34:19,540 --> 00:34:23,420 In general, solving this optimization problem is 520 00:34:23,420 --> 00:34:25,690 computationally prohibitive. 521 00:34:25,690 --> 00:34:30,190 So once again, in practice, what people typically resort 522 00:34:30,190 --> 00:34:32,699 to is greedy algorithms. 523 00:34:32,699 --> 00:34:36,600 And I want to look at two kinds of greedy algorithms, 524 00:34:36,600 --> 00:34:41,020 probably the two most common approaches to clustering. 525 00:34:41,020 --> 00:34:42,270 One is called k-means. 526 00:34:47,145 --> 00:34:52,385 In k-means clustering, you say I want exactly k clusters. 527 00:34:55,000 --> 00:34:57,950 And find the best k clustering. 528 00:34:57,950 --> 00:35:00,070 We'll talk about how it does that. 529 00:35:00,070 --> 00:35:03,170 And again, it's not guaranteed to find the best. 530 00:35:03,170 --> 00:35:05,000 And the other is hierarchical clustering. 531 00:35:14,210 --> 00:35:15,790 We'll come back to that shortly. 532 00:35:15,790 --> 00:35:22,750 Both are simple to understand and widely used in practice. 533 00:35:22,750 --> 00:35:28,380 So let's first talk about how we do this. 534 00:35:28,380 --> 00:35:30,555 Let's first look at hierarchical clustering. 535 00:35:38,200 --> 00:35:43,060 So we have a set of n items to be clustered. 536 00:35:43,060 --> 00:35:54,670 And let's assume we have an n by n distance matrix that 537 00:35:54,670 --> 00:35:59,790 tells me for each pair of items how far they are from 538 00:35:59,790 --> 00:36:01,040 each other. 539 00:36:03,140 --> 00:36:05,970 So we can look at an example. 540 00:36:05,970 --> 00:36:10,400 So here's an n by n distance matrix for the airline 541 00:36:10,400 --> 00:36:14,230 distance between some cities in the United States. 542 00:36:14,230 --> 00:36:17,480 The distance from Boston to Boston is 0 miles. 543 00:36:17,480 --> 00:36:20,450 Distance from New York is 206. 544 00:36:20,450 --> 00:36:23,420 The distance from Chicago to San Francisco 545 00:36:23,420 --> 00:36:26,480 is 2,142, et cetera. 546 00:36:26,480 --> 00:36:27,370 All right? 547 00:36:27,370 --> 00:36:30,020 So I have my n by n distance matrix there. 548 00:36:33,200 --> 00:36:37,360 Now let's go through how hierarchical clustering would 549 00:36:37,360 --> 00:36:40,540 relate these things to each other. 550 00:36:40,540 --> 00:36:51,490 So we start by assigning each item to its own cluster. 551 00:36:59,910 --> 00:37:03,710 So if we have n items, we now have n clusters. 552 00:37:08,250 --> 00:37:09,070 All right? 553 00:37:09,070 --> 00:37:13,430 That's the trivial solution that you suggested before. 554 00:37:13,430 --> 00:37:34,360 The next step is to find the most similar pair of clusters 555 00:37:34,360 --> 00:37:35,610 and merge them. 556 00:37:42,310 --> 00:37:49,150 So if we look here and we just-- we start, we'll have 557 00:37:49,150 --> 00:37:52,050 six clusters, one for each city. 558 00:37:52,050 --> 00:37:56,185 And we would merge the two most similar, which I guess in 559 00:37:56,185 --> 00:37:59,450 this case is New York and Boston. 560 00:37:59,450 --> 00:38:01,920 Hard to believe that those are the most similar cities. 561 00:38:01,920 --> 00:38:05,800 But at least by this distance metric they're the closest. 562 00:38:05,800 --> 00:38:07,450 So we would merge those two. 563 00:38:14,530 --> 00:38:24,680 And then you just continue the process in principle, until 564 00:38:24,680 --> 00:38:27,420 all items are in one cluster. 565 00:38:27,420 --> 00:38:30,810 So now you have a whole hierarchy of clusters. 566 00:38:30,810 --> 00:38:33,920 And you can cut it off where you want. 567 00:38:33,920 --> 00:38:36,020 If you want to have six clusters, you could look at 568 00:38:36,020 --> 00:38:37,480 where the hierarchy you have six. 569 00:38:37,480 --> 00:38:41,160 You can look where you have two, where you have three. 570 00:38:41,160 --> 00:38:43,610 Of course, you don't have to go all the way to finish it if 571 00:38:43,610 --> 00:38:45,670 you don't want to. 572 00:38:45,670 --> 00:38:48,450 This kind of hierarchical clustering is called 573 00:38:48,450 --> 00:38:49,700 agglomerative. 574 00:38:57,880 --> 00:38:58,420 Why? 575 00:38:58,420 --> 00:39:00,850 Well, because we're combining things. 576 00:39:00,850 --> 00:39:02,100 We're agglomerating them. 577 00:39:07,890 --> 00:39:09,310 So this is pretty 578 00:39:09,310 --> 00:39:12,025 straightforward, except for two. 579 00:39:15,690 --> 00:39:22,460 The complication in step (2) is we have to define what it 580 00:39:22,460 --> 00:39:26,010 means to find the two most similar clusters. 581 00:39:28,560 --> 00:39:32,520 Now it's pretty easy when the clusters each contain one 582 00:39:32,520 --> 00:39:36,590 element, because, well, we have our metric-- in this 583 00:39:36,590 --> 00:39:37,400 case, distance-- 584 00:39:37,400 --> 00:39:41,180 and we can just do that as I did. 585 00:39:41,180 --> 00:39:44,810 But it's not so obvious what you do when they 586 00:39:44,810 --> 00:39:46,740 have multiple elements. 587 00:39:49,940 --> 00:39:55,250 And in fact, different metrics can be used to get different 588 00:39:55,250 --> 00:39:56,940 properties. 589 00:39:56,940 --> 00:39:58,540 So I want to talk about some of the 590 00:39:58,540 --> 00:40:00,690 metrics we use for that. 591 00:40:00,690 --> 00:40:03,645 These are typically called linkage criteria. 592 00:40:13,990 --> 00:40:18,380 So one popular one is what's called single linkage. 593 00:40:24,170 --> 00:40:25,190 It's also called 594 00:40:25,190 --> 00:40:28,040 connectedness, or minimum method. 595 00:40:28,040 --> 00:40:31,160 In this, we consider the distance between a pair of 596 00:40:31,160 --> 00:40:36,680 clusters to be equal to the shortest distance from any 597 00:40:36,680 --> 00:40:37,975 member to any other member. 598 00:41:00,220 --> 00:41:04,080 So we take the two points in each cluster that are closest 599 00:41:04,080 --> 00:41:06,780 to each other and say that's the distance 600 00:41:06,780 --> 00:41:08,030 between the two clusters. 601 00:41:14,490 --> 00:41:17,335 People also use something called complete linkage-- 602 00:41:21,376 --> 00:41:25,420 It's also called diameter or maximum-- 603 00:41:25,420 --> 00:41:30,010 where we consider the distance between any two clusters to be 604 00:41:30,010 --> 00:41:32,710 the distance between the points that are furthest from 605 00:41:32,710 --> 00:41:33,960 each other. 606 00:41:40,060 --> 00:41:44,940 So in one case, essentially single linkages was looking at 607 00:41:44,940 --> 00:41:47,150 the best case. 608 00:41:47,150 --> 00:41:49,390 Complete-- 609 00:41:49,390 --> 00:41:50,870 in English, not French-- 610 00:41:50,870 --> 00:41:53,450 is looking at the worst case. 611 00:41:58,110 --> 00:42:01,870 And you won't be surprised to hear that you could also look 612 00:42:01,870 --> 00:42:11,175 at the average case, where you take all of the distances. 613 00:42:13,950 --> 00:42:16,160 So you take all of the pairwise things. 614 00:42:16,160 --> 00:42:17,020 You add them up. 615 00:42:17,020 --> 00:42:18,860 You take the average. 616 00:42:18,860 --> 00:42:20,460 You can also take the mean, the 617 00:42:20,460 --> 00:42:23,640 median, if you want instead. 618 00:42:23,640 --> 00:42:27,130 None of these is necessarily best. 619 00:42:27,130 --> 00:42:29,790 But they do give you different answers. 620 00:42:29,790 --> 00:42:32,605 And so I want to look at that now with our example here. 621 00:42:35,370 --> 00:42:38,500 So let's look at it and run it. 622 00:42:38,500 --> 00:42:41,740 So the first step is independent of what linkage 623 00:42:41,740 --> 00:42:42,850 we're using. 624 00:42:42,850 --> 00:42:46,580 We get these six clusters. 625 00:42:46,580 --> 00:42:51,790 All right, now let's look at the second step. 626 00:42:51,790 --> 00:42:55,030 Well, also pretty simple since we only have one 627 00:42:55,030 --> 00:42:57,400 element in each one. 628 00:42:57,400 --> 00:42:59,170 We're going to get that clustering. 629 00:43:02,750 --> 00:43:03,300 All right. 630 00:43:03,300 --> 00:43:08,960 Now, what about the next step? 631 00:43:08,960 --> 00:43:13,750 What do I get if I'm using the minimal 632 00:43:13,750 --> 00:43:15,350 single linkage distance? 633 00:43:15,350 --> 00:43:16,895 What gets merged here? 634 00:43:23,640 --> 00:43:24,980 Somebody? 635 00:43:24,980 --> 00:43:27,070 AUDIENCE: Boston, New York and Chicago. 636 00:43:27,070 --> 00:43:28,550 PROFESSOR: Boston, New York, and Chicago. 637 00:43:31,910 --> 00:43:35,495 And it turns out we'll get the same thing if we use other 638 00:43:35,495 --> 00:43:37,270 linkages in this case. 639 00:43:37,270 --> 00:43:38,745 Let's continue to the next step. 640 00:43:43,060 --> 00:43:45,935 Now we'll end up merging San Francisco and Seattle. 641 00:43:52,840 --> 00:43:55,840 Now we get a difference. 642 00:43:55,840 --> 00:43:57,540 What does the red represent and what 643 00:43:57,540 --> 00:43:58,520 does the blue represent? 644 00:43:58,520 --> 00:44:00,980 Which linkage criteria? 645 00:44:00,980 --> 00:44:04,590 We're saying, we could either merge Denver with Boston, New 646 00:44:04,590 --> 00:44:06,480 York, and Chicago. 647 00:44:06,480 --> 00:44:10,185 Or we could merge Denver with San Francisco and Seattle. 648 00:44:14,920 --> 00:44:16,790 Which is which? 649 00:44:16,790 --> 00:44:20,660 Which linkage criterion has put Denver in which cluster? 650 00:44:30,910 --> 00:44:33,640 Well, suppose we're using single linkage. 651 00:44:33,640 --> 00:44:37,520 Where are we getting it from? 652 00:44:37,520 --> 00:44:39,460 AUDIENCE: Boston, New York and Chicago? 653 00:44:39,460 --> 00:44:39,945 PROFESSOR: Yes. 654 00:44:39,945 --> 00:44:42,370 Because it's not so far from Chicago. 655 00:44:42,370 --> 00:44:45,924 Even though it's pretty far from Boston or New York. 656 00:44:45,924 --> 00:44:50,744 But if we use average linkage, we see on average, it's closer 657 00:44:50,744 --> 00:44:55,180 to San Francisco and Seattle than it is to the average of 658 00:44:55,180 --> 00:44:57,300 Boston, New York, or Chicago. 659 00:44:57,300 --> 00:45:00,920 So we get a different answer. 660 00:45:00,920 --> 00:45:03,860 And then finally, at the last step, 661 00:45:03,860 --> 00:45:08,270 everything gets merged together. 662 00:45:08,270 --> 00:45:15,940 So you can see, in this case, without having labels, we have 663 00:45:15,940 --> 00:45:22,200 used a feature to produce things and, say, if we wanted 664 00:45:22,200 --> 00:45:25,840 to have three clusters, we would maybe stop here. 665 00:45:25,840 --> 00:45:28,300 And we'd say, all right, these things are one cluster. 666 00:45:28,300 --> 00:45:29,170 This is a cluster. 667 00:45:29,170 --> 00:45:31,340 And this is a cluster. 668 00:45:31,340 --> 00:45:33,640 And that's not a bad geographical clustering, 669 00:45:33,640 --> 00:45:36,990 actually, for deciding how to relate these 670 00:45:36,990 --> 00:45:38,240 things to each other. 671 00:45:40,890 --> 00:45:43,160 This technique is used a lot. 672 00:45:43,160 --> 00:45:46,260 It does have some weaknesses. 673 00:45:46,260 --> 00:45:50,050 One weakness is it's very time consuming. 674 00:45:50,050 --> 00:45:53,310 It doesn't scale well. 675 00:45:53,310 --> 00:45:58,370 The complexity is at least order n-squared, where n is 676 00:45:58,370 --> 00:46:03,900 the number of points to be clustered. 677 00:46:03,900 --> 00:46:06,060 And in fact, in many implementations, it's worse 678 00:46:06,060 --> 00:46:08,230 than n-squared. 679 00:46:08,230 --> 00:46:13,010 And of course, it doesn't necessarily find the optimal 680 00:46:13,010 --> 00:46:17,240 clustering, even giving these criteria. 681 00:46:17,240 --> 00:46:21,190 It might never at any level have the optimal clustering, 682 00:46:21,190 --> 00:46:24,770 because, again, at each step, it's making a locally optimal 683 00:46:24,770 --> 00:46:28,435 decision, not guaranteed to find the best solution. 684 00:46:33,040 --> 00:46:38,950 I should point out that a big issue in deciding to get these 685 00:46:38,950 --> 00:46:42,940 clusters or getting these clusters was 686 00:46:42,940 --> 00:46:46,920 my choice of features. 687 00:46:46,920 --> 00:46:50,580 And this is something we're going to come back to in 688 00:46:50,580 --> 00:46:54,700 spades, because I actually think it is the most important 689 00:46:54,700 --> 00:46:56,980 issue in machine learning-- 690 00:46:56,980 --> 00:47:01,310 is if we're going to say which points are similar to each 691 00:47:01,310 --> 00:47:06,840 other, we need to understand our feature space. 692 00:47:06,840 --> 00:47:10,500 So, for example, the feature I'm using here 693 00:47:10,500 --> 00:47:14,340 is distance by air. 694 00:47:14,340 --> 00:47:19,010 Suppose, instead, I added distance by air and distance 695 00:47:19,010 --> 00:47:22,900 by road and distance by train. 696 00:47:22,900 --> 00:47:25,900 Well, particularly given this sparsity of railroads in this 697 00:47:25,900 --> 00:47:29,180 country, we might get very different clustering, 698 00:47:29,180 --> 00:47:33,090 depending upon where the trains ran. 699 00:47:33,090 --> 00:47:36,590 And suppose I throw in a totally different feature like 700 00:47:36,590 --> 00:47:39,570 population. 701 00:47:39,570 --> 00:47:42,040 Well, I might get another different clustering, 702 00:47:42,040 --> 00:47:43,395 depending on how I use that. 703 00:47:46,780 --> 00:47:54,910 What we typically need to do in these situations, dealing 704 00:47:54,910 --> 00:47:57,660 with multi-dimensional data-- and most data is 705 00:47:57,660 --> 00:47:59,230 multi-dimensional-- 706 00:47:59,230 --> 00:48:06,790 is we construct something called a feature vector that 707 00:48:06,790 --> 00:48:10,830 incorporates multiple features. 708 00:48:10,830 --> 00:48:14,870 So we might have for each city-- 709 00:48:14,870 --> 00:48:16,805 we'll just take something like the distance. 710 00:48:22,600 --> 00:48:26,720 Or let's say, instead of distance, we'll compute the 711 00:48:26,720 --> 00:48:33,810 distance by having for each city its GPS coordinates, 712 00:48:33,810 --> 00:48:39,110 where it is on the globe, and its population. 713 00:48:39,110 --> 00:48:42,840 And let's say that's how we define a city. 714 00:48:42,840 --> 00:48:45,070 And that would be our feature vector. 715 00:48:45,070 --> 00:48:47,560 And then we would cluster it, say, using hierarchical 716 00:48:47,560 --> 00:48:51,140 clustering to determine which cities are most like which 717 00:48:51,140 --> 00:48:53,340 other cities. 718 00:48:53,340 --> 00:48:56,280 Well, it's a little bit complicated. 719 00:48:56,280 --> 00:49:01,760 I have to ask how do I compare feature vectors? 720 00:49:01,760 --> 00:49:04,990 What distance metric do I use there? 721 00:49:04,990 --> 00:49:09,600 Do I get confused that GPS coordinates and populations 722 00:49:09,600 --> 00:49:12,700 are essentially unrelated? 723 00:49:12,700 --> 00:49:16,240 And I wouldn't like to compare those to each other. 724 00:49:16,240 --> 00:49:19,240 Lots of issues there, and that's what we're going to 725 00:49:19,240 --> 00:49:22,500 talk about when we come back from Patriot's Day-- 726 00:49:22,500 --> 00:49:26,440 is how in the real world problems, we go from the large 727 00:49:26,440 --> 00:49:30,430 number of features associated with objects or things in the 728 00:49:30,430 --> 00:49:34,680 real world to feature vectors that allow us to automatically 729 00:49:34,680 --> 00:49:37,830 deduce which things are quote "most similar" 730 00:49:37,830 --> 00:49:39,080 to which other things.