1 00:00:09,640 --> 00:00:11,900 PATRICK WINSTON: We've now almost completed our journey. 2 00:00:11,900 --> 00:00:15,650 This will be it for talking about 3 00:00:15,650 --> 00:00:17,260 several kinds of learning-- 4 00:00:17,260 --> 00:00:21,920 the venerable kind, that's the nearest neighbors and 5 00:00:21,920 --> 00:00:24,310 identification tree types of learning. 6 00:00:24,310 --> 00:00:26,910 Still useful, still the right thing to do if there's no 7 00:00:26,910 --> 00:00:30,180 reason not to do the simple thing. 8 00:00:30,180 --> 00:00:32,330 Then we have the biologically-inspired 9 00:00:32,330 --> 00:00:33,290 approaches. 10 00:00:33,290 --> 00:00:35,110 Neural nets. 11 00:00:35,110 --> 00:00:38,500 All kinds of problems with local maxima and overfitting 12 00:00:38,500 --> 00:00:41,520 and oscillation, if you get the rate constant too big. 13 00:00:45,600 --> 00:00:46,850 Genetic algorithms. 14 00:00:48,870 --> 00:00:53,150 Like neural nets, both are very naive in their attempt to 15 00:00:53,150 --> 00:00:54,800 mimic nature. 16 00:00:54,800 --> 00:00:57,250 So maybe they work on a class of problems. 17 00:00:57,250 --> 00:00:59,470 They surely do each have a class of problems for which 18 00:00:59,470 --> 00:00:59,940 they're good. 19 00:00:59,940 --> 00:01:05,190 But as a general purpose first resort, I don't recommend it. 20 00:01:05,190 --> 00:01:07,540 But now the theorists have come out and done some things 21 00:01:07,540 --> 00:01:09,539 are very remarkable. 22 00:01:09,539 --> 00:01:11,830 And in the end, you have to say, wow, these are such 23 00:01:11,830 --> 00:01:14,490 powerful ideas. 24 00:01:14,490 --> 00:01:17,780 I wonder if nature has discovered them, too? 25 00:01:17,780 --> 00:01:20,420 Is there good engineering in the brain, 26 00:01:20,420 --> 00:01:22,780 based on good science? 27 00:01:22,780 --> 00:01:24,990 Or given the nature of evolution, is it just random 28 00:01:24,990 --> 00:01:28,800 junk that is the best ways for doing anything? 29 00:01:28,800 --> 00:01:30,180 Who knows? 30 00:01:30,180 --> 00:01:32,220 But today, we're going to talk about an idea that I'll bet is 31 00:01:32,220 --> 00:01:36,259 in there somewhere, because it's easy to implement, and 32 00:01:36,259 --> 00:01:41,180 it's extremely powerful in what it does, and it's the 33 00:01:41,180 --> 00:01:44,979 essential item in anybody's repertoire of learning 34 00:01:44,979 --> 00:01:46,860 mechanisms. 35 00:01:46,860 --> 00:01:51,820 It's also a mechanism which, if you understand only by 36 00:01:51,820 --> 00:01:55,789 formula, you will never be able to work the problems on 37 00:01:55,789 --> 00:01:57,759 the quiz, that's for sure. 38 00:01:57,759 --> 00:02:00,780 Because on the surface, it looks like it'd be very 39 00:02:00,780 --> 00:02:03,920 complicated to simulate this approach. 40 00:02:03,920 --> 00:02:06,530 But once you understand how it works and look at a little bit 41 00:02:06,530 --> 00:02:08,919 of the math and let it sing songs to you, it turns out to 42 00:02:08,919 --> 00:02:13,270 be extremely easy. 43 00:02:13,270 --> 00:02:18,980 So it's about letting multiple methods work in your behalf. 44 00:02:18,980 --> 00:02:21,490 So far, we've been talking about using just one method to 45 00:02:21,490 --> 00:02:23,100 do something. 46 00:02:23,100 --> 00:02:25,780 And what we're going to do now is we're looking to see if a 47 00:02:25,780 --> 00:02:33,370 crowd can be smarter than the individuals in the crowd. 48 00:02:33,370 --> 00:02:36,430 But before we get too far down that abstract path, let me 49 00:02:36,430 --> 00:02:40,750 just say that the whole works has to do with classification, 50 00:02:40,750 --> 00:02:44,110 and binary classification. 51 00:02:44,110 --> 00:02:48,500 Am I holding a piece of chalk in my hand, or a hand grenade? 52 00:02:48,500 --> 00:02:50,750 Is that a cup of coffee or tea? 53 00:02:50,750 --> 00:02:53,740 Those are binary classification problems. 54 00:02:53,740 --> 00:02:55,790 And so we're going to be talking today strictly about 55 00:02:55,790 --> 00:02:57,590 binary classification. 56 00:02:57,590 --> 00:02:59,570 We're not going to be talking about finding the right letter 57 00:02:59,570 --> 00:03:02,290 in the alphabet that's written on the page. 58 00:03:02,290 --> 00:03:04,450 That's a 26-way choice. 59 00:03:04,450 --> 00:03:07,190 We're talking about binary choices. 60 00:03:07,190 --> 00:03:10,610 So we assume that there's a set of classifiers 61 00:03:10,610 --> 00:03:12,280 that we can draw on. 62 00:03:12,280 --> 00:03:13,090 Here's one-- 63 00:03:13,090 --> 00:03:14,520 h. 64 00:03:14,520 --> 00:03:20,050 And it produces either a minus 1 or a plus 1. 65 00:03:20,050 --> 00:03:22,880 So that's how the classification is done. 66 00:03:22,880 --> 00:03:24,340 If it's coffee, plus 1. 67 00:03:24,340 --> 00:03:26,340 If it's tea, minus 1. 68 00:03:26,340 --> 00:03:27,650 Is this chalk, plus one. 69 00:03:27,650 --> 00:03:29,400 If it's a hand grenade, minus 1. 70 00:03:29,400 --> 00:03:32,710 So that's how the classification works. 71 00:03:32,710 --> 00:03:36,090 Now, too bad for us, normally the world doesn't give us very 72 00:03:36,090 --> 00:03:36,980 good classifiers. 73 00:03:36,980 --> 00:03:42,500 So if we look at the error rate of this classifier or any 74 00:03:42,500 --> 00:03:50,180 other classifier, that error rate will range from 0 to 1 in 75 00:03:50,180 --> 00:03:54,480 terms of the fraction of the cases got 76 00:03:54,480 --> 00:03:55,730 wrong on a sample set. 77 00:03:58,230 --> 00:04:02,020 So you'd like your error rate to be way down here. 78 00:04:02,020 --> 00:04:04,250 You're dead if it's over there. 79 00:04:04,250 --> 00:04:05,950 But what about in the middle? 80 00:04:05,950 --> 00:04:08,420 What if it's, say, right there. 81 00:04:08,420 --> 00:04:11,690 Just a little bit better than flipping a coin. 82 00:04:11,690 --> 00:04:14,800 If it's just a little bit better than flipping a coin, 83 00:04:14,800 --> 00:04:16,050 that's a weak classifier. 84 00:04:24,660 --> 00:04:28,560 And the question is, can you make a classifier that's way 85 00:04:28,560 --> 00:04:39,240 over here, like there, a strong classifier, by 86 00:04:39,240 --> 00:04:43,210 combining several of these weak classifiers, and 87 00:04:43,210 --> 00:04:46,090 letting them vote? 88 00:04:46,090 --> 00:04:47,340 So how would you do that? 89 00:04:47,340 --> 00:04:51,920 You might say, well, let us make a big classifier capital 90 00:04:51,920 --> 00:05:03,190 H, that works on some sample x, and has its output produces 91 00:05:03,190 --> 00:05:05,920 something that depends on the sum of the outputs of the 92 00:05:05,920 --> 00:05:08,180 individual classifiers. 93 00:05:08,180 --> 00:05:13,100 So we have H1 working on x. 94 00:05:13,100 --> 00:05:18,340 We have H2 working on x. 95 00:05:18,340 --> 00:05:21,770 And we have H3 also working on x. 96 00:05:21,770 --> 00:05:24,710 Let's say three of them, just to start us off. 97 00:05:24,710 --> 00:05:28,670 And now let's add those guys up, and take 98 00:05:28,670 --> 00:05:33,295 the sign of the output. 99 00:05:36,190 --> 00:05:40,690 So if two out of the three of those guys agree, then we'll 100 00:05:40,690 --> 00:05:43,409 get an either plus 1 or minus 1. 101 00:05:43,409 --> 00:05:46,150 If all three agree, we'll get plus 1 or minus 1. 102 00:05:46,150 --> 00:05:48,320 Because we're just taking the sign. 103 00:05:48,320 --> 00:05:52,620 We're just taking the sign of the sum of these guys. 104 00:05:52,620 --> 00:05:56,230 So this means that one guy can be wrong, as long as the other 105 00:05:56,230 --> 00:05:58,180 two guys are right. 106 00:05:58,180 --> 00:06:01,850 But I think it's easier to see how this all works if you 107 00:06:01,850 --> 00:06:07,380 think of some space of samples, you say, well, let's 108 00:06:07,380 --> 00:06:15,330 let that area here be where H1 is wrong, and this area over 109 00:06:15,330 --> 00:06:19,370 here is where H2 is wrong. 110 00:06:22,360 --> 00:06:26,305 And then this area over here is where H3 is wrong. 111 00:06:30,430 --> 00:06:34,710 So if the situation is like that, then this formula always 112 00:06:34,710 --> 00:06:38,250 gives you the right answers on the samples. 113 00:06:38,250 --> 00:06:40,610 I'm going to stop saying that right now, because I want to 114 00:06:40,610 --> 00:06:44,370 be kind of a background thing on the samples set. 115 00:06:44,370 --> 00:06:45,900 We're talking about wrapping this stuff 116 00:06:45,900 --> 00:06:47,840 over the sample set. 117 00:06:47,840 --> 00:06:50,170 Later on, we'll ask, OK, given that you trained this thing on 118 00:06:50,170 --> 00:06:52,880 a sample set, how well does it do on some new examples? 119 00:06:52,880 --> 00:06:54,420 Because we want to ask ourselves 120 00:06:54,420 --> 00:06:57,150 about overfitting questions. 121 00:06:57,150 --> 00:07:01,760 But for now, we just want to look and see if we believe 122 00:07:01,760 --> 00:07:06,690 that this arrangement, where each of these H's is producing 123 00:07:06,690 --> 00:07:09,450 plus 1 or minus 1, we're adding them up and taking the 124 00:07:09,450 --> 00:07:12,960 sign, is that going to give us a better result than the tests 125 00:07:12,960 --> 00:07:13,540 individually? 126 00:07:13,540 --> 00:07:17,620 And if they look like this when draped over a sample set, 127 00:07:17,620 --> 00:07:19,330 then it's clear that we're going to get the right answer 128 00:07:19,330 --> 00:07:24,540 every time, because there's no area here where any two of 129 00:07:24,540 --> 00:07:27,640 those tests are giving us the wrong answer. 130 00:07:27,640 --> 00:07:30,740 So the two that are getting the right answer, in this 131 00:07:30,740 --> 00:07:33,530 little circle here for H1, these other two are getting 132 00:07:33,530 --> 00:07:34,190 the right answer. 133 00:07:34,190 --> 00:07:36,420 So they'll outvote it, and you'll get the right answer 134 00:07:36,420 --> 00:07:39,030 every time. 135 00:07:39,030 --> 00:07:41,370 But it doesn't have to be that simple. 136 00:07:44,240 --> 00:07:45,490 It could look like this. 137 00:07:50,159 --> 00:07:54,090 There could be a situation where this 138 00:07:54,090 --> 00:07:57,080 is H1, wrong answer. 139 00:07:57,080 --> 00:07:59,600 This is H2, wrong answer. 140 00:07:59,600 --> 00:08:04,140 And this is H3, wrong answer. 141 00:08:04,140 --> 00:08:08,150 And now the situation gets a little bit more murky, because 142 00:08:08,150 --> 00:08:15,820 we have to ask ourselves whether that area where three 143 00:08:15,820 --> 00:08:24,870 out of the three get it wrong is sufficiently big so as to 144 00:08:24,870 --> 00:08:31,490 be worse than 1 of the individual tests. 145 00:08:31,490 --> 00:08:34,169 So if you look at that Venn diagram, and stare at it long 146 00:08:34,169 --> 00:08:37,260 enough, and try some things, you can say, well, there is no 147 00:08:37,260 --> 00:08:40,530 case where this will give a worse answer. 148 00:08:40,530 --> 00:08:44,590 Or, you might end up with the conclusion that there are 149 00:08:44,590 --> 00:08:50,080 cases where we can arrange those circles such that the 150 00:08:50,080 --> 00:08:52,220 voting scheme will give an answer that's worst than an 151 00:08:52,220 --> 00:08:55,260 individual test, but I'm not going to tell you the answer, 152 00:08:55,260 --> 00:08:58,970 because I think we'll make that a quiz question. 153 00:08:58,970 --> 00:08:59,830 Good idea? 154 00:08:59,830 --> 00:09:00,870 OK. 155 00:09:00,870 --> 00:09:02,170 So we'll make that a quiz question. 156 00:09:06,700 --> 00:09:08,840 So that looks like a good idea. 157 00:09:08,840 --> 00:09:15,460 And we can construct a little algorithm that will help us 158 00:09:15,460 --> 00:09:17,660 pick the particular weak classifiers to plug in here. 159 00:09:17,660 --> 00:09:20,160 We've got a whole bag of classifiers. 160 00:09:20,160 --> 00:09:22,640 We've got H1, we've got H2, we've got H55. 161 00:09:22,640 --> 00:09:24,880 We've got a lot of them we can choose from. 162 00:09:24,880 --> 00:09:32,114 So what we're going to do is we're going to use the data, 163 00:09:32,114 --> 00:09:37,280 undisturbed, to produce H1. 164 00:09:37,280 --> 00:09:39,330 We're just going to try all the tests on the data and see 165 00:09:39,330 --> 00:09:41,190 which one gives us the smallest error rate. 166 00:09:41,190 --> 00:09:44,830 And that's the good guy, so we're going to use that. 167 00:09:44,830 --> 00:09:51,300 Then we're going to use the data with an 168 00:09:51,300 --> 00:10:00,840 exaggeration of H1 errors. 169 00:10:04,560 --> 00:10:05,560 In other words-- 170 00:10:05,560 --> 00:10:06,910 this is a critical idea. 171 00:10:06,910 --> 00:10:10,060 What we're going to do is we're going to run this 172 00:10:10,060 --> 00:10:14,010 algorithm again, but instead of just looking at the number 173 00:10:14,010 --> 00:10:21,730 of samples that are got wrong, what we're going to do is 174 00:10:21,730 --> 00:10:25,370 we're going to look at a distorted set of samples, 175 00:10:25,370 --> 00:10:29,480 where the ones we're not doing well on has exaggerated effect 176 00:10:29,480 --> 00:10:31,510 on the result. 177 00:10:31,510 --> 00:10:35,730 So we're going to weight them or multiply them, or do 178 00:10:35,730 --> 00:10:40,410 something so that we're going to pay more attention to the 179 00:10:40,410 --> 00:10:43,810 samples on which H1 produces an error, and that's going to 180 00:10:43,810 --> 00:10:45,060 give us H2. 181 00:10:47,300 --> 00:10:49,830 And then we're going to do it one more time, because we've 182 00:10:49,830 --> 00:10:53,410 got three things to go with here in this particular little 183 00:10:53,410 --> 00:10:55,100 exploratory scheme. 184 00:10:55,100 --> 00:10:55,970 And this time, we're going to have an 185 00:10:55,970 --> 00:11:04,390 exaggeration of those samples-- 186 00:11:04,390 --> 00:11:06,100 which samples are we going to exaggerate now? 187 00:11:12,250 --> 00:11:14,510 We might as well look for the ones where H1 gives us a 188 00:11:14,510 --> 00:11:17,260 different answer from H2, because we want to be on the 189 00:11:17,260 --> 00:11:19,460 good guy's side. 190 00:11:19,460 --> 00:11:25,070 So we can say we're going to exaggerate those samples four 191 00:11:25,070 --> 00:11:29,120 which H1 gives us a different result from H2. 192 00:11:29,120 --> 00:11:30,670 And that's going to give us H3. 193 00:11:34,210 --> 00:11:35,260 All right. 194 00:11:35,260 --> 00:11:40,530 So we can think of this whole works here as part one of a 195 00:11:40,530 --> 00:11:41,780 multi-part idea. 196 00:11:48,680 --> 00:11:49,880 So let's see. 197 00:11:49,880 --> 00:11:51,260 I don't know, what might be step two? 198 00:11:51,260 --> 00:11:53,420 Well, this is a good idea. 199 00:11:53,420 --> 00:11:58,540 Then what we've got that we can easily derive from that is 200 00:11:58,540 --> 00:11:59,910 a little tree looked like this. 201 00:11:59,910 --> 00:12:09,900 And we can say that H of x depends on H1, H2, and H3. 202 00:12:09,900 --> 00:12:14,490 But now, if that that's a good idea, and that gives a better 203 00:12:14,490 --> 00:12:18,210 answer than any of the individual tests, maybe we can 204 00:12:18,210 --> 00:12:21,460 make this idea a little bit recursive, and say, well, 205 00:12:21,460 --> 00:12:25,550 maybe H1 is actually not an atomic test. 206 00:12:25,550 --> 00:12:30,010 But maybe it's the vote of three other tests. 207 00:12:30,010 --> 00:12:31,740 So you can make a tree structure 208 00:12:31,740 --> 00:12:33,750 that looks like this. 209 00:12:33,750 --> 00:12:40,680 So this is H11, H12, H13, and then 3 here. 210 00:12:40,680 --> 00:12:48,860 And then this will be H31, H32, H33. 211 00:12:48,860 --> 00:12:53,060 And so that's a sort of get out the vote idea. 212 00:12:57,200 --> 00:13:00,190 We're trying to get a whole bunch of individual 213 00:13:00,190 --> 00:13:03,550 tests into the act. 214 00:13:03,550 --> 00:13:06,920 So I guess the reason this wasn't discovered until about 215 00:13:06,920 --> 00:13:09,650 '10 years ago was because you've got to get so many of 216 00:13:09,650 --> 00:13:12,520 these desks all lined up before the idea gets through 217 00:13:12,520 --> 00:13:15,620 that long filter of ideas. 218 00:13:15,620 --> 00:13:18,360 So that's the only idea number two of quite a few. 219 00:13:23,350 --> 00:13:25,900 Well, next thing we might think is, well, we keep 220 00:13:25,900 --> 00:13:27,080 talking about these classifiers. 221 00:13:27,080 --> 00:13:30,500 What kind of classifiers are we talking about? 222 00:13:30,500 --> 00:13:31,912 I've got-- 223 00:13:31,912 --> 00:13:33,740 oh, shoot, I've spent my last nickel. 224 00:13:33,740 --> 00:13:35,300 I don't have a coin to flip. 225 00:13:35,300 --> 00:13:37,810 But that's one classifier, right? 226 00:13:37,810 --> 00:13:41,620 The trouble with that classifier is it's a weak 227 00:13:41,620 --> 00:13:43,780 classifier, because it gives me a 50/50 228 00:13:43,780 --> 00:13:46,230 chance of being right. 229 00:13:46,230 --> 00:13:48,460 I guess there are conditions in which a coin flip 230 00:13:48,460 --> 00:13:50,380 is better than a-- 231 00:13:50,380 --> 00:13:52,900 it is a weak classifier. 232 00:13:52,900 --> 00:13:55,680 If the two outcomes are not equally probable, than a coin 233 00:13:55,680 --> 00:13:58,900 flip is a perfectly good weak classifier. 234 00:13:58,900 --> 00:14:01,400 But what we're going to do is we're going to think in terms 235 00:14:01,400 --> 00:14:05,460 of a different set of classifiers. 236 00:14:05,460 --> 00:14:12,240 And we're going to call them decision tree. 237 00:14:12,240 --> 00:14:14,520 Now, you remember decision trees, right? 238 00:14:14,520 --> 00:14:16,000 But we're not going to build decision trees. 239 00:14:16,000 --> 00:14:19,820 We're going to use decision tree stumps. 240 00:14:23,480 --> 00:14:28,640 So if we have a two-dimensional space that 241 00:14:28,640 --> 00:14:34,790 looks like this, then a decision tree stump is a 242 00:14:34,790 --> 00:14:36,540 single test. 243 00:14:36,540 --> 00:14:38,750 It's not a complete tree that will divide up the samples 244 00:14:38,750 --> 00:14:40,340 into homogeneous groups. 245 00:14:40,340 --> 00:14:44,330 It's just what you can do with one test. 246 00:14:44,330 --> 00:14:48,530 So each possible test is a classifier. 247 00:14:48,530 --> 00:14:50,130 How many tests do we get out of that? 248 00:14:59,280 --> 00:15:00,550 12, right? 249 00:15:00,550 --> 00:15:01,040 Yeah. 250 00:15:01,040 --> 00:15:02,540 It doesn't look like 12 to me, either. 251 00:15:02,540 --> 00:15:05,300 But here's how you get to 12. 252 00:15:05,300 --> 00:15:10,460 One decision tree test you can stick in there would be that 253 00:15:10,460 --> 00:15:11,390 test right there. 254 00:15:11,390 --> 00:15:16,040 And that would be a complete decision tree stump. 255 00:15:16,040 --> 00:15:20,315 But, of course, you can also put in this one. 256 00:15:20,315 --> 00:15:23,580 That would be another decision tree stump. 257 00:15:23,580 --> 00:15:26,370 Now, for this one on the right, I could say, everything 258 00:15:26,370 --> 00:15:28,730 on the right is a minus. 259 00:15:28,730 --> 00:15:33,520 Or, I could say, everything on the right is a plus. 260 00:15:33,520 --> 00:15:37,420 It would happen to be wrong, but it's a valid test with a 261 00:15:37,420 --> 00:15:39,130 valid outcome. 262 00:15:39,130 --> 00:15:41,130 So that's how we double the number of test that 263 00:15:41,130 --> 00:15:43,090 we have lines for. 264 00:15:43,090 --> 00:15:44,050 And you know what? 265 00:15:44,050 --> 00:15:47,280 can even have a kind of test out here that says everything 266 00:15:47,280 --> 00:15:51,190 is plus, or everything is wrong. 267 00:15:51,190 --> 00:15:55,070 So for each dimension, the number of decision tree stumps 268 00:15:55,070 --> 00:15:59,120 is the number of lines I can put in times 2. 269 00:15:59,120 --> 00:16:00,570 And then I've got two dimensions here, that's how I 270 00:16:00,570 --> 00:16:02,310 got to twelve. 271 00:16:02,310 --> 00:16:04,470 So there are three lines. 272 00:16:04,470 --> 00:16:06,300 I can have the pluses on either the left 273 00:16:06,300 --> 00:16:07,070 or the right side. 274 00:16:07,070 --> 00:16:08,670 So that's six. 275 00:16:08,670 --> 00:16:09,830 And then I've got two dimensions, so 276 00:16:09,830 --> 00:16:11,750 that gives me 12. 277 00:16:11,750 --> 00:16:14,050 So that's the decision tree stump idea. 278 00:16:14,050 --> 00:16:19,180 And here are the other decision tree boundaries, 279 00:16:19,180 --> 00:16:23,940 obviously just like that. 280 00:16:23,940 --> 00:16:30,750 So that's one way can generate a batch of tests to try out 281 00:16:30,750 --> 00:16:35,530 with this idea of using a lot of tests to help 282 00:16:35,530 --> 00:16:36,455 you get the job done. 283 00:16:36,455 --> 00:16:38,558 STUDENT: Couldn't you also have a decision tree on the 284 00:16:38,558 --> 00:16:40,370 right side? 285 00:16:40,370 --> 00:16:44,330 PATRICK WINSTON: The question is, can you also have a test 286 00:16:44,330 --> 00:16:45,420 on the right side? 287 00:16:45,420 --> 00:16:48,530 See, this is just a stand-in for saying, everything's plus 288 00:16:48,530 --> 00:16:50,260 or everything's minus. 289 00:16:50,260 --> 00:16:52,530 So it doesn't matter where you put the line. 290 00:16:52,530 --> 00:16:54,362 It can be on the right side, or the left side, or the 291 00:16:54,362 --> 00:16:55,640 bottom, or the top. 292 00:16:55,640 --> 00:16:56,940 Or you don't have to put the line anywhere. 293 00:16:56,940 --> 00:17:00,640 It's just an extra test, an additional to the ones you put 294 00:17:00,640 --> 00:17:02,810 between the samples. 295 00:17:02,810 --> 00:17:06,260 So this whole idea of boosting, the 296 00:17:06,260 --> 00:17:07,040 main idea of the day. 297 00:17:07,040 --> 00:17:09,980 Does it depend on using decision tree stumps? 298 00:17:09,980 --> 00:17:12,490 The answer is no. 299 00:17:12,490 --> 00:17:14,390 Do not be confused. 300 00:17:14,390 --> 00:17:17,800 You can use boosting with any kind of classifier. 301 00:17:17,800 --> 00:17:21,030 so why do I use decision tree stumps today? 302 00:17:21,030 --> 00:17:23,660 Because it makes my life easy. 303 00:17:23,660 --> 00:17:26,420 We can look at it, we can see what it's doing. 304 00:17:26,420 --> 00:17:29,790 But we could put bunch of neural nets in there. 305 00:17:29,790 --> 00:17:33,060 We could put a bunch of real decision trees in there. 306 00:17:33,060 --> 00:17:35,530 We could put a bunch of nearest 307 00:17:35,530 --> 00:17:36,660 neighbor things in there. 308 00:17:36,660 --> 00:17:39,200 The boosting idea doesn't care. 309 00:17:39,200 --> 00:17:41,880 I just used these decision tree stumps because I and 310 00:17:41,880 --> 00:17:45,856 everybody else use them for illustration. 311 00:17:45,856 --> 00:17:48,270 All right. 312 00:17:48,270 --> 00:17:50,780 We're making progress. 313 00:17:50,780 --> 00:17:54,470 Now, what's the error rate for any these tests 314 00:17:54,470 --> 00:17:56,240 and lines we drew? 315 00:17:56,240 --> 00:18:05,110 Well, I guess it'll be the error rate is equal to the sum 316 00:18:05,110 --> 00:18:06,770 of 1 over n-- 317 00:18:06,770 --> 00:18:09,072 That's the total number of points, 318 00:18:09,072 --> 00:18:10,322 the number of samples-- 319 00:18:13,810 --> 00:18:15,970 summed over the cases where we are wrong. 320 00:18:22,450 --> 00:18:26,770 So gee, we're going to work on combining some of these ideas. 321 00:18:26,770 --> 00:18:29,690 And we've got this notion of exaggeration. 322 00:18:29,690 --> 00:18:31,980 At some stage in what we're doing here, we're going to 323 00:18:31,980 --> 00:18:34,280 want to be able to exaggerate the effect of some errors 324 00:18:34,280 --> 00:18:36,870 relative to other errors. 325 00:18:36,870 --> 00:18:41,700 So one thing we can do is we can assume, or we can 326 00:18:41,700 --> 00:18:46,620 stipulate, or we can assert that each of these samples has 327 00:18:46,620 --> 00:18:47,930 a weight associated with it. 328 00:18:47,930 --> 00:18:53,370 That's W1, this is W2, and that's W3. 329 00:18:53,370 --> 00:18:56,140 And in the beginning, there's no reason to suppose that any 330 00:18:56,140 --> 00:18:57,610 one of these is more or less important 331 00:18:57,610 --> 00:18:59,160 than any of the other. 332 00:18:59,160 --> 00:19:05,370 So in the beginning, W sub i at time [? stub ?] one is 333 00:19:05,370 --> 00:19:07,580 equal to 1 over n. 334 00:19:10,160 --> 00:19:14,170 So the error is just adding up the number of samples that 335 00:19:14,170 --> 00:19:15,730 were got wrong. 336 00:19:15,730 --> 00:19:18,205 And that'll be the fraction of samples to that 337 00:19:18,205 --> 00:19:19,350 you didn't get right. 338 00:19:19,350 --> 00:19:22,010 And that will be the error rate. 339 00:19:22,010 --> 00:19:26,270 So what we want to do is we want to say, instead of using 340 00:19:26,270 --> 00:19:30,175 this as the error rate for all time, what we want to do is we 341 00:19:30,175 --> 00:19:34,140 want to move that over, and say that the error rate is 342 00:19:34,140 --> 00:19:39,300 equal to the sum over the things you got wrong in the 343 00:19:39,300 --> 00:19:43,010 current step, times the weights of those 344 00:19:43,010 --> 00:19:44,770 that were got wrong. 345 00:19:44,770 --> 00:19:47,140 So in step one, everything's got the same weight, it 346 00:19:47,140 --> 00:19:48,200 doesn't matter. 347 00:19:48,200 --> 00:19:50,710 But if we find a way to change their weights going 348 00:19:50,710 --> 00:19:52,750 downstream-- 349 00:19:52,750 --> 00:19:57,750 so as to, for example, highly exaggerate that third sample, 350 00:19:57,750 --> 00:20:03,780 then W3 will go up relative to W1 and W2. 351 00:20:03,780 --> 00:20:06,250 The one thing we want to be sure of is there is no matter 352 00:20:06,250 --> 00:20:11,350 how we adjust the weights, that the sum of the weights 353 00:20:11,350 --> 00:20:14,721 over the whole space is equal to 1. 354 00:20:17,310 --> 00:20:19,170 So in other words, we want to choose the weights so that 355 00:20:19,170 --> 00:20:22,130 they emphasize some of the samples, but we also want to 356 00:20:22,130 --> 00:20:25,510 put a constraint on the weights such that all of them 357 00:20:25,510 --> 00:20:30,200 added together is summing to one. 358 00:20:30,200 --> 00:20:32,870 And we'll say that that enforces a distribution. 359 00:20:35,938 --> 00:20:41,400 A distribution is a set of weights that sum to one. 360 00:20:41,400 --> 00:20:45,570 Well, that's just a nice idea. 361 00:20:45,570 --> 00:20:46,780 So we're make a little progress. 362 00:20:46,780 --> 00:20:51,750 We've got this idea that we can add some plus/minus 1 363 00:20:51,750 --> 00:20:55,130 classifiers together, you get a better classifier. 364 00:20:55,130 --> 00:20:58,480 We got some idea about how to do that. 365 00:20:58,480 --> 00:21:00,080 It occurs to us that maybe we want to get a lot of 366 00:21:00,080 --> 00:21:03,500 classifiers into the act somehow or another. 367 00:21:03,500 --> 00:21:07,000 And maybe we want to think about using decision tree 368 00:21:07,000 --> 00:21:11,220 stumps so as to ground out thinking about all this stuff. 369 00:21:11,220 --> 00:21:16,830 So the next step is to say, well, how actually should we 370 00:21:16,830 --> 00:21:19,040 combine this stuff? 371 00:21:19,040 --> 00:21:21,800 And you will find, in the literature libraries, full of 372 00:21:21,800 --> 00:21:24,550 papers that do stuff like that. 373 00:21:24,550 --> 00:21:28,090 And that was state of the art for quite a few years. 374 00:21:28,090 --> 00:21:32,390 But then people began to say, well, maybe we can build up 375 00:21:32,390 --> 00:21:37,350 this classifier, H of x, in multiple steps and get a lot 376 00:21:37,350 --> 00:21:40,090 of classifiers into the act. 377 00:21:40,090 --> 00:21:51,786 So maybe we can say that the classifier is the sign of H-- 378 00:21:51,786 --> 00:21:54,130 that's the one we picked first. 379 00:21:54,130 --> 00:21:56,990 That's the classifier we picked first. 380 00:21:56,990 --> 00:21:58,490 That's looking at samples. 381 00:21:58,490 --> 00:22:00,790 And then we've got H2. 382 00:22:00,790 --> 00:22:03,090 And then we've got H3. 383 00:22:03,090 --> 00:22:06,220 And then we've got how many other classifiers we might 384 00:22:06,220 --> 00:22:11,620 want, or how many classifiers we might need in order to 385 00:22:11,620 --> 00:22:16,800 correctly classify everything in our sample set. 386 00:22:16,800 --> 00:22:19,760 So people began to think about whether there might be an 387 00:22:19,760 --> 00:22:22,560 algorithm that would develop a classifier that way, 388 00:22:22,560 --> 00:22:23,810 one step at a time. 389 00:22:26,240 --> 00:22:29,660 That's why I put that step number in the exponent, 390 00:22:29,660 --> 00:22:33,280 because we're picking this one at first, then we're expanding 391 00:22:33,280 --> 00:22:35,010 it to have two, and then we're expanding it to have 392 00:22:35,010 --> 00:22:36,620 three, and so on. 393 00:22:36,620 --> 00:22:38,960 And each of those individual classifiers are separately 394 00:22:38,960 --> 00:22:42,530 looking at the sample. 395 00:22:42,530 --> 00:22:46,380 But of course, it would be natural to suppose that just 396 00:22:46,380 --> 00:22:49,150 adding things up wouldn't be enough. 397 00:22:49,150 --> 00:22:50,870 And it's not. 398 00:22:50,870 --> 00:22:54,690 So it isn't too hard to invent the next idea, which is to 399 00:22:54,690 --> 00:23:00,250 modify this thing just a little bit by doing what? 400 00:23:00,250 --> 00:23:04,420 It looks almost like a scoring polynomial, doesn't it? 401 00:23:04,420 --> 00:23:08,308 So what would we do to tart this up a little bit? 402 00:23:08,308 --> 00:23:11,050 STUDENT: [INAUDIBLE]. 403 00:23:11,050 --> 00:23:11,955 PATRICK WINSTON: Come again? 404 00:23:11,955 --> 00:23:13,380 Do what? 405 00:23:13,380 --> 00:23:16,230 STUDENT: [INAUDIBLE]. 406 00:23:16,230 --> 00:23:19,360 PATRICK WINSTON: Somewhere out there someone's murmuring. 407 00:23:19,360 --> 00:23:19,815 STUDENT: Add-- 408 00:23:19,815 --> 00:23:21,100 PATRICK WINSTON: Add weights! 409 00:23:21,100 --> 00:23:21,505 STUDENT: --weights. 410 00:23:21,505 --> 00:23:21,910 Yeah. 411 00:23:21,910 --> 00:23:22,185 PATRICK WINSTON: Excellent. 412 00:23:22,185 --> 00:23:24,040 Good idea. 413 00:23:24,040 --> 00:23:28,320 So what we're going to do is we're going to have alphas 414 00:23:28,320 --> 00:23:32,105 associated with each of these classifiers, and we're going 415 00:23:32,105 --> 00:23:34,240 to determine if somebody can build that kind 416 00:23:34,240 --> 00:23:38,790 formula to do the job. 417 00:23:38,790 --> 00:23:41,780 So maybe I ought to modify this gold star idea before I 418 00:23:41,780 --> 00:23:44,250 get too far downstream. 419 00:23:44,250 --> 00:23:52,240 And we're not going to treat everybody in a crowd equally. 420 00:23:52,240 --> 00:23:55,760 We're going to wait some of the opinions more than others. 421 00:23:55,760 --> 00:23:57,790 And by the way, they're all going to make errors in 422 00:23:57,790 --> 00:24:00,910 different parts of the space. 423 00:24:00,910 --> 00:24:05,775 So maybe it's not the wisdom of even a weighted crowd, but 424 00:24:05,775 --> 00:24:08,855 a crowd of experts. 425 00:24:12,360 --> 00:24:16,860 Each of which is good at different parts of the space. 426 00:24:16,860 --> 00:24:19,770 So anyhow, we've got this formula, and there are a few 427 00:24:19,770 --> 00:24:25,780 things that one can say turn out. 428 00:24:25,780 --> 00:24:31,530 But first, let's write down the an algorithm for what this 429 00:24:31,530 --> 00:24:33,140 ought to look like. 430 00:24:33,140 --> 00:24:35,810 Before I run out of space, I think I'll exploit the right 431 00:24:35,810 --> 00:24:41,110 hand board here, and put the overall algorithm right here. 432 00:24:41,110 --> 00:24:47,410 So we're going to start out by letting of all the weights at 433 00:24:47,410 --> 00:24:53,570 time 1 be equal to 1 over n. 434 00:24:53,570 --> 00:24:56,330 That's just saying that they're all equal in the 435 00:24:56,330 --> 00:24:59,170 beginning, and they're equal to 1 over n. 436 00:24:59,170 --> 00:25:01,215 And n is the number of samples. 437 00:25:06,090 --> 00:25:11,130 And then, when I've got that, I want to 438 00:25:11,130 --> 00:25:14,890 compute alpha, somehow. 439 00:25:17,510 --> 00:25:18,780 Let's see. 440 00:25:18,780 --> 00:25:20,210 No, I don't want to do that. 441 00:25:20,210 --> 00:25:22,810 I want to 442 00:25:22,810 --> 00:25:28,140 I want to pick a classifier the minimizes the error rate. 443 00:25:37,730 --> 00:25:43,050 And then m, i, zes, error at time t. 444 00:25:43,050 --> 00:25:45,230 And that's going to be at time t. 445 00:25:45,230 --> 00:25:46,340 And we're going to come back in here. 446 00:25:46,340 --> 00:25:50,160 That's why we put a step index in there. 447 00:25:50,160 --> 00:25:56,790 So once we've picked a classifier that produces an 448 00:25:56,790 --> 00:25:59,210 error rate, then we can use the error rate to 449 00:25:59,210 --> 00:26:00,350 determine the alpha. 450 00:26:00,350 --> 00:26:02,260 So I want the alpha over here. 451 00:26:07,910 --> 00:26:11,900 That'll be sort of a byproduct of picking that test. 452 00:26:11,900 --> 00:26:14,890 And with all that stuff in hand, maybe that will be 453 00:26:14,890 --> 00:26:20,480 enough to calculate Wt plus 1. 454 00:26:28,600 --> 00:26:33,162 So we're going to use that classifier that we just picked 455 00:26:33,162 --> 00:26:36,040 to get some revised weights, and then we're going to go 456 00:26:36,040 --> 00:26:41,870 around that loop until this classifier produces a perfect 457 00:26:41,870 --> 00:26:46,290 set of conclusions on all the sample data. 458 00:26:46,290 --> 00:26:49,560 So that's going to be our overall strategy. 459 00:26:49,560 --> 00:26:51,800 Maybe we've got, if we're going to number these things, 460 00:26:51,800 --> 00:26:54,960 that's the fourth big idea. 461 00:26:54,960 --> 00:26:59,350 And this arrangement here is the fifth big idea. 462 00:26:59,350 --> 00:27:01,390 Then we've got the sixth big idea. 463 00:27:01,390 --> 00:27:04,350 And the sixth big idea says this. 464 00:27:06,940 --> 00:27:19,340 Suppose that the weight on it ith sample at time t plus 1 is 465 00:27:19,340 --> 00:27:28,600 equal to the weight at time t on that same sample, divided 466 00:27:28,600 --> 00:27:38,150 by some normalizing factor, times e to the minus alpha at 467 00:27:38,150 --> 00:27:52,750 time t, times h at time t, times some function y which is 468 00:27:52,750 --> 00:27:58,160 a function of x, But not a function of time. 469 00:27:58,160 --> 00:28:01,280 Now you say, where did this come from? 470 00:28:01,280 --> 00:28:03,670 And the answer is, it did not spring from the heart of 471 00:28:03,670 --> 00:28:06,190 mathematician in the first 10 minutes that he 472 00:28:06,190 --> 00:28:07,800 looked at this problem. 473 00:28:07,800 --> 00:28:09,550 In fact, when I asked [INAUDIBLE] 474 00:28:09,550 --> 00:28:13,300 how this worked, he said, well, he was thinking about 475 00:28:13,300 --> 00:28:15,630 this on the couch every Saturday for about a year, and 476 00:28:15,630 --> 00:28:18,200 his wife was getting pretty sore, but he finally found it 477 00:28:18,200 --> 00:28:20,590 and saved their marriage. 478 00:28:20,590 --> 00:28:23,950 So where does stuff like this come from? 479 00:28:23,950 --> 00:28:27,080 Really, it comes from knowing a lot of mathematics, and 480 00:28:27,080 --> 00:28:29,280 seeing a lot of situations, and knowing that something 481 00:28:29,280 --> 00:28:34,570 like this might be mathematically convenient. 482 00:28:34,570 --> 00:28:40,080 Something like this might be mathematically convenient. 483 00:28:40,080 --> 00:28:42,670 But we've got to back up a little and let it sing to us. 484 00:28:42,670 --> 00:28:44,010 What's y? 485 00:28:44,010 --> 00:28:45,100 We saw y last time. 486 00:28:45,100 --> 00:28:46,910 The support vector machines. 487 00:28:46,910 --> 00:28:47,780 That's just a function. 488 00:28:47,780 --> 00:28:51,270 That's plus 1 or minus 1, depending on whether the 489 00:28:51,270 --> 00:28:55,310 output ought to be plus 1 or minus 1. 490 00:28:55,310 --> 00:29:02,200 So if this guy is giving the correct answer, and the 491 00:29:02,200 --> 00:29:06,630 correct answer is plus, and then this guy will be plus 1 492 00:29:06,630 --> 00:29:10,210 too, because it always gives you the correct answer. 493 00:29:10,210 --> 00:29:12,330 So in that case, where this guy is giving the right 494 00:29:12,330 --> 00:29:15,190 answer, these will have the same sign, so that will be a 495 00:29:15,190 --> 00:29:16,960 plus 1 combination. 496 00:29:16,960 --> 00:29:19,000 On the other hand, if that guy's giving the wrong answer, 497 00:29:19,000 --> 00:29:22,450 you're going to get a minus 1 out of that combination. 498 00:29:22,450 --> 00:29:25,680 So it's true even if the right answer should be minus, right? 499 00:29:25,680 --> 00:29:28,320 So if the right answer should be minus, and this is plus, 500 00:29:28,320 --> 00:29:30,820 then this will be minus 1, and the whole combination well 501 00:29:30,820 --> 00:29:31,945 give you minus 1 again. 502 00:29:31,945 --> 00:29:36,360 In other words, the y just flips the sign if you've got 503 00:29:36,360 --> 00:29:39,170 the wrong answer, no matter whether the wrong answer is 504 00:29:39,170 --> 00:29:42,330 plus 1 or minus 1. 505 00:29:42,330 --> 00:29:43,650 These alphas-- 506 00:29:43,650 --> 00:29:46,420 shoot, those are the same alphas that are in this 507 00:29:46,420 --> 00:29:49,950 formula up here, somehow. 508 00:29:49,950 --> 00:29:52,840 And then that z, what's that for? 509 00:29:52,840 --> 00:29:55,650 Well, if you just look at the previous weights, and its 510 00:29:55,650 --> 00:30:00,900 exponential function to produce these W's for the next 511 00:30:00,900 --> 00:30:04,910 generation, that's not going to be a distribution, because 512 00:30:04,910 --> 00:30:07,620 they won't sum up to 1. 513 00:30:07,620 --> 00:30:11,470 So what this thing here, this z is, that's a sort of 514 00:30:11,470 --> 00:30:12,720 normalizer. 515 00:30:18,750 --> 00:30:21,680 And that makes that whole combination of new 516 00:30:21,680 --> 00:30:23,980 weights add up to 1. 517 00:30:23,980 --> 00:30:31,570 So it's whatever you got by adding up all those guys, and 518 00:30:31,570 --> 00:30:34,660 then dividing by that number. 519 00:30:34,660 --> 00:30:35,910 Well, phew. 520 00:30:43,030 --> 00:30:44,350 I don't know. 521 00:30:44,350 --> 00:30:45,600 Now there's some it-turns-out-thats. 522 00:30:50,360 --> 00:30:52,230 We're going to imagine that somebody's done the same sort 523 00:30:52,230 --> 00:30:54,940 of thing we did to the support vector machines. 524 00:30:54,940 --> 00:30:57,730 We're going to find a way to minimize the error. 525 00:30:57,730 --> 00:30:59,540 And the error we're going to minimize is the error produced 526 00:30:59,540 --> 00:31:02,420 by that whole thing up there in 4. 527 00:31:02,420 --> 00:31:05,120 We're going to minimize the error of that entire 528 00:31:05,120 --> 00:31:06,370 expression as we go along. 529 00:31:08,930 --> 00:31:11,970 And what we discover when we do the appropriate 530 00:31:11,970 --> 00:31:13,775 differentiations and stuff-- 531 00:31:13,775 --> 00:31:15,710 you know, that's what we do in calculus-- 532 00:31:15,710 --> 00:31:24,580 what we discover is that you get minimum error for the 533 00:31:24,580 --> 00:31:45,970 whole thing if alpha is equal to 1 minus the error rate at 534 00:31:45,970 --> 00:31:51,190 time t, divided by the error rate at time t. 535 00:31:51,190 --> 00:31:53,950 Now let's take the logarithm of that, and 536 00:31:53,950 --> 00:31:56,220 multiply it by half. 537 00:31:56,220 --> 00:31:57,140 And that's what [INAUDIBLE] 538 00:31:57,140 --> 00:31:59,880 was struggling to find. 539 00:31:59,880 --> 00:32:01,350 But we haven't quite got it right. 540 00:32:01,350 --> 00:32:03,800 And so let me add this in separate chunks, so we don't 541 00:32:03,800 --> 00:32:05,926 get confused about this. 542 00:32:05,926 --> 00:32:12,880 It's a bound on that expression up there. 543 00:32:12,880 --> 00:32:16,510 It's a bound on the error rate produced by that expression. 544 00:32:16,510 --> 00:32:22,540 So interestingly enough, this means that the error rate can 545 00:32:22,540 --> 00:32:26,000 actually go up as you add terms to this formula. 546 00:32:26,000 --> 00:32:28,560 all you know is that the error rate is going to be bounded by 547 00:32:28,560 --> 00:32:32,080 an exponentially decaying function. 548 00:32:32,080 --> 00:32:36,910 So it's eventually guaranteed to converge on zero. 549 00:32:36,910 --> 00:32:38,260 So it's a minimal error bound. 550 00:32:38,260 --> 00:32:39,510 It turns out to be exponential. 551 00:32:43,120 --> 00:32:45,630 Well, there it is. 552 00:32:45,630 --> 00:32:46,120 We're done. 553 00:32:46,120 --> 00:32:48,207 Would you like to see a demonstration? 554 00:32:48,207 --> 00:32:49,550 Yeah, OK. 555 00:32:49,550 --> 00:32:51,260 Because you look at that, and you say, well, how could 556 00:32:51,260 --> 00:32:53,800 anything like that possibly work? 557 00:32:53,800 --> 00:32:57,120 And the answer is, surprisingly enough, here's 558 00:32:57,120 --> 00:32:59,720 what happens. 559 00:32:59,720 --> 00:33:02,440 There's a simple little example. 560 00:33:02,440 --> 00:33:05,310 So that's the first test chosen. 561 00:33:05,310 --> 00:33:09,470 the greens are pluses and the reds are minuses, so it's 562 00:33:09,470 --> 00:33:11,480 still got an error. 563 00:33:11,480 --> 00:33:12,620 Still got an error-- boom. 564 00:33:12,620 --> 00:33:13,830 There, in two steps. 565 00:33:13,830 --> 00:33:14,600 It now has-- 566 00:33:14,600 --> 00:33:16,670 we can look in the upper right hand corner-- 567 00:33:16,670 --> 00:33:20,460 we see its used three classifiers, and we see that 568 00:33:20,460 --> 00:33:22,900 one of those classifiers says that everybody belongs to a 569 00:33:22,900 --> 00:33:27,250 particular class, three different weights. 570 00:33:27,250 --> 00:33:30,540 And the error rate has converged to 0. 571 00:33:30,540 --> 00:33:32,170 So let's look at a couple of other ones. 572 00:33:32,170 --> 00:33:35,060 Here is the one I use for debugging this thing. 573 00:33:35,060 --> 00:33:36,250 We'll let that run. 574 00:33:36,250 --> 00:33:37,690 See how fast it is? 575 00:33:37,690 --> 00:33:38,710 Boom. 576 00:33:38,710 --> 00:33:42,800 It converges to getting all the samples right very fast. 577 00:33:42,800 --> 00:33:44,190 Here's another one. 578 00:33:44,190 --> 00:33:47,350 This is one we gave on an exam a few years back. 579 00:33:47,350 --> 00:33:48,670 First test. 580 00:33:48,670 --> 00:33:50,620 Oh, I let it run, so it got everything 581 00:33:50,620 --> 00:33:52,380 instantaneously right. 582 00:33:52,380 --> 00:33:53,950 Let's take that through step at a time. 583 00:33:53,950 --> 00:33:56,940 There's the first one, second one. 584 00:33:56,940 --> 00:33:58,800 Still got a lot of errors. 585 00:33:58,800 --> 00:34:01,600 Ah, the error rate's dropping. 586 00:34:01,600 --> 00:34:06,160 And then flattened, flattened, and it goes to 0. 587 00:34:06,160 --> 00:34:08,000 Cool, don't you think? 588 00:34:08,000 --> 00:34:10,010 But you say to me, bah, who cares about that stuff? 589 00:34:10,010 --> 00:34:11,540 Let's try something more interesting. 590 00:34:11,540 --> 00:34:14,190 There's one. 591 00:34:14,190 --> 00:34:15,500 That was pretty fast, too. 592 00:34:15,500 --> 00:34:17,090 Well, there's not too many samples here. 593 00:34:17,090 --> 00:34:20,030 So we can try this. 594 00:34:20,030 --> 00:34:22,230 So there's an array of pluses and minuses. 595 00:34:22,230 --> 00:34:22,940 Boom. 596 00:34:22,940 --> 00:34:24,920 You can see how that error rate is bounded by an 597 00:34:24,920 --> 00:34:26,170 exponential? 598 00:34:27,920 --> 00:34:32,800 So in a bottom graph, you've got the number of classifiers 599 00:34:32,800 --> 00:34:36,650 involved, and that goes up to a total, eventually, of 10. 600 00:34:36,650 --> 00:34:41,230 You can see how positive or negative each of the 601 00:34:41,230 --> 00:34:43,530 classifiers that's added is by looking at 602 00:34:43,530 --> 00:34:45,270 this particular tab. 603 00:34:45,270 --> 00:34:48,045 And this just shows how they evolve over time. 604 00:34:48,045 --> 00:34:52,239 But the progress thing here is the most interesting. 605 00:34:52,239 --> 00:34:57,420 And now you say to me, well, how did the machine do that? 606 00:34:57,420 --> 00:35:00,330 And it's all right here. 607 00:35:00,330 --> 00:35:05,400 We use an alpha that looks like this. 608 00:35:05,400 --> 00:35:08,400 And that allows us to compute the new weights. 609 00:35:08,400 --> 00:35:10,150 It says we've got a preliminary calculation. 610 00:35:10,150 --> 00:35:13,630 We've got to find a z that does the normalization. 611 00:35:13,630 --> 00:35:17,640 And we sure better bring our calculator, because we've got, 612 00:35:17,640 --> 00:35:19,350 first of all, to calculate the error rate. 613 00:35:19,350 --> 00:35:22,365 Then we've got to take its logarithm, divide by 2, plug 614 00:35:22,365 --> 00:35:27,290 it into that formula, take the exponent, and that gives us 615 00:35:27,290 --> 00:35:28,210 the new weight. 616 00:35:28,210 --> 00:35:29,460 And that's how the program works. 617 00:35:29,460 --> 00:35:30,880 And if you try that, I guarantee you 618 00:35:30,880 --> 00:35:33,130 will flunk the exam. 619 00:35:33,130 --> 00:35:34,940 Now, I don't care about my computer. 620 00:35:34,940 --> 00:35:35,920 I really don't. 621 00:35:35,920 --> 00:35:39,050 It's a slave, and it can calculate these logarithm and 622 00:35:39,050 --> 00:35:41,840 exponentials till it turns blue, and I don't care. 623 00:35:41,840 --> 00:35:44,740 Because I've got four cores or something, and who cares. 624 00:35:44,740 --> 00:35:46,220 Might as well do this, than sit around 625 00:35:46,220 --> 00:35:48,391 just burning up heat. 626 00:35:48,391 --> 00:35:49,640 But you don't want to do that. 627 00:35:49,640 --> 00:35:53,010 So what you want to do is you want to know how to do this 628 00:35:53,010 --> 00:35:57,240 sort of thing more expeditiously. 629 00:35:57,240 --> 00:36:00,720 So we're going to have to let them the math sing to us a 630 00:36:00,720 --> 00:36:05,470 little bit, with a view towards finding better ways of 631 00:36:05,470 --> 00:36:08,290 doing this sort of thing. 632 00:36:08,290 --> 00:36:11,700 So let's do that. 633 00:36:11,700 --> 00:36:14,080 And we're going to run out of space here before long, so let 634 00:36:14,080 --> 00:36:18,450 me reclaim as much of this board as I can. 635 00:36:18,450 --> 00:36:20,940 So what I'm going to do is I'm going to say, well, now that 636 00:36:20,940 --> 00:36:25,720 we've got this formula for alpha that relates alpha t to 637 00:36:25,720 --> 00:36:31,530 the error, then I can plug that into this formula up 638 00:36:31,530 --> 00:36:32,345 here, number 6. 639 00:36:32,345 --> 00:36:40,390 And what I'll get is that the weight of t plus 1 is equal to 640 00:36:40,390 --> 00:36:46,710 the weight at t divided by that normalizing factor, 641 00:36:46,710 --> 00:36:53,350 multiplied times something that depends on whether it's 642 00:36:53,350 --> 00:36:55,600 categorized correctly or not. 643 00:36:55,600 --> 00:36:59,660 That's what that y's in their for, right? 644 00:36:59,660 --> 00:37:05,630 So we've got a logarithm here, and we got a sign flipper up 645 00:37:05,630 --> 00:37:10,690 there in terms of that H of x and y combination. 646 00:37:10,690 --> 00:37:18,220 So if the sign of that whole thing at minus alpha and that 647 00:37:18,220 --> 00:37:23,900 y H combination turns out to be negative, then we're going 648 00:37:23,900 --> 00:37:27,740 to have to flip the numerator and denominator here in this 649 00:37:27,740 --> 00:37:29,620 logarithm, right? 650 00:37:29,620 --> 00:37:32,250 And oh, by the way, since we've got a half out here, 651 00:37:32,250 --> 00:37:34,170 that turns out to be the square root of that term 652 00:37:34,170 --> 00:37:37,190 inside the logarithm. 653 00:37:37,190 --> 00:37:43,290 So when we carefully do that, what we discover is that it 654 00:37:43,290 --> 00:37:46,430 depends on whether it's the right thing or not. 655 00:37:46,430 --> 00:37:50,860 But what it turns out to be is something like a multiplier of 656 00:37:50,860 --> 00:37:53,750 the square root. 657 00:37:53,750 --> 00:37:55,960 Better be careful, here. 658 00:37:55,960 --> 00:37:59,300 The square root of what? 659 00:37:59,300 --> 00:38:02,030 STUDENT: [INAUDIBLE]. 660 00:38:02,030 --> 00:38:02,860 PATRICK WINSTON: Well, let's see. 661 00:38:02,860 --> 00:38:04,180 But we have to be careful. 662 00:38:04,180 --> 00:38:08,180 So let's suppose that this is 4 things that we get correct. 663 00:38:13,740 --> 00:38:17,910 So if we get it correct, then we're going to get the same 664 00:38:17,910 --> 00:38:20,200 sign out of H of x and y. 665 00:38:20,200 --> 00:38:22,350 We've get a minus sign out there, so we're going to flip 666 00:38:22,350 --> 00:38:25,500 the numerator and denominator. 667 00:38:25,500 --> 00:38:30,460 So we're going to get the square root of e of t over 1 668 00:38:30,460 --> 00:38:34,110 minus epsilon of t if that's correct. 669 00:38:34,110 --> 00:38:36,510 If it's wrong, it'll just be the flip of that. 670 00:38:39,350 --> 00:38:44,690 So it'll be the square root of 1 minus the error rate over 671 00:38:44,690 --> 00:38:45,940 the error rate. 672 00:38:48,570 --> 00:38:49,740 Everybody with me on that? 673 00:38:49,740 --> 00:38:51,620 I think that's right. 674 00:38:51,620 --> 00:38:55,930 If it's wrong, I'll have to hang myself and wear a paper 675 00:38:55,930 --> 00:38:57,760 bag over my head like I did last year. 676 00:38:57,760 --> 00:39:00,796 But let's see if we can make this go correctly this time. 677 00:39:05,730 --> 00:39:12,430 So now, we've got this guy here, we've got everything 678 00:39:12,430 --> 00:39:18,110 plugged in all right, and we know that now this z ought to 679 00:39:18,110 --> 00:39:22,630 be selected so that it's equal to the sum of this guy 680 00:39:22,630 --> 00:39:25,070 multiplied by these things as appropriate for whether it's 681 00:39:25,070 --> 00:39:28,220 correct or not. 682 00:39:28,220 --> 00:39:31,710 Because we want, in the end, for all of these w's 683 00:39:31,710 --> 00:39:34,320 to add up to 1. 684 00:39:34,320 --> 00:39:39,830 So let's see what they add up to without the z there. 685 00:39:39,830 --> 00:39:44,840 So what we know is that it must be the case that if we 686 00:39:44,840 --> 00:39:53,670 add over the correct ones, we get the square root of the 687 00:39:53,670 --> 00:39:59,930 error rate over 1 minus the rate of the Wt plus 1. 688 00:40:04,100 --> 00:40:09,520 Plus now we've got the sum of 1 minus the error rate over 689 00:40:09,520 --> 00:40:16,010 the error rate, times the sum of the Wi at time t for wrong. 690 00:40:24,340 --> 00:40:27,320 So that's what we get if we added all these 691 00:40:27,320 --> 00:40:30,420 up without the z. 692 00:40:30,420 --> 00:40:33,400 So since everything has to add up to 1, then z ought to be 693 00:40:33,400 --> 00:40:34,650 equal to this sum. 694 00:40:43,880 --> 00:40:47,960 That looks pretty horrible, until we realize that if we 695 00:40:47,960 --> 00:40:51,930 add these guys up over the weights that are wrong, that 696 00:40:51,930 --> 00:40:53,180 is the error rate. 697 00:40:55,880 --> 00:40:57,130 This is e. 698 00:40:59,850 --> 00:41:08,540 So therefore, z is equal the square root of the error rate 699 00:41:08,540 --> 00:41:10,710 times 1 minus the error rate. 700 00:41:10,710 --> 00:41:14,040 That's the contribution of this term. 701 00:41:14,040 --> 00:41:15,310 Now, let's see. 702 00:41:15,310 --> 00:41:17,700 What is the sum of the weights over the 703 00:41:17,700 --> 00:41:20,320 ones that are correct? 704 00:41:20,320 --> 00:41:25,020 Well, that must be 1 minus the error rate. 705 00:41:25,020 --> 00:41:30,290 Ah, so this thing gives you the same result as this one. 706 00:41:30,290 --> 00:41:34,170 So z is equal to 2 times that. 707 00:41:34,170 --> 00:41:35,420 And that's a good thing. 708 00:41:38,580 --> 00:41:40,540 Now we are getting somewhere. 709 00:41:40,540 --> 00:41:44,380 Because now, it becomes a little bit easier to write 710 00:41:44,380 --> 00:41:46,490 some things down. 711 00:41:46,490 --> 00:41:49,330 Well, we're way past this, so let's get rid of this. 712 00:41:54,090 --> 00:41:57,940 And now we can put some things together. 713 00:41:57,940 --> 00:42:00,910 Let me point out what I'm putting together. 714 00:42:00,910 --> 00:42:06,560 I've got an expression for z right here. 715 00:42:06,560 --> 00:42:11,320 And I've got an expression for the new w's here. 716 00:42:11,320 --> 00:42:19,020 So let's put those together and say that w of t plus 1 is 717 00:42:19,020 --> 00:42:23,150 equal to w of t. 718 00:42:23,150 --> 00:42:26,090 I guess we're going to divide that by 2. 719 00:42:26,090 --> 00:42:33,470 And then we've got this square root times that expression. 720 00:42:33,470 --> 00:42:40,470 So if we take that correct one, and divide by that one, 721 00:42:40,470 --> 00:42:44,970 then the [INAUDIBLE] 722 00:42:44,970 --> 00:42:50,360 cancel out, and I get 1 over 1 minus the error rate. 723 00:42:53,560 --> 00:42:53,850 That's it. 724 00:42:53,850 --> 00:42:55,100 That's correct. 725 00:42:59,880 --> 00:43:04,620 And if it's not correct, then it's Wt over 2-- 726 00:43:04,620 --> 00:43:05,670 and working through the math-- 727 00:43:05,670 --> 00:43:08,630 1 over epsilon, if wrong. 728 00:43:11,950 --> 00:43:15,130 Do we feel like we're making any progress? 729 00:43:15,130 --> 00:43:16,030 No. 730 00:43:16,030 --> 00:43:19,090 Because we haven't let it sing to us enough yet. 731 00:43:19,090 --> 00:43:25,130 So I want to draw your attention to what happens to 732 00:43:25,130 --> 00:43:28,500 amateur rock climbers when they're halfway 733 00:43:28,500 --> 00:43:31,360 up a difficult cliff. 734 00:43:31,360 --> 00:43:33,570 They're usually [INAUDIBLE], sometimes they're not. 735 00:43:33,570 --> 00:43:36,800 If they're not, they're scared to death. 736 00:43:36,800 --> 00:43:40,850 And every once in a while, as they're just about to fall, 737 00:43:40,850 --> 00:43:44,410 they find some little tiny hole to stick a fingernail in, 738 00:43:44,410 --> 00:43:46,510 and that keeps them from falling. 739 00:43:46,510 --> 00:43:50,440 That's called a thank-god hole. 740 00:43:50,440 --> 00:43:53,680 So what I'm about to introduce is the analog of those little 741 00:43:53,680 --> 00:43:55,530 places where you can stick your fingernail in. 742 00:43:55,530 --> 00:43:57,380 It's the thank-god hole for dealing 743 00:43:57,380 --> 00:43:58,630 with boosting problems. 744 00:44:04,680 --> 00:44:07,370 So what happens if I add all these [? Wi ?] 745 00:44:07,370 --> 00:44:12,470 up for the ones that the classifier where produces a 746 00:44:12,470 --> 00:44:16,050 correct answer on? 747 00:44:16,050 --> 00:44:22,110 Well, it'll be 1 over 2, and 1 over 1 minus epsilon, times 748 00:44:22,110 --> 00:44:29,490 the sum of the Wt for which the answer was correct. 749 00:44:29,490 --> 00:44:31,781 What's this sum? 750 00:44:31,781 --> 00:44:32,450 Oh! 751 00:44:32,450 --> 00:44:34,480 My goddess. 752 00:44:34,480 --> 00:44:38,920 1 minus epsilon. 753 00:44:38,920 --> 00:44:50,920 So what I've just discovered is that if I sum new w's over 754 00:44:50,920 --> 00:44:53,880 those samples for which I got a correct answer, 755 00:44:53,880 --> 00:44:56,490 it's equal to 1/2. 756 00:44:56,490 --> 00:44:57,130 And guess what? 757 00:44:57,130 --> 00:45:03,240 That means that if I sum them over wrong, it's equal to 1/2 758 00:45:03,240 --> 00:45:04,490 half as well. 759 00:45:07,710 --> 00:45:11,300 So that means that I take all of the weight for which I got 760 00:45:11,300 --> 00:45:18,000 the right answer with the previous test, and those ways 761 00:45:18,000 --> 00:45:19,990 will add up to something. 762 00:45:19,990 --> 00:45:22,263 And to get the weights for the next generation, all I have to 763 00:45:22,263 --> 00:45:24,780 do is scale them so that they equal half. 764 00:45:24,780 --> 00:45:26,710 This was not noticed by the people who 765 00:45:26,710 --> 00:45:27,000 developed this stuff. 766 00:45:27,000 --> 00:45:31,210 This was noticed by Luis Ortiz, who was a 6.034 767 00:45:31,210 --> 00:45:34,160 instructor a few years ago. 768 00:45:34,160 --> 00:45:38,660 The sum of those weights is going to be a scaled version 769 00:45:38,660 --> 00:45:41,400 of what they were before. 770 00:45:41,400 --> 00:45:43,340 So you take all the weights for which this new 771 00:45:43,340 --> 00:45:44,590 classifier-- 772 00:45:44,590 --> 00:45:46,890 this one you selected to give you the minimum weight on the 773 00:45:46,890 --> 00:45:48,050 re-weighted stuff-- 774 00:45:48,050 --> 00:45:50,520 you take the ones that it gives a correct answer for, 775 00:45:50,520 --> 00:45:52,775 and you take all of those weights, and you just scale 776 00:45:52,775 --> 00:45:55,770 them so they add up to 1/2. 777 00:45:55,770 --> 00:45:58,730 So do you have to compute any logarithms? 778 00:45:58,730 --> 00:45:59,670 No. 779 00:45:59,670 --> 00:46:01,320 Do you have to compute any exponentials? 780 00:46:01,320 --> 00:46:02,230 No. 781 00:46:02,230 --> 00:46:03,790 Do you have to calculate z? 782 00:46:03,790 --> 00:46:05,170 No. 783 00:46:05,170 --> 00:46:07,120 Do you have to calculate alpha to get the new weights? 784 00:46:07,120 --> 00:46:07,755 No. 785 00:46:07,755 --> 00:46:09,690 All you have to do is scale them. 786 00:46:09,690 --> 00:46:12,730 And that's a pretty good thank-god hole. 787 00:46:12,730 --> 00:46:14,020 So that's thank-god hole number one. 788 00:46:21,890 --> 00:46:26,340 Now, for thank-god hole number two, we need to go back and 789 00:46:26,340 --> 00:46:28,720 think about the fact that were going to give you problems in 790 00:46:28,720 --> 00:46:32,940 probability that involve decision tree stumps. 791 00:46:32,940 --> 00:46:35,790 And there are a lot of decision tree stumps that you 792 00:46:35,790 --> 00:46:38,050 might have to pick from. 793 00:46:38,050 --> 00:46:39,940 So we need a thank-god hole for deciding how 794 00:46:39,940 --> 00:46:42,320 to deal with that. 795 00:46:42,320 --> 00:46:43,330 Where can I find some room? 796 00:46:43,330 --> 00:46:44,580 How about right here. 797 00:46:53,870 --> 00:46:56,040 Suppose you've got a space that looks like this. 798 00:47:02,810 --> 00:47:06,020 I'm just makings this up at random. 799 00:47:06,020 --> 00:47:07,020 So how many-- 800 00:47:07,020 --> 00:47:07,180 let's see. 801 00:47:07,180 --> 00:47:11,300 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. 802 00:47:11,300 --> 00:47:14,315 How many tests do I have to consider in that dimension? 803 00:47:17,598 --> 00:47:19,077 11. 804 00:47:19,077 --> 00:47:22,060 It's 1 plus the number of samples. 805 00:47:22,060 --> 00:47:23,310 That would be horrible. 806 00:47:26,590 --> 00:47:27,080 I don't know. 807 00:47:27,080 --> 00:47:28,990 Do I have actually calculate this one? 808 00:47:33,040 --> 00:47:36,430 How could that possibly be better than that one? 809 00:47:36,430 --> 00:47:39,930 It's got one more thing wrong. 810 00:47:39,930 --> 00:47:45,570 So that one makes sense. 811 00:47:45,570 --> 00:47:48,940 The other one doesn't make sense. 812 00:47:48,940 --> 00:47:55,520 So in the end, no test that lies between two correctly 813 00:47:55,520 --> 00:47:58,530 classified samples will ever be any good. 814 00:47:58,530 --> 00:48:01,830 So that one's a good guy, and that one's a good guy. 815 00:48:01,830 --> 00:48:02,870 And this one's a bad guy. 816 00:48:02,870 --> 00:48:05,600 Bad guy, bad guy bad guy, bad guy. 817 00:48:05,600 --> 00:48:08,910 Bad guy, bad guy, bad buy. 818 00:48:08,910 --> 00:48:14,410 So the actual number of tests you've got is three. 819 00:48:14,410 --> 00:48:17,770 And likewise, in the other dimension-- 820 00:48:17,770 --> 00:48:19,960 well, I haven't drawn it so well here, but would this test 821 00:48:19,960 --> 00:48:20,660 be a good one? 822 00:48:20,660 --> 00:48:21,320 No. 823 00:48:21,320 --> 00:48:21,690 That one? 824 00:48:21,690 --> 00:48:22,940 No. 825 00:48:24,760 --> 00:48:26,465 Actually, I'd better look over here on the right and see what 826 00:48:26,465 --> 00:48:28,400 I've got before I draw too many conclusions. 827 00:48:28,400 --> 00:48:30,870 Let's look over this, since I don't want to think too hard 828 00:48:30,870 --> 00:48:32,980 about what's going on in the other dimension. 829 00:48:32,980 --> 00:48:35,270 But the idea is that very few of those 830 00:48:35,270 --> 00:48:38,240 tests actually matter. 831 00:48:38,240 --> 00:48:39,770 Now, you say to me, there's one last thing. 832 00:48:39,770 --> 00:48:41,762 What about overfitting? 833 00:48:41,762 --> 00:48:45,800 Because all this does is drape a solution over the samples. 834 00:48:45,800 --> 00:48:49,110 And like support vector machines overfit, neural maps 835 00:48:49,110 --> 00:48:52,580 overfit, identification trees overfit. 836 00:48:52,580 --> 00:48:53,820 Guess what? 837 00:48:53,820 --> 00:48:56,290 This doesn't seem to overfit. 838 00:48:56,290 --> 00:48:59,130 That's an experimental result for which the 839 00:48:59,130 --> 00:49:01,470 literature is confused. 840 00:49:01,470 --> 00:49:03,920 It goes back to providing an explanation. 841 00:49:03,920 --> 00:49:06,210 So this stuff is tried on all sorts of problems, like 842 00:49:06,210 --> 00:49:10,100 handwriting recognition, understanding speech, all 843 00:49:10,100 --> 00:49:12,180 sorts of stuff uses boosting. 844 00:49:12,180 --> 00:49:16,010 And unlike other methods, for some reason as yet imperfectly 845 00:49:16,010 --> 00:49:20,260 understood, it doesn't seem to overfit. 846 00:49:20,260 --> 00:49:25,550 But in the end, they leave no stone unturned in 6.034. 847 00:49:25,550 --> 00:49:28,670 Every time we do this, we do some additional experiments. 848 00:49:28,670 --> 00:49:32,410 So here's a sample that I'll leave you with. 849 00:49:32,410 --> 00:49:36,130 Here's a situation in which we have a 10-dimensional space. 850 00:49:36,130 --> 00:49:38,270 We've made a fake distribution, and then we put 851 00:49:38,270 --> 00:49:40,270 in that boxed outlier. 852 00:49:40,270 --> 00:49:42,630 That was just put into the space at random, so it can be 853 00:49:42,630 --> 00:49:45,230 viewed as an error point. 854 00:49:45,230 --> 00:49:47,240 So now what we're going to do is we're going to see what 855 00:49:47,240 --> 00:49:49,560 happens when we run that guy. 856 00:49:49,560 --> 00:49:55,140 And sure enough, in 17 steps, it finds a solution. 857 00:49:55,140 --> 00:49:59,620 But maybe it's overfit that little guy who's an error. 858 00:49:59,620 --> 00:50:03,000 But one thing you can do is you can say, well, all of 859 00:50:03,000 --> 00:50:06,890 these classifiers are dividing this space up into chunks, and 860 00:50:06,890 --> 00:50:11,750 we can compute the size of the space occupied by any sample. 861 00:50:11,750 --> 00:50:13,650 So one thing we can do-- 862 00:50:13,650 --> 00:50:16,370 alas, I'll have to get up a new demonstration. 863 00:50:16,370 --> 00:50:19,750 One thing we can do, now that this guy's over here, we can 864 00:50:19,750 --> 00:50:23,310 switch the volume tab and watch how the volume occupied 865 00:50:23,310 --> 00:50:29,640 by that error point evolves as we solve the problem. 866 00:50:29,640 --> 00:50:31,820 So look what happens. 867 00:50:31,820 --> 00:50:33,380 This is, of course, randomly generated. 868 00:50:33,380 --> 00:50:35,390 I'm counting on this working. 869 00:50:35,390 --> 00:50:36,640 Never failed before. 870 00:50:39,930 --> 00:50:44,510 So it originally starts out as occupying 26% 871 00:50:44,510 --> 00:50:47,020 of the total volume. 872 00:50:47,020 --> 00:50:52,360 It ends up occupying 1.4 times 10 to the 873 00:50:52,360 --> 00:50:55,910 minus 3rd% of the volume. 874 00:50:55,910 --> 00:51:00,060 So what tends to happen is that these decision tree 875 00:51:00,060 --> 00:51:03,190 stumps tend to wrap themselves so tightly around the error 876 00:51:03,190 --> 00:51:05,350 points, there's no room for overfitting, because nothing 877 00:51:05,350 --> 00:51:07,550 else will fit in that same volume. 878 00:51:07,550 --> 00:51:10,390 So that's why I think that this thing tends to produce 879 00:51:10,390 --> 00:51:12,430 solutions which don't overfit. 880 00:51:12,430 --> 00:51:14,970 So in conclusion, this is magic. 881 00:51:14,970 --> 00:51:16,010 You always want to use it. 882 00:51:16,010 --> 00:51:17,510 It'll work with any kind of [? speed ?] of 883 00:51:17,510 --> 00:51:19,090 classifiers you want. 884 00:51:19,090 --> 00:51:21,590 And you should understand it very thoroughly, because of 885 00:51:21,590 --> 00:51:25,740 anything is useful in the subject in dimension learning, 886 00:51:25,740 --> 00:51:26,990 this is it.