1
00:00:09,640 --> 00:00:11,900
PATRICK WINSTON: We've now
almost completed our journey.

2
00:00:11,900 --> 00:00:15,650
This will be it for
talking about

3
00:00:15,650 --> 00:00:17,260
several kinds of learning--

4
00:00:17,260 --> 00:00:21,920
the venerable kind, that's
the nearest neighbors and

5
00:00:21,920 --> 00:00:24,310
identification tree
types of learning.

6
00:00:24,310 --> 00:00:26,910
Still useful, still the right
thing to do if there's no

7
00:00:26,910 --> 00:00:30,180
reason not to do the
simple thing.

8
00:00:30,180 --> 00:00:32,330
Then we have the
biologically-inspired

9
00:00:32,330 --> 00:00:33,290
approaches.

10
00:00:33,290 --> 00:00:35,110
Neural nets.

11
00:00:35,110 --> 00:00:38,500
All kinds of problems with local
maxima and overfitting

12
00:00:38,500 --> 00:00:41,520
and oscillation, if you get
the rate constant too big.

13
00:00:45,600 --> 00:00:46,850
Genetic algorithms.

14
00:00:48,870 --> 00:00:53,150
Like neural nets, both are very
naive in their attempt to

15
00:00:53,150 --> 00:00:54,800
mimic nature.

16
00:00:54,800 --> 00:00:57,250
So maybe they work on
a class of problems.

17
00:00:57,250 --> 00:00:59,470
They surely do each have a class
of problems for which

18
00:00:59,470 --> 00:00:59,940
they're good.

19
00:00:59,940 --> 00:01:05,190
But as a general purpose first
resort, I don't recommend it.

20
00:01:05,190 --> 00:01:07,540
But now the theorists have come
out and done some things

21
00:01:07,540 --> 00:01:09,539
are very remarkable.

22
00:01:09,539 --> 00:01:11,830
And in the end, you have to
say, wow, these are such

23
00:01:11,830 --> 00:01:14,490
powerful ideas.

24
00:01:14,490 --> 00:01:17,780
I wonder if nature has
discovered them, too?

25
00:01:17,780 --> 00:01:20,420
Is there good engineering
in the brain,

26
00:01:20,420 --> 00:01:22,780
based on good science?

27
00:01:22,780 --> 00:01:24,990
Or given the nature of
evolution, is it just random

28
00:01:24,990 --> 00:01:28,800
junk that is the best ways
for doing anything?

29
00:01:28,800 --> 00:01:30,180
Who knows?

30
00:01:30,180 --> 00:01:32,220
But today, we're going to talk
about an idea that I'll bet is

31
00:01:32,220 --> 00:01:36,259
in there somewhere, because it's
easy to implement, and

32
00:01:36,259 --> 00:01:41,180
it's extremely powerful in what
it does, and it's the

33
00:01:41,180 --> 00:01:44,979
essential item in anybody's
repertoire of learning

34
00:01:44,979 --> 00:01:46,860
mechanisms.

35
00:01:46,860 --> 00:01:51,820
It's also a mechanism which,
if you understand only by

36
00:01:51,820 --> 00:01:55,789
formula, you will never be able
to work the problems on

37
00:01:55,789 --> 00:01:57,759
the quiz, that's for sure.

38
00:01:57,759 --> 00:02:00,780
Because on the surface, it
looks like it'd be very

39
00:02:00,780 --> 00:02:03,920
complicated to simulate
this approach.

40
00:02:03,920 --> 00:02:06,530
But once you understand how it
works and look at a little bit

41
00:02:06,530 --> 00:02:08,919
of the math and let it sing
songs to you, it turns out to

42
00:02:08,919 --> 00:02:13,270
be extremely easy.

43
00:02:13,270 --> 00:02:18,980
So it's about letting multiple
methods work in your behalf.

44
00:02:18,980 --> 00:02:21,490
So far, we've been talking about
using just one method to

45
00:02:21,490 --> 00:02:23,100
do something.

46
00:02:23,100 --> 00:02:25,780
And what we're going to do now
is we're looking to see if a

47
00:02:25,780 --> 00:02:33,370
crowd can be smarter than the
individuals in the crowd.

48
00:02:33,370 --> 00:02:36,430
But before we get too far down
that abstract path, let me

49
00:02:36,430 --> 00:02:40,750
just say that the whole works
has to do with classification,

50
00:02:40,750 --> 00:02:44,110
and binary classification.

51
00:02:44,110 --> 00:02:48,500
Am I holding a piece of chalk in
my hand, or a hand grenade?

52
00:02:48,500 --> 00:02:50,750
Is that a cup of
coffee or tea?

53
00:02:50,750 --> 00:02:53,740
Those are binary classification
problems.

54
00:02:53,740 --> 00:02:55,790
And so we're going to be talking
today strictly about

55
00:02:55,790 --> 00:02:57,590
binary classification.

56
00:02:57,590 --> 00:02:59,570
We're not going to be talking
about finding the right letter

57
00:02:59,570 --> 00:03:02,290
in the alphabet that's
written on the page.

58
00:03:02,290 --> 00:03:04,450
That's a 26-way choice.

59
00:03:04,450 --> 00:03:07,190
We're talking about
binary choices.

60
00:03:07,190 --> 00:03:10,610
So we assume that there's
a set of classifiers

61
00:03:10,610 --> 00:03:12,280
that we can draw on.

62
00:03:12,280 --> 00:03:13,090
Here's one--

63
00:03:13,090 --> 00:03:14,520
h.

64
00:03:14,520 --> 00:03:20,050
And it produces either a
minus 1 or a plus 1.

65
00:03:20,050 --> 00:03:22,880
So that's how the classification
is done.

66
00:03:22,880 --> 00:03:24,340
If it's coffee, plus 1.

67
00:03:24,340 --> 00:03:26,340
If it's tea, minus 1.

68
00:03:26,340 --> 00:03:27,650
Is this chalk, plus one.

69
00:03:27,650 --> 00:03:29,400
If it's a hand grenade,
minus 1.

70
00:03:29,400 --> 00:03:32,710
So that's how the classification
works.

71
00:03:32,710 --> 00:03:36,090
Now, too bad for us, normally
the world doesn't give us very

72
00:03:36,090 --> 00:03:36,980
good classifiers.

73
00:03:36,980 --> 00:03:42,500
So if we look at the error rate
of this classifier or any

74
00:03:42,500 --> 00:03:50,180
other classifier, that error
rate will range from 0 to 1 in

75
00:03:50,180 --> 00:03:54,480
terms of the fraction
of the cases got

76
00:03:54,480 --> 00:03:55,730
wrong on a sample set.

77
00:03:58,230 --> 00:04:02,020
So you'd like your error rate
to be way down here.

78
00:04:02,020 --> 00:04:04,250
You're dead if it's
over there.

79
00:04:04,250 --> 00:04:05,950
But what about in the middle?

80
00:04:05,950 --> 00:04:08,420
What if it's, say,
right there.

81
00:04:08,420 --> 00:04:11,690
Just a little bit better
than flipping a coin.

82
00:04:11,690 --> 00:04:14,800
If it's just a little bit better
than flipping a coin,

83
00:04:14,800 --> 00:04:16,050
that's a weak classifier.

84
00:04:24,660 --> 00:04:28,560
And the question is, can you
make a classifier that's way

85
00:04:28,560 --> 00:04:39,240
over here, like there, a
strong classifier, by

86
00:04:39,240 --> 00:04:43,210
combining several of these
weak classifiers, and

87
00:04:43,210 --> 00:04:46,090
letting them vote?

88
00:04:46,090 --> 00:04:47,340
So how would you do that?

89
00:04:47,340 --> 00:04:51,920
You might say, well, let us make
a big classifier capital

90
00:04:51,920 --> 00:05:03,190
H, that works on some sample x,
and has its output produces

91
00:05:03,190 --> 00:05:05,920
something that depends on the
sum of the outputs of the

92
00:05:05,920 --> 00:05:08,180
individual classifiers.

93
00:05:08,180 --> 00:05:13,100
So we have H1 working on x.

94
00:05:13,100 --> 00:05:18,340
We have H2 working on x.

95
00:05:18,340 --> 00:05:21,770
And we have H3 also
working on x.

96
00:05:21,770 --> 00:05:24,710
Let's say three of them,
just to start us off.

97
00:05:24,710 --> 00:05:28,670
And now let's add those
guys up, and take

98
00:05:28,670 --> 00:05:33,295
the sign of the output.

99
00:05:36,190 --> 00:05:40,690
So if two out of the three of
those guys agree, then we'll

100
00:05:40,690 --> 00:05:43,409
get an either plus
1 or minus 1.

101
00:05:43,409 --> 00:05:46,150
If all three agree, we'll
get plus 1 or minus 1.

102
00:05:46,150 --> 00:05:48,320
Because we're just
taking the sign.

103
00:05:48,320 --> 00:05:52,620
We're just taking the sign
of the sum of these guys.

104
00:05:52,620 --> 00:05:56,230
So this means that one guy can
be wrong, as long as the other

105
00:05:56,230 --> 00:05:58,180
two guys are right.

106
00:05:58,180 --> 00:06:01,850
But I think it's easier to see
how this all works if you

107
00:06:01,850 --> 00:06:07,380
think of some space of samples,
you say, well, let's

108
00:06:07,380 --> 00:06:15,330
let that area here be where H1
is wrong, and this area over

109
00:06:15,330 --> 00:06:19,370
here is where H2 is wrong.

110
00:06:22,360 --> 00:06:26,305
And then this area over here
is where H3 is wrong.

111
00:06:30,430 --> 00:06:34,710
So if the situation is like
that, then this formula always

112
00:06:34,710 --> 00:06:38,250
gives you the right answers
on the samples.

113
00:06:38,250 --> 00:06:40,610
I'm going to stop saying that
right now, because I want to

114
00:06:40,610 --> 00:06:44,370
be kind of a background thing
on the samples set.

115
00:06:44,370 --> 00:06:45,900
We're talking about wrapping
this stuff

116
00:06:45,900 --> 00:06:47,840
over the sample set.

117
00:06:47,840 --> 00:06:50,170
Later on, we'll ask, OK, given
that you trained this thing on

118
00:06:50,170 --> 00:06:52,880
a sample set, how well does it
do on some new examples?

119
00:06:52,880 --> 00:06:54,420
Because we want to
ask ourselves

120
00:06:54,420 --> 00:06:57,150
about overfitting questions.

121
00:06:57,150 --> 00:07:01,760
But for now, we just want to
look and see if we believe

122
00:07:01,760 --> 00:07:06,690
that this arrangement, where
each of these H's is producing

123
00:07:06,690 --> 00:07:09,450
plus 1 or minus 1, we're adding
them up and taking the

124
00:07:09,450 --> 00:07:12,960
sign, is that going to give us a
better result than the tests

125
00:07:12,960 --> 00:07:13,540
individually?

126
00:07:13,540 --> 00:07:17,620
And if they look like this when
draped over a sample set,

127
00:07:17,620 --> 00:07:19,330
then it's clear that we're going
to get the right answer

128
00:07:19,330 --> 00:07:24,540
every time, because there's no
area here where any two of

129
00:07:24,540 --> 00:07:27,640
those tests are giving
us the wrong answer.

130
00:07:27,640 --> 00:07:30,740
So the two that are getting
the right answer, in this

131
00:07:30,740 --> 00:07:33,530
little circle here for H1, these
other two are getting

132
00:07:33,530 --> 00:07:34,190
the right answer.

133
00:07:34,190 --> 00:07:36,420
So they'll outvote it, and
you'll get the right answer

134
00:07:36,420 --> 00:07:39,030
every time.

135
00:07:39,030 --> 00:07:41,370
But it doesn't have
to be that simple.

136
00:07:44,240 --> 00:07:45,490
It could look like this.

137
00:07:50,159 --> 00:07:54,090
There could be a situation
where this

138
00:07:54,090 --> 00:07:57,080
is H1, wrong answer.

139
00:07:57,080 --> 00:07:59,600
This is H2, wrong answer.

140
00:07:59,600 --> 00:08:04,140
And this is H3, wrong answer.

141
00:08:04,140 --> 00:08:08,150
And now the situation gets a
little bit more murky, because

142
00:08:08,150 --> 00:08:15,820
we have to ask ourselves whether
that area where three

143
00:08:15,820 --> 00:08:24,870
out of the three get it wrong
is sufficiently big so as to

144
00:08:24,870 --> 00:08:31,490
be worse than 1 of the
individual tests.

145
00:08:31,490 --> 00:08:34,169
So if you look at that Venn
diagram, and stare at it long

146
00:08:34,169 --> 00:08:37,260
enough, and try some things, you
can say, well, there is no

147
00:08:37,260 --> 00:08:40,530
case where this will give
a worse answer.

148
00:08:40,530 --> 00:08:44,590
Or, you might end up with the
conclusion that there are

149
00:08:44,590 --> 00:08:50,080
cases where we can arrange those
circles such that the

150
00:08:50,080 --> 00:08:52,220
voting scheme will give an
answer that's worst than an

151
00:08:52,220 --> 00:08:55,260
individual test, but I'm not
going to tell you the answer,

152
00:08:55,260 --> 00:08:58,970
because I think we'll make
that a quiz question.

153
00:08:58,970 --> 00:08:59,830
Good idea?

154
00:08:59,830 --> 00:09:00,870
OK.

155
00:09:00,870 --> 00:09:02,170
So we'll make that
a quiz question.

156
00:09:06,700 --> 00:09:08,840
So that looks like
a good idea.

157
00:09:08,840 --> 00:09:15,460
And we can construct a little
algorithm that will help us

158
00:09:15,460 --> 00:09:17,660
pick the particular weak
classifiers to plug in here.

159
00:09:17,660 --> 00:09:20,160
We've got a whole bag
of classifiers.

160
00:09:20,160 --> 00:09:22,640
We've got H1, we've got
H2, we've got H55.

161
00:09:22,640 --> 00:09:24,880
We've got a lot of them
we can choose from.

162
00:09:24,880 --> 00:09:32,114
So what we're going to do is
we're going to use the data,

163
00:09:32,114 --> 00:09:37,280
undisturbed, to produce H1.

164
00:09:37,280 --> 00:09:39,330
We're just going to try all the
tests on the data and see

165
00:09:39,330 --> 00:09:41,190
which one gives us the
smallest error rate.

166
00:09:41,190 --> 00:09:44,830
And that's the good guy, so
we're going to use that.

167
00:09:44,830 --> 00:09:51,300
Then we're going to use
the data with an

168
00:09:51,300 --> 00:10:00,840
exaggeration of H1 errors.

169
00:10:04,560 --> 00:10:05,560
In other words--

170
00:10:05,560 --> 00:10:06,910
this is a critical idea.

171
00:10:06,910 --> 00:10:10,060
What we're going to do is
we're going to run this

172
00:10:10,060 --> 00:10:14,010
algorithm again, but instead of
just looking at the number

173
00:10:14,010 --> 00:10:21,730
of samples that are got wrong,
what we're going to do is

174
00:10:21,730 --> 00:10:25,370
we're going to look at a
distorted set of samples,

175
00:10:25,370 --> 00:10:29,480
where the ones we're not doing
well on has exaggerated effect

176
00:10:29,480 --> 00:10:31,510
on the result.

177
00:10:31,510 --> 00:10:35,730
So we're going to weight them
or multiply them, or do

178
00:10:35,730 --> 00:10:40,410
something so that we're going
to pay more attention to the

179
00:10:40,410 --> 00:10:43,810
samples on which H1 produces an
error, and that's going to

180
00:10:43,810 --> 00:10:45,060
give us H2.

181
00:10:47,300 --> 00:10:49,830
And then we're going to do it
one more time, because we've

182
00:10:49,830 --> 00:10:53,410
got three things to go with here
in this particular little

183
00:10:53,410 --> 00:10:55,100
exploratory scheme.

184
00:10:55,100 --> 00:10:55,970
And this time, we're
going to have an

185
00:10:55,970 --> 00:11:04,390
exaggeration of those samples--

186
00:11:04,390 --> 00:11:06,100
which samples are we going
to exaggerate now?

187
00:11:12,250 --> 00:11:14,510
We might as well look for the
ones where H1 gives us a

188
00:11:14,510 --> 00:11:17,260
different answer from H2,
because we want to be on the

189
00:11:17,260 --> 00:11:19,460
good guy's side.

190
00:11:19,460 --> 00:11:25,070
So we can say we're going to
exaggerate those samples four

191
00:11:25,070 --> 00:11:29,120
which H1 gives us a different
result from H2.

192
00:11:29,120 --> 00:11:30,670
And that's going
to give us H3.

193
00:11:34,210 --> 00:11:35,260
All right.

194
00:11:35,260 --> 00:11:40,530
So we can think of this whole
works here as part one of a

195
00:11:40,530 --> 00:11:41,780
multi-part idea.

196
00:11:48,680 --> 00:11:49,880
So let's see.

197
00:11:49,880 --> 00:11:51,260
I don't know, what might
be step two?

198
00:11:51,260 --> 00:11:53,420
Well, this is a good idea.

199
00:11:53,420 --> 00:11:58,540
Then what we've got that we can
easily derive from that is

200
00:11:58,540 --> 00:11:59,910
a little tree looked
like this.

201
00:11:59,910 --> 00:12:09,900
And we can say that H of x
depends on H1, H2, and H3.

202
00:12:09,900 --> 00:12:14,490
But now, if that that's a good
idea, and that gives a better

203
00:12:14,490 --> 00:12:18,210
answer than any of the
individual tests, maybe we can

204
00:12:18,210 --> 00:12:21,460
make this idea a little bit
recursive, and say, well,

205
00:12:21,460 --> 00:12:25,550
maybe H1 is actually
not an atomic test.

206
00:12:25,550 --> 00:12:30,010
But maybe it's the vote
of three other tests.

207
00:12:30,010 --> 00:12:31,740
So you can make a
tree structure

208
00:12:31,740 --> 00:12:33,750
that looks like this.

209
00:12:33,750 --> 00:12:40,680
So this is H11, H12, H13,
and then 3 here.

210
00:12:40,680 --> 00:12:48,860
And then this will
be H31, H32, H33.

211
00:12:48,860 --> 00:12:53,060
And so that's a sort of
get out the vote idea.

212
00:12:57,200 --> 00:13:00,190
We're trying to get a whole
bunch of individual

213
00:13:00,190 --> 00:13:03,550
tests into the act.

214
00:13:03,550 --> 00:13:06,920
So I guess the reason this
wasn't discovered until about

215
00:13:06,920 --> 00:13:09,650
'10 years ago was because you've
got to get so many of

216
00:13:09,650 --> 00:13:12,520
these desks all lined up before
the idea gets through

217
00:13:12,520 --> 00:13:15,620
that long filter of ideas.

218
00:13:15,620 --> 00:13:18,360
So that's the only idea number
two of quite a few.

219
00:13:23,350 --> 00:13:25,900
Well, next thing we might
think is, well, we keep

220
00:13:25,900 --> 00:13:27,080
talking about these
classifiers.

221
00:13:27,080 --> 00:13:30,500
What kind of classifiers
are we talking about?

222
00:13:30,500 --> 00:13:31,912
I've got--

223
00:13:31,912 --> 00:13:33,740
oh, shoot, I've spent
my last nickel.

224
00:13:33,740 --> 00:13:35,300
I don't have a coin to flip.

225
00:13:35,300 --> 00:13:37,810
But that's one classifier,
right?

226
00:13:37,810 --> 00:13:41,620
The trouble with that classifier
is it's a weak

227
00:13:41,620 --> 00:13:43,780
classifier, because it
gives me a 50/50

228
00:13:43,780 --> 00:13:46,230
chance of being right.

229
00:13:46,230 --> 00:13:48,460
I guess there are conditions
in which a coin flip

230
00:13:48,460 --> 00:13:50,380
is better than a--

231
00:13:50,380 --> 00:13:52,900
it is a weak classifier.

232
00:13:52,900 --> 00:13:55,680
If the two outcomes are not
equally probable, than a coin

233
00:13:55,680 --> 00:13:58,900
flip is a perfectly good
weak classifier.

234
00:13:58,900 --> 00:14:01,400
But what we're going to do is
we're going to think in terms

235
00:14:01,400 --> 00:14:05,460
of a different set
of classifiers.

236
00:14:05,460 --> 00:14:12,240
And we're going to call
them decision tree.

237
00:14:12,240 --> 00:14:14,520
Now, you remember decision
trees, right?

238
00:14:14,520 --> 00:14:16,000
But we're not going to
build decision trees.

239
00:14:16,000 --> 00:14:19,820
We're going to use decision
tree stumps.

240
00:14:23,480 --> 00:14:28,640
So if we have a two-dimensional
space that

241
00:14:28,640 --> 00:14:34,790
looks like this, then a decision
tree stump is a

242
00:14:34,790 --> 00:14:36,540
single test.

243
00:14:36,540 --> 00:14:38,750
It's not a complete tree that
will divide up the samples

244
00:14:38,750 --> 00:14:40,340
into homogeneous groups.

245
00:14:40,340 --> 00:14:44,330
It's just what you can
do with one test.

246
00:14:44,330 --> 00:14:48,530
So each possible test
is a classifier.

247
00:14:48,530 --> 00:14:50,130
How many tests do we
get out of that?

248
00:14:59,280 --> 00:15:00,550
12, right?

249
00:15:00,550 --> 00:15:01,040
Yeah.

250
00:15:01,040 --> 00:15:02,540
It doesn't look like
12 to me, either.

251
00:15:02,540 --> 00:15:05,300
But here's how you get to 12.

252
00:15:05,300 --> 00:15:10,460
One decision tree test you can
stick in there would be that

253
00:15:10,460 --> 00:15:11,390
test right there.

254
00:15:11,390 --> 00:15:16,040
And that would be a complete
decision tree stump.

255
00:15:16,040 --> 00:15:20,315
But, of course, you can
also put in this one.

256
00:15:20,315 --> 00:15:23,580
That would be another
decision tree stump.

257
00:15:23,580 --> 00:15:26,370
Now, for this one on the right,
I could say, everything

258
00:15:26,370 --> 00:15:28,730
on the right is a minus.

259
00:15:28,730 --> 00:15:33,520
Or, I could say, everything
on the right is a plus.

260
00:15:33,520 --> 00:15:37,420
It would happen to be wrong, but
it's a valid test with a

261
00:15:37,420 --> 00:15:39,130
valid outcome.

262
00:15:39,130 --> 00:15:41,130
So that's how we double the
number of test that

263
00:15:41,130 --> 00:15:43,090
we have lines for.

264
00:15:43,090 --> 00:15:44,050
And you know what?

265
00:15:44,050 --> 00:15:47,280
can even have a kind of test out
here that says everything

266
00:15:47,280 --> 00:15:51,190
is plus, or everything
is wrong.

267
00:15:51,190 --> 00:15:55,070
So for each dimension, the
number of decision tree stumps

268
00:15:55,070 --> 00:15:59,120
is the number of lines
I can put in times 2.

269
00:15:59,120 --> 00:16:00,570
And then I've got two dimensions
here, that's how I

270
00:16:00,570 --> 00:16:02,310
got to twelve.

271
00:16:02,310 --> 00:16:04,470
So there are three lines.

272
00:16:04,470 --> 00:16:06,300
I can have the pluses
on either the left

273
00:16:06,300 --> 00:16:07,070
or the right side.

274
00:16:07,070 --> 00:16:08,670
So that's six.

275
00:16:08,670 --> 00:16:09,830
And then I've got two
dimensions, so

276
00:16:09,830 --> 00:16:11,750
that gives me 12.

277
00:16:11,750 --> 00:16:14,050
So that's the decision
tree stump idea.

278
00:16:14,050 --> 00:16:19,180
And here are the other decision
tree boundaries,

279
00:16:19,180 --> 00:16:23,940
obviously just like that.

280
00:16:23,940 --> 00:16:30,750
So that's one way can generate
a batch of tests to try out

281
00:16:30,750 --> 00:16:35,530
with this idea of using
a lot of tests to help

282
00:16:35,530 --> 00:16:36,455
you get the job done.

283
00:16:36,455 --> 00:16:38,558
STUDENT: Couldn't you also have
a decision tree on the

284
00:16:38,558 --> 00:16:40,370
right side?

285
00:16:40,370 --> 00:16:44,330
PATRICK WINSTON: The question
is, can you also have a test

286
00:16:44,330 --> 00:16:45,420
on the right side?

287
00:16:45,420 --> 00:16:48,530
See, this is just a stand-in for
saying, everything's plus

288
00:16:48,530 --> 00:16:50,260
or everything's minus.

289
00:16:50,260 --> 00:16:52,530
So it doesn't matter where
you put the line.

290
00:16:52,530 --> 00:16:54,362
It can be on the right side,
or the left side, or the

291
00:16:54,362 --> 00:16:55,640
bottom, or the top.

292
00:16:55,640 --> 00:16:56,940
Or you don't have to put
the line anywhere.

293
00:16:56,940 --> 00:17:00,640
It's just an extra test, an
additional to the ones you put

294
00:17:00,640 --> 00:17:02,810
between the samples.

295
00:17:02,810 --> 00:17:06,260
So this whole idea
of boosting, the

296
00:17:06,260 --> 00:17:07,040
main idea of the day.

297
00:17:07,040 --> 00:17:09,980
Does it depend on using
decision tree stumps?

298
00:17:09,980 --> 00:17:12,490
The answer is no.

299
00:17:12,490 --> 00:17:14,390
Do not be confused.

300
00:17:14,390 --> 00:17:17,800
You can use boosting with
any kind of classifier.

301
00:17:17,800 --> 00:17:21,030
so why do I use decision
tree stumps today?

302
00:17:21,030 --> 00:17:23,660
Because it makes my life easy.

303
00:17:23,660 --> 00:17:26,420
We can look at it, we can
see what it's doing.

304
00:17:26,420 --> 00:17:29,790
But we could put bunch of
neural nets in there.

305
00:17:29,790 --> 00:17:33,060
We could put a bunch of real
decision trees in there.

306
00:17:33,060 --> 00:17:35,530
We could put a bunch of nearest

307
00:17:35,530 --> 00:17:36,660
neighbor things in there.

308
00:17:36,660 --> 00:17:39,200
The boosting idea
doesn't care.

309
00:17:39,200 --> 00:17:41,880
I just used these decision
tree stumps because I and

310
00:17:41,880 --> 00:17:45,856
everybody else use them
for illustration.

311
00:17:45,856 --> 00:17:48,270
All right.

312
00:17:48,270 --> 00:17:50,780
We're making progress.

313
00:17:50,780 --> 00:17:54,470
Now, what's the error rate
for any these tests

314
00:17:54,470 --> 00:17:56,240
and lines we drew?

315
00:17:56,240 --> 00:18:05,110
Well, I guess it'll be the error
rate is equal to the sum

316
00:18:05,110 --> 00:18:06,770
of 1 over n--

317
00:18:06,770 --> 00:18:09,072
That's the total number
of points,

318
00:18:09,072 --> 00:18:10,322
the number of samples--

319
00:18:13,810 --> 00:18:15,970
summed over the cases
where we are wrong.

320
00:18:22,450 --> 00:18:26,770
So gee, we're going to work on
combining some of these ideas.

321
00:18:26,770 --> 00:18:29,690
And we've got this notion
of exaggeration.

322
00:18:29,690 --> 00:18:31,980
At some stage in what we're
doing here, we're going to

323
00:18:31,980 --> 00:18:34,280
want to be able to exaggerate
the effect of some errors

324
00:18:34,280 --> 00:18:36,870
relative to other errors.

325
00:18:36,870 --> 00:18:41,700
So one thing we can do is
we can assume, or we can

326
00:18:41,700 --> 00:18:46,620
stipulate, or we can assert that
each of these samples has

327
00:18:46,620 --> 00:18:47,930
a weight associated with it.

328
00:18:47,930 --> 00:18:53,370
That's W1, this is W2,
and that's W3.

329
00:18:53,370 --> 00:18:56,140
And in the beginning, there's no
reason to suppose that any

330
00:18:56,140 --> 00:18:57,610
one of these is more
or less important

331
00:18:57,610 --> 00:18:59,160
than any of the other.

332
00:18:59,160 --> 00:19:05,370
So in the beginning, W sub i
at time [? stub ?] one is

333
00:19:05,370 --> 00:19:07,580
equal to 1 over n.

334
00:19:10,160 --> 00:19:14,170
So the error is just adding up
the number of samples that

335
00:19:14,170 --> 00:19:15,730
were got wrong.

336
00:19:15,730 --> 00:19:18,205
And that'll be the fraction
of samples to that

337
00:19:18,205 --> 00:19:19,350
you didn't get right.

338
00:19:19,350 --> 00:19:22,010
And that will be
the error rate.

339
00:19:22,010 --> 00:19:26,270
So what we want to do is we want
to say, instead of using

340
00:19:26,270 --> 00:19:30,175
this as the error rate for all
time, what we want to do is we

341
00:19:30,175 --> 00:19:34,140
want to move that over, and
say that the error rate is

342
00:19:34,140 --> 00:19:39,300
equal to the sum over the things
you got wrong in the

343
00:19:39,300 --> 00:19:43,010
current step, times the
weights of those

344
00:19:43,010 --> 00:19:44,770
that were got wrong.

345
00:19:44,770 --> 00:19:47,140
So in step one, everything's
got the same weight, it

346
00:19:47,140 --> 00:19:48,200
doesn't matter.

347
00:19:48,200 --> 00:19:50,710
But if we find a way to change
their weights going

348
00:19:50,710 --> 00:19:52,750
downstream--

349
00:19:52,750 --> 00:19:57,750
so as to, for example, highly
exaggerate that third sample,

350
00:19:57,750 --> 00:20:03,780
then W3 will go up relative
to W1 and W2.

351
00:20:03,780 --> 00:20:06,250
The one thing we want to be sure
of is there is no matter

352
00:20:06,250 --> 00:20:11,350
how we adjust the weights, that
the sum of the weights

353
00:20:11,350 --> 00:20:14,721
over the whole space
is equal to 1.

354
00:20:17,310 --> 00:20:19,170
So in other words, we want to
choose the weights so that

355
00:20:19,170 --> 00:20:22,130
they emphasize some of the
samples, but we also want to

356
00:20:22,130 --> 00:20:25,510
put a constraint on the weights
such that all of them

357
00:20:25,510 --> 00:20:30,200
added together is
summing to one.

358
00:20:30,200 --> 00:20:32,870
And we'll say that that enforces
a distribution.

359
00:20:35,938 --> 00:20:41,400
A distribution is a set of
weights that sum to one.

360
00:20:41,400 --> 00:20:45,570
Well, that's just a nice idea.

361
00:20:45,570 --> 00:20:46,780
So we're make a little
progress.

362
00:20:46,780 --> 00:20:51,750
We've got this idea that we
can add some plus/minus 1

363
00:20:51,750 --> 00:20:55,130
classifiers together, you
get a better classifier.

364
00:20:55,130 --> 00:20:58,480
We got some idea about
how to do that.

365
00:20:58,480 --> 00:21:00,080
It occurs to us that maybe
we want to get a lot of

366
00:21:00,080 --> 00:21:03,500
classifiers into the act
somehow or another.

367
00:21:03,500 --> 00:21:07,000
And maybe we want to think
about using decision tree

368
00:21:07,000 --> 00:21:11,220
stumps so as to ground out
thinking about all this stuff.

369
00:21:11,220 --> 00:21:16,830
So the next step is to say,
well, how actually should we

370
00:21:16,830 --> 00:21:19,040
combine this stuff?

371
00:21:19,040 --> 00:21:21,800
And you will find, in the
literature libraries, full of

372
00:21:21,800 --> 00:21:24,550
papers that do stuff
like that.

373
00:21:24,550 --> 00:21:28,090
And that was state of the art
for quite a few years.

374
00:21:28,090 --> 00:21:32,390
But then people began to say,
well, maybe we can build up

375
00:21:32,390 --> 00:21:37,350
this classifier, H of x, in
multiple steps and get a lot

376
00:21:37,350 --> 00:21:40,090
of classifiers into the act.

377
00:21:40,090 --> 00:21:51,786
So maybe we can say that the
classifier is the sign of H--

378
00:21:51,786 --> 00:21:54,130
that's the one we
picked first.

379
00:21:54,130 --> 00:21:56,990
That's the classifier
we picked first.

380
00:21:56,990 --> 00:21:58,490
That's looking at samples.

381
00:21:58,490 --> 00:22:00,790
And then we've got H2.

382
00:22:00,790 --> 00:22:03,090
And then we've got H3.

383
00:22:03,090 --> 00:22:06,220
And then we've got how many
other classifiers we might

384
00:22:06,220 --> 00:22:11,620
want, or how many classifiers
we might need in order to

385
00:22:11,620 --> 00:22:16,800
correctly classify everything
in our sample set.

386
00:22:16,800 --> 00:22:19,760
So people began to think about
whether there might be an

387
00:22:19,760 --> 00:22:22,560
algorithm that would develop
a classifier that way,

388
00:22:22,560 --> 00:22:23,810
one step at a time.

389
00:22:26,240 --> 00:22:29,660
That's why I put that step
number in the exponent,

390
00:22:29,660 --> 00:22:33,280
because we're picking this one
at first, then we're expanding

391
00:22:33,280 --> 00:22:35,010
it to have two, and then we're
expanding it to have

392
00:22:35,010 --> 00:22:36,620
three, and so on.

393
00:22:36,620 --> 00:22:38,960
And each of those individual
classifiers are separately

394
00:22:38,960 --> 00:22:42,530
looking at the sample.

395
00:22:42,530 --> 00:22:46,380
But of course, it would be
natural to suppose that just

396
00:22:46,380 --> 00:22:49,150
adding things up wouldn't
be enough.

397
00:22:49,150 --> 00:22:50,870
And it's not.

398
00:22:50,870 --> 00:22:54,690
So it isn't too hard to invent
the next idea, which is to

399
00:22:54,690 --> 00:23:00,250
modify this thing just a little
bit by doing what?

400
00:23:00,250 --> 00:23:04,420
It looks almost like a scoring
polynomial, doesn't it?

401
00:23:04,420 --> 00:23:08,308
So what would we do to tart
this up a little bit?

402
00:23:08,308 --> 00:23:11,050
STUDENT: [INAUDIBLE].

403
00:23:11,050 --> 00:23:11,955
PATRICK WINSTON: Come again?

404
00:23:11,955 --> 00:23:13,380
Do what?

405
00:23:13,380 --> 00:23:16,230
STUDENT: [INAUDIBLE].

406
00:23:16,230 --> 00:23:19,360
PATRICK WINSTON: Somewhere out
there someone's murmuring.

407
00:23:19,360 --> 00:23:19,815
STUDENT: Add--

408
00:23:19,815 --> 00:23:21,100
PATRICK WINSTON: Add weights!

409
00:23:21,100 --> 00:23:21,505
STUDENT: --weights.

410
00:23:21,505 --> 00:23:21,910
Yeah.

411
00:23:21,910 --> 00:23:22,185
PATRICK WINSTON: Excellent.

412
00:23:22,185 --> 00:23:24,040
Good idea.

413
00:23:24,040 --> 00:23:28,320
So what we're going to do is
we're going to have alphas

414
00:23:28,320 --> 00:23:32,105
associated with each of these
classifiers, and we're going

415
00:23:32,105 --> 00:23:34,240
to determine if somebody
can build that kind

416
00:23:34,240 --> 00:23:38,790
formula to do the job.

417
00:23:38,790 --> 00:23:41,780
So maybe I ought to modify this
gold star idea before I

418
00:23:41,780 --> 00:23:44,250
get too far downstream.

419
00:23:44,250 --> 00:23:52,240
And we're not going to treat
everybody in a crowd equally.

420
00:23:52,240 --> 00:23:55,760
We're going to wait some of the
opinions more than others.

421
00:23:55,760 --> 00:23:57,790
And by the way, they're all
going to make errors in

422
00:23:57,790 --> 00:24:00,910
different parts of the space.

423
00:24:00,910 --> 00:24:05,775
So maybe it's not the wisdom of
even a weighted crowd, but

424
00:24:05,775 --> 00:24:08,855
a crowd of experts.

425
00:24:12,360 --> 00:24:16,860
Each of which is good at
different parts of the space.

426
00:24:16,860 --> 00:24:19,770
So anyhow, we've got this
formula, and there are a few

427
00:24:19,770 --> 00:24:25,780
things that one can
say turn out.

428
00:24:25,780 --> 00:24:31,530
But first, let's write down the
an algorithm for what this

429
00:24:31,530 --> 00:24:33,140
ought to look like.

430
00:24:33,140 --> 00:24:35,810
Before I run out of space, I
think I'll exploit the right

431
00:24:35,810 --> 00:24:41,110
hand board here, and put the
overall algorithm right here.

432
00:24:41,110 --> 00:24:47,410
So we're going to start out by
letting of all the weights at

433
00:24:47,410 --> 00:24:53,570
time 1 be equal to 1 over n.

434
00:24:53,570 --> 00:24:56,330
That's just saying that they're
all equal in the

435
00:24:56,330 --> 00:24:59,170
beginning, and they're
equal to 1 over n.

436
00:24:59,170 --> 00:25:01,215
And n is the number
of samples.

437
00:25:06,090 --> 00:25:11,130
And then, when I've got
that, I want to

438
00:25:11,130 --> 00:25:14,890
compute alpha, somehow.

439
00:25:17,510 --> 00:25:18,780
Let's see.

440
00:25:18,780 --> 00:25:20,210
No, I don't want to do that.

441
00:25:20,210 --> 00:25:22,810
I want to

442
00:25:22,810 --> 00:25:28,140
I want to pick a classifier the
minimizes the error rate.

443
00:25:37,730 --> 00:25:43,050
And then m, i, zes,
error at time t.

444
00:25:43,050 --> 00:25:45,230
And that's going to
be at time t.

445
00:25:45,230 --> 00:25:46,340
And we're going to come
back in here.

446
00:25:46,340 --> 00:25:50,160
That's why we put a step
index in there.

447
00:25:50,160 --> 00:25:56,790
So once we've picked a
classifier that produces an

448
00:25:56,790 --> 00:25:59,210
error rate, then we can
use the error rate to

449
00:25:59,210 --> 00:26:00,350
determine the alpha.

450
00:26:00,350 --> 00:26:02,260
So I want the alpha over here.

451
00:26:07,910 --> 00:26:11,900
That'll be sort of a byproduct
of picking that test.

452
00:26:11,900 --> 00:26:14,890
And with all that stuff in
hand, maybe that will be

453
00:26:14,890 --> 00:26:20,480
enough to calculate Wt plus 1.

454
00:26:28,600 --> 00:26:33,162
So we're going to use that
classifier that we just picked

455
00:26:33,162 --> 00:26:36,040
to get some revised weights,
and then we're going to go

456
00:26:36,040 --> 00:26:41,870
around that loop until this
classifier produces a perfect

457
00:26:41,870 --> 00:26:46,290
set of conclusions on
all the sample data.

458
00:26:46,290 --> 00:26:49,560
So that's going to be our
overall strategy.

459
00:26:49,560 --> 00:26:51,800
Maybe we've got, if we're going
to number these things,

460
00:26:51,800 --> 00:26:54,960
that's the fourth big idea.

461
00:26:54,960 --> 00:26:59,350
And this arrangement here
is the fifth big idea.

462
00:26:59,350 --> 00:27:01,390
Then we've got the
sixth big idea.

463
00:27:01,390 --> 00:27:04,350
And the sixth big
idea says this.

464
00:27:06,940 --> 00:27:19,340
Suppose that the weight on it
ith sample at time t plus 1 is

465
00:27:19,340 --> 00:27:28,600
equal to the weight at time t
on that same sample, divided

466
00:27:28,600 --> 00:27:38,150
by some normalizing factor,
times e to the minus alpha at

467
00:27:38,150 --> 00:27:52,750
time t, times h at time t, times
some function y which is

468
00:27:52,750 --> 00:27:58,160
a function of x, But not
a function of time.

469
00:27:58,160 --> 00:28:01,280
Now you say, where did
this come from?

470
00:28:01,280 --> 00:28:03,670
And the answer is, it did not
spring from the heart of

471
00:28:03,670 --> 00:28:06,190
mathematician in the first
10 minutes that he

472
00:28:06,190 --> 00:28:07,800
looked at this problem.

473
00:28:07,800 --> 00:28:09,550
In fact, when I asked
[INAUDIBLE]

474
00:28:09,550 --> 00:28:13,300
how this worked, he said, well,
he was thinking about

475
00:28:13,300 --> 00:28:15,630
this on the couch every Saturday
for about a year, and

476
00:28:15,630 --> 00:28:18,200
his wife was getting pretty
sore, but he finally found it

477
00:28:18,200 --> 00:28:20,590
and saved their marriage.

478
00:28:20,590 --> 00:28:23,950
So where does stuff like
this come from?

479
00:28:23,950 --> 00:28:27,080
Really, it comes from knowing
a lot of mathematics, and

480
00:28:27,080 --> 00:28:29,280
seeing a lot of situations,
and knowing that something

481
00:28:29,280 --> 00:28:34,570
like this might be
mathematically convenient.

482
00:28:34,570 --> 00:28:40,080
Something like this might be
mathematically convenient.

483
00:28:40,080 --> 00:28:42,670
But we've got to back up a
little and let it sing to us.

484
00:28:42,670 --> 00:28:44,010
What's y?

485
00:28:44,010 --> 00:28:45,100
We saw y last time.

486
00:28:45,100 --> 00:28:46,910
The support vector machines.

487
00:28:46,910 --> 00:28:47,780
That's just a function.

488
00:28:47,780 --> 00:28:51,270
That's plus 1 or minus 1,
depending on whether the

489
00:28:51,270 --> 00:28:55,310
output ought to be plus
1 or minus 1.

490
00:28:55,310 --> 00:29:02,200
So if this guy is giving the
correct answer, and the

491
00:29:02,200 --> 00:29:06,630
correct answer is plus, and then
this guy will be plus 1

492
00:29:06,630 --> 00:29:10,210
too, because it always gives
you the correct answer.

493
00:29:10,210 --> 00:29:12,330
So in that case, where this
guy is giving the right

494
00:29:12,330 --> 00:29:15,190
answer, these will have the same
sign, so that will be a

495
00:29:15,190 --> 00:29:16,960
plus 1 combination.

496
00:29:16,960 --> 00:29:19,000
On the other hand, if that guy's
giving the wrong answer,

497
00:29:19,000 --> 00:29:22,450
you're going to get a minus
1 out of that combination.

498
00:29:22,450 --> 00:29:25,680
So it's true even if the right
answer should be minus, right?

499
00:29:25,680 --> 00:29:28,320
So if the right answer should
be minus, and this is plus,

500
00:29:28,320 --> 00:29:30,820
then this will be minus 1, and
the whole combination well

501
00:29:30,820 --> 00:29:31,945
give you minus 1 again.

502
00:29:31,945 --> 00:29:36,360
In other words, the y just flips
the sign if you've got

503
00:29:36,360 --> 00:29:39,170
the wrong answer, no matter
whether the wrong answer is

504
00:29:39,170 --> 00:29:42,330
plus 1 or minus 1.

505
00:29:42,330 --> 00:29:43,650
These alphas--

506
00:29:43,650 --> 00:29:46,420
shoot, those are the same
alphas that are in this

507
00:29:46,420 --> 00:29:49,950
formula up here, somehow.

508
00:29:49,950 --> 00:29:52,840
And then that z, what's
that for?

509
00:29:52,840 --> 00:29:55,650
Well, if you just look at the
previous weights, and its

510
00:29:55,650 --> 00:30:00,900
exponential function to produce
these W's for the next

511
00:30:00,900 --> 00:30:04,910
generation, that's not going to
be a distribution, because

512
00:30:04,910 --> 00:30:07,620
they won't sum up to 1.

513
00:30:07,620 --> 00:30:11,470
So what this thing here, this
z is, that's a sort of

514
00:30:11,470 --> 00:30:12,720
normalizer.

515
00:30:18,750 --> 00:30:21,680
And that makes that whole
combination of new

516
00:30:21,680 --> 00:30:23,980
weights add up to 1.

517
00:30:23,980 --> 00:30:31,570
So it's whatever you got by
adding up all those guys, and

518
00:30:31,570 --> 00:30:34,660
then dividing by that number.

519
00:30:34,660 --> 00:30:35,910
Well, phew.

520
00:30:43,030 --> 00:30:44,350
I don't know.

521
00:30:44,350 --> 00:30:45,600
Now there's some
it-turns-out-thats.

522
00:30:50,360 --> 00:30:52,230
We're going to imagine that
somebody's done the same sort

523
00:30:52,230 --> 00:30:54,940
of thing we did to the support
vector machines.

524
00:30:54,940 --> 00:30:57,730
We're going to find a way
to minimize the error.

525
00:30:57,730 --> 00:30:59,540
And the error we're going to
minimize is the error produced

526
00:30:59,540 --> 00:31:02,420
by that whole thing
up there in 4.

527
00:31:02,420 --> 00:31:05,120
We're going to minimize the
error of that entire

528
00:31:05,120 --> 00:31:06,370
expression as we go along.

529
00:31:08,930 --> 00:31:11,970
And what we discover when
we do the appropriate

530
00:31:11,970 --> 00:31:13,775
differentiations and stuff--

531
00:31:13,775 --> 00:31:15,710
you know, that's what
we do in calculus--

532
00:31:15,710 --> 00:31:24,580
what we discover is that you
get minimum error for the

533
00:31:24,580 --> 00:31:45,970
whole thing if alpha is equal
to 1 minus the error rate at

534
00:31:45,970 --> 00:31:51,190
time t, divided by the
error rate at time t.

535
00:31:51,190 --> 00:31:53,950
Now let's take the logarithm
of that, and

536
00:31:53,950 --> 00:31:56,220
multiply it by half.

537
00:31:56,220 --> 00:31:57,140
And that's what [INAUDIBLE]

538
00:31:57,140 --> 00:31:59,880
was struggling to find.

539
00:31:59,880 --> 00:32:01,350
But we haven't quite
got it right.

540
00:32:01,350 --> 00:32:03,800
And so let me add this in
separate chunks, so we don't

541
00:32:03,800 --> 00:32:05,926
get confused about this.

542
00:32:05,926 --> 00:32:12,880
It's a bound on that expression
up there.

543
00:32:12,880 --> 00:32:16,510
It's a bound on the error rate
produced by that expression.

544
00:32:16,510 --> 00:32:22,540
So interestingly enough, this
means that the error rate can

545
00:32:22,540 --> 00:32:26,000
actually go up as you add
terms to this formula.

546
00:32:26,000 --> 00:32:28,560
all you know is that the error
rate is going to be bounded by

547
00:32:28,560 --> 00:32:32,080
an exponentially decaying
function.

548
00:32:32,080 --> 00:32:36,910
So it's eventually guaranteed
to converge on zero.

549
00:32:36,910 --> 00:32:38,260
So it's a minimal error bound.

550
00:32:38,260 --> 00:32:39,510
It turns out to be
exponential.

551
00:32:43,120 --> 00:32:45,630
Well, there it is.

552
00:32:45,630 --> 00:32:46,120
We're done.

553
00:32:46,120 --> 00:32:48,207
Would you like to see
a demonstration?

554
00:32:48,207 --> 00:32:49,550
Yeah, OK.

555
00:32:49,550 --> 00:32:51,260
Because you look at that, and
you say, well, how could

556
00:32:51,260 --> 00:32:53,800
anything like that
possibly work?

557
00:32:53,800 --> 00:32:57,120
And the answer is, surprisingly
enough, here's

558
00:32:57,120 --> 00:32:59,720
what happens.

559
00:32:59,720 --> 00:33:02,440
There's a simple
little example.

560
00:33:02,440 --> 00:33:05,310
So that's the first
test chosen.

561
00:33:05,310 --> 00:33:09,470
the greens are pluses and the
reds are minuses, so it's

562
00:33:09,470 --> 00:33:11,480
still got an error.

563
00:33:11,480 --> 00:33:12,620
Still got an error-- boom.

564
00:33:12,620 --> 00:33:13,830
There, in two steps.

565
00:33:13,830 --> 00:33:14,600
It now has--

566
00:33:14,600 --> 00:33:16,670
we can look in the upper
right hand corner--

567
00:33:16,670 --> 00:33:20,460
we see its used three
classifiers, and we see that

568
00:33:20,460 --> 00:33:22,900
one of those classifiers says
that everybody belongs to a

569
00:33:22,900 --> 00:33:27,250
particular class, three
different weights.

570
00:33:27,250 --> 00:33:30,540
And the error rate has
converged to 0.

571
00:33:30,540 --> 00:33:32,170
So let's look at a couple
of other ones.

572
00:33:32,170 --> 00:33:35,060
Here is the one I use for
debugging this thing.

573
00:33:35,060 --> 00:33:36,250
We'll let that run.

574
00:33:36,250 --> 00:33:37,690
See how fast it is?

575
00:33:37,690 --> 00:33:38,710
Boom.

576
00:33:38,710 --> 00:33:42,800
It converges to getting all the
samples right very fast.

577
00:33:42,800 --> 00:33:44,190
Here's another one.

578
00:33:44,190 --> 00:33:47,350
This is one we gave on an
exam a few years back.

579
00:33:47,350 --> 00:33:48,670
First test.

580
00:33:48,670 --> 00:33:50,620
Oh, I let it run, so
it got everything

581
00:33:50,620 --> 00:33:52,380
instantaneously right.

582
00:33:52,380 --> 00:33:53,950
Let's take that through
step at a time.

583
00:33:53,950 --> 00:33:56,940
There's the first
one, second one.

584
00:33:56,940 --> 00:33:58,800
Still got a lot of errors.

585
00:33:58,800 --> 00:34:01,600
Ah, the error rate's dropping.

586
00:34:01,600 --> 00:34:06,160
And then flattened, flattened,
and it goes to 0.

587
00:34:06,160 --> 00:34:08,000
Cool, don't you think?

588
00:34:08,000 --> 00:34:10,010
But you say to me, bah, who
cares about that stuff?

589
00:34:10,010 --> 00:34:11,540
Let's try something
more interesting.

590
00:34:11,540 --> 00:34:14,190
There's one.

591
00:34:14,190 --> 00:34:15,500
That was pretty fast, too.

592
00:34:15,500 --> 00:34:17,090
Well, there's not too
many samples here.

593
00:34:17,090 --> 00:34:20,030
So we can try this.

594
00:34:20,030 --> 00:34:22,230
So there's an array of
pluses and minuses.

595
00:34:22,230 --> 00:34:22,940
Boom.

596
00:34:22,940 --> 00:34:24,920
You can see how that error
rate is bounded by an

597
00:34:24,920 --> 00:34:26,170
exponential?

598
00:34:27,920 --> 00:34:32,800
So in a bottom graph, you've got
the number of classifiers

599
00:34:32,800 --> 00:34:36,650
involved, and that goes up to
a total, eventually, of 10.

600
00:34:36,650 --> 00:34:41,230
You can see how positive
or negative each of the

601
00:34:41,230 --> 00:34:43,530
classifiers that's added
is by looking at

602
00:34:43,530 --> 00:34:45,270
this particular tab.

603
00:34:45,270 --> 00:34:48,045
And this just shows how
they evolve over time.

604
00:34:48,045 --> 00:34:52,239
But the progress thing here
is the most interesting.

605
00:34:52,239 --> 00:34:57,420
And now you say to me, well, how
did the machine do that?

606
00:34:57,420 --> 00:35:00,330
And it's all right here.

607
00:35:00,330 --> 00:35:05,400
We use an alpha that
looks like this.

608
00:35:05,400 --> 00:35:08,400
And that allows us to compute
the new weights.

609
00:35:08,400 --> 00:35:10,150
It says we've got a preliminary
calculation.

610
00:35:10,150 --> 00:35:13,630
We've got to find a z that
does the normalization.

611
00:35:13,630 --> 00:35:17,640
And we sure better bring our
calculator, because we've got,

612
00:35:17,640 --> 00:35:19,350
first of all, to calculate
the error rate.

613
00:35:19,350 --> 00:35:22,365
Then we've got to take its
logarithm, divide by 2, plug

614
00:35:22,365 --> 00:35:27,290
it into that formula, take the
exponent, and that gives us

615
00:35:27,290 --> 00:35:28,210
the new weight.

616
00:35:28,210 --> 00:35:29,460
And that's how the
program works.

617
00:35:29,460 --> 00:35:30,880
And if you try that,
I guarantee you

618
00:35:30,880 --> 00:35:33,130
will flunk the exam.

619
00:35:33,130 --> 00:35:34,940
Now, I don't care about
my computer.

620
00:35:34,940 --> 00:35:35,920
I really don't.

621
00:35:35,920 --> 00:35:39,050
It's a slave, and it can
calculate these logarithm and

622
00:35:39,050 --> 00:35:41,840
exponentials till it turns
blue, and I don't care.

623
00:35:41,840 --> 00:35:44,740
Because I've got four cores or
something, and who cares.

624
00:35:44,740 --> 00:35:46,220
Might as well do this,
than sit around

625
00:35:46,220 --> 00:35:48,391
just burning up heat.

626
00:35:48,391 --> 00:35:49,640
But you don't want to do that.

627
00:35:49,640 --> 00:35:53,010
So what you want to do is you
want to know how to do this

628
00:35:53,010 --> 00:35:57,240
sort of thing more
expeditiously.

629
00:35:57,240 --> 00:36:00,720
So we're going to have to let
them the math sing to us a

630
00:36:00,720 --> 00:36:05,470
little bit, with a view towards
finding better ways of

631
00:36:05,470 --> 00:36:08,290
doing this sort of thing.

632
00:36:08,290 --> 00:36:11,700
So let's do that.

633
00:36:11,700 --> 00:36:14,080
And we're going to run out of
space here before long, so let

634
00:36:14,080 --> 00:36:18,450
me reclaim as much of
this board as I can.

635
00:36:18,450 --> 00:36:20,940
So what I'm going to do is I'm
going to say, well, now that

636
00:36:20,940 --> 00:36:25,720
we've got this formula for alpha
that relates alpha t to

637
00:36:25,720 --> 00:36:31,530
the error, then I can plug
that into this formula up

638
00:36:31,530 --> 00:36:32,345
here, number 6.

639
00:36:32,345 --> 00:36:40,390
And what I'll get is that the
weight of t plus 1 is equal to

640
00:36:40,390 --> 00:36:46,710
the weight at t divided by
that normalizing factor,

641
00:36:46,710 --> 00:36:53,350
multiplied times something that
depends on whether it's

642
00:36:53,350 --> 00:36:55,600
categorized correctly or not.

643
00:36:55,600 --> 00:36:59,660
That's what that y's in
their for, right?

644
00:36:59,660 --> 00:37:05,630
So we've got a logarithm here,
and we got a sign flipper up

645
00:37:05,630 --> 00:37:10,690
there in terms of that H
of x and y combination.

646
00:37:10,690 --> 00:37:18,220
So if the sign of that whole
thing at minus alpha and that

647
00:37:18,220 --> 00:37:23,900
y H combination turns out to be
negative, then we're going

648
00:37:23,900 --> 00:37:27,740
to have to flip the numerator
and denominator here in this

649
00:37:27,740 --> 00:37:29,620
logarithm, right?

650
00:37:29,620 --> 00:37:32,250
And oh, by the way, since we've
got a half out here,

651
00:37:32,250 --> 00:37:34,170
that turns out to be the square
root of that term

652
00:37:34,170 --> 00:37:37,190
inside the logarithm.

653
00:37:37,190 --> 00:37:43,290
So when we carefully do that,
what we discover is that it

654
00:37:43,290 --> 00:37:46,430
depends on whether it's the
right thing or not.

655
00:37:46,430 --> 00:37:50,860
But what it turns out to be is
something like a multiplier of

656
00:37:50,860 --> 00:37:53,750
the square root.

657
00:37:53,750 --> 00:37:55,960
Better be careful, here.

658
00:37:55,960 --> 00:37:59,300
The square root of what?

659
00:37:59,300 --> 00:38:02,030
STUDENT: [INAUDIBLE].

660
00:38:02,030 --> 00:38:02,860
PATRICK WINSTON: Well,
let's see.

661
00:38:02,860 --> 00:38:04,180
But we have to be careful.

662
00:38:04,180 --> 00:38:08,180
So let's suppose that this is 4
things that we get correct.

663
00:38:13,740 --> 00:38:17,910
So if we get it correct, then
we're going to get the same

664
00:38:17,910 --> 00:38:20,200
sign out of H of x and y.

665
00:38:20,200 --> 00:38:22,350
We've get a minus sign out
there, so we're going to flip

666
00:38:22,350 --> 00:38:25,500
the numerator and denominator.

667
00:38:25,500 --> 00:38:30,460
So we're going to get the square
root of e of t over 1

668
00:38:30,460 --> 00:38:34,110
minus epsilon of t if
that's correct.

669
00:38:34,110 --> 00:38:36,510
If it's wrong, it'll just
be the flip of that.

670
00:38:39,350 --> 00:38:44,690
So it'll be the square root of
1 minus the error rate over

671
00:38:44,690 --> 00:38:45,940
the error rate.

672
00:38:48,570 --> 00:38:49,740
Everybody with me on that?

673
00:38:49,740 --> 00:38:51,620
I think that's right.

674
00:38:51,620 --> 00:38:55,930
If it's wrong, I'll have to hang
myself and wear a paper

675
00:38:55,930 --> 00:38:57,760
bag over my head like
I did last year.

676
00:38:57,760 --> 00:39:00,796
But let's see if we can make
this go correctly this time.

677
00:39:05,730 --> 00:39:12,430
So now, we've got this guy here,
we've got everything

678
00:39:12,430 --> 00:39:18,110
plugged in all right, and we
know that now this z ought to

679
00:39:18,110 --> 00:39:22,630
be selected so that it's equal
to the sum of this guy

680
00:39:22,630 --> 00:39:25,070
multiplied by these things as
appropriate for whether it's

681
00:39:25,070 --> 00:39:28,220
correct or not.

682
00:39:28,220 --> 00:39:31,710
Because we want, in the end,
for all of these w's

683
00:39:31,710 --> 00:39:34,320
to add up to 1.

684
00:39:34,320 --> 00:39:39,830
So let's see what they add up
to without the z there.

685
00:39:39,830 --> 00:39:44,840
So what we know is that it must
be the case that if we

686
00:39:44,840 --> 00:39:53,670
add over the correct ones, we
get the square root of the

687
00:39:53,670 --> 00:39:59,930
error rate over 1 minus the
rate of the Wt plus 1.

688
00:40:04,100 --> 00:40:09,520
Plus now we've got the sum of
1 minus the error rate over

689
00:40:09,520 --> 00:40:16,010
the error rate, times the sum of
the Wi at time t for wrong.

690
00:40:24,340 --> 00:40:27,320
So that's what we get if
we added all these

691
00:40:27,320 --> 00:40:30,420
up without the z.

692
00:40:30,420 --> 00:40:33,400
So since everything has to add
up to 1, then z ought to be

693
00:40:33,400 --> 00:40:34,650
equal to this sum.

694
00:40:43,880 --> 00:40:47,960
That looks pretty horrible,
until we realize that if we

695
00:40:47,960 --> 00:40:51,930
add these guys up over the
weights that are wrong, that

696
00:40:51,930 --> 00:40:53,180
is the error rate.

697
00:40:55,880 --> 00:40:57,130
This is e.

698
00:40:59,850 --> 00:41:08,540
So therefore, z is equal the
square root of the error rate

699
00:41:08,540 --> 00:41:10,710
times 1 minus the error rate.

700
00:41:10,710 --> 00:41:14,040
That's the contribution
of this term.

701
00:41:14,040 --> 00:41:15,310
Now, let's see.

702
00:41:15,310 --> 00:41:17,700
What is the sum of the
weights over the

703
00:41:17,700 --> 00:41:20,320
ones that are correct?

704
00:41:20,320 --> 00:41:25,020
Well, that must be 1 minus
the error rate.

705
00:41:25,020 --> 00:41:30,290
Ah, so this thing gives you the
same result as this one.

706
00:41:30,290 --> 00:41:34,170
So z is equal to 2 times that.

707
00:41:34,170 --> 00:41:35,420
And that's a good thing.

708
00:41:38,580 --> 00:41:40,540
Now we are getting somewhere.

709
00:41:40,540 --> 00:41:44,380
Because now, it becomes a little
bit easier to write

710
00:41:44,380 --> 00:41:46,490
some things down.

711
00:41:46,490 --> 00:41:49,330
Well, we're way past this,
so let's get rid of this.

712
00:41:54,090 --> 00:41:57,940
And now we can put some
things together.

713
00:41:57,940 --> 00:42:00,910
Let me point out what I'm
putting together.

714
00:42:00,910 --> 00:42:06,560
I've got an expression
for z right here.

715
00:42:06,560 --> 00:42:11,320
And I've got an expression
for the new w's here.

716
00:42:11,320 --> 00:42:19,020
So let's put those together and
say that w of t plus 1 is

717
00:42:19,020 --> 00:42:23,150
equal to w of t.

718
00:42:23,150 --> 00:42:26,090
I guess we're going to
divide that by 2.

719
00:42:26,090 --> 00:42:33,470
And then we've got this square
root times that expression.

720
00:42:33,470 --> 00:42:40,470
So if we take that correct one,
and divide by that one,

721
00:42:40,470 --> 00:42:44,970
then the [INAUDIBLE]

722
00:42:44,970 --> 00:42:50,360
cancel out, and I get 1 over
1 minus the error rate.

723
00:42:53,560 --> 00:42:53,850
That's it.

724
00:42:53,850 --> 00:42:55,100
That's correct.

725
00:42:59,880 --> 00:43:04,620
And if it's not correct,
then it's Wt over 2--

726
00:43:04,620 --> 00:43:05,670
and working through the math--

727
00:43:05,670 --> 00:43:08,630
1 over epsilon, if wrong.

728
00:43:11,950 --> 00:43:15,130
Do we feel like we're
making any progress?

729
00:43:15,130 --> 00:43:16,030
No.

730
00:43:16,030 --> 00:43:19,090
Because we haven't let it
sing to us enough yet.

731
00:43:19,090 --> 00:43:25,130
So I want to draw your attention
to what happens to

732
00:43:25,130 --> 00:43:28,500
amateur rock climbers
when they're halfway

733
00:43:28,500 --> 00:43:31,360
up a difficult cliff.

734
00:43:31,360 --> 00:43:33,570
They're usually [INAUDIBLE],
sometimes they're not.

735
00:43:33,570 --> 00:43:36,800
If they're not, they're
scared to death.

736
00:43:36,800 --> 00:43:40,850
And every once in a while, as
they're just about to fall,

737
00:43:40,850 --> 00:43:44,410
they find some little tiny hole
to stick a fingernail in,

738
00:43:44,410 --> 00:43:46,510
and that keeps them
from falling.

739
00:43:46,510 --> 00:43:50,440
That's called a thank-god
hole.

740
00:43:50,440 --> 00:43:53,680
So what I'm about to introduce
is the analog of those little

741
00:43:53,680 --> 00:43:55,530
places where you can stick
your fingernail in.

742
00:43:55,530 --> 00:43:57,380
It's the thank-god
hole for dealing

743
00:43:57,380 --> 00:43:58,630
with boosting problems.

744
00:44:04,680 --> 00:44:07,370
So what happens if I add
all these [? Wi ?]

745
00:44:07,370 --> 00:44:12,470
up for the ones that the
classifier where produces a

746
00:44:12,470 --> 00:44:16,050
correct answer on?

747
00:44:16,050 --> 00:44:22,110
Well, it'll be 1 over 2, and 1
over 1 minus epsilon, times

748
00:44:22,110 --> 00:44:29,490
the sum of the Wt for which
the answer was correct.

749
00:44:29,490 --> 00:44:31,781
What's this sum?

750
00:44:31,781 --> 00:44:32,450
Oh!

751
00:44:32,450 --> 00:44:34,480
My goddess.

752
00:44:34,480 --> 00:44:38,920
1 minus epsilon.

753
00:44:38,920 --> 00:44:50,920
So what I've just discovered is
that if I sum new w's over

754
00:44:50,920 --> 00:44:53,880
those samples for which I
got a correct answer,

755
00:44:53,880 --> 00:44:56,490
it's equal to 1/2.

756
00:44:56,490 --> 00:44:57,130
And guess what?

757
00:44:57,130 --> 00:45:03,240
That means that if I sum them
over wrong, it's equal to 1/2

758
00:45:03,240 --> 00:45:04,490
half as well.

759
00:45:07,710 --> 00:45:11,300
So that means that I take all of
the weight for which I got

760
00:45:11,300 --> 00:45:18,000
the right answer with the
previous test, and those ways

761
00:45:18,000 --> 00:45:19,990
will add up to something.

762
00:45:19,990 --> 00:45:22,263
And to get the weights for the
next generation, all I have to

763
00:45:22,263 --> 00:45:24,780
do is scale them so that
they equal half.

764
00:45:24,780 --> 00:45:26,710
This was not noticed
by the people who

765
00:45:26,710 --> 00:45:27,000
developed this stuff.

766
00:45:27,000 --> 00:45:31,210
This was noticed by Luis
Ortiz, who was a 6.034

767
00:45:31,210 --> 00:45:34,160
instructor a few years ago.

768
00:45:34,160 --> 00:45:38,660
The sum of those weights is
going to be a scaled version

769
00:45:38,660 --> 00:45:41,400
of what they were before.

770
00:45:41,400 --> 00:45:43,340
So you take all the weights
for which this new

771
00:45:43,340 --> 00:45:44,590
classifier--

772
00:45:44,590 --> 00:45:46,890
this one you selected to give
you the minimum weight on the

773
00:45:46,890 --> 00:45:48,050
re-weighted stuff--

774
00:45:48,050 --> 00:45:50,520
you take the ones that it gives
a correct answer for,

775
00:45:50,520 --> 00:45:52,775
and you take all of those
weights, and you just scale

776
00:45:52,775 --> 00:45:55,770
them so they add up to 1/2.

777
00:45:55,770 --> 00:45:58,730
So do you have to compute
any logarithms?

778
00:45:58,730 --> 00:45:59,670
No.

779
00:45:59,670 --> 00:46:01,320
Do you have to compute
any exponentials?

780
00:46:01,320 --> 00:46:02,230
No.

781
00:46:02,230 --> 00:46:03,790
Do you have to calculate z?

782
00:46:03,790 --> 00:46:05,170
No.

783
00:46:05,170 --> 00:46:07,120
Do you have to calculate alpha
to get the new weights?

784
00:46:07,120 --> 00:46:07,755
No.

785
00:46:07,755 --> 00:46:09,690
All you have to do
is scale them.

786
00:46:09,690 --> 00:46:12,730
And that's a pretty good
thank-god hole.

787
00:46:12,730 --> 00:46:14,020
So that's thank-god
hole number one.

788
00:46:21,890 --> 00:46:26,340
Now, for thank-god hole number
two, we need to go back and

789
00:46:26,340 --> 00:46:28,720
think about the fact that were
going to give you problems in

790
00:46:28,720 --> 00:46:32,940
probability that involve
decision tree stumps.

791
00:46:32,940 --> 00:46:35,790
And there are a lot of decision
tree stumps that you

792
00:46:35,790 --> 00:46:38,050
might have to pick from.

793
00:46:38,050 --> 00:46:39,940
So we need a thank-god
hole for deciding how

794
00:46:39,940 --> 00:46:42,320
to deal with that.

795
00:46:42,320 --> 00:46:43,330
Where can I find some room?

796
00:46:43,330 --> 00:46:44,580
How about right here.

797
00:46:53,870 --> 00:46:56,040
Suppose you've got a space
that looks like this.

798
00:47:02,810 --> 00:47:06,020
I'm just makings this
up at random.

799
00:47:06,020 --> 00:47:07,020
So how many--

800
00:47:07,020 --> 00:47:07,180
let's see.

801
00:47:07,180 --> 00:47:11,300
1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11.

802
00:47:11,300 --> 00:47:14,315
How many tests do I have to
consider in that dimension?

803
00:47:17,598 --> 00:47:19,077
11.

804
00:47:19,077 --> 00:47:22,060
It's 1 plus the number
of samples.

805
00:47:22,060 --> 00:47:23,310
That would be horrible.

806
00:47:26,590 --> 00:47:27,080
I don't know.

807
00:47:27,080 --> 00:47:28,990
Do I have actually calculate
this one?

808
00:47:33,040 --> 00:47:36,430
How could that possibly be
better than that one?

809
00:47:36,430 --> 00:47:39,930
It's got one more thing wrong.

810
00:47:39,930 --> 00:47:45,570
So that one makes sense.

811
00:47:45,570 --> 00:47:48,940
The other one doesn't
make sense.

812
00:47:48,940 --> 00:47:55,520
So in the end, no test that
lies between two correctly

813
00:47:55,520 --> 00:47:58,530
classified samples will
ever be any good.

814
00:47:58,530 --> 00:48:01,830
So that one's a good guy, and
that one's a good guy.

815
00:48:01,830 --> 00:48:02,870
And this one's a bad guy.

816
00:48:02,870 --> 00:48:05,600
Bad guy, bad guy bad
guy, bad guy.

817
00:48:05,600 --> 00:48:08,910
Bad guy, bad guy, bad buy.

818
00:48:08,910 --> 00:48:14,410
So the actual number of tests
you've got is three.

819
00:48:14,410 --> 00:48:17,770
And likewise, in the
other dimension--

820
00:48:17,770 --> 00:48:19,960
well, I haven't drawn it so well
here, but would this test

821
00:48:19,960 --> 00:48:20,660
be a good one?

822
00:48:20,660 --> 00:48:21,320
No.

823
00:48:21,320 --> 00:48:21,690
That one?

824
00:48:21,690 --> 00:48:22,940
No.

825
00:48:24,760 --> 00:48:26,465
Actually, I'd better look over
here on the right and see what

826
00:48:26,465 --> 00:48:28,400
I've got before I draw
too many conclusions.

827
00:48:28,400 --> 00:48:30,870
Let's look over this, since I
don't want to think too hard

828
00:48:30,870 --> 00:48:32,980
about what's going on in
the other dimension.

829
00:48:32,980 --> 00:48:35,270
But the idea is that
very few of those

830
00:48:35,270 --> 00:48:38,240
tests actually matter.

831
00:48:38,240 --> 00:48:39,770
Now, you say to me, there's
one last thing.

832
00:48:39,770 --> 00:48:41,762
What about overfitting?

833
00:48:41,762 --> 00:48:45,800
Because all this does is drape
a solution over the samples.

834
00:48:45,800 --> 00:48:49,110
And like support vector machines
overfit, neural maps

835
00:48:49,110 --> 00:48:52,580
overfit, identification
trees overfit.

836
00:48:52,580 --> 00:48:53,820
Guess what?

837
00:48:53,820 --> 00:48:56,290
This doesn't seem to overfit.

838
00:48:56,290 --> 00:48:59,130
That's an experimental
result for which the

839
00:48:59,130 --> 00:49:01,470
literature is confused.

840
00:49:01,470 --> 00:49:03,920
It goes back to providing
an explanation.

841
00:49:03,920 --> 00:49:06,210
So this stuff is tried on all
sorts of problems, like

842
00:49:06,210 --> 00:49:10,100
handwriting recognition,
understanding speech, all

843
00:49:10,100 --> 00:49:12,180
sorts of stuff uses boosting.

844
00:49:12,180 --> 00:49:16,010
And unlike other methods, for
some reason as yet imperfectly

845
00:49:16,010 --> 00:49:20,260
understood, it doesn't
seem to overfit.

846
00:49:20,260 --> 00:49:25,550
But in the end, they leave no
stone unturned in 6.034.

847
00:49:25,550 --> 00:49:28,670
Every time we do this, we do
some additional experiments.

848
00:49:28,670 --> 00:49:32,410
So here's a sample that
I'll leave you with.

849
00:49:32,410 --> 00:49:36,130
Here's a situation in which we
have a 10-dimensional space.

850
00:49:36,130 --> 00:49:38,270
We've made a fake distribution,
and then we put

851
00:49:38,270 --> 00:49:40,270
in that boxed outlier.

852
00:49:40,270 --> 00:49:42,630
That was just put into the space
at random, so it can be

853
00:49:42,630 --> 00:49:45,230
viewed as an error point.

854
00:49:45,230 --> 00:49:47,240
So now what we're going to do
is we're going to see what

855
00:49:47,240 --> 00:49:49,560
happens when we run that guy.

856
00:49:49,560 --> 00:49:55,140
And sure enough, in 17 steps,
it finds a solution.

857
00:49:55,140 --> 00:49:59,620
But maybe it's overfit that
little guy who's an error.

858
00:49:59,620 --> 00:50:03,000
But one thing you can do is
you can say, well, all of

859
00:50:03,000 --> 00:50:06,890
these classifiers are dividing
this space up into chunks, and

860
00:50:06,890 --> 00:50:11,750
we can compute the size of the
space occupied by any sample.

861
00:50:11,750 --> 00:50:13,650
So one thing we can do--

862
00:50:13,650 --> 00:50:16,370
alas, I'll have to get up
a new demonstration.

863
00:50:16,370 --> 00:50:19,750
One thing we can do, now that
this guy's over here, we can

864
00:50:19,750 --> 00:50:23,310
switch the volume tab and watch
how the volume occupied

865
00:50:23,310 --> 00:50:29,640
by that error point evolves
as we solve the problem.

866
00:50:29,640 --> 00:50:31,820
So look what happens.

867
00:50:31,820 --> 00:50:33,380
This is, of course, randomly
generated.

868
00:50:33,380 --> 00:50:35,390
I'm counting on this working.

869
00:50:35,390 --> 00:50:36,640
Never failed before.

870
00:50:39,930 --> 00:50:44,510
So it originally starts
out as occupying 26%

871
00:50:44,510 --> 00:50:47,020
of the total volume.

872
00:50:47,020 --> 00:50:52,360
It ends up occupying
1.4 times 10 to the

873
00:50:52,360 --> 00:50:55,910
minus 3rd% of the volume.

874
00:50:55,910 --> 00:51:00,060
So what tends to happen is
that these decision tree

875
00:51:00,060 --> 00:51:03,190
stumps tend to wrap themselves
so tightly around the error

876
00:51:03,190 --> 00:51:05,350
points, there's no room for
overfitting, because nothing

877
00:51:05,350 --> 00:51:07,550
else will fit in that
same volume.

878
00:51:07,550 --> 00:51:10,390
So that's why I think that this
thing tends to produce

879
00:51:10,390 --> 00:51:12,430
solutions which don't overfit.

880
00:51:12,430 --> 00:51:14,970
So in conclusion,
this is magic.

881
00:51:14,970 --> 00:51:16,010
You always want to use it.

882
00:51:16,010 --> 00:51:17,510
It'll work with any kind
of [? speed ?] of

883
00:51:17,510 --> 00:51:19,090
classifiers you want.

884
00:51:19,090 --> 00:51:21,590
And you should understand it
very thoroughly, because of

885
00:51:21,590 --> 00:51:25,740
anything is useful in the
subject in dimension learning,

886
00:51:25,740 --> 00:51:26,990
this is it.