1
00:00:00,000 --> 00:00:09,234


2
00:00:09,234 --> 00:00:10,300
PATRICK WINSTON: So
where are we?

3
00:00:10,300 --> 00:00:14,962
We started off with simple
methods for learning stuff.

4
00:00:14,962 --> 00:00:20,730
Then, we talked a little about
a purchase of learning that

5
00:00:20,730 --> 00:00:24,556
we're vaguely inspired by.

6
00:00:24,556 --> 00:00:27,300
The fact that our heads are
stuffed with neurons, and that

7
00:00:27,300 --> 00:00:31,095
we seemed to have evolved
from primates.

8
00:00:31,095 --> 00:00:34,940
Then, we talked about looking at
the problem and address the

9
00:00:34,940 --> 00:00:36,410
issue of [? phrenology ?]

10
00:00:36,410 --> 00:00:40,430
and how it's possible
to learn concepts.

11
00:00:40,430 --> 00:00:43,700
But now, we're coming full
circle back to the beginning

12
00:00:43,700 --> 00:00:47,990
and thinking about how to
divide up a space with

13
00:00:47,990 --> 00:00:49,930
decision boundaries.

14
00:00:49,930 --> 00:00:54,580
But whereas, you do it with
a neural net or a nearest

15
00:00:54,580 --> 00:00:56,510
neighbors or a ID tree.

16
00:00:56,510 --> 00:01:02,115
Those are very simple ideas
that work very often.

17
00:01:02,115 --> 00:01:05,895
Today, we're going to talk about
a very sophisticated

18
00:01:05,895 --> 00:01:09,212
idea that still has
a implementation.

19
00:01:09,212 --> 00:01:13,220
So this needs to be
in the tool bag of

20
00:01:13,220 --> 00:01:15,506
every civilized person.

21
00:01:15,506 --> 00:01:18,560
This is about support
vector machines, an

22
00:01:18,560 --> 00:01:20,735
idea that was developed.

23
00:01:20,735 --> 00:01:22,470
Well, I want to talk to
you today about how

24
00:01:22,470 --> 00:01:24,705
ideas develop, actually.

25
00:01:24,705 --> 00:01:27,150
Because you look at stuff like
this in a book, and you think,

26
00:01:27,150 --> 00:01:32,515
well, Vladimir Vapnik just
figured this out one Saturday

27
00:01:32,515 --> 00:01:35,780
afternoon when the weather was
too bad to go outside.

28
00:01:35,780 --> 00:01:37,185
That's not how it happens.

29
00:01:37,185 --> 00:01:38,580
It happens very differently.

30
00:01:38,580 --> 00:01:41,229
I want to talk to you
a little about that.

31
00:01:41,229 --> 00:01:46,950
The next thing about great
things that were done by

32
00:01:46,950 --> 00:01:49,060
people who are still alive
is you can ask them

33
00:01:49,060 --> 00:01:50,210
how they did it.

34
00:01:50,210 --> 00:01:51,810
You can't do that
with Fourier.

35
00:01:51,810 --> 00:01:54,310
You can't say to Fourier,
how did you do it?

36
00:01:54,310 --> 00:01:56,946
Did you dream it up on
a Saturday afternoon?

37
00:01:56,946 --> 00:02:00,220
But can call Vapnik on the phone
and ask him questions.

38
00:02:00,220 --> 00:02:02,050
That's the stuff I'm going
to talk about toward

39
00:02:02,050 --> 00:02:04,186
the end of the hour.

40
00:02:04,186 --> 00:02:06,045
Well, it's all about decision
boundaries.

41
00:02:06,045 --> 00:02:11,400
And now, we have several
techniques that we can use to

42
00:02:11,400 --> 00:02:12,620
draw some decision boundaries.

43
00:02:12,620 --> 00:02:14,700
And here's the same problem.

44
00:02:14,700 --> 00:02:18,329
And if we drew decision
boundaries in here, we might

45
00:02:18,329 --> 00:02:21,826
get something that would
look like maybe this.

46
00:02:21,826 --> 00:02:25,790
If we were doing a nearest
neighbor approach, and if

47
00:02:25,790 --> 00:02:31,522
we're doing ID trees, we'll just
draw in a line like that.

48
00:02:31,522 --> 00:02:34,945
And if we're doing neural nets,
well, you can put in a

49
00:02:34,945 --> 00:02:37,550
lot of straight lines wherever
you like with a neural net,

50
00:02:37,550 --> 00:02:39,110
depending on how it's
trained up.

51
00:02:39,110 --> 00:02:42,470
Or if you just simply go in
there and design it, so you

52
00:02:42,470 --> 00:02:45,554
could do that if you wanted.

53
00:02:45,554 --> 00:02:48,110
And you would think that after
people have been working on

54
00:02:48,110 --> 00:02:52,500
this sort of stuff for 50 or 75
years that there wouldn't

55
00:02:52,500 --> 00:02:54,535
be any tricks in the bag left.

56
00:02:54,535 --> 00:02:59,340
And that's when everybody got
surprised, because around the

57
00:02:59,340 --> 00:03:03,880
early '90s Vladimir Vapnik
introduced the ideas I'm about

58
00:03:03,880 --> 00:03:05,916
to talk to you about.

59
00:03:05,916 --> 00:03:11,215
So what Vapnik says is
something like this.

60
00:03:11,215 --> 00:03:17,470
Here you have a space, and you
have some negative examples,

61
00:03:17,470 --> 00:03:20,436
and you have some positive
examples.

62
00:03:20,436 --> 00:03:22,870
How do you divide the positive
examples from

63
00:03:22,870 --> 00:03:24,220
the negative examples?

64
00:03:24,220 --> 00:03:27,710
And what he says that we want
to do is we want to draw a

65
00:03:27,710 --> 00:03:29,140
straight line.

66
00:03:29,140 --> 00:03:32,062
But which straight line
is the question.

67
00:03:32,062 --> 00:03:35,140
Well, we want to draw
a straight line.

68
00:03:35,140 --> 00:03:38,141
Well, would this be a
good straight line?

69
00:03:38,141 --> 00:03:40,492
One that went up like that?

70
00:03:40,492 --> 00:03:42,660
Probably not so hot.

71
00:03:42,660 --> 00:03:45,622
How about one that's
just right here?

72
00:03:45,622 --> 00:03:49,460
Well, that might separate them,
but it seems awfully

73
00:03:49,460 --> 00:03:51,765
close to the negative
examples.

74
00:03:51,765 --> 00:03:55,030
So maybe what we ought to do
is we ought to draw our

75
00:03:55,030 --> 00:03:57,220
straight line in here,
sort of like this.

76
00:03:57,220 --> 00:04:00,458


77
00:04:00,458 --> 00:04:07,590
And that line is drawn with a
view toward putting in the

78
00:04:07,590 --> 00:04:13,330
widest street that separates the
positive samples from the

79
00:04:13,330 --> 00:04:14,460
negative samples.

80
00:04:14,460 --> 00:04:17,209
That's why I call it the
widest street approach.

81
00:04:17,209 --> 00:04:21,535
So that makes way of putting
in the decision boundary--

82
00:04:21,535 --> 00:04:25,560
is to put in a straight line but
in contrast with the way

83
00:04:25,560 --> 00:04:27,440
ID tree puts in a
straight line.

84
00:04:27,440 --> 00:04:32,165
It tries to put the line in in
such a way as the separation

85
00:04:32,165 --> 00:04:34,680
between the positive and
negative examples.

86
00:04:34,680 --> 00:04:37,236
That street is as wide
as possible.

87
00:04:37,236 --> 00:04:37,722
All right.

88
00:04:37,722 --> 00:04:41,620
So you might think to do that in
the UROP project, and then,

89
00:04:41,620 --> 00:04:43,205
let it go with that.

90
00:04:43,205 --> 00:04:44,730
What's the big deal?

91
00:04:44,730 --> 00:04:47,340
So what we've got to do is we've
got to go through why

92
00:04:47,340 --> 00:04:49,176
it's a big deal.

93
00:04:49,176 --> 00:04:55,170
So first of all, we like to
think about how you would make

94
00:04:55,170 --> 00:04:59,326
a decision rule that would use
that decision boundary.

95
00:04:59,326 --> 00:05:03,650
So what I'm going to ask you to
imagine is that we've got a

96
00:05:03,650 --> 00:05:09,650
vector of any length that you
like, constrained to be

97
00:05:09,650 --> 00:05:13,715
perpendicular to the median, or
if you like, perpendicular

98
00:05:13,715 --> 00:05:14,630
to the gutters.

99
00:05:14,630 --> 00:05:18,280
It's perpendicular to the median
line of the street.

100
00:05:18,280 --> 00:05:20,540
All right, it's drawn in such
a way that that's true.

101
00:05:20,540 --> 00:05:23,984
We don't know anything about
it's length, yet.

102
00:05:23,984 --> 00:05:29,920
Then, we also have some unknown,
say, right here.

103
00:05:29,920 --> 00:05:35,325
And we have a vector that
points to it by excel.

104
00:05:35,325 --> 00:05:39,310
So now, what we're really
interested in is whether or

105
00:05:39,310 --> 00:05:42,920
not that unknown is on the right
side of the street or on

106
00:05:42,920 --> 00:05:45,062
the left side of the street.

107
00:05:45,062 --> 00:05:47,909
So what we'd what to do is want
to project that vector,

108
00:05:47,909 --> 00:05:51,990
u, down on to one that's
perpendicular to the street.

109
00:05:51,990 --> 00:05:55,205
Because then, we'll have the
distance in this direction or

110
00:05:55,205 --> 00:05:58,490
a number that's proportional
to this in this direction.

111
00:05:58,490 --> 00:06:02,670
And the further out we go, the
closer we'll get to being on

112
00:06:02,670 --> 00:06:05,360
the right side of the street,
where the right side of the

113
00:06:05,360 --> 00:06:08,065
street is not the correct side
but actually the right side of

114
00:06:08,065 --> 00:06:08,985
the street.

115
00:06:08,985 --> 00:06:14,280
So what we can do is we can say,
let's take w and dot it

116
00:06:14,280 --> 00:06:19,930
with u and measure whether or
not that number is equal to or

117
00:06:19,930 --> 00:06:22,646
greater than some constant, c.

118
00:06:22,646 --> 00:06:25,880
So remember that the dot
product has taken the

119
00:06:25,880 --> 00:06:27,896
projection onto w.

120
00:06:27,896 --> 00:06:32,150
And the bigger that projection
is, the further out along this

121
00:06:32,150 --> 00:06:34,255
line the projection will lie.

122
00:06:34,255 --> 00:06:37,490
And eventually it will be so
big that the projection

123
00:06:37,490 --> 00:06:40,440
crosses the median line of the
street, and we'll say it must

124
00:06:40,440 --> 00:06:41,690
be a positive sample.

125
00:06:41,690 --> 00:06:45,707


126
00:06:45,707 --> 00:06:50,880
Or we could say, without loss
of generality that the dot

127
00:06:50,880 --> 00:06:56,360
product plus some constant, b,
is equal to or greater than 0.

128
00:06:56,360 --> 00:07:03,050
If that's true, then it's
a positive sample.

129
00:07:03,050 --> 00:07:04,300
So that's our decision rule.

130
00:07:04,300 --> 00:07:11,522


131
00:07:11,522 --> 00:07:17,300
And this is the first in several
elements that we're

132
00:07:17,300 --> 00:07:20,960
going to have to line up to
understand this idea called

133
00:07:20,960 --> 00:07:23,340
support vector machines.

134
00:07:23,340 --> 00:07:24,730
So that's the decision rule.

135
00:07:24,730 --> 00:07:29,460
And the trouble is we don't know
what constant to use, and

136
00:07:29,460 --> 00:07:32,450
we don't know which
w to use either.

137
00:07:32,450 --> 00:07:35,390
We know that w has to be
perpendicular to the median

138
00:07:35,390 --> 00:07:37,476
line of the street.

139
00:07:37,476 --> 00:07:39,880
But there's lot of w's that
are perpendicular to the

140
00:07:39,880 --> 00:07:41,070
median line of the street,
because it

141
00:07:41,070 --> 00:07:42,740
could be of any length.

142
00:07:42,740 --> 00:07:45,750
So we don't have enough
constraint here to fix a

143
00:07:45,750 --> 00:07:49,532
particular b or a
particular w.

144
00:07:49,532 --> 00:07:52,395
Are you with me so far?

145
00:07:52,395 --> 00:07:55,176
All right.

146
00:07:55,176 --> 00:07:57,990
And this, by the way, we get
just by saying that c

147
00:07:57,990 --> 00:07:59,240
equals minus b.

148
00:07:59,240 --> 00:08:02,800


149
00:08:02,800 --> 00:08:05,790
What we're going to do next is
we're going to lay on some

150
00:08:05,790 --> 00:08:08,960
additional constraints whether
you're toward putting enough

151
00:08:08,960 --> 00:08:13,330
constraint on the situation that
we can actually calculate

152
00:08:13,330 --> 00:08:16,015
a b and a w.

153
00:08:16,015 --> 00:08:21,290
So what we're going to say is
this, that if we look at this

154
00:08:21,290 --> 00:08:24,680
quantity that we're checking out
to be greater than or less

155
00:08:24,680 --> 00:08:28,040
than 0 to make our decision,
then, what we're going to do

156
00:08:28,040 --> 00:08:32,510
is we're going to say that if we
take that vector w, and we

157
00:08:32,510 --> 00:08:37,789
take the dot product of that
with some x plus, some

158
00:08:37,789 --> 00:08:38,929
positive sample, now.

159
00:08:38,929 --> 00:08:39,760
This is not an unknown.

160
00:08:39,760 --> 00:08:42,272
This is a positive sample.

161
00:08:42,272 --> 00:08:46,500
If we take the dot product of
those two vectors, and we had

162
00:08:46,500 --> 00:08:50,050
b just like in our decision
rule, we're going to want that

163
00:08:50,050 --> 00:08:51,370
to be equal to or
greater than 1.

164
00:08:51,370 --> 00:08:54,220


165
00:08:54,220 --> 00:08:59,080
So in other words, you can be
an unknown anywhere in this

166
00:08:59,080 --> 00:09:02,140
street and be just a little bit
greater or just a little

167
00:09:02,140 --> 00:09:03,610
bit less than 0.

168
00:09:03,610 --> 00:09:06,120
But if you're a positive sample,
we're going to insist

169
00:09:06,120 --> 00:09:08,550
that this decision function
gives the

170
00:09:08,550 --> 00:09:11,476
value of one or greater.

171
00:09:11,476 --> 00:09:21,030
Likewise, if w thought it was
some negative sample is

172
00:09:21,030 --> 00:09:24,380
provided to us, then we're going
to say that has to be

173
00:09:24,380 --> 00:09:25,800
equal to or less than minus 1.

174
00:09:25,800 --> 00:09:28,690


175
00:09:28,690 --> 00:09:29,866
All right.

176
00:09:29,866 --> 00:09:33,790
So if you're a minus sample,
like one of these two guys or

177
00:09:33,790 --> 00:09:38,330
any minus sample that may lie
down here, this function that

178
00:09:38,330 --> 00:09:42,506
gives us the decision rule must
return minus 1 or less.

179
00:09:42,506 --> 00:09:45,020
So there's a separation
of distance here.

180
00:09:45,020 --> 00:09:46,930
Minus 1 to plus 1 for
all of the samples.

181
00:09:46,930 --> 00:09:50,717


182
00:09:50,717 --> 00:09:52,842
So that's cool.

183
00:09:52,842 --> 00:09:58,290
But we're not quite done,
because carrying around two

184
00:09:58,290 --> 00:10:01,534
equations like this,
it's a pain.

185
00:10:01,534 --> 00:10:04,760
So what we're going to do is
we're going to introduce

186
00:10:04,760 --> 00:10:08,190
another variable to make
like a little easier.

187
00:10:08,190 --> 00:10:11,502


188
00:10:11,502 --> 00:10:15,210
Like many things that we do, and
when we develop this kind

189
00:10:15,210 --> 00:10:19,120
of stuff, introducing this
variable is not something that

190
00:10:19,120 --> 00:10:20,370
God says has to be done.

191
00:10:20,370 --> 00:10:24,380


192
00:10:24,380 --> 00:10:25,310
What is it?

193
00:10:25,310 --> 00:10:28,930
We introduced this additional
stuff to do what?

194
00:10:28,930 --> 00:10:34,140
To make the mathematics more
convenient, so mathematical

195
00:10:34,140 --> 00:10:35,822
convenience.

196
00:10:35,822 --> 00:10:37,730
So what we're going to do is
we're going to introduce a

197
00:10:37,730 --> 00:10:53,600
variable, y sub i, such that y
sub i is equal to plus 1 for

198
00:10:53,600 --> 00:11:10,460
plus samples and minus 1
for negative samples.

199
00:11:10,460 --> 00:11:11,685
All right.

200
00:11:11,685 --> 00:11:14,190
So for each sample, we're going
to have a value for this

201
00:11:14,190 --> 00:11:16,680
new quantity we've
introduced, y.

202
00:11:16,680 --> 00:11:19,910
And the value of y is going to
be determined by whether it's

203
00:11:19,910 --> 00:11:22,370
a positive sample or
negative sample.

204
00:11:22,370 --> 00:11:26,600
If it's a positive sample it's
got to be plus 1 for this

205
00:11:26,600 --> 00:11:29,280
situation up here, and it's
going to be minus 1 for this

206
00:11:29,280 --> 00:11:31,235
situation down here.

207
00:11:31,235 --> 00:11:34,480
So what we're going to do with
this first equation is we're

208
00:11:34,480 --> 00:11:41,605
going to multiply it by y sub
i, and that is now x of i,

209
00:11:41,605 --> 00:11:46,430
plus b is equal to or
greater than 1.

210
00:11:46,430 --> 00:11:47,740
And then, you know what
we're going to do?

211
00:11:47,740 --> 00:11:53,030
We're going to multiply the left
side of this equation by

212
00:11:53,030 --> 00:11:54,770
y sub i, as well.

213
00:11:54,770 --> 00:12:03,172
So the second equation becomes
y sub i times x sub i plus b.

214
00:12:03,172 --> 00:12:05,876
And now, what does that
do over here?

215
00:12:05,876 --> 00:12:09,480
We multiplied this guy
times minus 1.

216
00:12:09,480 --> 00:12:12,750
So it used to be the case that
that was less than minus 1.

217
00:12:12,750 --> 00:12:14,900
So if we multiply it by minus
1, then it has to be greater

218
00:12:14,900 --> 00:12:16,150
than plus 1.

219
00:12:16,150 --> 00:12:18,990


220
00:12:18,990 --> 00:12:23,220
The two equations are the same,
because that introduces

221
00:12:23,220 --> 00:12:26,580
this little mathematical
convenience.

222
00:12:26,580 --> 00:12:35,430
So now, we can say that y sub
i times x sub i plus b.

223
00:12:35,430 --> 00:12:37,986


224
00:12:37,986 --> 00:12:41,826
Well, what we're going to do--

225
00:12:41,826 --> 00:12:42,675
Brett?

226
00:12:42,675 --> 00:12:44,255
STUDENT: What happened
to the w?

227
00:12:44,255 --> 00:12:45,450
PATRICK WINSTON: Oh, did
I leave out a w?

228
00:12:45,450 --> 00:12:46,050
I'm sorry.

229
00:12:46,050 --> 00:12:48,612
Thank you.

230
00:12:48,612 --> 00:12:51,561
Yeah, I wouldn't have gotten
very far with that.

231
00:12:51,561 --> 00:12:54,210
So that's dot it with
w, dot it with w.

232
00:12:54,210 --> 00:12:55,605
Thank you, Brett.

233
00:12:55,605 --> 00:12:56,710
Those are all vectors.

234
00:12:56,710 --> 00:13:00,010
I'll pretty soon forget to put
the little vector marks on

235
00:13:00,010 --> 00:13:01,090
there, but you know
what I mean.

236
00:13:01,090 --> 00:13:05,256
So that's w plus b.

237
00:13:05,256 --> 00:13:09,660
And now, let me bring that 1
over to the left side, and

238
00:13:09,660 --> 00:13:11,010
that's equal to or
greater than 0.

239
00:13:11,010 --> 00:13:13,535


240
00:13:13,535 --> 00:13:14,730
All right.

241
00:13:14,730 --> 00:13:17,440
With Brett's correction, I
think everything's OK.

242
00:13:17,440 --> 00:13:21,010
But we're going to take one more
step, and we're going to

243
00:13:21,010 --> 00:13:31,270
say that y sub i times x sub
i times w plus b minus 1.

244
00:13:31,270 --> 00:13:33,885


245
00:13:33,885 --> 00:13:35,760
It's always got to be equal
to or greater than 0.

246
00:13:35,760 --> 00:13:42,492
But what I'm going to
say is if we're for

247
00:13:42,492 --> 00:13:44,550
x sub i in a gutter.

248
00:13:44,550 --> 00:13:49,092


249
00:13:49,092 --> 00:13:51,140
So there's always going to be
greater than 0, but we're

250
00:13:51,140 --> 00:13:53,540
going to add the additional
constraint that it's going to

251
00:13:53,540 --> 00:13:58,300
be exactly 0 for all the samples
that end up in the

252
00:13:58,300 --> 00:14:00,190
gutters here of the street.

253
00:14:00,190 --> 00:14:03,010
So the value of that expression
is going to be

254
00:14:03,010 --> 00:14:08,390
exactly 0 for that sample, 0
for this sample and this

255
00:14:08,390 --> 00:14:10,460
sample, not 0 for that sample.

256
00:14:10,460 --> 00:14:12,180
It's got to be greater than 1.

257
00:14:12,180 --> 00:14:13,846
All right?

258
00:14:13,846 --> 00:14:16,760
So that's step number two.

259
00:14:16,760 --> 00:14:25,319


260
00:14:25,319 --> 00:14:27,140
And this is step number one.

261
00:14:27,140 --> 00:14:31,454


262
00:14:31,454 --> 00:14:31,950
OK.

263
00:14:31,950 --> 00:14:34,340
So now, we've just got some
expressions to talk about,

264
00:14:34,340 --> 00:14:36,415
some constraints.

265
00:14:36,415 --> 00:14:37,870
Now, what are we trying
to do here?

266
00:14:37,870 --> 00:14:39,922
I forgot.

267
00:14:39,922 --> 00:14:41,320
Oh, I remember now.

268
00:14:41,320 --> 00:14:45,500
We're trying to figure out how
to arrange for the line to be

269
00:14:45,500 --> 00:14:48,790
such at the street separating
the pluses from the minuses as

270
00:14:48,790 --> 00:14:51,121
wide as possible.

271
00:14:51,121 --> 00:14:54,300
So maybe we better figure out
how we can express the

272
00:14:54,300 --> 00:14:56,130
distance between the
two gutters.

273
00:14:56,130 --> 00:15:03,645


274
00:15:03,645 --> 00:15:06,822
Let's just repeat our drawing.

275
00:15:06,822 --> 00:15:12,030
We've got some minuses here, got
pluses out here, and we've

276
00:15:12,030 --> 00:15:17,021
got gutters that are
going down here.

277
00:15:17,021 --> 00:15:22,290
And now, we've got a vector here
to a minus, and we've got

278
00:15:22,290 --> 00:15:27,091
a vector here to a plus.

279
00:15:27,091 --> 00:15:33,950
So we'll call that x plus
and this x minus.

280
00:15:33,950 --> 00:15:36,730
So what's the width
of the street?

281
00:15:36,730 --> 00:15:37,600
I don't know, yet.

282
00:15:37,600 --> 00:15:40,360
But what we can do is we can
take the difference of those

283
00:15:40,360 --> 00:15:44,120
two vectors, and that will
be a vector that

284
00:15:44,120 --> 00:15:46,346
looks like this, right?

285
00:15:46,346 --> 00:15:52,016
So that's x plus
minus x minus.

286
00:15:52,016 --> 00:15:56,280
So now, if I only had a unit
normal that's normal to the

287
00:15:56,280 --> 00:16:00,320
median line of the street, if
it's a unit normal, then I

288
00:16:00,320 --> 00:16:02,120
could just take the dot product
or that unit normal

289
00:16:02,120 --> 00:16:03,975
and this difference vector, and
that would be the width of

290
00:16:03,975 --> 00:16:05,980
the street, right?

291
00:16:05,980 --> 00:16:13,090
So in other words, if I had a
unit vector in that direction,

292
00:16:13,090 --> 00:16:15,530
then I could just dot the two
together, and that would be

293
00:16:15,530 --> 00:16:17,896
the width of the street.

294
00:16:17,896 --> 00:16:21,550
So let me write that down
before I forget.

295
00:16:21,550 --> 00:16:31,625
So the width is equal to
x plus minus x minus.

296
00:16:31,625 --> 00:16:34,396
OK.

297
00:16:34,396 --> 00:16:35,580
That's the difference vector.

298
00:16:35,580 --> 00:16:37,510
And now, I've got to multiple
it by unit vector.

299
00:16:37,510 --> 00:16:38,180
But wait a minute.

300
00:16:38,180 --> 00:16:41,590
I said that that w is
a normal, right?

301
00:16:41,590 --> 00:16:44,032
The w is a normal.

302
00:16:44,032 --> 00:16:50,018
So what I can do is I can
multiply this times w, and

303
00:16:50,018 --> 00:16:54,156
then, we'll divide by the
magnitude of w, and that will

304
00:16:54,156 --> 00:16:56,591
make it a unit vector.

305
00:16:56,591 --> 00:17:05,650
So that dot product, not a
product, that dot product is,

306
00:17:05,650 --> 00:17:10,329
in fact, a scalar, and it's
the width of the street.

307
00:17:10,329 --> 00:17:14,730
It doesn't do as much good,
because it doesn't look like

308
00:17:14,730 --> 00:17:17,053
we get much out of it.

309
00:17:17,053 --> 00:17:18,220
Oh, but I don't know.

310
00:17:18,220 --> 00:17:21,371
Let's see, what can
we get out of it?

311
00:17:21,371 --> 00:17:25,954
Oh gee, we've got this equation
over here, this

312
00:17:25,954 --> 00:17:28,594
equation that constrains
the samples

313
00:17:28,594 --> 00:17:31,310
that lie in the gutter.

314
00:17:31,310 --> 00:17:35,610
So if we have a positive sample,
for example, then this

315
00:17:35,610 --> 00:17:38,530
is plus 1, and we have
this equation.

316
00:17:38,530 --> 00:17:41,150


317
00:17:41,150 --> 00:17:53,900
So it says that x plus times w
is equal to, oh, 1 minus b.

318
00:17:53,900 --> 00:17:58,492


319
00:17:58,492 --> 00:18:02,210
See, I'm just taking this part
here, this vector here, and

320
00:18:02,210 --> 00:18:04,880
I'm dotting it with x plus.

321
00:18:04,880 --> 00:18:08,650
So that's this piece
right here.

322
00:18:08,650 --> 00:18:11,230
y is 1 for this kind
of sample.

323
00:18:11,230 --> 00:18:13,600
So I'll just take the 1 and the
b back over to the other

324
00:18:13,600 --> 00:18:16,212
side, and I've got 1 minus b.

325
00:18:16,212 --> 00:18:18,592
OK?

326
00:18:18,592 --> 00:18:22,241
Well, we can do the same
trick with x minus.

327
00:18:22,241 --> 00:18:24,806
If we've got a negative sample,

328
00:18:24,806 --> 00:18:28,572
then y sub i is negative.

329
00:18:28,572 --> 00:18:34,296
That gives us our negative
w times dot over x sub i.

330
00:18:34,296 --> 00:18:37,190
But now, we take this stuff back
over to the right side,

331
00:18:37,190 --> 00:18:40,540
and we get 1 plus b.

332
00:18:40,540 --> 00:18:45,252


333
00:18:45,252 --> 00:18:50,200
So that all licenses to rewrite
this thing as 2 over

334
00:18:50,200 --> 00:18:52,646
the magnitude of w.

335
00:18:52,646 --> 00:18:54,210
How did I get there?

336
00:18:54,210 --> 00:18:59,270
Well, I decided I was going to
enforce this constraint.

337
00:18:59,270 --> 00:19:03,540
I noted that the width of the
street has got to be this

338
00:19:03,540 --> 00:19:06,105
difference vector times
a unit vector.

339
00:19:06,105 --> 00:19:09,400
Then, I used the constraint to
plug back some values here.

340
00:19:09,400 --> 00:19:12,480
And I discovered to my delight
and amazement that the width

341
00:19:12,480 --> 00:19:15,350
of the street is 2 over
the magnitude of w.

342
00:19:15,350 --> 00:19:18,340


343
00:19:18,340 --> 00:19:20,388
Yes, Brett?

344
00:19:20,388 --> 00:19:23,881
STUDENT: So your first x
plus is minus b, and x

345
00:19:23,881 --> 00:19:25,378
minus is 1 plus b.

346
00:19:25,378 --> 00:19:25,877
PATRICK WINSTON: Yeah.

347
00:19:25,877 --> 00:19:26,875
STUDENT: So you're
subtracting it?

348
00:19:26,875 --> 00:19:27,750
PATRICK WINSTON: Let's see.

349
00:19:27,750 --> 00:19:31,855
If I've got a minus here, then
that makes that minus, and

350
00:19:31,855 --> 00:19:33,810
then, the b is minus, and when I
take the b over to the other

351
00:19:33,810 --> 00:19:35,579
side it becomes plus.

352
00:19:35,579 --> 00:19:38,573
STUDENT: Yeah, so if you
subtract the left with the

353
00:19:38,573 --> 00:19:41,068
right [INAUDIBLE].

354
00:19:41,068 --> 00:19:41,670
PATRICK WINSTON: No.

355
00:19:41,670 --> 00:19:42,320
No, sorry.

356
00:19:42,320 --> 00:19:46,981
This expression here
is 1 plus b.

357
00:19:46,981 --> 00:19:48,870
Trust me it works.

358
00:19:48,870 --> 00:19:51,370
I haven't got my legs all
tangled up like last Friday,

359
00:19:51,370 --> 00:19:53,786
well, not yet, anyway.

360
00:19:53,786 --> 00:19:55,340
It's possible.

361
00:19:55,340 --> 00:19:58,958
There's going to be a lot of
algebra here eventually.

362
00:19:58,958 --> 00:20:04,995
So this quantity here, this
is miracle number three.

363
00:20:04,995 --> 00:20:09,731
This quantity here is the
width of the street.

364
00:20:09,731 --> 00:20:13,570
And what we're trying to
do is we're trying to

365
00:20:13,570 --> 00:20:17,158
maximize that, right?

366
00:20:17,158 --> 00:20:27,170
So we want to maximize 2 over
the magnitude of w if we're to

367
00:20:27,170 --> 00:20:29,300
get the widest street under
the constraints that we've

368
00:20:29,300 --> 00:20:32,210
decided that we're going
to work with.

369
00:20:32,210 --> 00:20:33,050
All right.

370
00:20:33,050 --> 00:20:46,281
So that means that it's OK to
maximize 1 over w, instead.

371
00:20:46,281 --> 00:20:48,250
We just drop the constant.

372
00:20:48,250 --> 00:20:53,550
And that means that it's
OK to minimize the

373
00:20:53,550 --> 00:20:56,150
magnitude of w, right?

374
00:20:56,150 --> 00:20:59,572


375
00:20:59,572 --> 00:21:08,710
And that means that it's OK
to minimize 1/2 times the

376
00:21:08,710 --> 00:21:12,070
magnitude of w squared.

377
00:21:12,070 --> 00:21:13,675
Right, Brett?

378
00:21:13,675 --> 00:21:16,075
Why did I do that?

379
00:21:16,075 --> 00:21:19,010
Why did I multiply by
1/2 and square it?

380
00:21:19,010 --> 00:21:19,970
STUDENT: Because it's
mathematically convenient.

381
00:21:19,970 --> 00:21:20,930
PATRICK WINSTON: It's
mathematically convenient.

382
00:21:20,930 --> 00:21:22,850
Thank you.

383
00:21:22,850 --> 00:21:27,840
So this is point number three
in the development.

384
00:21:27,840 --> 00:21:28,950
So where do we go?

385
00:21:28,950 --> 00:21:31,170
We decided that was going
to be our decision rule.

386
00:21:31,170 --> 00:21:33,530
We're going to see which side
of the line we're on.

387
00:21:33,530 --> 00:21:36,420
We decided to constrain the
situation, so the value of the

388
00:21:36,420 --> 00:21:40,750
decision rule is plus 1 in the
gutters for the positive

389
00:21:40,750 --> 00:21:42,820
samples and minus 1
in the gutters for

390
00:21:42,820 --> 00:21:44,070
the negative samples.

391
00:21:44,070 --> 00:21:47,470
And then, we discovered that
maximizing the width of the

392
00:21:47,470 --> 00:21:51,090
street led us to an expression
like that,

393
00:21:51,090 --> 00:21:52,340
which we wish to maximize.

394
00:21:52,340 --> 00:21:57,425


395
00:21:57,425 --> 00:21:58,350
Should we take a break?

396
00:21:58,350 --> 00:21:59,460
Should we get coffee?

397
00:21:59,460 --> 00:22:02,365
Too bad, we can't do that in
this kind of situation.

398
00:22:02,365 --> 00:22:04,400
But we would if we could.

399
00:22:04,400 --> 00:22:07,090
And I'm sure when Vapnik
got to this point, he

400
00:22:07,090 --> 00:22:09,826
went out for coffee.

401
00:22:09,826 --> 00:22:13,820
So now, we back up, and we say,
well, let's let these

402
00:22:13,820 --> 00:22:17,252
expressions start developing
into a song.

403
00:22:17,252 --> 00:22:21,030
Not like that, that's vapid,
speaking of Vapnik.

404
00:22:21,030 --> 00:22:29,760


405
00:22:29,760 --> 00:22:31,970
What song is it going to sing?

406
00:22:31,970 --> 00:22:35,680
We've got an expression here
that we'd like to find the

407
00:22:35,680 --> 00:22:38,236
minimum of, the extremum of.

408
00:22:38,236 --> 00:22:41,790
And we've got some constraints
here that we

409
00:22:41,790 --> 00:22:44,040
would like to honor.

410
00:22:44,040 --> 00:22:45,290
What are we going to do?

411
00:22:45,290 --> 00:22:47,600


412
00:22:47,600 --> 00:22:49,300
Let me put what we're going
to do to you in

413
00:22:49,300 --> 00:22:52,385
the form of a puzzle.

414
00:22:52,385 --> 00:22:58,900
Is it got something to
do with Legendre?

415
00:22:58,900 --> 00:23:04,270
Has it got something
to do with Laplace?

416
00:23:04,270 --> 00:23:07,375
Or does it have something
to do with Lagrange?

417
00:23:07,375 --> 00:23:09,400
She says Lagrange.

418
00:23:09,400 --> 00:23:12,850
Actually, all three were said
to be on Fourier's Doctoral

419
00:23:12,850 --> 00:23:15,590
Defense Committee-- must have
been quite an example.

420
00:23:15,590 --> 00:23:18,960
But we want to talk about
Lagrange, because we've got a

421
00:23:18,960 --> 00:23:20,605
situation here.

422
00:23:20,605 --> 00:23:22,060
Is this 1801?

423
00:23:22,060 --> 00:23:22,840
1802?

424
00:23:22,840 --> 00:23:25,000
1802.

425
00:23:25,000 --> 00:23:28,462
We learned in 1802 that if we
going to find the extremum of

426
00:23:28,462 --> 00:23:33,840
a function with constraints,
then we're going to have to

427
00:23:33,840 --> 00:23:35,922
use Lagrange multipliers.

428
00:23:35,922 --> 00:23:39,820
That would give us a new
expression, which we can

429
00:23:39,820 --> 00:23:43,350
maximize or minimize without
thinking about

430
00:23:43,350 --> 00:23:45,090
the constraints anymore.

431
00:23:45,090 --> 00:23:47,755
That's how Lagrange
multipliers work.

432
00:23:47,755 --> 00:23:52,440
So this brings us to miracle
number four, developmental

433
00:23:52,440 --> 00:23:53,770
piece number four.

434
00:23:53,770 --> 00:23:56,420
And it works like this.

435
00:23:56,420 --> 00:23:58,210
We're going to say that L--

436
00:23:58,210 --> 00:24:00,720
the thing we're going to try
to maximize in order to

437
00:24:00,720 --> 00:24:02,660
maximize the width
of the street--

438
00:24:02,660 --> 00:24:08,235
is equal to 1/2 times the
magnitude of that vector, w,

439
00:24:08,235 --> 00:24:12,476
squared minus.

440
00:24:12,476 --> 00:24:16,230
And now, we've got to have
a summation over all the

441
00:24:16,230 --> 00:24:17,480
constraints.

442
00:24:17,480 --> 00:24:18,880


443
00:24:18,880 --> 00:24:21,460
And each or those constraints is
going to have a multiplier,

444
00:24:21,460 --> 00:24:23,412
alpha sub i.

445
00:24:23,412 --> 00:24:26,106
And then, we write down
the constraint.

446
00:24:26,106 --> 00:24:27,575
And when we write down
a constraint,

447
00:24:27,575 --> 00:24:29,100
there it is up there.

448
00:24:29,100 --> 00:24:31,690
And I've got to be hyper
careful here, because,

449
00:24:31,690 --> 00:24:33,830
otherwise, I'll get lost
in the algebra.

450
00:24:33,830 --> 00:24:42,520
So the constraint is y sub i
times vector, w, dotted with

451
00:24:42,520 --> 00:24:49,030
vector x sub i plus b, and
now, I've got a closing

452
00:24:49,030 --> 00:24:52,315
parenthesis, a minus 1.

453
00:24:52,315 --> 00:24:56,690
That's the end of my constraint,
like so.

454
00:24:56,690 --> 00:25:00,330


455
00:25:00,330 --> 00:25:03,380
I sure hope I've got that right,
because I'll be in deep

456
00:25:03,380 --> 00:25:04,730
trouble if that's wrong.

457
00:25:04,730 --> 00:25:05,940
Anybody see any bugs in that?

458
00:25:05,940 --> 00:25:08,250
That looks right. doesn't it?

459
00:25:08,250 --> 00:25:10,310
We've got the original thing
we're trying to work with.

460
00:25:10,310 --> 00:25:14,425
Now, we've got Lagrange
multipliers all multiplied.

461
00:25:14,425 --> 00:25:16,300
It's back to that constraint
up there, where each

462
00:25:16,300 --> 00:25:20,512
constraint is constrained
to be 0.

463
00:25:20,512 --> 00:25:24,770
Well, there's a little bit of
mathematical slight of hand

464
00:25:24,770 --> 00:25:27,810
here, because in the end, the
ones that are going to be 0,

465
00:25:27,810 --> 00:25:31,210
the Lagrange multipliers here.

466
00:25:31,210 --> 00:25:33,795
The ones that are going to be
non 0 are going to be the ones

467
00:25:33,795 --> 00:25:36,120
connected with vectors that
lie in the gutter.

468
00:25:36,120 --> 00:25:39,848
The rest are going to be 0.

469
00:25:39,848 --> 00:25:43,380
But in any event, we can pretend
that this is what

470
00:25:43,380 --> 00:25:44,630
we're doing.

471
00:25:44,630 --> 00:25:46,550


472
00:25:46,550 --> 00:25:48,350
I don't care whether it's
a maximum or minimum.

473
00:25:48,350 --> 00:25:49,550
I've lost track.

474
00:25:49,550 --> 00:25:51,290
But what we're going to do is
we're going to try to find an

475
00:25:51,290 --> 00:25:52,360
extremum of that.

476
00:25:52,360 --> 00:25:53,730
So what do we do?

477
00:25:53,730 --> 00:25:58,330
What does 1801 teach us about?

478
00:25:58,330 --> 00:25:59,465
Finding the maximum--

479
00:25:59,465 --> 00:26:04,760
well, we've got to find the
derivatives and set them to 0.

480
00:26:04,760 --> 00:26:06,500
And then, after we've done that,
a little bit of that

481
00:26:06,500 --> 00:26:08,760
manipulation, we're going
to see a wonderful

482
00:26:08,760 --> 00:26:10,850
song start to emerge.

483
00:26:10,850 --> 00:26:12,890
So let's see if we can do it.

484
00:26:12,890 --> 00:26:17,160
Let's take the partial of L, the
Lagrangian, with respect

485
00:26:17,160 --> 00:26:19,190
to the vector, w.

486
00:26:19,190 --> 00:26:21,430
Oh my God, how do you
differentiate with

487
00:26:21,430 --> 00:26:22,680
respect to a vector?

488
00:26:22,680 --> 00:26:25,255


489
00:26:25,255 --> 00:26:28,050
It turns out that it has a form
that looks exactly like

490
00:26:28,050 --> 00:26:30,450
differentiating with respect
to a scalar.

491
00:26:30,450 --> 00:26:32,580
And the way you prove that to
yourself is you just expand

492
00:26:32,580 --> 00:26:35,530
everything in terms of all of
the vector's components.

493
00:26:35,530 --> 00:26:37,660
You differentiate those with
respect to what you're

494
00:26:37,660 --> 00:26:40,140
differentiating with respect
to, and everything

495
00:26:40,140 --> 00:26:42,380
turns out the same.

496
00:26:42,380 --> 00:26:44,880
So what you get when you
differentiate this with

497
00:26:44,880 --> 00:26:52,280
respect to the vector, w, is 2
comes down, and we have just

498
00:26:52,280 --> 00:26:53,833
magnitude of w.

499
00:26:53,833 --> 00:26:56,090
Was it the magnitude of w?

500
00:26:56,090 --> 00:26:58,000
Yeah, like so.

501
00:26:58,000 --> 00:27:01,629


502
00:27:01,629 --> 00:27:02,910
Was it the magnitude of w?

503
00:27:02,910 --> 00:27:06,510
Oh, it's not the
magnitude of w.

504
00:27:06,510 --> 00:27:12,396
It's just w, like so, no
magnitude involved.

505
00:27:12,396 --> 00:27:16,480
Then, we've got a w over here,
so we've got to differentiate

506
00:27:16,480 --> 00:27:18,270
this part with respect
to w, as well.

507
00:27:18,270 --> 00:27:19,690
But that part's a lot easier,
because all we

508
00:27:19,690 --> 00:27:21,310
have there is a w.

509
00:27:21,310 --> 00:27:22,350
There's no magnitude.

510
00:27:22,350 --> 00:27:24,002
It's not raised to any power.

511
00:27:24,002 --> 00:27:26,290
So what's w multiplied by?

512
00:27:26,290 --> 00:27:31,954
Well, it's multiplied by x and
y sub i and alpha sub i.

513
00:27:31,954 --> 00:27:32,610
All right.

514
00:27:32,610 --> 00:27:36,605
So that means that this
expression, this derivative of

515
00:27:36,605 --> 00:27:41,660
the Lagrangian, with respect to
w is going to be equal to w

516
00:27:41,660 --> 00:27:51,820
minus the sum of alpha sub i,
y sub i, x sub i, and that's

517
00:27:51,820 --> 00:27:54,240
got to be set to 0.

518
00:27:54,240 --> 00:28:02,250
And that implies that w is equal
to the sum of some alpha

519
00:28:02,250 --> 00:28:06,980
i, some scalars, times this
minus 1 or plus 1 variable

520
00:28:06,980 --> 00:28:11,332
times x sub i over i.

521
00:28:11,332 --> 00:28:14,430
And now, the math is
beginning to sing.

522
00:28:14,430 --> 00:28:19,490
Because it tells us that the
vector w is a linear sum of

523
00:28:19,490 --> 00:28:24,492
the samples, all the samples
or some of the sample.

524
00:28:24,492 --> 00:28:27,786
It didn't have to be that way.

525
00:28:27,786 --> 00:28:29,230
It could have been raised
to a power.

526
00:28:29,230 --> 00:28:31,160
It could have been
a logarithm.

527
00:28:31,160 --> 00:28:33,010
All sorts of horrible
things could have

528
00:28:33,010 --> 00:28:34,320
happened when we did this.

529
00:28:34,320 --> 00:28:39,210
But when we did this, we
discovered that w is going to

530
00:28:39,210 --> 00:28:44,620
be equal to a linear some
of these vectors here.

531
00:28:44,620 --> 00:28:49,060
Some of the vectors in the
sample set, and I say some,

532
00:28:49,060 --> 00:28:51,260
because for some alpha
will be 0.

533
00:28:51,260 --> 00:28:54,265


534
00:28:54,265 --> 00:28:55,515
All right.

535
00:28:55,515 --> 00:29:01,560
So this is something that we
want to take note of as

536
00:29:01,560 --> 00:29:05,402
something important.

537
00:29:05,402 --> 00:29:09,760
Now, of course, we've got to
differentiate L with respect

538
00:29:09,760 --> 00:29:12,900
to anything else it might
vary, so we've got to

539
00:29:12,900 --> 00:29:15,180
differentiate L with respect
to b, as well.

540
00:29:15,180 --> 00:29:18,436


541
00:29:18,436 --> 00:29:21,222
So what's that going
to be equal to?

542
00:29:21,222 --> 00:29:25,705
Well, there's no b in here, so
that makes no contribution.

543
00:29:25,705 --> 00:29:28,750
This part here doesn't have a
b in it, so that makes no

544
00:29:28,750 --> 00:29:29,335
contribution.

545
00:29:29,335 --> 00:29:32,270
There's no b over here, so that
makes no contribution.

546
00:29:32,270 --> 00:29:37,210
So we've got alpha i times
y sub i times b.

547
00:29:37,210 --> 00:29:39,365
That has a contribution.

548
00:29:39,365 --> 00:29:46,470
So that's going to be the sum
of alpha i times y sub i.

549
00:29:46,470 --> 00:29:48,570
And then, we're differentiating
with respect

550
00:29:48,570 --> 00:29:50,635
to b, so that disappears.

551
00:29:50,635 --> 00:29:55,440
There's a minus sign here, and
that's equal to 0, or that

552
00:29:55,440 --> 00:29:59,490
implies that the sum of the
alpha i times y sub

553
00:29:59,490 --> 00:30:03,012
i is equal to 0.

554
00:30:03,012 --> 00:30:05,100
Hm, that looks like that might
be helpful somewhere.

555
00:30:05,100 --> 00:30:10,460


556
00:30:10,460 --> 00:30:12,755
And now, it's time
for more coffee.

557
00:30:12,755 --> 00:30:15,520
By the way, these coffee
periods take months.

558
00:30:15,520 --> 00:30:16,905
You stare at it.

559
00:30:16,905 --> 00:30:18,980
You work on something else.

560
00:30:18,980 --> 00:30:22,000
You've got to worry
about your finals.

561
00:30:22,000 --> 00:30:24,020
And you think about
it some more.

562
00:30:24,020 --> 00:30:25,740
And eventually, you come
back from coffee

563
00:30:25,740 --> 00:30:28,930
and do the next thing.

564
00:30:28,930 --> 00:30:31,640
Oh, what is the next thing?

565
00:30:31,640 --> 00:30:34,180
Well, we've still got this
expression that we're trying

566
00:30:34,180 --> 00:30:41,020
to find the minimum for.

567
00:30:41,020 --> 00:30:43,500
And you say to yourself, this
is really a job for the

568
00:30:43,500 --> 00:30:44,480
numerical analysts.

569
00:30:44,480 --> 00:30:47,205
Those guys know about
this sort of stuff.

570
00:30:47,205 --> 00:30:49,620
Because of that little power
in there, that square.

571
00:30:49,620 --> 00:30:54,772
This is a so-called quadratic
optimization problem.

572
00:30:54,772 --> 00:30:57,480
So at this point, you would be
inclined to hand this problem

573
00:30:57,480 --> 00:30:59,290
over to a numerical analysts.

574
00:30:59,290 --> 00:31:01,410
They'll come back in a few
weeks with an algorithm.

575
00:31:01,410 --> 00:31:03,100
You implement the algorithm.

576
00:31:03,100 --> 00:31:04,120
And maybe things work.

577
00:31:04,120 --> 00:31:04,890
Maybe they don't converge.

578
00:31:04,890 --> 00:31:08,325
But any case, you don't
worry about it.

579
00:31:08,325 --> 00:31:10,360
But we're not going to do that,
because we want to do a

580
00:31:10,360 --> 00:31:12,680
little bit more math, because
we're interested

581
00:31:12,680 --> 00:31:14,890
in stuff like this.

582
00:31:14,890 --> 00:31:18,770
We're interested in the fact
that the decision vector is a

583
00:31:18,770 --> 00:31:21,265
linear sum of the samples.

584
00:31:21,265 --> 00:31:24,030
So we're going to work a little
harder on this stuff.

585
00:31:24,030 --> 00:31:27,730
And in particular, now that
we've got an expression for w,

586
00:31:27,730 --> 00:31:31,010
this one right here, we're
going to plug it back in

587
00:31:31,010 --> 00:31:34,870
there, and we're going to plug
it back in here and see what

588
00:31:34,870 --> 00:31:37,440
happens to that thing
we're trying to find

589
00:31:37,440 --> 00:31:38,690
the extremum of.

590
00:31:38,690 --> 00:31:46,817


591
00:31:46,817 --> 00:31:51,220
Is everybody relaxed,
taking deep breath?

592
00:31:51,220 --> 00:31:52,530
Actually, this is the
easiest part.

593
00:31:52,530 --> 00:31:55,755
This is just doing a little
bit of the algebra.

594
00:31:55,755 --> 00:31:58,830
So the think we're trying
to maximize or

595
00:31:58,830 --> 00:32:03,465
minimize is equal to 1/2.

596
00:32:03,465 --> 00:32:10,570
And now, we've got to
have this vector

597
00:32:10,570 --> 00:32:16,781
here in there twice.

598
00:32:16,781 --> 00:32:17,190
Right?

599
00:32:17,190 --> 00:32:21,295
Because we're multiplying
the two together.

600
00:32:21,295 --> 00:32:22,970
So let's see.

601
00:32:22,970 --> 00:32:26,860
We've got from that expression
up there, one of those w's

602
00:32:26,860 --> 00:32:33,670
will just be the sum of the
alpha i times y sub i times

603
00:32:33,670 --> 00:32:36,265
the vector x sub i.

604
00:32:36,265 --> 00:32:38,320
And then, we've got the
other one, too.

605
00:32:38,320 --> 00:32:41,620
So that's just going to
be the sum of alpha.

606
00:32:41,620 --> 00:32:45,280
Now, I'm going to, actually,
eventually, squish those two

607
00:32:45,280 --> 00:32:48,050
sums together into a double
summation, so I have to keep

608
00:32:48,050 --> 00:32:49,990
the indexes straight.

609
00:32:49,990 --> 00:32:53,786
So I'm just going to write
that as alpha sub j, y

610
00:32:53,786 --> 00:32:57,726
sub j, x sub j.

611
00:32:57,726 --> 00:32:59,760
So those are my two vectors and
I'm going to take the dot

612
00:32:59,760 --> 00:33:00,850
product of those.

613
00:33:00,850 --> 00:33:04,310
That's the first piece, right?

614
00:33:04,310 --> 00:33:07,345
Boy, this is hard.

615
00:33:07,345 --> 00:33:13,760
So minus, and now, the next term
looks like alpha i, y sub

616
00:33:13,760 --> 00:33:17,395
i, x sub i times w.

617
00:33:17,395 --> 00:33:19,640
So you've got a whole
bunch of these.

618
00:33:19,640 --> 00:33:26,996
We've got a sum of alpha i times
y sub i times x sub i,

619
00:33:26,996 --> 00:33:30,425
and then, that gets multiplied
times w.

620
00:33:30,425 --> 00:33:39,160
So we'll put this like this, the
sum of alpha j, y sub j, x

621
00:33:39,160 --> 00:33:41,630
sub j in there like that.

622
00:33:41,630 --> 00:33:44,345
And then, that's the dot
product like that.

623
00:33:44,345 --> 00:33:45,890
That wasn't as bad
as I thought.

624
00:33:45,890 --> 00:33:49,731


625
00:33:49,731 --> 00:33:54,150
Now, I've got to deal with the
next term, the alpha i times y

626
00:33:54,150 --> 00:33:55,740
sub i times b.

627
00:33:55,740 --> 00:33:58,475


628
00:33:58,475 --> 00:34:07,746
So that's minus sub of alpha
i times y sub i times b.

629
00:34:07,746 --> 00:34:13,949
And then, to finish it off, we
have plus the sum of alpha sub

630
00:34:13,949 --> 00:34:18,320
i minus 1 up there, minus 1 in
front of the summation, such

631
00:34:18,320 --> 00:34:20,059
as the sum of the alphas.

632
00:34:20,059 --> 00:34:21,605
Are you with me so far?

633
00:34:21,605 --> 00:34:24,096
Just a little algebra.

634
00:34:24,096 --> 00:34:24,860
It looks good.

635
00:34:24,860 --> 00:34:28,838
I think I haven't
mucked it, yet.

636
00:34:28,838 --> 00:34:30,952
Let's see.

637
00:34:30,952 --> 00:34:34,364
alpha i times y sub i times
b. b is a constant.

638
00:34:34,364 --> 00:34:37,409
So pull that out there, and
then, I just got the sum of

639
00:34:37,409 --> 00:34:41,078
alpha sub i times y sub i.

640
00:34:41,078 --> 00:34:42,250
Oh, that's good.

641
00:34:42,250 --> 00:34:43,500
That's 0.

642
00:34:43,500 --> 00:34:48,304


643
00:34:48,304 --> 00:34:51,900
Now, so for every one of these
terms, we dot it with this

644
00:34:51,900 --> 00:34:53,150
whole expression.

645
00:34:53,150 --> 00:34:54,966


646
00:34:54,966 --> 00:35:00,050
So that's just like taking this
thing here and dotting

647
00:35:00,050 --> 00:35:02,145
those two things together,
right?

648
00:35:02,145 --> 00:35:04,240
Oh, but that's just the same
thing we've got here.

649
00:35:04,240 --> 00:35:07,324


650
00:35:07,324 --> 00:35:11,140
So now, what we can do is we
can say that we can rewrite

651
00:35:11,140 --> 00:35:15,560
this Lagrangian as--

652
00:35:15,560 --> 00:35:19,566
we've got that sum of alpha i.

653
00:35:19,566 --> 00:35:22,256
That's the positive element.

654
00:35:22,256 --> 00:35:25,680
And then, we've got one of
these and half of these.

655
00:35:25,680 --> 00:35:28,865
So that's minus 1/2.

656
00:35:28,865 --> 00:35:30,980
And now, I'll just convert that
whole works into a double

657
00:35:30,980 --> 00:35:43,230
sum over both i and j of alpha
i times alpha j times y sub i

658
00:35:43,230 --> 00:35:49,760
times y sub j times x sub
i dotted with x of j.

659
00:35:49,760 --> 00:35:52,670


660
00:35:52,670 --> 00:35:55,560
We sure went through a lot of
trouble to get there, but now,

661
00:35:55,560 --> 00:35:56,210
we've got it.

662
00:35:56,210 --> 00:35:59,200
And we know that what we're
trying to do is we're trying

663
00:35:59,200 --> 00:36:03,320
to find a maximum of
that expression.

664
00:36:03,320 --> 00:36:07,212


665
00:36:07,212 --> 00:36:08,910
And that's the one we're
going to had off to

666
00:36:08,910 --> 00:36:11,010
the numerical analysts.

667
00:36:11,010 --> 00:36:13,090
So if we're going to had this
off to the numerical analysts

668
00:36:13,090 --> 00:36:16,136
anyway, why did I go to
all this trouble?

669
00:36:16,136 --> 00:36:19,200
Good question.

670
00:36:19,200 --> 00:36:22,626
Do you have any idea why I
went to all this trouble?

671
00:36:22,626 --> 00:36:25,440
Because I wanted to find out
the dependence of this

672
00:36:25,440 --> 00:36:26,950
expression.

673
00:36:26,950 --> 00:36:28,120
Wanda is telling me.

674
00:36:28,120 --> 00:36:29,450
I'm translating as I go.

675
00:36:29,450 --> 00:36:31,555
She's telling me in Romanian.

676
00:36:31,555 --> 00:36:35,510
I want to find what this
maximization depends on with

677
00:36:35,510 --> 00:36:41,160
respect these vectors, the
x, the sample vectors.

678
00:36:41,160 --> 00:36:46,480
And what I've discovered is that
the optimization depends

679
00:36:46,480 --> 00:36:53,976
only on the dot product
of pairs of samples.

680
00:36:53,976 --> 00:36:55,300
And that's something we
want to keep in mind.

681
00:36:55,300 --> 00:36:56,620
That's why I put it
in royal purple.

682
00:36:56,620 --> 00:36:59,350


683
00:36:59,350 --> 00:37:02,920
Now, up here, so let's see.

684
00:37:02,920 --> 00:37:04,210
What do we call that
one up there?

685
00:37:04,210 --> 00:37:05,715
That's two.

686
00:37:05,715 --> 00:37:10,505
I guess, we'll call this
piece here three.

687
00:37:10,505 --> 00:37:12,600
This piece here is four.

688
00:37:12,600 --> 00:37:15,060
And now, there's
one more piece.

689
00:37:15,060 --> 00:37:20,080
Because I want to take that w,
and not only stick it back

690
00:37:20,080 --> 00:37:22,700
into that Lagrangian, I want
to stick it back into the

691
00:37:22,700 --> 00:37:24,446
decision rule.

692
00:37:24,446 --> 00:37:29,030
So now, my decision rule with
this expression for w is going

693
00:37:29,030 --> 00:37:31,410
to be w plugged into
that thing.

694
00:37:31,410 --> 00:37:37,000
So the decision rule is going to
look like the sum of alpha

695
00:37:37,000 --> 00:37:45,960
i times y sub i times x sub
i dotted with the unknown

696
00:37:45,960 --> 00:37:47,840
vector, like so.

697
00:37:47,840 --> 00:37:51,536
And we're going to,
I guess, add b.

698
00:37:51,536 --> 00:37:53,770
And we're going to say, if
that's greater than or equal

699
00:37:53,770 --> 00:37:57,660
to 0, then plus.

700
00:37:57,660 --> 00:38:00,560


701
00:38:00,560 --> 00:38:04,750
So you see why the math is
beginning to sing to us now.

702
00:38:04,750 --> 00:38:08,840
Because now, we discover that
the decision rule, also,

703
00:38:08,840 --> 00:38:12,700
depends only on the dot product
of those sample

704
00:38:12,700 --> 00:38:15,340
vectors and the unknown.

705
00:38:15,340 --> 00:38:18,640
So the total of dependence
of all of the

706
00:38:18,640 --> 00:38:21,106
math on the dot products.

707
00:38:21,106 --> 00:38:24,034
All right.

708
00:38:24,034 --> 00:38:27,160
And now, I hear a whisper.

709
00:38:27,160 --> 00:38:30,410
Someone is saying, I
don't believe that

710
00:38:30,410 --> 00:38:31,720
mathematicians can do it.

711
00:38:31,720 --> 00:38:33,850
I don't think those numerical
analysts can find the

712
00:38:33,850 --> 00:38:35,100
optimization.

713
00:38:35,100 --> 00:38:37,360


714
00:38:37,360 --> 00:38:38,925
I want to be sure of it.

715
00:38:38,925 --> 00:38:40,850
Give me ocular proof.

716
00:38:40,850 --> 00:38:42,360
So I'd like to run a
demonstration of it.

717
00:38:42,360 --> 00:38:56,596


718
00:38:56,596 --> 00:38:57,090
OK.

719
00:38:57,090 --> 00:38:58,060
There's our sample problem.

720
00:38:58,060 --> 00:38:59,800
The one I started the
hour out with.

721
00:38:59,800 --> 00:39:05,430
Now, if the optimization
algorithm doesn't get stuck in

722
00:39:05,430 --> 00:39:07,720
a local maximum or something,
it should find a nice,

723
00:39:07,720 --> 00:39:10,900
straight line separating those
two guys to finding the widest

724
00:39:10,900 --> 00:39:14,445
street between the minuses
and the pluses.

725
00:39:14,445 --> 00:39:16,880
So in just a couple of steps,
you can see down

726
00:39:16,880 --> 00:39:18,150
there in step 11.

727
00:39:18,150 --> 00:39:20,630
It's decided that it's done
as much as it can on the

728
00:39:20,630 --> 00:39:22,406
optimization.

729
00:39:22,406 --> 00:39:25,480
And it's got three alphas.

730
00:39:25,480 --> 00:39:30,970
And you can see that the two
negative samples both figure

731
00:39:30,970 --> 00:39:34,575
into the solution, the weights
on the Lagrangian multipliers

732
00:39:34,575 --> 00:39:36,820
are given by those little
yellow bars.

733
00:39:36,820 --> 00:39:40,030
So the two negatives participate
in the solution as

734
00:39:40,030 --> 00:39:42,040
one of the positives, but the
other positive doesn't.

735
00:39:42,040 --> 00:39:45,500
So it has a 0 weight.

736
00:39:45,500 --> 00:39:47,700
So everything worked out well.

737
00:39:47,700 --> 00:39:50,440
Now, I said, as long as it
doesn't get stuck on a local

738
00:39:50,440 --> 00:39:55,095
maximum, guess what, those
mathematical friends of ours

739
00:39:55,095 --> 00:39:58,120
can tell us and prove
to us that this

740
00:39:58,120 --> 00:40:00,420
thing is a convex space.

741
00:40:00,420 --> 00:40:04,042
That means it can never get
stuck in a local maximum.

742
00:40:04,042 --> 00:40:07,780
So in contrast with things like
neural nets, where you

743
00:40:07,780 --> 00:40:11,160
have a plague of local maxima,
this guy never gets stuck in a

744
00:40:11,160 --> 00:40:12,355
local maxima.

745
00:40:12,355 --> 00:40:15,536
Let's try some other examples.

746
00:40:15,536 --> 00:40:17,250
Here's two vertical points--

747
00:40:17,250 --> 00:40:20,920
no surprises there, right?

748
00:40:20,920 --> 00:40:22,470
Well, you say, well,
maybe it can't deal

749
00:40:22,470 --> 00:40:24,165
with diagonal points.

750
00:40:24,165 --> 00:40:26,830
Sure it can.

751
00:40:26,830 --> 00:40:32,091
How about this thing here?

752
00:40:32,091 --> 00:40:38,510
Yeah, it only needed two of the
points since any two, a

753
00:40:38,510 --> 00:40:41,820
plus or minus, will
define the street.

754
00:40:41,820 --> 00:40:44,580
Let's try this guy.

755
00:40:44,580 --> 00:40:46,526
Oh.

756
00:40:46,526 --> 00:40:47,110
What do you think?

757
00:40:47,110 --> 00:40:50,046
What happened here?

758
00:40:50,046 --> 00:40:51,345
Well, we're screwed, right?

759
00:40:51,345 --> 00:40:52,595
Because it's linearly
inseparable--

760
00:40:52,595 --> 00:40:56,629


761
00:40:56,629 --> 00:40:57,879
bad news.

762
00:40:57,879 --> 00:41:00,175


763
00:41:00,175 --> 00:41:04,250
So in situations where it's
linearly inseparable, the

764
00:41:04,250 --> 00:41:07,060
mechanism struggles, and
eventually, it will just slow

765
00:41:07,060 --> 00:41:08,570
down and you truncate
it, because it's

766
00:41:08,570 --> 00:41:09,510
not making any progress.

767
00:41:09,510 --> 00:41:14,765
And you see the red dots there
are ones that it got wrong.

768
00:41:14,765 --> 00:41:17,480
So you say, well, too bad for
our side-- doesn't look like

769
00:41:17,480 --> 00:41:19,502
it's all that good anyway.

770
00:41:19,502 --> 00:41:26,020
But then, a powerful idea comes
to the rescue, when

771
00:41:26,020 --> 00:41:28,896
stuck switch to another
perspective.

772
00:41:28,896 --> 00:41:31,850
So if we don't like the space
that we're in, because it

773
00:41:31,850 --> 00:41:37,680
gives examples that are not
linearly separable, then we

774
00:41:37,680 --> 00:41:39,705
can say, oh, shoot.

775
00:41:39,705 --> 00:41:42,052
Here's our space.

776
00:41:42,052 --> 00:41:43,302
Here are two points.

777
00:41:43,302 --> 00:41:49,486


778
00:41:49,486 --> 00:41:52,944
Here are two other points.

779
00:41:52,944 --> 00:41:54,630
We can't separate them.

780
00:41:54,630 --> 00:41:57,740
But if we could somehow get them
into another space, maybe

781
00:41:57,740 --> 00:42:06,600
we can separate them, because
they look like this in the

782
00:42:06,600 --> 00:42:08,925
other space, and they're
easy to separate.

783
00:42:08,925 --> 00:42:12,820
So what we need, then, is a
transformation that will take

784
00:42:12,820 --> 00:42:16,070
us from the space we're in into
a space where things are

785
00:42:16,070 --> 00:42:17,590
more convenient, so we're
going to call that

786
00:42:17,590 --> 00:42:22,745
transformation phi
with a vector, x.

787
00:42:22,745 --> 00:42:23,855
That's the transformation.

788
00:42:23,855 --> 00:42:26,290
And now, here's the reason
for all the magic.

789
00:42:26,290 --> 00:42:28,950


790
00:42:28,950 --> 00:42:34,880
I said, that the maximization
only depends on dot products.

791
00:42:34,880 --> 00:42:38,810
So all I need to do the
maximization is the

792
00:42:38,810 --> 00:42:43,975
transformation of one vector
dotted with the transformation

793
00:42:43,975 --> 00:42:47,235
of another vector, like so.

794
00:42:47,235 --> 00:42:51,260
That's what I need to maximize,
or to find the

795
00:42:51,260 --> 00:42:52,510
maximum on.

796
00:42:52,510 --> 00:42:55,216
Then, in order to recognize--

797
00:42:55,216 --> 00:42:57,706
where did it go?

798
00:42:57,706 --> 00:42:59,260
Underneath the chalkboard.

799
00:42:59,260 --> 00:43:05,290


800
00:43:05,290 --> 00:43:06,002
Oh, yes.

801
00:43:06,002 --> 00:43:06,900
Here it is.

802
00:43:06,900 --> 00:43:09,620
To recognize, all I need
is dot products, too.

803
00:43:09,620 --> 00:43:17,025
So for that one I need phi of
x dotted with phi of u.

804
00:43:17,025 --> 00:43:19,300
And just to make this a little
bit more consistent, the

805
00:43:19,300 --> 00:43:22,750
notation, I'll call that
x j and this x sub i.

806
00:43:22,750 --> 00:43:23,550
And that's x sub i.

807
00:43:23,550 --> 00:43:27,595
Those are the quantities I
need in order to do it.

808
00:43:27,595 --> 00:43:34,540
So that means that if I have a
function, let's call it k of x

809
00:43:34,540 --> 00:43:45,370
sub i and x sub j, that's equal
to phi of x sub i dotted

810
00:43:45,370 --> 00:43:49,191
with phi of x sub j.

811
00:43:49,191 --> 00:43:50,215
Then, I'm done.

812
00:43:50,215 --> 00:43:52,306
This is what I need.

813
00:43:52,306 --> 00:43:54,020
I don't actually need this.

814
00:43:54,020 --> 00:43:56,955


815
00:43:56,955 --> 00:44:00,990
All I need is that function, k,
which happens to be called

816
00:44:00,990 --> 00:44:04,650
a kernel function, which
provides me with the dot

817
00:44:04,650 --> 00:44:07,745
product of those two vectors
in another space.

818
00:44:07,745 --> 00:44:09,310
I don't have to know
the transformation

819
00:44:09,310 --> 00:44:11,200
into the other space.

820
00:44:11,200 --> 00:44:15,935
And that's the reason that
this stuff is a miracle.

821
00:44:15,935 --> 00:44:19,595
So what are some of the kernels
that are popular?

822
00:44:19,595 --> 00:44:27,200
One is the linear kernel that
says that u dotted with v plus

823
00:44:27,200 --> 00:44:32,515
1 to the n-th is such a kernel,
because it's got u in

824
00:44:32,515 --> 00:44:35,190
it and v in it, the
two vectors.

825
00:44:35,190 --> 00:44:38,060
And this is what the dot product
is in the other space.

826
00:44:38,060 --> 00:44:39,550
So that's one choice.

827
00:44:39,550 --> 00:44:42,450
Another choice is a kernel
that looks like

828
00:44:42,450 --> 00:44:46,295
this, e to the minus.

829
00:44:46,295 --> 00:44:50,440
Let's take the dot product
of the difference

830
00:44:50,440 --> 00:44:51,690
of those two guys.

831
00:44:51,690 --> 00:44:53,880


832
00:44:53,880 --> 00:44:56,360
Let's take the magnitude
of that and

833
00:44:56,360 --> 00:44:57,660
divide it by some sigma.

834
00:44:57,660 --> 00:45:01,160
That's a second kind of kernel
that we can use.

835
00:45:01,160 --> 00:45:04,350
So let's go back and see if we
can solve this problem by

836
00:45:04,350 --> 00:45:06,350
transforming it into another
space where we have another

837
00:45:06,350 --> 00:45:07,600
perspective.

838
00:45:07,600 --> 00:45:10,082


839
00:45:10,082 --> 00:45:15,618
So that's it.

840
00:45:15,618 --> 00:45:17,760
That's another kernel.

841
00:45:17,760 --> 00:45:18,870
And so sure, we can.

842
00:45:18,870 --> 00:45:21,280
And that's the answer when
transformed back into the

843
00:45:21,280 --> 00:45:22,905
original space.

844
00:45:22,905 --> 00:45:24,690
We can also try doing that
with a so-called

845
00:45:24,690 --> 00:45:25,780
radial basis kernel.

846
00:45:25,780 --> 00:45:28,112
That's the one with the
exponential in it.

847
00:45:28,112 --> 00:45:29,310
We can learn on that one.

848
00:45:29,310 --> 00:45:30,480
Boom.

849
00:45:30,480 --> 00:45:33,346
No problem.

850
00:45:33,346 --> 00:45:36,860
So we've got a general method
that's convex and guaranteed

851
00:45:36,860 --> 00:45:39,245
to produce a global solution.

852
00:45:39,245 --> 00:45:42,950
We've got a mechanism that
easily allows us to transform

853
00:45:42,950 --> 00:45:45,470
this into another space.

854
00:45:45,470 --> 00:45:47,695
So it works like a charm.

855
00:45:47,695 --> 00:45:50,736
Of course, it doesn't remove
all possible problems.

856
00:45:50,736 --> 00:45:53,650
Look at that exponential
thing here.

857
00:45:53,650 --> 00:45:59,890
If we choose a sigma that is
small enough, then those

858
00:45:59,890 --> 00:46:02,760
sigmas are essentially shrunk
right around the sample

859
00:46:02,760 --> 00:46:06,092
points, and we could
get overfitting.

860
00:46:06,092 --> 00:46:09,385
So it doesn't immunize us
against overfitting, but it

861
00:46:09,385 --> 00:46:12,500
does immunize us against local
maxima and does provide us

862
00:46:12,500 --> 00:46:16,820
with a general mechanism for
doing a transformation into

863
00:46:16,820 --> 00:46:18,935
another space with a
better perspective.

864
00:46:18,935 --> 00:46:22,435
Now, the history lesson, all
this stuff feels fairly new.

865
00:46:22,435 --> 00:46:25,746
It feels like it's younger
than you are.

866
00:46:25,746 --> 00:46:27,822
Here's the history of it.

867
00:46:27,822 --> 00:46:31,060
Vapnik immigrated from the
Soviet Union to the United

868
00:46:31,060 --> 00:46:33,760
States in about 1991.

869
00:46:33,760 --> 00:46:36,795
Nobody ever heard of this stuff
before he immigrated.

870
00:46:36,795 --> 00:46:40,200
He actually had done this work
on the basic support vector

871
00:46:40,200 --> 00:46:44,355
idea in his Ph.D. thesis
at Moscow University

872
00:46:44,355 --> 00:46:46,590
in the early '60s.

873
00:46:46,590 --> 00:46:49,470
But it wasn't possible for him
to do anything with it,

874
00:46:49,470 --> 00:46:51,220
because they didn't have any
computers they could try

875
00:46:51,220 --> 00:46:53,010
anything out with.

876
00:46:53,010 --> 00:46:57,460
So he spent the next 25 years at
some oncology institute in

877
00:46:57,460 --> 00:47:00,660
the Soviet Union doing
applications.

878
00:47:00,660 --> 00:47:03,440
Somebody from Bell Labs
discovers him, invites him

879
00:47:03,440 --> 00:47:05,445
over to the United States
where, subsequently, he

880
00:47:05,445 --> 00:47:07,466
decides to immigrate.

881
00:47:07,466 --> 00:47:13,580
In 1992, or thereabouts, Vapnik
submits three papers to

882
00:47:13,580 --> 00:47:17,115
NIPS, the Neural Information
Processing Systems journal.

883
00:47:17,115 --> 00:47:19,065
All of them were rejected.

884
00:47:19,065 --> 00:47:23,570
He's still sore about it,
but it's motivating.

885
00:47:23,570 --> 00:47:27,060
So around 1992, 1993, Bell
Labs was interested in

886
00:47:27,060 --> 00:47:28,420
hand-written character
recognition

887
00:47:28,420 --> 00:47:30,456
and in neural nets.

888
00:47:30,456 --> 00:47:33,270
Vapnik thinks that
neural nets--

889
00:47:33,270 --> 00:47:36,295
what would be a good
word to use?

890
00:47:36,295 --> 00:47:38,410
I can think of the vernacular,
but he thinks that

891
00:47:38,410 --> 00:47:40,150
they're not very good.

892
00:47:40,150 --> 00:47:44,320
So he bets a colleague a good
dinner that support vector

893
00:47:44,320 --> 00:47:46,385
machines will eventually do
better at handwriting

894
00:47:46,385 --> 00:47:50,356
recognition then neural nets.

895
00:47:50,356 --> 00:47:51,690
And it's a dinner bet, right?

896
00:47:51,690 --> 00:47:52,600
It's not that big of deal.

897
00:47:52,600 --> 00:47:55,280
But as Napoleon said, it's
amazing what a soldier will do

898
00:47:55,280 --> 00:47:57,641
for a bit of ribbon.

899
00:47:57,641 --> 00:48:01,380
So that makes colleague, who's
working on this problem with

900
00:48:01,380 --> 00:48:06,730
handwritten recognition, decides
to try a support

901
00:48:06,730 --> 00:48:12,700
vector machine with a kernel,
in which n equals 2, just

902
00:48:12,700 --> 00:48:14,820
slightly nonlinear, works
like a charm.

903
00:48:14,820 --> 00:48:17,530


904
00:48:17,530 --> 00:48:19,890
Was this the first time anybody
tried a kernel?

905
00:48:19,890 --> 00:48:23,070
Vapnik actually had the idea in
his thesis but never though

906
00:48:23,070 --> 00:48:25,560
it was very important.

907
00:48:25,560 --> 00:48:29,670
As soon as it was shown to work
in the early '90s on the

908
00:48:29,670 --> 00:48:32,090
problem handwriting recognition,
Vapnik

909
00:48:32,090 --> 00:48:35,190
resuscitated the idea of the
kernel, began to develop it,

910
00:48:35,190 --> 00:48:38,270
and became an essential part of
the whole approach of using

911
00:48:38,270 --> 00:48:39,920
support vector machines.

912
00:48:39,920 --> 00:48:43,980
So the main point about this
is that it was 30 years in

913
00:48:43,980 --> 00:48:47,380
between the concept and anybody
ever hearing about it.

914
00:48:47,380 --> 00:48:52,360
It was 30 years between Vapnik's
understanding of

915
00:48:52,360 --> 00:48:55,840
kernels and his appreciation
of their importance.

916
00:48:55,840 --> 00:48:59,870
And that's the way things often
go, great ideas followed

917
00:48:59,870 --> 00:49:03,320
by long periods of nothing
happening, followed by an

918
00:49:03,320 --> 00:49:06,640
epiphanous moment when the
original idea seemed to have

919
00:49:06,640 --> 00:49:09,320
great power with just a
little bit of a twist.

920
00:49:09,320 --> 00:49:10,960
And then, the world
never looks back.

921
00:49:10,960 --> 00:49:14,780
And Vapnik, who nobody ever
heard of until the early '90s,

922
00:49:14,780 --> 00:49:18,380
becomes famous for something
that everybody knows about

923
00:49:18,380 --> 00:49:19,630
today who does machine
learning.

924
00:49:19,630 --> 00:49:33,807