1
00:00:00,040 --> 00:00:02,410
The following content is
provided under a Creative

2
00:00:02,410 --> 00:00:03,790
Commons license.

3
00:00:03,790 --> 00:00:06,030
Your support will help
MIT OpenCourseWare

4
00:00:06,030 --> 00:00:10,100
continue to offer high-quality
educational resources for free.

5
00:00:10,100 --> 00:00:12,680
To make a donation or to
view additional materials

6
00:00:12,680 --> 00:00:16,496
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,496 --> 00:00:17,120
at ocw.mit.edu.

8
00:00:30,885 --> 00:00:33,855
[MUSIC PLAYING]

9
00:00:37,780 --> 00:00:40,030
PATRICK H. WINSTON: Well,
what we're going to do today

10
00:00:40,030 --> 00:00:42,110
is climb a pretty big
mountain because we're

11
00:00:42,110 --> 00:00:43,920
going to go from a
neural net with two

12
00:00:43,920 --> 00:00:47,940
parameters to discussing
the kind of neural nets

13
00:00:47,940 --> 00:00:56,590
in which people end up dealing
with 60 million parameters.

14
00:00:56,590 --> 00:00:59,176
So it's going to be
a pretty big jump.

15
00:00:59,176 --> 00:01:00,550
Along the way are
a couple things

16
00:01:00,550 --> 00:01:05,750
I wanted to underscore from
our previous discussion.

17
00:01:05,750 --> 00:01:08,340
Last time, I tried to
develop some intuition

18
00:01:08,340 --> 00:01:11,600
for the kinds of formulas
that you use to actually do

19
00:01:11,600 --> 00:01:14,230
the calculations in a
small neural net about how

20
00:01:14,230 --> 00:01:16,160
the weights are going to change.

21
00:01:16,160 --> 00:01:18,230
And the main thing
I tried to emphasize

22
00:01:18,230 --> 00:01:30,420
is that when you have a
neural net like this one,

23
00:01:30,420 --> 00:01:36,610
everything is sort of
divided in each column.

24
00:01:36,610 --> 00:01:42,360
You can't have the performance
based on this output

25
00:01:42,360 --> 00:01:44,940
affect some weight
change back here

26
00:01:44,940 --> 00:01:49,150
without going through this
finite number of output

27
00:01:49,150 --> 00:01:51,036
variables, the y1s.

28
00:01:51,036 --> 00:01:57,310
And by the way, there's no y2
and y4-- there's no y2 and y3.

29
00:01:57,310 --> 00:02:00,510
Dealing with this is really
a notational nightmare,

30
00:02:00,510 --> 00:02:03,890
and I spent a lot
of time yesterday

31
00:02:03,890 --> 00:02:06,149
trying to clean it
up a little bit.

32
00:02:06,149 --> 00:02:07,690
But basically, what
I'm trying to say

33
00:02:07,690 --> 00:02:09,959
has nothing to do with
the notation I have used

34
00:02:09,959 --> 00:02:11,500
but rather with the
fact that there's

35
00:02:11,500 --> 00:02:15,124
a limited number of ways in
which that can influence this,

36
00:02:15,124 --> 00:02:17,290
even though the number of
paths through this network

37
00:02:17,290 --> 00:02:19,930
can be growing exponential.

38
00:02:19,930 --> 00:02:22,840
So those equations
underneath are

39
00:02:22,840 --> 00:02:26,420
equations that derive
from trying to figure out

40
00:02:26,420 --> 00:02:31,690
how the output performance
depends on some

41
00:02:31,690 --> 00:02:33,620
of these weights back here.

42
00:02:33,620 --> 00:02:36,180
And what I've calculated
is I've calculated

43
00:02:36,180 --> 00:02:41,000
the dependence of
the performance on w1

44
00:02:41,000 --> 00:02:44,260
going that way, and
I've also calculated

45
00:02:44,260 --> 00:02:52,420
the dependence of performance
on w1 going that way.

46
00:02:52,420 --> 00:02:55,910
So that's one of the
equations I've got down there.

47
00:02:55,910 --> 00:02:58,830
And another one
deals with w3, and it

48
00:02:58,830 --> 00:03:07,060
involves going both
this way and this way.

49
00:03:07,060 --> 00:03:10,110
And all I've done in both
cases, in all four cases,

50
00:03:10,110 --> 00:03:13,350
is just take the partial
derivative of performance

51
00:03:13,350 --> 00:03:15,750
with respect to those weights
and use the chain rule

52
00:03:15,750 --> 00:03:17,270
to expand it.

53
00:03:17,270 --> 00:03:25,732
And when I do that,
this is the stuff I get.

54
00:03:25,732 --> 00:03:27,940
And that's just a whole
bunch of partial derivatives.

55
00:03:27,940 --> 00:03:30,510
But if you look at it and let
it sing a little bit to you,

56
00:03:30,510 --> 00:03:32,509
what you see is that
there's a lot of redundancy

57
00:03:32,509 --> 00:03:34,270
in the computation.

58
00:03:34,270 --> 00:03:38,990
So for example, this
guy here, partial

59
00:03:38,990 --> 00:03:41,800
of performance
with respect to w1,

60
00:03:41,800 --> 00:03:45,350
depends on both
paths, of course.

61
00:03:45,350 --> 00:03:51,150
But look at the first elements
here, these guys right here.

62
00:03:51,150 --> 00:03:53,850
And look at the first
elements in the expression

63
00:03:53,850 --> 00:03:56,210
for calculating the partial
derivative of performance

64
00:03:56,210 --> 00:04:00,004
with respect to w3, these guys.

65
00:04:04,612 --> 00:04:05,320
They're the same.

66
00:04:07,920 --> 00:04:12,310
And not only that, if you
look inside these expressions

67
00:04:12,310 --> 00:04:16,399
and look at this
particular piece here,

68
00:04:16,399 --> 00:04:18,810
you see that that is
an expression that

69
00:04:18,810 --> 00:04:21,779
was needed in order
to calculate one

70
00:04:21,779 --> 00:04:24,710
of the downstream weights,
the changes in one

71
00:04:24,710 --> 00:04:27,370
of the downstream weights.

72
00:04:27,370 --> 00:04:30,070
But it happens to be the same
thing as you see over here.

73
00:04:32,780 --> 00:04:41,532
And likewise, this piece is the
same thing you see over here.

74
00:04:44,630 --> 00:04:47,180
So each time you move
further and further back

75
00:04:47,180 --> 00:04:49,290
from the outputs
toward the inputs,

76
00:04:49,290 --> 00:04:51,270
you're reusing a
lot of computation

77
00:04:51,270 --> 00:04:54,180
that you've already done.

78
00:04:54,180 --> 00:04:57,250
So I'm trying to find a
way to sloganize this,

79
00:04:57,250 --> 00:05:02,280
and what I've come up with is
what's done is done and cannot

80
00:05:02,280 --> 00:05:03,260
be-- no, no.

81
00:05:03,260 --> 00:05:05,080
That's not quite right, is it?

82
00:05:05,080 --> 00:05:10,090
It's what's computed is computed
and need not be recomputed.

83
00:05:10,090 --> 00:05:10,719
OK?

84
00:05:10,719 --> 00:05:12,010
So that's what's going on here.

85
00:05:12,010 --> 00:05:16,180
And that's why this is
a calculation that's

86
00:05:16,180 --> 00:05:22,075
linear in the depths of the
neural net, not exponential.

87
00:05:22,075 --> 00:05:24,450
There's another thing I wanted
to point out in connection

88
00:05:24,450 --> 00:05:28,900
with these neural nets.

89
00:05:28,900 --> 00:05:30,400
And that has to do
with what happens

90
00:05:30,400 --> 00:05:34,870
when we look at a single neuron
and note that what we've got

91
00:05:34,870 --> 00:05:37,920
is we've got a bunch of
weights that you multiply times

92
00:05:37,920 --> 00:05:39,415
a bunch of inputs like so.

93
00:05:46,960 --> 00:05:51,390
And then those are all
summed up in a summing box

94
00:05:51,390 --> 00:05:57,920
before they enter some kind
of non-linearity, in our case

95
00:05:57,920 --> 00:06:00,480
a sigmoid function.

96
00:06:00,480 --> 00:06:05,890
But if I ask you to write down
the expression for the value

97
00:06:05,890 --> 00:06:07,150
we've got there, what is it?

98
00:06:07,150 --> 00:06:13,072
Well, it's just the sum
of the w's times the x's.

99
00:06:16,570 --> 00:06:17,080
What's that?

100
00:06:20,590 --> 00:06:22,847
That's the dot product.

101
00:06:22,847 --> 00:06:24,930
Remember a few lectures
ago I said that some of us

102
00:06:24,930 --> 00:06:28,690
believe that the dot product is
a fundamental calculation that

103
00:06:28,690 --> 00:06:30,760
takes place in our heads?

104
00:06:30,760 --> 00:06:33,790
So this is why we think so.

105
00:06:33,790 --> 00:06:36,880
If neural nets are doing
anything like this,

106
00:06:36,880 --> 00:06:39,200
then there's a dot product
between some weights

107
00:06:39,200 --> 00:06:41,250
and some input values.

108
00:06:41,250 --> 00:06:43,900
Now, it's a funny
kind of dot product

109
00:06:43,900 --> 00:06:47,410
because in the models
that we've been using,

110
00:06:47,410 --> 00:06:50,890
these input variables are
all or none, or 0 or 1.

111
00:06:50,890 --> 00:06:52,500
But that's OK.

112
00:06:52,500 --> 00:06:54,110
I have it on good
authority that there

113
00:06:54,110 --> 00:06:58,090
are neurons in our head
for which the values that

114
00:06:58,090 --> 00:07:01,600
are produced are not
exactly all or none

115
00:07:01,600 --> 00:07:03,950
but rather have a kind of
proportionality to them.

116
00:07:03,950 --> 00:07:07,850
So you get a real dot product
type of operation out of that.

117
00:07:07,850 --> 00:07:09,510
So that's by way of
a couple of asides

118
00:07:09,510 --> 00:07:11,400
that I wanted to
underscore before we

119
00:07:11,400 --> 00:07:16,360
get into the center
of today's discussion,

120
00:07:16,360 --> 00:07:20,830
which will be to talk about
the so-called deep nets.

121
00:07:20,830 --> 00:07:23,890
Now, let's see,
what's a deep net do?

122
00:07:23,890 --> 00:07:29,820
Well, from last time, you
know that a deep net does

123
00:07:29,820 --> 00:07:34,040
that sort of thing, and
it's interesting to look

124
00:07:34,040 --> 00:07:36,470
at some of the offerings here.

125
00:07:36,470 --> 00:07:40,370
By the way, how good was
this performance in 2012?

126
00:07:40,370 --> 00:07:44,630
Well, it turned out
that the fraction

127
00:07:44,630 --> 00:07:48,880
of the time that the
system had the right answer

128
00:07:48,880 --> 00:07:52,690
in its top five
choices was about 15%.

129
00:07:52,690 --> 00:07:55,390
And the fraction of the time
that it got exactly the right

130
00:07:55,390 --> 00:07:59,900
answer as its top pick
was about 37%-- error,

131
00:07:59,900 --> 00:08:05,740
15% error if you count it as
an error if it's-- what am I

132
00:08:05,740 --> 00:08:06,960
saying?

133
00:08:06,960 --> 00:08:09,440
You got it right if you
got it in the top five.

134
00:08:09,440 --> 00:08:12,660
An error rate on that
calculation, about 15%.

135
00:08:12,660 --> 00:08:16,110
If you say you only get it right
if it was your top choice, then

136
00:08:16,110 --> 00:08:18,830
the error rate was about 37%.

137
00:08:18,830 --> 00:08:21,530
So pretty good, especially
since some of these things

138
00:08:21,530 --> 00:08:24,602
are highly ambiguous even to us.

139
00:08:24,602 --> 00:08:26,060
And what kind of
a system did that?

140
00:08:26,060 --> 00:08:30,610
Well, it wasn't one
that looked exactly

141
00:08:30,610 --> 00:08:33,820
like that, although that
is the essence of it.

142
00:08:33,820 --> 00:08:36,447
The system actually
looked like that.

143
00:08:36,447 --> 00:08:38,030
There's quite a lot
of stuff in there.

144
00:08:38,030 --> 00:08:40,530
And what I'm going to talk about
is not exactly this system,

145
00:08:40,530 --> 00:08:44,530
but I'm going to talk about the
stuff of which such systems are

146
00:08:44,530 --> 00:08:46,880
made because there's
nothing particularly

147
00:08:46,880 --> 00:08:47,920
special about this.

148
00:08:47,920 --> 00:08:50,990
It just happens to be
a particular assembly

149
00:08:50,990 --> 00:08:54,240
of components that tend to
reappear when anyone does

150
00:08:54,240 --> 00:08:56,910
this sort of neural net stuff.

151
00:08:56,910 --> 00:08:59,030
So let me explain that this way.

152
00:08:59,030 --> 00:09:05,110
First thing I need to talk
about is the concept of-- well,

153
00:09:05,110 --> 00:09:06,090
I don't like the term.

154
00:09:06,090 --> 00:09:07,780
It's called convolution.

155
00:09:07,780 --> 00:09:11,290
I don't like the term because
in the second-best course

156
00:09:11,290 --> 00:09:12,979
at the Institute,
Signals and Systems,

157
00:09:12,979 --> 00:09:15,020
you learn about impulse
responses and convolution

158
00:09:15,020 --> 00:09:16,810
integrals and stuff like that.

159
00:09:16,810 --> 00:09:19,500
And this hints at that,
but it's not the same thing

160
00:09:19,500 --> 00:09:22,930
because there's no memory
involved in what's going on

161
00:09:22,930 --> 00:09:24,560
as these signals are processed.

162
00:09:24,560 --> 00:09:27,474
But they call it convolutional
neural nets anyway.

163
00:09:27,474 --> 00:09:28,140
So here you are.

164
00:09:28,140 --> 00:09:29,733
You got some kind of image.

165
00:09:32,290 --> 00:09:37,620
And even with lots of computing
power and GPUs and all

166
00:09:37,620 --> 00:09:39,970
that sort of stuff, we're
not talking about images

167
00:09:39,970 --> 00:09:42,630
with 4 million pixels.

168
00:09:42,630 --> 00:09:46,090
We're talking about images
that might be 256 on a side.

169
00:09:53,284 --> 00:09:54,950
As I say, we're not
talking about images

170
00:09:54,950 --> 00:09:58,250
that are 1,000 by 1,000 or 4,000
by 4,000 or anything like that.

171
00:09:58,250 --> 00:10:00,940
They tend to be
kind of compressed

172
00:10:00,940 --> 00:10:04,540
into a 256-by-256 image.

173
00:10:04,540 --> 00:10:09,180
And now what we do
is we run over this

174
00:10:09,180 --> 00:10:12,270
with a neuron that
is looking only

175
00:10:12,270 --> 00:10:22,980
at a 10-by-10 square like so,
and that produces an output.

176
00:10:22,980 --> 00:10:27,090
And next, we went
over that again having

177
00:10:27,090 --> 00:10:32,270
shifted this neuron
a little bit like so.

178
00:10:32,270 --> 00:10:36,950
And then the next thing we do
is we shift it again, so we

179
00:10:36,950 --> 00:10:39,870
get that output right there.

180
00:10:39,870 --> 00:10:46,190
So each of those deployments
of a neuron produces an output,

181
00:10:46,190 --> 00:10:48,830
and that output is associated
with a particular place

182
00:10:48,830 --> 00:10:50,530
in the image.

183
00:10:50,530 --> 00:10:58,060
This is the process that
is called convolution

184
00:10:58,060 --> 00:10:59,570
as a term of art.

185
00:10:59,570 --> 00:11:04,160
Now, this guy, or this
convolution operation,

186
00:11:04,160 --> 00:11:06,040
results in a bunch
of points over here.

187
00:11:12,384 --> 00:11:16,730
And the next thing that
we do with those points

188
00:11:16,730 --> 00:11:19,210
is we look in
local neighborhoods

189
00:11:19,210 --> 00:11:22,340
and see what the
maximum value is.

190
00:11:22,340 --> 00:11:24,650
And then we take
that maximum value

191
00:11:24,650 --> 00:11:28,000
and construct yet another
mapping of the image

192
00:11:28,000 --> 00:11:31,030
over here using
that maximum value.

193
00:11:31,030 --> 00:11:36,470
Then we slide that over like so,
and we produce another value.

194
00:11:36,470 --> 00:11:40,870
And then we slide
that over one more

195
00:11:40,870 --> 00:11:44,400
time with a different
color, and now we've

196
00:11:44,400 --> 00:11:46,630
got yet another value.

197
00:11:46,630 --> 00:11:48,232
So this process
is called pooling.

198
00:11:54,969 --> 00:11:56,510
And because we're
taking the maximum,

199
00:11:56,510 --> 00:12:00,960
this particular kind of
pooling is called max pooling.

200
00:12:00,960 --> 00:12:03,200
So now let's see what's next.

201
00:12:03,200 --> 00:12:05,560
This is taking a
particular neuron

202
00:12:05,560 --> 00:12:08,870
and running it across the image.

203
00:12:08,870 --> 00:12:12,970
We call that a kernel, again
sucking some terminology out

204
00:12:12,970 --> 00:12:14,136
of Signals and Systems.

205
00:12:14,136 --> 00:12:15,510
But now what we're
going to do is

206
00:12:15,510 --> 00:12:19,140
we're going to say we could
use a whole bunch of kernels.

207
00:12:19,140 --> 00:12:21,990
So the thing that I
produce with one kernel

208
00:12:21,990 --> 00:12:27,430
can now be repeated
many times like so.

209
00:12:27,430 --> 00:12:30,870
In fact, a typical
number is 100 times.

210
00:12:30,870 --> 00:12:34,440
So now what we've got is
we've got a 256-by-256 image.

211
00:12:34,440 --> 00:12:38,240
We've gone over it
with a 10-by-10 kernel.

212
00:12:38,240 --> 00:12:41,760
We have taken the
maximum values that

213
00:12:41,760 --> 00:12:43,750
are in the vicinity
of each other,

214
00:12:43,750 --> 00:12:48,270
and then we repeated
that 100 times.

215
00:12:48,270 --> 00:12:53,320
So now we can take that, and
we can feed all those results

216
00:12:53,320 --> 00:12:55,540
into some kind of neural net.

217
00:12:55,540 --> 00:12:59,210
And then we can, through
perhaps a fully-connected job

218
00:12:59,210 --> 00:13:04,010
on the final layers of this, and
then in the ultimate output we

219
00:13:04,010 --> 00:13:07,770
get some sort of
indication of how likely it

220
00:13:07,770 --> 00:13:11,660
is that the thing that's
being seen is, say, a mite.

221
00:13:15,300 --> 00:13:20,850
So that's roughly how
these things work.

222
00:13:20,850 --> 00:13:22,530
So what have we
talked about so far?

223
00:13:22,530 --> 00:13:27,780
We've talked about pooling, and
we've talked about convolution.

224
00:13:27,780 --> 00:13:31,450
And now we can talk about
some of the good stuff.

225
00:13:31,450 --> 00:13:35,175
But before I get into that,
this is what we can do now,

226
00:13:35,175 --> 00:13:38,008
and you can compare this with
what was done in the old days.

227
00:13:42,630 --> 00:13:45,040
It was done in the old
days before massive amounts

228
00:13:45,040 --> 00:13:56,070
of computing became available
is a kind of neural net activity

229
00:13:56,070 --> 00:13:58,350
that's a little easier to see.

230
00:13:58,350 --> 00:14:03,090
You might, in the old days,
only have enough computing power

231
00:14:03,090 --> 00:14:05,740
to deal with a small
grid of picture elements,

232
00:14:05,740 --> 00:14:07,480
or so-called pixels.

233
00:14:07,480 --> 00:14:12,890
And then each of these might be
a value that is fed as an input

234
00:14:12,890 --> 00:14:15,000
into some kind of neuron.

235
00:14:15,000 --> 00:14:19,110
And so you might have a column
of neurons that are looking

236
00:14:19,110 --> 00:14:23,860
at these pixels in your image.

237
00:14:23,860 --> 00:14:26,410
And then there might be
a small number of columns

238
00:14:26,410 --> 00:14:27,830
that follow from that.

239
00:14:27,830 --> 00:14:30,780
And finally, something
that says this neuron

240
00:14:30,780 --> 00:14:35,820
is looking for things that are
a number 1, that is to say,

241
00:14:35,820 --> 00:14:43,330
something that looks like
a number 1 in the image.

242
00:14:43,330 --> 00:14:46,430
So this stuff up
here is what you

243
00:14:46,430 --> 00:14:48,597
can do when you have a
massive amount of computation

244
00:14:48,597 --> 00:14:49,971
relative to the
kind of thing you

245
00:14:49,971 --> 00:14:51,240
used to see in the old days.

246
00:14:55,400 --> 00:14:58,006
So what's different?

247
00:14:58,006 --> 00:14:59,380
Well, what's
different is instead

248
00:14:59,380 --> 00:15:02,730
of a few hundred parameters,
we've got a lot more.

249
00:15:02,730 --> 00:15:07,000
Instead of 10 digits,
we have 1,000 classes.

250
00:15:07,000 --> 00:15:09,130
Instead of a few
hundred samples,

251
00:15:09,130 --> 00:15:13,570
we have maybe 1,000
examples of each class.

252
00:15:13,570 --> 00:15:16,270
So that makes a million samples.

253
00:15:16,270 --> 00:15:20,030
And we got 60 million
parameters to play with.

254
00:15:20,030 --> 00:15:23,060
And the surprising thing
is that the net result

255
00:15:23,060 --> 00:15:26,800
is we've got a function
approximator that

256
00:15:26,800 --> 00:15:28,380
astonishes everybody.

257
00:15:28,380 --> 00:15:30,070
And no one quite
knows why it works,

258
00:15:30,070 --> 00:15:34,230
except that when you throw an
immense amount of computation

259
00:15:34,230 --> 00:15:38,350
into this kind of
arrangement, it's

260
00:15:38,350 --> 00:15:42,770
possible to get a performance
that no one expected would

261
00:15:42,770 --> 00:15:45,540
be possible.

262
00:15:45,540 --> 00:15:47,170
So that's sort of
the bottom line.

263
00:15:47,170 --> 00:15:51,240
But now there are a couple of
ideas beyond that that I think

264
00:15:51,240 --> 00:15:56,030
are especially interesting,
and I want to talk about those.

265
00:15:56,030 --> 00:15:58,690
First idea that's
especially interesting

266
00:15:58,690 --> 00:16:01,730
is the idea of
autocoding, and here's

267
00:16:01,730 --> 00:16:03,530
how the idea of
autocoding works.

268
00:16:09,190 --> 00:16:10,690
I'm going to run
out of board space,

269
00:16:10,690 --> 00:16:14,230
so I think I'll
do it right here.

270
00:16:14,230 --> 00:16:16,320
You have some input values.

271
00:16:24,860 --> 00:16:29,420
They go into a layer of
neurons, the input layer.

272
00:16:29,420 --> 00:16:33,181
Then there is a so-called hidden
layer that's much smaller.

273
00:16:36,010 --> 00:16:39,830
So maybe in the example,
there will be 10 neurons here

274
00:16:39,830 --> 00:16:41,810
and just a couple here.

275
00:16:41,810 --> 00:16:47,750
And then these expand to
an output layer like so.

276
00:16:47,750 --> 00:16:53,180
Now we can take the output
layer, z1 through zn,

277
00:16:53,180 --> 00:16:59,030
and compare it with the
desired values, d1 through dn.

278
00:17:01,920 --> 00:17:03,770
You following me so far?

279
00:17:03,770 --> 00:17:09,020
Now, the trick is to say, well,
what are the desired values?

280
00:17:09,020 --> 00:17:13,605
Let's let the desired
values be the input values.

281
00:17:17,254 --> 00:17:18,670
So what we're going
to do is we're

282
00:17:18,670 --> 00:17:20,839
going to train this net
up so that the output's

283
00:17:20,839 --> 00:17:23,600
the same as the input.

284
00:17:23,600 --> 00:17:24,599
What's the good of that?

285
00:17:24,599 --> 00:17:27,030
Well, we're going to
force it down through this

286
00:17:27,030 --> 00:17:30,270
[? neck-down ?]
piece of network.

287
00:17:30,270 --> 00:17:33,640
So if this network
is going to succeed

288
00:17:33,640 --> 00:17:37,160
in taking all the possibilities
here and cramming them

289
00:17:37,160 --> 00:17:42,720
into this smaller inner layer,
the so-called hidden layer,

290
00:17:42,720 --> 00:17:45,900
such that it can reproduce
the input [? at ?] the output,

291
00:17:45,900 --> 00:17:48,300
it must be doing some
kind of generalization

292
00:17:48,300 --> 00:17:52,560
of the kinds of things
it sees on its input.

293
00:17:52,560 --> 00:17:56,830
And that's a very clever idea,
and it's seen in various forms

294
00:17:56,830 --> 00:18:00,200
in a large fraction
of the papers that

295
00:18:00,200 --> 00:18:03,340
appear on deep neural nets.

296
00:18:03,340 --> 00:18:05,384
But now I want to
talk about an example

297
00:18:05,384 --> 00:18:06,800
so I can show you
a demonstration.

298
00:18:06,800 --> 00:18:07,860
OK?

299
00:18:07,860 --> 00:18:11,410
So we don't have GPUs, and
we don't have three days

300
00:18:11,410 --> 00:18:12,670
to do this.

301
00:18:12,670 --> 00:18:17,520
So I'm going to make up a
very simple example that's

302
00:18:17,520 --> 00:18:21,010
reminiscent of what goes
on here but involves

303
00:18:21,010 --> 00:18:22,974
hardly any computation.

304
00:18:22,974 --> 00:18:24,390
What I'm going to
imagine is we're

305
00:18:24,390 --> 00:18:31,260
trying to recognize
animals from how tall they

306
00:18:31,260 --> 00:18:35,920
are from the shadows
that they cast.

307
00:18:35,920 --> 00:18:43,940
So we're going to recognize
three animals, a cheetah,

308
00:18:43,940 --> 00:18:51,060
a zebra, and a giraffe, and
they will each cast a shadow

309
00:18:51,060 --> 00:18:55,800
on the blackboard like me.

310
00:18:55,800 --> 00:18:57,012
No vampire involved here.

311
00:18:57,012 --> 00:18:58,470
And what we're
going to do is we're

312
00:18:58,470 --> 00:19:03,290
going to use the shadow as
an input to a neural net.

313
00:19:03,290 --> 00:19:03,976
All right?

314
00:19:03,976 --> 00:19:05,350
So let's see how
that would work.

315
00:19:19,350 --> 00:19:24,180
So there is our network.

316
00:19:24,180 --> 00:19:26,480
And if I just clicked into
one of these test samples,

317
00:19:26,480 --> 00:19:32,010
that's the height of the shadow
that a cheetah casts on a wall.

318
00:19:32,010 --> 00:19:34,900
And there are 10 input
neurons corresponding

319
00:19:34,900 --> 00:19:37,500
to each level of the shadow.

320
00:19:37,500 --> 00:19:41,350
They're rammed through
three inner layer neurons,

321
00:19:41,350 --> 00:19:46,167
and from that it spreads out and
becomes the outer layer values.

322
00:19:46,167 --> 00:19:48,000
And we're going to
compare those outer layer

323
00:19:48,000 --> 00:19:50,520
values to the desired values,
but the desired values

324
00:19:50,520 --> 00:19:52,640
are the same as
the input values.

325
00:19:52,640 --> 00:19:54,840
So this column is a
column of input values.

326
00:19:54,840 --> 00:19:58,420
On the far right, we have
our column of desired values.

327
00:19:58,420 --> 00:20:00,650
And we haven't trained
this neural net yet.

328
00:20:00,650 --> 00:20:02,900
All we've got is
random values in there.

329
00:20:02,900 --> 00:20:08,140
So if we run the test samples
through, we get that and that.

330
00:20:08,140 --> 00:20:11,400
Yeah, cheetahs are short,
zebras are medium height,

331
00:20:11,400 --> 00:20:13,140
and giraffes are tall.

332
00:20:13,140 --> 00:20:19,630
But our output is just pretty
much 0.5 for all of them,

333
00:20:19,630 --> 00:20:21,470
for all of those shadow
heights, all right,

334
00:20:21,470 --> 00:20:24,110
[? with ?] no training so far.

335
00:20:24,110 --> 00:20:25,270
So let's run this thing.

336
00:20:25,270 --> 00:20:27,810
We're just using simple
[? backdrop, ?] just like on

337
00:20:27,810 --> 00:20:30,570
our world's simplest neural net.

338
00:20:30,570 --> 00:20:36,700
And it's interesting
to see what happens.

339
00:20:36,700 --> 00:20:38,310
You see all those
values changing?

340
00:20:38,310 --> 00:20:42,010
Now, I need to mention that
when you see a green connection,

341
00:20:42,010 --> 00:20:44,390
that means it's a
positive weight,

342
00:20:44,390 --> 00:20:49,870
and the density of the green
indicates how positive it is.

343
00:20:49,870 --> 00:20:51,860
And the red ones are
negative weights,

344
00:20:51,860 --> 00:20:55,000
and the intensity of the
red indicates how red it is.

345
00:20:55,000 --> 00:20:56,590
So here you can
see that we still

346
00:20:56,590 --> 00:20:59,630
have from our random
inputs a variety

347
00:20:59,630 --> 00:21:00,950
of red and green values.

348
00:21:00,950 --> 00:21:02,680
We haven't really
done much training,

349
00:21:02,680 --> 00:21:06,840
so everything correctly
looks pretty much random.

350
00:21:06,840 --> 00:21:10,160
So let's run this thing.

351
00:21:10,160 --> 00:21:17,244
And after only 1,000 iterations
going through these examples

352
00:21:17,244 --> 00:21:19,410
and trying to make the
output the same as the input,

353
00:21:19,410 --> 00:21:22,174
we reached a point where
the error rate has dropped.

354
00:21:22,174 --> 00:21:23,590
In fact, it's
dropped so much it's

355
00:21:23,590 --> 00:21:26,910
interesting to relook
at the test cases.

356
00:21:26,910 --> 00:21:29,150
So here's a test case
where we have a cheetah.

357
00:21:29,150 --> 00:21:30,980
And now the output
value is, in fact,

358
00:21:30,980 --> 00:21:37,740
very close to the desired value
in all the output neurons.

359
00:21:37,740 --> 00:21:40,680
So if we look at
another one, once again,

360
00:21:40,680 --> 00:21:43,780
there's a correspondence
in the right two columns.

361
00:21:43,780 --> 00:21:45,945
And if we look at the
final one, yeah, there's

362
00:21:45,945 --> 00:21:47,695
a correspondence in
the right two columns.

363
00:21:50,980 --> 00:21:52,930
Now, you back up from
this and say, well,

364
00:21:52,930 --> 00:21:55,570
what's going on here?

365
00:21:55,570 --> 00:22:00,320
It turns out that you're
not training this thing

366
00:22:00,320 --> 00:22:02,490
to classify animals.

367
00:22:02,490 --> 00:22:05,590
You're training it to understand
the nature of the things

368
00:22:05,590 --> 00:22:09,447
that it sees in the
environment because all it sees

369
00:22:09,447 --> 00:22:10,530
is the height of a shadow.

370
00:22:10,530 --> 00:22:12,613
It doesn't know anything
about the classifications

371
00:22:12,613 --> 00:22:14,610
you're going to try
to get out of that.

372
00:22:14,610 --> 00:22:17,470
All it sees is that there's
a kind of consistency

373
00:22:17,470 --> 00:22:21,630
in the kind of data that it
sees on the input values.

374
00:22:21,630 --> 00:22:23,170
Right?

375
00:22:23,170 --> 00:22:24,860
Now, you might say,
OK, oh, that's cool,

376
00:22:24,860 --> 00:22:26,610
because what must
be happening is

377
00:22:26,610 --> 00:22:29,730
that that hidden layer,
because everything is forced

378
00:22:29,730 --> 00:22:32,120
through that narrow
pipe, must be doing

379
00:22:32,120 --> 00:22:35,229
some kind of generalization.

380
00:22:35,229 --> 00:22:37,020
So it ought to be the
case that if we click

381
00:22:37,020 --> 00:22:38,436
on each of those
neurons, we ought

382
00:22:38,436 --> 00:22:40,820
to see it specialize
to a particular height,

383
00:22:40,820 --> 00:22:46,290
because that's the sort of stuff
that's presented on the input.

384
00:22:46,290 --> 00:22:49,170
Well, let's go see
what, in fact, is

385
00:22:49,170 --> 00:22:52,980
the maximum
stimulation to be seen

386
00:22:52,980 --> 00:22:56,540
on the neurons in
that hidden layer.

387
00:22:56,540 --> 00:22:59,860
So when I click on these
guys, what we're going to see

388
00:22:59,860 --> 00:23:02,990
is the input values
that maximally

389
00:23:02,990 --> 00:23:05,174
stimulate that neuron.

390
00:23:05,174 --> 00:23:06,590
And by the way, I
have no idea how

391
00:23:06,590 --> 00:23:09,480
this is going to turn out
because the initialization's

392
00:23:09,480 --> 00:23:11,580
all random.

393
00:23:11,580 --> 00:23:12,360
Well, that's good.

394
00:23:12,360 --> 00:23:13,860
That one looks like
it's generalized

395
00:23:13,860 --> 00:23:16,920
the notion of short.

396
00:23:16,920 --> 00:23:20,690
Ugh, that doesn't
look like medium.

397
00:23:20,690 --> 00:23:24,670
And in fact, the
maximum stimulation

398
00:23:24,670 --> 00:23:28,570
doesn't involve any stimulation
from that lower neuron.

399
00:23:28,570 --> 00:23:31,170
Here, look at this one.

400
00:23:31,170 --> 00:23:32,640
That doesn't look like tall.

401
00:23:32,640 --> 00:23:34,910
So we got one that looks
like short and two that

402
00:23:34,910 --> 00:23:37,505
just look completely random.

403
00:23:40,320 --> 00:23:42,510
So in fact, maybe we
better back off the idea

404
00:23:42,510 --> 00:23:44,730
that what's going on
in that hidden layer

405
00:23:44,730 --> 00:23:48,850
is generalization
and say that what

406
00:23:48,850 --> 00:23:51,660
is going on in there
is maybe the encoding

407
00:23:51,660 --> 00:23:53,840
of a generalization.

408
00:23:53,840 --> 00:23:56,190
It doesn't look like
an encoding we can see,

409
00:23:56,190 --> 00:24:01,060
but there is a generalization
that's-- let me start that

410
00:24:01,060 --> 00:24:01,820
over.

411
00:24:01,820 --> 00:24:08,312
We don't see the generalization
in the stimulating values.

412
00:24:08,312 --> 00:24:10,020
What we have instead
is we have some kind

413
00:24:10,020 --> 00:24:12,567
of encoded generalization.

414
00:24:12,567 --> 00:24:14,150
And because we got
this stuff encoded,

415
00:24:14,150 --> 00:24:16,570
it's what makes these neural
nets so extraordinarily

416
00:24:16,570 --> 00:24:17,970
difficult to understand.

417
00:24:17,970 --> 00:24:20,610
We don't understand
what they're doing.

418
00:24:20,610 --> 00:24:23,050
We don't understand why they
can recognize a cheetah.

419
00:24:23,050 --> 00:24:25,400
We don't understand why
it can recognize a school

420
00:24:25,400 --> 00:24:27,070
bus in some cases,
but not in others,

421
00:24:27,070 --> 00:24:29,780
because we don't
really understand

422
00:24:29,780 --> 00:24:32,530
what these neurons
are responding to.

423
00:24:32,530 --> 00:24:34,110
Well, that's not quite true.

424
00:24:34,110 --> 00:24:36,300
There's been a lot
of work recently

425
00:24:36,300 --> 00:24:38,690
on trying to sort that
out, but it's still

426
00:24:38,690 --> 00:24:41,870
a lot of mystery in this world.

427
00:24:41,870 --> 00:24:44,700
In any event, that's
the autocoding idea.

428
00:24:44,700 --> 00:24:45,945
It comes in various guises.

429
00:24:45,945 --> 00:24:48,320
Sometimes people talk about
Boltzmann machines and things

430
00:24:48,320 --> 00:24:48,880
of that sort.

431
00:24:48,880 --> 00:24:51,230
But it's basically all
the same sort of idea.

432
00:24:51,230 --> 00:24:53,300
And so what you can
do is layer by layer.

433
00:24:53,300 --> 00:24:55,290
Once you've trained
the input layer,

434
00:24:55,290 --> 00:24:57,810
then you can use that layer
to train the next layer,

435
00:24:57,810 --> 00:25:00,110
and then that can train
the next layer after that.

436
00:25:00,110 --> 00:25:04,360
And it's only at the very, very
end that you say to yourself,

437
00:25:04,360 --> 00:25:06,720
well, now I've accumulated
a lot of knowledge

438
00:25:06,720 --> 00:25:10,220
about the environment and what
can be seen in the environment.

439
00:25:10,220 --> 00:25:12,781
Maybe it's time to
get around to using

440
00:25:12,781 --> 00:25:17,770
some samples of particular
classes and train on classes.

441
00:25:17,770 --> 00:25:19,320
So that's the story
on autocoding.

442
00:25:22,780 --> 00:25:26,730
Now, the next thing to talk
about is that final layer.

443
00:25:29,660 --> 00:25:32,384
So let's see what the final
layer might look like.

444
00:25:35,110 --> 00:25:39,940
Let's see, it might
look like this.

445
00:25:39,940 --> 00:25:44,937
There's a [? summer. ?]
There's a minus 1 up here.

446
00:25:44,937 --> 00:25:45,436
No.

447
00:25:45,436 --> 00:25:47,926
Let's see, there's a
minus 1 up-- [INAUDIBLE].

448
00:25:50,710 --> 00:25:53,120
There's a minus 1 up there.

449
00:25:53,120 --> 00:25:55,420
There's a multiplier here.

450
00:25:55,420 --> 00:25:58,091
And there's a
threshold value there.

451
00:25:58,091 --> 00:26:00,950
Now, likewise, there's some
other input values here.

452
00:26:00,950 --> 00:26:07,690
Let me call this one x, and it
gets multiplied by some weight.

453
00:26:07,690 --> 00:26:10,500
And then that goes into
the [? summer ?] as well.

454
00:26:10,500 --> 00:26:19,540
And that, in turn, goes into
a sigmoid that looks like so.

455
00:26:19,540 --> 00:26:25,180
And finally, you get an
output, which we'll z.

456
00:26:25,180 --> 00:26:29,400
So it's clear that if you
just write out the value of z

457
00:26:29,400 --> 00:26:36,020
as it depends on those inputs
using the formula that we

458
00:26:36,020 --> 00:26:38,770
worked with last
time, then what you

459
00:26:38,770 --> 00:26:48,280
see is that z is
equal to 1 over 1

460
00:26:48,280 --> 00:27:01,218
plus e to the minus w times
x minus T-- plus T, I guess.

461
00:27:01,218 --> 00:27:01,717
Right?

462
00:27:04,990 --> 00:27:08,190
So that's a sigmoid
function that

463
00:27:08,190 --> 00:27:11,390
depends on the
value of that weight

464
00:27:11,390 --> 00:27:13,970
and on the value
of that threshold.

465
00:27:13,970 --> 00:27:20,510
So let's look at how those
values might change things.

466
00:27:20,510 --> 00:27:23,803
So here we have an
ordinary sigmoid.

467
00:27:26,550 --> 00:27:31,390
And what happens if we shift
it with a threshold value?

468
00:27:31,390 --> 00:27:34,520
If we change that
threshold value,

469
00:27:34,520 --> 00:27:36,390
then it's going
to shift the place

470
00:27:36,390 --> 00:27:41,948
where that sigmoid comes down.

471
00:27:45,600 --> 00:27:47,860
So a change in T
could cause this thing

472
00:27:47,860 --> 00:27:50,770
to shift over that way.

473
00:27:50,770 --> 00:27:53,010
And if we change
the value of w, that

474
00:27:53,010 --> 00:27:54,870
could change how
steep this guy is.

475
00:27:58,460 --> 00:28:02,700
So we might think that the
performance, since it depends

476
00:28:02,700 --> 00:28:08,820
on w and T, should be
adjusted in such a way

477
00:28:08,820 --> 00:28:14,470
as to make the classification
do the right thing.

478
00:28:14,470 --> 00:28:17,020
But what's the right thing?

479
00:28:17,020 --> 00:28:19,397
Well, that depends on the
samples that we've seen.

480
00:28:24,490 --> 00:28:28,603
Suppose, for example, that
this is our sigmoid function.

481
00:28:31,330 --> 00:28:37,380
And we see some examples of a
class, some positive examples

482
00:28:37,380 --> 00:28:40,780
of a class, that
have values that

483
00:28:40,780 --> 00:28:46,800
lie at that point and
that point and that point.

484
00:28:46,800 --> 00:28:54,180
And we have some values that
correspond to situations where

485
00:28:54,180 --> 00:28:57,040
the class is not one of the
things that are associated

486
00:28:57,040 --> 00:28:58,580
with this neuron.

487
00:28:58,580 --> 00:29:01,930
And in that case, what
we see is examples that

488
00:29:01,930 --> 00:29:03,370
are over in this vicinity here.

489
00:29:06,370 --> 00:29:10,710
So the probability that we
would see this particular guy

490
00:29:10,710 --> 00:29:15,300
in this world is associated with
the value on the sigmoid curve.

491
00:29:15,300 --> 00:29:17,420
So you could think of
this as the probability

492
00:29:17,420 --> 00:29:19,380
of that positive
example, and this

493
00:29:19,380 --> 00:29:21,840
is the probability of
that positive example,

494
00:29:21,840 --> 00:29:25,020
and this is the probability
of that positive example.

495
00:29:25,020 --> 00:29:28,020
What's the probability
of this negative example?

496
00:29:28,020 --> 00:29:32,480
Well, it's 1 minus the
value on that curve.

497
00:29:32,480 --> 00:29:36,830
And this one's 1 minus
the value on that curve.

498
00:29:36,830 --> 00:29:39,510
So we could go through
the calculations.

499
00:29:39,510 --> 00:29:43,230
And what we would determine
is that to maximize

500
00:29:43,230 --> 00:29:46,790
the probability of seeing this
data, this particular stuff

501
00:29:46,790 --> 00:29:50,140
in a set of experiments, to
maximize that probability,

502
00:29:50,140 --> 00:29:55,860
we would have to adjust T and
w so as to get this curve doing

503
00:29:55,860 --> 00:29:57,770
the optimal thing.

504
00:29:57,770 --> 00:29:59,634
And there's nothing
mysterious about it.

505
00:29:59,634 --> 00:30:01,050
It's just more
partial derivatives

506
00:30:01,050 --> 00:30:03,150
and that sort of thing.

507
00:30:03,150 --> 00:30:09,720
But the bottom line is that the
probability of seeing this data

508
00:30:09,720 --> 00:30:12,570
is dependent on the
shape of this curve,

509
00:30:12,570 --> 00:30:16,540
and the shape of this curve is
dependent on those parameters.

510
00:30:16,540 --> 00:30:19,330
And if we wanted to maximize
the probability that we've

511
00:30:19,330 --> 00:30:23,058
seen this data, then we have
to adjust those parameters

512
00:30:23,058 --> 00:30:23,558
accordingly.

513
00:30:25,559 --> 00:30:27,100
Let's have a look
at a demonstration.

514
00:30:39,770 --> 00:30:40,270
OK.

515
00:30:40,270 --> 00:30:43,730
So there's an ordinary
sigmoid curve.

516
00:30:43,730 --> 00:30:46,850
Here are a couple of
positive examples.

517
00:30:46,850 --> 00:30:49,995
Here's a negative example.

518
00:30:53,500 --> 00:30:58,170
Let's put in some more
positive examples over here.

519
00:30:58,170 --> 00:31:04,670
And now let's run the good,
old gradient ascent algorithm

520
00:31:04,670 --> 00:31:06,870
on that.

521
00:31:06,870 --> 00:31:08,920
And this is what happens.

522
00:31:08,920 --> 00:31:11,640
You've seen how the
probability, as we adjust

523
00:31:11,640 --> 00:31:14,370
the shape of the curve,
the probability of seeing

524
00:31:14,370 --> 00:31:18,060
those examples of
the class goes up,

525
00:31:18,060 --> 00:31:22,950
and the probability of seeing
the non-example goes down.

526
00:31:22,950 --> 00:31:26,110
So what if we put
some more examples in?

527
00:31:26,110 --> 00:31:27,640
If we put a negative
example there,

528
00:31:27,640 --> 00:31:30,030
not much is going to happen.

529
00:31:30,030 --> 00:31:33,762
What would happen if we put a
positive example right there?

530
00:31:33,762 --> 00:31:35,970
Then we're going to start
seeing some dramatic shifts

531
00:31:35,970 --> 00:31:37,053
in the shape of the curve.

532
00:31:48,000 --> 00:31:50,450
So that's probably
a noise point.

533
00:31:50,450 --> 00:31:54,750
But we can put some more
negative examples in there

534
00:31:54,750 --> 00:31:56,262
and see how that
adjusts the curve.

535
00:31:59,510 --> 00:32:00,010
All right.

536
00:32:00,010 --> 00:32:01,135
So that's what we're doing.

537
00:32:01,135 --> 00:32:03,430
We're viewing this
output value as something

538
00:32:03,430 --> 00:32:07,250
that's related to the
probability of seeing a class.

539
00:32:07,250 --> 00:32:09,900
And we're adjusting the
parameters on that output layer

540
00:32:09,900 --> 00:32:12,620
so as to maximize the
probability of the sample data

541
00:32:12,620 --> 00:32:14,525
that we've got at hand.

542
00:32:14,525 --> 00:32:15,025
Right?

543
00:32:17,930 --> 00:32:20,124
Now, there's one more thing.

544
00:32:20,124 --> 00:32:21,540
Because see what
we've got here is

545
00:32:21,540 --> 00:32:24,880
we've got the basic idea
of back propagation, which

546
00:32:24,880 --> 00:32:30,200
has layers and layers
of additional--

547
00:32:30,200 --> 00:32:33,390
let me be flattering and call
them ideas layered on top.

548
00:32:33,390 --> 00:32:38,820
So here's the next idea
that's layered on top.

549
00:32:38,820 --> 00:32:43,110
So we've got an
output value here.

550
00:32:45,740 --> 00:32:50,270
And it's a function after
all, and it's got a value.

551
00:32:50,270 --> 00:32:54,120
And if we have
1,000 classes, we're

552
00:32:54,120 --> 00:32:56,180
going to have 1,000
output neurons,

553
00:32:56,180 --> 00:32:59,020
and each is going to be
producing some kind of value.

554
00:32:59,020 --> 00:33:02,326
And we can think of that
value as a probability.

555
00:33:04,705 --> 00:33:06,580
But I didn't want to
write a probability yet.

556
00:33:06,580 --> 00:33:07,996
I just want to say
that what we've

557
00:33:07,996 --> 00:33:13,186
got for this output neuron
is a function of class 1.

558
00:33:13,186 --> 00:33:15,060
And then there will be
another output neuron,

559
00:33:15,060 --> 00:33:18,200
which is a function of class 2.

560
00:33:18,200 --> 00:33:21,040
And these values will
be presumably higher--

561
00:33:21,040 --> 00:33:24,550
this will be higher if we are,
in fact, looking at class 1.

562
00:33:24,550 --> 00:33:27,210
And this one down here
will be, in fact, higher

563
00:33:27,210 --> 00:33:28,658
if we're looking at class m.

564
00:33:31,820 --> 00:33:35,020
So what we would like to do
is we'd like to not just pick

565
00:33:35,020 --> 00:33:37,950
one of these outputs
and say, well, you've

566
00:33:37,950 --> 00:33:40,550
got the highest
value, so you win.

567
00:33:40,550 --> 00:33:42,710
What we want to do
instead is we want

568
00:33:42,710 --> 00:33:45,130
to associate some
kind of probability

569
00:33:45,130 --> 00:33:47,074
with each of the classes.

570
00:33:47,074 --> 00:33:48,740
Because, after all,
we want to do things

571
00:33:48,740 --> 00:33:52,602
like find the most
probable five.

572
00:33:52,602 --> 00:33:54,060
So what we do is
we say, all right,

573
00:33:54,060 --> 00:33:59,950
so the actual
probability of class 1

574
00:33:59,950 --> 00:34:07,990
is equal to the output of
that sigmoid function divided

575
00:34:07,990 --> 00:34:11,357
by the sum over all functions.

576
00:34:14,909 --> 00:34:17,239
So that takes all of
that entire output vector

577
00:34:17,239 --> 00:34:20,920
and converts each output
value into a probability.

578
00:34:20,920 --> 00:34:24,475
So when we used that
sigmoid function,

579
00:34:24,475 --> 00:34:26,100
we did it with the
view toward thinking

580
00:34:26,100 --> 00:34:27,266
about that as a probability.

581
00:34:27,266 --> 00:34:30,000
And in fact, we assumed
it was a probability when

582
00:34:30,000 --> 00:34:32,429
we made this argument.

583
00:34:32,429 --> 00:34:35,170
But in the end,
there's an output

584
00:34:35,170 --> 00:34:36,280
for each of those classes.

585
00:34:36,280 --> 00:34:39,000
And so what we get is, in the
end, not exactly a probability

586
00:34:39,000 --> 00:34:43,219
until we divide by a
normalizing factor.

587
00:34:43,219 --> 00:34:49,500
So this, by the way, is called--
not on my list of things,

588
00:34:49,500 --> 00:34:50,526
but it soon will be.

589
00:34:54,580 --> 00:34:59,640
Since we're not talking
about taking the maximum

590
00:34:59,640 --> 00:35:02,580
and using that to classify the
picture, what we're going to do

591
00:35:02,580 --> 00:35:05,290
is we're going to use
what's called softmax.

592
00:35:09,140 --> 00:35:11,730
So we're going to give a
range of classifications,

593
00:35:11,730 --> 00:35:14,680
and we're going to associate
a probability with each.

594
00:35:14,680 --> 00:35:18,610
And that's what you saw
in all of those samples.

595
00:35:18,610 --> 00:35:20,360
You saw, yes, this is
[? containership, ?]

596
00:35:20,360 --> 00:35:24,070
but maybe it's also this,
that, or a third, or fourth,

597
00:35:24,070 --> 00:35:26,760
and fifth thing.

598
00:35:26,760 --> 00:35:31,924
So that is a pretty good
summary of the kinds

599
00:35:31,924 --> 00:35:33,090
of things that are involved.

600
00:35:33,090 --> 00:35:36,890
But now we've got one more
step, because what we can do now

601
00:35:36,890 --> 00:35:41,090
is we can take this output
layer idea, this softmax idea,

602
00:35:41,090 --> 00:35:43,470
and we can put them together
with the autocoding idea.

603
00:35:47,770 --> 00:35:50,820
So we've trained
just a layer up.

604
00:35:50,820 --> 00:35:53,700
And now we're going to detach
it from the output layer

605
00:35:53,700 --> 00:35:55,800
but retain those
weights that connect

606
00:35:55,800 --> 00:35:58,170
the input to the hidden layer.

607
00:35:58,170 --> 00:36:00,560
And when we do that,
what we're going to see

608
00:36:00,560 --> 00:36:03,430
is something that
looks like this.

609
00:36:03,430 --> 00:36:05,410
And now we've got a
trained first layer

610
00:36:05,410 --> 00:36:07,850
but an untrained output layer.

611
00:36:07,850 --> 00:36:10,280
We're going to freeze
the input layer

612
00:36:10,280 --> 00:36:16,590
and train the output layer
using the sigmoid curve

613
00:36:16,590 --> 00:36:18,140
and see what happens
when we do that.

614
00:36:18,140 --> 00:36:21,725
Oh, by the way, let's run
our test samples through.

615
00:36:21,725 --> 00:36:23,225
You can see it's
not doing anything,

616
00:36:23,225 --> 00:36:27,180
and the output is half
for each of the categories

617
00:36:27,180 --> 00:36:29,390
even though we've got
a trained middle layer.

618
00:36:29,390 --> 00:36:30,890
So we have to train
the outer layer.

619
00:36:30,890 --> 00:36:32,826
Let's see how long it takes.

620
00:36:32,826 --> 00:36:33,950
Whoa, that was pretty fast.

621
00:36:36,880 --> 00:36:40,210
Now there's an extraordinarily
good match between the outputs

622
00:36:40,210 --> 00:36:41,809
and the desired outputs.

623
00:36:41,809 --> 00:36:43,600
So that's the combination
of the autocoding

624
00:36:43,600 --> 00:36:45,541
idea and the softmax idea.

625
00:36:50,150 --> 00:36:55,540
[? There's ?] just one more
idea that's worthy of mention,

626
00:36:55,540 --> 00:36:57,020
and that's the idea of dropout.

627
00:37:00,320 --> 00:37:02,880
The plague of any neural
net is that it gets stuck

628
00:37:02,880 --> 00:37:06,040
in some kind of local maximum.

629
00:37:06,040 --> 00:37:08,740
So it was discovered
that these things train

630
00:37:08,740 --> 00:37:16,500
better if, on every
iteration, you

631
00:37:16,500 --> 00:37:19,620
flip a coin for each neuron.

632
00:37:19,620 --> 00:37:22,260
And if the coin
ends up tails, you

633
00:37:22,260 --> 00:37:26,920
assume it's just died and has
no influence on the output.

634
00:37:26,920 --> 00:37:29,930
It's called dropping
out those neurons.

635
00:37:29,930 --> 00:37:33,950
And in our next iteration,
you drop out a different set.

636
00:37:33,950 --> 00:37:35,660
So what this seems
to do is it seems

637
00:37:35,660 --> 00:37:41,021
to prevent this thing from going
into a frozen local maximum

638
00:37:41,021 --> 00:37:41,520
state.

639
00:37:44,840 --> 00:37:46,230
So that's deep nets.

640
00:37:46,230 --> 00:37:50,000
They should be called, by the
way, wide nets because they

641
00:37:50,000 --> 00:37:53,020
tend to be enormously
wide but rarely

642
00:37:53,020 --> 00:38:01,250
more than 10 columns deep.

643
00:38:01,250 --> 00:38:04,050
Now, let's see, where
to go from here?

644
00:38:04,050 --> 00:38:10,900
Maybe what we should do is talk
about the awesome curiosity

645
00:38:10,900 --> 00:38:13,820
in the current state of the art.

646
00:38:13,820 --> 00:38:17,910
And that is that
all of [? this ?]

647
00:38:17,910 --> 00:38:21,750
sophistication with output
layers that are probabilities

648
00:38:21,750 --> 00:38:28,580
and training using autocoding
or Boltzmann machines,

649
00:38:28,580 --> 00:38:33,190
it doesn't seem to help much
relative to plain, old back

650
00:38:33,190 --> 00:38:35,640
propagation.

651
00:38:35,640 --> 00:38:38,090
So back propagation
with a convolutional net

652
00:38:38,090 --> 00:38:41,630
seems to do just about
as good as anything.

653
00:38:41,630 --> 00:38:46,530
And while we're on the subject
of an ordinary deep net,

654
00:38:46,530 --> 00:38:50,070
I'd like to examine
a situation here

655
00:38:50,070 --> 00:38:59,175
where we have a deep net--
well, it's a classroom deep net.

656
00:38:59,175 --> 00:39:02,670
And we'll will put
five layers in there,

657
00:39:02,670 --> 00:39:04,930
and its job is still
to do the same thing.

658
00:39:04,930 --> 00:39:09,630
It's to classify an animal as a
cheetah, a zebra, or a giraffe

659
00:39:09,630 --> 00:39:13,630
based on the height of
the shadow it casts.

660
00:39:13,630 --> 00:39:16,610
And as before, if it's
green, that means positive.

661
00:39:16,610 --> 00:39:19,300
If it's red, that
means negative.

662
00:39:19,300 --> 00:39:22,660
And right at the moment,
we have no training.

663
00:39:22,660 --> 00:39:24,420
So if we run our
test samples through,

664
00:39:24,420 --> 00:39:29,120
the output is always a 1/2
no matter what the animal is.

665
00:39:29,120 --> 00:39:29,975
All right?

666
00:39:29,975 --> 00:39:31,350
So what we're
going to do is just

667
00:39:31,350 --> 00:39:34,920
going to use ordinary back
prop on this, same thing

668
00:39:34,920 --> 00:39:41,970
as in that sample that's
underneath the blackboard.

669
00:39:41,970 --> 00:39:43,890
Only now we've got a
lot more parameters.

670
00:39:43,890 --> 00:39:46,950
We've got five columns,
and each one of them

671
00:39:46,950 --> 00:39:50,320
has 9 or 10 neurons in it.

672
00:39:50,320 --> 00:39:52,270
So let's let this one run.

673
00:39:56,160 --> 00:39:57,740
Now, look at that
stuff on the right.

674
00:39:57,740 --> 00:39:59,310
It's all turned red.

675
00:39:59,310 --> 00:40:03,270
At first I thought this
was a bug in my program.

676
00:40:03,270 --> 00:40:04,699
But that makes absolute sense.

677
00:40:04,699 --> 00:40:06,990
If you don't know what the
actual animal is going to be

678
00:40:06,990 --> 00:40:08,865
and there are a whole
bunch of possibilities,

679
00:40:08,865 --> 00:40:10,970
you better just say
no for everybody.

680
00:40:10,970 --> 00:40:13,550
It's like when a biologist
says, we don't know.

681
00:40:13,550 --> 00:40:16,380
It's the most probable answer.

682
00:40:16,380 --> 00:40:20,400
Well, but eventually, after
about 160,000 iterations,

683
00:40:20,400 --> 00:40:21,400
it seems to have got it.

684
00:40:21,400 --> 00:40:22,858
Let's run the test
samples through.

685
00:40:27,044 --> 00:40:29,310
Now it's doing great.

686
00:40:29,310 --> 00:40:31,387
Let's do it again just to
see if this is a fluke.

687
00:40:37,131 --> 00:40:52,590
And all red on the right
side, and finally, you

688
00:40:52,590 --> 00:40:56,930
start seeing some changes go
in the final layers there.

689
00:40:56,930 --> 00:40:59,600
And if you look at the error
rate down at the bottom,

690
00:40:59,600 --> 00:41:02,700
you'll see that it kind
of falls off a cliff.

691
00:41:02,700 --> 00:41:04,806
So nothing happens
for a real long time,

692
00:41:04,806 --> 00:41:06,056
and then it falls off a cliff.

693
00:41:09,560 --> 00:41:13,620
Now, what would happen if
this neural net were not

694
00:41:13,620 --> 00:41:15,860
quite so wide?

695
00:41:15,860 --> 00:41:16,760
Good question.

696
00:41:16,760 --> 00:41:19,093
But before we get to that
question, what I'm going to do

697
00:41:19,093 --> 00:41:21,880
is I'm going to do a
funny kind of variation

698
00:41:21,880 --> 00:41:23,676
on the theme of dropout.

699
00:41:23,676 --> 00:41:25,050
What I'm going to
do is I'm going

700
00:41:25,050 --> 00:41:28,070
to kill off one
neuron in each column,

701
00:41:28,070 --> 00:41:30,590
and then see if I can
retrain the network

702
00:41:30,590 --> 00:41:33,750
to do the right thing.

703
00:41:33,750 --> 00:41:37,210
So I'm going to reassign
those to some other purpose.

704
00:41:37,210 --> 00:41:40,100
So now there's one fewer
neuron in the network.

705
00:41:40,100 --> 00:41:44,622
If we rerun that, we see that
it trains itself up very fast.

706
00:41:44,622 --> 00:41:46,080
So we seem to be
still close enough

707
00:41:46,080 --> 00:41:50,470
to a solution we
can do without one

708
00:41:50,470 --> 00:41:52,110
of the neurons in each column.

709
00:41:52,110 --> 00:41:52,920
Let's do it again.

710
00:41:55,269 --> 00:41:57,060
Now it goes up a little
bit, but it quickly

711
00:41:57,060 --> 00:41:59,530
falls down to a solution.

712
00:41:59,530 --> 00:42:02,060
Try again.

713
00:42:02,060 --> 00:42:05,620
Quickly falls down
to a solution.

714
00:42:05,620 --> 00:42:08,560
Oh, my god, how much of
this am I going to do?

715
00:42:08,560 --> 00:42:10,970
Each time I knock
something out and retrain,

716
00:42:10,970 --> 00:42:13,010
it finds its solution very fast.

717
00:42:30,480 --> 00:42:34,170
Whoa, I got it all the way down
to two neurons in each column,

718
00:42:34,170 --> 00:42:37,290
and it still has a solution.

719
00:42:37,290 --> 00:42:40,440
It's interesting,
don't you think?

720
00:42:40,440 --> 00:42:43,120
But let's repeat the
experiment, but this time we're

721
00:42:43,120 --> 00:42:45,181
going to do it a
little differently.

722
00:42:45,181 --> 00:42:46,715
We're going to take
our five layers,

723
00:42:46,715 --> 00:42:49,610
and before we do
any training I'm

724
00:42:49,610 --> 00:42:57,390
going to knock out all but
two neurons in each column.

725
00:42:57,390 --> 00:42:59,760
Now, I know that with two
neurons in each column,

726
00:42:59,760 --> 00:43:01,520
I've got a solution.

727
00:43:01,520 --> 00:43:02,370
I just showed it.

728
00:43:02,370 --> 00:43:03,345
I just showed one.

729
00:43:03,345 --> 00:43:06,050
But let's run it this way.

730
00:43:19,060 --> 00:43:23,790
It looks like
increasingly bad news.

731
00:43:23,790 --> 00:43:25,810
What's happened is that
this sucker's got itself

732
00:43:25,810 --> 00:43:28,440
into a local maximum.

733
00:43:28,440 --> 00:43:31,550
So now you can see
why there's been

734
00:43:31,550 --> 00:43:35,600
a breakthrough in this
neural net learning stuff.

735
00:43:35,600 --> 00:43:39,300
And it's because when
you widen the net,

736
00:43:39,300 --> 00:43:43,750
you turn local maxima
into saddle points.

737
00:43:43,750 --> 00:43:45,560
So now it's got a way
of crawling its way

738
00:43:45,560 --> 00:43:48,790
through this vast
space without getting

739
00:43:48,790 --> 00:43:53,650
stuck on a local maximum,
as suggested by this.

740
00:43:53,650 --> 00:43:54,150
All right.

741
00:43:54,150 --> 00:43:57,880
So those are some, I
think, interesting things

742
00:43:57,880 --> 00:44:01,810
to look at by way of
these demonstrations.

743
00:44:01,810 --> 00:44:04,510
But now I'd like to go
back to my slide set

744
00:44:04,510 --> 00:44:06,860
and show you some
examples that will address

745
00:44:06,860 --> 00:44:09,670
the question of whether these
things are seeing like we see.

746
00:44:20,610 --> 00:44:22,380
So you can try these
examples online.

747
00:44:22,380 --> 00:44:24,370
There are a variety
of websites that allow

748
00:44:24,370 --> 00:44:27,950
you to put in your own picture.

749
00:44:27,950 --> 00:44:33,510
And there's a cottage industry
of producing papers in journals

750
00:44:33,510 --> 00:44:35,840
that fool neural nets.

751
00:44:35,840 --> 00:44:38,600
So in this case, a very
small number of pixels

752
00:44:38,600 --> 00:44:39,420
have been changed.

753
00:44:39,420 --> 00:44:41,640
You don't see the
difference, but it's

754
00:44:41,640 --> 00:44:44,290
enough to take this
particular neural net

755
00:44:44,290 --> 00:44:47,850
from a high confidence that
it's looking at a school bus

756
00:44:47,850 --> 00:44:51,777
to thinking that it's
not a school bus.

757
00:44:51,777 --> 00:44:54,026
Those are some things that
it thinks are a school bus.

758
00:44:56,780 --> 00:44:58,490
So it appears to be
the case that what

759
00:44:58,490 --> 00:45:01,320
is triggering this
school bus result

760
00:45:01,320 --> 00:45:04,340
is that it's seeing enough
local evidence that this is not

761
00:45:04,340 --> 00:45:10,080
one of the other 999 classes
and enough positive evidence

762
00:45:10,080 --> 00:45:12,310
from these local
looks to conclude

763
00:45:12,310 --> 00:45:13,313
that it's a school bus.

764
00:45:18,020 --> 00:45:20,330
So do you see any
of those things?

765
00:45:20,330 --> 00:45:20,870
I don't.

766
00:45:24,494 --> 00:45:28,290
And here you can say, OK, well,
look at that baseball one.

767
00:45:28,290 --> 00:45:31,500
Yeah, that looks like it's got
a little bit of baseball texture

768
00:45:31,500 --> 00:45:32,020
in it.

769
00:45:32,020 --> 00:45:33,978
So maybe what it's doing
is looking at texture.

770
00:45:39,130 --> 00:45:43,380
These are some examples from
a recent and very famous

771
00:45:43,380 --> 00:45:47,380
paper by Google using
essentially the same ideas

772
00:45:47,380 --> 00:45:51,290
to put captions on pictures.

773
00:45:51,290 --> 00:45:53,790
So this, by the way,
is what has stimulated

774
00:45:53,790 --> 00:45:56,620
all this enormous concern
about artificial intelligence.

775
00:45:56,620 --> 00:45:58,870
Because a naive viewer looks
at that picture and says,

776
00:45:58,870 --> 00:46:00,245
oh, my god, this
thing knows what

777
00:46:00,245 --> 00:46:06,260
it's like to play, or be young,
or move, or what a Frisbee is.

778
00:46:06,260 --> 00:46:08,070
And of course, it
knows none of that.

779
00:46:08,070 --> 00:46:10,950
It just knows how to
label this picture.

780
00:46:10,950 --> 00:46:14,080
And to the credit of the
people who wrote this paper,

781
00:46:14,080 --> 00:46:17,540
they show examples
that don't do so well.

782
00:46:17,540 --> 00:46:21,000
So yeah, it's a cat,
but it's not lying.

783
00:46:21,000 --> 00:46:24,620
Oh, it's a little girl, but
she's not blowing bubbles.

784
00:46:24,620 --> 00:46:25,884
What about this one?

785
00:46:25,884 --> 00:46:28,848
[LAUGHTER]

786
00:46:31,820 --> 00:46:34,770
So we've been doing our
own work in my laboratory

787
00:46:34,770 --> 00:46:36,390
on some of this.

788
00:46:36,390 --> 00:46:39,900
And the way the following set of
pictures was produced was this.

789
00:46:39,900 --> 00:46:41,910
You take an image,
and you separate it

790
00:46:41,910 --> 00:46:44,310
into a bunch of slices,
each representing

791
00:46:44,310 --> 00:46:46,760
a particular frequency band.

792
00:46:46,760 --> 00:46:49,300
And then you go into one
of those frequency bands

793
00:46:49,300 --> 00:46:51,680
and you knock out a
rectangle from the picture,

794
00:46:51,680 --> 00:46:54,730
and then you
reassemble the thing.

795
00:46:54,730 --> 00:46:56,876
And if you hadn't
knocked that piece out,

796
00:46:56,876 --> 00:46:58,750
when you reassemble it,
it would look exactly

797
00:46:58,750 --> 00:47:00,760
like it did when you started.

798
00:47:00,760 --> 00:47:03,370
So what we're doing is we
knock out as much as we can

799
00:47:03,370 --> 00:47:05,827
and still retain the
neural net's impression

800
00:47:05,827 --> 00:47:08,160
that it's the thing that it
started out thinking it was.

801
00:47:08,160 --> 00:47:09,575
So what do you think this is?

802
00:47:13,640 --> 00:47:17,310
It's identified by a neural
net as a railroad car

803
00:47:17,310 --> 00:47:21,589
because this is the image
that it started with.

804
00:47:21,589 --> 00:47:22,380
How about this one?

805
00:47:22,380 --> 00:47:23,370
That's easy, right?

806
00:47:23,370 --> 00:47:25,100
That's a guitar.

807
00:47:25,100 --> 00:47:28,090
We weren't able to mutilate that
one very much and still retain

808
00:47:28,090 --> 00:47:30,830
the guitar-ness of it.

809
00:47:30,830 --> 00:47:32,320
How about this one?

810
00:47:32,320 --> 00:47:33,029
AUDIENCE: A lamp?

811
00:47:33,029 --> 00:47:34,361
PATRICK H. WINSTON: What's that?

812
00:47:34,361 --> 00:47:35,020
AUDIENCE: Lamp.

813
00:47:35,020 --> 00:47:35,250
PATRICK H. WINSTON: What?

814
00:47:35,250 --> 00:47:36,190
AUDIENCE: Lamp.

815
00:47:36,190 --> 00:47:37,330
PATRICK H. WINSTON: A lamp.

816
00:47:37,330 --> 00:47:38,067
Any other ideas?

817
00:47:38,067 --> 00:47:38,983
AUDIENCE: [INAUDIBLE].

818
00:47:38,983 --> 00:47:40,280
AUDIENCE: [INAUDIBLE].

819
00:47:40,280 --> 00:47:42,321
PATRICK H. WINSTON: Ken,
what do you think it is?

820
00:47:42,321 --> 00:47:43,157
AUDIENCE: A toilet.

821
00:47:43,157 --> 00:47:45,490
PATRICK H. WINSTON: See, he's
an expert on this subject.

822
00:47:45,490 --> 00:47:46,880
[LAUGHTER]

823
00:47:46,880 --> 00:47:50,480
It was identified as a barbell.

824
00:47:50,480 --> 00:47:51,290
What's that?

825
00:47:51,290 --> 00:47:52,206
AUDIENCE: [INAUDIBLE].

826
00:47:52,206 --> 00:47:53,450
PATRICK H. WINSTON: A what?

827
00:47:53,450 --> 00:47:54,340
AUDIENCE: Cello.

828
00:47:54,340 --> 00:47:55,730
PATRICK H. WINSTON: Cello.

829
00:47:55,730 --> 00:47:59,361
You didn't see the little
girl or the instructor.

830
00:47:59,361 --> 00:48:00,152
How about this one?

831
00:48:00,152 --> 00:48:01,330
AUDIENCE: [INAUDIBLE].

832
00:48:01,330 --> 00:48:01,830
PATRICK H. WINSTON: What?

833
00:48:01,830 --> 00:48:02,830
AUDIENCE: [INAUDIBLE].

834
00:48:02,830 --> 00:48:03,788
PATRICK H. WINSTON: No.

835
00:48:07,205 --> 00:48:08,630
AUDIENCE: [INAUDIBLE].

836
00:48:08,630 --> 00:48:10,680
PATRICK H. WINSTON:
It's a grasshopper.

837
00:48:10,680 --> 00:48:11,390
What's this?

838
00:48:11,390 --> 00:48:12,330
AUDIENCE: A wolf.

839
00:48:12,330 --> 00:48:13,871
PATRICK H. WINSTON:
Wow, you're good.

840
00:48:15,870 --> 00:48:17,693
It's actually not
a two-headed wolf.

841
00:48:17,693 --> 00:48:20,000
[LAUGHTER]

842
00:48:20,000 --> 00:48:23,438
It's two wolves that
are close together.

843
00:48:23,438 --> 00:48:24,694
AUDIENCE: [INAUDIBLE].

844
00:48:24,694 --> 00:48:26,402
PATRICK H. WINSTON:
That's a bird, right?

845
00:48:26,402 --> 00:48:27,775
AUDIENCE: [INAUDIBLE].

846
00:48:27,775 --> 00:48:29,150
PATRICK H. WINSTON:
Good for you.

847
00:48:29,150 --> 00:48:29,837
It's a rabbit.

848
00:48:29,837 --> 00:48:32,194
[LAUGHTER]

849
00:48:32,194 --> 00:48:32,819
How about that?

850
00:48:32,819 --> 00:48:33,819
[? AUDIENCE: Giraffe. ?]

851
00:48:36,040 --> 00:48:38,362
PATRICK H. WINSTON:
Russian wolfhound.

852
00:48:38,362 --> 00:48:39,278
AUDIENCE: [INAUDIBLE].

853
00:48:46,415 --> 00:48:48,290
PATRICK H. WINSTON: If
you've been to Venice,

854
00:48:48,290 --> 00:48:49,314
you recognize this.

855
00:48:49,314 --> 00:48:51,920
AUDIENCE: [INAUDIBLE].

856
00:48:51,920 --> 00:48:54,230
PATRICK H. WINSTON:
So bottom line

857
00:48:54,230 --> 00:48:55,960
is that these things
are an engineering

858
00:48:55,960 --> 00:49:00,536
marvel and do great things,
but they don't see like we see.