1
00:00:01,700 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,460
Commons license.

3
00:00:05,460 --> 00:00:07,670
Your support will help
MIT OpenCourseWare

4
00:00:07,670 --> 00:00:11,760
continue to offer high-quality
educational resources for free.

5
00:00:11,760 --> 00:00:14,300
To make a donation or to
view additional materials

6
00:00:14,300 --> 00:00:18,260
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,260 --> 00:00:19,294
at ocw.mit.edu.

8
00:00:22,790 --> 00:00:25,760
RYAN ALEXANDER: All right, so
as this XKCD comic points out,

9
00:00:25,760 --> 00:00:27,990
in CS, it can be very
difficult to figure out

10
00:00:27,990 --> 00:00:29,480
when something is
just really hard

11
00:00:29,480 --> 00:00:31,200
or something is
virtually impossible.

12
00:00:31,200 --> 00:00:32,700
And until a couple
years ago, people

13
00:00:32,700 --> 00:00:35,690
thought this idea of
image classification

14
00:00:35,690 --> 00:00:38,840
would be something that was
closer to the impossible side.

15
00:00:38,840 --> 00:00:40,856
But with the advent of
deep learning typology,

16
00:00:40,856 --> 00:00:43,147
we've made significant strides
in image classification,

17
00:00:43,147 --> 00:00:45,475
and now the problem's
actually quite practical.

18
00:00:45,475 --> 00:00:47,690
So today we'll be
going through how

19
00:00:47,690 --> 00:00:49,310
the process of
image classification

20
00:00:49,310 --> 00:00:50,351
with deep learning works.

21
00:00:50,351 --> 00:00:53,390
So we're first going to talk
about what deep learning is,

22
00:00:53,390 --> 00:00:56,170
and then we'll move into some of
the image processing techniques

23
00:00:56,170 --> 00:00:58,749
that researchers use,
followed by the architecture

24
00:00:58,749 --> 00:01:00,290
of the convolutional
neural networks,

25
00:01:00,290 --> 00:01:03,440
which will be the main
focus in our presentation.

26
00:01:03,440 --> 00:01:05,840
We'll also talk about
the training process,

27
00:01:05,840 --> 00:01:08,240
and then go through some
results and limitations of CNNs

28
00:01:08,240 --> 00:01:10,400
and image classification.

29
00:01:10,400 --> 00:01:11,760
So what is deep learning?

30
00:01:11,760 --> 00:01:14,290
Well, the term is
particularly vague,

31
00:01:14,290 --> 00:01:16,530
and it's purposely so
for a couple of reasons.

32
00:01:16,530 --> 00:01:19,230
The first is mystery is
always good for marketing.

33
00:01:19,230 --> 00:01:21,030
But the second reason
is that deep learning

34
00:01:21,030 --> 00:01:24,150
refers to a pretty wide range
of machine learning algorithms.

35
00:01:24,150 --> 00:01:25,910
They do have some commonalities.

36
00:01:25,910 --> 00:01:28,850
They all seek to solve
problems of a complexity

37
00:01:28,850 --> 00:01:32,840
that previously, people thought
only people could solve.

38
00:01:32,840 --> 00:01:35,504
So these are more sophisticated
classification problems

39
00:01:35,504 --> 00:01:37,420
than traditional
conventional machine learning

40
00:01:37,420 --> 00:01:39,800
algorithms can do.

41
00:01:39,800 --> 00:01:41,280
So how do they go
about doing this?

42
00:01:41,280 --> 00:01:44,120
Well, all of these
deep learning programs

43
00:01:44,120 --> 00:01:45,950
tend to take all the
processes that need

44
00:01:45,950 --> 00:01:47,430
to happen, and split them up.

45
00:01:47,430 --> 00:01:48,950
They've got different parts
of their program working

46
00:01:48,950 --> 00:01:51,440
on different things, all
while performing calculations,

47
00:01:51,440 --> 00:01:53,540
and then at the end,
it all comes together,

48
00:01:53,540 --> 00:01:56,240
and we get a result.
Of course, this

49
00:01:56,240 --> 00:01:59,320
isn't unique to deep learning,
and lots of distributed systems

50
00:01:59,320 --> 00:02:02,250
decentralize their calculations.

51
00:02:02,250 --> 00:02:03,980
But the key thing
about deep learning

52
00:02:03,980 --> 00:02:07,310
is that every part is
performing these calculations.

53
00:02:07,310 --> 00:02:09,674
The calculations are
not simple calculations.

54
00:02:09,674 --> 00:02:11,840
They're not we'll do this
one simple operation over,

55
00:02:11,840 --> 00:02:13,570
and over again on a lot
of data, and then we'll

56
00:02:13,570 --> 00:02:14,570
get a result at the end.

57
00:02:14,570 --> 00:02:17,960
Each part is performing some
particularly complicated

58
00:02:17,960 --> 00:02:21,440
process on all the little parts
before they come together.

59
00:02:21,440 --> 00:02:24,470
So why is this
architecture a good idea?

60
00:02:24,470 --> 00:02:26,360
Why did engineers
come up with this sort

61
00:02:26,360 --> 00:02:30,190
of decentralized,
multi-layered complex process?

62
00:02:30,190 --> 00:02:33,080
Well, we take the example
of image classification.

63
00:02:33,080 --> 00:02:35,030
It turns out that
the human brain

64
00:02:35,030 --> 00:02:37,920
does a pretty similar process.

65
00:02:37,920 --> 00:02:39,470
So here's the human
visual system,

66
00:02:39,470 --> 00:02:42,350
and it's pretty much a
hierarchical process.

67
00:02:42,350 --> 00:02:45,470
So you begin by
moving from the retina

68
00:02:45,470 --> 00:02:47,917
into the first
areas of the brain,

69
00:02:47,917 --> 00:02:49,500
and as the information
gets processed,

70
00:02:49,500 --> 00:02:51,625
it moves from one region
of the brain to the other,

71
00:02:51,625 --> 00:02:54,170
and each spatial
element of your brain

72
00:02:54,170 --> 00:02:56,880
is performing an entirely
different calculation.

73
00:02:56,880 --> 00:02:58,640
For example, the
v1 area over here

74
00:02:58,640 --> 00:03:01,430
is picking out
edges and corners,

75
00:03:01,430 --> 00:03:03,800
and then over here, a
couple steps later in v4,

76
00:03:03,800 --> 00:03:06,360
you're starting to group
those figures together.

77
00:03:06,360 --> 00:03:08,000
And so the brain
kind of operates

78
00:03:08,000 --> 00:03:12,330
in a way that is very similar to
the way these networks operate.

79
00:03:12,330 --> 00:03:14,780
So let's talk about
to classify a face.

80
00:03:14,780 --> 00:03:18,080
If I asked you guys how
would you classify a face,

81
00:03:18,080 --> 00:03:20,331
what is the first
thing you might do?

82
00:03:20,331 --> 00:03:22,580
Well, as I mentioned before,
the first thing out brain

83
00:03:22,580 --> 00:03:24,350
does is it finds these edges.

84
00:03:24,350 --> 00:03:26,470
The first thing to do is
identify where the face

85
00:03:26,470 --> 00:03:28,970
is versus everything else.

86
00:03:28,970 --> 00:03:30,890
Now, does anyone have
any idea as to what

87
00:03:30,890 --> 00:03:32,150
we could with the next step?

88
00:03:36,650 --> 00:03:38,720
Julian, you have an idea?

89
00:03:38,720 --> 00:03:41,224
AUDIENCE: Maybe you could
group these edges together.

90
00:03:41,224 --> 00:03:42,140
RYAN ALEXANDER: Right.

91
00:03:42,140 --> 00:03:45,170
We could maybe identify
some of these features

92
00:03:45,170 --> 00:03:46,520
that we're working with.

93
00:03:46,520 --> 00:03:50,420
So these are things like
noses, and lips, and eyes.

94
00:03:50,420 --> 00:03:53,905
And then what do we do after we
have these individual features?

95
00:03:57,970 --> 00:03:58,644
Steve.

96
00:03:58,644 --> 00:04:01,060
AUDIENCE: Well, maybe we can
group some of those together.

97
00:04:01,060 --> 00:04:02,059
RYAN ALEXANDER: Exactly.

98
00:04:02,059 --> 00:04:05,000
Yeah, we can organize them into
what we know the pattern to be.

99
00:04:05,000 --> 00:04:07,900
We know that a face has to
have two eyes, above a nose,

100
00:04:07,900 --> 00:04:09,550
and then above the mouth.

101
00:04:09,550 --> 00:04:12,849
So that is precisely what a
neural network actually ends up

102
00:04:12,849 --> 00:04:15,390
doing, and we'll walk through
the process of how it does this

103
00:04:15,390 --> 00:04:17,129
later on in the talk.

104
00:04:17,129 --> 00:04:18,670
But as you can see,
the intuitive way

105
00:04:18,670 --> 00:04:20,500
that we classify a
face, and the way

106
00:04:20,500 --> 00:04:21,996
our brains are
wired to do it, is

107
00:04:21,996 --> 00:04:24,120
pretty similar to the way
got these neural networks

108
00:04:24,120 --> 00:04:26,350
to operate.

109
00:04:26,350 --> 00:04:27,990
So like I said,
we're talking a lot

110
00:04:27,990 --> 00:04:29,740
about these convolutional
neural networks.

111
00:04:29,740 --> 00:04:32,556
There are other types of
architectures involved.

112
00:04:32,556 --> 00:04:34,180
Like we mentioned
before, deep learning

113
00:04:34,180 --> 00:04:36,600
is a pretty wide
variety of algorithms,

114
00:04:36,600 --> 00:04:39,440
but we're going to
focus on these CNNs.

115
00:04:39,440 --> 00:04:43,750
To give you a precursor
of how good these CNN are,

116
00:04:43,750 --> 00:04:46,740
this results from
ImageNet competition.

117
00:04:46,740 --> 00:04:48,680
So the ImageNet
competition is basically

118
00:04:48,680 --> 00:04:50,470
exactly what it sounds like,
a bunch of computer scientists

119
00:04:50,470 --> 00:04:52,803
get together, and see how
many images they can correctly

120
00:04:52,803 --> 00:04:54,160
classify.

121
00:04:54,160 --> 00:04:56,430
And their error rate
was pretty high.

122
00:04:56,430 --> 00:04:58,980
Almost a third error
rate over here in 2010,

123
00:04:58,980 --> 00:05:04,010
2011, and then in 2011, the CNNs
were introduced to the topic,

124
00:05:04,010 --> 00:05:05,950
and the error rate plummeted.

125
00:05:05,950 --> 00:05:08,030
As you can see
over here in 2015,

126
00:05:08,030 --> 00:05:10,420
we've got a significant
improvement in these ImageNet

127
00:05:10,420 --> 00:05:11,020
competitions.

128
00:05:11,020 --> 00:05:13,994
So clearly, the CNNs
have been very effective,

129
00:05:13,994 --> 00:05:15,410
and it's definitely
been something

130
00:05:15,410 --> 00:05:19,500
that is exciting in the field
and happening right now.

131
00:05:19,500 --> 00:05:23,759
All right, so now we're going
to move into image processing.

132
00:05:23,759 --> 00:05:25,550
ISHWARYA ANANTHABHOTLA:
OK, so Ryan gave us

133
00:05:25,550 --> 00:05:27,900
a nice overview of where
we get this concept

134
00:05:27,900 --> 00:05:31,690
of neural networks, but
let's take a time travel,

135
00:05:31,690 --> 00:05:33,290
and go into a quick
history lesson.

136
00:05:33,290 --> 00:05:35,740
So suppose I had a chair,
and I wanted the computer

137
00:05:35,740 --> 00:05:37,580
to classify this chair.

138
00:05:37,580 --> 00:05:40,520
I have some a priori knowledge
about what sort of things

139
00:05:40,520 --> 00:05:41,210
make up a chair.

140
00:05:41,210 --> 00:05:42,668
So I might be
interested in looking

141
00:05:42,668 --> 00:05:46,400
at arms, and corners of the
chair, legs, things like that.

142
00:05:46,400 --> 00:05:50,480
So I would go ahead, and feature
engineer my discovery scheme

143
00:05:50,480 --> 00:05:51,897
to be looking for
specific things.

144
00:05:51,897 --> 00:05:53,855
So I'm going to talk
about some techniques that

145
00:05:53,855 --> 00:05:55,020
are traditionally used.

146
00:05:55,020 --> 00:05:57,520
For example, chairs, doors,
these things have corners.

147
00:05:57,520 --> 00:05:59,855
So I might use an image
processing technique called

148
00:05:59,855 --> 00:06:02,030
a Harris Corner Detector,
where we basically

149
00:06:02,030 --> 00:06:05,400
look at large changes in
intensity as groups of pixels

150
00:06:05,400 --> 00:06:07,550
move from an image to
an image that indicate

151
00:06:07,550 --> 00:06:10,470
the presence of corners, and you
can use common corners to say

152
00:06:10,470 --> 00:06:14,150
that OK, all of these images are
chairs, or doors, or whatever.

153
00:06:14,150 --> 00:06:16,790
Similarly, I want to say I have
a bunch of pictures of chairs

154
00:06:16,790 --> 00:06:19,200
of different sizes, but
that they all must have

155
00:06:19,200 --> 00:06:21,380
so many corners or something.

156
00:06:21,380 --> 00:06:24,590
So typically, we use a sift
algorithm to scale invariant

157
00:06:24,590 --> 00:06:25,520
feature transforms.

158
00:06:25,520 --> 00:06:28,086
It basically says that
across different sizes,

159
00:06:28,086 --> 00:06:29,960
I still should be able
to extract information

160
00:06:29,960 --> 00:06:33,416
about the placement of corners.

161
00:06:33,416 --> 00:06:35,930
Another common technique
that's used in image processing

162
00:06:35,930 --> 00:06:38,832
is what we call HOG,
Histogram of Gradients.

163
00:06:38,832 --> 00:06:40,540
So basically, for
example, in this image,

164
00:06:40,540 --> 00:06:45,140
if I want to say I want to find
all the images that have faces

165
00:06:45,140 --> 00:06:46,940
in them, or consist
of faces, let's

166
00:06:46,940 --> 00:06:50,480
say, I might come up with
a template of a face that

167
00:06:50,480 --> 00:06:53,870
basically assigns gradients
to groups of pixels that form

168
00:06:53,870 --> 00:06:55,460
an outline of what
looks like a face,

169
00:06:55,460 --> 00:06:58,910
and then scan it across my
sample images, and say OK,

170
00:06:58,910 --> 00:07:00,305
a face is present in this image.

171
00:07:00,305 --> 00:07:01,680
Obviously, there
are some errors.

172
00:07:01,680 --> 00:07:04,240
A mead cap, and a logo back
here have become a face,

173
00:07:04,240 --> 00:07:07,710
but this is the
traditional approach.

174
00:07:07,710 --> 00:07:10,310
But here's the problem,
what if I don't actually

175
00:07:10,310 --> 00:07:14,410
necessarily know what features
are the most critical depending

176
00:07:14,410 --> 00:07:16,280
on the dataset that I get?

177
00:07:16,280 --> 00:07:19,130
I want the system
itself to figure out

178
00:07:19,130 --> 00:07:22,490
what techniques to apply without
having any a priori knowledge

179
00:07:22,490 --> 00:07:23,940
about the dataset.

180
00:07:23,940 --> 00:07:25,884
So this is exactly
the idea of CNNs,

181
00:07:25,884 --> 00:07:27,300
the convolutional
neural networks.

182
00:07:27,300 --> 00:07:30,030
We want the techniques to
be learned automatically

183
00:07:30,030 --> 00:07:31,760
by the process, by the system.

184
00:07:31,760 --> 00:07:33,950
So if I'm trying
to classify faces,

185
00:07:33,950 --> 00:07:35,980
I want the system to
figure out that eyes,

186
00:07:35,980 --> 00:07:38,330
and ears, and nose, these are
the most important things.

187
00:07:38,330 --> 00:07:39,955
Or if I'm trying to
classify elephants,

188
00:07:39,955 --> 00:07:41,900
the ears and trunks are
the critical features,

189
00:07:41,900 --> 00:07:43,460
without me having
to say OK, we're

190
00:07:43,460 --> 00:07:45,790
going to do corner detection,
so on, and so forth.

191
00:07:45,790 --> 00:07:48,470
So this is the idea.

192
00:07:48,470 --> 00:07:52,000
So to be able to understand
this process in greater detail,

193
00:07:52,000 --> 00:07:54,000
I'm first going to go
into a little bit of math,

194
00:07:54,000 --> 00:07:57,230
and the idea is to present
the most fundamental operation

195
00:07:57,230 --> 00:07:59,019
here, which is the convolution.

196
00:07:59,019 --> 00:08:01,310
So this is the formal definition
of the two-dimensional

197
00:08:01,310 --> 00:08:04,210
convolution, and since
we're working with images,

198
00:08:04,210 --> 00:08:08,060
we're only considering
the two-dimensional case.

199
00:08:08,060 --> 00:08:09,937
So in a more graphical
presentation,

200
00:08:09,937 --> 00:08:12,395
which is a little bit easier
to understand than just seeing

201
00:08:12,395 --> 00:08:14,570
the formula, the
idea is that we have

202
00:08:14,570 --> 00:08:16,610
a kernel, or a
convolutional filter

203
00:08:16,610 --> 00:08:18,860
that we seek to apply
on another image,

204
00:08:18,860 --> 00:08:21,410
and that extracts some
information about that image

205
00:08:21,410 --> 00:08:25,485
that we can use to help us
classify the convolution.

206
00:08:25,485 --> 00:08:27,860
So assume that this is our
kernel, or this is our filter,

207
00:08:27,860 --> 00:08:29,280
and suppose this--

208
00:08:33,663 --> 00:08:35,850
oh, there it is.

209
00:08:35,850 --> 00:08:38,090
So suppose we're applying
the kernel [INAUDIBLE]

210
00:08:38,090 --> 00:08:40,690
here to the image
that's in green.

211
00:08:40,690 --> 00:08:43,587
So the idea is we want to slide
this filter across the image,

212
00:08:43,587 --> 00:08:45,170
and what we're
basically doing this is

213
00:08:45,170 --> 00:08:46,580
a succession of dot products.

214
00:08:46,580 --> 00:08:48,440
So at each placement
on the image,

215
00:08:48,440 --> 00:08:50,590
we multiply the
overlayed numbers,

216
00:08:50,590 --> 00:08:55,770
and the sum becomes the output
image on the convolt output.

217
00:08:55,770 --> 00:08:57,964
So this is basically the
way the process works.

218
00:08:57,964 --> 00:09:00,380
You probably notice that there's
a reduction in dimension,

219
00:09:00,380 --> 00:09:04,520
and Henry will talk a little
bit more about why this is.

220
00:09:04,520 --> 00:09:07,386
Let me get to it, and
then [INAUDIBLE] So let's

221
00:09:07,386 --> 00:09:09,320
see some examples
of what information

222
00:09:09,320 --> 00:09:11,390
we get by applying
the convolution.

223
00:09:11,390 --> 00:09:13,580
So you see the image of
a tiger on the top left.

224
00:09:13,580 --> 00:09:15,760
When we apply a filter
that's a low pass filter,

225
00:09:15,760 --> 00:09:18,440
basically-- it's
a Gaussian-- then

226
00:09:18,440 --> 00:09:22,390
we get low spatial frequency
information about this image.

227
00:09:22,390 --> 00:09:24,680
So basically, we blurred
it, and this tells us

228
00:09:24,680 --> 00:09:27,285
something specific that
we might want to learn.

229
00:09:27,285 --> 00:09:29,785
So the kernel actually looks
like a two-dimensional Gaussian

230
00:09:29,785 --> 00:09:32,243
function that's been distributed
across this three-by-three

231
00:09:32,243 --> 00:09:34,010
kernel.

232
00:09:34,010 --> 00:09:35,570
Similarly, we
might be interested

233
00:09:35,570 --> 00:09:38,220
in high spatial
frequency information.

234
00:09:38,220 --> 00:09:40,850
So in this case, we're
looking at sharp features.

235
00:09:40,850 --> 00:09:43,740
So horizontal edges
or vertical edges.

236
00:09:43,740 --> 00:09:46,660
So a question for you is
if I have this kernel,

237
00:09:46,660 --> 00:09:48,935
which of these outputs
when this kernel

238
00:09:48,935 --> 00:09:51,310
was applied to the original
image, which of these outputs

239
00:09:51,310 --> 00:09:53,449
do you think it produced?

240
00:09:53,449 --> 00:09:54,990
AUDIENCE: The third
one on the right.

241
00:09:54,990 --> 00:09:56,180
ISHWARYA ANANTHABHOTLA:
Yeah, that's exactly right,

242
00:09:56,180 --> 00:09:57,740
and it's probably
pretty easy to see

243
00:09:57,740 --> 00:09:59,960
why that's the case, given
that the numbers are all

244
00:09:59,960 --> 00:10:02,370
horizontal bands here.

245
00:10:02,370 --> 00:10:04,525
Lastly, we also may be
interested in extracting

246
00:10:04,525 --> 00:10:06,710
information at a
particular frequency.

247
00:10:06,710 --> 00:10:11,420
So we can take the
difference of a high pass

248
00:10:11,420 --> 00:10:13,212
filter and a low pass
filter, and add it

249
00:10:13,212 --> 00:10:15,545
to your frequency you can
extract information about that

250
00:10:15,545 --> 00:10:17,910
as well.

251
00:10:17,910 --> 00:10:20,591
OK, one last helpful
piece of information

252
00:10:20,591 --> 00:10:22,090
is that there's
another way that you

253
00:10:22,090 --> 00:10:25,960
can think of the information
that's learned at each stage

254
00:10:25,960 --> 00:10:27,940
because a convolution
can also be

255
00:10:27,940 --> 00:10:30,470
thought of as a Fourier
transform in the frequency

256
00:10:30,470 --> 00:10:30,970
domain.

257
00:10:30,970 --> 00:10:33,140
You can think of the image
transformation that way.

258
00:10:33,140 --> 00:10:37,180
So from an image perspective,
what a Fourier transform is

259
00:10:37,180 --> 00:10:40,580
is basically a sum of a
set of sinusoidal gratings

260
00:10:40,580 --> 00:10:43,575
that differ, say, in frequency,
in orientation, in amplitude,

261
00:10:43,575 --> 00:10:44,743
and in phase.

262
00:10:44,743 --> 00:10:47,250
So you can think about
the zebra image here

263
00:10:47,250 --> 00:10:52,030
that's actually a composite
of different gradients

264
00:10:52,030 --> 00:10:54,880
that might look like this,
and the Fourier coefficients

265
00:10:54,880 --> 00:10:57,100
would be how much of
each of these pieces

266
00:10:57,100 --> 00:10:59,907
come together to make
that final image.

267
00:10:59,907 --> 00:11:01,990
So just to get a sense of
what kind of information

268
00:11:01,990 --> 00:11:05,042
this could convey, we typically
take a Fourier transformation,

269
00:11:05,042 --> 00:11:07,000
and break it apart into
the magnitude and phase

270
00:11:07,000 --> 00:11:08,850
representation.

271
00:11:08,850 --> 00:11:11,390
So you see magnitude,
and you see phase.

272
00:11:11,390 --> 00:11:13,420
So those images weren't
particularly clear,

273
00:11:13,420 --> 00:11:16,192
but this is a really
good example for this.

274
00:11:16,192 --> 00:11:17,650
So if we take the
Fourier transform

275
00:11:17,650 --> 00:11:19,150
of all the horizontal
text here, you

276
00:11:19,150 --> 00:11:20,880
see how the magnitude
reflects this,

277
00:11:20,880 --> 00:11:22,870
and you can go back to
the math to understand

278
00:11:22,870 --> 00:11:26,243
why it's reflected in
a vertical marking.

279
00:11:26,243 --> 00:11:28,550
And similarly, if I were
to take that same image,

280
00:11:28,550 --> 00:11:31,540
and rotate it, and then ask
for the Fourier transform,

281
00:11:31,540 --> 00:11:33,850
you see how that information
is contained very clearly

282
00:11:33,850 --> 00:11:35,380
in the magnitude spectrum.

283
00:11:35,380 --> 00:11:37,860
So these might be things
that a network would

284
00:11:37,860 --> 00:11:41,140
learn at each stage to try
to identify this as a text,

285
00:11:41,140 --> 00:11:43,615
or as as body of text that's
tilted one way or the other,

286
00:11:43,615 --> 00:11:45,050
so on and so forth.

287
00:11:45,050 --> 00:11:48,660
So with that, we can now go into
what the actual architecture

288
00:11:48,660 --> 00:11:49,981
of a convolutional neural is.

289
00:11:49,981 --> 00:11:52,480
HENRY NASSIF: All right, so as
it was said earlier, in order

290
00:11:52,480 --> 00:11:54,909
to classify or detect
objects, you actually

291
00:11:54,909 --> 00:11:55,825
need certain features.

292
00:11:55,825 --> 00:11:58,630
You need to be able to
identify these features.

293
00:11:58,630 --> 00:12:00,530
And the way you can
identify these features

294
00:12:00,530 --> 00:12:03,790
is using certain convolutions
or certain filters.

295
00:12:03,790 --> 00:12:07,060
In many cases, we don't know
what these features are,

296
00:12:07,060 --> 00:12:08,920
and as a result of
that, we don't actually

297
00:12:08,920 --> 00:12:11,530
know what the filters are
to extract these features.

298
00:12:11,530 --> 00:12:14,290
And what convolution neural
networks allow us to do

299
00:12:14,290 --> 00:12:17,110
is actually determine
what these features are,

300
00:12:17,110 --> 00:12:19,600
and also determine what
the filters are in order

301
00:12:19,600 --> 00:12:21,970
to extract these features.

302
00:12:21,970 --> 00:12:24,720
Now, the idea for
convolutional neural networks,

303
00:12:24,720 --> 00:12:27,470
or the idea for replicating
how the brain works

304
00:12:27,470 --> 00:12:32,440
started in about 1960s or
1950s after some experiments

305
00:12:32,440 --> 00:12:34,120
by Hubel and Wesel.

306
00:12:34,120 --> 00:12:35,830
And what happened in
these experiments,

307
00:12:35,830 --> 00:12:38,680
as can be seen here,
is a cat was actually

308
00:12:38,680 --> 00:12:41,390
shown a light band
at different angles,

309
00:12:41,390 --> 00:12:43,510
and the neural
activity of the cat

310
00:12:43,510 --> 00:12:45,730
was measured using an electrode.

311
00:12:45,730 --> 00:12:48,070
And the outcome
from this experiment

312
00:12:48,070 --> 00:12:51,280
show that based on the angle
at which the light was shown,

313
00:12:51,280 --> 00:12:54,040
the neural response of
the cat was different.

314
00:12:54,040 --> 00:12:56,620
As you can see here, the
number of neurons, as well as

315
00:12:56,620 --> 00:12:58,810
the neurons that were
firing were very different

316
00:12:58,810 --> 00:13:00,690
based on the angle.

317
00:13:00,690 --> 00:13:02,820
So what you can see
also here is really

318
00:13:02,820 --> 00:13:07,670
a plot of the response versus
the orientation of the light.

319
00:13:07,670 --> 00:13:10,390
And what this has led
Hubel and Wesel to

320
00:13:10,390 --> 00:13:12,940
is the idea that
neurons in the brain

321
00:13:12,940 --> 00:13:15,600
are organized in a certain
topographical order,

322
00:13:15,600 --> 00:13:18,770
and at each filter, it
fills a specific role,

323
00:13:18,770 --> 00:13:24,610
and the only fires when its
specific input is shown,

324
00:13:24,610 --> 00:13:29,980
or when the angle is show,
or the angle is specified.

325
00:13:29,980 --> 00:13:32,020
Now, the first step to
actually replicating

326
00:13:32,020 --> 00:13:33,760
how the brain works
in code is really

327
00:13:33,760 --> 00:13:38,120
understanding how the building
block, the neuron, works.

328
00:13:38,120 --> 00:13:41,080
That's a quick
reminder of 7012 here.

329
00:13:41,080 --> 00:13:46,100
So a neuron is actually a cell
with dendrites, nucleus, axon,

330
00:13:46,100 --> 00:13:47,770
and a terminal.

331
00:13:47,770 --> 00:13:49,960
And what the neuron
actually does

332
00:13:49,960 --> 00:13:53,240
is aggregate the action
potentials or the inputs

333
00:13:53,240 --> 00:13:55,480
it gets from all the
neighboring neurons that

334
00:13:55,480 --> 00:13:57,670
are connected to it
through the timelines,

335
00:13:57,670 --> 00:14:00,580
and it sums these action
potentials, and then compares

336
00:14:00,580 --> 00:14:03,840
them to a certain threshold
that it has internally,

337
00:14:03,840 --> 00:14:06,250
and that would determine
whether or not this neuron would

338
00:14:06,250 --> 00:14:08,620
fire an action potential.

339
00:14:08,620 --> 00:14:12,400
And that very simple idea can
actually be replicated in code.

340
00:14:15,560 --> 00:14:19,840
An artificial neuron looks
very much like a natural one.

341
00:14:19,840 --> 00:14:22,000
So what you would have
is a set of inputs.

342
00:14:22,000 --> 00:14:25,720
Here we have three inputs that
are summed inside of a cell,

343
00:14:25,720 --> 00:14:27,640
or a neuron.

344
00:14:27,640 --> 00:14:30,220
The sum here is not
just a regular sum,

345
00:14:30,220 --> 00:14:31,190
it's a weighted sum.

346
00:14:31,190 --> 00:14:33,950
So the neuron
specifies some weight,

347
00:14:33,950 --> 00:14:37,540
which you can think of as how
much it values the input coming

348
00:14:37,540 --> 00:14:41,740
from a specific neuron, and
then the input is multiplied

349
00:14:41,740 --> 00:14:45,070
by its weight, and then the
total sum that the neuron

350
00:14:45,070 --> 00:14:48,820
computes is then fed into
an activation function that

351
00:14:48,820 --> 00:14:52,935
produces the output that the
neuron then basically produces.

352
00:14:55,690 --> 00:14:58,740
Now, what we just saw here
is really a simple neuron,

353
00:14:58,740 --> 00:14:59,950
a single neuron.

354
00:14:59,950 --> 00:15:02,570
You can't really do much
with just one neuron,

355
00:15:02,570 --> 00:15:04,710
so what you would do is
combined these neurons

356
00:15:04,710 --> 00:15:07,750
in a certain topography,
or in that case,

357
00:15:07,750 --> 00:15:12,130
we have a network with
seven neurons organized

358
00:15:12,130 --> 00:15:13,730
in three different layers.

359
00:15:13,730 --> 00:15:16,430
And what you can think
of that is really

360
00:15:16,430 --> 00:15:20,790
as one big neuron with 12
inputs, and one output.

361
00:15:20,790 --> 00:15:22,922
So for example, in the
case of the chair that

362
00:15:22,922 --> 00:15:24,380
was previously
mentioned, if you're

363
00:15:24,380 --> 00:15:27,890
trying to identify whether
a specific image has

364
00:15:27,890 --> 00:15:31,670
a chair in it or not,
these 12 inputs here

365
00:15:31,670 --> 00:15:35,060
could be some sub images,
or some small areas

366
00:15:35,060 --> 00:15:38,030
of the initial image that
you feed into the network,

367
00:15:38,030 --> 00:15:41,100
and the output here
could be a yes or no.

368
00:15:41,100 --> 00:15:44,480
Whether the image has a chair,
or doesn't have a chair.

369
00:15:44,480 --> 00:15:46,330
And that is really
the concept behind

370
00:15:46,330 --> 00:15:47,990
convolutional neural
networks, which

371
00:15:47,990 --> 00:15:52,040
we'll go into details in a bit.

372
00:15:52,040 --> 00:15:55,460
So what each neuron would
be doing in that case,

373
00:15:55,460 --> 00:15:58,600
is really just
performing a dot product,

374
00:15:58,600 --> 00:16:01,520
which if you aggregate
that with the dot

375
00:16:01,520 --> 00:16:04,140
products computed by each
of the other neurons,

376
00:16:04,140 --> 00:16:05,960
you would obtain a convolution.

377
00:16:05,960 --> 00:16:08,420
So what we have here
is three inputs.

378
00:16:08,420 --> 00:16:11,810
If the input, in that case,
is an image or a sub image,

379
00:16:11,810 --> 00:16:13,790
then the inputs would be pixels.

380
00:16:13,790 --> 00:16:15,650
The weights that you
would be using here

381
00:16:15,650 --> 00:16:19,100
would be the filter weights,
which is the filter that you

382
00:16:19,100 --> 00:16:20,930
use in the convolution.

383
00:16:20,930 --> 00:16:23,480
And then the sum here
would be the dot product

384
00:16:23,480 --> 00:16:26,060
of the weights and the
inputs, and that sum

385
00:16:26,060 --> 00:16:30,110
would be computed by a specific
neuron in your network.

386
00:16:30,110 --> 00:16:33,710
Then, that would be
the convolution step,

387
00:16:33,710 --> 00:16:35,990
and then that
convolution step would

388
00:16:35,990 --> 00:16:37,699
happen at the first
layer in the network.

389
00:16:37,699 --> 00:16:39,490
So you would be applying
this to the input,

390
00:16:39,490 --> 00:16:41,120
but you also would
be applying this

391
00:16:41,120 --> 00:16:42,699
at the second layer,
and third layer.

392
00:16:42,699 --> 00:16:44,240
In that case, we're
only showing what

393
00:16:44,240 --> 00:16:47,030
happens in the first layer.

394
00:16:47,030 --> 00:16:48,590
The next step after
the convolution

395
00:16:48,590 --> 00:16:50,720
would be the activation step.

396
00:16:50,720 --> 00:16:54,470
So the dot product
computed here would

397
00:16:54,470 --> 00:16:57,470
be there's a function that
would be applied to the sum,

398
00:16:57,470 --> 00:16:59,420
and then that
function would produce

399
00:16:59,420 --> 00:17:02,730
the output of the neuron.

400
00:17:02,730 --> 00:17:05,030
And this is where the
activation layer is.

401
00:17:05,030 --> 00:17:07,369
You also have another
activation layer here,

402
00:17:07,369 --> 00:17:08,630
and then a final one here.

403
00:17:12,319 --> 00:17:15,290
What we just went through
now are convolutions

404
00:17:15,290 --> 00:17:18,829
and activations, but this is
not the only thing that actually

405
00:17:18,829 --> 00:17:21,260
happens in a neural net.

406
00:17:21,260 --> 00:17:25,415
What we also have is a step
called subsampling, which

407
00:17:25,415 --> 00:17:28,880
we will be talking about next.

408
00:17:28,880 --> 00:17:32,210
For now, we will dig
deeper into the activation,

409
00:17:32,210 --> 00:17:35,240
and specifically, what
activation functions to use.

410
00:17:35,240 --> 00:17:38,810
In that case, we can see
that that's a neuron,

411
00:17:38,810 --> 00:17:40,400
and what the neuron
is doing here

412
00:17:40,400 --> 00:17:42,290
is the weighted sum
that we talked about,

413
00:17:42,290 --> 00:17:43,532
or the dot product.

414
00:17:43,532 --> 00:17:44,990
And then the output
from this would

415
00:17:44,990 --> 00:17:48,480
be fed into a certain
activation function.

416
00:17:48,480 --> 00:17:50,960
Common activation functions
are sigmoid, tanh,

417
00:17:50,960 --> 00:17:55,490
or rectify linear unit, and
we will go through each one

418
00:17:55,490 --> 00:17:56,190
independently.

419
00:17:56,190 --> 00:17:59,690
So here, we can see the
sigmoid activation function.

420
00:17:59,690 --> 00:18:01,490
So what this function
essentially does

421
00:18:01,490 --> 00:18:06,595
is map any input to an output
in the range of zero to one,

422
00:18:06,595 --> 00:18:10,970
and it's defined as one divided
by one plus e to the minus x.

423
00:18:10,970 --> 00:18:13,700
The other common activation
function is tanh,

424
00:18:13,700 --> 00:18:16,160
and that's any
input to an output

425
00:18:16,160 --> 00:18:18,320
between minus one and one.

426
00:18:18,320 --> 00:18:21,890
And then finally, would be the
rectified linear unit, which

427
00:18:21,890 --> 00:18:24,470
maps an input to itself
if it's positive,

428
00:18:24,470 --> 00:18:27,440
or to zero if it's negative.

429
00:18:27,440 --> 00:18:30,590
Now, in theory, you
could use any function

430
00:18:30,590 --> 00:18:33,930
as an activation
function in your network,

431
00:18:33,930 --> 00:18:36,290
but that's not what you
want to do in practice.

432
00:18:36,290 --> 00:18:38,000
You want your
activation functions

433
00:18:38,000 --> 00:18:41,000
to be non-linear
for one main reason,

434
00:18:41,000 --> 00:18:42,830
that the goal of the
activation function

435
00:18:42,830 --> 00:18:45,510
is actually to introduce
non-linearity in your system.

436
00:18:45,510 --> 00:18:48,442
And if all your activation
functions are linear,

437
00:18:48,442 --> 00:18:50,900
then you would essentially be
having a linear system, which

438
00:18:50,900 --> 00:18:53,690
prevents you from achieving
the level of complexity

439
00:18:53,690 --> 00:18:57,320
that you would ideally want to
achieve with a neural network.

440
00:18:57,320 --> 00:19:00,850
And there's a formal
proof for as to why

441
00:19:00,850 --> 00:19:02,620
you need non-linear
activation functions.

442
00:19:02,620 --> 00:19:04,370
They don't all need
to be non-linear,

443
00:19:04,370 --> 00:19:06,757
but you need to have a least
a few non-linear activation

444
00:19:06,757 --> 00:19:07,840
functions in your network.

445
00:19:07,840 --> 00:19:10,131
And the proof is available
in the appendix, or the link

446
00:19:10,131 --> 00:19:14,840
to the paper that has the proof.

447
00:19:14,840 --> 00:19:18,350
So after we've discussed what
happens at the activation

448
00:19:18,350 --> 00:19:21,370
layer, now we want to
talk about convolution.

449
00:19:21,370 --> 00:19:25,780
So as I said earlier,
an image is obviously

450
00:19:25,780 --> 00:19:29,170
a two-dimensional image,
but we're using RGB images.

451
00:19:29,170 --> 00:19:31,200
So actually need three channels.

452
00:19:31,200 --> 00:19:33,170
So what this means is
that an image is actually

453
00:19:33,170 --> 00:19:37,670
three-dimensional, and each 2D
matrix represents one channel.

454
00:19:37,670 --> 00:19:41,230
One of them corresponding to R,
one of them corresponding to G,

455
00:19:41,230 --> 00:19:45,330
one of them corresponding
to B. So a 32 by 32 image

456
00:19:45,330 --> 00:19:50,900
would essentially be represented
by a 32 by 32 by three matrix,

457
00:19:50,900 --> 00:19:52,100
as can be seen here.

458
00:19:54,710 --> 00:19:58,980
So what happens at
the convolution layer?

459
00:19:58,980 --> 00:20:01,070
So here we have a
nice animation that

460
00:20:01,070 --> 00:20:05,800
shows what is happening at
each convolutional layer.

461
00:20:05,800 --> 00:20:09,590
So assume we have a five
by five by three filter.

462
00:20:09,590 --> 00:20:11,360
So what this is,
essentially, would

463
00:20:11,360 --> 00:20:15,380
be doing is covering a certain
patch in the original image,

464
00:20:15,380 --> 00:20:17,450
which is 32 by 32 by three.

465
00:20:17,450 --> 00:20:19,820
So what can see here
is that for that five

466
00:20:19,820 --> 00:20:22,460
by five by three patch
in the original image,

467
00:20:22,460 --> 00:20:24,320
we have a neuron
that is actually

468
00:20:24,320 --> 00:20:27,170
performing the dot
product on all the pixels

469
00:20:27,170 --> 00:20:29,120
in that specific patch.

470
00:20:29,120 --> 00:20:32,690
So what is happening
here is the pixel values,

471
00:20:32,690 --> 00:20:37,400
which in that case, we have
five by five by three pixels,

472
00:20:37,400 --> 00:20:40,220
are being multiplied
by the filter values,

473
00:20:40,220 --> 00:20:43,530
and this operation is
being performed here.

474
00:20:43,530 --> 00:20:47,120
Then, after that dot
product is performed,

475
00:20:47,120 --> 00:20:51,080
it's fed into an activation
function, as can be seen here,

476
00:20:51,080 --> 00:20:54,420
and this produces the
output of this neuron.

477
00:20:54,420 --> 00:20:57,260
Now, this is what this single
neuron is actually doing.

478
00:20:57,260 --> 00:21:00,680
It's just covering that
area of the original image.

479
00:21:00,680 --> 00:21:02,890
What you would have
in a neural net

480
00:21:02,890 --> 00:21:06,020
is many neurons, each
covering a certain area

481
00:21:06,020 --> 00:21:07,830
of the original image.

482
00:21:07,830 --> 00:21:11,930
And if you aggregate the output
of all of these neurons, what

483
00:21:11,930 --> 00:21:15,650
you would be doing or performing
is, essentially, a convolution

484
00:21:15,650 --> 00:21:18,920
on the original image.

485
00:21:18,920 --> 00:21:21,290
And to formalize
what happens here,

486
00:21:21,290 --> 00:21:23,480
or what's the output
that's being produced

487
00:21:23,480 --> 00:21:26,180
from that operation, we can
look at that from a more

488
00:21:26,180 --> 00:21:27,300
mathematical perspective.

489
00:21:27,300 --> 00:21:29,570
So if you have an
input of size H1,

490
00:21:29,570 --> 00:21:34,630
W1, D1, and you're performing
a convolution with a filter,

491
00:21:34,630 --> 00:21:38,420
then the output, W2,
would be related to W1

492
00:21:38,420 --> 00:21:39,730
with the following formula.

493
00:21:39,730 --> 00:21:43,360
So W2 plus W1 minus filter width
plus one, and the same formula

494
00:21:43,360 --> 00:21:46,460
applies for the height, and
the depth would actually

495
00:21:46,460 --> 00:21:48,020
be the same because
in that case,

496
00:21:48,020 --> 00:21:51,260
we're using a filter that
has the same depth, or three,

497
00:21:51,260 --> 00:21:52,430
as the original image.

498
00:21:55,330 --> 00:21:58,280
So what this would
produce in aggregate

499
00:21:58,280 --> 00:22:01,890
is if you have 28 by 28 by one
neurons, each one performing

500
00:22:01,890 --> 00:22:05,030
a dot product on some pixels
in the original image,

501
00:22:05,030 --> 00:22:08,690
the output would be an
activation map of size 28 by 28

502
00:22:08,690 --> 00:22:10,790
by one, and the
output of each neuron

503
00:22:10,790 --> 00:22:15,540
would be one pixel in
the activation map.

504
00:22:15,540 --> 00:22:18,870
Now, if we go back to the
points that we made earlier,

505
00:22:18,870 --> 00:22:23,060
one thing we said was that the
reason you use a neural network

506
00:22:23,060 --> 00:22:25,979
is because you don't know
exactly what features

507
00:22:25,979 --> 00:22:27,770
you want to extract,
and you don't actually

508
00:22:27,770 --> 00:22:30,950
have specific filters that you
want to apply to the image.

509
00:22:30,950 --> 00:22:33,140
So ideally, what
you want to do is

510
00:22:33,140 --> 00:22:36,650
have multiple filters being
applied to the first image,

511
00:22:36,650 --> 00:22:38,540
and perform multiple
convolutions,

512
00:22:38,540 --> 00:22:41,434
and this is what you can do
with multiple neuron layers.

513
00:22:41,434 --> 00:22:43,850
So what we described before
was just for one neuron layer.

514
00:22:43,850 --> 00:22:46,990
In that case, we can assume
we have five different neuron

515
00:22:46,990 --> 00:22:50,000
layers, each one performing
a different convolution

516
00:22:50,000 --> 00:22:51,690
on the original image.

517
00:22:51,690 --> 00:22:53,780
So in that case, we
would have 28 by 28

518
00:22:53,780 --> 00:22:56,210
by one neuron per
layer, and then

519
00:22:56,210 --> 00:22:58,070
if we aggregate all
these neurons together,

520
00:22:58,070 --> 00:22:59,865
we need to multiply
it by five, and that

521
00:22:59,865 --> 00:23:01,490
would be the total
number of neurons we

522
00:23:01,490 --> 00:23:06,230
have in that specific number.

523
00:23:06,230 --> 00:23:09,860
So this actually leaves us with
a pretty complicated system.

524
00:23:09,860 --> 00:23:12,070
It would have many parameters.

525
00:23:12,070 --> 00:23:14,510
The neurons have weights,
the number of neurons

526
00:23:14,510 --> 00:23:18,890
is also a parameter So how do
we actually formalize that?

527
00:23:18,890 --> 00:23:22,160
If we have an input volume
of 32 by 32 by three, which

528
00:23:22,160 --> 00:23:24,800
is our original image,
and a filter size of five

529
00:23:24,800 --> 00:23:28,100
by five by three, then the
size of the activation map

530
00:23:28,100 --> 00:23:31,970
that would be reduced
would be 28 by 28 by one.

531
00:23:31,970 --> 00:23:34,850
Then in that case, we also said
we have five different neuron

532
00:23:34,850 --> 00:23:37,410
layers that perform five
different convolutions.

533
00:23:37,410 --> 00:23:41,310
Then the total number of neurons
would be 28 by 28 by five,

534
00:23:41,310 --> 00:23:44,460
and then the weights
per neuron are five

535
00:23:44,460 --> 00:23:46,062
by five by three, which is 75.

536
00:23:46,062 --> 00:23:48,520
In that case, we're assuming
that the neurons independently

537
00:23:48,520 --> 00:23:49,860
keep track of their own weights.

538
00:23:49,860 --> 00:23:52,129
This could be simplified
to each layer having

539
00:23:52,129 --> 00:23:54,170
their own weight, which
would tremendously reduce

540
00:23:54,170 --> 00:23:55,460
the number of parameters.

541
00:23:55,460 --> 00:23:58,010
But in that case, just
to get an upper bound,

542
00:23:58,010 --> 00:23:59,622
this leaves us
with a total number

543
00:23:59,622 --> 00:24:01,910
of parameters of 294,000.

544
00:24:01,910 --> 00:24:05,480
And this is just using a
32 by 32 by three image.

545
00:24:05,480 --> 00:24:07,980
You can think of this
as a pretty small image.

546
00:24:07,980 --> 00:24:09,520
So if you have a
bigger image, you

547
00:24:09,520 --> 00:24:13,500
will have many more parameters.

548
00:24:13,500 --> 00:24:14,000
Great.

549
00:24:14,000 --> 00:24:17,570
So what we just saw now, and
described, were convolutions,

550
00:24:17,570 --> 00:24:21,425
activations, and these
steps happen sequentially

551
00:24:21,425 --> 00:24:24,110
in a convolutional
neural network,

552
00:24:24,110 --> 00:24:25,920
specifically as can be see here.

553
00:24:25,920 --> 00:24:29,870
One step that also happens
occasionally is subsampling,

554
00:24:29,870 --> 00:24:32,594
and we'll discuss that
step in detail here.

555
00:24:32,594 --> 00:24:34,760
So there are two main reasons
why you would actually

556
00:24:34,760 --> 00:24:36,140
subsample your input.

557
00:24:36,140 --> 00:24:39,920
One is to obviously reduce
the size of your input,

558
00:24:39,920 --> 00:24:42,752
and your feature space,
but also because you

559
00:24:42,752 --> 00:24:44,960
want to keep track of the
most important information,

560
00:24:44,960 --> 00:24:48,850
and get rid of everything
else that you don't think

561
00:24:48,850 --> 00:24:52,130
is going to be relevant
to your classification.

562
00:24:52,130 --> 00:24:54,490
And the common methods
used in subsampling

563
00:24:54,490 --> 00:24:56,825
are either max pooling
or average pooling.

564
00:24:56,825 --> 00:25:00,280
We will describe
max pooling here.

565
00:25:00,280 --> 00:25:02,010
So what happens
in max pooling is,

566
00:25:02,010 --> 00:25:03,940
essentially, you are
dividing the image

567
00:25:03,940 --> 00:25:07,430
into different sub images,
non-overlapping sub images,

568
00:25:07,430 --> 00:25:09,590
and you perform an
at max operation.

569
00:25:09,590 --> 00:25:12,650
So in that case, if we
consider two by two filters,

570
00:25:12,650 --> 00:25:15,520
we would split the image, which
in that case is four by four.

571
00:25:15,520 --> 00:25:17,370
We'd split it into
four sub images,

572
00:25:17,370 --> 00:25:21,370
and for each two by two square,
we would take the maximum.

573
00:25:21,370 --> 00:25:23,590
In that case, for the first
square it would be six,

574
00:25:23,590 --> 00:25:25,480
then eight, then
three, then four.

575
00:25:25,480 --> 00:25:27,610
And the reason
that actually works

576
00:25:27,610 --> 00:25:30,340
is because what you want
to do is really keep

577
00:25:30,340 --> 00:25:35,740
track of the response of the
neurons that-- or the highest

578
00:25:35,740 --> 00:25:38,130
response produced
by your neurons.

579
00:25:38,130 --> 00:25:40,780
In that case, for example,
the first highest response

580
00:25:40,780 --> 00:25:42,730
in the first square
is six, and that

581
00:25:42,730 --> 00:25:45,610
means that if you get
that high of a response,

582
00:25:45,610 --> 00:25:48,020
it means that something has
been detected in the image,

583
00:25:48,020 --> 00:25:49,090
or has been detected.

584
00:25:49,090 --> 00:25:52,135
And this is something you
want to keep track of as you

585
00:25:52,135 --> 00:25:53,530
move forward in your network.

586
00:25:53,530 --> 00:25:57,580
And although this moves
around the location of pixels,

587
00:25:57,580 --> 00:26:00,640
because you can think of
that as subsampling an image,

588
00:26:00,640 --> 00:26:03,220
it does keep track of the
information you care about

589
00:26:03,220 --> 00:26:05,470
because you only
care about the fact

590
00:26:05,470 --> 00:26:07,541
that something has been
detected in the image.

591
00:26:07,541 --> 00:26:09,040
At this point, you
don't really care

592
00:26:09,040 --> 00:26:11,560
about where it's
located in the image,

593
00:26:11,560 --> 00:26:14,590
and you want to keep
track of all the features

594
00:26:14,590 --> 00:26:16,770
that your neurons have
detected in order to be

595
00:26:16,770 --> 00:26:22,070
able to eventually classify
the input correctly.

596
00:26:22,070 --> 00:26:25,570
So if you have multiple
feature maps-- so in that case,

597
00:26:25,570 --> 00:26:30,910
if you have 224 by 224 by 64,
what your subsampling operation

598
00:26:30,910 --> 00:26:35,140
would be doing is reducing
the height and the width

599
00:26:35,140 --> 00:26:37,180
so the depth would
remain unchanged.

600
00:26:37,180 --> 00:26:41,670
So in that case, you would go
from 224 by 224 by 64 to 112

601
00:26:41,670 --> 00:26:45,600
by 112 by 64, and that would
be reducing your output size

602
00:26:45,600 --> 00:26:47,200
by a factor of four.

603
00:26:49,840 --> 00:26:51,820
And formally, what
this would look

604
00:26:51,820 --> 00:26:54,250
like is if you have
an input of size H1,

605
00:26:54,250 --> 00:26:57,775
W1, D1, the size of
your output would

606
00:26:57,775 --> 00:26:59,800
be related to your input
in the following ways.

607
00:26:59,800 --> 00:27:03,460
W2 would be W1 minus
the pool width plus one.

608
00:27:03,460 --> 00:27:09,240
The same applies for H2, and the
depth would remain unchanged.

609
00:27:09,240 --> 00:27:11,286
So these are, essentially--

610
00:27:11,286 --> 00:27:13,660
these are the steps that happen
in a convolutional neural

611
00:27:13,660 --> 00:27:14,825
network.

612
00:27:14,825 --> 00:27:17,300
What you could be doing
is repeating these steps

613
00:27:17,300 --> 00:27:20,380
on a certain number of
times in your network.

614
00:27:20,380 --> 00:27:23,070
But eventually, you have
to make a classification,

615
00:27:23,070 --> 00:27:27,010
and decide in our case,
whether our image has a chair

616
00:27:27,010 --> 00:27:28,450
or doesn't have a chair.

617
00:27:28,450 --> 00:27:30,230
So how does that happen?

618
00:27:30,230 --> 00:27:33,130
So after you perform
all these steps,

619
00:27:33,130 --> 00:27:35,940
there's a step that happens
here that would allow

620
00:27:35,940 --> 00:27:38,240
you to make that
prediction, and that step

621
00:27:38,240 --> 00:27:40,960
is usually called a
fully connected layer,

622
00:27:40,960 --> 00:27:43,225
or a multi-layer perception.

623
00:27:43,225 --> 00:27:47,010
And what this essentially is is
layers that are very similar,

624
00:27:47,010 --> 00:27:49,120
or exactly the same as
what you had before,

625
00:27:49,120 --> 00:27:50,980
except that every
neuron in the layer

626
00:27:50,980 --> 00:27:53,690
is connected to all
the previous neurons.

627
00:27:53,690 --> 00:27:56,290
So what it's allowed
you to do is really

628
00:27:56,290 --> 00:27:58,610
consider everything
you currently

629
00:27:58,610 --> 00:28:00,430
have about your input,
or everything that's

630
00:28:00,430 --> 00:28:05,080
left about your input, and
compute a dot product on that,

631
00:28:05,080 --> 00:28:09,340
rather than focusing on a
subset subsample of your input

632
00:28:09,340 --> 00:28:11,740
like previous layers do.

633
00:28:11,740 --> 00:28:14,140
In that case, if
you're actually trying

634
00:28:14,140 --> 00:28:16,620
to classify your input
into four classes,

635
00:28:16,620 --> 00:28:18,830
you would ideally have
four different neurons

636
00:28:18,830 --> 00:28:21,460
in your output layer,
each one corresponding

637
00:28:21,460 --> 00:28:24,070
to one of the classes
that you have,

638
00:28:24,070 --> 00:28:27,220
and then you would
perform the same operation

639
00:28:27,220 --> 00:28:29,920
as you would in a previous
layer, compute the dot product,

640
00:28:29,920 --> 00:28:33,570
and then once you obtain
the values at every neuron,

641
00:28:33,570 --> 00:28:35,870
you would perform a
normalization operation

642
00:28:35,870 --> 00:28:36,940
on all the output.

643
00:28:36,940 --> 00:28:39,430
This organization operation
is called softmax,

644
00:28:39,430 --> 00:28:42,280
or normalized exponential,
and what it does is really,

645
00:28:42,280 --> 00:28:44,800
put more weight on
the highest value.

646
00:28:44,800 --> 00:28:47,740
And by computing the
softmax at the output,

647
00:28:47,740 --> 00:28:50,190
you're able to compute the
posterior probabilities,

648
00:28:50,190 --> 00:28:52,730
and allows you to
make a more informed--

649
00:28:52,730 --> 00:28:54,460
or basically make
a classification

650
00:28:54,460 --> 00:28:57,730
decision on your input.

651
00:28:57,730 --> 00:28:58,850
Great.

652
00:28:58,850 --> 00:29:01,182
So that's everything.

653
00:29:01,182 --> 00:29:05,325
And now, the next step will be
talking about back propagation.

654
00:29:05,325 --> 00:29:06,450
ISHWARYA ANANTHABHOTLA: OK.

655
00:29:06,450 --> 00:29:07,940
So now that Henry has
given us an overview

656
00:29:07,940 --> 00:29:09,459
of the entire
architecture of a CNN,

657
00:29:09,459 --> 00:29:11,000
I'm going to quickly
spend some time,

658
00:29:11,000 --> 00:29:14,610
and talk about standard
preprocessing tricks and tips

659
00:29:14,610 --> 00:29:17,020
that people might use
on the image dataset

660
00:29:17,020 --> 00:29:19,710
before they actually feet
it through a neural net

661
00:29:19,710 --> 00:29:21,600
to classify the images.

662
00:29:21,600 --> 00:29:24,580
So let's suppose we
have a dataset x,

663
00:29:24,580 --> 00:29:26,900
and there are n number of
data points in the dataset,

664
00:29:26,900 --> 00:29:29,450
and each point has
a dimension, D.

665
00:29:29,450 --> 00:29:33,010
So they have D
features per point.

666
00:29:33,010 --> 00:29:35,520
So in this example, we use
these graphs as an example.

667
00:29:35,520 --> 00:29:38,370
Basically, our original data
here has just two dimensions,

668
00:29:38,370 --> 00:29:40,879
and it spans this
range of values.

669
00:29:40,879 --> 00:29:43,170
So for example, if we want
to center this data, what we

670
00:29:43,170 --> 00:29:44,330
would do is a mean subtraction.

671
00:29:44,330 --> 00:29:45,760
So we basically
subtract the mean

672
00:29:45,760 --> 00:29:48,085
across all the features
of all the points,

673
00:29:48,085 --> 00:29:50,710
and we basically center it, and
you can see that transformation

674
00:29:50,710 --> 00:29:52,120
here.

675
00:29:52,120 --> 00:29:54,920
And then we might, again, go
for normalizing the dimension

676
00:29:54,920 --> 00:29:55,932
so that you have it.

677
00:29:55,932 --> 00:29:58,270
The data points
span the same range

678
00:29:58,270 --> 00:29:59,692
of values in both dimensions.

679
00:29:59,692 --> 00:30:01,150
So you can see that
transformation,

680
00:30:01,150 --> 00:30:02,942
and how it's taken place here.

681
00:30:02,942 --> 00:30:07,087
And we just divide by the
standard deviation to do this.

682
00:30:07,087 --> 00:30:08,545
Something that's
very commonly done

683
00:30:08,545 --> 00:30:11,297
is called PCA, or Principal
Component Analysis.

684
00:30:11,297 --> 00:30:12,896
And the idea here
is sometimes we

685
00:30:12,896 --> 00:30:15,270
have a dataset that has a
very, very high dimensionality,

686
00:30:15,270 --> 00:30:18,910
and we would like to
reduce that dimensionality.

687
00:30:18,910 --> 00:30:20,770
So basically, what
our goal is is

688
00:30:20,770 --> 00:30:23,680
to project the higher
dimensional space onto a lower

689
00:30:23,680 --> 00:30:27,126
dimensionality space by taking
the subset of those features.

690
00:30:27,126 --> 00:30:28,750
And if you've seen
a little bit of 1806

691
00:30:28,750 --> 00:30:30,340
from linear algebra,
the way we do

692
00:30:30,340 --> 00:30:32,860
this is by generating
a covariance matrix,

693
00:30:32,860 --> 00:30:35,070
then doing the single
variable decomposition.

694
00:30:35,070 --> 00:30:39,210
And I'll gloss over the math
now, but that's the idea.

695
00:30:39,210 --> 00:30:40,960
And you can see here
how the original data

696
00:30:40,960 --> 00:30:41,980
spanned two dimensions.

697
00:30:41,980 --> 00:30:44,438
I would decorrelate it so that
it spans a single dimension.

698
00:30:44,438 --> 00:30:46,066
And even with this
data, you might

699
00:30:46,066 --> 00:30:47,440
want to ensure
that it's widened,

700
00:30:47,440 --> 00:30:48,850
which is the same deal.

701
00:30:48,850 --> 00:30:52,770
You want the values to span the
same range in both dimensions.

702
00:30:52,770 --> 00:30:55,631
So then you would just
divide by your Eigenvalues

703
00:30:55,631 --> 00:30:56,950
to get the widened data.

704
00:30:59,520 --> 00:31:01,190
This last bit is
something that's

705
00:31:01,190 --> 00:31:02,960
very commonly done
as a preprocessing

706
00:31:02,960 --> 00:31:08,010
trick, though people aren't
entirely sure why it works very

707
00:31:08,010 --> 00:31:09,720
well, or that it
really does help,

708
00:31:09,720 --> 00:31:11,230
but it's something
that people do,

709
00:31:11,230 --> 00:31:12,820
and it's called
data augmentation.

710
00:31:12,820 --> 00:31:14,582
So basically, if I
have a dataset that

711
00:31:14,582 --> 00:31:18,165
contains a bunch of
images of chairs,

712
00:31:18,165 --> 00:31:20,790
a bunch of images of tables, and
then a bunch of images of say,

713
00:31:20,790 --> 00:31:23,510
trees, I might want to
intentionally augment

714
00:31:23,510 --> 00:31:27,080
that dataset further by
introducing a few variations

715
00:31:27,080 --> 00:31:28,300
on these same images.

716
00:31:28,300 --> 00:31:31,280
So I might take the chair
image, rotate some, reflect it

717
00:31:31,280 --> 00:31:35,042
a few more, scale, crop,
remap the color space,

718
00:31:35,042 --> 00:31:36,500
or just kind of
have a process that

719
00:31:36,500 --> 00:31:38,250
does this randomly
to create more

720
00:31:38,250 --> 00:31:39,960
variation on the same dataset.

721
00:31:39,960 --> 00:31:43,490
And this is a good illustration
of why this makes a difference.

722
00:31:43,490 --> 00:31:45,620
I've taken an image
here of what looks

723
00:31:45,620 --> 00:31:47,860
like a waterfall or
some spot of nature,

724
00:31:47,860 --> 00:31:49,785
and simply just
inverted the colors.

725
00:31:49,785 --> 00:31:52,076
And what I see, if I were to
just see this image alone,

726
00:31:52,076 --> 00:31:54,370
it maybe looks like a
curtain, or a bit of texture,

727
00:31:54,370 --> 00:31:55,010
or something.

728
00:31:55,010 --> 00:31:57,464
And the idea is even
to a human perception,

729
00:31:57,464 --> 00:31:59,380
these images have two
very different meanings,

730
00:31:59,380 --> 00:32:01,744
and so it's interesting
to see what effect they

731
00:32:01,744 --> 00:32:04,790
would have on a neural network.

732
00:32:04,790 --> 00:32:08,480
And with that, we'll go over to
image classification results.

733
00:32:08,480 --> 00:32:10,310
ALI SOYLEMEZOGLU:
So, so far we've

734
00:32:10,310 --> 00:32:14,600
seen how convolutional
neural networks are built,

735
00:32:14,600 --> 00:32:16,890
and certain image
processing techniques

736
00:32:16,890 --> 00:32:22,850
we can use on the input images
to get them into formats that

737
00:32:22,850 --> 00:32:25,430
are there for the
classification process,

738
00:32:25,430 --> 00:32:27,930
but so far, it seems
a bit abstract.

739
00:32:27,930 --> 00:32:33,500
It's good to know how CNNs work,
why CNNs work, but why don't we

740
00:32:33,500 --> 00:32:36,500
take a look at some of the
practical results from CNNs,

741
00:32:36,500 --> 00:32:39,990
and what they're used
for so that when you're

742
00:32:39,990 --> 00:32:41,450
done watching this
lecture, you can

743
00:32:41,450 --> 00:32:46,490
go home, and try classifying
images on your own time?

744
00:32:46,490 --> 00:32:49,880
With that, let's first revisit
the ImageNet competition.

745
00:32:49,880 --> 00:32:53,240
I hope you remember the graph
at the bottom from the beginning

746
00:32:53,240 --> 00:32:55,580
of the lecture,
where we used this

747
00:32:55,580 --> 00:32:57,980
to motivate the use of CNNs.

748
00:32:57,980 --> 00:33:01,040
CNNs came onto the
picture in 2012,

749
00:33:01,040 --> 00:33:06,470
but the winning CNN from 2012
was used on the 2010 ImageNet

750
00:33:06,470 --> 00:33:09,390
competition as
well, and it managed

751
00:33:09,390 --> 00:33:13,070
to bring down the top five
error rate to 0.17, which

752
00:33:13,070 --> 00:33:16,910
is pretty much on the same level
as how performed in the 2012

753
00:33:16,910 --> 00:33:20,480
competition when it was first
used, which was at 0.16.

754
00:33:20,480 --> 00:33:24,150
So this just goes to show that
these convolutional neural

755
00:33:24,150 --> 00:33:26,646
networks are the state of the
art when it comes to image

756
00:33:26,646 --> 00:33:30,740
classification, and
that's why we're currently

757
00:33:30,740 --> 00:33:31,770
focusing on that.

758
00:33:31,770 --> 00:33:33,820
But you might be wondering
what the ImageNet

759
00:33:33,820 --> 00:33:37,340
competition exactly looks like,
what the images looks like.

760
00:33:37,340 --> 00:33:39,410
So why don't we
take a look at that.

761
00:33:39,410 --> 00:33:41,360
As you can see here,
these are images

762
00:33:41,360 --> 00:33:43,310
from the ImageNet competition.

763
00:33:43,310 --> 00:33:45,830
Underneath each image
is a bold caption, which

764
00:33:45,830 --> 00:33:47,390
is considered to be
the ground truth,

765
00:33:47,390 --> 00:33:50,950
or what the competition
believes to be

766
00:33:50,950 --> 00:33:55,190
the correct classification
of the image.

767
00:33:55,190 --> 00:33:56,850
Underneath that
ground truth, you

768
00:33:56,850 --> 00:33:58,630
see a list of five
different labels.

769
00:33:58,630 --> 00:34:00,770
Now, these five
labels are produced

770
00:34:00,770 --> 00:34:03,120
by a convolutional
neural network,

771
00:34:03,120 --> 00:34:05,390
and the different bars--

772
00:34:05,390 --> 00:34:08,600
the different lengthened bars,
some pink and others blue,

773
00:34:08,600 --> 00:34:11,179
represent how
confident the CNN is

774
00:34:11,179 --> 00:34:14,989
that what it sees in that
image is that specific label.

775
00:34:14,989 --> 00:34:16,850
As you can see in
certain examples,

776
00:34:16,850 --> 00:34:21,469
the CNN is pretty confident in
that it has a correct answer.

777
00:34:21,469 --> 00:34:23,750
For example, when we look
at the container ship,

778
00:34:23,750 --> 00:34:26,540
it's pretty confident that
what's in that image is exactly

779
00:34:26,540 --> 00:34:27,650
a container ship.

780
00:34:27,650 --> 00:34:30,350
There are certain cases when it
doesn't get the correct label

781
00:34:30,350 --> 00:34:34,070
on its first try, but it does
have in its top five labels.

782
00:34:34,070 --> 00:34:36,010
For example, you can
see grill and mushroom.

783
00:34:36,010 --> 00:34:38,870
Now, the funny thing
about the mushroom image

784
00:34:38,870 --> 00:34:41,750
is that what it thinks
the image should

785
00:34:41,750 --> 00:34:43,658
be classified as is agaric.

786
00:34:43,658 --> 00:34:46,199
And if you don't know, agaric
is actually a type of mushroom,

787
00:34:46,199 --> 00:34:48,530
and in fact, it's a
mushroom that image.

788
00:34:48,530 --> 00:34:50,510
And it make sense that
their confidence levels

789
00:34:50,510 --> 00:34:51,590
are pretty much the same.

790
00:34:51,590 --> 00:34:55,909
Agaric is slightly-- it's
slightly more complex that what

791
00:34:55,909 --> 00:34:57,770
it sees in the image is agaric.

792
00:35:00,800 --> 00:35:03,140
But there are certain
cases when the CNN

793
00:35:03,140 --> 00:35:05,870
fails to classify
the image correctly

794
00:35:05,870 --> 00:35:07,200
in its top five levels.

795
00:35:07,200 --> 00:35:09,320
This will be registered
as a top five error,

796
00:35:09,320 --> 00:35:12,810
as you just saw in the previous
slide about the top error rate.

797
00:35:12,810 --> 00:35:15,560
One example here on
this slide is cherry.

798
00:35:15,560 --> 00:35:18,140
Now, the ImageNet
competition believed

799
00:35:18,140 --> 00:35:19,902
that this should be
classified correctly

800
00:35:19,902 --> 00:35:21,360
as cherry, even
though there's also

801
00:35:21,360 --> 00:35:22,850
a Dalmatian in the background.

802
00:35:22,850 --> 00:35:25,940
The CNN, on the other
hand, is pretty confident

803
00:35:25,940 --> 00:35:29,410
that what it sees in this
image is the Dalmatian.

804
00:35:29,410 --> 00:35:31,550
But if you look at some
of the other results

805
00:35:31,550 --> 00:35:34,640
within the top five, although
it doesn't guess cherry at all,

806
00:35:34,640 --> 00:35:36,380
it does guess certain
fruits that it

807
00:35:36,380 --> 00:35:38,780
may think look sort
of like cherries

808
00:35:38,780 --> 00:35:40,820
like grape or elderberry.

809
00:35:40,820 --> 00:35:44,510
So the CNN does
actually pick up on two

810
00:35:44,510 --> 00:35:47,000
different distinct
objects within the image,

811
00:35:47,000 --> 00:35:50,210
but as a result of how it's
built, or its training set,

812
00:35:50,210 --> 00:35:52,610
it ends up classifying
it as a Dalmatian.

813
00:35:52,610 --> 00:35:54,720
But it goes to show you
that CNNs could also

814
00:35:54,720 --> 00:35:56,775
be used not just as an
image classification,

815
00:35:56,775 --> 00:35:59,987
but also as object detection,
which we do not touch up

816
00:35:59,987 --> 00:36:02,910
on in this lecture at all.

817
00:36:02,910 --> 00:36:05,460
So I'm not going to
go further into that.

818
00:36:05,460 --> 00:36:08,080
Now, this is all fun and all,
but what about some real world

819
00:36:08,080 --> 00:36:10,040
applications?

820
00:36:10,040 --> 00:36:13,220
So this is a study that they
did at Google with Google Street

821
00:36:13,220 --> 00:36:18,670
View house numbers, where
they used the CNN to classify

822
00:36:18,670 --> 00:36:21,637
photographic images of house
numbers, as you can see here,

823
00:36:21,637 --> 00:36:23,470
of certain examples of
these house numbers--

824
00:36:23,470 --> 00:36:24,500
what they look like.

825
00:36:24,500 --> 00:36:27,100
So what the CNN was
tasked with doing

826
00:36:27,100 --> 00:36:31,600
was that it was supposed to
recognize the individual digits

827
00:36:31,600 --> 00:36:34,630
within the image, and then
understand that it's not

828
00:36:34,630 --> 00:36:36,910
just one digit that
it's looking at,

829
00:36:36,910 --> 00:36:38,410
but it's actually
a string of digits

830
00:36:38,410 --> 00:36:42,720
connected, and successfully
classified as the correct house

831
00:36:42,720 --> 00:36:44,780
number.

832
00:36:44,780 --> 00:36:47,430
This can be quite challenging,
even for humans sometimes

833
00:36:47,430 --> 00:36:48,930
when the image is quite blurry.

834
00:36:48,930 --> 00:36:54,420
You might not exactly know what
the house number exactly is,

835
00:36:54,420 --> 00:36:56,890
but they managed to get the
convolutional neural network

836
00:36:56,890 --> 00:37:00,410
to operate around
human operator levels.

837
00:37:00,410 --> 00:37:04,420
So that corresponds to
around 96% to 97% accuracy,

838
00:37:04,420 --> 00:37:06,270
and what that
enables Google to do

839
00:37:06,270 --> 00:37:08,950
is that they can
deploy the CNN such

840
00:37:08,950 --> 00:37:12,790
that the CNN automatically
extracts the house

841
00:37:12,790 --> 00:37:19,180
numbers from the images online,
and uses that to geocode

842
00:37:19,180 --> 00:37:20,516
these addresses.

843
00:37:20,516 --> 00:37:24,190
And it's gotten to a point
were the CNN is successfully

844
00:37:24,190 --> 00:37:27,520
able to do this process
in less than an hour

845
00:37:27,520 --> 00:37:31,390
for all of the street view
house numbers in all of France.

846
00:37:31,390 --> 00:37:35,930
Now, you might asking where
this could be useful for.

847
00:37:35,930 --> 00:37:38,560
If you don't have access
a lot of resources

848
00:37:38,560 --> 00:37:40,750
to actually do this
geocoding process where

849
00:37:40,750 --> 00:37:45,670
you match latitude and
longitude to street addresses,

850
00:37:45,670 --> 00:37:48,680
then your only resource might
be actually photographic images.

851
00:37:48,680 --> 00:37:51,160
So you actually need
something, hopefully not human,

852
00:37:51,160 --> 00:37:54,030
but some sort of software
that can do this successfully.

853
00:37:54,030 --> 00:37:56,200
And so this is, for
example, a place

854
00:37:56,200 --> 00:37:58,000
in South Africa,
a bird's eye view.

855
00:37:58,000 --> 00:38:00,670
Not sure if you can
exactly see, but there

856
00:38:00,670 --> 00:38:03,459
are these small numbers on
top of each of the houses.

857
00:38:03,459 --> 00:38:05,500
All of these numbers were
extracted and correctly

858
00:38:05,500 --> 00:38:09,930
classified using this
previously seen CNN.

859
00:38:09,930 --> 00:38:15,620
Another example from robotics
is recognizing hand gestures.

860
00:38:15,620 --> 00:38:19,120
So obviously, robots
come equipped with a lot

861
00:38:19,120 --> 00:38:20,680
of different hardware.

862
00:38:20,680 --> 00:38:22,660
They can sense
sounds, they can also

863
00:38:22,660 --> 00:38:25,330
capture images of
their surroundings.

864
00:38:25,330 --> 00:38:28,812
And if you're able to classify
what you see-- if the robots is

865
00:38:28,812 --> 00:38:30,400
able to classify
what it sees, then it

866
00:38:30,400 --> 00:38:33,280
can actually act upon it,
and take certain actions.

867
00:38:33,280 --> 00:38:35,785
That's why it becomes really
helpful to successfully

868
00:38:35,785 --> 00:38:37,640
classify the images.

869
00:38:37,640 --> 00:38:40,810
So this is what they
did using hand gestures,

870
00:38:40,810 --> 00:38:43,810
where there were five
different classes.

871
00:38:43,810 --> 00:38:46,460
Each class corresponds to the
number of extended fingers.

872
00:38:46,460 --> 00:38:49,790
So a, b, c, d, the top row,
corresponds to the same class.

873
00:38:49,790 --> 00:38:51,640
They all have two
fingers sticking out.

874
00:38:51,640 --> 00:38:53,620
The bottom row has three
fingers sticking out.

875
00:38:53,620 --> 00:38:56,230
So that's another class.

876
00:38:56,230 --> 00:39:00,250
And they get the error rate
down all the way to 3%.

877
00:39:00,250 --> 00:39:03,370
So 97% of the time, the
convolutional neural net

878
00:39:03,370 --> 00:39:05,650
correctly classified
the hand gesture.

879
00:39:05,650 --> 00:39:08,230
And you can use these
hand gestures then

880
00:39:08,230 --> 00:39:10,990
to give certain
commands to a robot,

881
00:39:10,990 --> 00:39:15,100
and it can train the CNN
to act upon something

882
00:39:15,100 --> 00:39:16,360
else besides hand gestures.

883
00:39:16,360 --> 00:39:18,276
For example, if it's in
some sort of terranean

884
00:39:18,276 --> 00:39:20,850
and you train it
on certain images

885
00:39:20,850 --> 00:39:23,740
that you might find
in nature, then it

886
00:39:23,740 --> 00:39:27,490
can take those classifications,
and act upon it once it sees,

887
00:39:27,490 --> 00:39:32,230
for example, a tree, or
some sort of body of water.

888
00:39:32,230 --> 00:39:35,080
It's all thanks to
image classification.

889
00:39:35,080 --> 00:39:38,050
Now, obviously, gestures
are not necessarily static.

890
00:39:38,050 --> 00:39:40,390
You could be waving your
hand, and so that would

891
00:39:40,390 --> 00:39:42,730
require a temporal component.

892
00:39:42,730 --> 00:39:45,660
So it's not just an image
you're looking at, but a video.

893
00:39:45,660 --> 00:39:50,380
And so if follows
that we can probably

894
00:39:50,380 --> 00:39:53,050
extend image classification
into video classification.

895
00:39:53,050 --> 00:39:57,580
After all, videos are just
images with an added component,

896
00:39:57,580 --> 00:39:59,920
specifically time.

897
00:39:59,920 --> 00:40:01,930
Obviously, the added
temporal component

898
00:40:01,930 --> 00:40:04,520
comes with a lot of
additional complexity.

899
00:40:04,520 --> 00:40:07,517
So we're not going to dive into
any of that, but in the end,

900
00:40:07,517 --> 00:40:08,850
it comes down to the same thing.

901
00:40:08,850 --> 00:40:10,900
You extract features
from the videos,

902
00:40:10,900 --> 00:40:12,520
and you attempt to
classify them using

903
00:40:12,520 --> 00:40:14,180
convolutional neural nets.

904
00:40:14,180 --> 00:40:17,190
So why don't we look
at a study done, again,

905
00:40:17,190 --> 00:40:21,390
at Google, where they
extracted one million videos

906
00:40:21,390 --> 00:40:26,980
from YouTube, sports videos,
with somewhere between 400

907
00:40:26,980 --> 00:40:32,630
to 500 different classes,
and they used CNNs to attempt

908
00:40:32,630 --> 00:40:34,360
to classify these videos.

909
00:40:34,360 --> 00:40:36,550
Now, they used
different approaches.

910
00:40:36,550 --> 00:40:38,590
They used different
approaches, different tests,

911
00:40:38,590 --> 00:40:43,360
different types of CNNs that
I'm not going to go into.

912
00:40:43,360 --> 00:40:45,795
But as you can see
here, these are

913
00:40:45,795 --> 00:40:50,260
certain stills from these videos
where the caption highlighted

914
00:40:50,260 --> 00:40:53,770
in blue is what the
correct answer should be,

915
00:40:53,770 --> 00:40:56,740
and underneath it,
the top five labels

916
00:40:56,740 --> 00:40:59,250
that the convolutional
neural network producers.

917
00:40:59,250 --> 00:41:01,990
The one highlighted
in green is supposed

918
00:41:01,990 --> 00:41:03,420
to be the correct answer.

919
00:41:03,420 --> 00:41:09,180
So you can see on all of these,
it gets it within the top five,

920
00:41:09,180 --> 00:41:11,757
and for the most part,
within the top two,

921
00:41:11,757 --> 00:41:14,090
and it's pretty confident
when it does get it correctly.

922
00:41:14,090 --> 00:41:16,673
Now, when I said that they use
different types of classifieds,

923
00:41:16,673 --> 00:41:19,880
some of them were more
stacked classifieds,

924
00:41:19,880 --> 00:41:22,970
where they were just trained
on stills within these images,

925
00:41:22,970 --> 00:41:26,120
while others were what they
called fusion ones, where they

926
00:41:26,120 --> 00:41:28,865
sort of add the temporal
component by fusing

927
00:41:28,865 --> 00:41:34,250
in different stills from
these photo images together.

928
00:41:34,250 --> 00:41:36,680
Now, the current accuracy rate--

929
00:41:36,680 --> 00:41:38,810
the best one they've
achieved so far--

930
00:41:38,810 --> 00:41:43,550
has been around 80% accuracy
within the top five label.

931
00:41:43,550 --> 00:41:46,190
Now, 80% accuracy
is nowhere near what

932
00:41:46,190 --> 00:41:48,020
we saw with the
ImageNet classification,

933
00:41:48,020 --> 00:41:50,690
where in 2015, they had
managed to get it up

934
00:41:50,690 --> 00:41:56,350
to 98% or 99% accuracy.

935
00:41:56,350 --> 00:42:01,520
But obviously, there's way more
complexity involved in this.

936
00:42:01,520 --> 00:42:04,520
So it makes sense that
it's not quite there yet.

937
00:42:04,520 --> 00:42:08,090
But it does provide a good
benchmark, and something

938
00:42:08,090 --> 00:42:11,310
to improve upon in
the future as well.

939
00:42:11,310 --> 00:42:15,740
Now, that being said,
convolutional neural networks

940
00:42:15,740 --> 00:42:17,450
do come with
certain limitations.

941
00:42:17,450 --> 00:42:19,190
They're not perfect.

942
00:42:19,190 --> 00:42:25,020
And so Julian will now
talk about the limitations.

943
00:42:25,020 --> 00:42:26,750
JULIAN BROWN: Thanks, Ali.

944
00:42:26,750 --> 00:42:30,210
So Ali talked about the
ImageNet competition,

945
00:42:30,210 --> 00:42:32,540
and talked about how
the recent winners have

946
00:42:32,540 --> 00:42:35,180
been convolutional neural nets.

947
00:42:35,180 --> 00:42:40,604
So before, the best was about
26% top five error rate,

948
00:42:40,604 --> 00:42:42,020
but now they've
actually gotten it

949
00:42:42,020 --> 00:42:45,330
down to a 4.9% top five
error rate, and that

950
00:42:45,330 --> 00:42:47,920
was-- the winner of that
competition, the 2015 one,

951
00:42:47,920 --> 00:42:50,452
was actually Microsoft.

952
00:42:50,452 --> 00:42:52,785
They've got the current state
of the art implementation,

953
00:42:52,785 --> 00:42:54,890
and so because it's the
ImageNet competition,

954
00:42:54,890 --> 00:42:58,070
that means they can
identify exactly 1,000

955
00:42:58,070 --> 00:43:00,770
different categories of images.

956
00:43:00,770 --> 00:43:03,050
So there are few
problems, actually,

957
00:43:03,050 --> 00:43:05,240
with the implementation,
or just in general

958
00:43:05,240 --> 00:43:07,300
with convolutional neural nets.

959
00:43:07,300 --> 00:43:09,710
So one of them is that
1,000 categories, well, it

960
00:43:09,710 --> 00:43:10,700
may seem like a lot--

961
00:43:10,700 --> 00:43:14,012
ImageNet is actually one of
the largest competitions--

962
00:43:14,012 --> 00:43:15,720
that's not actually
that many categories.

963
00:43:15,720 --> 00:43:18,500
So it doesn't contain things
like hot air balloons,

964
00:43:18,500 --> 00:43:19,330
for instance.

965
00:43:19,330 --> 00:43:23,180
So these things that children
would be able to classify,

966
00:43:23,180 --> 00:43:25,400
the neural nets actually
aren't able to, even

967
00:43:25,400 --> 00:43:27,810
in the biggest competition.

968
00:43:27,810 --> 00:43:29,570
And each of these
categories also

969
00:43:29,570 --> 00:43:31,940
requires thousands
of training images,

970
00:43:31,940 --> 00:43:33,860
whereas you could
show a child a couple

971
00:43:33,860 --> 00:43:36,560
examples of a dog
or a cat, and they'd

972
00:43:36,560 --> 00:43:39,740
be able to, generally, get a
feel for what a dog or a cat

973
00:43:39,740 --> 00:43:41,000
looks like.

974
00:43:41,000 --> 00:43:43,980
It takes thousands of images
per category for the neural

975
00:43:43,980 --> 00:43:46,850
nets to learn, which means
that the total number of images

976
00:43:46,850 --> 00:43:49,370
you need to train for
the ImageNet competition

977
00:43:49,370 --> 00:43:51,510
is over a million.

978
00:43:51,510 --> 00:43:54,200
And so this leads to
very long training times,

979
00:43:54,200 --> 00:43:56,360
even with all of the
heavy optimizations

980
00:43:56,360 --> 00:43:59,300
that Ishwari was telling
us about like how

981
00:43:59,300 --> 00:44:02,480
efficient convolution
is, it still

982
00:44:02,480 --> 00:44:07,970
takes weeks to train on multiple
parallel GPUs working together

983
00:44:07,970 --> 00:44:10,110
to train the net.

984
00:44:10,110 --> 00:44:12,260
There's actually a more
fundamental problem

985
00:44:12,260 --> 00:44:14,500
with neural nets as well.

986
00:44:14,500 --> 00:44:18,410
So here on the left, we have a
school bus, some kind of bird,

987
00:44:18,410 --> 00:44:20,174
and an Indian temple.

988
00:44:20,174 --> 00:44:21,840
And all of these
images on the left side

989
00:44:21,840 --> 00:44:23,350
are actually
correctly identified

990
00:44:23,350 --> 00:44:25,670
by convolutional neural nets.

991
00:44:25,670 --> 00:44:27,980
But when we add this
small distortion here

992
00:44:27,980 --> 00:44:32,290
in the middle, that doesn't
change any of the images

993
00:44:32,290 --> 00:44:36,230
perceptively to the
human, this actually

994
00:44:36,230 --> 00:44:39,660
causes the neural network
to misclassify these images,

995
00:44:39,660 --> 00:44:43,170
and now all three of
them are ostriches.

996
00:44:43,170 --> 00:44:44,330
So that's a little weird.

997
00:44:44,330 --> 00:44:46,020
How does this work?

998
00:44:46,020 --> 00:44:47,960
How did we find
those distortions.

999
00:44:47,960 --> 00:44:50,750
So here on the left side, we see
how a neural network typically

1000
00:44:50,750 --> 00:44:51,250
works.

1001
00:44:51,250 --> 00:44:53,690
You start with some
images, you put it

1002
00:44:53,690 --> 00:44:55,970
through the different layers
of the neural network,

1003
00:44:55,970 --> 00:44:58,630
and then it tells you
a certain probability

1004
00:44:58,630 --> 00:45:00,620
that it is a guitar,
or a penguin.

1005
00:45:00,620 --> 00:45:04,160
So it classifies
it, and so we can

1006
00:45:04,160 --> 00:45:09,010
use a modification
of that method

1007
00:45:09,010 --> 00:45:11,120
by applying an
evolutionary algorithm,

1008
00:45:11,120 --> 00:45:13,700
or a hill-climbing or
gradient ascent algorithm.

1009
00:45:13,700 --> 00:45:16,840
We take a couple of images,
and we put them through,

1010
00:45:16,840 --> 00:45:21,290
it classifies it, and we see
what the classification is.

1011
00:45:21,290 --> 00:45:23,590
And then we can do some
crossover between the images.

1012
00:45:23,590 --> 00:45:27,170
So we take the ones that look
a lot like what we're training

1013
00:45:27,170 --> 00:45:31,150
for in the guitars or
penguins, in this case,

1014
00:45:31,150 --> 00:45:33,190
and we take the
features of those

1015
00:45:33,190 --> 00:45:35,692
that identify very
strongly as a guitar,

1016
00:45:35,692 --> 00:45:37,900
and we combine those together
in the crossover phase.

1017
00:45:37,900 --> 00:45:39,910
This is for the
evolutionary algorithm.

1018
00:45:39,910 --> 00:45:42,690
Then we mutate the images,
which is making small changes

1019
00:45:42,690 --> 00:45:45,275
to each one, and
then we re-evaluate

1020
00:45:45,275 --> 00:45:47,720
by plugging it back in
through the neural network,

1021
00:45:47,720 --> 00:45:49,850
and only the best
images, the ones

1022
00:45:49,850 --> 00:45:51,860
that looked the most
like a guitar or penguin,

1023
00:45:51,860 --> 00:45:54,397
are then selected for
the next iteration.

1024
00:45:54,397 --> 00:45:56,980
And this continues until you get
to a very high identification

1025
00:45:56,980 --> 00:45:59,595
rates, even higher than
actual images of the objects.

1026
00:46:01,935 --> 00:46:04,560
So using gradient ascent, these
are some of the images that you

1027
00:46:04,560 --> 00:46:09,000
could produce if you start
with just a flat grey image,

1028
00:46:09,000 --> 00:46:11,530
and then you run it
through this algorithm.

1029
00:46:11,530 --> 00:46:13,890
So here on the side,
we have a backpack,

1030
00:46:13,890 --> 00:46:16,290
and we can actually
see the outline of what

1031
00:46:16,290 --> 00:46:18,200
looks like a backpack in there.

1032
00:46:18,200 --> 00:46:21,790
And over here, we have what
looks like a Windsor tie

1033
00:46:21,790 --> 00:46:24,185
right here, but all of
these objects-- and perhaps,

1034
00:46:24,185 --> 00:46:25,810
there are things in
these other images,

1035
00:46:25,810 --> 00:46:31,610
but they seem to be lost in
the LSD trip of colors here.

1036
00:46:31,610 --> 00:46:32,966
So that's kind of strange.

1037
00:46:32,966 --> 00:46:34,590
That's definitely
not how humans do it.

1038
00:46:34,590 --> 00:46:35,923
So let's try a different method.

1039
00:46:35,923 --> 00:46:37,580
What if instead of
directly encoding,

1040
00:46:37,580 --> 00:46:40,200
which is where we change
individual pixels,

1041
00:46:40,200 --> 00:46:42,710
what if we change
patterns in the images,

1042
00:46:42,710 --> 00:46:44,340
like different shapes?

1043
00:46:44,340 --> 00:46:46,980
Then this is the kind
of output that we get.

1044
00:46:46,980 --> 00:46:49,830
So in the upper left,
we have a starfish.

1045
00:46:49,830 --> 00:46:52,526
So you can see that it
has the orange and blue

1046
00:46:52,526 --> 00:46:54,150
of the orange in the
starfish, and also

1047
00:46:54,150 --> 00:46:56,355
the blue of the ocean
of the environment

1048
00:46:56,355 --> 00:46:59,320
that typical images of
starfish are taken in.

1049
00:46:59,320 --> 00:47:02,810
And you can also see that it has
the points, the jagged lines,

1050
00:47:02,810 --> 00:47:06,194
the triangles that we associate
with the arms of a starfish.

1051
00:47:06,194 --> 00:47:08,110
But the strange thing
here is that they're not

1052
00:47:08,110 --> 00:47:10,500
arranged in a circular pattern.

1053
00:47:10,500 --> 00:47:12,840
They're not pointing
outwards like this,

1054
00:47:12,840 --> 00:47:16,230
like we would expect
of an actual starfish.

1055
00:47:16,230 --> 00:47:19,920
So clearly, it's not latching
onto the same large scale

1056
00:47:19,920 --> 00:47:22,110
features that humans do.

1057
00:47:22,110 --> 00:47:25,470
It's actually just looking
at the low down features.

1058
00:47:25,470 --> 00:47:27,530
Even though it's a
deep neural network,

1059
00:47:27,530 --> 00:47:32,040
it doesn't grab onto
these abstract concepts

1060
00:47:32,040 --> 00:47:34,980
like a human would.

1061
00:47:34,980 --> 00:47:37,920
So the reason for this
problem, or at least

1062
00:47:37,920 --> 00:47:39,420
why we think neural
networks aren't

1063
00:47:39,420 --> 00:47:43,210
as good as humans at things
like this is the type of model.

1064
00:47:43,210 --> 00:47:45,210
So a human would
have more of what's

1065
00:47:45,210 --> 00:47:48,150
called a generative model,
which means if we have examples

1066
00:47:48,150 --> 00:47:51,990
here, these dark blue
dots, say, of lines,

1067
00:47:51,990 --> 00:47:55,660
a few examples of
images of lines,

1068
00:47:55,660 --> 00:47:58,080
then we could construct a
probability distribution,

1069
00:47:58,080 --> 00:48:01,320
and say that images that
fall somewhere in this region

1070
00:48:01,320 --> 00:48:02,770
are lines.

1071
00:48:02,770 --> 00:48:06,060
And over here, we have a few
examples of giraffes, say,

1072
00:48:06,060 --> 00:48:08,660
and so anything that falls in
this region would be a giraffe.

1073
00:48:08,660 --> 00:48:12,710
And so if you had a red triangle
in here, that would be a lion.

1074
00:48:12,710 --> 00:48:14,790
But if the red triangle
is instead over here,

1075
00:48:14,790 --> 00:48:17,220
it actually wouldn't
classify at all.

1076
00:48:17,220 --> 00:48:18,570
We wouldn't know what that is.

1077
00:48:18,570 --> 00:48:21,370
We would say that's something
other than a lion or a giraffe,

1078
00:48:21,370 --> 00:48:23,990
but neural networks
don't work the same way.

1079
00:48:23,990 --> 00:48:25,480
They just draw a
decision boundary.

1080
00:48:25,480 --> 00:48:29,505
They just draw lines between
the different categories.

1081
00:48:29,505 --> 00:48:33,480
So they don't say that something
really far away from the lion

1082
00:48:33,480 --> 00:48:37,080
class is necessary not a lion.

1083
00:48:37,080 --> 00:48:40,420
It just depends how far away it
is from the decision boundary.

1084
00:48:40,420 --> 00:48:42,770
So if we have the red
triangle way over there,

1085
00:48:42,770 --> 00:48:45,030
it's very far away
from giraffes,

1086
00:48:45,030 --> 00:48:47,480
and it's just
generally closer lions,

1087
00:48:47,480 --> 00:48:50,890
even though it isn't explicitly
very close to it at all,

1088
00:48:50,890 --> 00:48:54,030
and that will still be
identified as a lion.

1089
00:48:54,030 --> 00:48:57,200
So that's why we think
we're able to fool

1090
00:48:57,200 --> 00:49:01,290
these neural networks in such
a simplistic way or in such a

1091
00:49:01,290 --> 00:49:03,842
really abstract way.

1092
00:49:03,842 --> 00:49:05,715
So the main takeaways
from our presentation,

1093
00:49:05,715 --> 00:49:09,980
and the salient points
are that deep learning

1094
00:49:09,980 --> 00:49:13,200
is a very powerful tool
for image classification,

1095
00:49:13,200 --> 00:49:17,035
and it relies on multiple
layers of a network.

1096
00:49:17,035 --> 00:49:19,570
So multiple processing layers.

1097
00:49:19,570 --> 00:49:24,190
Also CNNs outperform
basically every other method

1098
00:49:24,190 --> 00:49:27,570
for classifying images, and
that's their primary use

1099
00:49:27,570 --> 00:49:28,540
right now.

1100
00:49:28,540 --> 00:49:30,810
We're currently
exploring other uses,

1101
00:49:30,810 --> 00:49:33,450
but that's generally
where it's at,

1102
00:49:33,450 --> 00:49:35,970
and this is because
convolutional filters are just

1103
00:49:35,970 --> 00:49:37,230
so incredibly powerful.

1104
00:49:37,230 --> 00:49:41,250
They're very fast
and very efficient.

1105
00:49:41,250 --> 00:49:44,670
Also back propagation is the way
that we train neural networks.

1106
00:49:44,670 --> 00:49:47,280
Normally, if you were to
train a neural network that

1107
00:49:47,280 --> 00:49:50,480
has a lot of layers, there's
actually an exponential growth

1108
00:49:50,480 --> 00:49:54,470
in the time it takes to train
because of the branching when

1109
00:49:54,470 --> 00:49:56,730
you go backwards because
each neuron is connected

1110
00:49:56,730 --> 00:49:59,590
to a large number of neurons
in the previous layer.

1111
00:49:59,590 --> 00:50:03,060
You get this exponential growth
in the number of dependencies

1112
00:50:03,060 --> 00:50:03,980
from a given neuron.

1113
00:50:03,980 --> 00:50:06,910
By using back
propagation, it actually

1114
00:50:06,910 --> 00:50:09,690
reduces it to linear time
to train the networks.

1115
00:50:09,690 --> 00:50:12,600
So this allows for
efficient training.

1116
00:50:12,600 --> 00:50:16,410
And even with back
propagation and convolution

1117
00:50:16,410 --> 00:50:20,130
being so efficient, it still
takes a very large number

1118
00:50:20,130 --> 00:50:23,910
of images, and a long time
with a lot of processing power

1119
00:50:23,910 --> 00:50:27,900
to train neural networks.

1120
00:50:27,900 --> 00:50:29,670
Also, if you'd
like to get started

1121
00:50:29,670 --> 00:50:31,550
working with neural
networks, there

1122
00:50:31,550 --> 00:50:36,870
are a couple of really nice
open source programming

1123
00:50:36,870 --> 00:50:38,290
platforms for neural networks.

1124
00:50:38,290 --> 00:50:41,480
So one of them that we used
for our pset was actually

1125
00:50:41,480 --> 00:50:44,440
TensorFlow, which is Google's
open source neural network

1126
00:50:44,440 --> 00:50:46,600
platform, and
another one would be

1127
00:50:46,600 --> 00:50:49,839
Cafe, which is Berkley's
neural network platform.

1128
00:50:49,839 --> 00:50:51,380
And they actually
have an online demo

1129
00:50:51,380 --> 00:50:54,240
where you can plug in
images, and immediately

1130
00:50:54,240 --> 00:50:55,590
get identifications.

1131
00:50:55,590 --> 00:50:59,560
So you can get started
very quickly with that one.

1132
00:50:59,560 --> 00:51:01,410
Thank you.