1
00:00:09,215 --> 00:00:11,320
PATRICK WINSTON: You know, it's
unfortunate that politics

2
00:00:11,320 --> 00:00:14,730
has become so serious.

3
00:00:14,730 --> 00:00:17,360
Back when you were little
it was a lot more fun.

4
00:00:17,360 --> 00:00:20,740
You could make fun
of politicians.

5
00:00:20,740 --> 00:00:23,815
Here's a politician some
of you may recognize.

6
00:00:27,480 --> 00:00:31,970
But it's convenient to be able
to vary what this particular

7
00:00:31,970 --> 00:00:34,210
politician looks like.

8
00:00:34,210 --> 00:00:41,274
For example, we can go from
a cookie baker to radical.

9
00:00:41,274 --> 00:00:43,960
[LAUGHTER]

10
00:00:43,960 --> 00:00:51,544
PATRICK WINSTON: We can go
from superwoman to bimbo.

11
00:00:51,544 --> 00:00:54,030
[LAUGHTER]

12
00:00:54,030 --> 00:00:55,920
PATRICK WINSTON: Socialite--

13
00:00:55,920 --> 00:00:59,710
I put socialite into this.

14
00:00:59,710 --> 00:01:02,340
There she is.

15
00:01:02,340 --> 00:01:08,430
Or we can move the slider over
the other way to bag lady.

16
00:01:08,430 --> 00:01:14,550
Alert, asleep, sad, happy.

17
00:01:14,550 --> 00:01:18,830
How does that work?

18
00:01:18,830 --> 00:01:19,340
I don't know.

19
00:01:19,340 --> 00:01:20,950
But I bet by the end
of this hour you'll

20
00:01:20,950 --> 00:01:22,360
know how that works.

21
00:01:22,360 --> 00:01:25,690
And not only that, you'll
understand something about

22
00:01:25,690 --> 00:01:29,940
what it takes to recognize
faces.

23
00:01:29,940 --> 00:01:34,380
It turns out to some theories of
face recognition are based

24
00:01:34,380 --> 00:01:41,791
on the same principles that
this program is based on.

25
00:01:41,791 --> 00:01:45,030
But you can kind of guess
what's happening here.

26
00:01:45,030 --> 00:01:49,500
There are many stored images and
when I move those sliders

27
00:01:49,500 --> 00:01:52,590
it's interpolating
amongst them.

28
00:01:52,590 --> 00:01:53,840
So that's how that works.

29
00:01:56,270 --> 00:02:00,500
But the main subject of
today is this matter

30
00:02:00,500 --> 00:02:02,390
of recognizing objects.

31
00:02:02,390 --> 00:02:04,620
Faces could be the objects,
but they don't have to be.

32
00:02:04,620 --> 00:02:08,430
This could be an object that you
might want to recognize.

33
00:02:08,430 --> 00:02:11,580
And I want to talk to you a
little bit about the history

34
00:02:11,580 --> 00:02:13,930
of this problem and where
it stands today.

35
00:02:13,930 --> 00:02:15,900
It's still not solved.

36
00:02:15,900 --> 00:02:18,760
But it's an interesting exercise
to see how the

37
00:02:18,760 --> 00:02:22,360
attempts at solution
have evolved slowly

38
00:02:22,360 --> 00:02:23,940
over the past 30 years.

39
00:02:23,940 --> 00:02:28,160
So slowly, in fact, that I think
if someone told me how

40
00:02:28,160 --> 00:02:31,380
long it would take to get to
where we are 30 years ago I

41
00:02:31,380 --> 00:02:33,579
think I would have
hung myself.

42
00:02:33,579 --> 00:02:36,440
But things do move slowly.

43
00:02:36,440 --> 00:02:38,500
And it's important to see
how slowly they move.

44
00:02:38,500 --> 00:02:42,170
Because they will continue to
move slowly in the future.

45
00:02:42,170 --> 00:02:43,920
And you have to understand that
that's the way things

46
00:02:43,920 --> 00:02:45,990
work sometimes.

47
00:02:45,990 --> 00:02:49,590
So to start this all off, we
have to go back to the ideas

48
00:02:49,590 --> 00:02:53,060
of the legendary David Marr, who
dropped dead from leukemia

49
00:02:53,060 --> 00:02:56,250
in about 1980.

50
00:02:56,250 --> 00:03:00,700
I say, the gospel according to
Marr, because he was such a

51
00:03:00,700 --> 00:03:03,960
powerful and central figure that
almost anything he said

52
00:03:03,960 --> 00:03:09,800
was believed by a large
collection of devotees.

53
00:03:09,800 --> 00:03:15,240
But Marr articulated a set of
ideas about how computer

54
00:03:15,240 --> 00:03:19,810
vision would work that started
off by suggesting that with

55
00:03:19,810 --> 00:03:26,340
the input from the camera,
you look for edges.

56
00:03:26,340 --> 00:03:27,700
And you find edge fragments.

57
00:03:27,700 --> 00:03:35,720
And normally they wouldn't be
even as well-drawn as I've

58
00:03:35,720 --> 00:03:37,990
done them now.

59
00:03:37,990 --> 00:03:40,410
Or as badly drawn as
I've done them now.

60
00:03:40,410 --> 00:03:43,460
But the first step, then, in
visual recognition would be to

61
00:03:43,460 --> 00:03:46,329
form this edge-based description
of what's out

62
00:03:46,329 --> 00:03:47,720
there in the world.

63
00:03:47,720 --> 00:03:49,875
And Marr called that
the primal sketch.

64
00:03:57,620 --> 00:04:00,960
And from the primal sketch, the
next step was to decorate

65
00:04:00,960 --> 00:04:06,600
the primal sketch with some
vectors, some surface normals,

66
00:04:06,600 --> 00:04:12,440
showing where the faces on
the object were oriented.

67
00:04:12,440 --> 00:04:14,340
He called that the two
and a half D sketch.

68
00:04:21,360 --> 00:04:22,620
Now why is it two
and a half D?

69
00:04:22,620 --> 00:04:26,360
Well, it's sort of 2D in the
sense that it's still

70
00:04:26,360 --> 00:04:31,070
camera-centric in its way of
presenting information.

71
00:04:31,070 --> 00:04:33,360
But at same time, it attempts
to say something about the

72
00:04:33,360 --> 00:04:37,110
three-dimensional arrangement
of the faces.

73
00:04:37,110 --> 00:04:39,610
So the speculation was that you
couldn't get to where you

74
00:04:39,610 --> 00:04:41,330
wanted to go in one step.

75
00:04:41,330 --> 00:04:43,970
So you needed several steps
to get from the image to

76
00:04:43,970 --> 00:04:45,990
something you could recognize.

77
00:04:45,990 --> 00:04:50,100
And the third step was to
convert the two and a half D

78
00:04:50,100 --> 00:04:51,850
sketch into generalized
cylinders.

79
00:05:03,100 --> 00:05:03,960
And the idea is this.

80
00:05:03,960 --> 00:05:08,140
If you have a regular cylinder,
you can think of it

81
00:05:08,140 --> 00:05:13,650
as a circular area moving
along an axis like so.

82
00:05:13,650 --> 00:05:16,570
So that's the description
of a cylinder.

83
00:05:16,570 --> 00:05:18,990
A circular area moving
along an axis.

84
00:05:18,990 --> 00:05:23,010
You can get a different kind
of cylinder if you go along

85
00:05:23,010 --> 00:05:26,320
the same axis but you allow
the size of the circle to

86
00:05:26,320 --> 00:05:27,820
change as you go.

87
00:05:27,820 --> 00:05:31,220
So for example, if you were
to describe a wine bottle.

88
00:05:35,550 --> 00:05:37,590
It would be a function of
distance along the axis that

89
00:05:37,590 --> 00:05:42,580
would shrink the circle
appropriately to match the

90
00:05:42,580 --> 00:05:45,520
dimensions of a wine bottle.

91
00:05:45,520 --> 00:05:46,780
A fine burgundy, I perceive.

92
00:05:46,780 --> 00:05:50,920
In any case, this one once
converted into a generalized

93
00:05:50,920 --> 00:05:54,260
cylinder, when matched against
a library of such

94
00:05:54,260 --> 00:05:58,875
descriptions, results
in recognition.

95
00:06:04,290 --> 00:06:07,190
Great theory, based on the idea
that you start off by

96
00:06:07,190 --> 00:06:11,330
looking at edges and you end
up, in several steps of

97
00:06:11,330 --> 00:06:14,470
transformation, producing
something that you could look

98
00:06:14,470 --> 00:06:17,350
up in a library of
descriptions.

99
00:06:17,350 --> 00:06:20,100
Great idea.

100
00:06:20,100 --> 00:06:23,840
Trouble is, no one could
make it work.

101
00:06:26,950 --> 00:06:28,976
It was too hard to do this.

102
00:06:28,976 --> 00:06:31,250
It was too hard to do that.

103
00:06:31,250 --> 00:06:32,980
And the generalized cylinders
produced, if

104
00:06:32,980 --> 00:06:35,980
any, were too coarse.

105
00:06:35,980 --> 00:06:37,520
You couldn't tell the difference
between a Ford and

106
00:06:37,520 --> 00:06:40,520
a Chevrolet or between a
Volkswagen and a Cadillac.

107
00:06:40,520 --> 00:06:43,010
Because they were
just too coarse.

108
00:06:43,010 --> 00:06:45,655
So although it was a great idea
based on the idea that

109
00:06:45,655 --> 00:06:49,580
you have to do recognition in
several transformations of

110
00:06:49,580 --> 00:06:55,430
representational apparatus,
it just didn't work.

111
00:06:55,430 --> 00:07:00,880
So much later, maybe 15 years
later or so, we get to the

112
00:07:00,880 --> 00:07:02,610
next part of our story.

113
00:07:02,610 --> 00:07:07,590
Which is the alignment theories,
most notably the one

114
00:07:07,590 --> 00:07:10,060
produced by Shimon Ullman,
one of Marr's students.

115
00:07:12,580 --> 00:07:16,810
So the alignment theory of
recognition is based on a very

116
00:07:16,810 --> 00:07:19,250
strange and exotic idea.

117
00:07:19,250 --> 00:07:22,930
It doesn't seem strange and
exotic to mechanical engineers

118
00:07:22,930 --> 00:07:25,470
for a while, because they're
used to mechanical drawings.

119
00:07:25,470 --> 00:07:28,620
But here's the strange
and miraculous idea.

120
00:07:28,620 --> 00:07:30,110
Imagine this object.

121
00:07:30,110 --> 00:07:33,540
You take three pictures of it.

122
00:07:33,540 --> 00:07:38,390
You can reconstruct any
view of that object.

123
00:07:38,390 --> 00:07:41,860
Now, I have to be a little bit
careful about how I say that.

124
00:07:41,860 --> 00:07:46,960
First of all, some of the
vertexes are not visible in

125
00:07:46,960 --> 00:07:48,490
the views that you have.

126
00:07:48,490 --> 00:07:50,880
So, of course, you can't
do anything with those.

127
00:07:50,880 --> 00:07:53,600
So let's say that we have a
transparent object where you

128
00:07:53,600 --> 00:07:55,570
can see all the vertexes.

129
00:07:55,570 --> 00:07:59,840
If you have three pictures of
that, you can reconstruct any

130
00:07:59,840 --> 00:08:01,990
view of that object.

131
00:08:01,990 --> 00:08:04,090
Now I have to be a little
careful about how I say that,

132
00:08:04,090 --> 00:08:06,430
because it's not true.

133
00:08:06,430 --> 00:08:09,730
What's true is, you can produce
any view of that in

134
00:08:09,730 --> 00:08:11,620
orthographic projection.

135
00:08:11,620 --> 00:08:13,670
So if you're close enough to
the object that you get

136
00:08:13,670 --> 00:08:15,010
perspective, it doesn't work.

137
00:08:15,010 --> 00:08:18,420
But for the most part, you can
neglect perspective after you

138
00:08:18,420 --> 00:08:20,860
get about two and a half
times as far away as

139
00:08:20,860 --> 00:08:22,420
the object is big.

140
00:08:22,420 --> 00:08:24,830
And you can presume that
you've got orthographic

141
00:08:24,830 --> 00:08:27,560
projection.

142
00:08:27,560 --> 00:08:29,500
So that's a strange
and exotic idea.

143
00:08:29,500 --> 00:08:31,080
But how can you make
a recognition

144
00:08:31,080 --> 00:08:32,150
theory out of that?

145
00:08:32,150 --> 00:08:33,740
So let me show you.

146
00:08:33,740 --> 00:08:36,395
Well, here's one drawing of the
object, I need two more.

147
00:08:39,230 --> 00:08:40,020
Let's see.

148
00:08:40,020 --> 00:08:41,270
Let's have this one.

149
00:08:48,440 --> 00:08:50,620
And maybe one that's tilted
up a little bit.

150
00:08:58,140 --> 00:09:05,360
It's important that these
pictures not be just rotations

151
00:09:05,360 --> 00:09:06,030
on one axis.

152
00:09:06,030 --> 00:09:07,860
Because they wouldn't form what
you might think of as a

153
00:09:07,860 --> 00:09:09,870
kind of basis set.

154
00:09:09,870 --> 00:09:10,850
So there are three pictures.

155
00:09:10,850 --> 00:09:12,110
We'll call them a, b, and c.

156
00:09:15,830 --> 00:09:18,860
And then we want a
fourth picture.

157
00:09:18,860 --> 00:09:21,172
Which will look like this.

158
00:09:21,172 --> 00:09:24,570
It doesn't have to
be too precise.

159
00:09:24,570 --> 00:09:26,890
And we'll call that
the unknown.

160
00:09:26,890 --> 00:09:33,220
And what we really want to know
is if the unknown is the

161
00:09:33,220 --> 00:09:37,100
same object that these three
pictures were made from.

162
00:09:41,170 --> 00:09:44,230
So let me begin with
an assertion.

163
00:09:44,230 --> 00:09:47,570
I'll need four colors of chalk
to make this assertion.

164
00:09:47,570 --> 00:09:51,310
What I want to do is I want to
pick a particular place on the

165
00:09:51,310 --> 00:09:52,560
object, like this one.

166
00:09:55,770 --> 00:09:58,220
And maybe the same place on
this object over here.

167
00:09:58,220 --> 00:10:00,790
Those are corresponding
places, right?

168
00:10:00,790 --> 00:10:05,480
So I can now write an equation
that the x-coordinate of that

169
00:10:05,480 --> 00:10:12,690
unknown object is equal to, oh,
I don't know, alpha x sub

170
00:10:12,690 --> 00:10:24,620
a plus beta x sub b plus
gamma x sub c plus

171
00:10:24,620 --> 00:10:27,460
some constant, tau.

172
00:10:27,460 --> 00:10:29,010
Well, of course, that's
obviously true.

173
00:10:29,010 --> 00:10:32,330
Because I'm letting you take
those alpha, beta, gamma, and

174
00:10:32,330 --> 00:10:33,890
tau and make them anything
you want.

175
00:10:36,680 --> 00:10:39,870
So although that's conspicuously
obviously true,

176
00:10:39,870 --> 00:10:41,630
it's not interesting.

177
00:10:41,630 --> 00:10:42,910
So let me take another point.

178
00:10:45,410 --> 00:10:48,680
And of course, I can write the
same equation down for this

179
00:10:48,680 --> 00:10:49,930
purple point.

180
00:11:01,800 --> 00:11:05,190
And now that I'm on a roll and
having a great deal of fun

181
00:11:05,190 --> 00:11:12,760
with this, I can
take this point

182
00:11:12,760 --> 00:11:14,010
and make a blue equation.

183
00:11:26,110 --> 00:11:29,840
And you know I'm destined
to do it, so I've

184
00:11:29,840 --> 00:11:31,050
got one more color.

185
00:11:31,050 --> 00:11:32,872
I might as well use it.

186
00:11:32,872 --> 00:11:36,350
Let's just make sure I get
something that works here.

187
00:11:36,350 --> 00:11:38,810
That's this one, that's
this one.

188
00:11:38,810 --> 00:11:42,180
I hope I've got these
correspondences right.

189
00:11:42,180 --> 00:11:42,700
STUDENT: [INAUDIBLE].

190
00:11:42,700 --> 00:11:44,030
PATRICK WINSTON: Have
I got one off?

191
00:11:44,030 --> 00:11:45,190
STUDENT: [INAUDIBLE].

192
00:11:45,190 --> 00:11:45,865
PATRICK WINSTON: Which color?

193
00:11:45,865 --> 00:11:46,210
STUDENT: Blue.

194
00:11:46,210 --> 00:11:47,570
[INAUDIBLE].

195
00:11:47,570 --> 00:11:47,930
PATRICK WINSTON: OK.

196
00:11:47,930 --> 00:11:51,500
So this one goes with this
one, goes with this one.

197
00:11:51,500 --> 00:11:52,200
Is that one wrong?

198
00:11:52,200 --> 00:11:54,920
STUDENTS: Yeah.

199
00:11:54,920 --> 00:11:56,100
PATRICK WINSTON: Oh, oh, oh.

200
00:11:56,100 --> 00:12:00,442
Of course this one, excuse
me, goes down here.

201
00:12:00,442 --> 00:12:02,380
Right?

202
00:12:02,380 --> 00:12:05,520
And then this one
is off as well.

203
00:12:05,520 --> 00:12:07,870
I wouldn't get a very good
recognition scheme if I can't

204
00:12:07,870 --> 00:12:10,630
get those correspondences
right.

205
00:12:10,630 --> 00:12:16,108
Which is one of the
lessons of today.

206
00:12:16,108 --> 00:12:16,550
OK.

207
00:12:16,550 --> 00:12:17,820
Now I've got them right.

208
00:12:17,820 --> 00:12:19,700
And now that equation
is correct.

209
00:12:19,700 --> 00:12:22,970
I think I've got this
one right already.

210
00:12:22,970 --> 00:12:24,730
So now I can just
write that down.

211
00:12:24,730 --> 00:12:26,910
I'm on a roll, I'm just
copying this.

212
00:12:37,870 --> 00:12:41,460
So those are a bunch
of equations.

213
00:12:41,460 --> 00:12:49,530
And now the astonishing part
is that I can choose alpha,

214
00:12:49,530 --> 00:12:56,070
beta, gamma, and tau
to be all the same.

215
00:12:59,710 --> 00:13:03,200
That is, there's one set of
alpha, beta, gamma, and tau

216
00:13:03,200 --> 00:13:07,220
that works for everything,
for all four points.

217
00:13:07,220 --> 00:13:09,330
So you look at that puzzled.

218
00:13:09,330 --> 00:13:10,450
And that's OK to be puzzled.

219
00:13:10,450 --> 00:13:11,890
Because I certainly
haven't proved it.

220
00:13:11,890 --> 00:13:14,430
I'm asserting it.

221
00:13:14,430 --> 00:13:15,890
But right away, there's
something interesting about

222
00:13:15,890 --> 00:13:18,330
this and that is that the
relationship between the

223
00:13:18,330 --> 00:13:22,020
points on the unknown object and
the points in this stored

224
00:13:22,020 --> 00:13:28,700
library of images are
related linearly.

225
00:13:28,700 --> 00:13:31,570
That's true because it's
orthographic projection.

226
00:13:31,570 --> 00:13:32,880
Linearly related.

227
00:13:32,880 --> 00:13:38,610
So I can generate the points in
some fourth object from the

228
00:13:38,610 --> 00:13:43,740
points in three sample objects
with linear operations.

229
00:13:43,740 --> 00:13:43,990
Christopher?

230
00:13:43,990 --> 00:13:46,950
STUDENT: Is that the
x-coordinate of--

231
00:13:46,950 --> 00:13:48,200
PATRICK WINSTON: It's
the x-coordinate.

232
00:13:50,840 --> 00:13:52,380
Christopher asked about
the x-coordinates.

233
00:13:52,380 --> 00:13:55,850
Each of these x-coordinates are
meant to be color coded.

234
00:13:55,850 --> 00:14:00,130
It gets a little complicated
with notation and stuff.

235
00:14:00,130 --> 00:14:03,720
So that's the reason I'm color
coding the coordinates.

236
00:14:03,720 --> 00:14:09,710
So the orange x sub u is the
x-coordinate of that

237
00:14:09,710 --> 00:14:10,570
particular point.

238
00:14:10,570 --> 00:14:12,390
STUDENT: In 3D space?

239
00:14:12,390 --> 00:14:13,610
PATRICK WINSTON: No.

240
00:14:13,610 --> 00:14:14,480
Not in 3D space.

241
00:14:14,480 --> 00:14:15,536
In the image.

242
00:14:15,536 --> 00:14:17,720
STUDENT: So it's a 2D
projection of it?

243
00:14:17,720 --> 00:14:19,750
PATRICK WINSTON: It's a 2D
projection of it, an

244
00:14:19,750 --> 00:14:20,600
orthographic projection.

245
00:14:20,600 --> 00:14:21,450
OK?

246
00:14:21,450 --> 00:14:24,010
So we're looking at drawings.

247
00:14:24,010 --> 00:14:26,500
And those coordinates over there
are the two-dimensional

248
00:14:26,500 --> 00:14:29,470
coordinates in the drawing.

249
00:14:29,470 --> 00:14:32,615
Just as if it were
on your retina.

250
00:14:32,615 --> 00:14:34,555
STUDENT: [INAUDIBLE]

251
00:14:34,555 --> 00:14:39,410
vertexes on the 3D projection
or can curved surfaces also?

252
00:14:39,410 --> 00:14:41,470
PATRICK WINSTON: So he asked
about curved surfaces.

253
00:14:41,470 --> 00:14:43,810
And the answer is that you have
to find corresponding

254
00:14:43,810 --> 00:14:45,610
points on the object.

255
00:14:45,610 --> 00:14:50,070
So if you have a totally curved
surface and you can't

256
00:14:50,070 --> 00:14:52,940
identify any corresponding
points, you lose.

257
00:14:52,940 --> 00:14:56,230
But if you consider our faces,
there are some obvious points,

258
00:14:56,230 --> 00:14:58,700
even though our face are
not by any means

259
00:14:58,700 --> 00:15:00,970
flat like these objects.

260
00:15:00,970 --> 00:15:03,260
We have the tip of our nose
and the center of our

261
00:15:03,260 --> 00:15:06,290
eyeballs and so on.

262
00:15:06,290 --> 00:15:11,250
So if that's true, what does
that mean about recovering

263
00:15:11,250 --> 00:15:15,390
alpha, beta, gamma, and tau?

264
00:15:15,390 --> 00:15:18,590
Can we find them?

265
00:15:18,590 --> 00:15:19,890
[INAUDIBLE], what
do you think?

266
00:15:19,890 --> 00:15:21,300
How do we go about
finding them?

267
00:15:21,300 --> 00:15:22,920
You're nodding your head
in the right direction.

268
00:15:22,920 --> 00:15:23,880
[LAUGHTER]

269
00:15:23,880 --> 00:15:26,280
STUDENT: It's four
equations and--

270
00:15:26,280 --> 00:15:26,550
PATRICK WINSTON: Splendid.

271
00:15:26,550 --> 00:15:28,712
It's four equations
and four unknowns.

272
00:15:28,712 --> 00:15:30,400
Four linear equations
and four unknowns.

273
00:15:30,400 --> 00:15:33,560
So obviously, you can solve for
alpha, beta, gamma, and

274
00:15:33,560 --> 00:15:37,960
tau if you know that these
equations are correct.

275
00:15:37,960 --> 00:15:41,390
So how does that help
us with recognition?

276
00:15:41,390 --> 00:15:44,000
It helps us with recognition
because we can take another

277
00:15:44,000 --> 00:15:51,820
point, let me say this square
point here and this

278
00:15:51,820 --> 00:15:58,410
corresponding square point here
and this corresponding

279
00:15:58,410 --> 00:16:00,750
square point here, and what
can we do with those three

280
00:16:00,750 --> 00:16:02,460
points now?

281
00:16:02,460 --> 00:16:05,610
We've got alpha, beta, gamma,
and tau, so we can predict

282
00:16:05,610 --> 00:16:09,060
where it's going to be
in the fourth image.

283
00:16:09,060 --> 00:16:15,310
So we can predict that that
square point is going to be

284
00:16:15,310 --> 00:16:17,060
right there.

285
00:16:17,060 --> 00:16:21,030
And if it isn't, we're highly
suspicious about whether this

286
00:16:21,030 --> 00:16:25,230
object is the kind of object
we think it is.

287
00:16:25,230 --> 00:16:26,480
So you look at me
in disbelief.

288
00:16:26,480 --> 00:16:29,036
You'd like me to demonstrate
this, I imagine.

289
00:16:29,036 --> 00:16:29,472
STUDENT: Yeah.

290
00:16:29,472 --> 00:16:31,220
PATRICK WINSTON: OK.

291
00:16:31,220 --> 00:16:32,500
Let me see if I can
demonstrate this.

292
00:16:51,480 --> 00:16:57,110
So I'm going to do this in a
slightly simplified version.

293
00:16:57,110 --> 00:17:01,320
I'm only going to allow
rotation around

294
00:17:01,320 --> 00:17:03,270
the vertical axis.

295
00:17:03,270 --> 00:17:05,680
And just so you know I'm not
cheating, there's a little

296
00:17:05,680 --> 00:17:09,358
slider here that rotates
that third object.

297
00:17:09,358 --> 00:17:12,770
Let's see, why are there
just two known

298
00:17:12,770 --> 00:17:13,940
objects and one unknown?

299
00:17:13,940 --> 00:17:16,740
Well that's because I've
restricted the motion to

300
00:17:16,740 --> 00:17:20,848
rotation around the vertical
axis and some translation.

301
00:17:20,848 --> 00:17:24,630
So now that I've spun that
around a little bit, let me

302
00:17:24,630 --> 00:17:27,300
pick some corresponding
points.

303
00:17:27,300 --> 00:17:28,508
Oops.

304
00:17:28,508 --> 00:17:29,758
What's happened?

305
00:17:41,240 --> 00:17:41,520
Wow.

306
00:17:41,520 --> 00:17:42,840
Let me run that by again.

307
00:18:04,430 --> 00:18:04,780
OK.

308
00:18:04,780 --> 00:18:07,060
So there's one point
I've selected

309
00:18:07,060 --> 00:18:08,970
from the model objects.

310
00:18:08,970 --> 00:18:10,830
The corresponding point
over here on the

311
00:18:10,830 --> 00:18:12,400
unknown is right there.

312
00:18:12,400 --> 00:18:13,350
I'm going to be a little off.

313
00:18:13,350 --> 00:18:15,120
But that's OK.

314
00:18:15,120 --> 00:18:18,480
So let me just pick that
one and then that

315
00:18:18,480 --> 00:18:20,650
corresponds to this one.

316
00:18:20,650 --> 00:18:23,870
Krishna, would you like to
specify a point so people know

317
00:18:23,870 --> 00:18:25,680
I'm not cheating.

318
00:18:25,680 --> 00:18:27,900
Pick a point.

319
00:18:27,900 --> 00:18:29,050
Pick a point, Krishna.

320
00:18:29,050 --> 00:18:30,874
STUDENT: Oh, the right?

321
00:18:30,874 --> 00:18:32,020
PATRICK WINSTON: The right?

322
00:18:32,020 --> 00:18:32,700
STUDENT: Yeah.

323
00:18:32,700 --> 00:18:33,190
PATRICK WINSTON: This one?

324
00:18:33,190 --> 00:18:35,046
STUDENT: Yep.

325
00:18:35,046 --> 00:18:35,510
PATRICK WINSTON: Oops.

326
00:18:35,510 --> 00:18:37,310
OK, let's pick it out
on the model first.

327
00:18:37,310 --> 00:18:40,175
Now pick it over here.

328
00:18:40,175 --> 00:18:40,670
Boom.

329
00:18:40,670 --> 00:18:43,290
So all the points are where
they're supposed to be.

330
00:18:43,290 --> 00:18:45,690
Isn't that cool?

331
00:18:45,690 --> 00:18:47,640
Well, let's suppose that the
unknown is something else.

332
00:18:50,540 --> 00:18:52,710
This is a carefully
selected object.

333
00:18:52,710 --> 00:18:57,320
Because the points are all the
correct positions vertically,

334
00:18:57,320 --> 00:18:59,300
but they're not necessarily the
correct positions in the

335
00:18:59,300 --> 00:19:00,950
other two dimensions.

336
00:19:00,950 --> 00:19:08,830
So let me pick this point, and
this point, and this point,

337
00:19:08,830 --> 00:19:11,160
and this point.

338
00:19:11,160 --> 00:19:14,630
And Krishna had me
pick this point.

339
00:19:14,630 --> 00:19:17,220
So let me pick this point.

340
00:19:17,220 --> 00:19:21,530
So if it thinks that the
unknown is one of these

341
00:19:21,530 --> 00:19:25,420
obelisk objects, then we would
expect to see all of the

342
00:19:25,420 --> 00:19:28,280
corresponding points correctly
identified.

343
00:19:28,280 --> 00:19:29,780
But boom.

344
00:19:29,780 --> 00:19:31,030
All the points are off.

345
00:19:34,850 --> 00:19:37,360
So it seems to work in this
particular example.

346
00:19:37,360 --> 00:19:43,640
I find the alpha and beta
using two images.

347
00:19:43,640 --> 00:19:47,010
And I predict the locations
of the other points.

348
00:19:47,010 --> 00:19:49,630
And I determine whether those
positions are correct.

349
00:19:49,630 --> 00:19:51,780
And if they are correct, then I
have a pretty good idea that

350
00:19:51,780 --> 00:19:54,840
I have in fact identified the
object on the right as either

351
00:19:54,840 --> 00:19:59,540
an obelisk or an organ,
depending on which of the

352
00:19:59,540 --> 00:20:05,420
model choices and the unknown
choices I've selected.

353
00:20:05,420 --> 00:20:09,550
So the only thing I have left
to do is to demonstrate that

354
00:20:09,550 --> 00:20:12,160
what I said about
this is true.

355
00:20:12,160 --> 00:20:14,880
So I'm going to actually
demonstrate that what I said

356
00:20:14,880 --> 00:20:18,710
about this is true using the
configuration in this

357
00:20:18,710 --> 00:20:19,640
demonstration.

358
00:20:19,640 --> 00:20:23,120
Because it's much too hard
for me to remember matrix

359
00:20:23,120 --> 00:20:25,930
transformations for generalized
rotation in three

360
00:20:25,930 --> 00:20:27,600
dimensions.

361
00:20:27,600 --> 00:20:28,850
So here's how it's
going to work.

362
00:20:33,410 --> 00:20:37,140
The z-axis is going
up that way.

363
00:20:37,140 --> 00:20:41,640
Or, it's going to be pointing
toward you.

364
00:20:41,640 --> 00:20:43,760
And what I'm going to
do is I'm going to

365
00:20:43,760 --> 00:20:46,820
rotate around this axis.

366
00:20:46,820 --> 00:20:49,750
And what I want to do is I
want to find out how the

367
00:20:49,750 --> 00:20:52,670
x-coordinate in the image
of the points move

368
00:20:52,670 --> 00:20:53,920
as I do that rotation.

369
00:20:56,350 --> 00:21:00,300
So here's the x-axis.

370
00:21:00,300 --> 00:21:03,180
This is the coordinate
that you can see.

371
00:21:03,180 --> 00:21:05,520
Here is the y-axis.

372
00:21:05,520 --> 00:21:08,010
That's in depth, so you can't
tell how far away it is.

373
00:21:10,750 --> 00:21:12,180
And the z-axis--

374
00:21:12,180 --> 00:21:17,060
x, y, z-axis must be pointing
out that way toward you.

375
00:21:17,060 --> 00:21:21,660
So now I'm going to consider
just a single point on the

376
00:21:21,660 --> 00:21:24,310
object and see what
happens to it.

377
00:21:24,310 --> 00:21:31,640
So I'm going to say to myself,
let's put the object in some

378
00:21:31,640 --> 00:21:32,650
kind of standard position.

379
00:21:32,650 --> 00:21:34,300
I don't care what it is.

380
00:21:34,300 --> 00:21:36,450
It can be just random,
just spin it around.

381
00:21:36,450 --> 00:21:42,810
Some position, we'll call that
the standard position, S. And

382
00:21:42,810 --> 00:21:46,770
that means that the x-coordinate
of the standard

383
00:21:46,770 --> 00:21:49,890
position is x sub s.

384
00:21:49,890 --> 00:21:57,520
And the y-coordinate of the
standard position is y sub s.

385
00:21:57,520 --> 00:22:01,930
And now I'm going to rotate
the object three times.

386
00:22:01,930 --> 00:22:05,280
Once to form the a picture, once
to form the b picture,

387
00:22:05,280 --> 00:22:07,110
and once to form
the c picture.

388
00:22:07,110 --> 00:22:09,960
And you can make
those choices.

389
00:22:09,960 --> 00:22:12,330
Those can be anything, right?

390
00:22:12,330 --> 00:22:18,540
So let's say that the a
picture is out here.

391
00:22:18,540 --> 00:22:22,190
So that's the a picture.

392
00:22:22,190 --> 00:22:25,040
The B picture is out here.

393
00:22:25,040 --> 00:22:27,610
And the unknown is
up that way.

394
00:22:32,000 --> 00:22:37,540
And so what I want to know
depends on these vectors.

395
00:22:37,540 --> 00:22:41,220
We'll call that theta sub a,
and this is theta sub b.

396
00:22:45,230 --> 00:22:50,430
And this one is theta sub u.

397
00:22:50,430 --> 00:23:01,950
So I would like to know how
x sub a depends on x

398
00:23:01,950 --> 00:23:05,490
sub s and y sub s.

399
00:23:05,490 --> 00:23:08,490
And I can never remember how
to do that, because I can

400
00:23:08,490 --> 00:23:09,810
never remember the
transformation

401
00:23:09,810 --> 00:23:11,480
equations for rotation.

402
00:23:11,480 --> 00:23:14,880
So I have to figure
it out every time.

403
00:23:14,880 --> 00:23:16,090
And this is no exception.

404
00:23:16,090 --> 00:23:18,870
So what I'm going to say is that
this vector that goes out

405
00:23:18,870 --> 00:23:21,900
to S consists of two pieces.

406
00:23:21,900 --> 00:23:25,570
There's the x part
and the y part.

407
00:23:25,570 --> 00:23:30,390
And I know that I can rotate
this vector by alpha sub a by

408
00:23:30,390 --> 00:23:33,130
rotating this vector and
rotating that vector and

409
00:23:33,130 --> 00:23:35,370
adding up the results.

410
00:23:35,370 --> 00:23:39,930
So if I rotate this vector
by alpha sub a, then the

411
00:23:39,930 --> 00:23:46,200
contribution of that to the
x-coordinate of a is going to

412
00:23:46,200 --> 00:23:52,790
be given by the cosine
of theta sub a

413
00:23:52,790 --> 00:23:56,530
multiplied by x sub s.

414
00:23:56,530 --> 00:23:59,360
So you can just exaggerate that
motion, say, well if I

415
00:23:59,360 --> 00:24:03,220
pitch it up that way then the
projection down on the x-axis

416
00:24:03,220 --> 00:24:06,520
is going to be this length
of the vector times the

417
00:24:06,520 --> 00:24:07,770
cosine of the angle.

418
00:24:10,410 --> 00:24:15,930
Now there's also going to be
a dependence on y sub s.

419
00:24:15,930 --> 00:24:17,820
Let's figure out what
that's going to be.

420
00:24:17,820 --> 00:24:19,065
I've got this vector here.

421
00:24:19,065 --> 00:24:22,220
And I'm going to rotate it
by theta sub a as well.

422
00:24:22,220 --> 00:24:25,250
If I rotate that by theta sub a
and see what the projection

423
00:24:25,250 --> 00:24:28,020
is on the x-axis, that's
going to be given by

424
00:24:28,020 --> 00:24:30,570
the sine of the angle.

425
00:24:30,570 --> 00:24:34,080
But it's going the wrong way, so
I have to subtract it off.

426
00:24:34,080 --> 00:24:36,825
So that's how I don't have to
remember what the signs are on

427
00:24:36,825 --> 00:24:38,075
these equations.

428
00:24:44,520 --> 00:24:45,450
Well, that was good.

429
00:24:45,450 --> 00:24:47,940
And now that I'm off
and running I can

430
00:24:47,940 --> 00:24:48,710
do what I did before.

431
00:24:48,710 --> 00:24:50,280
It makes it easy to
give the lecture.

432
00:24:50,280 --> 00:24:54,120
Because this is going to be x
sub b is equal to x sub s

433
00:24:54,120 --> 00:25:00,690
times the cosine of theta sub
b minus y sub s times the

434
00:25:00,690 --> 00:25:03,240
cosine of theta--

435
00:25:03,240 --> 00:25:05,690
oh, you're letting
me make mistakes.

436
00:25:05,690 --> 00:25:06,940
Shame.

437
00:25:09,740 --> 00:25:12,050
I can generally tell by all
the troubled looks.

438
00:25:12,050 --> 00:25:14,280
But there should be some
shouting as well.

439
00:25:14,280 --> 00:25:18,150
That's the sine and
that's the sine.

440
00:25:18,150 --> 00:25:19,780
And one more time.

441
00:25:19,780 --> 00:25:26,890
x sub u is equal to x sub s
times the cosine of theta sub

442
00:25:26,890 --> 00:25:33,830
u minus y sub s times the
sine of theta sub u.

443
00:25:33,830 --> 00:25:36,670
And I forgot the b up there.

444
00:25:36,670 --> 00:25:37,880
So there are some equations.

445
00:25:37,880 --> 00:25:39,710
And we don't know what
we're doing.

446
00:25:39,710 --> 00:25:41,610
We're just going to stare at
them awhile and see if they

447
00:25:41,610 --> 00:25:42,860
sing us a song.

448
00:25:45,200 --> 00:25:48,480
So let's see if they
sing us a song.

449
00:25:48,480 --> 00:25:54,350
What about x sub
a and x sub b?

450
00:25:54,350 --> 00:25:57,160
These are things that
we see in the image.

451
00:25:57,160 --> 00:25:58,520
These are things that
we can measure.

452
00:26:10,850 --> 00:26:14,100
What about all those cosines
and sines of theta

453
00:26:14,100 --> 00:26:16,000
a's and theta b's.

454
00:26:16,000 --> 00:26:18,010
Well, we have no idea
what they are.

455
00:26:18,010 --> 00:26:20,420
But one thing is clear.

456
00:26:20,420 --> 00:26:25,260
They're true for all of the
points on the object.

457
00:26:25,260 --> 00:26:28,740
Because when we rotate the
object around by angle theta,

458
00:26:28,740 --> 00:26:31,510
we're rotating all of
the points through

459
00:26:31,510 --> 00:26:33,790
the same angle, right?

460
00:26:33,790 --> 00:26:35,250
So with respect to any

461
00:26:35,250 --> 00:26:38,790
particular view of the object--

462
00:26:38,790 --> 00:26:41,810
here we are in the standard
position.

463
00:26:41,810 --> 00:26:44,590
Here we are in position a.

464
00:26:44,590 --> 00:26:46,830
The vectors to all of the
points on the object are

465
00:26:46,830 --> 00:26:50,510
rotated by the same angle when
we go from the standard

466
00:26:50,510 --> 00:26:53,240
position to the a position.

467
00:26:53,240 --> 00:27:02,100
So that means that for all of
the images in this particular

468
00:27:02,100 --> 00:27:06,940
rendering, with a particular
rotation by theta a, theta b,

469
00:27:06,940 --> 00:27:09,465
and theta u, those
are constants.

470
00:27:15,820 --> 00:27:18,125
Now remember this is for
a particular theta a, a

471
00:27:18,125 --> 00:27:21,140
particular theta be, and
a particular theta u.

472
00:27:21,140 --> 00:27:23,850
As long as we're talking about
all of the points for each of

473
00:27:23,850 --> 00:27:28,920
those rotations, those angles
and cosines are going to be

474
00:27:28,920 --> 00:27:35,090
the same for all possible
points on the object.

475
00:27:37,780 --> 00:27:38,110
OK.

476
00:27:38,110 --> 00:27:41,790
So now we go back to our high
school algebra experts and we

477
00:27:41,790 --> 00:27:49,880
say, look at these first two
equations, We've got two

478
00:27:49,880 --> 00:27:55,540
equations and what we can now
construe to be two unknowns.

479
00:27:55,540 --> 00:27:57,210
What are the unknowns
that are left?

480
00:27:57,210 --> 00:27:58,990
We can measure a and b.

481
00:27:58,990 --> 00:28:00,700
Whatever the cosines
are, they're the

482
00:28:00,700 --> 00:28:02,660
same for all the pictures.

483
00:28:02,660 --> 00:28:05,580
So if we treat those as
constants, then we can solve

484
00:28:05,580 --> 00:28:08,770
for x sub s and y sub s.

485
00:28:08,770 --> 00:28:10,490
Right?

486
00:28:10,490 --> 00:28:14,860
We can solve for x sub s and y
sub s in terms of x sub a and

487
00:28:14,860 --> 00:28:20,190
x sub b and a whole bunch
of constants.

488
00:28:20,190 --> 00:28:27,220
But, I don't know, a whole bunch
of constants, let's see.

489
00:28:27,220 --> 00:28:30,640
We can gather up all of those
cosines and ratios of sines

490
00:28:30,640 --> 00:28:34,350
and cosines and all that stuff
and put them all together.

491
00:28:34,350 --> 00:28:36,130
Because they're all constants.

492
00:28:36,130 --> 00:28:38,320
And then we can do this.

493
00:28:38,320 --> 00:28:48,060
We can say x sub
u is equal to--

494
00:28:48,060 --> 00:28:54,010
well, it's going to depend
on x sub a and x sub b.

495
00:28:54,010 --> 00:28:58,030
And by the time we wash or
manipulate or screw around

496
00:28:58,030 --> 00:29:03,220
with all those cosines, we can
say that the multiplier for x

497
00:29:03,220 --> 00:29:07,670
sub a is some constant alpha and
the multiplier for x sub b

498
00:29:07,670 --> 00:29:09,910
is some constant beta.

499
00:29:09,910 --> 00:29:11,500
So that's not a slight
of hand.

500
00:29:11,500 --> 00:29:12,710
That's just linear

501
00:29:12,710 --> 00:29:15,300
manipulation of those equations.

502
00:29:15,300 --> 00:29:17,940
And that's what we wanted to
show, that for orthographic

503
00:29:17,940 --> 00:29:21,130
projection, which this is--
there is no perspective

504
00:29:21,130 --> 00:29:23,530
involved here, we're just taking
the projection along

505
00:29:23,530 --> 00:29:24,780
the x-axis--

506
00:29:26,480 --> 00:29:30,060
we can demonstrate for this
simplified situation that that

507
00:29:30,060 --> 00:29:31,310
equation must hold.

508
00:29:33,880 --> 00:29:35,310
Now I want to give you
a few puzzles.

509
00:29:35,310 --> 00:29:36,730
Because this stuff
is so simple.

510
00:29:36,730 --> 00:29:41,020
Suppose I allow translation
as well as rotation.

511
00:29:41,020 --> 00:29:42,696
What's going to happen?

512
00:29:42,696 --> 00:29:44,094
STUDENT: You just get the tau.

513
00:29:44,094 --> 00:29:44,560
Basically, you get a constant.

514
00:29:44,560 --> 00:29:46,180
PATRICK WINSTON: Yeah, you
add a constant, tau.

515
00:29:46,180 --> 00:29:47,760
But what do we need to do
in order to solve it?

516
00:29:47,760 --> 00:29:49,221
STUDENT: Subtract them
[INAUDIBLE].

517
00:29:49,221 --> 00:29:52,630
You subtract two equations
and then [INAUDIBLE].

518
00:29:52,630 --> 00:29:54,950
PATRICK WINSTON: Let's
see, now we've got

519
00:29:54,950 --> 00:29:56,206
three unknowns, right?

520
00:29:56,206 --> 00:29:56,985
I don't know tau.

521
00:29:56,985 --> 00:29:58,216
I don't know x sub s.

522
00:29:58,216 --> 00:30:00,910
And I don't know y sub s.

523
00:30:00,910 --> 00:30:02,650
So I need another equation.

524
00:30:02,650 --> 00:30:04,534
Where do I get the
other equation.

525
00:30:04,534 --> 00:30:05,430
STUDENT: [INAUDIBLE].

526
00:30:05,430 --> 00:30:06,680
PATRICK WINSTON: From
another picture.

527
00:30:09,910 --> 00:30:14,150
That's why up there I
needed four points.

528
00:30:14,150 --> 00:30:17,690
That covers a situation where
I've got three degrees of

529
00:30:17,690 --> 00:30:20,080
rotation and translation.

530
00:30:20,080 --> 00:30:25,360
Here I got by with just two
pictures in this illustration.

531
00:30:25,360 --> 00:30:27,840
That one involved a tau
translational element, so I

532
00:30:27,840 --> 00:30:28,720
needed three pictures.

533
00:30:28,720 --> 00:30:32,700
And this one's got full
rotation, so I needed four.

534
00:30:32,700 --> 00:30:40,226
So great idea, works fine.

535
00:30:40,226 --> 00:30:45,410
The trouble is it doesn't work
so fine on natural objects.

536
00:30:45,410 --> 00:30:48,630
It works fine on things that are
manufactured because they

537
00:30:48,630 --> 00:30:51,250
all have identical dimensions.

538
00:30:51,250 --> 00:30:55,420
So if I made a million of these
in a factory, I'd have

539
00:30:55,420 --> 00:30:56,950
no trouble recognizing them.

540
00:30:56,950 --> 00:31:02,090
Because all I'd have to do is
take three pictures, record

541
00:31:02,090 --> 00:31:04,840
the coordinates of some of the
points, and I'd be done.

542
00:31:04,840 --> 00:31:07,245
The trouble is the natural
world isn't like this.

543
00:31:10,410 --> 00:31:13,080
And you aren't like
this either.

544
00:31:16,100 --> 00:31:21,020
I don't know, if I'm trying to
recognize faces, it's not that

545
00:31:21,020 --> 00:31:23,700
easy to do all this.

546
00:31:23,700 --> 00:31:27,380
First of all, it's a little
difficult to find the exact

547
00:31:27,380 --> 00:31:30,390
point, the exactly corresponding
points.

548
00:31:30,390 --> 00:31:32,950
I made a mistake in
doing it myself.

549
00:31:32,950 --> 00:31:35,230
And if the computer made a
mistake it would certainly

550
00:31:35,230 --> 00:31:36,060
make an error.

551
00:31:36,060 --> 00:31:39,440
Because it would be using
non-corresponding points to

552
00:31:39,440 --> 00:31:40,120
make the prediction.

553
00:31:40,120 --> 00:31:42,656
So it would be way off.

554
00:31:42,656 --> 00:31:47,070
But this is still in the
tradition of working from

555
00:31:47,070 --> 00:31:51,950
local features in the objects
toward recognition.

556
00:31:51,950 --> 00:31:58,790
So having looked at that theory,
we also find it a

557
00:31:58,790 --> 00:31:59,350
little wanting.

558
00:31:59,350 --> 00:32:02,280
It works great it some
circumstances, doesn't seem to

559
00:32:02,280 --> 00:32:03,790
solve the whole recognition
problem.

560
00:32:07,190 --> 00:32:09,590
Years pass.

561
00:32:09,590 --> 00:32:13,600
Shimon Ullman comes up with
another theory that's not so

562
00:32:13,600 --> 00:32:20,310
much based on edge fragments or
the location of particular

563
00:32:20,310 --> 00:32:27,570
features but rather
on correlation.

564
00:32:27,570 --> 00:32:32,930
Taking a picture of, say,
Krishna's face, taking a

565
00:32:32,930 --> 00:32:37,120
picture of the whole class, and
then using that as a kind

566
00:32:37,120 --> 00:32:40,680
of correlation mask, running it
all over the picture of the

567
00:32:40,680 --> 00:32:43,050
class, seeing where
it maximizes out.

568
00:32:43,050 --> 00:32:43,600
Now that's vague.

569
00:32:43,600 --> 00:32:45,480
I'll explain when I'm talking
about [INAUDIBLE]

570
00:32:45,480 --> 00:32:47,610
correlation in a minute.

571
00:32:47,610 --> 00:32:53,400
But it's basically saying, if
I have a picture of Krishna,

572
00:32:53,400 --> 00:32:54,390
where do I find him?

573
00:32:54,390 --> 00:32:55,902
I'll find him in one place.

574
00:32:55,902 --> 00:32:57,750
But you know what?

575
00:32:57,750 --> 00:33:00,410
Krishna doesn't look
like anybody else.

576
00:33:00,410 --> 00:33:02,810
So I might not find
any other faces.

577
00:33:02,810 --> 00:33:06,840
And if my objective is to find
all the faces, then maybe that

578
00:33:06,840 --> 00:33:09,150
idea won't work either.

579
00:33:09,150 --> 00:33:13,590
Or, to take another example,
here's a dollar bill.

580
00:33:13,590 --> 00:33:18,130
We haven't had raises in quite
well, so this is my last one.

581
00:33:18,130 --> 00:33:20,950
It's got a picture of George
Washington on it.

582
00:33:20,950 --> 00:33:22,740
And I can look all
over the class.

583
00:33:22,740 --> 00:33:26,630
And if I use this is as a face
detector, I'd be sorely

584
00:33:26,630 --> 00:33:27,100
disappointed.

585
00:33:27,100 --> 00:33:29,430
Because I wouldn't
find any faces.

586
00:33:29,430 --> 00:33:32,350
Because thank God, nobody looks
exactly like George

587
00:33:32,350 --> 00:33:32,890
Washington.

588
00:33:32,890 --> 00:33:36,700
So the correlation wouldn't
work very well.

589
00:33:36,700 --> 00:33:37,950
So that idea's a loser.

590
00:33:41,580 --> 00:33:42,290
But wait a minute.

591
00:33:42,290 --> 00:33:45,250
I don't have to look
for the whole face.

592
00:33:45,250 --> 00:33:50,790
I could just look for eyes.

593
00:33:50,790 --> 00:33:53,770
And then I could look for
noses and maybe mouths.

594
00:33:53,770 --> 00:33:57,080
And maybe I could have a library
of 10 different eyes

595
00:33:57,080 --> 00:34:01,280
and 10 different noses and
10 different mouths.

596
00:34:01,280 --> 00:34:02,540
Would that idea work?

597
00:34:06,100 --> 00:34:07,440
Probably not so well.

598
00:34:07,440 --> 00:34:09,420
The trouble with that
one is, I'd find

599
00:34:09,420 --> 00:34:11,676
eyeballs in every doorknob.

600
00:34:11,676 --> 00:34:17,960
There's just not enough stuff
there to give me a reliable

601
00:34:17,960 --> 00:34:19,210
correlation.

602
00:34:20,920 --> 00:34:23,989
So let's make this a little
more concrete by

603
00:34:23,989 --> 00:34:25,239
drawing some pictures.

604
00:34:29,770 --> 00:34:32,880
Halloween is approaching.

605
00:34:32,880 --> 00:34:35,174
So here's a face.

606
00:34:42,387 --> 00:34:44,375
All right?

607
00:34:44,375 --> 00:34:45,866
Here's another face.

608
00:34:55,840 --> 00:34:59,160
So those might be faces in my
pre-recorded library of

609
00:34:59,160 --> 00:35:01,410
pumpkin faces.

610
00:35:01,410 --> 00:35:02,660
Now along comes this face.

611
00:35:13,690 --> 00:35:16,270
What's going to happen?

612
00:35:16,270 --> 00:35:18,490
Well, I don't know.

613
00:35:18,490 --> 00:35:20,200
Let's draw yet another face.

614
00:35:32,020 --> 00:35:33,440
I don't know, that could
be a pretty weird

615
00:35:33,440 --> 00:35:34,460
pumpkin face, I suppose.

616
00:35:34,460 --> 00:35:37,000
But I mean it to be something
that doesn't look very much

617
00:35:37,000 --> 00:35:39,460
like a face.

618
00:35:39,460 --> 00:35:44,380
So if I'm doing a complete
correlation with either of

619
00:35:44,380 --> 00:35:47,280
these faces in my library,
neither one will match this

620
00:35:47,280 --> 00:35:48,530
one very well.

621
00:35:51,150 --> 00:35:55,800
If I'm looking for fine features
like eyes, then I've

622
00:35:55,800 --> 00:36:01,300
got these eyes everywhere.

623
00:36:01,300 --> 00:36:04,190
So it doesn't help very much.

624
00:36:04,190 --> 00:36:05,380
So you can see where
I'm going.

625
00:36:05,380 --> 00:36:10,030
And you can reinvent Ullman's
great idea.

626
00:36:10,030 --> 00:36:11,960
What is it?

627
00:36:11,960 --> 00:36:15,200
We don't look for big features,
like whole faces.

628
00:36:15,200 --> 00:36:16,970
We don't look for
small features,

629
00:36:16,970 --> 00:36:18,846
like individual eyes.

630
00:36:18,846 --> 00:36:22,180
We look for intermediate
features, like two eyes and a

631
00:36:22,180 --> 00:36:25,040
nose, or a mouth and a nose.

632
00:36:25,040 --> 00:36:34,310
So when we do that, then we
can say, now, here are two

633
00:36:34,310 --> 00:36:37,120
eyes and a nose.

634
00:36:37,120 --> 00:36:38,520
Well, that's found
in this one.

635
00:36:42,370 --> 00:36:48,460
And what about the combination
of that nose and that mouth?

636
00:36:48,460 --> 00:36:51,051
Oh, that's over here.

637
00:36:51,051 --> 00:36:53,030
But neither of those features
can be found

638
00:36:53,030 --> 00:36:56,800
in the fourth image.

639
00:36:56,800 --> 00:36:59,410
So that's the Goldilocks
principle.

640
00:36:59,410 --> 00:37:00,945
When you're doing this sort of
thing, you want things that

641
00:37:00,945 --> 00:37:03,500
are not too small
and not too big.

642
00:37:03,500 --> 00:37:07,740
I've got the Rumpelstiltskin
principle up

643
00:37:07,740 --> 00:37:08,970
there, too, by the way.

644
00:37:08,970 --> 00:37:12,020
Because I meant to mention
that Marr was a genius at

645
00:37:12,020 --> 00:37:13,830
naming things.

646
00:37:13,830 --> 00:37:18,650
And even though many of his
theories have faded, he's

647
00:37:18,650 --> 00:37:21,520
still known for these names like
primal sketch and two and

648
00:37:21,520 --> 00:37:23,030
a half D sketch because
he was such an artist

649
00:37:23,030 --> 00:37:24,610
at naming the concepts.

650
00:37:24,610 --> 00:37:27,900
He even got credit for a lot
of stuff that he didn't do.

651
00:37:27,900 --> 00:37:33,440
Not because he was deliberately
trying to get it

652
00:37:33,440 --> 00:37:35,240
inappropriately, but just
because he was so good at

653
00:37:35,240 --> 00:37:36,490
naming stuff.

654
00:37:36,490 --> 00:37:38,450
So we had the Rumpelstiltskin
principle back then.

655
00:37:38,450 --> 00:37:40,050
And now we have the Goldilocks
principle.

656
00:37:40,050 --> 00:37:43,535
Not too big, not too small.

657
00:37:43,535 --> 00:37:48,150
But that leaves us with the
final question, which is, so

658
00:37:48,150 --> 00:37:51,230
if what we want to do is look
for intermediate-size

659
00:37:51,230 --> 00:37:54,410
features, how do we actually
find them in a sea

660
00:37:54,410 --> 00:37:55,770
of faces out there?

661
00:37:55,770 --> 00:37:58,400
See, I might have a library,
I might take 10 of you and

662
00:37:58,400 --> 00:38:01,050
record your eyes.

663
00:38:01,050 --> 00:38:03,530
Take another ten, record
your mouths.

664
00:38:03,530 --> 00:38:06,400
And they may be put together
in a unique way for each of

665
00:38:06,400 --> 00:38:06,870
you out there.

666
00:38:06,870 --> 00:38:10,390
But it's likely that I'll
fin Lana's eyes

667
00:38:10,390 --> 00:38:12,430
somewhere else in a crowd.

668
00:38:12,430 --> 00:38:16,850
And Nicola's mouth somewhere
else in a crowd.

669
00:38:16,850 --> 00:38:21,330
So how do we in fact go
about finding them?

670
00:38:21,330 --> 00:38:23,080
And I mentioned the
term correlation a

671
00:38:23,080 --> 00:38:24,720
couple of times now.

672
00:38:24,720 --> 00:38:26,500
Let me make that concrete.

673
00:38:31,270 --> 00:38:38,790
So let's consider a
one-dimensional face that

674
00:38:38,790 --> 00:38:40,040
looks like this.

675
00:38:47,950 --> 00:38:50,810
Which is a signal.

676
00:38:50,810 --> 00:38:53,640
And I'm going to consider
a one-dimensional image.

677
00:38:56,160 --> 00:39:04,390
And in that one-dimensional
image I've got a

678
00:39:04,390 --> 00:39:06,670
facsimile of the face.

679
00:39:06,670 --> 00:39:08,850
And the question is, what kind
of algorithm could I use to

680
00:39:08,850 --> 00:39:14,030
determine the offset in the
image where the face occurs?

681
00:39:14,030 --> 00:39:17,320
So you can see that one
possibility is you just do an

682
00:39:17,320 --> 00:39:25,270
integral of the signal in the
face and the signal out here

683
00:39:25,270 --> 00:39:29,610
over the extent of the face and
see how it multiplies out.

684
00:39:29,610 --> 00:39:34,920
Or, to make it less lawyerly
and more MITish, let's say

685
00:39:34,920 --> 00:39:41,310
that what we're going to do is
we're going to maximize over

686
00:39:41,310 --> 00:39:52,190
some parameter x the integral
over x of some face, which is

687
00:39:52,190 --> 00:40:04,220
a function of x and the image
g, which is a function of x

688
00:40:04,220 --> 00:40:07,830
minus that offset.

689
00:40:07,830 --> 00:40:14,200
So when the offset, t, is equal
to this offset, then

690
00:40:14,200 --> 00:40:17,350
we're essentially multiplying
the thing by itself and

691
00:40:17,350 --> 00:40:19,890
integrating over the
extent of the face.

692
00:40:19,890 --> 00:40:24,610
And that gives you a very big
number if they're lined up and

693
00:40:24,610 --> 00:40:27,420
a very small number
if they're not.

694
00:40:27,420 --> 00:40:32,370
And it's even true if we
add a whole lot of

695
00:40:32,370 --> 00:40:37,130
noise to the images.

696
00:40:37,130 --> 00:40:38,660
But these are images.

697
00:40:38,660 --> 00:40:39,595
They're not one dimensional.

698
00:40:39,595 --> 00:40:41,210
But that's OK.

699
00:40:41,210 --> 00:40:44,215
It's easy enough to make
a modification here.

700
00:40:44,215 --> 00:40:46,980
We're going to maximize
over translation

701
00:40:46,980 --> 00:40:49,140
parameters x and y.

702
00:40:49,140 --> 00:40:51,970
And these are no longer
functions of just x, they're

703
00:40:51,970 --> 00:40:53,220
also functions of y.

704
00:40:56,900 --> 00:40:59,750
Like so.

705
00:40:59,750 --> 00:41:01,380
So that's basically
how it works.

706
00:41:01,380 --> 00:41:03,690
We won't go into details about
normalization and all that

707
00:41:03,690 --> 00:41:06,480
sort of thing because that's
the stuff of which other

708
00:41:06,480 --> 00:41:08,825
courses remain the custodians.

709
00:41:11,340 --> 00:41:13,412
So would you like to see
a demonstration?

710
00:41:13,412 --> 00:41:14,662
OK.

711
00:41:36,410 --> 00:41:37,220
All right.

712
00:41:37,220 --> 00:41:42,000
So without realizing it, Nicola
and Erica have loaned

713
00:41:42,000 --> 00:41:44,080
us their pictures.

714
00:41:44,080 --> 00:41:49,490
And they are embedded in that
big field of noise.

715
00:41:49,490 --> 00:41:52,170
And it's pretty easy to pick out
Erica and Nicola, right?

716
00:41:52,170 --> 00:41:57,120
Because we are actually pretty
good at picking faces out of

717
00:41:57,120 --> 00:41:58,740
these images.

718
00:41:58,740 --> 00:42:01,220
So let's add some noise.

719
00:42:05,640 --> 00:42:08,160
It's a little harder now.

720
00:42:08,160 --> 00:42:10,100
What I'm going to is I'm going
to run this correlation

721
00:42:10,100 --> 00:42:18,000
program over this whole image
using Nicola's face as a mask

722
00:42:18,000 --> 00:42:20,670
and seeing where the correlation
peaks up, in spite

723
00:42:20,670 --> 00:42:21,920
of all the noise that's
in there.

724
00:42:28,290 --> 00:42:29,540
Boom, there he is.

725
00:42:32,780 --> 00:42:34,820
I don't know, maybe we
can find Erica too.

726
00:42:37,370 --> 00:42:40,110
I forgot where she was.

727
00:42:40,110 --> 00:42:41,360
I can't find her.

728
00:42:44,740 --> 00:42:47,670
There she is.

729
00:42:47,670 --> 00:42:50,490
Unfortunately the parameters
aren't very good here.

730
00:42:50,490 --> 00:42:52,890
Do you see that?

731
00:42:52,890 --> 00:42:55,550
Let me get another
version of this.

732
00:42:55,550 --> 00:42:59,520
I'll just do some real-time
programming.

733
00:43:08,780 --> 00:43:13,210
I've been trying to reset the
parameters so that the images

734
00:43:13,210 --> 00:43:17,070
in the demonstration comes
out clearly up there.

735
00:43:17,070 --> 00:43:19,680
Let's see if this works
a little better.

736
00:43:19,680 --> 00:43:20,990
OK, so let's add some noise.

737
00:43:23,860 --> 00:43:25,630
And let's find Erica.

738
00:43:28,750 --> 00:43:30,000
There she is.

739
00:43:32,340 --> 00:43:33,290
There are some other
things that look a

740
00:43:33,290 --> 00:43:34,550
little bit like Erica.

741
00:43:34,550 --> 00:43:36,800
But nothing looks quite
exactly like Erica.

742
00:43:39,450 --> 00:43:42,245
So let's try Nicola's eyes.

743
00:43:46,090 --> 00:43:48,070
So they stand out pretty
clearly against the

744
00:43:48,070 --> 00:43:49,630
background.

745
00:43:49,630 --> 00:43:51,280
Let's see if we can
find Erica's eyes.

746
00:43:54,580 --> 00:43:56,150
So they stand out pretty
clearly against the

747
00:43:56,150 --> 00:43:56,440
background.

748
00:43:56,440 --> 00:43:59,300
Notice that it also gets
Nicola's eyes.

749
00:43:59,300 --> 00:44:04,840
So two eyes is an
intermediate-size constraint.

750
00:44:04,840 --> 00:44:08,780
It's loose enough that it will
match more than one person.

751
00:44:08,780 --> 00:44:12,130
But it's not so loose
that it's as bad as

752
00:44:12,130 --> 00:44:15,050
looking for one eye.

753
00:44:15,050 --> 00:44:17,490
See, they're all
over the place.

754
00:44:17,490 --> 00:44:21,640
So two eyes and a nose, a mouth
and a nose, that would

755
00:44:21,640 --> 00:44:23,870
be even better as an
intermediate feature.

756
00:44:23,870 --> 00:44:25,875
But it doesn't matter what the
best ones are, because you can

757
00:44:25,875 --> 00:44:28,620
work that out experimentally.

758
00:44:28,620 --> 00:44:30,660
So that's how correlation
works.

759
00:44:30,660 --> 00:44:34,690
And it's just amazing how much
noise you can add and it'll

760
00:44:34,690 --> 00:44:36,500
still pick out the
right stuff.

761
00:44:39,160 --> 00:44:39,780
There's Nicola.

762
00:44:39,780 --> 00:44:41,080
Boom.

763
00:44:41,080 --> 00:44:43,290
Very clear.

764
00:44:43,290 --> 00:44:46,790
Want to add some more noise?

765
00:44:46,790 --> 00:44:49,640
I don't know, I can see it,
but that's because I'm a

766
00:44:49,640 --> 00:44:50,910
pretty good correlator, too.

767
00:44:55,300 --> 00:44:56,500
Boom.

768
00:44:56,500 --> 00:44:57,930
I don't know, let's add
some more noise.

769
00:45:04,420 --> 00:45:06,480
It's just hard to
get rid of it.

770
00:45:06,480 --> 00:45:09,650
It's just amazing how well
it picks it out.

771
00:45:09,650 --> 00:45:10,400
That's good.

772
00:45:10,400 --> 00:45:12,640
That's cool.

773
00:45:12,640 --> 00:45:16,690
Now, but the reason that this
is 30 years and we're still

774
00:45:16,690 --> 00:45:19,340
not done is there are still
some questions.

775
00:45:19,340 --> 00:45:22,730
This is recognizing
stuff straight on.

776
00:45:22,730 --> 00:45:25,260
How is it I can recognize you
in the hall from the side?

777
00:45:25,260 --> 00:45:27,600
Nobody knows.

778
00:45:27,600 --> 00:45:31,830
One possibility is that you have
an ability to make those

779
00:45:31,830 --> 00:45:32,630
transformations.

780
00:45:32,630 --> 00:45:37,410
If so, then that alignment
theory has a role to play.

781
00:45:37,410 --> 00:45:41,360
Another theory is that, well,
after I've seen you once I can

782
00:45:41,360 --> 00:45:44,680
watch you turn your head and
keep recording what you look

783
00:45:44,680 --> 00:45:47,220
like at all possible angles.

784
00:45:47,220 --> 00:45:48,480
That would work.

785
00:45:48,480 --> 00:45:52,150
The trouble is, is there
enough stuff in there?

786
00:45:52,150 --> 00:45:52,650
Maybe.

787
00:45:52,650 --> 00:45:53,900
We don't know.

788
00:45:55,770 --> 00:45:59,040
Now what would it take to
break this mechanism?

789
00:45:59,040 --> 00:45:59,910
Well, I don't know.

790
00:45:59,910 --> 00:46:01,200
Let's just see if we can
break the mechanism.

791
00:46:08,600 --> 00:46:11,815
Let's see if you can recognize
some well-known faces.

792
00:46:15,820 --> 00:46:16,400
Who's that?

793
00:46:16,400 --> 00:46:17,430
Quick.

794
00:46:17,430 --> 00:46:18,540
STUDENT: Obama.

795
00:46:18,540 --> 00:46:21,290
PATRICK WINSTON: Oh,
that's too easy.

796
00:46:21,290 --> 00:46:23,110
We'll see if we can make
some harder ones.

797
00:46:25,930 --> 00:46:26,955
Yeah, that's Obama.

798
00:46:26,955 --> 00:46:28,280
Who's this?

799
00:46:28,280 --> 00:46:29,600
STUDENT: Bush.

800
00:46:29,600 --> 00:46:29,940
PATRICK WINSTON: Oh boy.

801
00:46:29,940 --> 00:46:31,440
You're really good at this.

802
00:46:31,440 --> 00:46:32,200
That's Bush.

803
00:46:32,200 --> 00:46:32,960
How about this guy?

804
00:46:32,960 --> 00:46:34,210
STUDENT: Kerry.

805
00:46:38,680 --> 00:46:39,000
PATRICK WINSTON: OK.

806
00:46:39,000 --> 00:46:39,780
Now I've got it.

807
00:46:39,780 --> 00:46:41,220
Some people are starting
to turn their heads.

808
00:46:41,220 --> 00:46:42,722
And that's not fair.

809
00:46:42,722 --> 00:46:44,200
[LAUGHTER]

810
00:46:44,200 --> 00:46:46,030
PATRICK WINSTON: That's
not fair.

811
00:46:46,030 --> 00:46:49,070
Because you see what's happened
is that if this kind

812
00:46:49,070 --> 00:46:52,500
of pumpkin in theory is correct,
then when you turn

813
00:46:52,500 --> 00:46:56,020
the face upside down you lose
the correlation of those

814
00:46:56,020 --> 00:46:58,740
features that have vertical
components.

815
00:46:58,740 --> 00:47:01,570
So if you have two eyes and a
nose, they won't match two

816
00:47:01,570 --> 00:47:04,890
eyes and a nose when they're
turned upside down.

817
00:47:04,890 --> 00:47:05,590
Well, let's see.

818
00:47:05,590 --> 00:47:08,470
We'll try some more.

819
00:47:08,470 --> 00:47:10,514
Who's that?

820
00:47:10,514 --> 00:47:11,370
STUDENT: Gorbachev.

821
00:47:11,370 --> 00:47:11,800
PATRICK WINSTON: Gorbachev.

822
00:47:11,800 --> 00:47:13,120
Who said that?

823
00:47:13,120 --> 00:47:14,575
Leonid, where are you?

824
00:47:14,575 --> 00:47:15,430
This is Gorbachev, right?

825
00:47:15,430 --> 00:47:18,340
You can recognize him because of
the little birthmark on the

826
00:47:18,340 --> 00:47:20,150
top of his head.

827
00:47:20,150 --> 00:47:21,105
One more.

828
00:47:21,105 --> 00:47:22,275
Who's--

829
00:47:22,275 --> 00:47:23,140
oh, that's easy.

830
00:47:23,140 --> 00:47:26,050
Who is it?

831
00:47:26,050 --> 00:47:28,050
That's Clinton.

832
00:47:28,050 --> 00:47:29,300
How about this one?

833
00:47:34,520 --> 00:47:39,000
Do you see how insulting
it is to be at MIT?

834
00:47:39,000 --> 00:47:40,076
That's me.

835
00:47:40,076 --> 00:47:43,480
[LAUGHTER]

836
00:47:43,480 --> 00:47:46,720
PATRICK WINSTON: And you
didn't even know.

837
00:47:46,720 --> 00:47:47,970
Oh, god.

838
00:47:52,770 --> 00:47:57,700
So this might be evidence for
the correlation theory.

839
00:47:57,700 --> 00:48:01,280
But of course, turning the face
upside down would make it

840
00:48:01,280 --> 00:48:02,860
very difficult to do
alignment, too.

841
00:48:02,860 --> 00:48:06,060
So it would break out alignment
theory, as well.

842
00:48:06,060 --> 00:48:09,045
Let me get that after class,
Was there a mistake, or?

843
00:48:09,045 --> 00:48:09,500
STUDENT: No, no.

844
00:48:09,500 --> 00:48:13,430
I was just curious [INAUDIBLE]
stretching would break the

845
00:48:13,430 --> 00:48:14,140
correlation.

846
00:48:14,140 --> 00:48:15,461
PATRICK WINSTON: If what would
break the structure?

847
00:48:15,461 --> 00:48:16,383
What?

848
00:48:16,383 --> 00:48:16,844
Stretching?

849
00:48:16,844 --> 00:48:18,094
STUDENT: [INAUDIBLE].

850
00:48:20,120 --> 00:48:21,800
PATRICK WINSTON: Elliot asked if
stretching would break the

851
00:48:21,800 --> 00:48:22,970
correlation.

852
00:48:22,970 --> 00:48:30,800
And the answer is, I think,
stretching in the vertical

853
00:48:30,800 --> 00:48:33,290
dimension is worse than
stretching in

854
00:48:33,290 --> 00:48:34,790
the horizontal dimension.

855
00:48:34,790 --> 00:48:36,455
Because you get a certain amount
of stretching in the

856
00:48:36,455 --> 00:48:38,700
horizontal dimension when
you just turn your head.

857
00:48:38,700 --> 00:48:41,450
By the way, since our faces
are basically mounted on a

858
00:48:41,450 --> 00:48:45,550
cylinder, this kind
of transformation

859
00:48:45,550 --> 00:48:46,890
might actually work.

860
00:48:46,890 --> 00:48:51,140
That's a sidebar to the answer
to your question, Elliot.

861
00:48:51,140 --> 00:48:53,730
But now you say, well, OK, so
this is not completely solved.

862
00:48:53,730 --> 00:48:55,980
You can work this out.

863
00:48:55,980 --> 00:48:59,430
But if you really want to work
something out, let me tell you

864
00:48:59,430 --> 00:49:03,570
what the current questions
are in computer vision.

865
00:49:03,570 --> 00:49:05,090
People have worked for an
awful long time on this

866
00:49:05,090 --> 00:49:16,010
recognition stuff and, to my
mind, have neglected the more

867
00:49:16,010 --> 00:49:18,900
serious questions.

868
00:49:18,900 --> 00:49:21,030
It's more serious questions
are, how do you visually

869
00:49:21,030 --> 00:49:24,150
determine what's happening?

870
00:49:24,150 --> 00:49:28,280
If you could write a program
that would reliably determine

871
00:49:28,280 --> 00:49:31,520
when these verbs are happening
in your field of view, I will

872
00:49:31,520 --> 00:49:32,970
sign your Ph.D. thesis
tomorrow.

873
00:49:32,970 --> 00:49:35,680
There are 48 of them there.

874
00:49:35,680 --> 00:49:37,610
And that is today's challenge.

875
00:49:37,610 --> 00:49:40,630
But since we're short on time,
I want to skip over that and

876
00:49:40,630 --> 00:49:42,800
perform an experiment on you.

877
00:49:42,800 --> 00:49:44,892
I want you to tell me
what I'm doing.

878
00:49:44,892 --> 00:49:46,600
STUDENT: [INAUDIBLE].

879
00:49:46,600 --> 00:49:49,960
PATRICK WINSTON: So the best
single-word answer is?

880
00:49:49,960 --> 00:49:51,090
[INAUDIBLE]?

881
00:49:51,090 --> 00:49:51,490
STUDENT: Drinking.

882
00:49:51,490 --> 00:49:54,020
PATRICK WINSTON: OK, this
is not a trick question.

883
00:49:54,020 --> 00:49:56,500
OK, the best single-word
answer.

884
00:49:56,500 --> 00:49:57,778
Christopher, what
do you think?

885
00:49:57,778 --> 00:49:59,272
STUDENT: Toasting.

886
00:49:59,272 --> 00:50:00,770
PATRICK WINSTON: Christopher.

887
00:50:00,770 --> 00:50:02,942
Well, you.

888
00:50:02,942 --> 00:50:04,874
You.

889
00:50:04,874 --> 00:50:06,330
STUDENT: Toasting.

890
00:50:06,330 --> 00:50:07,910
PATRICK WINSTON: What?

891
00:50:07,910 --> 00:50:09,298
Toasting.

892
00:50:09,298 --> 00:50:09,782
OK.

893
00:50:09,782 --> 00:50:12,690
Not a trick question.

894
00:50:12,690 --> 00:50:13,940
What's happening here?

895
00:50:18,066 --> 00:50:20,878
Best single-word answer?

896
00:50:20,878 --> 00:50:21,786
STUDENT: Drinking.

897
00:50:21,786 --> 00:50:24,060
PATRICK WINSTON: Is drinking.

898
00:50:24,060 --> 00:50:25,846
Which pair look more alike?

899
00:50:25,846 --> 00:50:32,170
[LAUGHTER]

900
00:50:32,170 --> 00:50:34,210
PATRICK WINSTON: So that cat is
drinking and nobody has any

901
00:50:34,210 --> 00:50:35,500
trouble recognizing that.

902
00:50:35,500 --> 00:50:43,280
And I believe it's because
you're telling a story.

903
00:50:43,280 --> 00:50:46,260
So our power of storytelling
even reaches down into our

904
00:50:46,260 --> 00:50:47,620
visual apparatus.

905
00:50:47,620 --> 00:50:52,720
So the story here is that some
animal has evidently had an

906
00:50:52,720 --> 00:50:56,860
urge to find something to drink
and water is passing

907
00:50:56,860 --> 00:50:57,920
through that animal's mouth.

908
00:50:57,920 --> 00:50:59,720
That's the drinking story.

909
00:50:59,720 --> 00:51:02,520
So even though they look
enormously different visually,

910
00:51:02,520 --> 00:51:05,360
the stuff at the bottom of our
vision system provides enough

911
00:51:05,360 --> 00:51:08,910
evidence for our story apparatus
so that we can give

912
00:51:08,910 --> 00:51:12,300
the left one and the right one
different labels and recognize

913
00:51:12,300 --> 00:51:13,550
the cat is drinking.

914
00:51:16,410 --> 00:51:17,950
And that's the end
of the story.