1 00:00:09,215 --> 00:00:11,320 PATRICK WINSTON: You know, it's unfortunate that politics 2 00:00:11,320 --> 00:00:14,730 has become so serious. 3 00:00:14,730 --> 00:00:17,360 Back when you were little it was a lot more fun. 4 00:00:17,360 --> 00:00:20,740 You could make fun of politicians. 5 00:00:20,740 --> 00:00:23,815 Here's a politician some of you may recognize. 6 00:00:27,480 --> 00:00:31,970 But it's convenient to be able to vary what this particular 7 00:00:31,970 --> 00:00:34,210 politician looks like. 8 00:00:34,210 --> 00:00:41,274 For example, we can go from a cookie baker to radical. 9 00:00:41,274 --> 00:00:43,960 [LAUGHTER] 10 00:00:43,960 --> 00:00:51,544 PATRICK WINSTON: We can go from superwoman to bimbo. 11 00:00:51,544 --> 00:00:54,030 [LAUGHTER] 12 00:00:54,030 --> 00:00:55,920 PATRICK WINSTON: Socialite-- 13 00:00:55,920 --> 00:00:59,710 I put socialite into this. 14 00:00:59,710 --> 00:01:02,340 There she is. 15 00:01:02,340 --> 00:01:08,430 Or we can move the slider over the other way to bag lady. 16 00:01:08,430 --> 00:01:14,550 Alert, asleep, sad, happy. 17 00:01:14,550 --> 00:01:18,830 How does that work? 18 00:01:18,830 --> 00:01:19,340 I don't know. 19 00:01:19,340 --> 00:01:20,950 But I bet by the end of this hour you'll 20 00:01:20,950 --> 00:01:22,360 know how that works. 21 00:01:22,360 --> 00:01:25,690 And not only that, you'll understand something about 22 00:01:25,690 --> 00:01:29,940 what it takes to recognize faces. 23 00:01:29,940 --> 00:01:34,380 It turns out to some theories of face recognition are based 24 00:01:34,380 --> 00:01:41,791 on the same principles that this program is based on. 25 00:01:41,791 --> 00:01:45,030 But you can kind of guess what's happening here. 26 00:01:45,030 --> 00:01:49,500 There are many stored images and when I move those sliders 27 00:01:49,500 --> 00:01:52,590 it's interpolating amongst them. 28 00:01:52,590 --> 00:01:53,840 So that's how that works. 29 00:01:56,270 --> 00:02:00,500 But the main subject of today is this matter 30 00:02:00,500 --> 00:02:02,390 of recognizing objects. 31 00:02:02,390 --> 00:02:04,620 Faces could be the objects, but they don't have to be. 32 00:02:04,620 --> 00:02:08,430 This could be an object that you might want to recognize. 33 00:02:08,430 --> 00:02:11,580 And I want to talk to you a little bit about the history 34 00:02:11,580 --> 00:02:13,930 of this problem and where it stands today. 35 00:02:13,930 --> 00:02:15,900 It's still not solved. 36 00:02:15,900 --> 00:02:18,760 But it's an interesting exercise to see how the 37 00:02:18,760 --> 00:02:22,360 attempts at solution have evolved slowly 38 00:02:22,360 --> 00:02:23,940 over the past 30 years. 39 00:02:23,940 --> 00:02:28,160 So slowly, in fact, that I think if someone told me how 40 00:02:28,160 --> 00:02:31,380 long it would take to get to where we are 30 years ago I 41 00:02:31,380 --> 00:02:33,579 think I would have hung myself. 42 00:02:33,579 --> 00:02:36,440 But things do move slowly. 43 00:02:36,440 --> 00:02:38,500 And it's important to see how slowly they move. 44 00:02:38,500 --> 00:02:42,170 Because they will continue to move slowly in the future. 45 00:02:42,170 --> 00:02:43,920 And you have to understand that that's the way things 46 00:02:43,920 --> 00:02:45,990 work sometimes. 47 00:02:45,990 --> 00:02:49,590 So to start this all off, we have to go back to the ideas 48 00:02:49,590 --> 00:02:53,060 of the legendary David Marr, who dropped dead from leukemia 49 00:02:53,060 --> 00:02:56,250 in about 1980. 50 00:02:56,250 --> 00:03:00,700 I say, the gospel according to Marr, because he was such a 51 00:03:00,700 --> 00:03:03,960 powerful and central figure that almost anything he said 52 00:03:03,960 --> 00:03:09,800 was believed by a large collection of devotees. 53 00:03:09,800 --> 00:03:15,240 But Marr articulated a set of ideas about how computer 54 00:03:15,240 --> 00:03:19,810 vision would work that started off by suggesting that with 55 00:03:19,810 --> 00:03:26,340 the input from the camera, you look for edges. 56 00:03:26,340 --> 00:03:27,700 And you find edge fragments. 57 00:03:27,700 --> 00:03:35,720 And normally they wouldn't be even as well-drawn as I've 58 00:03:35,720 --> 00:03:37,990 done them now. 59 00:03:37,990 --> 00:03:40,410 Or as badly drawn as I've done them now. 60 00:03:40,410 --> 00:03:43,460 But the first step, then, in visual recognition would be to 61 00:03:43,460 --> 00:03:46,329 form this edge-based description of what's out 62 00:03:46,329 --> 00:03:47,720 there in the world. 63 00:03:47,720 --> 00:03:49,875 And Marr called that the primal sketch. 64 00:03:57,620 --> 00:04:00,960 And from the primal sketch, the next step was to decorate 65 00:04:00,960 --> 00:04:06,600 the primal sketch with some vectors, some surface normals, 66 00:04:06,600 --> 00:04:12,440 showing where the faces on the object were oriented. 67 00:04:12,440 --> 00:04:14,340 He called that the two and a half D sketch. 68 00:04:21,360 --> 00:04:22,620 Now why is it two and a half D? 69 00:04:22,620 --> 00:04:26,360 Well, it's sort of 2D in the sense that it's still 70 00:04:26,360 --> 00:04:31,070 camera-centric in its way of presenting information. 71 00:04:31,070 --> 00:04:33,360 But at same time, it attempts to say something about the 72 00:04:33,360 --> 00:04:37,110 three-dimensional arrangement of the faces. 73 00:04:37,110 --> 00:04:39,610 So the speculation was that you couldn't get to where you 74 00:04:39,610 --> 00:04:41,330 wanted to go in one step. 75 00:04:41,330 --> 00:04:43,970 So you needed several steps to get from the image to 76 00:04:43,970 --> 00:04:45,990 something you could recognize. 77 00:04:45,990 --> 00:04:50,100 And the third step was to convert the two and a half D 78 00:04:50,100 --> 00:04:51,850 sketch into generalized cylinders. 79 00:05:03,100 --> 00:05:03,960 And the idea is this. 80 00:05:03,960 --> 00:05:08,140 If you have a regular cylinder, you can think of it 81 00:05:08,140 --> 00:05:13,650 as a circular area moving along an axis like so. 82 00:05:13,650 --> 00:05:16,570 So that's the description of a cylinder. 83 00:05:16,570 --> 00:05:18,990 A circular area moving along an axis. 84 00:05:18,990 --> 00:05:23,010 You can get a different kind of cylinder if you go along 85 00:05:23,010 --> 00:05:26,320 the same axis but you allow the size of the circle to 86 00:05:26,320 --> 00:05:27,820 change as you go. 87 00:05:27,820 --> 00:05:31,220 So for example, if you were to describe a wine bottle. 88 00:05:35,550 --> 00:05:37,590 It would be a function of distance along the axis that 89 00:05:37,590 --> 00:05:42,580 would shrink the circle appropriately to match the 90 00:05:42,580 --> 00:05:45,520 dimensions of a wine bottle. 91 00:05:45,520 --> 00:05:46,780 A fine burgundy, I perceive. 92 00:05:46,780 --> 00:05:50,920 In any case, this one once converted into a generalized 93 00:05:50,920 --> 00:05:54,260 cylinder, when matched against a library of such 94 00:05:54,260 --> 00:05:58,875 descriptions, results in recognition. 95 00:06:04,290 --> 00:06:07,190 Great theory, based on the idea that you start off by 96 00:06:07,190 --> 00:06:11,330 looking at edges and you end up, in several steps of 97 00:06:11,330 --> 00:06:14,470 transformation, producing something that you could look 98 00:06:14,470 --> 00:06:17,350 up in a library of descriptions. 99 00:06:17,350 --> 00:06:20,100 Great idea. 100 00:06:20,100 --> 00:06:23,840 Trouble is, no one could make it work. 101 00:06:26,950 --> 00:06:28,976 It was too hard to do this. 102 00:06:28,976 --> 00:06:31,250 It was too hard to do that. 103 00:06:31,250 --> 00:06:32,980 And the generalized cylinders produced, if 104 00:06:32,980 --> 00:06:35,980 any, were too coarse. 105 00:06:35,980 --> 00:06:37,520 You couldn't tell the difference between a Ford and 106 00:06:37,520 --> 00:06:40,520 a Chevrolet or between a Volkswagen and a Cadillac. 107 00:06:40,520 --> 00:06:43,010 Because they were just too coarse. 108 00:06:43,010 --> 00:06:45,655 So although it was a great idea based on the idea that 109 00:06:45,655 --> 00:06:49,580 you have to do recognition in several transformations of 110 00:06:49,580 --> 00:06:55,430 representational apparatus, it just didn't work. 111 00:06:55,430 --> 00:07:00,880 So much later, maybe 15 years later or so, we get to the 112 00:07:00,880 --> 00:07:02,610 next part of our story. 113 00:07:02,610 --> 00:07:07,590 Which is the alignment theories, most notably the one 114 00:07:07,590 --> 00:07:10,060 produced by Shimon Ullman, one of Marr's students. 115 00:07:12,580 --> 00:07:16,810 So the alignment theory of recognition is based on a very 116 00:07:16,810 --> 00:07:19,250 strange and exotic idea. 117 00:07:19,250 --> 00:07:22,930 It doesn't seem strange and exotic to mechanical engineers 118 00:07:22,930 --> 00:07:25,470 for a while, because they're used to mechanical drawings. 119 00:07:25,470 --> 00:07:28,620 But here's the strange and miraculous idea. 120 00:07:28,620 --> 00:07:30,110 Imagine this object. 121 00:07:30,110 --> 00:07:33,540 You take three pictures of it. 122 00:07:33,540 --> 00:07:38,390 You can reconstruct any view of that object. 123 00:07:38,390 --> 00:07:41,860 Now, I have to be a little bit careful about how I say that. 124 00:07:41,860 --> 00:07:46,960 First of all, some of the vertexes are not visible in 125 00:07:46,960 --> 00:07:48,490 the views that you have. 126 00:07:48,490 --> 00:07:50,880 So, of course, you can't do anything with those. 127 00:07:50,880 --> 00:07:53,600 So let's say that we have a transparent object where you 128 00:07:53,600 --> 00:07:55,570 can see all the vertexes. 129 00:07:55,570 --> 00:07:59,840 If you have three pictures of that, you can reconstruct any 130 00:07:59,840 --> 00:08:01,990 view of that object. 131 00:08:01,990 --> 00:08:04,090 Now I have to be a little careful about how I say that, 132 00:08:04,090 --> 00:08:06,430 because it's not true. 133 00:08:06,430 --> 00:08:09,730 What's true is, you can produce any view of that in 134 00:08:09,730 --> 00:08:11,620 orthographic projection. 135 00:08:11,620 --> 00:08:13,670 So if you're close enough to the object that you get 136 00:08:13,670 --> 00:08:15,010 perspective, it doesn't work. 137 00:08:15,010 --> 00:08:18,420 But for the most part, you can neglect perspective after you 138 00:08:18,420 --> 00:08:20,860 get about two and a half times as far away as 139 00:08:20,860 --> 00:08:22,420 the object is big. 140 00:08:22,420 --> 00:08:24,830 And you can presume that you've got orthographic 141 00:08:24,830 --> 00:08:27,560 projection. 142 00:08:27,560 --> 00:08:29,500 So that's a strange and exotic idea. 143 00:08:29,500 --> 00:08:31,080 But how can you make a recognition 144 00:08:31,080 --> 00:08:32,150 theory out of that? 145 00:08:32,150 --> 00:08:33,740 So let me show you. 146 00:08:33,740 --> 00:08:36,395 Well, here's one drawing of the object, I need two more. 147 00:08:39,230 --> 00:08:40,020 Let's see. 148 00:08:40,020 --> 00:08:41,270 Let's have this one. 149 00:08:48,440 --> 00:08:50,620 And maybe one that's tilted up a little bit. 150 00:08:58,140 --> 00:09:05,360 It's important that these pictures not be just rotations 151 00:09:05,360 --> 00:09:06,030 on one axis. 152 00:09:06,030 --> 00:09:07,860 Because they wouldn't form what you might think of as a 153 00:09:07,860 --> 00:09:09,870 kind of basis set. 154 00:09:09,870 --> 00:09:10,850 So there are three pictures. 155 00:09:10,850 --> 00:09:12,110 We'll call them a, b, and c. 156 00:09:15,830 --> 00:09:18,860 And then we want a fourth picture. 157 00:09:18,860 --> 00:09:21,172 Which will look like this. 158 00:09:21,172 --> 00:09:24,570 It doesn't have to be too precise. 159 00:09:24,570 --> 00:09:26,890 And we'll call that the unknown. 160 00:09:26,890 --> 00:09:33,220 And what we really want to know is if the unknown is the 161 00:09:33,220 --> 00:09:37,100 same object that these three pictures were made from. 162 00:09:41,170 --> 00:09:44,230 So let me begin with an assertion. 163 00:09:44,230 --> 00:09:47,570 I'll need four colors of chalk to make this assertion. 164 00:09:47,570 --> 00:09:51,310 What I want to do is I want to pick a particular place on the 165 00:09:51,310 --> 00:09:52,560 object, like this one. 166 00:09:55,770 --> 00:09:58,220 And maybe the same place on this object over here. 167 00:09:58,220 --> 00:10:00,790 Those are corresponding places, right? 168 00:10:00,790 --> 00:10:05,480 So I can now write an equation that the x-coordinate of that 169 00:10:05,480 --> 00:10:12,690 unknown object is equal to, oh, I don't know, alpha x sub 170 00:10:12,690 --> 00:10:24,620 a plus beta x sub b plus gamma x sub c plus 171 00:10:24,620 --> 00:10:27,460 some constant, tau. 172 00:10:27,460 --> 00:10:29,010 Well, of course, that's obviously true. 173 00:10:29,010 --> 00:10:32,330 Because I'm letting you take those alpha, beta, gamma, and 174 00:10:32,330 --> 00:10:33,890 tau and make them anything you want. 175 00:10:36,680 --> 00:10:39,870 So although that's conspicuously obviously true, 176 00:10:39,870 --> 00:10:41,630 it's not interesting. 177 00:10:41,630 --> 00:10:42,910 So let me take another point. 178 00:10:45,410 --> 00:10:48,680 And of course, I can write the same equation down for this 179 00:10:48,680 --> 00:10:49,930 purple point. 180 00:11:01,800 --> 00:11:05,190 And now that I'm on a roll and having a great deal of fun 181 00:11:05,190 --> 00:11:12,760 with this, I can take this point 182 00:11:12,760 --> 00:11:14,010 and make a blue equation. 183 00:11:26,110 --> 00:11:29,840 And you know I'm destined to do it, so I've 184 00:11:29,840 --> 00:11:31,050 got one more color. 185 00:11:31,050 --> 00:11:32,872 I might as well use it. 186 00:11:32,872 --> 00:11:36,350 Let's just make sure I get something that works here. 187 00:11:36,350 --> 00:11:38,810 That's this one, that's this one. 188 00:11:38,810 --> 00:11:42,180 I hope I've got these correspondences right. 189 00:11:42,180 --> 00:11:42,700 STUDENT: [INAUDIBLE]. 190 00:11:42,700 --> 00:11:44,030 PATRICK WINSTON: Have I got one off? 191 00:11:44,030 --> 00:11:45,190 STUDENT: [INAUDIBLE]. 192 00:11:45,190 --> 00:11:45,865 PATRICK WINSTON: Which color? 193 00:11:45,865 --> 00:11:46,210 STUDENT: Blue. 194 00:11:46,210 --> 00:11:47,570 [INAUDIBLE]. 195 00:11:47,570 --> 00:11:47,930 PATRICK WINSTON: OK. 196 00:11:47,930 --> 00:11:51,500 So this one goes with this one, goes with this one. 197 00:11:51,500 --> 00:11:52,200 Is that one wrong? 198 00:11:52,200 --> 00:11:54,920 STUDENTS: Yeah. 199 00:11:54,920 --> 00:11:56,100 PATRICK WINSTON: Oh, oh, oh. 200 00:11:56,100 --> 00:12:00,442 Of course this one, excuse me, goes down here. 201 00:12:00,442 --> 00:12:02,380 Right? 202 00:12:02,380 --> 00:12:05,520 And then this one is off as well. 203 00:12:05,520 --> 00:12:07,870 I wouldn't get a very good recognition scheme if I can't 204 00:12:07,870 --> 00:12:10,630 get those correspondences right. 205 00:12:10,630 --> 00:12:16,108 Which is one of the lessons of today. 206 00:12:16,108 --> 00:12:16,550 OK. 207 00:12:16,550 --> 00:12:17,820 Now I've got them right. 208 00:12:17,820 --> 00:12:19,700 And now that equation is correct. 209 00:12:19,700 --> 00:12:22,970 I think I've got this one right already. 210 00:12:22,970 --> 00:12:24,730 So now I can just write that down. 211 00:12:24,730 --> 00:12:26,910 I'm on a roll, I'm just copying this. 212 00:12:37,870 --> 00:12:41,460 So those are a bunch of equations. 213 00:12:41,460 --> 00:12:49,530 And now the astonishing part is that I can choose alpha, 214 00:12:49,530 --> 00:12:56,070 beta, gamma, and tau to be all the same. 215 00:12:59,710 --> 00:13:03,200 That is, there's one set of alpha, beta, gamma, and tau 216 00:13:03,200 --> 00:13:07,220 that works for everything, for all four points. 217 00:13:07,220 --> 00:13:09,330 So you look at that puzzled. 218 00:13:09,330 --> 00:13:10,450 And that's OK to be puzzled. 219 00:13:10,450 --> 00:13:11,890 Because I certainly haven't proved it. 220 00:13:11,890 --> 00:13:14,430 I'm asserting it. 221 00:13:14,430 --> 00:13:15,890 But right away, there's something interesting about 222 00:13:15,890 --> 00:13:18,330 this and that is that the relationship between the 223 00:13:18,330 --> 00:13:22,020 points on the unknown object and the points in this stored 224 00:13:22,020 --> 00:13:28,700 library of images are related linearly. 225 00:13:28,700 --> 00:13:31,570 That's true because it's orthographic projection. 226 00:13:31,570 --> 00:13:32,880 Linearly related. 227 00:13:32,880 --> 00:13:38,610 So I can generate the points in some fourth object from the 228 00:13:38,610 --> 00:13:43,740 points in three sample objects with linear operations. 229 00:13:43,740 --> 00:13:43,990 Christopher? 230 00:13:43,990 --> 00:13:46,950 STUDENT: Is that the x-coordinate of-- 231 00:13:46,950 --> 00:13:48,200 PATRICK WINSTON: It's the x-coordinate. 232 00:13:50,840 --> 00:13:52,380 Christopher asked about the x-coordinates. 233 00:13:52,380 --> 00:13:55,850 Each of these x-coordinates are meant to be color coded. 234 00:13:55,850 --> 00:14:00,130 It gets a little complicated with notation and stuff. 235 00:14:00,130 --> 00:14:03,720 So that's the reason I'm color coding the coordinates. 236 00:14:03,720 --> 00:14:09,710 So the orange x sub u is the x-coordinate of that 237 00:14:09,710 --> 00:14:10,570 particular point. 238 00:14:10,570 --> 00:14:12,390 STUDENT: In 3D space? 239 00:14:12,390 --> 00:14:13,610 PATRICK WINSTON: No. 240 00:14:13,610 --> 00:14:14,480 Not in 3D space. 241 00:14:14,480 --> 00:14:15,536 In the image. 242 00:14:15,536 --> 00:14:17,720 STUDENT: So it's a 2D projection of it? 243 00:14:17,720 --> 00:14:19,750 PATRICK WINSTON: It's a 2D projection of it, an 244 00:14:19,750 --> 00:14:20,600 orthographic projection. 245 00:14:20,600 --> 00:14:21,450 OK? 246 00:14:21,450 --> 00:14:24,010 So we're looking at drawings. 247 00:14:24,010 --> 00:14:26,500 And those coordinates over there are the two-dimensional 248 00:14:26,500 --> 00:14:29,470 coordinates in the drawing. 249 00:14:29,470 --> 00:14:32,615 Just as if it were on your retina. 250 00:14:32,615 --> 00:14:34,555 STUDENT: [INAUDIBLE] 251 00:14:34,555 --> 00:14:39,410 vertexes on the 3D projection or can curved surfaces also? 252 00:14:39,410 --> 00:14:41,470 PATRICK WINSTON: So he asked about curved surfaces. 253 00:14:41,470 --> 00:14:43,810 And the answer is that you have to find corresponding 254 00:14:43,810 --> 00:14:45,610 points on the object. 255 00:14:45,610 --> 00:14:50,070 So if you have a totally curved surface and you can't 256 00:14:50,070 --> 00:14:52,940 identify any corresponding points, you lose. 257 00:14:52,940 --> 00:14:56,230 But if you consider our faces, there are some obvious points, 258 00:14:56,230 --> 00:14:58,700 even though our face are not by any means 259 00:14:58,700 --> 00:15:00,970 flat like these objects. 260 00:15:00,970 --> 00:15:03,260 We have the tip of our nose and the center of our 261 00:15:03,260 --> 00:15:06,290 eyeballs and so on. 262 00:15:06,290 --> 00:15:11,250 So if that's true, what does that mean about recovering 263 00:15:11,250 --> 00:15:15,390 alpha, beta, gamma, and tau? 264 00:15:15,390 --> 00:15:18,590 Can we find them? 265 00:15:18,590 --> 00:15:19,890 [INAUDIBLE], what do you think? 266 00:15:19,890 --> 00:15:21,300 How do we go about finding them? 267 00:15:21,300 --> 00:15:22,920 You're nodding your head in the right direction. 268 00:15:22,920 --> 00:15:23,880 [LAUGHTER] 269 00:15:23,880 --> 00:15:26,280 STUDENT: It's four equations and-- 270 00:15:26,280 --> 00:15:26,550 PATRICK WINSTON: Splendid. 271 00:15:26,550 --> 00:15:28,712 It's four equations and four unknowns. 272 00:15:28,712 --> 00:15:30,400 Four linear equations and four unknowns. 273 00:15:30,400 --> 00:15:33,560 So obviously, you can solve for alpha, beta, gamma, and 274 00:15:33,560 --> 00:15:37,960 tau if you know that these equations are correct. 275 00:15:37,960 --> 00:15:41,390 So how does that help us with recognition? 276 00:15:41,390 --> 00:15:44,000 It helps us with recognition because we can take another 277 00:15:44,000 --> 00:15:51,820 point, let me say this square point here and this 278 00:15:51,820 --> 00:15:58,410 corresponding square point here and this corresponding 279 00:15:58,410 --> 00:16:00,750 square point here, and what can we do with those three 280 00:16:00,750 --> 00:16:02,460 points now? 281 00:16:02,460 --> 00:16:05,610 We've got alpha, beta, gamma, and tau, so we can predict 282 00:16:05,610 --> 00:16:09,060 where it's going to be in the fourth image. 283 00:16:09,060 --> 00:16:15,310 So we can predict that that square point is going to be 284 00:16:15,310 --> 00:16:17,060 right there. 285 00:16:17,060 --> 00:16:21,030 And if it isn't, we're highly suspicious about whether this 286 00:16:21,030 --> 00:16:25,230 object is the kind of object we think it is. 287 00:16:25,230 --> 00:16:26,480 So you look at me in disbelief. 288 00:16:26,480 --> 00:16:29,036 You'd like me to demonstrate this, I imagine. 289 00:16:29,036 --> 00:16:29,472 STUDENT: Yeah. 290 00:16:29,472 --> 00:16:31,220 PATRICK WINSTON: OK. 291 00:16:31,220 --> 00:16:32,500 Let me see if I can demonstrate this. 292 00:16:51,480 --> 00:16:57,110 So I'm going to do this in a slightly simplified version. 293 00:16:57,110 --> 00:17:01,320 I'm only going to allow rotation around 294 00:17:01,320 --> 00:17:03,270 the vertical axis. 295 00:17:03,270 --> 00:17:05,680 And just so you know I'm not cheating, there's a little 296 00:17:05,680 --> 00:17:09,358 slider here that rotates that third object. 297 00:17:09,358 --> 00:17:12,770 Let's see, why are there just two known 298 00:17:12,770 --> 00:17:13,940 objects and one unknown? 299 00:17:13,940 --> 00:17:16,740 Well that's because I've restricted the motion to 300 00:17:16,740 --> 00:17:20,848 rotation around the vertical axis and some translation. 301 00:17:20,848 --> 00:17:24,630 So now that I've spun that around a little bit, let me 302 00:17:24,630 --> 00:17:27,300 pick some corresponding points. 303 00:17:27,300 --> 00:17:28,508 Oops. 304 00:17:28,508 --> 00:17:29,758 What's happened? 305 00:17:41,240 --> 00:17:41,520 Wow. 306 00:17:41,520 --> 00:17:42,840 Let me run that by again. 307 00:18:04,430 --> 00:18:04,780 OK. 308 00:18:04,780 --> 00:18:07,060 So there's one point I've selected 309 00:18:07,060 --> 00:18:08,970 from the model objects. 310 00:18:08,970 --> 00:18:10,830 The corresponding point over here on the 311 00:18:10,830 --> 00:18:12,400 unknown is right there. 312 00:18:12,400 --> 00:18:13,350 I'm going to be a little off. 313 00:18:13,350 --> 00:18:15,120 But that's OK. 314 00:18:15,120 --> 00:18:18,480 So let me just pick that one and then that 315 00:18:18,480 --> 00:18:20,650 corresponds to this one. 316 00:18:20,650 --> 00:18:23,870 Krishna, would you like to specify a point so people know 317 00:18:23,870 --> 00:18:25,680 I'm not cheating. 318 00:18:25,680 --> 00:18:27,900 Pick a point. 319 00:18:27,900 --> 00:18:29,050 Pick a point, Krishna. 320 00:18:29,050 --> 00:18:30,874 STUDENT: Oh, the right? 321 00:18:30,874 --> 00:18:32,020 PATRICK WINSTON: The right? 322 00:18:32,020 --> 00:18:32,700 STUDENT: Yeah. 323 00:18:32,700 --> 00:18:33,190 PATRICK WINSTON: This one? 324 00:18:33,190 --> 00:18:35,046 STUDENT: Yep. 325 00:18:35,046 --> 00:18:35,510 PATRICK WINSTON: Oops. 326 00:18:35,510 --> 00:18:37,310 OK, let's pick it out on the model first. 327 00:18:37,310 --> 00:18:40,175 Now pick it over here. 328 00:18:40,175 --> 00:18:40,670 Boom. 329 00:18:40,670 --> 00:18:43,290 So all the points are where they're supposed to be. 330 00:18:43,290 --> 00:18:45,690 Isn't that cool? 331 00:18:45,690 --> 00:18:47,640 Well, let's suppose that the unknown is something else. 332 00:18:50,540 --> 00:18:52,710 This is a carefully selected object. 333 00:18:52,710 --> 00:18:57,320 Because the points are all the correct positions vertically, 334 00:18:57,320 --> 00:18:59,300 but they're not necessarily the correct positions in the 335 00:18:59,300 --> 00:19:00,950 other two dimensions. 336 00:19:00,950 --> 00:19:08,830 So let me pick this point, and this point, and this point, 337 00:19:08,830 --> 00:19:11,160 and this point. 338 00:19:11,160 --> 00:19:14,630 And Krishna had me pick this point. 339 00:19:14,630 --> 00:19:17,220 So let me pick this point. 340 00:19:17,220 --> 00:19:21,530 So if it thinks that the unknown is one of these 341 00:19:21,530 --> 00:19:25,420 obelisk objects, then we would expect to see all of the 342 00:19:25,420 --> 00:19:28,280 corresponding points correctly identified. 343 00:19:28,280 --> 00:19:29,780 But boom. 344 00:19:29,780 --> 00:19:31,030 All the points are off. 345 00:19:34,850 --> 00:19:37,360 So it seems to work in this particular example. 346 00:19:37,360 --> 00:19:43,640 I find the alpha and beta using two images. 347 00:19:43,640 --> 00:19:47,010 And I predict the locations of the other points. 348 00:19:47,010 --> 00:19:49,630 And I determine whether those positions are correct. 349 00:19:49,630 --> 00:19:51,780 And if they are correct, then I have a pretty good idea that 350 00:19:51,780 --> 00:19:54,840 I have in fact identified the object on the right as either 351 00:19:54,840 --> 00:19:59,540 an obelisk or an organ, depending on which of the 352 00:19:59,540 --> 00:20:05,420 model choices and the unknown choices I've selected. 353 00:20:05,420 --> 00:20:09,550 So the only thing I have left to do is to demonstrate that 354 00:20:09,550 --> 00:20:12,160 what I said about this is true. 355 00:20:12,160 --> 00:20:14,880 So I'm going to actually demonstrate that what I said 356 00:20:14,880 --> 00:20:18,710 about this is true using the configuration in this 357 00:20:18,710 --> 00:20:19,640 demonstration. 358 00:20:19,640 --> 00:20:23,120 Because it's much too hard for me to remember matrix 359 00:20:23,120 --> 00:20:25,930 transformations for generalized rotation in three 360 00:20:25,930 --> 00:20:27,600 dimensions. 361 00:20:27,600 --> 00:20:28,850 So here's how it's going to work. 362 00:20:33,410 --> 00:20:37,140 The z-axis is going up that way. 363 00:20:37,140 --> 00:20:41,640 Or, it's going to be pointing toward you. 364 00:20:41,640 --> 00:20:43,760 And what I'm going to do is I'm going to 365 00:20:43,760 --> 00:20:46,820 rotate around this axis. 366 00:20:46,820 --> 00:20:49,750 And what I want to do is I want to find out how the 367 00:20:49,750 --> 00:20:52,670 x-coordinate in the image of the points move 368 00:20:52,670 --> 00:20:53,920 as I do that rotation. 369 00:20:56,350 --> 00:21:00,300 So here's the x-axis. 370 00:21:00,300 --> 00:21:03,180 This is the coordinate that you can see. 371 00:21:03,180 --> 00:21:05,520 Here is the y-axis. 372 00:21:05,520 --> 00:21:08,010 That's in depth, so you can't tell how far away it is. 373 00:21:10,750 --> 00:21:12,180 And the z-axis-- 374 00:21:12,180 --> 00:21:17,060 x, y, z-axis must be pointing out that way toward you. 375 00:21:17,060 --> 00:21:21,660 So now I'm going to consider just a single point on the 376 00:21:21,660 --> 00:21:24,310 object and see what happens to it. 377 00:21:24,310 --> 00:21:31,640 So I'm going to say to myself, let's put the object in some 378 00:21:31,640 --> 00:21:32,650 kind of standard position. 379 00:21:32,650 --> 00:21:34,300 I don't care what it is. 380 00:21:34,300 --> 00:21:36,450 It can be just random, just spin it around. 381 00:21:36,450 --> 00:21:42,810 Some position, we'll call that the standard position, S. And 382 00:21:42,810 --> 00:21:46,770 that means that the x-coordinate of the standard 383 00:21:46,770 --> 00:21:49,890 position is x sub s. 384 00:21:49,890 --> 00:21:57,520 And the y-coordinate of the standard position is y sub s. 385 00:21:57,520 --> 00:22:01,930 And now I'm going to rotate the object three times. 386 00:22:01,930 --> 00:22:05,280 Once to form the a picture, once to form the b picture, 387 00:22:05,280 --> 00:22:07,110 and once to form the c picture. 388 00:22:07,110 --> 00:22:09,960 And you can make those choices. 389 00:22:09,960 --> 00:22:12,330 Those can be anything, right? 390 00:22:12,330 --> 00:22:18,540 So let's say that the a picture is out here. 391 00:22:18,540 --> 00:22:22,190 So that's the a picture. 392 00:22:22,190 --> 00:22:25,040 The B picture is out here. 393 00:22:25,040 --> 00:22:27,610 And the unknown is up that way. 394 00:22:32,000 --> 00:22:37,540 And so what I want to know depends on these vectors. 395 00:22:37,540 --> 00:22:41,220 We'll call that theta sub a, and this is theta sub b. 396 00:22:45,230 --> 00:22:50,430 And this one is theta sub u. 397 00:22:50,430 --> 00:23:01,950 So I would like to know how x sub a depends on x 398 00:23:01,950 --> 00:23:05,490 sub s and y sub s. 399 00:23:05,490 --> 00:23:08,490 And I can never remember how to do that, because I can 400 00:23:08,490 --> 00:23:09,810 never remember the transformation 401 00:23:09,810 --> 00:23:11,480 equations for rotation. 402 00:23:11,480 --> 00:23:14,880 So I have to figure it out every time. 403 00:23:14,880 --> 00:23:16,090 And this is no exception. 404 00:23:16,090 --> 00:23:18,870 So what I'm going to say is that this vector that goes out 405 00:23:18,870 --> 00:23:21,900 to S consists of two pieces. 406 00:23:21,900 --> 00:23:25,570 There's the x part and the y part. 407 00:23:25,570 --> 00:23:30,390 And I know that I can rotate this vector by alpha sub a by 408 00:23:30,390 --> 00:23:33,130 rotating this vector and rotating that vector and 409 00:23:33,130 --> 00:23:35,370 adding up the results. 410 00:23:35,370 --> 00:23:39,930 So if I rotate this vector by alpha sub a, then the 411 00:23:39,930 --> 00:23:46,200 contribution of that to the x-coordinate of a is going to 412 00:23:46,200 --> 00:23:52,790 be given by the cosine of theta sub a 413 00:23:52,790 --> 00:23:56,530 multiplied by x sub s. 414 00:23:56,530 --> 00:23:59,360 So you can just exaggerate that motion, say, well if I 415 00:23:59,360 --> 00:24:03,220 pitch it up that way then the projection down on the x-axis 416 00:24:03,220 --> 00:24:06,520 is going to be this length of the vector times the 417 00:24:06,520 --> 00:24:07,770 cosine of the angle. 418 00:24:10,410 --> 00:24:15,930 Now there's also going to be a dependence on y sub s. 419 00:24:15,930 --> 00:24:17,820 Let's figure out what that's going to be. 420 00:24:17,820 --> 00:24:19,065 I've got this vector here. 421 00:24:19,065 --> 00:24:22,220 And I'm going to rotate it by theta sub a as well. 422 00:24:22,220 --> 00:24:25,250 If I rotate that by theta sub a and see what the projection 423 00:24:25,250 --> 00:24:28,020 is on the x-axis, that's going to be given by 424 00:24:28,020 --> 00:24:30,570 the sine of the angle. 425 00:24:30,570 --> 00:24:34,080 But it's going the wrong way, so I have to subtract it off. 426 00:24:34,080 --> 00:24:36,825 So that's how I don't have to remember what the signs are on 427 00:24:36,825 --> 00:24:38,075 these equations. 428 00:24:44,520 --> 00:24:45,450 Well, that was good. 429 00:24:45,450 --> 00:24:47,940 And now that I'm off and running I can 430 00:24:47,940 --> 00:24:48,710 do what I did before. 431 00:24:48,710 --> 00:24:50,280 It makes it easy to give the lecture. 432 00:24:50,280 --> 00:24:54,120 Because this is going to be x sub b is equal to x sub s 433 00:24:54,120 --> 00:25:00,690 times the cosine of theta sub b minus y sub s times the 434 00:25:00,690 --> 00:25:03,240 cosine of theta-- 435 00:25:03,240 --> 00:25:05,690 oh, you're letting me make mistakes. 436 00:25:05,690 --> 00:25:06,940 Shame. 437 00:25:09,740 --> 00:25:12,050 I can generally tell by all the troubled looks. 438 00:25:12,050 --> 00:25:14,280 But there should be some shouting as well. 439 00:25:14,280 --> 00:25:18,150 That's the sine and that's the sine. 440 00:25:18,150 --> 00:25:19,780 And one more time. 441 00:25:19,780 --> 00:25:26,890 x sub u is equal to x sub s times the cosine of theta sub 442 00:25:26,890 --> 00:25:33,830 u minus y sub s times the sine of theta sub u. 443 00:25:33,830 --> 00:25:36,670 And I forgot the b up there. 444 00:25:36,670 --> 00:25:37,880 So there are some equations. 445 00:25:37,880 --> 00:25:39,710 And we don't know what we're doing. 446 00:25:39,710 --> 00:25:41,610 We're just going to stare at them awhile and see if they 447 00:25:41,610 --> 00:25:42,860 sing us a song. 448 00:25:45,200 --> 00:25:48,480 So let's see if they sing us a song. 449 00:25:48,480 --> 00:25:54,350 What about x sub a and x sub b? 450 00:25:54,350 --> 00:25:57,160 These are things that we see in the image. 451 00:25:57,160 --> 00:25:58,520 These are things that we can measure. 452 00:26:10,850 --> 00:26:14,100 What about all those cosines and sines of theta 453 00:26:14,100 --> 00:26:16,000 a's and theta b's. 454 00:26:16,000 --> 00:26:18,010 Well, we have no idea what they are. 455 00:26:18,010 --> 00:26:20,420 But one thing is clear. 456 00:26:20,420 --> 00:26:25,260 They're true for all of the points on the object. 457 00:26:25,260 --> 00:26:28,740 Because when we rotate the object around by angle theta, 458 00:26:28,740 --> 00:26:31,510 we're rotating all of the points through 459 00:26:31,510 --> 00:26:33,790 the same angle, right? 460 00:26:33,790 --> 00:26:35,250 So with respect to any 461 00:26:35,250 --> 00:26:38,790 particular view of the object-- 462 00:26:38,790 --> 00:26:41,810 here we are in the standard position. 463 00:26:41,810 --> 00:26:44,590 Here we are in position a. 464 00:26:44,590 --> 00:26:46,830 The vectors to all of the points on the object are 465 00:26:46,830 --> 00:26:50,510 rotated by the same angle when we go from the standard 466 00:26:50,510 --> 00:26:53,240 position to the a position. 467 00:26:53,240 --> 00:27:02,100 So that means that for all of the images in this particular 468 00:27:02,100 --> 00:27:06,940 rendering, with a particular rotation by theta a, theta b, 469 00:27:06,940 --> 00:27:09,465 and theta u, those are constants. 470 00:27:15,820 --> 00:27:18,125 Now remember this is for a particular theta a, a 471 00:27:18,125 --> 00:27:21,140 particular theta be, and a particular theta u. 472 00:27:21,140 --> 00:27:23,850 As long as we're talking about all of the points for each of 473 00:27:23,850 --> 00:27:28,920 those rotations, those angles and cosines are going to be 474 00:27:28,920 --> 00:27:35,090 the same for all possible points on the object. 475 00:27:37,780 --> 00:27:38,110 OK. 476 00:27:38,110 --> 00:27:41,790 So now we go back to our high school algebra experts and we 477 00:27:41,790 --> 00:27:49,880 say, look at these first two equations, We've got two 478 00:27:49,880 --> 00:27:55,540 equations and what we can now construe to be two unknowns. 479 00:27:55,540 --> 00:27:57,210 What are the unknowns that are left? 480 00:27:57,210 --> 00:27:58,990 We can measure a and b. 481 00:27:58,990 --> 00:28:00,700 Whatever the cosines are, they're the 482 00:28:00,700 --> 00:28:02,660 same for all the pictures. 483 00:28:02,660 --> 00:28:05,580 So if we treat those as constants, then we can solve 484 00:28:05,580 --> 00:28:08,770 for x sub s and y sub s. 485 00:28:08,770 --> 00:28:10,490 Right? 486 00:28:10,490 --> 00:28:14,860 We can solve for x sub s and y sub s in terms of x sub a and 487 00:28:14,860 --> 00:28:20,190 x sub b and a whole bunch of constants. 488 00:28:20,190 --> 00:28:27,220 But, I don't know, a whole bunch of constants, let's see. 489 00:28:27,220 --> 00:28:30,640 We can gather up all of those cosines and ratios of sines 490 00:28:30,640 --> 00:28:34,350 and cosines and all that stuff and put them all together. 491 00:28:34,350 --> 00:28:36,130 Because they're all constants. 492 00:28:36,130 --> 00:28:38,320 And then we can do this. 493 00:28:38,320 --> 00:28:48,060 We can say x sub u is equal to-- 494 00:28:48,060 --> 00:28:54,010 well, it's going to depend on x sub a and x sub b. 495 00:28:54,010 --> 00:28:58,030 And by the time we wash or manipulate or screw around 496 00:28:58,030 --> 00:29:03,220 with all those cosines, we can say that the multiplier for x 497 00:29:03,220 --> 00:29:07,670 sub a is some constant alpha and the multiplier for x sub b 498 00:29:07,670 --> 00:29:09,910 is some constant beta. 499 00:29:09,910 --> 00:29:11,500 So that's not a slight of hand. 500 00:29:11,500 --> 00:29:12,710 That's just linear 501 00:29:12,710 --> 00:29:15,300 manipulation of those equations. 502 00:29:15,300 --> 00:29:17,940 And that's what we wanted to show, that for orthographic 503 00:29:17,940 --> 00:29:21,130 projection, which this is-- there is no perspective 504 00:29:21,130 --> 00:29:23,530 involved here, we're just taking the projection along 505 00:29:23,530 --> 00:29:24,780 the x-axis-- 506 00:29:26,480 --> 00:29:30,060 we can demonstrate for this simplified situation that that 507 00:29:30,060 --> 00:29:31,310 equation must hold. 508 00:29:33,880 --> 00:29:35,310 Now I want to give you a few puzzles. 509 00:29:35,310 --> 00:29:36,730 Because this stuff is so simple. 510 00:29:36,730 --> 00:29:41,020 Suppose I allow translation as well as rotation. 511 00:29:41,020 --> 00:29:42,696 What's going to happen? 512 00:29:42,696 --> 00:29:44,094 STUDENT: You just get the tau. 513 00:29:44,094 --> 00:29:44,560 Basically, you get a constant. 514 00:29:44,560 --> 00:29:46,180 PATRICK WINSTON: Yeah, you add a constant, tau. 515 00:29:46,180 --> 00:29:47,760 But what do we need to do in order to solve it? 516 00:29:47,760 --> 00:29:49,221 STUDENT: Subtract them [INAUDIBLE]. 517 00:29:49,221 --> 00:29:52,630 You subtract two equations and then [INAUDIBLE]. 518 00:29:52,630 --> 00:29:54,950 PATRICK WINSTON: Let's see, now we've got 519 00:29:54,950 --> 00:29:56,206 three unknowns, right? 520 00:29:56,206 --> 00:29:56,985 I don't know tau. 521 00:29:56,985 --> 00:29:58,216 I don't know x sub s. 522 00:29:58,216 --> 00:30:00,910 And I don't know y sub s. 523 00:30:00,910 --> 00:30:02,650 So I need another equation. 524 00:30:02,650 --> 00:30:04,534 Where do I get the other equation. 525 00:30:04,534 --> 00:30:05,430 STUDENT: [INAUDIBLE]. 526 00:30:05,430 --> 00:30:06,680 PATRICK WINSTON: From another picture. 527 00:30:09,910 --> 00:30:14,150 That's why up there I needed four points. 528 00:30:14,150 --> 00:30:17,690 That covers a situation where I've got three degrees of 529 00:30:17,690 --> 00:30:20,080 rotation and translation. 530 00:30:20,080 --> 00:30:25,360 Here I got by with just two pictures in this illustration. 531 00:30:25,360 --> 00:30:27,840 That one involved a tau translational element, so I 532 00:30:27,840 --> 00:30:28,720 needed three pictures. 533 00:30:28,720 --> 00:30:32,700 And this one's got full rotation, so I needed four. 534 00:30:32,700 --> 00:30:40,226 So great idea, works fine. 535 00:30:40,226 --> 00:30:45,410 The trouble is it doesn't work so fine on natural objects. 536 00:30:45,410 --> 00:30:48,630 It works fine on things that are manufactured because they 537 00:30:48,630 --> 00:30:51,250 all have identical dimensions. 538 00:30:51,250 --> 00:30:55,420 So if I made a million of these in a factory, I'd have 539 00:30:55,420 --> 00:30:56,950 no trouble recognizing them. 540 00:30:56,950 --> 00:31:02,090 Because all I'd have to do is take three pictures, record 541 00:31:02,090 --> 00:31:04,840 the coordinates of some of the points, and I'd be done. 542 00:31:04,840 --> 00:31:07,245 The trouble is the natural world isn't like this. 543 00:31:10,410 --> 00:31:13,080 And you aren't like this either. 544 00:31:16,100 --> 00:31:21,020 I don't know, if I'm trying to recognize faces, it's not that 545 00:31:21,020 --> 00:31:23,700 easy to do all this. 546 00:31:23,700 --> 00:31:27,380 First of all, it's a little difficult to find the exact 547 00:31:27,380 --> 00:31:30,390 point, the exactly corresponding points. 548 00:31:30,390 --> 00:31:32,950 I made a mistake in doing it myself. 549 00:31:32,950 --> 00:31:35,230 And if the computer made a mistake it would certainly 550 00:31:35,230 --> 00:31:36,060 make an error. 551 00:31:36,060 --> 00:31:39,440 Because it would be using non-corresponding points to 552 00:31:39,440 --> 00:31:40,120 make the prediction. 553 00:31:40,120 --> 00:31:42,656 So it would be way off. 554 00:31:42,656 --> 00:31:47,070 But this is still in the tradition of working from 555 00:31:47,070 --> 00:31:51,950 local features in the objects toward recognition. 556 00:31:51,950 --> 00:31:58,790 So having looked at that theory, we also find it a 557 00:31:58,790 --> 00:31:59,350 little wanting. 558 00:31:59,350 --> 00:32:02,280 It works great it some circumstances, doesn't seem to 559 00:32:02,280 --> 00:32:03,790 solve the whole recognition problem. 560 00:32:07,190 --> 00:32:09,590 Years pass. 561 00:32:09,590 --> 00:32:13,600 Shimon Ullman comes up with another theory that's not so 562 00:32:13,600 --> 00:32:20,310 much based on edge fragments or the location of particular 563 00:32:20,310 --> 00:32:27,570 features but rather on correlation. 564 00:32:27,570 --> 00:32:32,930 Taking a picture of, say, Krishna's face, taking a 565 00:32:32,930 --> 00:32:37,120 picture of the whole class, and then using that as a kind 566 00:32:37,120 --> 00:32:40,680 of correlation mask, running it all over the picture of the 567 00:32:40,680 --> 00:32:43,050 class, seeing where it maximizes out. 568 00:32:43,050 --> 00:32:43,600 Now that's vague. 569 00:32:43,600 --> 00:32:45,480 I'll explain when I'm talking about [INAUDIBLE] 570 00:32:45,480 --> 00:32:47,610 correlation in a minute. 571 00:32:47,610 --> 00:32:53,400 But it's basically saying, if I have a picture of Krishna, 572 00:32:53,400 --> 00:32:54,390 where do I find him? 573 00:32:54,390 --> 00:32:55,902 I'll find him in one place. 574 00:32:55,902 --> 00:32:57,750 But you know what? 575 00:32:57,750 --> 00:33:00,410 Krishna doesn't look like anybody else. 576 00:33:00,410 --> 00:33:02,810 So I might not find any other faces. 577 00:33:02,810 --> 00:33:06,840 And if my objective is to find all the faces, then maybe that 578 00:33:06,840 --> 00:33:09,150 idea won't work either. 579 00:33:09,150 --> 00:33:13,590 Or, to take another example, here's a dollar bill. 580 00:33:13,590 --> 00:33:18,130 We haven't had raises in quite well, so this is my last one. 581 00:33:18,130 --> 00:33:20,950 It's got a picture of George Washington on it. 582 00:33:20,950 --> 00:33:22,740 And I can look all over the class. 583 00:33:22,740 --> 00:33:26,630 And if I use this is as a face detector, I'd be sorely 584 00:33:26,630 --> 00:33:27,100 disappointed. 585 00:33:27,100 --> 00:33:29,430 Because I wouldn't find any faces. 586 00:33:29,430 --> 00:33:32,350 Because thank God, nobody looks exactly like George 587 00:33:32,350 --> 00:33:32,890 Washington. 588 00:33:32,890 --> 00:33:36,700 So the correlation wouldn't work very well. 589 00:33:36,700 --> 00:33:37,950 So that idea's a loser. 590 00:33:41,580 --> 00:33:42,290 But wait a minute. 591 00:33:42,290 --> 00:33:45,250 I don't have to look for the whole face. 592 00:33:45,250 --> 00:33:50,790 I could just look for eyes. 593 00:33:50,790 --> 00:33:53,770 And then I could look for noses and maybe mouths. 594 00:33:53,770 --> 00:33:57,080 And maybe I could have a library of 10 different eyes 595 00:33:57,080 --> 00:34:01,280 and 10 different noses and 10 different mouths. 596 00:34:01,280 --> 00:34:02,540 Would that idea work? 597 00:34:06,100 --> 00:34:07,440 Probably not so well. 598 00:34:07,440 --> 00:34:09,420 The trouble with that one is, I'd find 599 00:34:09,420 --> 00:34:11,676 eyeballs in every doorknob. 600 00:34:11,676 --> 00:34:17,960 There's just not enough stuff there to give me a reliable 601 00:34:17,960 --> 00:34:19,210 correlation. 602 00:34:20,920 --> 00:34:23,989 So let's make this a little more concrete by 603 00:34:23,989 --> 00:34:25,239 drawing some pictures. 604 00:34:29,770 --> 00:34:32,880 Halloween is approaching. 605 00:34:32,880 --> 00:34:35,174 So here's a face. 606 00:34:42,387 --> 00:34:44,375 All right? 607 00:34:44,375 --> 00:34:45,866 Here's another face. 608 00:34:55,840 --> 00:34:59,160 So those might be faces in my pre-recorded library of 609 00:34:59,160 --> 00:35:01,410 pumpkin faces. 610 00:35:01,410 --> 00:35:02,660 Now along comes this face. 611 00:35:13,690 --> 00:35:16,270 What's going to happen? 612 00:35:16,270 --> 00:35:18,490 Well, I don't know. 613 00:35:18,490 --> 00:35:20,200 Let's draw yet another face. 614 00:35:32,020 --> 00:35:33,440 I don't know, that could be a pretty weird 615 00:35:33,440 --> 00:35:34,460 pumpkin face, I suppose. 616 00:35:34,460 --> 00:35:37,000 But I mean it to be something that doesn't look very much 617 00:35:37,000 --> 00:35:39,460 like a face. 618 00:35:39,460 --> 00:35:44,380 So if I'm doing a complete correlation with either of 619 00:35:44,380 --> 00:35:47,280 these faces in my library, neither one will match this 620 00:35:47,280 --> 00:35:48,530 one very well. 621 00:35:51,150 --> 00:35:55,800 If I'm looking for fine features like eyes, then I've 622 00:35:55,800 --> 00:36:01,300 got these eyes everywhere. 623 00:36:01,300 --> 00:36:04,190 So it doesn't help very much. 624 00:36:04,190 --> 00:36:05,380 So you can see where I'm going. 625 00:36:05,380 --> 00:36:10,030 And you can reinvent Ullman's great idea. 626 00:36:10,030 --> 00:36:11,960 What is it? 627 00:36:11,960 --> 00:36:15,200 We don't look for big features, like whole faces. 628 00:36:15,200 --> 00:36:16,970 We don't look for small features, 629 00:36:16,970 --> 00:36:18,846 like individual eyes. 630 00:36:18,846 --> 00:36:22,180 We look for intermediate features, like two eyes and a 631 00:36:22,180 --> 00:36:25,040 nose, or a mouth and a nose. 632 00:36:25,040 --> 00:36:34,310 So when we do that, then we can say, now, here are two 633 00:36:34,310 --> 00:36:37,120 eyes and a nose. 634 00:36:37,120 --> 00:36:38,520 Well, that's found in this one. 635 00:36:42,370 --> 00:36:48,460 And what about the combination of that nose and that mouth? 636 00:36:48,460 --> 00:36:51,051 Oh, that's over here. 637 00:36:51,051 --> 00:36:53,030 But neither of those features can be found 638 00:36:53,030 --> 00:36:56,800 in the fourth image. 639 00:36:56,800 --> 00:36:59,410 So that's the Goldilocks principle. 640 00:36:59,410 --> 00:37:00,945 When you're doing this sort of thing, you want things that 641 00:37:00,945 --> 00:37:03,500 are not too small and not too big. 642 00:37:03,500 --> 00:37:07,740 I've got the Rumpelstiltskin principle up 643 00:37:07,740 --> 00:37:08,970 there, too, by the way. 644 00:37:08,970 --> 00:37:12,020 Because I meant to mention that Marr was a genius at 645 00:37:12,020 --> 00:37:13,830 naming things. 646 00:37:13,830 --> 00:37:18,650 And even though many of his theories have faded, he's 647 00:37:18,650 --> 00:37:21,520 still known for these names like primal sketch and two and 648 00:37:21,520 --> 00:37:23,030 a half D sketch because he was such an artist 649 00:37:23,030 --> 00:37:24,610 at naming the concepts. 650 00:37:24,610 --> 00:37:27,900 He even got credit for a lot of stuff that he didn't do. 651 00:37:27,900 --> 00:37:33,440 Not because he was deliberately trying to get it 652 00:37:33,440 --> 00:37:35,240 inappropriately, but just because he was so good at 653 00:37:35,240 --> 00:37:36,490 naming stuff. 654 00:37:36,490 --> 00:37:38,450 So we had the Rumpelstiltskin principle back then. 655 00:37:38,450 --> 00:37:40,050 And now we have the Goldilocks principle. 656 00:37:40,050 --> 00:37:43,535 Not too big, not too small. 657 00:37:43,535 --> 00:37:48,150 But that leaves us with the final question, which is, so 658 00:37:48,150 --> 00:37:51,230 if what we want to do is look for intermediate-size 659 00:37:51,230 --> 00:37:54,410 features, how do we actually find them in a sea 660 00:37:54,410 --> 00:37:55,770 of faces out there? 661 00:37:55,770 --> 00:37:58,400 See, I might have a library, I might take 10 of you and 662 00:37:58,400 --> 00:38:01,050 record your eyes. 663 00:38:01,050 --> 00:38:03,530 Take another ten, record your mouths. 664 00:38:03,530 --> 00:38:06,400 And they may be put together in a unique way for each of 665 00:38:06,400 --> 00:38:06,870 you out there. 666 00:38:06,870 --> 00:38:10,390 But it's likely that I'll fin Lana's eyes 667 00:38:10,390 --> 00:38:12,430 somewhere else in a crowd. 668 00:38:12,430 --> 00:38:16,850 And Nicola's mouth somewhere else in a crowd. 669 00:38:16,850 --> 00:38:21,330 So how do we in fact go about finding them? 670 00:38:21,330 --> 00:38:23,080 And I mentioned the term correlation a 671 00:38:23,080 --> 00:38:24,720 couple of times now. 672 00:38:24,720 --> 00:38:26,500 Let me make that concrete. 673 00:38:31,270 --> 00:38:38,790 So let's consider a one-dimensional face that 674 00:38:38,790 --> 00:38:40,040 looks like this. 675 00:38:47,950 --> 00:38:50,810 Which is a signal. 676 00:38:50,810 --> 00:38:53,640 And I'm going to consider a one-dimensional image. 677 00:38:56,160 --> 00:39:04,390 And in that one-dimensional image I've got a 678 00:39:04,390 --> 00:39:06,670 facsimile of the face. 679 00:39:06,670 --> 00:39:08,850 And the question is, what kind of algorithm could I use to 680 00:39:08,850 --> 00:39:14,030 determine the offset in the image where the face occurs? 681 00:39:14,030 --> 00:39:17,320 So you can see that one possibility is you just do an 682 00:39:17,320 --> 00:39:25,270 integral of the signal in the face and the signal out here 683 00:39:25,270 --> 00:39:29,610 over the extent of the face and see how it multiplies out. 684 00:39:29,610 --> 00:39:34,920 Or, to make it less lawyerly and more MITish, let's say 685 00:39:34,920 --> 00:39:41,310 that what we're going to do is we're going to maximize over 686 00:39:41,310 --> 00:39:52,190 some parameter x the integral over x of some face, which is 687 00:39:52,190 --> 00:40:04,220 a function of x and the image g, which is a function of x 688 00:40:04,220 --> 00:40:07,830 minus that offset. 689 00:40:07,830 --> 00:40:14,200 So when the offset, t, is equal to this offset, then 690 00:40:14,200 --> 00:40:17,350 we're essentially multiplying the thing by itself and 691 00:40:17,350 --> 00:40:19,890 integrating over the extent of the face. 692 00:40:19,890 --> 00:40:24,610 And that gives you a very big number if they're lined up and 693 00:40:24,610 --> 00:40:27,420 a very small number if they're not. 694 00:40:27,420 --> 00:40:32,370 And it's even true if we add a whole lot of 695 00:40:32,370 --> 00:40:37,130 noise to the images. 696 00:40:37,130 --> 00:40:38,660 But these are images. 697 00:40:38,660 --> 00:40:39,595 They're not one dimensional. 698 00:40:39,595 --> 00:40:41,210 But that's OK. 699 00:40:41,210 --> 00:40:44,215 It's easy enough to make a modification here. 700 00:40:44,215 --> 00:40:46,980 We're going to maximize over translation 701 00:40:46,980 --> 00:40:49,140 parameters x and y. 702 00:40:49,140 --> 00:40:51,970 And these are no longer functions of just x, they're 703 00:40:51,970 --> 00:40:53,220 also functions of y. 704 00:40:56,900 --> 00:40:59,750 Like so. 705 00:40:59,750 --> 00:41:01,380 So that's basically how it works. 706 00:41:01,380 --> 00:41:03,690 We won't go into details about normalization and all that 707 00:41:03,690 --> 00:41:06,480 sort of thing because that's the stuff of which other 708 00:41:06,480 --> 00:41:08,825 courses remain the custodians. 709 00:41:11,340 --> 00:41:13,412 So would you like to see a demonstration? 710 00:41:13,412 --> 00:41:14,662 OK. 711 00:41:36,410 --> 00:41:37,220 All right. 712 00:41:37,220 --> 00:41:42,000 So without realizing it, Nicola and Erica have loaned 713 00:41:42,000 --> 00:41:44,080 us their pictures. 714 00:41:44,080 --> 00:41:49,490 And they are embedded in that big field of noise. 715 00:41:49,490 --> 00:41:52,170 And it's pretty easy to pick out Erica and Nicola, right? 716 00:41:52,170 --> 00:41:57,120 Because we are actually pretty good at picking faces out of 717 00:41:57,120 --> 00:41:58,740 these images. 718 00:41:58,740 --> 00:42:01,220 So let's add some noise. 719 00:42:05,640 --> 00:42:08,160 It's a little harder now. 720 00:42:08,160 --> 00:42:10,100 What I'm going to is I'm going to run this correlation 721 00:42:10,100 --> 00:42:18,000 program over this whole image using Nicola's face as a mask 722 00:42:18,000 --> 00:42:20,670 and seeing where the correlation peaks up, in spite 723 00:42:20,670 --> 00:42:21,920 of all the noise that's in there. 724 00:42:28,290 --> 00:42:29,540 Boom, there he is. 725 00:42:32,780 --> 00:42:34,820 I don't know, maybe we can find Erica too. 726 00:42:37,370 --> 00:42:40,110 I forgot where she was. 727 00:42:40,110 --> 00:42:41,360 I can't find her. 728 00:42:44,740 --> 00:42:47,670 There she is. 729 00:42:47,670 --> 00:42:50,490 Unfortunately the parameters aren't very good here. 730 00:42:50,490 --> 00:42:52,890 Do you see that? 731 00:42:52,890 --> 00:42:55,550 Let me get another version of this. 732 00:42:55,550 --> 00:42:59,520 I'll just do some real-time programming. 733 00:43:08,780 --> 00:43:13,210 I've been trying to reset the parameters so that the images 734 00:43:13,210 --> 00:43:17,070 in the demonstration comes out clearly up there. 735 00:43:17,070 --> 00:43:19,680 Let's see if this works a little better. 736 00:43:19,680 --> 00:43:20,990 OK, so let's add some noise. 737 00:43:23,860 --> 00:43:25,630 And let's find Erica. 738 00:43:28,750 --> 00:43:30,000 There she is. 739 00:43:32,340 --> 00:43:33,290 There are some other things that look a 740 00:43:33,290 --> 00:43:34,550 little bit like Erica. 741 00:43:34,550 --> 00:43:36,800 But nothing looks quite exactly like Erica. 742 00:43:39,450 --> 00:43:42,245 So let's try Nicola's eyes. 743 00:43:46,090 --> 00:43:48,070 So they stand out pretty clearly against the 744 00:43:48,070 --> 00:43:49,630 background. 745 00:43:49,630 --> 00:43:51,280 Let's see if we can find Erica's eyes. 746 00:43:54,580 --> 00:43:56,150 So they stand out pretty clearly against the 747 00:43:56,150 --> 00:43:56,440 background. 748 00:43:56,440 --> 00:43:59,300 Notice that it also gets Nicola's eyes. 749 00:43:59,300 --> 00:44:04,840 So two eyes is an intermediate-size constraint. 750 00:44:04,840 --> 00:44:08,780 It's loose enough that it will match more than one person. 751 00:44:08,780 --> 00:44:12,130 But it's not so loose that it's as bad as 752 00:44:12,130 --> 00:44:15,050 looking for one eye. 753 00:44:15,050 --> 00:44:17,490 See, they're all over the place. 754 00:44:17,490 --> 00:44:21,640 So two eyes and a nose, a mouth and a nose, that would 755 00:44:21,640 --> 00:44:23,870 be even better as an intermediate feature. 756 00:44:23,870 --> 00:44:25,875 But it doesn't matter what the best ones are, because you can 757 00:44:25,875 --> 00:44:28,620 work that out experimentally. 758 00:44:28,620 --> 00:44:30,660 So that's how correlation works. 759 00:44:30,660 --> 00:44:34,690 And it's just amazing how much noise you can add and it'll 760 00:44:34,690 --> 00:44:36,500 still pick out the right stuff. 761 00:44:39,160 --> 00:44:39,780 There's Nicola. 762 00:44:39,780 --> 00:44:41,080 Boom. 763 00:44:41,080 --> 00:44:43,290 Very clear. 764 00:44:43,290 --> 00:44:46,790 Want to add some more noise? 765 00:44:46,790 --> 00:44:49,640 I don't know, I can see it, but that's because I'm a 766 00:44:49,640 --> 00:44:50,910 pretty good correlator, too. 767 00:44:55,300 --> 00:44:56,500 Boom. 768 00:44:56,500 --> 00:44:57,930 I don't know, let's add some more noise. 769 00:45:04,420 --> 00:45:06,480 It's just hard to get rid of it. 770 00:45:06,480 --> 00:45:09,650 It's just amazing how well it picks it out. 771 00:45:09,650 --> 00:45:10,400 That's good. 772 00:45:10,400 --> 00:45:12,640 That's cool. 773 00:45:12,640 --> 00:45:16,690 Now, but the reason that this is 30 years and we're still 774 00:45:16,690 --> 00:45:19,340 not done is there are still some questions. 775 00:45:19,340 --> 00:45:22,730 This is recognizing stuff straight on. 776 00:45:22,730 --> 00:45:25,260 How is it I can recognize you in the hall from the side? 777 00:45:25,260 --> 00:45:27,600 Nobody knows. 778 00:45:27,600 --> 00:45:31,830 One possibility is that you have an ability to make those 779 00:45:31,830 --> 00:45:32,630 transformations. 780 00:45:32,630 --> 00:45:37,410 If so, then that alignment theory has a role to play. 781 00:45:37,410 --> 00:45:41,360 Another theory is that, well, after I've seen you once I can 782 00:45:41,360 --> 00:45:44,680 watch you turn your head and keep recording what you look 783 00:45:44,680 --> 00:45:47,220 like at all possible angles. 784 00:45:47,220 --> 00:45:48,480 That would work. 785 00:45:48,480 --> 00:45:52,150 The trouble is, is there enough stuff in there? 786 00:45:52,150 --> 00:45:52,650 Maybe. 787 00:45:52,650 --> 00:45:53,900 We don't know. 788 00:45:55,770 --> 00:45:59,040 Now what would it take to break this mechanism? 789 00:45:59,040 --> 00:45:59,910 Well, I don't know. 790 00:45:59,910 --> 00:46:01,200 Let's just see if we can break the mechanism. 791 00:46:08,600 --> 00:46:11,815 Let's see if you can recognize some well-known faces. 792 00:46:15,820 --> 00:46:16,400 Who's that? 793 00:46:16,400 --> 00:46:17,430 Quick. 794 00:46:17,430 --> 00:46:18,540 STUDENT: Obama. 795 00:46:18,540 --> 00:46:21,290 PATRICK WINSTON: Oh, that's too easy. 796 00:46:21,290 --> 00:46:23,110 We'll see if we can make some harder ones. 797 00:46:25,930 --> 00:46:26,955 Yeah, that's Obama. 798 00:46:26,955 --> 00:46:28,280 Who's this? 799 00:46:28,280 --> 00:46:29,600 STUDENT: Bush. 800 00:46:29,600 --> 00:46:29,940 PATRICK WINSTON: Oh boy. 801 00:46:29,940 --> 00:46:31,440 You're really good at this. 802 00:46:31,440 --> 00:46:32,200 That's Bush. 803 00:46:32,200 --> 00:46:32,960 How about this guy? 804 00:46:32,960 --> 00:46:34,210 STUDENT: Kerry. 805 00:46:38,680 --> 00:46:39,000 PATRICK WINSTON: OK. 806 00:46:39,000 --> 00:46:39,780 Now I've got it. 807 00:46:39,780 --> 00:46:41,220 Some people are starting to turn their heads. 808 00:46:41,220 --> 00:46:42,722 And that's not fair. 809 00:46:42,722 --> 00:46:44,200 [LAUGHTER] 810 00:46:44,200 --> 00:46:46,030 PATRICK WINSTON: That's not fair. 811 00:46:46,030 --> 00:46:49,070 Because you see what's happened is that if this kind 812 00:46:49,070 --> 00:46:52,500 of pumpkin in theory is correct, then when you turn 813 00:46:52,500 --> 00:46:56,020 the face upside down you lose the correlation of those 814 00:46:56,020 --> 00:46:58,740 features that have vertical components. 815 00:46:58,740 --> 00:47:01,570 So if you have two eyes and a nose, they won't match two 816 00:47:01,570 --> 00:47:04,890 eyes and a nose when they're turned upside down. 817 00:47:04,890 --> 00:47:05,590 Well, let's see. 818 00:47:05,590 --> 00:47:08,470 We'll try some more. 819 00:47:08,470 --> 00:47:10,514 Who's that? 820 00:47:10,514 --> 00:47:11,370 STUDENT: Gorbachev. 821 00:47:11,370 --> 00:47:11,800 PATRICK WINSTON: Gorbachev. 822 00:47:11,800 --> 00:47:13,120 Who said that? 823 00:47:13,120 --> 00:47:14,575 Leonid, where are you? 824 00:47:14,575 --> 00:47:15,430 This is Gorbachev, right? 825 00:47:15,430 --> 00:47:18,340 You can recognize him because of the little birthmark on the 826 00:47:18,340 --> 00:47:20,150 top of his head. 827 00:47:20,150 --> 00:47:21,105 One more. 828 00:47:21,105 --> 00:47:22,275 Who's-- 829 00:47:22,275 --> 00:47:23,140 oh, that's easy. 830 00:47:23,140 --> 00:47:26,050 Who is it? 831 00:47:26,050 --> 00:47:28,050 That's Clinton. 832 00:47:28,050 --> 00:47:29,300 How about this one? 833 00:47:34,520 --> 00:47:39,000 Do you see how insulting it is to be at MIT? 834 00:47:39,000 --> 00:47:40,076 That's me. 835 00:47:40,076 --> 00:47:43,480 [LAUGHTER] 836 00:47:43,480 --> 00:47:46,720 PATRICK WINSTON: And you didn't even know. 837 00:47:46,720 --> 00:47:47,970 Oh, god. 838 00:47:52,770 --> 00:47:57,700 So this might be evidence for the correlation theory. 839 00:47:57,700 --> 00:48:01,280 But of course, turning the face upside down would make it 840 00:48:01,280 --> 00:48:02,860 very difficult to do alignment, too. 841 00:48:02,860 --> 00:48:06,060 So it would break out alignment theory, as well. 842 00:48:06,060 --> 00:48:09,045 Let me get that after class, Was there a mistake, or? 843 00:48:09,045 --> 00:48:09,500 STUDENT: No, no. 844 00:48:09,500 --> 00:48:13,430 I was just curious [INAUDIBLE] stretching would break the 845 00:48:13,430 --> 00:48:14,140 correlation. 846 00:48:14,140 --> 00:48:15,461 PATRICK WINSTON: If what would break the structure? 847 00:48:15,461 --> 00:48:16,383 What? 848 00:48:16,383 --> 00:48:16,844 Stretching? 849 00:48:16,844 --> 00:48:18,094 STUDENT: [INAUDIBLE]. 850 00:48:20,120 --> 00:48:21,800 PATRICK WINSTON: Elliot asked if stretching would break the 851 00:48:21,800 --> 00:48:22,970 correlation. 852 00:48:22,970 --> 00:48:30,800 And the answer is, I think, stretching in the vertical 853 00:48:30,800 --> 00:48:33,290 dimension is worse than stretching in 854 00:48:33,290 --> 00:48:34,790 the horizontal dimension. 855 00:48:34,790 --> 00:48:36,455 Because you get a certain amount of stretching in the 856 00:48:36,455 --> 00:48:38,700 horizontal dimension when you just turn your head. 857 00:48:38,700 --> 00:48:41,450 By the way, since our faces are basically mounted on a 858 00:48:41,450 --> 00:48:45,550 cylinder, this kind of transformation 859 00:48:45,550 --> 00:48:46,890 might actually work. 860 00:48:46,890 --> 00:48:51,140 That's a sidebar to the answer to your question, Elliot. 861 00:48:51,140 --> 00:48:53,730 But now you say, well, OK, so this is not completely solved. 862 00:48:53,730 --> 00:48:55,980 You can work this out. 863 00:48:55,980 --> 00:48:59,430 But if you really want to work something out, let me tell you 864 00:48:59,430 --> 00:49:03,570 what the current questions are in computer vision. 865 00:49:03,570 --> 00:49:05,090 People have worked for an awful long time on this 866 00:49:05,090 --> 00:49:16,010 recognition stuff and, to my mind, have neglected the more 867 00:49:16,010 --> 00:49:18,900 serious questions. 868 00:49:18,900 --> 00:49:21,030 It's more serious questions are, how do you visually 869 00:49:21,030 --> 00:49:24,150 determine what's happening? 870 00:49:24,150 --> 00:49:28,280 If you could write a program that would reliably determine 871 00:49:28,280 --> 00:49:31,520 when these verbs are happening in your field of view, I will 872 00:49:31,520 --> 00:49:32,970 sign your Ph.D. thesis tomorrow. 873 00:49:32,970 --> 00:49:35,680 There are 48 of them there. 874 00:49:35,680 --> 00:49:37,610 And that is today's challenge. 875 00:49:37,610 --> 00:49:40,630 But since we're short on time, I want to skip over that and 876 00:49:40,630 --> 00:49:42,800 perform an experiment on you. 877 00:49:42,800 --> 00:49:44,892 I want you to tell me what I'm doing. 878 00:49:44,892 --> 00:49:46,600 STUDENT: [INAUDIBLE]. 879 00:49:46,600 --> 00:49:49,960 PATRICK WINSTON: So the best single-word answer is? 880 00:49:49,960 --> 00:49:51,090 [INAUDIBLE]? 881 00:49:51,090 --> 00:49:51,490 STUDENT: Drinking. 882 00:49:51,490 --> 00:49:54,020 PATRICK WINSTON: OK, this is not a trick question. 883 00:49:54,020 --> 00:49:56,500 OK, the best single-word answer. 884 00:49:56,500 --> 00:49:57,778 Christopher, what do you think? 885 00:49:57,778 --> 00:49:59,272 STUDENT: Toasting. 886 00:49:59,272 --> 00:50:00,770 PATRICK WINSTON: Christopher. 887 00:50:00,770 --> 00:50:02,942 Well, you. 888 00:50:02,942 --> 00:50:04,874 You. 889 00:50:04,874 --> 00:50:06,330 STUDENT: Toasting. 890 00:50:06,330 --> 00:50:07,910 PATRICK WINSTON: What? 891 00:50:07,910 --> 00:50:09,298 Toasting. 892 00:50:09,298 --> 00:50:09,782 OK. 893 00:50:09,782 --> 00:50:12,690 Not a trick question. 894 00:50:12,690 --> 00:50:13,940 What's happening here? 895 00:50:18,066 --> 00:50:20,878 Best single-word answer? 896 00:50:20,878 --> 00:50:21,786 STUDENT: Drinking. 897 00:50:21,786 --> 00:50:24,060 PATRICK WINSTON: Is drinking. 898 00:50:24,060 --> 00:50:25,846 Which pair look more alike? 899 00:50:25,846 --> 00:50:32,170 [LAUGHTER] 900 00:50:32,170 --> 00:50:34,210 PATRICK WINSTON: So that cat is drinking and nobody has any 901 00:50:34,210 --> 00:50:35,500 trouble recognizing that. 902 00:50:35,500 --> 00:50:43,280 And I believe it's because you're telling a story. 903 00:50:43,280 --> 00:50:46,260 So our power of storytelling even reaches down into our 904 00:50:46,260 --> 00:50:47,620 visual apparatus. 905 00:50:47,620 --> 00:50:52,720 So the story here is that some animal has evidently had an 906 00:50:52,720 --> 00:50:56,860 urge to find something to drink and water is passing 907 00:50:56,860 --> 00:50:57,920 through that animal's mouth. 908 00:50:57,920 --> 00:50:59,720 That's the drinking story. 909 00:50:59,720 --> 00:51:02,520 So even though they look enormously different visually, 910 00:51:02,520 --> 00:51:05,360 the stuff at the bottom of our vision system provides enough 911 00:51:05,360 --> 00:51:08,910 evidence for our story apparatus so that we can give 912 00:51:08,910 --> 00:51:12,300 the left one and the right one different labels and recognize 913 00:51:12,300 --> 00:51:13,550 the cat is drinking. 914 00:51:16,410 --> 00:51:17,950 And that's the end of the story.