1 00:00:00,040 --> 00:00:02,410 The following content is provided under a Creative 2 00:00:02,410 --> 00:00:03,790 Commons license. 3 00:00:03,790 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,100 continue to offer high-quality educational resources for free. 5 00:00:10,100 --> 00:00:12,680 To make a donation or to view additional materials 6 00:00:12,680 --> 00:00:16,496 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,496 --> 00:00:17,120 at ocw.mit.edu. 8 00:00:30,885 --> 00:00:33,855 [MUSIC PLAYING] 9 00:00:37,780 --> 00:00:40,030 PATRICK H. WINSTON: Well, what we're going to do today 10 00:00:40,030 --> 00:00:42,110 is climb a pretty big mountain because we're 11 00:00:42,110 --> 00:00:43,920 going to go from a neural net with two 12 00:00:43,920 --> 00:00:47,940 parameters to discussing the kind of neural nets 13 00:00:47,940 --> 00:00:56,590 in which people end up dealing with 60 million parameters. 14 00:00:56,590 --> 00:00:59,176 So it's going to be a pretty big jump. 15 00:00:59,176 --> 00:01:00,550 Along the way are a couple things 16 00:01:00,550 --> 00:01:05,750 I wanted to underscore from our previous discussion. 17 00:01:05,750 --> 00:01:08,340 Last time, I tried to develop some intuition 18 00:01:08,340 --> 00:01:11,600 for the kinds of formulas that you use to actually do 19 00:01:11,600 --> 00:01:14,230 the calculations in a small neural net about how 20 00:01:14,230 --> 00:01:16,160 the weights are going to change. 21 00:01:16,160 --> 00:01:18,230 And the main thing I tried to emphasize 22 00:01:18,230 --> 00:01:30,420 is that when you have a neural net like this one, 23 00:01:30,420 --> 00:01:36,610 everything is sort of divided in each column. 24 00:01:36,610 --> 00:01:42,360 You can't have the performance based on this output 25 00:01:42,360 --> 00:01:44,940 affect some weight change back here 26 00:01:44,940 --> 00:01:49,150 without going through this finite number of output 27 00:01:49,150 --> 00:01:51,036 variables, the y1s. 28 00:01:51,036 --> 00:01:57,310 And by the way, there's no y2 and y4-- there's no y2 and y3. 29 00:01:57,310 --> 00:02:00,510 Dealing with this is really a notational nightmare, 30 00:02:00,510 --> 00:02:03,890 and I spent a lot of time yesterday 31 00:02:03,890 --> 00:02:06,149 trying to clean it up a little bit. 32 00:02:06,149 --> 00:02:07,690 But basically, what I'm trying to say 33 00:02:07,690 --> 00:02:09,959 has nothing to do with the notation I have used 34 00:02:09,959 --> 00:02:11,500 but rather with the fact that there's 35 00:02:11,500 --> 00:02:15,124 a limited number of ways in which that can influence this, 36 00:02:15,124 --> 00:02:17,290 even though the number of paths through this network 37 00:02:17,290 --> 00:02:19,930 can be growing exponential. 38 00:02:19,930 --> 00:02:22,840 So those equations underneath are 39 00:02:22,840 --> 00:02:26,420 equations that derive from trying to figure out 40 00:02:26,420 --> 00:02:31,690 how the output performance depends on some 41 00:02:31,690 --> 00:02:33,620 of these weights back here. 42 00:02:33,620 --> 00:02:36,180 And what I've calculated is I've calculated 43 00:02:36,180 --> 00:02:41,000 the dependence of the performance on w1 44 00:02:41,000 --> 00:02:44,260 going that way, and I've also calculated 45 00:02:44,260 --> 00:02:52,420 the dependence of performance on w1 going that way. 46 00:02:52,420 --> 00:02:55,910 So that's one of the equations I've got down there. 47 00:02:55,910 --> 00:02:58,830 And another one deals with w3, and it 48 00:02:58,830 --> 00:03:07,060 involves going both this way and this way. 49 00:03:07,060 --> 00:03:10,110 And all I've done in both cases, in all four cases, 50 00:03:10,110 --> 00:03:13,350 is just take the partial derivative of performance 51 00:03:13,350 --> 00:03:15,750 with respect to those weights and use the chain rule 52 00:03:15,750 --> 00:03:17,270 to expand it. 53 00:03:17,270 --> 00:03:25,732 And when I do that, this is the stuff I get. 54 00:03:25,732 --> 00:03:27,940 And that's just a whole bunch of partial derivatives. 55 00:03:27,940 --> 00:03:30,510 But if you look at it and let it sing a little bit to you, 56 00:03:30,510 --> 00:03:32,509 what you see is that there's a lot of redundancy 57 00:03:32,509 --> 00:03:34,270 in the computation. 58 00:03:34,270 --> 00:03:38,990 So for example, this guy here, partial 59 00:03:38,990 --> 00:03:41,800 of performance with respect to w1, 60 00:03:41,800 --> 00:03:45,350 depends on both paths, of course. 61 00:03:45,350 --> 00:03:51,150 But look at the first elements here, these guys right here. 62 00:03:51,150 --> 00:03:53,850 And look at the first elements in the expression 63 00:03:53,850 --> 00:03:56,210 for calculating the partial derivative of performance 64 00:03:56,210 --> 00:04:00,004 with respect to w3, these guys. 65 00:04:04,612 --> 00:04:05,320 They're the same. 66 00:04:07,920 --> 00:04:12,310 And not only that, if you look inside these expressions 67 00:04:12,310 --> 00:04:16,399 and look at this particular piece here, 68 00:04:16,399 --> 00:04:18,810 you see that that is an expression that 69 00:04:18,810 --> 00:04:21,779 was needed in order to calculate one 70 00:04:21,779 --> 00:04:24,710 of the downstream weights, the changes in one 71 00:04:24,710 --> 00:04:27,370 of the downstream weights. 72 00:04:27,370 --> 00:04:30,070 But it happens to be the same thing as you see over here. 73 00:04:32,780 --> 00:04:41,532 And likewise, this piece is the same thing you see over here. 74 00:04:44,630 --> 00:04:47,180 So each time you move further and further back 75 00:04:47,180 --> 00:04:49,290 from the outputs toward the inputs, 76 00:04:49,290 --> 00:04:51,270 you're reusing a lot of computation 77 00:04:51,270 --> 00:04:54,180 that you've already done. 78 00:04:54,180 --> 00:04:57,250 So I'm trying to find a way to sloganize this, 79 00:04:57,250 --> 00:05:02,280 and what I've come up with is what's done is done and cannot 80 00:05:02,280 --> 00:05:03,260 be-- no, no. 81 00:05:03,260 --> 00:05:05,080 That's not quite right, is it? 82 00:05:05,080 --> 00:05:10,090 It's what's computed is computed and need not be recomputed. 83 00:05:10,090 --> 00:05:10,719 OK? 84 00:05:10,719 --> 00:05:12,010 So that's what's going on here. 85 00:05:12,010 --> 00:05:16,180 And that's why this is a calculation that's 86 00:05:16,180 --> 00:05:22,075 linear in the depths of the neural net, not exponential. 87 00:05:22,075 --> 00:05:24,450 There's another thing I wanted to point out in connection 88 00:05:24,450 --> 00:05:28,900 with these neural nets. 89 00:05:28,900 --> 00:05:30,400 And that has to do with what happens 90 00:05:30,400 --> 00:05:34,870 when we look at a single neuron and note that what we've got 91 00:05:34,870 --> 00:05:37,920 is we've got a bunch of weights that you multiply times 92 00:05:37,920 --> 00:05:39,415 a bunch of inputs like so. 93 00:05:46,960 --> 00:05:51,390 And then those are all summed up in a summing box 94 00:05:51,390 --> 00:05:57,920 before they enter some kind of non-linearity, in our case 95 00:05:57,920 --> 00:06:00,480 a sigmoid function. 96 00:06:00,480 --> 00:06:05,890 But if I ask you to write down the expression for the value 97 00:06:05,890 --> 00:06:07,150 we've got there, what is it? 98 00:06:07,150 --> 00:06:13,072 Well, it's just the sum of the w's times the x's. 99 00:06:16,570 --> 00:06:17,080 What's that? 100 00:06:20,590 --> 00:06:22,847 That's the dot product. 101 00:06:22,847 --> 00:06:24,930 Remember a few lectures ago I said that some of us 102 00:06:24,930 --> 00:06:28,690 believe that the dot product is a fundamental calculation that 103 00:06:28,690 --> 00:06:30,760 takes place in our heads? 104 00:06:30,760 --> 00:06:33,790 So this is why we think so. 105 00:06:33,790 --> 00:06:36,880 If neural nets are doing anything like this, 106 00:06:36,880 --> 00:06:39,200 then there's a dot product between some weights 107 00:06:39,200 --> 00:06:41,250 and some input values. 108 00:06:41,250 --> 00:06:43,900 Now, it's a funny kind of dot product 109 00:06:43,900 --> 00:06:47,410 because in the models that we've been using, 110 00:06:47,410 --> 00:06:50,890 these input variables are all or none, or 0 or 1. 111 00:06:50,890 --> 00:06:52,500 But that's OK. 112 00:06:52,500 --> 00:06:54,110 I have it on good authority that there 113 00:06:54,110 --> 00:06:58,090 are neurons in our head for which the values that 114 00:06:58,090 --> 00:07:01,600 are produced are not exactly all or none 115 00:07:01,600 --> 00:07:03,950 but rather have a kind of proportionality to them. 116 00:07:03,950 --> 00:07:07,850 So you get a real dot product type of operation out of that. 117 00:07:07,850 --> 00:07:09,510 So that's by way of a couple of asides 118 00:07:09,510 --> 00:07:11,400 that I wanted to underscore before we 119 00:07:11,400 --> 00:07:16,360 get into the center of today's discussion, 120 00:07:16,360 --> 00:07:20,830 which will be to talk about the so-called deep nets. 121 00:07:20,830 --> 00:07:23,890 Now, let's see, what's a deep net do? 122 00:07:23,890 --> 00:07:29,820 Well, from last time, you know that a deep net does 123 00:07:29,820 --> 00:07:34,040 that sort of thing, and it's interesting to look 124 00:07:34,040 --> 00:07:36,470 at some of the offerings here. 125 00:07:36,470 --> 00:07:40,370 By the way, how good was this performance in 2012? 126 00:07:40,370 --> 00:07:44,630 Well, it turned out that the fraction 127 00:07:44,630 --> 00:07:48,880 of the time that the system had the right answer 128 00:07:48,880 --> 00:07:52,690 in its top five choices was about 15%. 129 00:07:52,690 --> 00:07:55,390 And the fraction of the time that it got exactly the right 130 00:07:55,390 --> 00:07:59,900 answer as its top pick was about 37%-- error, 131 00:07:59,900 --> 00:08:05,740 15% error if you count it as an error if it's-- what am I 132 00:08:05,740 --> 00:08:06,960 saying? 133 00:08:06,960 --> 00:08:09,440 You got it right if you got it in the top five. 134 00:08:09,440 --> 00:08:12,660 An error rate on that calculation, about 15%. 135 00:08:12,660 --> 00:08:16,110 If you say you only get it right if it was your top choice, then 136 00:08:16,110 --> 00:08:18,830 the error rate was about 37%. 137 00:08:18,830 --> 00:08:21,530 So pretty good, especially since some of these things 138 00:08:21,530 --> 00:08:24,602 are highly ambiguous even to us. 139 00:08:24,602 --> 00:08:26,060 And what kind of a system did that? 140 00:08:26,060 --> 00:08:30,610 Well, it wasn't one that looked exactly 141 00:08:30,610 --> 00:08:33,820 like that, although that is the essence of it. 142 00:08:33,820 --> 00:08:36,447 The system actually looked like that. 143 00:08:36,447 --> 00:08:38,030 There's quite a lot of stuff in there. 144 00:08:38,030 --> 00:08:40,530 And what I'm going to talk about is not exactly this system, 145 00:08:40,530 --> 00:08:44,530 but I'm going to talk about the stuff of which such systems are 146 00:08:44,530 --> 00:08:46,880 made because there's nothing particularly 147 00:08:46,880 --> 00:08:47,920 special about this. 148 00:08:47,920 --> 00:08:50,990 It just happens to be a particular assembly 149 00:08:50,990 --> 00:08:54,240 of components that tend to reappear when anyone does 150 00:08:54,240 --> 00:08:56,910 this sort of neural net stuff. 151 00:08:56,910 --> 00:08:59,030 So let me explain that this way. 152 00:08:59,030 --> 00:09:05,110 First thing I need to talk about is the concept of-- well, 153 00:09:05,110 --> 00:09:06,090 I don't like the term. 154 00:09:06,090 --> 00:09:07,780 It's called convolution. 155 00:09:07,780 --> 00:09:11,290 I don't like the term because in the second-best course 156 00:09:11,290 --> 00:09:12,979 at the Institute, Signals and Systems, 157 00:09:12,979 --> 00:09:15,020 you learn about impulse responses and convolution 158 00:09:15,020 --> 00:09:16,810 integrals and stuff like that. 159 00:09:16,810 --> 00:09:19,500 And this hints at that, but it's not the same thing 160 00:09:19,500 --> 00:09:22,930 because there's no memory involved in what's going on 161 00:09:22,930 --> 00:09:24,560 as these signals are processed. 162 00:09:24,560 --> 00:09:27,474 But they call it convolutional neural nets anyway. 163 00:09:27,474 --> 00:09:28,140 So here you are. 164 00:09:28,140 --> 00:09:29,733 You got some kind of image. 165 00:09:32,290 --> 00:09:37,620 And even with lots of computing power and GPUs and all 166 00:09:37,620 --> 00:09:39,970 that sort of stuff, we're not talking about images 167 00:09:39,970 --> 00:09:42,630 with 4 million pixels. 168 00:09:42,630 --> 00:09:46,090 We're talking about images that might be 256 on a side. 169 00:09:53,284 --> 00:09:54,950 As I say, we're not talking about images 170 00:09:54,950 --> 00:09:58,250 that are 1,000 by 1,000 or 4,000 by 4,000 or anything like that. 171 00:09:58,250 --> 00:10:00,940 They tend to be kind of compressed 172 00:10:00,940 --> 00:10:04,540 into a 256-by-256 image. 173 00:10:04,540 --> 00:10:09,180 And now what we do is we run over this 174 00:10:09,180 --> 00:10:12,270 with a neuron that is looking only 175 00:10:12,270 --> 00:10:22,980 at a 10-by-10 square like so, and that produces an output. 176 00:10:22,980 --> 00:10:27,090 And next, we went over that again having 177 00:10:27,090 --> 00:10:32,270 shifted this neuron a little bit like so. 178 00:10:32,270 --> 00:10:36,950 And then the next thing we do is we shift it again, so we 179 00:10:36,950 --> 00:10:39,870 get that output right there. 180 00:10:39,870 --> 00:10:46,190 So each of those deployments of a neuron produces an output, 181 00:10:46,190 --> 00:10:48,830 and that output is associated with a particular place 182 00:10:48,830 --> 00:10:50,530 in the image. 183 00:10:50,530 --> 00:10:58,060 This is the process that is called convolution 184 00:10:58,060 --> 00:10:59,570 as a term of art. 185 00:10:59,570 --> 00:11:04,160 Now, this guy, or this convolution operation, 186 00:11:04,160 --> 00:11:06,040 results in a bunch of points over here. 187 00:11:12,384 --> 00:11:16,730 And the next thing that we do with those points 188 00:11:16,730 --> 00:11:19,210 is we look in local neighborhoods 189 00:11:19,210 --> 00:11:22,340 and see what the maximum value is. 190 00:11:22,340 --> 00:11:24,650 And then we take that maximum value 191 00:11:24,650 --> 00:11:28,000 and construct yet another mapping of the image 192 00:11:28,000 --> 00:11:31,030 over here using that maximum value. 193 00:11:31,030 --> 00:11:36,470 Then we slide that over like so, and we produce another value. 194 00:11:36,470 --> 00:11:40,870 And then we slide that over one more 195 00:11:40,870 --> 00:11:44,400 time with a different color, and now we've 196 00:11:44,400 --> 00:11:46,630 got yet another value. 197 00:11:46,630 --> 00:11:48,232 So this process is called pooling. 198 00:11:54,969 --> 00:11:56,510 And because we're taking the maximum, 199 00:11:56,510 --> 00:12:00,960 this particular kind of pooling is called max pooling. 200 00:12:00,960 --> 00:12:03,200 So now let's see what's next. 201 00:12:03,200 --> 00:12:05,560 This is taking a particular neuron 202 00:12:05,560 --> 00:12:08,870 and running it across the image. 203 00:12:08,870 --> 00:12:12,970 We call that a kernel, again sucking some terminology out 204 00:12:12,970 --> 00:12:14,136 of Signals and Systems. 205 00:12:14,136 --> 00:12:15,510 But now what we're going to do is 206 00:12:15,510 --> 00:12:19,140 we're going to say we could use a whole bunch of kernels. 207 00:12:19,140 --> 00:12:21,990 So the thing that I produce with one kernel 208 00:12:21,990 --> 00:12:27,430 can now be repeated many times like so. 209 00:12:27,430 --> 00:12:30,870 In fact, a typical number is 100 times. 210 00:12:30,870 --> 00:12:34,440 So now what we've got is we've got a 256-by-256 image. 211 00:12:34,440 --> 00:12:38,240 We've gone over it with a 10-by-10 kernel. 212 00:12:38,240 --> 00:12:41,760 We have taken the maximum values that 213 00:12:41,760 --> 00:12:43,750 are in the vicinity of each other, 214 00:12:43,750 --> 00:12:48,270 and then we repeated that 100 times. 215 00:12:48,270 --> 00:12:53,320 So now we can take that, and we can feed all those results 216 00:12:53,320 --> 00:12:55,540 into some kind of neural net. 217 00:12:55,540 --> 00:12:59,210 And then we can, through perhaps a fully-connected job 218 00:12:59,210 --> 00:13:04,010 on the final layers of this, and then in the ultimate output we 219 00:13:04,010 --> 00:13:07,770 get some sort of indication of how likely it 220 00:13:07,770 --> 00:13:11,660 is that the thing that's being seen is, say, a mite. 221 00:13:15,300 --> 00:13:20,850 So that's roughly how these things work. 222 00:13:20,850 --> 00:13:22,530 So what have we talked about so far? 223 00:13:22,530 --> 00:13:27,780 We've talked about pooling, and we've talked about convolution. 224 00:13:27,780 --> 00:13:31,450 And now we can talk about some of the good stuff. 225 00:13:31,450 --> 00:13:35,175 But before I get into that, this is what we can do now, 226 00:13:35,175 --> 00:13:38,008 and you can compare this with what was done in the old days. 227 00:13:42,630 --> 00:13:45,040 It was done in the old days before massive amounts 228 00:13:45,040 --> 00:13:56,070 of computing became available is a kind of neural net activity 229 00:13:56,070 --> 00:13:58,350 that's a little easier to see. 230 00:13:58,350 --> 00:14:03,090 You might, in the old days, only have enough computing power 231 00:14:03,090 --> 00:14:05,740 to deal with a small grid of picture elements, 232 00:14:05,740 --> 00:14:07,480 or so-called pixels. 233 00:14:07,480 --> 00:14:12,890 And then each of these might be a value that is fed as an input 234 00:14:12,890 --> 00:14:15,000 into some kind of neuron. 235 00:14:15,000 --> 00:14:19,110 And so you might have a column of neurons that are looking 236 00:14:19,110 --> 00:14:23,860 at these pixels in your image. 237 00:14:23,860 --> 00:14:26,410 And then there might be a small number of columns 238 00:14:26,410 --> 00:14:27,830 that follow from that. 239 00:14:27,830 --> 00:14:30,780 And finally, something that says this neuron 240 00:14:30,780 --> 00:14:35,820 is looking for things that are a number 1, that is to say, 241 00:14:35,820 --> 00:14:43,330 something that looks like a number 1 in the image. 242 00:14:43,330 --> 00:14:46,430 So this stuff up here is what you 243 00:14:46,430 --> 00:14:48,597 can do when you have a massive amount of computation 244 00:14:48,597 --> 00:14:49,971 relative to the kind of thing you 245 00:14:49,971 --> 00:14:51,240 used to see in the old days. 246 00:14:55,400 --> 00:14:58,006 So what's different? 247 00:14:58,006 --> 00:14:59,380 Well, what's different is instead 248 00:14:59,380 --> 00:15:02,730 of a few hundred parameters, we've got a lot more. 249 00:15:02,730 --> 00:15:07,000 Instead of 10 digits, we have 1,000 classes. 250 00:15:07,000 --> 00:15:09,130 Instead of a few hundred samples, 251 00:15:09,130 --> 00:15:13,570 we have maybe 1,000 examples of each class. 252 00:15:13,570 --> 00:15:16,270 So that makes a million samples. 253 00:15:16,270 --> 00:15:20,030 And we got 60 million parameters to play with. 254 00:15:20,030 --> 00:15:23,060 And the surprising thing is that the net result 255 00:15:23,060 --> 00:15:26,800 is we've got a function approximator that 256 00:15:26,800 --> 00:15:28,380 astonishes everybody. 257 00:15:28,380 --> 00:15:30,070 And no one quite knows why it works, 258 00:15:30,070 --> 00:15:34,230 except that when you throw an immense amount of computation 259 00:15:34,230 --> 00:15:38,350 into this kind of arrangement, it's 260 00:15:38,350 --> 00:15:42,770 possible to get a performance that no one expected would 261 00:15:42,770 --> 00:15:45,540 be possible. 262 00:15:45,540 --> 00:15:47,170 So that's sort of the bottom line. 263 00:15:47,170 --> 00:15:51,240 But now there are a couple of ideas beyond that that I think 264 00:15:51,240 --> 00:15:56,030 are especially interesting, and I want to talk about those. 265 00:15:56,030 --> 00:15:58,690 First idea that's especially interesting 266 00:15:58,690 --> 00:16:01,730 is the idea of autocoding, and here's 267 00:16:01,730 --> 00:16:03,530 how the idea of autocoding works. 268 00:16:09,190 --> 00:16:10,690 I'm going to run out of board space, 269 00:16:10,690 --> 00:16:14,230 so I think I'll do it right here. 270 00:16:14,230 --> 00:16:16,320 You have some input values. 271 00:16:24,860 --> 00:16:29,420 They go into a layer of neurons, the input layer. 272 00:16:29,420 --> 00:16:33,181 Then there is a so-called hidden layer that's much smaller. 273 00:16:36,010 --> 00:16:39,830 So maybe in the example, there will be 10 neurons here 274 00:16:39,830 --> 00:16:41,810 and just a couple here. 275 00:16:41,810 --> 00:16:47,750 And then these expand to an output layer like so. 276 00:16:47,750 --> 00:16:53,180 Now we can take the output layer, z1 through zn, 277 00:16:53,180 --> 00:16:59,030 and compare it with the desired values, d1 through dn. 278 00:17:01,920 --> 00:17:03,770 You following me so far? 279 00:17:03,770 --> 00:17:09,020 Now, the trick is to say, well, what are the desired values? 280 00:17:09,020 --> 00:17:13,605 Let's let the desired values be the input values. 281 00:17:17,254 --> 00:17:18,670 So what we're going to do is we're 282 00:17:18,670 --> 00:17:20,839 going to train this net up so that the output's 283 00:17:20,839 --> 00:17:23,600 the same as the input. 284 00:17:23,600 --> 00:17:24,599 What's the good of that? 285 00:17:24,599 --> 00:17:27,030 Well, we're going to force it down through this 286 00:17:27,030 --> 00:17:30,270 [? neck-down ?] piece of network. 287 00:17:30,270 --> 00:17:33,640 So if this network is going to succeed 288 00:17:33,640 --> 00:17:37,160 in taking all the possibilities here and cramming them 289 00:17:37,160 --> 00:17:42,720 into this smaller inner layer, the so-called hidden layer, 290 00:17:42,720 --> 00:17:45,900 such that it can reproduce the input [? at ?] the output, 291 00:17:45,900 --> 00:17:48,300 it must be doing some kind of generalization 292 00:17:48,300 --> 00:17:52,560 of the kinds of things it sees on its input. 293 00:17:52,560 --> 00:17:56,830 And that's a very clever idea, and it's seen in various forms 294 00:17:56,830 --> 00:18:00,200 in a large fraction of the papers that 295 00:18:00,200 --> 00:18:03,340 appear on deep neural nets. 296 00:18:03,340 --> 00:18:05,384 But now I want to talk about an example 297 00:18:05,384 --> 00:18:06,800 so I can show you a demonstration. 298 00:18:06,800 --> 00:18:07,860 OK? 299 00:18:07,860 --> 00:18:11,410 So we don't have GPUs, and we don't have three days 300 00:18:11,410 --> 00:18:12,670 to do this. 301 00:18:12,670 --> 00:18:17,520 So I'm going to make up a very simple example that's 302 00:18:17,520 --> 00:18:21,010 reminiscent of what goes on here but involves 303 00:18:21,010 --> 00:18:22,974 hardly any computation. 304 00:18:22,974 --> 00:18:24,390 What I'm going to imagine is we're 305 00:18:24,390 --> 00:18:31,260 trying to recognize animals from how tall they 306 00:18:31,260 --> 00:18:35,920 are from the shadows that they cast. 307 00:18:35,920 --> 00:18:43,940 So we're going to recognize three animals, a cheetah, 308 00:18:43,940 --> 00:18:51,060 a zebra, and a giraffe, and they will each cast a shadow 309 00:18:51,060 --> 00:18:55,800 on the blackboard like me. 310 00:18:55,800 --> 00:18:57,012 No vampire involved here. 311 00:18:57,012 --> 00:18:58,470 And what we're going to do is we're 312 00:18:58,470 --> 00:19:03,290 going to use the shadow as an input to a neural net. 313 00:19:03,290 --> 00:19:03,976 All right? 314 00:19:03,976 --> 00:19:05,350 So let's see how that would work. 315 00:19:19,350 --> 00:19:24,180 So there is our network. 316 00:19:24,180 --> 00:19:26,480 And if I just clicked into one of these test samples, 317 00:19:26,480 --> 00:19:32,010 that's the height of the shadow that a cheetah casts on a wall. 318 00:19:32,010 --> 00:19:34,900 And there are 10 input neurons corresponding 319 00:19:34,900 --> 00:19:37,500 to each level of the shadow. 320 00:19:37,500 --> 00:19:41,350 They're rammed through three inner layer neurons, 321 00:19:41,350 --> 00:19:46,167 and from that it spreads out and becomes the outer layer values. 322 00:19:46,167 --> 00:19:48,000 And we're going to compare those outer layer 323 00:19:48,000 --> 00:19:50,520 values to the desired values, but the desired values 324 00:19:50,520 --> 00:19:52,640 are the same as the input values. 325 00:19:52,640 --> 00:19:54,840 So this column is a column of input values. 326 00:19:54,840 --> 00:19:58,420 On the far right, we have our column of desired values. 327 00:19:58,420 --> 00:20:00,650 And we haven't trained this neural net yet. 328 00:20:00,650 --> 00:20:02,900 All we've got is random values in there. 329 00:20:02,900 --> 00:20:08,140 So if we run the test samples through, we get that and that. 330 00:20:08,140 --> 00:20:11,400 Yeah, cheetahs are short, zebras are medium height, 331 00:20:11,400 --> 00:20:13,140 and giraffes are tall. 332 00:20:13,140 --> 00:20:19,630 But our output is just pretty much 0.5 for all of them, 333 00:20:19,630 --> 00:20:21,470 for all of those shadow heights, all right, 334 00:20:21,470 --> 00:20:24,110 [? with ?] no training so far. 335 00:20:24,110 --> 00:20:25,270 So let's run this thing. 336 00:20:25,270 --> 00:20:27,810 We're just using simple [? backdrop, ?] just like on 337 00:20:27,810 --> 00:20:30,570 our world's simplest neural net. 338 00:20:30,570 --> 00:20:36,700 And it's interesting to see what happens. 339 00:20:36,700 --> 00:20:38,310 You see all those values changing? 340 00:20:38,310 --> 00:20:42,010 Now, I need to mention that when you see a green connection, 341 00:20:42,010 --> 00:20:44,390 that means it's a positive weight, 342 00:20:44,390 --> 00:20:49,870 and the density of the green indicates how positive it is. 343 00:20:49,870 --> 00:20:51,860 And the red ones are negative weights, 344 00:20:51,860 --> 00:20:55,000 and the intensity of the red indicates how red it is. 345 00:20:55,000 --> 00:20:56,590 So here you can see that we still 346 00:20:56,590 --> 00:20:59,630 have from our random inputs a variety 347 00:20:59,630 --> 00:21:00,950 of red and green values. 348 00:21:00,950 --> 00:21:02,680 We haven't really done much training, 349 00:21:02,680 --> 00:21:06,840 so everything correctly looks pretty much random. 350 00:21:06,840 --> 00:21:10,160 So let's run this thing. 351 00:21:10,160 --> 00:21:17,244 And after only 1,000 iterations going through these examples 352 00:21:17,244 --> 00:21:19,410 and trying to make the output the same as the input, 353 00:21:19,410 --> 00:21:22,174 we reached a point where the error rate has dropped. 354 00:21:22,174 --> 00:21:23,590 In fact, it's dropped so much it's 355 00:21:23,590 --> 00:21:26,910 interesting to relook at the test cases. 356 00:21:26,910 --> 00:21:29,150 So here's a test case where we have a cheetah. 357 00:21:29,150 --> 00:21:30,980 And now the output value is, in fact, 358 00:21:30,980 --> 00:21:37,740 very close to the desired value in all the output neurons. 359 00:21:37,740 --> 00:21:40,680 So if we look at another one, once again, 360 00:21:40,680 --> 00:21:43,780 there's a correspondence in the right two columns. 361 00:21:43,780 --> 00:21:45,945 And if we look at the final one, yeah, there's 362 00:21:45,945 --> 00:21:47,695 a correspondence in the right two columns. 363 00:21:50,980 --> 00:21:52,930 Now, you back up from this and say, well, 364 00:21:52,930 --> 00:21:55,570 what's going on here? 365 00:21:55,570 --> 00:22:00,320 It turns out that you're not training this thing 366 00:22:00,320 --> 00:22:02,490 to classify animals. 367 00:22:02,490 --> 00:22:05,590 You're training it to understand the nature of the things 368 00:22:05,590 --> 00:22:09,447 that it sees in the environment because all it sees 369 00:22:09,447 --> 00:22:10,530 is the height of a shadow. 370 00:22:10,530 --> 00:22:12,613 It doesn't know anything about the classifications 371 00:22:12,613 --> 00:22:14,610 you're going to try to get out of that. 372 00:22:14,610 --> 00:22:17,470 All it sees is that there's a kind of consistency 373 00:22:17,470 --> 00:22:21,630 in the kind of data that it sees on the input values. 374 00:22:21,630 --> 00:22:23,170 Right? 375 00:22:23,170 --> 00:22:24,860 Now, you might say, OK, oh, that's cool, 376 00:22:24,860 --> 00:22:26,610 because what must be happening is 377 00:22:26,610 --> 00:22:29,730 that that hidden layer, because everything is forced 378 00:22:29,730 --> 00:22:32,120 through that narrow pipe, must be doing 379 00:22:32,120 --> 00:22:35,229 some kind of generalization. 380 00:22:35,229 --> 00:22:37,020 So it ought to be the case that if we click 381 00:22:37,020 --> 00:22:38,436 on each of those neurons, we ought 382 00:22:38,436 --> 00:22:40,820 to see it specialize to a particular height, 383 00:22:40,820 --> 00:22:46,290 because that's the sort of stuff that's presented on the input. 384 00:22:46,290 --> 00:22:49,170 Well, let's go see what, in fact, is 385 00:22:49,170 --> 00:22:52,980 the maximum stimulation to be seen 386 00:22:52,980 --> 00:22:56,540 on the neurons in that hidden layer. 387 00:22:56,540 --> 00:22:59,860 So when I click on these guys, what we're going to see 388 00:22:59,860 --> 00:23:02,990 is the input values that maximally 389 00:23:02,990 --> 00:23:05,174 stimulate that neuron. 390 00:23:05,174 --> 00:23:06,590 And by the way, I have no idea how 391 00:23:06,590 --> 00:23:09,480 this is going to turn out because the initialization's 392 00:23:09,480 --> 00:23:11,580 all random. 393 00:23:11,580 --> 00:23:12,360 Well, that's good. 394 00:23:12,360 --> 00:23:13,860 That one looks like it's generalized 395 00:23:13,860 --> 00:23:16,920 the notion of short. 396 00:23:16,920 --> 00:23:20,690 Ugh, that doesn't look like medium. 397 00:23:20,690 --> 00:23:24,670 And in fact, the maximum stimulation 398 00:23:24,670 --> 00:23:28,570 doesn't involve any stimulation from that lower neuron. 399 00:23:28,570 --> 00:23:31,170 Here, look at this one. 400 00:23:31,170 --> 00:23:32,640 That doesn't look like tall. 401 00:23:32,640 --> 00:23:34,910 So we got one that looks like short and two that 402 00:23:34,910 --> 00:23:37,505 just look completely random. 403 00:23:40,320 --> 00:23:42,510 So in fact, maybe we better back off the idea 404 00:23:42,510 --> 00:23:44,730 that what's going on in that hidden layer 405 00:23:44,730 --> 00:23:48,850 is generalization and say that what 406 00:23:48,850 --> 00:23:51,660 is going on in there is maybe the encoding 407 00:23:51,660 --> 00:23:53,840 of a generalization. 408 00:23:53,840 --> 00:23:56,190 It doesn't look like an encoding we can see, 409 00:23:56,190 --> 00:24:01,060 but there is a generalization that's-- let me start that 410 00:24:01,060 --> 00:24:01,820 over. 411 00:24:01,820 --> 00:24:08,312 We don't see the generalization in the stimulating values. 412 00:24:08,312 --> 00:24:10,020 What we have instead is we have some kind 413 00:24:10,020 --> 00:24:12,567 of encoded generalization. 414 00:24:12,567 --> 00:24:14,150 And because we got this stuff encoded, 415 00:24:14,150 --> 00:24:16,570 it's what makes these neural nets so extraordinarily 416 00:24:16,570 --> 00:24:17,970 difficult to understand. 417 00:24:17,970 --> 00:24:20,610 We don't understand what they're doing. 418 00:24:20,610 --> 00:24:23,050 We don't understand why they can recognize a cheetah. 419 00:24:23,050 --> 00:24:25,400 We don't understand why it can recognize a school 420 00:24:25,400 --> 00:24:27,070 bus in some cases, but not in others, 421 00:24:27,070 --> 00:24:29,780 because we don't really understand 422 00:24:29,780 --> 00:24:32,530 what these neurons are responding to. 423 00:24:32,530 --> 00:24:34,110 Well, that's not quite true. 424 00:24:34,110 --> 00:24:36,300 There's been a lot of work recently 425 00:24:36,300 --> 00:24:38,690 on trying to sort that out, but it's still 426 00:24:38,690 --> 00:24:41,870 a lot of mystery in this world. 427 00:24:41,870 --> 00:24:44,700 In any event, that's the autocoding idea. 428 00:24:44,700 --> 00:24:45,945 It comes in various guises. 429 00:24:45,945 --> 00:24:48,320 Sometimes people talk about Boltzmann machines and things 430 00:24:48,320 --> 00:24:48,880 of that sort. 431 00:24:48,880 --> 00:24:51,230 But it's basically all the same sort of idea. 432 00:24:51,230 --> 00:24:53,300 And so what you can do is layer by layer. 433 00:24:53,300 --> 00:24:55,290 Once you've trained the input layer, 434 00:24:55,290 --> 00:24:57,810 then you can use that layer to train the next layer, 435 00:24:57,810 --> 00:25:00,110 and then that can train the next layer after that. 436 00:25:00,110 --> 00:25:04,360 And it's only at the very, very end that you say to yourself, 437 00:25:04,360 --> 00:25:06,720 well, now I've accumulated a lot of knowledge 438 00:25:06,720 --> 00:25:10,220 about the environment and what can be seen in the environment. 439 00:25:10,220 --> 00:25:12,781 Maybe it's time to get around to using 440 00:25:12,781 --> 00:25:17,770 some samples of particular classes and train on classes. 441 00:25:17,770 --> 00:25:19,320 So that's the story on autocoding. 442 00:25:22,780 --> 00:25:26,730 Now, the next thing to talk about is that final layer. 443 00:25:29,660 --> 00:25:32,384 So let's see what the final layer might look like. 444 00:25:35,110 --> 00:25:39,940 Let's see, it might look like this. 445 00:25:39,940 --> 00:25:44,937 There's a [? summer. ?] There's a minus 1 up here. 446 00:25:44,937 --> 00:25:45,436 No. 447 00:25:45,436 --> 00:25:47,926 Let's see, there's a minus 1 up-- [INAUDIBLE]. 448 00:25:50,710 --> 00:25:53,120 There's a minus 1 up there. 449 00:25:53,120 --> 00:25:55,420 There's a multiplier here. 450 00:25:55,420 --> 00:25:58,091 And there's a threshold value there. 451 00:25:58,091 --> 00:26:00,950 Now, likewise, there's some other input values here. 452 00:26:00,950 --> 00:26:07,690 Let me call this one x, and it gets multiplied by some weight. 453 00:26:07,690 --> 00:26:10,500 And then that goes into the [? summer ?] as well. 454 00:26:10,500 --> 00:26:19,540 And that, in turn, goes into a sigmoid that looks like so. 455 00:26:19,540 --> 00:26:25,180 And finally, you get an output, which we'll z. 456 00:26:25,180 --> 00:26:29,400 So it's clear that if you just write out the value of z 457 00:26:29,400 --> 00:26:36,020 as it depends on those inputs using the formula that we 458 00:26:36,020 --> 00:26:38,770 worked with last time, then what you 459 00:26:38,770 --> 00:26:48,280 see is that z is equal to 1 over 1 460 00:26:48,280 --> 00:27:01,218 plus e to the minus w times x minus T-- plus T, I guess. 461 00:27:01,218 --> 00:27:01,717 Right? 462 00:27:04,990 --> 00:27:08,190 So that's a sigmoid function that 463 00:27:08,190 --> 00:27:11,390 depends on the value of that weight 464 00:27:11,390 --> 00:27:13,970 and on the value of that threshold. 465 00:27:13,970 --> 00:27:20,510 So let's look at how those values might change things. 466 00:27:20,510 --> 00:27:23,803 So here we have an ordinary sigmoid. 467 00:27:26,550 --> 00:27:31,390 And what happens if we shift it with a threshold value? 468 00:27:31,390 --> 00:27:34,520 If we change that threshold value, 469 00:27:34,520 --> 00:27:36,390 then it's going to shift the place 470 00:27:36,390 --> 00:27:41,948 where that sigmoid comes down. 471 00:27:45,600 --> 00:27:47,860 So a change in T could cause this thing 472 00:27:47,860 --> 00:27:50,770 to shift over that way. 473 00:27:50,770 --> 00:27:53,010 And if we change the value of w, that 474 00:27:53,010 --> 00:27:54,870 could change how steep this guy is. 475 00:27:58,460 --> 00:28:02,700 So we might think that the performance, since it depends 476 00:28:02,700 --> 00:28:08,820 on w and T, should be adjusted in such a way 477 00:28:08,820 --> 00:28:14,470 as to make the classification do the right thing. 478 00:28:14,470 --> 00:28:17,020 But what's the right thing? 479 00:28:17,020 --> 00:28:19,397 Well, that depends on the samples that we've seen. 480 00:28:24,490 --> 00:28:28,603 Suppose, for example, that this is our sigmoid function. 481 00:28:31,330 --> 00:28:37,380 And we see some examples of a class, some positive examples 482 00:28:37,380 --> 00:28:40,780 of a class, that have values that 483 00:28:40,780 --> 00:28:46,800 lie at that point and that point and that point. 484 00:28:46,800 --> 00:28:54,180 And we have some values that correspond to situations where 485 00:28:54,180 --> 00:28:57,040 the class is not one of the things that are associated 486 00:28:57,040 --> 00:28:58,580 with this neuron. 487 00:28:58,580 --> 00:29:01,930 And in that case, what we see is examples that 488 00:29:01,930 --> 00:29:03,370 are over in this vicinity here. 489 00:29:06,370 --> 00:29:10,710 So the probability that we would see this particular guy 490 00:29:10,710 --> 00:29:15,300 in this world is associated with the value on the sigmoid curve. 491 00:29:15,300 --> 00:29:17,420 So you could think of this as the probability 492 00:29:17,420 --> 00:29:19,380 of that positive example, and this 493 00:29:19,380 --> 00:29:21,840 is the probability of that positive example, 494 00:29:21,840 --> 00:29:25,020 and this is the probability of that positive example. 495 00:29:25,020 --> 00:29:28,020 What's the probability of this negative example? 496 00:29:28,020 --> 00:29:32,480 Well, it's 1 minus the value on that curve. 497 00:29:32,480 --> 00:29:36,830 And this one's 1 minus the value on that curve. 498 00:29:36,830 --> 00:29:39,510 So we could go through the calculations. 499 00:29:39,510 --> 00:29:43,230 And what we would determine is that to maximize 500 00:29:43,230 --> 00:29:46,790 the probability of seeing this data, this particular stuff 501 00:29:46,790 --> 00:29:50,140 in a set of experiments, to maximize that probability, 502 00:29:50,140 --> 00:29:55,860 we would have to adjust T and w so as to get this curve doing 503 00:29:55,860 --> 00:29:57,770 the optimal thing. 504 00:29:57,770 --> 00:29:59,634 And there's nothing mysterious about it. 505 00:29:59,634 --> 00:30:01,050 It's just more partial derivatives 506 00:30:01,050 --> 00:30:03,150 and that sort of thing. 507 00:30:03,150 --> 00:30:09,720 But the bottom line is that the probability of seeing this data 508 00:30:09,720 --> 00:30:12,570 is dependent on the shape of this curve, 509 00:30:12,570 --> 00:30:16,540 and the shape of this curve is dependent on those parameters. 510 00:30:16,540 --> 00:30:19,330 And if we wanted to maximize the probability that we've 511 00:30:19,330 --> 00:30:23,058 seen this data, then we have to adjust those parameters 512 00:30:23,058 --> 00:30:23,558 accordingly. 513 00:30:25,559 --> 00:30:27,100 Let's have a look at a demonstration. 514 00:30:39,770 --> 00:30:40,270 OK. 515 00:30:40,270 --> 00:30:43,730 So there's an ordinary sigmoid curve. 516 00:30:43,730 --> 00:30:46,850 Here are a couple of positive examples. 517 00:30:46,850 --> 00:30:49,995 Here's a negative example. 518 00:30:53,500 --> 00:30:58,170 Let's put in some more positive examples over here. 519 00:30:58,170 --> 00:31:04,670 And now let's run the good, old gradient ascent algorithm 520 00:31:04,670 --> 00:31:06,870 on that. 521 00:31:06,870 --> 00:31:08,920 And this is what happens. 522 00:31:08,920 --> 00:31:11,640 You've seen how the probability, as we adjust 523 00:31:11,640 --> 00:31:14,370 the shape of the curve, the probability of seeing 524 00:31:14,370 --> 00:31:18,060 those examples of the class goes up, 525 00:31:18,060 --> 00:31:22,950 and the probability of seeing the non-example goes down. 526 00:31:22,950 --> 00:31:26,110 So what if we put some more examples in? 527 00:31:26,110 --> 00:31:27,640 If we put a negative example there, 528 00:31:27,640 --> 00:31:30,030 not much is going to happen. 529 00:31:30,030 --> 00:31:33,762 What would happen if we put a positive example right there? 530 00:31:33,762 --> 00:31:35,970 Then we're going to start seeing some dramatic shifts 531 00:31:35,970 --> 00:31:37,053 in the shape of the curve. 532 00:31:48,000 --> 00:31:50,450 So that's probably a noise point. 533 00:31:50,450 --> 00:31:54,750 But we can put some more negative examples in there 534 00:31:54,750 --> 00:31:56,262 and see how that adjusts the curve. 535 00:31:59,510 --> 00:32:00,010 All right. 536 00:32:00,010 --> 00:32:01,135 So that's what we're doing. 537 00:32:01,135 --> 00:32:03,430 We're viewing this output value as something 538 00:32:03,430 --> 00:32:07,250 that's related to the probability of seeing a class. 539 00:32:07,250 --> 00:32:09,900 And we're adjusting the parameters on that output layer 540 00:32:09,900 --> 00:32:12,620 so as to maximize the probability of the sample data 541 00:32:12,620 --> 00:32:14,525 that we've got at hand. 542 00:32:14,525 --> 00:32:15,025 Right? 543 00:32:17,930 --> 00:32:20,124 Now, there's one more thing. 544 00:32:20,124 --> 00:32:21,540 Because see what we've got here is 545 00:32:21,540 --> 00:32:24,880 we've got the basic idea of back propagation, which 546 00:32:24,880 --> 00:32:30,200 has layers and layers of additional-- 547 00:32:30,200 --> 00:32:33,390 let me be flattering and call them ideas layered on top. 548 00:32:33,390 --> 00:32:38,820 So here's the next idea that's layered on top. 549 00:32:38,820 --> 00:32:43,110 So we've got an output value here. 550 00:32:45,740 --> 00:32:50,270 And it's a function after all, and it's got a value. 551 00:32:50,270 --> 00:32:54,120 And if we have 1,000 classes, we're 552 00:32:54,120 --> 00:32:56,180 going to have 1,000 output neurons, 553 00:32:56,180 --> 00:32:59,020 and each is going to be producing some kind of value. 554 00:32:59,020 --> 00:33:02,326 And we can think of that value as a probability. 555 00:33:04,705 --> 00:33:06,580 But I didn't want to write a probability yet. 556 00:33:06,580 --> 00:33:07,996 I just want to say that what we've 557 00:33:07,996 --> 00:33:13,186 got for this output neuron is a function of class 1. 558 00:33:13,186 --> 00:33:15,060 And then there will be another output neuron, 559 00:33:15,060 --> 00:33:18,200 which is a function of class 2. 560 00:33:18,200 --> 00:33:21,040 And these values will be presumably higher-- 561 00:33:21,040 --> 00:33:24,550 this will be higher if we are, in fact, looking at class 1. 562 00:33:24,550 --> 00:33:27,210 And this one down here will be, in fact, higher 563 00:33:27,210 --> 00:33:28,658 if we're looking at class m. 564 00:33:31,820 --> 00:33:35,020 So what we would like to do is we'd like to not just pick 565 00:33:35,020 --> 00:33:37,950 one of these outputs and say, well, you've 566 00:33:37,950 --> 00:33:40,550 got the highest value, so you win. 567 00:33:40,550 --> 00:33:42,710 What we want to do instead is we want 568 00:33:42,710 --> 00:33:45,130 to associate some kind of probability 569 00:33:45,130 --> 00:33:47,074 with each of the classes. 570 00:33:47,074 --> 00:33:48,740 Because, after all, we want to do things 571 00:33:48,740 --> 00:33:52,602 like find the most probable five. 572 00:33:52,602 --> 00:33:54,060 So what we do is we say, all right, 573 00:33:54,060 --> 00:33:59,950 so the actual probability of class 1 574 00:33:59,950 --> 00:34:07,990 is equal to the output of that sigmoid function divided 575 00:34:07,990 --> 00:34:11,357 by the sum over all functions. 576 00:34:14,909 --> 00:34:17,239 So that takes all of that entire output vector 577 00:34:17,239 --> 00:34:20,920 and converts each output value into a probability. 578 00:34:20,920 --> 00:34:24,475 So when we used that sigmoid function, 579 00:34:24,475 --> 00:34:26,100 we did it with the view toward thinking 580 00:34:26,100 --> 00:34:27,266 about that as a probability. 581 00:34:27,266 --> 00:34:30,000 And in fact, we assumed it was a probability when 582 00:34:30,000 --> 00:34:32,429 we made this argument. 583 00:34:32,429 --> 00:34:35,170 But in the end, there's an output 584 00:34:35,170 --> 00:34:36,280 for each of those classes. 585 00:34:36,280 --> 00:34:39,000 And so what we get is, in the end, not exactly a probability 586 00:34:39,000 --> 00:34:43,219 until we divide by a normalizing factor. 587 00:34:43,219 --> 00:34:49,500 So this, by the way, is called-- not on my list of things, 588 00:34:49,500 --> 00:34:50,526 but it soon will be. 589 00:34:54,580 --> 00:34:59,640 Since we're not talking about taking the maximum 590 00:34:59,640 --> 00:35:02,580 and using that to classify the picture, what we're going to do 591 00:35:02,580 --> 00:35:05,290 is we're going to use what's called softmax. 592 00:35:09,140 --> 00:35:11,730 So we're going to give a range of classifications, 593 00:35:11,730 --> 00:35:14,680 and we're going to associate a probability with each. 594 00:35:14,680 --> 00:35:18,610 And that's what you saw in all of those samples. 595 00:35:18,610 --> 00:35:20,360 You saw, yes, this is [? containership, ?] 596 00:35:20,360 --> 00:35:24,070 but maybe it's also this, that, or a third, or fourth, 597 00:35:24,070 --> 00:35:26,760 and fifth thing. 598 00:35:26,760 --> 00:35:31,924 So that is a pretty good summary of the kinds 599 00:35:31,924 --> 00:35:33,090 of things that are involved. 600 00:35:33,090 --> 00:35:36,890 But now we've got one more step, because what we can do now 601 00:35:36,890 --> 00:35:41,090 is we can take this output layer idea, this softmax idea, 602 00:35:41,090 --> 00:35:43,470 and we can put them together with the autocoding idea. 603 00:35:47,770 --> 00:35:50,820 So we've trained just a layer up. 604 00:35:50,820 --> 00:35:53,700 And now we're going to detach it from the output layer 605 00:35:53,700 --> 00:35:55,800 but retain those weights that connect 606 00:35:55,800 --> 00:35:58,170 the input to the hidden layer. 607 00:35:58,170 --> 00:36:00,560 And when we do that, what we're going to see 608 00:36:00,560 --> 00:36:03,430 is something that looks like this. 609 00:36:03,430 --> 00:36:05,410 And now we've got a trained first layer 610 00:36:05,410 --> 00:36:07,850 but an untrained output layer. 611 00:36:07,850 --> 00:36:10,280 We're going to freeze the input layer 612 00:36:10,280 --> 00:36:16,590 and train the output layer using the sigmoid curve 613 00:36:16,590 --> 00:36:18,140 and see what happens when we do that. 614 00:36:18,140 --> 00:36:21,725 Oh, by the way, let's run our test samples through. 615 00:36:21,725 --> 00:36:23,225 You can see it's not doing anything, 616 00:36:23,225 --> 00:36:27,180 and the output is half for each of the categories 617 00:36:27,180 --> 00:36:29,390 even though we've got a trained middle layer. 618 00:36:29,390 --> 00:36:30,890 So we have to train the outer layer. 619 00:36:30,890 --> 00:36:32,826 Let's see how long it takes. 620 00:36:32,826 --> 00:36:33,950 Whoa, that was pretty fast. 621 00:36:36,880 --> 00:36:40,210 Now there's an extraordinarily good match between the outputs 622 00:36:40,210 --> 00:36:41,809 and the desired outputs. 623 00:36:41,809 --> 00:36:43,600 So that's the combination of the autocoding 624 00:36:43,600 --> 00:36:45,541 idea and the softmax idea. 625 00:36:50,150 --> 00:36:55,540 [? There's ?] just one more idea that's worthy of mention, 626 00:36:55,540 --> 00:36:57,020 and that's the idea of dropout. 627 00:37:00,320 --> 00:37:02,880 The plague of any neural net is that it gets stuck 628 00:37:02,880 --> 00:37:06,040 in some kind of local maximum. 629 00:37:06,040 --> 00:37:08,740 So it was discovered that these things train 630 00:37:08,740 --> 00:37:16,500 better if, on every iteration, you 631 00:37:16,500 --> 00:37:19,620 flip a coin for each neuron. 632 00:37:19,620 --> 00:37:22,260 And if the coin ends up tails, you 633 00:37:22,260 --> 00:37:26,920 assume it's just died and has no influence on the output. 634 00:37:26,920 --> 00:37:29,930 It's called dropping out those neurons. 635 00:37:29,930 --> 00:37:33,950 And in our next iteration, you drop out a different set. 636 00:37:33,950 --> 00:37:35,660 So what this seems to do is it seems 637 00:37:35,660 --> 00:37:41,021 to prevent this thing from going into a frozen local maximum 638 00:37:41,021 --> 00:37:41,520 state. 639 00:37:44,840 --> 00:37:46,230 So that's deep nets. 640 00:37:46,230 --> 00:37:50,000 They should be called, by the way, wide nets because they 641 00:37:50,000 --> 00:37:53,020 tend to be enormously wide but rarely 642 00:37:53,020 --> 00:38:01,250 more than 10 columns deep. 643 00:38:01,250 --> 00:38:04,050 Now, let's see, where to go from here? 644 00:38:04,050 --> 00:38:10,900 Maybe what we should do is talk about the awesome curiosity 645 00:38:10,900 --> 00:38:13,820 in the current state of the art. 646 00:38:13,820 --> 00:38:17,910 And that is that all of [? this ?] 647 00:38:17,910 --> 00:38:21,750 sophistication with output layers that are probabilities 648 00:38:21,750 --> 00:38:28,580 and training using autocoding or Boltzmann machines, 649 00:38:28,580 --> 00:38:33,190 it doesn't seem to help much relative to plain, old back 650 00:38:33,190 --> 00:38:35,640 propagation. 651 00:38:35,640 --> 00:38:38,090 So back propagation with a convolutional net 652 00:38:38,090 --> 00:38:41,630 seems to do just about as good as anything. 653 00:38:41,630 --> 00:38:46,530 And while we're on the subject of an ordinary deep net, 654 00:38:46,530 --> 00:38:50,070 I'd like to examine a situation here 655 00:38:50,070 --> 00:38:59,175 where we have a deep net-- well, it's a classroom deep net. 656 00:38:59,175 --> 00:39:02,670 And we'll will put five layers in there, 657 00:39:02,670 --> 00:39:04,930 and its job is still to do the same thing. 658 00:39:04,930 --> 00:39:09,630 It's to classify an animal as a cheetah, a zebra, or a giraffe 659 00:39:09,630 --> 00:39:13,630 based on the height of the shadow it casts. 660 00:39:13,630 --> 00:39:16,610 And as before, if it's green, that means positive. 661 00:39:16,610 --> 00:39:19,300 If it's red, that means negative. 662 00:39:19,300 --> 00:39:22,660 And right at the moment, we have no training. 663 00:39:22,660 --> 00:39:24,420 So if we run our test samples through, 664 00:39:24,420 --> 00:39:29,120 the output is always a 1/2 no matter what the animal is. 665 00:39:29,120 --> 00:39:29,975 All right? 666 00:39:29,975 --> 00:39:31,350 So what we're going to do is just 667 00:39:31,350 --> 00:39:34,920 going to use ordinary back prop on this, same thing 668 00:39:34,920 --> 00:39:41,970 as in that sample that's underneath the blackboard. 669 00:39:41,970 --> 00:39:43,890 Only now we've got a lot more parameters. 670 00:39:43,890 --> 00:39:46,950 We've got five columns, and each one of them 671 00:39:46,950 --> 00:39:50,320 has 9 or 10 neurons in it. 672 00:39:50,320 --> 00:39:52,270 So let's let this one run. 673 00:39:56,160 --> 00:39:57,740 Now, look at that stuff on the right. 674 00:39:57,740 --> 00:39:59,310 It's all turned red. 675 00:39:59,310 --> 00:40:03,270 At first I thought this was a bug in my program. 676 00:40:03,270 --> 00:40:04,699 But that makes absolute sense. 677 00:40:04,699 --> 00:40:06,990 If you don't know what the actual animal is going to be 678 00:40:06,990 --> 00:40:08,865 and there are a whole bunch of possibilities, 679 00:40:08,865 --> 00:40:10,970 you better just say no for everybody. 680 00:40:10,970 --> 00:40:13,550 It's like when a biologist says, we don't know. 681 00:40:13,550 --> 00:40:16,380 It's the most probable answer. 682 00:40:16,380 --> 00:40:20,400 Well, but eventually, after about 160,000 iterations, 683 00:40:20,400 --> 00:40:21,400 it seems to have got it. 684 00:40:21,400 --> 00:40:22,858 Let's run the test samples through. 685 00:40:27,044 --> 00:40:29,310 Now it's doing great. 686 00:40:29,310 --> 00:40:31,387 Let's do it again just to see if this is a fluke. 687 00:40:37,131 --> 00:40:52,590 And all red on the right side, and finally, you 688 00:40:52,590 --> 00:40:56,930 start seeing some changes go in the final layers there. 689 00:40:56,930 --> 00:40:59,600 And if you look at the error rate down at the bottom, 690 00:40:59,600 --> 00:41:02,700 you'll see that it kind of falls off a cliff. 691 00:41:02,700 --> 00:41:04,806 So nothing happens for a real long time, 692 00:41:04,806 --> 00:41:06,056 and then it falls off a cliff. 693 00:41:09,560 --> 00:41:13,620 Now, what would happen if this neural net were not 694 00:41:13,620 --> 00:41:15,860 quite so wide? 695 00:41:15,860 --> 00:41:16,760 Good question. 696 00:41:16,760 --> 00:41:19,093 But before we get to that question, what I'm going to do 697 00:41:19,093 --> 00:41:21,880 is I'm going to do a funny kind of variation 698 00:41:21,880 --> 00:41:23,676 on the theme of dropout. 699 00:41:23,676 --> 00:41:25,050 What I'm going to do is I'm going 700 00:41:25,050 --> 00:41:28,070 to kill off one neuron in each column, 701 00:41:28,070 --> 00:41:30,590 and then see if I can retrain the network 702 00:41:30,590 --> 00:41:33,750 to do the right thing. 703 00:41:33,750 --> 00:41:37,210 So I'm going to reassign those to some other purpose. 704 00:41:37,210 --> 00:41:40,100 So now there's one fewer neuron in the network. 705 00:41:40,100 --> 00:41:44,622 If we rerun that, we see that it trains itself up very fast. 706 00:41:44,622 --> 00:41:46,080 So we seem to be still close enough 707 00:41:46,080 --> 00:41:50,470 to a solution we can do without one 708 00:41:50,470 --> 00:41:52,110 of the neurons in each column. 709 00:41:52,110 --> 00:41:52,920 Let's do it again. 710 00:41:55,269 --> 00:41:57,060 Now it goes up a little bit, but it quickly 711 00:41:57,060 --> 00:41:59,530 falls down to a solution. 712 00:41:59,530 --> 00:42:02,060 Try again. 713 00:42:02,060 --> 00:42:05,620 Quickly falls down to a solution. 714 00:42:05,620 --> 00:42:08,560 Oh, my god, how much of this am I going to do? 715 00:42:08,560 --> 00:42:10,970 Each time I knock something out and retrain, 716 00:42:10,970 --> 00:42:13,010 it finds its solution very fast. 717 00:42:30,480 --> 00:42:34,170 Whoa, I got it all the way down to two neurons in each column, 718 00:42:34,170 --> 00:42:37,290 and it still has a solution. 719 00:42:37,290 --> 00:42:40,440 It's interesting, don't you think? 720 00:42:40,440 --> 00:42:43,120 But let's repeat the experiment, but this time we're 721 00:42:43,120 --> 00:42:45,181 going to do it a little differently. 722 00:42:45,181 --> 00:42:46,715 We're going to take our five layers, 723 00:42:46,715 --> 00:42:49,610 and before we do any training I'm 724 00:42:49,610 --> 00:42:57,390 going to knock out all but two neurons in each column. 725 00:42:57,390 --> 00:42:59,760 Now, I know that with two neurons in each column, 726 00:42:59,760 --> 00:43:01,520 I've got a solution. 727 00:43:01,520 --> 00:43:02,370 I just showed it. 728 00:43:02,370 --> 00:43:03,345 I just showed one. 729 00:43:03,345 --> 00:43:06,050 But let's run it this way. 730 00:43:19,060 --> 00:43:23,790 It looks like increasingly bad news. 731 00:43:23,790 --> 00:43:25,810 What's happened is that this sucker's got itself 732 00:43:25,810 --> 00:43:28,440 into a local maximum. 733 00:43:28,440 --> 00:43:31,550 So now you can see why there's been 734 00:43:31,550 --> 00:43:35,600 a breakthrough in this neural net learning stuff. 735 00:43:35,600 --> 00:43:39,300 And it's because when you widen the net, 736 00:43:39,300 --> 00:43:43,750 you turn local maxima into saddle points. 737 00:43:43,750 --> 00:43:45,560 So now it's got a way of crawling its way 738 00:43:45,560 --> 00:43:48,790 through this vast space without getting 739 00:43:48,790 --> 00:43:53,650 stuck on a local maximum, as suggested by this. 740 00:43:53,650 --> 00:43:54,150 All right. 741 00:43:54,150 --> 00:43:57,880 So those are some, I think, interesting things 742 00:43:57,880 --> 00:44:01,810 to look at by way of these demonstrations. 743 00:44:01,810 --> 00:44:04,510 But now I'd like to go back to my slide set 744 00:44:04,510 --> 00:44:06,860 and show you some examples that will address 745 00:44:06,860 --> 00:44:09,670 the question of whether these things are seeing like we see. 746 00:44:20,610 --> 00:44:22,380 So you can try these examples online. 747 00:44:22,380 --> 00:44:24,370 There are a variety of websites that allow 748 00:44:24,370 --> 00:44:27,950 you to put in your own picture. 749 00:44:27,950 --> 00:44:33,510 And there's a cottage industry of producing papers in journals 750 00:44:33,510 --> 00:44:35,840 that fool neural nets. 751 00:44:35,840 --> 00:44:38,600 So in this case, a very small number of pixels 752 00:44:38,600 --> 00:44:39,420 have been changed. 753 00:44:39,420 --> 00:44:41,640 You don't see the difference, but it's 754 00:44:41,640 --> 00:44:44,290 enough to take this particular neural net 755 00:44:44,290 --> 00:44:47,850 from a high confidence that it's looking at a school bus 756 00:44:47,850 --> 00:44:51,777 to thinking that it's not a school bus. 757 00:44:51,777 --> 00:44:54,026 Those are some things that it thinks are a school bus. 758 00:44:56,780 --> 00:44:58,490 So it appears to be the case that what 759 00:44:58,490 --> 00:45:01,320 is triggering this school bus result 760 00:45:01,320 --> 00:45:04,340 is that it's seeing enough local evidence that this is not 761 00:45:04,340 --> 00:45:10,080 one of the other 999 classes and enough positive evidence 762 00:45:10,080 --> 00:45:12,310 from these local looks to conclude 763 00:45:12,310 --> 00:45:13,313 that it's a school bus. 764 00:45:18,020 --> 00:45:20,330 So do you see any of those things? 765 00:45:20,330 --> 00:45:20,870 I don't. 766 00:45:24,494 --> 00:45:28,290 And here you can say, OK, well, look at that baseball one. 767 00:45:28,290 --> 00:45:31,500 Yeah, that looks like it's got a little bit of baseball texture 768 00:45:31,500 --> 00:45:32,020 in it. 769 00:45:32,020 --> 00:45:33,978 So maybe what it's doing is looking at texture. 770 00:45:39,130 --> 00:45:43,380 These are some examples from a recent and very famous 771 00:45:43,380 --> 00:45:47,380 paper by Google using essentially the same ideas 772 00:45:47,380 --> 00:45:51,290 to put captions on pictures. 773 00:45:51,290 --> 00:45:53,790 So this, by the way, is what has stimulated 774 00:45:53,790 --> 00:45:56,620 all this enormous concern about artificial intelligence. 775 00:45:56,620 --> 00:45:58,870 Because a naive viewer looks at that picture and says, 776 00:45:58,870 --> 00:46:00,245 oh, my god, this thing knows what 777 00:46:00,245 --> 00:46:06,260 it's like to play, or be young, or move, or what a Frisbee is. 778 00:46:06,260 --> 00:46:08,070 And of course, it knows none of that. 779 00:46:08,070 --> 00:46:10,950 It just knows how to label this picture. 780 00:46:10,950 --> 00:46:14,080 And to the credit of the people who wrote this paper, 781 00:46:14,080 --> 00:46:17,540 they show examples that don't do so well. 782 00:46:17,540 --> 00:46:21,000 So yeah, it's a cat, but it's not lying. 783 00:46:21,000 --> 00:46:24,620 Oh, it's a little girl, but she's not blowing bubbles. 784 00:46:24,620 --> 00:46:25,884 What about this one? 785 00:46:25,884 --> 00:46:28,848 [LAUGHTER] 786 00:46:31,820 --> 00:46:34,770 So we've been doing our own work in my laboratory 787 00:46:34,770 --> 00:46:36,390 on some of this. 788 00:46:36,390 --> 00:46:39,900 And the way the following set of pictures was produced was this. 789 00:46:39,900 --> 00:46:41,910 You take an image, and you separate it 790 00:46:41,910 --> 00:46:44,310 into a bunch of slices, each representing 791 00:46:44,310 --> 00:46:46,760 a particular frequency band. 792 00:46:46,760 --> 00:46:49,300 And then you go into one of those frequency bands 793 00:46:49,300 --> 00:46:51,680 and you knock out a rectangle from the picture, 794 00:46:51,680 --> 00:46:54,730 and then you reassemble the thing. 795 00:46:54,730 --> 00:46:56,876 And if you hadn't knocked that piece out, 796 00:46:56,876 --> 00:46:58,750 when you reassemble it, it would look exactly 797 00:46:58,750 --> 00:47:00,760 like it did when you started. 798 00:47:00,760 --> 00:47:03,370 So what we're doing is we knock out as much as we can 799 00:47:03,370 --> 00:47:05,827 and still retain the neural net's impression 800 00:47:05,827 --> 00:47:08,160 that it's the thing that it started out thinking it was. 801 00:47:08,160 --> 00:47:09,575 So what do you think this is? 802 00:47:13,640 --> 00:47:17,310 It's identified by a neural net as a railroad car 803 00:47:17,310 --> 00:47:21,589 because this is the image that it started with. 804 00:47:21,589 --> 00:47:22,380 How about this one? 805 00:47:22,380 --> 00:47:23,370 That's easy, right? 806 00:47:23,370 --> 00:47:25,100 That's a guitar. 807 00:47:25,100 --> 00:47:28,090 We weren't able to mutilate that one very much and still retain 808 00:47:28,090 --> 00:47:30,830 the guitar-ness of it. 809 00:47:30,830 --> 00:47:32,320 How about this one? 810 00:47:32,320 --> 00:47:33,029 AUDIENCE: A lamp? 811 00:47:33,029 --> 00:47:34,361 PATRICK H. WINSTON: What's that? 812 00:47:34,361 --> 00:47:35,020 AUDIENCE: Lamp. 813 00:47:35,020 --> 00:47:35,250 PATRICK H. WINSTON: What? 814 00:47:35,250 --> 00:47:36,190 AUDIENCE: Lamp. 815 00:47:36,190 --> 00:47:37,330 PATRICK H. WINSTON: A lamp. 816 00:47:37,330 --> 00:47:38,067 Any other ideas? 817 00:47:38,067 --> 00:47:38,983 AUDIENCE: [INAUDIBLE]. 818 00:47:38,983 --> 00:47:40,280 AUDIENCE: [INAUDIBLE]. 819 00:47:40,280 --> 00:47:42,321 PATRICK H. WINSTON: Ken, what do you think it is? 820 00:47:42,321 --> 00:47:43,157 AUDIENCE: A toilet. 821 00:47:43,157 --> 00:47:45,490 PATRICK H. WINSTON: See, he's an expert on this subject. 822 00:47:45,490 --> 00:47:46,880 [LAUGHTER] 823 00:47:46,880 --> 00:47:50,480 It was identified as a barbell. 824 00:47:50,480 --> 00:47:51,290 What's that? 825 00:47:51,290 --> 00:47:52,206 AUDIENCE: [INAUDIBLE]. 826 00:47:52,206 --> 00:47:53,450 PATRICK H. WINSTON: A what? 827 00:47:53,450 --> 00:47:54,340 AUDIENCE: Cello. 828 00:47:54,340 --> 00:47:55,730 PATRICK H. WINSTON: Cello. 829 00:47:55,730 --> 00:47:59,361 You didn't see the little girl or the instructor. 830 00:47:59,361 --> 00:48:00,152 How about this one? 831 00:48:00,152 --> 00:48:01,330 AUDIENCE: [INAUDIBLE]. 832 00:48:01,330 --> 00:48:01,830 PATRICK H. WINSTON: What? 833 00:48:01,830 --> 00:48:02,830 AUDIENCE: [INAUDIBLE]. 834 00:48:02,830 --> 00:48:03,788 PATRICK H. WINSTON: No. 835 00:48:07,205 --> 00:48:08,630 AUDIENCE: [INAUDIBLE]. 836 00:48:08,630 --> 00:48:10,680 PATRICK H. WINSTON: It's a grasshopper. 837 00:48:10,680 --> 00:48:11,390 What's this? 838 00:48:11,390 --> 00:48:12,330 AUDIENCE: A wolf. 839 00:48:12,330 --> 00:48:13,871 PATRICK H. WINSTON: Wow, you're good. 840 00:48:15,870 --> 00:48:17,693 It's actually not a two-headed wolf. 841 00:48:17,693 --> 00:48:20,000 [LAUGHTER] 842 00:48:20,000 --> 00:48:23,438 It's two wolves that are close together. 843 00:48:23,438 --> 00:48:24,694 AUDIENCE: [INAUDIBLE]. 844 00:48:24,694 --> 00:48:26,402 PATRICK H. WINSTON: That's a bird, right? 845 00:48:26,402 --> 00:48:27,775 AUDIENCE: [INAUDIBLE]. 846 00:48:27,775 --> 00:48:29,150 PATRICK H. WINSTON: Good for you. 847 00:48:29,150 --> 00:48:29,837 It's a rabbit. 848 00:48:29,837 --> 00:48:32,194 [LAUGHTER] 849 00:48:32,194 --> 00:48:32,819 How about that? 850 00:48:32,819 --> 00:48:33,819 [? AUDIENCE: Giraffe. ?] 851 00:48:36,040 --> 00:48:38,362 PATRICK H. WINSTON: Russian wolfhound. 852 00:48:38,362 --> 00:48:39,278 AUDIENCE: [INAUDIBLE]. 853 00:48:46,415 --> 00:48:48,290 PATRICK H. WINSTON: If you've been to Venice, 854 00:48:48,290 --> 00:48:49,314 you recognize this. 855 00:48:49,314 --> 00:48:51,920 AUDIENCE: [INAUDIBLE]. 856 00:48:51,920 --> 00:48:54,230 PATRICK H. WINSTON: So bottom line 857 00:48:54,230 --> 00:48:55,960 is that these things are an engineering 858 00:48:55,960 --> 00:49:00,536 marvel and do great things, but they don't see like we see.