1 00:00:00,040 --> 00:00:02,410 The following content is provided under a Creative 2 00:00:02,410 --> 00:00:03,790 Commons license. 3 00:00:03,790 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,100 continue to offer high-quality educational resources for free. 5 00:00:10,100 --> 00:00:12,680 To make a donation or to view additional materials 6 00:00:12,680 --> 00:00:16,590 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,590 --> 00:00:18,270 at fsae@mit.edu. 8 00:00:37,340 --> 00:00:42,313 PATRICK WINSTON: It was in 2010, yes, that's right. 9 00:00:42,313 --> 00:00:44,630 It was in 2010. 10 00:00:44,630 --> 00:00:46,590 We were having our annual discussion 11 00:00:46,590 --> 00:00:49,970 about what we would dump fro 6034 in order to make room 12 00:00:49,970 --> 00:00:51,720 for some other stuff. 13 00:00:51,720 --> 00:00:54,540 And we almost killed off neural nets. 14 00:00:54,540 --> 00:00:58,010 That might seem strange because our heads 15 00:00:58,010 --> 00:00:59,860 are stuffed with neurons. 16 00:00:59,860 --> 00:01:02,000 If you open up your skull and pluck them all out, 17 00:01:02,000 --> 00:01:03,870 you don't think anymore. 18 00:01:03,870 --> 00:01:07,914 So it would seem that neural nets 19 00:01:07,914 --> 00:01:12,610 would be a fundamental and unassailable topic. 20 00:01:12,610 --> 00:01:16,710 But many of us felt that the neural models of the day 21 00:01:16,710 --> 00:01:23,380 weren't much in the way of faithful models of what 22 00:01:23,380 --> 00:01:25,476 actually goes on inside our heads. 23 00:01:25,476 --> 00:01:26,850 And besides that, nobody had ever 24 00:01:26,850 --> 00:01:30,270 made a neural net that was worth a darn for doing anything. 25 00:01:30,270 --> 00:01:33,070 So we almost killed it off. 26 00:01:33,070 --> 00:01:34,694 But then we said, well, everybody 27 00:01:34,694 --> 00:01:36,360 would feel cheated if they take a course 28 00:01:36,360 --> 00:01:38,060 in artificial intelligence, don't learn anything 29 00:01:38,060 --> 00:01:39,334 about neural nets, and then they'll 30 00:01:39,334 --> 00:01:40,640 go off and invent them themselves. 31 00:01:40,640 --> 00:01:42,240 And they'll waste all sorts of time. 32 00:01:42,240 --> 00:01:44,740 So we kept the subject in. 33 00:01:44,740 --> 00:01:48,420 Then two years later, Jeff Hinton 34 00:01:48,420 --> 00:01:52,070 from the University of Toronto stunned the world 35 00:01:52,070 --> 00:01:56,450 with some neural network he had done on recognizing 36 00:01:56,450 --> 00:01:58,810 and classifying pictures. 37 00:01:58,810 --> 00:02:00,610 And he published a paper from which 38 00:02:00,610 --> 00:02:04,910 I am now going to show you a couple of examples. 39 00:02:04,910 --> 00:02:08,289 Jeff's neural net, by the way, had 60 million parameters 40 00:02:08,289 --> 00:02:09,389 in it. 41 00:02:09,389 --> 00:02:15,110 And its purpose was to determine which of 1,000 categories 42 00:02:15,110 --> 00:02:17,087 best characterized a picture. 43 00:02:22,900 --> 00:02:24,010 So there it is. 44 00:02:24,010 --> 00:02:31,280 There's a sample of things that the Toronto neural net 45 00:02:31,280 --> 00:02:35,416 was able to recognize or make mistakes on. 46 00:02:35,416 --> 00:02:37,040 I'm going to blow that up a little bit. 47 00:02:37,040 --> 00:02:38,623 I think I'm going to look particularly 48 00:02:38,623 --> 00:02:42,930 at the example labeled container ship. 49 00:02:42,930 --> 00:02:47,140 So what you see here is that the program returned its best 50 00:02:47,140 --> 00:02:51,630 estimate of what it was ranked, first five, according 51 00:02:51,630 --> 00:02:55,090 to the likelihood, probability, or the certainty 52 00:02:55,090 --> 00:02:59,240 that it felt that a particular class was 53 00:02:59,240 --> 00:03:01,030 characteristic of the picture. 54 00:03:01,030 --> 00:03:03,420 And so you can see this one is extremely confident 55 00:03:03,420 --> 00:03:06,260 that it's a container ship. 56 00:03:06,260 --> 00:03:09,920 It also was fairly moved by the idea 57 00:03:09,920 --> 00:03:13,590 that it might be a lifeboat. 58 00:03:13,590 --> 00:03:17,390 Now, I'm not sure about you, but I don't think this looks 59 00:03:17,390 --> 00:03:18,666 much like a lifeboat. 60 00:03:18,666 --> 00:03:20,290 But it does look like a container ship. 61 00:03:20,290 --> 00:03:23,450 So if I look at only the best choice, it looks pretty good. 62 00:03:23,450 --> 00:03:25,760 Here are the other things they did pretty well, 63 00:03:25,760 --> 00:03:27,980 got the right answer is the first choice-- 64 00:03:27,980 --> 00:03:30,257 is this first choice. 65 00:03:30,257 --> 00:03:31,840 So over on the left, you see that it's 66 00:03:31,840 --> 00:03:35,280 decided that the picture is a picture of a mite. 67 00:03:35,280 --> 00:03:37,700 The mite is not anywhere near the center of the picture, 68 00:03:37,700 --> 00:03:41,440 but somehow it managed to find it-- the container ship again. 69 00:03:41,440 --> 00:03:44,110 There is a motor scooter, a couple of people sitting on it. 70 00:03:44,110 --> 00:03:48,490 But it correctly characterized the picture as a motor scooter. 71 00:03:48,490 --> 00:03:50,160 And then on the right, a Leopard. 72 00:03:50,160 --> 00:03:52,450 And everything else is a cat of some sort. 73 00:03:52,450 --> 00:03:54,280 So it seems to be doing pretty well. 74 00:03:54,280 --> 00:03:56,930 In fact, it does do pretty well. 75 00:03:56,930 --> 00:03:59,124 But anyone who does this kind of work 76 00:03:59,124 --> 00:04:00,540 has an obligation to show you some 77 00:04:00,540 --> 00:04:03,110 of the stuff that doesn't work so well on 78 00:04:03,110 --> 00:04:04,670 or doesn't get quite right. 79 00:04:04,670 --> 00:04:10,010 And so these pictures also occurred in Hinton's paper. 80 00:04:10,010 --> 00:04:12,840 So the first one is characterized as a grill. 81 00:04:12,840 --> 00:04:15,910 But the right answer was supposed to be convertible. 82 00:04:15,910 --> 00:04:19,480 Oh, no, yes, yeah, right answer was convertible. 83 00:04:19,480 --> 00:04:22,430 In the second case, the characterization 84 00:04:22,430 --> 00:04:24,560 is of a mushroom. 85 00:04:24,560 --> 00:04:28,320 And the alleged right answer is agaric. 86 00:04:28,320 --> 00:04:29,740 Is that pronounced right? 87 00:04:29,740 --> 00:04:33,730 It turns out that's a kind of mushroom-- so no problem there. 88 00:04:33,730 --> 00:04:36,280 In the next case, it said it was a cherry. 89 00:04:36,280 --> 00:04:37,930 But it was supposed to be a dalmatian. 90 00:04:37,930 --> 00:04:41,200 Now, I think a dalmatian is a perfectly legitimate answer 91 00:04:41,200 --> 00:04:44,620 for that particular picture-- so hard to fault it for that. 92 00:04:44,620 --> 00:04:47,640 And the last case, the correct answer 93 00:04:47,640 --> 00:04:50,700 was not in any of the top five. 94 00:04:50,700 --> 00:04:53,100 I'm not sure if you've ever seen a Madagascar cap. 95 00:04:53,100 --> 00:04:54,720 But that's a picture of one. 96 00:04:54,720 --> 00:04:56,220 And it's interesting to compare that 97 00:04:56,220 --> 00:04:58,970 with the first choice of the program, the squirrel monkey. 98 00:04:58,970 --> 00:05:02,160 This is the two side by side. 99 00:05:02,160 --> 00:05:04,190 So in a way, it's not surprising that it 100 00:05:04,190 --> 00:05:06,330 thought that the Madagascar cat was 101 00:05:06,330 --> 00:05:10,790 a picture of a squirrel monkey-- so pretty impressive. 102 00:05:10,790 --> 00:05:12,050 It blew away the competition. 103 00:05:12,050 --> 00:05:16,225 It did so much better the second place wasn't even close. 104 00:05:16,225 --> 00:05:18,600 And for the first time, it demonstrated that a neural net 105 00:05:18,600 --> 00:05:20,350 could actually do something. 106 00:05:20,350 --> 00:05:23,810 And since that time, in the three years since that time, 107 00:05:23,810 --> 00:05:26,150 there's been an enormous amount of effort 108 00:05:26,150 --> 00:05:31,470 put into neural net technology, which some say is the answer. 109 00:05:31,470 --> 00:05:34,440 So what we're going to do today and tomorrow 110 00:05:34,440 --> 00:05:39,130 is have a look at this stuff and ask ourselves why it works, 111 00:05:39,130 --> 00:05:41,560 when it might not work, what needs to be done, 112 00:05:41,560 --> 00:05:43,790 what has been done, and all those kinds of questions 113 00:05:43,790 --> 00:05:44,470 will emerge. 114 00:05:51,520 --> 00:05:55,640 So I guess the first thing to do is think about what it is 115 00:05:55,640 --> 00:05:57,970 that we are being inspired by. 116 00:05:57,970 --> 00:06:00,980 We're being inspired by those things that 117 00:06:00,980 --> 00:06:04,650 are inside our head-- all 10 to the 11th of them. 118 00:06:04,650 --> 00:06:09,755 And so if we take one of those 10 to the 11th and look at it, 119 00:06:09,755 --> 00:06:12,700 you know from 700 something or other approximately 120 00:06:12,700 --> 00:06:14,570 what a neuron looks like. 121 00:06:14,570 --> 00:06:17,760 And by the way, I'm going to teach you in this lecture 122 00:06:17,760 --> 00:06:20,230 how to answer questions about neurobiology 123 00:06:20,230 --> 00:06:23,170 with an 80% probability that you will give the same answer 124 00:06:23,170 --> 00:06:25,470 as a neurobiologist. 125 00:06:25,470 --> 00:06:28,310 So let's go. 126 00:06:28,310 --> 00:06:30,140 So here's a neuron. 127 00:06:30,140 --> 00:06:31,930 It's got a cell body. 128 00:06:31,930 --> 00:06:33,530 And there is a nucleus. 129 00:06:33,530 --> 00:06:36,350 And then out here is a long thingamajigger 130 00:06:36,350 --> 00:06:41,230 which divides maybe a little bit, but not much. 131 00:06:41,230 --> 00:06:42,395 And we call that the axon. 132 00:06:46,080 --> 00:06:50,100 So then over here, we've got this much more branching type 133 00:06:50,100 --> 00:06:54,572 of structure that looks maybe a little bit like so. 134 00:07:02,144 --> 00:07:04,550 Maybe like that-- and this stuff branches a whole lot. 135 00:07:04,550 --> 00:07:06,520 And that part is called the dendritic tree. 136 00:07:15,400 --> 00:07:17,310 Now, there are a couple of things 137 00:07:17,310 --> 00:07:21,350 we can note about this is that these guys are connected axon 138 00:07:21,350 --> 00:07:22,640 to dendrite. 139 00:07:22,640 --> 00:07:28,430 So over here, they'll be a so-called pre-synaptic 140 00:07:28,430 --> 00:07:29,640 thickening. 141 00:07:29,640 --> 00:07:33,390 And over here will be some other neuron's dendrite. 142 00:07:33,390 --> 00:07:38,220 And likewise, over here some other neuron's axon 143 00:07:38,220 --> 00:07:45,855 is coming in here and hitting the dendrite of our the one 144 00:07:45,855 --> 00:07:47,517 that occupies most of our picture. 145 00:07:53,220 --> 00:07:57,260 So if there is enough stimulation from this side 146 00:07:57,260 --> 00:08:01,060 in the axonal tree, or the dendritic tree, 147 00:08:01,060 --> 00:08:04,620 then a spike will go down that axon. 148 00:08:04,620 --> 00:08:07,820 It acts like a transmission line. 149 00:08:07,820 --> 00:08:12,370 And then after that happens, the neuron 150 00:08:12,370 --> 00:08:15,000 will go quiet for a while as it's recovering its strength. 151 00:08:15,000 --> 00:08:16,550 That's called the refractory period. 152 00:08:19,560 --> 00:08:23,230 Now, if we look at that connection in a little more 153 00:08:23,230 --> 00:08:30,000 detail, this little piece right here sort of looks like this. 154 00:08:30,000 --> 00:08:32,820 Here's the axon coming in. 155 00:08:32,820 --> 00:08:35,870 It's got a whole bunch of little vesicles in it. 156 00:08:35,870 --> 00:08:39,049 And then there's a dendrite over here. 157 00:08:39,049 --> 00:08:42,730 And when the axon is stimulated, it dumps all these vesicles 158 00:08:42,730 --> 00:08:45,202 into this inner synaptic space. 159 00:08:45,202 --> 00:08:47,410 For a long time, it wasn't known whether those things 160 00:08:47,410 --> 00:08:49,430 were actually separated. 161 00:08:49,430 --> 00:08:51,110 I think it was Raamon and Cahal who 162 00:08:51,110 --> 00:08:54,410 demonstrated that one neuron is actually 163 00:08:54,410 --> 00:08:56,030 not part of the next one. 164 00:08:56,030 --> 00:09:01,610 They're actually separated by these synaptic gaps. 165 00:09:01,610 --> 00:09:04,120 So there it is. 166 00:09:04,120 --> 00:09:06,770 How can we model, that sort of thing? 167 00:09:06,770 --> 00:09:08,440 Well, here's what's usually done. 168 00:09:08,440 --> 00:09:11,320 Here's what is done in the neural net literature. 169 00:09:17,070 --> 00:09:20,960 First of all, we've got some kind of binary input, 170 00:09:20,960 --> 00:09:23,370 because these things either fire or they don't fire. 171 00:09:23,370 --> 00:09:26,060 So it's an all-or-none kind of situation. 172 00:09:26,060 --> 00:09:29,510 So over here, we have some kind of input value. 173 00:09:29,510 --> 00:09:31,000 We'll call it x1. 174 00:09:31,000 --> 00:09:34,010 And is either a 0 or 1. 175 00:09:34,010 --> 00:09:35,910 So it comes in here. 176 00:09:35,910 --> 00:09:40,450 And then it gets multiplied times some kind of weight. 177 00:09:40,450 --> 00:09:42,635 We'll call it w1. 178 00:09:45,620 --> 00:09:49,910 So this part here is modeling this synaptic connection. 179 00:09:49,910 --> 00:09:51,890 It may be more or less strong. 180 00:09:51,890 --> 00:09:53,850 And if it's more strong, this weight goes up. 181 00:09:53,850 --> 00:09:56,510 And if it's less strong, this weight goes down. 182 00:09:56,510 --> 00:10:01,220 So that reflects the influence of the synapse 183 00:10:01,220 --> 00:10:05,900 on whether or not the whole axon decides it's stimulated. 184 00:10:05,900 --> 00:10:12,110 Then we got other inputs down here-- x sub n, also 0 or 1. 185 00:10:12,110 --> 00:10:15,180 It's also multiplied by a weight. 186 00:10:15,180 --> 00:10:17,570 We'll call that w sub n. 187 00:10:17,570 --> 00:10:20,560 And now, we have to somehow represent 188 00:10:20,560 --> 00:10:26,960 the way in which these inputs are collected together-- 189 00:10:26,960 --> 00:10:29,290 how they have collective force. 190 00:10:29,290 --> 00:10:30,960 And we're going to model that very, very 191 00:10:30,960 --> 00:10:37,610 simply just by saying, OK, we'll run it through a summer 192 00:10:37,610 --> 00:10:39,660 like so. 193 00:10:39,660 --> 00:10:43,390 But then we have to decide if the collective influence of all 194 00:10:43,390 --> 00:10:48,630 those inputs is sufficient to make the neuron fire. 195 00:10:48,630 --> 00:10:51,430 So we're going to do that by running 196 00:10:51,430 --> 00:10:56,110 this guy through a threshold box like so. 197 00:10:56,110 --> 00:10:59,930 Here is what the box looks like in terms of the relationship 198 00:10:59,930 --> 00:11:02,540 between input and the output. 199 00:11:02,540 --> 00:11:04,430 And what you can see here is that nothing 200 00:11:04,430 --> 00:11:09,040 happens until the input exceeds some threshold t. 201 00:11:09,040 --> 00:11:13,960 If that happens, then the output z is a 1. 202 00:11:13,960 --> 00:11:16,270 Otherwise, it's a 0. 203 00:11:16,270 --> 00:11:20,040 So binary, binary out-- we model the synaptic weights 204 00:11:20,040 --> 00:11:21,530 by these multipliers. 205 00:11:21,530 --> 00:11:26,850 We model the cumulative effect of all that input to the neuron 206 00:11:26,850 --> 00:11:28,500 by a summer. 207 00:11:28,500 --> 00:11:31,749 We decide if it's going to be an all-or-none 1 by running it 208 00:11:31,749 --> 00:11:33,290 through this threshold box and seeing 209 00:11:33,290 --> 00:11:39,370 if the sum of the products add up to more than the threshold. 210 00:11:39,370 --> 00:11:41,610 If so, we get a 1. 211 00:11:41,610 --> 00:11:46,820 So what, in the end, are we in fact modeling? 212 00:11:46,820 --> 00:11:54,900 Well, with this model, we have number 1, all 213 00:11:54,900 --> 00:12:12,890 or none-- number 2, cumulative influence-- number 3, oh, I, 214 00:12:12,890 --> 00:12:14,171 suppose synaptic weight. 215 00:12:20,432 --> 00:12:21,890 But that's not all that there might 216 00:12:21,890 --> 00:12:24,940 be to model in a real neuron. 217 00:12:24,940 --> 00:12:27,142 We might want to deal with the refractory period. 218 00:12:39,180 --> 00:12:43,470 In these biological models that we build neural nets out of, 219 00:12:43,470 --> 00:12:45,385 we might want to model axonal bifurcation. 220 00:12:53,280 --> 00:12:56,330 We do get some division in the axon of the neuron. 221 00:12:56,330 --> 00:12:59,070 And it turns out that that pulse will either go down 222 00:12:59,070 --> 00:13:01,200 one branch or the other. 223 00:13:01,200 --> 00:13:02,970 And which branch it goes down depends 224 00:13:02,970 --> 00:13:06,870 on electrical activity in the vicinity of the division. 225 00:13:06,870 --> 00:13:09,560 So these things might actually be a fantastic coincidence 226 00:13:09,560 --> 00:13:10,840 detectors. 227 00:13:10,840 --> 00:13:12,006 But we're not modeling that. 228 00:13:12,006 --> 00:13:15,630 We don't know how it works. 229 00:13:15,630 --> 00:13:17,290 So axonal bifurcation might be modeled. 230 00:13:17,290 --> 00:13:21,900 We might also have a look at time patterns. 231 00:13:26,402 --> 00:13:27,860 See, what we don't know is we don't 232 00:13:27,860 --> 00:13:32,240 know if the timing of the arrival of these pulses 233 00:13:32,240 --> 00:13:34,290 in the dendritic tree has anything 234 00:13:34,290 --> 00:13:38,050 to do with what that neuron is going to recognize-- 235 00:13:38,050 --> 00:13:40,320 so a lot of unknowns here. 236 00:13:40,320 --> 00:13:42,650 And now, I'm going to show you how 237 00:13:42,650 --> 00:13:45,010 to answer a question about neurobiology 238 00:13:45,010 --> 00:13:47,590 with 80% probability you'll get it right. 239 00:13:47,590 --> 00:13:51,910 Just say, we don't know. 240 00:13:51,910 --> 00:13:53,890 And that will be with 80% probability what 241 00:13:53,890 --> 00:13:55,270 the neurobiologist would say. 242 00:13:58,700 --> 00:14:03,240 So this is a model inspired by what goes on in our heads. 243 00:14:03,240 --> 00:14:07,240 But it's far from clear if what we're modeling 244 00:14:07,240 --> 00:14:13,280 is the essence of why those guys make possible what we can do. 245 00:14:13,280 --> 00:14:15,280 Nevertheless, that's where we're going to start. 246 00:14:15,280 --> 00:14:16,571 That's where we're going to go. 247 00:14:16,571 --> 00:14:20,500 So we've got this model of what a neuron does. 248 00:14:20,500 --> 00:14:25,180 So what about what does a collection of these neurons do? 249 00:14:25,180 --> 00:14:31,830 Well, we can think of your skull as a big box full of neurons. 250 00:14:37,680 --> 00:14:39,170 Maybe a better way to think of this 251 00:14:39,170 --> 00:14:42,030 is that your head is full of neurons. 252 00:14:42,030 --> 00:14:51,490 And they in turn are full of weights and thresholds like so. 253 00:14:51,490 --> 00:14:56,676 So into this box come a variety of inputs x1 through xm. 254 00:15:00,080 --> 00:15:01,780 And these find their way to the inside 255 00:15:01,780 --> 00:15:04,790 of this gaggle of neurons. 256 00:15:04,790 --> 00:15:13,320 And out here come a bunch of outputs c1 through zn. 257 00:15:13,320 --> 00:15:17,110 And there a whole bunch of these maybe like so. 258 00:15:17,110 --> 00:15:19,680 And there are a lot of inputs like so. 259 00:15:19,680 --> 00:15:23,630 And somehow these inputs through the influence 260 00:15:23,630 --> 00:15:29,260 of the weights of the thresholds come out as a set of outputs. 261 00:15:29,260 --> 00:15:31,350 So we can write that down a little 262 00:15:31,350 --> 00:15:36,770 fancier by just saying that z is a vector, which 263 00:15:36,770 --> 00:15:42,220 is a function of, certainly the input vector, but also 264 00:15:42,220 --> 00:15:45,510 the weight vector and the threshold vector. 265 00:15:45,510 --> 00:15:47,780 So that's all a neural net is. 266 00:15:47,780 --> 00:15:49,430 And when we train a neural net, all 267 00:15:49,430 --> 00:15:52,210 we're going to be able to do is adjust those weights 268 00:15:52,210 --> 00:15:57,430 and thresholds so that what we get out is what we want. 269 00:15:57,430 --> 00:16:00,570 So a neural net is a function approximator. 270 00:16:00,570 --> 00:16:02,270 It's good to think about that. 271 00:16:02,270 --> 00:16:03,478 It's a function approximator. 272 00:16:05,420 --> 00:16:11,560 So maybe we've got some sample data that gives us an output 273 00:16:11,560 --> 00:16:17,865 vector that's desired as another function of the input, 274 00:16:17,865 --> 00:16:20,240 forgetting about what the weights and the thresholds are. 275 00:16:20,240 --> 00:16:22,550 That's what we want to get out. 276 00:16:22,550 --> 00:16:24,990 And so how well we're doing can be figured out 277 00:16:24,990 --> 00:16:31,440 by comparing the desired value with the actual value. 278 00:16:31,440 --> 00:16:33,520 So we might think then that we can 279 00:16:33,520 --> 00:16:38,450 get a handle on how well we're doing by constructing 280 00:16:38,450 --> 00:16:45,770 some performance function, which is determined by the desired 281 00:16:45,770 --> 00:16:50,960 vector and the input vector-- sorry, 282 00:16:50,960 --> 00:16:54,330 the desired vector and the actual output vector 283 00:16:54,330 --> 00:16:58,370 for some particular input or for some set of inputs. 284 00:16:58,370 --> 00:17:01,340 And the question is what should that function be? 285 00:17:01,340 --> 00:17:03,770 How should we measure performance 286 00:17:03,770 --> 00:17:06,980 given that we have what we want out here 287 00:17:06,980 --> 00:17:09,670 and what we actually got out here? 288 00:17:09,670 --> 00:17:12,349 Well, one simple thing to do is just 289 00:17:12,349 --> 00:17:16,520 to measure the magnitude of the difference. 290 00:17:16,520 --> 00:17:18,130 That makes sense. 291 00:17:18,130 --> 00:17:24,180 But of course, that would give us a performance function that 292 00:17:24,180 --> 00:17:26,119 is a function of the distance between those 293 00:17:26,119 --> 00:17:28,289 vectors would look like this. 294 00:17:32,000 --> 00:17:35,880 But this turns out to be mathematically 295 00:17:35,880 --> 00:17:36,980 inconvenient in the end. 296 00:17:36,980 --> 00:17:38,730 So how do you think we're going to turn it up a little bit? 297 00:17:38,730 --> 00:17:39,910 AUDIENCE: Normalize it? 298 00:17:39,910 --> 00:17:41,118 PATRICK WINSTON: What's that? 299 00:17:41,118 --> 00:17:42,120 AUDIENCE: Normalize it? 300 00:17:42,120 --> 00:17:43,750 PATRICK WINSTON: Well, I don't know. 301 00:17:43,750 --> 00:17:46,372 How about just we square it? 302 00:17:46,372 --> 00:17:49,890 And that way we're going to go from this little sharp point 303 00:17:49,890 --> 00:17:54,800 down there to something that looks more like that. 304 00:17:54,800 --> 00:17:58,300 So it's best when the difference is 0, of course. 305 00:17:58,300 --> 00:18:02,380 And it gets worse as you move away from 0. 306 00:18:02,380 --> 00:18:04,080 But what we're trying to do here is 307 00:18:04,080 --> 00:18:07,370 we're trying to get to a minimum value. 308 00:18:07,370 --> 00:18:09,360 And I hope you'll forgive me. 309 00:18:09,360 --> 00:18:11,170 I just don't like the direction we're 310 00:18:11,170 --> 00:18:14,040 going here, because I like to think in terms of improvement 311 00:18:14,040 --> 00:18:16,740 as going uphill instead of down hill. 312 00:18:16,740 --> 00:18:21,070 So I'm going to dress this up one more step-- put a minus 313 00:18:21,070 --> 00:18:22,580 sign out there. 314 00:18:22,580 --> 00:18:25,540 And then our performance function looks like this. 315 00:18:25,540 --> 00:18:26,810 It's always negative. 316 00:18:26,810 --> 00:18:29,640 And the best value it can possibly be is zero. 317 00:18:29,640 --> 00:18:33,040 So that's what we're going to use just because I am who I am. 318 00:18:33,040 --> 00:18:34,590 And it doesn't matter, right? 319 00:18:34,590 --> 00:18:37,260 Still, you're trying to either minimize or maximize 320 00:18:37,260 --> 00:18:40,490 some performance function. 321 00:18:40,490 --> 00:18:41,660 OK, so what do we got to do? 322 00:18:41,660 --> 00:18:46,630 I guess what we could do is we could treat this thing-- well, 323 00:18:46,630 --> 00:18:49,000 we already know what to do. 324 00:18:49,000 --> 00:18:51,860 I'm not even sure why we're devoting our lecture to this, 325 00:18:51,860 --> 00:18:55,580 because it's clear that what we're trying to do 326 00:18:55,580 --> 00:18:59,840 is we're trying to take our weights and our thresholds 327 00:18:59,840 --> 00:19:03,110 and adjust them so as to maximize performance. 328 00:19:03,110 --> 00:19:05,570 So we can make a little contour map here 329 00:19:05,570 --> 00:19:09,020 with a simple neural net with just two weights in it. 330 00:19:09,020 --> 00:19:11,562 And maybe it looks like this-- contour map. 331 00:19:14,990 --> 00:19:18,810 And at any given time we've got a particular w1 332 00:19:18,810 --> 00:19:20,470 and particular w2. 333 00:19:20,470 --> 00:19:23,570 And we're trying to find a better w1 and w2. 334 00:19:23,570 --> 00:19:26,550 So here we are right now. 335 00:19:26,550 --> 00:19:28,760 And there's the contour map. 336 00:19:28,760 --> 00:19:29,940 And it's a 6034. 337 00:19:29,940 --> 00:19:31,634 So what do we do? 338 00:19:31,634 --> 00:19:33,030 AUDIENCE: Climb. 339 00:19:33,030 --> 00:19:35,520 PATRICK WINSTON: Simple matter of hill climbing, right? 340 00:19:35,520 --> 00:19:38,360 So we'll take a step in every direction. 341 00:19:38,360 --> 00:19:41,860 If we take a step in that direction, not so hot. 342 00:19:41,860 --> 00:19:44,100 That actually goes pretty bad. 343 00:19:44,100 --> 00:19:46,550 These two are really ugly. 344 00:19:46,550 --> 00:19:48,280 Ah, but that one-- that one takes us 345 00:19:48,280 --> 00:19:50,430 up the hill a little bit. 346 00:19:50,430 --> 00:19:54,120 So we're done, except that I just 347 00:19:54,120 --> 00:19:55,870 mentioned that Hinton's neural net had 348 00:19:55,870 --> 00:19:57,962 60 million parameters in it. 349 00:19:57,962 --> 00:20:00,420 So we're not going to hill climb with 60 million parameters 350 00:20:00,420 --> 00:20:03,812 because it explodes exponentially 351 00:20:03,812 --> 00:20:05,270 in the number of weights you've got 352 00:20:05,270 --> 00:20:08,936 to deal with-- the number of steps you can take. 353 00:20:08,936 --> 00:20:10,936 So this approach is computationally intractable. 354 00:20:13,560 --> 00:20:19,280 Fortunately, you've all taken 1801 or the equivalent thereof. 355 00:20:19,280 --> 00:20:21,690 So you have a better idea. 356 00:20:21,690 --> 00:20:24,770 Instead of just taking a step in every direction, what 357 00:20:24,770 --> 00:20:27,590 we're going to do is we're going to take 358 00:20:27,590 --> 00:20:30,150 some partial derivatives. 359 00:20:30,150 --> 00:20:33,160 And we're going to see what they suggest 360 00:20:33,160 --> 00:20:36,720 to us in terms of how we're going to get around in space. 361 00:20:36,720 --> 00:20:39,130 So we might have a partial of that performance function 362 00:20:39,130 --> 00:20:42,969 up there with respect to w1. 363 00:20:42,969 --> 00:20:44,760 And we might also take a partial derivative 364 00:20:44,760 --> 00:20:48,450 of that guy with respect to w2. 365 00:20:48,450 --> 00:20:50,480 And these will tell us how much improvement 366 00:20:50,480 --> 00:20:53,510 we're getting by making a little movement in those directions, 367 00:20:53,510 --> 00:20:55,220 right? 368 00:20:55,220 --> 00:20:57,408 How much a change is given that we're just 369 00:20:57,408 --> 00:20:58,532 going right along the axis. 370 00:21:01,180 --> 00:21:07,020 So maybe what we ought to do is if this guy is 371 00:21:07,020 --> 00:21:08,910 much bigger than this guy, it would 372 00:21:08,910 --> 00:21:12,120 suggest we mostly want to move in this direction, 373 00:21:12,120 --> 00:21:13,990 or to put it in 1801 terms, what we're 374 00:21:13,990 --> 00:21:16,550 going to do is we're going to follow the gradient. 375 00:21:16,550 --> 00:21:21,320 And so the change in the w vector 376 00:21:21,320 --> 00:21:25,776 is going to equal to this partial derivative times 377 00:21:25,776 --> 00:21:30,010 i plus this partial derivative times j. 378 00:21:30,010 --> 00:21:32,600 So what we're going to end up doing in this particular case 379 00:21:32,600 --> 00:21:36,610 by following that formula is moving off in that direction 380 00:21:36,610 --> 00:21:40,310 right up to the steepest part of the hill. 381 00:21:40,310 --> 00:21:43,840 And how much we move is a question. 382 00:21:43,840 --> 00:21:47,400 So let's just have a rate constant R that decides how 383 00:21:47,400 --> 00:21:50,110 big our step is going to be. 384 00:21:50,110 --> 00:21:53,080 And now you think we were done. 385 00:21:53,080 --> 00:21:55,740 Well, too bad for our side. 386 00:21:55,740 --> 00:21:57,380 We're not done. 387 00:21:57,380 --> 00:21:59,410 There's a reason why we can't use-- 388 00:21:59,410 --> 00:22:04,214 create ascent, or in the case that I've drawn our gradient, 389 00:22:04,214 --> 00:22:06,005 descent if we take the performance function 390 00:22:06,005 --> 00:22:07,220 the other way. 391 00:22:07,220 --> 00:22:09,128 Why can't we use it? 392 00:22:09,128 --> 00:22:10,086 AUDIENCE: Local maxima. 393 00:22:12,237 --> 00:22:14,070 PATRICK WINSTON: The remark is local maxima. 394 00:22:14,070 --> 00:22:15,683 And that is certainly true. 395 00:22:15,683 --> 00:22:17,016 But it's not our first obstacle. 396 00:22:19,860 --> 00:22:21,892 Why doesn't gradient ascent work? 397 00:22:26,301 --> 00:22:28,250 AUDIENCE: So you're using a step function. 398 00:22:28,250 --> 00:22:28,550 PATRICK WINSTON: Ah, there's something 399 00:22:28,550 --> 00:22:30,580 wrong with our function. 400 00:22:30,580 --> 00:22:31,690 That's right. 401 00:22:31,690 --> 00:22:35,590 It's non-linear, but rather, it's discontinuous. 402 00:22:35,590 --> 00:22:39,100 So gradient ascent requires a continuous space, 403 00:22:39,100 --> 00:22:40,870 continuous surface. 404 00:22:40,870 --> 00:22:43,880 So too bad our side. 405 00:22:43,880 --> 00:22:46,010 It isn't. 406 00:22:46,010 --> 00:22:48,910 So what to do? 407 00:22:48,910 --> 00:22:52,840 Well, nobody knew what to do for 25 years. 408 00:22:52,840 --> 00:22:55,120 People were screwing around with training neural nets 409 00:22:55,120 --> 00:23:01,340 for 25 years before Paul Werbos sadly at Harvard in 1974 410 00:23:01,340 --> 00:23:03,010 gave us the answer. 411 00:23:03,010 --> 00:23:05,120 And now I want to tell you what the answer is. 412 00:23:05,120 --> 00:23:09,920 The first part of the answer is those thresholds are annoying. 413 00:23:09,920 --> 00:23:15,440 They're just extra baggage to deal with. 414 00:23:15,440 --> 00:23:19,310 What we really like instead of c being a function of xw and t 415 00:23:19,310 --> 00:23:23,465 was we'd like c prime to be a function f 416 00:23:23,465 --> 00:23:27,875 prime of x and the weights. 417 00:23:27,875 --> 00:23:30,000 But we've got to account for the threshold somehow. 418 00:23:30,000 --> 00:23:31,960 So here's how you do that. 419 00:23:31,960 --> 00:23:36,220 What you do is you say let us add 420 00:23:36,220 --> 00:23:40,320 another input to this neuron. 421 00:23:40,320 --> 00:23:44,515 And it's going to have a weight w0. 422 00:23:49,160 --> 00:23:52,120 And it's going to be connected to an input that's 423 00:23:52,120 --> 00:23:55,340 always minus 1. 424 00:23:55,340 --> 00:23:56,732 You with me so far? 425 00:23:56,732 --> 00:23:58,190 Now what we're going to do is we're 426 00:23:58,190 --> 00:24:04,070 going to say, let w0 equal t. 427 00:24:06,578 --> 00:24:08,702 What does that do to the movement of the threshold? 428 00:24:11,660 --> 00:24:13,760 What it does is it takes that threshold 429 00:24:13,760 --> 00:24:16,060 and moves it back to 0. 430 00:24:16,060 --> 00:24:19,750 So this little trick here takes this pink threshold 431 00:24:19,750 --> 00:24:24,550 and redoes it so that the new threshold box looks like this. 432 00:24:30,370 --> 00:24:31,480 Think about it. 433 00:24:31,480 --> 00:24:35,930 If this is t, and this is minus 1, then this is minus t. 434 00:24:35,930 --> 00:24:38,490 And so this thing ought to fire if everything's over-- 435 00:24:38,490 --> 00:24:39,750 if the sum is over 0. 436 00:24:39,750 --> 00:24:41,080 So it makes sense. 437 00:24:41,080 --> 00:24:43,420 And it gets rid of the threshold thing for us. 438 00:24:43,420 --> 00:24:46,920 So now we can just think about weights. 439 00:24:46,920 --> 00:24:53,740 But still, we've got that step function there. 440 00:24:53,740 --> 00:24:55,714 And that's not good. 441 00:24:55,714 --> 00:24:57,130 So what we're going to do is we're 442 00:24:57,130 --> 00:25:00,700 going to smooth that guy out. 443 00:25:00,700 --> 00:25:03,846 So this is trick number two. 444 00:25:03,846 --> 00:25:05,220 Instead of a step function, we're 445 00:25:05,220 --> 00:25:07,610 going to have this thing we lovingly 446 00:25:07,610 --> 00:25:09,840 call a sigmoid function, because it's 447 00:25:09,840 --> 00:25:12,110 kind of from an s-type shape. 448 00:25:12,110 --> 00:25:18,280 And the function we're going to use is this one-- one, 449 00:25:18,280 --> 00:25:23,710 well, better make it a little bit different-- 1 over 1 plus 450 00:25:23,710 --> 00:25:27,230 e to the minus whatever the input is. 451 00:25:27,230 --> 00:25:30,070 Let's call the input alpha. 452 00:25:30,070 --> 00:25:32,610 Does that makes sense? 453 00:25:32,610 --> 00:25:37,560 Is alpha is 0, then it's 1 over 1 plus 1 plus one half. 454 00:25:37,560 --> 00:25:40,960 If alpha is extremely big, then even the minus alpha 455 00:25:40,960 --> 00:25:42,060 is extremely small. 456 00:25:42,060 --> 00:25:44,100 And it becomes one. 457 00:25:44,100 --> 00:25:47,460 It goes up to an asymptotic value of one here. 458 00:25:47,460 --> 00:25:50,510 On the other hand, if alpha is extremely negative, 459 00:25:50,510 --> 00:25:53,840 than the minus alpha is extremely positive. 460 00:25:53,840 --> 00:25:56,470 And it goes to 0 asymptotically. 461 00:25:56,470 --> 00:25:59,830 So we got the right look to that function. 462 00:25:59,830 --> 00:26:01,680 It's a very convenient function. 463 00:26:01,680 --> 00:26:05,990 Did God say that neurons ought to be-- that threshold 464 00:26:05,990 --> 00:26:08,080 ought to work like that? 465 00:26:08,080 --> 00:26:09,140 No, God didn't say so. 466 00:26:09,140 --> 00:26:11,760 Who said so? 467 00:26:11,760 --> 00:26:13,540 The math says so. 468 00:26:13,540 --> 00:26:16,960 It has the right shape and look and the math. 469 00:26:16,960 --> 00:26:19,522 And it turns out to have the right math, 470 00:26:19,522 --> 00:26:20,605 as you'll see in a moment. 471 00:26:23,530 --> 00:26:24,357 So let's see. 472 00:26:24,357 --> 00:26:25,450 Where are we? 473 00:26:25,450 --> 00:26:26,950 We decided that what we'd like to do 474 00:26:26,950 --> 00:26:29,022 is take these partial derivatives. 475 00:26:29,022 --> 00:26:31,230 We know that it was awkward to have those thresholds. 476 00:26:31,230 --> 00:26:32,294 So we got rid of them. 477 00:26:32,294 --> 00:26:34,460 And we noted that it was impossible to have the step 478 00:26:34,460 --> 00:26:34,960 function. 479 00:26:34,960 --> 00:26:36,450 So we got rid of it. 480 00:26:36,450 --> 00:26:38,520 Now, we're a situation where we can actually 481 00:26:38,520 --> 00:26:41,170 take those partial derivatives, and see if it gives us 482 00:26:41,170 --> 00:26:43,180 a way of training the neural net so as 483 00:26:43,180 --> 00:26:46,155 to bring the actual output into alignment with what we desire. 484 00:26:48,525 --> 00:26:49,900 So to deal with that, we're going 485 00:26:49,900 --> 00:26:54,120 to have to work with the world's simplest neural net. 486 00:26:54,120 --> 00:26:57,896 Now, if we've got one neuron, it's not a net. 487 00:26:57,896 --> 00:27:00,020 But if we've got two-word neurons, we've got a net. 488 00:27:00,020 --> 00:27:02,560 And it turns out that's the world's simplest neuron. 489 00:27:02,560 --> 00:27:05,880 So we're going to look at it-- not 60 million parameters, 490 00:27:05,880 --> 00:27:11,390 but just a few, actually, just two parameters. 491 00:27:11,390 --> 00:27:13,350 So let's draw it out. 492 00:27:13,350 --> 00:27:16,090 We've got input x. 493 00:27:16,090 --> 00:27:18,560 That goes into a multiplier. 494 00:27:18,560 --> 00:27:22,790 And it gets multiplied times w1. 495 00:27:22,790 --> 00:27:27,700 And that goes into a sigmoid box like so. 496 00:27:27,700 --> 00:27:30,750 We'll call this p1, by the way, product number one. 497 00:27:30,750 --> 00:27:33,270 Out here comes y. 498 00:27:33,270 --> 00:27:37,030 Y gets multiplied times another weight. 499 00:27:37,030 --> 00:27:40,630 We'll call that w2. 500 00:27:40,630 --> 00:27:44,940 The neck produces another product which we'll call p2. 501 00:27:44,940 --> 00:27:49,200 And that goes into a sigmoid box. 502 00:27:49,200 --> 00:27:51,920 And then that comes out as z. 503 00:27:51,920 --> 00:27:54,230 And z is the number that we use to determine 504 00:27:54,230 --> 00:27:55,820 how well we're doing. 505 00:27:55,820 --> 00:28:00,270 And our performance function p is 506 00:28:00,270 --> 00:28:02,944 going to be one half minus one half, 507 00:28:02,944 --> 00:28:04,360 because I like things are going in 508 00:28:04,360 --> 00:28:08,330 a direction, times the difference between the desired 509 00:28:08,330 --> 00:28:11,006 output and the actual output squared. 510 00:28:14,480 --> 00:28:18,364 So now let's decide what those partial derivatives 511 00:28:18,364 --> 00:28:19,030 are going to be. 512 00:28:25,220 --> 00:28:26,180 Let me do it over here. 513 00:28:32,976 --> 00:28:34,350 So what are we trying to compute? 514 00:28:34,350 --> 00:28:39,100 Partial of the performance function p with respect to w2. 515 00:28:42,553 --> 00:28:43,052 OK. 516 00:28:47,970 --> 00:28:50,324 Well, let's see. 517 00:28:50,324 --> 00:28:51,990 We're trying to figure out how much this 518 00:28:51,990 --> 00:28:54,536 wiggles when we wiggle that. 519 00:28:57,390 --> 00:29:01,176 But you know it goes through this variable p2. 520 00:29:01,176 --> 00:29:02,800 And so maybe what we could do is figure 521 00:29:02,800 --> 00:29:05,750 out how much this wiggles-- how much z wiggles 522 00:29:05,750 --> 00:29:08,830 when we wiggle p2 and then how much p2 523 00:29:08,830 --> 00:29:13,290 wiggles when we wiggle w2. 524 00:29:13,290 --> 00:29:15,580 I just multiplied those together. 525 00:29:15,580 --> 00:29:16,080 I forget. 526 00:29:16,080 --> 00:29:18,840 What's that called? 527 00:29:18,840 --> 00:29:20,310 N180-- something or other. 528 00:29:20,310 --> 00:29:21,310 AUDIENCE: The chain rule 529 00:29:21,310 --> 00:29:22,754 PATRICK WINSTON: The chain rule. 530 00:29:22,754 --> 00:29:24,170 So what we're going to do is we're 531 00:29:24,170 --> 00:29:27,230 going to rewrite that partial derivative using chain rule. 532 00:29:27,230 --> 00:29:29,200 And all it's doing is saying that there's 533 00:29:29,200 --> 00:29:31,250 an intermediate variable. 534 00:29:31,250 --> 00:29:35,380 And we can compute how much that end wiggles with respect 535 00:29:35,380 --> 00:29:39,755 how much that end wiggles by multiplying 536 00:29:39,755 --> 00:29:41,545 how much the other guys wiggle. 537 00:29:41,545 --> 00:29:42,420 Let me write it down. 538 00:29:42,420 --> 00:29:45,420 It makes more sense in mathematics. 539 00:29:45,420 --> 00:29:48,260 So that's going to be able to the partial of p 540 00:29:48,260 --> 00:29:58,650 with respect to z times the partial of z with respect 541 00:29:58,650 --> 00:29:59,950 to p2. 542 00:30:04,140 --> 00:30:06,200 Keep me on track here. 543 00:30:06,200 --> 00:30:09,490 Partial of z with respect to w2. 544 00:30:12,310 --> 00:30:15,920 Now, I'm going to do something for which I will hate myself. 545 00:30:15,920 --> 00:30:17,780 I'm going to erase something on the board. 546 00:30:17,780 --> 00:30:18,780 I don't like to do that. 547 00:30:18,780 --> 00:30:21,900 But you know what I'm going to do, don't you? 548 00:30:21,900 --> 00:30:27,910 I'm going to say this is true by the chain rule. 549 00:30:27,910 --> 00:30:30,550 But look, I can take this guy here 550 00:30:30,550 --> 00:30:34,060 and screw around with it with the chain rule too. 551 00:30:34,060 --> 00:30:35,880 And in fact, what I'm going to do 552 00:30:35,880 --> 00:30:39,996 is I'm going to replace that with partial of z 553 00:30:39,996 --> 00:30:48,139 with respect to p2 and partial of p2 with respect to w2. 554 00:30:48,139 --> 00:30:49,430 So I didn't erase it after all. 555 00:30:49,430 --> 00:30:52,110 But you can see what I'm going to do next. 556 00:30:52,110 --> 00:30:53,610 Now, I'm going to do same thing with 557 00:30:53,610 --> 00:30:55,780 the other partial derivative. 558 00:30:55,780 --> 00:30:58,890 But this time, instead of writing down and writing over, 559 00:30:58,890 --> 00:31:02,580 I'm just going to expand it all out in one go, I think. 560 00:31:05,200 --> 00:31:10,620 So partial of p with respect to w1 561 00:31:10,620 --> 00:31:15,140 is equal to the partial of p with respect to z, 562 00:31:15,140 --> 00:31:21,810 the partial of z with respect to p2, the partial of p2 563 00:31:21,810 --> 00:31:23,700 with respect to what? 564 00:31:23,700 --> 00:31:26,260 Y? 565 00:31:26,260 --> 00:31:35,170 Partial of y with respect to p1-- partial of p1 566 00:31:35,170 --> 00:31:38,950 with respect to w1. 567 00:31:38,950 --> 00:31:43,680 So that's going like a zipper down that string of variables 568 00:31:43,680 --> 00:31:45,940 expanding each by using the chain 569 00:31:45,940 --> 00:31:48,490 rule until we got to the end. 570 00:31:48,490 --> 00:31:50,330 So there are some expressions that provide 571 00:31:50,330 --> 00:31:51,910 those partial derivatives. 572 00:31:56,660 --> 00:32:03,030 But now, if you'll forgive me, it 573 00:32:03,030 --> 00:32:05,370 was convenient to write them out that way. 574 00:32:05,370 --> 00:32:07,050 That matched the intuition in my head. 575 00:32:07,050 --> 00:32:08,674 But I'm just going to turn them around. 576 00:32:11,080 --> 00:32:12,960 It's just a product. 577 00:32:12,960 --> 00:32:14,790 I'm just going to turn them around. 578 00:32:14,790 --> 00:32:22,610 So partial p2, partial w2, times partial of z, 579 00:32:22,610 --> 00:32:28,360 partial p2, times the partial of p with respect 580 00:32:28,360 --> 00:32:30,040 to z-- same thing. 581 00:32:30,040 --> 00:32:31,860 And now, this one. 582 00:32:31,860 --> 00:32:34,190 Keep me on track, because if there's a mutation here, 583 00:32:34,190 --> 00:32:35,740 it will be fatal. 584 00:32:35,740 --> 00:32:41,860 Partial of p1-- partial of w1, partial of y, 585 00:32:41,860 --> 00:32:52,180 partial p1, partial of p2, partial of y, partial of z. 586 00:32:52,180 --> 00:32:56,920 There's a partial of p2, partial of a performance 587 00:32:56,920 --> 00:32:58,142 function with respect to z. 588 00:33:01,380 --> 00:33:04,740 Now, all we have to do is figure out what those partials are. 589 00:33:04,740 --> 00:33:08,709 And we have solved this simple neural net. 590 00:33:08,709 --> 00:33:09,750 So it's going to be easy. 591 00:33:14,530 --> 00:33:15,880 Where is my board space? 592 00:33:15,880 --> 00:33:22,360 Let's see, partial of p2 with respect to-- what? 593 00:33:22,360 --> 00:33:23,220 That's the product. 594 00:33:23,220 --> 00:33:25,740 The partial of z-- the performance function 595 00:33:25,740 --> 00:33:27,130 with respect to z. 596 00:33:27,130 --> 00:33:30,201 Oh, now I can see why I wrote it down this way. 597 00:33:30,201 --> 00:33:30,700 Let's see. 598 00:33:30,700 --> 00:33:33,699 It's going to be d minus e. 599 00:33:33,699 --> 00:33:34,990 We can do that one in our head. 600 00:33:41,110 --> 00:33:43,634 What about the partial of p2 with respect to w2. 601 00:33:46,520 --> 00:33:50,250 Well, p2 is equal to y times w2, so that's easy. 602 00:33:50,250 --> 00:33:51,050 That's just y. 603 00:33:57,830 --> 00:34:00,110 Now, all we have to do is figure out the partial 604 00:34:00,110 --> 00:34:02,110 of z with respect to p2. 605 00:34:02,110 --> 00:34:07,020 Oh, crap, it's going through this threshold box. 606 00:34:07,020 --> 00:34:11,070 So I don't know exactly what that partial derivative is. 607 00:34:11,070 --> 00:34:13,780 So we'll have to figure that out, right? 608 00:34:13,780 --> 00:34:18,414 Because the function relating them is this guy here. 609 00:34:18,414 --> 00:34:20,955 And so we have to figure out the partial of that with respect 610 00:34:20,955 --> 00:34:24,030 to alpha. 611 00:34:24,030 --> 00:34:26,120 All right, so we got to do it. 612 00:34:26,120 --> 00:34:28,330 There's no way around it. 613 00:34:28,330 --> 00:34:32,620 So we have to destroy something. 614 00:34:32,620 --> 00:34:36,440 OK, we're going to destroy our neuron. 615 00:34:49,989 --> 00:34:52,060 So the function we're dealing with 616 00:34:52,060 --> 00:34:55,620 is, we'll call it beta, equal to 1 over 1 617 00:34:55,620 --> 00:35:00,100 plus e to the minus alpha. 618 00:35:00,100 --> 00:35:02,711 And what we want is the derivative 619 00:35:02,711 --> 00:35:07,050 with respect to alpha of beta. 620 00:35:07,050 --> 00:35:13,080 And that's equal to d by d alpha of-- you know, 621 00:35:13,080 --> 00:35:16,530 I can never remember those quotient formulas. 622 00:35:16,530 --> 00:35:19,260 So I am going to rewrite it a little different way. 623 00:35:19,260 --> 00:35:23,518 I am going to write it as 1 minus e to the minus alpha 624 00:35:23,518 --> 00:35:28,340 to the minus 1, because I can't remember the formula 625 00:35:28,340 --> 00:35:31,490 for differentiating a quotient. 626 00:35:31,490 --> 00:35:33,030 OK, so let's differentiate it. 627 00:35:33,030 --> 00:35:45,572 So that's equal to 1 minus e to the minus alpha to the minus 2. 628 00:35:48,380 --> 00:35:51,140 And we got that minus comes out of that part of it. 629 00:35:51,140 --> 00:35:56,660 Then we got to differentiate the inside of that expression. 630 00:35:56,660 --> 00:35:59,410 And when we differentiate the inside of that expression, 631 00:35:59,410 --> 00:36:01,156 we get e to the minus alpha. 632 00:36:01,156 --> 00:36:02,142 AUDIENCE: Dr. Winston-- 633 00:36:02,142 --> 00:36:03,130 PATRICK WINSTON: Yeah? 634 00:36:03,130 --> 00:36:05,267 AUDIENCE: That should be 1 plus. 635 00:36:05,267 --> 00:36:06,850 PATRICK WINSTON: Oh, sorry, thank you. 636 00:36:06,850 --> 00:36:09,183 That was one of those fatal mistakes you just prevented. 637 00:36:09,183 --> 00:36:10,680 So that's 1 plus. 638 00:36:10,680 --> 00:36:12,400 That's 1 plus here too. 639 00:36:12,400 --> 00:36:15,590 OK, so we've differentiated that. 640 00:36:15,590 --> 00:36:17,170 We've turned that into a minus 2. 641 00:36:17,170 --> 00:36:18,890 We brought the minus sign outside. 642 00:36:18,890 --> 00:36:21,320 Then we're differentiating the inside. 643 00:36:21,320 --> 00:36:23,640 The derivative and the exponential is an exponential. 644 00:36:23,640 --> 00:36:25,979 Then we got to differentiate that guy. 645 00:36:25,979 --> 00:36:27,770 And that just helps us get rid of the minus 646 00:36:27,770 --> 00:36:29,690 sign we introduced. 647 00:36:29,690 --> 00:36:32,380 So that's the derivative. 648 00:36:32,380 --> 00:36:36,640 I'm not sure how much that helps except that I'm 649 00:36:36,640 --> 00:36:40,040 going to perform a parlor trick here and rewrite 650 00:36:40,040 --> 00:36:43,510 that expression thusly. 651 00:36:43,510 --> 00:36:47,170 We want to say that's going to be 652 00:36:47,170 --> 00:36:53,988 e to the minus alpha over 1 plus e to the minus 653 00:36:53,988 --> 00:37:01,415 alpha times 1 over 1 plus e to the minus alpha. 654 00:37:01,415 --> 00:37:03,919 That OK? 655 00:37:03,919 --> 00:37:05,460 I've got a lot of nodding heads here. 656 00:37:05,460 --> 00:37:08,465 So I think I'm on safe ground. 657 00:37:08,465 --> 00:37:10,590 But now, I'm going to perform another parlor trick. 658 00:37:13,700 --> 00:37:19,770 I am going to add 1, which means I also have to subtract 1. 659 00:37:24,270 --> 00:37:24,840 All right? 660 00:37:24,840 --> 00:37:27,520 That's legitimate isn't it? 661 00:37:27,520 --> 00:37:32,540 So now, I can rewrite this as 1 plus e 662 00:37:32,540 --> 00:37:38,820 to the minus alpha over 1 plus e to the minus alpha 663 00:37:38,820 --> 00:37:48,085 minus 1 over 1 plus e to the minus alpha times 1 over 1 plus 664 00:37:48,085 --> 00:37:51,660 e to the minus alpha. 665 00:37:51,660 --> 00:37:53,200 Any high school kid could do that. 666 00:37:53,200 --> 00:37:55,580 I think I'm on safe ground. 667 00:37:55,580 --> 00:38:02,150 Oh, wait, this is beta. 668 00:38:02,150 --> 00:38:04,464 This is beta. 669 00:38:04,464 --> 00:38:05,940 AUDIENCE: That's the wrong side. 670 00:38:05,940 --> 00:38:08,440 PATRICK WINSTON: Oh, sorry, wrong side. 671 00:38:08,440 --> 00:38:11,320 Better make this beta and this 1. 672 00:38:11,320 --> 00:38:13,964 Any high school kid could do it. 673 00:38:13,964 --> 00:38:16,490 OK, so what we've got then is that this 674 00:38:16,490 --> 00:38:22,310 is equal to 1 minus beta times beta. 675 00:38:22,310 --> 00:38:23,590 That's the derivative. 676 00:38:23,590 --> 00:38:25,950 And that's weird because the derivative 677 00:38:25,950 --> 00:38:27,960 of the output with respect to the input 678 00:38:27,960 --> 00:38:31,520 is given exclusively in terms of the output. 679 00:38:31,520 --> 00:38:33,020 It's strange. 680 00:38:33,020 --> 00:38:34,350 It doesn't really matter. 681 00:38:34,350 --> 00:38:36,240 But it's a curiosity. 682 00:38:36,240 --> 00:38:39,560 And what we get out of this is that partial derivative there-- 683 00:38:39,560 --> 00:38:47,680 that's equal to well, the output is p2. 684 00:38:47,680 --> 00:38:48,680 No, the output is z. 685 00:38:48,680 --> 00:38:52,340 So it's z time 1 minus e. 686 00:38:52,340 --> 00:38:54,380 So whenever we see the derivative of one 687 00:38:54,380 --> 00:38:57,300 of these sigmoids with respect to its input, 688 00:38:57,300 --> 00:38:59,500 we can just write the output times one minus alpha, 689 00:38:59,500 --> 00:39:00,230 and we've got it. 690 00:39:00,230 --> 00:39:02,290 So that's why it's mathematically convenient. 691 00:39:02,290 --> 00:39:04,081 It's mathematically convenient because when 692 00:39:04,081 --> 00:39:08,640 we do this differentiation, we get a very simple expression 693 00:39:08,640 --> 00:39:10,597 in terms of the output. 694 00:39:10,597 --> 00:39:11,930 We get a very simple expression. 695 00:39:11,930 --> 00:39:13,015 That's all we really need. 696 00:39:16,050 --> 00:39:20,360 So would you like to see a demonstration? 697 00:39:20,360 --> 00:39:22,800 It's a demonstration of the world's smallest neural 698 00:39:22,800 --> 00:39:23,494 net in action. 699 00:39:31,080 --> 00:39:32,430 Where is neural nets? 700 00:39:32,430 --> 00:39:32,930 Here we go. 701 00:39:37,707 --> 00:39:38,790 So there's our neural net. 702 00:39:38,790 --> 00:39:40,248 And what we're going to do is we're 703 00:39:40,248 --> 00:39:42,100 going to train it to do absolutely nothing. 704 00:39:42,100 --> 00:39:44,308 What we're going to do is train it to make the output 705 00:39:44,308 --> 00:39:47,460 the same as the input. 706 00:39:47,460 --> 00:39:49,660 Not what I'd call a fantastic leap of intelligence. 707 00:39:49,660 --> 00:39:50,785 But let's see what happens. 708 00:39:58,930 --> 00:39:59,430 Wow! 709 00:39:59,430 --> 00:40:00,263 Nothing's happening. 710 00:40:07,050 --> 00:40:09,120 Well, it finally got to the point 711 00:40:09,120 --> 00:40:12,530 where the maximum error, not the performance, 712 00:40:12,530 --> 00:40:14,590 but the maximum error went below a threshold 713 00:40:14,590 --> 00:40:16,600 that I had previously determined. 714 00:40:16,600 --> 00:40:18,810 So if you look at the input here and compare that 715 00:40:18,810 --> 00:40:21,200 with the desired output on the far right, 716 00:40:21,200 --> 00:40:24,060 you see it produces an output, which compared with the desired 717 00:40:24,060 --> 00:40:26,010 output, is pretty close. 718 00:40:26,010 --> 00:40:29,070 So we can test the other way like so. 719 00:40:29,070 --> 00:40:30,950 And we can see that the desired output 720 00:40:30,950 --> 00:40:34,300 is pretty close to the actual output in that case too. 721 00:40:34,300 --> 00:40:37,130 And it took 694 iterations to get that done. 722 00:40:37,130 --> 00:40:37,952 Let's try it again. 723 00:40:56,090 --> 00:40:59,190 To 823-- of course, this is all a consequence of just starting 724 00:40:59,190 --> 00:41:01,265 off with random weights. 725 00:41:01,265 --> 00:41:03,890 By the way, if you started with all the weights being the same, 726 00:41:03,890 --> 00:41:04,780 what would happen? 727 00:41:04,780 --> 00:41:07,495 Nothing because it would always stay the same. 728 00:41:07,495 --> 00:41:09,120 So you've got to put some randomization 729 00:41:09,120 --> 00:41:11,580 in in the beginning. 730 00:41:11,580 --> 00:41:12,670 So it took a long time. 731 00:41:12,670 --> 00:41:15,450 Maybe the problem is our rate constant is too small. 732 00:41:15,450 --> 00:41:18,106 So let's crank up the rate counts a little bit 733 00:41:18,106 --> 00:41:18,980 and see what happens. 734 00:41:22,430 --> 00:41:23,730 That was pretty fast. 735 00:41:23,730 --> 00:41:26,510 Let's see if it was a consequence of random chance. 736 00:41:29,510 --> 00:41:30,920 Run. 737 00:41:30,920 --> 00:41:38,110 No, it's pretty fast there-- 57 iterations-- third try-- 67. 738 00:41:38,110 --> 00:41:42,020 So it looks like at my initial rate constant was too small. 739 00:41:42,020 --> 00:41:45,240 So if 0.5 was not as good as 5.0, 740 00:41:45,240 --> 00:41:47,698 why don't we crank it up to 50 and see what happens. 741 00:41:51,830 --> 00:41:54,702 Oh, in this case, 124-- let's try it again. 742 00:41:58,470 --> 00:42:02,546 Ah, in this case 117-- so it's actually gotten worse. 743 00:42:02,546 --> 00:42:03,920 And not only has it gotten worse. 744 00:42:03,920 --> 00:42:09,270 You'll see there's a little a bit of instability showing up 745 00:42:09,270 --> 00:42:12,510 as it courses along its way toward a solution. 746 00:42:12,510 --> 00:42:15,200 So what it looks like is that if you've got a rate constant 747 00:42:15,200 --> 00:42:17,040 that's too small, it takes forever. 748 00:42:17,040 --> 00:42:19,020 If you've get a rate constant that's too big, 749 00:42:19,020 --> 00:42:25,310 it can of jump too far, as in my diagram which is somewhere 750 00:42:25,310 --> 00:42:29,207 underneath the board, you can go all the way across the hill 751 00:42:29,207 --> 00:42:30,290 and get to the other side. 752 00:42:30,290 --> 00:42:31,900 So you have to be careful about the rate constant. 753 00:42:31,900 --> 00:42:33,399 So what you really want to do is you 754 00:42:33,399 --> 00:42:36,010 want your rate constant to vary with what 755 00:42:36,010 --> 00:42:43,920 is happening as you progress toward an optimal performance. 756 00:42:43,920 --> 00:42:46,420 So if your performance is going down when you make the jump, 757 00:42:46,420 --> 00:42:48,572 you know you've got a rate constant that's too big. 758 00:42:48,572 --> 00:42:50,780 If your performance is going up when you make a jump, 759 00:42:50,780 --> 00:42:52,404 maybe you want to increase-- bump it up 760 00:42:52,404 --> 00:42:57,460 a little bit until it doesn't look so good. 761 00:42:57,460 --> 00:42:58,960 So is that all there is to it? 762 00:42:58,960 --> 00:43:03,010 Well, not quite, because this is the world's simplest 763 00:43:03,010 --> 00:43:04,002 neural net. 764 00:43:04,002 --> 00:43:05,710 And maybe we ought to look at the world's 765 00:43:05,710 --> 00:43:08,450 second simplest neural net. 766 00:43:08,450 --> 00:43:13,982 Now, let's call this-- well, let's call this x. 767 00:43:13,982 --> 00:43:18,410 What we're going to do is we're going to have a second input. 768 00:43:18,410 --> 00:43:19,850 And I don't know. 769 00:43:19,850 --> 00:43:21,096 Maybe this is screwy. 770 00:43:21,096 --> 00:43:22,720 I'm just going to use color coding here 771 00:43:22,720 --> 00:43:26,734 to differentiate between the two inputs and the stuff 772 00:43:26,734 --> 00:43:27,400 they go through. 773 00:43:34,010 --> 00:43:39,666 Maybe I'll call this z2 and this z1 and this x1 and x2. 774 00:43:42,300 --> 00:43:45,140 Now, if I do that-- if I've got two inputs and two outputs, 775 00:43:45,140 --> 00:43:47,760 then my performance function is going 776 00:43:47,760 --> 00:43:51,800 to have two numbers in it-- the two desired values and the two 777 00:43:51,800 --> 00:43:53,560 actual values. 778 00:43:53,560 --> 00:43:55,350 And I'm going to have two inputs. 779 00:43:55,350 --> 00:43:57,680 But it's the same stuff. 780 00:43:57,680 --> 00:44:01,031 I just repeat what I did in white, only I make it orange. 781 00:44:07,440 --> 00:44:12,654 Oh, but what happens if-- what happens if I do this? 782 00:44:28,850 --> 00:44:31,750 Say put little cross connections in there. 783 00:44:31,750 --> 00:44:35,030 So these two streams are going to interact. 784 00:44:35,030 --> 00:44:37,340 And then there might be some-- this y can 785 00:44:37,340 --> 00:44:43,290 go into another multiplier here and go into a summer here. 786 00:44:43,290 --> 00:44:46,220 And likewise, this y can go up here 787 00:44:46,220 --> 00:44:50,920 and into a multiplier like so. 788 00:44:50,920 --> 00:45:01,330 And there are weights all over the place like so. 789 00:45:01,330 --> 00:45:05,070 This guy goes up in here. 790 00:45:05,070 --> 00:45:06,430 And now what happens? 791 00:45:06,430 --> 00:45:08,800 Now, we've got a disaster on our hands, 792 00:45:08,800 --> 00:45:11,900 because there are all kinds of paths through this network. 793 00:45:11,900 --> 00:45:16,260 And you can imagine that if this was not just two neurons deep, 794 00:45:16,260 --> 00:45:19,170 but three neurons deep, what I would find 795 00:45:19,170 --> 00:45:22,300 is expressions that look like that. 796 00:45:22,300 --> 00:45:25,890 But you could go this way, and then down through, and out 797 00:45:25,890 --> 00:45:27,470 here. 798 00:45:27,470 --> 00:45:33,150 Or you could go this way and then back up through here. 799 00:45:33,150 --> 00:45:37,470 So it looks like there is an exponentially growing number 800 00:45:37,470 --> 00:45:39,910 of paths through that network. 801 00:45:39,910 --> 00:45:41,820 And so we're back to an exponential blowup. 802 00:45:41,820 --> 00:45:42,570 And it won't work. 803 00:45:50,890 --> 00:45:53,396 Yeah, it won't work except that we 804 00:45:53,396 --> 00:45:55,270 need to let the math sing to us a little bit. 805 00:45:55,270 --> 00:45:57,670 And we need to look at the picture. 806 00:45:57,670 --> 00:46:01,190 And the reason I turned this guy around was actually 807 00:46:01,190 --> 00:46:06,580 because from a point of view of letting the math sing to us, 808 00:46:06,580 --> 00:46:11,500 this piece here is the same as this piece here. 809 00:46:11,500 --> 00:46:13,570 So part of what we needed to do to calculate 810 00:46:13,570 --> 00:46:16,064 the partial derivative with respect to w1 811 00:46:16,064 --> 00:46:17,730 has already been done when we calculated 812 00:46:17,730 --> 00:46:22,550 the partial derivative with respect to w2. 813 00:46:22,550 --> 00:46:26,970 And not only that, if we calculated 814 00:46:26,970 --> 00:46:29,200 the partial wit respect to these green w's 815 00:46:29,200 --> 00:46:32,460 at both levels, what we would discover 816 00:46:32,460 --> 00:46:37,840 is that sort of repetition occurs over and over again. 817 00:46:37,840 --> 00:46:41,330 And now, I'm going to try to give you an intuitive 818 00:46:41,330 --> 00:46:44,070 idea of what's going on here rather than just write down 819 00:46:44,070 --> 00:46:46,720 the math and salute it. 820 00:46:46,720 --> 00:46:49,980 And here's a way to think about it from an intuitive 821 00:46:49,980 --> 00:46:50,748 point of view. 822 00:46:53,740 --> 00:46:56,960 Whatever happens to this performance function 823 00:46:56,960 --> 00:47:04,440 that's back of these p's here, the stuff over there can 824 00:47:04,440 --> 00:47:07,150 influence p only by going through, 825 00:47:07,150 --> 00:47:09,830 and influence performance only going through this column 826 00:47:09,830 --> 00:47:12,460 of p's. 827 00:47:12,460 --> 00:47:13,960 And there's a fixed number of those. 828 00:47:13,960 --> 00:47:16,335 So it depends on the width, not the depth of the network. 829 00:47:19,350 --> 00:47:26,030 So the influence of that stuff back there on p 830 00:47:26,030 --> 00:47:28,620 is going to end up going through these guys. 831 00:47:28,620 --> 00:47:34,840 And it's going to end up being so that we're 832 00:47:34,840 --> 00:47:38,050 going to discover that a lot of what we need to compute in one 833 00:47:38,050 --> 00:47:43,150 column has already been computed in the column on the right. 834 00:47:43,150 --> 00:47:47,430 So it isn't going to explode exponentially, 835 00:47:47,430 --> 00:47:50,643 because the influence-- let me say it one more time. 836 00:47:54,120 --> 00:47:58,440 The influences of changes of changes in p on the performance 837 00:47:58,440 --> 00:48:01,370 is all we care about when we come back to this part 838 00:48:01,370 --> 00:48:05,450 of the network, because this stuff cannot influence 839 00:48:05,450 --> 00:48:09,859 the performance except by going through this column of p's. 840 00:48:09,859 --> 00:48:11,650 So it's not going to blow up exponentially. 841 00:48:11,650 --> 00:48:14,560 We're going to be able to reuse a lot of the computation. 842 00:48:14,560 --> 00:48:17,040 So it's the reuse principle. 843 00:48:17,040 --> 00:48:21,350 Have we ever seen the reuse principle at work before. 844 00:48:21,350 --> 00:48:22,109 Not exactly. 845 00:48:22,109 --> 00:48:23,650 But you remember that little business 846 00:48:23,650 --> 00:48:25,770 about the extended list? 847 00:48:25,770 --> 00:48:31,184 We know that we've seen-- we know 848 00:48:31,184 --> 00:48:32,350 we've seen something before. 849 00:48:32,350 --> 00:48:34,450 So we can stop computing. 850 00:48:34,450 --> 00:48:35,810 It's like that. 851 00:48:35,810 --> 00:48:37,870 We're going to be able to reuse the computation. 852 00:48:37,870 --> 00:48:40,721 We've already done it to prevent an exponential blowup. 853 00:48:40,721 --> 00:48:42,720 By the way, for those of you who know about fast 854 00:48:42,720 --> 00:48:45,880 Fourier transform-- same kind of idea-- reuse 855 00:48:45,880 --> 00:48:48,570 of partial results. 856 00:48:48,570 --> 00:48:52,590 So in the end, what can we say about this stuff? 857 00:48:52,590 --> 00:49:02,725 In the end, what we can say is that it's linear in depth. 858 00:49:05,710 --> 00:49:08,680 That is to say if we increase the number of layers 859 00:49:08,680 --> 00:49:10,720 to so-called depth, then we're going 860 00:49:10,720 --> 00:49:12,430 to increase the amount of computation 861 00:49:12,430 --> 00:49:15,990 necessary in a linear way, because the computation we 862 00:49:15,990 --> 00:49:20,420 need in any column is going to be fixed. 863 00:49:20,420 --> 00:49:26,900 What about how it goes with respect to the width? 864 00:49:31,070 --> 00:49:33,500 Well, with respect to the width, any neuron 865 00:49:33,500 --> 00:49:36,902 here can be connected to any neuron in the next row. 866 00:49:36,902 --> 00:49:38,860 So the amount of work we're going to have to do 867 00:49:38,860 --> 00:49:41,550 will be proportional to the number of connections. 868 00:49:41,550 --> 00:49:47,210 So with respect to width, it's going to be w-squared. 869 00:49:47,210 --> 00:49:52,260 But the fact is that in the end, this stuff is readily computed. 870 00:49:52,260 --> 00:49:58,120 And this, phenomenally enough, was overlooked for 25 years. 871 00:49:58,120 --> 00:50:00,740 So what is it in the end? 872 00:50:00,740 --> 00:50:02,490 In the end, it's an extremely simple idea. 873 00:50:02,490 --> 00:50:03,670 All great ideas are simple. 874 00:50:03,670 --> 00:50:05,150 How come there aren't more of them? 875 00:50:05,150 --> 00:50:08,272 Well, because frequently, that simplicity 876 00:50:08,272 --> 00:50:09,730 involves finding a couple of tricks 877 00:50:09,730 --> 00:50:12,150 and making a couple of observations. 878 00:50:12,150 --> 00:50:14,950 So usually, we humans are hardly ever 879 00:50:14,950 --> 00:50:16,944 go beyond one trick or one observation. 880 00:50:16,944 --> 00:50:18,360 But if you cascade a few together, 881 00:50:18,360 --> 00:50:20,270 sometimes something miraculous falls out 882 00:50:20,270 --> 00:50:23,250 that looks in retrospect extremely simple. 883 00:50:23,250 --> 00:50:25,856 So that's why we got the reuse principle at work-- 884 00:50:25,856 --> 00:50:27,510 and our reuse computation. 885 00:50:27,510 --> 00:50:29,580 In this case, the miracle was a consequence 886 00:50:29,580 --> 00:50:31,590 of two tricks plus an observation. 887 00:50:31,590 --> 00:50:33,960 And the overall idea is all great ideas 888 00:50:33,960 --> 00:50:37,500 are simple and easy to overlook for a quarter century.