1 00:00:00,000 --> 00:00:01,990 OPERATOR: The following content is provided under a 2 00:00:01,990 --> 00:00:03,840 Creative Commons license. 3 00:00:03,840 --> 00:00:06,840 Your support will help MIT OpenCourseWare continue to 4 00:00:06,840 --> 00:00:10,530 offer high quality educational resources for free. 5 00:00:10,530 --> 00:00:13,390 To make a donation or view additional materials from 6 00:00:13,390 --> 00:00:17,490 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,490 --> 00:00:19,930 ocw.mit.edu. 8 00:00:19,930 --> 00:00:22,810 PROFESSOR: So let's start. 9 00:00:22,810 --> 00:00:25,920 I have written a number on the board here. 10 00:00:25,920 --> 00:00:32,130 Anyone want to speculate what that number represents? 11 00:00:32,130 --> 00:00:34,620 Well, you may recall at the end of the last lecture, we 12 00:00:34,620 --> 00:00:39,480 were simulating pi, and I started up running it with a 13 00:00:39,480 --> 00:00:41,840 billion darts. 14 00:00:41,840 --> 00:00:45,400 And when it finally terminated, this was the 15 00:00:45,400 --> 00:00:50,310 estimate of pi it gave me with a billion. 16 00:00:50,310 --> 00:00:57,360 Not bad, not quite perfect, but still pretty good. 17 00:00:57,360 --> 00:01:00,790 In fact when I later ran it with 10 billion darts, which 18 00:01:00,790 --> 00:01:04,660 took a rather long time to run, didn't do much better. 19 00:01:04,660 --> 00:01:10,810 So it's converging very slowly now near the end. 20 00:01:10,810 --> 00:01:14,480 When we use an algorithm like that one to perform a Monte 21 00:01:14,480 --> 00:01:18,890 Carlo simulation, we're trusting, as I said, that fate 22 00:01:18,890 --> 00:01:22,610 will give us an unbiased sample, a sample that would be 23 00:01:22,610 --> 00:01:27,540 representative of true random throws. 24 00:01:27,540 --> 00:01:30,220 And, indeed in this case, that's a pretty good 25 00:01:30,220 --> 00:01:31,500 assumption. 26 00:01:31,500 --> 00:01:34,990 The random number generator is not truly random, it's what's 27 00:01:34,990 --> 00:01:38,510 called pseudo-random, in that if you start it with the same 28 00:01:38,510 --> 00:01:42,360 initial conditions, it will give you the same results. 29 00:01:42,360 --> 00:01:47,090 But it's close enough for, at least for government work, and 30 00:01:47,090 --> 00:01:51,980 other useful projects. 31 00:01:51,980 --> 00:01:55,820 We do have to think about the question, how many samples 32 00:01:55,820 --> 00:01:57,690 should we run? 33 00:01:57,690 --> 00:02:00,730 Was a billion darts enough? 34 00:02:00,730 --> 00:02:04,280 Now since we sort of all started knowing what pi was, 35 00:02:04,280 --> 00:02:07,460 we could look at it and say, yeah, pretty good. 36 00:02:07,460 --> 00:02:14,440 But suppose we had no clue about the actual value of pi. 37 00:02:14,440 --> 00:02:16,100 We still have to think about the 38 00:02:16,100 --> 00:02:28,370 question of how many samples? 39 00:02:28,370 --> 00:02:38,020 And also, how accurate do we believe our result is, given 40 00:02:38,020 --> 00:02:40,660 the number of samples? 41 00:02:40,660 --> 00:02:45,790 As you might guess, these two questions are closely related. 42 00:02:45,790 --> 00:02:52,170 That, if we know in advance how much accuracy we want, we 43 00:02:52,170 --> 00:02:54,820 can sometimes use that to calculate how 44 00:02:54,820 --> 00:03:03,900 many samples we need. 45 00:03:03,900 --> 00:03:10,050 But there's still always the issue. 46 00:03:10,050 --> 00:03:13,910 It's never possible to achieve perfect 47 00:03:13,910 --> 00:03:16,200 accuracy through sampling. 48 00:03:16,200 --> 00:03:20,110 Unless you sample the entire population. 49 00:03:20,110 --> 00:03:25,100 No matter how many samples you take, you can never be sure 50 00:03:25,100 --> 00:03:30,260 that the sample set is typical until you've checked every 51 00:03:30,260 --> 00:03:32,220 last element. 52 00:03:32,220 --> 00:03:38,610 So if I went around MIT and sampled 100 students to try 53 00:03:38,610 --> 00:03:43,110 and, for example, guess the fraction of students at MIT 54 00:03:43,110 --> 00:03:46,940 who are of Chinese descent. 55 00:03:46,940 --> 00:03:52,470 Maybe 100 students would be enough, but maybe I would get 56 00:03:52,470 --> 00:03:55,420 unlucky and draw the wrong 100. 57 00:03:55,420 --> 00:03:59,680 In the sense of, by accident, 100 Chinese descent, or 100 58 00:03:59,680 --> 00:04:01,820 non-Chinese descent, which would give 59 00:04:01,820 --> 00:04:04,090 me the wrong answer. 60 00:04:04,090 --> 00:04:08,200 And there would be no way I could be sure that I had not 61 00:04:08,200 --> 00:04:18,880 drawn a biased sample, unless I really did have the whole 62 00:04:18,880 --> 00:04:22,770 population to look at. 63 00:04:22,770 --> 00:04:28,560 So we can never know that our estimate is correct. 64 00:04:28,560 --> 00:04:32,270 Now maybe I took a billion darts, and for some reason got 65 00:04:32,270 --> 00:04:35,330 really unlucky and they all ended up inside 66 00:04:35,330 --> 00:04:38,440 or outside the circle. 67 00:04:38,440 --> 00:04:42,590 But what we can know, is how likely it is that our answer 68 00:04:42,590 --> 00:04:46,030 is correct, given the assumptions. 69 00:04:46,030 --> 00:04:48,400 And that's the topic we'll spend the next few lectures 70 00:04:48,400 --> 00:04:50,510 on, at least one of the topics. 71 00:04:50,510 --> 00:04:54,520 It's saying, how can we know how likely it is that our 72 00:04:54,520 --> 00:04:56,090 answer is good. 73 00:04:56,090 --> 00:05:01,290 But it's always given some set of assumptions, and we have to 74 00:05:01,290 --> 00:05:04,860 worry a lot about those assumptions. 75 00:05:04,860 --> 00:05:10,870 Now in the case of our pi example, our assumption was 76 00:05:10,870 --> 00:05:14,810 that the random number generator was indeed giving us 77 00:05:14,810 --> 00:05:18,860 random numbers in the interval 0 to 1. 78 00:05:18,860 --> 00:05:23,980 So that was our underlying assumption. 79 00:05:23,980 --> 00:05:28,700 Then using that, we looked at a plot, and we saw that after 80 00:05:28,700 --> 00:05:33,310 time the answer wasn't changing very much. 81 00:05:33,310 --> 00:05:36,290 And we use that to say, OK, it looks like we're actually 82 00:05:36,290 --> 00:05:40,010 converging on an answer. 83 00:05:40,010 --> 00:05:44,320 And then I ran it again, with another trial, and it 84 00:05:44,320 --> 00:05:49,040 converged again at the same place. 85 00:05:49,040 --> 00:05:52,710 And the fact that that happened several times led me 86 00:05:52,710 --> 00:05:56,640 to at least have some reason to believe that I was actually 87 00:05:56,640 --> 00:06:04,920 finding a good approximation of pi. 88 00:06:04,920 --> 00:06:07,260 That's a good thing to do. 89 00:06:07,260 --> 00:06:09,040 It's a necessary thing to do. 90 00:06:09,040 --> 00:06:11,880 But it is not sufficient. 91 00:06:11,880 --> 00:06:16,300 Because errors can creep into many places. 92 00:06:16,300 --> 00:06:20,120 So that kind of technique, and in fact, almost all 93 00:06:20,120 --> 00:06:26,160 statistical techniques, are good at establishing, in some 94 00:06:26,160 --> 00:06:30,540 sense, the reproduce-ability of the result, and that it is 95 00:06:30,540 --> 00:06:35,480 statistically valid, and that there's no error, for example, 96 00:06:35,480 --> 00:06:40,250 in the way I'm generating the numbers. 97 00:06:40,250 --> 00:06:43,480 Or I didn't get very unlucky. 98 00:06:43,480 --> 00:06:48,410 However, they're other places other than bad luck where 99 00:06:48,410 --> 00:06:51,270 errors can creep in. 100 00:06:51,270 --> 00:06:53,940 So let's look at an example here. 101 00:06:53,940 --> 00:06:59,900 I've taken the algorithm we looked at last time for 102 00:06:59,900 --> 00:07:08,450 finding pi, and I've made a change. 103 00:07:08,450 --> 00:07:13,310 You'll remember that we were before using 4 as our 104 00:07:13,310 --> 00:07:17,110 multiplier, and here what I've done is, just gone in and 105 00:07:17,110 --> 00:07:20,710 replaced 4 by 2. 106 00:07:20,710 --> 00:07:25,190 Assuming that I made a programming error. 107 00:07:25,190 --> 00:07:35,420 Now let's see what happens when we run it. 108 00:07:35,420 --> 00:07:42,480 Well, a bad thing has happened. 109 00:07:42,480 --> 00:07:48,560 Sure enough, we ran it and it converged, started to 110 00:07:48,560 --> 00:07:53,850 converge, and if I ran 100 trials each one would converge 111 00:07:53,850 --> 00:07:56,800 at roughly the same place. 112 00:07:56,800 --> 00:08:00,450 Any statistical test I would do, would say that my 113 00:08:00,450 --> 00:08:03,770 statistics are sound, I've chosen enough samples, and for 114 00:08:03,770 --> 00:08:05,960 some accuracy, it's converting. 115 00:08:05,960 --> 00:08:10,280 Everything is perfect, except for what? 116 00:08:10,280 --> 00:08:13,360 It's the wrong answer. 117 00:08:13,360 --> 00:08:18,540 The moral here, is that just because an answer is 118 00:08:18,540 --> 00:08:45,810 statistically valid, does not mean it's the right answer. 119 00:08:45,810 --> 00:08:49,700 And that's really important to understand, because you see 120 00:08:49,700 --> 00:08:53,180 this, and we'll see more examples later, not today, but 121 00:08:53,180 --> 00:08:56,460 after Thanksgiving, comes up all the time in the 122 00:08:56,460 --> 00:09:00,660 newspapers, in scientific articles, where people do a 123 00:09:00,660 --> 00:09:04,370 million tests, do all the statistics right, say here's 124 00:09:04,370 --> 00:09:08,420 the answer, and it turns out to be completely wrong. 125 00:09:08,420 --> 00:09:12,310 And that's because it was some underlying assumption that 126 00:09:12,310 --> 00:09:17,120 went into the decision, that was not true. 127 00:09:17,120 --> 00:09:20,750 So here, the assumption is, that I've done my algebra 128 00:09:20,750 --> 00:09:27,440 right for computing pi based upon where the darts land. 129 00:09:27,440 --> 00:09:32,990 And it turns out, if I put 2 here, my algebra is wrong. 130 00:09:32,990 --> 00:09:35,850 Now how could I discover this? 131 00:09:35,850 --> 00:09:38,700 Since I've already told you no statistical test is 132 00:09:38,700 --> 00:09:40,060 going to help me. 133 00:09:40,060 --> 00:09:42,710 What's the obvious thing I should be doing when I get 134 00:09:42,710 --> 00:09:44,830 this answer? 135 00:09:44,830 --> 00:09:45,930 Somebody? 136 00:09:45,930 --> 00:09:50,420 Yeah? 137 00:09:50,420 --> 00:09:50,800 STUDENT: [INAUDIBLE] 138 00:09:50,800 --> 00:09:54,830 PROFESSOR: Exactly. 139 00:09:54,830 --> 00:09:57,890 Checking against reality. 140 00:09:57,890 --> 00:10:01,160 I started with the notion that pi had some relation to the 141 00:10:01,160 --> 00:10:03,690 area of a circle. 142 00:10:03,690 --> 00:10:08,030 So I could use this value of pi, draw a 143 00:10:08,030 --> 00:10:11,850 circle with a radius. 144 00:10:11,850 --> 00:10:13,980 Do my best to measure the area. 145 00:10:13,980 --> 00:10:17,120 I wouldn't need to get a very good, accurate measurement, 146 00:10:17,120 --> 00:10:20,890 and I would say, whoa, this isn't even close. 147 00:10:20,890 --> 00:10:25,480 And that would tell me I have a problem. 148 00:10:25,480 --> 00:10:37,240 So the moral here is, to check results 149 00:10:37,240 --> 00:10:50,040 against physical reality. 150 00:10:50,040 --> 00:10:53,750 So for example, the current problem set, you're doing a 151 00:10:53,750 --> 00:10:56,340 simulation about what happens to viruses 152 00:10:56,340 --> 00:11:00,020 when drugs are applied. 153 00:11:00,020 --> 00:11:03,660 If you were doing this for a pharmaceutical company, in 154 00:11:03,660 --> 00:11:06,660 addition to the simulation, you'd want to run some real 155 00:11:06,660 --> 00:11:08,530 experiments. 156 00:11:08,530 --> 00:11:17,390 And make sure that things matched. 157 00:11:17,390 --> 00:11:29,720 OK, what this suggests, is that we often use simulation, 158 00:11:29,720 --> 00:11:35,930 and other computational techniques, to try and model 159 00:11:35,930 --> 00:11:38,320 the real world, or the physical world, in 160 00:11:38,320 --> 00:11:42,190 which we all live. 161 00:11:42,190 --> 00:11:46,690 And we can use data to do that. 162 00:11:46,690 --> 00:11:51,190 I now want to go through another set of examples, and 163 00:11:51,190 --> 00:11:55,230 we're going to look at the interplay of three things: 164 00:11:55,230 --> 00:12:02,770 what happens when you have data, say from measurements, 165 00:12:02,770 --> 00:12:15,220 and models that at least claim to explain the data. 166 00:12:15,220 --> 00:12:26,870 And then, consequences that follow from the models. 167 00:12:26,870 --> 00:12:30,570 This is often the way science works, its the way engineering 168 00:12:30,570 --> 00:12:35,870 works, we have some measurements, we have a theory 169 00:12:35,870 --> 00:12:38,360 that explains the measurements, and then we 170 00:12:38,360 --> 00:12:43,770 write software to explore the consequences of that theory. 171 00:12:43,770 --> 00:12:48,270 Including, is it plausible that it's really true? 172 00:12:48,270 --> 00:12:53,030 So I want to start, as an example, with a 173 00:12:53,030 --> 00:12:57,650 classic chosen from 8.01. 174 00:12:57,650 --> 00:13:01,780 So I presume, everyone here has taken 8.01? 175 00:13:01,780 --> 00:13:02,860 Or in 8.01? 176 00:13:02,860 --> 00:13:08,390 Anyone here who's not had an experience with 801? 177 00:13:08,390 --> 00:13:11,020 All right, well. 178 00:13:11,020 --> 00:13:13,290 I hope you know about springs, because we're going to talk 179 00:13:13,290 --> 00:13:15,120 about springs. 180 00:13:15,120 --> 00:13:18,780 So if you think about it, I'm now just talking not about 181 00:13:18,780 --> 00:13:21,680 springs that have water in them, but springs that you 182 00:13:21,680 --> 00:13:25,920 compress, you know, and expand, and things like that. 183 00:13:25,920 --> 00:13:28,760 And there's typically something called the spring 184 00:13:28,760 --> 00:13:42,090 constant that tells us how stiff the spring is, how much 185 00:13:42,090 --> 00:13:45,750 energy it takes to compress this spring. 186 00:13:45,750 --> 00:13:49,570 Or equivalently, how much pop the spring has when you're no 187 00:13:49,570 --> 00:13:53,990 longer holding it down. 188 00:13:53,990 --> 00:13:56,720 Some springs are easy to stretch, they have a small 189 00:13:56,720 --> 00:13:58,070 spring constant. 190 00:13:58,070 --> 00:14:01,230 Some strings, for example, the ones that hold up an 191 00:14:01,230 --> 00:14:04,470 automobile, suspension, are much harder 192 00:14:04,470 --> 00:14:08,370 to stretch and compress. 193 00:14:08,370 --> 00:14:20,320 There's a theory about them called Hooke's Law. 194 00:14:20,320 --> 00:14:28,070 And it's quite simple. 195 00:14:28,070 --> 00:14:32,870 Force, the amount of force exerted by a spring, is equal 196 00:14:32,870 --> 00:14:38,560 to minus some constant times the distance you have 197 00:14:38,560 --> 00:14:44,170 compressed the spring. 198 00:14:44,170 --> 00:14:47,640 It's minus, because the force is exerted in an opposite 199 00:14:47,640 --> 00:14:50,580 direction, trying to spring up. 200 00:14:50,580 --> 00:14:55,350 So for example, we could look at it this way. 201 00:14:55,350 --> 00:15:01,160 We've got a spring, excuse my art here. 202 00:15:01,160 --> 00:15:05,840 And we put some weight on the spring, which has therefore 203 00:15:05,840 --> 00:15:08,630 compressed it a little bit. 204 00:15:08,630 --> 00:15:13,440 And the spring is exerting some upward force. 205 00:15:13,440 --> 00:15:18,080 And the amount of force it's exerting is proportional to 206 00:15:18,080 --> 00:15:26,590 the distance x. 207 00:15:26,590 --> 00:15:34,670 So, if we believe Hooke's Law, and I give you a spring, how 208 00:15:34,670 --> 00:15:38,790 can we find out what this constant is? 209 00:15:38,790 --> 00:15:45,090 Well, we can do it by putting a weight on top of the spring. 210 00:15:45,090 --> 00:15:50,100 It will compress the spring a certain amount, and then the 211 00:15:50,100 --> 00:15:53,380 spring will stop moving. 212 00:15:53,380 --> 00:15:56,290 Now gravity would normally have had this weight go all 213 00:15:56,290 --> 00:16:00,490 the way down to the bottom, if there was no spring. 214 00:16:00,490 --> 00:16:03,330 So clearly the spring is exerting some force in the 215 00:16:03,330 --> 00:16:08,660 upward direction, to keep that mass from going down to the 216 00:16:08,660 --> 00:16:14,310 table, right? 217 00:16:14,310 --> 00:16:17,760 So we know what that force is there. 218 00:16:17,760 --> 00:16:23,180 If we compress the spring to a bunch of different distances, 219 00:16:23,180 --> 00:16:29,570 by putting, say, different size weights on it, we can 220 00:16:29,570 --> 00:16:35,780 then solve for the spring constant, just the way, 221 00:16:35,780 --> 00:16:39,350 before, we solved for pi. 222 00:16:39,350 --> 00:16:47,280 So it just so happens, not quite by accident, that I've 223 00:16:47,280 --> 00:16:50,540 got some data from a spring. 224 00:16:50,540 --> 00:16:52,290 So let's look at it. 225 00:16:52,290 --> 00:16:57,310 So here's some data taken from measuring a spring. 226 00:16:57,310 --> 00:17:01,300 This is distance and force, force computed from the mass, 227 00:17:01,300 --> 00:17:02,720 basically, right? 228 00:17:02,720 --> 00:17:07,130 Because we know that these have to be in balance. 229 00:17:07,130 --> 00:17:10,630 And I'm not going to ask you to in your head estimate the 230 00:17:10,630 --> 00:17:16,440 constant from these, but what you'll see is, the format is, 231 00:17:16,440 --> 00:17:22,700 there's a distance, and then a colon, and then the force. 232 00:17:22,700 --> 00:17:22,950 Yeah? 233 00:17:22,950 --> 00:17:29,940 STUDENT: [INAUDIBLE] 234 00:17:29,940 --> 00:17:37,780 PROFESSOR: Ok, right, yes, thank you. 235 00:17:37,780 --> 00:17:41,010 All right, want to repeat that more loudly for everyone? 236 00:17:41,010 --> 00:17:42,850 STUDENT: [INAUDIBLE] 237 00:17:42,850 --> 00:17:48,680 PROFESSOR: Right, right, because the x in the equation 238 00:17:48,680 --> 00:17:53,280 -- right, here we're getting an equilibrium. 239 00:17:53,280 --> 00:17:57,140 OK, so let's look at what happens when we try and 240 00:17:57,140 --> 00:17:59,790 examine this. 241 00:17:59,790 --> 00:18:05,300 We'll look at spring dot pi. 242 00:18:05,300 --> 00:18:07,520 So it's pretty simple. 243 00:18:07,520 --> 00:18:10,590 First thing is, I've got a function that reads in the 244 00:18:10,590 --> 00:18:12,700 data and parses it. 245 00:18:12,700 --> 00:18:15,400 You've all done more complicated parsing of data 246 00:18:15,400 --> 00:18:16,930 files than this. 247 00:18:16,930 --> 00:18:19,820 So I won't belabor the details. 248 00:18:19,820 --> 00:18:22,640 I called it get data rather than get spring data, because 249 00:18:22,640 --> 00:18:24,610 I'm going to use the same thing for a lot of 250 00:18:24,610 --> 00:18:26,640 other kinds of data. 251 00:18:26,640 --> 00:18:29,630 And the only thing I want you to notice, is that it's 252 00:18:29,630 --> 00:18:36,190 returning a pair of arrays. 253 00:18:36,190 --> 00:18:38,650 OK, not lists. 254 00:18:38,650 --> 00:18:41,350 The usual thing is, I'm building them up using lists, 255 00:18:41,350 --> 00:18:44,470 because lists have append and arrays don't, and then I'm 256 00:18:44,470 --> 00:18:48,310 converting them to arrays so I can do matrix kinds of 257 00:18:48,310 --> 00:18:50,910 operations on them. 258 00:18:50,910 --> 00:18:54,210 So I'll get the distances and the forces. 259 00:18:54,210 --> 00:18:57,010 And then I'm just going to plot them, and we'll see what 260 00:18:57,010 --> 00:18:58,970 they look like. 261 00:18:58,970 --> 00:19:10,640 So let's do that. 262 00:19:10,640 --> 00:19:14,560 There they are. 263 00:19:14,560 --> 00:19:19,990 Now, if you believe Hooke's Law, you could look at this 264 00:19:19,990 --> 00:19:25,370 data, and maybe you wouldn't like it. 265 00:19:25,370 --> 00:19:29,330 Because Hooke's Law implies that, in fact, these points 266 00:19:29,330 --> 00:19:33,950 should lie in a straight line, right? 267 00:19:33,950 --> 00:19:44,690 If I just plug in values here, what am I going to get? 268 00:19:44,690 --> 00:19:46,020 A straight line, right? 269 00:19:46,020 --> 00:19:49,620 I'm just multiplying k times x. 270 00:19:49,620 --> 00:19:51,885 But I don't have a straight line, I have a little scatter 271 00:19:51,885 --> 00:19:54,740 of points, kind of it looks like a straight 272 00:19:54,740 --> 00:19:57,360 line, but it's not. 273 00:19:57,360 --> 00:19:59,250 And why do you think that's true? 274 00:19:59,250 --> 00:20:03,300 What's going on here? 275 00:20:03,300 --> 00:20:12,130 What could cause this line not to be straight? 276 00:20:12,130 --> 00:20:17,310 Have any you ever done a physics experiment? 277 00:20:17,310 --> 00:20:21,320 And when you did it, did your results actually match the 278 00:20:21,320 --> 00:20:23,300 theory that your high school teacher, say, 279 00:20:23,300 --> 00:20:27,080 explained to you. 280 00:20:27,080 --> 00:20:30,590 No, and why not. 281 00:20:30,590 --> 00:20:35,230 Yeah, you have various kinds of experimental or measurement 282 00:20:35,230 --> 00:20:39,690 error, right? 283 00:20:39,690 --> 00:20:44,580 Because, when you're doing these experiments, at least 284 00:20:44,580 --> 00:20:47,390 I'm not perfect, and I suspect at least most of you are not 285 00:20:47,390 --> 00:20:50,200 perfect, you get mistakes. 286 00:20:50,200 --> 00:20:54,330 A little bit of error creeps in inevitably. 287 00:20:54,330 --> 00:20:57,830 And so, when we acquired this data, sure enough there was 288 00:20:57,830 --> 00:21:00,940 measurement error. 289 00:21:00,940 --> 00:21:04,080 And so the points are scattered around. 290 00:21:04,080 --> 00:21:06,720 This is something to be expected. 291 00:21:06,720 --> 00:21:13,100 Real data almost never matches the theory precisely. 292 00:21:13,100 --> 00:21:16,960 Because there usually is some sort of experimental error 293 00:21:16,960 --> 00:21:24,250 that creeps into things. 294 00:21:24,250 --> 00:21:28,370 So what should we do about that? 295 00:21:28,370 --> 00:21:32,050 Well, what usually people do, when they think about this, is 296 00:21:32,050 --> 00:21:36,685 they would look at this data and say, well, let me fit a 297 00:21:36,685 --> 00:21:37,730 line to this. 298 00:21:37,730 --> 00:21:43,240 Somehow, say, what would be the line that best 299 00:21:43,240 --> 00:21:47,580 approximates these points? 300 00:21:47,580 --> 00:21:51,540 And then the slope of that line would give 301 00:21:51,540 --> 00:21:57,570 me the spring constant. 302 00:21:57,570 --> 00:22:05,210 So that raises the next question, what do I mean by 303 00:22:05,210 --> 00:22:09,460 finding a line that best fits these points? 304 00:22:09,460 --> 00:22:26,260 How do we, fit, in this case, a line, to the data? 305 00:22:26,260 --> 00:22:29,460 First of all, I should ask the question, why did I say let's 306 00:22:29,460 --> 00:22:31,020 fit a line? 307 00:22:31,020 --> 00:22:34,380 Maybe I should have said, let's fit a parabola, or let's 308 00:22:34,380 --> 00:22:38,580 fit a circle? 309 00:22:38,580 --> 00:22:45,530 Why should I had said let's fit a line. 310 00:22:45,530 --> 00:22:45,850 Yeah? 311 00:22:45,850 --> 00:22:49,546 STUDENT: [INAUDIBLE] 312 00:22:49,546 --> 00:22:51,580 PROFESSOR: Well, how do I know that the 313 00:22:51,580 --> 00:22:57,230 plot is a linear function? 314 00:22:57,230 --> 00:23:01,010 Pardon? 315 00:23:01,010 --> 00:23:04,860 Well, so, two things. 316 00:23:04,860 --> 00:23:08,260 One is, I had a theory. 317 00:23:08,260 --> 00:23:13,720 You know, I had up there a model, and my model suggested 318 00:23:13,720 --> 00:23:17,840 that I expected it to be linear. 319 00:23:17,840 --> 00:23:20,670 And so if I'm testing my model, I should and fit a 320 00:23:20,670 --> 00:23:23,150 line, my theory, if you will. 321 00:23:23,150 --> 00:23:26,410 But also when I look at it, it looks kind of like a line. 322 00:23:26,410 --> 00:23:29,800 So you know, if I looked at it, and it didn't look like a 323 00:23:29,800 --> 00:23:34,820 line, I might have said, well, my model must be badly broken. 324 00:23:34,820 --> 00:23:38,960 So let's try and see if we can fit it. 325 00:23:38,960 --> 00:23:43,730 Whenever we try and fit something, we need some sort 326 00:23:43,730 --> 00:23:53,770 of an objective function that captures the 327 00:23:53,770 --> 00:23:56,120 goodness of a fit. 328 00:23:56,120 --> 00:23:59,680 I'm trying to find, this is an optimization problem of the 329 00:23:59,680 --> 00:24:01,910 sort that we've looked at before. 330 00:24:01,910 --> 00:24:06,070 I'm trying to find a line that optimizes 331 00:24:06,070 --> 00:24:10,310 some objective function. 332 00:24:10,310 --> 00:24:15,150 So a very simple objective function here, is called the 333 00:24:15,150 --> 00:24:24,390 least squares fit. 334 00:24:24,390 --> 00:24:37,610 I want to find the line that minimizes the sum of 335 00:24:37,610 --> 00:24:47,410 observation sub i, the i'th data point I have, minus what 336 00:24:47,410 --> 00:24:54,390 the line, the model, predicts that point should have been, 337 00:24:54,390 --> 00:24:59,740 and then I'll square it. 338 00:24:59,740 --> 00:25:03,160 So I want to minimize this value. 339 00:25:03,160 --> 00:25:07,490 I want to find the line that gives me the 340 00:25:07,490 --> 00:25:10,210 smallest value for this. 341 00:25:10,210 --> 00:25:12,610 Why do you think I'm squaring the difference? 342 00:25:12,610 --> 00:25:17,310 What would happen if I didn't square the difference? 343 00:25:17,310 --> 00:25:18,950 Yeah? 344 00:25:18,950 --> 00:25:24,670 Positive and negative errors might cancel each other out. 345 00:25:24,670 --> 00:25:28,760 And in judging the quality of the fit, I don't really care 346 00:25:28,760 --> 00:25:31,770 deeply -- you're going to get very fat the way you're 347 00:25:31,770 --> 00:25:34,410 collecting candy here -- 348 00:25:34,410 --> 00:25:37,530 I don't care deeply whether the error is, which side, it 349 00:25:37,530 --> 00:25:39,860 is, just that it's wrong. 350 00:25:39,860 --> 00:25:43,070 And so by squaring it, it's kind of like taking the 351 00:25:43,070 --> 00:25:47,150 absolute value of the error, among other things. 352 00:25:47,150 --> 00:25:54,690 All right, so if we look at our example here, 353 00:25:54,690 --> 00:26:02,290 what would this be? 354 00:26:02,290 --> 00:26:07,250 I want to minimize, want to find a line that minimizes it. 355 00:26:07,250 --> 00:26:10,470 So how do I do that? 356 00:26:10,470 --> 00:26:13,930 I could easily do it using successive 357 00:26:13,930 --> 00:26:17,480 approximation, right? 358 00:26:17,480 --> 00:26:20,230 I could choose a line, basically what I am, is I'm 359 00:26:20,230 --> 00:26:23,680 choosing a slope, here, right? 360 00:26:23,680 --> 00:26:27,690 And, I could, just like Newton Raphson, do successive 361 00:26:27,690 --> 00:26:34,880 approximation for awhile, and get the best fit. 362 00:26:34,880 --> 00:26:37,810 That's one way to do the optimization. 363 00:26:37,810 --> 00:26:41,940 It turns out that for this particular optimization 364 00:26:41,940 --> 00:26:44,790 there's something more efficient. 365 00:26:44,790 --> 00:26:48,070 You can actually, there is a closed form way of attacking 366 00:26:48,070 --> 00:26:52,380 this, and I could explain that, but in fact, I'll 367 00:26:52,380 --> 00:26:55,080 explain something even better. 368 00:26:55,080 --> 00:27:04,350 It's built into Pylab. 369 00:27:04,350 --> 00:27:14,900 So Pylab has a function built-in called polyfit. 370 00:27:14,900 --> 00:27:19,670 Which, given a set of points, finds the polynomial that 371 00:27:19,670 --> 00:27:21,660 gives you the best least squares 372 00:27:21,660 --> 00:27:28,790 approximation to those points. 373 00:27:28,790 --> 00:27:33,040 It's called polynomial because it isn't necessarily going to 374 00:27:33,040 --> 00:27:36,520 be first order, that is to say, a line. 375 00:27:36,520 --> 00:27:42,030 It can find polynomials of arbitrary degree. 376 00:27:42,030 --> 00:27:48,790 So let's look at the example here, we'll see how it works. 377 00:27:48,790 --> 00:27:59,410 So let me uncomment it. 378 00:27:59,410 --> 00:28:07,480 So I'm going to get k and b equals Pylab dot polyfit here. 379 00:28:07,480 --> 00:28:14,780 What it's going to do is, think about a polynomial. 380 00:28:14,780 --> 00:28:18,640 I give you a polynomial of degree one, you have all 381 00:28:18,640 --> 00:28:26,940 learned that it's a x plus b, b is the constant, and x is 382 00:28:26,940 --> 00:28:29,290 the single variable. 383 00:28:29,290 --> 00:28:34,330 And so I multiply a by x and I add b to it, and as I vary x I 384 00:28:34,330 --> 00:28:36,130 get new values. 385 00:28:36,130 --> 00:28:47,090 And so polyfit, in this case, will take the set of points 386 00:28:47,090 --> 00:28:52,870 defined by these two arrays and return me a value for a 387 00:28:52,870 --> 00:28:57,230 and a value for b. 388 00:28:57,230 --> 00:29:05,290 Now here I've assigned a to k, but don't worry about that. 389 00:29:05,290 --> 00:29:11,930 And then, I'm gonna now generate the predictions that 390 00:29:11,930 --> 00:29:19,570 I would get from this k and b, and plot those. 391 00:29:19,570 --> 00:29:32,050 So let's look at it. 392 00:29:32,050 --> 00:29:39,970 So here it said the k is 31.475, etc., and it's plotted 393 00:29:39,970 --> 00:29:43,300 the line that it's found. 394 00:29:43,300 --> 00:29:45,590 Or I've plotted the line. 395 00:29:45,590 --> 00:29:48,320 You'll note, a lot of the points don't lie on the line, 396 00:29:48,320 --> 00:29:53,380 in fact, most of the points don't lie on the line. 397 00:29:53,380 --> 00:29:56,100 But it's asserting that this is the best it 398 00:29:56,100 --> 00:29:58,830 can do with the line. 399 00:29:58,830 --> 00:30:02,890 And there's some points, for example, up here, that are 400 00:30:02,890 --> 00:30:07,790 kind of outliers, that are pretty far from the line. 401 00:30:07,790 --> 00:30:11,560 But it has minimized the error, if you will, for all of 402 00:30:11,560 --> 00:30:15,620 the points it has. 403 00:30:15,620 --> 00:30:18,670 That's quite different from, say, finding the line that 404 00:30:18,670 --> 00:30:22,640 touches the most points, right? 405 00:30:22,640 --> 00:30:29,790 It's minimizing the sum of the errors. 406 00:30:29,790 --> 00:30:32,950 Now, given that I was just looking for a constant to 407 00:30:32,950 --> 00:30:40,930 start with, why did I bother even plotting the data? 408 00:30:40,930 --> 00:30:43,490 I happen to have known before I did this that polyfit 409 00:30:43,490 --> 00:30:49,230 existed, and what I was really looking for was this line. 410 00:30:49,230 --> 00:30:51,450 So maybe I should have just done the polyfit and said 411 00:30:51,450 --> 00:30:55,710 here's k and I'm done. 412 00:30:55,710 --> 00:31:01,500 Would that have been a good idea? 413 00:31:01,500 --> 00:31:01,810 Yeah? 414 00:31:01,810 --> 00:31:06,292 STUDENT: You can't know without seeing the actual data 415 00:31:06,292 --> 00:31:10,276 how well it's actually fitting it. 416 00:31:10,276 --> 00:31:11,910 PROFESSOR: Right. 417 00:31:11,910 --> 00:31:12,790 Exactly right. 418 00:31:12,790 --> 00:31:15,090 That says, well how would I know that it was fitting it 419 00:31:15,090 --> 00:31:19,150 badly or well, and in fact, how would I know that my 420 00:31:19,150 --> 00:31:23,220 notion of the model is sound, or that my experiment isn't 421 00:31:23,220 --> 00:31:25,410 completely broken? 422 00:31:25,410 --> 00:31:31,720 So always, I think, always look at the real data. 423 00:31:31,720 --> 00:31:34,460 Don't just, I've seen too many papers where people show me 424 00:31:34,460 --> 00:31:38,180 the curve that fits the data, and don't show me the data, 425 00:31:38,180 --> 00:31:40,770 and it always makes me very nervous. 426 00:31:40,770 --> 00:31:44,880 So always look at the data, as well as however you're 427 00:31:44,880 --> 00:31:46,980 choosing to fit it. 428 00:31:46,980 --> 00:31:53,140 As an example of that, let's look at another set of inputs. 429 00:31:53,140 --> 00:32:05,090 This is not a spring. 430 00:32:05,090 --> 00:32:08,350 It's the same get data function as before, ignore 431 00:32:08,350 --> 00:32:13,720 that thing at the top. 432 00:32:13,720 --> 00:32:26,980 I'm going to analyze it and we'll look at it. 433 00:32:26,980 --> 00:32:33,340 So here I'm plotting the speed of something over time. 434 00:32:33,340 --> 00:32:39,640 So I plotted it, and I've done a least squares fit using 435 00:32:39,640 --> 00:32:44,620 polyfit just as before to get a line, and I put the line vs. 436 00:32:44,620 --> 00:32:51,100 the data, and here I'm a little suspicious. 437 00:32:51,100 --> 00:32:55,930 Right, I fit a line, but when I look at it, I don't think 438 00:32:55,930 --> 00:32:59,180 it's a real good fit for the data. 439 00:32:59,180 --> 00:33:14,840 Somehow modeling this data as a line is probably not right. 440 00:33:14,840 --> 00:33:17,920 A linear model is not good for this data. 441 00:33:17,920 --> 00:33:20,210 This data is derived from something, a 442 00:33:20,210 --> 00:33:23,470 more complex process. 443 00:33:23,470 --> 00:33:27,120 So take a look at it, and tell me what order were calling of 444 00:33:27,120 --> 00:33:29,650 polynomial do you think might fit this data? 445 00:33:29,650 --> 00:33:34,290 What shape does this look like to you? 446 00:33:34,290 --> 00:33:35,850 Pardon? 447 00:33:35,850 --> 00:33:36,100 STUDENT: Quadratic. 448 00:33:36,100 --> 00:33:40,140 PROFESSOR: Quadratic, because the shape is a what? 449 00:33:40,140 --> 00:33:41,830 It's a parabola. 450 00:33:41,830 --> 00:33:43,660 Well, I don't know if I dare try this one all 451 00:33:43,660 --> 00:33:45,680 the way to the back. 452 00:33:45,680 --> 00:33:50,400 Ooh, at least I didn't hurt anybody. 453 00:33:50,400 --> 00:33:54,470 All right, fortunately it's just as easy to fit a ravel 454 00:33:54,470 --> 00:33:59,540 parabola as a line. 455 00:33:59,540 --> 00:34:06,470 So let's look down here. 456 00:34:06,470 --> 00:34:11,760 I've done the same thing, but instead of passing it one, as 457 00:34:11,760 --> 00:34:15,090 I did up here as the argument, I'm passing it two. 458 00:34:15,090 --> 00:34:18,630 Saying, instead of fitting a polynomial of degree one, fit 459 00:34:18,630 --> 00:34:21,510 a polynomial of degree two. 460 00:34:21,510 --> 00:34:32,800 And now let's see what it looks like. 461 00:34:32,800 --> 00:34:39,000 Well, my eyes tell me this is a much better 462 00:34:39,000 --> 00:34:44,830 fit than the line. 463 00:34:44,830 --> 00:34:49,910 So again, that's why I wanted to see the scatter plot, so 464 00:34:49,910 --> 00:34:53,130 that I could at least look at it with my eyes, and say, 465 00:34:53,130 --> 00:34:58,120 yeah, this looks like a better fit. 466 00:34:58,120 --> 00:35:07,470 All right, any question about what's going on here? 467 00:35:07,470 --> 00:35:13,400 What we've been looking at is something called linear 468 00:35:13,400 --> 00:35:23,640 regression. 469 00:35:23,640 --> 00:35:30,370 It's called linear because the relationship of the dependent 470 00:35:30,370 --> 00:35:39,380 variable y to the independent variables is assumed to be a 471 00:35:39,380 --> 00:35:43,390 linear function of the parameters. 472 00:35:43,390 --> 00:35:47,350 It's not because it has to be a linear function of 473 00:35:47,350 --> 00:35:50,460 the value of x, OK? 474 00:35:50,460 --> 00:35:53,950 Because as you can see, we're not getting a line, we're 475 00:35:53,950 --> 00:35:56,360 getting a parabola. 476 00:35:56,360 --> 00:36:00,100 Don't worry about the details, the point I want to make is, 477 00:36:00,100 --> 00:36:03,500 people sometimes see the word linear regression and think it 478 00:36:03,500 --> 00:36:06,750 can only be used to find lines. 479 00:36:06,750 --> 00:36:11,780 It's not so. 480 00:36:11,780 --> 00:36:16,580 So when, for example, we did the quadratic, what we had is 481 00:36:16,580 --> 00:36:26,210 y equals a x squared plus b x plus c. 482 00:36:26,210 --> 00:36:30,590 The graph vs. x will not be a straight line, right, because 483 00:36:30,590 --> 00:36:34,810 I'm squaring x. 484 00:36:34,810 --> 00:36:43,780 But it is, just about, in this case, the single variable x. 485 00:36:43,780 --> 00:36:49,910 Now, when I looked at this, I said, all right, it's clear 486 00:36:49,910 --> 00:36:55,530 that the yellow curve is a better fit than the red. 487 00:36:55,530 --> 00:36:59,130 It's a red line. 488 00:36:59,130 --> 00:37:03,740 But that was a pretty informal statement. 489 00:37:03,740 --> 00:37:11,550 I can actually look at this much more formally. 490 00:37:11,550 --> 00:37:14,340 And we're going to look at something that's the 491 00:37:14,340 --> 00:37:18,410 statisticians call r squared. 492 00:37:18,410 --> 00:37:26,750 Which in the case of a linear regression is the coefficient 493 00:37:26,750 --> 00:37:34,480 of determination. 494 00:37:34,480 --> 00:37:38,640 Now, this is a big fancy word for something that's actually 495 00:37:38,640 --> 00:37:41,510 pretty simple. 496 00:37:41,510 --> 00:37:44,630 So what r squared its going to be, and this is on your 497 00:37:44,630 --> 00:37:58,660 handout, is 1 minus e e over d v. So e e is going to be the 498 00:37:58,660 --> 00:38:02,200 errors in the estimation. 499 00:38:02,200 --> 00:38:06,630 So I've got some estimated values, some predicted values, 500 00:38:06,630 --> 00:38:11,810 if you will, given to me by the model, either the line or 501 00:38:11,810 --> 00:38:14,070 the parabola in this case. 502 00:38:14,070 --> 00:38:18,710 And I've got some real values, corresponding to each of those 503 00:38:18,710 --> 00:38:25,040 points, and I can look at the difference between the 2 And 504 00:38:25,040 --> 00:38:29,340 that will tell me how much difference there is between 505 00:38:29,340 --> 00:38:35,720 the estimated data and the, well, between the predicted 506 00:38:35,720 --> 00:38:42,840 data and the measured data, in this case. 507 00:38:42,840 --> 00:38:49,160 And then I want to divide that by the variance in the 508 00:38:49,160 --> 00:38:51,390 measured data. 509 00:38:51,390 --> 00:38:59,230 The data variance. 510 00:38:59,230 --> 00:39:03,620 How broadly scattered the measured points are. 511 00:39:03,620 --> 00:39:10,530 And I'll do that by comparing the mean of the measured data, 512 00:39:10,530 --> 00:39:13,190 to the measured data. 513 00:39:13,190 --> 00:39:16,020 So I get the average value of the measured data, and I look 514 00:39:16,020 --> 00:39:21,690 at how different the points I measure are. 515 00:39:21,690 --> 00:39:26,330 So I just want to give to you, informally, because I really 516 00:39:26,330 --> 00:39:28,780 don't care if you understand all the math. 517 00:39:28,780 --> 00:39:32,300 What I do want you to understand, when someone tells 518 00:39:32,300 --> 00:39:37,310 you, here's the r squared value, is, informally what it 519 00:39:37,310 --> 00:39:39,490 really is saying. 520 00:39:39,490 --> 00:39:47,150 It's attempting to capture the proportion of the response 521 00:39:47,150 --> 00:39:51,850 variation explained by the variables in the model. 522 00:39:51,850 --> 00:39:56,030 In this case, x. 523 00:39:56,030 --> 00:40:04,110 So you'll have some amount of variation that is explained by 524 00:40:04,110 --> 00:40:08,320 changing the values of the variables. 525 00:40:08,320 --> 00:40:11,170 So if, actually, I'm going to give an example and then come 526 00:40:11,170 --> 00:40:12,960 back to it more informally. 527 00:40:12,960 --> 00:40:21,140 So if, for example, r squared were to equal 0.9, that would 528 00:40:21,140 --> 00:40:26,470 mean that approximately 90 percent of the variation in 529 00:40:26,470 --> 00:40:34,380 the variables can be explained by the model. 530 00:40:34,380 --> 00:40:36,660 OK, so we have some amount of variation in the measured 531 00:40:36,660 --> 00:40:42,610 data, and if r squared is 0.9, it says that 90 percent can be 532 00:40:42,610 --> 00:40:49,290 explained by the models, and the other 10 percent cannot. 533 00:40:49,290 --> 00:40:54,640 Now, that other 10 percent could be experimental error, 534 00:40:54,640 --> 00:40:57,840 or it could be that, in fact, you need more 535 00:40:57,840 --> 00:41:00,550 variables in the model. 536 00:41:00,550 --> 00:41:05,440 That there are what are called lurking variables. 537 00:41:05,440 --> 00:41:09,530 I love this term. 538 00:41:09,530 --> 00:41:12,770 A lurking variable is something that actually 539 00:41:12,770 --> 00:41:18,860 effects the result, but is not reflected in the model. 540 00:41:18,860 --> 00:41:26,260 As we'll see a little bit later, this is a very 541 00:41:26,260 --> 00:41:29,320 important thing to worry about, when you're looking at 542 00:41:29,320 --> 00:41:32,530 experimental data and you're building models. 543 00:41:32,530 --> 00:41:36,370 So we see this, for example, in the medical literature, 544 00:41:36,370 --> 00:41:41,530 that they will do some experiment, and they'll say 545 00:41:41,530 --> 00:41:46,870 that this drug explains x, or has this affect. 546 00:41:46,870 --> 00:41:49,250 And the variables they are looking at are, say, the 547 00:41:49,250 --> 00:41:55,710 disease the patient has, and the age of the patient. 548 00:41:55,710 --> 00:42:00,700 Well, maybe the gender of the patient is also important, but 549 00:42:00,700 --> 00:42:04,860 it doesn't happen to be in the model. 550 00:42:04,860 --> 00:42:09,400 Now, if when they did a fit, it came out with 0.9, that 551 00:42:09,400 --> 00:42:12,980 says at worst case, the variables we didn't consider 552 00:42:12,980 --> 00:42:19,080 could cause a 10 percent error. 553 00:42:19,080 --> 00:42:23,880 But, that could be big, that could matter a lot. 554 00:42:23,880 --> 00:42:29,760 And so as you get farther from 1, you ought to get very 555 00:42:29,760 --> 00:42:33,460 worried about whether you actually have 556 00:42:33,460 --> 00:42:35,590 all the right variables. 557 00:42:35,590 --> 00:42:37,730 Now you might have the right variables, and just experiment 558 00:42:37,730 --> 00:42:42,410 was not conducted well, But it's usually the case that the 559 00:42:42,410 --> 00:42:46,930 problem is not that, but that there are lurking variables. 560 00:42:46,930 --> 00:42:49,370 And we'll see examples of that. 561 00:42:49,370 --> 00:42:52,400 So, easier to read than the math, at least by me, easier 562 00:42:52,400 --> 00:43:07,940 to read than the math, is the implementation of r square. 563 00:43:07,940 --> 00:43:12,510 So it's measured and estimated values, I get the diffs, the 564 00:43:12,510 --> 00:43:15,820 differences, between the estimated and the measured. 565 00:43:15,820 --> 00:43:19,080 These are both arrays, so I subtract 1 array from the 566 00:43:19,080 --> 00:43:20,970 other, and then I square it. 567 00:43:20,970 --> 00:43:24,000 Remember, this'll do an element-wise subtraction, and 568 00:43:24,000 --> 00:43:26,830 then square each element. 569 00:43:26,830 --> 00:43:32,560 Then I can get the mean, by dividing the sum of the array 570 00:43:32,560 --> 00:43:38,850 measured by the length of it. 571 00:43:38,850 --> 00:43:42,590 I can get the variance, which is the measured mean minus the 572 00:43:42,590 --> 00:43:46,060 measured value, again squared. 573 00:43:46,060 --> 00:43:53,590 And then I'll return 1 minus this. 574 00:43:53,590 --> 00:43:55,360 All right? 575 00:43:55,360 --> 00:43:59,710 So, just to make sure we sort of understand the code, and 576 00:43:59,710 --> 00:44:04,320 the theory here as well, what would we get if we had 577 00:44:04,320 --> 00:44:08,210 absolutely perfect prediction? 578 00:44:08,210 --> 00:44:11,930 So if every measured point actually fit on the curb 579 00:44:11,930 --> 00:44:19,200 predicted by our model, what would r square return? 580 00:44:19,200 --> 00:44:24,980 So in this case, measured and estimated would be identical. 581 00:44:24,980 --> 00:44:30,940 What gets return by this? 582 00:44:30,940 --> 00:44:32,940 Yeah, 1. 583 00:44:32,940 --> 00:44:38,720 Exactly right. 584 00:44:38,720 --> 00:44:43,480 Because when I compute it, it will turn out that these two 585 00:44:43,480 --> 00:44:50,600 numbers will be the, I'll get 0, 1 minus 0 is 0, right? 586 00:44:50,600 --> 00:44:55,900 Because the differences will be zero. 587 00:44:55,900 --> 00:44:59,340 OK? 588 00:44:59,340 --> 00:45:04,900 So I can use this, now, to actually get a notion of how 589 00:45:04,900 --> 00:45:08,480 good my fit is. 590 00:45:08,480 --> 00:45:13,130 So let's look at speed dot pi again here, and now I'm going 591 00:45:13,130 --> 00:45:17,790 to uncomment these two things, where I'm going to, after I 592 00:45:17,790 --> 00:45:30,970 compute the fit, I'm going to then measure it. 593 00:45:30,970 --> 00:45:34,420 And you'll see here that the r squared error for the linear 594 00:45:34,420 --> 00:45:42,730 fit is 0.896, and for the quadratic fit is 0.973. 595 00:45:42,730 --> 00:45:47,990 So indeed, we get a much better fit here. 596 00:45:47,990 --> 00:45:51,550 So not only does our eye tell us we have a better fit, our 597 00:45:51,550 --> 00:45:55,210 more formal statistical measure tells us we have a 598 00:45:55,210 --> 00:45:57,690 better fit, and it tells us how good it is. 599 00:45:57,690 --> 00:46:02,490 It's not a perfect fit, but it's a pretty 600 00:46:02,490 --> 00:46:07,140 good fit, for sure. 601 00:46:07,140 --> 00:46:13,930 Now, interestingly enough, it isn't surprising that the 602 00:46:13,930 --> 00:46:20,770 quadratic fit is better than the linear fit. 603 00:46:20,770 --> 00:46:24,950 In fact, the mathematics of this should tell us it can 604 00:46:24,950 --> 00:46:28,620 never be worse. 605 00:46:28,620 --> 00:46:31,820 How do I know it can never be worse? 606 00:46:31,820 --> 00:46:35,640 That's just, never is a really strong word. 607 00:46:35,640 --> 00:46:38,720 How do I know that? 608 00:46:38,720 --> 00:46:42,980 Because, when I do the quadratic fit, if I had 609 00:46:42,980 --> 00:46:47,660 perfectly linear data, then this coefficient, whoops, not 610 00:46:47,660 --> 00:46:56,670 that coefficient, wrong, this coefficient, could be 0. 611 00:46:56,670 --> 00:47:01,800 So if I ask it to do a quadratic fit to linear data, 612 00:47:01,800 --> 00:47:06,120 and the a is truly perfectly linear, this coefficient will 613 00:47:06,120 --> 00:47:09,380 be 0, and my model will turn out to be the same as the 614 00:47:09,380 --> 00:47:12,880 linear model. 615 00:47:12,880 --> 00:47:19,950 So I will always get at least as good a fit. 616 00:47:19,950 --> 00:47:25,240 Now, does this mean that it's always better to use a higher 617 00:47:25,240 --> 00:47:27,990 order polynomial? 618 00:47:27,990 --> 00:47:38,710 The answer is no, and let's look at why. 619 00:47:38,710 --> 00:47:48,400 So here what I've done is, I've taken seven points, and 620 00:47:48,400 --> 00:47:54,470 I've generated, if you look at this line here, the y-values, 621 00:47:54,470 --> 00:47:57,070 for x in x vals, points dot append x 622 00:47:57,070 --> 00:48:00,520 plus some random number. 623 00:48:00,520 --> 00:48:04,230 So basically I've got something linear in x, but I'm 624 00:48:04,230 --> 00:48:08,620 perturbing, if you will, my data by some random value. 625 00:48:08,620 --> 00:48:11,930 Something between 0 and 1 is getting added to things. 626 00:48:11,930 --> 00:48:14,320 And I'm doing this so my points won't lie on a 627 00:48:14,320 --> 00:48:19,430 perfectly straight line. 628 00:48:19,430 --> 00:48:24,340 And then we'll try and fit a line to it. 629 00:48:24,340 --> 00:48:28,580 And also, just for fun, we'll try and fit a fifth order 630 00:48:28,580 --> 00:48:30,840 polynomial to it. 631 00:48:30,840 --> 00:48:40,500 And let's see what we get. 632 00:48:40,500 --> 00:48:44,170 Well, there's my line, and there's my fifth order 633 00:48:44,170 --> 00:48:45,570 polynomial. 634 00:48:45,570 --> 00:48:50,160 Neither is quite perfect, but which do you think looks like 635 00:48:50,160 --> 00:48:53,890 a closer fit? 636 00:48:53,890 --> 00:49:00,650 With your eye. 637 00:49:00,650 --> 00:49:04,960 Well, I would say the red line, the red curve, if you 638 00:49:04,960 --> 00:49:09,910 will, is a better fit, and sure enough if we look at the 639 00:49:09,910 --> 00:49:16,830 statistics, we'll see it's 0.99, as opposed to 0.978. 640 00:49:16,830 --> 00:49:21,890 So it's clearly a closer fit. 641 00:49:21,890 --> 00:49:30,140 But that raises the very important question: does 642 00:49:30,140 --> 00:49:36,850 closer equal better, or tighter, which is another word 643 00:49:36,850 --> 00:49:40,980 for closer? 644 00:49:40,980 --> 00:49:44,830 And the answer is no. 645 00:49:44,830 --> 00:49:49,140 It's a tighter fit, but it's not necessarily better, in the 646 00:49:49,140 --> 00:49:52,780 sense of more useful. 647 00:49:52,780 --> 00:49:55,400 Because one of the things I want to do when I build a 648 00:49:55,400 --> 00:49:56,930 model like this, is have something 649 00:49:56,930 --> 00:50:00,030 with predictive power. 650 00:50:00,030 --> 00:50:05,120 I don't really necessarily need a model to tell me where 651 00:50:05,120 --> 00:50:08,510 the points I've measured lie, because I have them. 652 00:50:08,510 --> 00:50:12,160 The whole purpose of the model is to give me some way to 653 00:50:12,160 --> 00:50:17,260 predict where unmeasured points would lie, where future 654 00:50:17,260 --> 00:50:19,330 points would lie. 655 00:50:19,330 --> 00:50:23,390 OK, I understand how the spring works, and I can guess 656 00:50:23,390 --> 00:50:26,620 where it would be if things I haven't had the time to 657 00:50:26,620 --> 00:50:31,410 measure, or the ability to measure. 658 00:50:31,410 --> 00:50:38,080 So let's look at that. 659 00:50:38,080 --> 00:50:41,350 Let's see, where'd that figure go. 660 00:50:41,350 --> 00:50:47,720 It's lurking somewhere. 661 00:50:47,720 --> 00:50:54,950 All right, we'll just kill this for now. 662 00:50:54,950 --> 00:51:00,810 So let's generate some more points, and I'm going to use 663 00:51:00,810 --> 00:51:05,100 exactly the same algorithm. 664 00:51:05,100 --> 00:51:09,670 But I'm going to generate twice as many points. 665 00:51:09,670 --> 00:51:14,600 But I'm only fitting it to the first half. 666 00:51:14,600 --> 00:51:24,990 So if I run this one, figure one is what 667 00:51:24,990 --> 00:51:26,910 we looked at before. 668 00:51:26,910 --> 00:51:29,900 The red line is fitting them a little better. 669 00:51:29,900 --> 00:51:33,460 But here's figure two. 670 00:51:33,460 --> 00:51:37,370 What happens when I extrapolate the curve to the 671 00:51:37,370 --> 00:51:39,500 new points? 672 00:51:39,500 --> 00:51:43,780 Well, you can see, it's a terrible fit. 673 00:51:43,780 --> 00:51:46,820 And you would expect that, because my data was basically 674 00:51:46,820 --> 00:51:52,440 linear, and I fit in non-linear curve to it. 675 00:51:52,440 --> 00:51:56,780 And if you look at it you can see that, OK, look at this, to 676 00:51:56,780 --> 00:51:59,790 get from here to here, it thought I had to take off 677 00:51:59,790 --> 00:52:02,540 pretty sharply. 678 00:52:02,540 --> 00:52:06,270 And so sure enough, as I get new points, the prediction 679 00:52:06,270 --> 00:52:11,240 will postulate that it's still going up, much more steeply 680 00:52:11,240 --> 00:52:14,300 than it really does. 681 00:52:14,300 --> 00:52:18,350 So you can see it's a terrible prediction. 682 00:52:18,350 --> 00:52:28,510 And that's because what I've done is, I over-fit the data. 683 00:52:28,510 --> 00:52:32,430 I've taken a very high degree polynomial, which has given me 684 00:52:32,430 --> 00:52:36,860 a good close fit, and I can always get a fit, by the way. 685 00:52:36,860 --> 00:52:40,020 If I choose a high enough degree polynomial, I can fit 686 00:52:40,020 --> 00:52:43,810 lots and lots of data sets. 687 00:52:43,810 --> 00:52:47,050 But I have reason to be very suspicious. 688 00:52:47,050 --> 00:52:49,950 The fact that I took a fifth order polynomial to get six 689 00:52:49,950 --> 00:52:57,720 points should make me very nervous. 690 00:52:57,720 --> 00:52:59,980 And it's a very important moral. 691 00:52:59,980 --> 00:53:01,920 Beware of over-fitting. 692 00:53:01,920 --> 00:53:08,790 If you have a very complex model, there's a good chance 693 00:53:08,790 --> 00:53:12,810 that it's over-fit. 694 00:53:12,810 --> 00:53:18,790 The larger moral is, beware of statistics without any theory. 695 00:53:18,790 --> 00:53:21,670 You're just cranking away, you get a great r squared, you say 696 00:53:21,670 --> 00:53:23,850 it's a beautiful fit. 697 00:53:23,850 --> 00:53:26,140 But there was no real theory there. 698 00:53:26,140 --> 00:53:29,280 You can always find a fit. 699 00:53:29,280 --> 00:53:32,340 As Disraeli is alleged to have said, there are three kinds of 700 00:53:32,340 --> 00:53:38,120 lies: lies, damned lies, and statistics. 701 00:53:38,120 --> 00:53:41,570 And we'll spend some more time when we get back from 702 00:53:41,570 --> 00:53:44,480 Thanksgiving looking at how to lie with statistics. 703 00:53:44,480 --> 00:53:46,580 Have a great holiday, everybody.