1 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,870 Commons license. 3 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 4 00:00:06,910 --> 00:00:10,560 offer high quality educational resources for free. 5 00:00:10,560 --> 00:00:13,460 To make a donation, or view additional materials from 6 00:00:13,460 --> 00:00:19,290 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,290 --> 00:00:21,436 ocw.mit.edu. 8 00:00:21,436 --> 00:00:24,560 PROFESSOR: I want to go back to where I stopped at the end 9 00:00:24,560 --> 00:00:30,370 of Tuesday's lecture, when you let me pull a fast one on you. 10 00:00:30,370 --> 00:00:34,360 I ended up with a strong statement that was 11 00:00:34,360 --> 00:00:37,250 effectively a lie. 12 00:00:37,250 --> 00:00:41,740 I told you that when we drop a large enough number of pins, 13 00:00:41,740 --> 00:00:45,950 and do a large enough number of trials, we can look at the 14 00:00:45,950 --> 00:00:51,780 small standard deviation we get across trials and say, 15 00:00:51,780 --> 00:00:54,220 that means we have a good answer. 16 00:00:54,220 --> 00:00:57,820 It doesn't change much. 17 00:00:57,820 --> 00:01:02,750 And I said, so we can tell you that with 95% confidence, the 18 00:01:02,750 --> 00:01:08,170 answer lies between x and y, where we had the two standard 19 00:01:08,170 --> 00:01:11,280 deviations from the mean. 20 00:01:11,280 --> 00:01:15,390 That's not actually true. 21 00:01:15,390 --> 00:01:19,630 I was confusing the notion of a statistically sound 22 00:01:19,630 --> 00:01:23,240 conclusion with truth. 23 00:01:23,240 --> 00:01:26,520 The utility of every statistical test rests on 24 00:01:26,520 --> 00:01:28,720 certain assumptions. 25 00:01:28,720 --> 00:01:30,370 So we talked about independence 26 00:01:30,370 --> 00:01:31,690 and things like that. 27 00:01:31,690 --> 00:01:36,490 But the key assumption is that our simulation is actually a 28 00:01:36,490 --> 00:01:37,740 model of reality. 29 00:01:40,270 --> 00:01:43,530 You can recall that in designing the simulation, we 30 00:01:43,530 --> 00:01:48,280 looked at the Buffon-Laplace mathematics and did a little 31 00:01:48,280 --> 00:01:52,270 algebra from which we derived the code, wrote the code, ran 32 00:01:52,270 --> 00:01:55,460 the simulation, looked at the results, did the statistical 33 00:01:55,460 --> 00:01:58,450 results, and smiled. 34 00:01:58,450 --> 00:02:03,180 Well, suppose I had made a coding error. 35 00:02:03,180 --> 00:02:06,760 So, for example, instead of that 4 there-- which the 36 00:02:06,760 --> 00:02:08,889 algebra said we should have-- 37 00:02:08,889 --> 00:02:10,869 I had mistakenly typed a 2. 38 00:02:15,560 --> 00:02:17,910 Not an impossible error. 39 00:02:17,910 --> 00:02:24,780 Now if we run it, what we're going to see here is that it 40 00:02:24,780 --> 00:02:29,450 converges quite quickly, it gives me a small standard 41 00:02:29,450 --> 00:02:33,980 deviation, and I can feel very confident that my answer that 42 00:02:33,980 --> 00:02:38,970 pi is somewhere around 1.569. 43 00:02:38,970 --> 00:02:41,940 Well, it isn't of course. 44 00:02:41,940 --> 00:02:46,580 We know that that's nowhere close to the value of pi. 45 00:02:46,580 --> 00:02:49,620 But there's nothing wrong with my statistics. 46 00:02:49,620 --> 00:02:54,660 It's just that my statistics are about the simulation, not 47 00:02:54,660 --> 00:02:58,230 about pi itself. 48 00:02:58,230 --> 00:03:01,480 So what's the moral here? 49 00:03:01,480 --> 00:03:06,120 Before believing the results of any simulation, we have to 50 00:03:06,120 --> 00:03:10,720 have confidence that our conceptual model is correct. 51 00:03:10,720 --> 00:03:14,000 And that we have correctly implemented 52 00:03:14,000 --> 00:03:16,720 that conceptual model. 53 00:03:16,720 --> 00:03:19,220 How can we do that? 54 00:03:19,220 --> 00:03:22,860 Well, one thing we can do is test our 55 00:03:22,860 --> 00:03:25,480 results against reality. 56 00:03:25,480 --> 00:03:31,550 So if I ran this and I said pi is about 1.57, I could go draw 57 00:03:31,550 --> 00:03:36,810 a circle, and I could crudely measure the circumference, and 58 00:03:36,810 --> 00:03:38,810 I would immediately know I'm nowhere close 59 00:03:38,810 --> 00:03:41,340 to the right answer. 60 00:03:41,340 --> 00:03:43,070 And that's the right thing to do. 61 00:03:43,070 --> 00:03:46,450 And in fact, what a scientist does when they use a 62 00:03:46,450 --> 00:03:50,390 simulation model to derive something, they always run 63 00:03:50,390 --> 00:03:54,040 some experiments to see whether their derived result 64 00:03:54,040 --> 00:03:59,360 is actually at least plausibly correct. 65 00:03:59,360 --> 00:04:02,610 Statistics are good to show that we've got the little 66 00:04:02,610 --> 00:04:05,790 details right at the end, but we've got to do a 67 00:04:05,790 --> 00:04:08,200 sanity check first. 68 00:04:08,200 --> 00:04:11,850 So that's a really important moral to keep in mind. 69 00:04:11,850 --> 00:04:15,930 Don't get seduced by a statistical test and confuse 70 00:04:15,930 --> 00:04:19,829 that with truth. 71 00:04:19,829 --> 00:04:25,670 All right, I now want to move on to look at some more 72 00:04:25,670 --> 00:04:30,890 examples that do the same kind of thing we've been doing. 73 00:04:30,890 --> 00:04:33,470 And in fact, what we're going to be looking at is the 74 00:04:33,470 --> 00:04:36,405 interplay between physical reality,-- 75 00:04:39,810 --> 00:04:42,880 some physical system, just in the real world-- 76 00:04:49,720 --> 00:04:55,210 some theoretical models of the physical system, and 77 00:04:55,210 --> 00:04:56,505 computational models. 78 00:05:04,620 --> 00:05:07,410 Because this is really the way modern science and 79 00:05:07,410 --> 00:05:09,690 engineering is done. 80 00:05:09,690 --> 00:05:13,710 We start with some physical situation-- 81 00:05:13,710 --> 00:05:17,080 and by physical I don't mean it has to be bricks and 82 00:05:17,080 --> 00:05:19,070 mortar, or physics, or biology. 83 00:05:19,070 --> 00:05:21,640 The physical situation could be the stock 84 00:05:21,640 --> 00:05:23,220 market, if you will,-- 85 00:05:23,220 --> 00:05:26,130 some real situation in the world. 86 00:05:26,130 --> 00:05:30,200 We use some theory to give us some insight into that, and 87 00:05:30,200 --> 00:05:33,620 when the theory gets too complicated or doesn't get us 88 00:05:33,620 --> 00:05:38,050 all the way to the answer, we use computation. 89 00:05:38,050 --> 00:05:41,230 And I now want to talk about how those things relate to 90 00:05:41,230 --> 00:05:42,750 each other. 91 00:05:42,750 --> 00:05:46,370 So imagine, for example, that you're a bright student in 92 00:05:46,370 --> 00:05:49,940 high school biology, chemistry, or physics-- 93 00:05:49,940 --> 00:05:53,760 a situation probably all of you who have been in. 94 00:05:53,760 --> 00:05:57,380 You perform some experiment to the best of your ability. 95 00:05:57,380 --> 00:05:59,860 But you've done the math and you know your experimental 96 00:05:59,860 --> 00:06:03,500 results don't actually match the theory. 97 00:06:03,500 --> 00:06:06,810 What should you do? 98 00:06:06,810 --> 00:06:09,500 Well I suspect you've all been in this situation. 99 00:06:09,500 --> 00:06:13,570 You could just turn in the results and risk getting 100 00:06:13,570 --> 00:06:17,140 criticized for poor laboratory technique. 101 00:06:17,140 --> 00:06:19,330 Some of you may have done this. 102 00:06:19,330 --> 00:06:21,770 More likely what you've done is you've calculated the 103 00:06:21,770 --> 00:06:27,050 correct results and turned those in, risking some 104 00:06:27,050 --> 00:06:29,860 suspicion that they're too good to be true. 105 00:06:29,860 --> 00:06:32,720 But being smart guys, I suspect what all of you did in 106 00:06:32,720 --> 00:06:36,690 high school is you calculated the correct results, looked at 107 00:06:36,690 --> 00:06:40,690 your experimental results, and met somewhere in between to 108 00:06:40,690 --> 00:06:43,250 introduce a little error, but not look too foolish. 109 00:06:43,250 --> 00:06:47,560 Have any of you cheated that way in high school? 110 00:06:47,560 --> 00:06:48,850 Yeah well all right. 111 00:06:48,850 --> 00:06:51,550 We have about two people who would admit it. 112 00:06:51,550 --> 00:06:55,040 The rest of you are either exceedingly honorable, or just 113 00:06:55,040 --> 00:06:56,640 don't want to admit it. 114 00:06:56,640 --> 00:06:58,960 I confess, I had fudged experimental 115 00:06:58,960 --> 00:07:00,450 results in high school. 116 00:07:00,450 --> 00:07:03,730 But no longer, I've seen the truth. 117 00:07:03,730 --> 00:07:08,260 All right, to do this correctly you need to have a 118 00:07:08,260 --> 00:07:12,710 sense of how best to model not only reality, but also 119 00:07:12,710 --> 00:07:15,380 experimental errors. 120 00:07:15,380 --> 00:07:19,480 Typically, the best way to model experimental errors-- 121 00:07:19,480 --> 00:07:21,240 and we need to do this even when we're not 122 00:07:21,240 --> 00:07:23,390 attempting to cheat-- 123 00:07:23,390 --> 00:07:29,330 is to assume some sort of random perturbation of the 124 00:07:29,330 --> 00:07:31,800 actual data. 125 00:07:31,800 --> 00:07:35,850 And in fact, one of the key steps forward, which was 126 00:07:35,850 --> 00:07:39,690 really Gauss' big contribution, was to say we 127 00:07:39,690 --> 00:07:44,030 can typically model experimental error as normally 128 00:07:44,030 --> 00:07:48,380 distributed, as a Gaussian distribution. 129 00:07:48,380 --> 00:07:50,440 So let's look at an example. 130 00:07:50,440 --> 00:07:53,110 Let's consider a spring. 131 00:07:53,110 --> 00:07:56,890 Not the current time of year, or a spring of water, but the 132 00:07:56,890 --> 00:08:00,060 kind of spring you looked at in 8.01. 133 00:08:00,060 --> 00:08:03,630 The things you compress with some force then they expand, 134 00:08:03,630 --> 00:08:05,690 or you stretch, and then they contract. 135 00:08:08,270 --> 00:08:09,770 Springs are great things. 136 00:08:09,770 --> 00:08:15,750 We use them in our cars, our mattresses, seat belts. 137 00:08:15,750 --> 00:08:20,930 We use them to launch projectiles, lots of things. 138 00:08:20,930 --> 00:08:23,930 And in fact, as we'll see later, they're frequently 139 00:08:23,930 --> 00:08:25,610 occurring in biology as well. 140 00:08:28,600 --> 00:08:31,560 I don't want to belabor this, I presume 141 00:08:31,560 --> 00:08:32,900 you've all taken 8.01. 142 00:08:32,900 --> 00:08:36,200 Do they still do springs in 8.01? 143 00:08:36,200 --> 00:08:38,380 Yes, good, all right. 144 00:08:38,380 --> 00:08:42,100 So as you know, in 1676-- 145 00:08:42,100 --> 00:08:43,909 maybe you didn't know the date-- 146 00:08:43,909 --> 00:08:49,190 the British physicist, Robert Hooke, formulated Hooke's Law 147 00:08:49,190 --> 00:08:51,940 to explain the behavior of springs. 148 00:08:51,940 --> 00:08:56,430 And the law is very simple, it's f equals minus kx. 149 00:09:03,840 --> 00:09:08,810 In other words, the force, f, stored in the spring is 150 00:09:08,810 --> 00:09:12,990 linearly related to x, the distance the spring has been 151 00:09:12,990 --> 00:09:14,265 either compressed or stretched. 152 00:09:16,850 --> 00:09:20,170 OK. so that's Hooke's law, you've all seen that. 153 00:09:20,170 --> 00:09:23,180 The law holds true for a wide variety of materials and 154 00:09:23,180 --> 00:09:27,170 systems including many biological systems. 155 00:09:27,170 --> 00:09:29,620 Of course, it does not hold for an 156 00:09:29,620 --> 00:09:33,360 arbitrarily large force. 157 00:09:33,360 --> 00:09:37,660 All springs have an elastic limit and if you stretch them 158 00:09:37,660 --> 00:09:40,580 beyond that the law fails. 159 00:09:40,580 --> 00:09:43,620 Has anyone here ever broken a Slinky that way. 160 00:09:43,620 --> 00:09:46,030 Where you've just taken the spring and stretched it so 161 00:09:46,030 --> 00:09:48,220 much that it's no longer useful. 162 00:09:48,220 --> 00:09:52,780 Well you've exceeded its elastic limit. 163 00:09:52,780 --> 00:09:58,280 The proportionate of constant here, k, is 164 00:09:58,280 --> 00:09:59,530 called the spring constant. 165 00:10:05,160 --> 00:10:08,580 And every spring has a constant, k, that 166 00:10:08,580 --> 00:10:10,700 explains its behavior. 167 00:10:10,700 --> 00:10:13,170 If the spring is stiff like the suspension in an 168 00:10:13,170 --> 00:10:16,800 automobile, k is big. 169 00:10:19,400 --> 00:10:22,430 If the spring is not stiff like the spring in a ballpoint 170 00:10:22,430 --> 00:10:25,500 pen, k is small. 171 00:10:25,500 --> 00:10:29,390 The negative sign is there to indicate that the force 172 00:10:29,390 --> 00:10:33,150 exerted by the spring is in the opposite direction of the 173 00:10:33,150 --> 00:10:34,510 displacement. 174 00:10:34,510 --> 00:10:37,950 If you pull a spring bring down, the force exerted by the 175 00:10:37,950 --> 00:10:41,410 spring is going up. 176 00:10:41,410 --> 00:10:45,950 Knowing the spring constant of a spring is actually a matter 177 00:10:45,950 --> 00:10:48,890 of considerable practical importance. 178 00:10:48,890 --> 00:10:52,600 It's used to do things like calibrate scales,-- 179 00:10:52,600 --> 00:10:55,650 one can use to weigh oneself, if one 180 00:10:55,650 --> 00:10:59,220 wants to know the truth-- 181 00:10:59,220 --> 00:11:03,840 atomic force microscopes, lots of kinds of things. 182 00:11:03,840 --> 00:11:07,230 And in fact, recently people have started worrying about 183 00:11:07,230 --> 00:11:11,680 thinking that you should model DNA as a spring, and finding 184 00:11:11,680 --> 00:11:14,510 the spring constant for DNA turns out to be of 185 00:11:14,510 --> 00:11:19,670 considerable use in some biological experiments. 186 00:11:19,670 --> 00:11:24,550 All right, so generations of students have learned to 187 00:11:24,550 --> 00:11:29,050 estimate springs using this very simple experiment. 188 00:11:29,050 --> 00:11:30,880 You've probably most of you have done this. 189 00:11:36,770 --> 00:11:40,130 Get a picture up here, all right. 190 00:11:40,130 --> 00:11:45,580 So what you do is you take a spring and you hang it on some 191 00:11:45,580 --> 00:11:51,230 sort of apparatus, and then you put a weight of known mass 192 00:11:51,230 --> 00:11:55,250 at the bottom of the spring, and you measure how much the 193 00:11:55,250 --> 00:11:56,500 spring has stretched. 194 00:11:58,700 --> 00:12:02,730 You then can do the math, if f equals minus kx. 195 00:12:02,730 --> 00:12:06,440 We also have to know that f equals m times a, mass times 196 00:12:06,440 --> 00:12:09,210 acceleration. 197 00:12:09,210 --> 00:12:12,570 We know that on this planet at least the acceleration due to 198 00:12:12,570 --> 00:12:18,630 gravity is roughly 9.81 meters per second per second, and we 199 00:12:18,630 --> 00:12:22,490 can just do the algebra and we can calculate k. 200 00:12:22,490 --> 00:12:26,650 So we hang one weight in the spring, we measure it, we say, 201 00:12:26,650 --> 00:12:28,400 we're done. 202 00:12:28,400 --> 00:12:31,350 We now know what k is for that spring. 203 00:12:31,350 --> 00:12:33,800 Not so easy, of course, to do this experiment if the spring 204 00:12:33,800 --> 00:12:35,890 is a strand of DNA. 205 00:12:35,890 --> 00:12:37,890 So you need a slightly more complicated 206 00:12:37,890 --> 00:12:40,120 apparatus to do that. 207 00:12:44,140 --> 00:12:48,820 This would be all well and good if we didn't have 208 00:12:48,820 --> 00:12:52,920 experimental error, but we do. 209 00:12:52,920 --> 00:12:55,710 Any experiment we typically have errors. 210 00:12:55,710 --> 00:12:59,820 So what people do instead is rather than hanging one weight 211 00:12:59,820 --> 00:13:03,880 on the spring, they hang different weights-- 212 00:13:03,880 --> 00:13:05,620 weights of different mass-- 213 00:13:05,620 --> 00:13:07,835 they wait for the spring to stop moving and they measure 214 00:13:07,835 --> 00:13:12,220 it, and now they have a series of points. 215 00:13:12,220 --> 00:13:15,660 And they assume that, well I've got some errors and if we 216 00:13:15,660 --> 00:13:18,990 believe that our errors are normally distributed some will 217 00:13:18,990 --> 00:13:21,120 be positive, some will be negative. 218 00:13:21,120 --> 00:13:23,280 And if we do enough experiments it will kind of 219 00:13:23,280 --> 00:13:28,910 all balance out and we'll be able to actually get a good 220 00:13:28,910 --> 00:13:33,630 estimate of the spring constant, k. 221 00:13:33,630 --> 00:13:39,680 I did such an experiment, put the results in a file. 222 00:13:39,680 --> 00:13:41,470 This is just a format of the file. 223 00:13:41,470 --> 00:13:44,050 The first line tells us what it is, it's the distance in 224 00:13:44,050 --> 00:13:47,340 meters and a mass in kilograms. 225 00:13:47,340 --> 00:13:51,230 And then I just have the two things separated by a space, 226 00:13:51,230 --> 00:13:52,880 in this case. 227 00:13:52,880 --> 00:13:59,460 So my first experiment, the distance I measured was 0.0865 228 00:13:59,460 --> 00:14:07,110 and the weight was 0.1 kilograms. 229 00:14:07,110 --> 00:14:10,580 All right, so I've now got the data, so that's 230 00:14:10,580 --> 00:14:11,960 the physical reality. 231 00:14:11,960 --> 00:14:15,110 I've done my experiment. 232 00:14:15,110 --> 00:14:20,230 I've done some theory telling me how to calculate k. 233 00:14:20,230 --> 00:14:26,000 And now I'm going to put them together and write some code. 234 00:14:26,000 --> 00:14:27,250 So let's look at the code. 235 00:14:33,680 --> 00:14:37,570 Think we'll skip over this, and I'll comment this out, so 236 00:14:37,570 --> 00:14:42,070 we don't see get pi get estimated over and over again. 237 00:14:42,070 --> 00:14:46,200 So the first piece of code is pretty simple, it's just 238 00:14:46,200 --> 00:14:48,140 getting the data. 239 00:14:48,140 --> 00:14:51,100 And again, this is typically the way one ought to structure 240 00:14:51,100 --> 00:14:52,320 these things. 241 00:14:52,320 --> 00:14:57,280 I/O, input/output, is typically done in a separate 242 00:14:57,280 --> 00:15:00,890 function so that if the format of the data were changed, I'd 243 00:15:00,890 --> 00:15:03,210 only have to change this, and not the rest of my 244 00:15:03,210 --> 00:15:05,160 computation. 245 00:15:05,160 --> 00:15:10,340 it opens the file, discards the header, and then uses a 246 00:15:10,340 --> 00:15:20,360 split to get the x values and the y values, all right. 247 00:15:20,360 --> 00:15:23,440 So now I just get all the distances and all the masses-- 248 00:15:23,440 --> 00:15:27,170 not the x's and the y's yet, just distances and masses. 249 00:15:27,170 --> 00:15:30,220 Then I close the file and return them. 250 00:15:30,220 --> 00:15:32,860 Nothing that you haven't seen before. 251 00:15:32,860 --> 00:15:35,360 Nothing that you won't get to write again, and again, and 252 00:15:35,360 --> 00:15:39,530 again, similar kinds of things. 253 00:15:39,530 --> 00:15:41,340 Then I plot the data. 254 00:15:41,340 --> 00:15:44,760 So here we see something that's a little bit different 255 00:15:44,760 --> 00:15:47,870 from what we've seen before. 256 00:15:47,870 --> 00:15:49,980 So the first thing I do is I got my x and 257 00:15:49,980 --> 00:15:52,560 y by calling, GetData. 258 00:15:52,560 --> 00:15:55,690 Then I do a type conversion. 259 00:15:55,690 --> 00:16:00,540 What GetData is returning is a list. 260 00:16:00,540 --> 00:16:02,830 I'm here going to convert a list to another 261 00:16:02,830 --> 00:16:06,130 type called an array. 262 00:16:06,130 --> 00:16:12,020 This is a type implemented by a class supplied by PyLab 263 00:16:12,020 --> 00:16:14,690 which is built on top of something called NumPy, which 264 00:16:14,690 --> 00:16:16,730 is where it comes from. 265 00:16:16,730 --> 00:16:20,960 An array is kind of like a list. 266 00:16:20,960 --> 00:16:24,250 It's a sequence of things. 267 00:16:24,250 --> 00:16:28,620 There's some list operations methods that are not 268 00:16:28,620 --> 00:16:32,310 available, like append, but it's got some other things 269 00:16:32,310 --> 00:16:34,460 that are extremely valuable. 270 00:16:34,460 --> 00:16:36,660 For example, I can do point-wise 271 00:16:36,660 --> 00:16:39,770 operations on an array. 272 00:16:39,770 --> 00:16:44,560 So if I multiply an array by 3, what that does is it 273 00:16:44,560 --> 00:16:49,420 multiplies each element by 3. 274 00:16:49,420 --> 00:16:53,690 If I multiply one array by another, it 275 00:16:53,690 --> 00:16:56,970 does the cross products. 276 00:16:56,970 --> 00:17:01,480 OK, so they're very valuable for these kinds of things. 277 00:17:01,480 --> 00:17:06,640 Typically, in Python, one starts with a list, because 278 00:17:06,640 --> 00:17:09,619 lists are more convenient to build up incrementally than 279 00:17:09,619 --> 00:17:13,359 arrays, and then converts them to an array so that you can do 280 00:17:13,359 --> 00:17:15,869 the math on them. 281 00:17:15,869 --> 00:17:18,550 For those of you who've seen MATLAB you're very familiar 282 00:17:18,550 --> 00:17:22,339 with the concept of what Python calls an array. 283 00:17:22,339 --> 00:17:27,230 Those of you who know C or Pascal, what it calls an array 284 00:17:27,230 --> 00:17:28,930 has nothing to do with what Python or 285 00:17:28,930 --> 00:17:30,890 PyLab calls an array. 286 00:17:30,890 --> 00:17:34,040 So can be a little bit confusing. 287 00:17:34,040 --> 00:17:37,470 Any rate, I convert them to arrays. 288 00:17:37,470 --> 00:17:40,940 And then what I'll do here, now that I have an array, I'll 289 00:17:40,940 --> 00:17:46,810 multiply my x values by the acceleration due to gravity, 290 00:17:46,810 --> 00:17:50,750 this constant 9.81. 291 00:17:50,750 --> 00:17:52,290 And then I'm just going to plot them. 292 00:17:55,230 --> 00:17:56,970 All right, so let's see what we get here. 293 00:18:12,150 --> 00:18:14,230 So here I've now plotted the measure 294 00:18:14,230 --> 00:18:15,480 displacement of the spring. 295 00:18:19,280 --> 00:18:26,600 Force in Newtons, that's the standard international unit 296 00:18:26,600 --> 00:18:27,430 for measuring force. 297 00:18:27,430 --> 00:18:30,390 It's the amount of force needed to accelerate a mass of 298 00:18:30,390 --> 00:18:36,260 1 kilogram at a rate of 1 meter per second per second. 299 00:18:36,260 --> 00:18:38,420 So I've plotted the force in Newton's against 300 00:18:38,420 --> 00:18:41,125 the distance in meters. 301 00:18:43,760 --> 00:18:45,010 OK. 302 00:18:47,150 --> 00:18:50,390 Now I can go and calculate k. 303 00:18:55,530 --> 00:18:59,870 Well, how am I going to do that? 304 00:18:59,870 --> 00:19:07,800 Well, before I do that, I'm going to do something to see 305 00:19:07,800 --> 00:19:11,185 whether or not my data is sensible. 306 00:19:21,850 --> 00:19:28,090 What we often do, is we have a theoretical model and the 307 00:19:28,090 --> 00:19:33,340 model here is that the data should fall on a line, roughly 308 00:19:33,340 --> 00:19:36,155 speaking, modular experimental errors. 309 00:19:39,070 --> 00:19:43,260 I'm going to now find out what that line is. 310 00:19:43,260 --> 00:19:46,955 Because if I know that line, I can compute k. 311 00:19:49,470 --> 00:19:52,360 How does k relate to that line? 312 00:19:52,360 --> 00:19:55,960 So I plot a line. 313 00:19:55,960 --> 00:19:58,590 And now I can look at the slope of that line, how 314 00:19:58,590 --> 00:20:01,160 quickly it's changing. 315 00:20:01,160 --> 00:20:03,595 And k will be simply the inverse of that. 316 00:20:07,550 --> 00:20:08,800 How do I get the line? 317 00:20:11,180 --> 00:20:17,510 Well, I'm going to find a line that is the best approximation 318 00:20:17,510 --> 00:20:20,150 to the points I have. 319 00:20:20,150 --> 00:20:33,020 So if, for example, I have two points, a point here and a 320 00:20:33,020 --> 00:20:37,300 point here, I know I can quote, fit a 321 00:20:37,300 --> 00:20:38,800 line to that curve-- 322 00:20:38,800 --> 00:20:40,280 to those points-- 323 00:20:40,280 --> 00:20:42,300 it will always be perfect. 324 00:20:42,300 --> 00:20:44,890 It will be a line. 325 00:20:44,890 --> 00:20:47,460 So this is what's called a fit. 326 00:20:47,460 --> 00:20:53,110 Now if I have a bunch of points sort of scattered 327 00:20:53,110 --> 00:20:59,290 around, I then have to figure out, OK, what line is the 328 00:20:59,290 --> 00:21:01,740 closest to those points? 329 00:21:01,740 --> 00:21:03,670 What fits it the best? 330 00:21:03,670 --> 00:21:08,420 And I might say, OK, it's a line like this. 331 00:21:08,420 --> 00:21:12,300 But in order to do that, in order to fit a line to more 332 00:21:12,300 --> 00:21:16,500 than two points, I need some measure of the 333 00:21:16,500 --> 00:21:18,960 goodness of the fit. 334 00:21:18,960 --> 00:21:23,950 Because what I want to choose here is the best fit. 335 00:21:23,950 --> 00:21:29,050 What line is the best approximation of the data I've 336 00:21:29,050 --> 00:21:31,390 actually got? 337 00:21:31,390 --> 00:21:42,590 But in order to do that, I need some objective function 338 00:21:42,590 --> 00:21:49,120 that tells me how good is a particular fit. 339 00:21:49,120 --> 00:21:52,880 It lets me compare two fits so that I can 340 00:21:52,880 --> 00:21:54,680 choose the best one. 341 00:21:58,050 --> 00:22:05,820 OK, now if we want to look at that we have to ask, what 342 00:22:05,820 --> 00:22:09,030 should that be? 343 00:22:09,030 --> 00:22:11,000 There are lots of possibilities. 344 00:22:11,000 --> 00:22:14,040 One could say, all right let's find the line that goes 345 00:22:14,040 --> 00:22:16,790 through the most points, that actually 346 00:22:16,790 --> 00:22:18,880 touches the most points. 347 00:22:18,880 --> 00:22:23,850 The problem with that is it's really hard, and may be 348 00:22:23,850 --> 00:22:27,860 totally irrelevant, and in fact you may not find a line 349 00:22:27,860 --> 00:22:30,410 that touches more than one point. 350 00:22:30,410 --> 00:22:33,870 So we need something different. 351 00:22:33,870 --> 00:22:38,080 And there is a standard measure that's typically used 352 00:22:38,080 --> 00:22:40,350 and that's called the least squares fit. 353 00:22:48,520 --> 00:22:52,940 That's the objective function that's almost always used in 354 00:22:52,940 --> 00:22:56,690 measuring how good any curve-- or how well, excuse me, any 355 00:22:56,690 --> 00:22:58,805 curve fits a set of points. 356 00:23:01,820 --> 00:23:09,842 What it looks like is the sum from L equals 0 to L equals 357 00:23:09,842 --> 00:23:20,360 the len of the observed points minus 1, just because of the 358 00:23:20,360 --> 00:23:22,750 way things will work in Python. 359 00:23:22,750 --> 00:23:32,840 But the key thing is what we're summing is the observed 360 00:23:32,840 --> 00:23:44,253 at point L minus the predicted at point L-squared. 361 00:23:48,110 --> 00:23:51,490 And since we're looking for the least squares fit, we want 362 00:23:51,490 --> 00:23:55,220 to minimize that. 363 00:23:55,220 --> 00:23:59,670 The smallest difference we can get. 364 00:23:59,670 --> 00:24:02,630 So there's some things to notice about this. 365 00:24:02,630 --> 00:24:08,870 Once we have a quote fit, in this case a line for every x 366 00:24:08,870 --> 00:24:14,280 value the fit predicts a y value. 367 00:24:14,280 --> 00:24:14,530 Right? 368 00:24:14,530 --> 00:24:16,620 That's what our model does. 369 00:24:16,620 --> 00:24:20,140 Our model in this case will take the independent variable, 370 00:24:20,140 --> 00:24:27,290 x, the mass, and predict the dependent variable, the 371 00:24:27,290 --> 00:24:28,540 displacement. 372 00:24:30,860 --> 00:24:33,450 But in addition to the predicted values, we have the 373 00:24:33,450 --> 00:24:38,000 observed values, these guys. 374 00:24:38,000 --> 00:24:40,550 And now we just measure the difference between the 375 00:24:40,550 --> 00:24:44,770 predicted and the observed, square it, and notice by 376 00:24:44,770 --> 00:24:49,190 squaring the difference we have discarded whether it's 377 00:24:49,190 --> 00:24:51,665 above or below the line-- because we don't care, we just 378 00:24:51,665 --> 00:24:53,120 care how far it's from the line. 379 00:24:55,650 --> 00:24:59,220 And then we sum all of those up and the smaller we can make 380 00:24:59,220 --> 00:25:00,695 that, the better our fit is. 381 00:25:03,670 --> 00:25:05,790 Makes sense? 382 00:25:05,790 --> 00:25:09,510 So now how do we find the best fit? 383 00:25:09,510 --> 00:25:10,800 Well, there's several different 384 00:25:10,800 --> 00:25:12,810 methods you could use. 385 00:25:12,810 --> 00:25:15,080 You can actually do this using Newton's method. 386 00:25:18,770 --> 00:25:22,600 Under many conditions there are analytical solutions, so 387 00:25:22,600 --> 00:25:25,080 you don't have to use approximation you can just 388 00:25:25,080 --> 00:25:26,330 compute it. 389 00:25:26,330 --> 00:25:31,610 And the best news of all, it's built into PyLab So that's how 390 00:25:31,610 --> 00:25:33,030 you actually do it. 391 00:25:33,030 --> 00:25:37,500 You call the PyLab function that does it for you. 392 00:25:37,500 --> 00:25:40,310 That function is called Polyfit. 393 00:25:48,010 --> 00:25:50,873 Polyfit takes three arguments. 394 00:25:53,630 --> 00:26:02,470 It takes all of the observed X values, all of the observed Y 395 00:26:02,470 --> 00:26:06,885 values, and the degree of the polynomial. 396 00:26:14,420 --> 00:26:17,410 So I've been talking about fitting lines. 397 00:26:17,410 --> 00:26:21,720 As we'll see, polyfit can be used to fit polynomials of 398 00:26:21,720 --> 00:26:24,130 arbitrary degree to data. 399 00:26:24,130 --> 00:26:25,900 So you can fit a line, you can fit a 400 00:26:25,900 --> 00:26:28,400 parabola, you can fit cubic. 401 00:26:28,400 --> 00:26:30,640 I don't know what it's called, you can fit a 10th order 402 00:26:30,640 --> 00:26:34,250 polynomial, whatever you choose here. 403 00:26:37,220 --> 00:26:41,030 And then it returns some values. 404 00:26:41,030 --> 00:26:53,220 So if we think about it being a line, we know that it's 405 00:26:53,220 --> 00:26:57,585 defined by the y value is equal to ax plus b. 406 00:27:00,680 --> 00:27:05,000 Some constant times the x value plus b, the y-intercept. 407 00:27:08,960 --> 00:27:11,510 So now let's look at it. 408 00:27:11,510 --> 00:27:21,130 We see here in fit data, what I've done is I've gotten my 409 00:27:21,130 --> 00:27:25,100 values as before, and now I'm going to say, a,b equals 410 00:27:25,100 --> 00:27:30,400 pyLab.polyfit of xVals, y values and 1. 411 00:27:30,400 --> 00:27:35,250 Since I'm looking for a line, the degree is 1. 412 00:27:35,250 --> 00:27:38,740 Once I've got that, I can then compute the estimated y 413 00:27:38,740 --> 00:27:43,050 values, a times pyLab.array. 414 00:27:43,050 --> 00:27:46,230 I'm turning the x values into an array, actually I didn't 415 00:27:46,230 --> 00:27:49,710 need to do that since I'd already done it, that's okay-- 416 00:27:49,710 --> 00:27:52,590 plus b. 417 00:27:52,590 --> 00:27:57,520 And now I'll plot it and, by the way now that I've got my 418 00:27:57,520 --> 00:28:01,460 line, I can also compute k. 419 00:28:06,020 --> 00:28:07,270 And let's see what we get. 420 00:28:27,460 --> 00:28:33,580 All right, I fit a line, and I've got a linear fit, and I 421 00:28:33,580 --> 00:28:37,460 said my spring constant k is 21 point -- 422 00:28:37,460 --> 00:28:39,940 I've rounded it to 5 digits just so would fit 423 00:28:39,940 --> 00:28:41,865 nicely on my plot. 424 00:28:44,710 --> 00:28:45,960 OK. 425 00:28:48,170 --> 00:28:54,230 The method that's used to do this in PyLab is called a 426 00:28:54,230 --> 00:28:55,480 linear regression. 427 00:29:00,880 --> 00:29:03,200 Now you might think it's called linear regression 428 00:29:03,200 --> 00:29:07,760 because I just used it to find a line, but in fact that's not 429 00:29:07,760 --> 00:29:10,010 why it's called linear regression. 430 00:29:10,010 --> 00:29:12,870 Because we can use linear regression to find a parabola, 431 00:29:12,870 --> 00:29:17,500 or a cubic, or anything else. 432 00:29:17,500 --> 00:29:23,680 The reason it's called linear, well let's look at an example. 433 00:29:23,680 --> 00:29:26,546 So if I wanted a parabola, I would have y equals ax-squared 434 00:29:26,546 --> 00:29:27,796 plus bx plus c. 435 00:29:35,070 --> 00:29:41,340 We think of the variables, the independent variables, as 436 00:29:41,340 --> 00:29:43,515 x-squared and x. 437 00:29:46,300 --> 00:29:54,720 And y is indeed a linear function of those variables, 438 00:29:54,720 --> 00:29:56,165 because we're adding terms. 439 00:29:58,710 --> 00:30:01,380 Not important that you understand the details, it is 440 00:30:01,380 --> 00:30:03,970 important that you know that linear regression can be used 441 00:30:03,970 --> 00:30:06,615 to find polynomials other than lines. 442 00:30:12,160 --> 00:30:17,990 All right, so we got this done. 443 00:30:17,990 --> 00:30:21,510 Should we be happy? 444 00:30:21,510 --> 00:30:24,820 We can look at this, we fit the best line to this data 445 00:30:24,820 --> 00:30:28,735 point, we computed k, are we done? 446 00:30:34,970 --> 00:30:39,020 Well I'm kind of concerned, because when I look at my 447 00:30:39,020 --> 00:30:47,670 picture it is the best line I can fit to this, but wow it's 448 00:30:47,670 --> 00:30:50,750 not a very good fit in some sense, right. 449 00:30:50,750 --> 00:30:53,730 I look at that line, the points are pretty 450 00:30:53,730 --> 00:30:55,890 far away from it. 451 00:30:55,890 --> 00:30:58,640 And if it's not a good fit, then I have to be suspicious 452 00:30:58,640 --> 00:31:03,380 about my value of k, which is derived from having the model 453 00:31:03,380 --> 00:31:05,820 I get by doing this fit. 454 00:31:05,820 --> 00:31:08,645 Well, all right, let's try something else. 455 00:31:11,330 --> 00:31:20,340 Let's look at FitData1, which in addition to doing a linear 456 00:31:20,340 --> 00:31:22,580 fit, I'm going to fit a cubic -- 457 00:31:25,280 --> 00:31:27,310 partly to show you how to do it. 458 00:31:27,310 --> 00:31:32,980 Here I'm going to say abcd equals pyLab.polyfit of xVals, 459 00:31:32,980 --> 00:31:36,320 yVals and 3 instead of 1. 460 00:31:36,320 --> 00:31:39,160 So it's a more complex function. 461 00:31:39,160 --> 00:31:43,020 Let's see what that gives us. 462 00:31:43,020 --> 00:31:45,250 First let me comment that out. 463 00:31:48,960 --> 00:31:53,250 So we're going to now compare visually what we get when we 464 00:31:53,250 --> 00:31:57,310 get a line fit versus we get a cubic fit to the same data. 465 00:32:04,680 --> 00:32:10,660 Well it looks to me like a cubic is a much better 466 00:32:10,660 --> 00:32:14,010 description of the data, a much better model of the data, 467 00:32:14,010 --> 00:32:15,260 than a line. 468 00:32:19,960 --> 00:32:21,990 Pretty good. 469 00:32:21,990 --> 00:32:23,970 Well, should I be happy with this? 470 00:32:27,260 --> 00:32:29,880 Well, let's ask ourselves in one question, why are we 471 00:32:29,880 --> 00:32:31,940 building the model? 472 00:32:31,940 --> 00:32:35,480 We're building the model so that we can better understand 473 00:32:35,480 --> 00:32:37,180 the spring. 474 00:32:37,180 --> 00:32:40,920 One of the things we often do with models is use them to 475 00:32:40,920 --> 00:32:43,780 predict values that we have not been able to run in our 476 00:32:43,780 --> 00:32:46,250 experiments. 477 00:32:46,250 --> 00:32:48,970 So, for example, if you're building a model of a nuclear 478 00:32:48,970 --> 00:32:52,970 reactor you might want to know what happens when the power is 479 00:32:52,970 --> 00:32:56,500 turned off for some period of time. 480 00:32:56,500 --> 00:32:59,000 In fact, if you read today's paper you noticed they've just 481 00:32:59,000 --> 00:33:01,800 done a simulation model of a nuclear reactor, in, I think, 482 00:33:01,800 --> 00:33:05,720 Tennessee, and discovered that if it lost power for more than 483 00:33:05,720 --> 00:33:08,120 two days, it would start to look like the 484 00:33:08,120 --> 00:33:11,080 nuclear reactors in Japan. 485 00:33:11,080 --> 00:33:13,010 Not a very good thing. 486 00:33:13,010 --> 00:33:14,710 But of course, that's not an experiment 487 00:33:14,710 --> 00:33:17,230 anyone wants to run. 488 00:33:17,230 --> 00:33:19,770 No one wants to blow up this nuclear reactor just to see 489 00:33:19,770 --> 00:33:21,190 what happens. 490 00:33:21,190 --> 00:33:25,770 So they do use a simulation model to predict what would 491 00:33:25,770 --> 00:33:28,840 happen in an experiment you can't run. 492 00:33:28,840 --> 00:33:33,380 So let's use our model here to do some predictions. 493 00:33:40,730 --> 00:33:44,350 So here I've taken the same program, I've called it 494 00:33:44,350 --> 00:33:49,720 FitData2, but what I've done is I've added a point. 495 00:33:49,720 --> 00:33:54,350 So instead of just looking at the x values, I'm looking at 496 00:33:54,350 --> 00:34:00,220 something I'm calling extended x, where I've added a weight 497 00:34:00,220 --> 00:34:06,370 of 1 and a 1/2 kilos to the spring just to see what would 498 00:34:06,370 --> 00:34:11,110 happen, what the model would predict. 499 00:34:11,110 --> 00:34:13,940 And other than that, everything is the same. 500 00:34:26,838 --> 00:34:29,230 Oops, what's happened here? 501 00:34:37,560 --> 00:34:39,710 Probably shouldn't be computing k here with a 502 00:34:39,710 --> 00:34:40,960 non-linear model. 503 00:34:45,250 --> 00:34:48,969 All right, why is it not? 504 00:34:48,969 --> 00:34:51,670 Come on, there it is. 505 00:34:51,670 --> 00:34:56,169 And now we have to un-comment this out, un-comment this. 506 00:35:04,470 --> 00:35:09,990 Well it fit the existing data pretty darn well, but it has a 507 00:35:09,990 --> 00:35:13,180 very strange prediction here. 508 00:35:13,180 --> 00:35:15,640 If you think about our experiment, it's predicting 509 00:35:15,640 --> 00:35:20,010 not only that the spring stopped stretching, but that 510 00:35:20,010 --> 00:35:23,810 it goes to above where it started. 511 00:35:23,810 --> 00:35:27,150 Highly unlikely in a physical world. 512 00:35:27,150 --> 00:35:33,570 So what we see here is that while I can easily fit a curve 513 00:35:33,570 --> 00:35:38,430 to the data, it fits it beautifully, it turns out to 514 00:35:38,430 --> 00:35:40,025 have very bad predictive value. 515 00:35:43,470 --> 00:35:45,460 What's going on here? 516 00:35:45,460 --> 00:35:51,130 Well, I started this whole endeavor under an assumption 517 00:35:51,130 --> 00:35:54,930 that there was some theory about springs, Hooke's law, 518 00:35:54,930 --> 00:35:58,260 and that it should be a linear model. 519 00:35:58,260 --> 00:36:02,620 Just because my data maybe didn't fit that theory, 520 00:36:02,620 --> 00:36:05,700 doesn't mean I should just fit an arbitrary curve and see 521 00:36:05,700 --> 00:36:06,950 what happens. 522 00:36:08,840 --> 00:36:12,780 It is the case that if you're willing to get a high enough 523 00:36:12,780 --> 00:36:15,070 degree polynomial, you can get a pretty good fit 524 00:36:15,070 --> 00:36:17,690 to almost any data. 525 00:36:17,690 --> 00:36:19,920 But that doesn't prove anything. 526 00:36:19,920 --> 00:36:21,170 It's not useful. 527 00:36:23,920 --> 00:36:26,990 It's one of the reasons why when I read papers I always 528 00:36:26,990 --> 00:36:29,550 like to see the raw data. 529 00:36:29,550 --> 00:36:31,910 I hate it when I read a technical paper and it just 530 00:36:31,910 --> 00:36:34,600 shows me the curve that they fit to the data, rather than 531 00:36:34,600 --> 00:36:42,950 the data, because it's easy to get to the wrong place here. 532 00:36:42,950 --> 00:36:49,160 So let's for the moment ignore the curves and 533 00:36:49,160 --> 00:36:51,930 look at the raw data. 534 00:36:51,930 --> 00:36:54,970 What do we see here about the raw data? 535 00:36:54,970 --> 00:37:02,110 Well, it looks like at the end it's flattening out. 536 00:37:02,110 --> 00:37:06,870 Well, that violates Hooke's law, which says I should have 537 00:37:06,870 --> 00:37:09,170 a linear relationship. 538 00:37:09,170 --> 00:37:12,660 Suddenly it stopped being linear. 539 00:37:12,660 --> 00:37:14,590 Have we violated Hooke's law? 540 00:37:18,520 --> 00:37:21,070 Have I done something so strange that maybe I should 541 00:37:21,070 --> 00:37:24,190 just give up on this experiment? 542 00:37:24,190 --> 00:37:25,420 What's the deal here? 543 00:37:25,420 --> 00:37:28,950 So, does this data contradict Hooke's law? 544 00:37:28,950 --> 00:37:30,930 Let me ask that question. 545 00:37:30,930 --> 00:37:32,180 Yes or no? 546 00:37:34,070 --> 00:37:35,320 Who says no? 547 00:37:37,550 --> 00:37:41,711 AUDIENCE: Hooke's law applies only for small displacements. 548 00:37:41,711 --> 00:37:44,110 PROFESSOR: Well, not necessarily small. 549 00:37:44,110 --> 00:37:46,875 But only up to an elastic limit. 550 00:37:46,875 --> 00:37:48,767 AUDIENCE: Which is in the scheme of inifinitely small. 551 00:37:48,767 --> 00:37:51,505 PROFESSOR: Compared to infinity [INAUDIBLE]. 552 00:37:51,505 --> 00:37:54,135 AUDIENCE: Yes, sorry, up to the limit where the linearity 553 00:37:54,135 --> 00:37:54,460 breaks down. 554 00:37:54,460 --> 00:37:58,140 PROFESSOR: Exactly right. 555 00:37:58,140 --> 00:38:00,762 Oh, I overthrew my hand here. 556 00:38:00,762 --> 00:38:02,654 AUDIENCE: I'll get it. 557 00:38:02,654 --> 00:38:06,290 PROFESSOR: Pick it up on your way out. 558 00:38:06,290 --> 00:38:07,310 Exactly, it doesn't. 559 00:38:07,310 --> 00:38:10,880 It just says, probably I exceeded the elastic limit of 560 00:38:10,880 --> 00:38:13,890 my spring in this experiment. 561 00:38:13,890 --> 00:38:21,920 Well now, let's go back and let's go back to our original 562 00:38:21,920 --> 00:38:42,330 code and see what happens if I discard the last six points, 563 00:38:42,330 --> 00:38:43,420 where it's flattened out. 564 00:38:43,420 --> 00:38:46,900 The points that seem to be where I've exceeded the limit. 565 00:38:46,900 --> 00:38:48,315 So I can easily do that. 566 00:38:51,640 --> 00:38:52,895 Do this little coding hack. 567 00:38:56,210 --> 00:38:58,520 It's so much easier to do experiments with code than 568 00:38:58,520 --> 00:39:01,810 with physical objects. 569 00:39:01,810 --> 00:39:03,060 Now let's see what we get. 570 00:39:19,820 --> 00:39:22,920 Well, we get something that's visually a much better fit. 571 00:39:26,620 --> 00:39:28,695 And we get a very different value of k. 572 00:39:32,630 --> 00:39:35,760 So we're a lot happier here. 573 00:39:35,760 --> 00:39:38,810 And if I fit cubic to this you would find that the cubic and 574 00:39:38,810 --> 00:39:43,940 the line actually look a lot alike. 575 00:39:43,940 --> 00:39:50,220 So this is a good thing, I guess. 576 00:39:50,220 --> 00:39:57,520 On the other hand, how do we know which line is a better 577 00:39:57,520 --> 00:40:03,180 representation of physical reality, a better model. 578 00:40:03,180 --> 00:40:09,240 After all, I could delete all the points except any two and 579 00:40:09,240 --> 00:40:12,100 then I would get a line that was a perfect fit, R-squared 580 00:40:12,100 --> 00:40:17,110 -- you know the mean squared error -- would be 0, right? 581 00:40:17,110 --> 00:40:19,350 Because you can fit a line to any two points. 582 00:40:23,890 --> 00:40:26,340 So again, we're seeing that we have a question here that 583 00:40:26,340 --> 00:40:29,240 can't be answered by statistics. 584 00:40:29,240 --> 00:40:33,120 It's not just a question of how good my fit is. 585 00:40:33,120 --> 00:40:37,600 I have to go back to the theory. 586 00:40:37,600 --> 00:40:43,820 And what my theory tells me is that it should be linear, and 587 00:40:43,820 --> 00:40:46,800 I have a theoretical justification of discarding 588 00:40:46,800 --> 00:40:49,060 those last six points. 589 00:40:49,060 --> 00:40:51,350 It's plausible that I exceeded the limit. 590 00:40:54,400 --> 00:40:57,960 I don't have a theoretical justification of deleting six 591 00:40:57,960 --> 00:41:00,750 arbitrary points somewhere in the middle that I didn't 592 00:41:00,750 --> 00:41:04,550 happen to like because they didn't fit the data. 593 00:41:04,550 --> 00:41:10,040 So again, the theme that I'm getting to is this interplay 594 00:41:10,040 --> 00:41:12,650 between physical reality,-- 595 00:41:12,650 --> 00:41:14,300 in this case the experiment-- 596 00:41:14,300 --> 00:41:17,390 the theoretical model,-- in this case Hooke's law-- 597 00:41:17,390 --> 00:41:21,360 and my computational model, -- the line I fit to the 598 00:41:21,360 --> 00:41:24,820 experimental data. 599 00:41:24,820 --> 00:41:29,910 OK, let's continue down this path and I want to look at 600 00:41:29,910 --> 00:41:33,710 another experiment, also with a spring but this is a 601 00:41:33,710 --> 00:41:36,080 different spring. 602 00:41:36,080 --> 00:41:38,520 Maybe I'll bring in that spring in the next lecture and 603 00:41:38,520 --> 00:41:39,770 show it to you. 604 00:41:39,770 --> 00:41:41,260 This spring is a bow and arrow. 605 00:41:41,260 --> 00:41:44,120 Actually the bow is the spring. 606 00:41:44,120 --> 00:41:47,200 Anyone here ever shot a bow and arrow? 607 00:41:47,200 --> 00:41:51,260 Well what you know is the bow has the limbs in it. 608 00:41:51,260 --> 00:41:55,630 And when you pull back the string, you are putting force 609 00:41:55,630 --> 00:41:58,750 in the limbs, which are essentially a spring. 610 00:41:58,750 --> 00:42:02,560 And when you release the spring goes back to the place 611 00:42:02,560 --> 00:42:07,545 it wants to be and fires the projectile on some trajectory. 612 00:42:12,760 --> 00:42:18,690 I now am interested in looking at the trajectory followed by 613 00:42:18,690 --> 00:42:20,840 such a projectile. 614 00:42:20,840 --> 00:42:26,410 This, by the way, is where a lot of this math came from. 615 00:42:26,410 --> 00:42:29,200 People were looking at projectiles, not typically of 616 00:42:29,200 --> 00:42:33,390 bows, but of artillery shells, where the force there was the 617 00:42:33,390 --> 00:42:37,710 force of some chemical reaction. 618 00:42:37,710 --> 00:42:40,680 OK, so once again I've got some data. 619 00:42:50,250 --> 00:42:54,880 In a file, similar kind of format. 620 00:42:54,880 --> 00:42:58,160 And I'm going to read that data in and plot it. 621 00:42:58,160 --> 00:42:59,460 So let's do that. 622 00:43:10,040 --> 00:43:14,310 So I'm going to get my trajectory data. 623 00:43:14,310 --> 00:43:18,540 The way I did this, by the way, is I actually did this 624 00:43:18,540 --> 00:43:19,120 experiment. 625 00:43:19,120 --> 00:43:25,580 I fired four arrows from different distances and 626 00:43:25,580 --> 00:43:29,980 measured the mean height of the four. 627 00:43:29,980 --> 00:43:34,720 So I'm getting at heights 1, 2, 3, and 4. 628 00:43:34,720 --> 00:43:36,140 Again, don't worry about this. 629 00:43:36,140 --> 00:43:38,640 And then I'm going to try some fits. 630 00:43:38,640 --> 00:43:40,000 And let's see what we get here. 631 00:43:57,770 --> 00:44:05,160 So I got my data inches from launch point, and inches above 632 00:44:05,160 --> 00:44:06,410 launch point. 633 00:44:08,950 --> 00:44:11,073 And then I fit a line to it. 634 00:44:11,073 --> 00:44:13,600 And you can see there's a little point way down here in 635 00:44:13,600 --> 00:44:16,480 the corner. 636 00:44:16,480 --> 00:44:19,690 The launch point and the target were at actually the 637 00:44:19,690 --> 00:44:22,450 same height for this experiment. 638 00:44:22,450 --> 00:44:26,480 And not surprisingly, the bow was angled up, I guess, the 639 00:44:26,480 --> 00:44:28,710 arrow went up, and then it came down, and 640 00:44:28,710 --> 00:44:31,010 ended up in the target. 641 00:44:31,010 --> 00:44:32,580 I fit a line to it. 642 00:44:32,580 --> 00:44:35,890 That's the best line I can fit to these points. 643 00:44:35,890 --> 00:44:40,300 Well, it's not real good. 644 00:44:40,300 --> 00:44:45,390 So let's pretend I didn't know anything about projectiles. 645 00:44:45,390 --> 00:44:52,020 I can now use computation to try and understand the theory. 646 00:44:52,020 --> 00:44:53,570 Assume I didn't know the theory. 647 00:44:53,570 --> 00:44:56,770 And what the theory tells me here, or what the computation 648 00:44:56,770 --> 00:45:00,100 tells me, the theory that the arrow travels in a straight 649 00:45:00,100 --> 00:45:01,440 line is not a very good one. 650 00:45:04,240 --> 00:45:08,150 All right, this does not actually conform at all to the 651 00:45:08,150 --> 00:45:12,440 data, I probably should reject this theory that says the 652 00:45:12,440 --> 00:45:14,870 arrow goes straight. 653 00:45:14,870 --> 00:45:17,120 If you looked at the arrows, by the way, in a short 654 00:45:17,120 --> 00:45:19,310 distance it would kind of look to your eyes like it was 655 00:45:19,310 --> 00:45:21,340 actually going straight. 656 00:45:21,340 --> 00:45:25,520 But in fact, physics tells us it can't and the model tells 657 00:45:25,520 --> 00:45:27,670 us it didn't. 658 00:45:27,670 --> 00:45:29,150 All right let's try a different one. 659 00:45:32,620 --> 00:45:36,970 Let's compare the linear fit to a quadratic fit. 660 00:45:36,970 --> 00:45:39,985 So now I'm using polyfit with a degree of 2. 661 00:45:44,530 --> 00:45:45,780 See what we get here. 662 00:45:48,100 --> 00:45:52,770 Well our eyes tell us it's not a perfect fit, but it's a lot 663 00:45:52,770 --> 00:45:56,430 better fit, right. 664 00:45:56,430 --> 00:46:00,470 So this is suggesting that maybe the arrow is traveling 665 00:46:00,470 --> 00:46:02,365 in a parabola, rather than a straight line. 666 00:46:06,840 --> 00:46:10,770 The next question is, our eyes tell us it's better. 667 00:46:10,770 --> 00:46:13,420 How much better? 668 00:46:13,420 --> 00:46:17,570 How do we go about measuring which fit is better? 669 00:46:21,330 --> 00:46:25,370 Recall that we started by saying what polyfit is doing 670 00:46:25,370 --> 00:46:29,230 is minimizing the mean square error. 671 00:46:29,230 --> 00:46:32,090 So one way to compare two fits would be to say what's the 672 00:46:32,090 --> 00:46:34,600 mean square error of the line? 673 00:46:34,600 --> 00:46:37,570 What's the mean square error of the parabola? 674 00:46:37,570 --> 00:46:39,860 Well, pretty clear it's going to be 675 00:46:39,860 --> 00:46:42,470 smaller for the parabola. 676 00:46:42,470 --> 00:46:46,930 So that would tell us OK it is a better fit. 677 00:46:46,930 --> 00:46:52,790 And in fact computing the mean square error is a good way to 678 00:46:52,790 --> 00:46:57,380 compare the fit of two different curves. 679 00:46:57,380 --> 00:47:02,810 On the other hand, it's not particularly useful for 680 00:47:02,810 --> 00:47:07,720 telling us the goodness of the fit in absolute terms. 681 00:47:07,720 --> 00:47:10,220 So I can tell you that the parabola is better than the 682 00:47:10,220 --> 00:47:15,610 line, but in some sense mean square error can't be used to 683 00:47:15,610 --> 00:47:19,075 tell me how good it is in an absolute sense. 684 00:47:21,880 --> 00:47:24,070 Why is that so? 685 00:47:24,070 --> 00:47:27,610 It's because mean square error -- 686 00:47:27,610 --> 00:47:31,400 there's a lower bound 0, but there's no upper bound. 687 00:47:34,950 --> 00:47:37,920 It can go arbitrarily high. 688 00:47:37,920 --> 00:47:41,250 And that is not so good for something where we're trying 689 00:47:41,250 --> 00:47:45,160 to measure things. 690 00:47:45,160 --> 00:47:48,880 So instead, what we typically use is something called the 691 00:47:48,880 --> 00:47:50,215 coefficient of determination. 692 00:48:09,450 --> 00:48:11,620 Usually written, for reasons you'll see 693 00:48:11,620 --> 00:48:12,970 shortly, as r squared. 694 00:48:18,720 --> 00:48:22,940 So the coefficient of determination, R-squared, is 695 00:48:22,940 --> 00:48:36,100 equal to 1 minus the estimated error EE over MV, which is the 696 00:48:36,100 --> 00:48:39,570 variance in the measured data. 697 00:48:39,570 --> 00:48:43,200 So we're comparing the ratio of the estimated error, our 698 00:48:43,200 --> 00:48:47,860 best estimate of the error, and a measurement of how 699 00:48:47,860 --> 00:48:50,970 variable the data is to start with. 700 00:48:58,440 --> 00:49:03,010 As we'll see, this value is always less than 1, less than 701 00:49:03,010 --> 00:49:06,650 or equal to 1, and therefore R-squared is always going to 702 00:49:06,650 --> 00:49:10,260 be between 0 and 1. 703 00:49:10,260 --> 00:49:13,930 Which gives us a nice way of thinking about it in an 704 00:49:13,930 --> 00:49:16,980 absolute sense. 705 00:49:16,980 --> 00:49:20,920 All right, so where are these values? 706 00:49:20,920 --> 00:49:22,920 How do we compute them? 707 00:49:22,920 --> 00:49:27,570 Well, I'm going to explain it the easiest way I know, which 708 00:49:27,570 --> 00:49:29,195 is by showing you the code. 709 00:49:33,100 --> 00:49:37,450 So I have the measured values and the estimated values. 710 00:49:37,450 --> 00:49:43,550 The estimated error is going to be-- 711 00:49:43,550 --> 00:49:49,240 I take estimated value, the value given me by the model, 712 00:49:49,240 --> 00:49:51,710 subtract the measured value, and square it and 713 00:49:51,710 --> 00:49:52,960 then I just sum them. 714 00:49:55,940 --> 00:49:58,410 All right, this is like what we looked at for the mean 715 00:49:58,410 --> 00:50:01,960 square error, but I'm not computing the mean, right. 716 00:50:01,960 --> 00:50:06,400 I'm getting the total of the estimated errors. 717 00:50:06,400 --> 00:50:10,910 I can then get the measured mean, which is the measured 718 00:50:10,910 --> 00:50:15,760 sum, divided by the length of the measurement. 719 00:50:15,760 --> 00:50:19,060 That gives me the mean of the measured data. 720 00:50:19,060 --> 00:50:22,740 And then my measured variance is going to be the mean of the 721 00:50:22,740 --> 00:50:30,480 measured data minus each point of the measured data squared, 722 00:50:30,480 --> 00:50:31,730 and then summing that. 723 00:50:34,340 --> 00:50:36,880 So just as we looked at before when we looked at the 724 00:50:36,880 --> 00:50:40,230 coefficient of variation, and standard deviation, by 725 00:50:40,230 --> 00:50:44,210 comparing how far things stray from the mean, that tells us 726 00:50:44,210 --> 00:50:47,380 how much variance there is in the data. 727 00:50:47,380 --> 00:50:50,440 And then I'll return 1 minus that. 728 00:50:50,440 --> 00:50:55,600 OK, Tuesday we'll go look at this in more detail. 729 00:50:55,600 --> 00:50:56,850 Thank you.