1 00:00:00,740 --> 00:00:03,080 The following content is provided under a Creative 2 00:00:03,080 --> 00:00:04,500 Commons license. 3 00:00:04,500 --> 00:00:06,710 Your support will help MIT OpenCourseWare 4 00:00:06,710 --> 00:00:10,800 continue to offer high quality educational resources for free. 5 00:00:10,800 --> 00:00:13,340 To make a donation or to view additional materials 6 00:00:13,340 --> 00:00:17,300 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,300 --> 00:00:18,210 at ocw.mit.edu. 8 00:00:30,101 --> 00:00:34,370 ERIC GRIMSON: OK, welcome back or welcome, 9 00:00:34,370 --> 00:00:37,360 depending on whether you've been away or not. 10 00:00:37,360 --> 00:00:40,970 I'm going to start with two simple announcements. 11 00:00:40,970 --> 00:00:43,910 There is a reading assignment for this lecture, 12 00:00:43,910 --> 00:00:47,780 actually for the next two lectures, which is chapter 18. 13 00:00:47,780 --> 00:00:50,000 And on a much happier note, there 14 00:00:50,000 --> 00:00:52,375 is no lecture Wednesday because we 15 00:00:52,375 --> 00:00:53,750 hope that you're going to be busy 16 00:00:53,750 --> 00:00:56,030 preparing to get that tryptophan poisoning 17 00:00:56,030 --> 00:00:59,115 as you eat way too much turkey and you fall asleep. 18 00:00:59,115 --> 00:01:00,490 More importantly, I hope you have 19 00:01:00,490 --> 00:01:02,656 a great break over Thanksgiving, whether you're here 20 00:01:02,656 --> 00:01:04,769 or you're back home or wherever you are. 21 00:01:04,769 --> 00:01:08,260 But no lecture Wednesday. 22 00:01:08,260 --> 00:01:13,390 Topic for today, I'm going to start with seems like-- 23 00:01:13,390 --> 00:01:17,270 sorry, what's going to seem like a really obvious statement. 24 00:01:17,270 --> 00:01:20,870 We're living in a data intensive world. 25 00:01:20,870 --> 00:01:25,400 Whether you're a scientist, an engineer, social scientist, 26 00:01:25,400 --> 00:01:31,280 financial worker, politician, manager of a sports team, 27 00:01:31,280 --> 00:01:33,800 you're spending increasingly larger amounts 28 00:01:33,800 --> 00:01:36,124 of time dealing with data. 29 00:01:36,124 --> 00:01:37,790 And if you're in one of those positions, 30 00:01:37,790 --> 00:01:40,460 that often means that you're either writing code 31 00:01:40,460 --> 00:01:43,040 or you're hiring somebody to write code for you 32 00:01:43,040 --> 00:01:45,970 to figure out that data. 33 00:01:45,970 --> 00:01:48,060 And this section of the course is 34 00:01:48,060 --> 00:01:49,710 focusing on exactly that issue. 35 00:01:49,710 --> 00:01:52,380 We want to help you understand what 36 00:01:52,380 --> 00:01:56,310 you can try to do with software that manipulates data, how you 37 00:01:56,310 --> 00:01:58,890 can write code that would do that manipulation of data 38 00:01:58,890 --> 00:02:02,700 for you, and especially what you should believe about what 39 00:02:02,700 --> 00:02:06,450 that software tells you about data, because sometimes it 40 00:02:06,450 --> 00:02:10,212 tells you stuff that isn't exactly what you need to know. 41 00:02:10,212 --> 00:02:11,670 And today we're going to start that 42 00:02:11,670 --> 00:02:14,970 by looking at particularly the case where we get data 43 00:02:14,970 --> 00:02:16,630 from experiments. 44 00:02:16,630 --> 00:02:18,540 So think of this lecture and the next one 45 00:02:18,540 --> 00:02:22,680 as sort of being statistics meets experimental science. 46 00:02:22,680 --> 00:02:25,490 So what do I mean by that? 47 00:02:25,490 --> 00:02:28,260 Imagine you're doing a physics lab, biology lab, a chemistry 48 00:02:28,260 --> 00:02:31,920 lab, or even something in sociology or anthropology, 49 00:02:31,920 --> 00:02:34,410 you conduct an experiment to gather some data. 50 00:02:34,410 --> 00:02:36,060 It could be measurements in a lab. 51 00:02:36,060 --> 00:02:38,170 It could be answers on a questionnaire. 52 00:02:38,170 --> 00:02:40,224 You get a set of data. 53 00:02:40,224 --> 00:02:41,640 Once you've got the data, you want 54 00:02:41,640 --> 00:02:43,980 to think about what can I do with it, 55 00:02:43,980 --> 00:02:47,850 and that usually will involve using some model, some theory 56 00:02:47,850 --> 00:02:51,860 about the underlying process to generate questions 57 00:02:51,860 --> 00:02:53,440 about the data. 58 00:02:53,440 --> 00:02:55,870 What does this data and the model associated with it 59 00:02:55,870 --> 00:02:58,890 tell me about future expectations, 60 00:02:58,890 --> 00:03:00,660 help me predict other results that 61 00:03:00,660 --> 00:03:02,100 will come out of this data. 62 00:03:02,100 --> 00:03:04,219 In the social case, it could be how 63 00:03:04,219 --> 00:03:05,760 do I think about how people are going 64 00:03:05,760 --> 00:03:07,950 to respond to a poll about who are 65 00:03:07,950 --> 00:03:12,130 you voting for in the next election, for example. 66 00:03:12,130 --> 00:03:15,700 Given the data, given the model, the third thing 67 00:03:15,700 --> 00:03:17,170 we're typically going to want to do 68 00:03:17,170 --> 00:03:21,190 is then design a computation to help us answer questions 69 00:03:21,190 --> 00:03:24,190 about the data, run a computational experiment 70 00:03:24,190 --> 00:03:26,890 to complement the physical experiment 71 00:03:26,890 --> 00:03:28,750 or the social experiment we used to gather 72 00:03:28,750 --> 00:03:30,940 the data in the first place. 73 00:03:30,940 --> 00:03:33,200 And that computation could be something deep. 74 00:03:33,200 --> 00:03:35,200 It could be something a little more interesting, 75 00:03:35,200 --> 00:03:37,280 depending on how you're thinking about it. 76 00:03:37,280 --> 00:03:38,980 But we want to think about how do we 77 00:03:38,980 --> 00:03:43,530 use computation to run additional experiments for us. 78 00:03:43,530 --> 00:03:46,650 So I'm going to start by using an example of gathering 79 00:03:46,650 --> 00:03:48,690 experimental data, and I want to start 80 00:03:48,690 --> 00:03:51,540 with the idea of a spring. 81 00:03:51,540 --> 00:03:53,005 How would I model a spring? 82 00:03:53,005 --> 00:03:54,630 How would I gather data about a spring? 83 00:03:54,630 --> 00:03:57,046 And how would I write software to help me answer questions 84 00:03:57,046 --> 00:03:58,740 about a spring? 85 00:03:58,740 --> 00:04:01,630 So what's spring? 86 00:04:01,630 --> 00:04:04,006 Well, there's one kind of spring, a little hard to model, 87 00:04:04,006 --> 00:04:06,671 although it could be interesting what's swimming around in there 88 00:04:06,671 --> 00:04:08,980 and how do I think about the ecological implications 89 00:04:08,980 --> 00:04:11,250 of that spring. 90 00:04:11,250 --> 00:04:13,530 Here's a second kind of spring. 91 00:04:13,530 --> 00:04:15,060 It's about four or five months away, 92 00:04:15,060 --> 00:04:16,769 but eventually we'll get through this winter 93 00:04:16,769 --> 00:04:18,420 and get to that spring and that would be nice, 94 00:04:18,420 --> 00:04:20,279 but I'm not going to model that one either. 95 00:04:20,279 --> 00:04:21,926 And yes, my jokes are really bad, 96 00:04:21,926 --> 00:04:23,800 and yes, you can't do a darn thing about them 97 00:04:23,800 --> 00:04:25,890 because I am tenured because-- 98 00:04:25,890 --> 00:04:27,815 while I'd like to model these two springs, 99 00:04:27,815 --> 00:04:29,190 we're going to stick with the one 100 00:04:29,190 --> 00:04:33,090 that you see in physics labs, these kinds of springs, 101 00:04:33,090 --> 00:04:35,000 so-called linear springs. 102 00:04:35,000 --> 00:04:38,400 And these are springs that have the property that you 103 00:04:38,400 --> 00:04:41,642 can stretch or compress them by applying a force to it. 104 00:04:41,642 --> 00:04:43,350 And when you release them, they literally 105 00:04:43,350 --> 00:04:46,680 spring back to the position they were originally. 106 00:04:46,680 --> 00:04:49,620 So we're going to deal with these kinds of springs. 107 00:04:49,620 --> 00:04:51,330 And the distinguishing characteristics 108 00:04:51,330 --> 00:04:54,240 of these two springs and others in this class 109 00:04:54,240 --> 00:04:58,080 is that that force you require to compress it or stretch 110 00:04:58,080 --> 00:04:59,250 it a certain amount-- 111 00:04:59,250 --> 00:05:03,030 the amount of force you require varies linearly 112 00:05:03,030 --> 00:05:04,452 in the distance. 113 00:05:04,452 --> 00:05:05,910 So if it takes some amount of force 114 00:05:05,910 --> 00:05:08,520 to compress it some amount of distance, 115 00:05:08,520 --> 00:05:12,197 it takes twice as much force to compress it twice as much 116 00:05:12,197 --> 00:05:12,780 of a distance. 117 00:05:12,780 --> 00:05:13,890 It's linearly related. 118 00:05:16,590 --> 00:05:18,610 So each one of these springs-- 119 00:05:18,610 --> 00:05:21,169 these kinds of springs has that property. 120 00:05:21,169 --> 00:05:22,710 The amount of force needed to stretch 121 00:05:22,710 --> 00:05:25,470 or compress it's linear in that distance. 122 00:05:25,470 --> 00:05:27,300 Associated with these springs there 123 00:05:27,300 --> 00:05:29,890 is something called a spring constant-- 124 00:05:29,890 --> 00:05:32,100 usually represented by the number k-- 125 00:05:32,100 --> 00:05:34,650 that determines how much force do 126 00:05:34,650 --> 00:05:38,100 you need to stretch or compress the spring. 127 00:05:38,100 --> 00:05:42,780 Now, it turns out that that spring constant can vary a lot. 128 00:05:42,780 --> 00:05:45,420 The slinky actually has a very low spring constant. 129 00:05:45,420 --> 00:05:47,980 It's one newton per meter. 130 00:05:47,980 --> 00:05:51,060 That spring on the suspension of a motorcycle 131 00:05:51,060 --> 00:05:52,800 has a much bigger spring constant. 132 00:05:52,800 --> 00:05:56,250 It's a lot stiffer, 35,000 newtons per meter. 133 00:05:56,250 --> 00:05:57,750 And just in case you don't remember, 134 00:05:57,750 --> 00:05:59,790 a newton is the amount of force you 135 00:05:59,790 --> 00:06:02,970 need to accelerate a one-kilogram mass one 136 00:06:02,970 --> 00:06:05,200 meter per second squared. 137 00:06:05,200 --> 00:06:06,740 We'll come back to that in a second. 138 00:06:06,740 --> 00:06:09,250 But the idea is we'd like to think about how do we 139 00:06:09,250 --> 00:06:10,570 model these kinds of springs. 140 00:06:13,120 --> 00:06:15,240 Well, turns out, fortunately for us, 141 00:06:15,240 --> 00:06:18,270 that that was done about 300-plus years ago 142 00:06:18,270 --> 00:06:20,680 by a British physicist named Robert Hooke. 143 00:06:20,680 --> 00:06:26,950 Back in 1676 he formulated Hooke's law of elasticity. 144 00:06:26,950 --> 00:06:28,990 Simple expression that says the force you 145 00:06:28,990 --> 00:06:33,490 need to compress or stretch a spring 146 00:06:33,490 --> 00:06:37,034 is linearly related to the distance, d, 147 00:06:37,034 --> 00:06:38,950 that you've actually done that compression in, 148 00:06:38,950 --> 00:06:40,824 or another way of saying it is, if I compress 149 00:06:40,824 --> 00:06:44,710 a spring some amount, the force that's stored in it 150 00:06:44,710 --> 00:06:47,110 is linearly related to that distance. 151 00:06:47,110 --> 00:06:49,630 And the negative sign here basically says 152 00:06:49,630 --> 00:06:52,070 it's pointing in the opposite direction. 153 00:06:52,070 --> 00:06:55,110 So if I compress, the force is going to push it back out. 154 00:06:55,110 --> 00:06:57,310 If I stretch it, the force is going to push back 155 00:06:57,310 --> 00:06:58,490 into that resting position. 156 00:07:01,530 --> 00:07:06,300 Now, this law holds for a wide range of springs, 157 00:07:06,300 --> 00:07:08,640 which is kind of nice. 158 00:07:08,640 --> 00:07:11,910 It's going to hold both in biological systems 159 00:07:11,910 --> 00:07:14,280 as well as in physical systems. 160 00:07:14,280 --> 00:07:16,530 It doesn't hold perfectly. 161 00:07:16,530 --> 00:07:18,750 There's a limit to how much you can stretch, 162 00:07:18,750 --> 00:07:22,105 in particular, a spring before the law breaks down, 163 00:07:22,105 --> 00:07:23,730 and maybe you did this as a kid, right. 164 00:07:23,730 --> 00:07:26,250 If you take a slinky and pull it too far apart, 165 00:07:26,250 --> 00:07:28,290 it stops working because you've exceeded 166 00:07:28,290 --> 00:07:31,050 what's called the elastic limit of the spring. 167 00:07:31,050 --> 00:07:32,722 Similarly, if you compress it too far, 168 00:07:32,722 --> 00:07:34,930 although I think you have to compress it a long ways, 169 00:07:34,930 --> 00:07:38,100 it'll stop working as well. 170 00:07:38,100 --> 00:07:41,890 So it doesn't hold completely, and it also 171 00:07:41,890 --> 00:07:44,170 doesn't hold for all springs. 172 00:07:44,170 --> 00:07:46,210 Only those springs that satisfy this linear law, 173 00:07:46,210 --> 00:07:47,380 which are a lot of them. 174 00:07:47,380 --> 00:07:50,260 So, for example, it doesn't apply to rubber bands, 175 00:07:50,260 --> 00:07:52,240 it doesn't apply to recurved bows. 176 00:07:52,240 --> 00:07:55,150 Those are two examples of springs that do not 177 00:07:55,150 --> 00:07:57,760 obey this linear relationship. 178 00:07:57,760 --> 00:08:00,625 But nonetheless, there's Hooke's law. 179 00:08:00,625 --> 00:08:02,500 And one of the things we can do is say, well, 180 00:08:02,500 --> 00:08:05,290 let's use it to do a little bit of reasoning about this spring. 181 00:08:05,290 --> 00:08:07,510 So we can ask the question, how much 182 00:08:07,510 --> 00:08:11,390 does a rider have to weigh to compress this spring by one 183 00:08:11,390 --> 00:08:12,924 centimeter? 184 00:08:12,924 --> 00:08:14,840 And we've got Hooke's law, and I also gave you 185 00:08:14,840 --> 00:08:15,923 a little bit of hint here. 186 00:08:15,923 --> 00:08:19,280 So I told you that this spring has a spring constant 187 00:08:19,280 --> 00:08:22,380 of 35,000 newtons per meter. 188 00:08:22,380 --> 00:08:24,620 So I could just plug this in, right, one centimeter, 189 00:08:24,620 --> 00:08:27,620 it's 1/100 of a meter times-- 190 00:08:27,620 --> 00:08:30,170 so that's the-- there's the spring constant. 191 00:08:30,170 --> 00:08:32,830 There's the amount we're going to compress it. 192 00:08:32,830 --> 00:08:35,299 Do a little math, and that says that the force I need 193 00:08:35,299 --> 00:08:38,250 is 350 newtons. 194 00:08:38,250 --> 00:08:39,584 So what's a newton? 195 00:08:39,584 --> 00:08:43,530 A small town in Massachusetts, an interesting cookie, 196 00:08:43,530 --> 00:08:45,645 and a force that we want to think about. 197 00:08:45,645 --> 00:08:49,090 I keep telling you guys, the jokes are really bad. 198 00:08:49,090 --> 00:08:50,887 So how do I get force? 199 00:08:50,887 --> 00:08:51,720 Well, you know that. 200 00:08:51,720 --> 00:08:55,200 Mass times acceleration, right, F equals ma. 201 00:08:55,200 --> 00:08:57,780 For acceleration here, I'm going to make an assumption, which 202 00:08:57,780 --> 00:09:01,530 is that the spring is basically oriented perpendicular 203 00:09:01,530 --> 00:09:03,980 to the earth, so that the acceleration is just 204 00:09:03,980 --> 00:09:07,020 the acceleration of gravity, which is roughly 205 00:09:07,020 --> 00:09:09,130 9.8 meters per second squared. 206 00:09:09,130 --> 00:09:11,450 It's basically pulling it down. 207 00:09:11,450 --> 00:09:15,260 So I could plug that back in because remember 208 00:09:15,260 --> 00:09:17,840 what I want to do is figure out what's the mass I need. 209 00:09:17,840 --> 00:09:19,880 So for the force, I'm substituting that in. 210 00:09:19,880 --> 00:09:23,240 I've got that expression, mass times 9.8 meters divided 211 00:09:23,240 --> 00:09:27,560 by seconds squared is 350 newtons, 212 00:09:27,560 --> 00:09:34,010 divide through by 9.8 both sides, do a little bit of math. 213 00:09:34,010 --> 00:09:38,220 And it says that the mass I need is 350 kilograms divided 214 00:09:38,220 --> 00:09:39,180 by 9.8. 215 00:09:39,180 --> 00:09:43,920 And that k refers to kilograms, not to the spring constant. 216 00:09:43,920 --> 00:09:46,650 Poor choice of example, but there I am. 217 00:09:46,650 --> 00:09:49,500 And if I do the math, it says I need a rider that 218 00:09:49,500 --> 00:09:52,509 weighs 35.68 kilos. 219 00:09:52,509 --> 00:09:54,300 And if you're not big on the metric system, 220 00:09:54,300 --> 00:09:55,758 it's actually a fairly light rider. 221 00:09:55,758 --> 00:09:57,250 That's about 79 pounds. 222 00:09:57,250 --> 00:10:01,050 So a 79-pound rider would compress that spring one 223 00:10:01,050 --> 00:10:03,927 centimeter. 224 00:10:03,927 --> 00:10:05,760 So we can figure out how to use Hooke's law. 225 00:10:05,760 --> 00:10:07,968 We're thinking about what we want to do with springs. 226 00:10:07,968 --> 00:10:10,030 That's kind of nice. 227 00:10:10,030 --> 00:10:14,030 How will we actually get the spring constant? 228 00:10:14,030 --> 00:10:16,580 It's really valuable to know what the spring constant is. 229 00:10:16,580 --> 00:10:18,680 And just to give you a sense of that, 230 00:10:18,680 --> 00:10:21,850 it's not just to deal with things like slinkies. 231 00:10:21,850 --> 00:10:24,790 Atomic force microscopes, need to know the spring 232 00:10:24,790 --> 00:10:26,350 constants of the components in order 233 00:10:26,350 --> 00:10:28,630 to calibrate them properly. 234 00:10:28,630 --> 00:10:31,810 The force you need to deform a strand of DNA 235 00:10:31,810 --> 00:10:34,690 is directly related to the spring constants 236 00:10:34,690 --> 00:10:36,970 of the biological structures themselves. 237 00:10:36,970 --> 00:10:41,920 So I'd really like to figure out how do I get them. 238 00:10:41,920 --> 00:10:45,100 How many of you have done this experiment in physics and hated 239 00:10:45,100 --> 00:10:46,191 it? 240 00:10:46,191 --> 00:10:46,690 Right. 241 00:10:46,690 --> 00:10:47,700 Well, I don't know if you hated it or not, 242 00:10:47,700 --> 00:10:48,840 but you've done it, right? 243 00:10:48,840 --> 00:10:51,160 Standard way to do it is I'd take a spring, 244 00:10:51,160 --> 00:10:52,860 I suspend it from some point. 245 00:10:52,860 --> 00:10:55,110 Let it come to a resting position. 246 00:10:55,110 --> 00:10:58,440 And then I put a mass on the bottom of the spring. 247 00:10:58,440 --> 00:10:59,580 It kind of bounces around. 248 00:10:59,580 --> 00:11:02,250 And when it settles, I measure the distance 249 00:11:02,250 --> 00:11:04,170 from where it was before I put the mass 250 00:11:04,170 --> 00:11:08,780 on to the distance of where it is after I've added the mass. 251 00:11:08,780 --> 00:11:10,320 I measure that distance. 252 00:11:10,320 --> 00:11:12,250 And then I just plug in. 253 00:11:13,919 --> 00:11:15,210 I plug into that formula there. 254 00:11:15,210 --> 00:11:17,550 The force is minus k times d. 255 00:11:17,550 --> 00:11:21,360 So k the spring constant is the force, forget the minus sign, 256 00:11:21,360 --> 00:11:23,610 divided by the distance, and the force here 257 00:11:23,610 --> 00:11:27,600 would be 9.8 meters per second squared or-- kilograms 258 00:11:27,600 --> 00:11:30,360 per second squared times the mass divided by d. 259 00:11:30,360 --> 00:11:32,876 So I could just plug it in. 260 00:11:32,876 --> 00:11:38,440 In an ideal world, I'd plug it in, I'm done, one measurement. 261 00:11:38,440 --> 00:11:40,570 Not so much, right. 262 00:11:40,570 --> 00:11:42,610 Masses aren't always perfectly calibrated. 263 00:11:42,610 --> 00:11:47,080 Maybe the spring has got not perfect materials in it. 264 00:11:47,080 --> 00:11:49,960 So ideally I'd actually do multiple trials. 265 00:11:49,960 --> 00:11:53,230 I would take different weights, put them on the spring, 266 00:11:53,230 --> 00:11:56,130 make the measurements, and just record those. 267 00:11:56,130 --> 00:11:58,630 So that's what I'm going to do, and I've actually done that. 268 00:11:58,630 --> 00:12:00,640 I'm not going to make you do it. 269 00:12:00,640 --> 00:12:05,540 But I get out a set of measurements. 270 00:12:05,540 --> 00:12:07,010 What have I done here? 271 00:12:07,010 --> 00:12:11,540 I've used different masses, all increasing by now 0.05 272 00:12:11,540 --> 00:12:14,000 kilograms, and I've measured the distance 273 00:12:14,000 --> 00:12:17,510 that the spring has deformed. 274 00:12:17,510 --> 00:12:19,070 And ideally, these would all have 275 00:12:19,070 --> 00:12:21,770 that nice linear relationship, so I could just plug them in 276 00:12:21,770 --> 00:12:24,320 and I could figure out what the spring constant is. 277 00:12:26,630 --> 00:12:30,412 So let's take this data and let's plot it. 278 00:12:30,412 --> 00:12:31,870 And by the way, all the code you'll 279 00:12:31,870 --> 00:12:33,510 be able to see when you download the file, 280 00:12:33,510 --> 00:12:35,384 I'm going to walk through some of it quickly. 281 00:12:35,384 --> 00:12:37,807 This is a simple way to deal with it, 282 00:12:37,807 --> 00:12:39,390 and I'm going to back up for a second. 283 00:12:39,390 --> 00:12:42,820 There's my data, and I actually have done this 284 00:12:42,820 --> 00:12:44,170 in some ways the wrong order. 285 00:12:44,170 --> 00:12:49,240 These are my independent measures, different masses. 286 00:12:49,240 --> 00:12:52,340 I'm going to plot those along the x-axis, 287 00:12:52,340 --> 00:12:54,200 the horizontal axis. 288 00:12:54,200 --> 00:12:55,582 These are the dependent things. 289 00:12:55,582 --> 00:12:57,040 These are the things I'm measuring. 290 00:12:57,040 --> 00:12:59,264 I'm going to plot those along the y-axis. 291 00:12:59,264 --> 00:13:01,430 So I really should have put them in the other order. 292 00:13:01,430 --> 00:13:03,610 So just cross your eyes and make this column go over 293 00:13:03,610 --> 00:13:06,720 to that column, and we'll be in good shape. 294 00:13:06,720 --> 00:13:08,370 Let's plot this. 295 00:13:08,370 --> 00:13:09,420 So here's a little file. 296 00:13:09,420 --> 00:13:11,520 Having stored those away in a file, 297 00:13:11,520 --> 00:13:13,270 I'm just going to read them in, get data. 298 00:13:13,270 --> 00:13:15,720 Just going to do the obvious thing of read in these things 299 00:13:15,720 --> 00:13:19,920 and return two tuples or lists, one for the x values-- 300 00:13:19,920 --> 00:13:22,200 or if you like, again going back to it, 301 00:13:22,200 --> 00:13:27,571 this set of values, and one for the y values. 302 00:13:27,571 --> 00:13:29,070 Now I'm going to play a little trick 303 00:13:29,070 --> 00:13:32,130 that you may have seen before that's going to be handy to me. 304 00:13:32,130 --> 00:13:34,170 I'm going to actually call this function out 305 00:13:34,170 --> 00:13:36,930 of the PyLab library called array. 306 00:13:36,930 --> 00:13:39,000 I pass in that tuple, and what it 307 00:13:39,000 --> 00:13:41,130 does is it converts it into an array, which 308 00:13:41,130 --> 00:13:45,480 is a data structure that has a fixed number of slots in it 309 00:13:45,480 --> 00:13:48,000 but has a really nice property I want to take advantage of. 310 00:13:48,000 --> 00:13:49,260 I could do all of this with lists. 311 00:13:49,260 --> 00:13:50,730 But by converting that into array 312 00:13:50,730 --> 00:13:53,880 and then giving it the same name xVals and similarly 313 00:13:53,880 --> 00:13:57,570 for the yVals, I can now do math on the array 314 00:13:57,570 --> 00:13:59,920 without having to write loops. 315 00:13:59,920 --> 00:14:03,010 And in particular right here, notice what I'm doing. 316 00:14:03,010 --> 00:14:05,490 I'm taking xVals, which is an array, multiplying it 317 00:14:05,490 --> 00:14:07,170 by a number. 318 00:14:07,170 --> 00:14:10,890 And what that does is it takes every entry in the array, 319 00:14:10,890 --> 00:14:13,320 multiplies that entry, and puts it 320 00:14:13,320 --> 00:14:16,410 into basically a new version of the array, which I then 321 00:14:16,410 --> 00:14:19,110 store into xVals. 322 00:14:19,110 --> 00:14:20,670 If you've programmed in Matlab, this 323 00:14:20,670 --> 00:14:22,140 is the same kind of feeling, right. 324 00:14:22,140 --> 00:14:23,940 I can take an array, do something to it, 325 00:14:23,940 --> 00:14:24,898 and that's really nice. 326 00:14:24,898 --> 00:14:27,450 So I'm going to scale all of my values, 327 00:14:27,450 --> 00:14:31,520 and then I'm going to plot them out some appropriate things. 328 00:14:31,520 --> 00:14:33,390 And if I do it, I get that. 329 00:14:38,170 --> 00:14:43,310 I thought we said Hooke's law was a linear relationship. 330 00:14:43,310 --> 00:14:45,410 So in an ideal world, all of these points 331 00:14:45,410 --> 00:14:49,040 ought to lay along a line somewhere, 332 00:14:49,040 --> 00:14:51,050 where the slope of the line would 333 00:14:51,050 --> 00:14:54,650 tell me the spring constant. 334 00:14:54,650 --> 00:14:56,252 Not so good, right. 335 00:14:56,252 --> 00:14:58,210 And in fact, if you look at it, you can kind of 336 00:14:58,210 --> 00:15:01,810 see-- in here you can kind of imagine there's a line there, 337 00:15:01,810 --> 00:15:03,700 something funky is going on up here. 338 00:15:03,700 --> 00:15:06,400 And we're going to come back to that at the end of the lecture. 339 00:15:06,400 --> 00:15:11,432 But how do we think about actually finding the line? 340 00:15:11,432 --> 00:15:13,390 Well, we know there's noise in the measurement, 341 00:15:13,390 --> 00:15:15,290 so our best thing to do is to say, well, 342 00:15:15,290 --> 00:15:19,000 could we just fit a line to this data? 343 00:15:19,000 --> 00:15:20,505 And how would we do that? 344 00:15:20,505 --> 00:15:22,630 And that's the first big thing we want to do today. 345 00:15:22,630 --> 00:15:24,329 We want to try and figure out, given 346 00:15:24,329 --> 00:15:25,870 that we've got measurement noise, how 347 00:15:25,870 --> 00:15:29,240 do we fit a line to it. 348 00:15:29,240 --> 00:15:32,880 So how do we fit a curve to data? 349 00:15:32,880 --> 00:15:35,120 Well, what we're basically going to try and do 350 00:15:35,120 --> 00:15:38,900 is find a way to relate an independent variable, which 351 00:15:38,900 --> 00:15:43,500 were the masses, the y values, to the dependent-- sorry, 352 00:15:43,500 --> 00:15:44,000 wrong way. 353 00:15:44,000 --> 00:15:46,340 The independent values, which are the x-axis, 354 00:15:46,340 --> 00:15:48,950 to the dependent value, what is the actual displacement we're 355 00:15:48,950 --> 00:15:49,580 going to see? 356 00:15:49,580 --> 00:15:52,740 So another way of saying it is if I go back to here, 357 00:15:52,740 --> 00:15:56,090 I want to know for every point along here, 358 00:15:56,090 --> 00:16:00,860 how do I fit something that predicts what the y value is? 359 00:16:00,860 --> 00:16:04,290 So I need to figure out how to do that fit. 360 00:16:04,290 --> 00:16:07,440 To decide-- even if I had a curve, a line that I 361 00:16:07,440 --> 00:16:09,540 thought was a good fit to that, I 362 00:16:09,540 --> 00:16:11,852 need to decide how good it is. 363 00:16:11,852 --> 00:16:13,560 So imagine I was lucky and somebody said, 364 00:16:13,560 --> 00:16:15,900 here's a line that I think describes 365 00:16:15,900 --> 00:16:17,350 Hooke's law in this case. 366 00:16:17,350 --> 00:16:18,570 Great. 367 00:16:18,570 --> 00:16:20,490 I could draw the line on that data. 368 00:16:20,490 --> 00:16:22,800 I could draw it on this chunk of data here. 369 00:16:22,800 --> 00:16:26,440 I still need to decide how do I know if it's a good fit. 370 00:16:26,440 --> 00:16:31,020 And for that, we need something we call an objective function, 371 00:16:31,020 --> 00:16:33,660 and it's going to measure how close is 372 00:16:33,660 --> 00:16:36,420 the line to the data to which I'm trying to fit it. 373 00:16:40,190 --> 00:16:44,360 Once we've defined the objective function, then what we say 374 00:16:44,360 --> 00:16:48,140 is, OK, now let's find the line that minimizes it, 375 00:16:48,140 --> 00:16:50,755 the best possible line, the line that makes that objective 376 00:16:50,755 --> 00:16:52,880 function as small as possible, because that's going 377 00:16:52,880 --> 00:16:55,620 to be the best fit to the data. 378 00:16:55,620 --> 00:16:57,897 And so that's what I'd like to do. 379 00:16:57,897 --> 00:16:58,730 We're going to see-- 380 00:16:58,730 --> 00:17:00,396 we're going to do it for general curves, 381 00:17:00,396 --> 00:17:02,180 but we're going to start just with lines, 382 00:17:02,180 --> 00:17:03,080 with linear function. 383 00:17:03,080 --> 00:17:04,663 So in this case, we want to say what's 384 00:17:04,663 --> 00:17:07,700 the line such that some function of the sum 385 00:17:07,700 --> 00:17:10,670 of the distances from the line to the measured points 386 00:17:10,670 --> 00:17:11,525 is minimized. 387 00:17:11,525 --> 00:17:13,400 And I'm going to come back in a second to how 388 00:17:13,400 --> 00:17:14,250 do we find the line. 389 00:17:14,250 --> 00:17:15,619 But first we've got to think about what does it 390 00:17:15,619 --> 00:17:16,859 mean to measure it. 391 00:17:19,390 --> 00:17:21,280 So I've got a point. 392 00:17:21,280 --> 00:17:22,900 Imagine I got a line that I think 393 00:17:22,900 --> 00:17:25,960 is a good match for the thing fitting the data. 394 00:17:25,960 --> 00:17:28,820 How do I measure distance? 395 00:17:28,820 --> 00:17:30,880 Well, there's one option. 396 00:17:30,880 --> 00:17:35,340 I could measure just the displacement along the x-axis. 397 00:17:35,340 --> 00:17:36,790 There's a second option. 398 00:17:36,790 --> 00:17:40,230 I could measure the displacement vertically. 399 00:17:40,230 --> 00:17:43,800 Or a third option is I could actually measure the distance 400 00:17:43,800 --> 00:17:46,589 to the closest point on the line, which 401 00:17:46,589 --> 00:17:48,380 would be that perpendicular distance there. 402 00:17:50,631 --> 00:17:52,630 You're way too quiet, which is always dangerous. 403 00:17:52,630 --> 00:17:53,610 What do you think? 404 00:17:53,610 --> 00:17:55,401 I'm going to look for a show of hands here. 405 00:17:55,401 --> 00:17:57,610 How many people think we should use x as the thing 406 00:17:57,610 --> 00:18:00,000 that we measure here? 407 00:18:00,000 --> 00:18:00,500 Hands up. 408 00:18:00,500 --> 00:18:02,810 Please don't use a single finger when you put your hand up. 409 00:18:02,810 --> 00:18:03,309 All right. 410 00:18:03,309 --> 00:18:04,190 Good. 411 00:18:04,190 --> 00:18:06,530 How many people think we should use p, the perpendicular 412 00:18:06,530 --> 00:18:09,216 distance? 413 00:18:09,216 --> 00:18:10,340 Reasonable number of hands. 414 00:18:10,340 --> 00:18:12,670 And how about y? 415 00:18:12,670 --> 00:18:15,880 And I see actually about split between p and y. 416 00:18:15,880 --> 00:18:18,410 And that's actually really good. 417 00:18:18,410 --> 00:18:20,460 X doesn't make a lot of sense, right, 418 00:18:20,460 --> 00:18:23,570 because I know that my values along the x-axis 419 00:18:23,570 --> 00:18:24,980 are independent measurements. 420 00:18:24,980 --> 00:18:26,540 So the displacement in that direction 421 00:18:26,540 --> 00:18:29,060 doesn't make a lot of sense. 422 00:18:29,060 --> 00:18:32,540 P makes a lot of sense, but unfortunately isn't 423 00:18:32,540 --> 00:18:33,980 what I want. 424 00:18:33,980 --> 00:18:35,420 We're going to see examples later 425 00:18:35,420 --> 00:18:37,190 on where, in fact, minimizing things 426 00:18:37,190 --> 00:18:39,710 where you minimize that distance is the right thing to do. 427 00:18:39,710 --> 00:18:41,390 When we do machine learning, that 428 00:18:41,390 --> 00:18:44,750 is how you find what's called a classifier or a separator. 429 00:18:44,750 --> 00:18:47,990 But actually here we're going to pick y, 430 00:18:47,990 --> 00:18:50,660 and the reason is important. 431 00:18:50,660 --> 00:18:53,360 I'm trying to predict the dependent value, which 432 00:18:53,360 --> 00:18:57,750 is the y value, given an independent new x value. 433 00:18:57,750 --> 00:19:00,770 And so the displacement, the uncertainty 434 00:19:00,770 --> 00:19:03,620 is, in fact, the vertical displacement. 435 00:19:03,620 --> 00:19:04,954 And so I'm going to use y. 436 00:19:04,954 --> 00:19:06,620 That displacement is the thing I'm going 437 00:19:06,620 --> 00:19:08,310 to measure as the distance. 438 00:19:12,480 --> 00:19:13,452 How do I find this? 439 00:19:13,452 --> 00:19:14,910 I need an objective function that's 440 00:19:14,910 --> 00:19:18,716 going to tell me what is the closeness of the fit. 441 00:19:18,716 --> 00:19:20,090 So here's how I'm going to do it. 442 00:19:20,090 --> 00:19:23,490 I'm going to have some set of observed values. 443 00:19:23,490 --> 00:19:25,590 Think of it as an array. 444 00:19:25,590 --> 00:19:27,870 I've got some index into them, so the indices 445 00:19:27,870 --> 00:19:29,130 are giving me the x values. 446 00:19:29,130 --> 00:19:31,960 And the observed values are the things I've actually measured. 447 00:19:31,960 --> 00:19:33,460 If you want to think of it this way, 448 00:19:33,460 --> 00:19:35,730 I'm going to go back to this slide really quickly. 449 00:19:35,730 --> 00:19:37,770 The observed values are the displacements 450 00:19:37,770 --> 00:19:39,220 or the values along the y-axis. 451 00:19:41,950 --> 00:19:42,790 Sorry about that. 452 00:19:45,900 --> 00:19:49,890 Let's assume that I have some hypothesized line that I 453 00:19:49,890 --> 00:19:52,787 think fits this data, y equals ax plus b. 454 00:19:52,787 --> 00:19:53,745 I know the a and the b. 455 00:19:53,745 --> 00:19:55,710 I've hypothesized it. 456 00:19:55,710 --> 00:20:00,460 Then predicted will basically say given the x value, 457 00:20:00,460 --> 00:20:04,120 the line predicts here's what the y value should be. 458 00:20:04,120 --> 00:20:07,380 And so I'm going to take the difference between those two 459 00:20:07,380 --> 00:20:09,131 and square them. 460 00:20:09,131 --> 00:20:10,380 So the difference makes sense. 461 00:20:10,380 --> 00:20:12,720 It tells me how far away is the observed value from what 462 00:20:12,720 --> 00:20:15,060 the line predicts it should be. 463 00:20:15,060 --> 00:20:15,974 Why am I squaring it? 464 00:20:15,974 --> 00:20:17,140 Well, there are two reasons. 465 00:20:17,140 --> 00:20:18,570 The first one is that squaring is 466 00:20:18,570 --> 00:20:20,790 going to get rid of the sign. 467 00:20:20,790 --> 00:20:23,430 It shouldn't matter if my observed value 468 00:20:23,430 --> 00:20:25,620 is some amount above the predicted value 469 00:20:25,620 --> 00:20:27,420 or some amount below-- the same amount 470 00:20:27,420 --> 00:20:28,660 below the predicted value. 471 00:20:28,660 --> 00:20:31,140 The displacement in direction shouldn't matter. 472 00:20:31,140 --> 00:20:33,090 It's how far away is it. 473 00:20:33,090 --> 00:20:36,640 Now, you could say, well, why not just use absolute value? 474 00:20:36,640 --> 00:20:38,490 And the answer is you could, but we're 475 00:20:38,490 --> 00:20:42,030 going to see in a couple of slides that by using the square 476 00:20:42,030 --> 00:20:43,920 we get a really nice property that helps 477 00:20:43,920 --> 00:20:46,770 us find the best fitting line. 478 00:20:46,770 --> 00:20:50,220 So my objective function here basically says, 479 00:20:50,220 --> 00:20:52,180 given a bunch of observed values, 480 00:20:52,180 --> 00:20:55,140 use the hypothesized line to predict what the value should 481 00:20:55,140 --> 00:20:57,589 be, measure the difference in the y direction-- 482 00:20:57,589 --> 00:20:59,880 which is what I'm doing because I'm measuring predicted 483 00:20:59,880 --> 00:21:01,530 and observed y values-- 484 00:21:01,530 --> 00:21:02,970 square them, sum them all up. 485 00:21:02,970 --> 00:21:06,074 It's called least squares. 486 00:21:06,074 --> 00:21:07,740 That's going to give me a measure of how 487 00:21:07,740 --> 00:21:09,120 close that line is to a fit. 488 00:21:09,120 --> 00:21:12,330 In a second, I'll get to how you find the best line. 489 00:21:12,330 --> 00:21:17,200 But this hopefully looks familiar. 490 00:21:17,200 --> 00:21:19,330 Anybody recognize this? 491 00:21:19,330 --> 00:21:21,774 You've seen it earlier in this class. 492 00:21:21,774 --> 00:21:24,190 Boy, that's a terrible thing to ask because you don't even 493 00:21:24,190 --> 00:21:26,315 remember the last thing you did in this class other 494 00:21:26,315 --> 00:21:28,279 than the problem set. 495 00:21:28,279 --> 00:21:29,154 AUDIENCE: [INAUDIBLE] 496 00:21:29,154 --> 00:21:29,846 ERIC GRIMSON: Sorry? 497 00:21:29,846 --> 00:21:30,625 AUDIENCE: Variance. 498 00:21:30,625 --> 00:21:31,200 ERIC GRIMSON: Variance. 499 00:21:31,200 --> 00:21:31,740 Thank you. 500 00:21:31,740 --> 00:21:32,327 Absolutely. 501 00:21:32,327 --> 00:21:33,910 Sorry, I didn't bring any candy today. 502 00:21:33,910 --> 00:21:34,980 That's Professor Guttag. 503 00:21:34,980 --> 00:21:36,840 I got a better arm than he does, but I still 504 00:21:36,840 --> 00:21:39,180 didn't bring any candy today. 505 00:21:39,180 --> 00:21:40,830 Yeah, it's variance, not quite. 506 00:21:40,830 --> 00:21:41,960 It's almost variance. 507 00:21:41,960 --> 00:21:44,610 That's the variance times the number of observations, 508 00:21:44,610 --> 00:21:46,830 or another way of saying it is if I divided this 509 00:21:46,830 --> 00:21:49,500 by the number of observations, that would be the variance. 510 00:21:49,500 --> 00:21:50,958 If I took the square root, it would 511 00:21:50,958 --> 00:21:52,210 be the standard deviation. 512 00:21:52,210 --> 00:21:54,060 Why is that valuable? 513 00:21:54,060 --> 00:21:57,900 Because that tells you something about how badly things 514 00:21:57,900 --> 00:21:59,670 are dispersed, how much variation 515 00:21:59,670 --> 00:22:01,790 there is in this measurement. 516 00:22:01,790 --> 00:22:05,130 And so if it says, if I can minimize this expression, 517 00:22:05,130 --> 00:22:06,930 that's great because it not only will 518 00:22:06,930 --> 00:22:09,730 find what I hope is the best fit, 519 00:22:09,730 --> 00:22:12,720 but it's going to minimize the variance between what I predict 520 00:22:12,720 --> 00:22:15,570 and what I measure, which makes intuitive sense. 521 00:22:15,570 --> 00:22:18,482 That's exactly the thing I would like to minimize. 522 00:22:21,490 --> 00:22:23,620 This was built on the assumption that I 523 00:22:23,620 --> 00:22:26,560 had a line that I thought was a good fit, 524 00:22:26,560 --> 00:22:29,772 and this lets me measure how good a fit I have. 525 00:22:29,772 --> 00:22:31,480 But I still have to do a little bit more. 526 00:22:31,480 --> 00:22:33,190 I have to now figure out, OK, how do 527 00:22:33,190 --> 00:22:36,780 I find the best-fitting line? 528 00:22:36,780 --> 00:22:40,380 And for that, we need to come up with a minimization technique. 529 00:22:40,380 --> 00:22:42,780 So to minimize this objective function, 530 00:22:42,780 --> 00:22:45,732 I want to find the curve for the predicted values-- 531 00:22:45,732 --> 00:22:46,440 this thing here-- 532 00:22:46,440 --> 00:22:48,210 some way of representing that that leads 533 00:22:48,210 --> 00:22:50,880 to the best possible solution. 534 00:22:50,880 --> 00:22:54,830 And I'm going to make a simple assumption. 535 00:22:54,830 --> 00:22:57,680 I'm going to assume that my model for this 536 00:22:57,680 --> 00:22:59,307 predicted curve-- 537 00:22:59,307 --> 00:23:00,890 I've been using the example of a line, 538 00:23:00,890 --> 00:23:02,450 but we're going to say curve-- 539 00:23:02,450 --> 00:23:04,070 is a polynomial. 540 00:23:04,070 --> 00:23:05,690 It's a polynomial and one variable. 541 00:23:05,690 --> 00:23:09,399 The one variable is what are the x values of the samples. 542 00:23:09,399 --> 00:23:11,690 And I'm going to assume that the curve is a polynomial. 543 00:23:11,690 --> 00:23:15,110 In the simplest case, it's a line in case order, and two, 544 00:23:15,110 --> 00:23:16,756 it's going to be a parabola. 545 00:23:16,756 --> 00:23:19,130 And I'm going to use a technique called linear regression 546 00:23:19,130 --> 00:23:23,990 to find the polynomial that best fits the data, that minimizes 547 00:23:23,990 --> 00:23:25,130 that objective function. 548 00:23:27,329 --> 00:23:29,620 Quick aside, just to remind you, I'm sure you remember, 549 00:23:29,620 --> 00:23:31,670 so polynomial-- 550 00:23:31,670 --> 00:23:34,520 polynomials, either the value is zero, which is really boring, 551 00:23:34,520 --> 00:23:38,180 or it is a finite sum of non-zero terms 552 00:23:38,180 --> 00:23:42,120 that all have the form c times x to the p. 553 00:23:42,120 --> 00:23:45,210 C is a constant, a real number. 554 00:23:45,210 --> 00:23:47,730 P is a power, a non-negative integer. 555 00:23:47,730 --> 00:23:50,024 And this is basically-- x is the free variable that's 556 00:23:50,024 --> 00:23:50,940 going to capture this. 557 00:23:50,940 --> 00:23:53,970 So easy way to say it is a line would 558 00:23:53,970 --> 00:23:57,870 be represented as a degree one polynomial ax plus b. 559 00:23:57,870 --> 00:24:00,550 A parabola is a second-degree polynomial, 560 00:24:00,550 --> 00:24:02,070 ax squared plus bx plus c. 561 00:24:02,070 --> 00:24:05,214 And we can go up to higher order terms. 562 00:24:05,214 --> 00:24:07,380 We're going to refer to the degree of the polynomial 563 00:24:07,380 --> 00:24:10,690 as the largest degree of any term in that polynomial. 564 00:24:10,690 --> 00:24:15,470 So again, degree one, linear degree two, quadratic. 565 00:24:15,470 --> 00:24:17,200 Now how do I use that? 566 00:24:17,200 --> 00:24:18,940 Well, here's the basic idea. 567 00:24:18,940 --> 00:24:20,220 Let's take a simple example. 568 00:24:20,220 --> 00:24:22,970 Let's assume I'm still just trying to fit a line. 569 00:24:22,970 --> 00:24:26,240 So my assumption is I want to find a degree one 570 00:24:26,240 --> 00:24:30,590 polynomial, y equals ax plus b, as our model of the day. 571 00:24:30,590 --> 00:24:34,420 That means for every sample, I'm going to plug in x, 572 00:24:34,420 --> 00:24:37,572 and if I know a and b, it gives me the predicted value. 573 00:24:37,572 --> 00:24:39,280 I've already seen that's going to give me 574 00:24:39,280 --> 00:24:42,310 a good measure of the closeness of the fit. 575 00:24:42,310 --> 00:24:44,805 And the question is, how do I find a and b. 576 00:24:47,200 --> 00:24:50,510 My goal is find a and b such that when 577 00:24:50,510 --> 00:24:53,780 we use this polynomial to compute those y values, 578 00:24:53,780 --> 00:24:57,640 that sum squared difference is minimized. 579 00:24:57,640 --> 00:25:00,430 So the sum squared difference is my measure of fit. 580 00:25:00,430 --> 00:25:03,617 All I have to do is find a and b. 581 00:25:03,617 --> 00:25:05,450 And that's where linear regression comes in, 582 00:25:05,450 --> 00:25:09,480 and I want to just give you a visualization of this. 583 00:25:09,480 --> 00:25:12,780 If a line is described by ax plus b, 584 00:25:12,780 --> 00:25:15,630 then I can represent every possible line 585 00:25:15,630 --> 00:25:17,980 in a two-dimensional space. 586 00:25:17,980 --> 00:25:21,450 One axis is possible values for a. 587 00:25:21,450 --> 00:25:23,815 The other axis is possible values for b. 588 00:25:23,815 --> 00:25:26,190 So if you think about it, I take any point in that space. 589 00:25:26,190 --> 00:25:27,870 It gives me an a and a B value. 590 00:25:27,870 --> 00:25:30,370 That describes a line. 591 00:25:30,370 --> 00:25:32,460 Why should you care about that? 592 00:25:32,460 --> 00:25:35,460 Because I can put a two-dimensional surface 593 00:25:35,460 --> 00:25:37,210 over that space. 594 00:25:37,210 --> 00:25:40,020 In other words, for every a and b, that gives me a line, 595 00:25:40,020 --> 00:25:43,396 and I could, therefore, compute this function, 596 00:25:43,396 --> 00:25:45,520 given the observed values and the predicted values, 597 00:25:45,520 --> 00:25:46,978 and it would give me a value, which 598 00:25:46,978 --> 00:25:50,602 is the height of the surface in that space. 599 00:25:50,602 --> 00:25:52,310 If you're with me with the visualization, 600 00:25:52,310 --> 00:25:53,180 why is that nice? 601 00:25:53,180 --> 00:25:57,050 Because linear regression gives me a very easy way 602 00:25:57,050 --> 00:25:59,490 to find the lowest point on that surface, 603 00:25:59,490 --> 00:26:01,760 which is exactly the solution I want, 604 00:26:01,760 --> 00:26:03,820 because that's the best fitting line. 605 00:26:03,820 --> 00:26:05,620 And it's called linear regression 606 00:26:05,620 --> 00:26:07,570 not because we're solving for a line, 607 00:26:07,570 --> 00:26:10,230 but because of how you do that solution. 608 00:26:10,230 --> 00:26:11,560 If you think of this as being-- 609 00:26:11,560 --> 00:26:13,690 take a marble on this two-dimensional surface, 610 00:26:13,690 --> 00:26:15,250 you want to place the marble on it, 611 00:26:15,250 --> 00:26:17,170 you want to let it run down to the lowest 612 00:26:17,170 --> 00:26:19,580 point in the surface. 613 00:26:19,580 --> 00:26:22,510 And oh, yeah, I promised you why do we use sum squares, 614 00:26:22,510 --> 00:26:24,340 because if we used the sum of the squares, 615 00:26:24,340 --> 00:26:28,690 that surface always has only one minimum. 616 00:26:28,690 --> 00:26:30,850 So it's not a really funky, convoluted surface. 617 00:26:30,850 --> 00:26:32,600 It has exactly one minimum. 618 00:26:32,600 --> 00:26:34,750 It's called linear regression because the way 619 00:26:34,750 --> 00:26:38,160 to find it is to start at some point and walk downhill. 620 00:26:38,160 --> 00:26:41,650 I linearly regress or walk downhill along the gradient 621 00:26:41,650 --> 00:26:43,510 some distance, measure the new gradient, 622 00:26:43,510 --> 00:26:45,700 and do that until I get down to the lowest 623 00:26:45,700 --> 00:26:49,870 point in the surface. 624 00:26:49,870 --> 00:26:51,990 Could you write code to do it? 625 00:26:51,990 --> 00:26:53,070 Sure. 626 00:26:53,070 --> 00:26:54,810 Are we going to ask you to do it? 627 00:26:54,810 --> 00:26:57,554 No, because fortunately-- 628 00:26:57,554 --> 00:26:59,220 I was hoping to get a cheer out of that. 629 00:26:59,220 --> 00:26:59,670 Too bad. 630 00:26:59,670 --> 00:27:01,628 OK, maybe we will ask you to do it on the exam. 631 00:27:01,628 --> 00:27:03,624 What the hell. 632 00:27:03,624 --> 00:27:04,290 You could do it. 633 00:27:04,290 --> 00:27:06,720 In fact, you've seen a version of this. 634 00:27:06,720 --> 00:27:08,340 The typical algorithm for doing it 635 00:27:08,340 --> 00:27:10,020 is very similar to Newton's method 636 00:27:10,020 --> 00:27:13,140 that we used way back in the beginning of 60001 637 00:27:13,140 --> 00:27:15,330 when we found square roots. 638 00:27:15,330 --> 00:27:17,200 You could write that kind of a solution, 639 00:27:17,200 --> 00:27:19,140 but the good news is that the nice people who 640 00:27:19,140 --> 00:27:21,480 wrote Python, or particularly PyLab, 641 00:27:21,480 --> 00:27:24,160 have given you code to do it. 642 00:27:24,160 --> 00:27:25,980 And we're going to take advantage of it. 643 00:27:25,980 --> 00:27:30,370 So in PyLab there is a built-in function called polyFit. 644 00:27:30,370 --> 00:27:33,360 It takes a collection of x values, 645 00:27:33,360 --> 00:27:34,984 takes a collection of equal length 646 00:27:34,984 --> 00:27:36,900 of y values-- they need to be the same length. 647 00:27:36,900 --> 00:27:38,630 I'm going to assume they're arrays. 648 00:27:38,630 --> 00:27:43,140 And it takes an integer n, which is the degree of fit, 649 00:27:43,140 --> 00:27:44,600 that I want to apply. 650 00:27:44,600 --> 00:27:46,770 And what polyFit will do is it will 651 00:27:46,770 --> 00:27:50,550 find the coefficients of a polynomial of that degree that 652 00:27:50,550 --> 00:27:54,010 provides the best least squares fit. 653 00:27:54,010 --> 00:27:55,900 So think of it as polyFit walking along 654 00:27:55,900 --> 00:27:59,644 that surface to find the best a and b that will come back. 655 00:27:59,644 --> 00:28:01,310 So if I give it a value of n equals one, 656 00:28:01,310 --> 00:28:03,768 it'll give me back the a and b that gives me the best line. 657 00:28:03,768 --> 00:28:05,710 If I get a value of n equal two, it 658 00:28:05,710 --> 00:28:07,960 gives me back a, b, and c that would 659 00:28:07,960 --> 00:28:12,740 fit an ax squared plus bx plus c parabola to best fit the data. 660 00:28:12,740 --> 00:28:14,864 And I could pick n to be any non-negative integer, 661 00:28:14,864 --> 00:28:16,780 and it would actually come up with a good fit. 662 00:28:20,130 --> 00:28:22,250 So let's use it. 663 00:28:22,250 --> 00:28:25,240 I'm going to write a little function called fitData. 664 00:28:25,240 --> 00:28:27,470 The first part up here just comes from plotData. 665 00:28:27,470 --> 00:28:28,690 It's exactly the same thing. 666 00:28:28,690 --> 00:28:29,980 I read in the data. 667 00:28:29,980 --> 00:28:31,330 I convert them into arrays. 668 00:28:31,330 --> 00:28:33,520 I convert this because I want to get out the force. 669 00:28:33,520 --> 00:28:35,830 I go ahead and plot it. 670 00:28:35,830 --> 00:28:38,920 And then notice what I do, I use polyFit right here 671 00:28:38,920 --> 00:28:43,000 to take the inputted x values and y values and a degree one, 672 00:28:43,000 --> 00:28:45,690 and it's going to give me back a tuple, 673 00:28:45,690 --> 00:28:48,310 an a and a b that are the best fit line. 674 00:28:48,310 --> 00:28:51,740 Finds that point in the space that best fits it. 675 00:28:51,740 --> 00:28:55,580 Once I've got that, I could go ahead and actually compute now 676 00:28:55,580 --> 00:28:58,250 what are the estimated or predicted values. 677 00:28:58,250 --> 00:28:59,870 The line's going to tell me what I 678 00:28:59,870 --> 00:29:01,332 should have seen as those values, 679 00:29:01,332 --> 00:29:02,790 and I'm going to do the same thing. 680 00:29:02,790 --> 00:29:04,873 I'm going to take x values, convert it into array, 681 00:29:04,873 --> 00:29:08,330 multiply it by a, which says every entry in the array 682 00:29:08,330 --> 00:29:09,290 is scaled by a. 683 00:29:09,290 --> 00:29:11,450 Add b to every entry. 684 00:29:11,450 --> 00:29:15,320 So I'm just computing ax plus b for all possible x's. 685 00:29:15,320 --> 00:29:18,110 And that then gives me an estimated set of y values, 686 00:29:18,110 --> 00:29:21,200 and I can plot those out. 687 00:29:21,200 --> 00:29:22,540 I'm cheating here. 688 00:29:22,540 --> 00:29:23,350 Sorry. 689 00:29:23,350 --> 00:29:24,310 I'm misdirecting you. 690 00:29:24,310 --> 00:29:26,200 I never cheat. 691 00:29:26,200 --> 00:29:29,260 I actually don't need to do the conversion to an array 692 00:29:29,260 --> 00:29:31,580 there because I did it up here. 693 00:29:31,580 --> 00:29:33,700 But because I've borrowed this from plot lab, 694 00:29:33,700 --> 00:29:35,560 I wanted to show you that I can redundantly 695 00:29:35,560 --> 00:29:37,643 do it here to remind you that I want to convert it 696 00:29:37,643 --> 00:29:41,740 into array to make sure I can do that kind of algebra on it. 697 00:29:41,740 --> 00:29:44,410 The last thing I could do is say even if I can-- 698 00:29:44,410 --> 00:29:47,270 once I show you the fit of this line, 699 00:29:47,270 --> 00:29:49,840 I also want to get out the spring constant. 700 00:29:49,840 --> 00:29:53,980 Now, the slope of this line is difference in force 701 00:29:53,980 --> 00:29:56,140 over difference in distance. 702 00:29:56,140 --> 00:29:58,150 The spring constant is the opposite of it. 703 00:29:58,150 --> 00:30:00,970 So I could simply take the slope of the line, which 704 00:30:00,970 --> 00:30:04,600 is a, invert it, and that gives me the spring constant. 705 00:30:06,890 --> 00:30:09,840 So let's see what happens if we actually run this. 706 00:30:09,840 --> 00:30:11,300 So I'm going to go over to my code, 707 00:30:11,300 --> 00:30:12,549 hoping that it works properly. 708 00:30:14,906 --> 00:30:16,520 Here's my Python. 709 00:30:16,520 --> 00:30:17,570 I've loaded this in. 710 00:30:17,570 --> 00:30:20,302 I'm going to run it. 711 00:30:20,302 --> 00:30:22,570 And there you go. 712 00:30:22,570 --> 00:30:25,090 Fits a line, and it prints out the value 713 00:30:25,090 --> 00:30:29,530 of a, which is about 0.46, and the value of b. 714 00:30:29,530 --> 00:30:38,080 And if I go back and look at this, 715 00:30:38,080 --> 00:30:41,150 there we go, spring constant is about 21 and a half, 716 00:30:41,150 --> 00:30:43,330 which is about the reciprocal of 0.046 717 00:30:43,330 --> 00:30:45,100 if you can figure that out. 718 00:30:45,100 --> 00:30:47,740 And you can see, it's not a bad fit 719 00:30:47,740 --> 00:30:49,919 to a line through that data. 720 00:30:49,919 --> 00:30:52,210 Again, there's still something funky going on over here 721 00:30:52,210 --> 00:30:53,585 that we're going to come back to. 722 00:30:53,585 --> 00:30:56,350 But it's a pretty good fit to the data. 723 00:30:56,350 --> 00:30:57,760 Great. 724 00:30:57,760 --> 00:31:00,612 So now I've got a fit. 725 00:31:00,612 --> 00:31:02,320 I'm going to show you a variation of this 726 00:31:02,320 --> 00:31:03,819 that we're going to use in a second. 727 00:31:03,819 --> 00:31:06,339 I could do the same thing, but after I've done polyFit here, 728 00:31:06,339 --> 00:31:08,380 I'm going to use another built-in function called 729 00:31:08,380 --> 00:31:10,097 polyval. 730 00:31:10,097 --> 00:31:11,680 It's going to take a polynomial, which 731 00:31:11,680 --> 00:31:14,462 is captured by that model of the thing that I returned, 732 00:31:14,462 --> 00:31:16,420 and I'm going to show you the difference again. 733 00:31:16,420 --> 00:31:19,580 Back sure we returned this as a tuple. 734 00:31:19,580 --> 00:31:21,200 Since it's coming back as a tuple, 735 00:31:21,200 --> 00:31:23,320 I can give it a name model. 736 00:31:23,320 --> 00:31:26,714 Polyval will take that tuple plus the x values 737 00:31:26,714 --> 00:31:27,630 and do the same thing. 738 00:31:27,630 --> 00:31:30,750 It will give me back an array of predicted values. 739 00:31:30,750 --> 00:31:34,030 But the nice thing here is that this model could be a line. 740 00:31:34,030 --> 00:31:35,610 It could be a parabola. 741 00:31:35,610 --> 00:31:36,704 It could be a quartic. 742 00:31:36,704 --> 00:31:37,620 It could be a quintic. 743 00:31:37,620 --> 00:31:41,704 It could be any order polynomial. 744 00:31:41,704 --> 00:31:43,370 If you like the abstraction here-- which 745 00:31:43,370 --> 00:31:44,828 we're going to see in a little bit, 746 00:31:44,828 --> 00:31:47,930 that it allows me to use the same code 747 00:31:47,930 --> 00:31:50,260 for different orders of model. 748 00:31:50,260 --> 00:31:52,510 And if I ran this, it would do exactly the same thing. 749 00:31:54,790 --> 00:31:57,130 I'm going to come back to thinking about what's going 750 00:31:57,130 --> 00:31:59,150 on in that spring in a second. 751 00:31:59,150 --> 00:32:01,130 But I want to show you another example. 752 00:32:01,130 --> 00:32:03,101 So here's another set of data. 753 00:32:03,101 --> 00:32:04,600 In a little bit, I'll show you where 754 00:32:04,600 --> 00:32:06,320 that mystery data came from. 755 00:32:06,320 --> 00:32:09,576 But here's another set of data that I've plotted out. 756 00:32:09,576 --> 00:32:10,700 I could run the same thing. 757 00:32:10,700 --> 00:32:14,100 I could run exactly the same code and fit a line to it. 758 00:32:14,100 --> 00:32:16,160 And if I do it, I get that. 759 00:32:19,020 --> 00:32:20,080 What do you think? 760 00:32:20,080 --> 00:32:22,190 Good fit? 761 00:32:22,190 --> 00:32:25,581 Show of hands, how many people like this fit to the data? 762 00:32:25,581 --> 00:32:27,080 Show of hands, how many people don't 763 00:32:27,080 --> 00:32:28,917 like this fit to the data? 764 00:32:28,917 --> 00:32:30,500 Show of hands, how many hope that I'll 765 00:32:30,500 --> 00:32:31,661 stop asking you questions? 766 00:32:31,661 --> 00:32:32,660 Don't put your hands up. 767 00:32:32,660 --> 00:32:33,320 Yeah, thank you. 768 00:32:33,320 --> 00:32:33,819 I know. 769 00:32:33,819 --> 00:32:36,040 Too bad. 770 00:32:36,040 --> 00:32:37,689 It's a lousy fit. 771 00:32:37,689 --> 00:32:38,980 And you kind of know it, right. 772 00:32:38,980 --> 00:32:40,397 It's clear that this doesn't look 773 00:32:40,397 --> 00:32:41,980 like it's coming from a line, or if it 774 00:32:41,980 --> 00:32:45,550 is, it's a really noisy line. 775 00:32:45,550 --> 00:32:46,830 So let's think about this. 776 00:32:46,830 --> 00:32:51,700 What if I were to try a higher order degree. 777 00:32:51,700 --> 00:32:54,507 Let's change the one to a two. 778 00:32:54,507 --> 00:32:56,340 So I'm going to come back to it in a second. 779 00:32:56,340 --> 00:32:57,590 I've changed the one to a two. 780 00:32:57,590 --> 00:33:00,180 That says I'm still using the polynomial fit, 781 00:33:00,180 --> 00:33:04,770 but now I'm going to ask what's the best fitting parabola, ax 782 00:33:04,770 --> 00:33:07,910 squared plus bx plus c. 783 00:33:07,910 --> 00:33:08,840 Simple change. 784 00:33:08,840 --> 00:33:12,961 Because I was using polyval, exactly the same code 785 00:33:12,961 --> 00:33:13,460 will work. 786 00:33:13,460 --> 00:33:15,830 It's going to do the fit to it. 787 00:33:15,830 --> 00:33:19,920 This is, by the way, still an example of linear regression. 788 00:33:19,920 --> 00:33:22,820 So think of what I'm doing now. 789 00:33:22,820 --> 00:33:24,950 I have a three-dimensional space. 790 00:33:24,950 --> 00:33:26,390 One axis is a values. 791 00:33:26,390 --> 00:33:28,440 Second axis is b values. 792 00:33:28,440 --> 00:33:30,560 Third axis is c values. 793 00:33:30,560 --> 00:33:34,890 Any point in that space describes a parabola, 794 00:33:34,890 --> 00:33:36,560 and every point in that space describes 795 00:33:36,560 --> 00:33:38,215 every possible parabola. 796 00:33:38,215 --> 00:33:40,340 And now you've got to twist your head a little bit. 797 00:33:40,340 --> 00:33:42,800 Put a four-dimensional surface on 798 00:33:42,800 --> 00:33:46,330 that three-dimensional basis, where the point in that surface 799 00:33:46,330 --> 00:33:48,890 is the value of that objective function. 800 00:33:48,890 --> 00:33:50,570 Play the same game. 801 00:33:50,570 --> 00:33:51,080 And you can. 802 00:33:51,080 --> 00:33:52,550 It's just a higher-dimensional thing. 803 00:33:52,550 --> 00:33:54,591 So you're, again, going to walk down the gradient 804 00:33:54,591 --> 00:33:56,461 to find the solution, and be glad you don't 805 00:33:56,461 --> 00:33:58,460 have to write this code because PyLab will do it 806 00:33:58,460 --> 00:33:59,084 for you freely. 807 00:33:59,084 --> 00:34:04,190 But it's still an example of regression, which is great. 808 00:34:04,190 --> 00:34:07,243 And if we do that, we get that fit. 809 00:34:07,243 --> 00:34:09,409 Actually just to show you that, I'm going to run it, 810 00:34:09,409 --> 00:34:12,949 but it will do exactly the same thing. 811 00:34:12,949 --> 00:34:15,158 If I go over to Python-- 812 00:34:15,158 --> 00:34:16,199 wherever I have it here-- 813 00:34:19,989 --> 00:34:23,350 I'm going to change that order of the model. 814 00:34:23,350 --> 00:34:25,070 Oops, it went a little too far for me. 815 00:34:25,070 --> 00:34:27,100 Sorry about that. 816 00:34:27,100 --> 00:34:29,940 Let me go back and do this again. 817 00:34:29,940 --> 00:34:35,310 There's the first one, and there's the second one. 818 00:34:40,170 --> 00:34:44,090 So I could fit different models to it. 819 00:34:44,090 --> 00:34:47,090 Quadratic clearly looks like it's a better fit. 820 00:34:47,090 --> 00:34:49,540 I hope you'll agree. 821 00:34:49,540 --> 00:34:53,710 So how do I decide which one's better 822 00:34:53,710 --> 00:34:55,820 other than eyeballing it? 823 00:34:55,820 --> 00:34:57,920 And then if I could fit a quadratic to it, 824 00:34:57,920 --> 00:34:59,810 what about other orders of polynomials? 825 00:34:59,810 --> 00:35:01,970 Maybe there's an even better fit out there. 826 00:35:01,970 --> 00:35:05,870 So how do I figure out what's the best way to do the fit? 827 00:35:05,870 --> 00:35:08,540 And that leads to the second big thing for this lecture. 828 00:35:08,540 --> 00:35:10,550 How good are these fits? 829 00:35:10,550 --> 00:35:12,050 What's the first big thing? 830 00:35:12,050 --> 00:35:14,600 The idea of linear regression, a way of finding 831 00:35:14,600 --> 00:35:16,614 fits of curves to data. 832 00:35:16,614 --> 00:35:18,530 But now I've got to decide how good are these. 833 00:35:18,530 --> 00:35:21,760 And I could ask this question two ways. 834 00:35:21,760 --> 00:35:23,641 One is just relative to each other, 835 00:35:23,641 --> 00:35:25,890 how do I measure which one's better other than looking 836 00:35:25,890 --> 00:35:27,570 at it by eye? 837 00:35:27,570 --> 00:35:30,370 And then the second part of it is in an absolute sense, 838 00:35:30,370 --> 00:35:33,025 how do I know where the best solution is? 839 00:35:33,025 --> 00:35:35,120 Is quadratic the best I could do? 840 00:35:35,120 --> 00:35:36,639 Or should I be doing something else 841 00:35:36,639 --> 00:35:38,680 to try and figure out a better solution, a better 842 00:35:38,680 --> 00:35:39,430 fit to the data? 843 00:35:41,860 --> 00:35:44,350 The relative fit. 844 00:35:44,350 --> 00:35:45,860 What are we doing here? 845 00:35:45,860 --> 00:35:48,370 We're fitting a curve, which is a function 846 00:35:48,370 --> 00:35:51,157 of the independent variable to the dependent variable. 847 00:35:51,157 --> 00:35:52,240 What does it mean by that? 848 00:35:52,240 --> 00:35:53,590 I've got a set of x values. 849 00:35:53,590 --> 00:35:55,930 I'm trying to predict what the y values should be, 850 00:35:55,930 --> 00:35:57,070 the displacement should be. 851 00:35:57,070 --> 00:35:59,170 I want to get a good fit to that. 852 00:35:59,170 --> 00:36:01,360 The idea is that given an independent value, 853 00:36:01,360 --> 00:36:03,280 it gives me an estimate of what it should be, 854 00:36:03,280 --> 00:36:05,620 and I really want to know which fit provides the better 855 00:36:05,620 --> 00:36:07,030 estimates. 856 00:36:07,030 --> 00:36:10,240 And since I was simply minimizing mean squared error, 857 00:36:10,240 --> 00:36:12,910 average square error, an obvious thing to do 858 00:36:12,910 --> 00:36:17,050 is just to use the goodness of fit by looking at that error. 859 00:36:17,050 --> 00:36:19,720 Why not just measure where am I on that surface 860 00:36:19,720 --> 00:36:21,100 and see which one does better? 861 00:36:21,100 --> 00:36:22,641 Or actually it would be two surfaces, 862 00:36:22,641 --> 00:36:25,110 one for a linear fit, one for a quadratic one. 863 00:36:27,026 --> 00:36:28,150 We'll do what we always do. 864 00:36:28,150 --> 00:36:29,550 Let's write a little bit of code. 865 00:36:29,550 --> 00:36:30,966 I can write something that's going 866 00:36:30,966 --> 00:36:32,830 to get the average, mean squared error. 867 00:36:32,830 --> 00:36:36,010 Takes in a set of data points, a set of predicted values, 868 00:36:36,010 --> 00:36:38,102 simply measures the difference between them, 869 00:36:38,102 --> 00:36:40,060 squares them, adds them all up in a little loop 870 00:36:40,060 --> 00:36:42,970 here and returns that divided by the number of samples I have. 871 00:36:42,970 --> 00:36:45,540 So it gives me the average squared error. 872 00:36:45,540 --> 00:36:47,740 And I could do it for that first model I built, 873 00:36:47,740 --> 00:36:49,552 which was for a linear fit, and I 874 00:36:49,552 --> 00:36:51,260 could do it for the second model I built, 875 00:36:51,260 --> 00:36:53,010 which is a quadratic fit. 876 00:36:53,010 --> 00:36:57,760 And if I run it, I get those values. 877 00:36:57,760 --> 00:36:59,320 Looks pretty good. 878 00:36:59,320 --> 00:37:01,970 You knew by eye that the quadratic was a better fit. 879 00:37:01,970 --> 00:37:05,440 And look, this says it's about six times better, 880 00:37:05,440 --> 00:37:08,350 that the residual error is six times smaller 881 00:37:08,350 --> 00:37:11,720 with the quadratic model than it is the linear model. 882 00:37:15,970 --> 00:37:19,200 But with that, I still have a problem, which is-- 883 00:37:19,200 --> 00:37:22,740 OK, so it's useful for comparing two models. 884 00:37:22,740 --> 00:37:26,352 But is 1524 a good number? 885 00:37:26,352 --> 00:37:28,310 Certainly better than 9,000-something or other. 886 00:37:28,310 --> 00:37:31,720 But how do I know that 1524 is a good number? 887 00:37:31,720 --> 00:37:35,324 How do I know there isn't a better fit out there somewhere? 888 00:37:35,324 --> 00:37:37,740 Well, good news is we're going to be able to measure that. 889 00:37:37,740 --> 00:37:41,730 It's hard to know because there's no bound on the values. 890 00:37:41,730 --> 00:37:44,370 And more importantly, this is not scale independent. 891 00:37:44,370 --> 00:37:45,340 What do I mean by that? 892 00:37:45,340 --> 00:37:49,520 If I take all of the values and multiply them by some factor, 893 00:37:49,520 --> 00:37:52,530 I would still fit the same models to them. 894 00:37:52,530 --> 00:37:53,880 They would just scale. 895 00:37:53,880 --> 00:37:56,397 But that measure would increase by that amount. 896 00:37:56,397 --> 00:37:58,230 So I could make the error as big or as small 897 00:37:58,230 --> 00:38:00,720 as I want by just changing the size of the values. 898 00:38:00,720 --> 00:38:02,830 That doesn't make any sense. 899 00:38:02,830 --> 00:38:06,250 I'd like a way to measure goodness of fit 900 00:38:06,250 --> 00:38:08,960 that is scale independent and that tells me 901 00:38:08,960 --> 00:38:11,860 for any fit how close it comes to being 902 00:38:11,860 --> 00:38:14,105 the perfect fit to the data. 903 00:38:14,105 --> 00:38:15,980 And so for that, we're going to use something 904 00:38:15,980 --> 00:38:18,170 called the coefficient of determination 905 00:38:18,170 --> 00:38:20,089 written as r squared. 906 00:38:20,089 --> 00:38:22,130 So let me show you what this does, and then we're 907 00:38:22,130 --> 00:38:24,400 going to use it. 908 00:38:24,400 --> 00:38:26,890 The y's are measured values. 909 00:38:26,890 --> 00:38:29,470 Those are my samples I got from my experiment. 910 00:38:29,470 --> 00:38:31,870 The p's are the predicted values. 911 00:38:31,870 --> 00:38:34,360 That is, for this curve, here's what I 912 00:38:34,360 --> 00:38:36,250 predict those values should be. 913 00:38:36,250 --> 00:38:38,440 So the top here is basically measuring 914 00:38:38,440 --> 00:38:42,510 as we saw before the sum squared error in those pieces. 915 00:38:42,510 --> 00:38:47,380 Mu down here is the average, or mean, of the measured values. 916 00:38:47,380 --> 00:38:48,660 It's the average of the y's. 917 00:38:50,650 --> 00:38:53,040 So what I've got here is in the numerator-- 918 00:38:53,040 --> 00:38:56,150 this is basically the error in the estimates from my curve 919 00:38:56,150 --> 00:38:58,240 fit. 920 00:38:58,240 --> 00:39:01,510 And in the denominator I've got the amount of variation 921 00:39:01,510 --> 00:39:03,830 in the data itself. 922 00:39:03,830 --> 00:39:07,270 This is telling me how much does the data change from just being 923 00:39:07,270 --> 00:39:09,850 a constant value, and this is telling me how much 924 00:39:09,850 --> 00:39:13,270 do my errors vary around it. 925 00:39:13,270 --> 00:39:16,416 That ratio is scale independent because it's a ratio. 926 00:39:16,416 --> 00:39:18,040 So even if I increase all of the values 927 00:39:18,040 --> 00:39:19,831 by some amount, that's going to divide out, 928 00:39:19,831 --> 00:39:22,070 which is kind of nice. 929 00:39:22,070 --> 00:39:25,040 So I could compute that, and there it is. 930 00:39:25,040 --> 00:39:27,340 R squared is, again, that expression. 931 00:39:27,340 --> 00:39:29,120 I'll take in a set of observed values, 932 00:39:29,120 --> 00:39:32,450 a set of predicted values, and I'll measure the error-- 933 00:39:32,450 --> 00:39:33,620 again, these are arrays. 934 00:39:33,620 --> 00:39:35,730 So I'm going to take the difference between the arrays. 935 00:39:35,730 --> 00:39:37,604 That's going to give me piecewise or pairwise 936 00:39:37,604 --> 00:39:38,510 that difference. 937 00:39:38,510 --> 00:39:39,205 I'll square it. 938 00:39:39,205 --> 00:39:41,330 That's going to give me at every point in the array 939 00:39:41,330 --> 00:39:43,280 the square of that distance. 940 00:39:43,280 --> 00:39:44,720 And then because it's an array, I 941 00:39:44,720 --> 00:39:47,150 can just use the built-in sum function to add them all up. 942 00:39:47,150 --> 00:39:48,780 So this is going to give me the-- 943 00:39:48,780 --> 00:39:51,380 if you like, the values up there. 944 00:39:51,380 --> 00:39:54,060 And then I'm going to play a little trick. 945 00:39:54,060 --> 00:39:55,980 I'm going to compute the mean error, which 946 00:39:55,980 --> 00:40:00,310 is that thing divided by the number of observations. 947 00:40:00,310 --> 00:40:02,350 Why would I do that? 948 00:40:02,350 --> 00:40:05,350 Well, because then I can compute this really simply. 949 00:40:05,350 --> 00:40:07,840 I could write a little loop to compute it. 950 00:40:07,840 --> 00:40:10,000 But in fact, I've already said what is that? 951 00:40:10,000 --> 00:40:14,510 If I take that sum and divide it by the number of samples, 952 00:40:14,510 --> 00:40:16,380 that's the variance. 953 00:40:16,380 --> 00:40:17,340 So that's really nice. 954 00:40:17,340 --> 00:40:19,950 Right here I can say, get the variance 955 00:40:19,950 --> 00:40:23,990 using the non-p version of the observed data. 956 00:40:23,990 --> 00:40:27,090 And because that has associated with it division 957 00:40:27,090 --> 00:40:29,460 by the number of samples, the ratio 958 00:40:29,460 --> 00:40:31,110 of the mean error to the variance 959 00:40:31,110 --> 00:40:35,290 is exactly the same as the ratio of that to that. 960 00:40:35,290 --> 00:40:35,990 Little trick. 961 00:40:35,990 --> 00:40:38,620 It lets me save doing a little bit of computation. 962 00:40:38,620 --> 00:40:40,970 So I can compute r squared values. 963 00:40:43,510 --> 00:40:47,300 So what does r squared actually tell us? 964 00:40:47,300 --> 00:40:50,210 What we're doing is we're trying to compare the estimation 965 00:40:50,210 --> 00:40:53,120 errors, the top part, with the variability 966 00:40:53,120 --> 00:40:55,949 in the original values, the bottom part. 967 00:40:55,949 --> 00:40:57,740 So r squared, as you're going to see there, 968 00:40:57,740 --> 00:41:00,020 it's intended to capture what portion 969 00:41:00,020 --> 00:41:05,310 of the variability in the data is accounted for by my model. 970 00:41:05,310 --> 00:41:06,690 My model's a really good fit. 971 00:41:06,690 --> 00:41:10,920 It should account for almost all of that data. 972 00:41:10,920 --> 00:41:15,860 So what we see then is if we do a fit with a linear regression, 973 00:41:15,860 --> 00:41:19,140 r squared is always going to be between zero and one. 974 00:41:19,140 --> 00:41:21,920 And I want to just show you some examples. 975 00:41:21,920 --> 00:41:24,390 If r squared is equal to one, this is great. 976 00:41:24,390 --> 00:41:28,350 It says the model explains all of the variability in the data. 977 00:41:28,350 --> 00:41:31,097 And you can see it if we go back here. 978 00:41:31,097 --> 00:41:32,680 How do we make r squared equal to one? 979 00:41:32,680 --> 00:41:35,710 We need this to be zero, which says 980 00:41:35,710 --> 00:41:38,140 that the variability in the data is perfectly 981 00:41:38,140 --> 00:41:40,630 predicted by my model. 982 00:41:40,630 --> 00:41:42,469 Every point lies exactly along the curve. 983 00:41:42,469 --> 00:41:43,010 That's great. 984 00:41:47,050 --> 00:41:48,610 Second option at the other extreme 985 00:41:48,610 --> 00:41:51,040 is if r squared is equal to zero, 986 00:41:51,040 --> 00:41:54,670 you basically got bupkis, which is a well-known technical term, 987 00:41:54,670 --> 00:41:57,640 meaning there's no relationship between the values predicted 988 00:41:57,640 --> 00:42:00,280 by the model and the actual data. 989 00:42:00,280 --> 00:42:03,010 That basically says that all of the variability 990 00:42:03,010 --> 00:42:06,430 here is exactly the same as all the variability in the data. 991 00:42:06,430 --> 00:42:07,900 The model doesn't capture anything, 992 00:42:07,900 --> 00:42:12,320 and it's making this one, which is making the whole thing zero. 993 00:42:12,320 --> 00:42:14,320 And then in between an r squared of about a half 994 00:42:14,320 --> 00:42:16,430 says you're capturing about half the variability. 995 00:42:16,430 --> 00:42:18,730 So what you would like is a system 996 00:42:18,730 --> 00:42:22,480 in which your fit is as close to an r squared value of one 997 00:42:22,480 --> 00:42:24,880 as possible because it says my model is 998 00:42:24,880 --> 00:42:28,130 capturing all the variability in the data really well. 999 00:42:31,570 --> 00:42:33,651 So two functions that will do this for us. 1000 00:42:33,651 --> 00:42:35,900 We're going to come back to these in the next lecture. 1001 00:42:35,900 --> 00:42:38,500 The first one called generate fits, or genFits, 1002 00:42:38,500 --> 00:42:41,480 will take a set of x values, a set of y values, 1003 00:42:41,480 --> 00:42:43,670 and a list or a tuple of degrees, 1004 00:42:43,670 --> 00:42:45,560 and these will be the different degrees 1005 00:42:45,560 --> 00:42:47,180 of models I'd like to fit. 1006 00:42:47,180 --> 00:42:48,687 I could just give it one. 1007 00:42:48,687 --> 00:42:49,520 I could give it two. 1008 00:42:49,520 --> 00:42:52,636 I could give a 1, 2, 4, 8, 16, whatever. 1009 00:42:52,636 --> 00:42:54,260 And I'll just run through a little loop 1010 00:42:54,260 --> 00:42:56,270 here where I'm going to build up a set of models 1011 00:42:56,270 --> 00:42:58,670 for each degree-- or d in degrees. 1012 00:42:58,670 --> 00:43:00,452 I'll do the fit exactly as I had before. 1013 00:43:00,452 --> 00:43:01,910 It's going to return a model, which 1014 00:43:01,910 --> 00:43:03,860 is a tuple of coefficients. 1015 00:43:03,860 --> 00:43:07,760 And I'm going to store that in models and then return it. 1016 00:43:07,760 --> 00:43:10,510 And then I'm going to use that, because in testFits I 1017 00:43:10,510 --> 00:43:13,480 will take the models that come from genFits, 1018 00:43:13,480 --> 00:43:15,770 I'll take the set of degrees that I also passed 1019 00:43:15,770 --> 00:43:17,780 in there as well as the values. 1020 00:43:17,780 --> 00:43:19,160 I'll plot them out, and then I'll 1021 00:43:19,160 --> 00:43:25,120 simply run through each of the models and generate a fit, 1022 00:43:25,120 --> 00:43:29,241 compute the r squared value, plot it, and then print out 1023 00:43:29,241 --> 00:43:29,740 some data. 1024 00:43:31,890 --> 00:43:37,200 With that in mind, let's see what happens if we run this. 1025 00:43:37,200 --> 00:43:40,360 So I'm going to take, again, that example 1026 00:43:40,360 --> 00:43:43,445 of that data that I started with, 1027 00:43:43,445 --> 00:43:45,570 assuming I picked the right one here, which I think 1028 00:43:45,570 --> 00:43:46,110 is this one. 1029 00:43:46,110 --> 00:43:49,860 I'm going to do a fit with a degree one and a degree 1030 00:43:49,860 --> 00:43:51,114 two curve. 1031 00:43:51,114 --> 00:43:52,530 So I'm going to fit the best line. 1032 00:43:52,530 --> 00:43:55,020 I'm going to fit the best quadratic, the best parabola, 1033 00:43:55,020 --> 00:43:58,200 and I want to see how well that comes out. 1034 00:43:58,200 --> 00:44:00,950 So I do that. 1035 00:44:00,950 --> 00:44:02,000 I got some data there. 1036 00:44:02,000 --> 00:44:02,570 Looks good. 1037 00:44:02,570 --> 00:44:05,870 And what does the data tell me? 1038 00:44:05,870 --> 00:44:09,112 Data says, oh, cool-- 1039 00:44:09,112 --> 00:44:10,570 I know you don't believe it, but it 1040 00:44:10,570 --> 00:44:12,040 is because notice what it says, it 1041 00:44:12,040 --> 00:44:17,670 says the r squared value for the line is horrible. 1042 00:44:17,670 --> 00:44:24,314 It accounts for less than 0.05% of the data. 1043 00:44:24,314 --> 00:44:25,730 You could say, OK, I can see that. 1044 00:44:25,730 --> 00:44:26,530 I look at it. 1045 00:44:26,530 --> 00:44:28,380 It does a lousy job. 1046 00:44:28,380 --> 00:44:31,290 On the other hand, the quadratic is really pretty good. 1047 00:44:31,290 --> 00:44:35,820 It's accounting for about 84% of the variability in the data. 1048 00:44:35,820 --> 00:44:37,770 This is a nice high value. 1049 00:44:37,770 --> 00:44:40,270 It's not one, but it's a nice high value. 1050 00:44:40,270 --> 00:44:42,660 So this is now reinforcing what I already 1051 00:44:42,660 --> 00:44:44,040 knew, but in a nice way. 1052 00:44:44,040 --> 00:44:47,190 It's telling me that that r squared value tells me 1053 00:44:47,190 --> 00:44:51,851 that the quadratic is a much better fit than the linear fit 1054 00:44:51,851 --> 00:44:52,350 was. 1055 00:44:55,840 --> 00:44:57,940 But then you say maybe, wait a minute. 1056 00:44:57,940 --> 00:45:01,510 I could have done this by just comparing the fits themselves. 1057 00:45:01,510 --> 00:45:03,010 I already saw that. 1058 00:45:03,010 --> 00:45:05,500 Part of my goal is how do I know if I've got 1059 00:45:05,500 --> 00:45:08,360 the best fit possible or not. 1060 00:45:08,360 --> 00:45:09,970 So I'm going to do the same thing, 1061 00:45:09,970 --> 00:45:16,597 but now I'm going to run it with another set of degrees. 1062 00:45:16,597 --> 00:45:17,680 I'm going to go over here. 1063 00:45:17,680 --> 00:45:19,510 I'm going to take exactly the same code. 1064 00:45:19,510 --> 00:45:23,840 But let's try it with a quadratic, 1065 00:45:23,840 --> 00:45:27,500 with a quartic, an order eight, and an order 16 fit. 1066 00:45:27,500 --> 00:45:30,590 So I'm going to take different size polynomials. 1067 00:45:30,590 --> 00:45:33,200 As a quick aside, this is why I want 1068 00:45:33,200 --> 00:45:35,750 to use the PyLab kind of code because now I'm 1069 00:45:35,750 --> 00:45:38,930 simply optimizing over a 16-dimensional space. 1070 00:45:38,930 --> 00:45:41,350 Every point in that 16-dimensional space 1071 00:45:41,350 --> 00:45:44,010 defines a 16th-degree polynomial. 1072 00:45:44,010 --> 00:45:45,950 And I can still use linear regression, 1073 00:45:45,950 --> 00:45:47,690 meaning walking down the gradient, 1074 00:45:47,690 --> 00:45:50,210 to find the best solution. 1075 00:45:50,210 --> 00:45:53,342 I'm going to run this. 1076 00:45:53,342 --> 00:45:56,460 And I get out a set of values. 1077 00:45:56,460 --> 00:45:57,357 Looks good. 1078 00:45:57,357 --> 00:45:58,440 And let's go look at them. 1079 00:46:03,780 --> 00:46:08,960 Here is the r squared value for quadratic, about 84%. 1080 00:46:08,960 --> 00:46:11,330 Degree four does a little bit better. 1081 00:46:11,330 --> 00:46:13,130 Degree eight does a little bit better. 1082 00:46:13,130 --> 00:46:16,590 But wow, look at that, degree 16-- 1083 00:46:16,590 --> 00:46:20,640 16th order polynomial does a really good job, 1084 00:46:20,640 --> 00:46:26,100 accounts for almost 97% of the variability in the data. 1085 00:46:26,100 --> 00:46:26,850 That sounds great. 1086 00:46:29,225 --> 00:46:31,320 Now, to quote something that your parents probably 1087 00:46:31,320 --> 00:46:34,020 said to you when you were much younger, just because something 1088 00:46:34,020 --> 00:46:37,350 looks good doesn't mean we should do it. 1089 00:46:37,350 --> 00:46:40,725 And in fact, just because this has a really high r 1090 00:46:40,725 --> 00:46:44,190 squared value doesn't mean that we want to use the order 1091 00:46:44,190 --> 00:46:46,560 16th polynomial. 1092 00:46:46,560 --> 00:46:49,770 And I will wonderfully leave you waiting in suspense 1093 00:46:49,770 --> 00:46:52,564 because we're going to answer that question next Monday. 1094 00:46:52,564 --> 00:46:54,730 And with that, I'll let you out a few minutes early. 1095 00:46:54,730 --> 00:46:57,110 Have a great Thanksgiving break.