Flash and JavaScript are required for this feature.
Download the video from iTunes U or the Internet Archive.
Description: Prof. Grimson talks about how to model experimental data in a way that gives a sense of the underlying mechanism and to predict behaviour in new settings.
Instructor: Eric Grimson
Lecture 9: Understanding Ex...
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
ERIC GRIMSON: OK, welcome back or welcome, depending on whether you've been away or not. I'm going to start with two simple announcements. There is a reading assignment for this lecture, actually for the next two lectures, which is chapter 18. And on a much happier note, there is no lecture Wednesday because we hope that you're going to be busy preparing to get that tryptophan poisoning as you eat way too much turkey and you fall asleep. More importantly, I hope you have a great break over Thanksgiving, whether you're here or you're back home or wherever you are. But no lecture Wednesday.
Topic for today, I'm going to start with seems like-- sorry, what's going to seem like a really obvious statement. We're living in a data intensive world. Whether you're a scientist, an engineer, social scientist, financial worker, politician, manager of a sports team, you're spending increasingly larger amounts of time dealing with data. And if you're in one of those positions, that often means that you're either writing code or you're hiring somebody to write code for you to figure out that data.
And this section of the course is focusing on exactly that issue. We want to help you understand what you can try to do with software that manipulates data, how you can write code that would do that manipulation of data for you, and especially what you should believe about what that software tells you about data, because sometimes it tells you stuff that isn't exactly what you need to know. And today we're going to start that by looking at particularly the case where we get data from experiments. So think of this lecture and the next one as sort of being statistics meets experimental science.
So what do I mean by that? Imagine you're doing a physics lab, biology lab, a chemistry lab, or even something in sociology or anthropology, you conduct an experiment to gather some data. It could be measurements in a lab. It could be answers on a questionnaire. You get a set of data. Once you've got the data, you want to think about what can I do with it, and that usually will involve using some model, some theory about the underlying process to generate questions about the data. What does this data and the model associated with it tell me about future expectations, help me predict other results that will come out of this data. In the social case, it could be how do I think about how people are going to respond to a poll about who are you voting for in the next election, for example.
Given the data, given the model, the third thing we're typically going to want to do is then design a computation to help us answer questions about the data, run a computational experiment to complement the physical experiment or the social experiment we used to gather the data in the first place. And that computation could be something deep. It could be something a little more interesting, depending on how you're thinking about it. But we want to think about how do we use computation to run additional experiments for us.
So I'm going to start by using an example of gathering experimental data, and I want to start with the idea of a spring. How would I model a spring? How would I gather data about a spring? And how would I write software to help me answer questions about a spring? So what's spring? Well, there's one kind of spring, a little hard to model, although it could be interesting what's swimming around in there and how do I think about the ecological implications of that spring. Here's a second kind of spring. It's about four or five months away, but eventually we'll get through this winter and get to that spring and that would be nice, but I'm not going to model that one either. And yes, my jokes are really bad, and yes, you can't do a darn thing about them because I am tenured because-- while I'd like to model these two springs, we're going to stick with the one that you see in physics labs, these kinds of springs, so-called linear springs. And these are springs that have the property that you can stretch or compress them by applying a force to it. And when you release them, they literally spring back to the position they were originally.
So we're going to deal with these kinds of springs. And the distinguishing characteristics of these two springs and others in this class is that that force you require to compress it or stretch it a certain amount-- the amount of force you require varies linearly in the distance. So if it takes some amount of force to compress it some amount of distance, it takes twice as much force to compress it twice as much of a distance. It's linearly related.
So each one of these springs-- these kinds of springs has that property. The amount of force needed to stretch or compress it's linear in that distance. Associated with these springs there is something called a spring constant-- usually represented by the number k-- that determines how much force do you need to stretch or compress the spring. Now, it turns out that that spring constant can vary a lot. The slinky actually has a very low spring constant. It's one newton per meter. That spring on the suspension of a motorcycle has a much bigger spring constant. It's a lot stiffer, 35,000 newtons per meter. And just in case you don't remember, a newton is the amount of force you need to accelerate a one-kilogram mass one meter per second squared. We'll come back to that in a second. But the idea is we'd like to think about how do we model these kinds of springs.
Well, turns out, fortunately for us, that that was done about 300-plus years ago by a British physicist named Robert Hooke. Back in 1676 he formulated Hooke's law of elasticity. Simple expression that says the force you need to compress or stretch a spring is linearly related to the distance, d, that you've actually done that compression in, or another way of saying it is, if I compress a spring some amount, the force that's stored in it is linearly related to that distance. And the negative sign here basically says it's pointing in the opposite direction. So if I compress, the force is going to push it back out. If I stretch it, the force is going to push back into that resting position.
Now, this law holds for a wide range of springs, which is kind of nice. It's going to hold both in biological systems as well as in physical systems. It doesn't hold perfectly. There's a limit to how much you can stretch, in particular, a spring before the law breaks down, and maybe you did this as a kid, right. If you take a slinky and pull it too far apart, it stops working because you've exceeded what's called the elastic limit of the spring. Similarly, if you compress it too far, although I think you have to compress it a long ways, it'll stop working as well. So it doesn't hold completely, and it also doesn't hold for all springs. Only those springs that satisfy this linear law, which are a lot of them. So, for example, it doesn't apply to rubber bands, it doesn't apply to recurved bows. Those are two examples of springs that do not obey this linear relationship.
But nonetheless, there's Hooke's law. And one of the things we can do is say, well, let's use it to do a little bit of reasoning about this spring. So we can ask the question, how much does a rider have to weigh to compress this spring by one centimeter? And we've got Hooke's law, and I also gave you a little bit of hint here. So I told you that this spring has a spring constant of 35,000 newtons per meter. So I could just plug this in, right, one centimeter, it's 1/100 of a meter times-- so that's the-- there's the spring constant. There's the amount we're going to compress it. Do a little math, and that says that the force I need is 350 newtons. So what's a newton? A small town in Massachusetts, an interesting cookie, and a force that we want to think about. I keep telling you guys, the jokes are really bad.
So how do I get force? Well, you know that. Mass times acceleration, right, F equals ma. For acceleration here, I'm going to make an assumption, which is that the spring is basically oriented perpendicular to the earth, so that the acceleration is just the acceleration of gravity, which is roughly 9.8 meters per second squared. It's basically pulling it down. So I could plug that back in because remember what I want to do is figure out what's the mass I need.
So for the force, I'm substituting that in. I've got that expression, mass times 9.8 meters divided by seconds squared is 350 newtons, divide through by 9.8 both sides, do a little bit of math. And it says that the mass I need is 350 kilograms divided by 9.8. And that k refers to kilograms, not to the spring constant. Poor choice of example, but there I am. And if I do the math, it says I need a rider that weighs 35.68 kilos. And if you're not big on the metric system, it's actually a fairly light rider. That's about 79 pounds. So a 79-pound rider would compress that spring one centimeter.
So we can figure out how to use Hooke's law. We're thinking about what we want to do with springs. That's kind of nice. How will we actually get the spring constant? It's really valuable to know what the spring constant is. And just to give you a sense of that, it's not just to deal with things like slinkies. Atomic force microscopes, need to know the spring constants of the components in order to calibrate them properly. The force you need to deform a strand of DNA is directly related to the spring constants of the biological structures themselves. So I'd really like to figure out how do I get them.
How many of you have done this experiment in physics and hated it? Right. Well, I don't know if you hated it or not, but you've done it, right? Standard way to do it is I'd take a spring, I suspend it from some point. Let it come to a resting position. And then I put a mass on the bottom of the spring. It kind of bounces around. And when it settles, I measure the distance from where it was before I put the mass on to the distance of where it is after I've added the mass. I measure that distance. And then I just plug in.
I plug into that formula there. The force is minus k times d. So k the spring constant is the force, forget the minus sign, divided by the distance, and the force here would be 9.8 meters per second squared or-- kilograms per second squared times the mass divided by d. So I could just plug it in. In an ideal world, I'd plug it in, I'm done, one measurement. Not so much, right. Masses aren't always perfectly calibrated. Maybe the spring has got not perfect materials in it. So ideally I'd actually do multiple trials. I would take different weights, put them on the spring, make the measurements, and just record those.
So that's what I'm going to do, and I've actually done that. I'm not going to make you do it. But I get out a set of measurements.
What have I done here? I've used different masses, all increasing by now 0.05 kilograms, and I've measured the distance that the spring has deformed. And ideally, these would all have that nice linear relationship, so I could just plug them in and I could figure out what the spring constant is.
So let's take this data and let's plot it. And by the way, all the code you'll be able to see when you download the file, I'm going to walk through some of it quickly. This is a simple way to deal with it, and I'm going to back up for a second. There's my data, and I actually have done this in some ways the wrong order. These are my independent measures, different masses. I'm going to plot those along the x-axis, the horizontal axis. These are the dependent things. These are the things I'm measuring. I'm going to plot those along the y-axis. So I really should have put them in the other order. So just cross your eyes and make this column go over to that column, and we'll be in good shape.
Let's plot this. So here's a little file. Having stored those away in a file, I'm just going to read them in, get data. Just going to do the obvious thing of read in these things and return two tuples or lists, one for the x values-- or if you like, again going back to it, this set of values, and one for the y values. Now I'm going to play a little trick that you may have seen before that's going to be handy to me. I'm going to actually call this function out of the PyLab library called array. I pass in that tuple, and what it does is it converts it into an array, which is a data structure that has a fixed number of slots in it but has a really nice property I want to take advantage of. I could do all of this with lists.
But by converting that into array and then giving it the same name xVals and similarly for the yVals, I can now do math on the array without having to write loops. And in particular right here, notice what I'm doing. I'm taking xVals, which is an array, multiplying it by a number. And what that does is it takes every entry in the array, multiplies that entry, and puts it into basically a new version of the array, which I then store into xVals.
If you've programmed in Matlab, this is the same kind of feeling, right. I can take an array, do something to it, and that's really nice. So I'm going to scale all of my values, and then I'm going to plot them out some appropriate things. And if I do it, I get that.
I thought we said Hooke's law was a linear relationship. So in an ideal world, all of these points ought to lay along a line somewhere, where the slope of the line would tell me the spring constant. Not so good, right. And in fact, if you look at it, you can kind of see-- in here you can kind of imagine there's a line there, something funky is going on up here. And we're going to come back to that at the end of the lecture. But how do we think about actually finding the line? Well, we know there's noise in the measurement, so our best thing to do is to say, well, could we just fit a line to this data? And how would we do that? And that's the first big thing we want to do today. We want to try and figure out, given that we've got measurement noise, how do we fit a line to it.
So how do we fit a curve to data? Well, what we're basically going to try and do is find a way to relate an independent variable, which were the masses, the y values, to the dependent-- sorry, wrong way. The independent values, which are the x-axis, to the dependent value, what is the actual displacement we're going to see? So another way of saying it is if I go back to here, I want to know for every point along here, how do I fit something that predicts what the y value is? So I need to figure out how to do that fit.
To decide-- even if I had a curve, a line that I thought was a good fit to that, I need to decide how good it is. So imagine I was lucky and somebody said, here's a line that I think describes Hooke's law in this case. Great. I could draw the line on that data. I could draw it on this chunk of data here. I still need to decide how do I know if it's a good fit. And for that, we need something we call an objective function, and it's going to measure how close is the line to the data to which I'm trying to fit it.
Once we've defined the objective function, then what we say is, OK, now let's find the line that minimizes it, the best possible line, the line that makes that objective function as small as possible, because that's going to be the best fit to the data. And so that's what I'd like to do. We're going to see-- we're going to do it for general curves, but we're going to start just with lines, with linear function. So in this case, we want to say what's the line such that some function of the sum of the distances from the line to the measured points is minimized. And I'm going to come back in a second to how do we find the line. But first we've got to think about what does it mean to measure it.
So I've got a point. Imagine I got a line that I think is a good match for the thing fitting the data. How do I measure distance? Well, there's one option. I could measure just the displacement along the x-axis. There's a second option. I could measure the displacement vertically. Or a third option is I could actually measure the distance to the closest point on the line, which would be that perpendicular distance there.
You're way too quiet, which is always dangerous. What do you think? I'm going to look for a show of hands here. How many people think we should use x as the thing that we measure here? Hands up. Please don't use a single finger when you put your hand up. All right. Good. How many people think we should use p, the perpendicular distance? Reasonable number of hands. And how about y? And I see actually about split between p and y. And that's actually really good. X doesn't make a lot of sense, right, because I know that my values along the x-axis are independent measurements. So the displacement in that direction doesn't make a lot of sense. P makes a lot of sense, but unfortunately isn't what I want. We're going to see examples later on where, in fact, minimizing things where you minimize that distance is the right thing to do. When we do machine learning, that is how you find what's called a classifier or a separator.
But actually here we're going to pick y, and the reason is important. I'm trying to predict the dependent value, which is the y value, given an independent new x value. And so the displacement, the uncertainty is, in fact, the vertical displacement. And so I'm going to use y. That displacement is the thing I'm going to measure as the distance.
How do I find this? I need an objective function that's going to tell me what is the closeness of the fit. So here's how I'm going to do it. I'm going to have some set of observed values. Think of it as an array. I've got some index into them, so the indices are giving me the x values. And the observed values are the things I've actually measured. If you want to think of it this way, I'm going to go back to this slide really quickly. The observed values are the displacements or the values along the y-axis.
Sorry about that.
Let's assume that I have some hypothesized line that I think fits this data, y equals ax plus b. I know the a and the b. I've hypothesized it. Then predicted will basically say given the x value, the line predicts here's what the y value should be. And so I'm going to take the difference between those two and square them. So the difference makes sense. It tells me how far away is the observed value from what the line predicts it should be. Why am I squaring it? Well, there are two reasons. The first one is that squaring is going to get rid of the sign. It shouldn't matter if my observed value is some amount above the predicted value or some amount below-- the same amount below the predicted value. The displacement in direction shouldn't matter. It's how far away is it. Now, you could say, well, why not just use absolute value? And the answer is you could, but we're going to see in a couple of slides that by using the square we get a really nice property that helps us find the best fitting line.
So my objective function here basically says, given a bunch of observed values, use the hypothesized line to predict what the value should be, measure the difference in the y direction-- which is what I'm doing because I'm measuring predicted and observed y values-- square them, sum them all up. It's called least squares. That's going to give me a measure of how close that line is to a fit. In a second, I'll get to how you find the best line. But this hopefully looks familiar. Anybody recognize this? You've seen it earlier in this class. Boy, that's a terrible thing to ask because you don't even remember the last thing you did in this class other than the problem set.
AUDIENCE: [INAUDIBLE]
ERIC GRIMSON: Sorry?
AUDIENCE: Variance.
ERIC GRIMSON: Variance. Thank you. Absolutely. Sorry, I didn't bring any candy today. That's Professor Guttag. I got a better arm than he does, but I still didn't bring any candy today. Yeah, it's variance, not quite. It's almost variance. That's the variance times the number of observations, or another way of saying it is if I divided this by the number of observations, that would be the variance. If I took the square root, it would be the standard deviation. Why is that valuable? Because that tells you something about how badly things are dispersed, how much variation there is in this measurement. And so if it says, if I can minimize this expression, that's great because it not only will find what I hope is the best fit, but it's going to minimize the variance between what I predict and what I measure, which makes intuitive sense. That's exactly the thing I would like to minimize.
This was built on the assumption that I had a line that I thought was a good fit, and this lets me measure how good a fit I have. But I still have to do a little bit more. I have to now figure out, OK, how do I find the best-fitting line? And for that, we need to come up with a minimization technique. So to minimize this objective function, I want to find the curve for the predicted values-- this thing here-- some way of representing that that leads to the best possible solution.
And I'm going to make a simple assumption. I'm going to assume that my model for this predicted curve-- I've been using the example of a line, but we're going to say curve-- is a polynomial. It's a polynomial and one variable. The one variable is what are the x values of the samples. And I'm going to assume that the curve is a polynomial. In the simplest case, it's a line in case order, and two, it's going to be a parabola. And I'm going to use a technique called linear regression to find the polynomial that best fits the data, that minimizes that objective function.
Quick aside, just to remind you, I'm sure you remember, so polynomial-- polynomials, either the value is zero, which is really boring, or it is a finite sum of non-zero terms that all have the form c times x to the p. C is a constant, a real number. P is a power, a non-negative integer. And this is basically-- x is the free variable that's going to capture this. So easy way to say it is a line would be represented as a degree one polynomial ax plus b. A parabola is a second-degree polynomial, ax squared plus bx plus c. And we can go up to higher order terms.
We're going to refer to the degree of the polynomial as the largest degree of any term in that polynomial. So again, degree one, linear degree two, quadratic. Now how do I use that? Well, here's the basic idea. Let's take a simple example. Let's assume I'm still just trying to fit a line. So my assumption is I want to find a degree one polynomial, y equals ax plus b, as our model of the day. That means for every sample, I'm going to plug in x, and if I know a and b, it gives me the predicted value. I've already seen that's going to give me a good measure of the closeness of the fit. And the question is, how do I find a and b.
My goal is find a and b such that when we use this polynomial to compute those y values, that sum squared difference is minimized. So the sum squared difference is my measure of fit. All I have to do is find a and b. And that's where linear regression comes in, and I want to just give you a visualization of this. If a line is described by ax plus b, then I can represent every possible line in a two-dimensional space. One axis is possible values for a. The other axis is possible values for b. So if you think about it, I take any point in that space. It gives me an a and a B value. That describes a line. Why should you care about that? Because I can put a two-dimensional surface over that space. In other words, for every a and b, that gives me a line, and I could, therefore, compute this function, given the observed values and the predicted values, and it would give me a value, which is the height of the surface in that space.
If you're with me with the visualization, why is that nice? Because linear regression gives me a very easy way to find the lowest point on that surface, which is exactly the solution I want, because that's the best fitting line. And it's called linear regression not because we're solving for a line, but because of how you do that solution. If you think of this as being-- take a marble on this two-dimensional surface, you want to place the marble on it, you want to let it run down to the lowest point in the surface. And oh, yeah, I promised you why do we use sum squares, because if we used the sum of the squares, that surface always has only one minimum. So it's not a really funky, convoluted surface. It has exactly one minimum. It's called linear regression because the way to find it is to start at some point and walk downhill. I linearly regress or walk downhill along the gradient some distance, measure the new gradient, and do that until I get down to the lowest point in the surface.
Could you write code to do it? Sure. Are we going to ask you to do it? No, because fortunately-- I was hoping to get a cheer out of that. Too bad. OK, maybe we will ask you to do it on the exam. What the hell. You could do it. In fact, you've seen a version of this. The typical algorithm for doing it is very similar to Newton's method that we used way back in the beginning of 60001 when we found square roots. You could write that kind of a solution, but the good news is that the nice people who wrote Python, or particularly PyLab, have given you code to do it. And we're going to take advantage of it.
So in PyLab there is a built-in function called polyFit. It takes a collection of x values, takes a collection of equal length of y values-- they need to be the same length. I'm going to assume they're arrays. And it takes an integer n, which is the degree of fit, that I want to apply. And what polyFit will do is it will find the coefficients of a polynomial of that degree that provides the best least squares fit. So think of it as polyFit walking along that surface to find the best a and b that will come back. So if I give it a value of n equals one, it'll give me back the a and b that gives me the best line. If I get a value of n equal two, it gives me back a, b, and c that would fit an ax squared plus bx plus c parabola to best fit the data. And I could pick n to be any non-negative integer, and it would actually come up with a good fit.
So let's use it. I'm going to write a little function called fitData. The first part up here just comes from plotData. It's exactly the same thing. I read in the data. I convert them into arrays. I convert this because I want to get out the force. I go ahead and plot it. And then notice what I do, I use polyFit right here to take the inputted x values and y values and a degree one, and it's going to give me back a tuple, an a and a b that are the best fit line. Finds that point in the space that best fits it. Once I've got that, I could go ahead and actually compute now what are the estimated or predicted values. The line's going to tell me what I should have seen as those values, and I'm going to do the same thing. I'm going to take x values, convert it into array, multiply it by a, which says every entry in the array is scaled by a. Add b to every entry. So I'm just computing ax plus b for all possible x's. And that then gives me an estimated set of y values, and I can plot those out.
I'm cheating here. Sorry. I'm misdirecting you. I never cheat. I actually don't need to do the conversion to an array there because I did it up here. But because I've borrowed this from plot lab, I wanted to show you that I can redundantly do it here to remind you that I want to convert it into array to make sure I can do that kind of algebra on it. The last thing I could do is say even if I can-- once I show you the fit of this line, I also want to get out the spring constant. Now, the slope of this line is difference in force over difference in distance. The spring constant is the opposite of it. So I could simply take the slope of the line, which is a, invert it, and that gives me the spring constant.
So let's see what happens if we actually run this. So I'm going to go over to my code, hoping that it works properly.
Here's my Python. I've loaded this in. I'm going to run it. And there you go. Fits a line, and it prints out the value of a, which is about 0.46, and the value of b. And if I go back and look at this, there we go, spring constant is about 21 and a half, which is about the reciprocal of 0.046 if you can figure that out. And you can see, it's not a bad fit to a line through that data. Again, there's still something funky going on over here that we're going to come back to. But it's a pretty good fit to the data.
Great. So now I've got a fit. I'm going to show you a variation of this that we're going to use in a second. I could do the same thing, but after I've done polyFit here, I'm going to use another built-in function called polyval. It's going to take a polynomial, which is captured by that model of the thing that I returned, and I'm going to show you the difference again. Back sure we returned this as a tuple. Since it's coming back as a tuple, I can give it a name model. Polyval will take that tuple plus the x values and do the same thing. It will give me back an array of predicted values. But the nice thing here is that this model could be a line. It could be a parabola. It could be a quartic. It could be a quintic. It could be any order polynomial.
If you like the abstraction here-- which we're going to see in a little bit, that it allows me to use the same code for different orders of model. And if I ran this, it would do exactly the same thing.
I'm going to come back to thinking about what's going on in that spring in a second. But I want to show you another example. So here's another set of data. In a little bit, I'll show you where that mystery data came from. But here's another set of data that I've plotted out. I could run the same thing. I could run exactly the same code and fit a line to it. And if I do it, I get that.
What do you think? Good fit? Show of hands, how many people like this fit to the data? Show of hands, how many people don't like this fit to the data? Show of hands, how many hope that I'll stop asking you questions? Don't put your hands up. Yeah, thank you. I know. Too bad. It's a lousy fit. And you kind of know it, right. It's clear that this doesn't look like it's coming from a line, or if it is, it's a really noisy line. So let's think about this. What if I were to try a higher order degree. Let's change the one to a two. So I'm going to come back to it in a second. I've changed the one to a two. That says I'm still using the polynomial fit, but now I'm going to ask what's the best fitting parabola, ax squared plus bx plus c. Simple change. Because I was using polyval, exactly the same code will work. It's going to do the fit to it.
This is, by the way, still an example of linear regression. So think of what I'm doing now. I have a three-dimensional space. One axis is a values. Second axis is b values. Third axis is c values. Any point in that space describes a parabola, and every point in that space describes every possible parabola. And now you've got to twist your head a little bit. Put a four-dimensional surface on that three-dimensional basis, where the point in that surface is the value of that objective function. Play the same game. And you can. It's just a higher-dimensional thing. So you're, again, going to walk down the gradient to find the solution, and be glad you don't have to write this code because PyLab will do it for you freely. But it's still an example of regression, which is great.
And if we do that, we get that fit. Actually just to show you that, I'm going to run it, but it will do exactly the same thing. If I go over to Python-- wherever I have it here-- I'm going to change that order of the model. Oops, it went a little too far for me. Sorry about that. Let me go back and do this again. There's the first one, and there's the second one.
So I could fit different models to it. Quadratic clearly looks like it's a better fit. I hope you'll agree. So how do I decide which one's better other than eyeballing it? And then if I could fit a quadratic to it, what about other orders of polynomials? Maybe there's an even better fit out there. So how do I figure out what's the best way to do the fit? And that leads to the second big thing for this lecture. How good are these fits? What's the first big thing? The idea of linear regression, a way of finding fits of curves to data. But now I've got to decide how good are these. And I could ask this question two ways. One is just relative to each other, how do I measure which one's better other than looking at it by eye? And then the second part of it is in an absolute sense, how do I know where the best solution is? Is quadratic the best I could do? Or should I be doing something else to try and figure out a better solution, a better fit to the data?
The relative fit. What are we doing here? We're fitting a curve, which is a function of the independent variable to the dependent variable. What does it mean by that? I've got a set of x values. I'm trying to predict what the y values should be, the displacement should be. I want to get a good fit to that. The idea is that given an independent value, it gives me an estimate of what it should be, and I really want to know which fit provides the better estimates. And since I was simply minimizing mean squared error, average square error, an obvious thing to do is just to use the goodness of fit by looking at that error. Why not just measure where am I on that surface and see which one does better? Or actually it would be two surfaces, one for a linear fit, one for a quadratic one.
We'll do what we always do. Let's write a little bit of code. I can write something that's going to get the average, mean squared error. Takes in a set of data points, a set of predicted values, simply measures the difference between them, squares them, adds them all up in a little loop here and returns that divided by the number of samples I have. So it gives me the average squared error. And I could do it for that first model I built, which was for a linear fit, and I could do it for the second model I built, which is a quadratic fit. And if I run it, I get those values. Looks pretty good. You knew by eye that the quadratic was a better fit. And look, this says it's about six times better, that the residual error is six times smaller with the quadratic model than it is the linear model.
But with that, I still have a problem, which is-- OK, so it's useful for comparing two models. But is 1524 a good number? Certainly better than 9,000-something or other. But how do I know that 1524 is a good number? How do I know there isn't a better fit out there somewhere? Well, good news is we're going to be able to measure that. It's hard to know because there's no bound on the values. And more importantly, this is not scale independent. What do I mean by that? If I take all of the values and multiply them by some factor, I would still fit the same models to them. They would just scale. But that measure would increase by that amount. So I could make the error as big or as small as I want by just changing the size of the values. That doesn't make any sense.
I'd like a way to measure goodness of fit that is scale independent and that tells me for any fit how close it comes to being the perfect fit to the data. And so for that, we're going to use something called the coefficient of determination written as r squared. So let me show you what this does, and then we're going to use it. The y's are measured values. Those are my samples I got from my experiment. The p's are the predicted values. That is, for this curve, here's what I predict those values should be. So the top here is basically measuring as we saw before the sum squared error in those pieces. Mu down here is the average, or mean, of the measured values. It's the average of the y's.
So what I've got here is in the numerator-- this is basically the error in the estimates from my curve fit. And in the denominator I've got the amount of variation in the data itself. This is telling me how much does the data change from just being a constant value, and this is telling me how much do my errors vary around it. That ratio is scale independent because it's a ratio. So even if I increase all of the values by some amount, that's going to divide out, which is kind of nice. So I could compute that, and there it is. R squared is, again, that expression. I'll take in a set of observed values, a set of predicted values, and I'll measure the error-- again, these are arrays. So I'm going to take the difference between the arrays. That's going to give me piecewise or pairwise that difference. I'll square it. That's going to give me at every point in the array the square of that distance. And then because it's an array, I can just use the built-in sum function to add them all up. So this is going to give me the-- if you like, the values up there.
And then I'm going to play a little trick. I'm going to compute the mean error, which is that thing divided by the number of observations. Why would I do that? Well, because then I can compute this really simply. I could write a little loop to compute it. But in fact, I've already said what is that? If I take that sum and divide it by the number of samples, that's the variance. So that's really nice. Right here I can say, get the variance using the non-p version of the observed data. And because that has associated with it division by the number of samples, the ratio of the mean error to the variance is exactly the same as the ratio of that to that. Little trick. It lets me save doing a little bit of computation. So I can compute r squared values.
So what does r squared actually tell us? What we're doing is we're trying to compare the estimation errors, the top part, with the variability in the original values, the bottom part. So r squared, as you're going to see there, it's intended to capture what portion of the variability in the data is accounted for by my model. My model's a really good fit. It should account for almost all of that data. So what we see then is if we do a fit with a linear regression, r squared is always going to be between zero and one. And I want to just show you some examples. If r squared is equal to one, this is great. It says the model explains all of the variability in the data. And you can see it if we go back here. How do we make r squared equal to one? We need this to be zero, which says that the variability in the data is perfectly predicted by my model. Every point lies exactly along the curve. That's great.
Second option at the other extreme is if r squared is equal to zero, you basically got bupkis, which is a well-known technical term, meaning there's no relationship between the values predicted by the model and the actual data. That basically says that all of the variability here is exactly the same as all the variability in the data. The model doesn't capture anything, and it's making this one, which is making the whole thing zero. And then in between an r squared of about a half says you're capturing about half the variability. So what you would like is a system in which your fit is as close to an r squared value of one as possible because it says my model is capturing all the variability in the data really well.
So two functions that will do this for us. We're going to come back to these in the next lecture. The first one called generate fits, or genFits, will take a set of x values, a set of y values, and a list or a tuple of degrees, and these will be the different degrees of models I'd like to fit. I could just give it one. I could give it two. I could give a 1, 2, 4, 8, 16, whatever. And I'll just run through a little loop here where I'm going to build up a set of models for each degree-- or d in degrees. I'll do the fit exactly as I had before. It's going to return a model, which is a tuple of coefficients. And I'm going to store that in models and then return it. And then I'm going to use that, because in testFits I will take the models that come from genFits, I'll take the set of degrees that I also passed in there as well as the values. I'll plot them out, and then I'll simply run through each of the models and generate a fit, compute the r squared value, plot it, and then print out some data.
With that in mind, let's see what happens if we run this. So I'm going to take, again, that example of that data that I started with, assuming I picked the right one here, which I think is this one. I'm going to do a fit with a degree one and a degree two curve. So I'm going to fit the best line. I'm going to fit the best quadratic, the best parabola, and I want to see how well that comes out. So I do that. I got some data there. Looks good. And what does the data tell me? Data says, oh, cool-- I know you don't believe it, but it is because notice what it says, it says the r squared value for the line is horrible. It accounts for less than 0.05% of the data. You could say, OK, I can see that. I look at it. It does a lousy job. On the other hand, the quadratic is really pretty good. It's accounting for about 84% of the variability in the data. This is a nice high value. It's not one, but it's a nice high value. So this is now reinforcing what I already knew, but in a nice way. It's telling me that that r squared value tells me that the quadratic is a much better fit than the linear fit was.
But then you say maybe, wait a minute. I could have done this by just comparing the fits themselves. I already saw that. Part of my goal is how do I know if I've got the best fit possible or not. So I'm going to do the same thing, but now I'm going to run it with another set of degrees. I'm going to go over here. I'm going to take exactly the same code. But let's try it with a quadratic, with a quartic, an order eight, and an order 16 fit. So I'm going to take different size polynomials. As a quick aside, this is why I want to use the PyLab kind of code because now I'm simply optimizing over a 16-dimensional space. Every point in that 16-dimensional space defines a 16th-degree polynomial. And I can still use linear regression, meaning walking down the gradient, to find the best solution. I'm going to run this. And I get out a set of values. Looks good. And let's go look at them.
Here is the r squared value for quadratic, about 84%. Degree four does a little bit better. Degree eight does a little bit better. But wow, look at that, degree 16-- 16th order polynomial does a really good job, accounts for almost 97% of the variability in the data. That sounds great.
Now, to quote something that your parents probably said to you when you were much younger, just because something looks good doesn't mean we should do it. And in fact, just because this has a really high r squared value doesn't mean that we want to use the order 16th polynomial. And I will wonderfully leave you waiting in suspense because we're going to answer that question next Monday. And with that, I'll let you out a few minutes early. Have a great Thanksgiving break.