1 00:00:00,835 --> 00:00:03,130 The following content is provided under a Creative 2 00:00:03,130 --> 00:00:04,550 Commons license. 3 00:00:04,550 --> 00:00:06,760 Your support will help MIT OpenCourseWare 4 00:00:06,760 --> 00:00:10,850 continue to offer high quality educational resources for free. 5 00:00:10,850 --> 00:00:13,390 To make a donation or to view additional materials 6 00:00:13,390 --> 00:00:17,320 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,320 --> 00:00:18,570 at OCW.mit.edu. 8 00:00:29,750 --> 00:00:31,820 PROFESSOR: Welcome back. 9 00:00:31,820 --> 00:00:36,370 I hope you didn't spend time doing 6002 problem 10 00:00:36,370 --> 00:00:37,460 sets while eating turkey. 11 00:00:37,460 --> 00:00:40,160 It's not recommended for digestion. 12 00:00:40,160 --> 00:00:43,650 But I hope you're ready to go back into diving into material. 13 00:00:43,650 --> 00:00:46,220 And since it's been a week since we got together, 14 00:00:46,220 --> 00:00:49,430 let me remind you of what we were doing. 15 00:00:49,430 --> 00:00:52,880 We were looking at the issue of how to understand 16 00:00:52,880 --> 00:00:55,160 experimental data. 17 00:00:55,160 --> 00:00:57,040 Data could come from a physical experiment. 18 00:00:57,040 --> 00:00:59,450 We had the example of measuring the spring 19 00:00:59,450 --> 00:01:01,760 constant of a linear spring. 20 00:01:01,760 --> 00:01:03,550 Could come from biological data. 21 00:01:03,550 --> 00:01:05,420 Could come from social data. 22 00:01:05,420 --> 00:01:07,040 And what we looked out was the idea 23 00:01:07,040 --> 00:01:11,510 of how do we actually fit models to that data 24 00:01:11,510 --> 00:01:13,770 in order to understand them. 25 00:01:13,770 --> 00:01:16,580 So what I want to do is I want to start with that high level 26 00:01:16,580 --> 00:01:18,650 reminder of what we were after. 27 00:01:18,650 --> 00:01:20,459 I want to do about a five minute recap 28 00:01:20,459 --> 00:01:23,000 of what we were doing last time, because it has been a while. 29 00:01:23,000 --> 00:01:24,416 And then we're going to talk about 30 00:01:24,416 --> 00:01:26,237 how do you actually validate models 31 00:01:26,237 --> 00:01:28,820 that you're fitting to data to understand are they really good 32 00:01:28,820 --> 00:01:30,627 fits or not. 33 00:01:30,627 --> 00:01:31,460 And if you remember. 34 00:01:31,460 --> 00:01:33,650 I know you spend all your time thinking about 6002. 35 00:01:33,650 --> 00:01:35,660 You should remember I left you with a puzzle, 36 00:01:35,660 --> 00:01:37,930 where I fit data to-- 37 00:01:37,930 --> 00:01:40,190 sorry, fit models to some noisy data. 38 00:01:40,190 --> 00:01:42,050 And there was a question of, did the model 39 00:01:42,050 --> 00:01:45,052 really have an order 16 fit? 40 00:01:45,052 --> 00:01:46,510 Right, so what are we trying to do? 41 00:01:46,510 --> 00:01:51,040 Remember, our goal is to try and model experimental data. 42 00:01:51,040 --> 00:01:54,850 And really what we want to do is have a model that both explains 43 00:01:54,850 --> 00:01:57,460 the phenomena underlying what we see, 44 00:01:57,460 --> 00:01:59,590 gives us a sense of what might be the underlying 45 00:01:59,590 --> 00:02:02,950 physical mechanism, the underlying social mechanism, 46 00:02:02,950 --> 00:02:05,920 and can let us make predictions about the behavior 47 00:02:05,920 --> 00:02:07,450 in new settings. 48 00:02:07,450 --> 00:02:09,970 In the case of my spring, being able to predict 49 00:02:09,970 --> 00:02:12,190 what will the displacement be when I actually 50 00:02:12,190 --> 00:02:14,950 put a different weight on it than something I measured. 51 00:02:14,950 --> 00:02:17,180 Or if you want to think from a design perspective, 52 00:02:17,180 --> 00:02:18,805 working the other direction and saying, 53 00:02:18,805 --> 00:02:21,970 I don't want my spring to deflect more than this amount 54 00:02:21,970 --> 00:02:23,530 under certain kinds of weights. 55 00:02:23,530 --> 00:02:25,515 So how do I use the model to tell me 56 00:02:25,515 --> 00:02:27,640 what the spring constant should be for the spring I 57 00:02:27,640 --> 00:02:29,940 want in that case? 58 00:02:29,940 --> 00:02:33,760 So we want to be able to predict behavior in new settings. 59 00:02:33,760 --> 00:02:36,860 The last piece we know is that, if the data was perfect, 60 00:02:36,860 --> 00:02:38,120 this is easy. 61 00:02:38,120 --> 00:02:39,056 But it ain't. 62 00:02:39,056 --> 00:02:40,430 There's always going to be noise. 63 00:02:40,430 --> 00:02:42,596 There's always going to be experimental uncertainty. 64 00:02:42,596 --> 00:02:44,930 And so I really want to account for that uncertainty 65 00:02:44,930 --> 00:02:46,830 when I fit that model. 66 00:02:46,830 --> 00:02:50,560 And while sometimes I'll have theories that will help-- 67 00:02:50,560 --> 00:02:54,020 Hooke says models of springs are linear-- 68 00:02:54,020 --> 00:02:55,455 in some cases, I don't. 69 00:02:55,455 --> 00:02:57,830 And in those cases, I want to actually try and figure out 70 00:02:57,830 --> 00:03:00,350 what's the best model to fit even when I don't 71 00:03:00,350 --> 00:03:03,530 know what the theory tells me. 72 00:03:03,530 --> 00:03:07,760 OK, so quick recap, what do we use to solve this? 73 00:03:07,760 --> 00:03:10,370 So we've got a set of observed values. 74 00:03:10,370 --> 00:03:13,430 My spring case for different displays for different masses 75 00:03:13,430 --> 00:03:15,040 I measured the displacements. 76 00:03:15,040 --> 00:03:17,160 Those displacements are my observed values. 77 00:03:17,160 --> 00:03:20,420 And if I had a model that would predict 78 00:03:20,420 --> 00:03:22,520 what the displacement should be, I 79 00:03:22,520 --> 00:03:25,280 can measure how good the fit is by looking at that expression 80 00:03:25,280 --> 00:03:27,222 right there, the sum of the squares 81 00:03:27,222 --> 00:03:29,180 of the differences between the observed and the 82 00:03:29,180 --> 00:03:30,815 predicted data. 83 00:03:30,815 --> 00:03:32,440 As I said, we could use other measures. 84 00:03:32,440 --> 00:03:34,356 We could use a first order and absolute value. 85 00:03:34,356 --> 00:03:36,430 The square is actually really handy, because it 86 00:03:36,430 --> 00:03:39,310 makes the solution space very easy to deal with, 87 00:03:39,310 --> 00:03:41,060 which we'll get to in a second. 88 00:03:41,060 --> 00:03:42,954 So given observed data, get a prediction. 89 00:03:42,954 --> 00:03:44,620 I can use the sum of squared differences 90 00:03:44,620 --> 00:03:47,560 to measure how good the fit is. 91 00:03:47,560 --> 00:03:50,110 And then the second piece is I now 92 00:03:50,110 --> 00:03:52,580 what to find what's the best way to predict the data. 93 00:03:52,580 --> 00:03:54,840 What's the best curve that fits the data. 94 00:03:54,840 --> 00:03:57,730 What's the best model for protecting the values. 95 00:03:57,730 --> 00:03:59,230 And we suggest that last time, we'll 96 00:03:59,230 --> 00:04:03,940 focus on mathematical expressions, polynomials. 97 00:04:03,940 --> 00:04:06,700 Professor Guttag is so excited about polynomial expressions, 98 00:04:06,700 --> 00:04:08,170 he's throwing laptops on the floor. 99 00:04:08,170 --> 00:04:10,450 Please don't do that to your laptop. 100 00:04:10,450 --> 00:04:13,660 We're going to fit polynomials to these expressions. 101 00:04:13,660 --> 00:04:17,019 And since the polynomials have some coefficients, 102 00:04:17,019 --> 00:04:19,060 the game is basically, how do I find 103 00:04:19,060 --> 00:04:20,709 the coefficients of the polynomial that 104 00:04:20,709 --> 00:04:23,050 minimize that expression. 105 00:04:23,050 --> 00:04:26,330 And that, we said, was an example of linear regression. 106 00:04:26,330 --> 00:04:29,560 So let me just remind you what linear regression says. 107 00:04:29,560 --> 00:04:32,680 Simple example, case of the spring. 108 00:04:32,680 --> 00:04:34,630 I'm going to get a degree 1 polynomial. 109 00:04:34,630 --> 00:04:38,620 So that is something of the form y is ax plus b. 110 00:04:38,620 --> 00:04:42,730 a and b are the three variables, the parameters I can change. 111 00:04:42,730 --> 00:04:45,220 And the idea is for every x, in the case of my spring, 112 00:04:45,220 --> 00:04:48,082 for every mass, I'm going to use that model 113 00:04:48,082 --> 00:04:49,540 to predict what's the displacement, 114 00:04:49,540 --> 00:04:50,830 measure the differences, and find 115 00:04:50,830 --> 00:04:51,996 the thing that minimizes it. 116 00:04:51,996 --> 00:04:54,530 So I just want to find values of a and b 117 00:04:54,530 --> 00:04:58,930 that let me predict values that minimize that expression. 118 00:04:58,930 --> 00:05:01,210 As I suggested, you could solve this. 119 00:05:01,210 --> 00:05:02,680 You could write code to do it. 120 00:05:02,680 --> 00:05:04,900 It's a neat little piece of code to write. 121 00:05:04,900 --> 00:05:07,330 But fortunately, PiLab provides that for you. 122 00:05:07,330 --> 00:05:09,250 And I just want to give you the visualization 123 00:05:09,250 --> 00:05:10,291 of what we're doing here. 124 00:05:10,291 --> 00:05:13,550 And then we're going to look at examples. 125 00:05:13,550 --> 00:05:15,650 I'm going to try to find the best line. 126 00:05:15,650 --> 00:05:18,806 It's represented by two values, a and b. 127 00:05:18,806 --> 00:05:21,360 I could represent all possible lines 128 00:05:21,360 --> 00:05:25,242 in a space that has one access with values 129 00:05:25,242 --> 00:05:26,700 and the other access with b values. 130 00:05:26,700 --> 00:05:30,720 Every point in that plane defines a line for me. 131 00:05:30,720 --> 00:05:36,180 Now imagine a surface laid over this two dimensional space, 132 00:05:36,180 --> 00:05:38,450 where the value or the height of the surface 133 00:05:38,450 --> 00:05:41,902 is the value of that objective function at every point. 134 00:05:41,902 --> 00:05:43,360 Don't worry about computing it all, 135 00:05:43,360 --> 00:05:45,347 but just imagine I could do that. 136 00:05:45,347 --> 00:05:46,930 And by the way, one of the nice things 137 00:05:46,930 --> 00:05:49,720 about doing sum of squares is that surface always 138 00:05:49,720 --> 00:05:51,820 has a concave shape. 139 00:05:51,820 --> 00:05:53,800 And now the idea of linear regression 140 00:05:53,800 --> 00:05:57,190 is I'm going to start at some point on that surface. 141 00:05:57,190 --> 00:05:59,560 And I'm just going to walk downhill 142 00:05:59,560 --> 00:06:01,620 until I get to the bottom. 143 00:06:01,620 --> 00:06:04,500 There will always be one bottom, one point. 144 00:06:04,500 --> 00:06:07,830 And once I get to that point, that a and b value 145 00:06:07,830 --> 00:06:10,410 tell me the best line. 146 00:06:10,410 --> 00:06:12,700 So it's called linear regression because I'm linearly 147 00:06:12,700 --> 00:06:14,380 walking downhill on this space. 148 00:06:14,380 --> 00:06:18,160 Now I'm doing this for line with two parameters a,b, 149 00:06:18,160 --> 00:06:20,247 because it's easy to visualize. 150 00:06:20,247 --> 00:06:22,330 If you're a good mathematician even if you're not, 151 00:06:22,330 --> 00:06:25,480 you can generalize this to think about arbitrary dimensions. 152 00:06:25,480 --> 00:06:29,140 So a fourth order surface in a five dimensional space, 153 00:06:29,140 --> 00:06:33,619 for example, would solve a cubic example of this. 154 00:06:33,619 --> 00:06:35,160 That's the idea of linear regression. 155 00:06:35,160 --> 00:06:37,409 That's what we're going to use to actually figure out, 156 00:06:37,409 --> 00:06:40,280 to find the best solution. 157 00:06:40,280 --> 00:06:42,100 So here was the example I used. 158 00:06:42,100 --> 00:06:43,200 I gave you a set of data. 159 00:06:43,200 --> 00:06:44,600 In about 3 slides, I'm going to tell you 160 00:06:44,600 --> 00:06:45,641 where the data came from. 161 00:06:45,641 --> 00:06:47,390 But I give you a set of data. 162 00:06:47,390 --> 00:06:49,850 We could fit the best line to this using 163 00:06:49,850 --> 00:06:52,580 that linear regression idea. 164 00:06:52,580 --> 00:06:55,020 And again, last piece of reminder, I'm 165 00:06:55,020 --> 00:06:58,094 going to use polyfit from PiLab. 166 00:06:58,094 --> 00:07:00,010 It just solves that linear regression problem. 167 00:07:00,010 --> 00:07:01,830 And I give it a set of x values. 168 00:07:01,830 --> 00:07:04,320 I give a corresponding set of y values, 169 00:07:04,320 --> 00:07:06,540 need to be the same number in each case. 170 00:07:06,540 --> 00:07:07,800 And I give it a dimension. 171 00:07:07,800 --> 00:07:12,040 And in this case, one says, find the best fitting line. 172 00:07:12,040 --> 00:07:13,630 It will produce that and return it 173 00:07:13,630 --> 00:07:16,870 as a tuple, which I'll store under the name model 1. 174 00:07:16,870 --> 00:07:18,070 And I could plot it out. 175 00:07:18,070 --> 00:07:22,000 So just remind you, polyfit will find the best fitting n 176 00:07:22,000 --> 00:07:25,600 dimensional surface, n being that last parameter there, 177 00:07:25,600 --> 00:07:26,970 and return it. 178 00:07:26,970 --> 00:07:29,660 In a second, we're going to use polyval, which will say, 179 00:07:29,660 --> 00:07:32,980 given that model and a set of x values, 180 00:07:32,980 --> 00:07:34,450 predict what the y value should be. 181 00:07:34,450 --> 00:07:34,950 Apply them. 182 00:07:37,520 --> 00:07:40,380 OK, so I fit the line. 183 00:07:40,380 --> 00:07:41,930 What do you think? 184 00:07:41,930 --> 00:07:43,370 Good fit? 185 00:07:43,370 --> 00:07:45,710 Not so much, right? 186 00:07:45,710 --> 00:07:47,235 Pretty ugly. 187 00:07:47,235 --> 00:07:48,610 I mean, you can see it's probably 188 00:07:48,610 --> 00:07:49,660 the best-- or not probably. 189 00:07:49,660 --> 00:07:50,890 It is the best fitting line. 190 00:07:50,890 --> 00:07:54,310 It sort of accounts for the variation on either side of it. 191 00:07:54,310 --> 00:07:56,839 But it's not a very good fit. 192 00:07:56,839 --> 00:07:58,380 So then the question is, well why not 193 00:07:58,380 --> 00:08:01,350 try fitting a higher order model? 194 00:08:01,350 --> 00:08:03,690 So I could fit a quadratic. 195 00:08:03,690 --> 00:08:05,540 That is a second order model. 196 00:08:05,540 --> 00:08:09,000 y equals ax squared plus bx plus c. 197 00:08:09,000 --> 00:08:10,290 Run the same code. 198 00:08:10,290 --> 00:08:12,560 Block that out. 199 00:08:12,560 --> 00:08:14,450 And I get that. 200 00:08:14,450 --> 00:08:16,040 That's the linear model. 201 00:08:16,040 --> 00:08:19,700 There's the quadratic model. 202 00:08:19,700 --> 00:08:21,920 At least my [? i ?] our looks a lot better, right? 203 00:08:21,920 --> 00:08:26,230 It looks like it's following that data reasonably well. 204 00:08:26,230 --> 00:08:28,300 OK, I can fit a linear model. 205 00:08:28,300 --> 00:08:30,160 I can fit a quadratic model. 206 00:08:30,160 --> 00:08:31,600 What about higher order models? 207 00:08:31,600 --> 00:08:33,599 What about a fourth order model, an eighth order 208 00:08:33,599 --> 00:08:36,490 model, a 644th order model? 209 00:08:36,490 --> 00:08:39,865 How do I know which one is going to be best? 210 00:08:39,865 --> 00:08:42,490 So for that, I'm going to remind you of the last thing we used. 211 00:08:42,490 --> 00:08:43,950 And then we're going to start talking about how to use it 212 00:08:43,950 --> 00:08:46,830 further, which is if we try fitting higher order 213 00:08:46,830 --> 00:08:50,400 polynomials, do we get a better fit? 214 00:08:50,400 --> 00:08:52,820 And to do that, we need to measure what 215 00:08:52,820 --> 00:08:56,777 it means for the data to fit. 216 00:08:56,777 --> 00:08:58,360 If I don't have any other information. 217 00:08:58,360 --> 00:09:00,640 For example, if I don't have a theory that tells me 218 00:09:00,640 --> 00:09:03,365 this should be linear in the case afoot, 219 00:09:03,365 --> 00:09:05,490 then the best way to do it is to use what's called, 220 00:09:05,490 --> 00:09:09,300 the coefficient of determination, r-squared. 221 00:09:09,300 --> 00:09:11,440 It's a scale independent thing, which is good. 222 00:09:11,440 --> 00:09:13,840 By scale independent, I mean if I take all the data 223 00:09:13,840 --> 00:09:15,441 and stretch it out, this will still 224 00:09:15,441 --> 00:09:17,440 give me back the same value in terms of the fit. 225 00:09:17,440 --> 00:09:20,080 So it doesn't depend on the size of the data. 226 00:09:20,080 --> 00:09:22,990 And what it does is it basically tells me the 227 00:09:22,990 --> 00:09:25,690 a value between 0 and 1, how well 228 00:09:25,690 --> 00:09:28,760 does this model fit the data. 229 00:09:28,760 --> 00:09:30,640 So just to remind you, in this case, 230 00:09:30,640 --> 00:09:34,720 the y's are the measured values, the p's 231 00:09:34,720 --> 00:09:35,770 are the predicted values. 232 00:09:35,770 --> 00:09:39,550 That's what my model is saying, for each one of these cases. 233 00:09:39,550 --> 00:09:42,910 And mu down here is the mean or the average 234 00:09:42,910 --> 00:09:45,480 of the measured values. 235 00:09:45,480 --> 00:09:48,390 The way to think about this is this top expression here. 236 00:09:48,390 --> 00:09:51,360 Well, that's exactly what I'm trying to minimize, right? 237 00:09:51,360 --> 00:09:54,630 So it's giving me an estimate or a measure 238 00:09:54,630 --> 00:09:58,050 of the error in the estimates between what the model says 239 00:09:58,050 --> 00:10:00,510 and what I actually measure. 240 00:10:00,510 --> 00:10:02,980 And the denominator down here basically tells 241 00:10:02,980 --> 00:10:08,440 me how much does the data vary away from the mean value. 242 00:10:08,440 --> 00:10:10,510 Now here's the idea. 243 00:10:10,510 --> 00:10:14,560 If in fact, I can get this to 0, I 244 00:10:14,560 --> 00:10:18,310 can get a model that completely accounts for all the variation 245 00:10:18,310 --> 00:10:20,290 in the estimates, that's great. 246 00:10:20,290 --> 00:10:22,270 It says, the model has fit perfectly. 247 00:10:22,270 --> 00:10:26,110 And that means this is 0 so this r value or r squared value 248 00:10:26,110 --> 00:10:28,060 is 1. 249 00:10:28,060 --> 00:10:32,650 On the other hand, if this is equal to that, meaning 250 00:10:32,650 --> 00:10:34,510 that all of the variation in the estimates 251 00:10:34,510 --> 00:10:37,880 accounts for none of the variation in the data, 252 00:10:37,880 --> 00:10:41,130 then this is 1 and this goes to 0. 253 00:10:41,130 --> 00:10:45,160 So the idea is that an r-squared value is close to 1 is great. 254 00:10:45,160 --> 00:10:47,830 It says, the model is a good fit to the data. 255 00:10:47,830 --> 00:10:52,780 r-squared value is getting closer to 0, not so good. 256 00:10:52,780 --> 00:10:58,550 OK, so I ran this, fitting models of order 2, 257 00:10:58,550 --> 00:11:02,210 4, 8, and 16. 258 00:11:02,210 --> 00:11:05,180 Now you can see that model 2, that's the green line here. 259 00:11:05,180 --> 00:11:06,940 That's the one that we saw before. 260 00:11:06,940 --> 00:11:10,130 It's basically a parabolic kind of arc. 261 00:11:10,130 --> 00:11:12,020 It kind of follows the data pretty well. 262 00:11:12,020 --> 00:11:15,000 But if I look at those r-squared values. 263 00:11:15,000 --> 00:11:17,270 Wow, look at that. 264 00:11:17,270 --> 00:11:21,290 Order 16 fit accounts for all but 3% 265 00:11:21,290 --> 00:11:24,280 of the variation in the data. 266 00:11:24,280 --> 00:11:25,870 It's a great fit. 267 00:11:25,870 --> 00:11:26,606 And you can see. 268 00:11:26,606 --> 00:11:27,730 You can see how it follows. 269 00:11:27,730 --> 00:11:30,670 It actually goes through most, but not quite all, 270 00:11:30,670 --> 00:11:31,480 of the data points. 271 00:11:31,480 --> 00:11:34,530 So it's following them pretty well. 272 00:11:34,530 --> 00:11:39,390 OK, so if that's the case, the order 16 fit 273 00:11:39,390 --> 00:11:41,460 is really the best fit. 274 00:11:41,460 --> 00:11:43,615 Should we just use it? 275 00:11:43,615 --> 00:11:45,740 And I left you last time with that quote that says, 276 00:11:45,740 --> 00:11:47,570 from your parents, right, your mother telling you, 277 00:11:47,570 --> 00:11:49,070 just because you can do something 278 00:11:49,070 --> 00:11:51,720 doesn't mean you should do something. 279 00:11:51,720 --> 00:11:53,900 I'll leave it at that. 280 00:11:53,900 --> 00:11:55,940 Same thing applies here. 281 00:11:55,940 --> 00:11:58,930 Why are we building the model? 282 00:11:58,930 --> 00:12:00,820 Remember, I said two reasons. 283 00:12:00,820 --> 00:12:03,990 One is to be able to explain the phenomena. 284 00:12:03,990 --> 00:12:07,290 And the second one is to be able to make predictions. 285 00:12:07,290 --> 00:12:10,190 So I want to be able to explain the phenomena 286 00:12:10,190 --> 00:12:13,010 in the case of a spring, with things like it's linear 287 00:12:13,010 --> 00:12:15,740 and then that gives me a sense of a linear relationship 288 00:12:15,740 --> 00:12:18,210 between compression and force. 289 00:12:18,210 --> 00:12:22,690 In this case, a 16th order model, 290 00:12:22,690 --> 00:12:28,190 what kind of physical process has an order 16 variation? 291 00:12:28,190 --> 00:12:29,240 Sounds a little painful. 292 00:12:29,240 --> 00:12:34,510 So maybe not a great insight into the process. 293 00:12:34,510 --> 00:12:38,000 But the second reason is I want to be able to predict 294 00:12:38,000 --> 00:12:40,574 future behavior of this system. 295 00:12:40,574 --> 00:12:42,740 In the case of this spring, I put a different weight 296 00:12:42,740 --> 00:12:43,781 on than I've done before. 297 00:12:43,781 --> 00:12:47,720 I want to predict what the displacement is going to be. 298 00:12:47,720 --> 00:12:50,180 I've done a set of trials for an FDA approval of a drug. 299 00:12:50,180 --> 00:12:52,138 Now I want to predict the effect of a treatment 300 00:12:52,138 --> 00:12:53,150 on a new patient. 301 00:12:53,150 --> 00:12:55,984 How do I use the model to help me with that? 302 00:12:55,984 --> 00:12:57,650 One that maybe not so good, currently, I 303 00:12:57,650 --> 00:12:59,547 want to predict the outcome of an election. 304 00:12:59,547 --> 00:13:01,880 Maybe those models need to be fixed from, at least, what 305 00:13:01,880 --> 00:13:03,335 happened the last time around. 306 00:13:03,335 --> 00:13:05,210 But I need to be able to make the prediction. 307 00:13:05,210 --> 00:13:08,240 So another way of saying it is, a good model both explains 308 00:13:08,240 --> 00:13:12,730 the phenomena and let's me make the predictions. 309 00:13:12,730 --> 00:13:16,840 OK, so let's go back, then, to our example. 310 00:13:16,840 --> 00:13:20,210 And before I do it, let me tell you where that data came from. 311 00:13:20,210 --> 00:13:23,180 I actually built that data by looking at another kind 312 00:13:23,180 --> 00:13:24,400 of physical phenomenon. 313 00:13:24,400 --> 00:13:25,441 And it was a lot of them. 314 00:13:25,441 --> 00:13:28,370 Things that follow a parabolic arc. 315 00:13:28,370 --> 00:13:30,440 So for example, comets. 316 00:13:30,440 --> 00:13:33,050 Any particle under the influence of a uniform gravitational 317 00:13:33,050 --> 00:13:35,590 field follows a parabolic arc, which 318 00:13:35,590 --> 00:13:38,000 i why Halley's comet gets really close for a while, 319 00:13:38,000 --> 00:13:40,190 and then goes away off into the solar system, 320 00:13:40,190 --> 00:13:42,450 and comes back around. 321 00:13:42,450 --> 00:13:43,500 My favorite example-- 322 00:13:43,500 --> 00:13:44,520 I'm biased on this. 323 00:13:44,520 --> 00:13:46,620 And I know you all know which team I root for. 324 00:13:46,620 --> 00:13:49,200 But there is Tom Brady throwing a pass against the Pittsburgh 325 00:13:49,200 --> 00:13:51,200 Steelers. 326 00:13:51,200 --> 00:13:54,820 Center of mass of the past follows a nice parabolic arc. 327 00:13:54,820 --> 00:13:57,430 Even in design, you see parabolic arcs 328 00:13:57,430 --> 00:13:58,330 in lots of places. 329 00:13:58,330 --> 00:13:59,890 They have nice properties in terms 330 00:13:59,890 --> 00:14:02,590 of disbursement of loads and forces, 331 00:14:02,590 --> 00:14:05,510 which is why architects like to use them. 332 00:14:05,510 --> 00:14:08,667 So here's how I generated the data. 333 00:14:08,667 --> 00:14:09,750 I wrote a little function. 334 00:14:09,750 --> 00:14:10,541 Actually, I didn't. 335 00:14:10,541 --> 00:14:12,500 Professor Guttag did, but I borrowed it. 336 00:14:12,500 --> 00:14:16,606 It took in three parameters, a, b, and c. 337 00:14:16,606 --> 00:14:20,190 ax squared plus bx plus c. 338 00:14:20,190 --> 00:14:21,947 I gave it a set of x values. 339 00:14:21,947 --> 00:14:24,030 Those are the independent measurements, the things 340 00:14:24,030 --> 00:14:26,220 along the horizontal axis. 341 00:14:26,220 --> 00:14:27,780 And notice what I did. 342 00:14:27,780 --> 00:14:34,000 I generated values given an a, b, and c, for that equation. 343 00:14:34,000 --> 00:14:37,020 And then I added in some noise. 344 00:14:37,020 --> 00:14:40,890 So random.guass takes a mean and a standard deviation, 345 00:14:40,890 --> 00:14:44,340 and it generates noise following that bell shaped curve 346 00:14:44,340 --> 00:14:46,260 that goes in the distribution. 347 00:14:46,260 --> 00:14:49,440 So the 0 says it's 0 mean, meaning there's no bias. 348 00:14:49,440 --> 00:14:52,380 It's going to be equally likely to be above or below the value, 349 00:14:52,380 --> 00:14:53,910 positive or negative. 350 00:14:53,910 --> 00:14:56,380 But 35 is a pretty good standard deviation. 351 00:14:56,380 --> 00:14:59,960 This is putting a lot of noise into the data. 352 00:14:59,960 --> 00:15:01,802 And then I just added that into y values. 353 00:15:01,802 --> 00:15:03,260 The rest of this, you can see, it's 354 00:15:03,260 --> 00:15:06,020 simply going to write it into a file, a set of x and y values. 355 00:15:06,020 --> 00:15:10,500 But this will generate, given a value for a, b, and c, 356 00:15:10,500 --> 00:15:13,810 data from a parabolic arc with noise added to it. 357 00:15:13,810 --> 00:15:17,670 And in this case, I took it as y equals 3x squared. 358 00:15:17,670 --> 00:15:19,530 And then c and c are 0. 359 00:15:19,530 --> 00:15:20,937 And that's how I generated it. 360 00:15:23,620 --> 00:15:26,230 What I want to do, I want to see how well this model actually 361 00:15:26,230 --> 00:15:27,170 predicts behavior. 362 00:15:27,170 --> 00:15:28,570 So one of the ways I could do it, 363 00:15:28,570 --> 00:15:31,570 to say, all right, the question I want to ask is, whoa, 364 00:15:31,570 --> 00:15:35,110 if I generated the data from a degree 2 polynomial 365 00:15:35,110 --> 00:15:39,820 quadratic, why in the world is the 16th order polynomial the, 366 00:15:39,820 --> 00:15:40,660 "best fit?" 367 00:15:44,480 --> 00:15:47,450 So let's test it out. 368 00:15:47,450 --> 00:15:49,837 I'm going to give 3-- sorry, 4. 369 00:15:49,837 --> 00:15:50,420 I can't count. 370 00:15:50,420 --> 00:15:56,030 4 different degrees, order 2, order 4, order 8, order 16. 371 00:15:56,030 --> 00:15:59,870 And I've generated two different datasets, 372 00:15:59,870 --> 00:16:01,850 using exactly that code. 373 00:16:01,850 --> 00:16:02,705 I just ran it twice. 374 00:16:02,705 --> 00:16:04,580 It's going to have slightly different values, 375 00:16:04,580 --> 00:16:06,913 because the noise is going to be different in each case. 376 00:16:06,913 --> 00:16:08,720 But they're both coming from that a, 377 00:16:08,720 --> 00:16:12,050 y equals 3x squared equation. 378 00:16:12,050 --> 00:16:13,560 And the code here basically says, 379 00:16:13,560 --> 00:16:15,140 I'm going to take those two data sets 380 00:16:15,140 --> 00:16:18,590 and basically, get the x and y values out, and then 381 00:16:18,590 --> 00:16:20,420 fit models. 382 00:16:20,420 --> 00:16:22,610 So I'll remind you, genFits takes 383 00:16:22,610 --> 00:16:26,270 in a collection of x and y values and a list 384 00:16:26,270 --> 00:16:29,090 or a tuple of degrees, and for each degree, 385 00:16:29,090 --> 00:16:32,720 finds, using Polyfit, the best model. 386 00:16:32,720 --> 00:16:38,350 So models one will be 4 models for order 2, 4, 8, and 16. 387 00:16:38,350 --> 00:16:41,990 And similarly, down here, I'm going to do the same thing, 388 00:16:41,990 --> 00:16:43,470 but using the second data set. 389 00:16:43,470 --> 00:16:46,130 And I'm going to fit, again, a set of models. 390 00:16:46,130 --> 00:16:49,390 And then I'll remind you, test fits, which you saw last time. 391 00:16:49,390 --> 00:16:52,340 I know it's a while ago, basically takes 392 00:16:52,340 --> 00:16:57,710 a set of models, a corresponding set of degrees, x and y values, 393 00:16:57,710 --> 00:17:00,950 and says, for each model in that degree, measure 394 00:17:00,950 --> 00:17:05,790 how well that model meets the fit, using 395 00:17:05,790 --> 00:17:06,839 that r-squared value. 396 00:17:06,839 --> 00:17:10,760 So testFits is going to get us back a set of r-squared values. 397 00:17:10,760 --> 00:17:15,089 All right, with that in mind, I've got the code here. 398 00:17:15,089 --> 00:17:15,630 Let's run it. 399 00:17:18,180 --> 00:17:21,200 And here we go. 400 00:17:21,200 --> 00:17:24,720 I'm going to run that code. 401 00:17:24,720 --> 00:17:26,560 Ha, I get two fits. 402 00:17:26,560 --> 00:17:27,109 Looks good. 403 00:17:27,109 --> 00:17:28,150 Let's look at the values. 404 00:17:32,460 --> 00:17:36,070 So there's the first data set. 405 00:17:36,070 --> 00:17:40,110 All right, the green line still is doing not a bad job. 406 00:17:40,110 --> 00:17:42,930 The purple line, boy, is fitting it really well. 407 00:17:42,930 --> 00:17:45,860 And again, notice here's the best fit. 408 00:17:45,860 --> 00:17:46,850 That's amazing. 409 00:17:46,850 --> 00:17:50,900 That is accounting for all but 0.4% 410 00:17:50,900 --> 00:17:52,430 of the variation in the data. 411 00:17:52,430 --> 00:17:54,860 Great fit. 412 00:17:54,860 --> 00:17:55,880 Order 16. 413 00:17:55,880 --> 00:17:57,020 Came from an order 2 thing. 414 00:17:57,020 --> 00:18:00,120 All right, what about the second data set? 415 00:18:00,120 --> 00:18:03,290 Oh, grumph. 416 00:18:03,290 --> 00:18:08,600 It also says order 16 fit is the best fit. 417 00:18:08,600 --> 00:18:09,430 Not quite as good. 418 00:18:09,430 --> 00:18:13,080 It accounts for all but about 2% of the variation. 419 00:18:13,080 --> 00:18:15,720 Again, the green line, the red line, do OK. 420 00:18:15,720 --> 00:18:17,850 But in this case, again, that purple line 421 00:18:17,850 --> 00:18:18,790 is still the best fit. 422 00:18:18,790 --> 00:18:21,110 So I've still got this puzzle. 423 00:18:21,110 --> 00:18:25,490 But I didn't quite test what I wanted, right? 424 00:18:25,490 --> 00:18:29,670 I said I want to see how well it predicts new behavior. 425 00:18:29,670 --> 00:18:32,760 Here what I did was I took two datasets, fit the model, 426 00:18:32,760 --> 00:18:35,670 and I got two different fits, one for each dataset. 427 00:18:35,670 --> 00:18:38,040 They both fit well for order 16. 428 00:18:38,040 --> 00:18:40,670 But they're not quite right. 429 00:18:40,670 --> 00:18:43,370 OK, so best fitting model is still order 16 430 00:18:43,370 --> 00:18:46,970 but we know it came from an order 2 polynomial. 431 00:18:46,970 --> 00:18:49,190 So how could I will get a handle on seeing 432 00:18:49,190 --> 00:18:52,390 how good this model is? 433 00:18:52,390 --> 00:18:56,820 Well, what we're seeing here is coming from training error. 434 00:18:56,820 --> 00:18:59,910 Or another way of saying it is, what we're measuring 435 00:18:59,910 --> 00:19:02,580 is how well does the model perform 436 00:19:02,580 --> 00:19:05,040 on the data from which it was learned? 437 00:19:05,040 --> 00:19:10,260 How well do I fit the model to the training data? 438 00:19:10,260 --> 00:19:11,997 I want a small training error. 439 00:19:11,997 --> 00:19:14,330 And if you think about it, go back to the first example, 440 00:19:14,330 --> 00:19:16,930 when I fit a line to this data, it did not do well. 441 00:19:16,930 --> 00:19:18,440 It was not a good model. 442 00:19:18,440 --> 00:19:21,344 When I fit a quadratic, it was pretty decent. 443 00:19:21,344 --> 00:19:23,260 And then I got better and better as I went on. 444 00:19:23,260 --> 00:19:25,970 So I certainly need at least a small training error. 445 00:19:25,970 --> 00:19:27,800 But it's, to use the mathematical terms, 446 00:19:27,800 --> 00:19:32,630 a necessary, but not sufficient condition to get a great model. 447 00:19:32,630 --> 00:19:34,550 I need a small training error, but I really 448 00:19:34,550 --> 00:19:37,790 want to make sure that the model is capturing what I'd like. 449 00:19:37,790 --> 00:19:40,210 And so for that, I want to see how well does 450 00:19:40,210 --> 00:19:42,830 it do on other gen data, generated 451 00:19:42,830 --> 00:19:45,020 from the same process, whether it's 452 00:19:45,020 --> 00:19:49,680 weights on springs, different comets besides Haley's comet, 453 00:19:49,680 --> 00:19:51,180 different voters than those surveyed 454 00:19:51,180 --> 00:19:52,596 when we tried to figure out what's 455 00:19:52,596 --> 00:19:54,790 going to happen in an election. 456 00:19:54,790 --> 00:19:58,350 And I'm set up to do that, by using 457 00:19:58,350 --> 00:20:03,340 a really important tool called, validation or cross-validation. 458 00:20:03,340 --> 00:20:06,140 We set the stage, and then we're going to do the example. 459 00:20:06,140 --> 00:20:07,727 I'm going to get a set of data. 460 00:20:07,727 --> 00:20:09,310 I want to fit a model to it, actually, 461 00:20:09,310 --> 00:20:10,810 different models, different degrees, 462 00:20:10,810 --> 00:20:12,730 different kinds of models. 463 00:20:12,730 --> 00:20:14,710 To see how well they work, I want 464 00:20:14,710 --> 00:20:18,760 to see how well they predict behavior under other data 465 00:20:18,760 --> 00:20:21,670 than that from which I did the training. 466 00:20:21,670 --> 00:20:24,840 So I could do that right here. 467 00:20:24,840 --> 00:20:27,720 I could generate the models from one data set, 468 00:20:27,720 --> 00:20:29,670 but test them on the other. 469 00:20:29,670 --> 00:20:32,220 And so in fact, I had one data set. 470 00:20:32,220 --> 00:20:34,530 I build a set of models for the first data set. 471 00:20:34,530 --> 00:20:37,140 I compared how well it did on that data set. 472 00:20:37,140 --> 00:20:40,750 But I could now apply it to the second dataset. 473 00:20:40,750 --> 00:20:42,930 How well does that account for that data set? 474 00:20:42,930 --> 00:20:46,320 And similarly, take the models I built for the second data set, 475 00:20:46,320 --> 00:20:48,630 and see how well they predict the points 476 00:20:48,630 --> 00:20:51,920 from the first dataset. 477 00:20:51,920 --> 00:20:53,360 What do I expect? 478 00:20:53,360 --> 00:20:55,130 Certainly, expect that the testing error 479 00:20:55,130 --> 00:20:57,320 is likely to be larger than the training error, 480 00:20:57,320 --> 00:21:00,420 because I train on one set of data. 481 00:21:00,420 --> 00:21:02,300 And that means this ought to be a better way 482 00:21:02,300 --> 00:21:05,390 to think about, how well does this model generalize? 483 00:21:05,390 --> 00:21:07,700 How well does it predict other behavior, 484 00:21:07,700 --> 00:21:11,230 besides what I started with. 485 00:21:11,230 --> 00:21:12,730 So here's the code I'm going to use. 486 00:21:12,730 --> 00:21:13,660 It's pretty straightforward. 487 00:21:13,660 --> 00:21:15,951 All I want to draw your attention to here is, remember, 488 00:21:15,951 --> 00:21:19,510 models one I built by fitting models of degree 2, 4, 8, 489 00:21:19,510 --> 00:21:22,260 and 16 to the first data set. 490 00:21:22,260 --> 00:21:26,590 And I'm going to apply those models to the second dataset, 491 00:21:26,590 --> 00:21:29,350 x vals 2 and y vals 2. 492 00:21:29,350 --> 00:21:31,450 Similarly, I'm going to take the models built 493 00:21:31,450 --> 00:21:36,370 for the second data set, and test them on the first dataset 494 00:21:36,370 --> 00:21:39,510 to see how well they fit. 495 00:21:39,510 --> 00:21:41,130 I know you're eagerly anticipating, 496 00:21:41,130 --> 00:21:43,470 as I've been setting this up for a whole week. 497 00:21:43,470 --> 00:21:46,237 All right, let's look at what happens when I do this. 498 00:21:46,237 --> 00:21:47,070 I'm going to run it. 499 00:21:47,070 --> 00:21:48,570 And then we'll look at the examples. 500 00:21:50,620 --> 00:21:52,640 If I go back over to Python and this code 501 00:21:52,640 --> 00:21:56,610 was distributed earlier, if you want to play with it yourself. 502 00:21:56,610 --> 00:21:59,010 Should be the right place to do it. 503 00:21:59,010 --> 00:22:00,345 I am going to run that code. 504 00:22:05,320 --> 00:22:08,830 Now I get something a little different. 505 00:22:08,830 --> 00:22:15,650 In fact, if I go look at it, here 506 00:22:15,650 --> 00:22:20,955 is model one applied to data set 2. 507 00:22:20,955 --> 00:22:23,080 And we can both eyeball it and look at the numbers. 508 00:22:23,080 --> 00:22:27,220 Eyeballing it, there's that green line, still generally 509 00:22:27,220 --> 00:22:30,506 following the form of this pretty well. 510 00:22:30,506 --> 00:22:31,630 What about the purple line? 511 00:22:31,630 --> 00:22:33,020 The order 16 degree. 512 00:22:33,020 --> 00:22:35,170 Remember, that's the purple line from model 1, 513 00:22:35,170 --> 00:22:37,360 from training set 1. 514 00:22:37,360 --> 00:22:41,260 Wow, this misses a bunch of points, pretty badly. 515 00:22:41,260 --> 00:22:45,850 And in fact, look at the r-squared values. 516 00:22:45,850 --> 00:22:49,550 Order 2 and order 4, pretty good fit, 517 00:22:49,550 --> 00:22:52,130 accounts for all but about 14, 13% of the data. 518 00:22:52,130 --> 00:22:58,500 Look what happened to the degree 16, degrees 16 fit. 519 00:22:58,500 --> 00:22:59,357 Way down at last. 520 00:22:59,357 --> 00:23:00,640 0.7. 521 00:23:00,640 --> 00:23:04,570 Last time around it was 0.997. 522 00:23:04,570 --> 00:23:06,660 What about the other direction? 523 00:23:06,660 --> 00:23:09,540 Taking the model built and the second data 524 00:23:09,540 --> 00:23:11,610 set, testing it on the first data set. 525 00:23:11,610 --> 00:23:17,300 Again, notice a nice fit for degree 2 and 4, 526 00:23:17,300 --> 00:23:20,280 not so good for degree 16. 527 00:23:20,280 --> 00:23:22,760 And just to give you a sense of this, I'm going to go back. 528 00:23:22,760 --> 00:23:28,630 There is the model one case. 529 00:23:28,630 --> 00:23:30,180 There is the model in the other case. 530 00:23:30,180 --> 00:23:32,610 You can see the model that accounts for variation in one 531 00:23:32,610 --> 00:23:34,776 doesn't account for the variation in the other, when 532 00:23:34,776 --> 00:23:37,980 I look at order 16 fit. 533 00:23:37,980 --> 00:23:42,950 OK, so what this says is something important. 534 00:23:42,950 --> 00:23:43,790 Now I can see. 535 00:23:43,790 --> 00:23:46,130 In fact, if I look back at this, if I were just 536 00:23:46,130 --> 00:23:49,190 looking at the coefficient of determination, 537 00:23:49,190 --> 00:23:52,940 this says, in order to predict other behavior, 538 00:23:52,940 --> 00:23:57,190 I'm better off with an order 2 or maybe order 4 polynomial. 539 00:23:57,190 --> 00:23:59,166 Those r-squared values are both the same. 540 00:23:59,166 --> 00:24:01,040 I happen to know it's order 2, because that's 541 00:24:01,040 --> 00:24:01,998 where I generated from. 542 00:24:01,998 --> 00:24:06,462 But that's a whole lot better than order 16. 543 00:24:06,462 --> 00:24:08,920 And what you're seeing here is an example of something that 544 00:24:08,920 --> 00:24:11,920 happens a lot in statistics. 545 00:24:11,920 --> 00:24:15,820 And in fact, I would suggest is often misused in fitting data 546 00:24:15,820 --> 00:24:18,310 to statistical samples. 547 00:24:18,310 --> 00:24:20,190 It's called overfitting. 548 00:24:20,190 --> 00:24:22,070 And what it means is I've let there 549 00:24:22,070 --> 00:24:24,740 be too many degrees of freedom in my model, too 550 00:24:24,740 --> 00:24:25,910 many free parameters. 551 00:24:25,910 --> 00:24:29,760 And what it's fitting isn't just the underlying process. 552 00:24:29,760 --> 00:24:32,767 It's also fitting to the noise. 553 00:24:32,767 --> 00:24:35,350 The message I want you to take out of this part of the lecture 554 00:24:35,350 --> 00:24:39,040 is, if we only fit the model to training data, 555 00:24:39,040 --> 00:24:41,320 and we look at how well it does, we 556 00:24:41,320 --> 00:24:43,660 could get what looks like a great fit, 557 00:24:43,660 --> 00:24:47,650 but we may actually have come up with far too complex a model. 558 00:24:47,650 --> 00:24:50,230 Order 16 instead of order 2. 559 00:24:50,230 --> 00:24:52,300 And the only way you are likely to detect 560 00:24:52,300 --> 00:24:57,180 that is to train on one test set and test on a different. 561 00:24:57,180 --> 00:24:59,890 And if you do that, it's likely to expose whether, in fact, I 562 00:24:59,890 --> 00:25:01,810 have done a good job of fitting or whether I 563 00:25:01,810 --> 00:25:04,325 have overfit to the data. 564 00:25:04,325 --> 00:25:06,450 There are lots of horror stories in the literature, 565 00:25:06,450 --> 00:25:08,783 especially from early days of machine learning of people 566 00:25:08,783 --> 00:25:10,910 overfitting to data and coming up with models 567 00:25:10,910 --> 00:25:13,700 that they thought wonderfully predicted an effect, 568 00:25:13,700 --> 00:25:17,850 and then when it ran on new data really hit the big one. 569 00:25:17,850 --> 00:25:19,409 All right, so this is something you 570 00:25:19,409 --> 00:25:20,700 want to try and stay away from. 571 00:25:20,700 --> 00:25:24,760 And the best way to do it is to do validation. 572 00:25:24,760 --> 00:25:26,740 You can see it here, right? 573 00:25:26,740 --> 00:25:29,660 The upper left is my training data, dataset one. 574 00:25:29,660 --> 00:25:31,610 There's the set of models. 575 00:25:31,610 --> 00:25:33,400 This is now taking that and applying it 576 00:25:33,400 --> 00:25:36,820 to a different dataset from the same process. 577 00:25:36,820 --> 00:25:40,850 And notice for the degree to polynomial, 578 00:25:40,850 --> 00:25:44,580 the coefficient of determination, 0.86, now 0.87. 579 00:25:44,580 --> 00:25:47,970 The fact that it's slightly higher is just accidental. 580 00:25:47,970 --> 00:25:49,670 But it's really about the same level. 581 00:25:49,670 --> 00:25:52,230 It's doing the same kind of drop on the training data 582 00:25:52,230 --> 00:25:54,360 and on the test data. 583 00:25:54,360 --> 00:25:58,910 On the other hand, degree 16, coefficient of determination 584 00:25:58,910 --> 00:26:05,450 is a wonderful 0.96 here and a pretty awful 9 down there. 585 00:26:05,450 --> 00:26:08,660 And that's a sign that we're not in good shape, 586 00:26:08,660 --> 00:26:11,000 when in fact our coefficient of determination 587 00:26:11,000 --> 00:26:15,720 drops significantly when we try and handle new data. 588 00:26:15,720 --> 00:26:22,280 OK, so why do we get a better fit 589 00:26:22,280 --> 00:26:24,840 on the training data with a higher order model, 590 00:26:24,840 --> 00:26:28,780 but then do less well when we're actually handling new data? 591 00:26:28,780 --> 00:26:33,440 Or another way of saying it is, if I started out with, 592 00:26:33,440 --> 00:26:35,740 in the case of that with that data, a linear model 593 00:26:35,740 --> 00:26:38,370 it didn't fit well, and then I got to a quadratic model, 594 00:26:38,370 --> 00:26:41,920 why didn't that quadratic model still say [INAUDIBLE]? 595 00:26:41,920 --> 00:26:44,950 Why was it the case that, as I added more degrees of freedom, 596 00:26:44,950 --> 00:26:47,040 I did better. 597 00:26:47,040 --> 00:26:50,410 Or another way of asking it is, can I actually get a worse 598 00:26:50,410 --> 00:26:55,399 fit to training data as I increase the model complexity? 599 00:26:55,399 --> 00:26:57,190 And I see at least one negative head shake. 600 00:26:57,190 --> 00:26:57,580 Thank you. 601 00:26:57,580 --> 00:26:58,121 You're right. 602 00:26:58,121 --> 00:26:59,050 I cannot. 603 00:26:59,050 --> 00:27:01,610 Let's look at why. 604 00:27:01,610 --> 00:27:03,539 If I add in some higher order terms, 605 00:27:03,539 --> 00:27:04,830 and they actually don't matter. 606 00:27:04,830 --> 00:27:09,200 If I got perfect data, the coefficient will just be 0. 607 00:27:09,200 --> 00:27:12,050 The fit will basically say, this term doesn't matter. 608 00:27:12,050 --> 00:27:12,890 Ignore it. 609 00:27:12,890 --> 00:27:15,620 And that'll work in perfect data. 610 00:27:15,620 --> 00:27:19,250 But if the data is noisy, what the model is going to do 611 00:27:19,250 --> 00:27:22,237 is actually start fitting the noise. 612 00:27:22,237 --> 00:27:24,320 And while it may lead to a better r-squared value, 613 00:27:24,320 --> 00:27:27,180 it's not really a better fit. 614 00:27:27,180 --> 00:27:29,770 Right, let me show you an example of that. 615 00:27:29,770 --> 00:27:32,790 I'm going to fit a quadratic to a straight line. 616 00:27:32,790 --> 00:27:33,510 Easy thing to do. 617 00:27:33,510 --> 00:27:36,660 But I want to show you the effect of overfitting or adding 618 00:27:36,660 --> 00:27:38,049 in those extra terms. 619 00:27:38,049 --> 00:27:39,590 So let me say it a little bit better. 620 00:27:39,590 --> 00:27:41,539 I'm going to start off with this 3, sorry, 3. 621 00:27:41,539 --> 00:27:42,580 I'm doing it again today. 622 00:27:42,580 --> 00:27:45,970 4 simple values, 0, 1, 2, 3. 623 00:27:45,970 --> 00:27:47,640 The y values are the same as x values. 624 00:27:47,640 --> 00:27:50,264 So this is 0,0, 1, 1, 2, 2, 3 3. 625 00:27:50,264 --> 00:27:51,430 They're all lying on a line. 626 00:27:51,430 --> 00:27:52,510 But I'm going to fit. 627 00:27:52,510 --> 00:27:54,160 I'm going to plot them out. 628 00:27:54,160 --> 00:27:58,040 And then I'm going to fit a quadratic. 629 00:27:58,040 --> 00:28:02,504 y if it equals ax squared plus bx plus c to this. 630 00:28:02,504 --> 00:28:03,920 Now I know it's a line, but I want 631 00:28:03,920 --> 00:28:05,780 to see what happens if I fit a quadratic. 632 00:28:05,780 --> 00:28:08,510 So I'm going to use polyfit to fit my quadratic. 633 00:28:08,510 --> 00:28:11,430 I'm going to print out some data about it. 634 00:28:11,430 --> 00:28:14,610 And then I'm going to use Polyval to estimate 635 00:28:14,610 --> 00:28:16,950 what those values should be. 636 00:28:16,950 --> 00:28:18,150 Plot them out. 637 00:28:18,150 --> 00:28:22,510 And then compute r squared value, and see what happens. 638 00:28:22,510 --> 00:28:24,654 All right, OK, and let me set this up better. 639 00:28:24,654 --> 00:28:25,320 What am I doing? 640 00:28:25,320 --> 00:28:26,653 I want to just fit it to a line. 641 00:28:26,653 --> 00:28:29,250 I know it's a line, but I'm going to fit a quadratic to it. 642 00:28:29,250 --> 00:28:31,860 And what I'd expect is, even though there's an extra term 643 00:28:31,860 --> 00:28:34,300 there, it shouldn't matter. 644 00:28:34,300 --> 00:28:40,040 So if I go to Python, and I run this, 645 00:28:40,040 --> 00:28:46,130 I run exactly that example, look at that. 646 00:28:46,130 --> 00:28:48,560 a equals 0, b is 1, c equals 0. 647 00:28:48,560 --> 00:28:51,087 Look at the r-squared value. 648 00:28:51,087 --> 00:28:52,420 I'll pull that together for you. 649 00:28:56,640 --> 00:29:01,220 It says, in this perfect case, there's what I get. 650 00:29:01,220 --> 00:29:03,770 The blue line is drawn through the actual values. 651 00:29:03,770 --> 00:29:06,290 The dotted red line is drawn through the predicted values. 652 00:29:06,290 --> 00:29:07,340 They exactly line up. 653 00:29:07,340 --> 00:29:09,020 And in fact, the solution implied 654 00:29:09,020 --> 00:29:12,020 says, the higher order term coefficient 0, 655 00:29:12,020 --> 00:29:13,830 it doesn't matter. 656 00:29:13,830 --> 00:29:16,750 So what it found was y equals x. 657 00:29:16,750 --> 00:29:19,329 I know you're totally impressed I could find a straight line. 658 00:29:19,329 --> 00:29:20,620 But notice what happened there. 659 00:29:20,620 --> 00:29:22,480 I dropped or that system said, you 660 00:29:22,480 --> 00:29:24,340 don't need the higher order term. 661 00:29:24,340 --> 00:29:27,200 Wonderful r-squared value. 662 00:29:27,200 --> 00:29:29,810 OK, let's see how well it predicts. 663 00:29:29,810 --> 00:29:32,540 Let's add in one more point, out at 20. 664 00:29:32,540 --> 00:29:34,040 So this is 0, 1, 2, 3. 665 00:29:34,040 --> 00:29:35,700 That's 0, 1, 2, 3. 666 00:29:35,700 --> 00:29:39,140 I'm going to add 20 in there, so it's 0, 0 , 1, 2, 2, 3, 3, 20, 667 00:29:39,140 --> 00:29:40,550 20. 668 00:29:40,550 --> 00:29:43,490 Again, I can estimate using the same model. 669 00:29:43,490 --> 00:29:44,900 So I'm not recomputing the model, 670 00:29:44,900 --> 00:29:48,390 the model I predicted from using those first set of four points. 671 00:29:48,390 --> 00:29:50,270 I can get the estimated y values, 672 00:29:50,270 --> 00:29:53,310 plot those out, and you again, compute the r-squared value 673 00:29:53,310 --> 00:29:54,720 here. 674 00:29:54,720 --> 00:29:58,260 And even adding that point in, there's the line. 675 00:29:58,260 --> 00:29:59,840 And guess what. 676 00:29:59,840 --> 00:30:01,860 Perfectly predicts it. 677 00:30:01,860 --> 00:30:04,574 No big surprise. 678 00:30:04,574 --> 00:30:06,240 So it says, in the case of perfect data, 679 00:30:06,240 --> 00:30:09,420 adding the higher order terms isn't going to cause a problem. 680 00:30:09,420 --> 00:30:11,340 The system will say coefficients are 0. 681 00:30:11,340 --> 00:30:13,910 That's all I need. 682 00:30:13,910 --> 00:30:17,210 All right, now, let's go back and add in just a tiny bit 683 00:30:17,210 --> 00:30:20,220 of noise right there. 684 00:30:20,220 --> 00:30:22,910 0, 0, 1, 1, 2, 2, and 3, 3.1. 685 00:30:22,910 --> 00:30:26,420 So I've got a slight deviation in the y value there. 686 00:30:26,420 --> 00:30:27,590 Again, I can plot them. 687 00:30:27,590 --> 00:30:29,980 I'm going to fit a quadratic to them. 688 00:30:29,980 --> 00:30:31,980 I'm going to print out some information about it 689 00:30:31,980 --> 00:30:34,680 and then get the estimated values using 690 00:30:34,680 --> 00:30:38,020 that new model to see what it should look like. 691 00:30:38,020 --> 00:30:39,020 I'm not going to run it. 692 00:30:39,020 --> 00:30:40,570 I'm going to show you the result. 693 00:30:40,570 --> 00:30:43,370 I get a really good r-squared value. 694 00:30:43,370 --> 00:30:45,850 And there's the equation it comes up with. 695 00:30:48,710 --> 00:30:50,270 Not so bad, right? 696 00:30:50,270 --> 00:30:53,600 It's almost y equal to x. 697 00:30:53,600 --> 00:30:55,430 But because of that little bit of noise 698 00:30:55,430 --> 00:30:58,239 there, there's a small second order term 699 00:30:58,239 --> 00:31:00,030 here and a little constant term down there. 700 00:31:00,030 --> 00:31:03,050 The y squared value is really pretty good. 701 00:31:03,050 --> 00:31:05,610 And if you really squint and look carefully at this, 702 00:31:05,610 --> 00:31:07,970 you'll actually see there's a little bit 703 00:31:07,970 --> 00:31:11,660 of a deviation between the red and the blue line. 704 00:31:11,660 --> 00:31:13,610 It undershoots-- sorry, overshoots there, 705 00:31:13,610 --> 00:31:17,430 undershoots here, but it's really pretty close. 706 00:31:17,430 --> 00:31:19,770 All right, so am I just whistling in the dark here? 707 00:31:19,770 --> 00:31:21,760 What's the difference? 708 00:31:21,760 --> 00:31:25,770 Well, now let's add in that extra point. 709 00:31:25,770 --> 00:31:27,110 And what happens? 710 00:31:27,110 --> 00:31:30,890 So again, I'm now taking the same set of points 0, 0, 1, 1, 711 00:31:30,890 --> 00:31:32,870 2, 2, 3, and 3.1. 712 00:31:32,870 --> 00:31:34,730 I'm going to do 20, 20. 713 00:31:34,730 --> 00:31:38,407 Using the model I captured from fitting to that first set, 714 00:31:38,407 --> 00:31:39,740 I want to see what happens here. 715 00:31:43,340 --> 00:31:43,870 Crap. 716 00:31:43,870 --> 00:31:44,369 I'm sorry. 717 00:31:44,369 --> 00:31:45,280 Shouldn't say that. 718 00:31:45,280 --> 00:31:47,210 Darn. 719 00:31:47,210 --> 00:31:49,911 Pick up some other word. 720 00:31:49,911 --> 00:31:51,160 Shouldn't surprise you, right? 721 00:31:51,160 --> 00:31:55,840 A small variation here is now causing a really large 722 00:31:55,840 --> 00:31:57,990 variation up there. 723 00:31:57,990 --> 00:32:02,310 And this is why the ideal case overfitting is not a problem, 724 00:32:02,310 --> 00:32:04,500 because the coefficients get zeroed out. 725 00:32:04,500 --> 00:32:07,260 But even a little bit of noise can cause a problem. 726 00:32:07,260 --> 00:32:10,110 Now I'll grant you, we set this up deliberately 727 00:32:10,110 --> 00:32:11,430 to show a big effect here. 728 00:32:11,430 --> 00:32:16,260 But a 3% error in one data point is causing a huge problem 729 00:32:16,260 --> 00:32:18,684 when I get further out on this curve. 730 00:32:18,684 --> 00:32:20,600 And by the way, there is the r-squared values. 731 00:32:20,600 --> 00:32:21,190 It's 0.7. 732 00:32:21,190 --> 00:32:25,800 It doesn't do a particularly good job 733 00:32:25,800 --> 00:32:29,830 OK, so how would I fix this? 734 00:32:29,830 --> 00:32:32,610 Well, what if I had simply done a first degree 735 00:32:32,610 --> 00:32:34,810 fit, same situation. 736 00:32:34,810 --> 00:32:36,840 Let's say fit a line to this rather than 737 00:32:36,840 --> 00:32:38,080 fitting a quadratic. 738 00:32:38,080 --> 00:32:39,630 Remember, my question was, what's 739 00:32:39,630 --> 00:32:42,577 the harm of fitting a higher order model if the coefficients 740 00:32:42,577 --> 00:32:43,410 would be zeroed out? 741 00:32:43,410 --> 00:32:45,180 We've seen they won't be zeroed out. 742 00:32:45,180 --> 00:32:47,460 But if I were just to have fit a line 743 00:32:47,460 --> 00:32:51,150 to this, exactly the same experiment, 0, 0, 1, 1, 2, 744 00:32:51,150 --> 00:32:55,450 2, 3, and 3.1, 20 and 20. 745 00:32:55,450 --> 00:33:00,270 Now you can see it still does a really good job of fitting. 746 00:33:00,270 --> 00:33:04,410 The r-squared value is 0.9988. 747 00:33:04,410 --> 00:33:07,770 So again, fitting the right level of model, the noise 748 00:33:07,770 --> 00:33:10,850 doesn't cause nearly as much of a problem. 749 00:33:10,850 --> 00:33:13,730 And so just to pull that together, basically it says, 750 00:33:13,730 --> 00:33:17,660 the predictive ability of the first order model 751 00:33:17,660 --> 00:33:19,990 is much better than the second order model. 752 00:33:19,990 --> 00:33:21,740 And that's why, in this case, I would want 753 00:33:21,740 --> 00:33:25,210 to use that first order model. 754 00:33:25,210 --> 00:33:26,560 So take home message. 755 00:33:26,560 --> 00:33:29,420 And then we're going to amplify this. 756 00:33:29,420 --> 00:33:32,390 If I pick an overly complex model, 757 00:33:32,390 --> 00:33:35,450 I have the danger of overfitting to the training data, 758 00:33:35,450 --> 00:33:38,420 overfitting meaning that I'm not only fitting the underlying 759 00:33:38,420 --> 00:33:40,850 process, I'm fitting the noise. 760 00:33:40,850 --> 00:33:42,830 I get an order 16 model is the best 761 00:33:42,830 --> 00:33:46,950 fit when it's in fact, in order 2 model that was generating it. 762 00:33:46,950 --> 00:33:49,350 That increases the risk that it's not 763 00:33:49,350 --> 00:33:51,792 going to do well with the data, not what I'd like. 764 00:33:51,792 --> 00:33:53,250 I want to be able to predict what's 765 00:33:53,250 --> 00:33:56,020 going to go on well here. 766 00:33:56,020 --> 00:33:56,960 On the other hand. 767 00:33:56,960 --> 00:33:59,130 So that would say, boy, just stick with the simplest 768 00:33:59,130 --> 00:34:00,820 possible model. 769 00:34:00,820 --> 00:34:02,700 But there's a trade off here. 770 00:34:02,700 --> 00:34:04,350 And we already saw that when I tried 771 00:34:04,350 --> 00:34:06,780 to fit a line to a data that was basically quadratic. 772 00:34:06,780 --> 00:34:08,739 I didn't get a good fit. 773 00:34:08,739 --> 00:34:10,860 So I'd want to find the balance. 774 00:34:10,860 --> 00:34:14,850 An insufficiently complex model won't explain the data well. 775 00:34:14,850 --> 00:34:19,179 An overly complex model will overfit the training data. 776 00:34:19,179 --> 00:34:20,909 So I'd like to find the place where 777 00:34:20,909 --> 00:34:23,310 the model is as simple as possible, 778 00:34:23,310 --> 00:34:25,681 but still explains the data. 779 00:34:25,681 --> 00:34:28,139 And I can't resist the quote from Einstein that captures it 780 00:34:28,139 --> 00:34:30,870 pretty well, "everything should be made as simple 781 00:34:30,870 --> 00:34:33,690 as possible, but not simpler." 782 00:34:33,690 --> 00:34:35,280 In the case of where I started, it 783 00:34:35,280 --> 00:34:37,670 should be fit to a quadratic, because it's the right fit. 784 00:34:37,670 --> 00:34:39,420 But don't fit more than that, because it's 785 00:34:39,420 --> 00:34:42,770 getting overly complex 786 00:34:42,770 --> 00:34:47,822 Now how might we go about finding the right model? 787 00:34:47,822 --> 00:34:50,280 We're not going to dwell on this but here is a standard way 788 00:34:50,280 --> 00:34:52,050 in which you might do it. 789 00:34:52,050 --> 00:34:53,684 Start with a low order model. 790 00:34:53,684 --> 00:34:54,600 Again, take that data. 791 00:34:54,600 --> 00:34:55,980 Fit a linear model to it. 792 00:34:55,980 --> 00:34:59,080 Look at not only the r-squared value, 793 00:34:59,080 --> 00:35:02,630 but see how well it accounts for new data. 794 00:35:02,630 --> 00:35:04,160 Increase the order of the model. 795 00:35:04,160 --> 00:35:05,930 Repeat the process. 796 00:35:05,930 --> 00:35:07,730 And keep doing that until you find 797 00:35:07,730 --> 00:35:11,810 a point at which a model does a good job both on the training 798 00:35:11,810 --> 00:35:14,780 data and on predicting new data. 799 00:35:14,780 --> 00:35:16,280 An after it starts to fall off, that 800 00:35:16,280 --> 00:35:18,155 gives you a point where you might say there's 801 00:35:18,155 --> 00:35:19,830 a good sized model. 802 00:35:19,830 --> 00:35:22,080 In the case of this data, whether I would have stopped 803 00:35:22,080 --> 00:35:24,413 at a quadratic or I might have used a cubic or a quartic 804 00:35:24,413 --> 00:35:26,104 depends on the values. 805 00:35:26,104 --> 00:35:28,270 But I certainly wouldn't have gone much beyond that. 806 00:35:28,270 --> 00:35:30,228 And this is one way, if you don't have a theory 807 00:35:30,228 --> 00:35:32,490 to drive you, to think about, how do I actually fit 808 00:35:32,490 --> 00:35:33,960 the model the way I would like. 809 00:35:37,200 --> 00:35:38,635 Let's go back to where we started. 810 00:35:38,635 --> 00:35:40,260 We still have one more big topic to do, 811 00:35:40,260 --> 00:35:41,801 and we still have a few minutes left. 812 00:35:41,801 --> 00:35:45,870 But let's go back to where we started Hooke's law. 813 00:35:45,870 --> 00:35:48,730 There was the data from measuring displacements 814 00:35:48,730 --> 00:35:50,920 of a spring, as I added different weights 815 00:35:50,920 --> 00:35:53,290 to the bottom of the spring. 816 00:35:53,290 --> 00:35:54,940 And there's the linear fit. 817 00:35:54,940 --> 00:35:56,600 It's not bad. 818 00:35:56,600 --> 00:35:58,927 There's the quadratic fit. 819 00:35:58,927 --> 00:36:01,260 And it's certainly got a better r-squared value, though. 820 00:36:01,260 --> 00:36:03,930 That could be just fitting to the noise. 821 00:36:03,930 --> 00:36:05,370 But you actually can see, I think, 822 00:36:05,370 --> 00:36:08,340 that that green curve probably does a better 823 00:36:08,340 --> 00:36:11,450 job of fitting the data. 824 00:36:11,450 --> 00:36:13,120 Well, wait a minute. 825 00:36:13,120 --> 00:36:15,830 Even though the quadratic fit is tighter here, 826 00:36:15,830 --> 00:36:20,030 Hooke says, this is linear. 827 00:36:20,030 --> 00:36:21,959 So what's going on? 828 00:36:21,959 --> 00:36:23,500 Well, this is another place where you 829 00:36:23,500 --> 00:36:24,970 want to think about your model. 830 00:36:24,970 --> 00:36:28,066 And I'll remind you, in case you don't remember your physics, 831 00:36:28,066 --> 00:36:30,011 unless we believe that Hooke was wrong, 832 00:36:30,011 --> 00:36:31,260 this should tell us something. 833 00:36:31,260 --> 00:36:33,790 And in particular, Hooke's law says, the model 834 00:36:33,790 --> 00:36:38,750 holds until you reach the elastic limit of the spring. 835 00:36:38,750 --> 00:36:42,090 You stretch a slinky too far, it never springs back. 836 00:36:42,090 --> 00:36:44,250 You go beyond that elastic limit. 837 00:36:44,250 --> 00:36:47,870 And that's probably what's happening right up there. 838 00:36:47,870 --> 00:36:50,930 Through here, it's following that linear relationship. 839 00:36:50,930 --> 00:36:53,630 Up at this point, I've essentially broken the spring. 840 00:36:53,630 --> 00:36:56,730 The elastic limit doesn't hold anymore. 841 00:36:56,730 --> 00:36:58,840 And so really, in this case, I should probably 842 00:36:58,840 --> 00:37:02,940 fit different models to different segments. 843 00:37:02,940 --> 00:37:05,550 And there's a much better fit. 844 00:37:05,550 --> 00:37:07,620 Linear through the first part and another 845 00:37:07,620 --> 00:37:11,480 later line once I hit that elastic limit. 846 00:37:11,480 --> 00:37:12,980 How might I find this? 847 00:37:12,980 --> 00:37:16,520 Well, you could imagine a little search process in which you try 848 00:37:16,520 --> 00:37:19,970 and find where's the best place along here to break 849 00:37:19,970 --> 00:37:23,720 the data into two sets, fit linear segments to both, 850 00:37:23,720 --> 00:37:27,112 and get really good fits for both examples. 851 00:37:27,112 --> 00:37:29,070 And I raise it because that's the kind of thing 852 00:37:29,070 --> 00:37:30,070 you've also seen before. 853 00:37:30,070 --> 00:37:32,580 You could imagine writing code to do that search 854 00:37:32,580 --> 00:37:35,530 to find that good fit. 855 00:37:35,530 --> 00:37:38,940 OK, that gives you a sense, then, 856 00:37:38,940 --> 00:37:41,070 of why you want to be careful about overfitting, 857 00:37:41,070 --> 00:37:43,070 why you want to not just look at the coefficient 858 00:37:43,070 --> 00:37:46,920 of determination, but see how well does this predict behavior 859 00:37:46,920 --> 00:37:48,980 on new data sets. 860 00:37:48,980 --> 00:37:52,850 Now suppose I don't have a theory, like Hooke, 861 00:37:52,850 --> 00:37:54,460 to guide me. 862 00:37:54,460 --> 00:37:58,340 Can I still figure out what's a good model to fit to the data? 863 00:37:58,340 --> 00:37:59,680 And the answer is, you bet. 864 00:37:59,680 --> 00:38:01,810 We're going to use cross-validation to guide 865 00:38:01,810 --> 00:38:04,760 the choice of the model complexity. 866 00:38:04,760 --> 00:38:07,470 And I want to show you two examples. 867 00:38:07,470 --> 00:38:10,000 If the data set's small, we can use 868 00:38:10,000 --> 00:38:12,700 what's called leave one out cross-validation. 869 00:38:12,700 --> 00:38:15,600 I'll give you a definition of that in a second. 870 00:38:15,600 --> 00:38:17,390 If the data sets bigger than that, 871 00:38:17,390 --> 00:38:20,210 we can use k-fold cross-validation. 872 00:38:20,210 --> 00:38:21,980 I'll give you a definition that a second. 873 00:38:21,980 --> 00:38:24,660 Or just what's called, repeated random sampling. 874 00:38:24,660 --> 00:38:27,879 But we can use this same idea of validating new data 875 00:38:27,879 --> 00:38:30,170 to try and figure out whether the model is a good model 876 00:38:30,170 --> 00:38:32,490 or not. 877 00:38:32,490 --> 00:38:33,920 Leave one out cross-validation. 878 00:38:33,920 --> 00:38:35,330 This is as written in pseudocode, 879 00:38:35,330 --> 00:38:37,480 but the idea is pretty simple. 880 00:38:37,480 --> 00:38:38,440 I'm given a dataset. 881 00:38:38,440 --> 00:38:40,580 It's not too large. 882 00:38:40,580 --> 00:38:44,710 The idea is to walk through a number of trials, 883 00:38:44,710 --> 00:38:47,130 number trials equal to the size of the data set. 884 00:38:47,130 --> 00:38:51,170 And for each one, take the data set or a copy of it, 885 00:38:51,170 --> 00:38:52,640 and drop out one of the samples. 886 00:38:52,640 --> 00:38:54,305 So leave one out. 887 00:38:54,305 --> 00:38:55,930 Start off by leaving out the first one, 888 00:38:55,930 --> 00:38:57,280 then leaving out the second one, and then 889 00:38:57,280 --> 00:38:58,390 leaving out the third one. 890 00:38:58,390 --> 00:39:02,160 For each one of those training sets, build the model. 891 00:39:02,160 --> 00:39:04,610 For example, by using linear regression. 892 00:39:04,610 --> 00:39:10,340 And then test that model on that data point that you left out. 893 00:39:10,340 --> 00:39:12,087 So leave out the first one, build a model 894 00:39:12,087 --> 00:39:13,670 on all of the other ones, and then see 895 00:39:13,670 --> 00:39:15,461 how well that model predicts the first one. 896 00:39:15,461 --> 00:39:17,150 Leave out the second one, build a model 897 00:39:17,150 --> 00:39:18,350 using all of them but the second one, 898 00:39:18,350 --> 00:39:20,016 see how well it predicts the second one. 899 00:39:20,016 --> 00:39:21,946 And just average the result. Works 900 00:39:21,946 --> 00:39:23,570 when you don't have a really large data 901 00:39:23,570 --> 00:39:25,100 set, because it won't take too long. 902 00:39:25,100 --> 00:39:30,150 But it's a nice way of actually testing validation. 903 00:39:30,150 --> 00:39:32,130 If the data set's a lot bigger, you 904 00:39:32,130 --> 00:39:33,420 can still use the same idea. 905 00:39:33,420 --> 00:39:35,990 You can use what's called, k-fold. 906 00:39:35,990 --> 00:39:40,730 Divide the data set up into k equal sized chunks. 907 00:39:40,730 --> 00:39:41,700 Leave one of them out. 908 00:39:41,700 --> 00:39:43,880 Use the rest to build the model. 909 00:39:43,880 --> 00:39:45,770 And then use that model to predict 910 00:39:45,770 --> 00:39:47,300 that first chunk you left out. 911 00:39:47,300 --> 00:39:49,640 Leave out the second chunk, and keep doing it. 912 00:39:49,640 --> 00:39:51,320 Same idea, but now with groups of things 913 00:39:51,320 --> 00:39:55,270 rather than just leaving those single data points. 914 00:39:55,270 --> 00:39:57,790 All right, the other way you can deal with it, 915 00:39:57,790 --> 00:40:00,070 which has a nice effect to it, is 916 00:40:00,070 --> 00:40:03,700 to use what's called, repeated random sampling. 917 00:40:03,700 --> 00:40:05,710 OK, start out with some data set. 918 00:40:05,710 --> 00:40:07,210 And what I'm going to do here is I'm 919 00:40:07,210 --> 00:40:09,001 going to run through some number of trials. 920 00:40:09,001 --> 00:40:10,090 I'm going to call that, k. 921 00:40:10,090 --> 00:40:13,480 But I'm also going to pick some number of random samples 922 00:40:13,480 --> 00:40:15,650 from the data set. 923 00:40:15,650 --> 00:40:17,410 Usually, I think, and as I recall, 924 00:40:17,410 --> 00:40:20,470 it is somewhere between reserving 20% to 50% 925 00:40:20,470 --> 00:40:22,250 of the samples. 926 00:40:22,250 --> 00:40:25,630 But the idea is again, walk over all of those k trials. 927 00:40:25,630 --> 00:40:29,290 And in each one, pick out at random n elements 928 00:40:29,290 --> 00:40:30,820 for the test set. 929 00:40:30,820 --> 00:40:32,920 Use the remainder is the training set. 930 00:40:32,920 --> 00:40:34,990 Build the model on the training set. 931 00:40:34,990 --> 00:40:38,180 And then apply that model to the test set. 932 00:40:38,180 --> 00:40:40,730 So rather than doing k-fold, where I select k, 933 00:40:40,730 --> 00:40:42,080 in turn, and keep them. 934 00:40:42,080 --> 00:40:46,450 This is just randomly selecting which ones to pull out. 935 00:40:46,450 --> 00:40:48,960 So I'm going to show you one last example. 936 00:40:48,960 --> 00:40:51,400 Let's look at that idea of, I don't have a model here. 937 00:40:51,400 --> 00:40:54,400 I want to use this idea of cross-validation 938 00:40:54,400 --> 00:40:57,390 to try and figure out what's the best possible model. 939 00:40:57,390 --> 00:41:00,250 And for this, I'm going to use a different data set. 940 00:41:00,250 --> 00:41:02,230 The data set here is I want to model 941 00:41:02,230 --> 00:41:04,810 or the task here is I want to try model 942 00:41:04,810 --> 00:41:07,510 how the mean daily high temperature in the US 943 00:41:07,510 --> 00:41:14,060 has varied over about a 55 year period, from '61 to 2015. 944 00:41:14,060 --> 00:41:15,280 Got a set of data. 945 00:41:15,280 --> 00:41:18,540 It's mean-- sorry, the daily high for every day of the year 946 00:41:18,540 --> 00:41:19,839 through that entire period. 947 00:41:19,839 --> 00:41:21,380 And what I'm going to do is I'm going 948 00:41:21,380 --> 00:41:24,249 to compute the means for each year and plot them out. 949 00:41:24,249 --> 00:41:26,290 And then I'm going to try and fit models to them. 950 00:41:26,290 --> 00:41:28,430 And in particular, I'm going to take 951 00:41:28,430 --> 00:41:30,590 a set of different dimensionalities, 952 00:41:30,590 --> 00:41:34,477 linear, quadratic, cubic, quartic And in each case, 953 00:41:34,477 --> 00:41:36,060 I'm going to run through a trial where 954 00:41:36,060 --> 00:41:39,142 I train on one half of the data, test on the other. 955 00:41:39,142 --> 00:41:41,100 There again, is that idea of seeing how well it 956 00:41:41,100 --> 00:41:42,480 predicts other data. 957 00:41:42,480 --> 00:41:45,110 Record the coefficient of determination. 958 00:41:45,110 --> 00:41:47,370 And do that and get out an average, 959 00:41:47,370 --> 00:41:50,670 and report what I get as the mean for each of those values 960 00:41:50,670 --> 00:41:53,810 across each dimensionality. 961 00:41:53,810 --> 00:41:55,630 OK, here we go. 962 00:41:55,630 --> 00:41:57,212 Set a code that's pretty easy to see. 963 00:41:57,212 --> 00:41:59,170 Hopefully, you can just look at it and grok it. 964 00:41:59,170 --> 00:42:01,660 We start off with a boring class, 965 00:42:01,660 --> 00:42:04,086 which Professor guttag suggests refers to this lecture. 966 00:42:04,086 --> 00:42:04,710 But it doesn't. 967 00:42:04,710 --> 00:42:07,126 This may be a boring lecture, but it's not a boring class. 968 00:42:07,126 --> 00:42:08,580 This is a great class. 969 00:42:08,580 --> 00:42:10,800 And boy, those jokes are really awful, aren't they? 970 00:42:10,800 --> 00:42:11,910 But here we go. 971 00:42:11,910 --> 00:42:15,810 Simple class that builds temperature data. 972 00:42:15,810 --> 00:42:19,800 This reads in some information, splits it up, and basically, 973 00:42:19,800 --> 00:42:23,280 records the high for the day and the year in which I got that. 974 00:42:23,280 --> 00:42:26,020 So for each day, I've got a high temperature for that day. 975 00:42:26,020 --> 00:42:28,540 I'm going to give you back the high temperature and the year 976 00:42:28,540 --> 00:42:30,456 in which it was recorded, because I don't care 977 00:42:30,456 --> 00:42:32,590 whether it was in January or June. 978 00:42:32,590 --> 00:42:35,480 A little function that opens up a file. 979 00:42:35,480 --> 00:42:38,260 We've actually given you a file, if you want to go look at it. 980 00:42:38,260 --> 00:42:40,380 And simply walk through the file reading it in 981 00:42:40,380 --> 00:42:45,200 and returning a big list of all those data objects. 982 00:42:45,200 --> 00:42:48,140 OK, then what I want to do is I want 983 00:42:48,140 --> 00:42:52,545 to get the mean high temperature for each year. 984 00:42:52,545 --> 00:42:54,920 Given that data, I'm going to set up a dictionary called, 985 00:42:54,920 --> 00:42:55,545 years. 986 00:42:55,545 --> 00:42:57,920 I'm just going to run through a loop through all the data 987 00:42:57,920 --> 00:43:01,280 points, where in the dictionary, under that year. 988 00:43:01,280 --> 00:43:02,430 So there a data point. 989 00:43:02,430 --> 00:43:04,910 I use the method get year to get out the year. 990 00:43:04,910 --> 00:43:09,350 At that point, I add in the high temperature corresponding 991 00:43:09,350 --> 00:43:11,110 to that data point. 992 00:43:11,110 --> 00:43:13,090 And I'm using that nice little try except loop. 993 00:43:13,090 --> 00:43:14,881 I'll do that, unless I haven't had anything 994 00:43:14,881 --> 00:43:17,510 yet for this year, in which case this'll fail. 995 00:43:17,510 --> 00:43:20,260 And I'll simply store the first one in as a list. 996 00:43:20,260 --> 00:43:22,510 So after I've run through this loop in the dictionary, 997 00:43:22,510 --> 00:43:24,759 under the year, I have a list of the high temperatures 998 00:43:24,759 --> 00:43:27,201 for each day associated with it. 999 00:43:27,201 --> 00:43:27,700 Excuse me. 1000 00:43:27,700 --> 00:43:30,970 And then I can just compute the average, that 1001 00:43:30,970 --> 00:43:33,100 is for each year in the years. 1002 00:43:33,100 --> 00:43:34,150 I get that value. 1003 00:43:34,150 --> 00:43:34,912 I add them up. 1004 00:43:34,912 --> 00:43:35,620 I get the length. 1005 00:43:35,620 --> 00:43:36,369 I divide them out. 1006 00:43:36,369 --> 00:43:39,229 And I store that in as the average high temperature 1007 00:43:39,229 --> 00:43:39,770 for the year. 1008 00:43:42,290 --> 00:43:44,000 Now I can plot it. 1009 00:43:44,000 --> 00:43:46,310 Get the data, get out the information 1010 00:43:46,310 --> 00:43:49,460 by computing those yearly means, run through a little loop 1011 00:43:49,460 --> 00:43:52,460 that basically, in the x values, puts in the year, in the y 1012 00:43:52,460 --> 00:43:56,180 values, puts in the high temperature. 1013 00:43:56,180 --> 00:43:58,250 And I can do a plot. 1014 00:43:58,250 --> 00:44:02,830 And if I do that, I get that. 1015 00:44:02,830 --> 00:44:05,700 I'll let you run this yourself. 1016 00:44:05,700 --> 00:44:09,245 Now this is a little bit deceptive, because of the scale 1017 00:44:09,245 --> 00:44:09,870 I've used here. 1018 00:44:09,870 --> 00:44:12,150 But nonetheless, it shows, in the US, 1019 00:44:12,150 --> 00:44:15,581 over a 55 year period, the mean high day-- 1020 00:44:15,581 --> 00:44:16,080 I'm sorry. 1021 00:44:16,080 --> 00:44:19,320 The mean daily high has gone from about 15.5 1022 00:44:19,320 --> 00:44:23,740 degrees Celsius up to about 17 and 1/2. 1023 00:44:23,740 --> 00:44:26,150 So what's changed? 1024 00:44:26,150 --> 00:44:29,390 Now the question is, how could I model this? 1025 00:44:29,390 --> 00:44:31,280 Could I actually get a model that 1026 00:44:31,280 --> 00:44:34,100 would give me a sense of how this is changing? 1027 00:44:34,100 --> 00:44:37,250 And that's why I'm going to use cross-validation. 1028 00:44:37,250 --> 00:44:41,510 I'm going to run through a number of trials, 10 trials. 1029 00:44:41,510 --> 00:44:43,790 I'm going to try and fit four different models, 1030 00:44:43,790 --> 00:44:47,730 linear, quadratic, cubic, quartic. 1031 00:44:47,730 --> 00:44:49,822 And for each of these dimensions, 1032 00:44:49,822 --> 00:44:51,780 I'm going to get out a set of r-squared values. 1033 00:44:51,780 --> 00:44:56,850 So I'm just going to initialize that dictionary. an empty list. 1034 00:44:56,850 --> 00:44:59,490 Now here is how I'm going to do this. 1035 00:44:59,490 --> 00:45:00,554 Got a list of x-values. 1036 00:45:00,554 --> 00:45:01,220 Those are years. 1037 00:45:01,220 --> 00:45:02,360 Got a list of y values. 1038 00:45:02,360 --> 00:45:06,280 Those are average highs, daily highs. 1039 00:45:06,280 --> 00:45:10,730 I'm going to create a list of random samples. 1040 00:45:10,730 --> 00:45:13,840 So if you haven't seen this before, random.sample says, 1041 00:45:13,840 --> 00:45:15,880 given this iterator, which you can think 1042 00:45:15,880 --> 00:45:18,910 of as the collection from 0 up to n minus 1, 1043 00:45:18,910 --> 00:45:23,620 it's going to select this many or half of them, in this case, 1044 00:45:23,620 --> 00:45:26,810 of those numbers at random. 1045 00:45:26,810 --> 00:45:30,310 So if I give it 0 up to 9, and I say, pick five of them, 1046 00:45:30,310 --> 00:45:33,970 it will, at random, give me back 5 of those 10 numbers, 1047 00:45:33,970 --> 00:45:36,110 with no duplicates. 1048 00:45:36,110 --> 00:45:38,270 Ah, that's nice. 1049 00:45:38,270 --> 00:45:39,710 Because now notice what I can do. 1050 00:45:39,710 --> 00:45:41,440 I'm going to set up a training-- 1051 00:45:41,440 --> 00:45:44,320 sorry, an x and y values for a training set, x 1052 00:45:44,320 --> 00:45:45,670 and y values for the test set. 1053 00:45:45,670 --> 00:45:47,586 And I'm just going to run through a loop here, 1054 00:45:47,586 --> 00:45:51,490 where if this index is in that list, 1055 00:45:51,490 --> 00:45:53,600 I'll stick it in the training set. 1056 00:45:53,600 --> 00:45:57,290 Otherwise, I'll stick it in the test set. 1057 00:45:57,290 --> 00:45:59,380 And then I just return them. 1058 00:45:59,380 --> 00:46:02,140 So this is a really nice way of, at random, just 1059 00:46:02,140 --> 00:46:08,520 splitting the data set into a test set and a training set. 1060 00:46:08,520 --> 00:46:12,180 And then finally, I can run over the number of trials 1061 00:46:12,180 --> 00:46:13,500 I want to deal with. 1062 00:46:13,500 --> 00:46:15,330 In each case, get a different training 1063 00:46:15,330 --> 00:46:17,460 and test set, at random. 1064 00:46:17,460 --> 00:46:20,640 And then, for each dimension, do the fit. 1065 00:46:20,640 --> 00:46:23,550 There's polyfit on the training x and training y values 1066 00:46:23,550 --> 00:46:24,720 in that dimension. 1067 00:46:24,720 --> 00:46:26,670 Gives you back a model. 1068 00:46:26,670 --> 00:46:29,370 I could just check to see how well the training set gets, 1069 00:46:29,370 --> 00:46:32,250 but I really want to look at, given that model, 1070 00:46:32,250 --> 00:46:37,120 how well does polyval predict the test set, right? 1071 00:46:37,120 --> 00:46:39,640 The model will say, here's what I expect is the values. 1072 00:46:39,640 --> 00:46:41,710 I'm going to compare that to the actual values 1073 00:46:41,710 --> 00:46:44,740 that I saw from the training set, 1074 00:46:44,740 --> 00:46:48,341 computing that r squared value and adding it in. 1075 00:46:48,341 --> 00:46:49,840 And then the last of this just says, 1076 00:46:49,840 --> 00:46:53,718 I'll run this through a set of examples. 1077 00:46:53,718 --> 00:46:56,880 OK, here's what happens if I do that. 1078 00:46:56,880 --> 00:46:59,992 I'm not going to run it, although the code will run it. 1079 00:46:59,992 --> 00:47:01,700 Let me, again, remind you what I'm doing. 1080 00:47:01,700 --> 00:47:03,190 I got a big set of data I'm going 1081 00:47:03,190 --> 00:47:05,650 to pick out at random, subsets of it, 1082 00:47:05,650 --> 00:47:09,380 build the model on one part, test it on the other part. 1083 00:47:09,380 --> 00:47:16,640 And if I run it, I get a linear fit, quadratic fit, cubic fit, 1084 00:47:16,640 --> 00:47:18,240 and a quartic fit. 1085 00:47:18,240 --> 00:47:21,092 And here's the standard deviation of those samples. 1086 00:47:21,092 --> 00:47:22,550 Remember, I've got multiple trials. 1087 00:47:22,550 --> 00:47:24,050 I've got 10 trials, in this case. 1088 00:47:24,050 --> 00:47:26,270 So this gives me the average over those trials. 1089 00:47:26,270 --> 00:47:29,150 And this tells me how much they vary. 1090 00:47:29,150 --> 00:47:32,850 What can I conclude from this? 1091 00:47:32,850 --> 00:47:34,950 Well, I would argue that the linear fit's probably 1092 00:47:34,950 --> 00:47:36,570 the winner here. 1093 00:47:36,570 --> 00:47:37,530 Goes back to Einstein. 1094 00:47:37,530 --> 00:47:41,550 I want the simplest possible model that accounts for it. 1095 00:47:41,550 --> 00:47:44,700 And you can see it's got the highest r-squared value, which 1096 00:47:44,700 --> 00:47:46,800 is already a good sign. 1097 00:47:46,800 --> 00:47:49,650 It's got the smallest deviation across the trials, 1098 00:47:49,650 --> 00:47:52,170 which says it's probably a pretty good fit. 1099 00:47:52,170 --> 00:47:54,410 And it's the simplest model. 1100 00:47:54,410 --> 00:47:58,250 So linear sounds like a pretty good fit. 1101 00:47:58,250 --> 00:48:01,910 Now, why should we run multiple data sets to test this? 1102 00:48:01,910 --> 00:48:04,940 I ran 10 trials of each one of these dimensions. 1103 00:48:04,940 --> 00:48:07,280 Why bother with it? 1104 00:48:07,280 --> 00:48:09,440 Well, notice that those deviations-- 1105 00:48:09,440 --> 00:48:11,547 I'll go back to it here-- 1106 00:48:11,547 --> 00:48:12,380 they're pretty good. 1107 00:48:12,380 --> 00:48:13,860 They're about an order of magnitude 1108 00:48:13,860 --> 00:48:16,350 less than the actual mean, which says they're pretty tight, 1109 00:48:16,350 --> 00:48:20,032 but they're still reasonable size. 1110 00:48:20,032 --> 00:48:22,240 And that suggests that, while there's good agreement, 1111 00:48:22,240 --> 00:48:24,130 the deviations are large enough that you 1112 00:48:24,130 --> 00:48:28,330 could see a range of variation across the trials. 1113 00:48:28,330 --> 00:48:31,617 So in fact, if I had just run one trial, 1114 00:48:31,617 --> 00:48:32,700 I could have been screwed. 1115 00:48:32,700 --> 00:48:35,771 Sorry, oh-- sorry, pick your favorite [INAUDIBLE] here. 1116 00:48:35,771 --> 00:48:37,270 [? Hose ?] is a Canadian expression, 1117 00:48:37,270 --> 00:48:39,180 in case you haven't seen it. 1118 00:48:39,180 --> 00:48:42,300 Here are the r-squared values for each trial 1119 00:48:42,300 --> 00:48:44,004 of the linear fit. 1120 00:48:44,004 --> 00:48:45,920 And you can see the mean comes up pretty well. 1121 00:48:45,920 --> 00:48:48,140 But notice, if I'd only run one trial 1122 00:48:48,140 --> 00:48:52,470 and I happened to get that one, oh, darn. 1123 00:48:52,470 --> 00:48:54,300 That's a really low r-squared value. 1124 00:48:54,300 --> 00:48:56,430 And we might have decided, in this case, 1125 00:48:56,430 --> 00:49:00,730 a different conclusion, that the linear fit was not a good fit. 1126 00:49:00,730 --> 00:49:04,230 So this is a way of saying, even in a random sampling, run 1127 00:49:04,230 --> 00:49:06,630 multiple trials, because it lets you 1128 00:49:06,630 --> 00:49:09,480 get statistics on those trials, as well as statistics 1129 00:49:09,480 --> 00:49:10,700 within each trial. 1130 00:49:10,700 --> 00:49:12,450 So with any trial, I'm doing a whole bunch 1131 00:49:12,450 --> 00:49:14,904 of different random samples on measuring those values. 1132 00:49:14,904 --> 00:49:16,320 And then, across those trials, I'm 1133 00:49:16,320 --> 00:49:18,744 seeing what the deviation is. 1134 00:49:18,744 --> 00:49:20,410 I'm going to hope my machine comes back, 1135 00:49:20,410 --> 00:49:24,030 because what I want to do is then pull this together. 1136 00:49:24,030 --> 00:49:25,090 What have we done? 1137 00:49:25,090 --> 00:49:26,340 Something you're going to use. 1138 00:49:26,340 --> 00:49:28,550 We've seen how you can use linear regression 1139 00:49:28,550 --> 00:49:33,800 to fit a curve to data, 2D, 3D, 6D, however big 1140 00:49:33,800 --> 00:49:35,330 the data set is. 1141 00:49:35,330 --> 00:49:37,790 It gives us a mapping from the independent values 1142 00:49:37,790 --> 00:49:39,440 to the dependent values. 1143 00:49:39,440 --> 00:49:42,560 And that can then be used to predict values 1144 00:49:42,560 --> 00:49:44,390 associated with the independent values 1145 00:49:44,390 --> 00:49:46,340 that we haven't seen yet. 1146 00:49:46,340 --> 00:49:49,070 That leads, naturally, to both a way 1147 00:49:49,070 --> 00:49:52,430 to measure, which is r squared, but especially 1148 00:49:52,430 --> 00:49:55,700 to see that we want to look at how well does that model 1149 00:49:55,700 --> 00:49:59,300 actually predict new data, because that lets us select 1150 00:49:59,300 --> 00:50:03,470 the simplest model we can that accounts for the data, 1151 00:50:03,470 --> 00:50:06,140 but predicts new data in an effective way. 1152 00:50:06,140 --> 00:50:07,610 And that complexity can either be 1153 00:50:07,610 --> 00:50:11,510 based on theory, in the case of Hooke, or in more likely cases, 1154 00:50:11,510 --> 00:50:14,180 by doing cross-validation to try and figure out 1155 00:50:14,180 --> 00:50:16,640 which one is the simplest model that 1156 00:50:16,640 --> 00:50:19,510 still does a good job of predicting out 1157 00:50:19,510 --> 00:50:21,980 of data behavior. 1158 00:50:21,980 --> 00:50:25,000 And with that, I'll see you next time.