1
00:00:00,835 --> 00:00:03,130
The following content is
provided under a Creative

2
00:00:03,130 --> 00:00:04,550
Commons license.

3
00:00:04,550 --> 00:00:06,760
Your support will help
MIT OpenCourseWare

4
00:00:06,760 --> 00:00:10,850
continue to offer high quality
educational resources for free.

5
00:00:10,850 --> 00:00:13,390
To make a donation or to
view additional materials

6
00:00:13,390 --> 00:00:17,320
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,320 --> 00:00:18,570
at OCW.mit.edu.

8
00:00:29,750 --> 00:00:31,820
PROFESSOR: Welcome back.

9
00:00:31,820 --> 00:00:36,370
I hope you didn't spend
time doing 6002 problem

10
00:00:36,370 --> 00:00:37,460
sets while eating turkey.

11
00:00:37,460 --> 00:00:40,160
It's not recommended
for digestion.

12
00:00:40,160 --> 00:00:43,650
But I hope you're ready to go
back into diving into material.

13
00:00:43,650 --> 00:00:46,220
And since it's been a week
since we got together,

14
00:00:46,220 --> 00:00:49,430
let me remind you of
what we were doing.

15
00:00:49,430 --> 00:00:52,880
We were looking at the
issue of how to understand

16
00:00:52,880 --> 00:00:55,160
experimental data.

17
00:00:55,160 --> 00:00:57,040
Data could come from
a physical experiment.

18
00:00:57,040 --> 00:00:59,450
We had the example of
measuring the spring

19
00:00:59,450 --> 00:01:01,760
constant of a linear spring.

20
00:01:01,760 --> 00:01:03,550
Could come from biological data.

21
00:01:03,550 --> 00:01:05,420
Could come from social data.

22
00:01:05,420 --> 00:01:07,040
And what we looked
out was the idea

23
00:01:07,040 --> 00:01:11,510
of how do we actually
fit models to that data

24
00:01:11,510 --> 00:01:13,770
in order to understand them.

25
00:01:13,770 --> 00:01:16,580
So what I want to do is I want
to start with that high level

26
00:01:16,580 --> 00:01:18,650
reminder of what we were after.

27
00:01:18,650 --> 00:01:20,459
I want to do about
a five minute recap

28
00:01:20,459 --> 00:01:23,000
of what we were doing last time,
because it has been a while.

29
00:01:23,000 --> 00:01:24,416
And then we're
going to talk about

30
00:01:24,416 --> 00:01:26,237
how do you actually
validate models

31
00:01:26,237 --> 00:01:28,820
that you're fitting to data to
understand are they really good

32
00:01:28,820 --> 00:01:30,627
fits or not.

33
00:01:30,627 --> 00:01:31,460
And if you remember.

34
00:01:31,460 --> 00:01:33,650
I know you spend all your
time thinking about 6002.

35
00:01:33,650 --> 00:01:35,660
You should remember I
left you with a puzzle,

36
00:01:35,660 --> 00:01:37,930
where I fit data to--

37
00:01:37,930 --> 00:01:40,190
sorry, fit models
to some noisy data.

38
00:01:40,190 --> 00:01:42,050
And there was a question
of, did the model

39
00:01:42,050 --> 00:01:45,052
really have an order 16 fit?

40
00:01:45,052 --> 00:01:46,510
Right, so what are
we trying to do?

41
00:01:46,510 --> 00:01:51,040
Remember, our goal is to try
and model experimental data.

42
00:01:51,040 --> 00:01:54,850
And really what we want to do is
have a model that both explains

43
00:01:54,850 --> 00:01:57,460
the phenomena
underlying what we see,

44
00:01:57,460 --> 00:01:59,590
gives us a sense of what
might be the underlying

45
00:01:59,590 --> 00:02:02,950
physical mechanism, the
underlying social mechanism,

46
00:02:02,950 --> 00:02:05,920
and can let us make
predictions about the behavior

47
00:02:05,920 --> 00:02:07,450
in new settings.

48
00:02:07,450 --> 00:02:09,970
In the case of my spring,
being able to predict

49
00:02:09,970 --> 00:02:12,190
what will the displacement
be when I actually

50
00:02:12,190 --> 00:02:14,950
put a different weight on it
than something I measured.

51
00:02:14,950 --> 00:02:17,180
Or if you want to think
from a design perspective,

52
00:02:17,180 --> 00:02:18,805
working the other
direction and saying,

53
00:02:18,805 --> 00:02:21,970
I don't want my spring to
deflect more than this amount

54
00:02:21,970 --> 00:02:23,530
under certain kinds of weights.

55
00:02:23,530 --> 00:02:25,515
So how do I use the
model to tell me

56
00:02:25,515 --> 00:02:27,640
what the spring constant
should be for the spring I

57
00:02:27,640 --> 00:02:29,940
want in that case?

58
00:02:29,940 --> 00:02:33,760
So we want to be able to predict
behavior in new settings.

59
00:02:33,760 --> 00:02:36,860
The last piece we know is
that, if the data was perfect,

60
00:02:36,860 --> 00:02:38,120
this is easy.

61
00:02:38,120 --> 00:02:39,056
But it ain't.

62
00:02:39,056 --> 00:02:40,430
There's always
going to be noise.

63
00:02:40,430 --> 00:02:42,596
There's always going to be
experimental uncertainty.

64
00:02:42,596 --> 00:02:44,930
And so I really want to
account for that uncertainty

65
00:02:44,930 --> 00:02:46,830
when I fit that model.

66
00:02:46,830 --> 00:02:50,560
And while sometimes I'll have
theories that will help--

67
00:02:50,560 --> 00:02:54,020
Hooke says models of
springs are linear--

68
00:02:54,020 --> 00:02:55,455
in some cases, I don't.

69
00:02:55,455 --> 00:02:57,830
And in those cases, I want to
actually try and figure out

70
00:02:57,830 --> 00:03:00,350
what's the best model
to fit even when I don't

71
00:03:00,350 --> 00:03:03,530
know what the theory tells me.

72
00:03:03,530 --> 00:03:07,760
OK, so quick recap, what
do we use to solve this?

73
00:03:07,760 --> 00:03:10,370
So we've got a set
of observed values.

74
00:03:10,370 --> 00:03:13,430
My spring case for different
displays for different masses

75
00:03:13,430 --> 00:03:15,040
I measured the displacements.

76
00:03:15,040 --> 00:03:17,160
Those displacements
are my observed values.

77
00:03:17,160 --> 00:03:20,420
And if I had a model
that would predict

78
00:03:20,420 --> 00:03:22,520
what the displacement
should be, I

79
00:03:22,520 --> 00:03:25,280
can measure how good the fit is
by looking at that expression

80
00:03:25,280 --> 00:03:27,222
right there, the
sum of the squares

81
00:03:27,222 --> 00:03:29,180
of the differences between
the observed and the

82
00:03:29,180 --> 00:03:30,815
predicted data.

83
00:03:30,815 --> 00:03:32,440
As I said, we could
use other measures.

84
00:03:32,440 --> 00:03:34,356
We could use a first
order and absolute value.

85
00:03:34,356 --> 00:03:36,430
The square is actually
really handy, because it

86
00:03:36,430 --> 00:03:39,310
makes the solution space
very easy to deal with,

87
00:03:39,310 --> 00:03:41,060
which we'll get to in a second.

88
00:03:41,060 --> 00:03:42,954
So given observed
data, get a prediction.

89
00:03:42,954 --> 00:03:44,620
I can use the sum of
squared differences

90
00:03:44,620 --> 00:03:47,560
to measure how good the fit is.

91
00:03:47,560 --> 00:03:50,110
And then the second
piece is I now

92
00:03:50,110 --> 00:03:52,580
what to find what's the best
way to predict the data.

93
00:03:52,580 --> 00:03:54,840
What's the best curve
that fits the data.

94
00:03:54,840 --> 00:03:57,730
What's the best model for
protecting the values.

95
00:03:57,730 --> 00:03:59,230
And we suggest that
last time, we'll

96
00:03:59,230 --> 00:04:03,940
focus on mathematical
expressions, polynomials.

97
00:04:03,940 --> 00:04:06,700
Professor Guttag is so excited
about polynomial expressions,

98
00:04:06,700 --> 00:04:08,170
he's throwing
laptops on the floor.

99
00:04:08,170 --> 00:04:10,450
Please don't do
that to your laptop.

100
00:04:10,450 --> 00:04:13,660
We're going to fit polynomials
to these expressions.

101
00:04:13,660 --> 00:04:17,019
And since the polynomials
have some coefficients,

102
00:04:17,019 --> 00:04:19,060
the game is basically,
how do I find

103
00:04:19,060 --> 00:04:20,709
the coefficients of
the polynomial that

104
00:04:20,709 --> 00:04:23,050
minimize that expression.

105
00:04:23,050 --> 00:04:26,330
And that, we said, was an
example of linear regression.

106
00:04:26,330 --> 00:04:29,560
So let me just remind you
what linear regression says.

107
00:04:29,560 --> 00:04:32,680
Simple example,
case of the spring.

108
00:04:32,680 --> 00:04:34,630
I'm going to get a
degree 1 polynomial.

109
00:04:34,630 --> 00:04:38,620
So that is something of
the form y is ax plus b.

110
00:04:38,620 --> 00:04:42,730
a and b are the three variables,
the parameters I can change.

111
00:04:42,730 --> 00:04:45,220
And the idea is for every
x, in the case of my spring,

112
00:04:45,220 --> 00:04:48,082
for every mass, I'm
going to use that model

113
00:04:48,082 --> 00:04:49,540
to predict what's
the displacement,

114
00:04:49,540 --> 00:04:50,830
measure the
differences, and find

115
00:04:50,830 --> 00:04:51,996
the thing that minimizes it.

116
00:04:51,996 --> 00:04:54,530
So I just want to
find values of a and b

117
00:04:54,530 --> 00:04:58,930
that let me predict values
that minimize that expression.

118
00:04:58,930 --> 00:05:01,210
As I suggested, you
could solve this.

119
00:05:01,210 --> 00:05:02,680
You could write code to do it.

120
00:05:02,680 --> 00:05:04,900
It's a neat little
piece of code to write.

121
00:05:04,900 --> 00:05:07,330
But fortunately, PiLab
provides that for you.

122
00:05:07,330 --> 00:05:09,250
And I just want to give
you the visualization

123
00:05:09,250 --> 00:05:10,291
of what we're doing here.

124
00:05:10,291 --> 00:05:13,550
And then we're going
to look at examples.

125
00:05:13,550 --> 00:05:15,650
I'm going to try to
find the best line.

126
00:05:15,650 --> 00:05:18,806
It's represented by
two values, a and b.

127
00:05:18,806 --> 00:05:21,360
I could represent
all possible lines

128
00:05:21,360 --> 00:05:25,242
in a space that has
one access with values

129
00:05:25,242 --> 00:05:26,700
and the other access
with b values.

130
00:05:26,700 --> 00:05:30,720
Every point in that plane
defines a line for me.

131
00:05:30,720 --> 00:05:36,180
Now imagine a surface laid over
this two dimensional space,

132
00:05:36,180 --> 00:05:38,450
where the value or the
height of the surface

133
00:05:38,450 --> 00:05:41,902
is the value of that objective
function at every point.

134
00:05:41,902 --> 00:05:43,360
Don't worry about
computing it all,

135
00:05:43,360 --> 00:05:45,347
but just imagine
I could do that.

136
00:05:45,347 --> 00:05:46,930
And by the way, one
of the nice things

137
00:05:46,930 --> 00:05:49,720
about doing sum of squares
is that surface always

138
00:05:49,720 --> 00:05:51,820
has a concave shape.

139
00:05:51,820 --> 00:05:53,800
And now the idea of
linear regression

140
00:05:53,800 --> 00:05:57,190
is I'm going to start at
some point on that surface.

141
00:05:57,190 --> 00:05:59,560
And I'm just going
to walk downhill

142
00:05:59,560 --> 00:06:01,620
until I get to the bottom.

143
00:06:01,620 --> 00:06:04,500
There will always be
one bottom, one point.

144
00:06:04,500 --> 00:06:07,830
And once I get to that
point, that a and b value

145
00:06:07,830 --> 00:06:10,410
tell me the best line.

146
00:06:10,410 --> 00:06:12,700
So it's called linear
regression because I'm linearly

147
00:06:12,700 --> 00:06:14,380
walking downhill on this space.

148
00:06:14,380 --> 00:06:18,160
Now I'm doing this for line
with two parameters a,b,

149
00:06:18,160 --> 00:06:20,247
because it's easy to visualize.

150
00:06:20,247 --> 00:06:22,330
If you're a good mathematician
even if you're not,

151
00:06:22,330 --> 00:06:25,480
you can generalize this to think
about arbitrary dimensions.

152
00:06:25,480 --> 00:06:29,140
So a fourth order surface
in a five dimensional space,

153
00:06:29,140 --> 00:06:33,619
for example, would solve
a cubic example of this.

154
00:06:33,619 --> 00:06:35,160
That's the idea of
linear regression.

155
00:06:35,160 --> 00:06:37,409
That's what we're going to
use to actually figure out,

156
00:06:37,409 --> 00:06:40,280
to find the best solution.

157
00:06:40,280 --> 00:06:42,100
So here was the example I used.

158
00:06:42,100 --> 00:06:43,200
I gave you a set of data.

159
00:06:43,200 --> 00:06:44,600
In about 3 slides,
I'm going to tell you

160
00:06:44,600 --> 00:06:45,641
where the data came from.

161
00:06:45,641 --> 00:06:47,390
But I give you a set of data.

162
00:06:47,390 --> 00:06:49,850
We could fit the best
line to this using

163
00:06:49,850 --> 00:06:52,580
that linear regression idea.

164
00:06:52,580 --> 00:06:55,020
And again, last piece
of reminder, I'm

165
00:06:55,020 --> 00:06:58,094
going to use polyfit from PiLab.

166
00:06:58,094 --> 00:07:00,010
It just solves that
linear regression problem.

167
00:07:00,010 --> 00:07:01,830
And I give it a set of x values.

168
00:07:01,830 --> 00:07:04,320
I give a corresponding
set of y values,

169
00:07:04,320 --> 00:07:06,540
need to be the same
number in each case.

170
00:07:06,540 --> 00:07:07,800
And I give it a dimension.

171
00:07:07,800 --> 00:07:12,040
And in this case, one says,
find the best fitting line.

172
00:07:12,040 --> 00:07:13,630
It will produce
that and return it

173
00:07:13,630 --> 00:07:16,870
as a tuple, which I'll store
under the name model 1.

174
00:07:16,870 --> 00:07:18,070
And I could plot it out.

175
00:07:18,070 --> 00:07:22,000
So just remind you, polyfit
will find the best fitting n

176
00:07:22,000 --> 00:07:25,600
dimensional surface, n being
that last parameter there,

177
00:07:25,600 --> 00:07:26,970
and return it.

178
00:07:26,970 --> 00:07:29,660
In a second, we're going to
use polyval, which will say,

179
00:07:29,660 --> 00:07:32,980
given that model and
a set of x values,

180
00:07:32,980 --> 00:07:34,450
predict what the
y value should be.

181
00:07:34,450 --> 00:07:34,950
Apply them.

182
00:07:37,520 --> 00:07:40,380
OK, so I fit the line.

183
00:07:40,380 --> 00:07:41,930
What do you think?

184
00:07:41,930 --> 00:07:43,370
Good fit?

185
00:07:43,370 --> 00:07:45,710
Not so much, right?

186
00:07:45,710 --> 00:07:47,235
Pretty ugly.

187
00:07:47,235 --> 00:07:48,610
I mean, you can
see it's probably

188
00:07:48,610 --> 00:07:49,660
the best-- or not probably.

189
00:07:49,660 --> 00:07:50,890
It is the best fitting line.

190
00:07:50,890 --> 00:07:54,310
It sort of accounts for the
variation on either side of it.

191
00:07:54,310 --> 00:07:56,839
But it's not a very good fit.

192
00:07:56,839 --> 00:07:58,380
So then the question
is, well why not

193
00:07:58,380 --> 00:08:01,350
try fitting a
higher order model?

194
00:08:01,350 --> 00:08:03,690
So I could fit a quadratic.

195
00:08:03,690 --> 00:08:05,540
That is a second order model.

196
00:08:05,540 --> 00:08:09,000
y equals ax squared
plus bx plus c.

197
00:08:09,000 --> 00:08:10,290
Run the same code.

198
00:08:10,290 --> 00:08:12,560
Block that out.

199
00:08:12,560 --> 00:08:14,450
And I get that.

200
00:08:14,450 --> 00:08:16,040
That's the linear model.

201
00:08:16,040 --> 00:08:19,700
There's the quadratic model.

202
00:08:19,700 --> 00:08:21,920
At least my [? i ?] our
looks a lot better, right?

203
00:08:21,920 --> 00:08:26,230
It looks like it's following
that data reasonably well.

204
00:08:26,230 --> 00:08:28,300
OK, I can fit a linear model.

205
00:08:28,300 --> 00:08:30,160
I can fit a quadratic model.

206
00:08:30,160 --> 00:08:31,600
What about higher order models?

207
00:08:31,600 --> 00:08:33,599
What about a fourth order
model, an eighth order

208
00:08:33,599 --> 00:08:36,490
model, a 644th order model?

209
00:08:36,490 --> 00:08:39,865
How do I know which one
is going to be best?

210
00:08:39,865 --> 00:08:42,490
So for that, I'm going to remind
you of the last thing we used.

211
00:08:42,490 --> 00:08:43,950
And then we're going to start
talking about how to use it

212
00:08:43,950 --> 00:08:46,830
further, which is if we
try fitting higher order

213
00:08:46,830 --> 00:08:50,400
polynomials, do we
get a better fit?

214
00:08:50,400 --> 00:08:52,820
And to do that, we
need to measure what

215
00:08:52,820 --> 00:08:56,777
it means for the data to fit.

216
00:08:56,777 --> 00:08:58,360
If I don't have any
other information.

217
00:08:58,360 --> 00:09:00,640
For example, if I don't
have a theory that tells me

218
00:09:00,640 --> 00:09:03,365
this should be linear
in the case afoot,

219
00:09:03,365 --> 00:09:05,490
then the best way to do it
is to use what's called,

220
00:09:05,490 --> 00:09:09,300
the coefficient of
determination, r-squared.

221
00:09:09,300 --> 00:09:11,440
It's a scale independent
thing, which is good.

222
00:09:11,440 --> 00:09:13,840
By scale independent, I
mean if I take all the data

223
00:09:13,840 --> 00:09:15,441
and stretch it out,
this will still

224
00:09:15,441 --> 00:09:17,440
give me back the same
value in terms of the fit.

225
00:09:17,440 --> 00:09:20,080
So it doesn't depend on
the size of the data.

226
00:09:20,080 --> 00:09:22,990
And what it does is it
basically tells me the

227
00:09:22,990 --> 00:09:25,690
a value between
0 and 1, how well

228
00:09:25,690 --> 00:09:28,760
does this model fit the data.

229
00:09:28,760 --> 00:09:30,640
So just to remind
you, in this case,

230
00:09:30,640 --> 00:09:34,720
the y's are the
measured values, the p's

231
00:09:34,720 --> 00:09:35,770
are the predicted values.

232
00:09:35,770 --> 00:09:39,550
That's what my model is saying,
for each one of these cases.

233
00:09:39,550 --> 00:09:42,910
And mu down here is
the mean or the average

234
00:09:42,910 --> 00:09:45,480
of the measured values.

235
00:09:45,480 --> 00:09:48,390
The way to think about this
is this top expression here.

236
00:09:48,390 --> 00:09:51,360
Well, that's exactly what I'm
trying to minimize, right?

237
00:09:51,360 --> 00:09:54,630
So it's giving me an
estimate or a measure

238
00:09:54,630 --> 00:09:58,050
of the error in the estimates
between what the model says

239
00:09:58,050 --> 00:10:00,510
and what I actually measure.

240
00:10:00,510 --> 00:10:02,980
And the denominator down
here basically tells

241
00:10:02,980 --> 00:10:08,440
me how much does the data
vary away from the mean value.

242
00:10:08,440 --> 00:10:10,510
Now here's the idea.

243
00:10:10,510 --> 00:10:14,560
If in fact, I can
get this to 0, I

244
00:10:14,560 --> 00:10:18,310
can get a model that completely
accounts for all the variation

245
00:10:18,310 --> 00:10:20,290
in the estimates, that's great.

246
00:10:20,290 --> 00:10:22,270
It says, the model
has fit perfectly.

247
00:10:22,270 --> 00:10:26,110
And that means this is 0 so
this r value or r squared value

248
00:10:26,110 --> 00:10:28,060
is 1.

249
00:10:28,060 --> 00:10:32,650
On the other hand, if this
is equal to that, meaning

250
00:10:32,650 --> 00:10:34,510
that all of the variation
in the estimates

251
00:10:34,510 --> 00:10:37,880
accounts for none of the
variation in the data,

252
00:10:37,880 --> 00:10:41,130
then this is 1 and
this goes to 0.

253
00:10:41,130 --> 00:10:45,160
So the idea is that an r-squared
value is close to 1 is great.

254
00:10:45,160 --> 00:10:47,830
It says, the model is
a good fit to the data.

255
00:10:47,830 --> 00:10:52,780
r-squared value is getting
closer to 0, not so good.

256
00:10:52,780 --> 00:10:58,550
OK, so I ran this,
fitting models of order 2,

257
00:10:58,550 --> 00:11:02,210
4, 8, and 16.

258
00:11:02,210 --> 00:11:05,180
Now you can see that model 2,
that's the green line here.

259
00:11:05,180 --> 00:11:06,940
That's the one
that we saw before.

260
00:11:06,940 --> 00:11:10,130
It's basically a
parabolic kind of arc.

261
00:11:10,130 --> 00:11:12,020
It kind of follows
the data pretty well.

262
00:11:12,020 --> 00:11:15,000
But if I look at those
r-squared values.

263
00:11:15,000 --> 00:11:17,270
Wow, look at that.

264
00:11:17,270 --> 00:11:21,290
Order 16 fit accounts
for all but 3%

265
00:11:21,290 --> 00:11:24,280
of the variation in the data.

266
00:11:24,280 --> 00:11:25,870
It's a great fit.

267
00:11:25,870 --> 00:11:26,606
And you can see.

268
00:11:26,606 --> 00:11:27,730
You can see how it follows.

269
00:11:27,730 --> 00:11:30,670
It actually goes through
most, but not quite all,

270
00:11:30,670 --> 00:11:31,480
of the data points.

271
00:11:31,480 --> 00:11:34,530
So it's following
them pretty well.

272
00:11:34,530 --> 00:11:39,390
OK, so if that's the
case, the order 16 fit

273
00:11:39,390 --> 00:11:41,460
is really the best fit.

274
00:11:41,460 --> 00:11:43,615
Should we just use it?

275
00:11:43,615 --> 00:11:45,740
And I left you last time
with that quote that says,

276
00:11:45,740 --> 00:11:47,570
from your parents, right,
your mother telling you,

277
00:11:47,570 --> 00:11:49,070
just because you
can do something

278
00:11:49,070 --> 00:11:51,720
doesn't mean you
should do something.

279
00:11:51,720 --> 00:11:53,900
I'll leave it at that.

280
00:11:53,900 --> 00:11:55,940
Same thing applies here.

281
00:11:55,940 --> 00:11:58,930
Why are we building the model?

282
00:11:58,930 --> 00:12:00,820
Remember, I said two reasons.

283
00:12:00,820 --> 00:12:03,990
One is to be able to
explain the phenomena.

284
00:12:03,990 --> 00:12:07,290
And the second one is to be
able to make predictions.

285
00:12:07,290 --> 00:12:10,190
So I want to be able to
explain the phenomena

286
00:12:10,190 --> 00:12:13,010
in the case of a spring,
with things like it's linear

287
00:12:13,010 --> 00:12:15,740
and then that gives me a
sense of a linear relationship

288
00:12:15,740 --> 00:12:18,210
between compression and force.

289
00:12:18,210 --> 00:12:22,690
In this case, a
16th order model,

290
00:12:22,690 --> 00:12:28,190
what kind of physical process
has an order 16 variation?

291
00:12:28,190 --> 00:12:29,240
Sounds a little painful.

292
00:12:29,240 --> 00:12:34,510
So maybe not a great
insight into the process.

293
00:12:34,510 --> 00:12:38,000
But the second reason is I
want to be able to predict

294
00:12:38,000 --> 00:12:40,574
future behavior of this system.

295
00:12:40,574 --> 00:12:42,740
In the case of this spring,
I put a different weight

296
00:12:42,740 --> 00:12:43,781
on than I've done before.

297
00:12:43,781 --> 00:12:47,720
I want to predict what the
displacement is going to be.

298
00:12:47,720 --> 00:12:50,180
I've done a set of trials for
an FDA approval of a drug.

299
00:12:50,180 --> 00:12:52,138
Now I want to predict
the effect of a treatment

300
00:12:52,138 --> 00:12:53,150
on a new patient.

301
00:12:53,150 --> 00:12:55,984
How do I use the model
to help me with that?

302
00:12:55,984 --> 00:12:57,650
One that maybe not
so good, currently, I

303
00:12:57,650 --> 00:12:59,547
want to predict the
outcome of an election.

304
00:12:59,547 --> 00:13:01,880
Maybe those models need to
be fixed from, at least, what

305
00:13:01,880 --> 00:13:03,335
happened the last time around.

306
00:13:03,335 --> 00:13:05,210
But I need to be able
to make the prediction.

307
00:13:05,210 --> 00:13:08,240
So another way of saying it
is, a good model both explains

308
00:13:08,240 --> 00:13:12,730
the phenomena and let's
me make the predictions.

309
00:13:12,730 --> 00:13:16,840
OK, so let's go back,
then, to our example.

310
00:13:16,840 --> 00:13:20,210
And before I do it, let me tell
you where that data came from.

311
00:13:20,210 --> 00:13:23,180
I actually built that data
by looking at another kind

312
00:13:23,180 --> 00:13:24,400
of physical phenomenon.

313
00:13:24,400 --> 00:13:25,441
And it was a lot of them.

314
00:13:25,441 --> 00:13:28,370
Things that follow
a parabolic arc.

315
00:13:28,370 --> 00:13:30,440
So for example, comets.

316
00:13:30,440 --> 00:13:33,050
Any particle under the influence
of a uniform gravitational

317
00:13:33,050 --> 00:13:35,590
field follows a
parabolic arc, which

318
00:13:35,590 --> 00:13:38,000
i why Halley's comet gets
really close for a while,

319
00:13:38,000 --> 00:13:40,190
and then goes away off
into the solar system,

320
00:13:40,190 --> 00:13:42,450
and comes back around.

321
00:13:42,450 --> 00:13:43,500
My favorite example--

322
00:13:43,500 --> 00:13:44,520
I'm biased on this.

323
00:13:44,520 --> 00:13:46,620
And I know you all know
which team I root for.

324
00:13:46,620 --> 00:13:49,200
But there is Tom Brady throwing
a pass against the Pittsburgh

325
00:13:49,200 --> 00:13:51,200
Steelers.

326
00:13:51,200 --> 00:13:54,820
Center of mass of the past
follows a nice parabolic arc.

327
00:13:54,820 --> 00:13:57,430
Even in design, you
see parabolic arcs

328
00:13:57,430 --> 00:13:58,330
in lots of places.

329
00:13:58,330 --> 00:13:59,890
They have nice
properties in terms

330
00:13:59,890 --> 00:14:02,590
of disbursement of
loads and forces,

331
00:14:02,590 --> 00:14:05,510
which is why architects
like to use them.

332
00:14:05,510 --> 00:14:08,667
So here's how I
generated the data.

333
00:14:08,667 --> 00:14:09,750
I wrote a little function.

334
00:14:09,750 --> 00:14:10,541
Actually, I didn't.

335
00:14:10,541 --> 00:14:12,500
Professor Guttag did,
but I borrowed it.

336
00:14:12,500 --> 00:14:16,606
It took in three
parameters, a, b, and c.

337
00:14:16,606 --> 00:14:20,190
ax squared plus bx plus c.

338
00:14:20,190 --> 00:14:21,947
I gave it a set of x values.

339
00:14:21,947 --> 00:14:24,030
Those are the independent
measurements, the things

340
00:14:24,030 --> 00:14:26,220
along the horizontal axis.

341
00:14:26,220 --> 00:14:27,780
And notice what I did.

342
00:14:27,780 --> 00:14:34,000
I generated values given an a,
b, and c, for that equation.

343
00:14:34,000 --> 00:14:37,020
And then I added in some noise.

344
00:14:37,020 --> 00:14:40,890
So random.guass takes a mean
and a standard deviation,

345
00:14:40,890 --> 00:14:44,340
and it generates noise
following that bell shaped curve

346
00:14:44,340 --> 00:14:46,260
that goes in the distribution.

347
00:14:46,260 --> 00:14:49,440
So the 0 says it's 0 mean,
meaning there's no bias.

348
00:14:49,440 --> 00:14:52,380
It's going to be equally likely
to be above or below the value,

349
00:14:52,380 --> 00:14:53,910
positive or negative.

350
00:14:53,910 --> 00:14:56,380
But 35 is a pretty good
standard deviation.

351
00:14:56,380 --> 00:14:59,960
This is putting a lot
of noise into the data.

352
00:14:59,960 --> 00:15:01,802
And then I just added
that into y values.

353
00:15:01,802 --> 00:15:03,260
The rest of this,
you can see, it's

354
00:15:03,260 --> 00:15:06,020
simply going to write it into a
file, a set of x and y values.

355
00:15:06,020 --> 00:15:10,500
But this will generate, given
a value for a, b, and c,

356
00:15:10,500 --> 00:15:13,810
data from a parabolic arc
with noise added to it.

357
00:15:13,810 --> 00:15:17,670
And in this case, I took
it as y equals 3x squared.

358
00:15:17,670 --> 00:15:19,530
And then c and c are 0.

359
00:15:19,530 --> 00:15:20,937
And that's how I generated it.

360
00:15:23,620 --> 00:15:26,230
What I want to do, I want to
see how well this model actually

361
00:15:26,230 --> 00:15:27,170
predicts behavior.

362
00:15:27,170 --> 00:15:28,570
So one of the ways
I could do it,

363
00:15:28,570 --> 00:15:31,570
to say, all right, the question
I want to ask is, whoa,

364
00:15:31,570 --> 00:15:35,110
if I generated the data
from a degree 2 polynomial

365
00:15:35,110 --> 00:15:39,820
quadratic, why in the world is
the 16th order polynomial the,

366
00:15:39,820 --> 00:15:40,660
"best fit?"

367
00:15:44,480 --> 00:15:47,450
So let's test it out.

368
00:15:47,450 --> 00:15:49,837
I'm going to give 3-- sorry, 4.

369
00:15:49,837 --> 00:15:50,420
I can't count.

370
00:15:50,420 --> 00:15:56,030
4 different degrees, order 2,
order 4, order 8, order 16.

371
00:15:56,030 --> 00:15:59,870
And I've generated two
different datasets,

372
00:15:59,870 --> 00:16:01,850
using exactly that code.

373
00:16:01,850 --> 00:16:02,705
I just ran it twice.

374
00:16:02,705 --> 00:16:04,580
It's going to have
slightly different values,

375
00:16:04,580 --> 00:16:06,913
because the noise is going
to be different in each case.

376
00:16:06,913 --> 00:16:08,720
But they're both
coming from that a,

377
00:16:08,720 --> 00:16:12,050
y equals 3x squared equation.

378
00:16:12,050 --> 00:16:13,560
And the code here
basically says,

379
00:16:13,560 --> 00:16:15,140
I'm going to take
those two data sets

380
00:16:15,140 --> 00:16:18,590
and basically, get the x
and y values out, and then

381
00:16:18,590 --> 00:16:20,420
fit models.

382
00:16:20,420 --> 00:16:22,610
So I'll remind
you, genFits takes

383
00:16:22,610 --> 00:16:26,270
in a collection of x
and y values and a list

384
00:16:26,270 --> 00:16:29,090
or a tuple of degrees,
and for each degree,

385
00:16:29,090 --> 00:16:32,720
finds, using Polyfit,
the best model.

386
00:16:32,720 --> 00:16:38,350
So models one will be 4 models
for order 2, 4, 8, and 16.

387
00:16:38,350 --> 00:16:41,990
And similarly, down here, I'm
going to do the same thing,

388
00:16:41,990 --> 00:16:43,470
but using the second data set.

389
00:16:43,470 --> 00:16:46,130
And I'm going to fit,
again, a set of models.

390
00:16:46,130 --> 00:16:49,390
And then I'll remind you, test
fits, which you saw last time.

391
00:16:49,390 --> 00:16:52,340
I know it's a while
ago, basically takes

392
00:16:52,340 --> 00:16:57,710
a set of models, a corresponding
set of degrees, x and y values,

393
00:16:57,710 --> 00:17:00,950
and says, for each model
in that degree, measure

394
00:17:00,950 --> 00:17:05,790
how well that model
meets the fit, using

395
00:17:05,790 --> 00:17:06,839
that r-squared value.

396
00:17:06,839 --> 00:17:10,760
So testFits is going to get us
back a set of r-squared values.

397
00:17:10,760 --> 00:17:15,089
All right, with that in
mind, I've got the code here.

398
00:17:15,089 --> 00:17:15,630
Let's run it.

399
00:17:18,180 --> 00:17:21,200
And here we go.

400
00:17:21,200 --> 00:17:24,720
I'm going to run that code.

401
00:17:24,720 --> 00:17:26,560
Ha, I get two fits.

402
00:17:26,560 --> 00:17:27,109
Looks good.

403
00:17:27,109 --> 00:17:28,150
Let's look at the values.

404
00:17:32,460 --> 00:17:36,070
So there's the first data set.

405
00:17:36,070 --> 00:17:40,110
All right, the green line
still is doing not a bad job.

406
00:17:40,110 --> 00:17:42,930
The purple line, boy, is
fitting it really well.

407
00:17:42,930 --> 00:17:45,860
And again, notice
here's the best fit.

408
00:17:45,860 --> 00:17:46,850
That's amazing.

409
00:17:46,850 --> 00:17:50,900
That is accounting
for all but 0.4%

410
00:17:50,900 --> 00:17:52,430
of the variation in the data.

411
00:17:52,430 --> 00:17:54,860
Great fit.

412
00:17:54,860 --> 00:17:55,880
Order 16.

413
00:17:55,880 --> 00:17:57,020
Came from an order 2 thing.

414
00:17:57,020 --> 00:18:00,120
All right, what about
the second data set?

415
00:18:00,120 --> 00:18:03,290
Oh, grumph.

416
00:18:03,290 --> 00:18:08,600
It also says order 16
fit is the best fit.

417
00:18:08,600 --> 00:18:09,430
Not quite as good.

418
00:18:09,430 --> 00:18:13,080
It accounts for all but
about 2% of the variation.

419
00:18:13,080 --> 00:18:15,720
Again, the green line,
the red line, do OK.

420
00:18:15,720 --> 00:18:17,850
But in this case,
again, that purple line

421
00:18:17,850 --> 00:18:18,790
is still the best fit.

422
00:18:18,790 --> 00:18:21,110
So I've still got this puzzle.

423
00:18:21,110 --> 00:18:25,490
But I didn't quite test
what I wanted, right?

424
00:18:25,490 --> 00:18:29,670
I said I want to see how well
it predicts new behavior.

425
00:18:29,670 --> 00:18:32,760
Here what I did was I took
two datasets, fit the model,

426
00:18:32,760 --> 00:18:35,670
and I got two different
fits, one for each dataset.

427
00:18:35,670 --> 00:18:38,040
They both fit well for order 16.

428
00:18:38,040 --> 00:18:40,670
But they're not quite right.

429
00:18:40,670 --> 00:18:43,370
OK, so best fitting
model is still order 16

430
00:18:43,370 --> 00:18:46,970
but we know it came from
an order 2 polynomial.

431
00:18:46,970 --> 00:18:49,190
So how could I will
get a handle on seeing

432
00:18:49,190 --> 00:18:52,390
how good this model is?

433
00:18:52,390 --> 00:18:56,820
Well, what we're seeing here
is coming from training error.

434
00:18:56,820 --> 00:18:59,910
Or another way of saying
it is, what we're measuring

435
00:18:59,910 --> 00:19:02,580
is how well does
the model perform

436
00:19:02,580 --> 00:19:05,040
on the data from
which it was learned?

437
00:19:05,040 --> 00:19:10,260
How well do I fit the
model to the training data?

438
00:19:10,260 --> 00:19:11,997
I want a small training error.

439
00:19:11,997 --> 00:19:14,330
And if you think about it,
go back to the first example,

440
00:19:14,330 --> 00:19:16,930
when I fit a line to this
data, it did not do well.

441
00:19:16,930 --> 00:19:18,440
It was not a good model.

442
00:19:18,440 --> 00:19:21,344
When I fit a quadratic,
it was pretty decent.

443
00:19:21,344 --> 00:19:23,260
And then I got better
and better as I went on.

444
00:19:23,260 --> 00:19:25,970
So I certainly need at least
a small training error.

445
00:19:25,970 --> 00:19:27,800
But it's, to use the
mathematical terms,

446
00:19:27,800 --> 00:19:32,630
a necessary, but not sufficient
condition to get a great model.

447
00:19:32,630 --> 00:19:34,550
I need a small training
error, but I really

448
00:19:34,550 --> 00:19:37,790
want to make sure that the model
is capturing what I'd like.

449
00:19:37,790 --> 00:19:40,210
And so for that, I want
to see how well does

450
00:19:40,210 --> 00:19:42,830
it do on other gen
data, generated

451
00:19:42,830 --> 00:19:45,020
from the same
process, whether it's

452
00:19:45,020 --> 00:19:49,680
weights on springs, different
comets besides Haley's comet,

453
00:19:49,680 --> 00:19:51,180
different voters
than those surveyed

454
00:19:51,180 --> 00:19:52,596
when we tried to
figure out what's

455
00:19:52,596 --> 00:19:54,790
going to happen in an election.

456
00:19:54,790 --> 00:19:58,350
And I'm set up to
do that, by using

457
00:19:58,350 --> 00:20:03,340
a really important tool called,
validation or cross-validation.

458
00:20:03,340 --> 00:20:06,140
We set the stage, and then
we're going to do the example.

459
00:20:06,140 --> 00:20:07,727
I'm going to get a set of data.

460
00:20:07,727 --> 00:20:09,310
I want to fit a model
to it, actually,

461
00:20:09,310 --> 00:20:10,810
different models,
different degrees,

462
00:20:10,810 --> 00:20:12,730
different kinds of models.

463
00:20:12,730 --> 00:20:14,710
To see how well
they work, I want

464
00:20:14,710 --> 00:20:18,760
to see how well they predict
behavior under other data

465
00:20:18,760 --> 00:20:21,670
than that from which
I did the training.

466
00:20:21,670 --> 00:20:24,840
So I could do that right here.

467
00:20:24,840 --> 00:20:27,720
I could generate the
models from one data set,

468
00:20:27,720 --> 00:20:29,670
but test them on the other.

469
00:20:29,670 --> 00:20:32,220
And so in fact, I
had one data set.

470
00:20:32,220 --> 00:20:34,530
I build a set of models
for the first data set.

471
00:20:34,530 --> 00:20:37,140
I compared how well it
did on that data set.

472
00:20:37,140 --> 00:20:40,750
But I could now apply it
to the second dataset.

473
00:20:40,750 --> 00:20:42,930
How well does that
account for that data set?

474
00:20:42,930 --> 00:20:46,320
And similarly, take the models
I built for the second data set,

475
00:20:46,320 --> 00:20:48,630
and see how well they
predict the points

476
00:20:48,630 --> 00:20:51,920
from the first dataset.

477
00:20:51,920 --> 00:20:53,360
What do I expect?

478
00:20:53,360 --> 00:20:55,130
Certainly, expect
that the testing error

479
00:20:55,130 --> 00:20:57,320
is likely to be larger
than the training error,

480
00:20:57,320 --> 00:21:00,420
because I train on
one set of data.

481
00:21:00,420 --> 00:21:02,300
And that means this
ought to be a better way

482
00:21:02,300 --> 00:21:05,390
to think about, how well
does this model generalize?

483
00:21:05,390 --> 00:21:07,700
How well does it
predict other behavior,

484
00:21:07,700 --> 00:21:11,230
besides what I started with.

485
00:21:11,230 --> 00:21:12,730
So here's the code
I'm going to use.

486
00:21:12,730 --> 00:21:13,660
It's pretty straightforward.

487
00:21:13,660 --> 00:21:15,951
All I want to draw your
attention to here is, remember,

488
00:21:15,951 --> 00:21:19,510
models one I built by fitting
models of degree 2, 4, 8,

489
00:21:19,510 --> 00:21:22,260
and 16 to the first data set.

490
00:21:22,260 --> 00:21:26,590
And I'm going to apply those
models to the second dataset,

491
00:21:26,590 --> 00:21:29,350
x vals 2 and y vals 2.

492
00:21:29,350 --> 00:21:31,450
Similarly, I'm going to
take the models built

493
00:21:31,450 --> 00:21:36,370
for the second data set, and
test them on the first dataset

494
00:21:36,370 --> 00:21:39,510
to see how well they fit.

495
00:21:39,510 --> 00:21:41,130
I know you're
eagerly anticipating,

496
00:21:41,130 --> 00:21:43,470
as I've been setting
this up for a whole week.

497
00:21:43,470 --> 00:21:46,237
All right, let's look at
what happens when I do this.

498
00:21:46,237 --> 00:21:47,070
I'm going to run it.

499
00:21:47,070 --> 00:21:48,570
And then we'll look
at the examples.

500
00:21:50,620 --> 00:21:52,640
If I go back over to
Python and this code

501
00:21:52,640 --> 00:21:56,610
was distributed earlier, if you
want to play with it yourself.

502
00:21:56,610 --> 00:21:59,010
Should be the right
place to do it.

503
00:21:59,010 --> 00:22:00,345
I am going to run that code.

504
00:22:05,320 --> 00:22:08,830
Now I get something
a little different.

505
00:22:08,830 --> 00:22:15,650
In fact, if I go
look at it, here

506
00:22:15,650 --> 00:22:20,955
is model one applied
to data set 2.

507
00:22:20,955 --> 00:22:23,080
And we can both eyeball it
and look at the numbers.

508
00:22:23,080 --> 00:22:27,220
Eyeballing it, there's that
green line, still generally

509
00:22:27,220 --> 00:22:30,506
following the form
of this pretty well.

510
00:22:30,506 --> 00:22:31,630
What about the purple line?

511
00:22:31,630 --> 00:22:33,020
The order 16 degree.

512
00:22:33,020 --> 00:22:35,170
Remember, that's the
purple line from model 1,

513
00:22:35,170 --> 00:22:37,360
from training set 1.

514
00:22:37,360 --> 00:22:41,260
Wow, this misses a bunch
of points, pretty badly.

515
00:22:41,260 --> 00:22:45,850
And in fact, look at
the r-squared values.

516
00:22:45,850 --> 00:22:49,550
Order 2 and order
4, pretty good fit,

517
00:22:49,550 --> 00:22:52,130
accounts for all but
about 14, 13% of the data.

518
00:22:52,130 --> 00:22:58,500
Look what happened to the
degree 16, degrees 16 fit.

519
00:22:58,500 --> 00:22:59,357
Way down at last.

520
00:22:59,357 --> 00:23:00,640
0.7.

521
00:23:00,640 --> 00:23:04,570
Last time around it was 0.997.

522
00:23:04,570 --> 00:23:06,660
What about the other direction?

523
00:23:06,660 --> 00:23:09,540
Taking the model built
and the second data

524
00:23:09,540 --> 00:23:11,610
set, testing it on
the first data set.

525
00:23:11,610 --> 00:23:17,300
Again, notice a nice
fit for degree 2 and 4,

526
00:23:17,300 --> 00:23:20,280
not so good for degree 16.

527
00:23:20,280 --> 00:23:22,760
And just to give you a sense
of this, I'm going to go back.

528
00:23:22,760 --> 00:23:28,630
There is the model one case.

529
00:23:28,630 --> 00:23:30,180
There is the model
in the other case.

530
00:23:30,180 --> 00:23:32,610
You can see the model that
accounts for variation in one

531
00:23:32,610 --> 00:23:34,776
doesn't account for the
variation in the other, when

532
00:23:34,776 --> 00:23:37,980
I look at order 16 fit.

533
00:23:37,980 --> 00:23:42,950
OK, so what this says
is something important.

534
00:23:42,950 --> 00:23:43,790
Now I can see.

535
00:23:43,790 --> 00:23:46,130
In fact, if I look back
at this, if I were just

536
00:23:46,130 --> 00:23:49,190
looking at the coefficient
of determination,

537
00:23:49,190 --> 00:23:52,940
this says, in order to
predict other behavior,

538
00:23:52,940 --> 00:23:57,190
I'm better off with an order
2 or maybe order 4 polynomial.

539
00:23:57,190 --> 00:23:59,166
Those r-squared values
are both the same.

540
00:23:59,166 --> 00:24:01,040
I happen to know it's
order 2, because that's

541
00:24:01,040 --> 00:24:01,998
where I generated from.

542
00:24:01,998 --> 00:24:06,462
But that's a whole lot
better than order 16.

543
00:24:06,462 --> 00:24:08,920
And what you're seeing here is
an example of something that

544
00:24:08,920 --> 00:24:11,920
happens a lot in statistics.

545
00:24:11,920 --> 00:24:15,820
And in fact, I would suggest is
often misused in fitting data

546
00:24:15,820 --> 00:24:18,310
to statistical samples.

547
00:24:18,310 --> 00:24:20,190
It's called overfitting.

548
00:24:20,190 --> 00:24:22,070
And what it means
is I've let there

549
00:24:22,070 --> 00:24:24,740
be too many degrees of
freedom in my model, too

550
00:24:24,740 --> 00:24:25,910
many free parameters.

551
00:24:25,910 --> 00:24:29,760
And what it's fitting isn't
just the underlying process.

552
00:24:29,760 --> 00:24:32,767
It's also fitting to the noise.

553
00:24:32,767 --> 00:24:35,350
The message I want you to take
out of this part of the lecture

554
00:24:35,350 --> 00:24:39,040
is, if we only fit the
model to training data,

555
00:24:39,040 --> 00:24:41,320
and we look at how
well it does, we

556
00:24:41,320 --> 00:24:43,660
could get what looks
like a great fit,

557
00:24:43,660 --> 00:24:47,650
but we may actually have come
up with far too complex a model.

558
00:24:47,650 --> 00:24:50,230
Order 16 instead of order 2.

559
00:24:50,230 --> 00:24:52,300
And the only way you
are likely to detect

560
00:24:52,300 --> 00:24:57,180
that is to train on one test
set and test on a different.

561
00:24:57,180 --> 00:24:59,890
And if you do that, it's likely
to expose whether, in fact, I

562
00:24:59,890 --> 00:25:01,810
have done a good job
of fitting or whether I

563
00:25:01,810 --> 00:25:04,325
have overfit to the data.

564
00:25:04,325 --> 00:25:06,450
There are lots of horror
stories in the literature,

565
00:25:06,450 --> 00:25:08,783
especially from early days
of machine learning of people

566
00:25:08,783 --> 00:25:10,910
overfitting to data and
coming up with models

567
00:25:10,910 --> 00:25:13,700
that they thought wonderfully
predicted an effect,

568
00:25:13,700 --> 00:25:17,850
and then when it ran on new
data really hit the big one.

569
00:25:17,850 --> 00:25:19,409
All right, so this
is something you

570
00:25:19,409 --> 00:25:20,700
want to try and stay away from.

571
00:25:20,700 --> 00:25:24,760
And the best way to do
it is to do validation.

572
00:25:24,760 --> 00:25:26,740
You can see it here, right?

573
00:25:26,740 --> 00:25:29,660
The upper left is my
training data, dataset one.

574
00:25:29,660 --> 00:25:31,610
There's the set of models.

575
00:25:31,610 --> 00:25:33,400
This is now taking
that and applying it

576
00:25:33,400 --> 00:25:36,820
to a different dataset
from the same process.

577
00:25:36,820 --> 00:25:40,850
And notice for the
degree to polynomial,

578
00:25:40,850 --> 00:25:44,580
the coefficient of
determination, 0.86, now 0.87.

579
00:25:44,580 --> 00:25:47,970
The fact that it's slightly
higher is just accidental.

580
00:25:47,970 --> 00:25:49,670
But it's really
about the same level.

581
00:25:49,670 --> 00:25:52,230
It's doing the same kind of
drop on the training data

582
00:25:52,230 --> 00:25:54,360
and on the test data.

583
00:25:54,360 --> 00:25:58,910
On the other hand, degree 16,
coefficient of determination

584
00:25:58,910 --> 00:26:05,450
is a wonderful 0.96 here and
a pretty awful 9 down there.

585
00:26:05,450 --> 00:26:08,660
And that's a sign that
we're not in good shape,

586
00:26:08,660 --> 00:26:11,000
when in fact our
coefficient of determination

587
00:26:11,000 --> 00:26:15,720
drops significantly when
we try and handle new data.

588
00:26:15,720 --> 00:26:22,280
OK, so why do we
get a better fit

589
00:26:22,280 --> 00:26:24,840
on the training data with
a higher order model,

590
00:26:24,840 --> 00:26:28,780
but then do less well when we're
actually handling new data?

591
00:26:28,780 --> 00:26:33,440
Or another way of saying it
is, if I started out with,

592
00:26:33,440 --> 00:26:35,740
in the case of that with
that data, a linear model

593
00:26:35,740 --> 00:26:38,370
it didn't fit well, and then
I got to a quadratic model,

594
00:26:38,370 --> 00:26:41,920
why didn't that quadratic
model still say [INAUDIBLE]?

595
00:26:41,920 --> 00:26:44,950
Why was it the case that, as I
added more degrees of freedom,

596
00:26:44,950 --> 00:26:47,040
I did better.

597
00:26:47,040 --> 00:26:50,410
Or another way of asking it
is, can I actually get a worse

598
00:26:50,410 --> 00:26:55,399
fit to training data as I
increase the model complexity?

599
00:26:55,399 --> 00:26:57,190
And I see at least one
negative head shake.

600
00:26:57,190 --> 00:26:57,580
Thank you.

601
00:26:57,580 --> 00:26:58,121
You're right.

602
00:26:58,121 --> 00:26:59,050
I cannot.

603
00:26:59,050 --> 00:27:01,610
Let's look at why.

604
00:27:01,610 --> 00:27:03,539
If I add in some
higher order terms,

605
00:27:03,539 --> 00:27:04,830
and they actually don't matter.

606
00:27:04,830 --> 00:27:09,200
If I got perfect data, the
coefficient will just be 0.

607
00:27:09,200 --> 00:27:12,050
The fit will basically say,
this term doesn't matter.

608
00:27:12,050 --> 00:27:12,890
Ignore it.

609
00:27:12,890 --> 00:27:15,620
And that'll work
in perfect data.

610
00:27:15,620 --> 00:27:19,250
But if the data is noisy,
what the model is going to do

611
00:27:19,250 --> 00:27:22,237
is actually start
fitting the noise.

612
00:27:22,237 --> 00:27:24,320
And while it may lead to
a better r-squared value,

613
00:27:24,320 --> 00:27:27,180
it's not really a better fit.

614
00:27:27,180 --> 00:27:29,770
Right, let me show you
an example of that.

615
00:27:29,770 --> 00:27:32,790
I'm going to fit a quadratic
to a straight line.

616
00:27:32,790 --> 00:27:33,510
Easy thing to do.

617
00:27:33,510 --> 00:27:36,660
But I want to show you the
effect of overfitting or adding

618
00:27:36,660 --> 00:27:38,049
in those extra terms.

619
00:27:38,049 --> 00:27:39,590
So let me say it a
little bit better.

620
00:27:39,590 --> 00:27:41,539
I'm going to start off
with this 3, sorry, 3.

621
00:27:41,539 --> 00:27:42,580
I'm doing it again today.

622
00:27:42,580 --> 00:27:45,970
4 simple values, 0, 1, 2, 3.

623
00:27:45,970 --> 00:27:47,640
The y values are the
same as x values.

624
00:27:47,640 --> 00:27:50,264
So this is 0,0, 1, 1, 2, 2, 3 3.

625
00:27:50,264 --> 00:27:51,430
They're all lying on a line.

626
00:27:51,430 --> 00:27:52,510
But I'm going to fit.

627
00:27:52,510 --> 00:27:54,160
I'm going to plot them out.

628
00:27:54,160 --> 00:27:58,040
And then I'm going
to fit a quadratic.

629
00:27:58,040 --> 00:28:02,504
y if it equals ax squared
plus bx plus c to this.

630
00:28:02,504 --> 00:28:03,920
Now I know it's a
line, but I want

631
00:28:03,920 --> 00:28:05,780
to see what happens
if I fit a quadratic.

632
00:28:05,780 --> 00:28:08,510
So I'm going to use polyfit
to fit my quadratic.

633
00:28:08,510 --> 00:28:11,430
I'm going to print out
some data about it.

634
00:28:11,430 --> 00:28:14,610
And then I'm going to
use Polyval to estimate

635
00:28:14,610 --> 00:28:16,950
what those values should be.

636
00:28:16,950 --> 00:28:18,150
Plot them out.

637
00:28:18,150 --> 00:28:22,510
And then compute r squared
value, and see what happens.

638
00:28:22,510 --> 00:28:24,654
All right, OK, and let
me set this up better.

639
00:28:24,654 --> 00:28:25,320
What am I doing?

640
00:28:25,320 --> 00:28:26,653
I want to just fit it to a line.

641
00:28:26,653 --> 00:28:29,250
I know it's a line, but I'm
going to fit a quadratic to it.

642
00:28:29,250 --> 00:28:31,860
And what I'd expect is, even
though there's an extra term

643
00:28:31,860 --> 00:28:34,300
there, it shouldn't matter.

644
00:28:34,300 --> 00:28:40,040
So if I go to Python,
and I run this,

645
00:28:40,040 --> 00:28:46,130
I run exactly that
example, look at that.

646
00:28:46,130 --> 00:28:48,560
a equals 0, b is 1, c equals 0.

647
00:28:48,560 --> 00:28:51,087
Look at the r-squared value.

648
00:28:51,087 --> 00:28:52,420
I'll pull that together for you.

649
00:28:56,640 --> 00:29:01,220
It says, in this perfect
case, there's what I get.

650
00:29:01,220 --> 00:29:03,770
The blue line is drawn
through the actual values.

651
00:29:03,770 --> 00:29:06,290
The dotted red line is drawn
through the predicted values.

652
00:29:06,290 --> 00:29:07,340
They exactly line up.

653
00:29:07,340 --> 00:29:09,020
And in fact, the
solution implied

654
00:29:09,020 --> 00:29:12,020
says, the higher order
term coefficient 0,

655
00:29:12,020 --> 00:29:13,830
it doesn't matter.

656
00:29:13,830 --> 00:29:16,750
So what it found was y equals x.

657
00:29:16,750 --> 00:29:19,329
I know you're totally impressed
I could find a straight line.

658
00:29:19,329 --> 00:29:20,620
But notice what happened there.

659
00:29:20,620 --> 00:29:22,480
I dropped or that
system said, you

660
00:29:22,480 --> 00:29:24,340
don't need the
higher order term.

661
00:29:24,340 --> 00:29:27,200
Wonderful r-squared value.

662
00:29:27,200 --> 00:29:29,810
OK, let's see how
well it predicts.

663
00:29:29,810 --> 00:29:32,540
Let's add in one more
point, out at 20.

664
00:29:32,540 --> 00:29:34,040
So this is 0, 1, 2, 3.

665
00:29:34,040 --> 00:29:35,700
That's 0, 1, 2, 3.

666
00:29:35,700 --> 00:29:39,140
I'm going to add 20 in there, so
it's 0, 0 , 1, 2, 2, 3, 3, 20,

667
00:29:39,140 --> 00:29:40,550
20.

668
00:29:40,550 --> 00:29:43,490
Again, I can estimate
using the same model.

669
00:29:43,490 --> 00:29:44,900
So I'm not
recomputing the model,

670
00:29:44,900 --> 00:29:48,390
the model I predicted from using
those first set of four points.

671
00:29:48,390 --> 00:29:50,270
I can get the
estimated y values,

672
00:29:50,270 --> 00:29:53,310
plot those out, and you again,
compute the r-squared value

673
00:29:53,310 --> 00:29:54,720
here.

674
00:29:54,720 --> 00:29:58,260
And even adding that point
in, there's the line.

675
00:29:58,260 --> 00:29:59,840
And guess what.

676
00:29:59,840 --> 00:30:01,860
Perfectly predicts it.

677
00:30:01,860 --> 00:30:04,574
No big surprise.

678
00:30:04,574 --> 00:30:06,240
So it says, in the
case of perfect data,

679
00:30:06,240 --> 00:30:09,420
adding the higher order terms
isn't going to cause a problem.

680
00:30:09,420 --> 00:30:11,340
The system will say
coefficients are 0.

681
00:30:11,340 --> 00:30:13,910
That's all I need.

682
00:30:13,910 --> 00:30:17,210
All right, now, let's go back
and add in just a tiny bit

683
00:30:17,210 --> 00:30:20,220
of noise right there.

684
00:30:20,220 --> 00:30:22,910
0, 0, 1, 1, 2, 2, and 3, 3.1.

685
00:30:22,910 --> 00:30:26,420
So I've got a slight deviation
in the y value there.

686
00:30:26,420 --> 00:30:27,590
Again, I can plot them.

687
00:30:27,590 --> 00:30:29,980
I'm going to fit a
quadratic to them.

688
00:30:29,980 --> 00:30:31,980
I'm going to print out
some information about it

689
00:30:31,980 --> 00:30:34,680
and then get the
estimated values using

690
00:30:34,680 --> 00:30:38,020
that new model to see
what it should look like.

691
00:30:38,020 --> 00:30:39,020
I'm not going to run it.

692
00:30:39,020 --> 00:30:40,570
I'm going to show
you the result.

693
00:30:40,570 --> 00:30:43,370
I get a really good
r-squared value.

694
00:30:43,370 --> 00:30:45,850
And there's the equation
it comes up with.

695
00:30:48,710 --> 00:30:50,270
Not so bad, right?

696
00:30:50,270 --> 00:30:53,600
It's almost y equal to x.

697
00:30:53,600 --> 00:30:55,430
But because of that
little bit of noise

698
00:30:55,430 --> 00:30:58,239
there, there's a small
second order term

699
00:30:58,239 --> 00:31:00,030
here and a little
constant term down there.

700
00:31:00,030 --> 00:31:03,050
The y squared value
is really pretty good.

701
00:31:03,050 --> 00:31:05,610
And if you really squint
and look carefully at this,

702
00:31:05,610 --> 00:31:07,970
you'll actually see
there's a little bit

703
00:31:07,970 --> 00:31:11,660
of a deviation between
the red and the blue line.

704
00:31:11,660 --> 00:31:13,610
It undershoots-- sorry,
overshoots there,

705
00:31:13,610 --> 00:31:17,430
undershoots here, but
it's really pretty close.

706
00:31:17,430 --> 00:31:19,770
All right, so am I just
whistling in the dark here?

707
00:31:19,770 --> 00:31:21,760
What's the difference?

708
00:31:21,760 --> 00:31:25,770
Well, now let's add
in that extra point.

709
00:31:25,770 --> 00:31:27,110
And what happens?

710
00:31:27,110 --> 00:31:30,890
So again, I'm now taking the
same set of points 0, 0, 1, 1,

711
00:31:30,890 --> 00:31:32,870
2, 2, 3, and 3.1.

712
00:31:32,870 --> 00:31:34,730
I'm going to do 20, 20.

713
00:31:34,730 --> 00:31:38,407
Using the model I captured
from fitting to that first set,

714
00:31:38,407 --> 00:31:39,740
I want to see what happens here.

715
00:31:43,340 --> 00:31:43,870
Crap.

716
00:31:43,870 --> 00:31:44,369
I'm sorry.

717
00:31:44,369 --> 00:31:45,280
Shouldn't say that.

718
00:31:45,280 --> 00:31:47,210
Darn.

719
00:31:47,210 --> 00:31:49,911
Pick up some other word.

720
00:31:49,911 --> 00:31:51,160
Shouldn't surprise you, right?

721
00:31:51,160 --> 00:31:55,840
A small variation here is
now causing a really large

722
00:31:55,840 --> 00:31:57,990
variation up there.

723
00:31:57,990 --> 00:32:02,310
And this is why the ideal case
overfitting is not a problem,

724
00:32:02,310 --> 00:32:04,500
because the coefficients
get zeroed out.

725
00:32:04,500 --> 00:32:07,260
But even a little bit of
noise can cause a problem.

726
00:32:07,260 --> 00:32:10,110
Now I'll grant you, we
set this up deliberately

727
00:32:10,110 --> 00:32:11,430
to show a big effect here.

728
00:32:11,430 --> 00:32:16,260
But a 3% error in one data
point is causing a huge problem

729
00:32:16,260 --> 00:32:18,684
when I get further
out on this curve.

730
00:32:18,684 --> 00:32:20,600
And by the way, there
is the r-squared values.

731
00:32:20,600 --> 00:32:21,190
It's 0.7.

732
00:32:21,190 --> 00:32:25,800
It doesn't do a
particularly good job

733
00:32:25,800 --> 00:32:29,830
OK, so how would I fix this?

734
00:32:29,830 --> 00:32:32,610
Well, what if I had
simply done a first degree

735
00:32:32,610 --> 00:32:34,810
fit, same situation.

736
00:32:34,810 --> 00:32:36,840
Let's say fit a line
to this rather than

737
00:32:36,840 --> 00:32:38,080
fitting a quadratic.

738
00:32:38,080 --> 00:32:39,630
Remember, my
question was, what's

739
00:32:39,630 --> 00:32:42,577
the harm of fitting a higher
order model if the coefficients

740
00:32:42,577 --> 00:32:43,410
would be zeroed out?

741
00:32:43,410 --> 00:32:45,180
We've seen they
won't be zeroed out.

742
00:32:45,180 --> 00:32:47,460
But if I were just
to have fit a line

743
00:32:47,460 --> 00:32:51,150
to this, exactly the same
experiment, 0, 0, 1, 1, 2,

744
00:32:51,150 --> 00:32:55,450
2, 3, and 3.1, 20 and 20.

745
00:32:55,450 --> 00:33:00,270
Now you can see it still does
a really good job of fitting.

746
00:33:00,270 --> 00:33:04,410
The r-squared value is 0.9988.

747
00:33:04,410 --> 00:33:07,770
So again, fitting the right
level of model, the noise

748
00:33:07,770 --> 00:33:10,850
doesn't cause nearly
as much of a problem.

749
00:33:10,850 --> 00:33:13,730
And so just to pull that
together, basically it says,

750
00:33:13,730 --> 00:33:17,660
the predictive ability
of the first order model

751
00:33:17,660 --> 00:33:19,990
is much better than
the second order model.

752
00:33:19,990 --> 00:33:21,740
And that's why, in
this case, I would want

753
00:33:21,740 --> 00:33:25,210
to use that first order model.

754
00:33:25,210 --> 00:33:26,560
So take home message.

755
00:33:26,560 --> 00:33:29,420
And then we're going
to amplify this.

756
00:33:29,420 --> 00:33:32,390
If I pick an overly
complex model,

757
00:33:32,390 --> 00:33:35,450
I have the danger of overfitting
to the training data,

758
00:33:35,450 --> 00:33:38,420
overfitting meaning that I'm
not only fitting the underlying

759
00:33:38,420 --> 00:33:40,850
process, I'm fitting the noise.

760
00:33:40,850 --> 00:33:42,830
I get an order 16
model is the best

761
00:33:42,830 --> 00:33:46,950
fit when it's in fact, in order
2 model that was generating it.

762
00:33:46,950 --> 00:33:49,350
That increases the
risk that it's not

763
00:33:49,350 --> 00:33:51,792
going to do well with the
data, not what I'd like.

764
00:33:51,792 --> 00:33:53,250
I want to be able
to predict what's

765
00:33:53,250 --> 00:33:56,020
going to go on well here.

766
00:33:56,020 --> 00:33:56,960
On the other hand.

767
00:33:56,960 --> 00:33:59,130
So that would say, boy,
just stick with the simplest

768
00:33:59,130 --> 00:34:00,820
possible model.

769
00:34:00,820 --> 00:34:02,700
But there's a trade off here.

770
00:34:02,700 --> 00:34:04,350
And we already saw
that when I tried

771
00:34:04,350 --> 00:34:06,780
to fit a line to a data that
was basically quadratic.

772
00:34:06,780 --> 00:34:08,739
I didn't get a good fit.

773
00:34:08,739 --> 00:34:10,860
So I'd want to find the balance.

774
00:34:10,860 --> 00:34:14,850
An insufficiently complex model
won't explain the data well.

775
00:34:14,850 --> 00:34:19,179
An overly complex model will
overfit the training data.

776
00:34:19,179 --> 00:34:20,909
So I'd like to find
the place where

777
00:34:20,909 --> 00:34:23,310
the model is as
simple as possible,

778
00:34:23,310 --> 00:34:25,681
but still explains the data.

779
00:34:25,681 --> 00:34:28,139
And I can't resist the quote
from Einstein that captures it

780
00:34:28,139 --> 00:34:30,870
pretty well, "everything
should be made as simple

781
00:34:30,870 --> 00:34:33,690
as possible, but not simpler."

782
00:34:33,690 --> 00:34:35,280
In the case of
where I started, it

783
00:34:35,280 --> 00:34:37,670
should be fit to a quadratic,
because it's the right fit.

784
00:34:37,670 --> 00:34:39,420
But don't fit more
than that, because it's

785
00:34:39,420 --> 00:34:42,770
getting overly complex

786
00:34:42,770 --> 00:34:47,822
Now how might we go about
finding the right model?

787
00:34:47,822 --> 00:34:50,280
We're not going to dwell on
this but here is a standard way

788
00:34:50,280 --> 00:34:52,050
in which you might do it.

789
00:34:52,050 --> 00:34:53,684
Start with a low order model.

790
00:34:53,684 --> 00:34:54,600
Again, take that data.

791
00:34:54,600 --> 00:34:55,980
Fit a linear model to it.

792
00:34:55,980 --> 00:34:59,080
Look at not only
the r-squared value,

793
00:34:59,080 --> 00:35:02,630
but see how well it
accounts for new data.

794
00:35:02,630 --> 00:35:04,160
Increase the order of the model.

795
00:35:04,160 --> 00:35:05,930
Repeat the process.

796
00:35:05,930 --> 00:35:07,730
And keep doing
that until you find

797
00:35:07,730 --> 00:35:11,810
a point at which a model does
a good job both on the training

798
00:35:11,810 --> 00:35:14,780
data and on predicting new data.

799
00:35:14,780 --> 00:35:16,280
An after it starts
to fall off, that

800
00:35:16,280 --> 00:35:18,155
gives you a point where
you might say there's

801
00:35:18,155 --> 00:35:19,830
a good sized model.

802
00:35:19,830 --> 00:35:22,080
In the case of this data,
whether I would have stopped

803
00:35:22,080 --> 00:35:24,413
at a quadratic or I might
have used a cubic or a quartic

804
00:35:24,413 --> 00:35:26,104
depends on the values.

805
00:35:26,104 --> 00:35:28,270
But I certainly wouldn't
have gone much beyond that.

806
00:35:28,270 --> 00:35:30,228
And this is one way, if
you don't have a theory

807
00:35:30,228 --> 00:35:32,490
to drive you, to think
about, how do I actually fit

808
00:35:32,490 --> 00:35:33,960
the model the way I would like.

809
00:35:37,200 --> 00:35:38,635
Let's go back to
where we started.

810
00:35:38,635 --> 00:35:40,260
We still have one
more big topic to do,

811
00:35:40,260 --> 00:35:41,801
and we still have
a few minutes left.

812
00:35:41,801 --> 00:35:45,870
But let's go back to where
we started Hooke's law.

813
00:35:45,870 --> 00:35:48,730
There was the data from
measuring displacements

814
00:35:48,730 --> 00:35:50,920
of a spring, as I
added different weights

815
00:35:50,920 --> 00:35:53,290
to the bottom of the spring.

816
00:35:53,290 --> 00:35:54,940
And there's the linear fit.

817
00:35:54,940 --> 00:35:56,600
It's not bad.

818
00:35:56,600 --> 00:35:58,927
There's the quadratic fit.

819
00:35:58,927 --> 00:36:01,260
And it's certainly got a
better r-squared value, though.

820
00:36:01,260 --> 00:36:03,930
That could be just
fitting to the noise.

821
00:36:03,930 --> 00:36:05,370
But you actually
can see, I think,

822
00:36:05,370 --> 00:36:08,340
that that green curve
probably does a better

823
00:36:08,340 --> 00:36:11,450
job of fitting the data.

824
00:36:11,450 --> 00:36:13,120
Well, wait a minute.

825
00:36:13,120 --> 00:36:15,830
Even though the quadratic
fit is tighter here,

826
00:36:15,830 --> 00:36:20,030
Hooke says, this is linear.

827
00:36:20,030 --> 00:36:21,959
So what's going on?

828
00:36:21,959 --> 00:36:23,500
Well, this is another
place where you

829
00:36:23,500 --> 00:36:24,970
want to think about your model.

830
00:36:24,970 --> 00:36:28,066
And I'll remind you, in case
you don't remember your physics,

831
00:36:28,066 --> 00:36:30,011
unless we believe
that Hooke was wrong,

832
00:36:30,011 --> 00:36:31,260
this should tell us something.

833
00:36:31,260 --> 00:36:33,790
And in particular, Hooke's
law says, the model

834
00:36:33,790 --> 00:36:38,750
holds until you reach the
elastic limit of the spring.

835
00:36:38,750 --> 00:36:42,090
You stretch a slinky too
far, it never springs back.

836
00:36:42,090 --> 00:36:44,250
You go beyond that
elastic limit.

837
00:36:44,250 --> 00:36:47,870
And that's probably what's
happening right up there.

838
00:36:47,870 --> 00:36:50,930
Through here, it's following
that linear relationship.

839
00:36:50,930 --> 00:36:53,630
Up at this point, I've
essentially broken the spring.

840
00:36:53,630 --> 00:36:56,730
The elastic limit
doesn't hold anymore.

841
00:36:56,730 --> 00:36:58,840
And so really, in this
case, I should probably

842
00:36:58,840 --> 00:37:02,940
fit different models
to different segments.

843
00:37:02,940 --> 00:37:05,550
And there's a much better fit.

844
00:37:05,550 --> 00:37:07,620
Linear through the
first part and another

845
00:37:07,620 --> 00:37:11,480
later line once I hit
that elastic limit.

846
00:37:11,480 --> 00:37:12,980
How might I find this?

847
00:37:12,980 --> 00:37:16,520
Well, you could imagine a little
search process in which you try

848
00:37:16,520 --> 00:37:19,970
and find where's the best
place along here to break

849
00:37:19,970 --> 00:37:23,720
the data into two sets, fit
linear segments to both,

850
00:37:23,720 --> 00:37:27,112
and get really good
fits for both examples.

851
00:37:27,112 --> 00:37:29,070
And I raise it because
that's the kind of thing

852
00:37:29,070 --> 00:37:30,070
you've also seen before.

853
00:37:30,070 --> 00:37:32,580
You could imagine writing
code to do that search

854
00:37:32,580 --> 00:37:35,530
to find that good fit.

855
00:37:35,530 --> 00:37:38,940
OK, that gives
you a sense, then,

856
00:37:38,940 --> 00:37:41,070
of why you want to be
careful about overfitting,

857
00:37:41,070 --> 00:37:43,070
why you want to not just
look at the coefficient

858
00:37:43,070 --> 00:37:46,920
of determination, but see how
well does this predict behavior

859
00:37:46,920 --> 00:37:48,980
on new data sets.

860
00:37:48,980 --> 00:37:52,850
Now suppose I don't have
a theory, like Hooke,

861
00:37:52,850 --> 00:37:54,460
to guide me.

862
00:37:54,460 --> 00:37:58,340
Can I still figure out what's a
good model to fit to the data?

863
00:37:58,340 --> 00:37:59,680
And the answer is, you bet.

864
00:37:59,680 --> 00:38:01,810
We're going to use
cross-validation to guide

865
00:38:01,810 --> 00:38:04,760
the choice of the
model complexity.

866
00:38:04,760 --> 00:38:07,470
And I want to show
you two examples.

867
00:38:07,470 --> 00:38:10,000
If the data set's
small, we can use

868
00:38:10,000 --> 00:38:12,700
what's called leave one
out cross-validation.

869
00:38:12,700 --> 00:38:15,600
I'll give you a definition
of that in a second.

870
00:38:15,600 --> 00:38:17,390
If the data sets
bigger than that,

871
00:38:17,390 --> 00:38:20,210
we can use k-fold
cross-validation.

872
00:38:20,210 --> 00:38:21,980
I'll give you a
definition that a second.

873
00:38:21,980 --> 00:38:24,660
Or just what's called,
repeated random sampling.

874
00:38:24,660 --> 00:38:27,879
But we can use this same
idea of validating new data

875
00:38:27,879 --> 00:38:30,170
to try and figure out whether
the model is a good model

876
00:38:30,170 --> 00:38:32,490
or not.

877
00:38:32,490 --> 00:38:33,920
Leave one out cross-validation.

878
00:38:33,920 --> 00:38:35,330
This is as written
in pseudocode,

879
00:38:35,330 --> 00:38:37,480
but the idea is pretty simple.

880
00:38:37,480 --> 00:38:38,440
I'm given a dataset.

881
00:38:38,440 --> 00:38:40,580
It's not too large.

882
00:38:40,580 --> 00:38:44,710
The idea is to walk
through a number of trials,

883
00:38:44,710 --> 00:38:47,130
number trials equal to
the size of the data set.

884
00:38:47,130 --> 00:38:51,170
And for each one, take the
data set or a copy of it,

885
00:38:51,170 --> 00:38:52,640
and drop out one of the samples.

886
00:38:52,640 --> 00:38:54,305
So leave one out.

887
00:38:54,305 --> 00:38:55,930
Start off by leaving
out the first one,

888
00:38:55,930 --> 00:38:57,280
then leaving out the
second one, and then

889
00:38:57,280 --> 00:38:58,390
leaving out the third one.

890
00:38:58,390 --> 00:39:02,160
For each one of those training
sets, build the model.

891
00:39:02,160 --> 00:39:04,610
For example, by using
linear regression.

892
00:39:04,610 --> 00:39:10,340
And then test that model on that
data point that you left out.

893
00:39:10,340 --> 00:39:12,087
So leave out the first
one, build a model

894
00:39:12,087 --> 00:39:13,670
on all of the other
ones, and then see

895
00:39:13,670 --> 00:39:15,461
how well that model
predicts the first one.

896
00:39:15,461 --> 00:39:17,150
Leave out the second
one, build a model

897
00:39:17,150 --> 00:39:18,350
using all of them
but the second one,

898
00:39:18,350 --> 00:39:20,016
see how well it
predicts the second one.

899
00:39:20,016 --> 00:39:21,946
And just average
the result. Works

900
00:39:21,946 --> 00:39:23,570
when you don't have
a really large data

901
00:39:23,570 --> 00:39:25,100
set, because it
won't take too long.

902
00:39:25,100 --> 00:39:30,150
But it's a nice way of
actually testing validation.

903
00:39:30,150 --> 00:39:32,130
If the data set's
a lot bigger, you

904
00:39:32,130 --> 00:39:33,420
can still use the same idea.

905
00:39:33,420 --> 00:39:35,990
You can use what's
called, k-fold.

906
00:39:35,990 --> 00:39:40,730
Divide the data set up
into k equal sized chunks.

907
00:39:40,730 --> 00:39:41,700
Leave one of them out.

908
00:39:41,700 --> 00:39:43,880
Use the rest to build the model.

909
00:39:43,880 --> 00:39:45,770
And then use that
model to predict

910
00:39:45,770 --> 00:39:47,300
that first chunk you left out.

911
00:39:47,300 --> 00:39:49,640
Leave out the second
chunk, and keep doing it.

912
00:39:49,640 --> 00:39:51,320
Same idea, but now
with groups of things

913
00:39:51,320 --> 00:39:55,270
rather than just leaving
those single data points.

914
00:39:55,270 --> 00:39:57,790
All right, the other way
you can deal with it,

915
00:39:57,790 --> 00:40:00,070
which has a nice
effect to it, is

916
00:40:00,070 --> 00:40:03,700
to use what's called,
repeated random sampling.

917
00:40:03,700 --> 00:40:05,710
OK, start out with
some data set.

918
00:40:05,710 --> 00:40:07,210
And what I'm going
to do here is I'm

919
00:40:07,210 --> 00:40:09,001
going to run through
some number of trials.

920
00:40:09,001 --> 00:40:10,090
I'm going to call that, k.

921
00:40:10,090 --> 00:40:13,480
But I'm also going to pick
some number of random samples

922
00:40:13,480 --> 00:40:15,650
from the data set.

923
00:40:15,650 --> 00:40:17,410
Usually, I think,
and as I recall,

924
00:40:17,410 --> 00:40:20,470
it is somewhere between
reserving 20% to 50%

925
00:40:20,470 --> 00:40:22,250
of the samples.

926
00:40:22,250 --> 00:40:25,630
But the idea is again, walk
over all of those k trials.

927
00:40:25,630 --> 00:40:29,290
And in each one, pick
out at random n elements

928
00:40:29,290 --> 00:40:30,820
for the test set.

929
00:40:30,820 --> 00:40:32,920
Use the remainder
is the training set.

930
00:40:32,920 --> 00:40:34,990
Build the model on
the training set.

931
00:40:34,990 --> 00:40:38,180
And then apply that
model to the test set.

932
00:40:38,180 --> 00:40:40,730
So rather than doing
k-fold, where I select k,

933
00:40:40,730 --> 00:40:42,080
in turn, and keep them.

934
00:40:42,080 --> 00:40:46,450
This is just randomly selecting
which ones to pull out.

935
00:40:46,450 --> 00:40:48,960
So I'm going to show
you one last example.

936
00:40:48,960 --> 00:40:51,400
Let's look at that idea of,
I don't have a model here.

937
00:40:51,400 --> 00:40:54,400
I want to use this idea
of cross-validation

938
00:40:54,400 --> 00:40:57,390
to try and figure out what's
the best possible model.

939
00:40:57,390 --> 00:41:00,250
And for this, I'm going to
use a different data set.

940
00:41:00,250 --> 00:41:02,230
The data set here
is I want to model

941
00:41:02,230 --> 00:41:04,810
or the task here is
I want to try model

942
00:41:04,810 --> 00:41:07,510
how the mean daily high
temperature in the US

943
00:41:07,510 --> 00:41:14,060
has varied over about a 55
year period, from '61 to 2015.

944
00:41:14,060 --> 00:41:15,280
Got a set of data.

945
00:41:15,280 --> 00:41:18,540
It's mean-- sorry, the daily
high for every day of the year

946
00:41:18,540 --> 00:41:19,839
through that entire period.

947
00:41:19,839 --> 00:41:21,380
And what I'm going
to do is I'm going

948
00:41:21,380 --> 00:41:24,249
to compute the means for
each year and plot them out.

949
00:41:24,249 --> 00:41:26,290
And then I'm going to try
and fit models to them.

950
00:41:26,290 --> 00:41:28,430
And in particular,
I'm going to take

951
00:41:28,430 --> 00:41:30,590
a set of different
dimensionalities,

952
00:41:30,590 --> 00:41:34,477
linear, quadratic, cubic,
quartic And in each case,

953
00:41:34,477 --> 00:41:36,060
I'm going to run
through a trial where

954
00:41:36,060 --> 00:41:39,142
I train on one half of the
data, test on the other.

955
00:41:39,142 --> 00:41:41,100
There again, is that idea
of seeing how well it

956
00:41:41,100 --> 00:41:42,480
predicts other data.

957
00:41:42,480 --> 00:41:45,110
Record the coefficient
of determination.

958
00:41:45,110 --> 00:41:47,370
And do that and
get out an average,

959
00:41:47,370 --> 00:41:50,670
and report what I get as the
mean for each of those values

960
00:41:50,670 --> 00:41:53,810
across each dimensionality.

961
00:41:53,810 --> 00:41:55,630
OK, here we go.

962
00:41:55,630 --> 00:41:57,212
Set a code that's
pretty easy to see.

963
00:41:57,212 --> 00:41:59,170
Hopefully, you can just
look at it and grok it.

964
00:41:59,170 --> 00:42:01,660
We start off with
a boring class,

965
00:42:01,660 --> 00:42:04,086
which Professor guttag suggests
refers to this lecture.

966
00:42:04,086 --> 00:42:04,710
But it doesn't.

967
00:42:04,710 --> 00:42:07,126
This may be a boring lecture,
but it's not a boring class.

968
00:42:07,126 --> 00:42:08,580
This is a great class.

969
00:42:08,580 --> 00:42:10,800
And boy, those jokes are
really awful, aren't they?

970
00:42:10,800 --> 00:42:11,910
But here we go.

971
00:42:11,910 --> 00:42:15,810
Simple class that
builds temperature data.

972
00:42:15,810 --> 00:42:19,800
This reads in some information,
splits it up, and basically,

973
00:42:19,800 --> 00:42:23,280
records the high for the day and
the year in which I got that.

974
00:42:23,280 --> 00:42:26,020
So for each day, I've got a
high temperature for that day.

975
00:42:26,020 --> 00:42:28,540
I'm going to give you back the
high temperature and the year

976
00:42:28,540 --> 00:42:30,456
in which it was recorded,
because I don't care

977
00:42:30,456 --> 00:42:32,590
whether it was in
January or June.

978
00:42:32,590 --> 00:42:35,480
A little function
that opens up a file.

979
00:42:35,480 --> 00:42:38,260
We've actually given you a file,
if you want to go look at it.

980
00:42:38,260 --> 00:42:40,380
And simply walk through
the file reading it in

981
00:42:40,380 --> 00:42:45,200
and returning a big list
of all those data objects.

982
00:42:45,200 --> 00:42:48,140
OK, then what I
want to do is I want

983
00:42:48,140 --> 00:42:52,545
to get the mean high
temperature for each year.

984
00:42:52,545 --> 00:42:54,920
Given that data, I'm going to
set up a dictionary called,

985
00:42:54,920 --> 00:42:55,545
years.

986
00:42:55,545 --> 00:42:57,920
I'm just going to run through
a loop through all the data

987
00:42:57,920 --> 00:43:01,280
points, where in the
dictionary, under that year.

988
00:43:01,280 --> 00:43:02,430
So there a data point.

989
00:43:02,430 --> 00:43:04,910
I use the method get
year to get out the year.

990
00:43:04,910 --> 00:43:09,350
At that point, I add in the
high temperature corresponding

991
00:43:09,350 --> 00:43:11,110
to that data point.

992
00:43:11,110 --> 00:43:13,090
And I'm using that nice
little try except loop.

993
00:43:13,090 --> 00:43:14,881
I'll do that, unless
I haven't had anything

994
00:43:14,881 --> 00:43:17,510
yet for this year, in
which case this'll fail.

995
00:43:17,510 --> 00:43:20,260
And I'll simply store the
first one in as a list.

996
00:43:20,260 --> 00:43:22,510
So after I've run through
this loop in the dictionary,

997
00:43:22,510 --> 00:43:24,759
under the year, I have a
list of the high temperatures

998
00:43:24,759 --> 00:43:27,201
for each day associated with it.

999
00:43:27,201 --> 00:43:27,700
Excuse me.

1000
00:43:27,700 --> 00:43:30,970
And then I can just
compute the average, that

1001
00:43:30,970 --> 00:43:33,100
is for each year in the years.

1002
00:43:33,100 --> 00:43:34,150
I get that value.

1003
00:43:34,150 --> 00:43:34,912
I add them up.

1004
00:43:34,912 --> 00:43:35,620
I get the length.

1005
00:43:35,620 --> 00:43:36,369
I divide them out.

1006
00:43:36,369 --> 00:43:39,229
And I store that in as the
average high temperature

1007
00:43:39,229 --> 00:43:39,770
for the year.

1008
00:43:42,290 --> 00:43:44,000
Now I can plot it.

1009
00:43:44,000 --> 00:43:46,310
Get the data, get
out the information

1010
00:43:46,310 --> 00:43:49,460
by computing those yearly
means, run through a little loop

1011
00:43:49,460 --> 00:43:52,460
that basically, in the x values,
puts in the year, in the y

1012
00:43:52,460 --> 00:43:56,180
values, puts in the
high temperature.

1013
00:43:56,180 --> 00:43:58,250
And I can do a plot.

1014
00:43:58,250 --> 00:44:02,830
And if I do that, I get that.

1015
00:44:02,830 --> 00:44:05,700
I'll let you run this yourself.

1016
00:44:05,700 --> 00:44:09,245
Now this is a little bit
deceptive, because of the scale

1017
00:44:09,245 --> 00:44:09,870
I've used here.

1018
00:44:09,870 --> 00:44:12,150
But nonetheless, it
shows, in the US,

1019
00:44:12,150 --> 00:44:15,581
over a 55 year period,
the mean high day--

1020
00:44:15,581 --> 00:44:16,080
I'm sorry.

1021
00:44:16,080 --> 00:44:19,320
The mean daily high has
gone from about 15.5

1022
00:44:19,320 --> 00:44:23,740
degrees Celsius up
to about 17 and 1/2.

1023
00:44:23,740 --> 00:44:26,150
So what's changed?

1024
00:44:26,150 --> 00:44:29,390
Now the question is,
how could I model this?

1025
00:44:29,390 --> 00:44:31,280
Could I actually
get a model that

1026
00:44:31,280 --> 00:44:34,100
would give me a sense
of how this is changing?

1027
00:44:34,100 --> 00:44:37,250
And that's why I'm going
to use cross-validation.

1028
00:44:37,250 --> 00:44:41,510
I'm going to run through a
number of trials, 10 trials.

1029
00:44:41,510 --> 00:44:43,790
I'm going to try and fit
four different models,

1030
00:44:43,790 --> 00:44:47,730
linear, quadratic,
cubic, quartic.

1031
00:44:47,730 --> 00:44:49,822
And for each of
these dimensions,

1032
00:44:49,822 --> 00:44:51,780
I'm going to get out a
set of r-squared values.

1033
00:44:51,780 --> 00:44:56,850
So I'm just going to initialize
that dictionary. an empty list.

1034
00:44:56,850 --> 00:44:59,490
Now here is how I'm
going to do this.

1035
00:44:59,490 --> 00:45:00,554
Got a list of x-values.

1036
00:45:00,554 --> 00:45:01,220
Those are years.

1037
00:45:01,220 --> 00:45:02,360
Got a list of y values.

1038
00:45:02,360 --> 00:45:06,280
Those are average
highs, daily highs.

1039
00:45:06,280 --> 00:45:10,730
I'm going to create a
list of random samples.

1040
00:45:10,730 --> 00:45:13,840
So if you haven't seen this
before, random.sample says,

1041
00:45:13,840 --> 00:45:15,880
given this iterator,
which you can think

1042
00:45:15,880 --> 00:45:18,910
of as the collection
from 0 up to n minus 1,

1043
00:45:18,910 --> 00:45:23,620
it's going to select this many
or half of them, in this case,

1044
00:45:23,620 --> 00:45:26,810
of those numbers at random.

1045
00:45:26,810 --> 00:45:30,310
So if I give it 0 up to 9,
and I say, pick five of them,

1046
00:45:30,310 --> 00:45:33,970
it will, at random, give me
back 5 of those 10 numbers,

1047
00:45:33,970 --> 00:45:36,110
with no duplicates.

1048
00:45:36,110 --> 00:45:38,270
Ah, that's nice.

1049
00:45:38,270 --> 00:45:39,710
Because now notice
what I can do.

1050
00:45:39,710 --> 00:45:41,440
I'm going to set up a training--

1051
00:45:41,440 --> 00:45:44,320
sorry, an x and y values
for a training set, x

1052
00:45:44,320 --> 00:45:45,670
and y values for the test set.

1053
00:45:45,670 --> 00:45:47,586
And I'm just going to
run through a loop here,

1054
00:45:47,586 --> 00:45:51,490
where if this index
is in that list,

1055
00:45:51,490 --> 00:45:53,600
I'll stick it in
the training set.

1056
00:45:53,600 --> 00:45:57,290
Otherwise, I'll stick
it in the test set.

1057
00:45:57,290 --> 00:45:59,380
And then I just return them.

1058
00:45:59,380 --> 00:46:02,140
So this is a really nice
way of, at random, just

1059
00:46:02,140 --> 00:46:08,520
splitting the data set into a
test set and a training set.

1060
00:46:08,520 --> 00:46:12,180
And then finally, I can run
over the number of trials

1061
00:46:12,180 --> 00:46:13,500
I want to deal with.

1062
00:46:13,500 --> 00:46:15,330
In each case, get a
different training

1063
00:46:15,330 --> 00:46:17,460
and test set, at random.

1064
00:46:17,460 --> 00:46:20,640
And then, for each
dimension, do the fit.

1065
00:46:20,640 --> 00:46:23,550
There's polyfit on the training
x and training y values

1066
00:46:23,550 --> 00:46:24,720
in that dimension.

1067
00:46:24,720 --> 00:46:26,670
Gives you back a model.

1068
00:46:26,670 --> 00:46:29,370
I could just check to see how
well the training set gets,

1069
00:46:29,370 --> 00:46:32,250
but I really want to look
at, given that model,

1070
00:46:32,250 --> 00:46:37,120
how well does polyval
predict the test set, right?

1071
00:46:37,120 --> 00:46:39,640
The model will say, here's
what I expect is the values.

1072
00:46:39,640 --> 00:46:41,710
I'm going to compare
that to the actual values

1073
00:46:41,710 --> 00:46:44,740
that I saw from
the training set,

1074
00:46:44,740 --> 00:46:48,341
computing that r squared
value and adding it in.

1075
00:46:48,341 --> 00:46:49,840
And then the last
of this just says,

1076
00:46:49,840 --> 00:46:53,718
I'll run this through
a set of examples.

1077
00:46:53,718 --> 00:46:56,880
OK, here's what
happens if I do that.

1078
00:46:56,880 --> 00:46:59,992
I'm not going to run it,
although the code will run it.

1079
00:46:59,992 --> 00:47:01,700
Let me, again, remind
you what I'm doing.

1080
00:47:01,700 --> 00:47:03,190
I got a big set
of data I'm going

1081
00:47:03,190 --> 00:47:05,650
to pick out at
random, subsets of it,

1082
00:47:05,650 --> 00:47:09,380
build the model on one part,
test it on the other part.

1083
00:47:09,380 --> 00:47:16,640
And if I run it, I get a linear
fit, quadratic fit, cubic fit,

1084
00:47:16,640 --> 00:47:18,240
and a quartic fit.

1085
00:47:18,240 --> 00:47:21,092
And here's the standard
deviation of those samples.

1086
00:47:21,092 --> 00:47:22,550
Remember, I've got
multiple trials.

1087
00:47:22,550 --> 00:47:24,050
I've got 10 trials,
in this case.

1088
00:47:24,050 --> 00:47:26,270
So this gives me the
average over those trials.

1089
00:47:26,270 --> 00:47:29,150
And this tells me
how much they vary.

1090
00:47:29,150 --> 00:47:32,850
What can I conclude from this?

1091
00:47:32,850 --> 00:47:34,950
Well, I would argue that
the linear fit's probably

1092
00:47:34,950 --> 00:47:36,570
the winner here.

1093
00:47:36,570 --> 00:47:37,530
Goes back to Einstein.

1094
00:47:37,530 --> 00:47:41,550
I want the simplest possible
model that accounts for it.

1095
00:47:41,550 --> 00:47:44,700
And you can see it's got the
highest r-squared value, which

1096
00:47:44,700 --> 00:47:46,800
is already a good sign.

1097
00:47:46,800 --> 00:47:49,650
It's got the smallest
deviation across the trials,

1098
00:47:49,650 --> 00:47:52,170
which says it's probably
a pretty good fit.

1099
00:47:52,170 --> 00:47:54,410
And it's the simplest model.

1100
00:47:54,410 --> 00:47:58,250
So linear sounds like
a pretty good fit.

1101
00:47:58,250 --> 00:48:01,910
Now, why should we run multiple
data sets to test this?

1102
00:48:01,910 --> 00:48:04,940
I ran 10 trials of each
one of these dimensions.

1103
00:48:04,940 --> 00:48:07,280
Why bother with it?

1104
00:48:07,280 --> 00:48:09,440
Well, notice that
those deviations--

1105
00:48:09,440 --> 00:48:11,547
I'll go back to it here--

1106
00:48:11,547 --> 00:48:12,380
they're pretty good.

1107
00:48:12,380 --> 00:48:13,860
They're about an
order of magnitude

1108
00:48:13,860 --> 00:48:16,350
less than the actual mean,
which says they're pretty tight,

1109
00:48:16,350 --> 00:48:20,032
but they're still
reasonable size.

1110
00:48:20,032 --> 00:48:22,240
And that suggests that,
while there's good agreement,

1111
00:48:22,240 --> 00:48:24,130
the deviations are
large enough that you

1112
00:48:24,130 --> 00:48:28,330
could see a range of
variation across the trials.

1113
00:48:28,330 --> 00:48:31,617
So in fact, if I had
just run one trial,

1114
00:48:31,617 --> 00:48:32,700
I could have been screwed.

1115
00:48:32,700 --> 00:48:35,771
Sorry, oh-- sorry, pick your
favorite [INAUDIBLE] here.

1116
00:48:35,771 --> 00:48:37,270
[? Hose ?] is a
Canadian expression,

1117
00:48:37,270 --> 00:48:39,180
in case you haven't seen it.

1118
00:48:39,180 --> 00:48:42,300
Here are the r-squared
values for each trial

1119
00:48:42,300 --> 00:48:44,004
of the linear fit.

1120
00:48:44,004 --> 00:48:45,920
And you can see the mean
comes up pretty well.

1121
00:48:45,920 --> 00:48:48,140
But notice, if I'd
only run one trial

1122
00:48:48,140 --> 00:48:52,470
and I happened to get
that one, oh, darn.

1123
00:48:52,470 --> 00:48:54,300
That's a really low
r-squared value.

1124
00:48:54,300 --> 00:48:56,430
And we might have
decided, in this case,

1125
00:48:56,430 --> 00:49:00,730
a different conclusion, that the
linear fit was not a good fit.

1126
00:49:00,730 --> 00:49:04,230
So this is a way of saying,
even in a random sampling, run

1127
00:49:04,230 --> 00:49:06,630
multiple trials,
because it lets you

1128
00:49:06,630 --> 00:49:09,480
get statistics on those
trials, as well as statistics

1129
00:49:09,480 --> 00:49:10,700
within each trial.

1130
00:49:10,700 --> 00:49:12,450
So with any trial, I'm
doing a whole bunch

1131
00:49:12,450 --> 00:49:14,904
of different random samples
on measuring those values.

1132
00:49:14,904 --> 00:49:16,320
And then, across
those trials, I'm

1133
00:49:16,320 --> 00:49:18,744
seeing what the deviation is.

1134
00:49:18,744 --> 00:49:20,410
I'm going to hope my
machine comes back,

1135
00:49:20,410 --> 00:49:24,030
because what I want to do
is then pull this together.

1136
00:49:24,030 --> 00:49:25,090
What have we done?

1137
00:49:25,090 --> 00:49:26,340
Something you're going to use.

1138
00:49:26,340 --> 00:49:28,550
We've seen how you can
use linear regression

1139
00:49:28,550 --> 00:49:33,800
to fit a curve to data,
2D, 3D, 6D, however big

1140
00:49:33,800 --> 00:49:35,330
the data set is.

1141
00:49:35,330 --> 00:49:37,790
It gives us a mapping from
the independent values

1142
00:49:37,790 --> 00:49:39,440
to the dependent values.

1143
00:49:39,440 --> 00:49:42,560
And that can then be
used to predict values

1144
00:49:42,560 --> 00:49:44,390
associated with the
independent values

1145
00:49:44,390 --> 00:49:46,340
that we haven't seen yet.

1146
00:49:46,340 --> 00:49:49,070
That leads, naturally,
to both a way

1147
00:49:49,070 --> 00:49:52,430
to measure, which is r
squared, but especially

1148
00:49:52,430 --> 00:49:55,700
to see that we want to look
at how well does that model

1149
00:49:55,700 --> 00:49:59,300
actually predict new data,
because that lets us select

1150
00:49:59,300 --> 00:50:03,470
the simplest model we can
that accounts for the data,

1151
00:50:03,470 --> 00:50:06,140
but predicts new data
in an effective way.

1152
00:50:06,140 --> 00:50:07,610
And that complexity
can either be

1153
00:50:07,610 --> 00:50:11,510
based on theory, in the case of
Hooke, or in more likely cases,

1154
00:50:11,510 --> 00:50:14,180
by doing cross-validation
to try and figure out

1155
00:50:14,180 --> 00:50:16,640
which one is the
simplest model that

1156
00:50:16,640 --> 00:50:19,510
still does a good
job of predicting out

1157
00:50:19,510 --> 00:50:21,980
of data behavior.

1158
00:50:21,980 --> 00:50:25,000
And with that, I'll
see you next time.