1
00:00:00,740 --> 00:00:03,080
The following content is
provided under a Creative

2
00:00:03,080 --> 00:00:04,500
Commons license.

3
00:00:04,500 --> 00:00:06,710
Your support will help
MIT OpenCourseWare

4
00:00:06,710 --> 00:00:10,800
continue to offer high quality
educational resources for free.

5
00:00:10,800 --> 00:00:13,340
To make a donation or to
view additional materials

6
00:00:13,340 --> 00:00:17,300
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,300 --> 00:00:18,210
at ocw.mit.edu.

8
00:00:30,101 --> 00:00:34,370
ERIC GRIMSON: OK,
welcome back or welcome,

9
00:00:34,370 --> 00:00:37,360
depending on whether
you've been away or not.

10
00:00:37,360 --> 00:00:40,970
I'm going to start with
two simple announcements.

11
00:00:40,970 --> 00:00:43,910
There is a reading
assignment for this lecture,

12
00:00:43,910 --> 00:00:47,780
actually for the next two
lectures, which is chapter 18.

13
00:00:47,780 --> 00:00:50,000
And on a much
happier note, there

14
00:00:50,000 --> 00:00:52,375
is no lecture
Wednesday because we

15
00:00:52,375 --> 00:00:53,750
hope that you're
going to be busy

16
00:00:53,750 --> 00:00:56,030
preparing to get that
tryptophan poisoning

17
00:00:56,030 --> 00:00:59,115
as you eat way too much
turkey and you fall asleep.

18
00:00:59,115 --> 00:01:00,490
More importantly,
I hope you have

19
00:01:00,490 --> 00:01:02,656
a great break over Thanksgiving,
whether you're here

20
00:01:02,656 --> 00:01:04,769
or you're back home
or wherever you are.

21
00:01:04,769 --> 00:01:08,260
But no lecture Wednesday.

22
00:01:08,260 --> 00:01:13,390
Topic for today, I'm going
to start with seems like--

23
00:01:13,390 --> 00:01:17,270
sorry, what's going to seem
like a really obvious statement.

24
00:01:17,270 --> 00:01:20,870
We're living in a
data intensive world.

25
00:01:20,870 --> 00:01:25,400
Whether you're a scientist,
an engineer, social scientist,

26
00:01:25,400 --> 00:01:31,280
financial worker, politician,
manager of a sports team,

27
00:01:31,280 --> 00:01:33,800
you're spending
increasingly larger amounts

28
00:01:33,800 --> 00:01:36,124
of time dealing with data.

29
00:01:36,124 --> 00:01:37,790
And if you're in one
of those positions,

30
00:01:37,790 --> 00:01:40,460
that often means that
you're either writing code

31
00:01:40,460 --> 00:01:43,040
or you're hiring somebody
to write code for you

32
00:01:43,040 --> 00:01:45,970
to figure out that data.

33
00:01:45,970 --> 00:01:48,060
And this section
of the course is

34
00:01:48,060 --> 00:01:49,710
focusing on exactly that issue.

35
00:01:49,710 --> 00:01:52,380
We want to help
you understand what

36
00:01:52,380 --> 00:01:56,310
you can try to do with software
that manipulates data, how you

37
00:01:56,310 --> 00:01:58,890
can write code that would
do that manipulation of data

38
00:01:58,890 --> 00:02:02,700
for you, and especially what
you should believe about what

39
00:02:02,700 --> 00:02:06,450
that software tells you about
data, because sometimes it

40
00:02:06,450 --> 00:02:10,212
tells you stuff that isn't
exactly what you need to know.

41
00:02:10,212 --> 00:02:11,670
And today we're
going to start that

42
00:02:11,670 --> 00:02:14,970
by looking at particularly
the case where we get data

43
00:02:14,970 --> 00:02:16,630
from experiments.

44
00:02:16,630 --> 00:02:18,540
So think of this
lecture and the next one

45
00:02:18,540 --> 00:02:22,680
as sort of being statistics
meets experimental science.

46
00:02:22,680 --> 00:02:25,490
So what do I mean by that?

47
00:02:25,490 --> 00:02:28,260
Imagine you're doing a physics
lab, biology lab, a chemistry

48
00:02:28,260 --> 00:02:31,920
lab, or even something in
sociology or anthropology,

49
00:02:31,920 --> 00:02:34,410
you conduct an experiment
to gather some data.

50
00:02:34,410 --> 00:02:36,060
It could be
measurements in a lab.

51
00:02:36,060 --> 00:02:38,170
It could be answers
on a questionnaire.

52
00:02:38,170 --> 00:02:40,224
You get a set of data.

53
00:02:40,224 --> 00:02:41,640
Once you've got
the data, you want

54
00:02:41,640 --> 00:02:43,980
to think about what
can I do with it,

55
00:02:43,980 --> 00:02:47,850
and that usually will involve
using some model, some theory

56
00:02:47,850 --> 00:02:51,860
about the underlying process
to generate questions

57
00:02:51,860 --> 00:02:53,440
about the data.

58
00:02:53,440 --> 00:02:55,870
What does this data and the
model associated with it

59
00:02:55,870 --> 00:02:58,890
tell me about
future expectations,

60
00:02:58,890 --> 00:03:00,660
help me predict
other results that

61
00:03:00,660 --> 00:03:02,100
will come out of this data.

62
00:03:02,100 --> 00:03:04,219
In the social case,
it could be how

63
00:03:04,219 --> 00:03:05,760
do I think about
how people are going

64
00:03:05,760 --> 00:03:07,950
to respond to a
poll about who are

65
00:03:07,950 --> 00:03:12,130
you voting for in the next
election, for example.

66
00:03:12,130 --> 00:03:15,700
Given the data, given the
model, the third thing

67
00:03:15,700 --> 00:03:17,170
we're typically
going to want to do

68
00:03:17,170 --> 00:03:21,190
is then design a computation
to help us answer questions

69
00:03:21,190 --> 00:03:24,190
about the data, run a
computational experiment

70
00:03:24,190 --> 00:03:26,890
to complement the
physical experiment

71
00:03:26,890 --> 00:03:28,750
or the social experiment
we used to gather

72
00:03:28,750 --> 00:03:30,940
the data in the first place.

73
00:03:30,940 --> 00:03:33,200
And that computation
could be something deep.

74
00:03:33,200 --> 00:03:35,200
It could be something a
little more interesting,

75
00:03:35,200 --> 00:03:37,280
depending on how you're
thinking about it.

76
00:03:37,280 --> 00:03:38,980
But we want to think
about how do we

77
00:03:38,980 --> 00:03:43,530
use computation to run
additional experiments for us.

78
00:03:43,530 --> 00:03:46,650
So I'm going to start by
using an example of gathering

79
00:03:46,650 --> 00:03:48,690
experimental data,
and I want to start

80
00:03:48,690 --> 00:03:51,540
with the idea of a spring.

81
00:03:51,540 --> 00:03:53,005
How would I model a spring?

82
00:03:53,005 --> 00:03:54,630
How would I gather
data about a spring?

83
00:03:54,630 --> 00:03:57,046
And how would I write software
to help me answer questions

84
00:03:57,046 --> 00:03:58,740
about a spring?

85
00:03:58,740 --> 00:04:01,630
So what's spring?

86
00:04:01,630 --> 00:04:04,006
Well, there's one kind of
spring, a little hard to model,

87
00:04:04,006 --> 00:04:06,671
although it could be interesting
what's swimming around in there

88
00:04:06,671 --> 00:04:08,980
and how do I think about
the ecological implications

89
00:04:08,980 --> 00:04:11,250
of that spring.

90
00:04:11,250 --> 00:04:13,530
Here's a second kind of spring.

91
00:04:13,530 --> 00:04:15,060
It's about four or
five months away,

92
00:04:15,060 --> 00:04:16,769
but eventually we'll
get through this winter

93
00:04:16,769 --> 00:04:18,420
and get to that spring
and that would be nice,

94
00:04:18,420 --> 00:04:20,279
but I'm not going to
model that one either.

95
00:04:20,279 --> 00:04:21,926
And yes, my jokes
are really bad,

96
00:04:21,926 --> 00:04:23,800
and yes, you can't do
a darn thing about them

97
00:04:23,800 --> 00:04:25,890
because I am tenured because--

98
00:04:25,890 --> 00:04:27,815
while I'd like to model
these two springs,

99
00:04:27,815 --> 00:04:29,190
we're going to
stick with the one

100
00:04:29,190 --> 00:04:33,090
that you see in physics
labs, these kinds of springs,

101
00:04:33,090 --> 00:04:35,000
so-called linear springs.

102
00:04:35,000 --> 00:04:38,400
And these are springs that
have the property that you

103
00:04:38,400 --> 00:04:41,642
can stretch or compress them
by applying a force to it.

104
00:04:41,642 --> 00:04:43,350
And when you release
them, they literally

105
00:04:43,350 --> 00:04:46,680
spring back to the position
they were originally.

106
00:04:46,680 --> 00:04:49,620
So we're going to deal with
these kinds of springs.

107
00:04:49,620 --> 00:04:51,330
And the distinguishing
characteristics

108
00:04:51,330 --> 00:04:54,240
of these two springs
and others in this class

109
00:04:54,240 --> 00:04:58,080
is that that force you require
to compress it or stretch

110
00:04:58,080 --> 00:04:59,250
it a certain amount--

111
00:04:59,250 --> 00:05:03,030
the amount of force you
require varies linearly

112
00:05:03,030 --> 00:05:04,452
in the distance.

113
00:05:04,452 --> 00:05:05,910
So if it takes some
amount of force

114
00:05:05,910 --> 00:05:08,520
to compress it some
amount of distance,

115
00:05:08,520 --> 00:05:12,197
it takes twice as much force
to compress it twice as much

116
00:05:12,197 --> 00:05:12,780
of a distance.

117
00:05:12,780 --> 00:05:13,890
It's linearly related.

118
00:05:16,590 --> 00:05:18,610
So each one of these springs--

119
00:05:18,610 --> 00:05:21,169
these kinds of springs
has that property.

120
00:05:21,169 --> 00:05:22,710
The amount of force
needed to stretch

121
00:05:22,710 --> 00:05:25,470
or compress it's linear
in that distance.

122
00:05:25,470 --> 00:05:27,300
Associated with
these springs there

123
00:05:27,300 --> 00:05:29,890
is something called
a spring constant--

124
00:05:29,890 --> 00:05:32,100
usually represented
by the number k--

125
00:05:32,100 --> 00:05:34,650
that determines
how much force do

126
00:05:34,650 --> 00:05:38,100
you need to stretch or
compress the spring.

127
00:05:38,100 --> 00:05:42,780
Now, it turns out that that
spring constant can vary a lot.

128
00:05:42,780 --> 00:05:45,420
The slinky actually has a
very low spring constant.

129
00:05:45,420 --> 00:05:47,980
It's one newton per meter.

130
00:05:47,980 --> 00:05:51,060
That spring on the
suspension of a motorcycle

131
00:05:51,060 --> 00:05:52,800
has a much bigger
spring constant.

132
00:05:52,800 --> 00:05:56,250
It's a lot stiffer,
35,000 newtons per meter.

133
00:05:56,250 --> 00:05:57,750
And just in case
you don't remember,

134
00:05:57,750 --> 00:05:59,790
a newton is the
amount of force you

135
00:05:59,790 --> 00:06:02,970
need to accelerate a
one-kilogram mass one

136
00:06:02,970 --> 00:06:05,200
meter per second squared.

137
00:06:05,200 --> 00:06:06,740
We'll come back to
that in a second.

138
00:06:06,740 --> 00:06:09,250
But the idea is we'd like
to think about how do we

139
00:06:09,250 --> 00:06:10,570
model these kinds of springs.

140
00:06:13,120 --> 00:06:15,240
Well, turns out,
fortunately for us,

141
00:06:15,240 --> 00:06:18,270
that that was done
about 300-plus years ago

142
00:06:18,270 --> 00:06:20,680
by a British physicist
named Robert Hooke.

143
00:06:20,680 --> 00:06:26,950
Back in 1676 he formulated
Hooke's law of elasticity.

144
00:06:26,950 --> 00:06:28,990
Simple expression that
says the force you

145
00:06:28,990 --> 00:06:33,490
need to compress
or stretch a spring

146
00:06:33,490 --> 00:06:37,034
is linearly related
to the distance, d,

147
00:06:37,034 --> 00:06:38,950
that you've actually
done that compression in,

148
00:06:38,950 --> 00:06:40,824
or another way of saying
it is, if I compress

149
00:06:40,824 --> 00:06:44,710
a spring some amount, the
force that's stored in it

150
00:06:44,710 --> 00:06:47,110
is linearly related
to that distance.

151
00:06:47,110 --> 00:06:49,630
And the negative sign
here basically says

152
00:06:49,630 --> 00:06:52,070
it's pointing in the
opposite direction.

153
00:06:52,070 --> 00:06:55,110
So if I compress, the force
is going to push it back out.

154
00:06:55,110 --> 00:06:57,310
If I stretch it, the force
is going to push back

155
00:06:57,310 --> 00:06:58,490
into that resting position.

156
00:07:01,530 --> 00:07:06,300
Now, this law holds for
a wide range of springs,

157
00:07:06,300 --> 00:07:08,640
which is kind of nice.

158
00:07:08,640 --> 00:07:11,910
It's going to hold both
in biological systems

159
00:07:11,910 --> 00:07:14,280
as well as in physical systems.

160
00:07:14,280 --> 00:07:16,530
It doesn't hold perfectly.

161
00:07:16,530 --> 00:07:18,750
There's a limit to how
much you can stretch,

162
00:07:18,750 --> 00:07:22,105
in particular, a spring
before the law breaks down,

163
00:07:22,105 --> 00:07:23,730
and maybe you did
this as a kid, right.

164
00:07:23,730 --> 00:07:26,250
If you take a slinky and
pull it too far apart,

165
00:07:26,250 --> 00:07:28,290
it stops working
because you've exceeded

166
00:07:28,290 --> 00:07:31,050
what's called the elastic
limit of the spring.

167
00:07:31,050 --> 00:07:32,722
Similarly, if you
compress it too far,

168
00:07:32,722 --> 00:07:34,930
although I think you have
to compress it a long ways,

169
00:07:34,930 --> 00:07:38,100
it'll stop working as well.

170
00:07:38,100 --> 00:07:41,890
So it doesn't hold
completely, and it also

171
00:07:41,890 --> 00:07:44,170
doesn't hold for all springs.

172
00:07:44,170 --> 00:07:46,210
Only those springs that
satisfy this linear law,

173
00:07:46,210 --> 00:07:47,380
which are a lot of them.

174
00:07:47,380 --> 00:07:50,260
So, for example, it doesn't
apply to rubber bands,

175
00:07:50,260 --> 00:07:52,240
it doesn't apply
to recurved bows.

176
00:07:52,240 --> 00:07:55,150
Those are two examples
of springs that do not

177
00:07:55,150 --> 00:07:57,760
obey this linear relationship.

178
00:07:57,760 --> 00:08:00,625
But nonetheless,
there's Hooke's law.

179
00:08:00,625 --> 00:08:02,500
And one of the things
we can do is say, well,

180
00:08:02,500 --> 00:08:05,290
let's use it to do a little bit
of reasoning about this spring.

181
00:08:05,290 --> 00:08:07,510
So we can ask the
question, how much

182
00:08:07,510 --> 00:08:11,390
does a rider have to weigh to
compress this spring by one

183
00:08:11,390 --> 00:08:12,924
centimeter?

184
00:08:12,924 --> 00:08:14,840
And we've got Hooke's
law, and I also gave you

185
00:08:14,840 --> 00:08:15,923
a little bit of hint here.

186
00:08:15,923 --> 00:08:19,280
So I told you that this
spring has a spring constant

187
00:08:19,280 --> 00:08:22,380
of 35,000 newtons per meter.

188
00:08:22,380 --> 00:08:24,620
So I could just plug this
in, right, one centimeter,

189
00:08:24,620 --> 00:08:27,620
it's 1/100 of a meter times--

190
00:08:27,620 --> 00:08:30,170
so that's the-- there's
the spring constant.

191
00:08:30,170 --> 00:08:32,830
There's the amount we're
going to compress it.

192
00:08:32,830 --> 00:08:35,299
Do a little math, and that
says that the force I need

193
00:08:35,299 --> 00:08:38,250
is 350 newtons.

194
00:08:38,250 --> 00:08:39,584
So what's a newton?

195
00:08:39,584 --> 00:08:43,530
A small town in Massachusetts,
an interesting cookie,

196
00:08:43,530 --> 00:08:45,645
and a force that we
want to think about.

197
00:08:45,645 --> 00:08:49,090
I keep telling you guys,
the jokes are really bad.

198
00:08:49,090 --> 00:08:50,887
So how do I get force?

199
00:08:50,887 --> 00:08:51,720
Well, you know that.

200
00:08:51,720 --> 00:08:55,200
Mass times acceleration,
right, F equals ma.

201
00:08:55,200 --> 00:08:57,780
For acceleration here, I'm going
to make an assumption, which

202
00:08:57,780 --> 00:09:01,530
is that the spring is basically
oriented perpendicular

203
00:09:01,530 --> 00:09:03,980
to the earth, so that
the acceleration is just

204
00:09:03,980 --> 00:09:07,020
the acceleration of
gravity, which is roughly

205
00:09:07,020 --> 00:09:09,130
9.8 meters per second squared.

206
00:09:09,130 --> 00:09:11,450
It's basically pulling it down.

207
00:09:11,450 --> 00:09:15,260
So I could plug that
back in because remember

208
00:09:15,260 --> 00:09:17,840
what I want to do is figure
out what's the mass I need.

209
00:09:17,840 --> 00:09:19,880
So for the force, I'm
substituting that in.

210
00:09:19,880 --> 00:09:23,240
I've got that expression,
mass times 9.8 meters divided

211
00:09:23,240 --> 00:09:27,560
by seconds squared
is 350 newtons,

212
00:09:27,560 --> 00:09:34,010
divide through by 9.8 both
sides, do a little bit of math.

213
00:09:34,010 --> 00:09:38,220
And it says that the mass I
need is 350 kilograms divided

214
00:09:38,220 --> 00:09:39,180
by 9.8.

215
00:09:39,180 --> 00:09:43,920
And that k refers to kilograms,
not to the spring constant.

216
00:09:43,920 --> 00:09:46,650
Poor choice of example,
but there I am.

217
00:09:46,650 --> 00:09:49,500
And if I do the math, it
says I need a rider that

218
00:09:49,500 --> 00:09:52,509
weighs 35.68 kilos.

219
00:09:52,509 --> 00:09:54,300
And if you're not big
on the metric system,

220
00:09:54,300 --> 00:09:55,758
it's actually a
fairly light rider.

221
00:09:55,758 --> 00:09:57,250
That's about 79 pounds.

222
00:09:57,250 --> 00:10:01,050
So a 79-pound rider would
compress that spring one

223
00:10:01,050 --> 00:10:03,927
centimeter.

224
00:10:03,927 --> 00:10:05,760
So we can figure out
how to use Hooke's law.

225
00:10:05,760 --> 00:10:07,968
We're thinking about what
we want to do with springs.

226
00:10:07,968 --> 00:10:10,030
That's kind of nice.

227
00:10:10,030 --> 00:10:14,030
How will we actually
get the spring constant?

228
00:10:14,030 --> 00:10:16,580
It's really valuable to know
what the spring constant is.

229
00:10:16,580 --> 00:10:18,680
And just to give
you a sense of that,

230
00:10:18,680 --> 00:10:21,850
it's not just to deal
with things like slinkies.

231
00:10:21,850 --> 00:10:24,790
Atomic force microscopes,
need to know the spring

232
00:10:24,790 --> 00:10:26,350
constants of the
components in order

233
00:10:26,350 --> 00:10:28,630
to calibrate them properly.

234
00:10:28,630 --> 00:10:31,810
The force you need to
deform a strand of DNA

235
00:10:31,810 --> 00:10:34,690
is directly related to
the spring constants

236
00:10:34,690 --> 00:10:36,970
of the biological
structures themselves.

237
00:10:36,970 --> 00:10:41,920
So I'd really like to figure
out how do I get them.

238
00:10:41,920 --> 00:10:45,100
How many of you have done this
experiment in physics and hated

239
00:10:45,100 --> 00:10:46,191
it?

240
00:10:46,191 --> 00:10:46,690
Right.

241
00:10:46,690 --> 00:10:47,700
Well, I don't know if
you hated it or not,

242
00:10:47,700 --> 00:10:48,840
but you've done it, right?

243
00:10:48,840 --> 00:10:51,160
Standard way to do it
is I'd take a spring,

244
00:10:51,160 --> 00:10:52,860
I suspend it from some point.

245
00:10:52,860 --> 00:10:55,110
Let it come to a
resting position.

246
00:10:55,110 --> 00:10:58,440
And then I put a mass on
the bottom of the spring.

247
00:10:58,440 --> 00:10:59,580
It kind of bounces around.

248
00:10:59,580 --> 00:11:02,250
And when it settles,
I measure the distance

249
00:11:02,250 --> 00:11:04,170
from where it was
before I put the mass

250
00:11:04,170 --> 00:11:08,780
on to the distance of where it
is after I've added the mass.

251
00:11:08,780 --> 00:11:10,320
I measure that distance.

252
00:11:10,320 --> 00:11:12,250
And then I just plug in.

253
00:11:13,919 --> 00:11:15,210
I plug into that formula there.

254
00:11:15,210 --> 00:11:17,550
The force is minus k times d.

255
00:11:17,550 --> 00:11:21,360
So k the spring constant is the
force, forget the minus sign,

256
00:11:21,360 --> 00:11:23,610
divided by the distance,
and the force here

257
00:11:23,610 --> 00:11:27,600
would be 9.8 meters per
second squared or-- kilograms

258
00:11:27,600 --> 00:11:30,360
per second squared times
the mass divided by d.

259
00:11:30,360 --> 00:11:32,876
So I could just plug it in.

260
00:11:32,876 --> 00:11:38,440
In an ideal world, I'd plug it
in, I'm done, one measurement.

261
00:11:38,440 --> 00:11:40,570
Not so much, right.

262
00:11:40,570 --> 00:11:42,610
Masses aren't always
perfectly calibrated.

263
00:11:42,610 --> 00:11:47,080
Maybe the spring has got
not perfect materials in it.

264
00:11:47,080 --> 00:11:49,960
So ideally I'd actually
do multiple trials.

265
00:11:49,960 --> 00:11:53,230
I would take different weights,
put them on the spring,

266
00:11:53,230 --> 00:11:56,130
make the measurements,
and just record those.

267
00:11:56,130 --> 00:11:58,630
So that's what I'm going to do,
and I've actually done that.

268
00:11:58,630 --> 00:12:00,640
I'm not going to make you do it.

269
00:12:00,640 --> 00:12:05,540
But I get out a set
of measurements.

270
00:12:05,540 --> 00:12:07,010
What have I done here?

271
00:12:07,010 --> 00:12:11,540
I've used different masses,
all increasing by now 0.05

272
00:12:11,540 --> 00:12:14,000
kilograms, and I've
measured the distance

273
00:12:14,000 --> 00:12:17,510
that the spring has deformed.

274
00:12:17,510 --> 00:12:19,070
And ideally, these
would all have

275
00:12:19,070 --> 00:12:21,770
that nice linear relationship,
so I could just plug them in

276
00:12:21,770 --> 00:12:24,320
and I could figure out what
the spring constant is.

277
00:12:26,630 --> 00:12:30,412
So let's take this
data and let's plot it.

278
00:12:30,412 --> 00:12:31,870
And by the way,
all the code you'll

279
00:12:31,870 --> 00:12:33,510
be able to see when
you download the file,

280
00:12:33,510 --> 00:12:35,384
I'm going to walk through
some of it quickly.

281
00:12:35,384 --> 00:12:37,807
This is a simple
way to deal with it,

282
00:12:37,807 --> 00:12:39,390
and I'm going to
back up for a second.

283
00:12:39,390 --> 00:12:42,820
There's my data, and I
actually have done this

284
00:12:42,820 --> 00:12:44,170
in some ways the wrong order.

285
00:12:44,170 --> 00:12:49,240
These are my independent
measures, different masses.

286
00:12:49,240 --> 00:12:52,340
I'm going to plot
those along the x-axis,

287
00:12:52,340 --> 00:12:54,200
the horizontal axis.

288
00:12:54,200 --> 00:12:55,582
These are the dependent things.

289
00:12:55,582 --> 00:12:57,040
These are the things
I'm measuring.

290
00:12:57,040 --> 00:12:59,264
I'm going to plot
those along the y-axis.

291
00:12:59,264 --> 00:13:01,430
So I really should have put
them in the other order.

292
00:13:01,430 --> 00:13:03,610
So just cross your eyes and
make this column go over

293
00:13:03,610 --> 00:13:06,720
to that column, and
we'll be in good shape.

294
00:13:06,720 --> 00:13:08,370
Let's plot this.

295
00:13:08,370 --> 00:13:09,420
So here's a little file.

296
00:13:09,420 --> 00:13:11,520
Having stored those
away in a file,

297
00:13:11,520 --> 00:13:13,270
I'm just going to read
them in, get data.

298
00:13:13,270 --> 00:13:15,720
Just going to do the obvious
thing of read in these things

299
00:13:15,720 --> 00:13:19,920
and return two tuples or
lists, one for the x values--

300
00:13:19,920 --> 00:13:22,200
or if you like, again
going back to it,

301
00:13:22,200 --> 00:13:27,571
this set of values, and
one for the y values.

302
00:13:27,571 --> 00:13:29,070
Now I'm going to
play a little trick

303
00:13:29,070 --> 00:13:32,130
that you may have seen before
that's going to be handy to me.

304
00:13:32,130 --> 00:13:34,170
I'm going to actually
call this function out

305
00:13:34,170 --> 00:13:36,930
of the PyLab library
called array.

306
00:13:36,930 --> 00:13:39,000
I pass in that
tuple, and what it

307
00:13:39,000 --> 00:13:41,130
does is it converts it
into an array, which

308
00:13:41,130 --> 00:13:45,480
is a data structure that has
a fixed number of slots in it

309
00:13:45,480 --> 00:13:48,000
but has a really nice property
I want to take advantage of.

310
00:13:48,000 --> 00:13:49,260
I could do all of
this with lists.

311
00:13:49,260 --> 00:13:50,730
But by converting
that into array

312
00:13:50,730 --> 00:13:53,880
and then giving it the same
name xVals and similarly

313
00:13:53,880 --> 00:13:57,570
for the yVals, I can
now do math on the array

314
00:13:57,570 --> 00:13:59,920
without having to write loops.

315
00:13:59,920 --> 00:14:03,010
And in particular right
here, notice what I'm doing.

316
00:14:03,010 --> 00:14:05,490
I'm taking xVals, which is
an array, multiplying it

317
00:14:05,490 --> 00:14:07,170
by a number.

318
00:14:07,170 --> 00:14:10,890
And what that does is it takes
every entry in the array,

319
00:14:10,890 --> 00:14:13,320
multiplies that
entry, and puts it

320
00:14:13,320 --> 00:14:16,410
into basically a new version
of the array, which I then

321
00:14:16,410 --> 00:14:19,110
store into xVals.

322
00:14:19,110 --> 00:14:20,670
If you've programmed
in Matlab, this

323
00:14:20,670 --> 00:14:22,140
is the same kind
of feeling, right.

324
00:14:22,140 --> 00:14:23,940
I can take an array,
do something to it,

325
00:14:23,940 --> 00:14:24,898
and that's really nice.

326
00:14:24,898 --> 00:14:27,450
So I'm going to scale
all of my values,

327
00:14:27,450 --> 00:14:31,520
and then I'm going to plot them
out some appropriate things.

328
00:14:31,520 --> 00:14:33,390
And if I do it, I get that.

329
00:14:38,170 --> 00:14:43,310
I thought we said Hooke's law
was a linear relationship.

330
00:14:43,310 --> 00:14:45,410
So in an ideal world,
all of these points

331
00:14:45,410 --> 00:14:49,040
ought to lay along
a line somewhere,

332
00:14:49,040 --> 00:14:51,050
where the slope
of the line would

333
00:14:51,050 --> 00:14:54,650
tell me the spring constant.

334
00:14:54,650 --> 00:14:56,252
Not so good, right.

335
00:14:56,252 --> 00:14:58,210
And in fact, if you look
at it, you can kind of

336
00:14:58,210 --> 00:15:01,810
see-- in here you can kind of
imagine there's a line there,

337
00:15:01,810 --> 00:15:03,700
something funky is
going on up here.

338
00:15:03,700 --> 00:15:06,400
And we're going to come back to
that at the end of the lecture.

339
00:15:06,400 --> 00:15:11,432
But how do we think about
actually finding the line?

340
00:15:11,432 --> 00:15:13,390
Well, we know there's
noise in the measurement,

341
00:15:13,390 --> 00:15:15,290
so our best thing to
do is to say, well,

342
00:15:15,290 --> 00:15:19,000
could we just fit a
line to this data?

343
00:15:19,000 --> 00:15:20,505
And how would we do that?

344
00:15:20,505 --> 00:15:22,630
And that's the first big
thing we want to do today.

345
00:15:22,630 --> 00:15:24,329
We want to try and
figure out, given

346
00:15:24,329 --> 00:15:25,870
that we've got
measurement noise, how

347
00:15:25,870 --> 00:15:29,240
do we fit a line to it.

348
00:15:29,240 --> 00:15:32,880
So how do we fit
a curve to data?

349
00:15:32,880 --> 00:15:35,120
Well, what we're basically
going to try and do

350
00:15:35,120 --> 00:15:38,900
is find a way to relate an
independent variable, which

351
00:15:38,900 --> 00:15:43,500
were the masses, the y values,
to the dependent-- sorry,

352
00:15:43,500 --> 00:15:44,000
wrong way.

353
00:15:44,000 --> 00:15:46,340
The independent values,
which are the x-axis,

354
00:15:46,340 --> 00:15:48,950
to the dependent value, what is
the actual displacement we're

355
00:15:48,950 --> 00:15:49,580
going to see?

356
00:15:49,580 --> 00:15:52,740
So another way of saying
it is if I go back to here,

357
00:15:52,740 --> 00:15:56,090
I want to know for
every point along here,

358
00:15:56,090 --> 00:16:00,860
how do I fit something that
predicts what the y value is?

359
00:16:00,860 --> 00:16:04,290
So I need to figure
out how to do that fit.

360
00:16:04,290 --> 00:16:07,440
To decide-- even if I had
a curve, a line that I

361
00:16:07,440 --> 00:16:09,540
thought was a good
fit to that, I

362
00:16:09,540 --> 00:16:11,852
need to decide how good it is.

363
00:16:11,852 --> 00:16:13,560
So imagine I was lucky
and somebody said,

364
00:16:13,560 --> 00:16:15,900
here's a line that
I think describes

365
00:16:15,900 --> 00:16:17,350
Hooke's law in this case.

366
00:16:17,350 --> 00:16:18,570
Great.

367
00:16:18,570 --> 00:16:20,490
I could draw the
line on that data.

368
00:16:20,490 --> 00:16:22,800
I could draw it on this
chunk of data here.

369
00:16:22,800 --> 00:16:26,440
I still need to decide how
do I know if it's a good fit.

370
00:16:26,440 --> 00:16:31,020
And for that, we need something
we call an objective function,

371
00:16:31,020 --> 00:16:33,660
and it's going to
measure how close is

372
00:16:33,660 --> 00:16:36,420
the line to the data to
which I'm trying to fit it.

373
00:16:40,190 --> 00:16:44,360
Once we've defined the objective
function, then what we say

374
00:16:44,360 --> 00:16:48,140
is, OK, now let's find the
line that minimizes it,

375
00:16:48,140 --> 00:16:50,755
the best possible line, the
line that makes that objective

376
00:16:50,755 --> 00:16:52,880
function as small as
possible, because that's going

377
00:16:52,880 --> 00:16:55,620
to be the best fit to the data.

378
00:16:55,620 --> 00:16:57,897
And so that's what
I'd like to do.

379
00:16:57,897 --> 00:16:58,730
We're going to see--

380
00:16:58,730 --> 00:17:00,396
we're going to do it
for general curves,

381
00:17:00,396 --> 00:17:02,180
but we're going to
start just with lines,

382
00:17:02,180 --> 00:17:03,080
with linear function.

383
00:17:03,080 --> 00:17:04,663
So in this case, we
want to say what's

384
00:17:04,663 --> 00:17:07,700
the line such that some
function of the sum

385
00:17:07,700 --> 00:17:10,670
of the distances from the
line to the measured points

386
00:17:10,670 --> 00:17:11,525
is minimized.

387
00:17:11,525 --> 00:17:13,400
And I'm going to come
back in a second to how

388
00:17:13,400 --> 00:17:14,250
do we find the line.

389
00:17:14,250 --> 00:17:15,619
But first we've got to
think about what does it

390
00:17:15,619 --> 00:17:16,859
mean to measure it.

391
00:17:19,390 --> 00:17:21,280
So I've got a point.

392
00:17:21,280 --> 00:17:22,900
Imagine I got a
line that I think

393
00:17:22,900 --> 00:17:25,960
is a good match for the
thing fitting the data.

394
00:17:25,960 --> 00:17:28,820
How do I measure distance?

395
00:17:28,820 --> 00:17:30,880
Well, there's one option.

396
00:17:30,880 --> 00:17:35,340
I could measure just the
displacement along the x-axis.

397
00:17:35,340 --> 00:17:36,790
There's a second option.

398
00:17:36,790 --> 00:17:40,230
I could measure the
displacement vertically.

399
00:17:40,230 --> 00:17:43,800
Or a third option is I could
actually measure the distance

400
00:17:43,800 --> 00:17:46,589
to the closest point
on the line, which

401
00:17:46,589 --> 00:17:48,380
would be that perpendicular
distance there.

402
00:17:50,631 --> 00:17:52,630
You're way too quiet,
which is always dangerous.

403
00:17:52,630 --> 00:17:53,610
What do you think?

404
00:17:53,610 --> 00:17:55,401
I'm going to look for
a show of hands here.

405
00:17:55,401 --> 00:17:57,610
How many people think we
should use x as the thing

406
00:17:57,610 --> 00:18:00,000
that we measure here?

407
00:18:00,000 --> 00:18:00,500
Hands up.

408
00:18:00,500 --> 00:18:02,810
Please don't use a single finger
when you put your hand up.

409
00:18:02,810 --> 00:18:03,309
All right.

410
00:18:03,309 --> 00:18:04,190
Good.

411
00:18:04,190 --> 00:18:06,530
How many people think we
should use p, the perpendicular

412
00:18:06,530 --> 00:18:09,216
distance?

413
00:18:09,216 --> 00:18:10,340
Reasonable number of hands.

414
00:18:10,340 --> 00:18:12,670
And how about y?

415
00:18:12,670 --> 00:18:15,880
And I see actually about
split between p and y.

416
00:18:15,880 --> 00:18:18,410
And that's actually really good.

417
00:18:18,410 --> 00:18:20,460
X doesn't make a
lot of sense, right,

418
00:18:20,460 --> 00:18:23,570
because I know that my
values along the x-axis

419
00:18:23,570 --> 00:18:24,980
are independent measurements.

420
00:18:24,980 --> 00:18:26,540
So the displacement
in that direction

421
00:18:26,540 --> 00:18:29,060
doesn't make a lot of sense.

422
00:18:29,060 --> 00:18:32,540
P makes a lot of sense,
but unfortunately isn't

423
00:18:32,540 --> 00:18:33,980
what I want.

424
00:18:33,980 --> 00:18:35,420
We're going to
see examples later

425
00:18:35,420 --> 00:18:37,190
on where, in fact,
minimizing things

426
00:18:37,190 --> 00:18:39,710
where you minimize that distance
is the right thing to do.

427
00:18:39,710 --> 00:18:41,390
When we do machine
learning, that

428
00:18:41,390 --> 00:18:44,750
is how you find what's called
a classifier or a separator.

429
00:18:44,750 --> 00:18:47,990
But actually here
we're going to pick y,

430
00:18:47,990 --> 00:18:50,660
and the reason is important.

431
00:18:50,660 --> 00:18:53,360
I'm trying to predict the
dependent value, which

432
00:18:53,360 --> 00:18:57,750
is the y value, given an
independent new x value.

433
00:18:57,750 --> 00:19:00,770
And so the displacement,
the uncertainty

434
00:19:00,770 --> 00:19:03,620
is, in fact, the
vertical displacement.

435
00:19:03,620 --> 00:19:04,954
And so I'm going to use y.

436
00:19:04,954 --> 00:19:06,620
That displacement is
the thing I'm going

437
00:19:06,620 --> 00:19:08,310
to measure as the distance.

438
00:19:12,480 --> 00:19:13,452
How do I find this?

439
00:19:13,452 --> 00:19:14,910
I need an objective
function that's

440
00:19:14,910 --> 00:19:18,716
going to tell me what is
the closeness of the fit.

441
00:19:18,716 --> 00:19:20,090
So here's how I'm
going to do it.

442
00:19:20,090 --> 00:19:23,490
I'm going to have some
set of observed values.

443
00:19:23,490 --> 00:19:25,590
Think of it as an array.

444
00:19:25,590 --> 00:19:27,870
I've got some index into
them, so the indices

445
00:19:27,870 --> 00:19:29,130
are giving me the x values.

446
00:19:29,130 --> 00:19:31,960
And the observed values are the
things I've actually measured.

447
00:19:31,960 --> 00:19:33,460
If you want to think
of it this way,

448
00:19:33,460 --> 00:19:35,730
I'm going to go back to
this slide really quickly.

449
00:19:35,730 --> 00:19:37,770
The observed values
are the displacements

450
00:19:37,770 --> 00:19:39,220
or the values along the y-axis.

451
00:19:41,950 --> 00:19:42,790
Sorry about that.

452
00:19:45,900 --> 00:19:49,890
Let's assume that I have
some hypothesized line that I

453
00:19:49,890 --> 00:19:52,787
think fits this data,
y equals ax plus b.

454
00:19:52,787 --> 00:19:53,745
I know the a and the b.

455
00:19:53,745 --> 00:19:55,710
I've hypothesized it.

456
00:19:55,710 --> 00:20:00,460
Then predicted will basically
say given the x value,

457
00:20:00,460 --> 00:20:04,120
the line predicts here's
what the y value should be.

458
00:20:04,120 --> 00:20:07,380
And so I'm going to take the
difference between those two

459
00:20:07,380 --> 00:20:09,131
and square them.

460
00:20:09,131 --> 00:20:10,380
So the difference makes sense.

461
00:20:10,380 --> 00:20:12,720
It tells me how far away is
the observed value from what

462
00:20:12,720 --> 00:20:15,060
the line predicts it should be.

463
00:20:15,060 --> 00:20:15,974
Why am I squaring it?

464
00:20:15,974 --> 00:20:17,140
Well, there are two reasons.

465
00:20:17,140 --> 00:20:18,570
The first one is
that squaring is

466
00:20:18,570 --> 00:20:20,790
going to get rid of the sign.

467
00:20:20,790 --> 00:20:23,430
It shouldn't matter
if my observed value

468
00:20:23,430 --> 00:20:25,620
is some amount above
the predicted value

469
00:20:25,620 --> 00:20:27,420
or some amount below--
the same amount

470
00:20:27,420 --> 00:20:28,660
below the predicted value.

471
00:20:28,660 --> 00:20:31,140
The displacement in
direction shouldn't matter.

472
00:20:31,140 --> 00:20:33,090
It's how far away is it.

473
00:20:33,090 --> 00:20:36,640
Now, you could say, well, why
not just use absolute value?

474
00:20:36,640 --> 00:20:38,490
And the answer is
you could, but we're

475
00:20:38,490 --> 00:20:42,030
going to see in a couple of
slides that by using the square

476
00:20:42,030 --> 00:20:43,920
we get a really nice
property that helps

477
00:20:43,920 --> 00:20:46,770
us find the best fitting line.

478
00:20:46,770 --> 00:20:50,220
So my objective function
here basically says,

479
00:20:50,220 --> 00:20:52,180
given a bunch of
observed values,

480
00:20:52,180 --> 00:20:55,140
use the hypothesized line to
predict what the value should

481
00:20:55,140 --> 00:20:57,589
be, measure the difference
in the y direction--

482
00:20:57,589 --> 00:20:59,880
which is what I'm doing
because I'm measuring predicted

483
00:20:59,880 --> 00:21:01,530
and observed y values--

484
00:21:01,530 --> 00:21:02,970
square them, sum them all up.

485
00:21:02,970 --> 00:21:06,074
It's called least squares.

486
00:21:06,074 --> 00:21:07,740
That's going to give
me a measure of how

487
00:21:07,740 --> 00:21:09,120
close that line is to a fit.

488
00:21:09,120 --> 00:21:12,330
In a second, I'll get to
how you find the best line.

489
00:21:12,330 --> 00:21:17,200
But this hopefully
looks familiar.

490
00:21:17,200 --> 00:21:19,330
Anybody recognize this?

491
00:21:19,330 --> 00:21:21,774
You've seen it
earlier in this class.

492
00:21:21,774 --> 00:21:24,190
Boy, that's a terrible thing
to ask because you don't even

493
00:21:24,190 --> 00:21:26,315
remember the last thing
you did in this class other

494
00:21:26,315 --> 00:21:28,279
than the problem set.

495
00:21:28,279 --> 00:21:29,154
AUDIENCE: [INAUDIBLE]

496
00:21:29,154 --> 00:21:29,846
ERIC GRIMSON: Sorry?

497
00:21:29,846 --> 00:21:30,625
AUDIENCE: Variance.

498
00:21:30,625 --> 00:21:31,200
ERIC GRIMSON: Variance.

499
00:21:31,200 --> 00:21:31,740
Thank you.

500
00:21:31,740 --> 00:21:32,327
Absolutely.

501
00:21:32,327 --> 00:21:33,910
Sorry, I didn't bring
any candy today.

502
00:21:33,910 --> 00:21:34,980
That's Professor Guttag.

503
00:21:34,980 --> 00:21:36,840
I got a better arm than
he does, but I still

504
00:21:36,840 --> 00:21:39,180
didn't bring any candy today.

505
00:21:39,180 --> 00:21:40,830
Yeah, it's variance, not quite.

506
00:21:40,830 --> 00:21:41,960
It's almost variance.

507
00:21:41,960 --> 00:21:44,610
That's the variance times
the number of observations,

508
00:21:44,610 --> 00:21:46,830
or another way of saying
it is if I divided this

509
00:21:46,830 --> 00:21:49,500
by the number of observations,
that would be the variance.

510
00:21:49,500 --> 00:21:50,958
If I took the square
root, it would

511
00:21:50,958 --> 00:21:52,210
be the standard deviation.

512
00:21:52,210 --> 00:21:54,060
Why is that valuable?

513
00:21:54,060 --> 00:21:57,900
Because that tells you
something about how badly things

514
00:21:57,900 --> 00:21:59,670
are dispersed,
how much variation

515
00:21:59,670 --> 00:22:01,790
there is in this measurement.

516
00:22:01,790 --> 00:22:05,130
And so if it says, if I can
minimize this expression,

517
00:22:05,130 --> 00:22:06,930
that's great because
it not only will

518
00:22:06,930 --> 00:22:09,730
find what I hope
is the best fit,

519
00:22:09,730 --> 00:22:12,720
but it's going to minimize the
variance between what I predict

520
00:22:12,720 --> 00:22:15,570
and what I measure, which
makes intuitive sense.

521
00:22:15,570 --> 00:22:18,482
That's exactly the thing
I would like to minimize.

522
00:22:21,490 --> 00:22:23,620
This was built on
the assumption that I

523
00:22:23,620 --> 00:22:26,560
had a line that I
thought was a good fit,

524
00:22:26,560 --> 00:22:29,772
and this lets me measure
how good a fit I have.

525
00:22:29,772 --> 00:22:31,480
But I still have to
do a little bit more.

526
00:22:31,480 --> 00:22:33,190
I have to now figure
out, OK, how do

527
00:22:33,190 --> 00:22:36,780
I find the best-fitting line?

528
00:22:36,780 --> 00:22:40,380
And for that, we need to come up
with a minimization technique.

529
00:22:40,380 --> 00:22:42,780
So to minimize this
objective function,

530
00:22:42,780 --> 00:22:45,732
I want to find the curve
for the predicted values--

531
00:22:45,732 --> 00:22:46,440
this thing here--

532
00:22:46,440 --> 00:22:48,210
some way of representing
that that leads

533
00:22:48,210 --> 00:22:50,880
to the best possible solution.

534
00:22:50,880 --> 00:22:54,830
And I'm going to make
a simple assumption.

535
00:22:54,830 --> 00:22:57,680
I'm going to assume
that my model for this

536
00:22:57,680 --> 00:22:59,307
predicted curve--

537
00:22:59,307 --> 00:23:00,890
I've been using the
example of a line,

538
00:23:00,890 --> 00:23:02,450
but we're going to say curve--

539
00:23:02,450 --> 00:23:04,070
is a polynomial.

540
00:23:04,070 --> 00:23:05,690
It's a polynomial
and one variable.

541
00:23:05,690 --> 00:23:09,399
The one variable is what are
the x values of the samples.

542
00:23:09,399 --> 00:23:11,690
And I'm going to assume that
the curve is a polynomial.

543
00:23:11,690 --> 00:23:15,110
In the simplest case, it's a
line in case order, and two,

544
00:23:15,110 --> 00:23:16,756
it's going to be a parabola.

545
00:23:16,756 --> 00:23:19,130
And I'm going to use a technique
called linear regression

546
00:23:19,130 --> 00:23:23,990
to find the polynomial that best
fits the data, that minimizes

547
00:23:23,990 --> 00:23:25,130
that objective function.

548
00:23:27,329 --> 00:23:29,620
Quick aside, just to remind
you, I'm sure you remember,

549
00:23:29,620 --> 00:23:31,670
so polynomial--

550
00:23:31,670 --> 00:23:34,520
polynomials, either the value
is zero, which is really boring,

551
00:23:34,520 --> 00:23:38,180
or it is a finite
sum of non-zero terms

552
00:23:38,180 --> 00:23:42,120
that all have the form
c times x to the p.

553
00:23:42,120 --> 00:23:45,210
C is a constant, a real number.

554
00:23:45,210 --> 00:23:47,730
P is a power, a
non-negative integer.

555
00:23:47,730 --> 00:23:50,024
And this is basically-- x
is the free variable that's

556
00:23:50,024 --> 00:23:50,940
going to capture this.

557
00:23:50,940 --> 00:23:53,970
So easy way to say
it is a line would

558
00:23:53,970 --> 00:23:57,870
be represented as a degree
one polynomial ax plus b.

559
00:23:57,870 --> 00:24:00,550
A parabola is a
second-degree polynomial,

560
00:24:00,550 --> 00:24:02,070
ax squared plus bx plus c.

561
00:24:02,070 --> 00:24:05,214
And we can go up to
higher order terms.

562
00:24:05,214 --> 00:24:07,380
We're going to refer to the
degree of the polynomial

563
00:24:07,380 --> 00:24:10,690
as the largest degree of
any term in that polynomial.

564
00:24:10,690 --> 00:24:15,470
So again, degree one, linear
degree two, quadratic.

565
00:24:15,470 --> 00:24:17,200
Now how do I use that?

566
00:24:17,200 --> 00:24:18,940
Well, here's the basic idea.

567
00:24:18,940 --> 00:24:20,220
Let's take a simple example.

568
00:24:20,220 --> 00:24:22,970
Let's assume I'm still
just trying to fit a line.

569
00:24:22,970 --> 00:24:26,240
So my assumption is I
want to find a degree one

570
00:24:26,240 --> 00:24:30,590
polynomial, y equals ax plus
b, as our model of the day.

571
00:24:30,590 --> 00:24:34,420
That means for every sample,
I'm going to plug in x,

572
00:24:34,420 --> 00:24:37,572
and if I know a and b, it
gives me the predicted value.

573
00:24:37,572 --> 00:24:39,280
I've already seen
that's going to give me

574
00:24:39,280 --> 00:24:42,310
a good measure of the
closeness of the fit.

575
00:24:42,310 --> 00:24:44,805
And the question is,
how do I find a and b.

576
00:24:47,200 --> 00:24:50,510
My goal is find a
and b such that when

577
00:24:50,510 --> 00:24:53,780
we use this polynomial to
compute those y values,

578
00:24:53,780 --> 00:24:57,640
that sum squared
difference is minimized.

579
00:24:57,640 --> 00:25:00,430
So the sum squared difference
is my measure of fit.

580
00:25:00,430 --> 00:25:03,617
All I have to do
is find a and b.

581
00:25:03,617 --> 00:25:05,450
And that's where linear
regression comes in,

582
00:25:05,450 --> 00:25:09,480
and I want to just give you
a visualization of this.

583
00:25:09,480 --> 00:25:12,780
If a line is described
by ax plus b,

584
00:25:12,780 --> 00:25:15,630
then I can represent
every possible line

585
00:25:15,630 --> 00:25:17,980
in a two-dimensional space.

586
00:25:17,980 --> 00:25:21,450
One axis is possible
values for a.

587
00:25:21,450 --> 00:25:23,815
The other axis is
possible values for b.

588
00:25:23,815 --> 00:25:26,190
So if you think about it, I
take any point in that space.

589
00:25:26,190 --> 00:25:27,870
It gives me an a and a B value.

590
00:25:27,870 --> 00:25:30,370
That describes a line.

591
00:25:30,370 --> 00:25:32,460
Why should you care about that?

592
00:25:32,460 --> 00:25:35,460
Because I can put a
two-dimensional surface

593
00:25:35,460 --> 00:25:37,210
over that space.

594
00:25:37,210 --> 00:25:40,020
In other words, for every a
and b, that gives me a line,

595
00:25:40,020 --> 00:25:43,396
and I could, therefore,
compute this function,

596
00:25:43,396 --> 00:25:45,520
given the observed values
and the predicted values,

597
00:25:45,520 --> 00:25:46,978
and it would give
me a value, which

598
00:25:46,978 --> 00:25:50,602
is the height of the
surface in that space.

599
00:25:50,602 --> 00:25:52,310
If you're with me with
the visualization,

600
00:25:52,310 --> 00:25:53,180
why is that nice?

601
00:25:53,180 --> 00:25:57,050
Because linear regression
gives me a very easy way

602
00:25:57,050 --> 00:25:59,490
to find the lowest
point on that surface,

603
00:25:59,490 --> 00:26:01,760
which is exactly
the solution I want,

604
00:26:01,760 --> 00:26:03,820
because that's the
best fitting line.

605
00:26:03,820 --> 00:26:05,620
And it's called
linear regression

606
00:26:05,620 --> 00:26:07,570
not because we're
solving for a line,

607
00:26:07,570 --> 00:26:10,230
but because of how
you do that solution.

608
00:26:10,230 --> 00:26:11,560
If you think of this as being--

609
00:26:11,560 --> 00:26:13,690
take a marble on this
two-dimensional surface,

610
00:26:13,690 --> 00:26:15,250
you want to place
the marble on it,

611
00:26:15,250 --> 00:26:17,170
you want to let it
run down to the lowest

612
00:26:17,170 --> 00:26:19,580
point in the surface.

613
00:26:19,580 --> 00:26:22,510
And oh, yeah, I promised you
why do we use sum squares,

614
00:26:22,510 --> 00:26:24,340
because if we used the
sum of the squares,

615
00:26:24,340 --> 00:26:28,690
that surface always
has only one minimum.

616
00:26:28,690 --> 00:26:30,850
So it's not a really
funky, convoluted surface.

617
00:26:30,850 --> 00:26:32,600
It has exactly one minimum.

618
00:26:32,600 --> 00:26:34,750
It's called linear
regression because the way

619
00:26:34,750 --> 00:26:38,160
to find it is to start at
some point and walk downhill.

620
00:26:38,160 --> 00:26:41,650
I linearly regress or walk
downhill along the gradient

621
00:26:41,650 --> 00:26:43,510
some distance, measure
the new gradient,

622
00:26:43,510 --> 00:26:45,700
and do that until I
get down to the lowest

623
00:26:45,700 --> 00:26:49,870
point in the surface.

624
00:26:49,870 --> 00:26:51,990
Could you write code to do it?

625
00:26:51,990 --> 00:26:53,070
Sure.

626
00:26:53,070 --> 00:26:54,810
Are we going to
ask you to do it?

627
00:26:54,810 --> 00:26:57,554
No, because fortunately--

628
00:26:57,554 --> 00:26:59,220
I was hoping to get
a cheer out of that.

629
00:26:59,220 --> 00:26:59,670
Too bad.

630
00:26:59,670 --> 00:27:01,628
OK, maybe we will ask
you to do it on the exam.

631
00:27:01,628 --> 00:27:03,624
What the hell.

632
00:27:03,624 --> 00:27:04,290
You could do it.

633
00:27:04,290 --> 00:27:06,720
In fact, you've seen
a version of this.

634
00:27:06,720 --> 00:27:08,340
The typical algorithm
for doing it

635
00:27:08,340 --> 00:27:10,020
is very similar
to Newton's method

636
00:27:10,020 --> 00:27:13,140
that we used way back in
the beginning of 60001

637
00:27:13,140 --> 00:27:15,330
when we found square roots.

638
00:27:15,330 --> 00:27:17,200
You could write that
kind of a solution,

639
00:27:17,200 --> 00:27:19,140
but the good news is
that the nice people who

640
00:27:19,140 --> 00:27:21,480
wrote Python, or
particularly PyLab,

641
00:27:21,480 --> 00:27:24,160
have given you code to do it.

642
00:27:24,160 --> 00:27:25,980
And we're going to
take advantage of it.

643
00:27:25,980 --> 00:27:30,370
So in PyLab there is a built-in
function called polyFit.

644
00:27:30,370 --> 00:27:33,360
It takes a collection
of x values,

645
00:27:33,360 --> 00:27:34,984
takes a collection
of equal length

646
00:27:34,984 --> 00:27:36,900
of y values-- they need
to be the same length.

647
00:27:36,900 --> 00:27:38,630
I'm going to assume
they're arrays.

648
00:27:38,630 --> 00:27:43,140
And it takes an integer n,
which is the degree of fit,

649
00:27:43,140 --> 00:27:44,600
that I want to apply.

650
00:27:44,600 --> 00:27:46,770
And what polyFit
will do is it will

651
00:27:46,770 --> 00:27:50,550
find the coefficients of a
polynomial of that degree that

652
00:27:50,550 --> 00:27:54,010
provides the best
least squares fit.

653
00:27:54,010 --> 00:27:55,900
So think of it as
polyFit walking along

654
00:27:55,900 --> 00:27:59,644
that surface to find the best
a and b that will come back.

655
00:27:59,644 --> 00:28:01,310
So if I give it a
value of n equals one,

656
00:28:01,310 --> 00:28:03,768
it'll give me back the a and
b that gives me the best line.

657
00:28:03,768 --> 00:28:05,710
If I get a value
of n equal two, it

658
00:28:05,710 --> 00:28:07,960
gives me back a,
b, and c that would

659
00:28:07,960 --> 00:28:12,740
fit an ax squared plus bx plus
c parabola to best fit the data.

660
00:28:12,740 --> 00:28:14,864
And I could pick n to be
any non-negative integer,

661
00:28:14,864 --> 00:28:16,780
and it would actually
come up with a good fit.

662
00:28:20,130 --> 00:28:22,250
So let's use it.

663
00:28:22,250 --> 00:28:25,240
I'm going to write a little
function called fitData.

664
00:28:25,240 --> 00:28:27,470
The first part up here
just comes from plotData.

665
00:28:27,470 --> 00:28:28,690
It's exactly the same thing.

666
00:28:28,690 --> 00:28:29,980
I read in the data.

667
00:28:29,980 --> 00:28:31,330
I convert them into arrays.

668
00:28:31,330 --> 00:28:33,520
I convert this because I
want to get out the force.

669
00:28:33,520 --> 00:28:35,830
I go ahead and plot it.

670
00:28:35,830 --> 00:28:38,920
And then notice what I do,
I use polyFit right here

671
00:28:38,920 --> 00:28:43,000
to take the inputted x values
and y values and a degree one,

672
00:28:43,000 --> 00:28:45,690
and it's going to
give me back a tuple,

673
00:28:45,690 --> 00:28:48,310
an a and a b that are
the best fit line.

674
00:28:48,310 --> 00:28:51,740
Finds that point in the
space that best fits it.

675
00:28:51,740 --> 00:28:55,580
Once I've got that, I could go
ahead and actually compute now

676
00:28:55,580 --> 00:28:58,250
what are the estimated
or predicted values.

677
00:28:58,250 --> 00:28:59,870
The line's going
to tell me what I

678
00:28:59,870 --> 00:29:01,332
should have seen
as those values,

679
00:29:01,332 --> 00:29:02,790
and I'm going to
do the same thing.

680
00:29:02,790 --> 00:29:04,873
I'm going to take x values,
convert it into array,

681
00:29:04,873 --> 00:29:08,330
multiply it by a, which says
every entry in the array

682
00:29:08,330 --> 00:29:09,290
is scaled by a.

683
00:29:09,290 --> 00:29:11,450
Add b to every entry.

684
00:29:11,450 --> 00:29:15,320
So I'm just computing ax
plus b for all possible x's.

685
00:29:15,320 --> 00:29:18,110
And that then gives me an
estimated set of y values,

686
00:29:18,110 --> 00:29:21,200
and I can plot those out.

687
00:29:21,200 --> 00:29:22,540
I'm cheating here.

688
00:29:22,540 --> 00:29:23,350
Sorry.

689
00:29:23,350 --> 00:29:24,310
I'm misdirecting you.

690
00:29:24,310 --> 00:29:26,200
I never cheat.

691
00:29:26,200 --> 00:29:29,260
I actually don't need to do
the conversion to an array

692
00:29:29,260 --> 00:29:31,580
there because I did it up here.

693
00:29:31,580 --> 00:29:33,700
But because I've borrowed
this from plot lab,

694
00:29:33,700 --> 00:29:35,560
I wanted to show you
that I can redundantly

695
00:29:35,560 --> 00:29:37,643
do it here to remind you
that I want to convert it

696
00:29:37,643 --> 00:29:41,740
into array to make sure I can
do that kind of algebra on it.

697
00:29:41,740 --> 00:29:44,410
The last thing I could
do is say even if I can--

698
00:29:44,410 --> 00:29:47,270
once I show you the
fit of this line,

699
00:29:47,270 --> 00:29:49,840
I also want to get out
the spring constant.

700
00:29:49,840 --> 00:29:53,980
Now, the slope of this
line is difference in force

701
00:29:53,980 --> 00:29:56,140
over difference in distance.

702
00:29:56,140 --> 00:29:58,150
The spring constant
is the opposite of it.

703
00:29:58,150 --> 00:30:00,970
So I could simply take the
slope of the line, which

704
00:30:00,970 --> 00:30:04,600
is a, invert it, and that
gives me the spring constant.

705
00:30:06,890 --> 00:30:09,840
So let's see what happens
if we actually run this.

706
00:30:09,840 --> 00:30:11,300
So I'm going to go
over to my code,

707
00:30:11,300 --> 00:30:12,549
hoping that it works properly.

708
00:30:14,906 --> 00:30:16,520
Here's my Python.

709
00:30:16,520 --> 00:30:17,570
I've loaded this in.

710
00:30:17,570 --> 00:30:20,302
I'm going to run it.

711
00:30:20,302 --> 00:30:22,570
And there you go.

712
00:30:22,570 --> 00:30:25,090
Fits a line, and it
prints out the value

713
00:30:25,090 --> 00:30:29,530
of a, which is about
0.46, and the value of b.

714
00:30:29,530 --> 00:30:38,080
And if I go back
and look at this,

715
00:30:38,080 --> 00:30:41,150
there we go, spring constant
is about 21 and a half,

716
00:30:41,150 --> 00:30:43,330
which is about the
reciprocal of 0.046

717
00:30:43,330 --> 00:30:45,100
if you can figure that out.

718
00:30:45,100 --> 00:30:47,740
And you can see,
it's not a bad fit

719
00:30:47,740 --> 00:30:49,919
to a line through that data.

720
00:30:49,919 --> 00:30:52,210
Again, there's still something
funky going on over here

721
00:30:52,210 --> 00:30:53,585
that we're going
to come back to.

722
00:30:53,585 --> 00:30:56,350
But it's a pretty
good fit to the data.

723
00:30:56,350 --> 00:30:57,760
Great.

724
00:30:57,760 --> 00:31:00,612
So now I've got a fit.

725
00:31:00,612 --> 00:31:02,320
I'm going to show you
a variation of this

726
00:31:02,320 --> 00:31:03,819
that we're going
to use in a second.

727
00:31:03,819 --> 00:31:06,339
I could do the same thing, but
after I've done polyFit here,

728
00:31:06,339 --> 00:31:08,380
I'm going to use another
built-in function called

729
00:31:08,380 --> 00:31:10,097
polyval.

730
00:31:10,097 --> 00:31:11,680
It's going to take
a polynomial, which

731
00:31:11,680 --> 00:31:14,462
is captured by that model of
the thing that I returned,

732
00:31:14,462 --> 00:31:16,420
and I'm going to show
you the difference again.

733
00:31:16,420 --> 00:31:19,580
Back sure we returned
this as a tuple.

734
00:31:19,580 --> 00:31:21,200
Since it's coming
back as a tuple,

735
00:31:21,200 --> 00:31:23,320
I can give it a name model.

736
00:31:23,320 --> 00:31:26,714
Polyval will take that
tuple plus the x values

737
00:31:26,714 --> 00:31:27,630
and do the same thing.

738
00:31:27,630 --> 00:31:30,750
It will give me back an
array of predicted values.

739
00:31:30,750 --> 00:31:34,030
But the nice thing here is that
this model could be a line.

740
00:31:34,030 --> 00:31:35,610
It could be a parabola.

741
00:31:35,610 --> 00:31:36,704
It could be a quartic.

742
00:31:36,704 --> 00:31:37,620
It could be a quintic.

743
00:31:37,620 --> 00:31:41,704
It could be any
order polynomial.

744
00:31:41,704 --> 00:31:43,370
If you like the
abstraction here-- which

745
00:31:43,370 --> 00:31:44,828
we're going to see
in a little bit,

746
00:31:44,828 --> 00:31:47,930
that it allows me
to use the same code

747
00:31:47,930 --> 00:31:50,260
for different orders of model.

748
00:31:50,260 --> 00:31:52,510
And if I ran this, it would
do exactly the same thing.

749
00:31:54,790 --> 00:31:57,130
I'm going to come back to
thinking about what's going

750
00:31:57,130 --> 00:31:59,150
on in that spring in a second.

751
00:31:59,150 --> 00:32:01,130
But I want to show
you another example.

752
00:32:01,130 --> 00:32:03,101
So here's another set of data.

753
00:32:03,101 --> 00:32:04,600
In a little bit,
I'll show you where

754
00:32:04,600 --> 00:32:06,320
that mystery data came from.

755
00:32:06,320 --> 00:32:09,576
But here's another set of
data that I've plotted out.

756
00:32:09,576 --> 00:32:10,700
I could run the same thing.

757
00:32:10,700 --> 00:32:14,100
I could run exactly the same
code and fit a line to it.

758
00:32:14,100 --> 00:32:16,160
And if I do it, I get that.

759
00:32:19,020 --> 00:32:20,080
What do you think?

760
00:32:20,080 --> 00:32:22,190
Good fit?

761
00:32:22,190 --> 00:32:25,581
Show of hands, how many people
like this fit to the data?

762
00:32:25,581 --> 00:32:27,080
Show of hands, how
many people don't

763
00:32:27,080 --> 00:32:28,917
like this fit to the data?

764
00:32:28,917 --> 00:32:30,500
Show of hands, how
many hope that I'll

765
00:32:30,500 --> 00:32:31,661
stop asking you questions?

766
00:32:31,661 --> 00:32:32,660
Don't put your hands up.

767
00:32:32,660 --> 00:32:33,320
Yeah, thank you.

768
00:32:33,320 --> 00:32:33,819
I know.

769
00:32:33,819 --> 00:32:36,040
Too bad.

770
00:32:36,040 --> 00:32:37,689
It's a lousy fit.

771
00:32:37,689 --> 00:32:38,980
And you kind of know it, right.

772
00:32:38,980 --> 00:32:40,397
It's clear that
this doesn't look

773
00:32:40,397 --> 00:32:41,980
like it's coming
from a line, or if it

774
00:32:41,980 --> 00:32:45,550
is, it's a really noisy line.

775
00:32:45,550 --> 00:32:46,830
So let's think about this.

776
00:32:46,830 --> 00:32:51,700
What if I were to try
a higher order degree.

777
00:32:51,700 --> 00:32:54,507
Let's change the one to a two.

778
00:32:54,507 --> 00:32:56,340
So I'm going to come
back to it in a second.

779
00:32:56,340 --> 00:32:57,590
I've changed the one to a two.

780
00:32:57,590 --> 00:33:00,180
That says I'm still
using the polynomial fit,

781
00:33:00,180 --> 00:33:04,770
but now I'm going to ask what's
the best fitting parabola, ax

782
00:33:04,770 --> 00:33:07,910
squared plus bx plus c.

783
00:33:07,910 --> 00:33:08,840
Simple change.

784
00:33:08,840 --> 00:33:12,961
Because I was using polyval,
exactly the same code

785
00:33:12,961 --> 00:33:13,460
will work.

786
00:33:13,460 --> 00:33:15,830
It's going to do the fit to it.

787
00:33:15,830 --> 00:33:19,920
This is, by the way, still an
example of linear regression.

788
00:33:19,920 --> 00:33:22,820
So think of what I'm doing now.

789
00:33:22,820 --> 00:33:24,950
I have a
three-dimensional space.

790
00:33:24,950 --> 00:33:26,390
One axis is a values.

791
00:33:26,390 --> 00:33:28,440
Second axis is b values.

792
00:33:28,440 --> 00:33:30,560
Third axis is c values.

793
00:33:30,560 --> 00:33:34,890
Any point in that space
describes a parabola,

794
00:33:34,890 --> 00:33:36,560
and every point in
that space describes

795
00:33:36,560 --> 00:33:38,215
every possible parabola.

796
00:33:38,215 --> 00:33:40,340
And now you've got to twist
your head a little bit.

797
00:33:40,340 --> 00:33:42,800
Put a four-dimensional
surface on

798
00:33:42,800 --> 00:33:46,330
that three-dimensional basis,
where the point in that surface

799
00:33:46,330 --> 00:33:48,890
is the value of that
objective function.

800
00:33:48,890 --> 00:33:50,570
Play the same game.

801
00:33:50,570 --> 00:33:51,080
And you can.

802
00:33:51,080 --> 00:33:52,550
It's just a
higher-dimensional thing.

803
00:33:52,550 --> 00:33:54,591
So you're, again, going
to walk down the gradient

804
00:33:54,591 --> 00:33:56,461
to find the solution,
and be glad you don't

805
00:33:56,461 --> 00:33:58,460
have to write this code
because PyLab will do it

806
00:33:58,460 --> 00:33:59,084
for you freely.

807
00:33:59,084 --> 00:34:04,190
But it's still an example of
regression, which is great.

808
00:34:04,190 --> 00:34:07,243
And if we do that,
we get that fit.

809
00:34:07,243 --> 00:34:09,409
Actually just to show you
that, I'm going to run it,

810
00:34:09,409 --> 00:34:12,949
but it will do exactly
the same thing.

811
00:34:12,949 --> 00:34:15,158
If I go over to Python--

812
00:34:15,158 --> 00:34:16,199
wherever I have it here--

813
00:34:19,989 --> 00:34:23,350
I'm going to change
that order of the model.

814
00:34:23,350 --> 00:34:25,070
Oops, it went a
little too far for me.

815
00:34:25,070 --> 00:34:27,100
Sorry about that.

816
00:34:27,100 --> 00:34:29,940
Let me go back
and do this again.

817
00:34:29,940 --> 00:34:35,310
There's the first one, and
there's the second one.

818
00:34:40,170 --> 00:34:44,090
So I could fit
different models to it.

819
00:34:44,090 --> 00:34:47,090
Quadratic clearly looks
like it's a better fit.

820
00:34:47,090 --> 00:34:49,540
I hope you'll agree.

821
00:34:49,540 --> 00:34:53,710
So how do I decide
which one's better

822
00:34:53,710 --> 00:34:55,820
other than eyeballing it?

823
00:34:55,820 --> 00:34:57,920
And then if I could
fit a quadratic to it,

824
00:34:57,920 --> 00:34:59,810
what about other
orders of polynomials?

825
00:34:59,810 --> 00:35:01,970
Maybe there's an even
better fit out there.

826
00:35:01,970 --> 00:35:05,870
So how do I figure out what's
the best way to do the fit?

827
00:35:05,870 --> 00:35:08,540
And that leads to the second
big thing for this lecture.

828
00:35:08,540 --> 00:35:10,550
How good are these fits?

829
00:35:10,550 --> 00:35:12,050
What's the first big thing?

830
00:35:12,050 --> 00:35:14,600
The idea of linear
regression, a way of finding

831
00:35:14,600 --> 00:35:16,614
fits of curves to data.

832
00:35:16,614 --> 00:35:18,530
But now I've got to
decide how good are these.

833
00:35:18,530 --> 00:35:21,760
And I could ask this
question two ways.

834
00:35:21,760 --> 00:35:23,641
One is just relative
to each other,

835
00:35:23,641 --> 00:35:25,890
how do I measure which one's
better other than looking

836
00:35:25,890 --> 00:35:27,570
at it by eye?

837
00:35:27,570 --> 00:35:30,370
And then the second part of
it is in an absolute sense,

838
00:35:30,370 --> 00:35:33,025
how do I know where
the best solution is?

839
00:35:33,025 --> 00:35:35,120
Is quadratic the
best I could do?

840
00:35:35,120 --> 00:35:36,639
Or should I be
doing something else

841
00:35:36,639 --> 00:35:38,680
to try and figure out a
better solution, a better

842
00:35:38,680 --> 00:35:39,430
fit to the data?

843
00:35:41,860 --> 00:35:44,350
The relative fit.

844
00:35:44,350 --> 00:35:45,860
What are we doing here?

845
00:35:45,860 --> 00:35:48,370
We're fitting a curve,
which is a function

846
00:35:48,370 --> 00:35:51,157
of the independent variable
to the dependent variable.

847
00:35:51,157 --> 00:35:52,240
What does it mean by that?

848
00:35:52,240 --> 00:35:53,590
I've got a set of x values.

849
00:35:53,590 --> 00:35:55,930
I'm trying to predict what
the y values should be,

850
00:35:55,930 --> 00:35:57,070
the displacement should be.

851
00:35:57,070 --> 00:35:59,170
I want to get a
good fit to that.

852
00:35:59,170 --> 00:36:01,360
The idea is that given
an independent value,

853
00:36:01,360 --> 00:36:03,280
it gives me an estimate
of what it should be,

854
00:36:03,280 --> 00:36:05,620
and I really want to know
which fit provides the better

855
00:36:05,620 --> 00:36:07,030
estimates.

856
00:36:07,030 --> 00:36:10,240
And since I was simply
minimizing mean squared error,

857
00:36:10,240 --> 00:36:12,910
average square error,
an obvious thing to do

858
00:36:12,910 --> 00:36:17,050
is just to use the goodness of
fit by looking at that error.

859
00:36:17,050 --> 00:36:19,720
Why not just measure
where am I on that surface

860
00:36:19,720 --> 00:36:21,100
and see which one does better?

861
00:36:21,100 --> 00:36:22,641
Or actually it would
be two surfaces,

862
00:36:22,641 --> 00:36:25,110
one for a linear fit,
one for a quadratic one.

863
00:36:27,026 --> 00:36:28,150
We'll do what we always do.

864
00:36:28,150 --> 00:36:29,550
Let's write a
little bit of code.

865
00:36:29,550 --> 00:36:30,966
I can write something
that's going

866
00:36:30,966 --> 00:36:32,830
to get the average,
mean squared error.

867
00:36:32,830 --> 00:36:36,010
Takes in a set of data points,
a set of predicted values,

868
00:36:36,010 --> 00:36:38,102
simply measures the
difference between them,

869
00:36:38,102 --> 00:36:40,060
squares them, adds them
all up in a little loop

870
00:36:40,060 --> 00:36:42,970
here and returns that divided
by the number of samples I have.

871
00:36:42,970 --> 00:36:45,540
So it gives me the
average squared error.

872
00:36:45,540 --> 00:36:47,740
And I could do it for
that first model I built,

873
00:36:47,740 --> 00:36:49,552
which was for a
linear fit, and I

874
00:36:49,552 --> 00:36:51,260
could do it for the
second model I built,

875
00:36:51,260 --> 00:36:53,010
which is a quadratic fit.

876
00:36:53,010 --> 00:36:57,760
And if I run it, I
get those values.

877
00:36:57,760 --> 00:36:59,320
Looks pretty good.

878
00:36:59,320 --> 00:37:01,970
You knew by eye that the
quadratic was a better fit.

879
00:37:01,970 --> 00:37:05,440
And look, this says it's
about six times better,

880
00:37:05,440 --> 00:37:08,350
that the residual error
is six times smaller

881
00:37:08,350 --> 00:37:11,720
with the quadratic model
than it is the linear model.

882
00:37:15,970 --> 00:37:19,200
But with that, I still
have a problem, which is--

883
00:37:19,200 --> 00:37:22,740
OK, so it's useful for
comparing two models.

884
00:37:22,740 --> 00:37:26,352
But is 1524 a good number?

885
00:37:26,352 --> 00:37:28,310
Certainly better than
9,000-something or other.

886
00:37:28,310 --> 00:37:31,720
But how do I know that
1524 is a good number?

887
00:37:31,720 --> 00:37:35,324
How do I know there isn't a
better fit out there somewhere?

888
00:37:35,324 --> 00:37:37,740
Well, good news is we're going
to be able to measure that.

889
00:37:37,740 --> 00:37:41,730
It's hard to know because
there's no bound on the values.

890
00:37:41,730 --> 00:37:44,370
And more importantly, this
is not scale independent.

891
00:37:44,370 --> 00:37:45,340
What do I mean by that?

892
00:37:45,340 --> 00:37:49,520
If I take all of the values and
multiply them by some factor,

893
00:37:49,520 --> 00:37:52,530
I would still fit the
same models to them.

894
00:37:52,530 --> 00:37:53,880
They would just scale.

895
00:37:53,880 --> 00:37:56,397
But that measure would
increase by that amount.

896
00:37:56,397 --> 00:37:58,230
So I could make the
error as big or as small

897
00:37:58,230 --> 00:38:00,720
as I want by just changing
the size of the values.

898
00:38:00,720 --> 00:38:02,830
That doesn't make any sense.

899
00:38:02,830 --> 00:38:06,250
I'd like a way to
measure goodness of fit

900
00:38:06,250 --> 00:38:08,960
that is scale independent
and that tells me

901
00:38:08,960 --> 00:38:11,860
for any fit how close
it comes to being

902
00:38:11,860 --> 00:38:14,105
the perfect fit to the data.

903
00:38:14,105 --> 00:38:15,980
And so for that, we're
going to use something

904
00:38:15,980 --> 00:38:18,170
called the coefficient
of determination

905
00:38:18,170 --> 00:38:20,089
written as r squared.

906
00:38:20,089 --> 00:38:22,130
So let me show you what
this does, and then we're

907
00:38:22,130 --> 00:38:24,400
going to use it.

908
00:38:24,400 --> 00:38:26,890
The y's are measured values.

909
00:38:26,890 --> 00:38:29,470
Those are my samples I
got from my experiment.

910
00:38:29,470 --> 00:38:31,870
The p's are the
predicted values.

911
00:38:31,870 --> 00:38:34,360
That is, for this
curve, here's what I

912
00:38:34,360 --> 00:38:36,250
predict those values should be.

913
00:38:36,250 --> 00:38:38,440
So the top here is
basically measuring

914
00:38:38,440 --> 00:38:42,510
as we saw before the sum
squared error in those pieces.

915
00:38:42,510 --> 00:38:47,380
Mu down here is the average, or
mean, of the measured values.

916
00:38:47,380 --> 00:38:48,660
It's the average of the y's.

917
00:38:50,650 --> 00:38:53,040
So what I've got here
is in the numerator--

918
00:38:53,040 --> 00:38:56,150
this is basically the error
in the estimates from my curve

919
00:38:56,150 --> 00:38:58,240
fit.

920
00:38:58,240 --> 00:39:01,510
And in the denominator I've
got the amount of variation

921
00:39:01,510 --> 00:39:03,830
in the data itself.

922
00:39:03,830 --> 00:39:07,270
This is telling me how much does
the data change from just being

923
00:39:07,270 --> 00:39:09,850
a constant value, and this
is telling me how much

924
00:39:09,850 --> 00:39:13,270
do my errors vary around it.

925
00:39:13,270 --> 00:39:16,416
That ratio is scale independent
because it's a ratio.

926
00:39:16,416 --> 00:39:18,040
So even if I increase
all of the values

927
00:39:18,040 --> 00:39:19,831
by some amount, that's
going to divide out,

928
00:39:19,831 --> 00:39:22,070
which is kind of nice.

929
00:39:22,070 --> 00:39:25,040
So I could compute
that, and there it is.

930
00:39:25,040 --> 00:39:27,340
R squared is, again,
that expression.

931
00:39:27,340 --> 00:39:29,120
I'll take in a set
of observed values,

932
00:39:29,120 --> 00:39:32,450
a set of predicted values,
and I'll measure the error--

933
00:39:32,450 --> 00:39:33,620
again, these are arrays.

934
00:39:33,620 --> 00:39:35,730
So I'm going to take the
difference between the arrays.

935
00:39:35,730 --> 00:39:37,604
That's going to give me
piecewise or pairwise

936
00:39:37,604 --> 00:39:38,510
that difference.

937
00:39:38,510 --> 00:39:39,205
I'll square it.

938
00:39:39,205 --> 00:39:41,330
That's going to give me at
every point in the array

939
00:39:41,330 --> 00:39:43,280
the square of that distance.

940
00:39:43,280 --> 00:39:44,720
And then because
it's an array, I

941
00:39:44,720 --> 00:39:47,150
can just use the built-in sum
function to add them all up.

942
00:39:47,150 --> 00:39:48,780
So this is going
to give me the--

943
00:39:48,780 --> 00:39:51,380
if you like, the
values up there.

944
00:39:51,380 --> 00:39:54,060
And then I'm going to
play a little trick.

945
00:39:54,060 --> 00:39:55,980
I'm going to compute
the mean error, which

946
00:39:55,980 --> 00:40:00,310
is that thing divided by
the number of observations.

947
00:40:00,310 --> 00:40:02,350
Why would I do that?

948
00:40:02,350 --> 00:40:05,350
Well, because then I can
compute this really simply.

949
00:40:05,350 --> 00:40:07,840
I could write a little
loop to compute it.

950
00:40:07,840 --> 00:40:10,000
But in fact, I've already
said what is that?

951
00:40:10,000 --> 00:40:14,510
If I take that sum and divide
it by the number of samples,

952
00:40:14,510 --> 00:40:16,380
that's the variance.

953
00:40:16,380 --> 00:40:17,340
So that's really nice.

954
00:40:17,340 --> 00:40:19,950
Right here I can
say, get the variance

955
00:40:19,950 --> 00:40:23,990
using the non-p version
of the observed data.

956
00:40:23,990 --> 00:40:27,090
And because that has
associated with it division

957
00:40:27,090 --> 00:40:29,460
by the number of
samples, the ratio

958
00:40:29,460 --> 00:40:31,110
of the mean error
to the variance

959
00:40:31,110 --> 00:40:35,290
is exactly the same as
the ratio of that to that.

960
00:40:35,290 --> 00:40:35,990
Little trick.

961
00:40:35,990 --> 00:40:38,620
It lets me save doing a
little bit of computation.

962
00:40:38,620 --> 00:40:40,970
So I can compute
r squared values.

963
00:40:43,510 --> 00:40:47,300
So what does r squared
actually tell us?

964
00:40:47,300 --> 00:40:50,210
What we're doing is we're
trying to compare the estimation

965
00:40:50,210 --> 00:40:53,120
errors, the top part,
with the variability

966
00:40:53,120 --> 00:40:55,949
in the original values,
the bottom part.

967
00:40:55,949 --> 00:40:57,740
So r squared, as you're
going to see there,

968
00:40:57,740 --> 00:41:00,020
it's intended to
capture what portion

969
00:41:00,020 --> 00:41:05,310
of the variability in the data
is accounted for by my model.

970
00:41:05,310 --> 00:41:06,690
My model's a really good fit.

971
00:41:06,690 --> 00:41:10,920
It should account for
almost all of that data.

972
00:41:10,920 --> 00:41:15,860
So what we see then is if we do
a fit with a linear regression,

973
00:41:15,860 --> 00:41:19,140
r squared is always going
to be between zero and one.

974
00:41:19,140 --> 00:41:21,920
And I want to just
show you some examples.

975
00:41:21,920 --> 00:41:24,390
If r squared is equal
to one, this is great.

976
00:41:24,390 --> 00:41:28,350
It says the model explains all
of the variability in the data.

977
00:41:28,350 --> 00:41:31,097
And you can see it
if we go back here.

978
00:41:31,097 --> 00:41:32,680
How do we make r
squared equal to one?

979
00:41:32,680 --> 00:41:35,710
We need this to be
zero, which says

980
00:41:35,710 --> 00:41:38,140
that the variability in
the data is perfectly

981
00:41:38,140 --> 00:41:40,630
predicted by my model.

982
00:41:40,630 --> 00:41:42,469
Every point lies
exactly along the curve.

983
00:41:42,469 --> 00:41:43,010
That's great.

984
00:41:47,050 --> 00:41:48,610
Second option at
the other extreme

985
00:41:48,610 --> 00:41:51,040
is if r squared
is equal to zero,

986
00:41:51,040 --> 00:41:54,670
you basically got bupkis, which
is a well-known technical term,

987
00:41:54,670 --> 00:41:57,640
meaning there's no relationship
between the values predicted

988
00:41:57,640 --> 00:42:00,280
by the model and
the actual data.

989
00:42:00,280 --> 00:42:03,010
That basically says that
all of the variability

990
00:42:03,010 --> 00:42:06,430
here is exactly the same as all
the variability in the data.

991
00:42:06,430 --> 00:42:07,900
The model doesn't
capture anything,

992
00:42:07,900 --> 00:42:12,320
and it's making this one, which
is making the whole thing zero.

993
00:42:12,320 --> 00:42:14,320
And then in between an r
squared of about a half

994
00:42:14,320 --> 00:42:16,430
says you're capturing
about half the variability.

995
00:42:16,430 --> 00:42:18,730
So what you would
like is a system

996
00:42:18,730 --> 00:42:22,480
in which your fit is as close
to an r squared value of one

997
00:42:22,480 --> 00:42:24,880
as possible because
it says my model is

998
00:42:24,880 --> 00:42:28,130
capturing all the variability
in the data really well.

999
00:42:31,570 --> 00:42:33,651
So two functions that
will do this for us.

1000
00:42:33,651 --> 00:42:35,900
We're going to come back to
these in the next lecture.

1001
00:42:35,900 --> 00:42:38,500
The first one called
generate fits, or genFits,

1002
00:42:38,500 --> 00:42:41,480
will take a set of x
values, a set of y values,

1003
00:42:41,480 --> 00:42:43,670
and a list or a
tuple of degrees,

1004
00:42:43,670 --> 00:42:45,560
and these will be
the different degrees

1005
00:42:45,560 --> 00:42:47,180
of models I'd like to fit.

1006
00:42:47,180 --> 00:42:48,687
I could just give it one.

1007
00:42:48,687 --> 00:42:49,520
I could give it two.

1008
00:42:49,520 --> 00:42:52,636
I could give a 1, 2,
4, 8, 16, whatever.

1009
00:42:52,636 --> 00:42:54,260
And I'll just run
through a little loop

1010
00:42:54,260 --> 00:42:56,270
here where I'm going to
build up a set of models

1011
00:42:56,270 --> 00:42:58,670
for each degree--
or d in degrees.

1012
00:42:58,670 --> 00:43:00,452
I'll do the fit exactly
as I had before.

1013
00:43:00,452 --> 00:43:01,910
It's going to return
a model, which

1014
00:43:01,910 --> 00:43:03,860
is a tuple of coefficients.

1015
00:43:03,860 --> 00:43:07,760
And I'm going to store that
in models and then return it.

1016
00:43:07,760 --> 00:43:10,510
And then I'm going to use
that, because in testFits I

1017
00:43:10,510 --> 00:43:13,480
will take the models
that come from genFits,

1018
00:43:13,480 --> 00:43:15,770
I'll take the set of
degrees that I also passed

1019
00:43:15,770 --> 00:43:17,780
in there as well as the values.

1020
00:43:17,780 --> 00:43:19,160
I'll plot them
out, and then I'll

1021
00:43:19,160 --> 00:43:25,120
simply run through each of
the models and generate a fit,

1022
00:43:25,120 --> 00:43:29,241
compute the r squared value,
plot it, and then print out

1023
00:43:29,241 --> 00:43:29,740
some data.

1024
00:43:31,890 --> 00:43:37,200
With that in mind, let's see
what happens if we run this.

1025
00:43:37,200 --> 00:43:40,360
So I'm going to take,
again, that example

1026
00:43:40,360 --> 00:43:43,445
of that data that
I started with,

1027
00:43:43,445 --> 00:43:45,570
assuming I picked the right
one here, which I think

1028
00:43:45,570 --> 00:43:46,110
is this one.

1029
00:43:46,110 --> 00:43:49,860
I'm going to do a fit with
a degree one and a degree

1030
00:43:49,860 --> 00:43:51,114
two curve.

1031
00:43:51,114 --> 00:43:52,530
So I'm going to
fit the best line.

1032
00:43:52,530 --> 00:43:55,020
I'm going to fit the best
quadratic, the best parabola,

1033
00:43:55,020 --> 00:43:58,200
and I want to see how
well that comes out.

1034
00:43:58,200 --> 00:44:00,950
So I do that.

1035
00:44:00,950 --> 00:44:02,000
I got some data there.

1036
00:44:02,000 --> 00:44:02,570
Looks good.

1037
00:44:02,570 --> 00:44:05,870
And what does the data tell me?

1038
00:44:05,870 --> 00:44:09,112
Data says, oh, cool--

1039
00:44:09,112 --> 00:44:10,570
I know you don't
believe it, but it

1040
00:44:10,570 --> 00:44:12,040
is because notice
what it says, it

1041
00:44:12,040 --> 00:44:17,670
says the r squared value
for the line is horrible.

1042
00:44:17,670 --> 00:44:24,314
It accounts for less
than 0.05% of the data.

1043
00:44:24,314 --> 00:44:25,730
You could say, OK,
I can see that.

1044
00:44:25,730 --> 00:44:26,530
I look at it.

1045
00:44:26,530 --> 00:44:28,380
It does a lousy job.

1046
00:44:28,380 --> 00:44:31,290
On the other hand, the
quadratic is really pretty good.

1047
00:44:31,290 --> 00:44:35,820
It's accounting for about 84%
of the variability in the data.

1048
00:44:35,820 --> 00:44:37,770
This is a nice high value.

1049
00:44:37,770 --> 00:44:40,270
It's not one, but it's
a nice high value.

1050
00:44:40,270 --> 00:44:42,660
So this is now
reinforcing what I already

1051
00:44:42,660 --> 00:44:44,040
knew, but in a nice way.

1052
00:44:44,040 --> 00:44:47,190
It's telling me that that
r squared value tells me

1053
00:44:47,190 --> 00:44:51,851
that the quadratic is a much
better fit than the linear fit

1054
00:44:51,851 --> 00:44:52,350
was.

1055
00:44:55,840 --> 00:44:57,940
But then you say
maybe, wait a minute.

1056
00:44:57,940 --> 00:45:01,510
I could have done this by just
comparing the fits themselves.

1057
00:45:01,510 --> 00:45:03,010
I already saw that.

1058
00:45:03,010 --> 00:45:05,500
Part of my goal is how
do I know if I've got

1059
00:45:05,500 --> 00:45:08,360
the best fit possible or not.

1060
00:45:08,360 --> 00:45:09,970
So I'm going to
do the same thing,

1061
00:45:09,970 --> 00:45:16,597
but now I'm going to run it
with another set of degrees.

1062
00:45:16,597 --> 00:45:17,680
I'm going to go over here.

1063
00:45:17,680 --> 00:45:19,510
I'm going to take
exactly the same code.

1064
00:45:19,510 --> 00:45:23,840
But let's try it
with a quadratic,

1065
00:45:23,840 --> 00:45:27,500
with a quartic, an order
eight, and an order 16 fit.

1066
00:45:27,500 --> 00:45:30,590
So I'm going to take
different size polynomials.

1067
00:45:30,590 --> 00:45:33,200
As a quick aside,
this is why I want

1068
00:45:33,200 --> 00:45:35,750
to use the PyLab kind
of code because now I'm

1069
00:45:35,750 --> 00:45:38,930
simply optimizing over
a 16-dimensional space.

1070
00:45:38,930 --> 00:45:41,350
Every point in that
16-dimensional space

1071
00:45:41,350 --> 00:45:44,010
defines a 16th-degree
polynomial.

1072
00:45:44,010 --> 00:45:45,950
And I can still use
linear regression,

1073
00:45:45,950 --> 00:45:47,690
meaning walking
down the gradient,

1074
00:45:47,690 --> 00:45:50,210
to find the best solution.

1075
00:45:50,210 --> 00:45:53,342
I'm going to run this.

1076
00:45:53,342 --> 00:45:56,460
And I get out a set of values.

1077
00:45:56,460 --> 00:45:57,357
Looks good.

1078
00:45:57,357 --> 00:45:58,440
And let's go look at them.

1079
00:46:03,780 --> 00:46:08,960
Here is the r squared value
for quadratic, about 84%.

1080
00:46:08,960 --> 00:46:11,330
Degree four does a
little bit better.

1081
00:46:11,330 --> 00:46:13,130
Degree eight does a
little bit better.

1082
00:46:13,130 --> 00:46:16,590
But wow, look at
that, degree 16--

1083
00:46:16,590 --> 00:46:20,640
16th order polynomial
does a really good job,

1084
00:46:20,640 --> 00:46:26,100
accounts for almost 97% of
the variability in the data.

1085
00:46:26,100 --> 00:46:26,850
That sounds great.

1086
00:46:29,225 --> 00:46:31,320
Now, to quote something
that your parents probably

1087
00:46:31,320 --> 00:46:34,020
said to you when you were much
younger, just because something

1088
00:46:34,020 --> 00:46:37,350
looks good doesn't
mean we should do it.

1089
00:46:37,350 --> 00:46:40,725
And in fact, just because
this has a really high r

1090
00:46:40,725 --> 00:46:44,190
squared value doesn't mean
that we want to use the order

1091
00:46:44,190 --> 00:46:46,560
16th polynomial.

1092
00:46:46,560 --> 00:46:49,770
And I will wonderfully leave
you waiting in suspense

1093
00:46:49,770 --> 00:46:52,564
because we're going to answer
that question next Monday.

1094
00:46:52,564 --> 00:46:54,730
And with that, I'll let you
out a few minutes early.

1095
00:46:54,730 --> 00:46:57,110
Have a great Thanksgiving break.