1
00:00:00,000 --> 00:00:01,990
OPERATOR: The following content
is provided under a

2
00:00:01,990 --> 00:00:03,840
Creative Commons license.

3
00:00:03,840 --> 00:00:06,840
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,840 --> 00:00:10,530
offer high quality educational
resources for free.

5
00:00:10,530 --> 00:00:13,390
To make a donation or view
additional materials from

6
00:00:13,390 --> 00:00:17,490
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,490 --> 00:00:19,930
ocw.mit.edu.

8
00:00:19,930 --> 00:00:22,810
PROFESSOR: So let's start.

9
00:00:22,810 --> 00:00:25,920
I have written a number
on the board here.

10
00:00:25,920 --> 00:00:32,130
Anyone want to speculate what
that number represents?

11
00:00:32,130 --> 00:00:34,620
Well, you may recall at the end
of the last lecture, we

12
00:00:34,620 --> 00:00:39,480
were simulating pi, and I
started up running it with a

13
00:00:39,480 --> 00:00:41,840
billion darts.

14
00:00:41,840 --> 00:00:45,400
And when it finally terminated,
this was the

15
00:00:45,400 --> 00:00:50,310
estimate of pi it gave
me with a billion.

16
00:00:50,310 --> 00:00:57,360
Not bad, not quite perfect,
but still pretty good.

17
00:00:57,360 --> 00:01:00,790
In fact when I later ran it with
10 billion darts, which

18
00:01:00,790 --> 00:01:04,660
took a rather long time to run,
didn't do much better.

19
00:01:04,660 --> 00:01:10,810
So it's converging very slowly
now near the end.

20
00:01:10,810 --> 00:01:14,480
When we use an algorithm like
that one to perform a Monte

21
00:01:14,480 --> 00:01:18,890
Carlo simulation, we're
trusting, as I said, that fate

22
00:01:18,890 --> 00:01:22,610
will give us an unbiased sample,
a sample that would be

23
00:01:22,610 --> 00:01:27,540
representative of true
random throws.

24
00:01:27,540 --> 00:01:30,220
And, indeed in this case,
that's a pretty good

25
00:01:30,220 --> 00:01:31,500
assumption.

26
00:01:31,500 --> 00:01:34,990
The random number generator is
not truly random, it's what's

27
00:01:34,990 --> 00:01:38,510
called pseudo-random, in that if
you start it with the same

28
00:01:38,510 --> 00:01:42,360
initial conditions, it will
give you the same results.

29
00:01:42,360 --> 00:01:47,090
But it's close enough for, at
least for government work, and

30
00:01:47,090 --> 00:01:51,980
other useful projects.

31
00:01:51,980 --> 00:01:55,820
We do have to think about the
question, how many samples

32
00:01:55,820 --> 00:01:57,690
should we run?

33
00:01:57,690 --> 00:02:00,730
Was a billion darts enough?

34
00:02:00,730 --> 00:02:04,280
Now since we sort of all started
knowing what pi was,

35
00:02:04,280 --> 00:02:07,460
we could look at it and say,
yeah, pretty good.

36
00:02:07,460 --> 00:02:14,440
But suppose we had no clue about
the actual value of pi.

37
00:02:14,440 --> 00:02:16,100
We still have to
think about the

38
00:02:16,100 --> 00:02:28,370
question of how many samples?

39
00:02:28,370 --> 00:02:38,020
And also, how accurate do we
believe our result is, given

40
00:02:38,020 --> 00:02:40,660
the number of samples?

41
00:02:40,660 --> 00:02:45,790
As you might guess, these two
questions are closely related.

42
00:02:45,790 --> 00:02:52,170
That, if we know in advance how
much accuracy we want, we

43
00:02:52,170 --> 00:02:54,820
can sometimes use that
to calculate how

44
00:02:54,820 --> 00:03:03,900
many samples we need.

45
00:03:03,900 --> 00:03:10,050
But there's still always
the issue.

46
00:03:10,050 --> 00:03:13,910
It's never possible
to achieve perfect

47
00:03:13,910 --> 00:03:16,200
accuracy through sampling.

48
00:03:16,200 --> 00:03:20,110
Unless you sample the
entire population.

49
00:03:20,110 --> 00:03:25,100
No matter how many samples you
take, you can never be sure

50
00:03:25,100 --> 00:03:30,260
that the sample set is typical
until you've checked every

51
00:03:30,260 --> 00:03:32,220
last element.

52
00:03:32,220 --> 00:03:38,610
So if I went around MIT and
sampled 100 students to try

53
00:03:38,610 --> 00:03:43,110
and, for example, guess the
fraction of students at MIT

54
00:03:43,110 --> 00:03:46,940
who are of Chinese descent.

55
00:03:46,940 --> 00:03:52,470
Maybe 100 students would be
enough, but maybe I would get

56
00:03:52,470 --> 00:03:55,420
unlucky and draw
the wrong 100.

57
00:03:55,420 --> 00:03:59,680
In the sense of, by accident,
100 Chinese descent, or 100

58
00:03:59,680 --> 00:04:01,820
non-Chinese descent,
which would give

59
00:04:01,820 --> 00:04:04,090
me the wrong answer.

60
00:04:04,090 --> 00:04:08,200
And there would be no way I
could be sure that I had not

61
00:04:08,200 --> 00:04:18,880
drawn a biased sample, unless
I really did have the whole

62
00:04:18,880 --> 00:04:22,770
population to look at.

63
00:04:22,770 --> 00:04:28,560
So we can never know that
our estimate is correct.

64
00:04:28,560 --> 00:04:32,270
Now maybe I took a billion
darts, and for some reason got

65
00:04:32,270 --> 00:04:35,330
really unlucky and they
all ended up inside

66
00:04:35,330 --> 00:04:38,440
or outside the circle.

67
00:04:38,440 --> 00:04:42,590
But what we can know, is how
likely it is that our answer

68
00:04:42,590 --> 00:04:46,030
is correct, given
the assumptions.

69
00:04:46,030 --> 00:04:48,400
And that's the topic we'll spend
the next few lectures

70
00:04:48,400 --> 00:04:50,510
on, at least one
of the topics.

71
00:04:50,510 --> 00:04:54,520
It's saying, how can we know
how likely it is that our

72
00:04:54,520 --> 00:04:56,090
answer is good.

73
00:04:56,090 --> 00:05:01,290
But it's always given some set
of assumptions, and we have to

74
00:05:01,290 --> 00:05:04,860
worry a lot about those
assumptions.

75
00:05:04,860 --> 00:05:10,870
Now in the case of our pi
example, our assumption was

76
00:05:10,870 --> 00:05:14,810
that the random number generator
was indeed giving us

77
00:05:14,810 --> 00:05:18,860
random numbers in the
interval 0 to 1.

78
00:05:18,860 --> 00:05:23,980
So that was our underlying
assumption.

79
00:05:23,980 --> 00:05:28,700
Then using that, we looked at a
plot, and we saw that after

80
00:05:28,700 --> 00:05:33,310
time the answer wasn't
changing very much.

81
00:05:33,310 --> 00:05:36,290
And we use that to say, OK, it
looks like we're actually

82
00:05:36,290 --> 00:05:40,010
converging on an answer.

83
00:05:40,010 --> 00:05:44,320
And then I ran it again, with
another trial, and it

84
00:05:44,320 --> 00:05:49,040
converged again at
the same place.

85
00:05:49,040 --> 00:05:52,710
And the fact that that happened
several times led me

86
00:05:52,710 --> 00:05:56,640
to at least have some reason to
believe that I was actually

87
00:05:56,640 --> 00:06:04,920
finding a good approximation
of pi.

88
00:06:04,920 --> 00:06:07,260
That's a good thing to do.

89
00:06:07,260 --> 00:06:09,040
It's a necessary thing to do.

90
00:06:09,040 --> 00:06:11,880
But it is not sufficient.

91
00:06:11,880 --> 00:06:16,300
Because errors can creep
into many places.

92
00:06:16,300 --> 00:06:20,120
So that kind of technique,
and in fact, almost all

93
00:06:20,120 --> 00:06:26,160
statistical techniques, are good
at establishing, in some

94
00:06:26,160 --> 00:06:30,540
sense, the reproduce-ability of
the result, and that it is

95
00:06:30,540 --> 00:06:35,480
statistically valid, and that
there's no error, for example,

96
00:06:35,480 --> 00:06:40,250
in the way I'm generating
the numbers.

97
00:06:40,250 --> 00:06:43,480
Or I didn't get very unlucky.

98
00:06:43,480 --> 00:06:48,410
However, they're other places
other than bad luck where

99
00:06:48,410 --> 00:06:51,270
errors can creep in.

100
00:06:51,270 --> 00:06:53,940
So let's look at an
example here.

101
00:06:53,940 --> 00:06:59,900
I've taken the algorithm we
looked at last time for

102
00:06:59,900 --> 00:07:08,450
finding pi, and I've
made a change.

103
00:07:08,450 --> 00:07:13,310
You'll remember that we were
before using 4 as our

104
00:07:13,310 --> 00:07:17,110
multiplier, and here what I've
done is, just gone in and

105
00:07:17,110 --> 00:07:20,710
replaced 4 by 2.

106
00:07:20,710 --> 00:07:25,190
Assuming that I made a
programming error.

107
00:07:25,190 --> 00:07:35,420
Now let's see what happens
when we run it.

108
00:07:35,420 --> 00:07:42,480
Well, a bad thing
has happened.

109
00:07:42,480 --> 00:07:48,560
Sure enough, we ran it and
it converged, started to

110
00:07:48,560 --> 00:07:53,850
converge, and if I ran 100
trials each one would converge

111
00:07:53,850 --> 00:07:56,800
at roughly the same place.

112
00:07:56,800 --> 00:08:00,450
Any statistical test I would
do, would say that my

113
00:08:00,450 --> 00:08:03,770
statistics are sound, I've
chosen enough samples, and for

114
00:08:03,770 --> 00:08:05,960
some accuracy, it's
converting.

115
00:08:05,960 --> 00:08:10,280
Everything is perfect,
except for what?

116
00:08:10,280 --> 00:08:13,360
It's the wrong answer.

117
00:08:13,360 --> 00:08:18,540
The moral here, is that just
because an answer is

118
00:08:18,540 --> 00:08:45,810
statistically valid, does not
mean it's the right answer.

119
00:08:45,810 --> 00:08:49,700
And that's really important to
understand, because you see

120
00:08:49,700 --> 00:08:53,180
this, and we'll see more
examples later, not today, but

121
00:08:53,180 --> 00:08:56,460
after Thanksgiving, comes
up all the time in the

122
00:08:56,460 --> 00:09:00,660
newspapers, in scientific
articles, where people do a

123
00:09:00,660 --> 00:09:04,370
million tests, do all the
statistics right, say here's

124
00:09:04,370 --> 00:09:08,420
the answer, and it turns out
to be completely wrong.

125
00:09:08,420 --> 00:09:12,310
And that's because it was some
underlying assumption that

126
00:09:12,310 --> 00:09:17,120
went into the decision,
that was not true.

127
00:09:17,120 --> 00:09:20,750
So here, the assumption is,
that I've done my algebra

128
00:09:20,750 --> 00:09:27,440
right for computing pi based
upon where the darts land.

129
00:09:27,440 --> 00:09:32,990
And it turns out, if I put 2
here, my algebra is wrong.

130
00:09:32,990 --> 00:09:35,850
Now how could I discover this?

131
00:09:35,850 --> 00:09:38,700
Since I've already told you
no statistical test is

132
00:09:38,700 --> 00:09:40,060
going to help me.

133
00:09:40,060 --> 00:09:42,710
What's the obvious thing I
should be doing when I get

134
00:09:42,710 --> 00:09:44,830
this answer?

135
00:09:44,830 --> 00:09:45,930
Somebody?

136
00:09:45,930 --> 00:09:50,420
Yeah?

137
00:09:50,420 --> 00:09:50,800
STUDENT: [INAUDIBLE]

138
00:09:50,800 --> 00:09:54,830
PROFESSOR: Exactly.

139
00:09:54,830 --> 00:09:57,890
Checking against reality.

140
00:09:57,890 --> 00:10:01,160
I started with the notion that
pi had some relation to the

141
00:10:01,160 --> 00:10:03,690
area of a circle.

142
00:10:03,690 --> 00:10:08,030
So I could use this value
of pi, draw a

143
00:10:08,030 --> 00:10:11,850
circle with a radius.

144
00:10:11,850 --> 00:10:13,980
Do my best to measure
the area.

145
00:10:13,980 --> 00:10:17,120
I wouldn't need to get a very
good, accurate measurement,

146
00:10:17,120 --> 00:10:20,890
and I would say, whoa, this
isn't even close.

147
00:10:20,890 --> 00:10:25,480
And that would tell me
I have a problem.

148
00:10:25,480 --> 00:10:37,240
So the moral here is,
to check results

149
00:10:37,240 --> 00:10:50,040
against physical reality.

150
00:10:50,040 --> 00:10:53,750
So for example, the current
problem set, you're doing a

151
00:10:53,750 --> 00:10:56,340
simulation about what
happens to viruses

152
00:10:56,340 --> 00:11:00,020
when drugs are applied.

153
00:11:00,020 --> 00:11:03,660
If you were doing this for a
pharmaceutical company, in

154
00:11:03,660 --> 00:11:06,660
addition to the simulation,
you'd want to run some real

155
00:11:06,660 --> 00:11:08,530
experiments.

156
00:11:08,530 --> 00:11:17,390
And make sure that
things matched.

157
00:11:17,390 --> 00:11:29,720
OK, what this suggests, is that
we often use simulation,

158
00:11:29,720 --> 00:11:35,930
and other computational
techniques, to try and model

159
00:11:35,930 --> 00:11:38,320
the real world, or the
physical world, in

160
00:11:38,320 --> 00:11:42,190
which we all live.

161
00:11:42,190 --> 00:11:46,690
And we can use data
to do that.

162
00:11:46,690 --> 00:11:51,190
I now want to go through another
set of examples, and

163
00:11:51,190 --> 00:11:55,230
we're going to look at the
interplay of three things:

164
00:11:55,230 --> 00:12:02,770
what happens when you have data,
say from measurements,

165
00:12:02,770 --> 00:12:15,220
and models that at least claim
to explain the data.

166
00:12:15,220 --> 00:12:26,870
And then, consequences that
follow from the models.

167
00:12:26,870 --> 00:12:30,570
This is often the way science
works, its the way engineering

168
00:12:30,570 --> 00:12:35,870
works, we have some
measurements, we have a theory

169
00:12:35,870 --> 00:12:38,360
that explains the measurements,
and then we

170
00:12:38,360 --> 00:12:43,770
write software to explore the
consequences of that theory.

171
00:12:43,770 --> 00:12:48,270
Including, is it plausible
that it's really true?

172
00:12:48,270 --> 00:12:53,030
So I want to start, as
an example, with a

173
00:12:53,030 --> 00:12:57,650
classic chosen from 8.01.

174
00:12:57,650 --> 00:13:01,780
So I presume, everyone
here has taken 8.01?

175
00:13:01,780 --> 00:13:02,860
Or in 8.01?

176
00:13:02,860 --> 00:13:08,390
Anyone here who's not had
an experience with 801?

177
00:13:08,390 --> 00:13:11,020
All right, well.

178
00:13:11,020 --> 00:13:13,290
I hope you know about springs,
because we're going to talk

179
00:13:13,290 --> 00:13:15,120
about springs.

180
00:13:15,120 --> 00:13:18,780
So if you think about it, I'm
now just talking not about

181
00:13:18,780 --> 00:13:21,680
springs that have water in them,
but springs that you

182
00:13:21,680 --> 00:13:25,920
compress, you know, and expand,
and things like that.

183
00:13:25,920 --> 00:13:28,760
And there's typically something
called the spring

184
00:13:28,760 --> 00:13:42,090
constant that tells us how stiff
the spring is, how much

185
00:13:42,090 --> 00:13:45,750
energy it takes to compress
this spring.

186
00:13:45,750 --> 00:13:49,570
Or equivalently, how much pop
the spring has when you're no

187
00:13:49,570 --> 00:13:53,990
longer holding it down.

188
00:13:53,990 --> 00:13:56,720
Some springs are easy to
stretch, they have a small

189
00:13:56,720 --> 00:13:58,070
spring constant.

190
00:13:58,070 --> 00:14:01,230
Some strings, for example,
the ones that hold up an

191
00:14:01,230 --> 00:14:04,470
automobile, suspension,
are much harder

192
00:14:04,470 --> 00:14:08,370
to stretch and compress.

193
00:14:08,370 --> 00:14:20,320
There's a theory about them
called Hooke's Law.

194
00:14:20,320 --> 00:14:28,070
And it's quite simple.

195
00:14:28,070 --> 00:14:32,870
Force, the amount of force
exerted by a spring, is equal

196
00:14:32,870 --> 00:14:38,560
to minus some constant times
the distance you have

197
00:14:38,560 --> 00:14:44,170
compressed the spring.

198
00:14:44,170 --> 00:14:47,640
It's minus, because the force
is exerted in an opposite

199
00:14:47,640 --> 00:14:50,580
direction, trying
to spring up.

200
00:14:50,580 --> 00:14:55,350
So for example, we could
look at it this way.

201
00:14:55,350 --> 00:15:01,160
We've got a spring, excuse
my art here.

202
00:15:01,160 --> 00:15:05,840
And we put some weight on the
spring, which has therefore

203
00:15:05,840 --> 00:15:08,630
compressed it a little bit.

204
00:15:08,630 --> 00:15:13,440
And the spring is exerting
some upward force.

205
00:15:13,440 --> 00:15:18,080
And the amount of force it's
exerting is proportional to

206
00:15:18,080 --> 00:15:26,590
the distance x.

207
00:15:26,590 --> 00:15:34,670
So, if we believe Hooke's Law,
and I give you a spring, how

208
00:15:34,670 --> 00:15:38,790
can we find out what
this constant is?

209
00:15:38,790 --> 00:15:45,090
Well, we can do it by putting a
weight on top of the spring.

210
00:15:45,090 --> 00:15:50,100
It will compress the spring a
certain amount, and then the

211
00:15:50,100 --> 00:15:53,380
spring will stop moving.

212
00:15:53,380 --> 00:15:56,290
Now gravity would normally have
had this weight go all

213
00:15:56,290 --> 00:16:00,490
the way down to the bottom,
if there was no spring.

214
00:16:00,490 --> 00:16:03,330
So clearly the spring is
exerting some force in the

215
00:16:03,330 --> 00:16:08,660
upward direction, to keep that
mass from going down to the

216
00:16:08,660 --> 00:16:14,310
table, right?

217
00:16:14,310 --> 00:16:17,760
So we know what that
force is there.

218
00:16:17,760 --> 00:16:23,180
If we compress the spring to a
bunch of different distances,

219
00:16:23,180 --> 00:16:29,570
by putting, say, different size
weights on it, we can

220
00:16:29,570 --> 00:16:35,780
then solve for the spring
constant, just the way,

221
00:16:35,780 --> 00:16:39,350
before, we solved for pi.

222
00:16:39,350 --> 00:16:47,280
So it just so happens, not quite
by accident, that I've

223
00:16:47,280 --> 00:16:50,540
got some data from a spring.

224
00:16:50,540 --> 00:16:52,290
So let's look at it.

225
00:16:52,290 --> 00:16:57,310
So here's some data taken
from measuring a spring.

226
00:16:57,310 --> 00:17:01,300
This is distance and force,
force computed from the mass,

227
00:17:01,300 --> 00:17:02,720
basically, right?

228
00:17:02,720 --> 00:17:07,130
Because we know that these
have to be in balance.

229
00:17:07,130 --> 00:17:10,630
And I'm not going to ask you to
in your head estimate the

230
00:17:10,630 --> 00:17:16,440
constant from these, but what
you'll see is, the format is,

231
00:17:16,440 --> 00:17:22,700
there's a distance, and then a
colon, and then the force.

232
00:17:22,700 --> 00:17:22,950
Yeah?

233
00:17:22,950 --> 00:17:29,940
STUDENT: [INAUDIBLE]

234
00:17:29,940 --> 00:17:37,780
PROFESSOR: Ok, right,
yes, thank you.

235
00:17:37,780 --> 00:17:41,010
All right, want to repeat that
more loudly for everyone?

236
00:17:41,010 --> 00:17:42,850
STUDENT: [INAUDIBLE]

237
00:17:42,850 --> 00:17:48,680
PROFESSOR: Right, right, because
the x in the equation

238
00:17:48,680 --> 00:17:53,280
-- right, here we're getting
an equilibrium.

239
00:17:53,280 --> 00:17:57,140
OK, so let's look at what
happens when we try and

240
00:17:57,140 --> 00:17:59,790
examine this.

241
00:17:59,790 --> 00:18:05,300
We'll look at spring dot pi.

242
00:18:05,300 --> 00:18:07,520
So it's pretty simple.

243
00:18:07,520 --> 00:18:10,590
First thing is, I've got a
function that reads in the

244
00:18:10,590 --> 00:18:12,700
data and parses it.

245
00:18:12,700 --> 00:18:15,400
You've all done more complicated
parsing of data

246
00:18:15,400 --> 00:18:16,930
files than this.

247
00:18:16,930 --> 00:18:19,820
So I won't belabor
the details.

248
00:18:19,820 --> 00:18:22,640
I called it get data rather than
get spring data, because

249
00:18:22,640 --> 00:18:24,610
I'm going to use the same
thing for a lot of

250
00:18:24,610 --> 00:18:26,640
other kinds of data.

251
00:18:26,640 --> 00:18:29,630
And the only thing I want you
to notice, is that it's

252
00:18:29,630 --> 00:18:36,190
returning a pair of arrays.

253
00:18:36,190 --> 00:18:38,650
OK, not lists.

254
00:18:38,650 --> 00:18:41,350
The usual thing is, I'm building
them up using lists,

255
00:18:41,350 --> 00:18:44,470
because lists have append and
arrays don't, and then I'm

256
00:18:44,470 --> 00:18:48,310
converting them to arrays so
I can do matrix kinds of

257
00:18:48,310 --> 00:18:50,910
operations on them.

258
00:18:50,910 --> 00:18:54,210
So I'll get the distances
and the forces.

259
00:18:54,210 --> 00:18:57,010
And then I'm just going to plot
them, and we'll see what

260
00:18:57,010 --> 00:18:58,970
they look like.

261
00:18:58,970 --> 00:19:10,640
So let's do that.

262
00:19:10,640 --> 00:19:14,560
There they are.

263
00:19:14,560 --> 00:19:19,990
Now, if you believe Hooke's Law,
you could look at this

264
00:19:19,990 --> 00:19:25,370
data, and maybe you
wouldn't like it.

265
00:19:25,370 --> 00:19:29,330
Because Hooke's Law implies
that, in fact, these points

266
00:19:29,330 --> 00:19:33,950
should lie in a straight
line, right?

267
00:19:33,950 --> 00:19:44,690
If I just plug in values here,
what am I going to get?

268
00:19:44,690 --> 00:19:46,020
A straight line, right?

269
00:19:46,020 --> 00:19:49,620
I'm just multiplying
k times x.

270
00:19:49,620 --> 00:19:51,885
But I don't have a straight
line, I have a little scatter

271
00:19:51,885 --> 00:19:54,740
of points, kind of it looks
like a straight

272
00:19:54,740 --> 00:19:57,360
line, but it's not.

273
00:19:57,360 --> 00:19:59,250
And why do you think
that's true?

274
00:19:59,250 --> 00:20:03,300
What's going on here?

275
00:20:03,300 --> 00:20:12,130
What could cause this line
not to be straight?

276
00:20:12,130 --> 00:20:17,310
Have any you ever done
a physics experiment?

277
00:20:17,310 --> 00:20:21,320
And when you did it, did your
results actually match the

278
00:20:21,320 --> 00:20:23,300
theory that your high
school teacher, say,

279
00:20:23,300 --> 00:20:27,080
explained to you.

280
00:20:27,080 --> 00:20:30,590
No, and why not.

281
00:20:30,590 --> 00:20:35,230
Yeah, you have various kinds of
experimental or measurement

282
00:20:35,230 --> 00:20:39,690
error, right?

283
00:20:39,690 --> 00:20:44,580
Because, when you're doing these
experiments, at least

284
00:20:44,580 --> 00:20:47,390
I'm not perfect, and I suspect
at least most of you are not

285
00:20:47,390 --> 00:20:50,200
perfect, you get mistakes.

286
00:20:50,200 --> 00:20:54,330
A little bit of error creeps
in inevitably.

287
00:20:54,330 --> 00:20:57,830
And so, when we acquired this
data, sure enough there was

288
00:20:57,830 --> 00:21:00,940
measurement error.

289
00:21:00,940 --> 00:21:04,080
And so the points are
scattered around.

290
00:21:04,080 --> 00:21:06,720
This is something
to be expected.

291
00:21:06,720 --> 00:21:13,100
Real data almost never matches
the theory precisely.

292
00:21:13,100 --> 00:21:16,960
Because there usually is some
sort of experimental error

293
00:21:16,960 --> 00:21:24,250
that creeps into things.

294
00:21:24,250 --> 00:21:28,370
So what should we
do about that?

295
00:21:28,370 --> 00:21:32,050
Well, what usually people do,
when they think about this, is

296
00:21:32,050 --> 00:21:36,685
they would look at this data
and say, well, let me fit a

297
00:21:36,685 --> 00:21:37,730
line to this.

298
00:21:37,730 --> 00:21:43,240
Somehow, say, what would
be the line that best

299
00:21:43,240 --> 00:21:47,580
approximates these points?

300
00:21:47,580 --> 00:21:51,540
And then the slope of
that line would give

301
00:21:51,540 --> 00:21:57,570
me the spring constant.

302
00:21:57,570 --> 00:22:05,210
So that raises the next
question, what do I mean by

303
00:22:05,210 --> 00:22:09,460
finding a line that best
fits these points?

304
00:22:09,460 --> 00:22:26,260
How do we, fit, in this case,
a line, to the data?

305
00:22:26,260 --> 00:22:29,460
First of all, I should ask the
question, why did I say let's

306
00:22:29,460 --> 00:22:31,020
fit a line?

307
00:22:31,020 --> 00:22:34,380
Maybe I should have said, let's
fit a parabola, or let's

308
00:22:34,380 --> 00:22:38,580
fit a circle?

309
00:22:38,580 --> 00:22:45,530
Why should I had said
let's fit a line.

310
00:22:45,530 --> 00:22:45,850
Yeah?

311
00:22:45,850 --> 00:22:49,546
STUDENT: [INAUDIBLE]

312
00:22:49,546 --> 00:22:51,580
PROFESSOR: Well, how
do I know that the

313
00:22:51,580 --> 00:22:57,230
plot is a linear function?

314
00:22:57,230 --> 00:23:01,010
Pardon?

315
00:23:01,010 --> 00:23:04,860
Well, so, two things.

316
00:23:04,860 --> 00:23:08,260
One is, I had a theory.

317
00:23:08,260 --> 00:23:13,720
You know, I had up there a
model, and my model suggested

318
00:23:13,720 --> 00:23:17,840
that I expected it
to be linear.

319
00:23:17,840 --> 00:23:20,670
And so if I'm testing my model,
I should and fit a

320
00:23:20,670 --> 00:23:23,150
line, my theory, if you will.

321
00:23:23,150 --> 00:23:26,410
But also when I look at it, it
looks kind of like a line.

322
00:23:26,410 --> 00:23:29,800
So you know, if I looked at it,
and it didn't look like a

323
00:23:29,800 --> 00:23:34,820
line, I might have said, well,
my model must be badly broken.

324
00:23:34,820 --> 00:23:38,960
So let's try and see
if we can fit it.

325
00:23:38,960 --> 00:23:43,730
Whenever we try and fit
something, we need some sort

326
00:23:43,730 --> 00:23:53,770
of an objective function
that captures the

327
00:23:53,770 --> 00:23:56,120
goodness of a fit.

328
00:23:56,120 --> 00:23:59,680
I'm trying to find, this is an
optimization problem of the

329
00:23:59,680 --> 00:24:01,910
sort that we've looked
at before.

330
00:24:01,910 --> 00:24:06,070
I'm trying to find a
line that optimizes

331
00:24:06,070 --> 00:24:10,310
some objective function.

332
00:24:10,310 --> 00:24:15,150
So a very simple objective
function here, is called the

333
00:24:15,150 --> 00:24:24,390
least squares fit.

334
00:24:24,390 --> 00:24:37,610
I want to find the line that
minimizes the sum of

335
00:24:37,610 --> 00:24:47,410
observation sub i, the i'th data
point I have, minus what

336
00:24:47,410 --> 00:24:54,390
the line, the model, predicts
that point should have been,

337
00:24:54,390 --> 00:24:59,740
and then I'll square it.

338
00:24:59,740 --> 00:25:03,160
So I want to minimize
this value.

339
00:25:03,160 --> 00:25:07,490
I want to find the line
that gives me the

340
00:25:07,490 --> 00:25:10,210
smallest value for this.

341
00:25:10,210 --> 00:25:12,610
Why do you think I'm squaring
the difference?

342
00:25:12,610 --> 00:25:17,310
What would happen if I didn't
square the difference?

343
00:25:17,310 --> 00:25:18,950
Yeah?

344
00:25:18,950 --> 00:25:24,670
Positive and negative errors
might cancel each other out.

345
00:25:24,670 --> 00:25:28,760
And in judging the quality of
the fit, I don't really care

346
00:25:28,760 --> 00:25:31,770
deeply -- you're going to get
very fat the way you're

347
00:25:31,770 --> 00:25:34,410
collecting candy here --

348
00:25:34,410 --> 00:25:37,530
I don't care deeply whether the
error is, which side, it

349
00:25:37,530 --> 00:25:39,860
is, just that it's wrong.

350
00:25:39,860 --> 00:25:43,070
And so by squaring it, it's
kind of like taking the

351
00:25:43,070 --> 00:25:47,150
absolute value of the error,
among other things.

352
00:25:47,150 --> 00:25:54,690
All right, so if we look
at our example here,

353
00:25:54,690 --> 00:26:02,290
what would this be?

354
00:26:02,290 --> 00:26:07,250
I want to minimize, want to find
a line that minimizes it.

355
00:26:07,250 --> 00:26:10,470
So how do I do that?

356
00:26:10,470 --> 00:26:13,930
I could easily do it
using successive

357
00:26:13,930 --> 00:26:17,480
approximation, right?

358
00:26:17,480 --> 00:26:20,230
I could choose a line, basically
what I am, is I'm

359
00:26:20,230 --> 00:26:23,680
choosing a slope, here, right?

360
00:26:23,680 --> 00:26:27,690
And, I could, just like Newton
Raphson, do successive

361
00:26:27,690 --> 00:26:34,880
approximation for awhile,
and get the best fit.

362
00:26:34,880 --> 00:26:37,810
That's one way to do
the optimization.

363
00:26:37,810 --> 00:26:41,940
It turns out that for this
particular optimization

364
00:26:41,940 --> 00:26:44,790
there's something
more efficient.

365
00:26:44,790 --> 00:26:48,070
You can actually, there is a
closed form way of attacking

366
00:26:48,070 --> 00:26:52,380
this, and I could explain
that, but in fact, I'll

367
00:26:52,380 --> 00:26:55,080
explain something even better.

368
00:26:55,080 --> 00:27:04,350
It's built into Pylab.

369
00:27:04,350 --> 00:27:14,900
So Pylab has a function built-in
called polyfit.

370
00:27:14,900 --> 00:27:19,670
Which, given a set of points,
finds the polynomial that

371
00:27:19,670 --> 00:27:21,660
gives you the best
least squares

372
00:27:21,660 --> 00:27:28,790
approximation to those points.

373
00:27:28,790 --> 00:27:33,040
It's called polynomial because
it isn't necessarily going to

374
00:27:33,040 --> 00:27:36,520
be first order, that
is to say, a line.

375
00:27:36,520 --> 00:27:42,030
It can find polynomials
of arbitrary degree.

376
00:27:42,030 --> 00:27:48,790
So let's look at the example
here, we'll see how it works.

377
00:27:48,790 --> 00:27:59,410
So let me uncomment it.

378
00:27:59,410 --> 00:28:07,480
So I'm going to get k and b
equals Pylab dot polyfit here.

379
00:28:07,480 --> 00:28:14,780
What it's going to do is, think
about a polynomial.

380
00:28:14,780 --> 00:28:18,640
I give you a polynomial of
degree one, you have all

381
00:28:18,640 --> 00:28:26,940
learned that it's a x plus b,
b is the constant, and x is

382
00:28:26,940 --> 00:28:29,290
the single variable.

383
00:28:29,290 --> 00:28:34,330
And so I multiply a by x and I
add b to it, and as I vary x I

384
00:28:34,330 --> 00:28:36,130
get new values.

385
00:28:36,130 --> 00:28:47,090
And so polyfit, in this case,
will take the set of points

386
00:28:47,090 --> 00:28:52,870
defined by these two arrays and
return me a value for a

387
00:28:52,870 --> 00:28:57,230
and a value for b.

388
00:28:57,230 --> 00:29:05,290
Now here I've assigned a to k,
but don't worry about that.

389
00:29:05,290 --> 00:29:11,930
And then, I'm gonna now generate
the predictions that

390
00:29:11,930 --> 00:29:19,570
I would get from this k
and b, and plot those.

391
00:29:19,570 --> 00:29:32,050
So let's look at it.

392
00:29:32,050 --> 00:29:39,970
So here it said the k is 31.475,
etc., and it's plotted

393
00:29:39,970 --> 00:29:43,300
the line that it's found.

394
00:29:43,300 --> 00:29:45,590
Or I've plotted the line.

395
00:29:45,590 --> 00:29:48,320
You'll note, a lot of the points
don't lie on the line,

396
00:29:48,320 --> 00:29:53,380
in fact, most of the points
don't lie on the line.

397
00:29:53,380 --> 00:29:56,100
But it's asserting that
this is the best it

398
00:29:56,100 --> 00:29:58,830
can do with the line.

399
00:29:58,830 --> 00:30:02,890
And there's some points, for
example, up here, that are

400
00:30:02,890 --> 00:30:07,790
kind of outliers, that are
pretty far from the line.

401
00:30:07,790 --> 00:30:11,560
But it has minimized the error,
if you will, for all of

402
00:30:11,560 --> 00:30:15,620
the points it has.

403
00:30:15,620 --> 00:30:18,670
That's quite different from,
say, finding the line that

404
00:30:18,670 --> 00:30:22,640
touches the most
points, right?

405
00:30:22,640 --> 00:30:29,790
It's minimizing the
sum of the errors.

406
00:30:29,790 --> 00:30:32,950
Now, given that I was just
looking for a constant to

407
00:30:32,950 --> 00:30:40,930
start with, why did I bother
even plotting the data?

408
00:30:40,930 --> 00:30:43,490
I happen to have known before
I did this that polyfit

409
00:30:43,490 --> 00:30:49,230
existed, and what I was really
looking for was this line.

410
00:30:49,230 --> 00:30:51,450
So maybe I should have just
done the polyfit and said

411
00:30:51,450 --> 00:30:55,710
here's k and I'm done.

412
00:30:55,710 --> 00:31:01,500
Would that have been
a good idea?

413
00:31:01,500 --> 00:31:01,810
Yeah?

414
00:31:01,810 --> 00:31:06,292
STUDENT: You can't know without
seeing the actual data

415
00:31:06,292 --> 00:31:10,276
how well it's actually
fitting it.

416
00:31:10,276 --> 00:31:11,910
PROFESSOR: Right.

417
00:31:11,910 --> 00:31:12,790
Exactly right.

418
00:31:12,790 --> 00:31:15,090
That says, well how would I know
that it was fitting it

419
00:31:15,090 --> 00:31:19,150
badly or well, and in fact,
how would I know that my

420
00:31:19,150 --> 00:31:23,220
notion of the model is sound,
or that my experiment isn't

421
00:31:23,220 --> 00:31:25,410
completely broken?

422
00:31:25,410 --> 00:31:31,720
So always, I think, always
look at the real data.

423
00:31:31,720 --> 00:31:34,460
Don't just, I've seen too many
papers where people show me

424
00:31:34,460 --> 00:31:38,180
the curve that fits the data,
and don't show me the data,

425
00:31:38,180 --> 00:31:40,770
and it always makes
me very nervous.

426
00:31:40,770 --> 00:31:44,880
So always look at the data,
as well as however you're

427
00:31:44,880 --> 00:31:46,980
choosing to fit it.

428
00:31:46,980 --> 00:31:53,140
As an example of that, let's
look at another set of inputs.

429
00:31:53,140 --> 00:32:05,090
This is not a spring.

430
00:32:05,090 --> 00:32:08,350
It's the same get data function
as before, ignore

431
00:32:08,350 --> 00:32:13,720
that thing at the top.

432
00:32:13,720 --> 00:32:26,980
I'm going to analyze it
and we'll look at it.

433
00:32:26,980 --> 00:32:33,340
So here I'm plotting the speed
of something over time.

434
00:32:33,340 --> 00:32:39,640
So I plotted it, and I've done
a least squares fit using

435
00:32:39,640 --> 00:32:44,620
polyfit just as before to get a
line, and I put the line vs.

436
00:32:44,620 --> 00:32:51,100
the data, and here I'm
a little suspicious.

437
00:32:51,100 --> 00:32:55,930
Right, I fit a line, but when
I look at it, I don't think

438
00:32:55,930 --> 00:32:59,180
it's a real good fit
for the data.

439
00:32:59,180 --> 00:33:14,840
Somehow modeling this data as a
line is probably not right.

440
00:33:14,840 --> 00:33:17,920
A linear model is not
good for this data.

441
00:33:17,920 --> 00:33:20,210
This data is derived
from something, a

442
00:33:20,210 --> 00:33:23,470
more complex process.

443
00:33:23,470 --> 00:33:27,120
So take a look at it, and tell
me what order were calling of

444
00:33:27,120 --> 00:33:29,650
polynomial do you think
might fit this data?

445
00:33:29,650 --> 00:33:34,290
What shape does this
look like to you?

446
00:33:34,290 --> 00:33:35,850
Pardon?

447
00:33:35,850 --> 00:33:36,100
STUDENT: Quadratic.

448
00:33:36,100 --> 00:33:40,140
PROFESSOR: Quadratic, because
the shape is a what?

449
00:33:40,140 --> 00:33:41,830
It's a parabola.

450
00:33:41,830 --> 00:33:43,660
Well, I don't know if I
dare try this one all

451
00:33:43,660 --> 00:33:45,680
the way to the back.

452
00:33:45,680 --> 00:33:50,400
Ooh, at least I didn't
hurt anybody.

453
00:33:50,400 --> 00:33:54,470
All right, fortunately it's just
as easy to fit a ravel

454
00:33:54,470 --> 00:33:59,540
parabola as a line.

455
00:33:59,540 --> 00:34:06,470
So let's look down here.

456
00:34:06,470 --> 00:34:11,760
I've done the same thing, but
instead of passing it one, as

457
00:34:11,760 --> 00:34:15,090
I did up here as the argument,
I'm passing it two.

458
00:34:15,090 --> 00:34:18,630
Saying, instead of fitting a
polynomial of degree one, fit

459
00:34:18,630 --> 00:34:21,510
a polynomial of degree two.

460
00:34:21,510 --> 00:34:32,800
And now let's see what
it looks like.

461
00:34:32,800 --> 00:34:39,000
Well, my eyes tell me this
is a much better

462
00:34:39,000 --> 00:34:44,830
fit than the line.

463
00:34:44,830 --> 00:34:49,910
So again, that's why I wanted
to see the scatter plot, so

464
00:34:49,910 --> 00:34:53,130
that I could at least look at
it with my eyes, and say,

465
00:34:53,130 --> 00:34:58,120
yeah, this looks like
a better fit.

466
00:34:58,120 --> 00:35:07,470
All right, any question about
what's going on here?

467
00:35:07,470 --> 00:35:13,400
What we've been looking at is
something called linear

468
00:35:13,400 --> 00:35:23,640
regression.

469
00:35:23,640 --> 00:35:30,370
It's called linear because the
relationship of the dependent

470
00:35:30,370 --> 00:35:39,380
variable y to the independent
variables is assumed to be a

471
00:35:39,380 --> 00:35:43,390
linear function of
the parameters.

472
00:35:43,390 --> 00:35:47,350
It's not because it has to
be a linear function of

473
00:35:47,350 --> 00:35:50,460
the value of x, OK?

474
00:35:50,460 --> 00:35:53,950
Because as you can see, we're
not getting a line, we're

475
00:35:53,950 --> 00:35:56,360
getting a parabola.

476
00:35:56,360 --> 00:36:00,100
Don't worry about the details,
the point I want to make is,

477
00:36:00,100 --> 00:36:03,500
people sometimes see the word
linear regression and think it

478
00:36:03,500 --> 00:36:06,750
can only be used
to find lines.

479
00:36:06,750 --> 00:36:11,780
It's not so.

480
00:36:11,780 --> 00:36:16,580
So when, for example, we did the
quadratic, what we had is

481
00:36:16,580 --> 00:36:26,210
y equals a x squared
plus b x plus c.

482
00:36:26,210 --> 00:36:30,590
The graph vs. x will not be a
straight line, right, because

483
00:36:30,590 --> 00:36:34,810
I'm squaring x.

484
00:36:34,810 --> 00:36:43,780
But it is, just about, in this
case, the single variable x.

485
00:36:43,780 --> 00:36:49,910
Now, when I looked at this, I
said, all right, it's clear

486
00:36:49,910 --> 00:36:55,530
that the yellow curve is a
better fit than the red.

487
00:36:55,530 --> 00:36:59,130
It's a red line.

488
00:36:59,130 --> 00:37:03,740
But that was a pretty
informal statement.

489
00:37:03,740 --> 00:37:11,550
I can actually look at this
much more formally.

490
00:37:11,550 --> 00:37:14,340
And we're going to look at
something that's the

491
00:37:14,340 --> 00:37:18,410
statisticians call r squared.

492
00:37:18,410 --> 00:37:26,750
Which in the case of a linear
regression is the coefficient

493
00:37:26,750 --> 00:37:34,480
of determination.

494
00:37:34,480 --> 00:37:38,640
Now, this is a big fancy word
for something that's actually

495
00:37:38,640 --> 00:37:41,510
pretty simple.

496
00:37:41,510 --> 00:37:44,630
So what r squared its going
to be, and this is on your

497
00:37:44,630 --> 00:37:58,660
handout, is 1 minus e e over d
v. So e e is going to be the

498
00:37:58,660 --> 00:38:02,200
errors in the estimation.

499
00:38:02,200 --> 00:38:06,630
So I've got some estimated
values, some predicted values,

500
00:38:06,630 --> 00:38:11,810
if you will, given to me by the
model, either the line or

501
00:38:11,810 --> 00:38:14,070
the parabola in this case.

502
00:38:14,070 --> 00:38:18,710
And I've got some real values,
corresponding to each of those

503
00:38:18,710 --> 00:38:25,040
points, and I can look at the
difference between the 2 And

504
00:38:25,040 --> 00:38:29,340
that will tell me how much
difference there is between

505
00:38:29,340 --> 00:38:35,720
the estimated data and the,
well, between the predicted

506
00:38:35,720 --> 00:38:42,840
data and the measured
data, in this case.

507
00:38:42,840 --> 00:38:49,160
And then I want to divide that
by the variance in the

508
00:38:49,160 --> 00:38:51,390
measured data.

509
00:38:51,390 --> 00:38:59,230
The data variance.

510
00:38:59,230 --> 00:39:03,620
How broadly scattered the
measured points are.

511
00:39:03,620 --> 00:39:10,530
And I'll do that by comparing
the mean of the measured data,

512
00:39:10,530 --> 00:39:13,190
to the measured data.

513
00:39:13,190 --> 00:39:16,020
So I get the average value of
the measured data, and I look

514
00:39:16,020 --> 00:39:21,690
at how different the points
I measure are.

515
00:39:21,690 --> 00:39:26,330
So I just want to give to you,
informally, because I really

516
00:39:26,330 --> 00:39:28,780
don't care if you understand
all the math.

517
00:39:28,780 --> 00:39:32,300
What I do want you to
understand, when someone tells

518
00:39:32,300 --> 00:39:37,310
you, here's the r squared value,
is, informally what it

519
00:39:37,310 --> 00:39:39,490
really is saying.

520
00:39:39,490 --> 00:39:47,150
It's attempting to capture the
proportion of the response

521
00:39:47,150 --> 00:39:51,850
variation explained by the
variables in the model.

522
00:39:51,850 --> 00:39:56,030
In this case, x.

523
00:39:56,030 --> 00:40:04,110
So you'll have some amount of
variation that is explained by

524
00:40:04,110 --> 00:40:08,320
changing the values
of the variables.

525
00:40:08,320 --> 00:40:11,170
So if, actually, I'm going to
give an example and then come

526
00:40:11,170 --> 00:40:12,960
back to it more informally.

527
00:40:12,960 --> 00:40:21,140
So if, for example, r squared
were to equal 0.9, that would

528
00:40:21,140 --> 00:40:26,470
mean that approximately 90
percent of the variation in

529
00:40:26,470 --> 00:40:34,380
the variables can be explained
by the model.

530
00:40:34,380 --> 00:40:36,660
OK, so we have some amount of
variation in the measured

531
00:40:36,660 --> 00:40:42,610
data, and if r squared is 0.9,
it says that 90 percent can be

532
00:40:42,610 --> 00:40:49,290
explained by the models, and the
other 10 percent cannot.

533
00:40:49,290 --> 00:40:54,640
Now, that other 10 percent could
be experimental error,

534
00:40:54,640 --> 00:40:57,840
or it could be that, in
fact, you need more

535
00:40:57,840 --> 00:41:00,550
variables in the model.

536
00:41:00,550 --> 00:41:05,440
That there are what are called
lurking variables.

537
00:41:05,440 --> 00:41:09,530
I love this term.

538
00:41:09,530 --> 00:41:12,770
A lurking variable is something
that actually

539
00:41:12,770 --> 00:41:18,860
effects the result, but is not
reflected in the model.

540
00:41:18,860 --> 00:41:26,260
As we'll see a little bit
later, this is a very

541
00:41:26,260 --> 00:41:29,320
important thing to worry about,
when you're looking at

542
00:41:29,320 --> 00:41:32,530
experimental data and you're
building models.

543
00:41:32,530 --> 00:41:36,370
So we see this, for example,
in the medical literature,

544
00:41:36,370 --> 00:41:41,530
that they will do some
experiment, and they'll say

545
00:41:41,530 --> 00:41:46,870
that this drug explains
x, or has this affect.

546
00:41:46,870 --> 00:41:49,250
And the variables they are
looking at are, say, the

547
00:41:49,250 --> 00:41:55,710
disease the patient has, and
the age of the patient.

548
00:41:55,710 --> 00:42:00,700
Well, maybe the gender of the
patient is also important, but

549
00:42:00,700 --> 00:42:04,860
it doesn't happen to
be in the model.

550
00:42:04,860 --> 00:42:09,400
Now, if when they did a fit,
it came out with 0.9, that

551
00:42:09,400 --> 00:42:12,980
says at worst case, the
variables we didn't consider

552
00:42:12,980 --> 00:42:19,080
could cause a 10
percent error.

553
00:42:19,080 --> 00:42:23,880
But, that could be big, that
could matter a lot.

554
00:42:23,880 --> 00:42:29,760
And so as you get farther from
1, you ought to get very

555
00:42:29,760 --> 00:42:33,460
worried about whether
you actually have

556
00:42:33,460 --> 00:42:35,590
all the right variables.

557
00:42:35,590 --> 00:42:37,730
Now you might have the right
variables, and just experiment

558
00:42:37,730 --> 00:42:42,410
was not conducted well, But it's
usually the case that the

559
00:42:42,410 --> 00:42:46,930
problem is not that, but that
there are lurking variables.

560
00:42:46,930 --> 00:42:49,370
And we'll see examples
of that.

561
00:42:49,370 --> 00:42:52,400
So, easier to read than the
math, at least by me, easier

562
00:42:52,400 --> 00:43:07,940
to read than the math, is the
implementation of r square.

563
00:43:07,940 --> 00:43:12,510
So it's measured and estimated
values, I get the diffs, the

564
00:43:12,510 --> 00:43:15,820
differences, between the
estimated and the measured.

565
00:43:15,820 --> 00:43:19,080
These are both arrays, so I
subtract 1 array from the

566
00:43:19,080 --> 00:43:20,970
other, and then I square it.

567
00:43:20,970 --> 00:43:24,000
Remember, this'll do an
element-wise subtraction, and

568
00:43:24,000 --> 00:43:26,830
then square each element.

569
00:43:26,830 --> 00:43:32,560
Then I can get the mean, by
dividing the sum of the array

570
00:43:32,560 --> 00:43:38,850
measured by the length of it.

571
00:43:38,850 --> 00:43:42,590
I can get the variance, which is
the measured mean minus the

572
00:43:42,590 --> 00:43:46,060
measured value, again squared.

573
00:43:46,060 --> 00:43:53,590
And then I'll return
1 minus this.

574
00:43:53,590 --> 00:43:55,360
All right?

575
00:43:55,360 --> 00:43:59,710
So, just to make sure we sort
of understand the code, and

576
00:43:59,710 --> 00:44:04,320
the theory here as well, what
would we get if we had

577
00:44:04,320 --> 00:44:08,210
absolutely perfect prediction?

578
00:44:08,210 --> 00:44:11,930
So if every measured point
actually fit on the curb

579
00:44:11,930 --> 00:44:19,200
predicted by our model, what
would r square return?

580
00:44:19,200 --> 00:44:24,980
So in this case, measured and
estimated would be identical.

581
00:44:24,980 --> 00:44:30,940
What gets return by this?

582
00:44:30,940 --> 00:44:32,940
Yeah, 1.

583
00:44:32,940 --> 00:44:38,720
Exactly right.

584
00:44:38,720 --> 00:44:43,480
Because when I compute it, it
will turn out that these two

585
00:44:43,480 --> 00:44:50,600
numbers will be the, I'll get
0, 1 minus 0 is 0, right?

586
00:44:50,600 --> 00:44:55,900
Because the differences
will be zero.

587
00:44:55,900 --> 00:44:59,340
OK?

588
00:44:59,340 --> 00:45:04,900
So I can use this, now, to
actually get a notion of how

589
00:45:04,900 --> 00:45:08,480
good my fit is.

590
00:45:08,480 --> 00:45:13,130
So let's look at speed dot pi
again here, and now I'm going

591
00:45:13,130 --> 00:45:17,790
to uncomment these two things,
where I'm going to, after I

592
00:45:17,790 --> 00:45:30,970
compute the fit, I'm going
to then measure it.

593
00:45:30,970 --> 00:45:34,420
And you'll see here that the r
squared error for the linear

594
00:45:34,420 --> 00:45:42,730
fit is 0.896, and for the
quadratic fit is 0.973.

595
00:45:42,730 --> 00:45:47,990
So indeed, we get a much
better fit here.

596
00:45:47,990 --> 00:45:51,550
So not only does our eye tell
us we have a better fit, our

597
00:45:51,550 --> 00:45:55,210
more formal statistical measure
tells us we have a

598
00:45:55,210 --> 00:45:57,690
better fit, and it tells
us how good it is.

599
00:45:57,690 --> 00:46:02,490
It's not a perfect fit,
but it's a pretty

600
00:46:02,490 --> 00:46:07,140
good fit, for sure.

601
00:46:07,140 --> 00:46:13,930
Now, interestingly enough, it
isn't surprising that the

602
00:46:13,930 --> 00:46:20,770
quadratic fit is better
than the linear fit.

603
00:46:20,770 --> 00:46:24,950
In fact, the mathematics of
this should tell us it can

604
00:46:24,950 --> 00:46:28,620
never be worse.

605
00:46:28,620 --> 00:46:31,820
How do I know it can
never be worse?

606
00:46:31,820 --> 00:46:35,640
That's just, never is a
really strong word.

607
00:46:35,640 --> 00:46:38,720
How do I know that?

608
00:46:38,720 --> 00:46:42,980
Because, when I do the quadratic
fit, if I had

609
00:46:42,980 --> 00:46:47,660
perfectly linear data, then this
coefficient, whoops, not

610
00:46:47,660 --> 00:46:56,670
that coefficient, wrong, this
coefficient, could be 0.

611
00:46:56,670 --> 00:47:01,800
So if I ask it to do a quadratic
fit to linear data,

612
00:47:01,800 --> 00:47:06,120
and the a is truly perfectly
linear, this coefficient will

613
00:47:06,120 --> 00:47:09,380
be 0, and my model will turn
out to be the same as the

614
00:47:09,380 --> 00:47:12,880
linear model.

615
00:47:12,880 --> 00:47:19,950
So I will always get at
least as good a fit.

616
00:47:19,950 --> 00:47:25,240
Now, does this mean that it's
always better to use a higher

617
00:47:25,240 --> 00:47:27,990
order polynomial?

618
00:47:27,990 --> 00:47:38,710
The answer is no, and
let's look at why.

619
00:47:38,710 --> 00:47:48,400
So here what I've done is, I've
taken seven points, and

620
00:47:48,400 --> 00:47:54,470
I've generated, if you look at
this line here, the y-values,

621
00:47:54,470 --> 00:47:57,070
for x in x vals, points
dot append x

622
00:47:57,070 --> 00:48:00,520
plus some random number.

623
00:48:00,520 --> 00:48:04,230
So basically I've got something
linear in x, but I'm

624
00:48:04,230 --> 00:48:08,620
perturbing, if you will, my
data by some random value.

625
00:48:08,620 --> 00:48:11,930
Something between 0 and 1 is
getting added to things.

626
00:48:11,930 --> 00:48:14,320
And I'm doing this so my
points won't lie on a

627
00:48:14,320 --> 00:48:19,430
perfectly straight line.

628
00:48:19,430 --> 00:48:24,340
And then we'll try and
fit a line to it.

629
00:48:24,340 --> 00:48:28,580
And also, just for fun, we'll
try and fit a fifth order

630
00:48:28,580 --> 00:48:30,840
polynomial to it.

631
00:48:30,840 --> 00:48:40,500
And let's see what we get.

632
00:48:40,500 --> 00:48:44,170
Well, there's my line, and
there's my fifth order

633
00:48:44,170 --> 00:48:45,570
polynomial.

634
00:48:45,570 --> 00:48:50,160
Neither is quite perfect, but
which do you think looks like

635
00:48:50,160 --> 00:48:53,890
a closer fit?

636
00:48:53,890 --> 00:49:00,650
With your eye.

637
00:49:00,650 --> 00:49:04,960
Well, I would say the red line,
the red curve, if you

638
00:49:04,960 --> 00:49:09,910
will, is a better fit, and sure
enough if we look at the

639
00:49:09,910 --> 00:49:16,830
statistics, we'll see it's 0.99,
as opposed to 0.978.

640
00:49:16,830 --> 00:49:21,890
So it's clearly a closer fit.

641
00:49:21,890 --> 00:49:30,140
But that raises the very
important question: does

642
00:49:30,140 --> 00:49:36,850
closer equal better, or tighter,
which is another word

643
00:49:36,850 --> 00:49:40,980
for closer?

644
00:49:40,980 --> 00:49:44,830
And the answer is no.

645
00:49:44,830 --> 00:49:49,140
It's a tighter fit, but it's not
necessarily better, in the

646
00:49:49,140 --> 00:49:52,780
sense of more useful.

647
00:49:52,780 --> 00:49:55,400
Because one of the things I
want to do when I build a

648
00:49:55,400 --> 00:49:56,930
model like this, is
have something

649
00:49:56,930 --> 00:50:00,030
with predictive power.

650
00:50:00,030 --> 00:50:05,120
I don't really necessarily need
a model to tell me where

651
00:50:05,120 --> 00:50:08,510
the points I've measured lie,
because I have them.

652
00:50:08,510 --> 00:50:12,160
The whole purpose of the model
is to give me some way to

653
00:50:12,160 --> 00:50:17,260
predict where unmeasured points
would lie, where future

654
00:50:17,260 --> 00:50:19,330
points would lie.

655
00:50:19,330 --> 00:50:23,390
OK, I understand how the spring
works, and I can guess

656
00:50:23,390 --> 00:50:26,620
where it would be if things
I haven't had the time to

657
00:50:26,620 --> 00:50:31,410
measure, or the ability
to measure.

658
00:50:31,410 --> 00:50:38,080
So let's look at that.

659
00:50:38,080 --> 00:50:41,350
Let's see, where'd
that figure go.

660
00:50:41,350 --> 00:50:47,720
It's lurking somewhere.

661
00:50:47,720 --> 00:50:54,950
All right, we'll just
kill this for now.

662
00:50:54,950 --> 00:51:00,810
So let's generate some more
points, and I'm going to use

663
00:51:00,810 --> 00:51:05,100
exactly the same algorithm.

664
00:51:05,100 --> 00:51:09,670
But I'm going to generate
twice as many points.

665
00:51:09,670 --> 00:51:14,600
But I'm only fitting it
to the first half.

666
00:51:14,600 --> 00:51:24,990
So if I run this one,
figure one is what

667
00:51:24,990 --> 00:51:26,910
we looked at before.

668
00:51:26,910 --> 00:51:29,900
The red line is fitting
them a little better.

669
00:51:29,900 --> 00:51:33,460
But here's figure two.

670
00:51:33,460 --> 00:51:37,370
What happens when I extrapolate
the curve to the

671
00:51:37,370 --> 00:51:39,500
new points?

672
00:51:39,500 --> 00:51:43,780
Well, you can see, it's
a terrible fit.

673
00:51:43,780 --> 00:51:46,820
And you would expect that,
because my data was basically

674
00:51:46,820 --> 00:51:52,440
linear, and I fit in non-linear
curve to it.

675
00:51:52,440 --> 00:51:56,780
And if you look at it you can
see that, OK, look at this, to

676
00:51:56,780 --> 00:51:59,790
get from here to here, it
thought I had to take off

677
00:51:59,790 --> 00:52:02,540
pretty sharply.

678
00:52:02,540 --> 00:52:06,270
And so sure enough, as I get
new points, the prediction

679
00:52:06,270 --> 00:52:11,240
will postulate that it's still
going up, much more steeply

680
00:52:11,240 --> 00:52:14,300
than it really does.

681
00:52:14,300 --> 00:52:18,350
So you can see it's a
terrible prediction.

682
00:52:18,350 --> 00:52:28,510
And that's because what I've
done is, I over-fit the data.

683
00:52:28,510 --> 00:52:32,430
I've taken a very high degree
polynomial, which has given me

684
00:52:32,430 --> 00:52:36,860
a good close fit, and I can
always get a fit, by the way.

685
00:52:36,860 --> 00:52:40,020
If I choose a high enough degree
polynomial, I can fit

686
00:52:40,020 --> 00:52:43,810
lots and lots of data sets.

687
00:52:43,810 --> 00:52:47,050
But I have reason to
be very suspicious.

688
00:52:47,050 --> 00:52:49,950
The fact that I took a fifth
order polynomial to get six

689
00:52:49,950 --> 00:52:57,720
points should make
me very nervous.

690
00:52:57,720 --> 00:52:59,980
And it's a very important
moral.

691
00:52:59,980 --> 00:53:01,920
Beware of over-fitting.

692
00:53:01,920 --> 00:53:08,790
If you have a very complex
model, there's a good chance

693
00:53:08,790 --> 00:53:12,810
that it's over-fit.

694
00:53:12,810 --> 00:53:18,790
The larger moral is, beware of
statistics without any theory.

695
00:53:18,790 --> 00:53:21,670
You're just cranking away, you
get a great r squared, you say

696
00:53:21,670 --> 00:53:23,850
it's a beautiful fit.

697
00:53:23,850 --> 00:53:26,140
But there was no real
theory there.

698
00:53:26,140 --> 00:53:29,280
You can always find a fit.

699
00:53:29,280 --> 00:53:32,340
As Disraeli is alleged to have
said, there are three kinds of

700
00:53:32,340 --> 00:53:38,120
lies: lies, damned lies,
and statistics.

701
00:53:38,120 --> 00:53:41,570
And we'll spend some more time
when we get back from

702
00:53:41,570 --> 00:53:44,480
Thanksgiving looking at how
to lie with statistics.

703
00:53:44,480 --> 00:53:46,580
Have a great holiday,
everybody.