1
00:00:00,530 --> 00:00:02,960
The following content is
provided under a Creative

2
00:00:02,960 --> 00:00:04,370
Commons license.

3
00:00:04,370 --> 00:00:07,410
Your support will help MIT
OpenCourseWare continue to

4
00:00:07,410 --> 00:00:11,060
offer high-quality educational
resources for free.

5
00:00:11,060 --> 00:00:13,960
To make a donation or view
additional materials from

6
00:00:13,960 --> 00:00:19,790
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:19,790 --> 00:00:21,040
ocw.mit.edu.

8
00:00:22,775 --> 00:00:24,130
PROFESSOR: All right.

9
00:00:24,130 --> 00:00:27,580
So we've got three main
topics to talk about.

10
00:00:27,580 --> 00:00:29,320
One is distributions.

11
00:00:29,320 --> 00:00:30,980
The other is Monte
Carlo methods.

12
00:00:30,980 --> 00:00:33,340
And one is on regression.

13
00:00:33,340 --> 00:00:38,720
So for distributions, which
distributions have we learned

14
00:00:38,720 --> 00:00:39,970
about in class?

15
00:00:43,336 --> 00:00:44,340
Hmm?

16
00:00:44,340 --> 00:00:46,340
AUDIENCE: Normal.

17
00:00:46,340 --> 00:00:46,720
PROFESSOR: OK.

18
00:00:46,720 --> 00:00:47,970
So we have normal.

19
00:00:51,990 --> 00:00:53,382
What's another one?

20
00:00:53,382 --> 00:00:54,326
AUDIENCE: Uniform.

21
00:00:54,326 --> 00:00:55,576
PROFESSOR: OK.

22
00:00:59,530 --> 00:01:01,160
And there's one more
that he's kind of

23
00:01:01,160 --> 00:01:03,542
mentioned, I think, in passing.

24
00:01:03,542 --> 00:01:04,760
AUDIENCE: Exponential?

25
00:01:04,760 --> 00:01:05,780
PROFESSOR: Yes.

26
00:01:05,780 --> 00:01:07,030
So exponential.

27
00:01:13,430 --> 00:01:17,010
So for uniform, what would this
look like if I were to

28
00:01:17,010 --> 00:01:23,315
plot this as a histogram, and
I have endpoints A and B?

29
00:01:26,358 --> 00:01:30,354
Someone clue me in?

30
00:01:30,354 --> 00:01:30,852
Hmm?

31
00:01:30,852 --> 00:01:32,350
AUDIENCE: [INAUDIBLE]
straight line.

32
00:01:32,350 --> 00:01:34,140
PROFESSOR: So it's going to be
a horizontal line, right?

33
00:01:37,790 --> 00:01:39,200
And if we were to look
at the function for

34
00:01:39,200 --> 00:01:40,470
this, it would be--

35
00:01:46,890 --> 00:01:52,430
the probability would be 1 over
b minus a for all points

36
00:01:52,430 --> 00:01:54,690
between a and b.

37
00:01:54,690 --> 00:01:57,335
So let's look at this
graphically.

38
00:02:02,160 --> 00:02:05,010
So this chunk of code should
not be too difficult to

39
00:02:05,010 --> 00:02:07,280
understand at this
point, right?

40
00:02:07,280 --> 00:02:09,639
All we're doing is we're
using the random

41
00:02:09,639 --> 00:02:11,580
number generator, randint.

42
00:02:11,580 --> 00:02:14,720
It's going to return us an
integer, random integer from a

43
00:02:14,720 --> 00:02:17,410
uniform distribution
between a and b.

44
00:02:17,410 --> 00:02:20,810
Is there Anyone that's
puzzled by that?

45
00:02:20,810 --> 00:02:21,640
All right.

46
00:02:21,640 --> 00:02:24,280
We're going to do that for
numpoints, and then we're

47
00:02:24,280 --> 00:02:25,810
going to plot a histogram.

48
00:02:25,810 --> 00:02:27,980
The only parameter that I don't
think you've seen here

49
00:02:27,980 --> 00:02:29,230
is this normed=True.

50
00:02:32,100 --> 00:02:37,460
What this does is, normally,
when you use the hist command

51
00:02:37,460 --> 00:02:41,260
in Python, it's going to give
you raw frequency counts on

52
00:02:41,260 --> 00:02:42,130
the y-axis.

53
00:02:42,130 --> 00:02:45,380
What normed=True does is it
gives you the proportion of

54
00:02:45,380 --> 00:02:47,890
the points that wound up
in a particular bin.

55
00:02:47,890 --> 00:02:52,350
So I can actually show
you both ways.

56
00:02:56,430 --> 00:02:59,400
So does that look about right?

57
00:02:59,400 --> 00:03:05,770
For 100 bins, got, what,
100,000 points?

58
00:03:05,770 --> 00:03:11,620
Each one has about 0.01, so it
looks right, 1% in each bin.

59
00:03:11,620 --> 00:03:13,330
So that was normed.

60
00:03:13,330 --> 00:03:14,580
If we do it un-normed--

61
00:03:21,990 --> 00:03:24,620
see how the y-axis
here has changed?

62
00:03:24,620 --> 00:03:29,230
Before, it was from like
0 to 0.12, or [? 0-1 ?]

63
00:03:29,230 --> 00:03:33,060
Now, it's from 0
to like 1,000.

64
00:03:33,060 --> 00:03:34,870
That's all that normed
primer does.

65
00:03:34,870 --> 00:03:38,150
But this is what we
would expect.

66
00:03:38,150 --> 00:03:39,380
This is for integers.

67
00:03:39,380 --> 00:03:46,700
And then, of course, Python also
has a way of doing it for

68
00:03:46,700 --> 00:03:47,880
floating point.

69
00:03:47,880 --> 00:03:51,210
So here, we are going to use
the uniform command.

70
00:03:51,210 --> 00:03:54,650
And then when I say, show
continuous uniform, going to

71
00:03:54,650 --> 00:03:59,920
give it the a and
b, 0 and 1.0.

72
00:03:59,920 --> 00:04:03,660
And it's really not going to
look all that much different.

73
00:04:03,660 --> 00:04:08,265
It's just that the x-axis
is from 0 to 1.

74
00:04:15,440 --> 00:04:15,632
Ok.

75
00:04:15,632 --> 00:04:19,329
So uniform is easy.

76
00:04:19,329 --> 00:04:21,594
What does a Gaussian look
like, or a normal?

77
00:04:27,710 --> 00:04:31,880
Like if I were to plot it, what
should this look like?

78
00:04:31,880 --> 00:04:33,700
AUDIENCE: Bell curve.

79
00:04:33,700 --> 00:04:36,040
PROFESSOR: OK, it'll
be a bell curve.

80
00:04:36,040 --> 00:04:37,661
Where is its peak going to be?

81
00:04:37,661 --> 00:04:39,425
AUDIENCE: Exactly
in the middle?

82
00:04:39,425 --> 00:04:40,310
AUDIENCE: At the mean.

83
00:04:40,310 --> 00:04:40,850
PROFESSOR: At the mean.

84
00:04:40,850 --> 00:04:42,200
Thank you.

85
00:04:42,200 --> 00:04:44,190
So the peak is going
to be at the mean.

86
00:04:44,190 --> 00:04:46,430
We usually denote it with mu.

87
00:04:46,430 --> 00:04:49,540
And then it's going to fall
off asymmetrically or

88
00:04:49,540 --> 00:04:52,370
symmetrically off
on either side?

89
00:04:52,370 --> 00:04:53,620
Symmetrically.

90
00:04:57,330 --> 00:05:01,200
Now, a Gaussian can be specified
fully using two

91
00:05:01,200 --> 00:05:01,820
parameters.

92
00:05:01,820 --> 00:05:04,730
What are they?

93
00:05:04,730 --> 00:05:07,920
You have one here, and then you
have standard deviation.

94
00:05:07,920 --> 00:05:12,976
So mean and sigma.

95
00:05:17,040 --> 00:05:20,020
Now, the function for this is
not something you're going to

96
00:05:20,020 --> 00:05:20,530
have to know.

97
00:05:20,530 --> 00:05:24,770
But I wanted to show
it to you.

98
00:05:24,770 --> 00:05:26,810
And the stats major can correct
me if I'm wrong.

99
00:05:45,580 --> 00:05:47,760
So it might be a little scary.

100
00:05:47,760 --> 00:05:48,300
I don't know.

101
00:05:48,300 --> 00:05:50,130
It intimidated me the
first time I saw it.

102
00:05:50,130 --> 00:05:51,540
Does that look about
right to you?

103
00:05:51,540 --> 00:05:51,830
AUDIENCE: Yes.

104
00:05:51,830 --> 00:05:52,830
PROFESSOR: All right.

105
00:05:52,830 --> 00:05:56,170
So the reason why I threw that
out there is because what I

106
00:05:56,170 --> 00:06:03,380
want to do is show you the ideal
form when we plot out

107
00:06:03,380 --> 00:06:07,960
this function, versus a bunch of
random samples we've drawn

108
00:06:07,960 --> 00:06:10,280
from a distribution
that is Gaussian.

109
00:06:15,200 --> 00:06:19,580
So, I have a function
make Gaussian plot.

110
00:06:19,580 --> 00:06:23,240
All it takes is the mean,
standard deviation, how many

111
00:06:23,240 --> 00:06:26,480
points we want to draw from
the distribution.

112
00:06:26,480 --> 00:06:28,720
And then I have a parameter
here, show ideal.

113
00:06:28,720 --> 00:06:33,520
And we'll get to that
in a second.

114
00:06:33,520 --> 00:06:38,160
The function that we use
is called dot Gauss.

115
00:06:38,160 --> 00:06:41,410
And it just takes a mean and
the standard deviation.

116
00:06:44,930 --> 00:06:48,860
We're also going to compute
the ideal points.

117
00:06:48,860 --> 00:06:54,010
So if I take the mean, and
I go a couple of standard

118
00:06:54,010 --> 00:06:58,590
deviations in either direction
on the x-axis, then I can plot

119
00:06:58,590 --> 00:07:01,270
out what the y should be
according to this function

120
00:07:01,270 --> 00:07:06,690
here, and then just
do a histogram.

121
00:07:09,710 --> 00:07:12,030
If I want to show this
plot, that's what

122
00:07:12,030 --> 00:07:13,920
that parameter controls.

123
00:07:13,920 --> 00:07:15,940
It'll plot out the function.

124
00:07:15,940 --> 00:07:19,120
And if not, then it'll just
plot the histogram.

125
00:07:19,120 --> 00:07:21,340
So let's see what this looks
like with just the histogram.

126
00:07:34,370 --> 00:07:37,290
So it looks like what
we would expect.

127
00:07:37,290 --> 00:07:38,610
We have the nice bell shape.

128
00:07:38,610 --> 00:07:42,350
It's centered at 0, and it's got
a standard deviation of 1.

129
00:07:46,760 --> 00:07:49,680
These are the relative
frequencies of a random

130
00:07:49,680 --> 00:07:54,150
sampling of points from a
Gaussian distribution.

131
00:07:54,150 --> 00:07:59,950
And we can see that if we look
at the ideal version or the

132
00:07:59,950 --> 00:08:14,595
actual function, it matches
very closely.

133
00:08:20,200 --> 00:08:25,850
And then for various shapes,
standard deviation of 2,

134
00:08:25,850 --> 00:08:28,520
different mean, different
standard deviation.

135
00:08:28,520 --> 00:08:31,500
So it's pretty easy, right?

136
00:08:31,500 --> 00:08:34,200
Are there any questions on
Gaussian distributions or

137
00:08:34,200 --> 00:08:35,450
normal distributions?

138
00:08:45,760 --> 00:08:45,790
Ok.

139
00:08:45,790 --> 00:08:48,410
So, the last one we have--

140
00:08:48,410 --> 00:08:49,660
AUDIENCE: [INAUDIBLE].

141
00:08:52,170 --> 00:08:53,030
PROFESSOR: Oh --

142
00:08:53,030 --> 00:08:55,060
frange is a custom function.

143
00:08:55,060 --> 00:08:56,890
So we actually define
it up here.

144
00:08:56,890 --> 00:08:57,320
AUDIENCE: [INAUDIBLE].

145
00:08:57,320 --> 00:09:01,640
PROFESSOR: Was kind of hoping
I could slip that past you.

146
00:09:01,640 --> 00:09:05,930
It's just like range, except
instead of integers, it

147
00:09:05,930 --> 00:09:09,010
returns a list of floating point
numbers separated by

148
00:09:09,010 --> 00:09:10,270
step argument.

149
00:09:10,270 --> 00:09:18,400
So it starts at a lower-end
range start, and stops at the

150
00:09:18,400 --> 00:09:24,950
stop, and then increments by
step, until it returns a bunch

151
00:09:24,950 --> 00:09:26,200
of floating point numbers.

152
00:09:37,240 --> 00:09:39,710
The last one is the exponential
distribution.

153
00:09:39,710 --> 00:09:42,490
And I don't know-- did he really
explain what the shape

154
00:09:42,490 --> 00:09:47,350
looked like for this at all?

155
00:09:47,350 --> 00:09:50,740
So we can go really quickly
through it, because it doesn't

156
00:09:50,740 --> 00:09:55,850
sound he actually expects you
to know it too deeply.

157
00:09:55,850 --> 00:09:59,250
Basically, it'll like that.

158
00:09:59,250 --> 00:10:01,270
And the function is--

159
00:10:13,410 --> 00:10:14,840
you don't need to know it.

160
00:10:14,840 --> 00:10:16,865
It's just there for
your edification.

161
00:10:21,030 --> 00:10:24,400
Lambda is greater than 0.

162
00:10:24,400 --> 00:10:27,340
So I'm just going to show you
what it looks like, and then

163
00:10:27,340 --> 00:10:28,590
we'll move on.

164
00:10:38,990 --> 00:10:44,250
So here, the blue are the sample
points, and the red is

165
00:10:44,250 --> 00:10:45,500
the ideal curve.

166
00:10:47,940 --> 00:10:50,559
Just different values
of lambda.

167
00:10:50,559 --> 00:10:52,924
AUDIENCE: Does it always have a
downward slope like that for

168
00:10:52,924 --> 00:10:55,290
it to be exponential?

169
00:10:55,290 --> 00:10:58,940
PROFESSOR: Yeah, in this case.

170
00:10:58,940 --> 00:11:02,180
There's another family of
distributions that we're not

171
00:11:02,180 --> 00:11:03,430
going to touch on.

172
00:11:07,310 --> 00:11:12,660
But that is that for
distributions for today.

173
00:11:12,660 --> 00:11:14,110
Unless anyone has any
questions, I'm

174
00:11:14,110 --> 00:11:17,150
going to move on.

175
00:11:17,150 --> 00:11:19,310
OK.

176
00:11:19,310 --> 00:11:28,950
So the next big topic is
Monte Carlo methods.

177
00:11:28,950 --> 00:11:34,670
So can someone give me an
informal definition of what a

178
00:11:34,670 --> 00:11:36,265
Monte Carlo method is?

179
00:11:40,971 --> 00:11:46,350
AUDIENCE: Really roughly, is
it based on using a random

180
00:11:46,350 --> 00:11:48,900
method to try to approximate
something that's not random,

181
00:11:48,900 --> 00:11:52,370
by doing it many,
many times over?

182
00:11:52,370 --> 00:11:53,550
PROFESSOR: Yeah, more or less.

183
00:11:53,550 --> 00:11:57,470
It's trying to arrive at a
solution by repeated sampling,

184
00:11:57,470 --> 00:11:58,720
or random sampling.

185
00:12:00,950 --> 00:12:05,040
And we've seen many different
applications of this.

186
00:12:05,040 --> 00:12:10,240
But we're going to review them
and kind of try and get a

187
00:12:10,240 --> 00:12:11,820
better understanding.

188
00:12:11,820 --> 00:12:15,840
So the Monty Hall problem.

189
00:12:15,840 --> 00:12:18,670
This is a Monte Carlo
simulation.

190
00:12:18,670 --> 00:12:23,582
So, one, what's the action that
a person should take?

191
00:12:23,582 --> 00:12:24,430
AUDIENCE: [INAUDIBLE].

192
00:12:24,430 --> 00:12:24,800
PROFESSOR: All right.

193
00:12:24,800 --> 00:12:27,430
And does anyone remember what
proportion of the time if they

194
00:12:27,430 --> 00:12:28,754
switch they won?

195
00:12:28,754 --> 00:12:29,640
AUDIENCE: 2/3.

196
00:12:29,640 --> 00:12:31,700
PROFESSOR: Two-thirds, Ok.

197
00:12:31,700 --> 00:12:34,770
So I happen to know
this works--

198
00:12:37,470 --> 00:12:38,934
maybe.

199
00:12:38,934 --> 00:12:41,419
I think my program died.

200
00:12:48,380 --> 00:12:52,640
OK, so it works.

201
00:12:52,640 --> 00:12:57,710
Is this code confusing
to anyone or cryptic?

202
00:12:57,710 --> 00:13:00,420
I tried to make it a little bit
simpler than the code that

203
00:13:00,420 --> 00:13:01,770
was in the handout for class.

204
00:13:06,020 --> 00:13:07,620
We have a number of trials.

205
00:13:07,620 --> 00:13:10,750
We're going to pick a
door for the prize.

206
00:13:10,750 --> 00:13:12,170
The player's going
to choose a door.

207
00:13:14,820 --> 00:13:18,720
If they choose to stay, and the
prize is in the door that

208
00:13:18,720 --> 00:13:21,400
they chose, then stay wins.

209
00:13:21,400 --> 00:13:27,680
And if they choose to switch,
and the prize door is not the

210
00:13:27,680 --> 00:13:31,060
door that they originally
chose, then switch wins.

211
00:13:34,447 --> 00:13:36,350
So it's easy.

212
00:13:36,350 --> 00:13:40,380
What I wanted to try and do
is look at an intuitive

213
00:13:40,380 --> 00:13:41,730
explanation for this.

214
00:13:45,120 --> 00:13:47,870
At office hours, we were kicking
around different ways

215
00:13:47,870 --> 00:13:49,200
of explaining this.

216
00:13:49,200 --> 00:13:53,550
And we went to Wikipedia, and
we found this explanation.

217
00:13:53,550 --> 00:13:59,700
So the idea is let's say
that the contestant

218
00:13:59,700 --> 00:14:01,130
chooses door One.

219
00:14:01,130 --> 00:14:04,890
So there's a 1/3 probability
that they've chosen the door

220
00:14:04,890 --> 00:14:07,210
that has the prize behind it.

221
00:14:07,210 --> 00:14:10,860
And then there's a 1/3
probability that it's behind

222
00:14:10,860 --> 00:14:13,120
door number Two, 1/3 probability
it's behind door

223
00:14:13,120 --> 00:14:14,790
number Three.

224
00:14:14,790 --> 00:14:18,080
The key to this kind of
explanation is that if you

225
00:14:18,080 --> 00:14:21,080
consider both Two and Three
together, then there's a 2/3

226
00:14:21,080 --> 00:14:25,600
probability that the prize is
behind one of those two doors.

227
00:14:28,880 --> 00:14:31,210
So the player chooses, and
then Monty opens a door.

228
00:14:31,210 --> 00:14:34,390
There's a goat behind
door number Three.

229
00:14:34,390 --> 00:14:37,140
This new knowledge doesn't
change, though, the

230
00:14:37,140 --> 00:14:41,180
probability that you chose
the correct door.

231
00:14:41,180 --> 00:14:45,520
So you still have 1/3 chance
that One was the correct door.

232
00:14:45,520 --> 00:14:49,300
And there's still 2/3
chance on this side.

233
00:14:49,300 --> 00:14:52,100
But you know this one is 0,
because you see the goat.

234
00:14:52,100 --> 00:14:57,542
So this door has to a 2/3 chance
of having the prize.

235
00:14:57,542 --> 00:14:58,960
Does that agree with you?

236
00:15:01,550 --> 00:15:04,960
So it's one way of
explaining it.

237
00:15:04,960 --> 00:15:05,370
I don't know.

238
00:15:05,370 --> 00:15:09,950
I had problems getting
this into my head.

239
00:15:09,950 --> 00:15:12,170
Does anyone want me
to try again?

240
00:15:12,170 --> 00:15:14,066
All right.

241
00:15:14,066 --> 00:15:15,316
AUDIENCE: [INAUDIBLE]

242
00:15:17,538 --> 00:15:22,498
two doors the probability that
your goat is going to be

243
00:15:22,498 --> 00:15:25,308
[INAUDIBLE] behind the door you
chose [INAUDIBLE], so it's

244
00:15:25,308 --> 00:15:26,820
basically the same
[INAUDIBLE]?

245
00:15:26,820 --> 00:15:29,490
PROFESSOR: Same idea, but kind
of negating it, and thinking

246
00:15:29,490 --> 00:15:31,073
of it from the negative
direction.

247
00:15:34,900 --> 00:15:37,780
Another explanation that was
good was if you had a million

248
00:15:37,780 --> 00:15:43,670
doors, and you had 999,999
goats, and you had one prize,

249
00:15:43,670 --> 00:15:45,000
you have a one in a
million chance of

250
00:15:45,000 --> 00:15:46,510
choosing the right door.

251
00:15:46,510 --> 00:15:49,770
So now imagine Monty walking
down and open opening up

252
00:15:49,770 --> 00:15:54,400
999,998 doors, each with
a goat behind it.

253
00:15:54,400 --> 00:15:57,560
Well, now you have your door
that's still closed, and the

254
00:15:57,560 --> 00:16:02,230
door that's mystery
also closed.

255
00:16:02,230 --> 00:16:04,993
The probability that you chose
the correct door is still one

256
00:16:04,993 --> 00:16:06,310
in a million.

257
00:16:06,310 --> 00:16:11,380
So if you see 999,998 goats, and
one closed door, and you

258
00:16:11,380 --> 00:16:14,080
know that your door only has a
one in a million chance, you

259
00:16:14,080 --> 00:16:15,770
want to switch to the other
door, because that probably

260
00:16:15,770 --> 00:16:18,440
has the prize.

261
00:16:18,440 --> 00:16:20,860
So different ways of
thinking about it.

262
00:16:20,860 --> 00:16:23,960
The probability problems and
statistics problems, it always

263
00:16:23,960 --> 00:16:26,410
helps to-- or at least, I think
it does-- to have an

264
00:16:26,410 --> 00:16:28,850
intuitive idea of
what's going on.

265
00:16:28,850 --> 00:16:33,670
So with that said, let's
talk about pi.

266
00:16:33,670 --> 00:16:39,870
Because this is one of my
favorite Monte Carlo methods.

267
00:16:39,870 --> 00:16:42,250
Because it's got a
nice explanation.

268
00:16:42,250 --> 00:16:48,500
So does anyone need me to talk
about the idea behind this,

269
00:16:48,500 --> 00:16:52,435
like how this method works,
or to go through it?

270
00:16:56,200 --> 00:16:57,450
Someone's nodding.

271
00:17:00,710 --> 00:17:05,705
So the idea is we
have a square.

272
00:17:10,140 --> 00:17:17,630
And its side is 2r units long.

273
00:17:17,630 --> 00:17:19,020
So what's the area
of the square?

274
00:17:22,450 --> 00:17:23,700
So Asq ...

275
00:17:27,390 --> 00:17:29,970
squared, right?

276
00:17:29,970 --> 00:17:32,170
Now, we still have
a circle that's

277
00:17:32,170 --> 00:17:33,420
inscribed in the square.

278
00:17:37,100 --> 00:17:39,850
And it's got a radius of r.

279
00:17:39,850 --> 00:17:41,150
So area of circle.

280
00:17:46,490 --> 00:17:51,800
If we take the ratio of the
circle to the area of the

281
00:17:51,800 --> 00:17:57,410
square, then we find
have pi over 4.

282
00:17:57,410 --> 00:18:01,875
Now, let's assume that I
throw darts at this.

283
00:18:04,430 --> 00:18:07,630
Wakes people up.

284
00:18:07,630 --> 00:18:12,670
And there's a uniform
probability that the point

285
00:18:12,670 --> 00:18:15,856
will land somewhere in
the square here.

286
00:18:15,856 --> 00:18:23,800
If I throw N of these, then I
can expect pi over 4 of them,

287
00:18:23,800 --> 00:18:25,865
times N, to wind up
in the circle.

288
00:18:28,500 --> 00:18:31,130
And since I find this number and
this number, and I want to

289
00:18:31,130 --> 00:18:33,326
find pi, I can just
rearrange this.

290
00:18:36,295 --> 00:18:37,545
That's how we get pi.

291
00:18:40,590 --> 00:18:42,500
So let's go to the code.

292
00:18:45,800 --> 00:18:47,830
We just have some easy code.

293
00:18:47,830 --> 00:18:51,280
It gets a random point within
a square that's from minus r

294
00:18:51,280 --> 00:18:55,580
to r, so 2r units long.

295
00:18:55,580 --> 00:18:57,840
I have a function that makes
a whole bunch of points.

296
00:19:00,470 --> 00:19:02,760
And then I have a function
that checks if a point is

297
00:19:02,760 --> 00:19:09,740
within a circle of radius r
and another function that

298
00:19:09,740 --> 00:19:11,910
looks at a bunch of points
and counts how many

299
00:19:11,910 --> 00:19:15,330
are within the circle.

300
00:19:15,330 --> 00:19:17,290
And then I have my compute
pi function here.

301
00:19:19,900 --> 00:19:23,500
And all it does is you can
either pass at some points

302
00:19:23,500 --> 00:19:29,200
that are already made, or just
say, I want to have 100,000

303
00:19:29,200 --> 00:19:31,990
darts thrown at this square.

304
00:19:31,990 --> 00:19:36,230
And it'll make a whole bunch of
those random points, figure

305
00:19:36,230 --> 00:19:38,470
out many are in the circle.

306
00:19:38,470 --> 00:19:40,380
And then we have--

307
00:19:40,380 --> 00:19:48,080
this would be m and numpoints
N. If we multiply it by 4,

308
00:19:48,080 --> 00:19:53,190
that gives us pi,
more or less.

309
00:19:53,190 --> 00:19:57,893
So let's look at a
couple of plots.

310
00:19:57,893 --> 00:20:00,530
I have a function
here, runtrials.

311
00:20:00,530 --> 00:20:02,920
And what it's going to do is
it's going to run a number of

312
00:20:02,920 --> 00:20:07,020
trials for a given
number of points.

313
00:20:07,020 --> 00:20:15,900
So what I want to do is I'm
going to run 50 trials for

314
00:20:15,900 --> 00:20:17,900
each number of points.

315
00:20:17,900 --> 00:20:21,590
And I'm going to have a points
list that goes from 10 to

316
00:20:21,590 --> 00:20:23,950
10,000 in 1000-point
increments.

317
00:20:26,770 --> 00:20:28,470
I'm going to run the trials
and get the results.

318
00:20:28,470 --> 00:20:31,880
And then I'm going to
plot my results.

319
00:20:31,880 --> 00:20:33,380
And why don't we just throw
that out there?

320
00:20:48,410 --> 00:20:48,650
Ok.

321
00:20:48,650 --> 00:20:52,480
So on the plot, the blue line
blue, horizontal line, that's

322
00:20:52,480 --> 00:20:55,750
the actual value of pi, as
near as a computer can

323
00:20:55,750 --> 00:20:58,240
approximate it.

324
00:20:58,240 --> 00:21:00,800
On the x-axis, we have the
number of darts that we threw

325
00:21:00,800 --> 00:21:03,320
at the square.

326
00:21:03,320 --> 00:21:07,260
And each red dot represents
the result of one trial of

327
00:21:07,260 --> 00:21:12,120
throwing however many
darts at a board.

328
00:21:12,120 --> 00:21:16,945
So when you're down here, and
you're only throwing 10 darts,

329
00:21:16,945 --> 00:21:19,130
you tend to have a very
wide spread for the

330
00:21:19,130 --> 00:21:21,120
estimated value of pi.

331
00:21:21,120 --> 00:21:28,770
As you increase the number of
darts, you get much closer--

332
00:21:28,770 --> 00:21:31,330
I would say shot group, but
grouping it's probably more

333
00:21:31,330 --> 00:21:34,060
appropriate.

334
00:21:34,060 --> 00:21:38,646
And it's much closer to
the actual of pi.

335
00:21:38,646 --> 00:21:41,980
There's nothing really unusual
about this, right?

336
00:21:41,980 --> 00:21:44,760
Nothing confusing?

337
00:21:44,760 --> 00:21:53,020
So another way of visualizing
this is to actually, well,

338
00:21:53,020 --> 00:21:55,380
look at the darts
that are thrown.

339
00:22:02,880 --> 00:22:06,600
So I have a function here,
plot pi scatter.

340
00:22:06,600 --> 00:22:10,520
And this is actually just
going to plot this.

341
00:22:13,320 --> 00:22:17,280
And it's going to do it for 10
points, 100 points, 1,000

342
00:22:17,280 --> 00:22:19,450
points, and 10,000 points.

343
00:22:19,450 --> 00:22:26,500
And we'll see why we can
start converging on pi.

344
00:22:26,500 --> 00:22:30,990
So this is with only 10 darts
thrown at the square.

345
00:22:30,990 --> 00:22:34,430
The value for pi is
really pretty off.

346
00:22:34,430 --> 00:22:36,140
And it doesn't really look
very compelling.

347
00:22:42,810 --> 00:22:47,360
In fact, one of the
darts actually

348
00:22:47,360 --> 00:22:49,150
fell outside the circle.

349
00:22:49,150 --> 00:22:50,730
Nine of the darts fell
inside the circle.

350
00:22:50,730 --> 00:22:54,310
So you're not going to get a
real good estimate there.

351
00:22:54,310 --> 00:22:57,240
The blue dots there represent
being in the circle.

352
00:22:57,240 --> 00:22:59,810
Red is outside.

353
00:22:59,810 --> 00:23:02,250
So if we do it with 100 points,
it starts getting a

354
00:23:02,250 --> 00:23:04,832
little better.

355
00:23:04,832 --> 00:23:08,770
If we do with 1,000 points,
starts getting better.

356
00:23:12,690 --> 00:23:20,720
If we do it with
10,000 points.

357
00:23:20,720 --> 00:23:21,970
Anyone confused?

358
00:23:25,330 --> 00:23:29,840
So I'm going to move on and show
you how we can use the

359
00:23:29,840 --> 00:23:32,895
same method to do numeric
integration.

360
00:23:39,740 --> 00:23:41,880
So here we go.

361
00:23:41,880 --> 00:23:44,570
Here's that frange function
again, so it's

362
00:23:44,570 --> 00:23:48,360
not confusing anyone.

363
00:23:48,360 --> 00:23:53,170
What we're going to do is we're
going to use a Monte

364
00:23:53,170 --> 00:23:58,946
Carlo method to integrate
a polynomial.

365
00:24:01,790 --> 00:24:04,030
So let's say that I have--

366
00:24:11,310 --> 00:24:12,560
what I want to find.

367
00:24:19,730 --> 00:24:22,070
I'm going to do it for--

368
00:24:22,070 --> 00:24:26,780
because this is a numeric
method, let's say do it from

369
00:24:26,780 --> 00:24:27,820
negative 5 to 5.

370
00:24:27,820 --> 00:24:29,155
So I want to do this.

371
00:24:37,180 --> 00:24:39,990
If you haven't had calculus or
anything like that, don't

372
00:24:39,990 --> 00:24:41,010
worry about this.

373
00:24:41,010 --> 00:24:45,006
But I think a lot of people
have, with a couple of

374
00:24:45,006 --> 00:24:46,710
exceptions.

375
00:24:46,710 --> 00:24:53,350
So this is an easy function
to integrate, right?

376
00:24:53,350 --> 00:24:55,570
But there are also some
functions that are really hard

377
00:24:55,570 --> 00:24:56,540
or impossible to.

378
00:24:56,540 --> 00:25:02,360
So that's where a lot of
software packages actually use

379
00:25:02,360 --> 00:25:06,740
Monte Carlo methods to do a
numeric integration for you.

380
00:25:06,740 --> 00:25:14,080
But the idea is the same I'm
going to take a function.

381
00:25:14,080 --> 00:25:17,000
And this is going
to be x-squared.

382
00:25:17,000 --> 00:25:19,576
And then I'm going to take
an x-min and an x-max.

383
00:25:23,750 --> 00:25:26,020
These become my left and
right boundaries.

384
00:25:26,020 --> 00:25:28,310
And then I'm going to find the
minimum of the function

385
00:25:28,310 --> 00:25:34,000
between these limits and the
maximum of the function.

386
00:25:34,000 --> 00:25:35,460
So you see what I'm doing?

387
00:25:35,460 --> 00:25:37,715
I'm defining a rectangle.

388
00:25:40,270 --> 00:25:41,975
So again, same thing.

389
00:25:47,260 --> 00:25:48,250
Same principle.

390
00:25:48,250 --> 00:25:49,500
I have the area of
the rectangle.

391
00:25:54,070 --> 00:25:57,260
I don't have the area
of this guy.

392
00:25:57,260 --> 00:25:59,080
That's what I'm trying
to find.

393
00:25:59,080 --> 00:26:04,350
But I know that if I find the
ratio, the number of points

394
00:26:04,350 --> 00:26:07,740
that land in the square--

395
00:26:07,740 --> 00:26:13,220
or the ratio that land in this
curve versus the total in the

396
00:26:13,220 --> 00:26:17,565
square, then I can find this
area pretty easily.

397
00:26:20,360 --> 00:26:28,940
So this function, find function,
y-min, y-max.

398
00:26:28,940 --> 00:26:30,200
Does exactly what it says.

399
00:26:33,750 --> 00:26:37,540
Just goes between x-min and
x-max, and then finds where

400
00:26:37,540 --> 00:26:43,150
the function is a minimum and
where it's a maximum.

401
00:26:43,150 --> 00:26:46,546
So the function I'm calling f.

402
00:26:46,546 --> 00:26:49,350
It's one of the few
single-letter variable names

403
00:26:49,350 --> 00:26:52,850
I'll use that isn't
an index counter.

404
00:26:56,610 --> 00:26:58,910
My random point generator, it's

405
00:26:58,910 --> 00:27:00,840
going to take the bounds--

406
00:27:00,840 --> 00:27:03,090
x-min, x-max, y-min, y-max.

407
00:27:03,090 --> 00:27:05,640
So it's going to uniformly
produce a point that falls

408
00:27:05,640 --> 00:27:06,890
within this rectangle.

409
00:27:10,240 --> 00:27:11,520
My make-points --

410
00:27:11,520 --> 00:27:14,590
it just makes a whole
bunch of these.

411
00:27:14,590 --> 00:27:17,790
Then I have this function
between curve.

412
00:27:17,790 --> 00:27:21,610
What this tells me is if I
have a point here, it'll

413
00:27:21,610 --> 00:27:24,220
return true, because
it's between the

414
00:27:24,220 --> 00:27:28,540
curve and the x-axis.

415
00:27:28,540 --> 00:27:31,510
If it's up here, it's
false, right?

416
00:27:34,460 --> 00:27:37,130
Does anyone not understand
how that works?

417
00:27:37,130 --> 00:27:38,380
Ah, you're all smart.

418
00:27:40,990 --> 00:27:49,335
So here is our estimate of our
main function, estimate area.

419
00:27:49,335 --> 00:27:52,780
You give it a function,
x-min, x-max.

420
00:27:52,780 --> 00:27:55,690
I'm going to tell it how
many points to toss.

421
00:27:55,690 --> 00:27:58,600
And optionally, we can tell it
that we already have points

422
00:27:58,600 --> 00:28:01,160
that have been tossed.

423
00:28:01,160 --> 00:28:04,220
And the first thing we do is
find the y-min and the y-max.

424
00:28:07,010 --> 00:28:10,960
And then if we don't have
points, we make them.

425
00:28:10,960 --> 00:28:14,910
And then point counter counts
how many times a point wound

426
00:28:14,910 --> 00:28:18,030
up between the curve
and the x-axis.

427
00:28:21,110 --> 00:28:24,150
And we just iterate through
the points.

428
00:28:24,150 --> 00:28:27,185
If it's between the curve,
that means it's here.

429
00:28:30,150 --> 00:28:32,340
Then, if it's above the
x-axis, we're going to

430
00:28:32,340 --> 00:28:33,590
increment the point counter.

431
00:28:33,590 --> 00:28:37,910
And then if it's below the
x-axis, we're going to

432
00:28:37,910 --> 00:28:38,800
decrement the point counter.

433
00:28:38,800 --> 00:28:40,820
So we're accounting
for signs here.

434
00:28:40,820 --> 00:28:46,770
So if we had a function that
did this, we'd be able to

435
00:28:46,770 --> 00:28:48,020
properly handle it.

436
00:28:51,170 --> 00:28:55,190
Now we get the rectangular
area.

437
00:28:55,190 --> 00:29:00,110
And then all we do is we
multiply the rectangular area

438
00:29:00,110 --> 00:29:04,240
by the ratio of the number of
points between the curve and

439
00:29:04,240 --> 00:29:08,060
the x-axis and the total number
of points thrown.

440
00:29:08,060 --> 00:29:10,250
And that gives us the
function area.

441
00:29:13,060 --> 00:29:18,680
So here's my function,
x-squared.

442
00:29:18,680 --> 00:29:21,040
And this is just a plot
function scatter.

443
00:29:21,040 --> 00:29:23,810
All this is going to do is just
do the same thing I did

444
00:29:23,810 --> 00:29:25,060
with the circle.

445
00:29:27,070 --> 00:29:31,920
And I am going to
do this for--

446
00:29:31,920 --> 00:29:35,910
if I tossed 10 points,
100 points, 1,000,

447
00:29:35,910 --> 00:29:38,540
10,000, or a 100,000.

448
00:29:38,540 --> 00:29:39,815
So let's see what
this looks like.

449
00:29:47,230 --> 00:29:49,260
Assuming that Python
doesn't crash.

450
00:29:57,190 --> 00:29:59,485
So not too nice.

451
00:30:06,690 --> 00:30:26,230
100 points, 1,000 points,
10,000 points.

452
00:30:26,230 --> 00:30:27,480
And then a whole
mess of points.

453
00:30:36,956 --> 00:30:40,372
Oh, I crashed it.

454
00:30:40,372 --> 00:30:41,348
Hm?

455
00:30:41,348 --> 00:30:42,812
AUDIENCE: Can't we
just [INAUDIBLE]?

456
00:30:49,160 --> 00:30:49,580
PROFESSOR: I'm sorry.

457
00:30:49,580 --> 00:30:50,323
Say that again?

458
00:30:50,323 --> 00:30:51,573
AUDIENCE: Calculate
[INAUDIBLE]

459
00:30:54,187 --> 00:30:58,534
split up the x-axis to a lot of
points, and then multiply

460
00:30:58,534 --> 00:31:01,432
those by the value function
[INAUDIBLE]

461
00:31:01,432 --> 00:31:02,420
add them up?

462
00:31:02,420 --> 00:31:04,490
PROFESSOR: You're talking
about doing a Riemann

463
00:31:04,490 --> 00:31:05,390
approximation?

464
00:31:05,390 --> 00:31:07,000
AUDIENCE: Yeah, [INAUDIBLE].

465
00:31:07,000 --> 00:31:09,630
PROFESSOR: Or a Riemann sum?

466
00:31:09,630 --> 00:31:16,640
So his question is, why don't
you do something like this?

467
00:31:23,650 --> 00:31:39,610
Divide up the x-axis into very
small portions, like that, and

468
00:31:39,610 --> 00:31:41,265
then sum up the areas
of these rectangles.

469
00:31:43,991 --> 00:31:46,426
Yeah, you could do that.

470
00:31:46,426 --> 00:31:47,676
AUDIENCE: [INAUDIBLE]?

471
00:31:51,310 --> 00:31:53,460
PROFESSOR: You know, I don't
have an answer for that.

472
00:31:53,460 --> 00:31:57,644
I can't say which one
would work better.

473
00:31:57,644 --> 00:31:59,000
Do you know, Serena?

474
00:32:03,230 --> 00:32:08,920
I would say that right now,
whichever one you prefer.

475
00:32:11,810 --> 00:32:15,090
But I'll see if there's any
actual research on whether or

476
00:32:15,090 --> 00:32:16,620
not one is better
than the other.

477
00:32:16,620 --> 00:32:19,950
It might turn out that there
are certain instances where

478
00:32:19,950 --> 00:32:22,500
doing this sort of approximation
is better than

479
00:32:22,500 --> 00:32:24,340
doing the approximation
I'm talking about.

480
00:32:27,570 --> 00:32:30,832
But I don't know.

481
00:32:30,832 --> 00:32:34,095
Yeah, for this problem, you
could definitely use that.

482
00:32:38,800 --> 00:32:41,280
Is everyone good with this?

483
00:32:41,280 --> 00:32:42,260
Anyone confused?

484
00:32:42,260 --> 00:32:45,080
Any questions?

485
00:32:45,080 --> 00:32:45,370
Yeah?

486
00:32:45,370 --> 00:32:49,314
AUDIENCE: I think my concern
is that you need a

487
00:32:49,314 --> 00:32:52,765
fantastically large number of
darts to get a reasonably good

488
00:32:52,765 --> 00:32:54,750
integration [INAUDIBLE].

489
00:32:54,750 --> 00:32:56,090
PROFESSOR: Yeah.

490
00:32:56,090 --> 00:32:59,970
That is one issue with Monte
Carlo methods, is that they do

491
00:32:59,970 --> 00:33:02,660
rely on large numbers.

492
00:33:02,660 --> 00:33:09,218
So, yeah, sometimes they
can take a while.

493
00:33:09,218 --> 00:33:11,628
AUDIENCE: At least for the
purposes of this class, we

494
00:33:11,628 --> 00:33:15,002
don't need to be able to
quantify the error or anything

495
00:33:15,002 --> 00:33:16,930
like that, right?

496
00:33:16,930 --> 00:33:18,180
PROFESSOR: No.

497
00:33:22,080 --> 00:33:26,480
You do need to understand
that there can be error.

498
00:33:26,480 --> 00:33:29,750
And you should also understand
stuff like confidence

499
00:33:29,750 --> 00:33:31,190
intervals and confidence
levels.

500
00:33:34,200 --> 00:33:35,448
Are you OK with that?

501
00:33:35,448 --> 00:33:37,938
AUDIENCE: Mostly.

502
00:33:37,938 --> 00:33:41,175
But in order to get a confidence
interval, you'd

503
00:33:41,175 --> 00:33:44,412
have to do several
trials at, say,

504
00:33:44,412 --> 00:33:46,420
100,000 points, and then--

505
00:33:46,420 --> 00:33:47,670
PROFESSOR: Right, exactly.

506
00:33:51,260 --> 00:33:56,380
You could estimate the error.

507
00:33:56,380 --> 00:33:58,220
Like you could estimate it.

508
00:33:58,220 --> 00:34:00,860
But in order to really get
a good sense for how much

509
00:34:00,860 --> 00:34:03,820
variance there is, you'd have
to do repeated trials.

510
00:34:03,820 --> 00:34:05,810
So yeah.

511
00:34:05,810 --> 00:34:08,260
AUDIENCE: What I guess I was
getting at was in order to get

512
00:34:08,260 --> 00:34:10,220
a sense of how big the
error is relative to

513
00:34:10,220 --> 00:34:11,989
the number of trials--

514
00:34:11,989 --> 00:34:13,430
PROFESSOR: Yeah.

515
00:34:13,430 --> 00:34:14,855
AUDIENCE: --without sort
of analytically.

516
00:34:14,855 --> 00:34:17,710
But I guess that's probably
[INAUDIBLE].

517
00:34:17,710 --> 00:34:19,113
PROFESSOR: I'm sorry, what?

518
00:34:19,113 --> 00:34:20,965
AUDIENCE: That's not something
that we're going to be asked

519
00:34:20,965 --> 00:34:22,360
to do, at least in
this course?

520
00:34:22,360 --> 00:34:24,475
PROFESSOR: Yeah, no.

521
00:34:24,475 --> 00:34:27,400
The purpose is we want you to
understand that when you do

522
00:34:27,400 --> 00:34:32,280
things like this, that there is
some thought that has to go

523
00:34:32,280 --> 00:34:33,929
into, well, how many trials
do I need to do?

524
00:34:33,929 --> 00:34:36,280
How many points do
I need to throw?

525
00:34:36,280 --> 00:34:39,174
And you have to ask yourself,
how much error am

526
00:34:39,174 --> 00:34:41,040
I willing to tolerate?

527
00:34:41,040 --> 00:34:45,940
There's the joke that
mathematicians call pi pi, and

528
00:34:45,940 --> 00:34:54,810
then engineers call it 3.14.

529
00:34:54,810 --> 00:34:59,770
OK, so if everyone's done with
integration, I'm going to move

530
00:34:59,770 --> 00:35:01,020
on to regression.

531
00:35:07,370 --> 00:35:08,260
Oh, wait, now.

532
00:35:08,260 --> 00:35:11,870
There's one thing wanted
to touch on.

533
00:35:11,870 --> 00:35:20,880
So we kind of looked at some toy
problems with Monte Carlo.

534
00:35:20,880 --> 00:35:24,070
And this is, I guess, a toy
problem too, because it has to

535
00:35:24,070 --> 00:35:24,670
do with a toy.

536
00:35:24,670 --> 00:35:28,110
Is everyone familiar with
the game of Monopoly?

537
00:35:28,110 --> 00:35:32,780
So I don't have to explain the
rules too much in depth?

538
00:35:32,780 --> 00:35:33,550
OK.

539
00:35:33,550 --> 00:35:41,150
So let's assume that there are
no factors that modify this

540
00:35:41,150 --> 00:35:43,590
distribution.

541
00:35:43,590 --> 00:35:49,390
If I roll the die twice, then
each one of these spaces has

542
00:35:49,390 --> 00:35:51,940
an equal probability
of being landed on.

543
00:35:51,940 --> 00:35:53,890
It's about 2 and 1/2%.

544
00:35:53,890 --> 00:35:57,800
But there are certain rules
that distort this

545
00:35:57,800 --> 00:35:58,700
distribution.

546
00:35:58,700 --> 00:36:01,200
So you can land on Go To Jail.

547
00:36:01,200 --> 00:36:05,070
You can roll three doubles,
and get sent to Jail.

548
00:36:05,070 --> 00:36:08,660
You can draw a Chance card, and
get sent to Jail, sent to

549
00:36:08,660 --> 00:36:12,070
Go, or sent anywhere
on the board.

550
00:36:12,070 --> 00:36:15,660
And there are 10 out of 16
Chance cards that modify this

551
00:36:15,660 --> 00:36:17,560
distribution.

552
00:36:17,560 --> 00:36:19,230
And for Community Chest,
same thing.

553
00:36:19,230 --> 00:36:22,340
There's 2 out of the 16 cards
that distort the distribution.

554
00:36:22,340 --> 00:36:27,570
So the question is, how do
you do this analytically?

555
00:36:27,570 --> 00:36:28,710
And I've tried.

556
00:36:28,710 --> 00:36:30,700
It's hard.

557
00:36:30,700 --> 00:36:33,020
I'm actually not sure
if it's possible.

558
00:36:33,020 --> 00:36:36,270
Well, this is a perfect example
of where you would use

559
00:36:36,270 --> 00:36:39,940
a Monte Carlo simulation in
order to arrive at the answer.

560
00:36:39,940 --> 00:36:44,780
So if you actually want to take
a whack at this problem,

561
00:36:44,780 --> 00:36:48,090
you can go to this site called
projecteuler.net.

562
00:36:48,090 --> 00:36:50,880
They have a whole bunch of mathy
questions on there that

563
00:36:50,880 --> 00:36:55,740
are meant to get people to think
about math and computer

564
00:36:55,740 --> 00:36:58,550
programming.

565
00:36:58,550 --> 00:37:01,890
And you get little rankings the
more questions you answer

566
00:37:01,890 --> 00:37:02,890
correctly, and stuff
like that.

567
00:37:02,890 --> 00:37:04,790
So there's a little
competition.

568
00:37:04,790 --> 00:37:10,650
But the question in this
particular case was, what are

569
00:37:10,650 --> 00:37:14,650
the top three places
you'll land on

570
00:37:14,650 --> 00:37:16,500
with all these factors?

571
00:37:16,500 --> 00:37:20,950
And if you represent them as a
number that is concatenated

572
00:37:20,950 --> 00:37:23,110
one after the other,
what is the number?

573
00:37:23,110 --> 00:37:26,140
What is the six-digit number?

574
00:37:26,140 --> 00:37:28,190
But that's a fun problem.

575
00:37:30,990 --> 00:37:34,075
So going onto something that's
less fun, regression.

576
00:37:37,280 --> 00:37:43,030
So can someone tell me what
purposes we would use

577
00:37:43,030 --> 00:37:44,280
regression for?

578
00:37:52,190 --> 00:37:52,960
Take a stab.

579
00:37:52,960 --> 00:37:53,300
AUDIENCE: Sure.

580
00:37:53,300 --> 00:37:58,100
If you have experimental data
which you believe to fit some

581
00:37:58,100 --> 00:37:59,540
type of theoretical model.

582
00:37:59,540 --> 00:38:00,980
But experiments being

583
00:38:00,980 --> 00:38:04,010
experiments, they're not perfect.

584
00:38:04,010 --> 00:38:06,710
You can't--

585
00:38:06,710 --> 00:38:09,690
the data points exactly fall in
the model, so you have to

586
00:38:09,690 --> 00:38:14,078
find which parameters from the
model to pick so that your

587
00:38:14,078 --> 00:38:16,874
experiment [UNINTELLIGIBLE]
best fits [INAUDIBLE].

588
00:38:16,874 --> 00:38:17,810
PROFESSOR: Uh-huh.

589
00:38:17,810 --> 00:38:21,940
So the idea is that you have a
bunch of experimental data

590
00:38:21,940 --> 00:38:24,000
that has error.

591
00:38:24,000 --> 00:38:28,260
And you want to be able to
maybe find the underlying

592
00:38:28,260 --> 00:38:32,610
function of those
observations.

593
00:38:32,610 --> 00:38:35,420
And you would do that
using regression.

594
00:38:35,420 --> 00:38:39,880
So we have a couple
of nice cools in

595
00:38:39,880 --> 00:38:41,130
Python for doing that.

596
00:38:44,070 --> 00:38:47,710
Actually, before I move on,
another reason is you can find

597
00:38:47,710 --> 00:38:48,090
the function.

598
00:38:48,090 --> 00:38:50,310
But you can also then, once you
find that function, you

599
00:38:50,310 --> 00:38:52,560
can use it to predict
additional values.

600
00:38:52,560 --> 00:38:56,460
So say you have a gap in your
data, or you want to predict

601
00:38:56,460 --> 00:38:58,510
values beyond the range
that you collected

602
00:38:58,510 --> 00:39:00,050
observations for.

603
00:39:00,050 --> 00:39:03,270
If you do a regression, you find
the function, find the

604
00:39:03,270 --> 00:39:05,510
parameters for the function,
then you can use it to predict

605
00:39:05,510 --> 00:39:08,450
those values.

606
00:39:08,450 --> 00:39:14,130
And what we mainly want you
to understand here are the

607
00:39:14,130 --> 00:39:17,120
functions that you would use
to do it, and how you would

608
00:39:17,120 --> 00:39:21,650
tell if you have a good fit or
not a good fit, and the idea

609
00:39:21,650 --> 00:39:23,190
of overfitting.

610
00:39:23,190 --> 00:39:30,710
So we have a little bit of code
that demonstrates this,

611
00:39:30,710 --> 00:39:36,980
so a couple of helper functions
that compute various

612
00:39:36,980 --> 00:39:40,330
values that you've
seen before.

613
00:39:40,330 --> 00:39:44,170
So MSE is the sum of the
residual squares.

614
00:39:44,170 --> 00:39:48,230
And then you have the total
sum of squares.

615
00:39:48,230 --> 00:39:57,160
So these will help you compute
the coefficient of

616
00:39:57,160 --> 00:39:58,410
termination.

617
00:40:00,280 --> 00:40:06,680
And what I'm going to
show is let's say I

618
00:40:06,680 --> 00:40:07,930
define a function here.

619
00:40:10,380 --> 00:40:12,470
In this case, I have it
defined as x-cubed

620
00:40:12,470 --> 00:40:13,820
plus 5x plus 3.

621
00:40:16,580 --> 00:40:22,310
I am going to, for a certain
number of x values, apply the

622
00:40:22,310 --> 00:40:25,860
function and get the y value.

623
00:40:25,860 --> 00:40:28,800
And then to simulate
observational data, I'm going

624
00:40:28,800 --> 00:40:33,250
to perturb it using a Gaussian
distribution.

625
00:40:33,250 --> 00:40:34,910
So it's going to jitter
the points.

626
00:40:39,650 --> 00:40:42,220
And that's what the make
observations function does, is

627
00:40:42,220 --> 00:40:47,040
it just adds noise
to the y values.

628
00:40:47,040 --> 00:40:49,470
And then I'm going to--

629
00:40:49,470 --> 00:40:55,730
this function here plots out
the measured or observed

630
00:40:55,730 --> 00:41:00,430
values, the simulated.

631
00:41:00,430 --> 00:41:07,750
It computes a fit
for one degree.

632
00:41:07,750 --> 00:41:10,297
So in this case, I have two
parameters, fit degree 1 and

633
00:41:10,297 --> 00:41:13,000
fit degree 2, because I want
to do comparisons.

634
00:41:13,000 --> 00:41:19,080
So it'll compute fit using the
first degree and predict some

635
00:41:19,080 --> 00:41:20,500
values for the curve.

636
00:41:23,320 --> 00:41:27,590
And then it'll compute the
residual error and the

637
00:41:27,590 --> 00:41:32,290
coefficient of determination
and plot it out.

638
00:41:32,290 --> 00:41:36,790
And then it'll do the same thing
for the second degree.

639
00:41:42,290 --> 00:41:44,410
Let's see what this
looks like.

640
00:41:52,460 --> 00:41:56,885
Let's see Python not
behave badly.

641
00:41:59,795 --> 00:42:01,250
There we go.

642
00:42:10,640 --> 00:42:15,130
The function that we plotted
was, what, x-squared

643
00:42:15,130 --> 00:42:17,310
something, 5x-squared?

644
00:42:17,310 --> 00:42:18,560
Let me see.

645
00:42:22,100 --> 00:42:24,830
x-cubed plus 5x plus 3.

646
00:42:27,960 --> 00:42:30,780
And we're plotting it from
negative 2 to 2.

647
00:42:30,780 --> 00:42:34,170
So this is what I'm talking
about with the noise.

648
00:42:34,170 --> 00:42:37,840
So each of these red dots
represents some observation

649
00:42:37,840 --> 00:42:41,280
that's been disturbed
a little bit.

650
00:42:41,280 --> 00:42:45,800
And then I try to fit this
with a first degree

651
00:42:45,800 --> 00:42:48,380
polynomial, and then
a second degree.

652
00:42:48,380 --> 00:42:51,570
And I see--

653
00:42:51,570 --> 00:42:57,304
actually, my residual error is
lower for my first degree fit.

654
00:42:57,304 --> 00:42:58,620
That's interesting.

655
00:43:01,310 --> 00:43:04,340
So I don't know.

656
00:43:04,340 --> 00:43:06,690
At this point, I'd
say just stop and

657
00:43:06,690 --> 00:43:07,450
don't proceed further.

658
00:43:07,450 --> 00:43:09,940
But we know that that's not
the right function.

659
00:43:09,940 --> 00:43:16,260
So let's look at what we have
for a third degree fit.

660
00:43:16,260 --> 00:43:19,025
It actually worse, huh.

661
00:43:24,470 --> 00:43:26,530
This is the problem with random
programs, is that

662
00:43:26,530 --> 00:43:27,780
sometimes they fail you.

663
00:43:34,950 --> 00:43:37,440
I would say that these are nice
pretty plots, but they're

664
00:43:37,440 --> 00:43:41,000
not really telling me much,
other than I can fit some

665
00:43:41,000 --> 00:43:43,892
lines to some points.

666
00:43:43,892 --> 00:43:45,620
AUDIENCE: What should
it look like?

667
00:43:45,620 --> 00:43:48,540
What are you looking for
that's not there?

668
00:43:48,540 --> 00:43:51,590
PROFESSOR: So we know that the
function that we made the

669
00:43:51,590 --> 00:43:55,900
observations on is a third
degree polynomial.

670
00:43:55,900 --> 00:44:06,850
So it's a little puzzling why
this first degree fit is

671
00:44:06,850 --> 00:44:14,120
better than our third
degree fit.

672
00:44:14,120 --> 00:44:18,800
That's the conundrum.

673
00:44:18,800 --> 00:44:20,150
So maybe--

674
00:44:20,150 --> 00:44:23,350
I wonder what would happen if
I expanded the x range.

675
00:44:23,350 --> 00:44:27,570
So let's say I go from
negative 5 to 5.

676
00:44:27,570 --> 00:44:29,820
Maybe it's just too
little data.

677
00:44:34,150 --> 00:44:35,400
That's looking a
little better.

678
00:44:43,790 --> 00:44:45,040
Now I feel better.

679
00:44:48,080 --> 00:44:51,190
So the issue was that we just
were going from negative 2 to

680
00:44:51,190 --> 00:44:54,070
2, and basically it looked
linear there.

681
00:44:54,070 --> 00:44:57,370
So the first degree polynomial
was doing fine.

682
00:44:57,370 --> 00:45:01,610
But as soon as we go out and get
a little curvy in there,

683
00:45:01,610 --> 00:45:04,820
we see that both the first and
the second degree fits, they

684
00:45:04,820 --> 00:45:07,770
have pretty high error.

685
00:45:07,770 --> 00:45:09,460
Their R is pretty good.

686
00:45:09,460 --> 00:45:14,940
But when you compare them with,
say, a third degree fit,

687
00:45:14,940 --> 00:45:18,240
you see that the error drops
down dramatically.

688
00:45:18,240 --> 00:45:22,050
And it's got higher coefficient
of determination.

689
00:45:22,050 --> 00:45:25,130
So what we would say in this
case is that this third degree

690
00:45:25,130 --> 00:45:29,800
fit here is a lot better
than the first or

691
00:45:29,800 --> 00:45:32,660
second degree fit.

692
00:45:32,660 --> 00:45:35,970
And then we can also look at,
say, a fourth degree fit,

693
00:45:35,970 --> 00:45:39,040
which in this case happens
to have a higher error.

694
00:45:39,040 --> 00:45:41,580
So that's a good thing.

695
00:45:41,580 --> 00:45:45,080
And then if we look at a fifth
degree fit, it also has a

696
00:45:45,080 --> 00:45:45,630
higher error.

697
00:45:45,630 --> 00:45:50,810
So we'd say in this case that
the third degree fit is

698
00:45:50,810 --> 00:45:53,650
probably our best bet, and we
probably have a pretty good

699
00:45:53,650 --> 00:45:56,600
idea of what the function is
for the underlying model.

700
00:45:59,960 --> 00:46:02,780
AUDIENCE: Which part of
this is regression?

701
00:46:02,780 --> 00:46:06,220
PROFESSOR: Well, the part of
this that is regression is--

702
00:46:10,170 --> 00:46:12,560
the part that actually does the
regression is this poly

703
00:46:12,560 --> 00:46:15,200
fit method here.

704
00:46:15,200 --> 00:46:19,030
And what you do is you pass it
in the x values, the y values,

705
00:46:19,030 --> 00:46:20,860
and the degree of the
polynomial that you

706
00:46:20,860 --> 00:46:22,110
want to fit to it.

707
00:46:29,950 --> 00:46:32,270
I've hit the end
of my material,

708
00:46:32,270 --> 00:46:33,785
unless someone has questions.

709
00:46:36,950 --> 00:46:41,476
Comments, fears, trepidations?

710
00:46:41,476 --> 00:46:42,853
AUDIENCE: Just [INAUDIBLE]

711
00:46:42,853 --> 00:46:46,266
having done some stuff-- like
in Excel, you can fit curves

712
00:46:46,266 --> 00:46:47,238
with the R-squares?

713
00:46:47,238 --> 00:46:47,724
PROFESSOR: Yeah.

714
00:46:47,724 --> 00:46:50,640
AUDIENCE: The R-squared values
are really, really high, like

715
00:46:50,640 --> 00:46:51,890
really, really [? wanting ?]

716
00:46:51,890 --> 00:46:55,240
these fits, even though the
fits are pretty terrible.

717
00:46:55,240 --> 00:46:55,830
PROFESSOR: Yeah.

718
00:46:55,830 --> 00:46:57,510
AUDIENCE: So that's
weird to me.

719
00:46:57,510 --> 00:47:00,150
PROFESSOR: That is puzzling.

720
00:47:00,150 --> 00:47:04,925
And it's quite possible
that I have a bug.

721
00:47:04,925 --> 00:47:07,350
AUDIENCE: I wonder whether
there were different

722
00:47:07,350 --> 00:47:09,290
definitions for R-squared that
are maybe floating around in

723
00:47:09,290 --> 00:47:10,430
different places?

724
00:47:10,430 --> 00:47:12,200
PROFESSOR: No.

725
00:47:12,200 --> 00:47:14,130
I made a correction
to this earlier.

726
00:47:14,130 --> 00:47:16,360
And like I said, maybe
I introduced a bug.

727
00:47:16,360 --> 00:47:20,140
So I'm going to have to
double-check my math.

728
00:47:20,140 --> 00:47:21,803
Unfortunately, I'm
not perfect.

729
00:47:21,803 --> 00:47:23,053
I wish I was.