1
00:00:00,835 --> 00:00:03,130
The following content is
provided under a Creative

2
00:00:03,130 --> 00:00:04,550
Commons license.

3
00:00:04,550 --> 00:00:06,760
Your support will help
MIT OpenCourseWare

4
00:00:06,760 --> 00:00:10,850
continue to offer high quality
educational resources for free.

5
00:00:10,850 --> 00:00:13,390
To make a donation or to
view additional materials

6
00:00:13,390 --> 00:00:17,320
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,320 --> 00:00:18,570
at ocw.mit.edu.

8
00:00:30,070 --> 00:00:31,406
JOHN GUTTAG: Hi, everybody.

9
00:00:35,870 --> 00:00:39,154
Welcome back to class.

10
00:00:39,154 --> 00:00:40,820
I have a lot to cover
today, so I'm just

11
00:00:40,820 --> 00:00:42,600
going to get right to it.

12
00:00:42,600 --> 00:00:46,440
So you may remember at
the end of last lecture,

13
00:00:46,440 --> 00:00:49,160
I talked about
the Empirical Rule

14
00:00:49,160 --> 00:00:53,120
and said that there were
a couple of assumptions

15
00:00:53,120 --> 00:00:54,160
underlying it.

16
00:00:54,160 --> 00:00:57,380
That one is the mean
estimation error is zero.

17
00:00:57,380 --> 00:01:00,200
And second, that the
distribution of errors

18
00:01:00,200 --> 00:01:03,140
will be normally distributed.

19
00:01:03,140 --> 00:01:04,819
I didn't probably
mention at the time,

20
00:01:04,819 --> 00:01:08,060
but we often call this
distribution Gaussian

21
00:01:08,060 --> 00:01:11,180
after the astronomer Carl Gauss.

22
00:01:11,180 --> 00:01:12,345
And it looks like that.

23
00:01:16,190 --> 00:01:21,250
Normal distributions are very
easy to generate in Python.

24
00:01:21,250 --> 00:01:24,290
I have a little example
here of generating, not

25
00:01:24,290 --> 00:01:28,020
a real normal distribution,
but a discrete approximation

26
00:01:28,020 --> 00:01:28,520
of one.

27
00:01:31,170 --> 00:01:34,260
And the thing to really
notice about it here

28
00:01:34,260 --> 00:01:38,380
is this line random.gauss.

29
00:01:38,380 --> 00:01:43,720
So that's a built in function
of the random library.

30
00:01:43,720 --> 00:01:47,280
The first argument is the mean.

31
00:01:47,280 --> 00:01:49,920
And the second argument
is the standard deviation

32
00:01:49,920 --> 00:01:53,970
or a mu and sigma as
they're usually called.

33
00:01:53,970 --> 00:01:57,300
Every time I call that,
I will get a different--

34
00:01:57,300 --> 00:02:00,330
or usually a
different random value

35
00:02:00,330 --> 00:02:04,260
drawn from a Gaussian with
the mean, in this case of 0

36
00:02:04,260 --> 00:02:08,430
and a standard deviation of 100.

37
00:02:08,430 --> 00:02:10,990
I'm then going to
produce a plot of those

38
00:02:10,990 --> 00:02:13,270
so you can see
what it looks like.

39
00:02:13,270 --> 00:02:15,760
And I'm going to do that
using some things we

40
00:02:15,760 --> 00:02:16,735
haven't seen before.

41
00:02:19,550 --> 00:02:22,880
So first of all,
we've seen histograms.

42
00:02:22,880 --> 00:02:28,270
So pylab.hist
produces a histogram.

43
00:02:28,270 --> 00:02:31,540
Bins tells us how many bins
we want in the histogram.

44
00:02:31,540 --> 00:02:33,280
I said we want 100.

45
00:02:33,280 --> 00:02:35,240
The default is 10.

46
00:02:35,240 --> 00:02:39,190
Dist is the values
it will use for it.

47
00:02:39,190 --> 00:02:45,960
And then this is something we
haven't seen before, weights.

48
00:02:45,960 --> 00:02:48,540
Weights is a keyword argument.

49
00:02:48,540 --> 00:02:52,640
So normally when we
produce a histogram,

50
00:02:52,640 --> 00:02:57,980
we take all of the values,
the minimum to the maximum,

51
00:02:57,980 --> 00:03:00,760
and in this case, we would
divide them into 100 bins,

52
00:03:00,760 --> 00:03:07,490
because I said, bins equals 100.

53
00:03:07,490 --> 00:03:09,790
So the first bin
might be, well, let's

54
00:03:09,790 --> 00:03:13,030
say we only had values
ranging from 0 to 100.

55
00:03:13,030 --> 00:03:16,360
The first bin would be all
the 0's, all the 1's up to all

56
00:03:16,360 --> 00:03:16,960
the 99's.

57
00:03:21,010 --> 00:03:25,410
And it weights each
value in the bin by 1.

58
00:03:25,410 --> 00:03:28,560
So if the bin had 10
values falling in it,

59
00:03:28,560 --> 00:03:31,340
the y-axis would be a 10.

60
00:03:31,340 --> 00:03:33,680
If the bin had 50
values falling in it,

61
00:03:33,680 --> 00:03:35,470
the y-axis would go up to 50.

62
00:03:38,430 --> 00:03:41,190
You can tell it
how much you want

63
00:03:41,190 --> 00:03:45,600
to weight each bin, the
elements in the bins.

64
00:03:45,600 --> 00:03:48,570
And say, no, I don't want
them each to count as 1,

65
00:03:48,570 --> 00:03:51,290
I want them to count
as a half or a quarter,

66
00:03:51,290 --> 00:03:52,990
and that will change the y-axis.

67
00:03:55,560 --> 00:03:57,980
So that's what I've done here.

68
00:03:57,980 --> 00:04:02,480
What I've said is
I've created a list

69
00:04:02,480 --> 00:04:05,450
and I want to say for
each of the bins--

70
00:04:05,450 --> 00:04:10,010
in this case I'm going to weigh
each of them the same way--

71
00:04:10,010 --> 00:04:14,310
the weight is going to be 1
over the number of samples.

72
00:04:16,920 --> 00:04:19,910
I'm multiplying it by
the len of dist, that

73
00:04:19,910 --> 00:04:23,220
will be how many items I have.

74
00:04:23,220 --> 00:04:26,660
And that will tell me how much
each one is being weighted.

75
00:04:26,660 --> 00:04:38,380
So for example, if I
have, say, 1,000 items,

76
00:04:38,380 --> 00:04:40,870
I could give 1,000
values and say,

77
00:04:40,870 --> 00:04:43,660
I want this item
weighted by 1, and I

78
00:04:43,660 --> 00:04:48,580
want this item over here
weighted by 12 or a half.

79
00:04:48,580 --> 00:04:50,450
We rarely do that.

80
00:04:50,450 --> 00:04:55,860
Usually, what we want is to
give each item the same weight.

81
00:04:55,860 --> 00:05:01,150
So why would I want it not
to be weighted at just one?

82
00:05:01,150 --> 00:05:06,760
Because I want my y-axis to
be more easily interpreted,

83
00:05:06,760 --> 00:05:09,820
and essentially give me
the fraction of the values

84
00:05:09,820 --> 00:05:10,990
that fell in that bin.

85
00:05:14,050 --> 00:05:17,660
And that's what I'm doing here.

86
00:05:17,660 --> 00:05:20,570
The other new thing
I'm doing here

87
00:05:20,570 --> 00:05:23,900
is the plotting
commands, including

88
00:05:23,900 --> 00:05:28,250
pylab.hist, many of
them return values.

89
00:05:28,250 --> 00:05:30,800
Usually, I just ignore
that value when we just

90
00:05:30,800 --> 00:05:34,280
say pylab.plot or pylab.hist.

91
00:05:34,280 --> 00:05:37,820
Here I am taking the value.

92
00:05:37,820 --> 00:05:41,710
The value in this
case for a histogram

93
00:05:41,710 --> 00:05:44,560
is a tuple of length 2.

94
00:05:44,560 --> 00:05:50,030
The first element is
a list or an array,

95
00:05:50,030 --> 00:05:53,540
giving me how many
items are in each bin.

96
00:05:53,540 --> 00:05:57,740
And the second is
the patches used

97
00:05:57,740 --> 00:06:01,660
to produce the beautiful
pictures we're use to seeing.

98
00:06:01,660 --> 00:06:05,560
So here what I'm going
to do is take this value

99
00:06:05,560 --> 00:06:07,360
so that I can do
this at the end.

100
00:06:11,456 --> 00:06:13,330
Now, why would I want
to look at the fraction

101
00:06:13,330 --> 00:06:16,820
within approximately
200 of the mean?

102
00:06:16,820 --> 00:06:19,070
What is that going to
correspond to in this case?

103
00:06:22,510 --> 00:06:27,540
Well, if I divide
200 by 2 I get 100.

104
00:06:27,540 --> 00:06:30,710
Which happens to be
the standard deviation.

105
00:06:30,710 --> 00:06:33,730
So in this case, what I'm
going to be looking at

106
00:06:33,730 --> 00:06:37,150
is what fraction
of the values fall

107
00:06:37,150 --> 00:06:39,970
within two standard
deviations of the mean?

108
00:06:39,970 --> 00:06:43,740
Kind of a check on the
empirical rule, right?

109
00:06:43,740 --> 00:06:49,600
All right, when I run
the code I get this.

110
00:06:49,600 --> 00:06:53,350
So it is a discrete
approximation

111
00:06:53,350 --> 00:06:56,150
to the probability
density function.

112
00:06:56,150 --> 00:06:58,270
You'll notice, unlike
the previous picture

113
00:06:58,270 --> 00:07:02,830
I showed you which was nice
and smooth, this is jaggedy.

114
00:07:02,830 --> 00:07:05,600
You would expect it to be.

115
00:07:05,600 --> 00:07:11,800
And again, you can see it's
very nice that the peak is what

116
00:07:11,800 --> 00:07:14,230
we said the mean should be, 0.

117
00:07:14,230 --> 00:07:16,240
And then it falls off.

118
00:07:16,240 --> 00:07:21,200
And indeed, slightly
more than 95%

119
00:07:21,200 --> 00:07:26,666
fall within two standard
deviations of the mean.

120
00:07:26,666 --> 00:07:30,490
I'm not even surprised that
it's a little bit more than 95%

121
00:07:30,490 --> 00:07:35,560
because, remember the magic
number is 1.96, not 2.

122
00:07:35,560 --> 00:07:39,820
But since this is
only a finite sample,

123
00:07:39,820 --> 00:07:41,860
I only want it to be around 95.

124
00:07:41,860 --> 00:07:46,770
I'm not going to worry too much
whether it's bigger or smaller.

125
00:07:46,770 --> 00:07:48,180
All right?

126
00:07:48,180 --> 00:07:53,670
So random.gauss does a nice job
of giving us Gaussian values.

127
00:07:53,670 --> 00:07:56,580
We plotted them and now
you can see that I've

128
00:07:56,580 --> 00:07:59,100
got the relative frequency.

129
00:07:59,100 --> 00:08:06,400
That's why I see fractions in
the y-axis rather than counts.

130
00:08:06,400 --> 00:08:10,400
And that would be because of the
way I use the weights command.

131
00:08:10,400 --> 00:08:14,360
All right, let's return to PDFs.

132
00:08:14,360 --> 00:08:17,240
So as we said last
time, distributions

133
00:08:17,240 --> 00:08:21,090
can be defined by Probability
Density Functions,

134
00:08:21,090 --> 00:08:23,000
and that gives us
the probability

135
00:08:23,000 --> 00:08:27,980
of some random variable
lying between two values.

136
00:08:27,980 --> 00:08:30,890
It defines a curve where
the values in the x-axis

137
00:08:30,890 --> 00:08:35,990
lie between the minimum and
maximum values of the variable.

138
00:08:35,990 --> 00:08:38,000
And it's the area
under the curve--

139
00:08:38,000 --> 00:08:39,980
and we'll come back to this--

140
00:08:39,980 --> 00:08:42,650
between those two
points that give us

141
00:08:42,650 --> 00:08:47,010
the probability of an example
falling in that range.

142
00:08:47,010 --> 00:08:50,960
So now let's look at it
for a normal distribution.

143
00:08:50,960 --> 00:08:54,680
You may recall from the last
lecture this rather exotic

144
00:08:54,680 --> 00:09:02,380
looking formula which defines a
PDF for a normal distribution.

145
00:09:02,380 --> 00:09:07,580
And here's some code
that's as straight forward

146
00:09:07,580 --> 00:09:13,720
an implementation as one could
imagine of this formula, OK.

147
00:09:13,720 --> 00:09:16,900
So that now is a
value that, given

148
00:09:16,900 --> 00:09:21,610
a mu and a sigma and an x,
gives me the x associated

149
00:09:21,610 --> 00:09:26,810
with that mu and sigma, OK?

150
00:09:26,810 --> 00:09:29,770
And you'll notice there's
nothing random about this.

151
00:09:32,430 --> 00:09:37,030
All right, it is
giving me the value.

152
00:09:37,030 --> 00:09:41,215
Now, let's go down here.

153
00:09:44,350 --> 00:09:47,380
And I'm going to,
for a set of x's, get

154
00:09:47,380 --> 00:09:53,010
the set of y's corresponding
to that and then plot it.

155
00:09:53,010 --> 00:09:55,170
I'm going to set
mu to 0 and sigma

156
00:09:55,170 --> 00:09:59,260
to 1, the so-called standard
normal distribution.

157
00:09:59,260 --> 00:10:05,690
And I'm going to look at the
distribution from minus 4 to 4.

158
00:10:05,690 --> 00:10:08,240
Nothing magic about that
other than, as you'll see,

159
00:10:08,240 --> 00:10:13,450
it's kind of a place where
it asymptotes near 0.

160
00:10:13,450 --> 00:10:16,420
So while x is less
than 4, I'll get

161
00:10:16,420 --> 00:10:20,980
the x-value, I'll get the
y-value corresponding to that x

162
00:10:20,980 --> 00:10:25,440
by calling Gaussian,
increment x by 0.05

163
00:10:25,440 --> 00:10:29,810
and do that until I'm done.

164
00:10:29,810 --> 00:10:34,680
And then simply, I'll plot the
x-values against the y-values

165
00:10:34,680 --> 00:10:39,940
and throw a title on
it using pylab.title.

166
00:10:39,940 --> 00:10:41,270
All right, code make sense?

167
00:10:45,350 --> 00:10:48,550
Well, this is where I got
that beautiful picture

168
00:10:48,550 --> 00:10:50,840
we've looked at before.

169
00:10:50,840 --> 00:10:55,740
When I plotted here, it
looks, actually quite smooth.

170
00:10:55,740 --> 00:10:59,820
It's not, it's really connecting
a bunch of tiny little lines

171
00:10:59,820 --> 00:11:01,950
but I made the points
close enough together

172
00:11:01,950 --> 00:11:06,180
that it looks at least here
smooth at this resolution.

173
00:11:08,970 --> 00:11:12,840
So we know what the
values on the x-axis are.

174
00:11:12,840 --> 00:11:15,300
Those are the values
I happen to want

175
00:11:15,300 --> 00:11:19,560
to look at, from minus 4 to 4.

176
00:11:19,560 --> 00:11:21,180
What are the values
on the y-axis?

177
00:11:26,640 --> 00:11:28,710
We kind of would like
to interpret them

178
00:11:28,710 --> 00:11:31,170
as probabilities, right?

179
00:11:31,170 --> 00:11:34,800
But we could be pretty
suspicious about that

180
00:11:34,800 --> 00:11:38,280
and then if we take this
one point that's up here,

181
00:11:38,280 --> 00:11:42,540
we say the probability of
that single point is 0.4.

182
00:11:42,540 --> 00:11:45,690
Well, that doesn't make any
sense because, in fact, we

183
00:11:45,690 --> 00:11:49,740
know the probability
of any particular point

184
00:11:49,740 --> 00:11:52,850
is 0 in some sense, right?

185
00:11:52,850 --> 00:11:58,670
So furthermore, if I chose
a different value for sigma,

186
00:11:58,670 --> 00:12:05,200
I can actually get this to go
bigger than 1 on the y-axis.

187
00:12:05,200 --> 00:12:09,240
So if you take sigma
to be say, 0.1--

188
00:12:09,240 --> 00:12:14,480
I think the y-axis goes
up to something like 40.

189
00:12:14,480 --> 00:12:18,290
So we know we don't have
probabilities in the range 40.

190
00:12:18,290 --> 00:12:21,710
So if these aren't
probabilities, what are they?

191
00:12:21,710 --> 00:12:24,530
What are the y values?

192
00:12:24,530 --> 00:12:27,950
Well, not too surprising
since I claimed

193
00:12:27,950 --> 00:12:32,670
this was a probability density
function, they're densities.

194
00:12:32,670 --> 00:12:34,270
Well, what's a density?

195
00:12:37,160 --> 00:12:38,400
This makes sense.

196
00:12:38,400 --> 00:12:40,460
I'll say it and then
I'll try and explain it.

197
00:12:40,460 --> 00:12:44,120
It's a derivative of the
cumulative distribution

198
00:12:44,120 --> 00:12:45,540
function.

199
00:12:45,540 --> 00:12:49,850
Now, why are we talking about
derivatives in the first place?

200
00:12:49,850 --> 00:12:52,960
Well, remember what
we're trying to say.

201
00:12:52,960 --> 00:12:57,340
If we want to ask, what's the
probability of a value falling

202
00:12:57,340 --> 00:13:02,110
between here and here,
we claim that that

203
00:13:02,110 --> 00:13:10,380
was going to be the area under
this curve, the integral.

204
00:13:10,380 --> 00:13:13,960
Well, as you know
from 18.01, there's

205
00:13:13,960 --> 00:13:17,620
a very clear relationship
between derivatives

206
00:13:17,620 --> 00:13:19,750
and integrals.

207
00:13:19,750 --> 00:13:24,460
And so if we interpret each of
these points as a derivative,

208
00:13:24,460 --> 00:13:27,310
in some sense the
slope here, then we

209
00:13:27,310 --> 00:13:30,040
can look at this
as the area just

210
00:13:30,040 --> 00:13:31,570
by integrating under there.

211
00:13:34,720 --> 00:13:39,790
So to interpret a PDF, we
always do it mathematically.

212
00:13:39,790 --> 00:13:42,260
Actually, I do it
just by looking at it.

213
00:13:42,260 --> 00:13:46,420
But the only really interesting
mathematical questions to ask

214
00:13:46,420 --> 00:13:49,400
have to do with area.

215
00:13:49,400 --> 00:13:51,950
Once we have the
area, we can, then,

216
00:13:51,950 --> 00:13:56,390
talk about the probabilities
of some value falling

217
00:13:56,390 --> 00:13:58,920
within a region of the curve.

218
00:13:58,920 --> 00:14:01,070
So what's interesting
here is not

219
00:14:01,070 --> 00:14:06,930
the numbers per se on the y-axis
but the shape of the curve,

220
00:14:06,930 --> 00:14:09,240
because those numbers
have to be related

221
00:14:09,240 --> 00:14:14,190
to the numbers on the
x-axis, dx, dy right?

222
00:14:14,190 --> 00:14:17,860
We're looking at derivatives.

223
00:14:17,860 --> 00:14:22,300
All right, so now, we have
to talk about integration.

224
00:14:22,300 --> 00:14:25,700
I promise you'll only hear
about it for another few minutes

225
00:14:25,700 --> 00:14:27,730
then we'll leave the topic.

226
00:14:27,730 --> 00:14:30,520
So I mentioned before
SciPy as a library

227
00:14:30,520 --> 00:14:35,780
that contains a lot of useful
mathematical functions.

228
00:14:35,780 --> 00:14:39,295
One of them is integrate.quad.

229
00:14:41,815 --> 00:14:43,790
Well, the integrate
part is obvious.

230
00:14:43,790 --> 00:14:45,440
It means integration.

231
00:14:45,440 --> 00:14:47,930
Quad is telling you
the algorithm it's

232
00:14:47,930 --> 00:14:50,240
choosing to do the integration.

233
00:14:50,240 --> 00:14:54,945
All of these integrals are going
to be actual approximations

234
00:14:54,945 --> 00:14:55,820
to the real integral.

235
00:14:58,450 --> 00:15:01,010
SciPy is not doing
some clever mathematics

236
00:15:01,010 --> 00:15:03,900
to get an analytical solution.

237
00:15:03,900 --> 00:15:08,520
It's using a numerical technique
to approximate the integral.

238
00:15:08,520 --> 00:15:11,490
And the one here happens
to be called quadrature,

239
00:15:11,490 --> 00:15:14,810
it doesn't matter.

240
00:15:14,810 --> 00:15:19,400
All right, you can pass
it up to four arguments.

241
00:15:19,400 --> 00:15:22,930
You must pass it to
function to be integrated,

242
00:15:22,930 --> 00:15:25,540
that makes sense.

243
00:15:25,540 --> 00:15:29,810
A number representing the lower
limit of the integration--

244
00:15:29,810 --> 00:15:31,160
you need to give it that.

245
00:15:31,160 --> 00:15:33,290
A number representing
the upper limit--

246
00:15:33,290 --> 00:15:35,150
you need to give it that.

247
00:15:35,150 --> 00:15:39,450
And then the fourth
argument is a tuple

248
00:15:39,450 --> 00:15:43,490
supplying all values
for the arguments,

249
00:15:43,490 --> 00:15:48,050
except the first of the
function you are integrating.

250
00:15:48,050 --> 00:15:51,270
I'll show you an example
of that on the next slide.

251
00:15:51,270 --> 00:15:55,640
And we'll see that it returns
a tuple, an approximation

252
00:15:55,640 --> 00:15:58,730
to the result, what it
thinks the integral is,

253
00:15:58,730 --> 00:16:01,940
and an estimate
of how much error

254
00:16:01,940 --> 00:16:04,040
there might be in that one.

255
00:16:04,040 --> 00:16:06,780
We'll ignore the
error for the moment.

256
00:16:06,780 --> 00:16:10,020
All right, let's
look at the code.

257
00:16:10,020 --> 00:16:13,685
So we start by importing
scipy.integrate.

258
00:16:17,150 --> 00:16:20,460
That gives us all the
different integration methods.

259
00:16:20,460 --> 00:16:23,250
I'm not going to show
you the code for Gaussian

260
00:16:23,250 --> 00:16:25,750
since I showed it to you
a couple of minutes ago.

261
00:16:25,750 --> 00:16:27,750
But I wanted you to
remember that it takes three

262
00:16:27,750 --> 00:16:32,480
arguments, x, mu, and sigma.

263
00:16:32,480 --> 00:16:37,460
Because when we get down
here to the integration,

264
00:16:37,460 --> 00:16:44,180
we'll pass at the function
Gaussian and then the values

265
00:16:44,180 --> 00:16:45,890
that we want to integrate over.

266
00:16:49,500 --> 00:16:56,250
So those will be the values
that x can take upon.

267
00:16:56,250 --> 00:17:02,330
And that will change
as we go from mu

268
00:17:02,330 --> 00:17:05,060
minus the number of standard
deviations times sigma

269
00:17:05,060 --> 00:17:07,220
to mu plus the number
of standard deviations

270
00:17:07,220 --> 00:17:09,800
times sigma.

271
00:17:09,800 --> 00:17:13,819
And then, this is the optional
fourth argument, the tuple, mu,

272
00:17:13,819 --> 00:17:15,460
and sigma.

273
00:17:15,460 --> 00:17:17,740
Why do I need to pass that in?

274
00:17:17,740 --> 00:17:22,089
Because Gaussian is a ternary
argument, or a function

275
00:17:22,089 --> 00:17:24,460
that takes three values.

276
00:17:24,460 --> 00:17:26,770
And I'm going to
integrate over values

277
00:17:26,770 --> 00:17:32,830
of x so I have to fix mu and
sigma to constants, which

278
00:17:32,830 --> 00:17:35,080
is what I'm doing down here.

279
00:17:35,080 --> 00:17:37,390
And then I'll take
the zeroth value,

280
00:17:37,390 --> 00:17:41,320
which is its estimate
of the integral.

281
00:17:41,320 --> 00:17:43,980
All right, so that's
the new thing.

282
00:17:43,980 --> 00:17:46,980
The rest of the code is
all stuff you've seen.

283
00:17:46,980 --> 00:17:49,680
For t and range
number of trials,

284
00:17:49,680 --> 00:17:53,760
I'm going to choose a random
mu between minus 10 and 10

285
00:17:53,760 --> 00:17:56,790
and a random sigma
between 1 and 10.

286
00:17:56,790 --> 00:18:00,070
It doesn't matter what
those constants are.

287
00:18:00,070 --> 00:18:04,500
And then for the number
of standard deviations

288
00:18:04,500 --> 00:18:10,430
in 1, 1.96, and 3,
I'm going to integrate

289
00:18:10,430 --> 00:18:16,720
Gaussian over that range.

290
00:18:16,720 --> 00:18:19,800
And then we're just going
to see how many of them

291
00:18:19,800 --> 00:18:21,864
fall within that range.

292
00:18:21,864 --> 00:18:23,280
In some sense,
what we're doing is

293
00:18:23,280 --> 00:18:26,090
we're checking the
empirical rule.

294
00:18:26,090 --> 00:18:27,590
We're saying, take the Gaussian.

295
00:18:27,590 --> 00:18:29,270
I don't care what
mu and sigma are.

296
00:18:29,270 --> 00:18:31,010
It doesn't matter.

297
00:18:31,010 --> 00:18:34,490
The empirical rule will
still hold, I think.

298
00:18:34,490 --> 00:18:37,481
But we're just
checking it here, OK?

299
00:18:41,690 --> 00:18:44,370
Well, here are the results.

300
00:18:44,370 --> 00:18:46,655
So from mu equals 9
and sigma equals 6,

301
00:18:46,655 --> 00:18:49,400
I happened to
choose those, we'll

302
00:18:49,400 --> 00:18:55,380
see the fracture within 1,
fraction within 1.96 and 3.

303
00:18:55,380 --> 00:18:58,800
And so for these
random mus and sigmas,

304
00:18:58,800 --> 00:19:00,440
you can see that all
of them-- and you

305
00:19:00,440 --> 00:19:02,190
can set them to whatever
you want when you

306
00:19:02,190 --> 00:19:04,380
get your hand them the code.

307
00:19:04,380 --> 00:19:10,880
Essentially, what we have is,
whoops, the empirical rule

308
00:19:10,880 --> 00:19:13,430
actually works.

309
00:19:13,430 --> 00:19:17,120
One of those beautiful cases
where you can test the theory

310
00:19:17,120 --> 00:19:21,674
and see that the
theory really is sound.

311
00:19:21,674 --> 00:19:25,340
So there we go.

312
00:19:25,340 --> 00:19:29,240
So why am I making such a big
deal of normal distributions?

313
00:19:29,240 --> 00:19:31,910
They have lots of nice
mathematical properties,

314
00:19:31,910 --> 00:19:34,142
some of which we've
already talked about.

315
00:19:34,142 --> 00:19:35,600
But all of that
would be irrelevant

316
00:19:35,600 --> 00:19:38,020
if we didn't see them.

317
00:19:38,020 --> 00:19:41,110
The good news is they're
all over the place.

318
00:19:41,110 --> 00:19:43,900
I've just taken a few here.

319
00:19:43,900 --> 00:19:47,320
Up here we'll see SAT scores.

320
00:19:47,320 --> 00:19:50,900
I would never show that
to high school students,

321
00:19:50,900 --> 00:19:53,260
or GREs to you guys.

322
00:19:53,260 --> 00:19:56,020
But you can see that
they are amazingly

323
00:19:56,020 --> 00:19:58,944
well-distributed along
a normal distribution.

324
00:20:02,660 --> 00:20:09,420
On down here, this is plotting
percent change in oil prices.

325
00:20:09,420 --> 00:20:14,380
And again, we see something very
close to a normal distribution.

326
00:20:14,380 --> 00:20:17,350
And here is just looking at
heights of men and women.

327
00:20:17,350 --> 00:20:19,980
And again, they clearly
look very normal.

328
00:20:23,330 --> 00:20:29,970
So it's really quite impressive
how often they occur.

329
00:20:29,970 --> 00:20:34,630
But not everything is normal.

330
00:20:34,630 --> 00:20:37,360
So we saw that
the empirical rule

331
00:20:37,360 --> 00:20:39,600
works for normal distributions.

332
00:20:39,600 --> 00:20:41,230
I won't say I proved it for you.

333
00:20:41,230 --> 00:20:43,915
I illustrated it for you
with a bunch of examples.

334
00:20:47,110 --> 00:20:49,640
But are the outcomes of the
spins of a roulette wheel

335
00:20:49,640 --> 00:20:50,140
normal?

336
00:20:53,370 --> 00:20:54,390
No.

337
00:20:54,390 --> 00:20:56,430
They're totally uniform, right?

338
00:20:56,430 --> 00:21:01,650
Everything is equally probable--
a 4, a 6, an 11, a 13, double-0

339
00:21:01,650 --> 00:21:03,330
if you're in Las Vegas.

340
00:21:03,330 --> 00:21:05,370
They're all equally probable.

341
00:21:05,370 --> 00:21:09,750
So if I plotted
those, I'd basically

342
00:21:09,750 --> 00:21:12,580
just get a straight
line with everything

343
00:21:12,580 --> 00:21:17,060
at 1 over however many
pockets there are.

344
00:21:17,060 --> 00:21:23,810
So in that case, why does
the empirical rule work?

345
00:21:23,810 --> 00:21:28,040
We saw that we were doing
some estimates about returns

346
00:21:28,040 --> 00:21:30,400
and we used the empirical
rule, we checked it

347
00:21:30,400 --> 00:21:35,560
and, by George, it was
telling us the truth.

348
00:21:35,560 --> 00:21:38,470
And the reason is
because we're not

349
00:21:38,470 --> 00:21:42,620
reasoning about a
single spin of the wheel

350
00:21:42,620 --> 00:21:47,450
but about the mean
of a set of spins.

351
00:21:47,450 --> 00:21:49,910
So if you think about it,
what we were reasoning about

352
00:21:49,910 --> 00:21:52,820
was the return of betting.

353
00:21:52,820 --> 00:21:55,190
If we look at one spin--

354
00:21:55,190 --> 00:21:57,680
well, let's say we bet $1.

355
00:21:57,680 --> 00:22:02,060
The return is either minus 1
because we've lost our dollar.

356
00:22:02,060 --> 00:22:05,360
Or if we get lucky and our
pocket happens to come up,

357
00:22:05,360 --> 00:22:08,460
it was 36, I think, or 35.

358
00:22:08,460 --> 00:22:11,090
I forget which, OK?

359
00:22:11,090 --> 00:22:13,560
But that's all.

360
00:22:13,560 --> 00:22:16,860
So if we plotted a
histogram, we would

361
00:22:16,860 --> 00:22:23,880
see a huge peak at minus
1 and a little bump

362
00:22:23,880 --> 00:22:28,900
here at 36 and
nothing in the middle.

363
00:22:28,900 --> 00:22:31,655
Clearly, not a
normal distribution.

364
00:22:36,110 --> 00:22:39,850
But what we're
reasoning about is not

365
00:22:39,850 --> 00:22:43,870
the return of a single spin
but the return of many spins.

366
00:22:43,870 --> 00:22:47,500
If we played 1,000 spins,
what is our expected return?

367
00:22:51,310 --> 00:22:55,580
As soon as we end up reasoning,
not about a single event

368
00:22:55,580 --> 00:22:59,890
but about the mean of something,
we can imply something

369
00:22:59,890 --> 00:23:03,060
called the Central
Limit Theorem.

370
00:23:03,060 --> 00:23:04,380
And here it is.

371
00:23:04,380 --> 00:23:10,280
It's actually for something
so important, very simple.

372
00:23:10,280 --> 00:23:12,865
It says that given a
sufficiently large sample--

373
00:23:12,865 --> 00:23:16,120
and I love terms like
sufficiently large

374
00:23:16,120 --> 00:23:18,700
but we'll later put a
little meat on that--

375
00:23:18,700 --> 00:23:22,200
the following three
things are true.

376
00:23:22,200 --> 00:23:25,920
The means of the samples
in a set of samples,

377
00:23:25,920 --> 00:23:29,340
the so-called sample means
will be approximately normally

378
00:23:29,340 --> 00:23:32,650
distributed.

379
00:23:32,650 --> 00:23:39,140
So that says if
I take a sample--

380
00:23:39,140 --> 00:23:44,130
and remember, a sample will
have multiple examples.

381
00:23:44,130 --> 00:23:45,375
So just to remind people.

382
00:23:50,280 --> 00:23:55,030
A population is a
set of examples.

383
00:24:00,860 --> 00:24:07,770
A sample is a subset
of the population.

384
00:24:12,880 --> 00:24:19,010
So it too is a set of
examples, typically.

385
00:24:19,010 --> 00:24:21,785
If this set is
sufficiently large--

386
00:24:25,190 --> 00:24:28,870
certainly 1 is not
sufficiently large--

387
00:24:28,870 --> 00:24:34,460
then it will be the case
that the mean of the means--

388
00:24:34,460 --> 00:24:36,950
so I take the mean
of each sample

389
00:24:36,950 --> 00:24:39,740
and then I can now
plot all of those means

390
00:24:39,740 --> 00:24:43,390
and so take the mean
of those, right--

391
00:24:43,390 --> 00:24:45,220
and they'll be
normally distributed.

392
00:24:49,090 --> 00:24:53,540
Furthermore, this
distribution will

393
00:24:53,540 --> 00:24:56,700
have a mean that is close to
the mean of the population.

394
00:25:00,930 --> 00:25:03,000
The mean of the means
will be close to the mean

395
00:25:03,000 --> 00:25:04,930
of the population.

396
00:25:04,930 --> 00:25:07,980
And the variance
of the sample means

397
00:25:07,980 --> 00:25:11,940
will be close to the variance
of the population divided

398
00:25:11,940 --> 00:25:13,090
by the sample size.

399
00:25:16,960 --> 00:25:19,180
This is really
amazing that this is

400
00:25:19,180 --> 00:25:23,105
true and dramatically useful.

401
00:25:26,940 --> 00:25:30,780
So to get some insight,
let's check it.

402
00:25:30,780 --> 00:25:34,500
To do that, postulate that we
have this kind of miraculous

403
00:25:34,500 --> 00:25:35,790
die.

404
00:25:35,790 --> 00:25:38,940
So instead of a die that when
you roll it you get a number 1,

405
00:25:38,940 --> 00:25:44,220
2, 3, 4, 5, or 6, this
particular die is continuous.

406
00:25:44,220 --> 00:25:48,000
It gives you a real
number between 0 and 5,

407
00:25:48,000 --> 00:25:53,530
or maybe it's
between 1 and 6, OK?

408
00:25:53,530 --> 00:25:55,420
So it's a continuous die.

409
00:25:58,450 --> 00:26:02,360
What we're going to do is
roll it a lot of times.

410
00:26:07,950 --> 00:26:10,335
We're going to
say, how many die?

411
00:26:13,040 --> 00:26:15,500
And then, how many
times are we going

412
00:26:15,500 --> 00:26:17,780
to roll that number of die?

413
00:26:17,780 --> 00:26:21,170
So the number of die
will be the sample size--

414
00:26:21,170 --> 00:26:23,870
number of dice will
be the sample size.

415
00:26:23,870 --> 00:26:26,960
And then we'll take a
bunch of samples which

416
00:26:26,960 --> 00:26:28,910
I'm calling number of rolls.

417
00:26:28,910 --> 00:26:31,520
And then we'll plot
it and I'm just

418
00:26:31,520 --> 00:26:34,070
choosing some bins
and some colors

419
00:26:34,070 --> 00:26:36,770
and some style and
various other things

420
00:26:36,770 --> 00:26:39,350
just to show you how we
use the keyword arguments.

421
00:26:43,640 --> 00:26:46,880
Actually, I said the number of
rolls is the number of trials.

422
00:26:46,880 --> 00:26:50,240
But it isn't quite
that because I'm

423
00:26:50,240 --> 00:26:51,830
going to get the
number of trials

424
00:26:51,830 --> 00:26:55,820
by dividing the number of
rolls by the number of dice.

425
00:26:55,820 --> 00:27:00,110
So if I have more dice, I get
to have fewer samples, more

426
00:27:00,110 --> 00:27:04,540
dice per sample, all right?

427
00:27:04,540 --> 00:27:06,360
Then we'll just do it.

428
00:27:06,360 --> 00:27:11,790
So it will be between 0
and 5 because random.random

429
00:27:11,790 --> 00:27:17,045
returns a number between 0 and
1 and I'm multiplying it by 5.

430
00:27:21,220 --> 00:27:24,720
And then we'll look at the
means and we'll plot it all.

431
00:27:29,630 --> 00:27:32,620
Again, we're playing
games with weights just

432
00:27:32,620 --> 00:27:35,810
to make the plot a
little easier to read.

433
00:27:35,810 --> 00:27:36,870
And here's what we get.

434
00:27:39,470 --> 00:27:46,070
If we roll one die, the
mean is very close to 2.5.

435
00:27:46,070 --> 00:27:48,470
Well, that's certainly
what you'd expect, right?

436
00:27:48,470 --> 00:27:51,890
It's some random
number between 0 and 5.

437
00:27:51,890 --> 00:27:56,890
2.5 is a pretty good guess
as to what it should average.

438
00:27:56,890 --> 00:28:01,225
And it has a standard
deviation of 1.44.

439
00:28:01,225 --> 00:28:02,950
And that's a little
harder to guess

440
00:28:02,950 --> 00:28:04,720
that that's what it would be.

441
00:28:04,720 --> 00:28:07,210
But you could figure it
out with a little math

442
00:28:07,210 --> 00:28:10,840
or as I did here
with the simulation.

443
00:28:10,840 --> 00:28:15,670
But now, if I roll 50
dice, well, again, the mean

444
00:28:15,670 --> 00:28:20,000
is close to 2.5.

445
00:28:20,000 --> 00:28:22,530
It's what you'd expect, right?

446
00:28:22,530 --> 00:28:27,900
I roll 50 die, I get the
mean value of those 50.

447
00:28:27,900 --> 00:28:30,960
But look how much smaller
the standard deviation is.

448
00:28:34,270 --> 00:28:38,140
More importantly,
what we see here

449
00:28:38,140 --> 00:28:46,200
is that if we look at the
value, the probability

450
00:28:46,200 --> 00:28:51,240
is flat for all possible
values between 0 and 5

451
00:28:51,240 --> 00:28:53,430
for a single die.

452
00:28:53,430 --> 00:28:57,960
But if we look at the
distribution for the means,

453
00:28:57,960 --> 00:29:02,730
it's not quite Gaussian
but it's pretty close.

454
00:29:02,730 --> 00:29:04,030
Why is it not Gaussian?

455
00:29:04,030 --> 00:29:06,970
Well, I didn't do it an
infinite number of times.

456
00:29:06,970 --> 00:29:09,670
Did it quite a few, but
not an infinite number.

457
00:29:09,670 --> 00:29:13,870
Enough that you didn't want
to sit here while it ran.

458
00:29:13,870 --> 00:29:15,850
But you can see the
amazing thing here

459
00:29:15,850 --> 00:29:19,960
that when I go from looking at
1 to looking at the mean of 50,

460
00:29:19,960 --> 00:29:25,180
suddenly I have a
normal distribution.

461
00:29:25,180 --> 00:29:29,710
And that means that
I can bring to bear

462
00:29:29,710 --> 00:29:35,560
on the problem the
Central Limit Theorem.

463
00:29:35,560 --> 00:29:38,170
We can try it for roulette.

464
00:29:38,170 --> 00:29:40,930
Again I'm not going to make you
sit through a million trials

465
00:29:40,930 --> 00:29:44,000
of 200 spins each.

466
00:29:44,000 --> 00:29:46,340
I'll do it only
for fair roulette.

467
00:29:46,340 --> 00:29:51,077
And again, this is a
very simple simulation,

468
00:29:51,077 --> 00:29:52,160
and we'll see what we get.

469
00:29:54,730 --> 00:29:59,960
And what we see is it's
not quite normal, again,

470
00:29:59,960 --> 00:30:04,040
but it definitely
has that shape.

471
00:30:04,040 --> 00:30:06,500
Now, it's going
to be a little bit

472
00:30:06,500 --> 00:30:15,340
strange because I can't
lose more than one,

473
00:30:15,340 --> 00:30:17,440
if I'm betting one.

474
00:30:17,440 --> 00:30:20,440
So it will never be
quite normal because it's

475
00:30:20,440 --> 00:30:25,890
going to be truncated
down on the left side,

476
00:30:25,890 --> 00:30:31,080
whereas the tail can
be arbitrarily long.

477
00:30:31,080 --> 00:30:36,520
So again, mathematically
it can't be normal

478
00:30:36,520 --> 00:30:41,930
but it's close enough in
the main region, where

479
00:30:41,930 --> 00:30:46,460
most of the values lie,
that we can get away

480
00:30:46,460 --> 00:30:49,700
with applying the empirical
rule and looking at answers.

481
00:30:52,430 --> 00:30:56,500
And indeed as we
saw, it does work.

482
00:30:56,500 --> 00:30:57,690
So what's the moral here?

483
00:31:00,510 --> 00:31:04,170
It doesn't matter what the
shape of the distribution

484
00:31:04,170 --> 00:31:08,140
of the original
values happen to be.

485
00:31:08,140 --> 00:31:12,700
If we're trying to estimate
the mean using samples that

486
00:31:12,700 --> 00:31:17,020
are sufficiently
large, the CLT will

487
00:31:17,020 --> 00:31:20,320
allow us to use the
empirical rule when

488
00:31:20,320 --> 00:31:22,120
computing confidence intervals.

489
00:31:25,590 --> 00:31:30,180
Even if we go back and look at
this anomaly over in the left,

490
00:31:30,180 --> 00:31:34,470
what do you think would happen
if I, instead of had 200, have,

491
00:31:34,470 --> 00:31:40,670
say, 1,000?

492
00:31:40,670 --> 00:31:46,265
What's the probability of the
average return being minus 1

493
00:31:46,265 --> 00:31:50,090
of 1,000 bets?

494
00:31:50,090 --> 00:31:54,560
Much smaller than for 100 bets.

495
00:31:54,560 --> 00:31:59,290
To lose 1,000 times in a
row is pretty unlikely.

496
00:31:59,290 --> 00:32:01,180
So to get all the
way to the left

497
00:32:01,180 --> 00:32:03,470
is going to be less
likely, and, therefore,

498
00:32:03,470 --> 00:32:06,580
the thing will start
looking more and more normal

499
00:32:06,580 --> 00:32:07,970
as the samples get bigger.

500
00:32:10,510 --> 00:32:15,870
All right, and so
we can use the CLT

501
00:32:15,870 --> 00:32:18,720
to justify using
the empirical rule

502
00:32:18,720 --> 00:32:21,390
when we compute
confidence intervals.

503
00:32:21,390 --> 00:32:24,630
All right, I want to look at
one more example in detail.

504
00:32:24,630 --> 00:32:27,070
This is kind of an
interesting one.

505
00:32:27,070 --> 00:32:31,710
You might think that
randomness is of no use for,

506
00:32:31,710 --> 00:32:34,880
say, finding the value
of pi because there's

507
00:32:34,880 --> 00:32:37,320
nothing random about that.

508
00:32:37,320 --> 00:32:40,000
Similarly, you might think
that randomness was of no use

509
00:32:40,000 --> 00:32:42,880
in integrating a
function, but in fact,

510
00:32:42,880 --> 00:32:45,000
the way those numerical
algorithms work

511
00:32:45,000 --> 00:32:47,700
is they use randomness.

512
00:32:47,700 --> 00:32:51,210
What you're about to
see is that randomness,

513
00:32:51,210 --> 00:32:55,350
and indeed things related
to Monte Carlo simulations,

514
00:32:55,350 --> 00:32:59,190
can be enormously useful, even
when you're computing something

515
00:32:59,190 --> 00:33:04,620
that is inherently not random
like the value of pi here.

516
00:33:04,620 --> 00:33:06,690
And we won't ask you to
remember that many digits

517
00:33:06,690 --> 00:33:08,740
on the quiz and the exam.

518
00:33:08,740 --> 00:33:10,616
All right, so what's pi?

519
00:33:10,616 --> 00:33:13,860
Well, people have known about
pi for thousands and thousands

520
00:33:13,860 --> 00:33:15,180
of years.

521
00:33:15,180 --> 00:33:20,520
And what people knew was
that there was some constant,

522
00:33:20,520 --> 00:33:21,480
we'll call it pi.

523
00:33:21,480 --> 00:33:23,460
It wasn't always called that.

524
00:33:23,460 --> 00:33:25,410
Such that it was equal
to the circumference

525
00:33:25,410 --> 00:33:28,910
of a circle divided
by the diameter,

526
00:33:28,910 --> 00:33:34,100
and furthermore, the area
was going to be pi r squared.

527
00:33:34,100 --> 00:33:38,450
People knew that way back to the
Babylonians and the Egyptians.

528
00:33:38,450 --> 00:33:43,250
What they didn't know
is what that value was.

529
00:33:43,250 --> 00:33:47,800
The earliest known estimate
of pi was by the Egyptians,

530
00:33:47,800 --> 00:33:51,890
on something called the
Rhind Papyrus pictured here.

531
00:33:51,890 --> 00:33:55,250
And they estimated pi
to be 4 times 8 over 9

532
00:33:55,250 --> 00:33:59,840
squared, or 3.16.

533
00:33:59,840 --> 00:34:00,720
Not so bad.

534
00:34:03,920 --> 00:34:08,030
About 1,100 years
later an estimate of pi

535
00:34:08,030 --> 00:34:14,466
appeared in the
Bible, Kings 7:23.

536
00:34:14,466 --> 00:34:20,010
It didn't say pi is
but it described--

537
00:34:20,010 --> 00:34:23,639
I think this was a construction
of King Solomon's Temple.

538
00:34:23,639 --> 00:34:25,980
"He made a molten sea,
ten cubits from one brim

539
00:34:25,980 --> 00:34:27,210
to the other."

540
00:34:27,210 --> 00:34:28,480
it was round.

541
00:34:28,480 --> 00:34:30,120
It was a circle.

542
00:34:30,120 --> 00:34:33,120
"His height was five cubits
and a line of 30 cubits

543
00:34:33,120 --> 00:34:35,699
did compass it round about."

544
00:34:35,699 --> 00:34:38,219
Well, if you do the
arithmetic, what does

545
00:34:38,219 --> 00:34:41,460
this imply the
value of pi to be?

546
00:34:41,460 --> 00:34:46,300
3, and I'm sure that's what
Mike Pence thinks it is.

547
00:34:53,170 --> 00:34:58,450
About 300 years
later, Archimedes

548
00:34:58,450 --> 00:34:59,980
did a better job of it.

549
00:35:02,530 --> 00:35:08,140
He estimated pi by constructing
a 96-sided polygon.

550
00:35:08,140 --> 00:35:09,290
There's a picture of one.

551
00:35:09,290 --> 00:35:11,860
It looks a lot like a circle.

552
00:35:11,860 --> 00:35:14,380
On the left is one
with fewer sides.

553
00:35:14,380 --> 00:35:17,420
And what he did is he then--

554
00:35:17,420 --> 00:35:20,740
since it was a polygon he
knew how to compute its area

555
00:35:20,740 --> 00:35:23,750
or it's circumference and
he just counted things up

556
00:35:23,750 --> 00:35:25,940
and he said, well,
pi is somewhere

557
00:35:25,940 --> 00:35:31,780
between 223/71 and 22/7.

558
00:35:31,780 --> 00:35:33,600
And that turns out
to actually-- if you

559
00:35:33,600 --> 00:35:36,411
take the average of that it
will be a really good estimate.

560
00:35:39,060 --> 00:35:44,830
2000 years after Archimedes,
the French mathematicians

561
00:35:44,830 --> 00:35:47,980
Buffon and Laplace
proposed finding

562
00:35:47,980 --> 00:35:51,130
the value of pi using what we
would today call a Monte Carlo

563
00:35:51,130 --> 00:35:53,110
simulation.

564
00:35:53,110 --> 00:35:56,770
They did not have a
computer to execute this on.

565
00:35:56,770 --> 00:35:59,850
So what they proposed--

566
00:35:59,850 --> 00:36:02,420
it started with
this mathematics.

567
00:36:02,420 --> 00:36:05,180
They took a circle
and inscribed it

568
00:36:05,180 --> 00:36:11,610
in a square, a square of
two units of whatever it

569
00:36:11,610 --> 00:36:12,990
is per side.

570
00:36:12,990 --> 00:36:15,090
And therefore we know that
the area of the square

571
00:36:15,090 --> 00:36:19,060
is 2 times 2 or 4.

572
00:36:19,060 --> 00:36:24,440
We know that the area of
the circle is pi r squared.

573
00:36:24,440 --> 00:36:26,530
And since we know that r is 1--

574
00:36:26,530 --> 00:36:29,140
because that's the square
it's inscribed in--

575
00:36:29,140 --> 00:36:32,060
we know that the area of
a circle is exactly pi.

576
00:36:36,440 --> 00:36:40,850
What Buffon then proposed was
dropping a bunch of needles

577
00:36:40,850 --> 00:36:44,690
at random-- kind of like when
Professor Grimson was sorting

578
00:36:44,690 --> 00:36:46,530
things--

579
00:36:46,530 --> 00:36:49,700
and seeing where they land.

580
00:36:49,700 --> 00:36:53,420
Some would land in the
square but not in the circle.

581
00:36:53,420 --> 00:36:55,040
Some would land in the circle.

582
00:36:55,040 --> 00:36:59,560
You would ignore any that
landed outside the square.

583
00:36:59,560 --> 00:37:03,490
And then they said, well, since
they're falling at random,

584
00:37:03,490 --> 00:37:07,660
the ratio of the needles in the
circle to needles in the square

585
00:37:07,660 --> 00:37:10,870
should exactly equal
the area of the square

586
00:37:10,870 --> 00:37:13,870
over the area of
the circle, exactly,

587
00:37:13,870 --> 00:37:16,910
if you did an infinite
number of needles.

588
00:37:16,910 --> 00:37:19,940
Does that make sense?

589
00:37:19,940 --> 00:37:24,140
Now, given that, you
can do some algebra

590
00:37:24,140 --> 00:37:27,590
and solve for the
area of the circle

591
00:37:27,590 --> 00:37:30,890
and say it has to be the
area of the square times

592
00:37:30,890 --> 00:37:33,050
the number of needles
in the circle divided

593
00:37:33,050 --> 00:37:36,630
by the needles in the square.

594
00:37:36,630 --> 00:37:38,880
And since we know that
the area of the circle

595
00:37:38,880 --> 00:37:44,810
is pi that tells
us that pi is going

596
00:37:44,810 --> 00:37:48,460
to equal four times the
needles in the circle.

597
00:37:48,460 --> 00:37:51,200
That's 4 is the area
of the square divided

598
00:37:51,200 --> 00:37:54,350
by the number of
needles in the square.

599
00:37:54,350 --> 00:37:55,930
And so the argument
was you can just

600
00:37:55,930 --> 00:37:59,590
drop a bunch of these
needles, see where they land,

601
00:37:59,590 --> 00:38:02,860
add them up and from that
you would magically now know

602
00:38:02,860 --> 00:38:06,440
the actual value of pi.

603
00:38:06,440 --> 00:38:10,940
Well, we tried a simulation
one year in class but rather

604
00:38:10,940 --> 00:38:12,920
than using needles
we had an archer

605
00:38:12,920 --> 00:38:17,340
and we blindfolded him so he
would shoot arrows at random

606
00:38:17,340 --> 00:38:21,560
and we would see
where they ended up.

607
00:38:21,560 --> 00:38:25,160
There's a video of this
here if you want to see it.

608
00:38:25,160 --> 00:38:27,950
I'm going to play only the
very end of the video which

609
00:38:27,950 --> 00:38:29,005
describes the results.

610
00:38:40,591 --> 00:38:42,840
Maybe we should just use
Python to build a Monte Carlo

611
00:38:42,840 --> 00:38:44,830
simulation instead.

612
00:38:44,830 --> 00:38:46,290
So that's what happened.

613
00:38:46,290 --> 00:38:48,330
After it was all
over, Anna came up

614
00:38:48,330 --> 00:38:50,340
and gave me some sound advice.

615
00:38:50,340 --> 00:38:53,910
Yeah, I was very proud of that
particular shot with the apple.

616
00:38:53,910 --> 00:38:56,370
I had hoped the student would
put it on his or her head

617
00:38:56,370 --> 00:38:59,190
but no one volunteered.

618
00:38:59,190 --> 00:39:01,010
That was not done blindfolded.

619
00:39:01,010 --> 00:39:05,970
Any rate, so here
is the simulation.

620
00:39:05,970 --> 00:39:07,725
So first we have
throwing the needles.

621
00:39:10,840 --> 00:39:15,160
For needles in range 1 to
number of needles plus 1,

622
00:39:15,160 --> 00:39:20,290
we're going to choose
the random x or random y

623
00:39:20,290 --> 00:39:24,100
and we're going to see whether
or not it's in the circle.

624
00:39:24,100 --> 00:39:26,530
And there we will use
the Pythagorean theorem

625
00:39:26,530 --> 00:39:28,460
to tell us that, all right?

626
00:39:31,260 --> 00:39:33,060
And we'll just do
this, and then we're

627
00:39:33,060 --> 00:39:35,370
going to return
exactly the formula we

628
00:39:35,370 --> 00:39:39,080
saw in the previous slide.

629
00:39:39,080 --> 00:39:40,390
Now, comes an interesting part.

630
00:39:40,390 --> 00:39:43,810
We need to get an estimate.

631
00:39:43,810 --> 00:39:45,720
So we start with the
number of needles

632
00:39:45,720 --> 00:39:48,964
and the number of trials.

633
00:39:48,964 --> 00:39:51,400
For t in range number
of trials, we'll

634
00:39:51,400 --> 00:39:55,510
get a guess by throwing
number of needles.

635
00:39:55,510 --> 00:40:00,790
And then we'll append that
guess to our list of estimates.

636
00:40:03,370 --> 00:40:10,180
We'll then compute the
standard deviation,

637
00:40:10,180 --> 00:40:11,950
get the current
estimate which will

638
00:40:11,950 --> 00:40:13,630
be the sum of the
estimates divided

639
00:40:13,630 --> 00:40:17,860
by the len of the estimate,
just the mean of the estimate.

640
00:40:17,860 --> 00:40:20,860
And then we'll
print it, all right?

641
00:40:20,860 --> 00:40:24,040
So given a number of needles
and a number of trials,

642
00:40:24,040 --> 00:40:27,190
we'll estimate pi and we'll
give you the standard deviation

643
00:40:27,190 --> 00:40:28,324
of that estimate.

644
00:40:33,540 --> 00:40:37,240
However, to do that within
a certain precision,

645
00:40:37,240 --> 00:40:40,670
I'm going to have
yet another loop.

646
00:40:40,670 --> 00:40:43,100
And I'm showing you
this because we often

647
00:40:43,100 --> 00:40:44,880
structure simulations this way.

648
00:40:48,250 --> 00:40:51,640
So what I have
here is the number

649
00:40:51,640 --> 00:40:55,630
of trials, which is not so
interesting, but the precision.

650
00:40:55,630 --> 00:40:59,260
I'm saying, I would like
to know the value of pi

651
00:40:59,260 --> 00:41:01,090
and I would like
to have good reason

652
00:41:01,090 --> 00:41:04,240
to believe that the
value you give me

653
00:41:04,240 --> 00:41:08,920
is within 0.005, in this
case, of the true value.

654
00:41:12,910 --> 00:41:14,620
Or maybe more
precisely, I would like

655
00:41:14,620 --> 00:41:19,550
to know with a confidence of
95% that the true value is

656
00:41:19,550 --> 00:41:22,720
within a certain range.

657
00:41:22,720 --> 00:41:26,830
So what this is going to do
is it's just going to-- and I

658
00:41:26,830 --> 00:41:32,220
should probably use 1.96
instead of 2, but oh, well--

659
00:41:32,220 --> 00:41:34,860
it's just going
to keep increasing

660
00:41:34,860 --> 00:41:38,490
the number of needles,
doubling the number of needles

661
00:41:38,490 --> 00:41:41,880
until it's confident
about the estimate,

662
00:41:41,880 --> 00:41:44,990
confident enough, all right?

663
00:41:44,990 --> 00:41:47,390
So this is a very common thing.

664
00:41:47,390 --> 00:41:50,630
We don't know how many needles
we should need so let's

665
00:41:50,630 --> 00:41:52,700
just start with
some small number

666
00:41:52,700 --> 00:41:57,015
and we'll keep going
until we're good.

667
00:41:57,015 --> 00:41:58,640
All right, what
happens when we run it?

668
00:42:01,440 --> 00:42:03,140
Well, we start
with some estimates

669
00:42:03,140 --> 00:42:12,770
that when we had 1,000 needles
it told us that pi was 3.148

670
00:42:12,770 --> 00:42:16,590
and standard deviation
was 0.047 et cetera.

671
00:42:16,590 --> 00:42:19,590
So a couple of things to notice.

672
00:42:19,590 --> 00:42:26,120
One, are my estimates of pi
getting monotonically better?

673
00:42:26,120 --> 00:42:29,030
You'd like to think,
as I add more needles,

674
00:42:29,030 --> 00:42:31,164
my estimates are
getting more accurate.

675
00:42:33,890 --> 00:42:35,780
So that's question one.

676
00:42:35,780 --> 00:42:40,020
Are they getting more accurate?

677
00:42:40,020 --> 00:42:42,220
Not monotonically, right?

678
00:42:44,930 --> 00:42:48,040
If we look at it, where
are they getting worse?

679
00:42:48,040 --> 00:42:48,790
Well, let's see.

680
00:42:53,700 --> 00:43:00,760
All right, 3.148, well, 3.13
is already worse, right?

681
00:43:00,760 --> 00:43:02,860
So I double the
number of needles

682
00:43:02,860 --> 00:43:06,340
and I get a worse estimate.

683
00:43:06,340 --> 00:43:10,340
Now, the trend is good.

684
00:43:10,340 --> 00:43:15,680
By the time I get here, I've
got a pretty good estimate.

685
00:43:15,680 --> 00:43:23,250
So overall as I look at
larger samples, a bigger

686
00:43:23,250 --> 00:43:26,010
subset of the
population, my estimates

687
00:43:26,010 --> 00:43:29,360
are trending towards
better but not every time.

688
00:43:32,700 --> 00:43:38,940
On the other hand, let's look
at the standard deviations.

689
00:43:38,940 --> 00:43:42,510
What we see here is that
the standard deviations

690
00:43:42,510 --> 00:43:45,880
are, indeed, getting
monotonically better.

691
00:43:50,120 --> 00:43:52,620
Now, there's nothing
mathematical guaranteeing that,

692
00:43:52,620 --> 00:43:53,120
right?

693
00:43:53,120 --> 00:43:55,490
Because there is
randomness here.

694
00:43:55,490 --> 00:43:59,150
But I'm increasing the
number of needles by so much

695
00:43:59,150 --> 00:44:02,540
that it kind of overrides
the bad luck of maybe getting

696
00:44:02,540 --> 00:44:04,730
a bad random set.

697
00:44:04,730 --> 00:44:08,570
And we see those
are getting better.

698
00:44:08,570 --> 00:44:10,570
So the important
thing to see here

699
00:44:10,570 --> 00:44:15,420
is not that I happened
to get a better estimate,

700
00:44:15,420 --> 00:44:19,272
but that I know more
about the estimate.

701
00:44:19,272 --> 00:44:22,530
I can have more
confidence in the estimate

702
00:44:22,530 --> 00:44:26,342
because it's closing in on it.

703
00:44:29,650 --> 00:44:32,850
So the moral here is it's
not sufficient to produce

704
00:44:32,850 --> 00:44:33,630
a good answer.

705
00:44:33,630 --> 00:44:35,850
I've said this before.

706
00:44:35,850 --> 00:44:38,070
We need to have reason
to believe that it

707
00:44:38,070 --> 00:44:39,988
is close to the right answer.

708
00:44:43,200 --> 00:44:46,850
So in this case, I'm using
the standard deviation

709
00:44:46,850 --> 00:44:51,130
and say, given that it's
gotten quite small--

710
00:44:51,130 --> 00:44:58,800
you know, 1.96 times 0.002
is indeed a small number.

711
00:44:58,800 --> 00:45:00,810
I could make it
smaller if I wanted.

712
00:45:04,616 --> 00:45:05,990
I have good reason
to believe I'm

713
00:45:05,990 --> 00:45:09,210
close to the right value of pi.

714
00:45:09,210 --> 00:45:12,300
Everyone agree with that?

715
00:45:12,300 --> 00:45:13,020
Is that right?

716
00:45:15,770 --> 00:45:19,180
Not quite, actually.

717
00:45:19,180 --> 00:45:20,620
It would be nice
if it were true.

718
00:45:20,620 --> 00:45:23,530
But it isn't.

719
00:45:23,530 --> 00:45:26,000
So let's look at some things.

720
00:45:26,000 --> 00:45:29,240
Is it correct to state
that 95% of the time we

721
00:45:29,240 --> 00:45:32,270
run this simulation you'll
get an estimate of pi

722
00:45:32,270 --> 00:45:35,070
between these two values?

723
00:45:35,070 --> 00:45:37,560
I don't expect you to do
the arithmetic in your head

724
00:45:37,560 --> 00:45:40,570
but the answer is yes.

725
00:45:40,570 --> 00:45:42,960
So that is something
we believe is

726
00:45:42,960 --> 00:45:46,790
true by the math we've been
looking at for the last two

727
00:45:46,790 --> 00:45:47,290
lectures.

728
00:45:50,880 --> 00:45:54,570
Next statement, with
a probability of 0.95,

729
00:45:54,570 --> 00:45:58,940
the actual value of pi is
between these two things.

730
00:45:58,940 --> 00:45:59,690
Is that true?

731
00:46:07,490 --> 00:46:11,250
In fact, if I were to say,
with a probability of 1,

732
00:46:11,250 --> 00:46:14,020
the actual value is pi is
between those two values,

733
00:46:14,020 --> 00:46:15,940
would it be true?

734
00:46:15,940 --> 00:46:17,370
Yes.

735
00:46:17,370 --> 00:46:20,160
So they are both true facts.

736
00:46:22,860 --> 00:46:27,210
However, only the
first of these can be

737
00:46:27,210 --> 00:46:28,690
inferred from our simulation.

738
00:46:31,550 --> 00:46:33,350
While the second
fact is true, we

739
00:46:33,350 --> 00:46:36,980
can't infer it from
the simulation.

740
00:46:36,980 --> 00:46:44,110
And to show you that,
statistically valid is not

741
00:46:44,110 --> 00:46:49,100
the same as true,
we'll look at this.

742
00:46:49,100 --> 00:46:53,430
I've introduced a
bug in my simulation.

743
00:46:53,430 --> 00:47:00,725
I've replaced the 4 that
we saw we needed by 2, now,

744
00:47:00,725 --> 00:47:03,870
an easy kind of mistake to make.

745
00:47:03,870 --> 00:47:06,012
And now, if we go to the code--

746
00:47:15,180 --> 00:47:17,840
well, what do you think will
happen if we go to the code

747
00:47:17,840 --> 00:47:21,620
and run it?

748
00:47:21,620 --> 00:47:22,370
We'll try it.

749
00:47:27,310 --> 00:47:28,817
We'll go down here to the code.

750
00:47:33,690 --> 00:47:35,328
We'll make that a 2.

751
00:47:49,980 --> 00:47:55,080
And what you'll see as it
runs is that once again we're

752
00:47:55,080 --> 00:48:02,790
getting very nice confidence
intervals, but totally

753
00:48:02,790 --> 00:48:04,170
bogus values of pi.

754
00:48:08,190 --> 00:48:11,790
So the statistics
can tell us something

755
00:48:11,790 --> 00:48:15,720
about how reproducible
our simulation is

756
00:48:15,720 --> 00:48:18,570
but not whether the
simulation is an actually,

757
00:48:18,570 --> 00:48:21,347
accurate model of reality.

758
00:48:21,347 --> 00:48:22,430
So what do you need to do?

759
00:48:22,430 --> 00:48:25,610
You need to do something
like a sanity check.

760
00:48:25,610 --> 00:48:28,730
So here you might
look at a polygon

761
00:48:28,730 --> 00:48:32,300
and say, well, clearly that's
a totally wrong number.

762
00:48:32,300 --> 00:48:33,877
Something is wrong with my code.

763
00:48:44,320 --> 00:48:49,620
OK, so just to wrap-up.

764
00:48:49,620 --> 00:48:53,610
What we've shown is
a way to find pi.

765
00:48:53,610 --> 00:48:56,730
This is a generally
useful technique.

766
00:48:56,730 --> 00:49:00,790
To estimate the area
of any region r,

767
00:49:00,790 --> 00:49:02,650
you pick an enclosing
region, call

768
00:49:02,650 --> 00:49:08,640
it e, such that it's easy
to estimate the area of e,

769
00:49:08,640 --> 00:49:10,650
and r lies within it.

770
00:49:10,650 --> 00:49:16,590
Pick some random sets of points
within e, let f be the fraction

771
00:49:16,590 --> 00:49:21,980
and fall within r, multiply
e by f and you're done.

772
00:49:21,980 --> 00:49:26,410
So this for example, is a very
common way to do integration.

773
00:49:26,410 --> 00:49:29,230
I promised you we'd
talk about integration.

774
00:49:29,230 --> 00:49:31,140
So here's a sine of x.

775
00:49:31,140 --> 00:49:35,510
If I want to integrate the
sine of x over some region,

776
00:49:35,510 --> 00:49:39,320
as done here, all
I need to do is

777
00:49:39,320 --> 00:49:43,280
pick a bunch of random points,
red and black in this case,

778
00:49:43,280 --> 00:49:47,440
and look at the ratio
of one to the other.

779
00:49:47,440 --> 00:49:51,940
So showing how we
can use randomness

780
00:49:51,940 --> 00:49:55,750
to again, compute something
that is not inherently random.

781
00:49:55,750 --> 00:49:58,090
This is a trick people
use over and over

782
00:49:58,090 --> 00:50:03,340
and over again when confronted
with some situation where

783
00:50:03,340 --> 00:50:07,390
it's not easy to solve
for things mathematically.

784
00:50:07,390 --> 00:50:11,470
You just do a simulation
and if you do it right,

785
00:50:11,470 --> 00:50:13,720
you get a very good answer.

786
00:50:13,720 --> 00:50:20,190
All right, we will move on to
a different topic on Wednesday.