1
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,870
Commons license.

3
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,910 --> 00:00:10,560
offer high-quality educational
resources for free.

5
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from

6
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:19,290 --> 00:00:20,540
ocw.mit.edu.

8
00:00:22,560 --> 00:00:25,340
PROFESSOR: We're going to finish
today our discussion of

9
00:00:25,340 --> 00:00:27,460
limit theorems.

10
00:00:27,460 --> 00:00:30,340
I'm going to remind you what the
central limit theorem is,

11
00:00:30,340 --> 00:00:33,460
which we introduced
briefly last time.

12
00:00:33,460 --> 00:00:37,230
We're going to discuss what
exactly it says and its

13
00:00:37,230 --> 00:00:38,780
implications.

14
00:00:38,780 --> 00:00:42,100
And then we're going to apply
to a couple of examples,

15
00:00:42,100 --> 00:00:45,520
mostly on the binomial
distribution.

16
00:00:45,520 --> 00:00:49,950
OK, so the situation is that
we are dealing with a large

17
00:00:49,950 --> 00:00:52,420
number of independent,
identically

18
00:00:52,420 --> 00:00:55,000
distributed random variables.

19
00:00:55,000 --> 00:00:58,270
And we want to look at the sum
of them and say something

20
00:00:58,270 --> 00:01:00,510
about the distribution
of the sum.

21
00:01:03,310 --> 00:01:06,910
We might want to say that
the sum is distributed

22
00:01:06,910 --> 00:01:10,510
approximately as a normal random
variable, although,

23
00:01:10,510 --> 00:01:12,750
formally, this is
not quite right.

24
00:01:12,750 --> 00:01:16,330
As n goes to infinity, the
distribution of the sum

25
00:01:16,330 --> 00:01:20,000
becomes very spread out, and
it doesn't converge to a

26
00:01:20,000 --> 00:01:21,830
limiting distribution.

27
00:01:21,830 --> 00:01:24,930
In order to get an interesting
limit, we need first to take

28
00:01:24,930 --> 00:01:28,150
the sum and standardize it.

29
00:01:28,150 --> 00:01:32,267
By standardizing it, what we
mean is to subtract the mean

30
00:01:32,267 --> 00:01:38,060
and then divide by the
standard deviation.

31
00:01:38,060 --> 00:01:41,320
Now, the mean is, of course, n
times the expected value of

32
00:01:41,320 --> 00:01:43,080
each one of the X's.

33
00:01:43,080 --> 00:01:45,130
And the standard deviation
is the

34
00:01:45,130 --> 00:01:46,610
square root of the variance.

35
00:01:46,610 --> 00:01:50,530
The variance is n times sigma
squared, where sigma is the

36
00:01:50,530 --> 00:01:52,180
variance of the X's --

37
00:01:52,180 --> 00:01:53,400
so that's the standard
deviation.

38
00:01:53,400 --> 00:01:56,330
And after we do this, we obtain
a random variable that

39
00:01:56,330 --> 00:02:01,100
has 0 mean -- its centered
-- and the

40
00:02:01,100 --> 00:02:03,230
variance is equal to 1.

41
00:02:03,230 --> 00:02:07,240
And so the variance stays the
same, no matter how large n is

42
00:02:07,240 --> 00:02:08,500
going to be.

43
00:02:08,500 --> 00:02:12,660
So the distribution of Zn keeps
changing with n, but it

44
00:02:12,660 --> 00:02:14,080
cannot change too much.

45
00:02:14,080 --> 00:02:15,240
It stays in place.

46
00:02:15,240 --> 00:02:19,550
The mean is 0, and the width
remains also roughly the same

47
00:02:19,550 --> 00:02:22,000
because the variance is 1.

48
00:02:22,000 --> 00:02:25,820
The surprising thing is that, as
n grows, that distribution

49
00:02:25,820 --> 00:02:31,250
of Zn kind of settles in a
certain asymptotic shape.

50
00:02:31,250 --> 00:02:33,620
And that's the shape
of a standard

51
00:02:33,620 --> 00:02:35,290
normal random variable.

52
00:02:35,290 --> 00:02:37,580
So standard normal means
that it has 0

53
00:02:37,580 --> 00:02:39,930
mean and unit variance.

54
00:02:39,930 --> 00:02:43,850
More precisely, what the central
limit theorem tells us

55
00:02:43,850 --> 00:02:46,560
is a relation between the
cumulative distribution

56
00:02:46,560 --> 00:02:49,430
function of Zn and its relation
to the cumulative

57
00:02:49,430 --> 00:02:51,990
distribution function of
the standard normal.

58
00:02:51,990 --> 00:02:56,620
So for any given number, c,
the probability that Zn is

59
00:02:56,620 --> 00:03:01,140
less than or equal to c, in the
limit, becomes the same as

60
00:03:01,140 --> 00:03:04,090
the probability that the
standard normal becomes less

61
00:03:04,090 --> 00:03:05,760
than or equal to c.

62
00:03:05,760 --> 00:03:08,800
And of course, this is useful
because these probabilities

63
00:03:08,800 --> 00:03:11,960
are available from the normal
tables, whereas the

64
00:03:11,960 --> 00:03:15,850
distribution of Zn might be a
very complicated expression if

65
00:03:15,850 --> 00:03:19,520
you were to calculate
it exactly.

66
00:03:19,520 --> 00:03:22,960
So some comments about the
central limit theorem.

67
00:03:22,960 --> 00:03:27,860
First thing is that it's quite
amazing that it's universal.

68
00:03:27,860 --> 00:03:31,970
It doesn't matter what the
distribution of the X's is.

69
00:03:31,970 --> 00:03:35,970
It can be any distribution
whatsoever, as long as it has

70
00:03:35,970 --> 00:03:39,070
finite mean and finite
variance.

71
00:03:39,070 --> 00:03:42,170
And when you go and do your
approximations using the

72
00:03:42,170 --> 00:03:44,520
central limit theorem, the only
thing that you need to

73
00:03:44,520 --> 00:03:47,580
know about the distribution
of the X's are the

74
00:03:47,580 --> 00:03:49,130
mean and the variance.

75
00:03:49,130 --> 00:03:52,410
You need those in order
to standardize Sn.

76
00:03:52,410 --> 00:03:55,910
I mean -- to subtract the mean
and divide by the standard

77
00:03:55,910 --> 00:03:56,810
deviation --

78
00:03:56,810 --> 00:03:59,120
you need to know the mean
and the variance.

79
00:03:59,120 --> 00:04:02,350
But these are the only things
that you need to know in order

80
00:04:02,350 --> 00:04:03,600
to apply it.

81
00:04:06,060 --> 00:04:08,730
In addition, it's
a very accurate

82
00:04:08,730 --> 00:04:10,650
computational shortcut.

83
00:04:10,650 --> 00:04:14,660
So the distribution of this
Zn's, in principle, you can

84
00:04:14,660 --> 00:04:18,130
calculate it by convolution of
the distribution of the X's

85
00:04:18,130 --> 00:04:20,050
with itself many, many times.

86
00:04:20,050 --> 00:04:23,720
But this is tedious, and if you
try to do it analytically,

87
00:04:23,720 --> 00:04:26,570
it might be a very complicated
expression.

88
00:04:26,570 --> 00:04:29,910
Whereas by just appealing to the
standard normal table for

89
00:04:29,910 --> 00:04:33,870
the standard normal random
variable, things are done in a

90
00:04:33,870 --> 00:04:35,360
very quick way.

91
00:04:35,360 --> 00:04:39,070
So it's a nice computational
shortcut if you don't want to

92
00:04:39,070 --> 00:04:42,210
get an exact answer to a
probability problem.

93
00:04:42,210 --> 00:04:47,480
Now, at a more philosophical
level, it justifies why we are

94
00:04:47,480 --> 00:04:50,930
really interested in normal
random variables.

95
00:04:50,930 --> 00:04:55,230
Whenever you have a phenomenon
which is noisy, and the noise

96
00:04:55,230 --> 00:05:00,420
that you observe is created by
adding the lots of little

97
00:05:00,420 --> 00:05:03,820
pieces of randomness that are
independent of each other, the

98
00:05:03,820 --> 00:05:06,840
overall effect that you're
going to observe can be

99
00:05:06,840 --> 00:05:10,240
described by a normal
random variable.

100
00:05:10,240 --> 00:05:16,810
So in a classic example that
goes 100 years back or so,

101
00:05:16,810 --> 00:05:19,840
suppose that you have a fluid,
and inside that fluid, there's

102
00:05:19,840 --> 00:05:23,340
a little particle of dust
or whatever that's

103
00:05:23,340 --> 00:05:24,950
suspended in there.

104
00:05:24,950 --> 00:05:28,380
That little particle gets
hit by molecules

105
00:05:28,380 --> 00:05:30,000
completely at random --

106
00:05:30,000 --> 00:05:32,730
and so what you're going to see
is that particle kind of

107
00:05:32,730 --> 00:05:36,020
moving randomly inside
that liquid.

108
00:05:36,020 --> 00:05:40,260
Now that random motion, if you
ask, after one second, how

109
00:05:40,260 --> 00:05:45,520
much is my particle displaced,
let's say, in the x-axis along

110
00:05:45,520 --> 00:05:47,170
the x direction.

111
00:05:47,170 --> 00:05:50,960
That displacement is very, very
well modeled by a normal

112
00:05:50,960 --> 00:05:51,960
random variable.

113
00:05:51,960 --> 00:05:55,710
And the reason is that the
position of that particle is

114
00:05:55,710 --> 00:06:00,160
decided by the cumulative effect
of lots of random hits

115
00:06:00,160 --> 00:06:04,480
by molecules that hit
that particle.

116
00:06:04,480 --> 00:06:11,630
So that's a sort of celebrated
physical model that goes under

117
00:06:11,630 --> 00:06:15,000
the name of Brownian motion.

118
00:06:15,000 --> 00:06:18,100
And it's the same model that
some people use to describe

119
00:06:18,100 --> 00:06:20,300
the movement in the
financial markets.

120
00:06:20,300 --> 00:06:24,660
The argument might go that the
movement of prices has to do

121
00:06:24,660 --> 00:06:28,300
with lots of little decisions
and lots of little events by

122
00:06:28,300 --> 00:06:31,310
many, many different
actors that are

123
00:06:31,310 --> 00:06:32,890
involved in the market.

124
00:06:32,890 --> 00:06:37,440
So the distribution of stock
prices might be well described

125
00:06:37,440 --> 00:06:39,740
by normal random variables.

126
00:06:39,740 --> 00:06:43,780
At least that's what people
wanted to believe until

127
00:06:43,780 --> 00:06:45,160
somewhat recently.

128
00:06:45,160 --> 00:06:48,300
Now, the evidence is that,
actually, these distributions

129
00:06:48,300 --> 00:06:52,210
are a little more heavy-tailed
in the sense that extreme

130
00:06:52,210 --> 00:06:55,630
events are a little more likely
to occur that what

131
00:06:55,630 --> 00:06:58,040
normal random variables would
seem to indicate.

132
00:06:58,040 --> 00:07:03,110
But as a first model, again,
it could be a plausible

133
00:07:03,110 --> 00:07:07,300
argument to have, at least as
a starting model, one that

134
00:07:07,300 --> 00:07:10,200
involves normal random
variables.

135
00:07:10,200 --> 00:07:13,020
So this is the philosophical
side of things.

136
00:07:13,020 --> 00:07:15,820
On the more accurate,
mathematical side, it's

137
00:07:15,820 --> 00:07:18,290
important to appreciate
exactly quite kind of

138
00:07:18,290 --> 00:07:21,250
statement the central
limit theorem is.

139
00:07:21,250 --> 00:07:25,460
It's a statement about the
convergence of the CDF of

140
00:07:25,460 --> 00:07:27,940
these standardized random
variables to

141
00:07:27,940 --> 00:07:29,840
the CDF of a normal.

142
00:07:29,840 --> 00:07:32,470
So it's a statement about
convergence of CDFs.

143
00:07:32,470 --> 00:07:36,580
It's not a statement about
convergence of PMFs, or

144
00:07:36,580 --> 00:07:39,100
convergence of PDFs.

145
00:07:39,100 --> 00:07:42,160
Now, if one makes additional
mathematical assumptions,

146
00:07:42,160 --> 00:07:44,880
there are variations of the
central limit theorem that

147
00:07:44,880 --> 00:07:47,220
talk about PDFs and PMFs.

148
00:07:47,220 --> 00:07:51,930
But in general, that's not
necessarily the case.

149
00:07:51,930 --> 00:07:54,610
And I'm going to illustrate
this with--

150
00:07:54,610 --> 00:07:58,890
I have a plot here which
is not in your slides.

151
00:07:58,890 --> 00:08:04,700
But just to make the point,
consider two different

152
00:08:04,700 --> 00:08:06,710
discrete distributions.

153
00:08:06,710 --> 00:08:09,820
This discrete distribution
takes values 1, 4, 7.

154
00:08:13,470 --> 00:08:16,110
This discrete distribution can
take values 1, 2, 4, 6, and 7.

155
00:08:18,720 --> 00:08:24,270
So this one has sort of a
periodicity of 3, this one,

156
00:08:24,270 --> 00:08:27,960
the range of values is a little
more interesting.

157
00:08:27,960 --> 00:08:30,910
The numbers in these two
distributions are cooked up so

158
00:08:30,910 --> 00:08:34,380
that they have the same mean
and the same variance.

159
00:08:34,380 --> 00:08:38,970
Now, what I'm going to do is
to take eight independent

160
00:08:38,970 --> 00:08:44,090
copies of the random variable
and plot the PMF of the sum of

161
00:08:44,090 --> 00:08:45,980
eight random variables.

162
00:08:45,980 --> 00:08:51,520
Now, if I plot the PMF of the
sum of 8 of these, I get the

163
00:08:51,520 --> 00:08:59,690
plot, which corresponds to these
bullets in this diagram.

164
00:08:59,690 --> 00:09:03,040
If I take 8 random variables,
according to this

165
00:09:03,040 --> 00:09:07,270
distribution, and add them up
and compute their PMF, the PMF

166
00:09:07,270 --> 00:09:10,310
I get is the one denoted
here by the X's.

167
00:09:10,310 --> 00:09:15,630
The two PMFs look really
different, at least, when you

168
00:09:15,630 --> 00:09:16,890
eyeball them.

169
00:09:16,890 --> 00:09:23,500
On the other hand, if you were
to plot the CDFs of them, then

170
00:09:23,500 --> 00:09:34,000
the CDFs, if you compare them
with the normal CDF, which is

171
00:09:34,000 --> 00:09:38,390
this continuous curve, the CDF,
of course, it goes up in

172
00:09:38,390 --> 00:09:41,870
steps because we're looking at
discrete random variables.

173
00:09:41,870 --> 00:09:47,600
But it's very close
to the normal CDF.

174
00:09:47,600 --> 00:09:52,000
And if we, instead of n equal to
8, we were to take 16, then

175
00:09:52,000 --> 00:09:54,480
the coincidence would
be even better.

176
00:09:54,480 --> 00:09:59,850
So in terms of CDFs, when we add
8 or 16 of these, we get

177
00:09:59,850 --> 00:10:01,930
very close to the normal CDF.

178
00:10:01,930 --> 00:10:05,080
We would get essentially the
same picture if I were to take

179
00:10:05,080 --> 00:10:06,850
8 or 16 of these.

180
00:10:06,850 --> 00:10:11,730
So the CDFs sit, essentially, on
top of each other, although

181
00:10:11,730 --> 00:10:14,400
the two PMFs look
quite different.

182
00:10:14,400 --> 00:10:17,230
So this is to appreciate that,
formally speaking, we only

183
00:10:17,230 --> 00:10:22,470
have a statement about
CDFs, not about PMFs.

184
00:10:22,470 --> 00:10:26,980
Now in practice, how do you use
the central limit theorem?

185
00:10:26,980 --> 00:10:30,550
Well, it tells us that we can
calculate probabilities by

186
00:10:30,550 --> 00:10:32,810
treating Zn as if it
were a standard

187
00:10:32,810 --> 00:10:34,550
normal random variable.

188
00:10:34,550 --> 00:10:38,280
Now Zn is a linear
function of Sn.

189
00:10:38,280 --> 00:10:43,120
Conversely, Sn is a linear
function of Zn.

190
00:10:43,120 --> 00:10:45,680
Linear functions of normals
are normal.

191
00:10:45,680 --> 00:10:49,450
So if I pretend that Zn is
normal, it's essentially the

192
00:10:49,450 --> 00:10:53,230
same as if we pretend
that Sn is normal.

193
00:10:53,230 --> 00:10:55,560
And so we can calculate
probabilities that have to do

194
00:10:55,560 --> 00:10:59,830
with Sn as if Sn were normal.

195
00:10:59,830 --> 00:11:03,850
Now, the central limit theorem
does not tell us that Sn is

196
00:11:03,850 --> 00:11:05,120
approximately normal.

197
00:11:05,120 --> 00:11:08,860
The formal statement is about
Zn, but, practically speaking,

198
00:11:08,860 --> 00:11:11,150
when you use the result,
you can just

199
00:11:11,150 --> 00:11:14,650
pretend that Sn is normal.

200
00:11:14,650 --> 00:11:18,620
Finally, it's a limit theorem,
so it tells us about what

201
00:11:18,620 --> 00:11:21,240
happens when n goes
to infinity.

202
00:11:21,240 --> 00:11:23,880
If we are to use it in practice,
of course, n is not

203
00:11:23,880 --> 00:11:25,120
going to be infinity.

204
00:11:25,120 --> 00:11:28,320
Maybe n is equal to 15.

205
00:11:28,320 --> 00:11:32,130
Can we use a limit theorem when
n is a small number, as

206
00:11:32,130 --> 00:11:34,020
small as 15?

207
00:11:34,020 --> 00:11:36,980
Well, it turns out that it's
a very good approximation.

208
00:11:36,980 --> 00:11:41,420
Even for quite small values
of n, it gives us

209
00:11:41,420 --> 00:11:43,770
very accurate answers.

210
00:11:43,770 --> 00:11:49,710
So n over the order of 15, or
20, or so give us very good

211
00:11:49,710 --> 00:11:51,790
results in practice.

212
00:11:51,790 --> 00:11:54,820
There are no good theorems
that will give us hard

213
00:11:54,820 --> 00:11:58,550
guarantees because the quality
of the approximation does

214
00:11:58,550 --> 00:12:03,490
depend on the details of the
distribution of the X's.

215
00:12:03,490 --> 00:12:07,510
If the X's have a distribution
that, from the outset, looks a

216
00:12:07,510 --> 00:12:13,200
little bit like the normal, then
for small values of n,

217
00:12:13,200 --> 00:12:15,700
you are going to see,
essentially, a normal

218
00:12:15,700 --> 00:12:16,980
distribution for the sum.

219
00:12:16,980 --> 00:12:20,030
If the distribution of the X's
is very different from the

220
00:12:20,030 --> 00:12:23,350
normal, it's going to take a
larger value of n for the

221
00:12:23,350 --> 00:12:25,770
central limit theorem
to take effect.

222
00:12:25,770 --> 00:12:29,960
So let's illustrates this with
a few representative plots.

223
00:12:32,600 --> 00:12:36,460
So here, we're starting with a
discrete uniform distribution

224
00:12:36,460 --> 00:12:39,580
that goes from 1 to 8.

225
00:12:39,580 --> 00:12:44,200
Let's add 2 of these random
variables, 2 random variables

226
00:12:44,200 --> 00:12:47,870
with this PMF, and find
the PMF of the sum.

227
00:12:47,870 --> 00:12:52,570
This is a convolution of 2
discrete uniforms, and I

228
00:12:52,570 --> 00:12:54,960
believe you have seen this
exercise before.

229
00:12:54,960 --> 00:12:59,030
When you convolve this with
itself, you get a triangle.

230
00:12:59,030 --> 00:13:04,400
So this is the PMF for the sum
of two discrete uniforms.

231
00:13:04,400 --> 00:13:05,370
Now let's continue.

232
00:13:05,370 --> 00:13:07,980
Let's convolve this
with itself.

233
00:13:07,980 --> 00:13:10,750
These was going to give
us the PMF of a sum

234
00:13:10,750 --> 00:13:13,740
of 4 discrete uniforms.

235
00:13:13,740 --> 00:13:17,930
And we get this, which starts
looking like a normal.

236
00:13:17,930 --> 00:13:23,450
If we go to n equal to 32, then
it looks, essentially,

237
00:13:23,450 --> 00:13:25,270
exactly like a normal.

238
00:13:25,270 --> 00:13:27,850
And it's an excellent
approximation.

239
00:13:27,850 --> 00:13:32,290
So this is the PMF of the sum
of 32 discrete random

240
00:13:32,290 --> 00:13:36,560
variables with this uniform
distribution.

241
00:13:36,560 --> 00:13:42,190
If we start with a PMF which
is not symmetric--

242
00:13:42,190 --> 00:13:44,640
this one is symmetric
around the mean.

243
00:13:44,640 --> 00:13:47,630
But if we start with a PMF which
is non-symmetric, so

244
00:13:47,630 --> 00:13:53,780
this is, here, is a truncated
geometric PMF, then things do

245
00:13:53,780 --> 00:13:58,960
not work out as nicely when
I add 8 of these.

246
00:13:58,960 --> 00:14:03,640
That is, if I convolve this
with itself 8 times, I get

247
00:14:03,640 --> 00:14:08,600
this PMF, which maybe resembles
a little bit to the

248
00:14:08,600 --> 00:14:09,800
normal one.

249
00:14:09,800 --> 00:14:13,050
But you can really tell that
it's different from the normal

250
00:14:13,050 --> 00:14:16,640
if you focus at the details
here and there.

251
00:14:16,640 --> 00:14:19,930
Here it sort of rises sharply.

252
00:14:19,930 --> 00:14:23,420
Here it tails off
a bit slower.

253
00:14:23,420 --> 00:14:27,660
So there's an asymmetry here
that's present, and which is a

254
00:14:27,660 --> 00:14:29,340
consequence of the
asymmetry of the

255
00:14:29,340 --> 00:14:31,710
distribution we started with.

256
00:14:31,710 --> 00:14:35,320
If we go to 16, it looks a
little better, but still you

257
00:14:35,320 --> 00:14:39,600
can see the asymmetry between
this tail and that tail.

258
00:14:39,600 --> 00:14:43,030
If you get to 32 there's still a
little bit of asymmetry, but

259
00:14:43,030 --> 00:14:48,520
at least now it starts looking
like a normal distribution.

260
00:14:48,520 --> 00:14:54,270
So the moral from these plots
is that it might vary, a

261
00:14:54,270 --> 00:14:57,360
little bit, what kind of values
of n you need before

262
00:14:57,360 --> 00:15:00,070
you get the really good
approximation.

263
00:15:00,070 --> 00:15:04,520
But for values of n in the range
20 to 30 or so, usually

264
00:15:04,520 --> 00:15:07,340
you expect to get a pretty
good approximation.

265
00:15:07,340 --> 00:15:10,180
At least that's what the visual
inspection of these

266
00:15:10,180 --> 00:15:13,330
graphs tells us.

267
00:15:13,330 --> 00:15:16,560
So now that we know that we have
a good approximation in

268
00:15:16,560 --> 00:15:18,460
our hands, let's use it.

269
00:15:18,460 --> 00:15:21,890
Let's use it by revisiting an
example from last time.

270
00:15:21,890 --> 00:15:24,480
This is the polling problem.

271
00:15:24,480 --> 00:15:28,360
We're interested in the fraction
of population that

272
00:15:28,360 --> 00:15:30,220
has a certain habit been.

273
00:15:30,220 --> 00:15:33,680
And we try to find what f is.

274
00:15:33,680 --> 00:15:38,120
And the way we do it is by
polling people at random and

275
00:15:38,120 --> 00:15:40,600
recording the answers that they
give, whether they have

276
00:15:40,600 --> 00:15:42,340
the habit or not.

277
00:15:42,340 --> 00:15:45,250
So for each person, we get the
Bernoulli random variable.

278
00:15:45,250 --> 00:15:52,050
With probability f, a person is
going to respond 1, or yes,

279
00:15:52,050 --> 00:15:55,080
so this is with probability f.

280
00:15:55,080 --> 00:15:58,490
And with the remaining
probability 1-f, the person

281
00:15:58,490 --> 00:16:00,390
responds no.

282
00:16:00,390 --> 00:16:04,520
We record this number, which
is how many people answered

283
00:16:04,520 --> 00:16:06,800
yes, divided by the total
number of people.

284
00:16:06,800 --> 00:16:10,740
That's the fraction of the
population that we asked.

285
00:16:10,740 --> 00:16:16,980
This is the fraction inside our
sample that answered yes.

286
00:16:16,980 --> 00:16:21,410
And as we discussed last time,
you might start with some

287
00:16:21,410 --> 00:16:23,210
specs for the poll.

288
00:16:23,210 --> 00:16:25,660
And the specs have
two parameters--

289
00:16:25,660 --> 00:16:29,400
the accuracy that you want and
the confidence that you want

290
00:16:29,400 --> 00:16:33,620
to have that you did really
obtain the desired accuracy.

291
00:16:33,620 --> 00:16:40,550
So the specs here is that we
want, probability 95% that our

292
00:16:40,550 --> 00:16:46,400
estimate is within 1 % point
from the true answer.

293
00:16:46,400 --> 00:16:48,940
So the event of interest
is this.

294
00:16:48,940 --> 00:16:53,640
That's the result of the poll
minus distance from the true

295
00:16:53,640 --> 00:16:59,150
answer is less or bigger
than 1 % point.

296
00:16:59,150 --> 00:17:02,000
And we're interested in
calculating or approximating

297
00:17:02,000 --> 00:17:04,140
this particular probability.

298
00:17:04,140 --> 00:17:08,000
So we want to do it using the
central limit theorem.

299
00:17:08,000 --> 00:17:13,050
And one way of arranging the
mechanics of this calculation

300
00:17:13,050 --> 00:17:17,880
is to take the event of interest
and massage it by

301
00:17:17,880 --> 00:17:21,400
subtracting and dividing things
from both sides of this

302
00:17:21,400 --> 00:17:27,510
inequality so that you bring
him to the picture the

303
00:17:27,510 --> 00:17:31,600
standardized random variable,
the Zn, and then apply the

304
00:17:31,600 --> 00:17:33,900
central limit theorem.

305
00:17:33,900 --> 00:17:38,550
So the event of interest, let
me write it in full, Mn is

306
00:17:38,550 --> 00:17:42,280
this quantity, so I'm putting it
here, minus f, which is the

307
00:17:42,280 --> 00:17:44,410
same as nf divided by n.

308
00:17:44,410 --> 00:17:46,980
So this is the same
as that event.

309
00:17:46,980 --> 00:17:49,840
We're going to calculate the
probability of this.

310
00:17:49,840 --> 00:17:52,460
This is not exactly in the form
in which we apply the

311
00:17:52,460 --> 00:17:53,430
central limit theorem.

312
00:17:53,430 --> 00:17:56,570
To apply the central limit
theorem, we need, down here,

313
00:17:56,570 --> 00:17:59,660
to have sigma square root n.

314
00:17:59,660 --> 00:18:03,100
So how can I put sigma
square root n here?

315
00:18:03,100 --> 00:18:07,350
I can divide both sides of
this inequality by sigma.

316
00:18:07,350 --> 00:18:10,970
And then I can take a factor of
square root n from here and

317
00:18:10,970 --> 00:18:13,240
send it to the other side.

318
00:18:13,240 --> 00:18:15,660
So this event is the
same as that event.

319
00:18:15,660 --> 00:18:19,190
This will happen if and only
if that will happen.

320
00:18:19,190 --> 00:18:23,670
So calculating the probability
of this event here is the same

321
00:18:23,670 --> 00:18:27,110
as calculating the probability
that this events happens.

322
00:18:27,110 --> 00:18:30,870
And now we are in business
because the random variable

323
00:18:30,870 --> 00:18:36,510
that we got in here is Zn, or
the absolute value of Zn, and

324
00:18:36,510 --> 00:18:41,480
we're talking about the
probability that Zn, absolute

325
00:18:41,480 --> 00:18:45,660
value of Zn, is bigger than
a certain number.

326
00:18:45,660 --> 00:18:50,310
Since Zn is to be approximated
by a standard normal random

327
00:18:50,310 --> 00:18:54,560
variable, our approximation is
going to be, instead of asking

328
00:18:54,560 --> 00:18:59,040
for Zn being bigger than this
number, we will ask for Z,

329
00:18:59,040 --> 00:19:02,500
absolute value of Z, being
bigger than this number.

330
00:19:02,500 --> 00:19:05,640
So this is the probability that
we want to calculate.

331
00:19:05,640 --> 00:19:09,730
And now Z is a standard normal
random variable.

332
00:19:09,730 --> 00:19:12,760
There's a small difficulty,
the one that we also

333
00:19:12,760 --> 00:19:14,310
encountered last time.

334
00:19:14,310 --> 00:19:18,110
And the difficulty is that the
standard deviation, sigma, of

335
00:19:18,110 --> 00:19:20,720
the Xi's is not known.

336
00:19:20,720 --> 00:19:24,560
Sigma is equal to f times--

337
00:19:24,560 --> 00:19:30,090
sigma, in this example, is f
times (1-f), and the only

338
00:19:30,090 --> 00:19:32,690
thing that we know about sigma
is that it's going to be a

339
00:19:32,690 --> 00:19:35,010
number less than 1/2.

340
00:19:39,810 --> 00:19:45,180
OK, so we're going to have to
use an inequality here.

341
00:19:45,180 --> 00:19:48,890
We're going to use a
conservative value of sigma,

342
00:19:48,890 --> 00:19:54,120
the value of sigma equal to 1/2
and use that instead of

343
00:19:54,120 --> 00:19:55,760
the exact value of sigma.

344
00:19:55,760 --> 00:19:59,100
And this gives us an inequality
going this way.

345
00:19:59,100 --> 00:20:03,710
Let's just make sure why the
inequality goes this way.

346
00:20:03,710 --> 00:20:06,683
We got, on our axis,
two numbers.

347
00:20:12,390 --> 00:20:21,650
One number is 0.01 square
root n divided by sigma.

348
00:20:21,650 --> 00:20:27,870
And the other number is
0.02 square root of n.

349
00:20:27,870 --> 00:20:30,840
And my claim is that the numbers
are related to each

350
00:20:30,840 --> 00:20:32,930
other in this particular way.

351
00:20:32,930 --> 00:20:33,500
Why is this?

352
00:20:33,500 --> 00:20:35,410
Sigma is less than 2.

353
00:20:35,410 --> 00:20:39,580
So 1/sigma is bigger than 2.

354
00:20:39,580 --> 00:20:44,020
So since 1/sigma is bigger than
2 this means that this

355
00:20:44,020 --> 00:20:47,740
numbers sits to the right
of that number.

356
00:20:47,740 --> 00:20:51,950
So here we have the probability
that Z is bigger

357
00:20:51,950 --> 00:20:54,820
than this number.

358
00:20:54,820 --> 00:20:59,060
The probability of falling out
there is less than the

359
00:20:59,060 --> 00:21:03,060
probability of falling
in this interval.

360
00:21:03,060 --> 00:21:06,170
So that's what that last
inequality is saying--

361
00:21:06,170 --> 00:21:09,330
this probability is smaller
than that probability.

362
00:21:09,330 --> 00:21:12,010
This is the probability that
we're interested in, but since

363
00:21:12,010 --> 00:21:16,490
we don't know sigma, we take the
conservative value, and we

364
00:21:16,490 --> 00:21:21,610
use an upper bound in terms
of the probability of this

365
00:21:21,610 --> 00:21:23,730
interval here.

366
00:21:23,730 --> 00:21:26,920
And now we are in business.

367
00:21:26,920 --> 00:21:30,980
We can start using our normal
tables to calculate

368
00:21:30,980 --> 00:21:33,140
probabilities of interest.

369
00:21:33,140 --> 00:21:40,300
So for example, let's say that's
we take n to be 10,000.

370
00:21:40,300 --> 00:21:42,370
How is the calculation
going to go?

371
00:21:42,370 --> 00:21:45,860
We want to calculate the
probability that the absolute

372
00:21:45,860 --> 00:21:52,920
value of Z is bigger than 0.2
times 1000, which is the

373
00:21:52,920 --> 00:21:56,530
probability that the absolute
value of Z is larger than or

374
00:21:56,530 --> 00:21:58,490
equal to 2.

375
00:21:58,490 --> 00:22:00,500
And here let's do
some mechanics,

376
00:22:00,500 --> 00:22:03,300
just to stay in shape.

377
00:22:03,300 --> 00:22:05,860
The probability that you're
larger than or equal to 2 in

378
00:22:05,860 --> 00:22:09,290
absolute value, since the normal
is symmetric around the

379
00:22:09,290 --> 00:22:13,590
mean, this is going to be twice
the probability that Z

380
00:22:13,590 --> 00:22:16,560
is larger than or equal to 2.

381
00:22:16,560 --> 00:22:22,330
Can we use the cumulative
distribution function of Z to

382
00:22:22,330 --> 00:22:23,300
calculate this?

383
00:22:23,300 --> 00:22:26,100
Well, almost the cumulative
gives us probabilities of

384
00:22:26,100 --> 00:22:28,910
being less than something, not
bigger than something.

385
00:22:28,910 --> 00:22:33,480
So we need one more step and
write this as 1 minus the

386
00:22:33,480 --> 00:22:38,420
probability that Z is less
than or equal to 2.

387
00:22:38,420 --> 00:22:41,620
And this probability, now,
you can read off

388
00:22:41,620 --> 00:22:43,770
from the normal tables.

389
00:22:43,770 --> 00:22:46,460
And the normal tables will
tell you that this

390
00:22:46,460 --> 00:22:52,840
probability is 0.9772.

391
00:22:52,840 --> 00:22:54,520
And you do get an answer.

392
00:22:54,520 --> 00:23:02,530
And the answer is 0.0456.

393
00:23:02,530 --> 00:23:05,220
OK, so we tried 10,000.

394
00:23:05,220 --> 00:23:10,990
And we find that our probably
of error is 4.5%, so we're

395
00:23:10,990 --> 00:23:15,710
doing better than the
spec that we had.

396
00:23:15,710 --> 00:23:19,490
So this tells us that maybe
we have some leeway.

397
00:23:19,490 --> 00:23:24,070
Maybe we can use a smaller
sample size and still stay

398
00:23:24,070 --> 00:23:26,030
without our specs.

399
00:23:26,030 --> 00:23:29,630
Let's try to find how much
we can push the envelope.

400
00:23:29,630 --> 00:23:34,716
How much smaller
can we take n?

401
00:23:34,716 --> 00:23:37,890
To answer that question, we
need to do this kind of

402
00:23:37,890 --> 00:23:40,790
calculation, essentially,
going backwards.

403
00:23:40,790 --> 00:23:46,420
We're going to fix this number
to be 0.05 and work backwards

404
00:23:46,420 --> 00:23:49,130
here to find--

405
00:23:49,130 --> 00:23:50,770
did I do a mistake here?

406
00:23:50,770 --> 00:23:51,770
10,000.

407
00:23:51,770 --> 00:23:53,700
So I'm missing a 0 here.

408
00:23:57,440 --> 00:24:07,540
Ah, but I'm taking the square
root, so it's 100.

409
00:24:07,540 --> 00:24:11,080
Where did the 0.02
come in from?

410
00:24:11,080 --> 00:24:12,020
Ah, from here.

411
00:24:12,020 --> 00:24:15,870
OK, all right.

412
00:24:15,870 --> 00:24:19,330
0.02 times 100, that
gives us 2.

413
00:24:19,330 --> 00:24:22,130
OK, all right.

414
00:24:22,130 --> 00:24:24,240
Very good, OK.

415
00:24:24,240 --> 00:24:27,570
So we'll have to do this
calculation now backwards,

416
00:24:27,570 --> 00:24:33,510
figure out if this is 0.05,
what kind of number we're

417
00:24:33,510 --> 00:24:41,380
going to need here and then
here, and from this we will be

418
00:24:41,380 --> 00:24:45,240
able to tell what value
of n do we need.

419
00:24:45,240 --> 00:24:53,670
OK, so we want to find n such
that the probability that Z is

420
00:24:53,670 --> 00:25:04,870
bigger than 0.02 square
root n is 0.05.

421
00:25:04,870 --> 00:25:09,320
OK, so Z is a standard normal
random variable.

422
00:25:09,320 --> 00:25:16,810
And we want the probability
that we are

423
00:25:16,810 --> 00:25:18,640
outside this range.

424
00:25:18,640 --> 00:25:21,940
We want the probability of
those two tails together.

425
00:25:24,960 --> 00:25:26,920
Those two tails together
should have

426
00:25:26,920 --> 00:25:29,990
probability of 0.05.

427
00:25:29,990 --> 00:25:33,280
This means that this tail,
by itself, should have

428
00:25:33,280 --> 00:25:36,900
probability 0.025.

429
00:25:36,900 --> 00:25:45,960
And this means that this
probability should be 0.975.

430
00:25:45,960 --> 00:25:52,350
Now, if this probability
is to be 0.975, what

431
00:25:52,350 --> 00:25:54,970
should that number be?

432
00:25:54,970 --> 00:25:59,980
You go to the normal tables,
and you find which is the

433
00:25:59,980 --> 00:26:03,190
entry that corresponds
to that number.

434
00:26:03,190 --> 00:26:07,020
I actually brought a normal
table with me.

435
00:26:07,020 --> 00:26:12,740
And 0.975 is down here.

436
00:26:12,740 --> 00:26:15,420
And it tells you that
to the number that

437
00:26:15,420 --> 00:26:19,820
corresponds to it is 1.96.

438
00:26:19,820 --> 00:26:24,890
So this tells us that
this number

439
00:26:24,890 --> 00:26:31,790
should be equal to 1.96.

440
00:26:31,790 --> 00:26:36,380
And now, from here, you
do the calculations.

441
00:26:36,380 --> 00:26:47,510
And you find that n is 9604.

442
00:26:47,510 --> 00:26:53,200
So with a sample of 10,000, we
got probability of error 4.5%.

443
00:26:53,200 --> 00:26:57,910
With a slightly smaller sample
size of 9,600, we can get the

444
00:26:57,910 --> 00:27:01,880
probability of a mistake
to be 0.05, which

445
00:27:01,880 --> 00:27:04,070
was exactly our spec.

446
00:27:04,070 --> 00:27:07,450
So these are essentially the two
ways that you're going to

447
00:27:07,450 --> 00:27:09,830
be using the central
limit theorem.

448
00:27:09,830 --> 00:27:12,690
Either you're given n and
you try to calculate

449
00:27:12,690 --> 00:27:13,610
probabilities.

450
00:27:13,610 --> 00:27:15,590
Or you're given the
probabilities, and you want to

451
00:27:15,590 --> 00:27:18,210
work backwards to
find n itself.

452
00:27:20,990 --> 00:27:27,710
So in this example, the random
variable that we dealt with

453
00:27:27,710 --> 00:27:30,450
was, of course, a binomial
random variable.

454
00:27:30,450 --> 00:27:38,590
The Xi's were Bernoulli,
so the sum of

455
00:27:38,590 --> 00:27:40,950
the Xi's were binomial.

456
00:27:40,950 --> 00:27:44,100
So the central limit theorem
certainly applies to the

457
00:27:44,100 --> 00:27:45,950
binomial distribution.

458
00:27:45,950 --> 00:27:49,440
To be more precise, of course,
it applies to the standardized

459
00:27:49,440 --> 00:27:52,730
version of the binomial
random variable.

460
00:27:52,730 --> 00:27:55,140
So here's what we did,
essentially, in

461
00:27:55,140 --> 00:27:57,300
the previous example.

462
00:27:57,300 --> 00:28:00,690
We fixed the number p, which is
the probability of success

463
00:28:00,690 --> 00:28:02,010
in our experiments.

464
00:28:02,010 --> 00:28:06,550
p corresponds to f in the
previous example.

465
00:28:06,550 --> 00:28:10,570
Let every Xi a Bernoulli
random variable and are

466
00:28:10,570 --> 00:28:13,790
standing assumption is that
these random variables are

467
00:28:13,790 --> 00:28:15,040
independent.

468
00:28:17,580 --> 00:28:20,730
When we add them, we get a
random variable that has a

469
00:28:20,730 --> 00:28:22,030
binomial distribution.

470
00:28:22,030 --> 00:28:25,220
We know the mean and the
variance of the binomial, so

471
00:28:25,220 --> 00:28:29,130
we take Sn, we subtract the
mean, which is this, divide by

472
00:28:29,130 --> 00:28:30,470
the standard deviation.

473
00:28:30,470 --> 00:28:32,790
The central limit theorem tells
us that the cumulative

474
00:28:32,790 --> 00:28:36,130
distribution function of this
random variable is a standard

475
00:28:36,130 --> 00:28:39,860
normal random variable
in the limit.

476
00:28:39,860 --> 00:28:43,730
So let's do one more example
of a calculation.

477
00:28:43,730 --> 00:28:47,160
Let's take n to be--

478
00:28:47,160 --> 00:28:50,110
let's choose some specific
numbers to work with.

479
00:28:52,950 --> 00:28:58,300
So in this example, first thing
to do is to find the

480
00:28:58,300 --> 00:29:02,390
expected value of Sn,
which is n times p.

481
00:29:02,390 --> 00:29:04,150
It's 18.

482
00:29:04,150 --> 00:29:08,100
Then we need to write down
the standard deviation.

483
00:29:12,430 --> 00:29:16,530
The variance of Sn is the
sum of the variances.

484
00:29:16,530 --> 00:29:19,940
It's np times (1-p).

485
00:29:19,940 --> 00:29:25,920
And in this particular example,
p times (1-p) is 1/4,

486
00:29:25,920 --> 00:29:28,320
n is 36, so this is 9.

487
00:29:28,320 --> 00:29:33,120
And that tells us that the
standard deviation of this n

488
00:29:33,120 --> 00:29:34,370
is equal to 3.

489
00:29:37,170 --> 00:29:40,650
So what we're going to do is to
take the event of interest,

490
00:29:40,650 --> 00:29:46,400
which is Sn less than 21, and
rewrite it in a way that

491
00:29:46,400 --> 00:29:48,910
involves the standardized
random variable.

492
00:29:48,910 --> 00:29:51,700
So to do that, we need
to subtract the mean.

493
00:29:51,700 --> 00:29:55,680
So we write this as Sn-3
should be less

494
00:29:55,680 --> 00:29:58,460
than or equal to 21-3.

495
00:29:58,460 --> 00:30:00,360
This is the same event.

496
00:30:00,360 --> 00:30:02,890
And then divide by the standard
deviation, which is

497
00:30:02,890 --> 00:30:06,450
3, and we end up with this.

498
00:30:06,450 --> 00:30:08,300
So the event itself of--

499
00:30:08,300 --> 00:30:09,550
AUDIENCE: [INAUDIBLE].

500
00:30:13,700 --> 00:30:24,150
Should subtract, 18, yes, which
gives me a much nicer

501
00:30:24,150 --> 00:30:26,640
number out here, which is 1.

502
00:30:26,640 --> 00:30:31,650
So the event of interest, that
Sn is less than 21, is the

503
00:30:31,650 --> 00:30:37,330
same as the event that a
standard normal random

504
00:30:37,330 --> 00:30:41,580
variable is less than
or equal to 1.

505
00:30:41,580 --> 00:30:44,690
And once more, you can look this
up at the normal tables.

506
00:30:44,690 --> 00:30:50,690
And you find that the answer
that you get is 0.43.

507
00:30:50,690 --> 00:30:53,390
Now it's interesting to compare
this answer that we

508
00:30:53,390 --> 00:30:57,230
got through the central limit
theorem with the exact answer.

509
00:30:57,230 --> 00:31:01,920
The exact answer involves the
exact binomial distribution.

510
00:31:01,920 --> 00:31:08,780
What we have here is the
binomial probability that, Sn

511
00:31:08,780 --> 00:31:10,970
is equal to k.

512
00:31:10,970 --> 00:31:15,230
Sn being equal to k is given
by this formula.

513
00:31:15,230 --> 00:31:22,610
And we add, over all values for
k going from 0 up to 21,

514
00:31:22,610 --> 00:31:28,670
we write a two lines code to
calculate this sum, and we get

515
00:31:28,670 --> 00:31:32,530
the exact answer,
which is 0.8785.

516
00:31:32,530 --> 00:31:35,760
So there's a pretty good
agreements between the two,

517
00:31:35,760 --> 00:31:38,600
although you wouldn't
call that's

518
00:31:38,600 --> 00:31:40,395
necessarily excellent agreement.

519
00:31:45,080 --> 00:31:47,060
Can we do a little
better than that?

520
00:31:51,570 --> 00:31:53,750
OK.

521
00:31:53,750 --> 00:31:56,510
It turns out that we can.

522
00:31:56,510 --> 00:31:58,625
And here's the idea.

523
00:32:02,300 --> 00:32:07,750
So our random variable
Sn has a mean of 18.

524
00:32:07,750 --> 00:32:09,540
It has a binomial
distribution.

525
00:32:09,540 --> 00:32:14,050
It's described by a PMF that has
a shape roughly like this

526
00:32:14,050 --> 00:32:16,690
and which keeps going on.

527
00:32:16,690 --> 00:32:20,960
Using the central limit
theorem is basically

528
00:32:20,960 --> 00:32:26,650
pretending that Sn is
normal with the

529
00:32:26,650 --> 00:32:28,650
right mean and variance.

530
00:32:28,650 --> 00:32:35,200
So pretending that Zn has
0 mean unit variance, we

531
00:32:35,200 --> 00:32:38,850
approximate it with Z, that
has 0 mean unit variance.

532
00:32:38,850 --> 00:32:42,190
If you were to pretend that
Sn is normal, you would

533
00:32:42,190 --> 00:32:45,407
approximate it with a normal
that has the correct mean and

534
00:32:45,407 --> 00:32:46,250
correct variance.

535
00:32:46,250 --> 00:32:49,390
So it would still be
centered at 18.

536
00:32:49,390 --> 00:32:53,800
And it would have the same
variance as the binomial PMF.

537
00:32:53,800 --> 00:32:57,350
So using the central limit
theorem essentially means that

538
00:32:57,350 --> 00:33:00,420
we keep the mean and the
variance what they are but we

539
00:33:00,420 --> 00:33:03,960
pretend that our distribution
is normal.

540
00:33:03,960 --> 00:33:06,780
We want to calculate the
probability that Sn is less

541
00:33:06,780 --> 00:33:09,590
than or equal to 21.

542
00:33:09,590 --> 00:33:14,310
I pretend that my random
variable is normal, so I draw

543
00:33:14,310 --> 00:33:18,680
a line here and I calculate
the area under the normal

544
00:33:18,680 --> 00:33:22,000
curve going up to 21.

545
00:33:22,000 --> 00:33:23,500
That's essentially
what we did.

546
00:33:26,260 --> 00:33:29,730
Now, a smart person comes
around and says, Sn is a

547
00:33:29,730 --> 00:33:31,360
discrete random variable.

548
00:33:31,360 --> 00:33:34,750
So the event that Sn is less
than or equal to 21 is the

549
00:33:34,750 --> 00:33:38,480
same as Sn being strictly less
than 22 because nothing in

550
00:33:38,480 --> 00:33:41,240
between can happen.

551
00:33:41,240 --> 00:33:43,700
So I'm going to use the
central limit theorem

552
00:33:43,700 --> 00:33:48,290
approximation by pretending
again that Sn is normal and

553
00:33:48,290 --> 00:33:51,650
finding the probability of this
event while pretending

554
00:33:51,650 --> 00:33:53,720
that Sn is normal.

555
00:33:53,720 --> 00:33:57,870
So what this person would do
would be to draw a line here,

556
00:33:57,870 --> 00:34:02,780
at 22, and calculate the area
under the normal curve

557
00:34:02,780 --> 00:34:05,490
all the way to 22.

558
00:34:05,490 --> 00:34:06,700
Who is right?

559
00:34:06,700 --> 00:34:08,820
Which one is better?

560
00:34:08,820 --> 00:34:15,639
Well neither, but we can do
better than both if we sort of

561
00:34:15,639 --> 00:34:17,949
split the difference.

562
00:34:17,949 --> 00:34:21,969
So another way of writing the
same event for Sn is to write

563
00:34:21,969 --> 00:34:25,940
it as Sn being less than 21.5.

564
00:34:25,940 --> 00:34:29,570
In terms of the discrete random
variable Sn, all three

565
00:34:29,570 --> 00:34:32,239
of these are exactly
the same event.

566
00:34:32,239 --> 00:34:35,090
But when you do the continuous
approximation, they give you

567
00:34:35,090 --> 00:34:36,250
different probabilities.

568
00:34:36,250 --> 00:34:39,760
It's a matter of whether you
integrate the area under the

569
00:34:39,760 --> 00:34:46,159
normal curve up to here, up to
the midway point, or up to 22.

570
00:34:46,159 --> 00:34:50,659
It turns out that integrating
up to the midpoint is what

571
00:34:50,659 --> 00:34:54,469
gives us the better
numerical results.

572
00:34:54,469 --> 00:34:59,170
So we take here 21 and 1/2,
and we integrate the area

573
00:34:59,170 --> 00:35:01,170
under the normal curve
up to here.

574
00:35:14,100 --> 00:35:18,560
So let's do this calculation
and see what we get.

575
00:35:18,560 --> 00:35:21,330
What would we change here?

576
00:35:21,330 --> 00:35:27,730
Instead of 21, we would
now write 21 and 1/2.

577
00:35:27,730 --> 00:35:32,810
This 18 becomes, no, that
18 stays what it is.

578
00:35:32,810 --> 00:35:36,890
But this 21 becomes
21 and 1/2.

579
00:35:36,890 --> 00:35:44,790
And so this one becomes
1 + 0.5 by 3.

580
00:35:44,790 --> 00:35:48,210
This is 117.

581
00:35:48,210 --> 00:35:51,980
So we now look up into the
normal tables and ask for the

582
00:35:51,980 --> 00:36:00,000
probability that Z is
less than 1.17.

583
00:36:00,000 --> 00:36:06,070
So this here gets approximated
by the probability that the

584
00:36:06,070 --> 00:36:09,240
standard normal is
less than 1.17.

585
00:36:09,240 --> 00:36:15,960
And the normal tables will
tell us this is 0.879.

586
00:36:15,960 --> 00:36:23,550
Going back to the previous
slide, what we got this time

587
00:36:23,550 --> 00:36:30,310
with this improved approximation
is 0.879.

588
00:36:30,310 --> 00:36:33,730
This is a really good
approximation

589
00:36:33,730 --> 00:36:35,730
of the correct number.

590
00:36:35,730 --> 00:36:39,160
This is what we got
using the 21.

591
00:36:39,160 --> 00:36:42,360
This is what we get using
the 21 and 1/2.

592
00:36:42,360 --> 00:36:45,940
And it's an approximation that's
sort of right on-- a

593
00:36:45,940 --> 00:36:48,350
very good one.

594
00:36:48,350 --> 00:36:54,120
The moral from this numerical
example is that doing this 1

595
00:36:54,120 --> 00:37:00,933
and 1/2 correction does give
us better approximations.

596
00:37:06,070 --> 00:37:12,010
In fact, we can use this 1/2
idea to even calculate

597
00:37:12,010 --> 00:37:14,340
individual probabilities.

598
00:37:14,340 --> 00:37:17,130
So suppose you want to
approximate the probability

599
00:37:17,130 --> 00:37:21,010
that Sn equal to 19.

600
00:37:21,010 --> 00:37:25,620
If you were to pretend that Sn
is normal and calculate this

601
00:37:25,620 --> 00:37:28,470
probability, the probability
that the normal random

602
00:37:28,470 --> 00:37:31,670
variable is equal to 19 is 0.

603
00:37:31,670 --> 00:37:34,150
So you don't get an interesting
answer.

604
00:37:34,150 --> 00:37:37,610
You get a more interesting
answer by writing this event,

605
00:37:37,610 --> 00:37:41,460
19 as being the same as the
event of falling between 18

606
00:37:41,460 --> 00:37:45,910
and 1/2 and 19 and 1/2 and using
the normal approximation

607
00:37:45,910 --> 00:37:48,230
to calculate this probability.

608
00:37:48,230 --> 00:37:51,890
In terms of our previous
picture, this corresponds to

609
00:37:51,890 --> 00:37:53,140
the following.

610
00:37:59,400 --> 00:38:04,650
We are interested in the
probability that

611
00:38:04,650 --> 00:38:07,130
Sn is equal to 19.

612
00:38:07,130 --> 00:38:11,230
So we're interested in the
height of this bar.

613
00:38:11,230 --> 00:38:15,720
We're going to consider the area
under the normal curve

614
00:38:15,720 --> 00:38:21,500
going from here to here,
and use this area as an

615
00:38:21,500 --> 00:38:25,110
approximation for the height
of that particular bar.

616
00:38:25,110 --> 00:38:30,670
So what we're basically doing
is, we take the probability

617
00:38:30,670 --> 00:38:33,830
under the normal curve that's
assigned over a continuum of

618
00:38:33,830 --> 00:38:38,280
values and attributed it to
different discrete values.

619
00:38:38,280 --> 00:38:43,510
Whatever is above the midpoint
gets attributed to 19.

620
00:38:43,510 --> 00:38:45,640
Whatever is below that
midpoint gets

621
00:38:45,640 --> 00:38:47,250
attributed to 18.

622
00:38:47,250 --> 00:38:54,280
So this is green area is our
approximation of the value of

623
00:38:54,280 --> 00:38:56,500
the PMF at 19.

624
00:38:56,500 --> 00:39:00,740
So similarly, if you wanted to
approximate the value of the

625
00:39:00,740 --> 00:39:04,440
PMF at this point, you would
take this interval and

626
00:39:04,440 --> 00:39:06,580
integrate the area
under the normal

627
00:39:06,580 --> 00:39:09,350
curve over that interval.

628
00:39:09,350 --> 00:39:13,410
It turns out that this gives a
very good approximation of the

629
00:39:13,410 --> 00:39:15,660
PMF of the binomial.

630
00:39:15,660 --> 00:39:22,580
And actually, this was the
context in which the central

631
00:39:22,580 --> 00:39:26,310
limit theorem was proved in
the first place, when this

632
00:39:26,310 --> 00:39:27,990
business started.

633
00:39:27,990 --> 00:39:33,060
So this business goes back
a few hundred years.

634
00:39:33,060 --> 00:39:35,700
And the central limit theorem
was first approved by

635
00:39:35,700 --> 00:39:39,420
considering the PMF of a
binomial random variable when

636
00:39:39,420 --> 00:39:41,840
p is equal to 1/2.

637
00:39:41,840 --> 00:39:45,590
People did the algebra, and they
found out that the exact

638
00:39:45,590 --> 00:39:49,700
expression for the PMF is quite
well approximated by

639
00:39:49,700 --> 00:39:51,980
that expression hat you would
get from a normal

640
00:39:51,980 --> 00:39:53,380
distribution.

641
00:39:53,380 --> 00:39:57,510
Then the proof was extended to
binomials for more general

642
00:39:57,510 --> 00:39:59,690
values of p.

643
00:39:59,690 --> 00:40:04,220
So here we talk about this as
a refinement of the general

644
00:40:04,220 --> 00:40:07,480
central limit theorem, but,
historically, that refinement

645
00:40:07,480 --> 00:40:09,830
was where the whole business
got started

646
00:40:09,830 --> 00:40:11,820
in the first place.

647
00:40:11,820 --> 00:40:18,700
All right, so let's go through
the mechanics of approximating

648
00:40:18,700 --> 00:40:21,970
the probability that
Sn is equal to 19--

649
00:40:21,970 --> 00:40:23,810
exactly 19.

650
00:40:23,810 --> 00:40:27,340
As we said, we're going to write
this event as an event

651
00:40:27,340 --> 00:40:31,040
that covers an interval of unit
length from 18 and 1/2 to

652
00:40:31,040 --> 00:40:31,970
19 and 1/2.

653
00:40:31,970 --> 00:40:33,730
This is the event of interest.

654
00:40:33,730 --> 00:40:37,070
First step is to massage the
event of interest so that it

655
00:40:37,070 --> 00:40:40,010
involves our Zn random
variable.

656
00:40:40,010 --> 00:40:43,290
So subtract 18 from all sides.

657
00:40:43,290 --> 00:40:46,860
Divide by the standard deviation
of 3 from all sides.

658
00:40:46,860 --> 00:40:50,850
That's the equivalent
representation of the event.

659
00:40:50,850 --> 00:40:54,200
This is our standardized
random variable Zn.

660
00:40:54,200 --> 00:40:56,950
These are just these numbers.

661
00:40:56,950 --> 00:41:00,530
And to do an approximation, we
want to find the probability

662
00:41:00,530 --> 00:41:04,380
of this event, but Zn is
approximately normal, so we

663
00:41:04,380 --> 00:41:08,030
plug in here the Z, which
is the standard normal.

664
00:41:08,030 --> 00:41:10,150
So we want to find the
probability that the standard

665
00:41:10,150 --> 00:41:12,890
normal falls inside
this interval.

666
00:41:12,890 --> 00:41:15,630
You find these using CDFs
because this is the

667
00:41:15,630 --> 00:41:18,760
probability that you're
less than this but

668
00:41:18,760 --> 00:41:22,370
not less than that.

669
00:41:22,370 --> 00:41:25,370
So it's a difference between two
cumulative probabilities.

670
00:41:25,370 --> 00:41:27,400
Then, you look up your
normal tables.

671
00:41:27,400 --> 00:41:30,560
You find two numbers for these
quantities, and, finally, you

672
00:41:30,560 --> 00:41:35,140
get a numerical answer for an
individual entry of the PMF of

673
00:41:35,140 --> 00:41:36,480
the binomial.

674
00:41:36,480 --> 00:41:39,350
This is a pretty good
approximation, it turns out.

675
00:41:39,350 --> 00:41:42,910
If you were to do the
calculations using the exact

676
00:41:42,910 --> 00:41:47,130
formula, you would
get something

677
00:41:47,130 --> 00:41:49,360
which is pretty close--

678
00:41:49,360 --> 00:41:52,800
an error in the third digit--

679
00:41:52,800 --> 00:41:56,980
this is pretty good.

680
00:41:56,980 --> 00:41:59,650
So I guess what we did here
with our discussion of the

681
00:41:59,650 --> 00:42:04,560
binomial slightly contradicts
what I said before--

682
00:42:04,560 --> 00:42:07,330
that the central limit theorem
is a statement about

683
00:42:07,330 --> 00:42:09,240
cumulative distribution
functions.

684
00:42:09,240 --> 00:42:13,240
In general, it doesn't tell you
what to do to approximate

685
00:42:13,240 --> 00:42:15,270
PMFs themselves.

686
00:42:15,270 --> 00:42:17,440
And that's indeed the
case in general.

687
00:42:17,440 --> 00:42:20,220
One the other hand, for the
special case of a binomial

688
00:42:20,220 --> 00:42:23,610
distribution, the central limit
theorem approximation,

689
00:42:23,610 --> 00:42:28,200
with this 1/2 correction, is a
very good approximation even

690
00:42:28,200 --> 00:42:29,560
for the individual PMF.

691
00:42:33,290 --> 00:42:40,210
All right, so we spent quite
a bit of time on mechanics.

692
00:42:40,210 --> 00:42:46,050
So let's spend the last few
minutes today thinking a bit

693
00:42:46,050 --> 00:42:47,930
and look at a small puzzle.

694
00:42:51,390 --> 00:42:54,240
So the puzzle is
the following.

695
00:42:54,240 --> 00:43:02,460
Consider Poisson process that
runs over a unit interval.

696
00:43:02,460 --> 00:43:07,770
And where the arrival
rate is equal to 1.

697
00:43:07,770 --> 00:43:09,790
So this is the unit interval.

698
00:43:09,790 --> 00:43:12,720
And let X be the number
of arrivals.

699
00:43:15,430 --> 00:43:19,930
And this is Poisson,
with mean 1.

700
00:43:25,000 --> 00:43:28,160
Now, let me take this interval
and divide it

701
00:43:28,160 --> 00:43:30,650
into n little pieces.

702
00:43:30,650 --> 00:43:34,270
So each piece has length 1/n.

703
00:43:34,270 --> 00:43:41,225
And let Xi be the number
of arrivals during

704
00:43:41,225 --> 00:43:43,490
the Ith little interval.

705
00:43:48,000 --> 00:43:51,630
OK, what do we know about
the random variables Xi?

706
00:43:51,630 --> 00:43:55,260
Is they are themselves
Poisson.

707
00:43:55,260 --> 00:43:58,490
It's a number of arrivals
during a small interval.

708
00:43:58,490 --> 00:44:02,340
We also know that when n is
big, so the length of the

709
00:44:02,340 --> 00:44:08,190
interval is small, these Xi's
are approximately Bernoulli,

710
00:44:08,190 --> 00:44:11,730
with mean 1/n.

711
00:44:11,730 --> 00:44:13,970
Guess it doesn't matter whether
we model them as

712
00:44:13,970 --> 00:44:15,720
Bernoulli or not.

713
00:44:15,720 --> 00:44:19,660
What matters is that the
Xi's are independent.

714
00:44:19,660 --> 00:44:20,970
Why are they independent?

715
00:44:20,970 --> 00:44:24,410
Because, in a Poisson process,
these joint intervals are

716
00:44:24,410 --> 00:44:26,770
independent of each other.

717
00:44:26,770 --> 00:44:28,955
So the Xi's are independent.

718
00:44:31,840 --> 00:44:35,570
And they also have the
same distribution.

719
00:44:35,570 --> 00:44:40,360
And we have that X, the total
number of arrivals, is the sum

720
00:44:40,360 --> 00:44:41,610
over the Xn's.

721
00:44:44,470 --> 00:44:49,510
So the central limit theorem
tells us that, approximately,

722
00:44:49,510 --> 00:44:53,670
the sum of independent,
identically distributed random

723
00:44:53,670 --> 00:44:57,720
variables, when we have lots
of these random variables,

724
00:44:57,720 --> 00:45:01,530
behaves like a normal
random variable.

725
00:45:01,530 --> 00:45:07,475
So by using this decomposition
of X into a sum of i.i.d

726
00:45:07,475 --> 00:45:11,540
random variables, and by using
values of n that are bigger

727
00:45:11,540 --> 00:45:16,540
and bigger, by taking the limit,
it should follow that X

728
00:45:16,540 --> 00:45:19,510
has a normal distribution.

729
00:45:19,510 --> 00:45:22,120
On the other hand, we know
that X has a Poisson

730
00:45:22,120 --> 00:45:23,370
distribution.

731
00:45:25,270 --> 00:45:32,640
So something must be wrong
in this argument here.

732
00:45:32,640 --> 00:45:34,900
Can we really use the
central limit

733
00:45:34,900 --> 00:45:38,330
theorem in this situation?

734
00:45:38,330 --> 00:45:41,300
So what do we need for the
central limit theorem?

735
00:45:41,300 --> 00:45:44,160
We need to have independent,
identically

736
00:45:44,160 --> 00:45:46,700
distributed random variables.

737
00:45:46,700 --> 00:45:49,060
We have it here.

738
00:45:49,060 --> 00:45:53,410
We want them to have a finite
mean and finite variance.

739
00:45:53,410 --> 00:45:57,610
We also have it here, means
variances are finite.

740
00:45:57,610 --> 00:46:02,050
What is another assumption that
was never made explicit,

741
00:46:02,050 --> 00:46:04,080
but essentially was there?

742
00:46:07,680 --> 00:46:13,260
Or in other words, what is the
flaw in this argument that

743
00:46:13,260 --> 00:46:15,520
uses the central limit
theorem here?

744
00:46:15,520 --> 00:46:16,770
Any thoughts?

745
00:46:24,110 --> 00:46:29,640
So in the central limit theorem,
we said, consider--

746
00:46:29,640 --> 00:46:34,820
fix a probability distribution,
and let the Xi's

747
00:46:34,820 --> 00:46:38,280
be distributed according to that
probability distribution,

748
00:46:38,280 --> 00:46:42,935
and add a larger and larger
number or Xi's.

749
00:46:42,935 --> 00:46:47,410
But the underlying, unstated
assumption is that we fix the

750
00:46:47,410 --> 00:46:49,490
distribution of the Xi's.

751
00:46:49,490 --> 00:46:52,810
As we let n increase,
the statistics of

752
00:46:52,810 --> 00:46:55,930
each Xi do not change.

753
00:46:55,930 --> 00:46:59,010
Whereas here, I'm playing
a trick on you.

754
00:46:59,010 --> 00:47:03,700
As I'm taking more and more
random variables, I'm actually

755
00:47:03,700 --> 00:47:07,850
changing what those random
variables are.

756
00:47:07,850 --> 00:47:12,960
When I take a larger n, the Xi's
are random variables with

757
00:47:12,960 --> 00:47:15,720
a different mean and
different variance.

758
00:47:15,720 --> 00:47:19,800
So I'm adding more of these, but
at the same time, in this

759
00:47:19,800 --> 00:47:23,420
example, I'm changing
their distributions.

760
00:47:23,420 --> 00:47:26,380
That's something that doesn't
fit the setting of the central

761
00:47:26,380 --> 00:47:27,000
limit theorem.

762
00:47:27,000 --> 00:47:29,910
In the central limit theorem,
you first fix the distribution

763
00:47:29,910 --> 00:47:31,200
of the X's.

764
00:47:31,200 --> 00:47:35,290
You keep it fixed, and then you
consider adding more and

765
00:47:35,290 --> 00:47:38,950
more according to that
particular fixed distribution.

766
00:47:38,950 --> 00:47:40,020
So that's the catch.

767
00:47:40,020 --> 00:47:42,240
That's why the central limit
theorem does not

768
00:47:42,240 --> 00:47:43,970
apply to this situation.

769
00:47:43,970 --> 00:47:46,230
And we're lucky that it
doesn't apply because,

770
00:47:46,230 --> 00:47:50,220
otherwise, we would have a huge
contradiction destroying

771
00:47:50,220 --> 00:47:52,770
probability theory.

772
00:47:52,770 --> 00:48:02,240
OK, but now that's still
leaves us with a

773
00:48:02,240 --> 00:48:05,040
little bit of a dilemma.

774
00:48:05,040 --> 00:48:08,510
Suppose that, here, essentially
we're adding

775
00:48:08,510 --> 00:48:12,815
independent Bernoulli
random variables.

776
00:48:22,650 --> 00:48:25,300
So the issue is that the central
limit theorem has to

777
00:48:25,300 --> 00:48:28,920
do with asymptotics as
n goes to infinity.

778
00:48:28,920 --> 00:48:34,260
And if we consider a binomial,
and somebody gives us specific

779
00:48:34,260 --> 00:48:38,870
numbers about the parameters of
that binomial, it might not

780
00:48:38,870 --> 00:48:40,830
necessarily be obvious
what kind of

781
00:48:40,830 --> 00:48:42,790
approximation do we use.

782
00:48:42,790 --> 00:48:45,660
In particular, we do have two
different approximations for

783
00:48:45,660 --> 00:48:47,100
the binomial.

784
00:48:47,100 --> 00:48:51,610
If we fix p, then the binomial
is the sum of Bernoulli's that

785
00:48:51,610 --> 00:48:54,930
come from a fixed distribution,
we consider more

786
00:48:54,930 --> 00:48:56,450
and more of these.

787
00:48:56,450 --> 00:48:58,990
When we add them, the central
limit theorem tells us that we

788
00:48:58,990 --> 00:49:01,190
get the normal distribution.

789
00:49:01,190 --> 00:49:04,430
There's another sort of limit,
which has the flavor of this

790
00:49:04,430 --> 00:49:10,770
example, in which we still deal
with a binomial, sum of n

791
00:49:10,770 --> 00:49:11,170
Bernoulli's.

792
00:49:11,170 --> 00:49:14,310
We let that sum, the
number of the

793
00:49:14,310 --> 00:49:16,090
Bernoulli's go to infinity.

794
00:49:16,090 --> 00:49:18,890
But each Bernoulli has a
probability of success that

795
00:49:18,890 --> 00:49:23,830
goes to 0, and we do this in a
way so that np, the expected

796
00:49:23,830 --> 00:49:27,090
number of successes,
stays finite.

797
00:49:27,090 --> 00:49:30,660
This is the situation that we
dealt with when we first

798
00:49:30,660 --> 00:49:32,960
defined our Poisson process.

799
00:49:32,960 --> 00:49:37,540
We have a very, very large
number so lots, of time slots,

800
00:49:37,540 --> 00:49:40,920
but during each time slot,
there's a tiny probability of

801
00:49:40,920 --> 00:49:42,950
obtaining an arrival.

802
00:49:42,950 --> 00:49:48,460
Under that setting, in discrete
time, we have a

803
00:49:48,460 --> 00:49:51,670
binomial distribution, or
Bernoulli process, but when we

804
00:49:51,670 --> 00:49:54,530
take the limit, we obtain the
Poisson process and the

805
00:49:54,530 --> 00:49:56,470
Poisson approximation.

806
00:49:56,470 --> 00:49:58,510
So these are two equally valid

807
00:49:58,510 --> 00:50:00,550
approximations of the binomial.

808
00:50:00,550 --> 00:50:03,300
But they're valid in different
asymptotic regimes.

809
00:50:03,300 --> 00:50:06,180
In one regime, we fixed p,
let n go to infinity.

810
00:50:06,180 --> 00:50:09,360
In the other regime, we let
both n and p change

811
00:50:09,360 --> 00:50:11,540
simultaneously.

812
00:50:11,540 --> 00:50:14,240
Now, in real life, you're
never dealing with the

813
00:50:14,240 --> 00:50:15,290
limiting situations.

814
00:50:15,290 --> 00:50:17,870
You're dealing with
actual numbers.

815
00:50:17,870 --> 00:50:21,820
So if somebody tells you that
the numbers are like this,

816
00:50:21,820 --> 00:50:25,160
then you should probably say
that this is the situation

817
00:50:25,160 --> 00:50:27,380
that fits the Poisson
description--

818
00:50:27,380 --> 00:50:30,180
large number of slots with
each slot having a tiny

819
00:50:30,180 --> 00:50:32,460
probability of success.

820
00:50:32,460 --> 00:50:36,890
On the other hand, if p is
something like this, and n is

821
00:50:36,890 --> 00:50:40,460
500, then you expect to get
the distribution for the

822
00:50:40,460 --> 00:50:41,680
number of successes.

823
00:50:41,680 --> 00:50:45,740
It's going to have a mean of 50
and to have a fair amount

824
00:50:45,740 --> 00:50:47,280
of spread around there.

825
00:50:47,280 --> 00:50:50,150
It turns out that the normal
approximation would be better

826
00:50:50,150 --> 00:50:51,500
in this context.

827
00:50:51,500 --> 00:50:57,120
As a rule of thumb, if n times p
is bigger than 10 or 20, you

828
00:50:57,120 --> 00:50:59,320
can start using the normal
approximation.

829
00:50:59,320 --> 00:51:04,310
If n times p is a small number,
then you prefer to use

830
00:51:04,310 --> 00:51:06,090
the Poisson approximation.

831
00:51:06,090 --> 00:51:08,840
But there's no hard theorems
or rules about

832
00:51:08,840 --> 00:51:11,650
how to go about this.

833
00:51:11,650 --> 00:51:15,440
OK, so from next time we're
going to switch base again.

834
00:51:15,440 --> 00:51:17,830
And we're going to put together
everything we learned

835
00:51:17,830 --> 00:51:20,620
in this class to start solving
inference problems.