1
00:00:00,790 --> 00:00:03,130
The following content is
provided under a Creative

2
00:00:03,130 --> 00:00:04,550
Commons license.

3
00:00:04,550 --> 00:00:06,760
Your support will help
MIT OpenCourseWare

4
00:00:06,760 --> 00:00:10,850
continue to offer high quality
educational resources for free.

5
00:00:10,850 --> 00:00:13,390
To make a donation, or to
view additional materials

6
00:00:13,390 --> 00:00:17,320
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,320 --> 00:00:18,570
at ocw.mit.edu.

8
00:00:28,870 --> 00:00:32,360
PROFESSOR: Good
afternoon, everybody.

9
00:00:32,360 --> 00:00:35,170
Welcome to Lecture 8.

10
00:00:35,170 --> 00:00:39,730
So we're now more than
halfway through the lectures.

11
00:00:39,730 --> 00:00:43,990
All right, the topic
of today is sampling.

12
00:00:43,990 --> 00:00:48,760
I want to start by reminding
you about this whole business

13
00:00:48,760 --> 00:00:51,160
of inferential statistics.

14
00:00:51,160 --> 00:00:53,980
We make references
about populations

15
00:00:53,980 --> 00:00:56,800
by examining one or more
random samples drawn

16
00:00:56,800 --> 00:00:59,620
from that population.

17
00:00:59,620 --> 00:01:04,090
We used Monte Carlo simulation
over the last two lectures.

18
00:01:04,090 --> 00:01:06,190
And the key idea
there, as we saw

19
00:01:06,190 --> 00:01:09,160
in trying to find
the value of pi,

20
00:01:09,160 --> 00:01:13,330
was that we can generate
lots of random samples,

21
00:01:13,330 --> 00:01:17,370
and then use them to compute
confidence intervals.

22
00:01:17,370 --> 00:01:20,250
And then we use the
empirical rule to say,

23
00:01:20,250 --> 00:01:23,450
all right, we really
have good reason

24
00:01:23,450 --> 00:01:28,820
to believe that 95% of the
time we run this simulation,

25
00:01:28,820 --> 00:01:32,910
our answer will be
between here and here.

26
00:01:32,910 --> 00:01:37,430
Well, that's all well and good
when we're doing simulations.

27
00:01:37,430 --> 00:01:41,690
But what happens when you to
actually sample something real?

28
00:01:41,690 --> 00:01:43,730
For example, you
run an experiment,

29
00:01:43,730 --> 00:01:46,190
and you get some data points.

30
00:01:46,190 --> 00:01:50,390
And it's too hard to do
it over and over again.

31
00:01:50,390 --> 00:01:51,970
Think about political polls.

32
00:01:51,970 --> 00:01:56,030
Here was an interesting poll.

33
00:01:56,030 --> 00:01:57,860
How were these created?

34
00:01:57,860 --> 00:02:01,010
Not by simulation.

35
00:02:01,010 --> 00:02:04,460
They didn't run 1,000
polls and then compute

36
00:02:04,460 --> 00:02:05,510
the confidence interval.

37
00:02:05,510 --> 00:02:08,539
They ran one poll--

38
00:02:08,539 --> 00:02:12,140
of 835 people, in this case.

39
00:02:12,140 --> 00:02:15,640
And yet they claim to have
a confidence interval.

40
00:02:15,640 --> 00:02:17,600
That's what that
margin of error is.

41
00:02:20,240 --> 00:02:25,440
Obviously they needed that
large confidence interval.

42
00:02:25,440 --> 00:02:26,480
So how is this done?

43
00:02:29,940 --> 00:02:32,970
Backing up for a minute,
let's talk about how sampling

44
00:02:32,970 --> 00:02:37,170
is done when you are not
running a simulation.

45
00:02:37,170 --> 00:02:41,870
You want to do what's called
probability sampling, in which

46
00:02:41,870 --> 00:02:46,250
each member of the population
has a non-zero probability

47
00:02:46,250 --> 00:02:47,615
of being included in a sample.

48
00:02:50,150 --> 00:02:53,540
There are, roughly
speaking, two kinds.

49
00:02:53,540 --> 00:02:55,800
We'll spend, really,
all of our time

50
00:02:55,800 --> 00:02:59,180
on something called
simple random sampling.

51
00:02:59,180 --> 00:03:03,980
And the key idea here is that
each member of the population

52
00:03:03,980 --> 00:03:09,090
has an equal probability of
being chosen in the sample

53
00:03:09,090 --> 00:03:12,630
so there's no bias.

54
00:03:12,630 --> 00:03:14,860
Now, that's not
always appropriate.

55
00:03:14,860 --> 00:03:18,690
I do want to take a
minute to talk about why.

56
00:03:18,690 --> 00:03:25,010
So suppose we wanted to survey
MIT students to find out what

57
00:03:25,010 --> 00:03:27,500
fraction of them are nerds--

58
00:03:27,500 --> 00:03:31,100
which, by the way, I
consider a compliment.

59
00:03:31,100 --> 00:03:34,190
So suppose we wanted to
consider a random sample of 100

60
00:03:34,190 --> 00:03:35,870
students.

61
00:03:35,870 --> 00:03:40,200
We could walk around campus and
choose 100 people at random.

62
00:03:40,200 --> 00:03:43,400
And if 12% of them
were nerds, we

63
00:03:43,400 --> 00:03:46,860
would say 12% of the MIT
undergraduates are nerds--

64
00:03:46,860 --> 00:03:49,860
if 98%, et cetera.

65
00:03:49,860 --> 00:03:53,950
Well, the problem with that
is, let's look at the majors

66
00:03:53,950 --> 00:03:54,545
by school.

67
00:03:57,220 --> 00:04:02,090
This is actually the
majors at MIT by school.

68
00:04:02,090 --> 00:04:06,780
And you can see that they're
not exactly evenly distributed.

69
00:04:06,780 --> 00:04:08,510
And so if you went
around and just

70
00:04:08,510 --> 00:04:11,890
sampled 100 students
at random, there'd

71
00:04:11,890 --> 00:04:14,860
be a reasonably high probability
that they would all be

72
00:04:14,860 --> 00:04:18,480
from engineering and science.

73
00:04:18,480 --> 00:04:22,740
And that might give
you a misleading notion

74
00:04:22,740 --> 00:04:26,220
of the fraction of MIT
students that were nerds,

75
00:04:26,220 --> 00:04:29,800
or it might not.

76
00:04:29,800 --> 00:04:32,980
In such situations we do
something called stratified

77
00:04:32,980 --> 00:04:38,470
sampling, where we partition
the population into subgroups,

78
00:04:38,470 --> 00:04:43,570
and then take a simple random
sample from each subgroup.

79
00:04:43,570 --> 00:04:48,550
And we do that proportional
to the size of the subgroups.

80
00:04:48,550 --> 00:04:50,650
So we would certainly
want to take more students

81
00:04:50,650 --> 00:04:53,830
from engineering than
from architecture.

82
00:04:53,830 --> 00:04:56,140
But we probably want to
make sure we got somebody

83
00:04:56,140 --> 00:04:59,870
from architecture in our sample.

84
00:04:59,870 --> 00:05:03,230
This, by the way, is the way
most political polls are done.

85
00:05:03,230 --> 00:05:04,880
They're stratified.

86
00:05:04,880 --> 00:05:08,360
They say, we want to get
so many rural people,

87
00:05:08,360 --> 00:05:14,150
so many city people, so many
minorities-- things like that.

88
00:05:14,150 --> 00:05:17,720
And in fact, that's probably
where the election recent polls

89
00:05:17,720 --> 00:05:19,280
were all messed up.

90
00:05:19,280 --> 00:05:22,130
They did a very,
retrospectively at least,

91
00:05:22,130 --> 00:05:23,810
a bad job of stratifying.

92
00:05:26,440 --> 00:05:28,540
So we use stratified
sampling when

93
00:05:28,540 --> 00:05:31,510
there are small groups,
subgroups, that we want

94
00:05:31,510 --> 00:05:34,100
to make sure are represented.

95
00:05:34,100 --> 00:05:36,620
And we want to represent them
proportional to their size

96
00:05:36,620 --> 00:05:39,650
in the population.

97
00:05:39,650 --> 00:05:44,870
This can also be used to reduce
the needed size of the sample.

98
00:05:44,870 --> 00:05:47,680
If we wanted to make sure
we got some architecture

99
00:05:47,680 --> 00:05:49,750
students in our
sample, we'd need

100
00:05:49,750 --> 00:05:52,660
to get more than 100
people to start with.

101
00:05:52,660 --> 00:05:56,335
But if we stratify, we
can take fewer samples.

102
00:05:58,890 --> 00:06:01,630
It works well when
you do it properly.

103
00:06:01,630 --> 00:06:04,590
But it can be tricky
to do it properly.

104
00:06:04,590 --> 00:06:08,770
And we are going to stick to
simple random samples here.

105
00:06:08,770 --> 00:06:11,660
All right, let's
look at an example.

106
00:06:11,660 --> 00:06:16,870
So this is a map of temperatures
in the United States.

107
00:06:16,870 --> 00:06:18,700
And so our running
example today will

108
00:06:18,700 --> 00:06:24,910
be sampling to get information
about the average temperatures.

109
00:06:24,910 --> 00:06:27,220
And of course, as you can
see, they're highly variable.

110
00:06:27,220 --> 00:06:31,100
And we live in one
of the cooler areas.

111
00:06:31,100 --> 00:06:34,120
The data we're going
to use is real data--

112
00:06:34,120 --> 00:06:38,680
and it's in the zip file
that I put up for the class--

113
00:06:38,680 --> 00:06:42,310
from the US Centers for
Environmental Information.

114
00:06:42,310 --> 00:06:44,590
And it's got the daily
high and low temperatures

115
00:06:44,590 --> 00:06:49,390
for 21 different American
cities, every day

116
00:06:49,390 --> 00:06:53,090
from 1961 through 2015.

117
00:06:53,090 --> 00:06:55,760
So it's an
interesting data set--

118
00:06:55,760 --> 00:07:01,130
a total of about 422,000
examples in the dataset.

119
00:07:01,130 --> 00:07:03,560
So a fairly good sized dataset.

120
00:07:03,560 --> 00:07:06,170
It's fun to play with.

121
00:07:06,170 --> 00:07:11,210
All right, so we're sort of
in the part of the course

122
00:07:11,210 --> 00:07:14,600
where the next series of
lectures, including today,

123
00:07:14,600 --> 00:07:20,150
is going to be about data
science, how to analyze data.

124
00:07:20,150 --> 00:07:24,680
I always like to start by
actually looking at the data--

125
00:07:24,680 --> 00:07:28,250
not looking at all
421,000 samples,

126
00:07:28,250 --> 00:07:30,380
but giving a plot
to sort of give me

127
00:07:30,380 --> 00:07:34,640
a sense of what the
data looks like.

128
00:07:34,640 --> 00:07:36,710
I'm not going to walk
you through the code that

129
00:07:36,710 --> 00:07:38,450
does this plot.

130
00:07:38,450 --> 00:07:41,720
I do want to point out that
there are two things in it

131
00:07:41,720 --> 00:07:43,550
that we may not
have seen before.

132
00:07:47,710 --> 00:07:50,950
Simply enough, I'm going
to use numpy.std to get

133
00:07:50,950 --> 00:07:55,330
standard deviations instead
of my own code for it.

134
00:07:55,330 --> 00:08:02,260
And random.sample to take
simple random samples

135
00:08:02,260 --> 00:08:04,150
from the population.

136
00:08:04,150 --> 00:08:06,440
random.sample takes
two arguments.

137
00:08:06,440 --> 00:08:10,540
The first is some sort
of a sequence of values.

138
00:08:10,540 --> 00:08:12,700
And the second is an
integer telling you

139
00:08:12,700 --> 00:08:15,310
how many samples you want.

140
00:08:15,310 --> 00:08:18,790
And it returns a list
containing sample size,

141
00:08:18,790 --> 00:08:22,340
randomly chosen
distinct elements.

142
00:08:22,340 --> 00:08:26,510
Distinct elements is important,
because there are two ways

143
00:08:26,510 --> 00:08:28,760
that people do sampling.

144
00:08:28,760 --> 00:08:31,280
You can do sampling
without replacement,

145
00:08:31,280 --> 00:08:33,860
which is what's done here.

146
00:08:33,860 --> 00:08:37,980
You take a sample, and then
it's out of the population.

147
00:08:37,980 --> 00:08:40,280
So you won't draw
it the next time.

148
00:08:40,280 --> 00:08:42,260
Or you can do sampling
with replacement,

149
00:08:42,260 --> 00:08:45,640
which allows you to draw the
same sample multiple times--

150
00:08:45,640 --> 00:08:48,650
the same example multiple times.

151
00:08:48,650 --> 00:08:50,300
We'll see later in
the term that there

152
00:08:50,300 --> 00:08:52,310
are good reasons that
we sometimes prefer

153
00:08:52,310 --> 00:08:55,310
sampling with replacement.

154
00:08:55,310 --> 00:08:57,800
But usually we're doing
sampling without replacement.

155
00:08:57,800 --> 00:08:59,480
And that's what we'll do here.

156
00:08:59,480 --> 00:09:04,540
So we won't get Boston on April
3rd multiple times-- or, not

157
00:09:04,540 --> 00:09:06,561
the same year, at least.

158
00:09:06,561 --> 00:09:07,060
All right.

159
00:09:07,060 --> 00:09:10,630
So here's the histogram
the code produces.

160
00:09:10,630 --> 00:09:12,580
You can run it yourself
now, if you want,

161
00:09:12,580 --> 00:09:15,080
or you can run it later.

162
00:09:15,080 --> 00:09:16,990
And here's what it looks like.

163
00:09:16,990 --> 00:09:24,910
The daily high temperatures, the
mean is 16.3 degrees Celsius.

164
00:09:24,910 --> 00:09:29,210
I sort of vaguely know
what that feels like.

165
00:09:29,210 --> 00:09:32,190
And as you can see, it's kind
of an interesting distribution.

166
00:09:32,190 --> 00:09:34,550
It's not normal.

167
00:09:34,550 --> 00:09:38,070
But it's not that far, right?

168
00:09:38,070 --> 00:09:41,690
We have a little tail of these
cold temperatures on the left.

169
00:09:41,690 --> 00:09:43,380
And it is what it is.

170
00:09:43,380 --> 00:09:44,690
It's not a normal distribution.

171
00:09:44,690 --> 00:09:48,380
And we'll later see that
doesn't really matter.

172
00:09:48,380 --> 00:09:51,050
OK, so this gives me a sense.

173
00:09:51,050 --> 00:09:54,530
The next thing I'll
get is some statistics.

174
00:09:54,530 --> 00:09:59,390
So we know the mean is 16.3
and the standard deviation

175
00:09:59,390 --> 00:10:01,955
is approximately 9.4 degrees.

176
00:10:04,660 --> 00:10:09,260
So if you look at it,
you can believe that.

177
00:10:09,260 --> 00:10:18,190
Well, here's a histogram of
one random sample of size 100.

178
00:10:18,190 --> 00:10:20,080
Looks pretty different,
as you might expect.

179
00:10:23,690 --> 00:10:28,880
Its standard deviation
is 10.4, its mean 17.7.

180
00:10:28,880 --> 00:10:34,060
So even though the figures look
a little different, in fact,

181
00:10:34,060 --> 00:10:37,740
the means and standard
deviations are pretty similar.

182
00:10:37,740 --> 00:10:40,940
If we look at the population
mean and the sample mean--

183
00:10:40,940 --> 00:10:44,740
and I'll try and be careful
to use those terms--

184
00:10:44,740 --> 00:10:45,730
they're not the same.

185
00:10:45,730 --> 00:10:48,750
But they're in
the same ballpark.

186
00:10:48,750 --> 00:10:55,060
And the same is true of the
two standard deviations.

187
00:10:55,060 --> 00:10:58,860
Well, that raises
the question, did

188
00:10:58,860 --> 00:11:04,320
we get lucky or is
something we should expect?

189
00:11:04,320 --> 00:11:09,770
If we draw 100 random
examples, should we

190
00:11:09,770 --> 00:11:15,620
expect them to correspond to
the population as a whole?

191
00:11:15,620 --> 00:11:18,350
And the answer is sometimes
yeah and sometimes no.

192
00:11:18,350 --> 00:11:22,010
And that's one of the issues
I want to explore today.

193
00:11:22,010 --> 00:11:25,700
So one way to see whether
it's a happy accident

194
00:11:25,700 --> 00:11:28,010
is to try it 1,000 times.

195
00:11:28,010 --> 00:11:33,650
We can draw 1,000 samples of
size 100 and plot the results.

196
00:11:36,470 --> 00:11:38,280
Again, I'm not going
to go over the code.

197
00:11:38,280 --> 00:11:40,400
There's something in
that code, as well,

198
00:11:40,400 --> 00:11:42,510
that we haven't seen before.

199
00:11:42,510 --> 00:11:46,610
And that's the ax.vline
plotting command.

200
00:11:46,610 --> 00:11:49,910
V for vertical.

201
00:11:49,910 --> 00:11:52,310
It just, in this case,
will draw a red line--

202
00:11:52,310 --> 00:11:55,210
because I've said
the color is r--

203
00:11:55,210 --> 00:11:57,400
at population mean
on the x-axis.

204
00:11:57,400 --> 00:11:59,920
So just a vertical line.

205
00:11:59,920 --> 00:12:02,360
So that'll just show
us where the mean is.

206
00:12:02,360 --> 00:12:04,240
If we wanted to draw
a horizontal line,

207
00:12:04,240 --> 00:12:07,430
we'd use ax.hline.

208
00:12:07,430 --> 00:12:12,870
Just showing you a couple
of useful functions.

209
00:12:12,870 --> 00:12:17,390
When we try it 1,000 times,
here's what it looks like.

210
00:12:17,390 --> 00:12:22,840
So here we see what we had
originally, same picture

211
00:12:22,840 --> 00:12:24,460
I showed you before.

212
00:12:24,460 --> 00:12:26,530
And here's what we
get when we look

213
00:12:26,530 --> 00:12:30,450
at the means of 100 samples.

214
00:12:30,450 --> 00:12:35,540
So this plot on the
left looks a lot more

215
00:12:35,540 --> 00:12:38,880
like it's a normal distribution
than the one on the right.

216
00:12:38,880 --> 00:12:44,830
Should that surprise
us, or is there

217
00:12:44,830 --> 00:12:49,190
a reason we should have
expected that to happen?

218
00:12:49,190 --> 00:12:50,440
Well, what's the answer?

219
00:12:50,440 --> 00:12:53,790
Someone tell me why we
should have expected it.

220
00:12:53,790 --> 00:12:56,750
It's because of the central
limit theorem, right?

221
00:12:56,750 --> 00:12:59,420
That's exactly what the
central limit theorem

222
00:12:59,420 --> 00:13:01,610
promised us would happen.

223
00:13:01,610 --> 00:13:06,462
And, sure enough, it's
pretty close to normal.

224
00:13:06,462 --> 00:13:07,420
So that's a good thing.

225
00:13:12,910 --> 00:13:15,520
And now if we look
at it, we can see

226
00:13:15,520 --> 00:13:20,700
that the mean of the
sample means is 16.3,

227
00:13:20,700 --> 00:13:28,630
and the standard deviation
of the sample means is 0.94.

228
00:13:28,630 --> 00:13:35,350
So if we go back to
what we saw here,

229
00:13:35,350 --> 00:13:39,250
we see that, actually,
when we run it 1,000 times

230
00:13:39,250 --> 00:13:43,420
and look at the
means, we get very

231
00:13:43,420 --> 00:13:47,740
close to what we had initially.

232
00:13:47,740 --> 00:13:51,290
So, indeed, it's not
a happy accident.

233
00:13:51,290 --> 00:13:54,130
It's something we can
in general expect.

234
00:13:58,310 --> 00:14:03,480
All right, what's the 95%
confidence interval here?

235
00:14:03,480 --> 00:14:11,640
Well, it's going to be 16.28
plus or minus 1.96 times 0.94,

236
00:14:11,640 --> 00:14:14,940
the standard deviation
of the sample means.

237
00:14:14,940 --> 00:14:18,300
And so it tells us that
the confidence interval is,

238
00:14:18,300 --> 00:14:20,760
the mean high
temperature, is somewhere

239
00:14:20,760 --> 00:14:26,290
between 14.5 and 18.1.

240
00:14:26,290 --> 00:14:29,710
Well, that's actually a
pretty big range, right?

241
00:14:29,710 --> 00:14:32,164
It's sort of enough to
where you wear a sweater

242
00:14:32,164 --> 00:14:33,580
or where you don't
wear a sweater.

243
00:14:36,910 --> 00:14:41,530
So the good news is it
includes the population mean.

244
00:14:41,530 --> 00:14:43,420
That's nice.

245
00:14:43,420 --> 00:14:48,790
But the bad news is
it's pretty wide.

246
00:14:48,790 --> 00:14:52,360
Suppose we wanted
it tighter bound.

247
00:14:52,360 --> 00:14:55,180
I said, all right, sure enough,
the central limit theorem

248
00:14:55,180 --> 00:14:58,180
is going to tell me the
mean of the means is

249
00:14:58,180 --> 00:15:06,110
going to give me a good estimate
of the actual population mean.

250
00:15:06,110 --> 00:15:08,300
But I want it tighter bound.

251
00:15:08,300 --> 00:15:10,030
What can I do?

252
00:15:10,030 --> 00:15:15,780
Well, let's think about a
couple of things we could try.

253
00:15:15,780 --> 00:15:22,250
Well, one thing we could think
about is drawing more samples.

254
00:15:22,250 --> 00:15:24,980
Suppose instead
of 1,000 samples,

255
00:15:24,980 --> 00:15:29,500
I'd taken 2,000
or 3,000 samples.

256
00:15:29,500 --> 00:15:31,870
We can ask the question,
would that have given me

257
00:15:31,870 --> 00:15:33,710
a smaller standard deviation?

258
00:15:36,149 --> 00:15:37,940
For those of you who
have not looked ahead,

259
00:15:37,940 --> 00:15:38,930
what do you think?

260
00:15:38,930 --> 00:15:43,140
Who thinks it will give you
a smaller standard deviation?

261
00:15:43,140 --> 00:15:46,200
Who thinks it won't?

262
00:15:46,200 --> 00:15:48,450
And the rest of you
have either looked ahead

263
00:15:48,450 --> 00:15:50,220
or refused to think.

264
00:15:50,220 --> 00:15:53,870
I prefer to believe
you looked ahead.

265
00:15:53,870 --> 00:15:56,350
Well, we can run the experiment.

266
00:15:56,350 --> 00:15:57,350
You can go to the code.

267
00:15:57,350 --> 00:16:00,950
And you'll see that there is
a constant of 1,000, which

268
00:16:00,950 --> 00:16:03,560
you can easily change to 2,000.

269
00:16:03,560 --> 00:16:07,780
And lo and behold, the standard
deviation barely budges.

270
00:16:07,780 --> 00:16:09,830
It got a little bit
bigger, as it happens,

271
00:16:09,830 --> 00:16:12,770
but that's kind of an accident.

272
00:16:12,770 --> 00:16:15,920
It just, more or
less, doesn't change.

273
00:16:15,920 --> 00:16:19,670
And it won't change if I go
to 3,000 or 4,000 or 5,000.

274
00:16:19,670 --> 00:16:22,000
It'll wiggle around.

275
00:16:22,000 --> 00:16:23,150
But it won't help much.

276
00:16:26,590 --> 00:16:30,310
What we can see is doing
that more often is not

277
00:16:30,310 --> 00:16:33,210
going to help.

278
00:16:33,210 --> 00:16:35,220
Suppose we take larger samples?

279
00:16:37,780 --> 00:16:40,480
Is that going to help?

280
00:16:40,480 --> 00:16:43,730
Who thinks that will help?

281
00:16:43,730 --> 00:16:46,540
And who thinks it won't?

282
00:16:46,540 --> 00:16:47,190
OK.

283
00:16:47,190 --> 00:16:51,520
Well, we can again
run the experiment.

284
00:16:51,520 --> 00:16:52,750
I did run the experiment.

285
00:16:52,750 --> 00:16:57,230
I changed the sample
size from 100 to 200.

286
00:16:57,230 --> 00:16:59,680
And, again, you can
run this if you want.

287
00:16:59,680 --> 00:17:01,490
And if you run it,
you'll get a result--

288
00:17:01,490 --> 00:17:05,700
maybe not exactly this, but
something very similar-- that,

289
00:17:05,700 --> 00:17:10,460
indeed, as I increase
the size of the sample

290
00:17:10,460 --> 00:17:12,770
rather than the
number of the samples,

291
00:17:12,770 --> 00:17:18,730
the standard deviation
drops fairly dramatically,

292
00:17:18,730 --> 00:17:26,290
in this case from 0.94 0.66.

293
00:17:26,290 --> 00:17:29,280
So that's a good thing.

294
00:17:29,280 --> 00:17:34,080
I now want to digress a little
bit before we come back to this

295
00:17:34,080 --> 00:17:36,864
and look at how you
can visualize this--

296
00:17:36,864 --> 00:17:38,280
Because this is a
technique you'll

297
00:17:38,280 --> 00:17:41,850
want to use as you write
papers and things like that--

298
00:17:41,850 --> 00:17:46,380
is how do we visualize the
variability of the data?

299
00:17:46,380 --> 00:17:48,910
And it's usually done with
something called an error bar.

300
00:17:48,910 --> 00:17:51,510
You've all seen
these things here.

301
00:17:51,510 --> 00:17:54,960
And this is one I took
from the literature.

302
00:17:54,960 --> 00:18:01,980
This is plotting pulse
rate against how much

303
00:18:01,980 --> 00:18:06,130
exercise you do or how
frequently you exercise.

304
00:18:06,130 --> 00:18:09,130
And what you can see here
is there's definitely

305
00:18:09,130 --> 00:18:14,350
a downward trend suggesting
that the more you exercise,

306
00:18:14,350 --> 00:18:17,780
the lower your
average resting pulse.

307
00:18:17,780 --> 00:18:20,500
That's probably worth knowing.

308
00:18:20,500 --> 00:18:26,380
And these error bars give us
the 95% confidence intervals

309
00:18:26,380 --> 00:18:29,800
for different subpopulations.

310
00:18:32,330 --> 00:18:38,310
And what we can see here is
that some of them overlap.

311
00:18:38,310 --> 00:18:40,989
So, yes, once a fortnight--

312
00:18:40,989 --> 00:18:43,155
two weeks for those of you
who don't speak British--

313
00:18:45,980 --> 00:18:49,670
it does get a little bit
smaller than rarely or never.

314
00:18:49,670 --> 00:18:53,100
But the confidence
interval is very big.

315
00:18:53,100 --> 00:18:56,490
And so maybe we really
shouldn't feel very comfortable

316
00:18:56,490 --> 00:18:59,400
that it would actually help.

317
00:18:59,400 --> 00:19:04,170
The thing we can say is that if
the confidence intervals don't

318
00:19:04,170 --> 00:19:09,240
overlap, we can conclude
that the means are actually

319
00:19:09,240 --> 00:19:12,650
statistically
significantly different,

320
00:19:12,650 --> 00:19:15,780
in this case at the 95% level.

321
00:19:15,780 --> 00:19:19,520
So here we see that
the more than weekly

322
00:19:19,520 --> 00:19:23,870
does not overlap with
the rarely or never.

323
00:19:23,870 --> 00:19:26,740
And from that, we can conclude
that this is actually,

324
00:19:26,740 --> 00:19:29,200
statistically true--

325
00:19:29,200 --> 00:19:31,150
that if you exercise
more than weekly,

326
00:19:31,150 --> 00:19:35,590
your pulse is likely to be
lower than if you don't.

327
00:19:35,590 --> 00:19:39,450
If confidence
intervals do overlap,

328
00:19:39,450 --> 00:19:42,900
you cannot conclude that there
is no statistically significant

329
00:19:42,900 --> 00:19:43,920
difference.

330
00:19:43,920 --> 00:19:45,930
There might be, and
you can use other tests

331
00:19:45,930 --> 00:19:47,800
to find out whether there are.

332
00:19:47,800 --> 00:19:50,190
When they don't overlap,
it's a good thing.

333
00:19:50,190 --> 00:19:52,410
We can conclude
something strong.

334
00:19:52,410 --> 00:19:57,960
When they do overlap, we
need to investigate further.

335
00:19:57,960 --> 00:19:59,460
All right, let's
look at the error

336
00:19:59,460 --> 00:20:01,800
bars for our temperatures.

337
00:20:01,800 --> 00:20:03,965
And again, we can plot
those using something called

338
00:20:03,965 --> 00:20:04,590
pylab.errorbar.

339
00:20:04,590 --> 00:20:14,860
Lab So what it takes is two
values, the usual x-axis

340
00:20:14,860 --> 00:20:25,550
and y-axis, and then
it takes another list

341
00:20:25,550 --> 00:20:28,520
of the same length, or
sequence of the same length,

342
00:20:28,520 --> 00:20:31,450
which is the y errors.

343
00:20:31,450 --> 00:20:34,910
And here I'm just
going to say 1.96

344
00:20:34,910 --> 00:20:38,270
times the standard deviations.

345
00:20:38,270 --> 00:20:39,800
Where these variables
come from you

346
00:20:39,800 --> 00:20:43,340
can tell by looking at the code.

347
00:20:43,340 --> 00:20:46,580
And then I can say
the format, I want

348
00:20:46,580 --> 00:20:51,040
an o to show the mean,
and then a label.

349
00:20:51,040 --> 00:20:54,130
Fmt stands for format.

350
00:20:54,130 --> 00:20:58,390
errorbar has different
keyword arguments than plot.

351
00:20:58,390 --> 00:21:00,310
You'll find that you
look at different ways

352
00:21:00,310 --> 00:21:04,530
like histograms and bar
plots, scatterplots--

353
00:21:04,530 --> 00:21:07,360
they all have different
available keyword arguments.

354
00:21:07,360 --> 00:21:10,300
So you have to look
up each individually.

355
00:21:10,300 --> 00:21:13,230
But other than this,
everything in the code

356
00:21:13,230 --> 00:21:16,750
should look very
familiar to you.

357
00:21:16,750 --> 00:21:19,220
And when I run the
code, I get this.

358
00:21:22,850 --> 00:21:29,000
And so what I've plotted here
is the mean against the sample

359
00:21:29,000 --> 00:21:31,430
size with errorbars.

360
00:21:34,250 --> 00:21:37,560
And 100 trials, in this case.

361
00:21:37,560 --> 00:21:46,530
So what you can see is that,
as the sample size gets bigger,

362
00:21:46,530 --> 00:21:48,140
the errorbars get smaller.

363
00:21:52,010 --> 00:21:56,770
The estimates of the mean don't
necessarily get any better.

364
00:21:56,770 --> 00:22:01,820
In fact, we can look
here, and this is actually

365
00:22:01,820 --> 00:22:05,330
a worse estimate,
relative to the true mean,

366
00:22:05,330 --> 00:22:07,910
than the previous two estimates.

367
00:22:07,910 --> 00:22:10,370
But we can have more
confidence in it.

368
00:22:10,370 --> 00:22:12,830
The same thing we
saw on Monday when

369
00:22:12,830 --> 00:22:16,190
we looked at estimating
pi, dropping more needles

370
00:22:16,190 --> 00:22:19,550
didn't necessarily give us
a more accurate estimate.

371
00:22:19,550 --> 00:22:22,990
But it gave us more
confidence in our estimate.

372
00:22:22,990 --> 00:22:25,749
And the same thing
is happening here.

373
00:22:25,749 --> 00:22:27,290
And we can see that,
steadily, we can

374
00:22:27,290 --> 00:22:28,910
get more and more confidence.

375
00:22:31,620 --> 00:22:39,010
So larger samples
seem to be better.

376
00:22:39,010 --> 00:22:41,010
That's a good thing.

377
00:22:41,010 --> 00:22:46,200
Going from a sample size of
50 to a sample size of 600

378
00:22:46,200 --> 00:22:48,270
reduced the confidence
interval, as you

379
00:22:48,270 --> 00:22:56,800
can see, from a fairly large
confidence interval here,

380
00:22:56,800 --> 00:23:03,430
ran from just below 14 to almost
19, as opposed to 15 and a half

381
00:23:03,430 --> 00:23:04,795
or so to 17.

382
00:23:07,940 --> 00:23:09,620
I said confidence interval here.

383
00:23:09,620 --> 00:23:10,550
I should not have.

384
00:23:10,550 --> 00:23:13,850
I should have said
standard deviations.

385
00:23:13,850 --> 00:23:15,455
That's an error on the slides.

386
00:23:18,370 --> 00:23:20,910
OK, what's the catch?

387
00:23:20,910 --> 00:23:27,990
Well, we're now looking at
100 samples, each of size 600.

388
00:23:27,990 --> 00:23:36,260
So we've looked at a
total of 600,000 examples.

389
00:23:36,260 --> 00:23:38,690
What has this bought us?

390
00:23:38,690 --> 00:23:39,790
Absolutely nothing.

391
00:23:42,350 --> 00:23:45,860
The entire population only
contained about 422,000

392
00:23:45,860 --> 00:23:47,230
samples.

393
00:23:47,230 --> 00:23:51,020
We might as well have
looked at the whole thing,

394
00:23:51,020 --> 00:23:53,590
rather than take a few of them.

395
00:23:53,590 --> 00:23:55,990
So it's like, you might
as well hold an election

396
00:23:55,990 --> 00:24:00,130
rather than ask 800
people a million times

397
00:24:00,130 --> 00:24:03,250
who they're going to vote for.

398
00:24:03,250 --> 00:24:04,030
Sure, it's good.

399
00:24:04,030 --> 00:24:05,260
But it gave us nothing.

400
00:24:10,440 --> 00:24:13,170
Suppose we did it only once.

401
00:24:13,170 --> 00:24:18,150
Suppose we took only one sample,
as we see in political polls.

402
00:24:18,150 --> 00:24:21,470
What can we can
conclude from that?

403
00:24:21,470 --> 00:24:24,910
And the answer is actually
kind of surprising,

404
00:24:24,910 --> 00:24:28,420
how much we can conclude, in
a real mathematical sense,

405
00:24:28,420 --> 00:24:30,350
from one sample.

406
00:24:30,350 --> 00:24:32,440
And, again, this is
thanks to our old friend,

407
00:24:32,440 --> 00:24:35,930
the central limit theorem.

408
00:24:35,930 --> 00:24:39,730
So if you recall the
theorem, it had three parts.

409
00:24:39,730 --> 00:24:43,680
Up till now, we've
exploited the first two.

410
00:24:48,990 --> 00:24:50,690
We've used the fact
that the means will

411
00:24:50,690 --> 00:24:54,200
be normally distributed so that
we could use the empirical rule

412
00:24:54,200 --> 00:24:57,770
to get confidence
intervals, and the fact

413
00:24:57,770 --> 00:25:00,980
that the mean of
the sample means

414
00:25:00,980 --> 00:25:04,420
would be close to the
mean of the population.

415
00:25:04,420 --> 00:25:09,260
Now I want to use the
third piece of it, which

416
00:25:09,260 --> 00:25:12,080
is that the variance
of the sample means

417
00:25:12,080 --> 00:25:15,800
will be close to the variance
of the population divided

418
00:25:15,800 --> 00:25:18,440
by the sample size.

419
00:25:18,440 --> 00:25:20,990
And we're going to use
that to compute something

420
00:25:20,990 --> 00:25:24,650
called the standard error--

421
00:25:24,650 --> 00:25:27,470
formerly the standard
error of the mean.

422
00:25:27,470 --> 00:25:30,740
People often just call
it the standard error.

423
00:25:30,740 --> 00:25:35,030
And I will be,
alas, inconsistent.

424
00:25:35,030 --> 00:25:40,060
I sometimes call it one,
sometimes the other.

425
00:25:40,060 --> 00:25:43,302
It's an incredibly
simple formula.

426
00:25:43,302 --> 00:25:47,140
It says the standard
error is going

427
00:25:47,140 --> 00:25:53,720
to be equal to sigma, where
sigma is the population

428
00:25:53,720 --> 00:26:01,410
standard deviation divided by
the square root of n, which

429
00:26:01,410 --> 00:26:03,060
is going to be the
size of the sample.

430
00:26:09,960 --> 00:26:12,450
And then there's just
this very small function

431
00:26:12,450 --> 00:26:14,880
that implements it.

432
00:26:14,880 --> 00:26:16,740
So we can compute
this thing called

433
00:26:16,740 --> 00:26:21,030
the standard error of the mean
in a very straightforward way.

434
00:26:27,580 --> 00:26:29,430
We can compute it.

435
00:26:29,430 --> 00:26:30,960
But does it work?

436
00:26:30,960 --> 00:26:34,080
What do I mean by work?

437
00:26:34,080 --> 00:26:37,410
I mean, what's the relationship
of the standard error

438
00:26:37,410 --> 00:26:39,690
to the standard deviation?

439
00:26:39,690 --> 00:26:42,060
Because, remember,
that was our goal,

440
00:26:42,060 --> 00:26:46,840
was to understand the
standard deviation so we

441
00:26:46,840 --> 00:26:49,340
could use the empirical rule.

442
00:26:49,340 --> 00:26:52,720
Well, let's test the
standard error of the mean.

443
00:26:52,720 --> 00:26:57,480
So here's a slightly
longer piece of code.

444
00:26:57,480 --> 00:27:00,730
I'm going to look at a bunch
of different sample sizes,

445
00:27:00,730 --> 00:27:10,390
from 25 to 600, 50 trials each.

446
00:27:10,390 --> 00:27:15,970
So getHighs is just a function
that returns the temperatures.

447
00:27:15,970 --> 00:27:17,770
I'm going to get the
standard deviation

448
00:27:17,770 --> 00:27:24,120
of the whole population, then
the standard error of the means

449
00:27:24,120 --> 00:27:29,250
and the sample standard
deviations, both.

450
00:27:29,250 --> 00:27:32,220
And then I'm just going
to go through and run it.

451
00:27:32,220 --> 00:27:35,240
So for size and
sample size, I'm going

452
00:27:35,240 --> 00:27:40,060
to append the standard
error of the mean.

453
00:27:40,060 --> 00:27:43,680
And remember, that uses the
population standard deviation

454
00:27:43,680 --> 00:27:46,770
and the size of the sample.

455
00:27:46,770 --> 00:27:50,030
So I'll compute all the SEMs.

456
00:27:50,030 --> 00:27:53,600
And then I'm going to compute
all the actual standard

457
00:27:53,600 --> 00:27:56,890
deviations, as well.

458
00:27:56,890 --> 00:27:59,270
And then we'll produce
a bunch of plots--

459
00:27:59,270 --> 00:28:02,210
or a plot, actually.

460
00:28:02,210 --> 00:28:05,090
All right, so let's see
what that plot looks like.

461
00:28:08,410 --> 00:28:11,540
Pretty striking.

462
00:28:11,540 --> 00:28:17,060
So we see the blue solid line
is the standard deviation

463
00:28:17,060 --> 00:28:20,550
of the 50 means.

464
00:28:20,550 --> 00:28:26,800
And the red dotted line is the
standard error of the mean.

465
00:28:26,800 --> 00:28:30,970
So we can see, quite strikingly
here, that they really

466
00:28:30,970 --> 00:28:32,650
track each other very well.

467
00:28:35,950 --> 00:28:41,870
And this is saying
that I can anticipate

468
00:28:41,870 --> 00:28:45,110
what the standard deviation
would be by computing

469
00:28:45,110 --> 00:28:47,760
the standard error.

470
00:28:47,760 --> 00:28:51,530
Which is really useful,
because now I have one sample.

471
00:28:51,530 --> 00:28:54,510
I computed standard error.

472
00:28:54,510 --> 00:28:57,090
And I get something
very similar to what

473
00:28:57,090 --> 00:29:00,690
I get of the standard
deviation if I took 50 samples

474
00:29:00,690 --> 00:29:04,710
and looked at the standard
deviation of those 50 samples.

475
00:29:04,710 --> 00:29:09,260
All right, so not obvious that
this would be true, right?

476
00:29:09,260 --> 00:29:12,170
That I could use
this simple formula,

477
00:29:12,170 --> 00:29:14,550
and the two things would
track each other so well.

478
00:29:17,730 --> 00:29:20,680
And it's not a
coincidence, by the way,

479
00:29:20,680 --> 00:29:22,200
that as I get out
here near the end,

480
00:29:22,200 --> 00:29:26,040
they're really lying
on top of each other.

481
00:29:26,040 --> 00:29:29,352
As the sample size
gets much larger,

482
00:29:29,352 --> 00:29:30,435
they really will coincide.

483
00:29:33,240 --> 00:29:36,960
So one, does everyone
understand the difference

484
00:29:36,960 --> 00:29:40,600
between the standard deviation
and the standard error?

485
00:29:40,600 --> 00:29:41,290
No.

486
00:29:41,290 --> 00:29:42,400
OK.

487
00:29:42,400 --> 00:29:44,500
So how do we compute
a standard deviation?

488
00:29:44,500 --> 00:29:48,520
To do that, we have to
look at many samples--

489
00:29:48,520 --> 00:29:50,380
in this case 50--

490
00:29:50,380 --> 00:29:52,540
and we compute
how much variation

491
00:29:52,540 --> 00:29:56,110
there is in those 50 samples.

492
00:29:56,110 --> 00:30:00,130
For the standard error,
we look at one sample,

493
00:30:00,130 --> 00:30:03,340
and we compute this thing
called the standard error.

494
00:30:03,340 --> 00:30:07,570
And we argue that we get the
same number, more or less,

495
00:30:07,570 --> 00:30:12,040
that we would have gotten had we
taken 50 samples or 100 samples

496
00:30:12,040 --> 00:30:15,380
and computed the
standard deviation.

497
00:30:15,380 --> 00:30:19,810
So I can avoid
taking all 50 samples

498
00:30:19,810 --> 00:30:21,700
if my only reason
for doing it was

499
00:30:21,700 --> 00:30:24,410
to get the standard deviation.

500
00:30:24,410 --> 00:30:27,160
I can take one sample
instead and use

501
00:30:27,160 --> 00:30:30,220
the standard error of the mean.

502
00:30:30,220 --> 00:30:32,950
So going back to
my temperature--

503
00:30:32,950 --> 00:30:36,840
instead of having to
look at lots of samples,

504
00:30:36,840 --> 00:30:38,910
I only have to look at one.

505
00:30:38,910 --> 00:30:40,440
And I can get a
confidence interval.

506
00:30:40,440 --> 00:30:42,410
That make sense?

507
00:30:42,410 --> 00:30:44,189
OK.

508
00:30:44,189 --> 00:30:44,855
There's a catch.

509
00:30:48,390 --> 00:30:51,270
Notice that the formula
for the standard error

510
00:30:51,270 --> 00:30:56,100
includes the standard
deviation of the population--

511
00:30:56,100 --> 00:31:00,250
the standard deviation
of the sample.

512
00:31:00,250 --> 00:31:04,180
Well, that's kind of a bummer.

513
00:31:04,180 --> 00:31:06,160
Because how can I get
the standard deviation

514
00:31:06,160 --> 00:31:08,110
of the population
without looking

515
00:31:08,110 --> 00:31:09,647
at the whole population?

516
00:31:09,647 --> 00:31:11,980
And if we're going to look
at the whole population, then

517
00:31:11,980 --> 00:31:15,930
what's the point of
sampling in the first place?

518
00:31:15,930 --> 00:31:19,700
So we have a catch, that
we've got something that's

519
00:31:19,700 --> 00:31:28,730
a really good approximation, but
it uses a value we don't know.

520
00:31:28,730 --> 00:31:31,870
So what should we do about that?

521
00:31:31,870 --> 00:31:39,830
Well, what would be, really,
the only obvious thing to try?

522
00:31:39,830 --> 00:31:42,320
What's our best guess at
the standard deviation

523
00:31:42,320 --> 00:31:45,420
of the population if we have
only one sample to look at?

524
00:31:49,070 --> 00:31:52,000
What would you use?

525
00:31:52,000 --> 00:31:53,820
Somebody?

526
00:31:53,820 --> 00:31:55,670
I know I forgot to
bring the candy today,

527
00:31:55,670 --> 00:31:57,155
so no one wants to
answer any questions.

528
00:31:57,155 --> 00:31:58,900
AUDIENCE: The standard
deviation of the sample?

529
00:31:58,900 --> 00:32:00,900
PROFESSOR: The standard
deviation of the sample.

530
00:32:00,900 --> 00:32:02,220
It's all I got.

531
00:32:02,220 --> 00:32:05,660
So let's ask the question,
how good is that?

532
00:32:08,260 --> 00:32:10,930
Shockingly good.

533
00:32:10,930 --> 00:32:15,140
So I looked at our example
here for the temperatures.

534
00:32:15,140 --> 00:32:17,510
And I'm plotting the
sample standard deviation

535
00:32:17,510 --> 00:32:20,690
versus the population
standard deviation

536
00:32:20,690 --> 00:32:28,470
for different sample sizes,
ranging from 0 to 600 by one,

537
00:32:28,470 --> 00:32:30,100
I think.

538
00:32:30,100 --> 00:32:35,370
So what you can see here is
when the sample size is small,

539
00:32:35,370 --> 00:32:36,440
I'm pretty far off.

540
00:32:36,440 --> 00:32:39,600
I'm off by 14% here.

541
00:32:39,600 --> 00:32:43,180
And I think that's 25.

542
00:32:43,180 --> 00:32:48,250
But when the sample
sizes is larger, say 600,

543
00:32:48,250 --> 00:32:49,840
I'm off by about 2%.

544
00:32:56,270 --> 00:33:01,670
So what we see, at least for
this data set of temperatures--

545
00:33:01,670 --> 00:33:06,500
if the sample size
is large enough,

546
00:33:06,500 --> 00:33:10,510
the sample standard deviation
is a pretty good approximation

547
00:33:10,510 --> 00:33:12,250
of the population
standard deviation.

548
00:33:15,170 --> 00:33:15,670
Well.

549
00:33:15,670 --> 00:33:20,590
Now we should ask the
question, what good is this?

550
00:33:20,590 --> 00:33:26,710
Well, as I said, once the sample
reaches a reasonable size--

551
00:33:26,710 --> 00:33:32,980
and we see here, reasonable is
probably somewhere around 500--

552
00:33:32,980 --> 00:33:35,590
it becomes a good approximation.

553
00:33:35,590 --> 00:33:40,460
But is it true only
for this example?

554
00:33:40,460 --> 00:33:42,500
The fact that it
happened to work

555
00:33:42,500 --> 00:33:46,430
for high temperatures
in the US doesn't mean

556
00:33:46,430 --> 00:33:49,810
that it will always be true.

557
00:33:49,810 --> 00:33:53,430
So there are at least two
things we should consider

558
00:33:53,430 --> 00:33:55,170
to asking the question,
when will this

559
00:33:55,170 --> 00:33:58,410
be true, when won't it be true.

560
00:33:58,410 --> 00:34:04,540
One is, does the distribution
of the population matter?

561
00:34:04,540 --> 00:34:08,650
So here we saw, in
our very first plot,

562
00:34:08,650 --> 00:34:11,050
the distribution of
the high temperatures.

563
00:34:11,050 --> 00:34:17,350
And it was kind of symmetric
around a point-- not perfectly.

564
00:34:17,350 --> 00:34:20,940
But not everything
looks that way, right?

565
00:34:20,940 --> 00:34:22,530
So we should say,
well, suppose we

566
00:34:22,530 --> 00:34:24,690
have a different distribution.

567
00:34:24,690 --> 00:34:30,469
Would that change
this conclusion?

568
00:34:30,469 --> 00:34:32,030
And the other
thing we should ask

569
00:34:32,030 --> 00:34:36,080
is, well, suppose we had a
different sized population.

570
00:34:36,080 --> 00:34:38,560
Suppose instead of
400,000 temperatures

571
00:34:38,560 --> 00:34:41,860
I had 20 million temperatures.

572
00:34:41,860 --> 00:34:45,070
Would I need more than 600
samples for the two things

573
00:34:45,070 --> 00:34:47,530
to be about the same?

574
00:34:47,530 --> 00:34:52,350
Well, let's explore
both of those questions.

575
00:34:52,350 --> 00:34:54,320
First, let's look at
the distributions.

576
00:34:54,320 --> 00:34:58,200
And we'll look at three
common distributions--

577
00:34:58,200 --> 00:35:01,480
a uniform distribution,
a normal distribution,

578
00:35:01,480 --> 00:35:04,820
and an exponential distribution.

579
00:35:04,820 --> 00:35:07,680
And we'll look at each of
them for, what is this,

580
00:35:07,680 --> 00:35:10,980
100,000 points.

581
00:35:10,980 --> 00:35:13,610
So we know we can generate
a uniform distribution

582
00:35:13,610 --> 00:35:17,740
by calling random.random.

583
00:35:17,740 --> 00:35:19,870
Gives me a uniform
distribution of real numbers

584
00:35:19,870 --> 00:35:22,180
between 0 and 1.

585
00:35:22,180 --> 00:35:25,630
We know that we can generate
our normal distribution

586
00:35:25,630 --> 00:35:28,630
by calling random.gauss.

587
00:35:28,630 --> 00:35:32,830
In this case, I'm looking
at it between the mean of 0

588
00:35:32,830 --> 00:35:34,390
and a standard deviation of 1.

589
00:35:34,390 --> 00:35:36,700
But as we saw in
the last lecture,

590
00:35:36,700 --> 00:35:40,010
the shape will be the same,
independent of these values.

591
00:35:40,010 --> 00:35:44,330
And, finally, an
exponential distribution,

592
00:35:44,330 --> 00:35:46,420
which we get by calling
random.expovariate.

593
00:35:46,420 --> 00:35:52,060
Very And this number,
0.5, is something

594
00:35:52,060 --> 00:35:56,650
called lambda, which
has to do with how

595
00:35:56,650 --> 00:35:59,860
quickly the exponential
either decays or goes up,

596
00:35:59,860 --> 00:36:01,990
depending upon which direction.

597
00:36:01,990 --> 00:36:04,570
And I'm not going to give
you the formula for it

598
00:36:04,570 --> 00:36:05,530
at the moment.

599
00:36:05,530 --> 00:36:07,630
But we'll look at the pictures.

600
00:36:07,630 --> 00:36:11,050
And we'll plot each of these
discrete approximations

601
00:36:11,050 --> 00:36:12,370
to these distributions.

602
00:36:15,130 --> 00:36:18,510
So here's what they look like.

603
00:36:18,510 --> 00:36:21,020
Quite different, right?

604
00:36:21,020 --> 00:36:22,850
We've looked at
uniform and we've

605
00:36:22,850 --> 00:36:24,800
looked at Gaussian before.

606
00:36:24,800 --> 00:36:28,840
And here we see an
exponential, which basically

607
00:36:28,840 --> 00:36:32,800
decays and will asymptote
towards zero, never quite

608
00:36:32,800 --> 00:36:35,020
getting there.

609
00:36:35,020 --> 00:36:38,570
But as you can see,
it is certainly not

610
00:36:38,570 --> 00:36:42,020
very symmetric around the mean.

611
00:36:42,020 --> 00:36:47,020
All right, so let's
see what happens.

612
00:36:47,020 --> 00:36:52,740
If we run the experiment on
these three distributions,

613
00:36:52,740 --> 00:36:58,170
each of 100,000 point examples,
and look at different sample

614
00:36:58,170 --> 00:37:00,780
sizes, we actually see
that the difference

615
00:37:00,780 --> 00:37:06,150
between the standard deviation
and the sample standard

616
00:37:06,150 --> 00:37:10,800
deviation of the population
standard deviation

617
00:37:10,800 --> 00:37:11,680
is not the same.

618
00:37:14,600 --> 00:37:18,910
We see, down here--

619
00:37:18,910 --> 00:37:23,140
this looks kind of like
what we saw before.

620
00:37:23,140 --> 00:37:26,510
But the exponential one
is really quite different.

621
00:37:29,940 --> 00:37:33,010
You know, its worst
case is up here at 25.

622
00:37:35,520 --> 00:37:37,780
The normal is about 14.

623
00:37:37,780 --> 00:37:40,560
So that's not too surprising,
since our temperatures

624
00:37:40,560 --> 00:37:42,540
were kind of
normally distributed

625
00:37:42,540 --> 00:37:44,220
when we looked at it.

626
00:37:44,220 --> 00:37:51,352
And the uniform is, initially,
much better an approximation.

627
00:37:54,450 --> 00:37:56,570
And the reason
for this has to do

628
00:37:56,570 --> 00:37:59,540
with a fundamental difference
in these distributions,

629
00:37:59,540 --> 00:38:02,670
something called skew.

630
00:38:02,670 --> 00:38:07,380
Skew is a measure of the
asymmetry of a probability

631
00:38:07,380 --> 00:38:09,610
distribution.

632
00:38:09,610 --> 00:38:15,950
And what we can see here is
that skew actually matters.

633
00:38:15,950 --> 00:38:20,280
The more skew you
have, the more samples

634
00:38:20,280 --> 00:38:25,520
you're going to need to
get a good approximation.

635
00:38:25,520 --> 00:38:28,690
So if the population is
very skewed, very asymmetric

636
00:38:28,690 --> 00:38:31,150
in the distribution, you
need a lot of samples

637
00:38:31,150 --> 00:38:33,700
to figure out what's going on.

638
00:38:33,700 --> 00:38:36,450
If it's very uniform,
as in, for example,

639
00:38:36,450 --> 00:38:41,510
the uniform population, you
need many fewer samples.

640
00:38:41,510 --> 00:38:44,150
OK, so that's an
important thing.

641
00:38:44,150 --> 00:38:47,360
When we go about deciding
how many samples we need,

642
00:38:47,360 --> 00:38:52,810
we need to have some estimate
of the skew in our population.

643
00:38:52,810 --> 00:38:56,080
All right, how about size?

644
00:38:56,080 --> 00:38:59,110
Does size matter?

645
00:38:59,110 --> 00:39:01,660
Shockingly-- at least it
was to me the first time

646
00:39:01,660 --> 00:39:03,625
I looked at this--
the answer is no.

647
00:39:09,030 --> 00:39:11,700
If we look at this-- and I'm
looking just for the uniform

648
00:39:11,700 --> 00:39:15,355
distribution, but we'll see
the same thing for all three--

649
00:39:18,990 --> 00:39:20,510
it more or less doesn't matter.

650
00:39:25,730 --> 00:39:29,000
Quite amazing, right?

651
00:39:29,000 --> 00:39:32,435
If you have a bigger population,
you don't need more samples.

652
00:39:35,440 --> 00:39:40,200
And it's really almost
counterintuitive

653
00:39:40,200 --> 00:39:44,730
to think that you don't need
any more samples to find out

654
00:39:44,730 --> 00:39:48,990
what's going to happen if you
have a million people or 100

655
00:39:48,990 --> 00:39:51,340
million people.

656
00:39:51,340 --> 00:39:54,990
And that's why, when we look
at, say, political polls,

657
00:39:54,990 --> 00:39:57,430
they're amazingly small.

658
00:39:57,430 --> 00:40:00,040
They poll 1,000 people and
claim they're representative

659
00:40:00,040 --> 00:40:01,130
of Massachusetts.

660
00:40:05,380 --> 00:40:08,000
This is good news.

661
00:40:08,000 --> 00:40:11,470
So to estimate the mean
of a population, given

662
00:40:11,470 --> 00:40:16,360
a single sample, we
choose a sample size

663
00:40:16,360 --> 00:40:19,720
based upon some estimate
of skew in the population.

664
00:40:22,600 --> 00:40:25,530
This is important, because
if we get that wrong,

665
00:40:25,530 --> 00:40:29,334
we might choose a sample
size that is too small.

666
00:40:29,334 --> 00:40:30,750
And in some sense,
you always want

667
00:40:30,750 --> 00:40:33,780
to choose the smallest
sample size you can

668
00:40:33,780 --> 00:40:36,870
that will give you an
accurate answer, because it's

669
00:40:36,870 --> 00:40:41,810
more economical to have small
samples than big samples.

670
00:40:41,810 --> 00:40:43,610
And I've been
talking about polls,

671
00:40:43,610 --> 00:40:46,630
but the same is true
in an experiment.

672
00:40:46,630 --> 00:40:48,490
How many pieces of
data do you need

673
00:40:48,490 --> 00:40:52,210
to collect when you run
an experiment in a lab.

674
00:40:52,210 --> 00:40:56,560
And how much will depend,
again, on the skew of the data.

675
00:40:56,560 --> 00:41:00,090
And that will help you decide.

676
00:41:00,090 --> 00:41:03,850
When you know the size,
you choose a random sample

677
00:41:03,850 --> 00:41:04,930
from the population.

678
00:41:09,660 --> 00:41:12,300
Then you compute the mean
and the standard deviation

679
00:41:12,300 --> 00:41:13,280
of that sample.

680
00:41:17,350 --> 00:41:20,680
And then use the standard
deviation of that sample

681
00:41:20,680 --> 00:41:23,930
to estimate the standard error.

682
00:41:23,930 --> 00:41:26,240
And I want to emphasize that
what you're getting here

683
00:41:26,240 --> 00:41:30,110
is an estimate of the standard
error, not the standard error

684
00:41:30,110 --> 00:41:34,340
itself, which would require you
to know the population standard

685
00:41:34,340 --> 00:41:36,260
deviation.

686
00:41:36,260 --> 00:41:40,280
But if you've chosen the
sample size to be appropriate,

687
00:41:40,280 --> 00:41:44,735
this will turn out to
be a good estimate.

688
00:41:44,735 --> 00:41:46,110
And then once
we've done that, we

689
00:41:46,110 --> 00:41:49,300
use the estimated
standard error to generate

690
00:41:49,300 --> 00:41:52,290
confidence intervals
around the sample mean.

691
00:41:52,290 --> 00:41:55,110
And we're done.

692
00:41:55,110 --> 00:41:57,390
Now this works
great when we choose

693
00:41:57,390 --> 00:42:00,300
independent random samples.

694
00:42:00,300 --> 00:42:04,290
And, as we've seen
before, that if you

695
00:42:04,290 --> 00:42:07,740
don't choose
independent samples,

696
00:42:07,740 --> 00:42:10,170
it doesn't work so well.

697
00:42:10,170 --> 00:42:13,740
And, again, this is an issue
where if you assume that,

698
00:42:13,740 --> 00:42:15,960
in an election, each
state is independent

699
00:42:15,960 --> 00:42:20,590
of every other state, and
you'll get the wrong answer,

700
00:42:20,590 --> 00:42:23,320
because they're not.

701
00:42:23,320 --> 00:42:27,360
All right, let's go back
to our temperature example

702
00:42:27,360 --> 00:42:30,910
and pose a simple question.

703
00:42:30,910 --> 00:42:34,160
Are 200 samples enough?

704
00:42:34,160 --> 00:42:35,570
I don't know why I chose 200.

705
00:42:35,570 --> 00:42:36,980
I did.

706
00:42:36,980 --> 00:42:40,280
So we'll do an experiment here.

707
00:42:40,280 --> 00:42:44,280
This is similar to an
experiment we saw on Monday.

708
00:42:44,280 --> 00:42:48,990
So I'm starting with the
number of mistakes I make.

709
00:42:48,990 --> 00:42:50,850
For t in a range
number of trials,

710
00:42:50,850 --> 00:42:54,960
sample will be random.sample of
the temperatures in the sample

711
00:42:54,960 --> 00:42:56,890
size.

712
00:42:56,890 --> 00:42:58,150
This is a key step.

713
00:43:01,390 --> 00:43:04,660
The first time I did
this, I messed it up.

714
00:43:04,660 --> 00:43:07,090
And instead of doing
this very simple thing,

715
00:43:07,090 --> 00:43:10,360
I did a more complicated
thing of just choosing

716
00:43:10,360 --> 00:43:12,370
some point in my
list of temperatures

717
00:43:12,370 --> 00:43:16,950
and taking the next
200 temperatures.

718
00:43:16,950 --> 00:43:18,615
Why did that give
me the wrong answer?

719
00:43:21,530 --> 00:43:24,090
Because it's organized by city.

720
00:43:24,090 --> 00:43:27,350
So if I happen to choose
the first day of Phoenix,

721
00:43:27,350 --> 00:43:30,330
all 200 temperatures
were Phoenix--

722
00:43:30,330 --> 00:43:32,090
which is not a very
good approximation

723
00:43:32,090 --> 00:43:35,319
of the temperature in
the country as a whole.

724
00:43:35,319 --> 00:43:36,110
But this will work.

725
00:43:36,110 --> 00:43:38,540
I'm using random.sample.

726
00:43:38,540 --> 00:43:39,890
I'll then get the sample mean.

727
00:43:43,180 --> 00:43:46,690
Then I'll compute my estimate
of the standard error

728
00:43:46,690 --> 00:43:49,570
by taking that as seen here.

729
00:43:49,570 --> 00:43:55,140
And then if the absolute
value of the population

730
00:43:55,140 --> 00:44:01,510
minus the sample mean is more
than 1.96 standard errors,

731
00:44:01,510 --> 00:44:03,720
I'm going to say I messed up.

732
00:44:06,370 --> 00:44:07,890
It's outside.

733
00:44:07,890 --> 00:44:10,170
And then at the end,
I'm going to look

734
00:44:10,170 --> 00:44:14,280
at the fraction outside the
95% confidence intervals.

735
00:44:14,280 --> 00:44:17,180
And what do I hope
it should print?

736
00:44:17,180 --> 00:44:20,510
What would be the perfect
answer when I run this?

737
00:44:26,020 --> 00:44:28,290
What fraction should
lie outside that?

738
00:44:34,050 --> 00:44:35,820
It's a pretty
simple calculation.

739
00:44:44,000 --> 00:44:46,740
Five, right?

740
00:44:46,740 --> 00:44:49,170
Because if they all
were inside, then

741
00:44:49,170 --> 00:44:52,830
I'm being too conservative
in my interval, right?

742
00:44:52,830 --> 00:44:58,440
I want 5% of the tests to fall
outside the 95% confidence

743
00:44:58,440 --> 00:45:00,500
interval.

744
00:45:00,500 --> 00:45:02,660
If I wanted fewer,
then I would look

745
00:45:02,660 --> 00:45:04,490
at three standard deviations.

746
00:45:04,490 --> 00:45:08,600
Instead of 1.96, then I
would expect less than 1%

747
00:45:08,600 --> 00:45:10,760
to fall outside.

748
00:45:10,760 --> 00:45:13,220
So this is something we have
to always keep in mind when

749
00:45:13,220 --> 00:45:14,990
we do this kind of thing.

750
00:45:14,990 --> 00:45:19,100
If your answer is too
good, you've messed up.

751
00:45:19,100 --> 00:45:22,790
Shouldn't be too bad, but it
shouldn't be too good, either.

752
00:45:22,790 --> 00:45:25,290
That's what probabilities
are all about.

753
00:45:25,290 --> 00:45:28,340
If you called every
election correctly,

754
00:45:28,340 --> 00:45:29,390
then your math is wrong.

755
00:45:33,240 --> 00:45:40,720
Well, when we run this,
we get this lovely answer,

756
00:45:40,720 --> 00:45:44,350
that the fraction outside
the 95% confidence interval

757
00:45:44,350 --> 00:45:49,964
is 0.0511.

758
00:45:49,964 --> 00:45:51,880
That's exactly-- well,
close to what you want.

759
00:45:51,880 --> 00:45:55,240
It's almost exactly 5%.

760
00:45:55,240 --> 00:45:57,540
And if I run it
multiple times, I

761
00:45:57,540 --> 00:45:59,470
get slightly different numbers.

762
00:45:59,470 --> 00:46:03,030
But they're all in that
range, showing that, here,

763
00:46:03,030 --> 00:46:05,690
in fact, it really does work.

764
00:46:08,370 --> 00:46:12,500
So that's what I want to say,
and it's really important,

765
00:46:12,500 --> 00:46:15,020
this notion of the
standard error.

766
00:46:15,020 --> 00:46:17,870
When I talk to other
departments about what

767
00:46:17,870 --> 00:46:22,160
we should cover in 60002,
about the only thing everybody

768
00:46:22,160 --> 00:46:25,220
agrees on was we should
talk about standard error.

769
00:46:25,220 --> 00:46:29,060
So now I hope I have
made everyone happy.

770
00:46:29,060 --> 00:46:32,420
And we will talk
about fitting curves

771
00:46:32,420 --> 00:46:35,340
to experimental data
starting next week.

772
00:46:35,340 --> 00:46:37,930
All right, thanks a lot.