1
00:00:00,000 --> 00:00:00,040

2
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative

3
00:00:02,460 --> 00:00:03,870
Commons license.

4
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to

5
00:00:06,910 --> 00:00:10,560
offer high quality educational
resources for free.

6
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from

7
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at

8
00:00:19,290 --> 00:00:22,160
ocw.mit.edu

9
00:00:22,160 --> 00:00:26,640
PROFESSOR: OK, if you have not
yet done it, please take a

10
00:00:26,640 --> 00:00:30,510
moment to go through the course
evaluation website and

11
00:00:30,510 --> 00:00:32,880
enter your comments
for the class.

12
00:00:32,880 --> 00:00:36,250
So what we're going to do today
to wrap things up is

13
00:00:36,250 --> 00:00:39,070
we're going to go through
a tour of the world of

14
00:00:39,070 --> 00:00:41,320
hypothesis testing.

15
00:00:41,320 --> 00:00:44,500
See a few examples of hypothesis
tests, starting

16
00:00:44,500 --> 00:00:48,280
from simple ones such as the
one the setting that we

17
00:00:48,280 --> 00:00:51,220
discussed last time in which you
just have two hypotheses,

18
00:00:51,220 --> 00:00:53,130
you're trying to choose
between them.

19
00:00:53,130 --> 00:00:56,040
But also look at more
complicated situations in

20
00:00:56,040 --> 00:01:00,600
which you have one
basic hypothesis.

21
00:01:00,600 --> 00:01:03,720
Let's say that you have a fair
coin and you want to test it

22
00:01:03,720 --> 00:01:06,450
against the hypotheses that
your coin is not fair, but

23
00:01:06,450 --> 00:01:09,770
that alternative hypothesis is
really lots of different

24
00:01:09,770 --> 00:01:11,000
hypothesis.

25
00:01:11,000 --> 00:01:12,310
So is my coin fair?

26
00:01:12,310 --> 00:01:13,700
Is my die fair?

27
00:01:13,700 --> 00:01:15,980
Do I have the correct
distribution for random

28
00:01:15,980 --> 00:01:17,510
variable, and so on.

29
00:01:17,510 --> 00:01:20,960
And I'm going to end up with a
few general comments about

30
00:01:20,960 --> 00:01:23,190
this whole business.

31
00:01:23,190 --> 00:01:28,370
So the sad thing in simple
hypothesis testing problems is

32
00:01:28,370 --> 00:01:28,990
the following--

33
00:01:28,990 --> 00:01:33,610
we have two possible models,
and this is the classical

34
00:01:33,610 --> 00:01:36,680
world so we do not have any
prior probabilities on the two

35
00:01:36,680 --> 00:01:37,850
hypotheses.

36
00:01:37,850 --> 00:01:41,340
Usually we want to think of
these hypotheses as not being

37
00:01:41,340 --> 00:01:44,730
completely symmetrical, but
rather one is the default

38
00:01:44,730 --> 00:01:48,180
hypothesis, and usually it's
referred to as the null

39
00:01:48,180 --> 00:01:49,630
hypothesis.

40
00:01:49,630 --> 00:01:53,400
And you want to check whether
the null hypothesis is true,

41
00:01:53,400 --> 00:01:57,170
whether things are normal as you
would have expected them

42
00:01:57,170 --> 00:02:00,900
to be, or whether it turns out
to be false, in which case an

43
00:02:00,900 --> 00:02:03,750
alternative hypothesis
would be correct.

44
00:02:03,750 --> 00:02:05,710
So how does one go about it?

45
00:02:05,710 --> 00:02:08,919

46
00:02:08,919 --> 00:02:12,720
No matter what approach you use,
in the end you're going

47
00:02:12,720 --> 00:02:14,220
to end up doing the following.

48
00:02:14,220 --> 00:02:17,620
You have the space of all simple
observations that you

49
00:02:17,620 --> 00:02:18,980
may obtain.

50
00:02:18,980 --> 00:02:21,630
So when you do the experiment
you're going to get an X

51
00:02:21,630 --> 00:02:25,050
vector, a vector of data
that's somewhere.

52
00:02:25,050 --> 00:02:27,760
And for some vectors you're
going to decide that you

53
00:02:27,760 --> 00:02:31,410
accept H. Note for some vectors
that you reject H0 and

54
00:02:31,410 --> 00:02:33,160
you accept H1.

55
00:02:33,160 --> 00:02:37,100
So what you will end up doing
is that you're going to have

56
00:02:37,100 --> 00:02:42,130
some division of the space of
all X's into two parts, and

57
00:02:42,130 --> 00:02:45,660
one part is the rejection
region, and one part is the

58
00:02:45,660 --> 00:02:47,050
acceptance region.

59
00:02:47,050 --> 00:02:50,440
So if you fall in here you
accept H0, if you fall here

60
00:02:50,440 --> 00:02:53,240
you'd reject H0.

61
00:02:53,240 --> 00:02:57,750
So to design a hypothesis test
basically you need to come up

62
00:02:57,750 --> 00:03:03,360
with a division of your X
space into two pieces.

63
00:03:03,360 --> 00:03:08,770
So the figuring out how to do
this involves two elements.

64
00:03:08,770 --> 00:03:12,640
One element is to decide what
kind of shape so I want for my

65
00:03:12,640 --> 00:03:14,740
dividing curve?

66
00:03:14,740 --> 00:03:18,240
And having chosen the shape of
the dividing curve, where

67
00:03:18,240 --> 00:03:20,540
exactly do I put it?

68
00:03:20,540 --> 00:03:23,980
So if you were to cut this
space using, let's say, a

69
00:03:23,980 --> 00:03:27,360
straight cut you might put it
here, or you might put it

70
00:03:27,360 --> 00:03:28,930
there, or you might
put it there.

71
00:03:28,930 --> 00:03:31,730
Where exactly are you
going to put it?

72
00:03:31,730 --> 00:03:33,530
So let's look at those
two steps.

73
00:03:33,530 --> 00:03:38,700
The first issue is to decide
the general shape of your

74
00:03:38,700 --> 00:03:43,440
rejection region, which is the
structure of your test.

75
00:03:43,440 --> 00:03:47,420
And the way this is done for the
case of two hypothesis is

76
00:03:47,420 --> 00:03:52,050
by writing down the likelihood
ratio between the two

77
00:03:52,050 --> 00:03:52,840
hypothesis.

78
00:03:52,840 --> 00:03:56,860
So let's call that quantity l of
X. It's something that you

79
00:03:56,860 --> 00:04:00,280
can compute given the
data that you have.

80
00:04:00,280 --> 00:04:04,660
A high value of l of X basically
means that this

81
00:04:04,660 --> 00:04:08,140
probability here tends to be
bigger than this probability.

82
00:04:08,140 --> 00:04:12,150
It means that the data that you
have seen are quite likely

83
00:04:12,150 --> 00:04:15,650
to have occurred under H1,
but less likely to have

84
00:04:15,650 --> 00:04:18,399
occurred under H0.

85
00:04:18,399 --> 00:04:22,360
So if you see data that they
are more plausible, can be

86
00:04:22,360 --> 00:04:26,630
better explained, under H1, then
this ratio is big, and

87
00:04:26,630 --> 00:04:31,030
you're going to choose in favor
of H1 or reject H0.

88
00:04:31,030 --> 00:04:32,950
That's what you do if you
have discrete data.

89
00:04:32,950 --> 00:04:34,380
You use the PMFs.

90
00:04:34,380 --> 00:04:37,450
If you have densities, in the
case of continues data, again

91
00:04:37,450 --> 00:04:42,740
you consider the ratio
of the two densities.

92
00:04:42,740 --> 00:04:47,250
So a big l of X is evidence
that your data are more

93
00:04:47,250 --> 00:04:51,570
compatible with H1
rather than H0.

94
00:04:51,570 --> 00:04:59,140
Once you accept this kind of
structure then your decision

95
00:04:59,140 --> 00:05:02,920
is really made in terms
of that single number.

96
00:05:02,920 --> 00:05:06,270
That is, you had your data that
was some kind of vector,

97
00:05:06,270 --> 00:05:09,930
and you condense your data
into a single number-- a

98
00:05:09,930 --> 00:05:12,080
statistic as it's called--

99
00:05:12,080 --> 00:05:15,150
in this case the likelihood
ratio, and you put the

100
00:05:15,150 --> 00:05:19,880
dividing point somewhere
here call it Xi.

101
00:05:19,880 --> 00:05:22,600
And in this region you
accept H1, in this

102
00:05:22,600 --> 00:05:25,940
region you accept H0.

103
00:05:25,940 --> 00:05:30,410
So by committing ourselves to
using the likelihood ratio in

104
00:05:30,410 --> 00:05:33,650
order to carry out the test
we have gone from this

105
00:05:33,650 --> 00:05:38,030
complicated picture of finding a
dividing line in x-space, to

106
00:05:38,030 --> 00:05:42,860
a simpler problem of just
finding a dividing point on

107
00:05:42,860 --> 00:05:45,280
the real line.

108
00:05:45,280 --> 00:05:46,960
OK, how are we going?

109
00:05:46,960 --> 00:05:51,290
So what's left to do is to
choose this threshold, Xi.

110
00:05:51,290 --> 00:05:53,920
Or as it's called, the
critical value,

111
00:05:53,920 --> 00:05:56,560
for making our decision.

112
00:05:56,560 --> 00:06:01,930
And you can place it anywhere,
but one way of deciding where

113
00:06:01,930 --> 00:06:03,240
to place it is the following--

114
00:06:03,240 --> 00:06:07,740
look at the distribution of this
random variable, l of X.

115
00:06:07,740 --> 00:06:11,760
It's has a certain distribution
under H0, and it

116
00:06:11,760 --> 00:06:16,210
has some other distribution
under H1.

117
00:06:16,210 --> 00:06:19,650
If I put my threshold here,
here's what's going to happen.

118
00:06:19,650 --> 00:06:24,360
When H0 is true, there is this
much probability that I'm

119
00:06:24,360 --> 00:06:27,360
going to end up making an
incorrect decision.

120
00:06:27,360 --> 00:06:31,000
If H0 is true there's still a
probability that my likelihood

121
00:06:31,000 --> 00:06:35,100
ratio will be bigger than Xi,
and that's the probability of

122
00:06:35,100 --> 00:06:38,590
making an incorrect decision
of this particular type.

123
00:06:38,590 --> 00:06:42,720
That is of making a false
rejection of H0.

124
00:06:42,720 --> 00:06:46,330
Usually one sets this
probability to a certain

125
00:06:46,330 --> 00:06:48,230
number, alpha.

126
00:06:48,230 --> 00:06:51,770
For example alpha being 5 %.

127
00:06:51,770 --> 00:06:55,680
And once you decide that you
want this to be 5 %, that

128
00:06:55,680 --> 00:07:00,630
determines where this number
Psi(Xi) is going to be.

129
00:07:00,630 --> 00:07:07,340
So the idea here is that I'm
going to reject H0 if the data

130
00:07:07,340 --> 00:07:12,350
that I have seen are quite
incompatible with H0.

131
00:07:12,350 --> 00:07:16,860
if they're quite unlikely to
have occurred under H0.

132
00:07:16,860 --> 00:07:19,690
And I take this level, 5%.

133
00:07:19,690 --> 00:07:25,670
So I see my data and then I say
well if H0 was true, the

134
00:07:25,670 --> 00:07:29,380
probability that I would have
seen data of this kind would

135
00:07:29,380 --> 00:07:31,390
be less than 5 %.

136
00:07:31,390 --> 00:07:35,390
Given that I saw those data,
that suggests that H0 is not

137
00:07:35,390 --> 00:07:37,860
true, and I end up
rejecting H0.

138
00:07:37,860 --> 00:07:40,770

139
00:07:40,770 --> 00:07:44,150
Now of course there's the
other type of error

140
00:07:44,150 --> 00:07:45,190
probability.

141
00:07:45,190 --> 00:07:50,550
If I put my threshold here, if
H1 is true but my likelihood

142
00:07:50,550 --> 00:07:53,470
ratio falls here I'm going
to make a mistake of

143
00:07:53,470 --> 00:07:55,250
the opposite kind.

144
00:07:55,250 --> 00:07:59,780
H1 is true, but my likelihood
ratio turned out to be small,

145
00:07:59,780 --> 00:08:02,370
and I decided in favor of H0.

146
00:08:02,370 --> 00:08:05,680
This is an error of the other
kind, this probability of

147
00:08:05,680 --> 00:08:08,030
error we call beta.

148
00:08:08,030 --> 00:08:10,070
And you can see that
there's a trade-off

149
00:08:10,070 --> 00:08:12,300
between alpha and beta.

150
00:08:12,300 --> 00:08:15,710
If you move your threshold this
way alpha become smaller,

151
00:08:15,710 --> 00:08:18,320
but beta becomes larger.

152
00:08:18,320 --> 00:08:22,120
And the general picture is, in
your trade-off, depending on

153
00:08:22,120 --> 00:08:25,970
where you put your threshold
is as follows--

154
00:08:25,970 --> 00:08:31,370
you can make this beta to be 0
if you put your threshold out

155
00:08:31,370 --> 00:08:34,809
here, but in that case you are
certain that you're going to

156
00:08:34,809 --> 00:08:37,000
make a mistake of the
opposite kind.

157
00:08:37,000 --> 00:08:42,360
So beta equals 0, alpha equals
1 is one possibility.

158
00:08:42,360 --> 00:08:46,420
Beta equals 1 alpha equals 0
is the other possibility if

159
00:08:46,420 --> 00:08:49,620
you send your thresholds
complete to the other side.

160
00:08:49,620 --> 00:08:51,950
And in general you're going
to get a trade-off

161
00:08:51,950 --> 00:08:54,930
curve of some sort.

162
00:08:54,930 --> 00:08:58,720
And if you want to use a
specific value of alpha, for

163
00:08:58,720 --> 00:09:04,030
example alpha being 0.05, then
that's going to determine for

164
00:09:04,030 --> 00:09:07,820
you the probability for beta.

165
00:09:07,820 --> 00:09:11,410
Now there's a general, and quite
important theorem in

166
00:09:11,410 --> 00:09:13,640
statistics, which were
are not proving.

167
00:09:13,640 --> 00:09:17,500
And which tells us that when we
use likelihood ratio tests

168
00:09:17,500 --> 00:09:21,670
we get the best possible
trade-off curve.

169
00:09:21,670 --> 00:09:26,720
You could think of other ways
of making your decisions.

170
00:09:26,720 --> 00:09:30,780
Other ways of cutting off your
x-space into a rejection and

171
00:09:30,780 --> 00:09:32,090
acceptance region.

172
00:09:32,090 --> 00:09:36,050
But any other way that you do
it is going to end up with

173
00:09:36,050 --> 00:09:39,900
some probabilities of error
that are going to be above

174
00:09:39,900 --> 00:09:41,990
this particular curve.

175
00:09:41,990 --> 00:09:46,570
So the likelihood ratio test
turns out to give you the best

176
00:09:46,570 --> 00:09:49,200
possible way of dealing
with this trade-off

177
00:09:49,200 --> 00:09:50,750
between alpha and beta.

178
00:09:50,750 --> 00:09:54,090
We cannot minimize alpha and
beta simultaneously, there's a

179
00:09:54,090 --> 00:09:56,280
trade-off between them.

180
00:09:56,280 --> 00:10:02,420
But at least we would like to
have a test that deals with

181
00:10:02,420 --> 00:10:04,380
this trade-off in the
best possible way.

182
00:10:04,380 --> 00:10:07,770
For a given value of alpha we
want to have the smallest

183
00:10:07,770 --> 00:10:09,490
possible value of beta.

184
00:10:09,490 --> 00:10:13,900
And as the theorem is that the
likelihood ratio tests do have

185
00:10:13,900 --> 00:10:15,240
this optimality property.

186
00:10:15,240 --> 00:10:18,270
For a given value of alpha they
minimize the probability

187
00:10:18,270 --> 00:10:20,610
of error of a different kind.

188
00:10:20,610 --> 00:10:23,380
So let's make all these concrete
and look at the

189
00:10:23,380 --> 00:10:24,680
simple example.

190
00:10:24,680 --> 00:10:27,980
We have two normal
distributions

191
00:10:27,980 --> 00:10:29,610
with different means.

192
00:10:29,610 --> 00:10:32,850
So under H0 you have
a mean of 0.

193
00:10:32,850 --> 00:10:36,790
Under H1 you have a mean of 1.

194
00:10:36,790 --> 00:10:40,810
You get your data, you actually
get several data

195
00:10:40,810 --> 00:10:43,770
drawn from one of the
two distributions.

196
00:10:43,770 --> 00:10:45,560
And you want to make a
decision, which one

197
00:10:45,560 --> 00:10:47,050
of the two is true?

198
00:10:47,050 --> 00:10:50,400
So what you do is you write
down the likelihood ratio.

199
00:10:50,400 --> 00:10:54,730
The density for a vector of
data, if that vector was

200
00:10:54,730 --> 00:10:57,490
generated according to H0 --

201
00:10:57,490 --> 00:11:00,470
which is this one, and the
density if it was generated

202
00:11:00,470 --> 00:11:02,810
according to H1.

203
00:11:02,810 --> 00:11:06,510
Since we have multiple data the
density of a vector is the

204
00:11:06,510 --> 00:11:09,830
product of the densities of
the individual elements.

205
00:11:09,830 --> 00:11:11,800
Since we're dealing with
normals we have those

206
00:11:11,800 --> 00:11:13,500
exponential factors.

207
00:11:13,500 --> 00:11:15,550
A product of exponentials
gives us an

208
00:11:15,550 --> 00:11:17,340
exponential of the sum.

209
00:11:17,340 --> 00:11:20,170
I'll spare you the details, but
this is the form of the

210
00:11:20,170 --> 00:11:21,230
likelihood ratio.

211
00:11:21,230 --> 00:11:23,960
The likelihood ratio test
tells us that we should

212
00:11:23,960 --> 00:11:28,360
calculate this quantity after we
get your data, and compare

213
00:11:28,360 --> 00:11:30,750
with a threshold.

214
00:11:30,750 --> 00:11:35,340
Now you can do some algebra
here, and simplify.

215
00:11:35,340 --> 00:11:39,150
And by tracing down the
inequalities you're taking

216
00:11:39,150 --> 00:11:41,840
logarithms of both
sides, and so on.

217
00:11:41,840 --> 00:11:47,350
One comes to the conclusion that
using a test that has a

218
00:11:47,350 --> 00:11:52,150
threshold on this ratio is
equivalent to calculating this

219
00:11:52,150 --> 00:11:56,920
quantity, and comparing
it with a threshold.

220
00:11:56,920 --> 00:12:01,220
Basically this quantity here is
monotonic in that quantity.

221
00:12:01,220 --> 00:12:04,510
This being larger than the
threshold is equivalent to

222
00:12:04,510 --> 00:12:07,400
this being larger than
the threshold.

223
00:12:07,400 --> 00:12:10,310
So this tells us the general
structure of the likelihood

224
00:12:10,310 --> 00:12:12,770
ratio test in this
particular case.

225
00:12:12,770 --> 00:12:15,640
And it's nice because it tells
us that we can make our

226
00:12:15,640 --> 00:12:20,340
decisions by looking at this
simple summary of the data.

227
00:12:20,340 --> 00:12:23,810
This quantity, this summary of
the data on the basis of which

228
00:12:23,810 --> 00:12:29,130
we make our decision is
called a statistic.

229
00:12:29,130 --> 00:12:32,850
So you take your data, which is
a multi-dimensional vector,

230
00:12:32,850 --> 00:12:37,850
and you condense it to a single
number, and then you

231
00:12:37,850 --> 00:12:40,630
make a decision on the
basis of that number.

232
00:12:40,630 --> 00:12:42,750
So this is the structure
of the test.

233
00:12:42,750 --> 00:12:47,430
If I get a large sum of Xi's
this is evidence in favor of

234
00:12:47,430 --> 00:12:50,430
H1 because here the
mean is larger.

235
00:12:50,430 --> 00:12:54,990
And so I'm going to decide in
favor of H1 or reject H0 if

236
00:12:54,990 --> 00:12:56,650
the sum is bigger than
the threshold.

237
00:12:56,650 --> 00:12:58,750
How do I choose my threshold?

238
00:12:58,750 --> 00:13:01,080
Well I would like to choose
my threshold so that the

239
00:13:01,080 --> 00:13:04,990
probability of an incorrect
decision when H0 is true the

240
00:13:04,990 --> 00:13:09,980
probability of a false
rejection equals

241
00:13:09,980 --> 00:13:10,890
to a certain number.

242
00:13:10,890 --> 00:13:14,400
Alpha, such as for
example 5 %.

243
00:13:14,400 --> 00:13:19,210
So you're given here
that this is 5 %.

244
00:13:19,210 --> 00:13:20,660
You know the distribution
of this random

245
00:13:20,660 --> 00:13:22,240
variable, it's normal.

246
00:13:22,240 --> 00:13:24,980
And you want to find the
threshold value that makes

247
00:13:24,980 --> 00:13:26,430
this to be true.

248
00:13:26,430 --> 00:13:28,300
So this is a type of problem
that you have

249
00:13:28,300 --> 00:13:29,360
seen several times.

250
00:13:29,360 --> 00:13:32,910
You go to the normal tables,
and you figure it out.

251
00:13:32,910 --> 00:13:35,790
So the sum of the Xi's has some

252
00:13:35,790 --> 00:13:38,160
distribution, it's normal.

253
00:13:38,160 --> 00:13:41,090
So that's the distribution
of the sum of the Xi's.

254
00:13:41,090 --> 00:13:44,620
And you want this probability
here to be alpha.

255
00:13:44,620 --> 00:13:49,520
For this to happen what is the
threshold value that makes

256
00:13:49,520 --> 00:13:50,870
this to be true?

257
00:13:50,870 --> 00:13:55,570
So you know how to solve
problems of this kind using

258
00:13:55,570 --> 00:13:58,420
the normal tables.

259
00:13:58,420 --> 00:14:02,730
A slightly different example is
one in which you have two

260
00:14:02,730 --> 00:14:05,900
normal distributions that
have the same mean --

261
00:14:05,900 --> 00:14:07,580
let's take it to be 0 --

262
00:14:07,580 --> 00:14:10,580
but they have a different
variance.

263
00:14:10,580 --> 00:14:15,080
So it's sort of natural that
here, if your X's that you see

264
00:14:15,080 --> 00:14:19,880
are kind of big on either side
you would choose H1.

265
00:14:19,880 --> 00:14:23,500
If your X's are near 0 then
that's evidence for the

266
00:14:23,500 --> 00:14:27,120
smaller variance you
would choose H0.

267
00:14:27,120 --> 00:14:30,740
So to proceed formally you again
write down to the form

268
00:14:30,740 --> 00:14:33,190
of the likelihood ratio.

269
00:14:33,190 --> 00:14:39,780
So again the density of an X
vector under H0 is this one.

270
00:14:39,780 --> 00:14:41,680
It's the product of
the densities of

271
00:14:41,680 --> 00:14:43,410
each one of the Xi's.

272
00:14:43,410 --> 00:14:47,030
Product of normal densities
gives you a product of

273
00:14:47,030 --> 00:14:50,180
exponentials, which is
exponential of the sum, and

274
00:14:50,180 --> 00:14:52,070
that's the expression
that you get.

275
00:14:52,070 --> 00:14:54,560
Under the other hypothesis
the only thing that

276
00:14:54,560 --> 00:14:56,530
changes is the variance.

277
00:14:56,530 --> 00:14:59,800
And the variance, in the normal
distribution, shows up

278
00:14:59,800 --> 00:15:02,970
here in the denominator
of the exponent.

279
00:15:02,970 --> 00:15:04,560
So you put it there.

280
00:15:04,560 --> 00:15:07,390
So this is the general structure
of the likelihood

281
00:15:07,390 --> 00:15:08,650
ratio test.

282
00:15:08,650 --> 00:15:10,400
And now you do some algebra.

283
00:15:10,400 --> 00:15:14,110
These terms are constants
comparing this ratio to a

284
00:15:14,110 --> 00:15:17,190
constant is the same as just
comparing the ratio of the

285
00:15:17,190 --> 00:15:19,050
exponentials to a constant.

286
00:15:19,050 --> 00:15:23,710
Then you take logarithms, you
want to compare the logarithm

287
00:15:23,710 --> 00:15:25,650
of this thing to a constant.

288
00:15:25,650 --> 00:15:28,210
You do a little bit of algebra,
and in the end you

289
00:15:28,210 --> 00:15:32,180
find that the structure of the
test is to reject H0 if the

290
00:15:32,180 --> 00:15:37,740
sum of the squares of the Xi's
is bigger than the threshold.

291
00:15:37,740 --> 00:15:41,360
So by committing to a likelihood
ratio test you are

292
00:15:41,360 --> 00:15:45,060
told that you should be making
it your decision according to

293
00:15:45,060 --> 00:15:46,940
a rule of this type.

294
00:15:46,940 --> 00:15:51,450
So this fixes the shape or the
structure of the decision

295
00:15:51,450 --> 00:15:53,670
region, of the rejection
region.

296
00:15:53,670 --> 00:15:56,660
And the only thing that's left,
once more, is to pick

297
00:15:56,660 --> 00:16:00,190
this threshold in order to have
the property that the

298
00:16:00,190 --> 00:16:05,340
probability of a false rejection
is equal to say 5 %.

299
00:16:05,340 --> 00:16:09,490
So that's the probability that
H0 is true, but the sum of the

300
00:16:09,490 --> 00:16:11,450
squares accidentally
happens to be

301
00:16:11,450 --> 00:16:13,080
bigger than my threshold.

302
00:16:13,080 --> 00:16:17,330
In which case I end
up deciding H1.

303
00:16:17,330 --> 00:16:21,570
How do I find the value
of Xi prime?

304
00:16:21,570 --> 00:16:25,150
Well what I need to do is to
look at the picture, more or

305
00:16:25,150 --> 00:16:29,100
less of this kind, but now
I need to look at the

306
00:16:29,100 --> 00:16:32,870
distribution of the sum
of the Xi's squared.

307
00:16:32,870 --> 00:16:36,190
Actually the sum of the Xi's
squared is a non-negative

308
00:16:36,190 --> 00:16:37,580
random variable.

309
00:16:37,580 --> 00:16:40,280
So it's going to have a
distribution that's

310
00:16:40,280 --> 00:16:44,910
something like this.

311
00:16:44,910 --> 00:16:50,540
I look at that distribution, and
once more I want this tail

312
00:16:50,540 --> 00:16:54,300
probability to be alpha, and
that determines where my

313
00:16:54,300 --> 00:16:56,370
threshold is going to be.

314
00:16:56,370 --> 00:17:00,595
So that's again a simple
exercise provided that you

315
00:17:00,595 --> 00:17:03,650
know the distribution
of this quantity.

316
00:17:03,650 --> 00:17:05,540
Do you know it?

317
00:17:05,540 --> 00:17:08,980
Well we don't really know it,
we have not dealt with this

318
00:17:08,980 --> 00:17:11,859
particular distribution
in this class.

319
00:17:11,859 --> 00:17:15,730
But in principle you should be
able to find what it is.

320
00:17:15,730 --> 00:17:18,459
It's a derived distribution
problem.

321
00:17:18,459 --> 00:17:22,920
You know the distribution
of Xi, it's normal.

322
00:17:22,920 --> 00:17:26,410
Therefore, by solving a derived
distribution problem

323
00:17:26,410 --> 00:17:30,400
you can find the distribution
of Xi squared.

324
00:17:30,400 --> 00:17:34,180
And the Xi squared's are
independent of each other,

325
00:17:34,180 --> 00:17:36,400
because the Xi's are
independent.

326
00:17:36,400 --> 00:17:39,190
So you want to find the
distribution of the sum of

327
00:17:39,190 --> 00:17:41,750
random variables with
known distributions.

328
00:17:41,750 --> 00:17:44,410
And since they're independent,
in principle, you can do this

329
00:17:44,410 --> 00:17:46,470
using the convolution formula.

330
00:17:46,470 --> 00:17:49,720
So in principle, and if you're
patient enough, you will be

331
00:17:49,720 --> 00:17:52,830
able to find the distribution
of this random variable.

332
00:17:52,830 --> 00:17:57,430
And then you plot it or tabulate
it, and find where

333
00:17:57,430 --> 00:18:02,870
exactly is the 95th percentile
of that distribution, and that

334
00:18:02,870 --> 00:18:05,290
determines your threshold.

335
00:18:05,290 --> 00:18:08,310
So this distribution actually
turns out to have a nice and

336
00:18:08,310 --> 00:18:11,000
simple closed-form formula.

337
00:18:11,000 --> 00:18:13,740
Because this is a pretty common
test, people have

338
00:18:13,740 --> 00:18:15,220
tabulated that distribution.

339
00:18:15,220 --> 00:18:17,370
It's called the chi-square
distribution.

340
00:18:17,370 --> 00:18:19,512
There's tables available
for it.

341
00:18:19,512 --> 00:18:23,390
And you look up in the tables,
you find the 95th percentile

342
00:18:23,390 --> 00:18:25,900
of the distribution,
and this way you

343
00:18:25,900 --> 00:18:28,280
determine your threshold.

344
00:18:28,280 --> 00:18:31,140
So what's the moral
of the story?

345
00:18:31,140 --> 00:18:34,800
The structure of the likelihood
ratio test tells

346
00:18:34,800 --> 00:18:40,470
you what kind of decision region
you're going to have.

347
00:18:40,470 --> 00:18:42,880
It tells you that for this
particular test you should be

348
00:18:42,880 --> 00:18:46,360
using the sum of the Xi
squared's as your statistic,

349
00:18:46,360 --> 00:18:48,460
as the basis for making
your decision.

350
00:18:48,460 --> 00:18:51,840
And then you need to solve a
derived distribution problem

351
00:18:51,840 --> 00:18:53,110
to find the probability

352
00:18:53,110 --> 00:18:55,500
distribution of your statistic.

353
00:18:55,500 --> 00:19:00,290
Find the distribution of this
quantity under H0, and

354
00:19:00,290 --> 00:19:03,000
finally, based on that
distribution, after you have

355
00:19:03,000 --> 00:19:05,330
derived it, then determine
your threshold.

356
00:19:05,330 --> 00:19:08,240

357
00:19:08,240 --> 00:19:10,360
So now let's move
on to a somewhat

358
00:19:10,360 --> 00:19:13,090
more complicated situation.

359
00:19:13,090 --> 00:19:18,090
You have a coin, and you
are told that I tried

360
00:19:18,090 --> 00:19:21,040
to make a fair coin.

361
00:19:21,040 --> 00:19:22,450
Is it fair?

362
00:19:22,450 --> 00:19:25,200
So you have the hypothesis,
which is the default--

363
00:19:25,200 --> 00:19:26,320
the null hypothesis--

364
00:19:26,320 --> 00:19:27,890
that the coin is fair.

365
00:19:27,890 --> 00:19:29,690
But maybe it isn't.

366
00:19:29,690 --> 00:19:31,880
So you have the alternative
hypothesis that

367
00:19:31,880 --> 00:19:34,030
your coin is not fair.

368
00:19:34,030 --> 00:19:36,690
Now what's different in this
context is that your

369
00:19:36,690 --> 00:19:41,830
alternative hypothesis is not
just one specific hypothesis.

370
00:19:41,830 --> 00:19:45,990
Your alternative hypothesis
consists of many alternatives.

371
00:19:45,990 --> 00:19:49,270
It includes the hypothesis
that p is 0.6.

372
00:19:49,270 --> 00:19:53,930
It includes the hypothesis
that p is 0.51.

373
00:19:53,930 --> 00:19:58,850
It includes the hypothesis that
p is 0.48, and so on.

374
00:19:58,850 --> 00:20:05,030
So you're testing this
hypothesis versus all this

375
00:20:05,030 --> 00:20:08,070
family of alternative
hypothesis.

376
00:20:08,070 --> 00:20:11,080
What you will end up doing is
essentially the following--

377
00:20:11,080 --> 00:20:12,480
you get some data.

378
00:20:12,480 --> 00:20:15,080
That is, you flip the coin
a number of times.

379
00:20:15,080 --> 00:20:17,640
Let's say you flip
it 1,000 times.

380
00:20:17,640 --> 00:20:20,290
You observe some outcome.

381
00:20:20,290 --> 00:20:24,580
Let's say you saw 472 heads.

382
00:20:24,580 --> 00:20:31,650
And you ask the question if
this hypothesis is true is

383
00:20:31,650 --> 00:20:35,790
this value really possible
under that hypothesis?

384
00:20:35,790 --> 00:20:39,450
Or would it be very much
of an outlier?

385
00:20:39,450 --> 00:20:44,220
If it looks like an extreme
outlier under this hypothesis

386
00:20:44,220 --> 00:20:47,780
then I reject it, and I accept
the alternative.

387
00:20:47,780 --> 00:20:50,800
If this number turns out to be
something within the range

388
00:20:50,800 --> 00:20:56,690
that you would have expected
then you keep, or accept your

389
00:20:56,690 --> 00:20:59,080
null hypothesis.

390
00:20:59,080 --> 00:21:03,200
OK so what does it mean to
be an outlier or not?

391
00:21:03,200 --> 00:21:05,430
First you take your data,
and you condense

392
00:21:05,430 --> 00:21:07,220
them to a single number.

393
00:21:07,220 --> 00:21:10,240
So your detailed data actually
would have been a sequence of

394
00:21:10,240 --> 00:21:12,440
heads/tails, heads/tails
and all that.

395
00:21:12,440 --> 00:21:16,370
Any reasonable person would tell
you that you shouldn't

396
00:21:16,370 --> 00:21:19,430
really care about the exact
sequence of heads and tails.

397
00:21:19,430 --> 00:21:22,570
Let's just base our decision on
the number of heads that we

398
00:21:22,570 --> 00:21:24,380
have observed.

399
00:21:24,380 --> 00:21:28,870
So using some kind of reasoning
which could be

400
00:21:28,870 --> 00:21:33,650
mathematical, or intuitive,
or involving artistry--

401
00:21:33,650 --> 00:21:38,400
you pick a one-dimensional, or
scalar summary of the data

402
00:21:38,400 --> 00:21:39,450
that you have seen.

403
00:21:39,450 --> 00:21:42,250
In this case, the summary of the
data is just the number of

404
00:21:42,250 --> 00:21:44,330
heads that's a quite
reasonable one.

405
00:21:44,330 --> 00:21:47,880
And so you commit yourself to
make a decision on the basis

406
00:21:47,880 --> 00:21:49,080
of this quantity.

407
00:21:49,080 --> 00:21:52,670
And you ask the quantity that
I'm seeing does it look like

408
00:21:52,670 --> 00:21:53,680
an outlier?

409
00:21:53,680 --> 00:21:57,710
Or does it look more
or less OK?

410
00:21:57,710 --> 00:22:00,540
OK, what does it mean
to be an outlier?

411
00:22:00,540 --> 00:22:04,900
You want to choose the shape of
this rejection region, but

412
00:22:04,900 --> 00:22:08,750
on the basis of that
single number s.

413
00:22:08,750 --> 00:22:11,240
And again, the reasonable thing
to do in this context

414
00:22:11,240 --> 00:22:15,170
would be to argue as follows--
if my coin is fair I expect to

415
00:22:15,170 --> 00:22:16,850
see n over 2 heads.

416
00:22:16,850 --> 00:22:18,540
That's the expected value.

417
00:22:18,540 --> 00:22:23,330
If the number of heads I see
is far from the expected

418
00:22:23,330 --> 00:22:26,030
number of heads then I consider

419
00:22:26,030 --> 00:22:27,750
this to be an outlier.

420
00:22:27,750 --> 00:22:30,470
So if this number is bigger
than some threshold Xi.

421
00:22:30,470 --> 00:22:33,600
I consider it to be an outlier,
and then I'm going to

422
00:22:33,600 --> 00:22:36,100
reject my hypothesis.

423
00:22:36,100 --> 00:22:38,930
So we picked our statistic.

424
00:22:38,930 --> 00:22:44,990
We picked the general form of
how we're going to make our

425
00:22:44,990 --> 00:22:50,000
decision, and then we pick a
certain significance, or

426
00:22:50,000 --> 00:22:51,690
confidence level that we want.

427
00:22:51,690 --> 00:22:54,470
Again, this famous 5% number.

428
00:22:54,470 --> 00:22:58,310
And we're going to declare
something to be an outlier if

429
00:22:58,310 --> 00:23:01,380
it lies in the region
that has 5% or less

430
00:23:01,380 --> 00:23:03,270
probability of occurring.

431
00:23:03,270 --> 00:23:07,560
That is I'm picking my rejection
region so that if H0

432
00:23:07,560 --> 00:23:11,870
is true under the default, or
null hypothesis, there's only

433
00:23:11,870 --> 00:23:17,380
5% chance that by accident I
fall there, and the thing

434
00:23:17,380 --> 00:23:21,540
makes me think that H1
is going to be true.

435
00:23:21,540 --> 00:23:25,690

436
00:23:25,690 --> 00:23:28,770
So now what's left to
do is to pick the

437
00:23:28,770 --> 00:23:30,920
value of this threshold.

438
00:23:30,920 --> 00:23:34,410
This is a calculation
of the usual kind.

439
00:23:34,410 --> 00:23:39,580
I want to pick my threshold,
my Xi number so that the

440
00:23:39,580 --> 00:23:44,150
probability that s is further
from the mean by an amount of

441
00:23:44,150 --> 00:23:47,200
Xi is less than 5%.

442
00:23:47,200 --> 00:23:50,630
Or that the probability
of being inside

443
00:23:50,630 --> 00:23:52,300
the acceptance region--

444
00:23:52,300 --> 00:23:55,240
so that the distance
from the default is

445
00:23:55,240 --> 00:23:56,380
less than my threshold.

446
00:23:56,380 --> 00:23:59,880
I want that to be 95%.

447
00:23:59,880 --> 00:24:04,380
So this is an equality that you
can get using the central

448
00:24:04,380 --> 00:24:06,760
limit theorem and the
normal tables.

449
00:24:06,760 --> 00:24:10,230
There's 95% probability that the
number of heads is going

450
00:24:10,230 --> 00:24:14,920
to be within 31 from
the correct mean.

451
00:24:14,920 --> 00:24:17,910
So the way the exercise is done
of course, is that we

452
00:24:17,910 --> 00:24:20,640
start with this number, 5%.

453
00:24:20,640 --> 00:24:24,410
Which translates to
this number 95%.

454
00:24:24,410 --> 00:24:27,960
And once we have fixed that
number then you ask the

455
00:24:27,960 --> 00:24:34,370
question what number should
we have here to make this

456
00:24:34,370 --> 00:24:36,500
equality to be true?

457
00:24:36,500 --> 00:24:39,360
It's again a problem
of this kind.

458
00:24:39,360 --> 00:24:42,820
You have a quantity whose
distribution you know.

459
00:24:42,820 --> 00:24:43,950
Why do you know it?

460
00:24:43,950 --> 00:24:46,390
The number of heads by the
central limit theorem is

461
00:24:46,390 --> 00:24:47,970
approximately normal.

462
00:24:47,970 --> 00:24:51,560
So this here talks about the
normal distribution.

463
00:24:51,560 --> 00:24:56,330
You set your alpha to be 5%, and
you ask where should I put

464
00:24:56,330 --> 00:24:59,690
my threshold so that this
probability of being out there

465
00:24:59,690 --> 00:25:01,530
is only 5%?

466
00:25:01,530 --> 00:25:03,750
Now in our particular example
the threshold

467
00:25:03,750 --> 00:25:05,970
turned out to be 31.

468
00:25:05,970 --> 00:25:09,170
This number turned out
was just 28 away

469
00:25:09,170 --> 00:25:10,960
from the correct mean.

470
00:25:10,960 --> 00:25:14,150
So these distance was less
than the threshold.

471
00:25:14,150 --> 00:25:17,280
So we end up not rejecting H0.

472
00:25:17,280 --> 00:25:20,430

473
00:25:20,430 --> 00:25:23,820
So we have our rejection
region.

474
00:25:23,820 --> 00:25:28,900
The way we designed it is that
when H0 is true there's only a

475
00:25:28,900 --> 00:25:32,960
small chance, 5%, that we get
to data out of there.

476
00:25:32,960 --> 00:25:35,510
Data that we would
call an outlier.

477
00:25:35,510 --> 00:25:39,330
If we see such an outlier
we reject H0.

478
00:25:39,330 --> 00:25:43,930
If what we see is not an outlier
as in this case, where

479
00:25:43,930 --> 00:25:47,090
that distance turned out to
be kind of small, then we

480
00:25:47,090 --> 00:25:50,980
do not reject H0.

481
00:25:50,980 --> 00:25:54,700
An interesting little piece
of language here, people

482
00:25:54,700 --> 00:25:57,490
generally prefer to use
this terminology--

483
00:25:57,490 --> 00:26:01,820
to say that H0 is not rejected
by the data.

484
00:26:01,820 --> 00:26:06,490
Instead of saying that
H0 is accepted.

485
00:26:06,490 --> 00:26:09,260
In some sense they're both
saying the same thing, but the

486
00:26:09,260 --> 00:26:11,940
difference is sort of subtle.

487
00:26:11,940 --> 00:26:17,240
When I say not rejected what I
mean is that I got some data

488
00:26:17,240 --> 00:26:20,560
that are compatible with
my hypothesis.

489
00:26:20,560 --> 00:26:26,470
That is the data that I got do
not falsify the hypothesis

490
00:26:26,470 --> 00:26:29,520
that I had, my null
hypothesis.

491
00:26:29,520 --> 00:26:34,500
So my null hypothesis is still
alive, and may be true.

492
00:26:34,500 --> 00:26:38,700
But from data you can never
really prove that the

493
00:26:38,700 --> 00:26:41,360
hypothesis is correct.

494
00:26:41,360 --> 00:26:46,190
Perhaps my coin is not fair in
some other complicated way.

495
00:26:46,190 --> 00:26:51,660

496
00:26:51,660 --> 00:26:55,980
Perhaps I was just lucky, and
even though my coin is not

497
00:26:55,980 --> 00:26:58,930
fair I ended up with
an outcome that

498
00:26:58,930 --> 00:27:01,270
suggests that it's fair.

499
00:27:01,270 --> 00:27:04,600
Perhaps my coin flips are
not independent as I

500
00:27:04,600 --> 00:27:06,020
assumed in my model.

501
00:27:06,020 --> 00:27:11,860
So there's many ways that my
null hypothesis could be

502
00:27:11,860 --> 00:27:15,010
wrong, and still I got data
that tells me that my

503
00:27:15,010 --> 00:27:16,970
hypothesis is OK.

504
00:27:16,970 --> 00:27:20,980
So this is the general way that
things work in science.

505
00:27:20,980 --> 00:27:24,340
One comes up with a
model or a theory.

506
00:27:24,340 --> 00:27:28,480
This is the default theory, and
we work with that theory

507
00:27:28,480 --> 00:27:31,100
trying to find whether
there are examples

508
00:27:31,100 --> 00:27:32,450
that violate the theory.

509
00:27:32,450 --> 00:27:35,550
If you find data and examples
that violate the theory your

510
00:27:35,550 --> 00:27:38,560
theory is falsified, and you
need to look for a new one.

511
00:27:38,560 --> 00:27:43,090
But when you have your theory,
really no amount of data can

512
00:27:43,090 --> 00:27:45,810
prove that your theory
is correct.

513
00:27:45,810 --> 00:27:49,950
So we have the default theory
that the speed of light is

514
00:27:49,950 --> 00:27:54,620
constant as long as we do not
find any data that runs

515
00:27:54,620 --> 00:27:56,210
counter to it.

516
00:27:56,210 --> 00:27:59,650
We stay with that theory, but
there's no way of really

517
00:27:59,650 --> 00:28:03,710
proving this, no matter how
many experiments we do.

518
00:28:03,710 --> 00:28:06,590
But there could be experiments
that falsify that theory, in

519
00:28:06,590 --> 00:28:10,580
which case we need to do
look for a new one.

520
00:28:10,580 --> 00:28:14,450
So there's a bit of an asymmetry
here in how we treat

521
00:28:14,450 --> 00:28:16,510
the alternative hypothesis.

522
00:28:16,510 --> 00:28:22,900
H0 is the default which we'll
accept until we see some

523
00:28:22,900 --> 00:28:25,350
evidence to the contrary.

524
00:28:25,350 --> 00:28:30,170
And if we see some evidence to
the contrary we reject it.

525
00:28:30,170 --> 00:28:33,580
As long as we do not see
evidence to the contrary then

526
00:28:33,580 --> 00:28:35,940
we keep working with it,
but always take it

527
00:28:35,940 --> 00:28:38,200
with a grain of salt.

528
00:28:38,200 --> 00:28:42,210
You can never really prove that
a coin has a bias exactly

529
00:28:42,210 --> 00:28:43,860
equal to 1/2.

530
00:28:43,860 --> 00:28:50,360
Maybe the bias is equal
to 0.50001, so

531
00:28:50,360 --> 00:28:52,440
the bias is not 1/2.

532
00:28:52,440 --> 00:28:56,180
But with an experiment with
1,000 coin tosses you wouldn't

533
00:28:56,180 --> 00:28:59,200
be able to see this effect.

534
00:28:59,200 --> 00:29:03,750

535
00:29:03,750 --> 00:29:07,870
OK, so that's how you go about
testing about whether your

536
00:29:07,870 --> 00:29:09,120
coin is fair.

537
00:29:09,120 --> 00:29:13,150
You can also think about testing
whether a die is fair.

538
00:29:13,150 --> 00:29:17,130
So for a die the null hypothesis
would be that every

539
00:29:17,130 --> 00:29:21,830
possible result when you roll
the die has equal probability

540
00:29:21,830 --> 00:29:23,860
and equal to 1/6.

541
00:29:23,860 --> 00:29:27,720
And you also make the hypothesis
that your die rolls

542
00:29:27,720 --> 00:29:30,900
are statistically independent
from each other.

543
00:29:30,900 --> 00:29:36,050
So I take my die, I roll it a
number of times, little n, and

544
00:29:36,050 --> 00:29:40,240
I count how many 1's I got, how
many 2's I got, how many

545
00:29:40,240 --> 00:29:43,430
3's I got, and these
are my data.

546
00:29:43,430 --> 00:29:48,400
I count how many times I
observed a specific result in

547
00:29:48,400 --> 00:29:51,660
my die roll that was
equal to sum i.

548
00:29:51,660 --> 00:29:53,410
And now I ask the question--

549
00:29:53,410 --> 00:29:58,050
the Ni's that I observed, are
they compatible with my

550
00:29:58,050 --> 00:30:01,000
hypothesis or not?

551
00:30:01,000 --> 00:30:05,560
What does compatible to
my hypothesis mean?

552
00:30:05,560 --> 00:30:12,570
Under the null hypothesis Ni
should be approximately equal,

553
00:30:12,570 --> 00:30:17,750
or is equal in expectation
to N times little Pi.

554
00:30:17,750 --> 00:30:23,170
And in our example this little
Pi is of course 1/6.

555
00:30:23,170 --> 00:30:28,210
So if my die is fair the number
of ones I expect to see

556
00:30:28,210 --> 00:30:31,110
is equal to the number
of rolls times 1/6.

557
00:30:31,110 --> 00:30:35,070
The number of 2's I expect to
see is again that same number.

558
00:30:35,070 --> 00:30:37,970
Of course there's randomness,
so I do not expect to get

559
00:30:37,970 --> 00:30:39,420
exactly that number.

560
00:30:39,420 --> 00:30:42,420
But I can ask how far
away from the

561
00:30:42,420 --> 00:30:45,380
expected values was i?

562
00:30:45,380 --> 00:30:51,470
If my capital Ni's turn to be
very different from N/6 this

563
00:30:51,470 --> 00:30:55,110
is evidence that my
die is not fair.

564
00:30:55,110 --> 00:31:01,000
If those numbers turn out to be
close to N times 1/6 then

565
00:31:01,000 --> 00:31:05,180
I'm going to say there's no
evidence that would lead me to

566
00:31:05,180 --> 00:31:06,870
reject this hypothesis.

567
00:31:06,870 --> 00:31:10,850
So this hypothesis
remains alive.

568
00:31:10,850 --> 00:31:16,390
So someone has come up with this
thought that maybe the

569
00:31:16,390 --> 00:31:20,730
right statistic to use, or the
right way of quantifying how

570
00:31:20,730 --> 00:31:23,910
far away are the Ni's from
their mean is to

571
00:31:23,910 --> 00:31:25,590
look at this quantity.

572
00:31:25,590 --> 00:31:29,520
So I'm looking at the expected
value of Ni under the null

573
00:31:29,520 --> 00:31:30,700
hypothesis.

574
00:31:30,700 --> 00:31:34,760
See what I got, take the square
of this, and add it

575
00:31:34,760 --> 00:31:36,040
over all i's.

576
00:31:36,040 --> 00:31:40,930
But also throw in these terms
in the denominator.

577
00:31:40,930 --> 00:31:46,010
And why that term is there,
that's a longer story.

578
00:31:46,010 --> 00:31:49,740
One can write down certain
likelihood ratios, do certain

579
00:31:49,740 --> 00:31:53,010
Taylor Series approximations,
and there's a Heuristic

580
00:31:53,010 --> 00:31:58,120
argument that justifies why this
would be a good form for

581
00:31:58,120 --> 00:31:59,810
the test to use.

582
00:31:59,810 --> 00:32:02,660
So there's a certain art that's
involved in this step

583
00:32:02,660 --> 00:32:06,370
that some people somehow decided
that it's a reasonable

584
00:32:06,370 --> 00:32:08,730
thing to do is to calcelate.

585
00:32:08,730 --> 00:32:12,300
Once you get your results to
calculate this one-dimensional

586
00:32:12,300 --> 00:32:16,740
summary of your result, this is
going to be your statistic,

587
00:32:16,740 --> 00:32:19,550
and compare that statistic
to a threshold.

588
00:32:19,550 --> 00:32:21,680
And that's how you make
your decision.

589
00:32:21,680 --> 00:32:27,310
So by this point we have fixed
the type of the rejection

590
00:32:27,310 --> 00:32:29,740
region that we're
going to have.

591
00:32:29,740 --> 00:32:32,780
So we've chosen the qualitative
structure of our

592
00:32:32,780 --> 00:32:36,230
test, and the only thing that's
now left is to choose

593
00:32:36,230 --> 00:32:38,820
the particular threshold
we're going to use.

594
00:32:38,820 --> 00:32:41,550
And the recipe, once
more, is the same.

595
00:32:41,550 --> 00:32:44,840
We want to set our threshold so
that the probability of a

596
00:32:44,840 --> 00:32:47,320
false rejection is 5%.

597
00:32:47,320 --> 00:32:52,040
We want the probability that our
data fall in here is only

598
00:32:52,040 --> 00:32:55,990
5% when the null hypothesis
is true.

599
00:32:55,990 --> 00:33:01,040
So that's the same as setting
our threshold Xi so that the

600
00:33:01,040 --> 00:33:03,940
probability that our
test statistic is

601
00:33:03,940 --> 00:33:05,960
bigger than that threshold.

602
00:33:05,960 --> 00:33:11,470
We want that probability
to be only 0.05.

603
00:33:11,470 --> 00:33:15,140
So to solve a problem of
this kind what is it

604
00:33:15,140 --> 00:33:16,820
that you need to do?

605
00:33:16,820 --> 00:33:19,490
You need to find the probability
distribution of

606
00:33:19,490 --> 00:33:23,810
capital T. So once more
it's the same picture.

607
00:33:23,810 --> 00:33:26,370

608
00:33:26,370 --> 00:33:32,200
You need to do some calculations
of some sort, and

609
00:33:32,200 --> 00:33:36,550
come up with the distribution
of the random variable T,

610
00:33:36,550 --> 00:33:39,060
where T is defined this way.

611
00:33:39,060 --> 00:33:41,400
You want to find this
distribution

612
00:33:41,400 --> 00:33:43,190
under hypothesis H0.

613
00:33:43,190 --> 00:33:48,820

614
00:33:48,820 --> 00:33:53,780
Once you find what that
distribution is then you can

615
00:33:53,780 --> 00:33:55,480
solve this usual problem.

616
00:33:55,480 --> 00:33:58,470
I want this probability
here to be 5%.

617
00:33:58,470 --> 00:34:01,860
What should my threshold be?

618
00:34:01,860 --> 00:34:03,930
So what does this
boil down to?

619
00:34:03,930 --> 00:34:08,510
Finding the distribution of
capital T is in some sense a

620
00:34:08,510 --> 00:34:13,350
messy, difficult, derived
distribution problem.

621
00:34:13,350 --> 00:34:16,239
From this model we know
the distribution

622
00:34:16,239 --> 00:34:17,489
of the capital Ni's.

623
00:34:17,489 --> 00:34:20,290

624
00:34:20,290 --> 00:34:23,800
And actually we can even write
down the joint distribution of

625
00:34:23,800 --> 00:34:26,840
the capital Ni's.

626
00:34:26,840 --> 00:34:29,690
In fact we can make an
approximation here.

627
00:34:29,690 --> 00:34:33,219
Capital Ni is a binomial
random variable.

628
00:34:33,219 --> 00:34:39,790
Let's say the number of 1's that
I got in little N rolls

629
00:34:39,790 --> 00:34:41,090
off my die.

630
00:34:41,090 --> 00:34:43,300
So that's a binomial
random variable.

631
00:34:43,300 --> 00:34:45,860
When little n is big
this is going to be

632
00:34:45,860 --> 00:34:48,040
approximately normal.

633
00:34:48,040 --> 00:34:52,060
So we have normal random
variables, or approximately

634
00:34:52,060 --> 00:34:54,260
normal minus a constant.

635
00:34:54,260 --> 00:34:55,770
They're still approximately
normal.

636
00:34:55,770 --> 00:35:01,070
We take the squares of these,
scale them so you can solve a

637
00:35:01,070 --> 00:35:03,730
derived distribution problem
to find the distribution of

638
00:35:03,730 --> 00:35:04,930
this quantity.

639
00:35:04,930 --> 00:35:08,550
You can do more work, more
derived distribution work, and

640
00:35:08,550 --> 00:35:12,080
find the distribution of
capital T. So this is a

641
00:35:12,080 --> 00:35:17,500
tedious matter, but because this
test is used quite often,

642
00:35:17,500 --> 00:35:20,080
again people have done
those calculations.

643
00:35:20,080 --> 00:35:23,600
They have found the distribution
of capital T, and

644
00:35:23,600 --> 00:35:25,250
it's available in tables.

645
00:35:25,250 --> 00:35:29,090
And you go to those tables, and
you find the appropriate

646
00:35:29,090 --> 00:35:31,370
threshold for making a decision
of this type.

647
00:35:31,370 --> 00:35:36,160

648
00:35:36,160 --> 00:35:40,720
Now to give you a sense of how
complicated hypothesis one

649
00:35:40,720 --> 00:35:47,190
might have to deal with let's
make things one level more

650
00:35:47,190 --> 00:35:48,370
complicated.

651
00:35:48,370 --> 00:35:55,200
So here you can think this X is
a discrete random variable.

652
00:35:55,200 --> 00:35:57,770
This is the outcome
of my roll.

653
00:35:57,770 --> 00:36:02,760
And I had a model in which the
possible values of my discrete

654
00:36:02,760 --> 00:36:06,030
random variables they
have probabilities

655
00:36:06,030 --> 00:36:07,870
all equal to 1/6.

656
00:36:07,870 --> 00:36:13,280
So my null hypothesis here was
a particular PMF for the

657
00:36:13,280 --> 00:36:17,810
random variable capital X. So
another way of phrasing what

658
00:36:17,810 --> 00:36:19,950
happened in this
problem was the

659
00:36:19,950 --> 00:36:24,700
question is my PMF correct?

660
00:36:24,700 --> 00:36:30,580
So this is the PMF of the
result of one die roll.

661
00:36:30,580 --> 00:36:33,950
You're asking the question
is my PMF correct?

662
00:36:33,950 --> 00:36:36,740
Make it more complicated.

663
00:36:36,740 --> 00:36:41,510
How about the question of the
type is my PDF correct when I

664
00:36:41,510 --> 00:36:45,220
have continuous data?

665
00:36:45,220 --> 00:36:50,900
So I have hypothesized that's
the probability distribution

666
00:36:50,900 --> 00:36:54,780
that I have is let's say
a particular normal.

667
00:36:54,780 --> 00:36:58,990
I get lots of results from
that random variable.

668
00:36:58,990 --> 00:37:04,450
Can I tell whether my results
look like normal or not?

669
00:37:04,450 --> 00:37:06,650
What are some ways of
going about it?

670
00:37:06,650 --> 00:37:09,450
Well, we saw in the previous
slide that there is a

671
00:37:09,450 --> 00:37:13,110
methodology for deciding
if your PMF is correct.

672
00:37:13,110 --> 00:37:19,090
So you could take your normal
results, the data that you got

673
00:37:19,090 --> 00:37:23,200
from your experiment, and
discretize them, and so now

674
00:37:23,200 --> 00:37:25,500
you're dealing with
discrete data.

675
00:37:25,500 --> 00:37:31,200
And sort of used in previous
methodology to solve a

676
00:37:31,200 --> 00:37:34,900
discrete problem of the type
is my PDF correct?

677
00:37:34,900 --> 00:37:41,320
So in practice the way this is
done is that you get all your

678
00:37:41,320 --> 00:37:49,920
data, let's say data points
of this kind.

679
00:37:49,920 --> 00:37:56,400
You split your space into bins,
and you count how many

680
00:37:56,400 --> 00:38:00,190
you have in each bin.

681
00:38:00,190 --> 00:38:07,180
So you get this, and that,
and that, and nothing.

682
00:38:07,180 --> 00:38:10,020
So that's a histogram that
you get from the

683
00:38:10,020 --> 00:38:11,020
data that you have.

684
00:38:11,020 --> 00:38:14,670
Like the very familiar
histograms that you see after

685
00:38:14,670 --> 00:38:16,860
each one of our quizzes.

686
00:38:16,860 --> 00:38:21,760
So if you look at these
histogram, and you ask does it

687
00:38:21,760 --> 00:38:24,060
look like normal?

688
00:38:24,060 --> 00:38:27,700
OK, we need a systematic
way of going about it.

689
00:38:27,700 --> 00:38:33,140
If it were normal you can
calculate the probability of

690
00:38:33,140 --> 00:38:36,760
falling in this interval.

691
00:38:36,760 --> 00:38:39,120
The probability of falling in
that interval, probability of

692
00:38:39,120 --> 00:38:40,890
falling into that interval.

693
00:38:40,890 --> 00:38:45,480
So you would have expected
values of how many results, or

694
00:38:45,480 --> 00:38:48,210
data points, you would have
in this interval.

695
00:38:48,210 --> 00:38:52,170
And compare these expected
values for each interval with

696
00:38:52,170 --> 00:38:54,830
the actual ones that
you observed.

697
00:38:54,830 --> 00:38:58,290
And then take the sum of
squares, and so on, exactly as

698
00:38:58,290 --> 00:38:59,700
in the previous slide.

699
00:38:59,700 --> 00:39:03,010
And this gives you a way
of going about it.

700
00:39:03,010 --> 00:39:07,060

701
00:39:07,060 --> 00:39:09,710
This is a little messy.

702
00:39:09,710 --> 00:39:14,530
It gets hard to do because you
have the difficult decision of

703
00:39:14,530 --> 00:39:19,180
how do you choose
the bin size?

704
00:39:19,180 --> 00:39:22,430
If you take your bins to be very
narrow you would get lots

705
00:39:22,430 --> 00:39:25,680
of bins with 0's, and a few
bins that only have one

706
00:39:25,680 --> 00:39:26,840
outcome in them.

707
00:39:26,840 --> 00:39:29,120
It probably wouldn't
feel right.

708
00:39:29,120 --> 00:39:32,110
If you choose your bins to be
very wide then you're losing a

709
00:39:32,110 --> 00:39:33,680
lot of information.

710
00:39:33,680 --> 00:39:39,240
Is there some way of making a
test without creating bins?

711
00:39:39,240 --> 00:39:43,330
This is just to illustrate
the clever ideas of what

712
00:39:43,330 --> 00:39:45,640
statisticians have
thought about.

713
00:39:45,640 --> 00:39:51,960
And here's a really cute way of
going about a test, whether

714
00:39:51,960 --> 00:39:53,750
my distribution is
correct or not.

715
00:39:53,750 --> 00:39:56,980

716
00:39:56,980 --> 00:40:00,790
Here we're essentially
plotting a PMF, or an

717
00:40:00,790 --> 00:40:02,630
approximation of a PDF.

718
00:40:02,630 --> 00:40:06,040
And we ask does it look like
the PDF we assumed?

719
00:40:06,040 --> 00:40:09,930
Instead of working with PDFs
let's work with cumulative

720
00:40:09,930 --> 00:40:11,800
distribution functions.

721
00:40:11,800 --> 00:40:13,840
So how does this go?

722
00:40:13,840 --> 00:40:20,160
The true normal distribution
that I have hypothesized, the

723
00:40:20,160 --> 00:40:22,310
density that I'm hypothesizing--
my null

724
00:40:22,310 --> 00:40:23,350
hypothesis--

725
00:40:23,350 --> 00:40:26,950
has a certain CDF
that I can plot.

726
00:40:26,950 --> 00:40:36,820
So supposed that my hypothesis
H0 is that the X's are normal

727
00:40:36,820 --> 00:40:42,630
with our standard normals, and I
plot the CDF of the standard

728
00:40:42,630 --> 00:40:46,360
normal, which is the sort of
continuous looking curve here.

729
00:40:46,360 --> 00:40:53,310
Now I get my data, and I
plot the empirical CDF.

730
00:40:53,310 --> 00:40:54,930
What's the empirical CDF?

731
00:40:54,930 --> 00:40:59,830
In the empirical CDF you ask the
question what fraction of

732
00:40:59,830 --> 00:41:02,940
the data fell below 0?

733
00:41:02,940 --> 00:41:04,450
You get a number.

734
00:41:04,450 --> 00:41:07,920
What fraction of my
data fell below 1?

735
00:41:07,920 --> 00:41:08,730
I get a number.

736
00:41:08,730 --> 00:41:12,590
What fraction of my data fell
below 2, and so on.

737
00:41:12,590 --> 00:41:15,780
So you're talking about
fractions of the data that

738
00:41:15,780 --> 00:41:18,760
fell below each particular
number.

739
00:41:18,760 --> 00:41:21,640
And by plotting those fractions
as a function of

740
00:41:21,640 --> 00:41:26,740
this number you get something
that looks like a CDF.

741
00:41:26,740 --> 00:41:31,670
And it's the CDF suggested
by the data.

742
00:41:31,670 --> 00:41:35,800
Now the fraction of the data
that fall below 0 in my

743
00:41:35,800 --> 00:41:38,530
experiment is--

744
00:41:38,530 --> 00:41:43,280
if my hypothesis were true--

745
00:41:43,280 --> 00:41:46,470
expected to be 1/2.

746
00:41:46,470 --> 00:41:49,280
1/2 is the value of
the true CDF.

747
00:41:49,280 --> 00:41:51,730
I look at the fraction
that I got, it's

748
00:41:51,730 --> 00:41:54,470
expected to be that number.

749
00:41:54,470 --> 00:41:56,800
But there's randomness, so
it's might be a little

750
00:41:56,800 --> 00:41:58,300
different than that.

751
00:41:58,300 --> 00:42:03,490
For any particular value, the
fraction that I got below a

752
00:42:03,490 --> 00:42:04,350
certain number--

753
00:42:04,350 --> 00:42:09,970
the fraction of data that
we're below, 2, its

754
00:42:09,970 --> 00:42:15,310
expectation is the probability
of falling below 2, which is

755
00:42:15,310 --> 00:42:16,740
the correct CDF.

756
00:42:16,740 --> 00:42:21,060
So if my hypothesis is true the
empirical CDF that I get

757
00:42:21,060 --> 00:42:24,900
based on data should, when
n is large, be very

758
00:42:24,900 --> 00:42:27,100
close to the true CDF.

759
00:42:27,100 --> 00:42:31,350
So a way of judging whether my
model is correct or not is to

760
00:42:31,350 --> 00:42:38,300
look at the assumed CDF, the
CDF under hypothesis H0.

761
00:42:38,300 --> 00:42:41,880
Look at the CDF that I
constructed based on the data,

762
00:42:41,880 --> 00:42:45,440
and see whether they're
close enough or not.

763
00:42:45,440 --> 00:42:48,150
And by close enough, I mean I'm
going to look at all the

764
00:42:48,150 --> 00:42:52,000
possible X's, and look at the
maximum distance between those

765
00:42:52,000 --> 00:42:53,300
two curves.

766
00:42:53,300 --> 00:42:59,140
And I'm going to have a test
that decides in favor of H0 if

767
00:42:59,140 --> 00:43:03,550
this distance is small,
and in favor of H1 if

768
00:43:03,550 --> 00:43:06,110
this distance is large.

769
00:43:06,110 --> 00:43:07,790
That still leaves me
the problem of

770
00:43:07,790 --> 00:43:09,570
coming up with a threshold.

771
00:43:09,570 --> 00:43:13,180
Where exactly do I
put my threshold?

772
00:43:13,180 --> 00:43:17,230
Because this test is important
enough, and is used frequently

773
00:43:17,230 --> 00:43:20,990
people have made the effort
to try to understand the

774
00:43:20,990 --> 00:43:23,240
probability distribution
of this quite

775
00:43:23,240 --> 00:43:25,280
difficult random variable.

776
00:43:25,280 --> 00:43:28,220
One needs to do lots of
approximations and clever

777
00:43:28,220 --> 00:43:32,550
calculations, but these have
led to values and tabulated

778
00:43:32,550 --> 00:43:34,570
values for the probability
distribution

779
00:43:34,570 --> 00:43:36,210
of this random variable.

780
00:43:36,210 --> 00:43:39,340
And, for example, those
tabulated values tell us that

781
00:43:39,340 --> 00:43:45,030
if we want 5% false rejection
probability, then our

782
00:43:45,030 --> 00:43:48,860
threshold should be 1.36
divided by the

783
00:43:48,860 --> 00:43:50,570
square root of n.

784
00:43:50,570 --> 00:43:53,870
So we know where to put
our threshold for

785
00:43:53,870 --> 00:43:55,280
this particular value.

786
00:43:55,280 --> 00:43:59,680
If we want this particular
error or error

787
00:43:59,680 --> 00:44:02,380
probability to occur.

788
00:44:02,380 --> 00:44:06,320
So that's about as hard and
sophisticated classical

789
00:44:06,320 --> 00:44:08,070
statistics get.

790
00:44:08,070 --> 00:44:12,920
You want to have tests for
hypotheses that are not so

791
00:44:12,920 --> 00:44:15,910
easy to handle.

792
00:44:15,910 --> 00:44:21,260
People somehow think of
clever ways of doing

793
00:44:21,260 --> 00:44:22,500
tests of this kind.

794
00:44:22,500 --> 00:44:26,970
How to compare the theoretical
predictions with the observed

795
00:44:26,970 --> 00:44:29,650
predictions with the
observed data.

796
00:44:29,650 --> 00:44:34,430
Come up with some measure of the
difference between theory

797
00:44:34,430 --> 00:44:38,270
and data, and if that difference
is big, than you

798
00:44:38,270 --> 00:44:39,520
reject your hypothesis.

799
00:44:39,520 --> 00:44:42,340

800
00:44:42,340 --> 00:44:45,640
OK, of course that's not
the end of the field of

801
00:44:45,640 --> 00:44:49,000
statistics, there's
a lot more.

802
00:44:49,000 --> 00:44:52,000
In some ways, as we kept
moving through today's

803
00:44:52,000 --> 00:44:55,240
lecture, the way that we
constructed those rejection

804
00:44:55,240 --> 00:44:57,680
regions was more and
more ad hoc.

805
00:44:57,680 --> 00:45:02,220
I pulled out of a hat a
particular measure of fit

806
00:45:02,220 --> 00:45:04,980
between data and the model.

807
00:45:04,980 --> 00:45:09,470
And I said let's just use
a test based on this.

808
00:45:09,470 --> 00:45:13,890
There are attempts at more or
less systematic ways of coming

809
00:45:13,890 --> 00:45:17,350
up with the general shape of
rejection regions that have at

810
00:45:17,350 --> 00:45:20,540
least some desirable or
favorable theoretical

811
00:45:20,540 --> 00:45:21,790
properties.

812
00:45:21,790 --> 00:45:24,620

813
00:45:24,620 --> 00:45:28,300
Some more specific problems
that people study--

814
00:45:28,300 --> 00:45:31,690
instead of having a test,
is this the correct PDF?

815
00:45:31,690 --> 00:45:33,140
Yes or no.

816
00:45:33,140 --> 00:45:37,670
I just give you data, and I
ask you tell me, give me a

817
00:45:37,670 --> 00:45:41,270
model or a PDF for those data.

818
00:45:41,270 --> 00:45:45,000
OK, my thoughts of this kind
are of many types.

819
00:45:45,000 --> 00:45:50,640
One general method is you form a
histogram, and then you take

820
00:45:50,640 --> 00:45:54,570
your histogram and plot a smooth
line, that kind of fits

821
00:45:54,570 --> 00:45:55,680
the histogram.

822
00:45:55,680 --> 00:45:59,140
This still leaves the question
of how do you choose the bins?

823
00:45:59,140 --> 00:46:00,780
The bin size in your
histograms.

824
00:46:00,780 --> 00:46:02,620
How narrow do you take them?

825
00:46:02,620 --> 00:46:05,920
And that depends on how many
data you have, and there's a

826
00:46:05,920 --> 00:46:09,190
lot of theory that tells you
about the best way of choosing

827
00:46:09,190 --> 00:46:12,890
the bin sizes, and the best
ways of smoothing the data

828
00:46:12,890 --> 00:46:14,640
that you have.

829
00:46:14,640 --> 00:46:18,090
A completely different topic
is in signal processing --

830
00:46:18,090 --> 00:46:20,200
you want to do your inference.

831
00:46:20,200 --> 00:46:22,810
Not only you want it to be good,
but you also want it to

832
00:46:22,810 --> 00:46:25,520
be fast in a computational
way.

833
00:46:25,520 --> 00:46:28,010
You get data in real
time, lots of data.

834
00:46:28,010 --> 00:46:31,330
You want to keep processing and
revising your estimates

835
00:46:31,330 --> 00:46:35,220
and your decisions as
they come and go.

836
00:46:35,220 --> 00:46:38,950
Another topic that was briefly
touched upon the last couple

837
00:46:38,950 --> 00:46:43,010
of lectures is that when you set
up a model, like a linear

838
00:46:43,010 --> 00:46:46,540
regression model, you choose
some explanatory variables,

839
00:46:46,540 --> 00:46:50,230
and you try to predict y from
your X, these variables.

840
00:46:50,230 --> 00:46:52,720
You have a choice of
what to take as

841
00:46:52,720 --> 00:46:55,440
your explanatory variables.

842
00:46:55,440 --> 00:47:02,560
Are there systematic ways of
picking the right X variables

843
00:47:02,560 --> 00:47:04,520
to try to estimate a Y.

844
00:47:04,520 --> 00:47:08,360
For example should I try to
estimate Y on the basis of X?

845
00:47:08,360 --> 00:47:10,320
Or on the basis of X-squared?

846
00:47:10,320 --> 00:47:12,960
How do I decide between
the two?

847
00:47:12,960 --> 00:47:17,000
Finally, the rage these days has
to do with anything big,

848
00:47:17,000 --> 00:47:18,490
high-demensional.

849
00:47:18,490 --> 00:47:23,410
Complicated models of
complicated things, and tons

850
00:47:23,410 --> 00:47:24,650
and tons of data.

851
00:47:24,650 --> 00:47:27,430
So these days data are
generated everywhere.

852
00:47:27,430 --> 00:47:30,230
The amounts of data
are humongous.

853
00:47:30,230 --> 00:47:33,120
Also, the problems that people
are interested in tend to be

854
00:47:33,120 --> 00:47:35,500
very complicated with
lots of parameters.

855
00:47:35,500 --> 00:47:39,800
So I need specially tailored
methods that can give you good

856
00:47:39,800 --> 00:47:44,220
results, or decent results even
in the face of these huge

857
00:47:44,220 --> 00:47:47,290
amounts of data, and possibly
with computational

858
00:47:47,290 --> 00:47:48,310
constraints.

859
00:47:48,310 --> 00:47:50,720
So with huge amounts of data
you want methods that are

860
00:47:50,720 --> 00:47:56,460
simple, but still can deliver
for you meaningful answers.

861
00:47:56,460 --> 00:48:00,170
Now as I mentioned some time
ago, this whole field of

862
00:48:00,170 --> 00:48:03,960
statistics is very different
from the field of probability.

863
00:48:03,960 --> 00:48:06,530
In some sense all that we're
doing in statistics is

864
00:48:06,530 --> 00:48:08,100
probabilistic calculations.

865
00:48:08,100 --> 00:48:10,360
That's what the theory
kind of does.

866
00:48:10,360 --> 00:48:12,870
But there's a big
element of art.

867
00:48:12,870 --> 00:48:16,550
You saw that we chose the shape
of some decision regions

868
00:48:16,550 --> 00:48:19,840
or rejection regions in
a somewhat ad hoc way.

869
00:48:19,840 --> 00:48:21,660
There's even more
basic things.

870
00:48:21,660 --> 00:48:23,260
How do you organize your data?

871
00:48:23,260 --> 00:48:26,690
How do you think about which
hypotheses you would like to

872
00:48:26,690 --> 00:48:28,300
test, and so on.

873
00:48:28,300 --> 00:48:31,710
There's a lot of art that's
involved here, and there's a

874
00:48:31,710 --> 00:48:33,510
lot that can go wrong.

875
00:48:33,510 --> 00:48:36,630
So I'm going to close with a
note that you can take either

876
00:48:36,630 --> 00:48:39,050
as pessimistic or optimistic.

877
00:48:39,050 --> 00:48:42,880
There is a famous paper that
came out a few years ago and

878
00:48:42,880 --> 00:48:46,440
has been cited about a
1,000 times or so.

879
00:48:46,440 --> 00:48:50,110
And the title of the paper is
Why Most Published Research

880
00:48:50,110 --> 00:48:51,850
Findings Are False.

881
00:48:51,850 --> 00:48:56,080
And it's actually a very good
argument why, in fields like

882
00:48:56,080 --> 00:48:59,900
psychology or the medical
science and all that a lot of

883
00:48:59,900 --> 00:49:01,160
what you see published--

884
00:49:01,160 --> 00:49:03,410
that yes, this drug
has an effect on

885
00:49:03,410 --> 00:49:05,000
that particular disease--

886
00:49:05,000 --> 00:49:08,030
is actually false, because
people do not do their

887
00:49:08,030 --> 00:49:09,780
statistics correctly.

888
00:49:09,780 --> 00:49:12,130
There's lots of biases
in what people do.

889
00:49:12,130 --> 00:49:16,300
I mean an obvious bias is that
you only published a result

890
00:49:16,300 --> 00:49:19,190
when you see something.

891
00:49:19,190 --> 00:49:22,770
So the null hypothesis is that
the drug doesn't work.

892
00:49:22,770 --> 00:49:26,820
You do your tests, the drug
didn't work, OK, you just go

893
00:49:26,820 --> 00:49:27,960
home and cry.

894
00:49:27,960 --> 00:49:33,380
But if by accident that 5%
happens, and even though the

895
00:49:33,380 --> 00:49:37,320
drug doesn't work, you got
some outlier data, and it

896
00:49:37,320 --> 00:49:38,760
seemed to be working.

897
00:49:38,760 --> 00:49:40,990
Then you're excited,
you publish it.

898
00:49:40,990 --> 00:49:42,760
So that's clearly a bias.

899
00:49:42,760 --> 00:49:46,980
That gets results to be
published, even though they do

900
00:49:46,980 --> 00:49:50,330
not have a solid foundation
behind them.

901
00:49:50,330 --> 00:49:53,050
Then there's another
thing, OK?

902
00:49:53,050 --> 00:49:55,440
I'm picking my 5%.

903
00:49:55,440 --> 00:49:59,940
So H0 is true there's a small
probability that the data will

904
00:49:59,940 --> 00:50:04,160
look like an outlier,
and in that case I

905
00:50:04,160 --> 00:50:06,270
published my result.

906
00:50:06,270 --> 00:50:08,160
OK it's only 5% --

907
00:50:08,160 --> 00:50:10,300
it's not going to happen
too often.

908
00:50:10,300 --> 00:50:15,200
But suppose that I go and do
a 1,000 different tests?

909
00:50:15,200 --> 00:50:18,540
Test H0 against this hypothesis,
test H0 against

910
00:50:18,540 --> 00:50:22,000
that hypothesis , test H0
against that hypothesis.

911
00:50:22,000 --> 00:50:26,230
Some of these tests, just by
accident might turn out to be

912
00:50:26,230 --> 00:50:29,350
in favor of H1, and
again these are

913
00:50:29,350 --> 00:50:31,170
selected to be published.

914
00:50:31,170 --> 00:50:35,720
So if you do lots and lots of
tests and in each one you have

915
00:50:35,720 --> 00:50:38,980
a 5% probability of error,
when you consider the

916
00:50:38,980 --> 00:50:41,980
collection of all those tests,
actually the probability of

917
00:50:41,980 --> 00:50:46,940
making incorrect inferences
is a lot more than 5%.

918
00:50:46,940 --> 00:50:51,400
One basic principle in being
systematic about such studies

919
00:50:51,400 --> 00:50:55,950
is that you should first pick
your hypothesis that you're

920
00:50:55,950 --> 00:50:59,230
going to test, then get
your data, and do

921
00:50:59,230 --> 00:51:00,880
your hypothesis testing.

922
00:51:00,880 --> 00:51:05,640
What would be wrong is to get
your data, look at them, and

923
00:51:05,640 --> 00:51:08,890
say OK I'm going now to test
for these 100 different

924
00:51:08,890 --> 00:51:13,060
hypotheses, and I'm going to
choose my hypothesis to be for

925
00:51:13,060 --> 00:51:16,580
features that look abnormal
in my data.

926
00:51:16,580 --> 00:51:19,520
Well, given enough data, you
can always find some

927
00:51:19,520 --> 00:51:21,650
abnormalities just by chance.

928
00:51:21,650 --> 00:51:24,380
And if you choose to make
a statistical test--

929
00:51:24,380 --> 00:51:26,710
is this abnormality present?

930
00:51:26,710 --> 00:51:28,090
Yes, it will be present.

931
00:51:28,090 --> 00:51:31,020
Because you first found the
abnormality, and then you

932
00:51:31,020 --> 00:51:32,130
tested for it.

933
00:51:32,130 --> 00:51:35,210
So that's another way that
things can go wrong.

934
00:51:35,210 --> 00:51:37,520
So the moral of this story is
that while the world of

935
00:51:37,520 --> 00:51:40,200
probability is really beautiful
and solid, you have

936
00:51:40,200 --> 00:51:40,960
your axioms.

937
00:51:40,960 --> 00:51:44,630
Every question has a unique
answer that by now you can,

938
00:51:44,630 --> 00:51:48,250
all of you, find in a
very reliable way.

939
00:51:48,250 --> 00:51:50,740
Statistics is a dirty and
difficult business.

940
00:51:50,740 --> 00:51:53,010
And that's why the subject
is not over.

941
00:51:53,010 --> 00:51:55,430
And if you're interested in
it, it's worth taking

942
00:51:55,430 --> 00:51:58,920
follow-on courses in
that direction.

943
00:51:58,920 --> 00:52:03,950
OK so have good luck in the
final, do well, and have a

944
00:52:03,950 --> 00:52:05,200
nice vacation afterwards.

945
00:52:05,200 --> 00:52:06,260