1 00:00:00,000 --> 00:00:00,040 2 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 3 00:00:02,460 --> 00:00:03,870 Commons license. 4 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 5 00:00:06,910 --> 00:00:10,560 offer high quality educational resources for free. 6 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 7 00:00:13,460 --> 00:00:19,290 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:19,290 --> 00:00:22,160 ocw.mit.edu 9 00:00:22,160 --> 00:00:26,640 PROFESSOR: OK, if you have not yet done it, please take a 10 00:00:26,640 --> 00:00:30,510 moment to go through the course evaluation website and 11 00:00:30,510 --> 00:00:32,880 enter your comments for the class. 12 00:00:32,880 --> 00:00:36,250 So what we're going to do today to wrap things up is 13 00:00:36,250 --> 00:00:39,070 we're going to go through a tour of the world of 14 00:00:39,070 --> 00:00:41,320 hypothesis testing. 15 00:00:41,320 --> 00:00:44,500 See a few examples of hypothesis tests, starting 16 00:00:44,500 --> 00:00:48,280 from simple ones such as the one the setting that we 17 00:00:48,280 --> 00:00:51,220 discussed last time in which you just have two hypotheses, 18 00:00:51,220 --> 00:00:53,130 you're trying to choose between them. 19 00:00:53,130 --> 00:00:56,040 But also look at more complicated situations in 20 00:00:56,040 --> 00:01:00,600 which you have one basic hypothesis. 21 00:01:00,600 --> 00:01:03,720 Let's say that you have a fair coin and you want to test it 22 00:01:03,720 --> 00:01:06,450 against the hypotheses that your coin is not fair, but 23 00:01:06,450 --> 00:01:09,770 that alternative hypothesis is really lots of different 24 00:01:09,770 --> 00:01:11,000 hypothesis. 25 00:01:11,000 --> 00:01:12,310 So is my coin fair? 26 00:01:12,310 --> 00:01:13,700 Is my die fair? 27 00:01:13,700 --> 00:01:15,980 Do I have the correct distribution for random 28 00:01:15,980 --> 00:01:17,510 variable, and so on. 29 00:01:17,510 --> 00:01:20,960 And I'm going to end up with a few general comments about 30 00:01:20,960 --> 00:01:23,190 this whole business. 31 00:01:23,190 --> 00:01:28,370 So the sad thing in simple hypothesis testing problems is 32 00:01:28,370 --> 00:01:28,990 the following-- 33 00:01:28,990 --> 00:01:33,610 we have two possible models, and this is the classical 34 00:01:33,610 --> 00:01:36,680 world so we do not have any prior probabilities on the two 35 00:01:36,680 --> 00:01:37,850 hypotheses. 36 00:01:37,850 --> 00:01:41,340 Usually we want to think of these hypotheses as not being 37 00:01:41,340 --> 00:01:44,730 completely symmetrical, but rather one is the default 38 00:01:44,730 --> 00:01:48,180 hypothesis, and usually it's referred to as the null 39 00:01:48,180 --> 00:01:49,630 hypothesis. 40 00:01:49,630 --> 00:01:53,400 And you want to check whether the null hypothesis is true, 41 00:01:53,400 --> 00:01:57,170 whether things are normal as you would have expected them 42 00:01:57,170 --> 00:02:00,900 to be, or whether it turns out to be false, in which case an 43 00:02:00,900 --> 00:02:03,750 alternative hypothesis would be correct. 44 00:02:03,750 --> 00:02:05,710 So how does one go about it? 45 00:02:05,710 --> 00:02:08,919 46 00:02:08,919 --> 00:02:12,720 No matter what approach you use, in the end you're going 47 00:02:12,720 --> 00:02:14,220 to end up doing the following. 48 00:02:14,220 --> 00:02:17,620 You have the space of all simple observations that you 49 00:02:17,620 --> 00:02:18,980 may obtain. 50 00:02:18,980 --> 00:02:21,630 So when you do the experiment you're going to get an X 51 00:02:21,630 --> 00:02:25,050 vector, a vector of data that's somewhere. 52 00:02:25,050 --> 00:02:27,760 And for some vectors you're going to decide that you 53 00:02:27,760 --> 00:02:31,410 accept H. Note for some vectors that you reject H0 and 54 00:02:31,410 --> 00:02:33,160 you accept H1. 55 00:02:33,160 --> 00:02:37,100 So what you will end up doing is that you're going to have 56 00:02:37,100 --> 00:02:42,130 some division of the space of all X's into two parts, and 57 00:02:42,130 --> 00:02:45,660 one part is the rejection region, and one part is the 58 00:02:45,660 --> 00:02:47,050 acceptance region. 59 00:02:47,050 --> 00:02:50,440 So if you fall in here you accept H0, if you fall here 60 00:02:50,440 --> 00:02:53,240 you'd reject H0. 61 00:02:53,240 --> 00:02:57,750 So to design a hypothesis test basically you need to come up 62 00:02:57,750 --> 00:03:03,360 with a division of your X space into two pieces. 63 00:03:03,360 --> 00:03:08,770 So the figuring out how to do this involves two elements. 64 00:03:08,770 --> 00:03:12,640 One element is to decide what kind of shape so I want for my 65 00:03:12,640 --> 00:03:14,740 dividing curve? 66 00:03:14,740 --> 00:03:18,240 And having chosen the shape of the dividing curve, where 67 00:03:18,240 --> 00:03:20,540 exactly do I put it? 68 00:03:20,540 --> 00:03:23,980 So if you were to cut this space using, let's say, a 69 00:03:23,980 --> 00:03:27,360 straight cut you might put it here, or you might put it 70 00:03:27,360 --> 00:03:28,930 there, or you might put it there. 71 00:03:28,930 --> 00:03:31,730 Where exactly are you going to put it? 72 00:03:31,730 --> 00:03:33,530 So let's look at those two steps. 73 00:03:33,530 --> 00:03:38,700 The first issue is to decide the general shape of your 74 00:03:38,700 --> 00:03:43,440 rejection region, which is the structure of your test. 75 00:03:43,440 --> 00:03:47,420 And the way this is done for the case of two hypothesis is 76 00:03:47,420 --> 00:03:52,050 by writing down the likelihood ratio between the two 77 00:03:52,050 --> 00:03:52,840 hypothesis. 78 00:03:52,840 --> 00:03:56,860 So let's call that quantity l of X. It's something that you 79 00:03:56,860 --> 00:04:00,280 can compute given the data that you have. 80 00:04:00,280 --> 00:04:04,660 A high value of l of X basically means that this 81 00:04:04,660 --> 00:04:08,140 probability here tends to be bigger than this probability. 82 00:04:08,140 --> 00:04:12,150 It means that the data that you have seen are quite likely 83 00:04:12,150 --> 00:04:15,650 to have occurred under H1, but less likely to have 84 00:04:15,650 --> 00:04:18,399 occurred under H0. 85 00:04:18,399 --> 00:04:22,360 So if you see data that they are more plausible, can be 86 00:04:22,360 --> 00:04:26,630 better explained, under H1, then this ratio is big, and 87 00:04:26,630 --> 00:04:31,030 you're going to choose in favor of H1 or reject H0. 88 00:04:31,030 --> 00:04:32,950 That's what you do if you have discrete data. 89 00:04:32,950 --> 00:04:34,380 You use the PMFs. 90 00:04:34,380 --> 00:04:37,450 If you have densities, in the case of continues data, again 91 00:04:37,450 --> 00:04:42,740 you consider the ratio of the two densities. 92 00:04:42,740 --> 00:04:47,250 So a big l of X is evidence that your data are more 93 00:04:47,250 --> 00:04:51,570 compatible with H1 rather than H0. 94 00:04:51,570 --> 00:04:59,140 Once you accept this kind of structure then your decision 95 00:04:59,140 --> 00:05:02,920 is really made in terms of that single number. 96 00:05:02,920 --> 00:05:06,270 That is, you had your data that was some kind of vector, 97 00:05:06,270 --> 00:05:09,930 and you condense your data into a single number-- a 98 00:05:09,930 --> 00:05:12,080 statistic as it's called-- 99 00:05:12,080 --> 00:05:15,150 in this case the likelihood ratio, and you put the 100 00:05:15,150 --> 00:05:19,880 dividing point somewhere here call it Xi. 101 00:05:19,880 --> 00:05:22,600 And in this region you accept H1, in this 102 00:05:22,600 --> 00:05:25,940 region you accept H0. 103 00:05:25,940 --> 00:05:30,410 So by committing ourselves to using the likelihood ratio in 104 00:05:30,410 --> 00:05:33,650 order to carry out the test we have gone from this 105 00:05:33,650 --> 00:05:38,030 complicated picture of finding a dividing line in x-space, to 106 00:05:38,030 --> 00:05:42,860 a simpler problem of just finding a dividing point on 107 00:05:42,860 --> 00:05:45,280 the real line. 108 00:05:45,280 --> 00:05:46,960 OK, how are we going? 109 00:05:46,960 --> 00:05:51,290 So what's left to do is to choose this threshold, Xi. 110 00:05:51,290 --> 00:05:53,920 Or as it's called, the critical value, 111 00:05:53,920 --> 00:05:56,560 for making our decision. 112 00:05:56,560 --> 00:06:01,930 And you can place it anywhere, but one way of deciding where 113 00:06:01,930 --> 00:06:03,240 to place it is the following-- 114 00:06:03,240 --> 00:06:07,740 look at the distribution of this random variable, l of X. 115 00:06:07,740 --> 00:06:11,760 It's has a certain distribution under H0, and it 116 00:06:11,760 --> 00:06:16,210 has some other distribution under H1. 117 00:06:16,210 --> 00:06:19,650 If I put my threshold here, here's what's going to happen. 118 00:06:19,650 --> 00:06:24,360 When H0 is true, there is this much probability that I'm 119 00:06:24,360 --> 00:06:27,360 going to end up making an incorrect decision. 120 00:06:27,360 --> 00:06:31,000 If H0 is true there's still a probability that my likelihood 121 00:06:31,000 --> 00:06:35,100 ratio will be bigger than Xi, and that's the probability of 122 00:06:35,100 --> 00:06:38,590 making an incorrect decision of this particular type. 123 00:06:38,590 --> 00:06:42,720 That is of making a false rejection of H0. 124 00:06:42,720 --> 00:06:46,330 Usually one sets this probability to a certain 125 00:06:46,330 --> 00:06:48,230 number, alpha. 126 00:06:48,230 --> 00:06:51,770 For example alpha being 5 %. 127 00:06:51,770 --> 00:06:55,680 And once you decide that you want this to be 5 %, that 128 00:06:55,680 --> 00:07:00,630 determines where this number Psi(Xi) is going to be. 129 00:07:00,630 --> 00:07:07,340 So the idea here is that I'm going to reject H0 if the data 130 00:07:07,340 --> 00:07:12,350 that I have seen are quite incompatible with H0. 131 00:07:12,350 --> 00:07:16,860 if they're quite unlikely to have occurred under H0. 132 00:07:16,860 --> 00:07:19,690 And I take this level, 5%. 133 00:07:19,690 --> 00:07:25,670 So I see my data and then I say well if H0 was true, the 134 00:07:25,670 --> 00:07:29,380 probability that I would have seen data of this kind would 135 00:07:29,380 --> 00:07:31,390 be less than 5 %. 136 00:07:31,390 --> 00:07:35,390 Given that I saw those data, that suggests that H0 is not 137 00:07:35,390 --> 00:07:37,860 true, and I end up rejecting H0. 138 00:07:37,860 --> 00:07:40,770 139 00:07:40,770 --> 00:07:44,150 Now of course there's the other type of error 140 00:07:44,150 --> 00:07:45,190 probability. 141 00:07:45,190 --> 00:07:50,550 If I put my threshold here, if H1 is true but my likelihood 142 00:07:50,550 --> 00:07:53,470 ratio falls here I'm going to make a mistake of 143 00:07:53,470 --> 00:07:55,250 the opposite kind. 144 00:07:55,250 --> 00:07:59,780 H1 is true, but my likelihood ratio turned out to be small, 145 00:07:59,780 --> 00:08:02,370 and I decided in favor of H0. 146 00:08:02,370 --> 00:08:05,680 This is an error of the other kind, this probability of 147 00:08:05,680 --> 00:08:08,030 error we call beta. 148 00:08:08,030 --> 00:08:10,070 And you can see that there's a trade-off 149 00:08:10,070 --> 00:08:12,300 between alpha and beta. 150 00:08:12,300 --> 00:08:15,710 If you move your threshold this way alpha become smaller, 151 00:08:15,710 --> 00:08:18,320 but beta becomes larger. 152 00:08:18,320 --> 00:08:22,120 And the general picture is, in your trade-off, depending on 153 00:08:22,120 --> 00:08:25,970 where you put your threshold is as follows-- 154 00:08:25,970 --> 00:08:31,370 you can make this beta to be 0 if you put your threshold out 155 00:08:31,370 --> 00:08:34,809 here, but in that case you are certain that you're going to 156 00:08:34,809 --> 00:08:37,000 make a mistake of the opposite kind. 157 00:08:37,000 --> 00:08:42,360 So beta equals 0, alpha equals 1 is one possibility. 158 00:08:42,360 --> 00:08:46,420 Beta equals 1 alpha equals 0 is the other possibility if 159 00:08:46,420 --> 00:08:49,620 you send your thresholds complete to the other side. 160 00:08:49,620 --> 00:08:51,950 And in general you're going to get a trade-off 161 00:08:51,950 --> 00:08:54,930 curve of some sort. 162 00:08:54,930 --> 00:08:58,720 And if you want to use a specific value of alpha, for 163 00:08:58,720 --> 00:09:04,030 example alpha being 0.05, then that's going to determine for 164 00:09:04,030 --> 00:09:07,820 you the probability for beta. 165 00:09:07,820 --> 00:09:11,410 Now there's a general, and quite important theorem in 166 00:09:11,410 --> 00:09:13,640 statistics, which were are not proving. 167 00:09:13,640 --> 00:09:17,500 And which tells us that when we use likelihood ratio tests 168 00:09:17,500 --> 00:09:21,670 we get the best possible trade-off curve. 169 00:09:21,670 --> 00:09:26,720 You could think of other ways of making your decisions. 170 00:09:26,720 --> 00:09:30,780 Other ways of cutting off your x-space into a rejection and 171 00:09:30,780 --> 00:09:32,090 acceptance region. 172 00:09:32,090 --> 00:09:36,050 But any other way that you do it is going to end up with 173 00:09:36,050 --> 00:09:39,900 some probabilities of error that are going to be above 174 00:09:39,900 --> 00:09:41,990 this particular curve. 175 00:09:41,990 --> 00:09:46,570 So the likelihood ratio test turns out to give you the best 176 00:09:46,570 --> 00:09:49,200 possible way of dealing with this trade-off 177 00:09:49,200 --> 00:09:50,750 between alpha and beta. 178 00:09:50,750 --> 00:09:54,090 We cannot minimize alpha and beta simultaneously, there's a 179 00:09:54,090 --> 00:09:56,280 trade-off between them. 180 00:09:56,280 --> 00:10:02,420 But at least we would like to have a test that deals with 181 00:10:02,420 --> 00:10:04,380 this trade-off in the best possible way. 182 00:10:04,380 --> 00:10:07,770 For a given value of alpha we want to have the smallest 183 00:10:07,770 --> 00:10:09,490 possible value of beta. 184 00:10:09,490 --> 00:10:13,900 And as the theorem is that the likelihood ratio tests do have 185 00:10:13,900 --> 00:10:15,240 this optimality property. 186 00:10:15,240 --> 00:10:18,270 For a given value of alpha they minimize the probability 187 00:10:18,270 --> 00:10:20,610 of error of a different kind. 188 00:10:20,610 --> 00:10:23,380 So let's make all these concrete and look at the 189 00:10:23,380 --> 00:10:24,680 simple example. 190 00:10:24,680 --> 00:10:27,980 We have two normal distributions 191 00:10:27,980 --> 00:10:29,610 with different means. 192 00:10:29,610 --> 00:10:32,850 So under H0 you have a mean of 0. 193 00:10:32,850 --> 00:10:36,790 Under H1 you have a mean of 1. 194 00:10:36,790 --> 00:10:40,810 You get your data, you actually get several data 195 00:10:40,810 --> 00:10:43,770 drawn from one of the two distributions. 196 00:10:43,770 --> 00:10:45,560 And you want to make a decision, which one 197 00:10:45,560 --> 00:10:47,050 of the two is true? 198 00:10:47,050 --> 00:10:50,400 So what you do is you write down the likelihood ratio. 199 00:10:50,400 --> 00:10:54,730 The density for a vector of data, if that vector was 200 00:10:54,730 --> 00:10:57,490 generated according to H0 -- 201 00:10:57,490 --> 00:11:00,470 which is this one, and the density if it was generated 202 00:11:00,470 --> 00:11:02,810 according to H1. 203 00:11:02,810 --> 00:11:06,510 Since we have multiple data the density of a vector is the 204 00:11:06,510 --> 00:11:09,830 product of the densities of the individual elements. 205 00:11:09,830 --> 00:11:11,800 Since we're dealing with normals we have those 206 00:11:11,800 --> 00:11:13,500 exponential factors. 207 00:11:13,500 --> 00:11:15,550 A product of exponentials gives us an 208 00:11:15,550 --> 00:11:17,340 exponential of the sum. 209 00:11:17,340 --> 00:11:20,170 I'll spare you the details, but this is the form of the 210 00:11:20,170 --> 00:11:21,230 likelihood ratio. 211 00:11:21,230 --> 00:11:23,960 The likelihood ratio test tells us that we should 212 00:11:23,960 --> 00:11:28,360 calculate this quantity after we get your data, and compare 213 00:11:28,360 --> 00:11:30,750 with a threshold. 214 00:11:30,750 --> 00:11:35,340 Now you can do some algebra here, and simplify. 215 00:11:35,340 --> 00:11:39,150 And by tracing down the inequalities you're taking 216 00:11:39,150 --> 00:11:41,840 logarithms of both sides, and so on. 217 00:11:41,840 --> 00:11:47,350 One comes to the conclusion that using a test that has a 218 00:11:47,350 --> 00:11:52,150 threshold on this ratio is equivalent to calculating this 219 00:11:52,150 --> 00:11:56,920 quantity, and comparing it with a threshold. 220 00:11:56,920 --> 00:12:01,220 Basically this quantity here is monotonic in that quantity. 221 00:12:01,220 --> 00:12:04,510 This being larger than the threshold is equivalent to 222 00:12:04,510 --> 00:12:07,400 this being larger than the threshold. 223 00:12:07,400 --> 00:12:10,310 So this tells us the general structure of the likelihood 224 00:12:10,310 --> 00:12:12,770 ratio test in this particular case. 225 00:12:12,770 --> 00:12:15,640 And it's nice because it tells us that we can make our 226 00:12:15,640 --> 00:12:20,340 decisions by looking at this simple summary of the data. 227 00:12:20,340 --> 00:12:23,810 This quantity, this summary of the data on the basis of which 228 00:12:23,810 --> 00:12:29,130 we make our decision is called a statistic. 229 00:12:29,130 --> 00:12:32,850 So you take your data, which is a multi-dimensional vector, 230 00:12:32,850 --> 00:12:37,850 and you condense it to a single number, and then you 231 00:12:37,850 --> 00:12:40,630 make a decision on the basis of that number. 232 00:12:40,630 --> 00:12:42,750 So this is the structure of the test. 233 00:12:42,750 --> 00:12:47,430 If I get a large sum of Xi's this is evidence in favor of 234 00:12:47,430 --> 00:12:50,430 H1 because here the mean is larger. 235 00:12:50,430 --> 00:12:54,990 And so I'm going to decide in favor of H1 or reject H0 if 236 00:12:54,990 --> 00:12:56,650 the sum is bigger than the threshold. 237 00:12:56,650 --> 00:12:58,750 How do I choose my threshold? 238 00:12:58,750 --> 00:13:01,080 Well I would like to choose my threshold so that the 239 00:13:01,080 --> 00:13:04,990 probability of an incorrect decision when H0 is true the 240 00:13:04,990 --> 00:13:09,980 probability of a false rejection equals 241 00:13:09,980 --> 00:13:10,890 to a certain number. 242 00:13:10,890 --> 00:13:14,400 Alpha, such as for example 5 %. 243 00:13:14,400 --> 00:13:19,210 So you're given here that this is 5 %. 244 00:13:19,210 --> 00:13:20,660 You know the distribution of this random 245 00:13:20,660 --> 00:13:22,240 variable, it's normal. 246 00:13:22,240 --> 00:13:24,980 And you want to find the threshold value that makes 247 00:13:24,980 --> 00:13:26,430 this to be true. 248 00:13:26,430 --> 00:13:28,300 So this is a type of problem that you have 249 00:13:28,300 --> 00:13:29,360 seen several times. 250 00:13:29,360 --> 00:13:32,910 You go to the normal tables, and you figure it out. 251 00:13:32,910 --> 00:13:35,790 So the sum of the Xi's has some 252 00:13:35,790 --> 00:13:38,160 distribution, it's normal. 253 00:13:38,160 --> 00:13:41,090 So that's the distribution of the sum of the Xi's. 254 00:13:41,090 --> 00:13:44,620 And you want this probability here to be alpha. 255 00:13:44,620 --> 00:13:49,520 For this to happen what is the threshold value that makes 256 00:13:49,520 --> 00:13:50,870 this to be true? 257 00:13:50,870 --> 00:13:55,570 So you know how to solve problems of this kind using 258 00:13:55,570 --> 00:13:58,420 the normal tables. 259 00:13:58,420 --> 00:14:02,730 A slightly different example is one in which you have two 260 00:14:02,730 --> 00:14:05,900 normal distributions that have the same mean -- 261 00:14:05,900 --> 00:14:07,580 let's take it to be 0 -- 262 00:14:07,580 --> 00:14:10,580 but they have a different variance. 263 00:14:10,580 --> 00:14:15,080 So it's sort of natural that here, if your X's that you see 264 00:14:15,080 --> 00:14:19,880 are kind of big on either side you would choose H1. 265 00:14:19,880 --> 00:14:23,500 If your X's are near 0 then that's evidence for the 266 00:14:23,500 --> 00:14:27,120 smaller variance you would choose H0. 267 00:14:27,120 --> 00:14:30,740 So to proceed formally you again write down to the form 268 00:14:30,740 --> 00:14:33,190 of the likelihood ratio. 269 00:14:33,190 --> 00:14:39,780 So again the density of an X vector under H0 is this one. 270 00:14:39,780 --> 00:14:41,680 It's the product of the densities of 271 00:14:41,680 --> 00:14:43,410 each one of the Xi's. 272 00:14:43,410 --> 00:14:47,030 Product of normal densities gives you a product of 273 00:14:47,030 --> 00:14:50,180 exponentials, which is exponential of the sum, and 274 00:14:50,180 --> 00:14:52,070 that's the expression that you get. 275 00:14:52,070 --> 00:14:54,560 Under the other hypothesis the only thing that 276 00:14:54,560 --> 00:14:56,530 changes is the variance. 277 00:14:56,530 --> 00:14:59,800 And the variance, in the normal distribution, shows up 278 00:14:59,800 --> 00:15:02,970 here in the denominator of the exponent. 279 00:15:02,970 --> 00:15:04,560 So you put it there. 280 00:15:04,560 --> 00:15:07,390 So this is the general structure of the likelihood 281 00:15:07,390 --> 00:15:08,650 ratio test. 282 00:15:08,650 --> 00:15:10,400 And now you do some algebra. 283 00:15:10,400 --> 00:15:14,110 These terms are constants comparing this ratio to a 284 00:15:14,110 --> 00:15:17,190 constant is the same as just comparing the ratio of the 285 00:15:17,190 --> 00:15:19,050 exponentials to a constant. 286 00:15:19,050 --> 00:15:23,710 Then you take logarithms, you want to compare the logarithm 287 00:15:23,710 --> 00:15:25,650 of this thing to a constant. 288 00:15:25,650 --> 00:15:28,210 You do a little bit of algebra, and in the end you 289 00:15:28,210 --> 00:15:32,180 find that the structure of the test is to reject H0 if the 290 00:15:32,180 --> 00:15:37,740 sum of the squares of the Xi's is bigger than the threshold. 291 00:15:37,740 --> 00:15:41,360 So by committing to a likelihood ratio test you are 292 00:15:41,360 --> 00:15:45,060 told that you should be making it your decision according to 293 00:15:45,060 --> 00:15:46,940 a rule of this type. 294 00:15:46,940 --> 00:15:51,450 So this fixes the shape or the structure of the decision 295 00:15:51,450 --> 00:15:53,670 region, of the rejection region. 296 00:15:53,670 --> 00:15:56,660 And the only thing that's left, once more, is to pick 297 00:15:56,660 --> 00:16:00,190 this threshold in order to have the property that the 298 00:16:00,190 --> 00:16:05,340 probability of a false rejection is equal to say 5 %. 299 00:16:05,340 --> 00:16:09,490 So that's the probability that H0 is true, but the sum of the 300 00:16:09,490 --> 00:16:11,450 squares accidentally happens to be 301 00:16:11,450 --> 00:16:13,080 bigger than my threshold. 302 00:16:13,080 --> 00:16:17,330 In which case I end up deciding H1. 303 00:16:17,330 --> 00:16:21,570 How do I find the value of Xi prime? 304 00:16:21,570 --> 00:16:25,150 Well what I need to do is to look at the picture, more or 305 00:16:25,150 --> 00:16:29,100 less of this kind, but now I need to look at the 306 00:16:29,100 --> 00:16:32,870 distribution of the sum of the Xi's squared. 307 00:16:32,870 --> 00:16:36,190 Actually the sum of the Xi's squared is a non-negative 308 00:16:36,190 --> 00:16:37,580 random variable. 309 00:16:37,580 --> 00:16:40,280 So it's going to have a distribution that's 310 00:16:40,280 --> 00:16:44,910 something like this. 311 00:16:44,910 --> 00:16:50,540 I look at that distribution, and once more I want this tail 312 00:16:50,540 --> 00:16:54,300 probability to be alpha, and that determines where my 313 00:16:54,300 --> 00:16:56,370 threshold is going to be. 314 00:16:56,370 --> 00:17:00,595 So that's again a simple exercise provided that you 315 00:17:00,595 --> 00:17:03,650 know the distribution of this quantity. 316 00:17:03,650 --> 00:17:05,540 Do you know it? 317 00:17:05,540 --> 00:17:08,980 Well we don't really know it, we have not dealt with this 318 00:17:08,980 --> 00:17:11,859 particular distribution in this class. 319 00:17:11,859 --> 00:17:15,730 But in principle you should be able to find what it is. 320 00:17:15,730 --> 00:17:18,459 It's a derived distribution problem. 321 00:17:18,459 --> 00:17:22,920 You know the distribution of Xi, it's normal. 322 00:17:22,920 --> 00:17:26,410 Therefore, by solving a derived distribution problem 323 00:17:26,410 --> 00:17:30,400 you can find the distribution of Xi squared. 324 00:17:30,400 --> 00:17:34,180 And the Xi squared's are independent of each other, 325 00:17:34,180 --> 00:17:36,400 because the Xi's are independent. 326 00:17:36,400 --> 00:17:39,190 So you want to find the distribution of the sum of 327 00:17:39,190 --> 00:17:41,750 random variables with known distributions. 328 00:17:41,750 --> 00:17:44,410 And since they're independent, in principle, you can do this 329 00:17:44,410 --> 00:17:46,470 using the convolution formula. 330 00:17:46,470 --> 00:17:49,720 So in principle, and if you're patient enough, you will be 331 00:17:49,720 --> 00:17:52,830 able to find the distribution of this random variable. 332 00:17:52,830 --> 00:17:57,430 And then you plot it or tabulate it, and find where 333 00:17:57,430 --> 00:18:02,870 exactly is the 95th percentile of that distribution, and that 334 00:18:02,870 --> 00:18:05,290 determines your threshold. 335 00:18:05,290 --> 00:18:08,310 So this distribution actually turns out to have a nice and 336 00:18:08,310 --> 00:18:11,000 simple closed-form formula. 337 00:18:11,000 --> 00:18:13,740 Because this is a pretty common test, people have 338 00:18:13,740 --> 00:18:15,220 tabulated that distribution. 339 00:18:15,220 --> 00:18:17,370 It's called the chi-square distribution. 340 00:18:17,370 --> 00:18:19,512 There's tables available for it. 341 00:18:19,512 --> 00:18:23,390 And you look up in the tables, you find the 95th percentile 342 00:18:23,390 --> 00:18:25,900 of the distribution, and this way you 343 00:18:25,900 --> 00:18:28,280 determine your threshold. 344 00:18:28,280 --> 00:18:31,140 So what's the moral of the story? 345 00:18:31,140 --> 00:18:34,800 The structure of the likelihood ratio test tells 346 00:18:34,800 --> 00:18:40,470 you what kind of decision region you're going to have. 347 00:18:40,470 --> 00:18:42,880 It tells you that for this particular test you should be 348 00:18:42,880 --> 00:18:46,360 using the sum of the Xi squared's as your statistic, 349 00:18:46,360 --> 00:18:48,460 as the basis for making your decision. 350 00:18:48,460 --> 00:18:51,840 And then you need to solve a derived distribution problem 351 00:18:51,840 --> 00:18:53,110 to find the probability 352 00:18:53,110 --> 00:18:55,500 distribution of your statistic. 353 00:18:55,500 --> 00:19:00,290 Find the distribution of this quantity under H0, and 354 00:19:00,290 --> 00:19:03,000 finally, based on that distribution, after you have 355 00:19:03,000 --> 00:19:05,330 derived it, then determine your threshold. 356 00:19:05,330 --> 00:19:08,240 357 00:19:08,240 --> 00:19:10,360 So now let's move on to a somewhat 358 00:19:10,360 --> 00:19:13,090 more complicated situation. 359 00:19:13,090 --> 00:19:18,090 You have a coin, and you are told that I tried 360 00:19:18,090 --> 00:19:21,040 to make a fair coin. 361 00:19:21,040 --> 00:19:22,450 Is it fair? 362 00:19:22,450 --> 00:19:25,200 So you have the hypothesis, which is the default-- 363 00:19:25,200 --> 00:19:26,320 the null hypothesis-- 364 00:19:26,320 --> 00:19:27,890 that the coin is fair. 365 00:19:27,890 --> 00:19:29,690 But maybe it isn't. 366 00:19:29,690 --> 00:19:31,880 So you have the alternative hypothesis that 367 00:19:31,880 --> 00:19:34,030 your coin is not fair. 368 00:19:34,030 --> 00:19:36,690 Now what's different in this context is that your 369 00:19:36,690 --> 00:19:41,830 alternative hypothesis is not just one specific hypothesis. 370 00:19:41,830 --> 00:19:45,990 Your alternative hypothesis consists of many alternatives. 371 00:19:45,990 --> 00:19:49,270 It includes the hypothesis that p is 0.6. 372 00:19:49,270 --> 00:19:53,930 It includes the hypothesis that p is 0.51. 373 00:19:53,930 --> 00:19:58,850 It includes the hypothesis that p is 0.48, and so on. 374 00:19:58,850 --> 00:20:05,030 So you're testing this hypothesis versus all this 375 00:20:05,030 --> 00:20:08,070 family of alternative hypothesis. 376 00:20:08,070 --> 00:20:11,080 What you will end up doing is essentially the following-- 377 00:20:11,080 --> 00:20:12,480 you get some data. 378 00:20:12,480 --> 00:20:15,080 That is, you flip the coin a number of times. 379 00:20:15,080 --> 00:20:17,640 Let's say you flip it 1,000 times. 380 00:20:17,640 --> 00:20:20,290 You observe some outcome. 381 00:20:20,290 --> 00:20:24,580 Let's say you saw 472 heads. 382 00:20:24,580 --> 00:20:31,650 And you ask the question if this hypothesis is true is 383 00:20:31,650 --> 00:20:35,790 this value really possible under that hypothesis? 384 00:20:35,790 --> 00:20:39,450 Or would it be very much of an outlier? 385 00:20:39,450 --> 00:20:44,220 If it looks like an extreme outlier under this hypothesis 386 00:20:44,220 --> 00:20:47,780 then I reject it, and I accept the alternative. 387 00:20:47,780 --> 00:20:50,800 If this number turns out to be something within the range 388 00:20:50,800 --> 00:20:56,690 that you would have expected then you keep, or accept your 389 00:20:56,690 --> 00:20:59,080 null hypothesis. 390 00:20:59,080 --> 00:21:03,200 OK so what does it mean to be an outlier or not? 391 00:21:03,200 --> 00:21:05,430 First you take your data, and you condense 392 00:21:05,430 --> 00:21:07,220 them to a single number. 393 00:21:07,220 --> 00:21:10,240 So your detailed data actually would have been a sequence of 394 00:21:10,240 --> 00:21:12,440 heads/tails, heads/tails and all that. 395 00:21:12,440 --> 00:21:16,370 Any reasonable person would tell you that you shouldn't 396 00:21:16,370 --> 00:21:19,430 really care about the exact sequence of heads and tails. 397 00:21:19,430 --> 00:21:22,570 Let's just base our decision on the number of heads that we 398 00:21:22,570 --> 00:21:24,380 have observed. 399 00:21:24,380 --> 00:21:28,870 So using some kind of reasoning which could be 400 00:21:28,870 --> 00:21:33,650 mathematical, or intuitive, or involving artistry-- 401 00:21:33,650 --> 00:21:38,400 you pick a one-dimensional, or scalar summary of the data 402 00:21:38,400 --> 00:21:39,450 that you have seen. 403 00:21:39,450 --> 00:21:42,250 In this case, the summary of the data is just the number of 404 00:21:42,250 --> 00:21:44,330 heads that's a quite reasonable one. 405 00:21:44,330 --> 00:21:47,880 And so you commit yourself to make a decision on the basis 406 00:21:47,880 --> 00:21:49,080 of this quantity. 407 00:21:49,080 --> 00:21:52,670 And you ask the quantity that I'm seeing does it look like 408 00:21:52,670 --> 00:21:53,680 an outlier? 409 00:21:53,680 --> 00:21:57,710 Or does it look more or less OK? 410 00:21:57,710 --> 00:22:00,540 OK, what does it mean to be an outlier? 411 00:22:00,540 --> 00:22:04,900 You want to choose the shape of this rejection region, but 412 00:22:04,900 --> 00:22:08,750 on the basis of that single number s. 413 00:22:08,750 --> 00:22:11,240 And again, the reasonable thing to do in this context 414 00:22:11,240 --> 00:22:15,170 would be to argue as follows-- if my coin is fair I expect to 415 00:22:15,170 --> 00:22:16,850 see n over 2 heads. 416 00:22:16,850 --> 00:22:18,540 That's the expected value. 417 00:22:18,540 --> 00:22:23,330 If the number of heads I see is far from the expected 418 00:22:23,330 --> 00:22:26,030 number of heads then I consider 419 00:22:26,030 --> 00:22:27,750 this to be an outlier. 420 00:22:27,750 --> 00:22:30,470 So if this number is bigger than some threshold Xi. 421 00:22:30,470 --> 00:22:33,600 I consider it to be an outlier, and then I'm going to 422 00:22:33,600 --> 00:22:36,100 reject my hypothesis. 423 00:22:36,100 --> 00:22:38,930 So we picked our statistic. 424 00:22:38,930 --> 00:22:44,990 We picked the general form of how we're going to make our 425 00:22:44,990 --> 00:22:50,000 decision, and then we pick a certain significance, or 426 00:22:50,000 --> 00:22:51,690 confidence level that we want. 427 00:22:51,690 --> 00:22:54,470 Again, this famous 5% number. 428 00:22:54,470 --> 00:22:58,310 And we're going to declare something to be an outlier if 429 00:22:58,310 --> 00:23:01,380 it lies in the region that has 5% or less 430 00:23:01,380 --> 00:23:03,270 probability of occurring. 431 00:23:03,270 --> 00:23:07,560 That is I'm picking my rejection region so that if H0 432 00:23:07,560 --> 00:23:11,870 is true under the default, or null hypothesis, there's only 433 00:23:11,870 --> 00:23:17,380 5% chance that by accident I fall there, and the thing 434 00:23:17,380 --> 00:23:21,540 makes me think that H1 is going to be true. 435 00:23:21,540 --> 00:23:25,690 436 00:23:25,690 --> 00:23:28,770 So now what's left to do is to pick the 437 00:23:28,770 --> 00:23:30,920 value of this threshold. 438 00:23:30,920 --> 00:23:34,410 This is a calculation of the usual kind. 439 00:23:34,410 --> 00:23:39,580 I want to pick my threshold, my Xi number so that the 440 00:23:39,580 --> 00:23:44,150 probability that s is further from the mean by an amount of 441 00:23:44,150 --> 00:23:47,200 Xi is less than 5%. 442 00:23:47,200 --> 00:23:50,630 Or that the probability of being inside 443 00:23:50,630 --> 00:23:52,300 the acceptance region-- 444 00:23:52,300 --> 00:23:55,240 so that the distance from the default is 445 00:23:55,240 --> 00:23:56,380 less than my threshold. 446 00:23:56,380 --> 00:23:59,880 I want that to be 95%. 447 00:23:59,880 --> 00:24:04,380 So this is an equality that you can get using the central 448 00:24:04,380 --> 00:24:06,760 limit theorem and the normal tables. 449 00:24:06,760 --> 00:24:10,230 There's 95% probability that the number of heads is going 450 00:24:10,230 --> 00:24:14,920 to be within 31 from the correct mean. 451 00:24:14,920 --> 00:24:17,910 So the way the exercise is done of course, is that we 452 00:24:17,910 --> 00:24:20,640 start with this number, 5%. 453 00:24:20,640 --> 00:24:24,410 Which translates to this number 95%. 454 00:24:24,410 --> 00:24:27,960 And once we have fixed that number then you ask the 455 00:24:27,960 --> 00:24:34,370 question what number should we have here to make this 456 00:24:34,370 --> 00:24:36,500 equality to be true? 457 00:24:36,500 --> 00:24:39,360 It's again a problem of this kind. 458 00:24:39,360 --> 00:24:42,820 You have a quantity whose distribution you know. 459 00:24:42,820 --> 00:24:43,950 Why do you know it? 460 00:24:43,950 --> 00:24:46,390 The number of heads by the central limit theorem is 461 00:24:46,390 --> 00:24:47,970 approximately normal. 462 00:24:47,970 --> 00:24:51,560 So this here talks about the normal distribution. 463 00:24:51,560 --> 00:24:56,330 You set your alpha to be 5%, and you ask where should I put 464 00:24:56,330 --> 00:24:59,690 my threshold so that this probability of being out there 465 00:24:59,690 --> 00:25:01,530 is only 5%? 466 00:25:01,530 --> 00:25:03,750 Now in our particular example the threshold 467 00:25:03,750 --> 00:25:05,970 turned out to be 31. 468 00:25:05,970 --> 00:25:09,170 This number turned out was just 28 away 469 00:25:09,170 --> 00:25:10,960 from the correct mean. 470 00:25:10,960 --> 00:25:14,150 So these distance was less than the threshold. 471 00:25:14,150 --> 00:25:17,280 So we end up not rejecting H0. 472 00:25:17,280 --> 00:25:20,430 473 00:25:20,430 --> 00:25:23,820 So we have our rejection region. 474 00:25:23,820 --> 00:25:28,900 The way we designed it is that when H0 is true there's only a 475 00:25:28,900 --> 00:25:32,960 small chance, 5%, that we get to data out of there. 476 00:25:32,960 --> 00:25:35,510 Data that we would call an outlier. 477 00:25:35,510 --> 00:25:39,330 If we see such an outlier we reject H0. 478 00:25:39,330 --> 00:25:43,930 If what we see is not an outlier as in this case, where 479 00:25:43,930 --> 00:25:47,090 that distance turned out to be kind of small, then we 480 00:25:47,090 --> 00:25:50,980 do not reject H0. 481 00:25:50,980 --> 00:25:54,700 An interesting little piece of language here, people 482 00:25:54,700 --> 00:25:57,490 generally prefer to use this terminology-- 483 00:25:57,490 --> 00:26:01,820 to say that H0 is not rejected by the data. 484 00:26:01,820 --> 00:26:06,490 Instead of saying that H0 is accepted. 485 00:26:06,490 --> 00:26:09,260 In some sense they're both saying the same thing, but the 486 00:26:09,260 --> 00:26:11,940 difference is sort of subtle. 487 00:26:11,940 --> 00:26:17,240 When I say not rejected what I mean is that I got some data 488 00:26:17,240 --> 00:26:20,560 that are compatible with my hypothesis. 489 00:26:20,560 --> 00:26:26,470 That is the data that I got do not falsify the hypothesis 490 00:26:26,470 --> 00:26:29,520 that I had, my null hypothesis. 491 00:26:29,520 --> 00:26:34,500 So my null hypothesis is still alive, and may be true. 492 00:26:34,500 --> 00:26:38,700 But from data you can never really prove that the 493 00:26:38,700 --> 00:26:41,360 hypothesis is correct. 494 00:26:41,360 --> 00:26:46,190 Perhaps my coin is not fair in some other complicated way. 495 00:26:46,190 --> 00:26:51,660 496 00:26:51,660 --> 00:26:55,980 Perhaps I was just lucky, and even though my coin is not 497 00:26:55,980 --> 00:26:58,930 fair I ended up with an outcome that 498 00:26:58,930 --> 00:27:01,270 suggests that it's fair. 499 00:27:01,270 --> 00:27:04,600 Perhaps my coin flips are not independent as I 500 00:27:04,600 --> 00:27:06,020 assumed in my model. 501 00:27:06,020 --> 00:27:11,860 So there's many ways that my null hypothesis could be 502 00:27:11,860 --> 00:27:15,010 wrong, and still I got data that tells me that my 503 00:27:15,010 --> 00:27:16,970 hypothesis is OK. 504 00:27:16,970 --> 00:27:20,980 So this is the general way that things work in science. 505 00:27:20,980 --> 00:27:24,340 One comes up with a model or a theory. 506 00:27:24,340 --> 00:27:28,480 This is the default theory, and we work with that theory 507 00:27:28,480 --> 00:27:31,100 trying to find whether there are examples 508 00:27:31,100 --> 00:27:32,450 that violate the theory. 509 00:27:32,450 --> 00:27:35,550 If you find data and examples that violate the theory your 510 00:27:35,550 --> 00:27:38,560 theory is falsified, and you need to look for a new one. 511 00:27:38,560 --> 00:27:43,090 But when you have your theory, really no amount of data can 512 00:27:43,090 --> 00:27:45,810 prove that your theory is correct. 513 00:27:45,810 --> 00:27:49,950 So we have the default theory that the speed of light is 514 00:27:49,950 --> 00:27:54,620 constant as long as we do not find any data that runs 515 00:27:54,620 --> 00:27:56,210 counter to it. 516 00:27:56,210 --> 00:27:59,650 We stay with that theory, but there's no way of really 517 00:27:59,650 --> 00:28:03,710 proving this, no matter how many experiments we do. 518 00:28:03,710 --> 00:28:06,590 But there could be experiments that falsify that theory, in 519 00:28:06,590 --> 00:28:10,580 which case we need to do look for a new one. 520 00:28:10,580 --> 00:28:14,450 So there's a bit of an asymmetry here in how we treat 521 00:28:14,450 --> 00:28:16,510 the alternative hypothesis. 522 00:28:16,510 --> 00:28:22,900 H0 is the default which we'll accept until we see some 523 00:28:22,900 --> 00:28:25,350 evidence to the contrary. 524 00:28:25,350 --> 00:28:30,170 And if we see some evidence to the contrary we reject it. 525 00:28:30,170 --> 00:28:33,580 As long as we do not see evidence to the contrary then 526 00:28:33,580 --> 00:28:35,940 we keep working with it, but always take it 527 00:28:35,940 --> 00:28:38,200 with a grain of salt. 528 00:28:38,200 --> 00:28:42,210 You can never really prove that a coin has a bias exactly 529 00:28:42,210 --> 00:28:43,860 equal to 1/2. 530 00:28:43,860 --> 00:28:50,360 Maybe the bias is equal to 0.50001, so 531 00:28:50,360 --> 00:28:52,440 the bias is not 1/2. 532 00:28:52,440 --> 00:28:56,180 But with an experiment with 1,000 coin tosses you wouldn't 533 00:28:56,180 --> 00:28:59,200 be able to see this effect. 534 00:28:59,200 --> 00:29:03,750 535 00:29:03,750 --> 00:29:07,870 OK, so that's how you go about testing about whether your 536 00:29:07,870 --> 00:29:09,120 coin is fair. 537 00:29:09,120 --> 00:29:13,150 You can also think about testing whether a die is fair. 538 00:29:13,150 --> 00:29:17,130 So for a die the null hypothesis would be that every 539 00:29:17,130 --> 00:29:21,830 possible result when you roll the die has equal probability 540 00:29:21,830 --> 00:29:23,860 and equal to 1/6. 541 00:29:23,860 --> 00:29:27,720 And you also make the hypothesis that your die rolls 542 00:29:27,720 --> 00:29:30,900 are statistically independent from each other. 543 00:29:30,900 --> 00:29:36,050 So I take my die, I roll it a number of times, little n, and 544 00:29:36,050 --> 00:29:40,240 I count how many 1's I got, how many 2's I got, how many 545 00:29:40,240 --> 00:29:43,430 3's I got, and these are my data. 546 00:29:43,430 --> 00:29:48,400 I count how many times I observed a specific result in 547 00:29:48,400 --> 00:29:51,660 my die roll that was equal to sum i. 548 00:29:51,660 --> 00:29:53,410 And now I ask the question-- 549 00:29:53,410 --> 00:29:58,050 the Ni's that I observed, are they compatible with my 550 00:29:58,050 --> 00:30:01,000 hypothesis or not? 551 00:30:01,000 --> 00:30:05,560 What does compatible to my hypothesis mean? 552 00:30:05,560 --> 00:30:12,570 Under the null hypothesis Ni should be approximately equal, 553 00:30:12,570 --> 00:30:17,750 or is equal in expectation to N times little Pi. 554 00:30:17,750 --> 00:30:23,170 And in our example this little Pi is of course 1/6. 555 00:30:23,170 --> 00:30:28,210 So if my die is fair the number of ones I expect to see 556 00:30:28,210 --> 00:30:31,110 is equal to the number of rolls times 1/6. 557 00:30:31,110 --> 00:30:35,070 The number of 2's I expect to see is again that same number. 558 00:30:35,070 --> 00:30:37,970 Of course there's randomness, so I do not expect to get 559 00:30:37,970 --> 00:30:39,420 exactly that number. 560 00:30:39,420 --> 00:30:42,420 But I can ask how far away from the 561 00:30:42,420 --> 00:30:45,380 expected values was i? 562 00:30:45,380 --> 00:30:51,470 If my capital Ni's turn to be very different from N/6 this 563 00:30:51,470 --> 00:30:55,110 is evidence that my die is not fair. 564 00:30:55,110 --> 00:31:01,000 If those numbers turn out to be close to N times 1/6 then 565 00:31:01,000 --> 00:31:05,180 I'm going to say there's no evidence that would lead me to 566 00:31:05,180 --> 00:31:06,870 reject this hypothesis. 567 00:31:06,870 --> 00:31:10,850 So this hypothesis remains alive. 568 00:31:10,850 --> 00:31:16,390 So someone has come up with this thought that maybe the 569 00:31:16,390 --> 00:31:20,730 right statistic to use, or the right way of quantifying how 570 00:31:20,730 --> 00:31:23,910 far away are the Ni's from their mean is to 571 00:31:23,910 --> 00:31:25,590 look at this quantity. 572 00:31:25,590 --> 00:31:29,520 So I'm looking at the expected value of Ni under the null 573 00:31:29,520 --> 00:31:30,700 hypothesis. 574 00:31:30,700 --> 00:31:34,760 See what I got, take the square of this, and add it 575 00:31:34,760 --> 00:31:36,040 over all i's. 576 00:31:36,040 --> 00:31:40,930 But also throw in these terms in the denominator. 577 00:31:40,930 --> 00:31:46,010 And why that term is there, that's a longer story. 578 00:31:46,010 --> 00:31:49,740 One can write down certain likelihood ratios, do certain 579 00:31:49,740 --> 00:31:53,010 Taylor Series approximations, and there's a Heuristic 580 00:31:53,010 --> 00:31:58,120 argument that justifies why this would be a good form for 581 00:31:58,120 --> 00:31:59,810 the test to use. 582 00:31:59,810 --> 00:32:02,660 So there's a certain art that's involved in this step 583 00:32:02,660 --> 00:32:06,370 that some people somehow decided that it's a reasonable 584 00:32:06,370 --> 00:32:08,730 thing to do is to calcelate. 585 00:32:08,730 --> 00:32:12,300 Once you get your results to calculate this one-dimensional 586 00:32:12,300 --> 00:32:16,740 summary of your result, this is going to be your statistic, 587 00:32:16,740 --> 00:32:19,550 and compare that statistic to a threshold. 588 00:32:19,550 --> 00:32:21,680 And that's how you make your decision. 589 00:32:21,680 --> 00:32:27,310 So by this point we have fixed the type of the rejection 590 00:32:27,310 --> 00:32:29,740 region that we're going to have. 591 00:32:29,740 --> 00:32:32,780 So we've chosen the qualitative structure of our 592 00:32:32,780 --> 00:32:36,230 test, and the only thing that's now left is to choose 593 00:32:36,230 --> 00:32:38,820 the particular threshold we're going to use. 594 00:32:38,820 --> 00:32:41,550 And the recipe, once more, is the same. 595 00:32:41,550 --> 00:32:44,840 We want to set our threshold so that the probability of a 596 00:32:44,840 --> 00:32:47,320 false rejection is 5%. 597 00:32:47,320 --> 00:32:52,040 We want the probability that our data fall in here is only 598 00:32:52,040 --> 00:32:55,990 5% when the null hypothesis is true. 599 00:32:55,990 --> 00:33:01,040 So that's the same as setting our threshold Xi so that the 600 00:33:01,040 --> 00:33:03,940 probability that our test statistic is 601 00:33:03,940 --> 00:33:05,960 bigger than that threshold. 602 00:33:05,960 --> 00:33:11,470 We want that probability to be only 0.05. 603 00:33:11,470 --> 00:33:15,140 So to solve a problem of this kind what is it 604 00:33:15,140 --> 00:33:16,820 that you need to do? 605 00:33:16,820 --> 00:33:19,490 You need to find the probability distribution of 606 00:33:19,490 --> 00:33:23,810 capital T. So once more it's the same picture. 607 00:33:23,810 --> 00:33:26,370 608 00:33:26,370 --> 00:33:32,200 You need to do some calculations of some sort, and 609 00:33:32,200 --> 00:33:36,550 come up with the distribution of the random variable T, 610 00:33:36,550 --> 00:33:39,060 where T is defined this way. 611 00:33:39,060 --> 00:33:41,400 You want to find this distribution 612 00:33:41,400 --> 00:33:43,190 under hypothesis H0. 613 00:33:43,190 --> 00:33:48,820 614 00:33:48,820 --> 00:33:53,780 Once you find what that distribution is then you can 615 00:33:53,780 --> 00:33:55,480 solve this usual problem. 616 00:33:55,480 --> 00:33:58,470 I want this probability here to be 5%. 617 00:33:58,470 --> 00:34:01,860 What should my threshold be? 618 00:34:01,860 --> 00:34:03,930 So what does this boil down to? 619 00:34:03,930 --> 00:34:08,510 Finding the distribution of capital T is in some sense a 620 00:34:08,510 --> 00:34:13,350 messy, difficult, derived distribution problem. 621 00:34:13,350 --> 00:34:16,239 From this model we know the distribution 622 00:34:16,239 --> 00:34:17,489 of the capital Ni's. 623 00:34:17,489 --> 00:34:20,290 624 00:34:20,290 --> 00:34:23,800 And actually we can even write down the joint distribution of 625 00:34:23,800 --> 00:34:26,840 the capital Ni's. 626 00:34:26,840 --> 00:34:29,690 In fact we can make an approximation here. 627 00:34:29,690 --> 00:34:33,219 Capital Ni is a binomial random variable. 628 00:34:33,219 --> 00:34:39,790 Let's say the number of 1's that I got in little N rolls 629 00:34:39,790 --> 00:34:41,090 off my die. 630 00:34:41,090 --> 00:34:43,300 So that's a binomial random variable. 631 00:34:43,300 --> 00:34:45,860 When little n is big this is going to be 632 00:34:45,860 --> 00:34:48,040 approximately normal. 633 00:34:48,040 --> 00:34:52,060 So we have normal random variables, or approximately 634 00:34:52,060 --> 00:34:54,260 normal minus a constant. 635 00:34:54,260 --> 00:34:55,770 They're still approximately normal. 636 00:34:55,770 --> 00:35:01,070 We take the squares of these, scale them so you can solve a 637 00:35:01,070 --> 00:35:03,730 derived distribution problem to find the distribution of 638 00:35:03,730 --> 00:35:04,930 this quantity. 639 00:35:04,930 --> 00:35:08,550 You can do more work, more derived distribution work, and 640 00:35:08,550 --> 00:35:12,080 find the distribution of capital T. So this is a 641 00:35:12,080 --> 00:35:17,500 tedious matter, but because this test is used quite often, 642 00:35:17,500 --> 00:35:20,080 again people have done those calculations. 643 00:35:20,080 --> 00:35:23,600 They have found the distribution of capital T, and 644 00:35:23,600 --> 00:35:25,250 it's available in tables. 645 00:35:25,250 --> 00:35:29,090 And you go to those tables, and you find the appropriate 646 00:35:29,090 --> 00:35:31,370 threshold for making a decision of this type. 647 00:35:31,370 --> 00:35:36,160 648 00:35:36,160 --> 00:35:40,720 Now to give you a sense of how complicated hypothesis one 649 00:35:40,720 --> 00:35:47,190 might have to deal with let's make things one level more 650 00:35:47,190 --> 00:35:48,370 complicated. 651 00:35:48,370 --> 00:35:55,200 So here you can think this X is a discrete random variable. 652 00:35:55,200 --> 00:35:57,770 This is the outcome of my roll. 653 00:35:57,770 --> 00:36:02,760 And I had a model in which the possible values of my discrete 654 00:36:02,760 --> 00:36:06,030 random variables they have probabilities 655 00:36:06,030 --> 00:36:07,870 all equal to 1/6. 656 00:36:07,870 --> 00:36:13,280 So my null hypothesis here was a particular PMF for the 657 00:36:13,280 --> 00:36:17,810 random variable capital X. So another way of phrasing what 658 00:36:17,810 --> 00:36:19,950 happened in this problem was the 659 00:36:19,950 --> 00:36:24,700 question is my PMF correct? 660 00:36:24,700 --> 00:36:30,580 So this is the PMF of the result of one die roll. 661 00:36:30,580 --> 00:36:33,950 You're asking the question is my PMF correct? 662 00:36:33,950 --> 00:36:36,740 Make it more complicated. 663 00:36:36,740 --> 00:36:41,510 How about the question of the type is my PDF correct when I 664 00:36:41,510 --> 00:36:45,220 have continuous data? 665 00:36:45,220 --> 00:36:50,900 So I have hypothesized that's the probability distribution 666 00:36:50,900 --> 00:36:54,780 that I have is let's say a particular normal. 667 00:36:54,780 --> 00:36:58,990 I get lots of results from that random variable. 668 00:36:58,990 --> 00:37:04,450 Can I tell whether my results look like normal or not? 669 00:37:04,450 --> 00:37:06,650 What are some ways of going about it? 670 00:37:06,650 --> 00:37:09,450 Well, we saw in the previous slide that there is a 671 00:37:09,450 --> 00:37:13,110 methodology for deciding if your PMF is correct. 672 00:37:13,110 --> 00:37:19,090 So you could take your normal results, the data that you got 673 00:37:19,090 --> 00:37:23,200 from your experiment, and discretize them, and so now 674 00:37:23,200 --> 00:37:25,500 you're dealing with discrete data. 675 00:37:25,500 --> 00:37:31,200 And sort of used in previous methodology to solve a 676 00:37:31,200 --> 00:37:34,900 discrete problem of the type is my PDF correct? 677 00:37:34,900 --> 00:37:41,320 So in practice the way this is done is that you get all your 678 00:37:41,320 --> 00:37:49,920 data, let's say data points of this kind. 679 00:37:49,920 --> 00:37:56,400 You split your space into bins, and you count how many 680 00:37:56,400 --> 00:38:00,190 you have in each bin. 681 00:38:00,190 --> 00:38:07,180 So you get this, and that, and that, and nothing. 682 00:38:07,180 --> 00:38:10,020 So that's a histogram that you get from the 683 00:38:10,020 --> 00:38:11,020 data that you have. 684 00:38:11,020 --> 00:38:14,670 Like the very familiar histograms that you see after 685 00:38:14,670 --> 00:38:16,860 each one of our quizzes. 686 00:38:16,860 --> 00:38:21,760 So if you look at these histogram, and you ask does it 687 00:38:21,760 --> 00:38:24,060 look like normal? 688 00:38:24,060 --> 00:38:27,700 OK, we need a systematic way of going about it. 689 00:38:27,700 --> 00:38:33,140 If it were normal you can calculate the probability of 690 00:38:33,140 --> 00:38:36,760 falling in this interval. 691 00:38:36,760 --> 00:38:39,120 The probability of falling in that interval, probability of 692 00:38:39,120 --> 00:38:40,890 falling into that interval. 693 00:38:40,890 --> 00:38:45,480 So you would have expected values of how many results, or 694 00:38:45,480 --> 00:38:48,210 data points, you would have in this interval. 695 00:38:48,210 --> 00:38:52,170 And compare these expected values for each interval with 696 00:38:52,170 --> 00:38:54,830 the actual ones that you observed. 697 00:38:54,830 --> 00:38:58,290 And then take the sum of squares, and so on, exactly as 698 00:38:58,290 --> 00:38:59,700 in the previous slide. 699 00:38:59,700 --> 00:39:03,010 And this gives you a way of going about it. 700 00:39:03,010 --> 00:39:07,060 701 00:39:07,060 --> 00:39:09,710 This is a little messy. 702 00:39:09,710 --> 00:39:14,530 It gets hard to do because you have the difficult decision of 703 00:39:14,530 --> 00:39:19,180 how do you choose the bin size? 704 00:39:19,180 --> 00:39:22,430 If you take your bins to be very narrow you would get lots 705 00:39:22,430 --> 00:39:25,680 of bins with 0's, and a few bins that only have one 706 00:39:25,680 --> 00:39:26,840 outcome in them. 707 00:39:26,840 --> 00:39:29,120 It probably wouldn't feel right. 708 00:39:29,120 --> 00:39:32,110 If you choose your bins to be very wide then you're losing a 709 00:39:32,110 --> 00:39:33,680 lot of information. 710 00:39:33,680 --> 00:39:39,240 Is there some way of making a test without creating bins? 711 00:39:39,240 --> 00:39:43,330 This is just to illustrate the clever ideas of what 712 00:39:43,330 --> 00:39:45,640 statisticians have thought about. 713 00:39:45,640 --> 00:39:51,960 And here's a really cute way of going about a test, whether 714 00:39:51,960 --> 00:39:53,750 my distribution is correct or not. 715 00:39:53,750 --> 00:39:56,980 716 00:39:56,980 --> 00:40:00,790 Here we're essentially plotting a PMF, or an 717 00:40:00,790 --> 00:40:02,630 approximation of a PDF. 718 00:40:02,630 --> 00:40:06,040 And we ask does it look like the PDF we assumed? 719 00:40:06,040 --> 00:40:09,930 Instead of working with PDFs let's work with cumulative 720 00:40:09,930 --> 00:40:11,800 distribution functions. 721 00:40:11,800 --> 00:40:13,840 So how does this go? 722 00:40:13,840 --> 00:40:20,160 The true normal distribution that I have hypothesized, the 723 00:40:20,160 --> 00:40:22,310 density that I'm hypothesizing-- my null 724 00:40:22,310 --> 00:40:23,350 hypothesis-- 725 00:40:23,350 --> 00:40:26,950 has a certain CDF that I can plot. 726 00:40:26,950 --> 00:40:36,820 So supposed that my hypothesis H0 is that the X's are normal 727 00:40:36,820 --> 00:40:42,630 with our standard normals, and I plot the CDF of the standard 728 00:40:42,630 --> 00:40:46,360 normal, which is the sort of continuous looking curve here. 729 00:40:46,360 --> 00:40:53,310 Now I get my data, and I plot the empirical CDF. 730 00:40:53,310 --> 00:40:54,930 What's the empirical CDF? 731 00:40:54,930 --> 00:40:59,830 In the empirical CDF you ask the question what fraction of 732 00:40:59,830 --> 00:41:02,940 the data fell below 0? 733 00:41:02,940 --> 00:41:04,450 You get a number. 734 00:41:04,450 --> 00:41:07,920 What fraction of my data fell below 1? 735 00:41:07,920 --> 00:41:08,730 I get a number. 736 00:41:08,730 --> 00:41:12,590 What fraction of my data fell below 2, and so on. 737 00:41:12,590 --> 00:41:15,780 So you're talking about fractions of the data that 738 00:41:15,780 --> 00:41:18,760 fell below each particular number. 739 00:41:18,760 --> 00:41:21,640 And by plotting those fractions as a function of 740 00:41:21,640 --> 00:41:26,740 this number you get something that looks like a CDF. 741 00:41:26,740 --> 00:41:31,670 And it's the CDF suggested by the data. 742 00:41:31,670 --> 00:41:35,800 Now the fraction of the data that fall below 0 in my 743 00:41:35,800 --> 00:41:38,530 experiment is-- 744 00:41:38,530 --> 00:41:43,280 if my hypothesis were true-- 745 00:41:43,280 --> 00:41:46,470 expected to be 1/2. 746 00:41:46,470 --> 00:41:49,280 1/2 is the value of the true CDF. 747 00:41:49,280 --> 00:41:51,730 I look at the fraction that I got, it's 748 00:41:51,730 --> 00:41:54,470 expected to be that number. 749 00:41:54,470 --> 00:41:56,800 But there's randomness, so it's might be a little 750 00:41:56,800 --> 00:41:58,300 different than that. 751 00:41:58,300 --> 00:42:03,490 For any particular value, the fraction that I got below a 752 00:42:03,490 --> 00:42:04,350 certain number-- 753 00:42:04,350 --> 00:42:09,970 the fraction of data that we're below, 2, its 754 00:42:09,970 --> 00:42:15,310 expectation is the probability of falling below 2, which is 755 00:42:15,310 --> 00:42:16,740 the correct CDF. 756 00:42:16,740 --> 00:42:21,060 So if my hypothesis is true the empirical CDF that I get 757 00:42:21,060 --> 00:42:24,900 based on data should, when n is large, be very 758 00:42:24,900 --> 00:42:27,100 close to the true CDF. 759 00:42:27,100 --> 00:42:31,350 So a way of judging whether my model is correct or not is to 760 00:42:31,350 --> 00:42:38,300 look at the assumed CDF, the CDF under hypothesis H0. 761 00:42:38,300 --> 00:42:41,880 Look at the CDF that I constructed based on the data, 762 00:42:41,880 --> 00:42:45,440 and see whether they're close enough or not. 763 00:42:45,440 --> 00:42:48,150 And by close enough, I mean I'm going to look at all the 764 00:42:48,150 --> 00:42:52,000 possible X's, and look at the maximum distance between those 765 00:42:52,000 --> 00:42:53,300 two curves. 766 00:42:53,300 --> 00:42:59,140 And I'm going to have a test that decides in favor of H0 if 767 00:42:59,140 --> 00:43:03,550 this distance is small, and in favor of H1 if 768 00:43:03,550 --> 00:43:06,110 this distance is large. 769 00:43:06,110 --> 00:43:07,790 That still leaves me the problem of 770 00:43:07,790 --> 00:43:09,570 coming up with a threshold. 771 00:43:09,570 --> 00:43:13,180 Where exactly do I put my threshold? 772 00:43:13,180 --> 00:43:17,230 Because this test is important enough, and is used frequently 773 00:43:17,230 --> 00:43:20,990 people have made the effort to try to understand the 774 00:43:20,990 --> 00:43:23,240 probability distribution of this quite 775 00:43:23,240 --> 00:43:25,280 difficult random variable. 776 00:43:25,280 --> 00:43:28,220 One needs to do lots of approximations and clever 777 00:43:28,220 --> 00:43:32,550 calculations, but these have led to values and tabulated 778 00:43:32,550 --> 00:43:34,570 values for the probability distribution 779 00:43:34,570 --> 00:43:36,210 of this random variable. 780 00:43:36,210 --> 00:43:39,340 And, for example, those tabulated values tell us that 781 00:43:39,340 --> 00:43:45,030 if we want 5% false rejection probability, then our 782 00:43:45,030 --> 00:43:48,860 threshold should be 1.36 divided by the 783 00:43:48,860 --> 00:43:50,570 square root of n. 784 00:43:50,570 --> 00:43:53,870 So we know where to put our threshold for 785 00:43:53,870 --> 00:43:55,280 this particular value. 786 00:43:55,280 --> 00:43:59,680 If we want this particular error or error 787 00:43:59,680 --> 00:44:02,380 probability to occur. 788 00:44:02,380 --> 00:44:06,320 So that's about as hard and sophisticated classical 789 00:44:06,320 --> 00:44:08,070 statistics get. 790 00:44:08,070 --> 00:44:12,920 You want to have tests for hypotheses that are not so 791 00:44:12,920 --> 00:44:15,910 easy to handle. 792 00:44:15,910 --> 00:44:21,260 People somehow think of clever ways of doing 793 00:44:21,260 --> 00:44:22,500 tests of this kind. 794 00:44:22,500 --> 00:44:26,970 How to compare the theoretical predictions with the observed 795 00:44:26,970 --> 00:44:29,650 predictions with the observed data. 796 00:44:29,650 --> 00:44:34,430 Come up with some measure of the difference between theory 797 00:44:34,430 --> 00:44:38,270 and data, and if that difference is big, than you 798 00:44:38,270 --> 00:44:39,520 reject your hypothesis. 799 00:44:39,520 --> 00:44:42,340 800 00:44:42,340 --> 00:44:45,640 OK, of course that's not the end of the field of 801 00:44:45,640 --> 00:44:49,000 statistics, there's a lot more. 802 00:44:49,000 --> 00:44:52,000 In some ways, as we kept moving through today's 803 00:44:52,000 --> 00:44:55,240 lecture, the way that we constructed those rejection 804 00:44:55,240 --> 00:44:57,680 regions was more and more ad hoc. 805 00:44:57,680 --> 00:45:02,220 I pulled out of a hat a particular measure of fit 806 00:45:02,220 --> 00:45:04,980 between data and the model. 807 00:45:04,980 --> 00:45:09,470 And I said let's just use a test based on this. 808 00:45:09,470 --> 00:45:13,890 There are attempts at more or less systematic ways of coming 809 00:45:13,890 --> 00:45:17,350 up with the general shape of rejection regions that have at 810 00:45:17,350 --> 00:45:20,540 least some desirable or favorable theoretical 811 00:45:20,540 --> 00:45:21,790 properties. 812 00:45:21,790 --> 00:45:24,620 813 00:45:24,620 --> 00:45:28,300 Some more specific problems that people study-- 814 00:45:28,300 --> 00:45:31,690 instead of having a test, is this the correct PDF? 815 00:45:31,690 --> 00:45:33,140 Yes or no. 816 00:45:33,140 --> 00:45:37,670 I just give you data, and I ask you tell me, give me a 817 00:45:37,670 --> 00:45:41,270 model or a PDF for those data. 818 00:45:41,270 --> 00:45:45,000 OK, my thoughts of this kind are of many types. 819 00:45:45,000 --> 00:45:50,640 One general method is you form a histogram, and then you take 820 00:45:50,640 --> 00:45:54,570 your histogram and plot a smooth line, that kind of fits 821 00:45:54,570 --> 00:45:55,680 the histogram. 822 00:45:55,680 --> 00:45:59,140 This still leaves the question of how do you choose the bins? 823 00:45:59,140 --> 00:46:00,780 The bin size in your histograms. 824 00:46:00,780 --> 00:46:02,620 How narrow do you take them? 825 00:46:02,620 --> 00:46:05,920 And that depends on how many data you have, and there's a 826 00:46:05,920 --> 00:46:09,190 lot of theory that tells you about the best way of choosing 827 00:46:09,190 --> 00:46:12,890 the bin sizes, and the best ways of smoothing the data 828 00:46:12,890 --> 00:46:14,640 that you have. 829 00:46:14,640 --> 00:46:18,090 A completely different topic is in signal processing -- 830 00:46:18,090 --> 00:46:20,200 you want to do your inference. 831 00:46:20,200 --> 00:46:22,810 Not only you want it to be good, but you also want it to 832 00:46:22,810 --> 00:46:25,520 be fast in a computational way. 833 00:46:25,520 --> 00:46:28,010 You get data in real time, lots of data. 834 00:46:28,010 --> 00:46:31,330 You want to keep processing and revising your estimates 835 00:46:31,330 --> 00:46:35,220 and your decisions as they come and go. 836 00:46:35,220 --> 00:46:38,950 Another topic that was briefly touched upon the last couple 837 00:46:38,950 --> 00:46:43,010 of lectures is that when you set up a model, like a linear 838 00:46:43,010 --> 00:46:46,540 regression model, you choose some explanatory variables, 839 00:46:46,540 --> 00:46:50,230 and you try to predict y from your X, these variables. 840 00:46:50,230 --> 00:46:52,720 You have a choice of what to take as 841 00:46:52,720 --> 00:46:55,440 your explanatory variables. 842 00:46:55,440 --> 00:47:02,560 Are there systematic ways of picking the right X variables 843 00:47:02,560 --> 00:47:04,520 to try to estimate a Y. 844 00:47:04,520 --> 00:47:08,360 For example should I try to estimate Y on the basis of X? 845 00:47:08,360 --> 00:47:10,320 Or on the basis of X-squared? 846 00:47:10,320 --> 00:47:12,960 How do I decide between the two? 847 00:47:12,960 --> 00:47:17,000 Finally, the rage these days has to do with anything big, 848 00:47:17,000 --> 00:47:18,490 high-demensional. 849 00:47:18,490 --> 00:47:23,410 Complicated models of complicated things, and tons 850 00:47:23,410 --> 00:47:24,650 and tons of data. 851 00:47:24,650 --> 00:47:27,430 So these days data are generated everywhere. 852 00:47:27,430 --> 00:47:30,230 The amounts of data are humongous. 853 00:47:30,230 --> 00:47:33,120 Also, the problems that people are interested in tend to be 854 00:47:33,120 --> 00:47:35,500 very complicated with lots of parameters. 855 00:47:35,500 --> 00:47:39,800 So I need specially tailored methods that can give you good 856 00:47:39,800 --> 00:47:44,220 results, or decent results even in the face of these huge 857 00:47:44,220 --> 00:47:47,290 amounts of data, and possibly with computational 858 00:47:47,290 --> 00:47:48,310 constraints. 859 00:47:48,310 --> 00:47:50,720 So with huge amounts of data you want methods that are 860 00:47:50,720 --> 00:47:56,460 simple, but still can deliver for you meaningful answers. 861 00:47:56,460 --> 00:48:00,170 Now as I mentioned some time ago, this whole field of 862 00:48:00,170 --> 00:48:03,960 statistics is very different from the field of probability. 863 00:48:03,960 --> 00:48:06,530 In some sense all that we're doing in statistics is 864 00:48:06,530 --> 00:48:08,100 probabilistic calculations. 865 00:48:08,100 --> 00:48:10,360 That's what the theory kind of does. 866 00:48:10,360 --> 00:48:12,870 But there's a big element of art. 867 00:48:12,870 --> 00:48:16,550 You saw that we chose the shape of some decision regions 868 00:48:16,550 --> 00:48:19,840 or rejection regions in a somewhat ad hoc way. 869 00:48:19,840 --> 00:48:21,660 There's even more basic things. 870 00:48:21,660 --> 00:48:23,260 How do you organize your data? 871 00:48:23,260 --> 00:48:26,690 How do you think about which hypotheses you would like to 872 00:48:26,690 --> 00:48:28,300 test, and so on. 873 00:48:28,300 --> 00:48:31,710 There's a lot of art that's involved here, and there's a 874 00:48:31,710 --> 00:48:33,510 lot that can go wrong. 875 00:48:33,510 --> 00:48:36,630 So I'm going to close with a note that you can take either 876 00:48:36,630 --> 00:48:39,050 as pessimistic or optimistic. 877 00:48:39,050 --> 00:48:42,880 There is a famous paper that came out a few years ago and 878 00:48:42,880 --> 00:48:46,440 has been cited about a 1,000 times or so. 879 00:48:46,440 --> 00:48:50,110 And the title of the paper is Why Most Published Research 880 00:48:50,110 --> 00:48:51,850 Findings Are False. 881 00:48:51,850 --> 00:48:56,080 And it's actually a very good argument why, in fields like 882 00:48:56,080 --> 00:48:59,900 psychology or the medical science and all that a lot of 883 00:48:59,900 --> 00:49:01,160 what you see published-- 884 00:49:01,160 --> 00:49:03,410 that yes, this drug has an effect on 885 00:49:03,410 --> 00:49:05,000 that particular disease-- 886 00:49:05,000 --> 00:49:08,030 is actually false, because people do not do their 887 00:49:08,030 --> 00:49:09,780 statistics correctly. 888 00:49:09,780 --> 00:49:12,130 There's lots of biases in what people do. 889 00:49:12,130 --> 00:49:16,300 I mean an obvious bias is that you only published a result 890 00:49:16,300 --> 00:49:19,190 when you see something. 891 00:49:19,190 --> 00:49:22,770 So the null hypothesis is that the drug doesn't work. 892 00:49:22,770 --> 00:49:26,820 You do your tests, the drug didn't work, OK, you just go 893 00:49:26,820 --> 00:49:27,960 home and cry. 894 00:49:27,960 --> 00:49:33,380 But if by accident that 5% happens, and even though the 895 00:49:33,380 --> 00:49:37,320 drug doesn't work, you got some outlier data, and it 896 00:49:37,320 --> 00:49:38,760 seemed to be working. 897 00:49:38,760 --> 00:49:40,990 Then you're excited, you publish it. 898 00:49:40,990 --> 00:49:42,760 So that's clearly a bias. 899 00:49:42,760 --> 00:49:46,980 That gets results to be published, even though they do 900 00:49:46,980 --> 00:49:50,330 not have a solid foundation behind them. 901 00:49:50,330 --> 00:49:53,050 Then there's another thing, OK? 902 00:49:53,050 --> 00:49:55,440 I'm picking my 5%. 903 00:49:55,440 --> 00:49:59,940 So H0 is true there's a small probability that the data will 904 00:49:59,940 --> 00:50:04,160 look like an outlier, and in that case I 905 00:50:04,160 --> 00:50:06,270 published my result. 906 00:50:06,270 --> 00:50:08,160 OK it's only 5% -- 907 00:50:08,160 --> 00:50:10,300 it's not going to happen too often. 908 00:50:10,300 --> 00:50:15,200 But suppose that I go and do a 1,000 different tests? 909 00:50:15,200 --> 00:50:18,540 Test H0 against this hypothesis, test H0 against 910 00:50:18,540 --> 00:50:22,000 that hypothesis , test H0 against that hypothesis. 911 00:50:22,000 --> 00:50:26,230 Some of these tests, just by accident might turn out to be 912 00:50:26,230 --> 00:50:29,350 in favor of H1, and again these are 913 00:50:29,350 --> 00:50:31,170 selected to be published. 914 00:50:31,170 --> 00:50:35,720 So if you do lots and lots of tests and in each one you have 915 00:50:35,720 --> 00:50:38,980 a 5% probability of error, when you consider the 916 00:50:38,980 --> 00:50:41,980 collection of all those tests, actually the probability of 917 00:50:41,980 --> 00:50:46,940 making incorrect inferences is a lot more than 5%. 918 00:50:46,940 --> 00:50:51,400 One basic principle in being systematic about such studies 919 00:50:51,400 --> 00:50:55,950 is that you should first pick your hypothesis that you're 920 00:50:55,950 --> 00:50:59,230 going to test, then get your data, and do 921 00:50:59,230 --> 00:51:00,880 your hypothesis testing. 922 00:51:00,880 --> 00:51:05,640 What would be wrong is to get your data, look at them, and 923 00:51:05,640 --> 00:51:08,890 say OK I'm going now to test for these 100 different 924 00:51:08,890 --> 00:51:13,060 hypotheses, and I'm going to choose my hypothesis to be for 925 00:51:13,060 --> 00:51:16,580 features that look abnormal in my data. 926 00:51:16,580 --> 00:51:19,520 Well, given enough data, you can always find some 927 00:51:19,520 --> 00:51:21,650 abnormalities just by chance. 928 00:51:21,650 --> 00:51:24,380 And if you choose to make a statistical test-- 929 00:51:24,380 --> 00:51:26,710 is this abnormality present? 930 00:51:26,710 --> 00:51:28,090 Yes, it will be present. 931 00:51:28,090 --> 00:51:31,020 Because you first found the abnormality, and then you 932 00:51:31,020 --> 00:51:32,130 tested for it. 933 00:51:32,130 --> 00:51:35,210 So that's another way that things can go wrong. 934 00:51:35,210 --> 00:51:37,520 So the moral of this story is that while the world of 935 00:51:37,520 --> 00:51:40,200 probability is really beautiful and solid, you have 936 00:51:40,200 --> 00:51:40,960 your axioms. 937 00:51:40,960 --> 00:51:44,630 Every question has a unique answer that by now you can, 938 00:51:44,630 --> 00:51:48,250 all of you, find in a very reliable way. 939 00:51:48,250 --> 00:51:50,740 Statistics is a dirty and difficult business. 940 00:51:50,740 --> 00:51:53,010 And that's why the subject is not over. 941 00:51:53,010 --> 00:51:55,430 And if you're interested in it, it's worth taking 942 00:51:55,430 --> 00:51:58,920 follow-on courses in that direction. 943 00:51:58,920 --> 00:52:03,950 OK so have good luck in the final, do well, and have a 944 00:52:03,950 --> 00:52:05,200 nice vacation afterwards. 945 00:52:05,200 --> 00:52:06,260