1 00:00:01,280 --> 00:00:03,320 Now let's work out an example that 2 00:00:03,320 --> 00:00:06,130 shows how to use the pairwise independent sampling 3 00:00:06,130 --> 00:00:10,020 theorem to actually do some sampling and estimation. 4 00:00:10,020 --> 00:00:14,160 So let's remember that our basic theorem says 5 00:00:14,160 --> 00:00:20,000 that if we have n independent random variables-- 6 00:00:20,000 --> 00:00:23,490 pairwise independent, with the same mean and variance-- 7 00:00:23,490 --> 00:00:26,320 and we look at their average, the probability 8 00:00:26,320 --> 00:00:29,210 that their average differs from the mean by more than a 9 00:00:29,210 --> 00:00:32,890 given tolerance delta is less than or equal to this formula 10 00:00:32,890 --> 00:00:34,270 here. 11 00:00:34,270 --> 00:00:39,110 Which is the standard deviation over delta squared times 1 12 00:00:39,110 --> 00:00:39,960 over n. 13 00:00:39,960 --> 00:00:42,520 Now we're just going to be plugging into this formula, 14 00:00:42,520 --> 00:00:44,270 but I want to state it here for the record 15 00:00:44,270 --> 00:00:46,686 to remember that this is the pairwise independent sampling 16 00:00:46,686 --> 00:00:47,890 theorem that we're given. 17 00:00:47,890 --> 00:00:50,320 Which is, what we've seen in general, 18 00:00:50,320 --> 00:00:54,190 allows us to calculate the degree of confidence 19 00:00:54,190 --> 00:00:56,990 we can have, the probability that we have given n 20 00:00:56,990 --> 00:01:02,460 or the n that we need given how confident we want to be. 21 00:01:02,460 --> 00:01:04,590 So let's go ahead and do the example. 22 00:01:04,590 --> 00:01:07,240 And I want to think about what is the possibility of swimming 23 00:01:07,240 --> 00:01:08,460 in the Charles. 24 00:01:08,460 --> 00:01:11,280 Now, Charles has a coliform count. 25 00:01:11,280 --> 00:01:14,850 Coliform are some rather undesirable bacteria 26 00:01:14,850 --> 00:01:18,140 that are associated with fecal matter. 27 00:01:18,140 --> 00:01:22,010 And we want to know whether it's safe to swim in the Charles. 28 00:01:22,010 --> 00:01:26,430 That's a Petri dish showing a kind of sample of bacteria 29 00:01:26,430 --> 00:01:29,060 that you might grow, cultured to see what's going on. 30 00:01:29,060 --> 00:01:30,670 The Environmental Protection Agency 31 00:01:30,670 --> 00:01:35,570 requires that the average CMD, the coliform microbial density, 32 00:01:35,570 --> 00:01:39,920 on the dish is less than 200. 33 00:01:39,920 --> 00:01:42,020 And what we want to do is figure out 34 00:01:42,020 --> 00:01:47,420 whether, when we do a sample of CMDs around the river, 35 00:01:47,420 --> 00:01:51,140 and we get some numbers out, whether, in fact, we 36 00:01:51,140 --> 00:01:55,800 will conclude that the average CMD is less than 200. 37 00:01:55,800 --> 00:01:57,880 We need to convince the EPA of that. 38 00:01:57,880 --> 00:01:59,864 Now, we're never going to be certain. 39 00:01:59,864 --> 00:02:01,280 But what we're going to do is take 40 00:02:01,280 --> 00:02:05,640 32 measurements at random times and locations around the river. 41 00:02:05,640 --> 00:02:09,800 And we're going to collect these 32 measurements of CMD. 42 00:02:09,800 --> 00:02:14,230 And it's going to turn out that, although a few of them 43 00:02:14,230 --> 00:02:18,290 are over 200, the average is well under 200. 44 00:02:18,290 --> 00:02:22,860 The average of the 32 samples that we've taken is 180. 45 00:02:22,860 --> 00:02:27,170 And our task now is to convince the Environmental Protection 46 00:02:27,170 --> 00:02:30,320 Agency that, on the basis of our data, 47 00:02:30,320 --> 00:02:33,741 that the average in the whole river is really less than 200. 48 00:02:33,741 --> 00:02:35,240 Even though where a couple of places 49 00:02:35,240 --> 00:02:38,440 it was over 100, but on average it was 180, 50 00:02:38,440 --> 00:02:42,370 can we convince the EPA that the actual average 51 00:02:42,370 --> 00:02:44,500 is less than 200? 52 00:02:44,500 --> 00:02:47,200 And so we're trying to convince them 53 00:02:47,200 --> 00:02:49,160 that our estimate based on the sample 54 00:02:49,160 --> 00:02:52,460 is within 20 of the actual average. 55 00:02:52,460 --> 00:02:57,530 We got 180, so if our estimate is within 20 of the truth, 56 00:02:57,530 --> 00:03:02,316 then in fact, the average is less than 200. 57 00:03:02,316 --> 00:03:03,690 Well how are we going to do that? 58 00:03:03,690 --> 00:03:05,106 Well, let's look at the parameters 59 00:03:05,106 --> 00:03:08,200 in the same pairwise independent sampling theorem 60 00:03:08,200 --> 00:03:09,400 and see what we have. 61 00:03:09,400 --> 00:03:12,780 So c is the actual average CMD in the river. 62 00:03:12,780 --> 00:03:14,210 That's what we don't know. 63 00:03:14,210 --> 00:03:17,290 We're trying to estimate it. 64 00:03:17,290 --> 00:03:21,260 So our samples correspond to a random variable. 65 00:03:21,260 --> 00:03:23,440 We're taking a measurement of the CMD 66 00:03:23,440 --> 00:03:25,620 at random time and place. 67 00:03:25,620 --> 00:03:29,410 And that defines a random variable whose expectation 68 00:03:29,410 --> 00:03:31,360 is the unknown city. 69 00:03:31,360 --> 00:03:33,600 So we've defined, by our sampling process, 70 00:03:33,600 --> 00:03:36,700 a random variable with mean u. 71 00:03:36,700 --> 00:03:39,190 In fact, we've done it with 32 variables. 72 00:03:39,190 --> 00:03:43,330 So n samples mean n mutually independent random variables. 73 00:03:43,330 --> 00:03:46,900 All with mean equal to the number 74 00:03:46,900 --> 00:03:48,820 that I'm trying to estimate. 75 00:03:48,820 --> 00:03:52,910 And A n is the average of the n CMD samples. 76 00:03:52,910 --> 00:03:57,640 So we have an A32 that we're trying to understand. 77 00:03:57,640 --> 00:03:59,980 So here's the independent sampling theorem formula. 78 00:03:59,980 --> 00:04:01,450 And let's see what I know already. 79 00:04:01,450 --> 00:04:03,350 I'm going to plug in the knowns. 80 00:04:03,350 --> 00:04:07,560 What I know is that n is 32, mu is the unknown, c, 81 00:04:07,560 --> 00:04:09,180 that we're trying to estimate. 82 00:04:09,180 --> 00:04:12,160 And the delta that matters to us is 20. 83 00:04:12,160 --> 00:04:16,419 Because we want to argue that if our average of 180, 84 00:04:16,419 --> 00:04:20,410 our measurement of 180, was within 20 of the truth, 85 00:04:20,410 --> 00:04:25,150 then in fact, we're under the 200 that the EPA specifies. 86 00:04:25,150 --> 00:04:31,225 So let's plug in our known parameters, 32 for n 87 00:04:31,225 --> 00:04:33,700 and 20 for the tolerance. 88 00:04:33,700 --> 00:04:35,720 And they plug in here. 89 00:04:35,720 --> 00:04:40,550 And that leaves me with the standard deviation, 90 00:04:40,550 --> 00:04:42,800 which the formula requires and I have to plug in. 91 00:04:42,800 --> 00:04:45,640 And that is a problem, because we don't know 92 00:04:45,640 --> 00:04:48,270 what the standard deviation is. 93 00:04:48,270 --> 00:04:50,520 Now, sometimes you can kind of argue 94 00:04:50,520 --> 00:04:53,650 that you can figure out what the standard deviation is 95 00:04:53,650 --> 00:04:57,060 because you have a theory of what the random distribution is 96 00:04:57,060 --> 00:04:58,080 of these measurements. 97 00:04:58,080 --> 00:05:02,110 And therefore, you can calculate what its standard deviation 98 00:05:02,110 --> 00:05:03,700 should be. 99 00:05:03,700 --> 00:05:05,450 Other times you can actually take 100 00:05:05,450 --> 00:05:08,720 a sample of the deviation of your sample 101 00:05:08,720 --> 00:05:11,440 and use that as an estimate of the sample 102 00:05:11,440 --> 00:05:13,470 of the actual standard deviation. 103 00:05:13,470 --> 00:05:16,490 But that's kind of circular, and we're not going to go there. 104 00:05:16,490 --> 00:05:18,200 But another way to do it is to say 105 00:05:18,200 --> 00:05:23,160 that if you had some bounds on the maximum possible 106 00:05:23,160 --> 00:05:24,950 discrepancy of your measurements, 107 00:05:24,950 --> 00:05:28,790 if you had done these kinds of sampling testings of CMDs 108 00:05:28,790 --> 00:05:32,020 for a long time and you had never observed two that 109 00:05:32,020 --> 00:05:35,860 were more than 50 apart, then what you could argue 110 00:05:35,860 --> 00:05:40,460 is that the range of measurements 111 00:05:40,460 --> 00:05:42,870 is going to be only 50. 112 00:05:42,870 --> 00:05:45,400 So what we can do is if we say that L 113 00:05:45,400 --> 00:05:48,210 is the maximum possible difference that we'll ever 114 00:05:48,210 --> 00:05:51,570 measure among samples, that in fact, what you can say 115 00:05:51,570 --> 00:05:56,910 is that the worst possible standard deviation when 116 00:05:56,910 --> 00:06:02,310 your random variable is ranging over an interval l is l over 2. 117 00:06:02,310 --> 00:06:05,190 And you can check that algebraically, 118 00:06:05,190 --> 00:06:07,120 but for now let's just take that as a fact. 119 00:06:07,120 --> 00:06:10,170 If you know that your measurements are 120 00:06:10,170 --> 00:06:12,960 going to differ by at most l between max and min, 121 00:06:12,960 --> 00:06:16,280 the standard deviation can't be more than l over 2. 122 00:06:16,280 --> 00:06:19,820 And if we know that l is 50, then I got a number finally 123 00:06:19,820 --> 00:06:20,770 to plug in. 124 00:06:20,770 --> 00:06:23,530 Because now I can plug in 25 for sigma. 125 00:06:23,530 --> 00:06:24,840 So let's do that. 126 00:06:24,840 --> 00:06:27,770 And when I do that, I come out with this calculation that 127 00:06:27,770 --> 00:06:32,070 says that the probability that my average, 128 00:06:32,070 --> 00:06:37,780 minus c, was greater than 20, that my A32, which 129 00:06:37,780 --> 00:06:42,500 we said was 180, was more than 20 away from the truth 130 00:06:42,500 --> 00:06:45,810 is less than 0.05. 131 00:06:45,810 --> 00:06:48,670 Or, flipping it around, the probability 132 00:06:48,670 --> 00:06:52,250 that my average is within 20 of the truth 133 00:06:52,250 --> 00:06:55,700 is greater than 0.095. 134 00:06:55,700 --> 00:06:59,350 And so we would like to be able to say, now, 135 00:06:59,350 --> 00:07:03,090 that the probability that the unknown c is 136 00:07:03,090 --> 00:07:06,410 the 180 that we measured for A32, plus or minus 20, 137 00:07:06,410 --> 00:07:07,970 is at least 95%. 138 00:07:07,970 --> 00:07:10,170 That seems to be what the theorem told us. 139 00:07:10,170 --> 00:07:10,850 Let's go back. 140 00:07:10,850 --> 00:07:13,580 The theorem says that the probability that A32, 141 00:07:13,580 --> 00:07:16,340 which we measured to be 180, minus c is less than 142 00:07:16,340 --> 00:07:19,930 or equal to 20, is greater than 0.95. 143 00:07:19,930 --> 00:07:23,250 So we should go back and tell the EPA 144 00:07:23,250 --> 00:07:27,510 that the probability is that c is less than 200 145 00:07:27,510 --> 00:07:30,030 with probability 0.95. 146 00:07:30,030 --> 00:07:32,070 And we'd be pretty tempted to say that. 147 00:07:32,070 --> 00:07:33,450 But it's not right. 148 00:07:33,450 --> 00:07:36,150 It's technically the wrong thing to say. 149 00:07:36,150 --> 00:07:37,550 And why is that? 150 00:07:37,550 --> 00:07:40,810 Well, it's an important idea, which 151 00:07:40,810 --> 00:07:43,470 is that we're talking about something 152 00:07:43,470 --> 00:07:45,280 other than probability here. 153 00:07:45,280 --> 00:07:47,690 We're talking about confidence, not probability. 154 00:07:47,690 --> 00:07:50,020 And let's explain that a little bit more. 155 00:07:50,020 --> 00:07:51,440 Here's the issue. 156 00:07:51,440 --> 00:07:54,550 The number c is a number in the real world. 157 00:07:54,550 --> 00:07:57,320 It's an actual physical quantity which is 158 00:07:57,320 --> 00:08:00,310 the average CMD in the river. 159 00:08:00,310 --> 00:08:01,590 We don't know what it is. 160 00:08:01,590 --> 00:08:04,230 But that does not make it a random variable. 161 00:08:04,230 --> 00:08:08,080 It is or it isn't within less than 200 or more than 200 162 00:08:08,080 --> 00:08:09,380 and so on. 163 00:08:09,380 --> 00:08:14,430 What's going on is that we have created a probabilistic model 164 00:08:14,430 --> 00:08:21,810 of sampling that is designed to have in our probabilistic model 165 00:08:21,810 --> 00:08:23,340 this unknown constant. 166 00:08:23,340 --> 00:08:26,560 There's nothing probabilistic about the constant. 167 00:08:26,560 --> 00:08:29,700 We've introduced the probability by thinking 168 00:08:29,700 --> 00:08:33,260 of our random sampling as random variables. 169 00:08:33,260 --> 00:08:34,740 We control the randomness. 170 00:08:34,740 --> 00:08:36,650 We can't say that c is random. 171 00:08:36,650 --> 00:08:38,090 Our measurements are random. 172 00:08:38,090 --> 00:08:42,620 So the right thing we can say is that the possible outcomes 173 00:08:42,620 --> 00:08:45,820 of our sampling process can persuasively be 174 00:08:45,820 --> 00:08:47,980 modeled as a random variable. 175 00:08:47,980 --> 00:08:49,730 So what we can say is that the probability 176 00:08:49,730 --> 00:08:52,130 that our sampling process will yield 177 00:08:52,130 --> 00:08:55,810 an average that's within 20 of the true average 178 00:08:55,810 --> 00:08:58,420 is at least 0.95. 179 00:08:58,420 --> 00:09:00,410 So that's a funny thing to say. 180 00:09:00,410 --> 00:09:03,700 What you do is you go tell the EPA that says, look. 181 00:09:03,700 --> 00:09:06,740 We don't know what the real average is. 182 00:09:06,740 --> 00:09:11,090 But we have a process that gets the right answer 95% 183 00:09:11,090 --> 00:09:13,770 of the time to within plus or minus 20. 184 00:09:13,770 --> 00:09:15,000 And we measured it. 185 00:09:15,000 --> 00:09:20,100 And our process that we right 95% of the time 186 00:09:20,100 --> 00:09:24,720 came in with an answer that said it's less than 200. 187 00:09:24,720 --> 00:09:25,320 OK? 188 00:09:25,320 --> 00:09:27,590 Now that's the right thing to say. 189 00:09:27,590 --> 00:09:28,640 That's the truth. 190 00:09:28,640 --> 00:09:30,620 We're making a probabilistic statement 191 00:09:30,620 --> 00:09:35,520 about the general properties of our sampling process, 192 00:09:35,520 --> 00:09:36,350 and saying, OK. 193 00:09:36,350 --> 00:09:38,570 Our sampling process is usually right. 194 00:09:38,570 --> 00:09:41,870 The sampling process said less than 200. 195 00:09:41,870 --> 00:09:43,602 So we think that's probably right. 196 00:09:43,602 --> 00:09:45,560 But we can't say it is right, and we can't even 197 00:09:45,560 --> 00:09:47,220 say it's right with any probability. 198 00:09:47,220 --> 00:09:51,420 It's just the way that our mostly reliable process 199 00:09:51,420 --> 00:09:53,500 yielded an answer. 200 00:09:53,500 --> 00:09:56,730 And that's an important idea to distinguish. 201 00:09:56,730 --> 00:10:05,200 So it's our estimate that that's correct with probability 0.95. 202 00:10:05,200 --> 00:10:08,590 Now this is a long thing to say to the EPA. 203 00:10:08,590 --> 00:10:10,950 And what we'd like to go back is language 204 00:10:10,950 --> 00:10:15,410 that says that we think that the real average, c, is 205 00:10:15,410 --> 00:10:20,650 within 20 of 180 is probably within 20 of 180. 206 00:10:20,650 --> 00:10:22,736 Because that's what our tests seem to say. 207 00:10:22,736 --> 00:10:24,860 But we're not allowed to talk about the probability 208 00:10:24,860 --> 00:10:26,660 that c has some value or other. 209 00:10:26,660 --> 00:10:28,430 So instead we summarize the story 210 00:10:28,430 --> 00:10:32,540 about how we measured c using a probabilistic process that's 211 00:10:32,540 --> 00:10:36,220 right 95% of the time by saying that c 212 00:10:36,220 --> 00:10:42,710 is 180 plus or minus 20 at the 95% confidence level. 213 00:10:42,710 --> 00:10:44,900 And that is that's a shorthand way 214 00:10:44,900 --> 00:10:47,900 of saying we've got this process that we believe 215 00:10:47,900 --> 00:10:50,220 in that measured this unknown quantity 216 00:10:50,220 --> 00:10:51,255 and told us what it was. 217 00:10:54,170 --> 00:10:57,540 So the moral here that we'll wrap up this little video with 218 00:10:57,540 --> 00:11:00,450 is that when you're told that some fact holds 219 00:11:00,450 --> 00:11:02,060 at a high confidence level because 220 00:11:02,060 --> 00:11:07,460 of some tester, or some random experiment, or some pollster, 221 00:11:07,460 --> 00:11:10,630 you have to remember that what that implies 222 00:11:10,630 --> 00:11:14,760 is that somebody designed a random experiment to try 223 00:11:14,760 --> 00:11:18,550 to get an estimate of reality. 224 00:11:18,550 --> 00:11:21,540 And you can always question whether you believe 225 00:11:21,540 --> 00:11:23,006 in that random experiment. 226 00:11:23,006 --> 00:11:24,630 It's important to understand that there 227 00:11:24,630 --> 00:11:26,740 is some random experiment back there. 228 00:11:26,740 --> 00:11:29,220 And you should be wondering about what is it? 229 00:11:29,220 --> 00:11:31,460 And do I believe in it? 230 00:11:31,460 --> 00:11:34,060 And even more important question to ask 231 00:11:34,060 --> 00:11:38,460 is why are you hearing about this particular experiment? 232 00:11:38,460 --> 00:11:41,880 How many other experiments were tried and not reported? 233 00:11:41,880 --> 00:11:44,960 The point is that people can perform 234 00:11:44,960 --> 00:11:50,060 various kinds of tests at the 95% or higher confidence level. 235 00:11:50,060 --> 00:11:53,500 But when the tests don't come up with an interesting result, 236 00:11:53,500 --> 00:11:56,100 they don't bother to publish them or announce them. 237 00:11:56,100 --> 00:11:59,120 And of course, when they come up with a surprising result which 238 00:11:59,120 --> 00:12:01,940 is going to be wrong one out of 20 times, 239 00:12:01,940 --> 00:12:03,750 those are the results that they publish 240 00:12:03,750 --> 00:12:05,680 and submit and advertise. 241 00:12:05,680 --> 00:12:07,720 Because they sound good. 242 00:12:07,720 --> 00:12:11,530 In fact, the major drug company Glaxo SmithKline, 243 00:12:11,530 --> 00:12:15,600 after paying $3 billion as a judgment against them 244 00:12:15,600 --> 00:12:19,200 in 2012 for suppressing the negative results 245 00:12:19,200 --> 00:12:21,380 of clinical trials, just agreed to now 246 00:12:21,380 --> 00:12:25,050 make public in February 2013 all of the clinical trials 247 00:12:25,050 --> 00:12:25,970 that they perform. 248 00:12:25,970 --> 00:12:27,700 So that you're not just learning about 249 00:12:27,700 --> 00:12:29,460 the cherry picked positive results, 250 00:12:29,460 --> 00:12:32,040 but about the negative ones as well. 251 00:12:32,040 --> 00:12:43,530 And in fact, [AUDIO OUT[ to get this point home, 252 00:12:43,530 --> 00:12:47,330 you might want to look at the cartoon at XKCD, 253 00:12:47,330 --> 00:12:50,060 which explains how it is that when there's a problem with 254 00:12:50,060 --> 00:12:55,350 green jelly beans at the 95% confidence level.