1 00:00:00,409 --> 00:00:02,450 PROFESSOR: One of the most important applications 2 00:00:02,450 --> 00:00:05,720 of conditional probability is in analyzing 3 00:00:05,720 --> 00:00:10,260 the results of diagnostic tests of uncertain reliability. 4 00:00:10,260 --> 00:00:14,520 So let's look at a fundamental example. 5 00:00:14,520 --> 00:00:19,082 Suppose that I have a diagnostic test for tuberculosis. 6 00:00:19,082 --> 00:00:20,540 It really sounds great because it's 7 00:00:20,540 --> 00:00:23,710 going to be 99% accurate-- in fact, more than 99% 8 00:00:23,710 --> 00:00:25,250 accurate, really, because here are 9 00:00:25,250 --> 00:00:27,050 the properties this test has. 10 00:00:27,050 --> 00:00:30,710 If you have TB, this test is guaranteed to detect it 11 00:00:30,710 --> 00:00:33,190 and say, yes, you have TB. 12 00:00:33,190 --> 00:00:36,660 If you don't have TB, 99% of the time, 13 00:00:36,660 --> 00:00:40,550 the test says correctly that you don't 14 00:00:40,550 --> 00:00:45,390 have TB, and 1% of the time, it gets it wrong. 15 00:00:45,390 --> 00:00:48,520 Now, suppose the doctor gives you the test and the test 16 00:00:48,520 --> 00:00:50,630 comes up saying that you have TB. 17 00:00:50,630 --> 00:00:54,110 That's kind of scary because TB is, in fact, 18 00:00:54,110 --> 00:00:55,180 quite a serious disease. 19 00:00:55,180 --> 00:00:56,680 It's getting worse because there are 20 00:00:56,680 --> 00:00:59,250 all of these antibiotic-resistant versions 21 00:00:59,250 --> 00:01:00,030 of TB. 22 00:01:00,030 --> 00:01:04,110 Now in Asia, where all the known antibiotics are not 23 00:01:04,110 --> 00:01:08,410 very effective-- if effective at all-- of curing it, 24 00:01:08,410 --> 00:01:12,370 and this test that was 99% accurate 25 00:01:12,370 --> 00:01:16,090 says I have this disease, it sounds really worrisome. 26 00:01:16,090 --> 00:01:19,652 But in fact, we can ask more technically, 27 00:01:19,652 --> 00:01:20,860 should you really be worried? 28 00:01:20,860 --> 00:01:24,930 What is the probability given that this apparently highly 29 00:01:24,930 --> 00:01:26,700 accurate test says you have TB? 30 00:01:26,700 --> 00:01:28,800 What's the probability that you actually have TB? 31 00:01:28,800 --> 00:01:30,630 That's what we want to calculate. 32 00:01:30,630 --> 00:01:33,220 What's the probability that you have it? 33 00:01:33,220 --> 00:01:36,900 So in other words, I want the conditional probability 34 00:01:36,900 --> 00:01:39,825 that I have TB given that the test comes in positive. 35 00:01:39,825 --> 00:01:44,210 The test says, yes, you have TB. 36 00:01:44,210 --> 00:01:47,114 That test positive is a big word that I won't have room 37 00:01:47,114 --> 00:01:48,530 for on other slides, so let's just 38 00:01:48,530 --> 00:01:50,380 abbreviate it by [? plus. ?] Plus means, 39 00:01:50,380 --> 00:01:56,470 in green, that the test said, yes, positive-- you have TB. 40 00:01:56,470 --> 00:01:58,790 OK, so that's the probability that we're 41 00:01:58,790 --> 00:02:01,090 trying to calculate, this conditional probability. 42 00:02:01,090 --> 00:02:02,440 What do we know about the test? 43 00:02:02,440 --> 00:02:04,030 Let's translate the information we 44 00:02:04,030 --> 00:02:05,950 have about the test into the language 45 00:02:05,950 --> 00:02:07,400 of conditional probability. 46 00:02:07,400 --> 00:02:10,210 And the first thing we said was that the test is guaranteed 47 00:02:10,210 --> 00:02:11,980 to get it right if you have TB. 48 00:02:11,980 --> 00:02:15,560 So given that you have TB, the probability 49 00:02:15,560 --> 00:02:18,380 that the test will say so-- it will return a positive result-- 50 00:02:18,380 --> 00:02:21,060 is 1. 51 00:02:21,060 --> 00:02:23,820 Given that you don't have TB, the probability 52 00:02:23,820 --> 00:02:27,370 that the test will say that you do have TB is only 1 in 100. 53 00:02:27,370 --> 00:02:29,920 Because 99% of the time, it correctly 54 00:02:29,920 --> 00:02:32,030 says you don't have TB. 55 00:02:32,030 --> 00:02:35,530 And 1% of the time, it says oops, you do have TB. 56 00:02:35,530 --> 00:02:37,800 So this is what's called a false positive rate. 57 00:02:37,800 --> 00:02:41,510 It's falsely claiming that you have TB when you really don't. 58 00:02:41,510 --> 00:02:46,430 And that rate, we're hypothesizing, is only 1%. 59 00:02:46,430 --> 00:02:48,150 Now, what we're trying to calculate, 60 00:02:48,150 --> 00:02:51,870 again, is the probability that you have TB 61 00:02:51,870 --> 00:02:56,494 given that the test came in positive and said you had TB. 62 00:02:56,494 --> 00:02:57,910 Well, let's look at the definition 63 00:02:57,910 --> 00:02:59,034 of conditional probability. 64 00:02:59,034 --> 00:03:02,260 The probability that you have TB given that the test came 65 00:03:02,260 --> 00:03:07,050 in positive, that said you do, is simply the probability 66 00:03:07,050 --> 00:03:09,220 that both the test comes in positive 67 00:03:09,220 --> 00:03:12,640 and you have TB divided by the probability 68 00:03:12,640 --> 00:03:15,330 that the test comes in positive. 69 00:03:15,330 --> 00:03:20,110 Well, using the definition of conditional probability again, 70 00:03:20,110 --> 00:03:25,580 this intersection, this AND of having TB and the test 71 00:03:25,580 --> 00:03:28,040 coming in positive, is simply the probability 72 00:03:28,040 --> 00:03:30,070 that the test comes in positive given 73 00:03:30,070 --> 00:03:34,680 that you have TB times the probability that you have TB. 74 00:03:34,680 --> 00:03:36,330 Now, this one we know. 75 00:03:36,330 --> 00:03:38,390 It's 1 because the test is perfect. 76 00:03:38,390 --> 00:03:39,980 If you have TB, the test is definitely 77 00:03:39,980 --> 00:03:41,200 going to say positive. 78 00:03:41,200 --> 00:03:43,600 So that lets me simplify things nicely. 79 00:03:43,600 --> 00:03:46,220 What I've just figured out is the probability 80 00:03:46,220 --> 00:03:49,860 that you have TB given that the test says 81 00:03:49,860 --> 00:03:53,180 you do is simply the quotient of the probability 82 00:03:53,180 --> 00:03:57,570 that you have TB given no other information and the probability 83 00:03:57,570 --> 00:04:01,310 that the test comes in positive. 84 00:04:01,310 --> 00:04:03,490 Well, what is that probability that the test 85 00:04:03,490 --> 00:04:04,240 comes in positive? 86 00:04:04,240 --> 00:04:05,698 How are we going to calculate that? 87 00:04:05,698 --> 00:04:08,197 That's the key unknown here. 88 00:04:08,197 --> 00:04:10,030 And we're going to use the probability rule, 89 00:04:10,030 --> 00:04:11,730 the total probability rule. 90 00:04:11,730 --> 00:04:14,700 Total probability says that you do or you don't have TB. 91 00:04:14,700 --> 00:04:20,240 So that the way to calculate the probability that the test comes 92 00:04:20,240 --> 00:04:22,700 in positive is to look at the probability 93 00:04:22,700 --> 00:04:26,110 that the test comes in positive when you do and don't have TB. 94 00:04:26,110 --> 00:04:28,040 And we know those numbers. 95 00:04:28,040 --> 00:04:31,600 So let's look at the total probability formula. 96 00:04:31,600 --> 00:04:33,900 The probability that the test comes in positive 97 00:04:33,900 --> 00:04:36,710 is simply the probability that it comes in positive 98 00:04:36,710 --> 00:04:40,020 if you have TB times the probability you have TB, 99 00:04:40,020 --> 00:04:42,470 plus the probability it comes in positive 100 00:04:42,470 --> 00:04:45,540 given that you don't have TB times the probability you don't 101 00:04:45,540 --> 00:04:46,690 have TB. 102 00:04:46,690 --> 00:04:48,570 Well, we know a lot of these terms. 103 00:04:48,570 --> 00:04:50,005 Let's work them out. 104 00:04:50,005 --> 00:04:53,350 Well, the probability the test comes in positive 105 00:04:53,350 --> 00:04:56,110 given that you have TB is 1. 106 00:04:56,110 --> 00:04:57,830 And the probability that the test 107 00:04:57,830 --> 00:05:01,059 comes in positive given that you don't have TB is 1/100. 108 00:05:01,059 --> 00:05:02,350 That's the false positive rate. 109 00:05:02,350 --> 00:05:03,720 We figured that already. 110 00:05:03,720 --> 00:05:06,120 What about the probability that you don't have TB? 111 00:05:06,120 --> 00:05:08,560 Well, that's simply 1 minus the probability 112 00:05:08,560 --> 00:05:10,560 that you do have TB. 113 00:05:10,560 --> 00:05:12,580 Now I have this nice arithmetic formula 114 00:05:12,580 --> 00:05:14,490 in the probability of TB. 115 00:05:14,490 --> 00:05:20,250 So I wind up with the probability 116 00:05:20,250 --> 00:05:25,450 of TB plus 1/100 minus [? 1/100 ?] 117 00:05:25,450 --> 00:05:26,750 of the probability of TB. 118 00:05:26,750 --> 00:05:30,270 It leaves me with 1/100 plus the 99/100 of TB. 119 00:05:30,270 --> 00:05:32,360 So that's what this simplifies to. 120 00:05:32,360 --> 00:05:34,040 The probability that the test comes 121 00:05:34,040 --> 00:05:37,100 in positive given no other information 122 00:05:37,100 --> 00:05:41,410 is 99/100 of the probability that a person 123 00:05:41,410 --> 00:05:44,960 has TB plus 1/100. 124 00:05:44,960 --> 00:05:47,740 We'll come back to this formula. 125 00:05:47,740 --> 00:05:50,820 Well, we were working on the probability 126 00:05:50,820 --> 00:05:52,932 that you have TB given the test came in positive. 127 00:05:52,932 --> 00:05:54,640 We figured out that it was this quotient. 128 00:05:54,640 --> 00:05:56,910 And now, I know what the denominator is. 129 00:05:56,910 --> 00:06:00,850 The denominator is 99/100 times the probability 130 00:06:00,850 --> 00:06:02,070 of TB plus 1/100. 131 00:06:02,070 --> 00:06:04,780 Multiply numerator and denominator through by 100, 132 00:06:04,780 --> 00:06:07,970 and you get that the probability that you have TB 133 00:06:07,970 --> 00:06:11,730 given that the test says you do is 100 times the probability 134 00:06:11,730 --> 00:06:16,450 that you have TB divided by 99 times the probability that you 135 00:06:16,450 --> 00:06:18,580 have TB plus 1. 136 00:06:18,580 --> 00:06:20,870 So let's hold formula. 137 00:06:20,870 --> 00:06:24,320 Notice the key unknown here is the probability 138 00:06:24,320 --> 00:06:28,720 that you have TB independent of the test, the probability 139 00:06:28,720 --> 00:06:31,680 that a random person in the population has TB. 140 00:06:31,680 --> 00:06:34,400 If we can figure that out or if we can look that up, 141 00:06:34,400 --> 00:06:36,470 then we know what this unknown is, 142 00:06:36,470 --> 00:06:39,550 the probability have TB given that the test says you do. 143 00:06:39,550 --> 00:06:45,000 Well, what is the probability that a random person has TB? 144 00:06:45,000 --> 00:06:49,840 Well, there were 11,000 cases of TB reported in 2011, according 145 00:06:49,840 --> 00:06:53,510 to the Center for Disease Control in the United States. 146 00:06:53,510 --> 00:06:55,170 And you can assume that there's going 147 00:06:55,170 --> 00:06:58,200 to be a lot of unreported cases if there are 11,000 148 00:06:58,200 --> 00:06:59,940 reported ones, because a lot of people 149 00:06:59,940 --> 00:07:02,820 don't even know they have the disease. 150 00:07:02,820 --> 00:07:05,930 So let's estimate, on that basis given that the population 151 00:07:05,930 --> 00:07:13,310 of the US is around 350 million, that the probability of TB is 152 00:07:13,310 --> 00:07:16,000 about 1/10,000. 153 00:07:16,000 --> 00:07:18,490 Let's plug that into our formula. 154 00:07:18,490 --> 00:07:21,510 The probability that you have TB given the test is positive 155 00:07:21,510 --> 00:07:22,500 is this formula. 156 00:07:22,500 --> 00:07:25,510 When I plug in 1/10,000 for TB. 157 00:07:25,510 --> 00:07:31,760 I get 100/10,000 and 99/10,000 plus 1. 158 00:07:31,760 --> 00:07:35,200 Well now, I can see that the denominator is essentially 1. 159 00:07:35,200 --> 00:07:36,800 It's 1.01. 160 00:07:36,800 --> 00:07:38,370 And the numerator is 1/100. 161 00:07:38,370 --> 00:07:41,530 And this is basically about 1/100. 162 00:07:41,530 --> 00:07:46,370 In other words, it's not very likely that you have TB. 163 00:07:46,370 --> 00:07:49,720 Because of the relatively high false positive rate 164 00:07:49,720 --> 00:07:53,930 that was relatively high of 1%, that false positive rate 165 00:07:53,930 --> 00:08:01,180 washed out the actual number of TB cases, which the TB rate was 166 00:08:01,180 --> 00:08:07,590 only 0.01%, so that almost all of the reports of TB 167 00:08:07,590 --> 00:08:10,390 were caused by the high false positive rate. 168 00:08:10,390 --> 00:08:14,350 And that means that when you have a report that you've 169 00:08:14,350 --> 00:08:18,180 got TB, you still only have a 1% chance that you actually 170 00:08:18,180 --> 00:08:20,590 have the TB. 171 00:08:20,590 --> 00:08:23,860 So the 99% accurate test was not very useful 172 00:08:23,860 --> 00:08:26,725 here for you to figure out what kind of action to take 173 00:08:26,725 --> 00:08:28,100 and what kind of medicine to take 174 00:08:28,100 --> 00:08:31,350 or treatment to take given that the test came in positive. 175 00:08:31,350 --> 00:08:33,310 With 1 in 100 chance, the odds are 176 00:08:33,310 --> 00:08:34,809 you won't do anything, in which case 177 00:08:34,809 --> 00:08:37,850 you can wonder why your doctor gave you the test. 178 00:08:37,850 --> 00:08:40,330 Well, the 99% test sounds good. 179 00:08:40,330 --> 00:08:41,620 We figured out that it isn't. 180 00:08:41,620 --> 00:08:46,330 And a hint about why 99% accurate isn't really so useful 181 00:08:46,330 --> 00:08:50,290 is that there's an obvious test that's 99.99% accurate. 182 00:08:50,290 --> 00:08:51,600 What's the test? 183 00:08:51,600 --> 00:08:52,990 Always say no. 184 00:08:52,990 --> 00:08:56,390 After all, the probability is only 1 in 10,000 185 00:08:56,390 --> 00:08:57,890 that you're going to be wrong. 186 00:08:57,890 --> 00:09:02,150 And that's the 99.99% rate. 187 00:09:02,150 --> 00:09:05,530 So it sounds as though this test is really worthless. 188 00:09:05,530 --> 00:09:06,600 But no, it's not. 189 00:09:06,600 --> 00:09:10,754 If you think about it a little bit, it will be useful. 190 00:09:10,754 --> 00:09:12,170 And I'll explain that in a minute. 191 00:09:12,170 --> 00:09:13,711 I forgot I'm getting ahead of myself. 192 00:09:13,711 --> 00:09:17,050 Because the basic formula that we used here was we 193 00:09:17,050 --> 00:09:20,380 figured out what the probability of TB 194 00:09:20,380 --> 00:09:24,030 given that the test said you had TB in terms 195 00:09:24,030 --> 00:09:26,490 of the inverse probabilities which we knew-- that is, 196 00:09:26,490 --> 00:09:29,560 the probability that the test came in positive given 197 00:09:29,560 --> 00:09:31,710 that you had TB. 198 00:09:31,710 --> 00:09:34,085 This is an example of what's a famous rule in probability 199 00:09:34,085 --> 00:09:34,585 theory. 200 00:09:34,585 --> 00:09:36,740 It's called Bayes' rule, or Bayes' law. 201 00:09:36,740 --> 00:09:37,590 And this is it. 202 00:09:37,590 --> 00:09:40,270 It's just stated in terms of arbitrary events A and B. 203 00:09:40,270 --> 00:09:42,200 It expresses the probability of B 204 00:09:42,200 --> 00:09:44,920 given A in terms of the probability of A given 205 00:09:44,920 --> 00:09:48,967 B and the probabilities of A and B independently. 206 00:09:48,967 --> 00:09:50,800 Now, I can actually never remember this law, 207 00:09:50,800 --> 00:09:52,350 but I re-derive it right every time 208 00:09:52,350 --> 00:09:55,910 I need to do it as we've done in the previous slides. 209 00:09:55,910 --> 00:09:57,760 It's really a quite straightforward law 210 00:09:57,760 --> 00:09:59,890 to derive and prove. 211 00:09:59,890 --> 00:10:02,784 But let's go back to this 99% accurate test that 212 00:10:02,784 --> 00:10:04,950 seemed worthless since there was a trivial test that 213 00:10:04,950 --> 00:10:07,820 was 99.9% accurate. 214 00:10:07,820 --> 00:10:10,400 But in fact, it's really helpful because it 215 00:10:10,400 --> 00:10:12,590 did increase the probability that you 216 00:10:12,590 --> 00:10:14,770 had TB by a factor of 100. 217 00:10:14,770 --> 00:10:17,429 Before you took the test and before you know anything, 218 00:10:17,429 --> 00:10:18,970 you thought that your probability was 219 00:10:18,970 --> 00:10:21,806 the same as everybody elses-- about 1 in 10,000. 220 00:10:21,806 --> 00:10:23,180 Now the test says the probability 221 00:10:23,180 --> 00:10:25,566 that you have TB is 1 in 100. 222 00:10:25,566 --> 00:10:27,040 That's a hundred times larger. 223 00:10:27,040 --> 00:10:28,560 What's the value of that? 224 00:10:28,560 --> 00:10:32,990 Well, suppose you only had 5 million doses of medicine 225 00:10:32,990 --> 00:10:36,310 to treat this American population of 350 226 00:10:36,310 --> 00:10:37,990 million people. 227 00:10:37,990 --> 00:10:39,950 Who should you medicate? 228 00:10:39,950 --> 00:10:43,910 Well, if you medicated a random 5 million people 229 00:10:43,910 --> 00:10:46,230 out of 350 million, the likelihood 230 00:10:46,230 --> 00:10:49,840 that you're going to get very many of the real TB cases 231 00:10:49,840 --> 00:10:50,860 is small. 232 00:10:50,860 --> 00:10:53,440 It's only going to be about 1 in 30. 233 00:10:53,440 --> 00:10:56,050 You'll only get about 1/30 of the cases. 234 00:10:56,050 --> 00:11:00,710 But if you use your 5 million doses 235 00:11:00,710 --> 00:11:05,620 to medicate the 3.5 million people who would test positive 236 00:11:05,620 --> 00:11:10,660 under this 99% accurate test, then 237 00:11:10,660 --> 00:11:12,760 when you test all 350 million people, 238 00:11:12,760 --> 00:11:15,560 you're going to get about 3.5 million who test positive. 239 00:11:15,560 --> 00:11:17,940 You have enough medication to treat all of them. 240 00:11:17,940 --> 00:11:19,470 And if you treat all of them, you're 241 00:11:19,470 --> 00:11:23,720 almost certain to get all of the actual TB cases, all 10,000 242 00:11:23,720 --> 00:11:24,730 of them. 243 00:11:24,730 --> 00:11:28,350 So the 99% accurate test does have an important use 244 00:11:28,350 --> 00:11:30,240 in this final setting, a lot more 245 00:11:30,240 --> 00:11:35,640 so then the 99.99% accurate test that simply always said 246 00:11:35,640 --> 00:11:38,530 no-- food for thought.