1 00:00:02,530 --> 00:00:07,220 You've tested positive for a rare and deadly cancer that afflicts 1 out of 1000 people, 2 00:00:07,220 --> 00:00:11,980 based on a test that is 99% accurate. What are the chances that you actually have the 3 00:00:11,980 --> 00:00:17,560 cancer? By the end of this video, you'll be able to answer this question! 4 00:00:17,560 --> 00:00:22,510 This video is part of the Probability and Statistics video series. Many natural and 5 00:00:22,510 --> 00:00:28,260 social phenomena are probabilistic in nature. Engineers, scientists, and policymakers often 6 00:00:28,260 --> 00:00:31,930 use probability to model and predict system behavior. 7 00:00:31,930 --> 00:00:38,469 Hi, my name is Sam Watson, and I'm a graduate student in mathematics at MIT. 8 00:00:38,469 --> 00:00:43,679 Before watching this video, you should be familiar with basic probability vocabulary 9 00:00:43,679 --> 00:00:46,989 and the definition of conditional probability. 10 00:00:46,989 --> 00:00:51,219 After watching this video, you'll be able to: Calculate the conditional probability 11 00:00:51,219 --> 00:00:56,269 of a given event using tables and trees; and Understand how conditional probability can 12 00:00:56,269 --> 00:01:03,269 be used to interpret medical diagnoses. 13 00:01:04,479 --> 00:01:11,479 Suppose that in front of you are two bowls, labeled A and B. Each bowl contains five marbles. 14 00:01:11,600 --> 00:01:18,240 Bowl A has 1 blue and 4 yellow marbles. Bowl B has 3 blue and 2 yellow marbles. 15 00:01:18,240 --> 00:01:23,229 Now choose a bowl at random and draw a marble uniformly at random from it. Based on your 16 00:01:23,229 --> 00:01:28,200 existing knowledge of probability, how likely is it that you pick a blue marble? How about 17 00:01:28,200 --> 00:01:33,109 a yellow marble? 18 00:01:33,109 --> 00:01:40,109 Out of the 10 marbles you could choose from, 4 are blue. So the probability of choosing a blue 19 00:01:55,130 --> 00:01:58,020 marble is 4 out of 10. 20 00:01:58,020 --> 00:02:03,109 There are 6 yellow marbles out of 10 total, so the probability of choosing yellow is 6 21 00:02:03,109 --> 00:02:03,270 out of 10. 22 00:02:03,270 --> 00:02:04,109 When the number of possible outcomes is finite, and all events are equally likely, the probability 23 00:02:04,109 --> 00:02:05,009 of one event happening is the number of favorable outcomes divided by the total number of possible 24 00:02:05,009 --> 00:02:05,070 outcomes. 25 00:02:05,070 --> 00:02:09,470 What if you must draw from Bowl A? What's the probability of drawing a blue marble, 26 00:02:09,470 --> 00:02:16,470 given that you draw from Bowl A? 27 00:02:18,239 --> 00:02:25,239 Let's go back to the table and consider only Bowl A. Bowl A contains 5 marbles of which 28 00:02:25,599 --> 00:02:31,300 1 is blue, so the probability of picking a blue one is 1 in 5. 29 00:02:31,300 --> 00:02:36,610 Notice the probability has changed. In the first scenario, the sample space consists 30 00:02:36,610 --> 00:02:42,329 of all 10 marbles, because we are free to draw from both bowls. 31 00:02:42,329 --> 00:02:48,220 In the second scenario, we are restricted to Bowl A. Our new sample space consists of 32 00:02:48,220 --> 00:02:55,220 only the five marbles in Bowl A. We ignore these marbles in Bowl. 33 00:02:55,670 --> 00:03:00,909 Restricting our attention to a specific set of outcomes changes the sample space, and 34 00:03:00,909 --> 00:03:07,730 can also change the probability of an event. This new probability is what we call a conditional 35 00:03:07,730 --> 00:03:08,370 In the previous example, we calculated the conditional probability of drawing a blue 36 00:03:08,370 --> 00:03:09,659 marble, given that we draw from Bowl A. 37 00:03:09,659 --> 00:03:14,510 This is standard notation for conditional probability. The vertical bar ( | ) is read 38 00:03:14,510 --> 00:03:21,510 as "given." The probability we are looking for precedes the bar, and the condition follows 39 00:03:25,099 --> 00:03:32,099 the bar. 40 00:03:32,299 --> 00:03:37,439 Now let's flip things around. Suppose someone picks a marble at random from either bowl 41 00:03:37,439 --> 00:03:43,939 A or bowl B and reveals to you that the marble drawn was blue. What is the probability that 42 00:03:43,939 --> 00:03:46,989 the blue marble came from Bowl A? 43 00:03:46,989 --> 00:03:52,159 In other words, what's the conditional probability that the marble was drawn from Bowl A, given 44 00:03:52,159 --> 00:03:59,159 that it is blue? Pause the video and try to work this out. 45 00:04:01,159 --> 00:04:06,400 Going back to the table, because we are dealing with the condition that the marble is blue, 46 00:04:06,400 --> 00:04:11,170 the sample space is restricted to the four blue marbles. 47 00:04:11,170 --> 00:04:17,290 Of these four blue marbles, one is in Bowl A, and each is equally likely to be drawn. 48 00:04:17,290 --> 00:04:22,320 Thus, the conditional probability is 1 out of 4. 49 00:04:22,320 --> 00:04:27,160 Notice that the probability of picking a blue marble given that the marble came from Bowl 50 00:04:27,160 --> 00:04:33,030 A is NOT equal to the probability that the marble came from Bowl A given that the marble 51 00:04:33,030 --> 00:04:40,030 was blue. Each has a different condition, so be careful not to mix them up! 52 00:04:42,919 --> 00:04:47,650 We've seen how tables can help us organize our data and visualize changes in the sample 53 00:04:47,650 --> 00:04:49,060 space. 54 00:04:49,060 --> 00:04:53,180 Let's look at another tool that is useful for understanding conditional probabilities 55 00:04:53,180 --> 00:04:55,870 - a tree diagram. 56 00:04:55,870 --> 00:05:02,870 Suppose we have a jar containing 5 marbles; 2 are blue and 3 are yellow. If we draw any 57 00:05:03,220 --> 00:05:08,280 one marble at random, the probability of drawing a blue marble is 2/5. 58 00:05:08,280 --> 00:05:14,280 Now, without replacing the first marble, draw a second marble from the jar. Given that the 59 00:05:14,280 --> 00:05:20,330 first marble is blue, is the probability of drawing a second blue marble still 2/5? 60 00:05:20,330 --> 00:05:27,250 NO, it isn't. Our sample space has changed. If a blue marble is drawn first, you are left 61 00:05:27,250 --> 00:05:31,569 with 4 marbles; 1 blue and 3 yellow. 62 00:05:31,569 --> 00:05:36,130 In other words, if a blue marble is selected first, the probability that you draw blue 63 00:05:36,130 --> 00:05:42,580 second is 1/4. And the probability you draw yellow second is 3/4. 64 00:05:42,580 --> 00:05:49,580 Now pause the video and determine the probabilities if the yellow marble is selected first instead. 65 00:05:54,659 --> 00:06:00,539 If a yellow marble is selected first, you are left with 2 yellow and 2 blue marbles. 66 00:06:00,539 --> 00:06:06,389 There is now a 2/4 chance of drawing a blue marble and a 2/4 chance of drawing a yellow 67 00:06:06,389 --> 00:06:08,060 marble. 68 00:06:08,060 --> 00:06:15,060 What we have drawn here is called a tree diagram. The probability assigned to the second branch 69 00:06:18,550 --> 00:06:21,660 denotes the conditional probability given that the first happened. 70 00:06:21,660 --> 00:06:23,139 Tree diagrams help us to visualize our sample space and reason out probabilities. 71 00:06:23,139 --> 00:06:27,500 We can answer questions like "What is the probability of drawing 2 blue marbles in a 72 00:06:27,500 --> 00:06:32,550 row?" In other words, what is the probability of drawing a blue marble first AND a blue 73 00:06:32,550 --> 00:06:34,479 marble second? 74 00:06:34,479 --> 00:06:39,880 This event is represented by these two branches in the tree diagram. 75 00:06:39,880 --> 00:06:46,880 We have a 2/5 chance followed by a 1/4 chance. We multiply these to get 2/20, or 1/10. The 76 00:06:48,849 --> 00:06:53,349 probability of drawing two blue marbles in a row is 1/10. 77 00:06:53,349 --> 00:06:58,729 Now you do it. Use the tree diagram to calculate the probabilities of the other possibilities: 78 00:06:58,729 --> 00:07:05,729 blue, yellow; yellow, blue; and yellow, yellow. 79 00:07:10,050 --> 00:07:16,750 The probabilities each work out to 3/10. The four probabilities add up to a total of 1, 80 00:07:16,750 --> 00:07:18,800 as they should. 81 00:07:18,800 --> 00:07:22,599 What if we don't care about the first marble? We just want to determine the probability 82 00:07:22,599 --> 00:07:26,330 that the second marble is yellow. 83 00:07:26,330 --> 00:07:30,699 Because it does not matter whether the first marble is blue or yellow, we consider both 84 00:07:30,699 --> 00:07:37,699 the blue, yellow, and the yellow, yellow paths. Adding the probabilities gives us 3/10 + 3/10, 85 00:07:38,099 --> 00:07:41,139 which works out to 3/5. 86 00:07:41,139 --> 00:07:45,190 Here's another interesting question. What is the probability that the first marble drawn 87 00:07:45,190 --> 00:07:48,819 is blue, given that the second marble drawn is yellow? 88 00:07:48,819 --> 00:07:54,050 Intuitively, this seems tricky. Pause the video and reason through the probability tree 89 00:07:54,050 --> 00:08:01,050 with a friend. 90 00:08:01,370 --> 00:08:05,680 Because we are conditioning on the event that the second marble drawn is yellow, our sample 91 00:08:05,680 --> 00:08:09,289 space is restricted to these two paths: P(blue, yellow) and P(yellow, yellow). 92 00:08:09,289 --> 00:08:14,690 Of these two paths, only the top one meets our criteria - that the blue marble is drawn 93 00:08:14,690 --> 00:08:16,759 first. 94 00:08:16,759 --> 00:08:21,919 We represent the probability as a fraction of favorable to possible outcomes. Hence, 95 00:08:21,919 --> 00:08:26,199 the probability that the first marble drawn is blue, given that the second marble drawn 96 00:08:26,199 --> 00:08:33,198 is yellow is 3/10 divided by (3/10 +3/10), which works out to 1/2. 97 00:08:33,450 --> 00:08:39,000 I hope you appreciate that tree diagrams and tables make these types of probability problems 98 00:08:39,000 --> 00:08:46,000 doable without having to memorize any formulas! 99 00:08:47,210 --> 00:08:51,830 Let's return to our opening question. Recall that you've tested positive for a cancer that 100 00:08:51,830 --> 00:08:56,780 afflicts 1 out of 1000 people, based on a test that is 99% accurate. 101 00:08:56,780 --> 00:09:02,950 More precisely, out of 100 test results, we expect about 99 correct results and only 1 102 00:09:02,950 --> 00:09:05,100 incorrect result. 103 00:09:05,100 --> 00:09:10,290 Since the test is highly accurate, you might conclude that the test is unlikely to be wrong, 104 00:09:10,290 --> 00:09:13,110 and that you most likely have cancer. 105 00:09:13,110 --> 00:09:19,320 But wait! Let's first use conditional probability to make sense of our seemingly gloomy diagnosis. 106 00:09:19,320 --> 00:09:20,850 Now pause the video and determine the probability that you have the cancer, given that you test 107 00:09:20,850 --> 00:09:20,940 positive. 108 00:09:20,940 --> 00:09:24,470 Let's use a tree diagram to help with our calculations. 109 00:09:24,470 --> 00:09:30,650 The first branch of the tree represents the likelihood of cancer in the general population. 110 00:09:30,650 --> 00:09:36,910 The probability of having the rare cancer is 1 in 1000, or 0.001. The probability of 111 00:09:36,910 --> 00:09:41,790 having no cancer is 0.999. 112 00:09:41,790 --> 00:09:46,260 Let's extend the tree diagram to illustrate the possible results of the medical test that 113 00:09:46,260 --> 00:09:49,020 is 99% accurate. 114 00:09:49,020 --> 00:09:56,020 In the cancer population, 99% will test positive (correctly), but 1% will test negative (incorrectly). 115 00:09:57,150 --> 00:10:01,200 These incorrect results are called false negatives. 116 00:10:01,200 --> 00:10:07,520 In the cancer-free population, 99% will test negative (correctly), but 1% will test positive 117 00:10:07,520 --> 00:10:13,590 (incorrectly). These incorrect results are called false positives. 118 00:10:13,590 --> 00:10:19,270 Given that you test positive, our sample space is now restricted to only the population that 119 00:10:19,270 --> 00:10:24,920 test positive. This is represented by these two paths. 120 00:10:24,920 --> 00:10:30,760 The top path shows the probability you have the cancer AND test positive. The lower path 121 00:10:30,760 --> 00:10:37,440 shows the probability that you don't have cancer AND still test positive. 122 00:10:37,440 --> 00:10:44,440 The probability that you actually do have the cancer, given that you test positive, is (0.001*0.99)/((0.001*0.99)+(0.999*0.01)), 123 00:10:55,720 --> 00:11:01,150 which works out to about 0.09 - less than 10%! 124 00:11:01,150 --> 00:11:06,880 The error rate of the test is only 1 percent, but the chance of a misdiagnosis is more than 125 00:11:06,880 --> 00:11:13,550 90%! Chances are pretty good that you do not actually have cancer, despite the rather accurate 126 00:11:13,550 --> 00:11:16,760 test. Why is this so? 127 00:11:16,760 --> 00:11:22,440 The accuracy of the test actually reflects the conditional probability that one tests 128 00:11:22,440 --> 00:11:25,070 positive, given that one has cancer. 129 00:11:25,070 --> 00:11:30,520 But in practice, what you want to know is the conditional probability that you have 130 00:11:30,520 --> 00:11:37,520 cancer, given that you test positive! These probabilities are NOT the same! 131 00:11:37,520 --> 00:11:42,550 Whenever we take medical tests, or perform experiments, it is important to understand 132 00:11:42,550 --> 00:11:47,260 what events our results are conditioned on, and how that might affect the accuracy of 133 00:11:47,260 --> 00:11:53,180 our conclusions. 134 00:11:53,180 --> 00:11:57,180 In this video, you've seen that conditional probability must be used to understand and 135 00:11:57,180 --> 00:12:02,810 predict the outcomes of many events. You've also learned to evaluate and manage conditional 136 00:12:02,810 --> 00:12:06,830 probabilities using tables and trees. 137 00:12:06,830 --> 00:12:11,310 We hope that you will now think more carefully about the probabilities you encounter, and 138 00:12:11,310 --> 00:12:14,200 consider how conditioning affects their interpretation.