1 00:00:00,580 --> 00:00:02,640 Let us now discuss in some more detail 2 00:00:02,640 --> 00:00:05,900 what it takes to carry out Bayesian inference, 3 00:00:05,900 --> 00:00:08,630 when both random variables are discrete. 4 00:00:08,630 --> 00:00:12,120 The unknown parameter, Theta, is a random variable 5 00:00:12,120 --> 00:00:14,980 that takes values in the discrete set. 6 00:00:14,980 --> 00:00:19,340 And we can think of these values as alternative hypotheses. 7 00:00:19,340 --> 00:00:22,400 In this case, we know how to do inference. 8 00:00:22,400 --> 00:00:24,760 We have in our hands the Bayes rule 9 00:00:24,760 --> 00:00:27,110 and we have seen plenty of examples. 10 00:00:27,110 --> 00:00:30,530 So instead of going through one more example in detail, 11 00:00:30,530 --> 00:00:34,240 let us assume that we have a model, that we have observed 12 00:00:34,240 --> 00:00:38,470 the value of X, and that we have already determined 13 00:00:38,470 --> 00:00:42,100 the conditional PMF of the random variable Theta. 14 00:00:42,100 --> 00:00:44,860 As a concrete example, suppose that Theta 15 00:00:44,860 --> 00:00:47,460 can take values 1, 2, or 3. 16 00:00:47,460 --> 00:00:49,310 We have obtained our observation, 17 00:00:49,310 --> 00:00:52,890 and the conditional PMF takes this form. 18 00:00:52,890 --> 00:00:55,280 We could stop at this point or we 19 00:00:55,280 --> 00:01:00,090 could continue by asking for a specific estimate of Theta-- 20 00:01:00,090 --> 00:01:03,550 our best guess as to what Theta is. 21 00:01:03,550 --> 00:01:05,440 One way of coming up with an estimate 22 00:01:05,440 --> 00:01:08,750 is to use the maximum a posteriori of probability 23 00:01:08,750 --> 00:01:11,560 rule, which looks for that value of theta that 24 00:01:11,560 --> 00:01:15,530 has the largest posterior, or conditional, probability. 25 00:01:15,530 --> 00:01:17,880 In this example, it is this value, 26 00:01:17,880 --> 00:01:21,480 so our estimate is going to be equal to 2. 27 00:01:21,480 --> 00:01:24,890 An alternative way of coming up with an estimate 28 00:01:24,890 --> 00:01:28,340 could be the LMS rule, which calculates 29 00:01:28,340 --> 00:01:31,980 an estimate equal to the conditional expectation 30 00:01:31,980 --> 00:01:35,520 of the unknown parameter, given the observation that we 31 00:01:35,520 --> 00:01:36,650 have made. 32 00:01:36,650 --> 00:01:39,670 This is just the mean of this conditional distribution. 33 00:01:39,670 --> 00:01:43,080 In this example, it would fall somewhere around here, 34 00:01:43,080 --> 00:01:48,820 and the numerical value, as you can check, is equal to 2.2. 35 00:01:48,820 --> 00:01:55,090 Next, we may be interested in how good a certain estimate is. 36 00:01:55,090 --> 00:01:58,270 And for the case where we interpret the values of Theta 37 00:01:58,270 --> 00:02:01,510 as hypotheses, a relevant criterion 38 00:02:01,510 --> 00:02:04,080 is the probability of error. 39 00:02:04,080 --> 00:02:06,560 In this case, because we already have 40 00:02:06,560 --> 00:02:09,610 some data available in our hands and we're 41 00:02:09,610 --> 00:02:12,270 called to make an estimate, what we care about 42 00:02:12,270 --> 00:02:15,435 is the conditional probability, given the information 43 00:02:15,435 --> 00:02:18,450 that we have, that we're making an error. 44 00:02:18,450 --> 00:02:20,950 Making an error means the following. 45 00:02:20,950 --> 00:02:23,410 We have the observation, the value of the estimate 46 00:02:23,410 --> 00:02:26,270 has been determined, it is now a number, 47 00:02:26,270 --> 00:02:30,220 and that's why we write it with a lowercase theta hat. 48 00:02:30,220 --> 00:02:34,059 But the parameter is still unknown. 49 00:02:34,059 --> 00:02:35,100 We don't know what it is. 50 00:02:35,100 --> 00:02:37,022 It is described by this distribution. 51 00:02:37,022 --> 00:02:38,480 And there's a probability that it's 52 00:02:38,480 --> 00:02:41,170 going to be different from our estimate. 53 00:02:41,170 --> 00:02:43,340 What is this probability? 54 00:02:43,340 --> 00:02:46,220 It depends on how we construct the estimates. 55 00:02:46,220 --> 00:02:48,540 If in this example, we use the MAP rule 56 00:02:48,540 --> 00:02:52,295 and we make an estimate of 2, there 57 00:02:52,295 --> 00:02:55,340 is probability 0.6 that the true value of Theta 58 00:02:55,340 --> 00:02:59,120 is also equal to 2, and we are fine. 59 00:02:59,120 --> 00:03:02,830 But there's a remaining probability of 0.4 60 00:03:02,830 --> 00:03:06,340 that the true value of Theta is different than our estimate. 61 00:03:06,340 --> 00:03:11,010 So there's probability 0.4 of having made a mistake. 62 00:03:11,010 --> 00:03:13,010 If, instead of an estimate equal to 2, 63 00:03:13,010 --> 00:03:16,850 we had chosen an estimate equal to 3, 64 00:03:16,850 --> 00:03:21,190 then the true parameter would be equal to our estimate 65 00:03:21,190 --> 00:03:25,210 with probability 0.3, but we would have made an error 66 00:03:25,210 --> 00:03:28,310 with probability 0.7, which would 67 00:03:28,310 --> 00:03:31,260 be a bigger probability of error. 68 00:03:31,260 --> 00:03:33,620 More generally, the probability of error 69 00:03:33,620 --> 00:03:35,720 of a particular estimate is the sum 70 00:03:35,720 --> 00:03:39,540 of the probabilities of the other values of Theta. 71 00:03:39,540 --> 00:03:44,180 And if we want to keep the probability of error small, 72 00:03:44,180 --> 00:03:47,200 we want to keep the sum of the probabilities 73 00:03:47,200 --> 00:03:50,230 of the other values small, which means 74 00:03:50,230 --> 00:03:54,530 we want to pick an estimate for which its own probability is 75 00:03:54,530 --> 00:03:55,550 large. 76 00:03:55,550 --> 00:03:59,100 And so by that argument, we see that the way 77 00:03:59,100 --> 00:04:02,540 to achieve the smallest possible probability of error 78 00:04:02,540 --> 00:04:04,540 is to employ the MAP rule. 79 00:04:04,540 --> 00:04:09,450 This is a very important property of the MAP rule. 80 00:04:09,450 --> 00:04:11,480 Now, this is the conditional probability 81 00:04:11,480 --> 00:04:15,650 of error, given that we already have data in our hands. 82 00:04:15,650 --> 00:04:20,890 But more generally, we may want to compare estimators or talk 83 00:04:20,890 --> 00:04:25,250 about their performance in terms of their overall probability 84 00:04:25,250 --> 00:04:26,320 of error. 85 00:04:26,320 --> 00:04:29,080 We're designing a decision-making system 86 00:04:29,080 --> 00:04:32,430 that's going to process data and making decisions. 87 00:04:32,430 --> 00:04:35,270 In order to say how good our system is, 88 00:04:35,270 --> 00:04:38,909 we want to say that overall, whenever you use the system, 89 00:04:38,909 --> 00:04:41,790 there's going to be some random parameter, 90 00:04:41,790 --> 00:04:45,090 there's going to be some value of the estimate. 91 00:04:45,090 --> 00:04:49,300 And we want to know what's the probability that these two will 92 00:04:49,300 --> 00:04:51,340 be different. 93 00:04:51,340 --> 00:04:54,750 We can calculate this overall probability of error 94 00:04:54,750 --> 00:04:58,780 by using the total probability theorem. 95 00:04:58,780 --> 00:05:01,900 And the conditional probabilities of error as 96 00:05:01,900 --> 00:05:03,010 follows. 97 00:05:03,010 --> 00:05:08,380 We condition on the value of X. For any possible value of X, 98 00:05:08,380 --> 00:05:10,710 we have a conditional probability of error. 99 00:05:10,710 --> 00:05:13,220 And then we take a weighted average 100 00:05:13,220 --> 00:05:16,190 of these conditional probabilities of error. 101 00:05:16,190 --> 00:05:19,650 There's also an alternative way of using the total probability 102 00:05:19,650 --> 00:05:22,470 theorem, which would be to first condition on Theta 103 00:05:22,470 --> 00:05:26,010 and calculate the conditional probability of error 104 00:05:26,010 --> 00:05:30,160 for a given choice of this unknown parameter. 105 00:05:30,160 --> 00:05:34,000 And both of these formulas can be used. 106 00:05:34,000 --> 00:05:36,650 Which one of the two is more convenient 107 00:05:36,650 --> 00:05:41,640 really depends on the specifics of the problem. 108 00:05:41,640 --> 00:05:45,290 Finally, I would like to make an important observation. 109 00:05:45,290 --> 00:05:50,260 We argued that for any particular choice 110 00:05:50,260 --> 00:05:55,390 of an observation, the MAP rule achieves the smallest 111 00:05:55,390 --> 00:05:58,800 possible probability of error. 112 00:05:58,800 --> 00:06:02,920 So under the MAP rule, this term is as small 113 00:06:02,920 --> 00:06:07,270 as possible for any given value of the random variable, 114 00:06:07,270 --> 00:06:10,530 capital X. 115 00:06:10,530 --> 00:06:15,290 Since each term of this sum is as small as possible 116 00:06:15,290 --> 00:06:19,480 under the MAP rule, it means that the overall sum will also 117 00:06:19,480 --> 00:06:21,950 be as small as possible. 118 00:06:21,950 --> 00:06:27,750 And this means is that the overall probability of error 119 00:06:27,750 --> 00:06:31,110 is also smallest under the MAP rule. 120 00:06:31,110 --> 00:06:34,730 In this sense, the MAP rule is the optimum way 121 00:06:34,730 --> 00:06:40,670 of coming up with estimates in the hypothesis-testing context, 122 00:06:40,670 --> 00:06:44,943 where we want to minimize the probability of error.