1 00:00:00,500 --> 00:00:02,650 We will now go through an example that 2 00:00:02,650 --> 00:00:05,820 involves a continuous unknown parameter, 3 00:00:05,820 --> 00:00:09,830 the unknown bias of a coin and discrete observations, 4 00:00:09,830 --> 00:00:12,100 namely, the number of heads that are 5 00:00:12,100 --> 00:00:14,360 observed in a sequence of coin flips. 6 00:00:14,360 --> 00:00:17,880 This is an example that we will start in some detail now, 7 00:00:17,880 --> 00:00:20,630 and we will also revisit later on. 8 00:00:20,630 --> 00:00:23,620 And in the process, we will also have the opportunity 9 00:00:23,620 --> 00:00:27,770 to introduce a new class of probability distributions. 10 00:00:27,770 --> 00:00:30,860 This example is an extension of an example 11 00:00:30,860 --> 00:00:33,320 that we have already seen, when we first 12 00:00:33,320 --> 00:00:36,970 introduced the relevant version of the Bayes rule. 13 00:00:36,970 --> 00:00:38,800 We have a coin. 14 00:00:38,800 --> 00:00:44,110 It has a certain bias between 0 and 1, but the bias is unknown. 15 00:00:44,110 --> 00:00:47,190 And consistent with the Bayesian philosophy, 16 00:00:47,190 --> 00:00:50,470 we treat this unknown bias as a random variable, 17 00:00:50,470 --> 00:00:54,240 and we assign a prior probability distribution to it. 18 00:00:54,240 --> 00:00:57,130 We flip this coin n times independently, 19 00:00:57,130 --> 00:00:59,360 where n is some positive integer, 20 00:00:59,360 --> 00:01:02,440 and we record the number of heads that are obtained. 21 00:01:02,440 --> 00:01:05,030 On the basis of the value of this random variable, 22 00:01:05,030 --> 00:01:08,740 we would like to make inferences about Theta. 23 00:01:08,740 --> 00:01:11,510 Now to make some more concrete progress, 24 00:01:11,510 --> 00:01:13,280 let us make a specific assumption. 25 00:01:13,280 --> 00:01:16,820 Let us assume that the prior on Theta 26 00:01:16,820 --> 00:01:20,840 is uniform on the unit interval, in some sense reflecting 27 00:01:20,840 --> 00:01:25,260 complete ignorance about the true value of Theta. 28 00:01:25,260 --> 00:01:30,789 We observe the value of this random variable, some little k, 29 00:01:30,789 --> 00:01:34,400 we fix that value, and we're interested in the functional 30 00:01:34,400 --> 00:01:38,490 dependence on theta of this particular quantity, 31 00:01:38,490 --> 00:01:41,140 when k is given to us. 32 00:01:41,140 --> 00:01:42,650 How do we do this? 33 00:01:42,650 --> 00:01:46,610 We use the appropriate form of the Bayes rule, which 34 00:01:46,610 --> 00:01:49,740 in this setting is as follows. 35 00:01:49,740 --> 00:01:54,289 it is the usual form, but we have 36 00:01:54,289 --> 00:01:57,620 f's indicating densities whenever we're 37 00:01:57,620 --> 00:01:59,509 talking about the distribution of Theta, 38 00:01:59,509 --> 00:02:01,440 because Theta is continuous. 39 00:02:01,440 --> 00:02:04,760 And whenever we talk about the distribution of K, which 40 00:02:04,760 --> 00:02:07,020 is discrete, we use the symbol p, 41 00:02:07,020 --> 00:02:10,600 because we're dealing with probability mass functions. 42 00:02:10,600 --> 00:02:14,770 As always, the denominator term is such 43 00:02:14,770 --> 00:02:19,490 that the integral of the whole expression over theta 44 00:02:19,490 --> 00:02:20,670 is equal to 1. 45 00:02:20,670 --> 00:02:23,329 This is the necessary normalization property, 46 00:02:23,329 --> 00:02:26,180 and because of this, this denominator term 47 00:02:26,180 --> 00:02:29,650 has to be equal to the integral of the numerator 48 00:02:29,650 --> 00:02:33,250 over all theta, which is what we have here. 49 00:02:33,250 --> 00:02:36,990 So now let us move, and let us apply this formula. 50 00:02:36,990 --> 00:02:41,320 We first have the prior, which is equal to 1. 51 00:02:41,320 --> 00:02:45,530 Then we have the probability that K is equal to little k. 52 00:02:45,530 --> 00:02:49,030 This is the probability of obtaining exactly k heads, 53 00:02:49,030 --> 00:02:51,740 if I tell you the bias or the coin. 54 00:02:51,740 --> 00:02:53,860 But if I tell you the bias of the coin, 55 00:02:53,860 --> 00:02:57,410 we're dealing with the usual model of independent coin 56 00:02:57,410 --> 00:03:00,270 flips, and the probability of k heads 57 00:03:00,270 --> 00:03:04,610 is given by the binomial probabilities, which 58 00:03:04,610 --> 00:03:05,890 takes this form. 59 00:03:08,900 --> 00:03:14,520 And finally, we have the denominator term, 60 00:03:14,520 --> 00:03:18,260 which we do not need to evaluate at this point. 61 00:03:18,260 --> 00:03:21,760 Now, I said earlier that we're interested in the dependence 62 00:03:21,760 --> 00:03:26,250 on theta, which comes through these terms. 63 00:03:26,250 --> 00:03:29,550 On the other hand, the remaining terms 64 00:03:29,550 --> 00:03:34,090 do not involve any thetas, and so they 65 00:03:34,090 --> 00:03:38,420 can be lumped together in just a constant. 66 00:03:38,420 --> 00:03:41,140 And so we can write the answer that we 67 00:03:41,140 --> 00:03:44,980 have found in this more suggestive form. 68 00:03:44,980 --> 00:03:47,160 We have some normalizing constant, 69 00:03:47,160 --> 00:03:50,670 and here we keep separately the dependence on theta. 70 00:03:50,670 --> 00:03:52,960 Of course, this answer that we derived 71 00:03:52,960 --> 00:03:57,570 is valid for little theta belonging to the unit interval. 72 00:03:57,570 --> 00:04:01,660 Outside the unit interval, either the prior density 73 00:04:01,660 --> 00:04:07,370 or the posterior density of Theta would be equal to 0. 74 00:04:07,370 --> 00:04:12,130 This particular form of the posterior distribution 75 00:04:12,130 --> 00:04:15,500 for Theta is a certain type of density, 76 00:04:15,500 --> 00:04:18,110 and it shows up in various contexts. 77 00:04:18,110 --> 00:04:20,890 And for this reason, it has a name. 78 00:04:20,890 --> 00:04:25,320 It is called a Beta distribution with certain parameters, 79 00:04:25,320 --> 00:04:28,040 and the parameters reflect the exponents 80 00:04:28,040 --> 00:04:32,390 that we have up here in the two terms. 81 00:04:32,390 --> 00:04:36,150 Note that these parameters are the exponents augmented by 1. 82 00:04:36,150 --> 00:04:39,730 This is for historical reasons that do not concern us here. 83 00:04:39,730 --> 00:04:41,720 It is just a convention. 84 00:04:41,720 --> 00:04:45,840 The important thing is to be able to recognize what it takes 85 00:04:45,840 --> 00:04:48,760 for a distribution to be a Beta distribution. 86 00:04:48,760 --> 00:04:52,790 That this that the dependence on theta is of the form theta 87 00:04:52,790 --> 00:04:57,100 to some power times 1 minus theta to some other power. 88 00:04:57,100 --> 00:05:01,060 Any distribution of this form is called a Beta distribution. 89 00:05:01,060 --> 00:05:03,020 So now, let's continue this example 90 00:05:03,020 --> 00:05:05,270 by considering a different prior. 91 00:05:05,270 --> 00:05:10,530 Suppose that the prior is itself a Beta distribution 92 00:05:10,530 --> 00:05:13,610 of this form where alpha and beta are 93 00:05:13,610 --> 00:05:17,130 some non-negative numbers. 94 00:05:17,130 --> 00:05:20,250 What is the posterior in this case? 95 00:05:20,250 --> 00:05:23,160 We just go through the same calculation as before, 96 00:05:23,160 --> 00:05:27,150 but instead of using one in the place of the prior, 97 00:05:27,150 --> 00:05:30,850 we now use the prior that's given to us. 98 00:05:35,950 --> 00:05:39,909 The probability of k heads in the n tosses, 99 00:05:39,909 --> 00:05:43,350 when we know the bias, is exactly as before. 100 00:05:43,350 --> 00:05:47,840 It is given by the binomial probabilities. 101 00:05:47,840 --> 00:05:53,540 And finally, we need to divide by the denominator term, which 102 00:05:53,540 --> 00:05:56,480 is the normalizing constant. 103 00:05:56,480 --> 00:05:58,670 What do we observe here? 104 00:05:58,670 --> 00:06:03,750 The dependence on theta comes through these terms. 105 00:06:03,750 --> 00:06:07,610 The remaining terms do not involve theta, 106 00:06:07,610 --> 00:06:11,710 and they can all be absorbed in a constant. 107 00:06:11,710 --> 00:06:16,430 Let's call that constant d, and collect the remaining terms. 108 00:06:16,430 --> 00:06:22,260 We have theta to the power of alpha plus k, 109 00:06:22,260 --> 00:06:28,550 and then, 1 minus theta to the power of beta plus n minus k. 110 00:06:33,530 --> 00:06:36,900 And once more, this is the form of the posterior 111 00:06:36,900 --> 00:06:40,170 for thetas belonging to this range. 112 00:06:40,170 --> 00:06:43,680 The posterior is 0 outside this range. 113 00:06:43,680 --> 00:06:45,180 So what do we see? 114 00:06:45,180 --> 00:06:47,390 We started with a prior that came 115 00:06:47,390 --> 00:06:49,920 from the Beta family of this form, 116 00:06:49,920 --> 00:06:54,830 and we came up with a posterior that is still 117 00:06:54,830 --> 00:06:57,490 a function of theta of this form, 118 00:06:57,490 --> 00:07:01,550 but with different values of the parameters alpha and beta. 119 00:07:01,550 --> 00:07:03,970 Namely, alpha gets replaced by alpha plus k, 120 00:07:03,970 --> 00:07:08,080 beta gets replaced by beta plus n minus k. 121 00:07:08,080 --> 00:07:10,340 So we see that if we start with a prior 122 00:07:10,340 --> 00:07:12,890 from the family of Beta distributions, 123 00:07:12,890 --> 00:07:17,720 the posterior will also be in that same family. 124 00:07:17,720 --> 00:07:21,120 This is a beautiful property of Beta distributions 125 00:07:21,120 --> 00:07:24,410 that can be exploited in various ways. 126 00:07:24,410 --> 00:07:26,890 One of which is that it actually allows 127 00:07:26,890 --> 00:07:31,170 for recursive ways of updating the posterior of Theta 128 00:07:31,170 --> 00:07:34,159 as we get more and more observations.