1 00:00:01,240 --> 00:00:05,370 We can finally go ahead and introduce the basic elements 2 00:00:05,370 --> 00:00:08,660 of the Bayesian inference framework. 3 00:00:08,660 --> 00:00:11,700 There is an unknown quantity, which 4 00:00:11,700 --> 00:00:15,900 we treat as a random variable, and this is what's special 5 00:00:15,900 --> 00:00:19,780 and why we call this the Bayesian inference framework. 6 00:00:19,780 --> 00:00:22,295 This is in contrast to other frameworks 7 00:00:22,295 --> 00:00:25,850 in which the unknown quantity theta is just 8 00:00:25,850 --> 00:00:28,110 treated as an unknown constant. 9 00:00:28,110 --> 00:00:30,260 But here, we treat it as a random variable, 10 00:00:30,260 --> 00:00:32,980 and as such, it has a distribution. 11 00:00:32,980 --> 00:00:34,660 This is the prior distribution. 12 00:00:34,660 --> 00:00:36,890 This is what we believe about Theta 13 00:00:36,890 --> 00:00:39,590 before we obtain any data. 14 00:00:39,590 --> 00:00:44,720 And then, we obtain some data, which are some observation. 15 00:00:44,720 --> 00:00:47,250 That observation is a random variable, 16 00:00:47,250 --> 00:00:49,510 but when the process gets realized, 17 00:00:49,510 --> 00:00:53,310 we observe an actual value, numerical value, 18 00:00:53,310 --> 00:00:55,000 of this random variable. 19 00:00:55,000 --> 00:00:58,280 The observation process is modeled, 20 00:00:58,280 --> 00:01:00,710 again in terms of a probabilistic model. 21 00:01:00,710 --> 00:01:03,810 We specify the distribution of X, 22 00:01:03,810 --> 00:01:07,510 but we actually specify the conditional distribution of X. 23 00:01:07,510 --> 00:01:12,750 We say how X will behave if Theta happens 24 00:01:12,750 --> 00:01:15,870 to take on a specific value. 25 00:01:15,870 --> 00:01:20,690 These two pieces, the prior and the model of the observations, 26 00:01:20,690 --> 00:01:22,580 are the two components of the model 27 00:01:22,580 --> 00:01:25,370 that we will be working with. 28 00:01:25,370 --> 00:01:29,560 Once we have obtained a specific value for the observations, 29 00:01:29,560 --> 00:01:33,220 then we can use the Bayes rule to calculate 30 00:01:33,220 --> 00:01:37,090 the conditional distribution of Theta, 31 00:01:37,090 --> 00:01:40,750 either a conditional PMF if Theta is discrete 32 00:01:40,750 --> 00:01:44,970 or a conditional PDF if Theta is continuous. 33 00:01:44,970 --> 00:01:48,310 And this will be a complete solution, in some sense, 34 00:01:48,310 --> 00:01:51,740 of the Bayesian inference problem. 35 00:01:51,740 --> 00:01:55,550 There's one philosophical issue about this framework, which 36 00:01:55,550 --> 00:01:59,870 is where does this prior distribution come from? 37 00:01:59,870 --> 00:02:01,810 How do we choose it? 38 00:02:01,810 --> 00:02:05,110 Sometimes we can choose it using a symmetry argument. 39 00:02:05,110 --> 00:02:08,000 If there's a number of possible choices for Theta 40 00:02:08,000 --> 00:02:10,860 and there's a reason to believe that they're all 41 00:02:10,860 --> 00:02:12,630 equally likely, we have no reason 42 00:02:12,630 --> 00:02:15,300 to believe that one is more likely than the other, then 43 00:02:15,300 --> 00:02:19,800 the symmetry consideration gives us a uniform prior. 44 00:02:19,800 --> 00:02:22,760 We definitely take into account any information 45 00:02:22,760 --> 00:02:26,520 we have about the range of the parameter Theta, 46 00:02:26,520 --> 00:02:32,220 so we use that range and we assign 0 prior probability 47 00:02:32,220 --> 00:02:35,130 for values of Theta outside the range. 48 00:02:35,130 --> 00:02:37,720 Sometimes, we have some knowledge about Theta 49 00:02:37,720 --> 00:02:42,740 from previous studies of a certain problem, that tell us 50 00:02:42,740 --> 00:02:45,640 a little bit about what Theta might be, 51 00:02:45,640 --> 00:02:48,090 and then when we obtain new observations, 52 00:02:48,090 --> 00:02:51,280 we refine those results that were obtained 53 00:02:51,280 --> 00:02:54,450 from previous studies by applying the Bayes rule. 54 00:02:54,450 --> 00:02:56,750 And in some cases, finally, the choice 55 00:02:56,750 --> 00:03:01,510 could be arbitrary or subjective just reflecting our beliefs 56 00:03:01,510 --> 00:03:05,280 about Theta, some plausible judgment 57 00:03:05,280 --> 00:03:10,690 about the relative likelihoods of different choices of Theta. 58 00:03:10,690 --> 00:03:14,140 Now, as we just discussed, the complete solution 59 00:03:14,140 --> 00:03:18,020 or the complete answer to a Bayesian inference problem 60 00:03:18,020 --> 00:03:21,690 is just the specification of the posterior distribution 61 00:03:21,690 --> 00:03:24,810 of Theta given the particular observation that we 62 00:03:24,810 --> 00:03:26,430 have obtained. 63 00:03:26,430 --> 00:03:29,000 Pictorially, if Theta is discrete, 64 00:03:29,000 --> 00:03:33,220 a complete answer might be in the form of such a diagram that 65 00:03:33,220 --> 00:03:35,380 tells us that certain values of Theta 66 00:03:35,380 --> 00:03:38,120 are possible with certain probabilities. 67 00:03:38,120 --> 00:03:41,500 Or if Theta is continuous, a complete solution 68 00:03:41,500 --> 00:03:45,090 might be in the form of a conditional PDF that again 69 00:03:45,090 --> 00:03:49,380 tells us the conditional distribution of Theta. 70 00:03:49,380 --> 00:03:53,160 To appreciate the idea here, consider the problem 71 00:03:53,160 --> 00:03:57,360 of guessing the number of electoral votes 72 00:03:57,360 --> 00:04:01,470 that a candidate gets in the presidential election. 73 00:04:01,470 --> 00:04:04,210 The electoral votes are certain votes 74 00:04:04,210 --> 00:04:06,190 that the candidate gets from each one 75 00:04:06,190 --> 00:04:09,210 of the states in the United States. 76 00:04:09,210 --> 00:04:11,690 And there is a certain number that the candidate 77 00:04:11,690 --> 00:04:14,540 needs to get in order to be elected president. 78 00:04:14,540 --> 00:04:17,149 One possible prediction could be a statement 79 00:04:17,149 --> 00:04:20,670 that I predict that candidate A will win, 80 00:04:20,670 --> 00:04:23,160 but actually a more complete presentation 81 00:04:23,160 --> 00:04:25,480 of the results of a poll could be 82 00:04:25,480 --> 00:04:30,190 a diagram of this kind, which is essentially a PMF. 83 00:04:30,190 --> 00:04:34,270 Here, a particular pollster collected all the data 84 00:04:34,270 --> 00:04:38,240 and gave the posterior probability distribution 85 00:04:38,240 --> 00:04:43,500 for the different possible numbers of electoral votes. 86 00:04:43,500 --> 00:04:45,900 And this diagram is a lot more informative 87 00:04:45,900 --> 00:04:50,230 than the simple statement that we expect a certain candidate 88 00:04:50,230 --> 00:04:54,409 to get more than the required electoral votes. 89 00:04:54,409 --> 00:04:56,810 So what is next? 90 00:04:56,810 --> 00:04:59,190 As we just discussed, the complete solution 91 00:04:59,190 --> 00:05:01,950 is in terms of a posterior distribution, 92 00:05:01,950 --> 00:05:05,480 but sometimes, you may want to summarize this posterior 93 00:05:05,480 --> 00:05:09,810 distribution in a single number or a single estimate, 94 00:05:09,810 --> 00:05:11,930 and this could be a further stage 95 00:05:11,930 --> 00:05:14,670 of processing of the results. 96 00:05:14,670 --> 00:05:18,170 So let us talk about this. 97 00:05:18,170 --> 00:05:21,780 Once you have in your hands the posterior distribution 98 00:05:21,780 --> 00:05:26,260 of Theta, either in a discrete or in a continuous setting, 99 00:05:26,260 --> 00:05:30,280 and if you're asked to provide a single guess about what 100 00:05:30,280 --> 00:05:33,430 Theta is, how might you proceed? 101 00:05:33,430 --> 00:05:36,480 In the discrete case, you could argue as follows. 102 00:05:36,480 --> 00:05:40,210 These values of Theta all have some chance of occurring. 103 00:05:40,210 --> 00:05:43,540 This value of Theta is the one which is the most likely, 104 00:05:43,540 --> 00:05:46,420 so I'm going to report this value 105 00:05:46,420 --> 00:05:49,760 as my best guess of what Theta is. 106 00:05:49,760 --> 00:05:51,760 And using a similar philosophy, you 107 00:05:51,760 --> 00:05:53,690 could look at the continuous case 108 00:05:53,690 --> 00:05:59,000 and find the value of Theta at which the PDF is largest 109 00:05:59,000 --> 00:06:02,350 and report that particular value. 110 00:06:02,350 --> 00:06:05,600 This particular way of estimating Theta 111 00:06:05,600 --> 00:06:09,530 is called the maximum a posteriori probability rule. 112 00:06:09,530 --> 00:06:13,460 We already have in our hands the specific value of X, 113 00:06:13,460 --> 00:06:15,130 and therefore, we have determined 114 00:06:15,130 --> 00:06:17,920 the conditional distribution for Theta. 115 00:06:17,920 --> 00:06:21,230 What we then do is to find the value of theta 116 00:06:21,230 --> 00:06:27,200 that maximizes over all possible thetas the conditional PMF 117 00:06:27,200 --> 00:06:30,090 of this random variables capital Theta. 118 00:06:30,090 --> 00:06:31,950 And similarly in the continuous case, 119 00:06:31,950 --> 00:06:37,380 the value of theta that maximizes the conditional PDF 120 00:06:37,380 --> 00:06:39,400 of the random variable Theta. 121 00:06:39,400 --> 00:06:42,360 This is one way of coming up with an estimate. 122 00:06:42,360 --> 00:06:44,360 One can think of other ways. 123 00:06:44,360 --> 00:06:49,180 For example, I might want to report instead, the mean 124 00:06:49,180 --> 00:06:51,890 of the conditional distribution, which in this diagram 125 00:06:51,890 --> 00:06:54,920 might be somewhere here, and in this picture, 126 00:06:54,920 --> 00:06:57,380 it might be somewhere here. 127 00:06:57,380 --> 00:07:02,250 This way of estimating theta is the conditional expectation 128 00:07:02,250 --> 00:07:03,320 estimator. 129 00:07:03,320 --> 00:07:06,500 It just reports the value of the conditional expectation, 130 00:07:06,500 --> 00:07:09,410 the mean of this conditional distribution. 131 00:07:09,410 --> 00:07:13,190 It is called the least mean squares estimator, 132 00:07:13,190 --> 00:07:16,940 because it has a certain useful and important property. 133 00:07:16,940 --> 00:07:19,290 It is the estimator that gives you 134 00:07:19,290 --> 00:07:21,710 the smallest mean squared error. 135 00:07:21,710 --> 00:07:24,120 We will discuss this particular issue 136 00:07:24,120 --> 00:07:27,910 in much more depth a little later. 137 00:07:27,910 --> 00:07:31,610 Now, let me make two comments about terminology. 138 00:07:31,610 --> 00:07:35,540 What we have produced here is an estimate. 139 00:07:35,540 --> 00:07:39,700 I gave you the conditional PDF or conditional PMF, 140 00:07:39,700 --> 00:07:41,710 and you tell me a number. 141 00:07:41,710 --> 00:07:45,730 This number, the estimate, is obtained 142 00:07:45,730 --> 00:07:49,960 by starting with the data, doing some processing to the data, 143 00:07:49,960 --> 00:07:54,880 and eventually, coming up with a numerical value. 144 00:07:54,880 --> 00:07:58,940 Now, g is the way that we process the data. 145 00:07:58,940 --> 00:08:00,670 It's a certain rule. 146 00:08:00,670 --> 00:08:03,570 Now, if we know the value of the data, 147 00:08:03,570 --> 00:08:06,290 we know what the estimate is going to be. 148 00:08:06,290 --> 00:08:09,580 But if I do not tell you the value of the data 149 00:08:09,580 --> 00:08:12,580 and you look at the situation more abstractly, 150 00:08:12,580 --> 00:08:14,700 then the only thing you can tell me 151 00:08:14,700 --> 00:08:17,900 is that I will be seeing a random variable, 152 00:08:17,900 --> 00:08:21,480 capital X, I will do some processing to it, 153 00:08:21,480 --> 00:08:25,070 and then I will obtain a certain quantity. 154 00:08:25,070 --> 00:08:29,370 Because capital X is random, the quantity that I will obtain 155 00:08:29,370 --> 00:08:30,800 will also be random. 156 00:08:30,800 --> 00:08:32,679 It's a random variable. 157 00:08:32,679 --> 00:08:36,490 This random variable, capital Theta hat, 158 00:08:36,490 --> 00:08:38,750 we call it an estimator. 159 00:08:38,750 --> 00:08:41,610 Sometimes, we might also use the term estimator 160 00:08:41,610 --> 00:08:43,429 to [refer to] the function g, which 161 00:08:43,429 --> 00:08:46,390 is the way that we process the data. 162 00:08:46,390 --> 00:08:49,810 In any case, it is important to keep this distinction in mind. 163 00:08:49,810 --> 00:08:54,680 The estimator is the rule that we use to process the data, 164 00:08:54,680 --> 00:08:58,810 and it is equivalent to a certain random variable. 165 00:08:58,810 --> 00:09:02,110 An estimate is the specific numerical value 166 00:09:02,110 --> 00:09:08,410 that we get when the data take a specific numerical value. 167 00:09:08,410 --> 00:09:11,890 So if little x is the numerical value of capital X, 168 00:09:11,890 --> 00:09:16,730 in that case, little theta hat is the numerical value 169 00:09:16,730 --> 00:09:22,010 of the estimator capital Theta hat. 170 00:09:22,010 --> 00:09:26,750 So at this point, we have a complete conceptual framework. 171 00:09:26,750 --> 00:09:29,440 We know, abstractly speaking, what 172 00:09:29,440 --> 00:09:33,610 it takes to calculate conditional distributions, 173 00:09:33,610 --> 00:09:37,520 and we have two specific estimators at hand. 174 00:09:37,520 --> 00:09:39,870 All that's left for us to do now is 175 00:09:39,870 --> 00:09:43,780 to consider various examples in which we can discuss what 176 00:09:43,780 --> 00:09:46,940 it takes to go through these various steps.