1 00:00:00,700 --> 00:00:03,880 We are now ready to move on to a model which 2 00:00:03,880 --> 00:00:07,240 is quite interesting and quite realistic. 3 00:00:07,240 --> 00:00:11,900 This is a model in which we have an unknown parameter modeled 4 00:00:11,900 --> 00:00:14,970 as a random variable that we try to estimate. 5 00:00:14,970 --> 00:00:17,080 This is the random variable, Theta. 6 00:00:17,080 --> 00:00:18,880 And we have multiple observations 7 00:00:18,880 --> 00:00:20,540 of that random variable. 8 00:00:20,540 --> 00:00:22,380 Each one of those observations is 9 00:00:22,380 --> 00:00:24,720 equal to the unknown random variable, 10 00:00:24,720 --> 00:00:27,490 plus some additive noise. 11 00:00:27,490 --> 00:00:30,900 This is a model that appears quite often in practice. 12 00:00:30,900 --> 00:00:32,509 It is often the case that we're trying 13 00:00:32,509 --> 00:00:35,730 to estimate a certain quantity, but we can only 14 00:00:35,730 --> 00:00:39,300 observe values of that quantity in the presence of noise. 15 00:00:39,300 --> 00:00:42,130 And because of the noise, what we want to do 16 00:00:42,130 --> 00:00:45,280 is to try to measure it to multiple times. 17 00:00:45,280 --> 00:00:48,470 And so we have multiple such measurement equations. 18 00:00:48,470 --> 00:00:52,020 And then we want to combine all of the observations together 19 00:00:52,020 --> 00:00:56,340 to come up with a good estimate of that parameter. 20 00:00:56,340 --> 00:00:58,440 The assumption that we will be making 21 00:00:58,440 --> 00:01:01,950 are that Theta is a normal random variable. 22 00:01:01,950 --> 00:01:05,019 It has a certain mean that we denote by x0. 23 00:01:05,019 --> 00:01:08,030 The reason for this strange notation will be seen later. 24 00:01:08,030 --> 00:01:10,800 And it also has a certain variance. 25 00:01:10,800 --> 00:01:14,130 The noise terms are also normal random variables 26 00:01:14,130 --> 00:01:17,360 with 0 mean and a certain variance. 27 00:01:17,360 --> 00:01:21,300 And finally, we assume that these basic random variables 28 00:01:21,300 --> 00:01:26,240 that define our model are all independent. 29 00:01:26,240 --> 00:01:29,680 Based on these assumptions, now we would like to estimate Theta 30 00:01:29,680 --> 00:01:32,840 on the basis of the X's. 31 00:01:32,840 --> 00:01:36,140 And as usual, in the Bayesian setting, 32 00:01:36,140 --> 00:01:38,759 what we want to do is to calculate the posterior 33 00:01:38,759 --> 00:01:42,360 distribution of Theta, given the X's. 34 00:01:42,360 --> 00:01:45,090 The Bayes rule has the usual form 35 00:01:45,090 --> 00:01:47,590 for the case of continuous random variables. 36 00:01:47,590 --> 00:01:51,740 The only remark that needs to be made is that in this case, 37 00:01:51,740 --> 00:01:55,729 there are multiple X's, so X up here stands 38 00:01:55,729 --> 00:01:58,265 for the vector of the observations 39 00:01:58,265 --> 00:01:59,920 that we have obtained. 40 00:01:59,920 --> 00:02:02,160 And similarly, little x will stand 41 00:02:02,160 --> 00:02:06,260 for the vector of the values of these observations. 42 00:02:06,260 --> 00:02:09,070 So we need now to start making some progress 43 00:02:09,070 --> 00:02:11,810 towards calculating this term here. 44 00:02:11,810 --> 00:02:15,080 What is the distribution of the vector of measurements 45 00:02:15,080 --> 00:02:16,600 given theta. 46 00:02:16,600 --> 00:02:19,030 Before we move to the vector case, 47 00:02:19,030 --> 00:02:23,570 let us look at one of the measurements in isolation. 48 00:02:23,570 --> 00:02:26,600 This is something that we have already seen. 49 00:02:26,600 --> 00:02:30,770 If I tell you the value of the random variable, 50 00:02:30,770 --> 00:02:35,210 Theta, which is what happens in this conditional universe 51 00:02:35,210 --> 00:02:38,270 when you condition on the value of Theta, 52 00:02:38,270 --> 00:02:42,300 then in that universe, the random variable, Xi, 53 00:02:42,300 --> 00:02:45,610 is equal to the numerical value that you gave me 54 00:02:45,610 --> 00:02:47,460 for Theta, plus Wi. 55 00:02:50,329 --> 00:02:55,360 And because Wi is independent from the random variable Theta, 56 00:02:55,360 --> 00:02:57,579 knowing the value of the random variable 57 00:02:57,579 --> 00:03:00,300 Theta does not change the distribution of Wi. 58 00:03:00,300 --> 00:03:02,990 It will still have this normal distribution. 59 00:03:02,990 --> 00:03:08,580 So Xi is a normal of this kind plus a constant. 60 00:03:08,580 --> 00:03:12,780 And so Xi is a normal random variable 61 00:03:12,780 --> 00:03:16,470 with mean equal to the constant that we added, 62 00:03:16,470 --> 00:03:19,020 and variance equal to the original variance 63 00:03:19,020 --> 00:03:21,420 of the random variable, Wi. 64 00:03:21,420 --> 00:03:25,750 And so we can now write down, the PDF, the conditional PDF, 65 00:03:25,750 --> 00:03:27,325 of Xi. 66 00:03:27,325 --> 00:03:30,140 There's going to be a normalizing constant. 67 00:03:30,140 --> 00:03:32,700 And then the usual exponential term, 68 00:03:32,700 --> 00:03:39,730 which is going to be xi minus the mean of the distribution, 69 00:03:39,730 --> 00:03:42,704 which is theta. 70 00:03:42,704 --> 00:03:45,620 And then we divide by the usual variance term. 71 00:03:48,280 --> 00:03:53,730 Let us move next to this distribution here. 72 00:03:53,730 --> 00:03:58,940 This is a shorthand notation for the joint PDF 73 00:03:58,940 --> 00:04:03,480 of the random variables X1 up to Xn, 74 00:04:03,480 --> 00:04:07,610 conditional on the random variable Theta. 75 00:04:07,610 --> 00:04:11,446 So it's really a function of multiple variables. 76 00:04:14,990 --> 00:04:17,820 And how do we proceed now? 77 00:04:17,820 --> 00:04:20,899 Here is the crucial observation. 78 00:04:20,899 --> 00:04:26,090 If I tell you the value of the random variable capital 79 00:04:26,090 --> 00:04:32,400 Theta as before, then you argue as follows. 80 00:04:32,400 --> 00:04:35,460 All of these random variables are independent. 81 00:04:35,460 --> 00:04:39,180 So if I tell you the value of the random variable Theta, 82 00:04:39,180 --> 00:04:43,490 this does not change the joint distribution of the Wi's. 83 00:04:43,490 --> 00:04:46,680 The Wi's were independent when we started, 84 00:04:46,680 --> 00:04:50,000 so they remain independent in the conditional universe. 85 00:04:58,630 --> 00:05:01,110 And since the Wi's are independent 86 00:05:01,110 --> 00:05:05,560 and Xi's are obtained from the Wi's by just adding a constant, 87 00:05:05,560 --> 00:05:09,670 this means that the Xi's are also 88 00:05:09,670 --> 00:05:13,420 independent in this conditional universe. 89 00:05:13,420 --> 00:05:16,080 Once I tell you the value of Theta, 90 00:05:16,080 --> 00:05:18,490 then because the noises are independent, 91 00:05:18,490 --> 00:05:21,460 the observations are also independent. 92 00:05:21,460 --> 00:05:27,810 But this means that the joint PDF factors as a product 93 00:05:27,810 --> 00:05:32,330 of the individual marginal PDFs of the Xi's. 94 00:05:37,900 --> 00:05:41,540 And these PDFs, we have already found. 95 00:05:41,540 --> 00:05:44,409 So now, we can put everything together 96 00:05:44,409 --> 00:05:48,560 to write down a formula for the posterior PDF 97 00:05:48,560 --> 00:05:50,400 using the Bayes rule. 98 00:05:50,400 --> 00:05:57,300 We have this denominator term, which I will write first, 99 00:05:57,300 --> 00:06:00,580 and which term we do not need to evaluate. 100 00:06:00,580 --> 00:06:04,320 Then we have the marginal PDF of Theta. 101 00:06:04,320 --> 00:06:08,440 Now since Theta is normal with these parameters, 102 00:06:08,440 --> 00:06:16,120 this is of the form e to the minus theta minus x0 103 00:06:16,120 --> 00:06:21,690 squared over 2 sigma 0 squared. 104 00:06:21,690 --> 00:06:27,500 And then we have this joint density of the X's conditioned 105 00:06:27,500 --> 00:06:32,840 on Theta, which we have already found, it is this product here. 106 00:06:32,840 --> 00:06:36,710 It's a product of n terms, one for each observation. 107 00:06:36,710 --> 00:06:40,480 And each one of these terms is what we have found earlier, 108 00:06:40,480 --> 00:06:43,780 so I'm just substituting this expression up here. 109 00:06:52,060 --> 00:06:56,390 Now once we have obtained the observations, so the value 110 00:06:56,390 --> 00:07:00,130 of the random variable capital X, that is, the value little x, 111 00:07:00,130 --> 00:07:01,710 is fixed. 112 00:07:01,710 --> 00:07:07,430 Once it is fixed, then the x's that appear here are constant. 113 00:07:07,430 --> 00:07:11,040 So in particular, this term here is a constant. 114 00:07:11,040 --> 00:07:12,840 We do not bother with it. 115 00:07:12,840 --> 00:07:19,030 And what we have is a constant times an exponential in terms 116 00:07:19,030 --> 00:07:21,720 that are quadratic in theta. 117 00:07:21,720 --> 00:07:25,020 So we recognize this kind of expression. 118 00:07:25,020 --> 00:07:29,410 It has to correspond to a normal distribution. 119 00:07:29,410 --> 00:07:34,120 And this is the first conclusion of this exercise. 120 00:07:34,120 --> 00:07:38,770 That is, the posterior PDF of the parameter, Theta, 121 00:07:38,770 --> 00:07:43,560 given our observations, this posterior PDF is normal. 122 00:07:43,560 --> 00:07:47,890 We have e to a quadratic function in theta. 123 00:07:47,890 --> 00:07:50,080 And that quadratic function also involves 124 00:07:50,080 --> 00:07:53,659 the specific values of the X's that we have obtained. 125 00:07:53,659 --> 00:07:57,710 Let us copy what we have found and rearrange it. 126 00:07:57,710 --> 00:08:02,430 Once more, we have a constant, then the exponential 127 00:08:02,430 --> 00:08:06,170 of the negative of some quadratic function in theta. 128 00:08:06,170 --> 00:08:07,955 And the specific quadratic function 129 00:08:07,955 --> 00:08:15,370 that we calculated just before takes this particular form. 130 00:08:15,370 --> 00:08:19,250 What is the mean of this normal distribution? 131 00:08:19,250 --> 00:08:23,340 The mean is same as the peak. 132 00:08:23,340 --> 00:08:30,180 And to find the peak, the location at which this PDF is 133 00:08:30,180 --> 00:08:34,940 largest, what we do is we try to find 134 00:08:34,940 --> 00:08:39,190 the place at which this quadratic function is smallest. 135 00:08:39,190 --> 00:08:42,890 So what we do is to take the derivative with respect 136 00:08:42,890 --> 00:08:50,520 to theta of this quadratic, and set it to 0. 137 00:08:50,520 --> 00:08:54,530 This gives us a sum of terms. 138 00:08:54,530 --> 00:08:59,970 The derivative of the typical term 139 00:08:59,970 --> 00:09:14,970 is going to be theta minus xi, divided by sigma i squared. 140 00:09:14,970 --> 00:09:18,340 And this expression must be equal to 0 141 00:09:18,340 --> 00:09:23,440 if theta is at the peak of the posterior distribution. 142 00:09:23,440 --> 00:09:26,930 And so we now rearrange this equation. 143 00:09:26,930 --> 00:09:30,700 We split and take first the term involving theta, 144 00:09:30,700 --> 00:09:33,465 and gives us a contribution of this kind. 145 00:09:39,190 --> 00:09:43,160 And we move the terms involving x's to the other side. 146 00:09:43,160 --> 00:09:45,365 And this gives us this expression. 147 00:09:50,830 --> 00:09:52,960 And finally, we take this sum here 148 00:09:52,960 --> 00:09:56,630 and send it to the denominator of the other side. 149 00:09:56,630 --> 00:10:00,940 And this gives us the final form of the solution: 150 00:10:00,940 --> 00:10:05,280 the peak of the posterior distribution, which is also 151 00:10:05,280 --> 00:10:08,400 the same as the conditional expectation of the posterior 152 00:10:08,400 --> 00:10:09,240 distribution. 153 00:10:09,240 --> 00:10:12,670 Whenever we have a normal distribution, 154 00:10:12,670 --> 00:10:15,550 the expected value is the same as the place 155 00:10:15,550 --> 00:10:17,390 where the distribution is largest. 156 00:10:21,280 --> 00:10:25,000 Let us now conclude with a few comments and words about how 157 00:10:25,000 --> 00:10:28,120 to interpret the result that we found. 158 00:10:28,120 --> 00:10:32,210 First, let me emphasize that the same conclusions that we have 159 00:10:32,210 --> 00:10:35,680 obtained for the case of a single observation 160 00:10:35,680 --> 00:10:40,110 go through in this case as well. 161 00:10:40,110 --> 00:10:43,250 The posterior distribution of the unknown parameter 162 00:10:43,250 --> 00:10:47,480 is still a normal distribution. 163 00:10:47,480 --> 00:10:49,860 Our state of knowledge about Theta 164 00:10:49,860 --> 00:10:52,000 after we obtain the observations is 165 00:10:52,000 --> 00:10:54,450 described by a normal distribution. 166 00:10:54,450 --> 00:10:56,190 Because it is a normal distribution, 167 00:10:56,190 --> 00:11:00,690 the location of its peak is the same as the expected value. 168 00:11:00,690 --> 00:11:03,070 And for this reason, the conditional expectation 169 00:11:03,070 --> 00:11:05,640 estimate and the maximum a posteriori probability 170 00:11:05,640 --> 00:11:07,490 estimates coincide. 171 00:11:07,490 --> 00:11:11,760 And finally, the form of the estimates that we get is 172 00:11:11,760 --> 00:11:16,700 a linear function in the xi's. 173 00:11:16,700 --> 00:11:20,430 And this linearity is a very convenient property to have, 174 00:11:20,430 --> 00:11:23,180 because it allows further analysis 175 00:11:23,180 --> 00:11:26,630 of these ways of obtaining estimates. 176 00:11:26,630 --> 00:11:30,620 How do we interpret this formula? 177 00:11:30,620 --> 00:11:33,530 What we have here is the following. 178 00:11:33,530 --> 00:11:35,880 Each one of the xi's gets multiplied 179 00:11:35,880 --> 00:11:39,760 by a certain coefficient, which is 1 over the variance. 180 00:11:39,760 --> 00:11:42,750 And in the denominator, we have the sum 181 00:11:42,750 --> 00:11:44,870 of all of those coefficients. 182 00:11:44,870 --> 00:11:50,110 So what we really have here is a weighted average of the xi's. 183 00:11:50,110 --> 00:11:52,720 Now keep in mind that those xi's are not 184 00:11:52,720 --> 00:11:54,520 all of them of the same kind. 185 00:11:54,520 --> 00:11:58,170 One term is x0, which is the prior mean, 186 00:11:58,170 --> 00:12:02,470 whereas the remaining xi's are the values of the observations. 187 00:12:02,470 --> 00:12:05,080 So there's something interesting happening here. 188 00:12:05,080 --> 00:12:08,350 We combine the values of the observations 189 00:12:08,350 --> 00:12:10,790 with the value of the prior mean. 190 00:12:10,790 --> 00:12:12,900 And in some sense, the prior mean 191 00:12:12,900 --> 00:12:17,200 is treated as just one more piece of information 192 00:12:17,200 --> 00:12:19,010 available to us. 193 00:12:19,010 --> 00:12:23,100 And it is treated in a sort of equal way 194 00:12:23,100 --> 00:12:26,060 as the other observations. 195 00:12:26,060 --> 00:12:29,980 The weight that we have in this weighted average 196 00:12:29,980 --> 00:12:36,410 are that each xi gets divided by the corresponding variance. 197 00:12:36,410 --> 00:12:38,130 Does this make sense? 198 00:12:38,130 --> 00:12:42,090 Well, suppose that sigma i squared is large. 199 00:12:45,290 --> 00:12:49,300 This means that the noise term Wi is very large. 200 00:12:49,300 --> 00:12:53,190 So Xi is very noisy. 201 00:12:53,190 --> 00:12:56,940 And so it's not a useful observation to have. 202 00:12:56,940 --> 00:13:00,590 And in that case, it gets a small weight. 203 00:13:03,320 --> 00:13:06,510 So the weights are determined by the variances 204 00:13:06,510 --> 00:13:09,480 in a way that is quite sensible. 205 00:13:09,480 --> 00:13:12,350 Those observations that will get the most weight 206 00:13:12,350 --> 00:13:15,730 will be those observations for which the corresponding noise 207 00:13:15,730 --> 00:13:18,760 variance is small. 208 00:13:18,760 --> 00:13:22,280 So the solution to this estimation problem that we just 209 00:13:22,280 --> 00:13:26,350 went through has many nice properties. 210 00:13:26,350 --> 00:13:30,960 First, we stay within the world of normal random variables, 211 00:13:30,960 --> 00:13:33,760 because even the posterior is normal. 212 00:13:33,760 --> 00:13:36,540 We stay within the world of linear functions 213 00:13:36,540 --> 00:13:39,590 of normal random variables, and the form 214 00:13:39,590 --> 00:13:42,670 of the formula that we have, itself, 215 00:13:42,670 --> 00:13:46,120 has an appealing intuitive content.