1 00:00:00,760 --> 00:00:05,090 We now continue our discussion of the model in which we obtain 2 00:00:05,090 --> 00:00:09,170 several measurements of an unknown random variable Theta 3 00:00:09,170 --> 00:00:11,450 in the presence of additive noise, 4 00:00:11,450 --> 00:00:13,500 under the same assumptions as before. 5 00:00:13,500 --> 00:00:17,670 Theta and Wi are all independent random variables. 6 00:00:17,670 --> 00:00:19,670 And they're also normal. 7 00:00:19,670 --> 00:00:23,320 We have seen that in this case, the posterior distribution 8 00:00:23,320 --> 00:00:26,430 of Theta is a normal distribution, 9 00:00:26,430 --> 00:00:29,180 and it takes this particular form. 10 00:00:29,180 --> 00:00:32,990 We found the mean of the posterior distribution, which 11 00:00:32,990 --> 00:00:36,450 is also the maximum posterior probability estimate. 12 00:00:36,450 --> 00:00:39,050 And it is given by this expression. 13 00:00:39,050 --> 00:00:42,020 Now that we have an estimate in our hands, we can ask, 14 00:00:42,020 --> 00:00:43,860 how good is this estimate? 15 00:00:43,860 --> 00:00:46,270 And for this, we need an appropriate performance 16 00:00:46,270 --> 00:00:47,060 measure. 17 00:00:47,060 --> 00:00:50,490 For estimation problems, a reasonable performance measure 18 00:00:50,490 --> 00:00:55,960 is to look at the mean value of the squared error. 19 00:00:55,960 --> 00:00:59,330 But given that we have already obtained observations, 20 00:00:59,330 --> 00:01:03,860 what we're interested in is the conditional mean squared error. 21 00:01:03,860 --> 00:01:06,630 This is the error that's remaining 22 00:01:06,630 --> 00:01:10,910 after we have seen the observations. 23 00:01:10,910 --> 00:01:14,620 Now, let us notice something here. 24 00:01:14,620 --> 00:01:17,900 If I tell you the value of the observations, 25 00:01:17,900 --> 00:01:22,720 then my estimator is completely determined. 26 00:01:22,720 --> 00:01:24,940 The estimator is a random variable 27 00:01:24,940 --> 00:01:28,530 that processes the data and comes up with an estimate. 28 00:01:28,530 --> 00:01:30,830 So although it is a random variable, 29 00:01:30,830 --> 00:01:33,990 once I tell you the value of the observations, 30 00:01:33,990 --> 00:01:35,660 the value of this random variable 31 00:01:35,660 --> 00:01:37,590 has been completely determined. 32 00:01:37,590 --> 00:01:42,470 And so we can replace it with its actual numerical value, 33 00:01:42,470 --> 00:01:43,930 which is theta hat. 34 00:01:43,930 --> 00:01:47,520 And it is given by this expression. 35 00:01:47,520 --> 00:01:51,190 Now remember also that theta hat, the estimate that we're 36 00:01:51,190 --> 00:01:55,479 using, is the mean of the posterior distribution. 37 00:01:55,479 --> 00:01:57,300 In this conditional universe where 38 00:01:57,300 --> 00:02:00,000 we have conditioned on this information, 39 00:02:00,000 --> 00:02:02,710 theta hat is the mean of this random variable. 40 00:02:02,710 --> 00:02:07,030 So we're dealing with the square distance from the mean. 41 00:02:07,030 --> 00:02:09,100 And then we take the expected value. 42 00:02:09,100 --> 00:02:11,670 But that's nothing but the variance 43 00:02:11,670 --> 00:02:14,920 of Theta in this conditional universe. 44 00:02:14,920 --> 00:02:19,070 So what we're looking for is the variance of the posterior 45 00:02:19,070 --> 00:02:22,000 distribution of Theta, given the observations 46 00:02:22,000 --> 00:02:23,820 that we have obtained. 47 00:02:23,820 --> 00:02:26,040 Can we eyeball the variance by just 48 00:02:26,040 --> 00:02:30,090 looking at this formula for the posterior distribution? 49 00:02:30,090 --> 00:02:31,730 More or less, we can. 50 00:02:31,730 --> 00:02:34,380 Recall this earlier fact that if I give you 51 00:02:34,380 --> 00:02:37,640 a density of this form, you recognize that it is normal. 52 00:02:37,640 --> 00:02:40,390 And you also recognize that the variance 53 00:02:40,390 --> 00:02:42,850 is determined by the coefficient that 54 00:02:42,850 --> 00:02:46,690 comes next to a term of the form x squared. 55 00:02:46,690 --> 00:02:50,329 Now, this is a PDF involving a variable x. 56 00:02:50,329 --> 00:02:52,920 Here, we are talking about a PDF of Theta. 57 00:02:52,920 --> 00:02:55,070 So what we're looking for is the constant 58 00:02:55,070 --> 00:02:58,820 that sits next to the theta squared term. 59 00:02:58,820 --> 00:03:01,370 There's going to be multiple theta squared terms. 60 00:03:01,370 --> 00:03:03,650 So we need to collect all of them. 61 00:03:03,650 --> 00:03:07,340 And so we find that the overall coefficient sitting next 62 00:03:07,340 --> 00:03:10,430 to theta squared terms is as follows. 63 00:03:10,430 --> 00:03:12,790 From this, we obtain a contribution 64 00:03:12,790 --> 00:03:17,800 of 1 over 2 sigma 0 squared. 65 00:03:17,800 --> 00:03:19,600 And similarly from here, we're going 66 00:03:19,600 --> 00:03:21,490 to obtain a coefficient next to theta 67 00:03:21,490 --> 00:03:23,980 squared of 1 over 2 sigma 1 squared. 68 00:03:23,980 --> 00:03:26,350 We continue the same way. 69 00:03:26,350 --> 00:03:29,030 And finally from the last term, we 70 00:03:29,030 --> 00:03:33,070 obtain a contribution of this kind. 71 00:03:33,070 --> 00:03:36,210 Now, we take this factor of 2, move it to the other side. 72 00:03:36,210 --> 00:03:39,520 So we know what 2 times alpha is. 73 00:03:39,520 --> 00:03:41,900 And then we need to take the inverse of that, 74 00:03:41,900 --> 00:03:44,680 so as to obtain 1 over 2 alpha. 75 00:03:44,680 --> 00:03:48,190 And what we obtain is that 1 over 2 alpha, when 76 00:03:48,190 --> 00:03:50,340 alpha is given by this expression, 77 00:03:50,340 --> 00:03:53,160 is equal to this expression here. 78 00:03:53,160 --> 00:03:56,870 And so we have found the conditional variance, 79 00:03:56,870 --> 00:03:59,860 the variance of the posterior distribution of Theta given 80 00:03:59,860 --> 00:04:02,940 the data that we have available in our hands. 81 00:04:02,940 --> 00:04:05,990 Now, this is the mean squared error 82 00:04:05,990 --> 00:04:08,970 given that you have seen some particular piece 83 00:04:08,970 --> 00:04:10,610 of information. 84 00:04:10,610 --> 00:04:13,780 What about the overall mean squared error? 85 00:04:13,780 --> 00:04:15,830 This is the quantity that you care about 86 00:04:15,830 --> 00:04:19,120 before you go and make the actual measurements. 87 00:04:19,120 --> 00:04:21,790 This tells you how well you expect 88 00:04:21,790 --> 00:04:26,000 to estimate your random variable Theta. 89 00:04:26,000 --> 00:04:30,090 Well, we can use here the total expectation theorem, 90 00:04:30,090 --> 00:04:36,140 and write the expected value as a weighted average 91 00:04:36,140 --> 00:04:44,710 of the conditional expectations under different scenarios, 92 00:04:44,710 --> 00:04:49,130 namely under different measurements of X, 93 00:04:49,130 --> 00:04:51,659 and average those conditional expectations 94 00:04:51,659 --> 00:04:56,650 over the possible values of X. 95 00:04:56,650 --> 00:05:01,140 Now, this quantity here is actually a constant. 96 00:05:01,140 --> 00:05:04,330 It is this constant here. 97 00:05:04,330 --> 00:05:06,650 So we can pull it outside the expectation. 98 00:05:06,650 --> 00:05:11,120 What we're left is the integral of a PDF 99 00:05:11,120 --> 00:05:16,460 over all possible values, which has to be equal to 1. 100 00:05:16,460 --> 00:05:19,210 So what we're left with is just the value 101 00:05:19,210 --> 00:05:22,270 of this constant, which is this particular number. 102 00:05:22,270 --> 00:05:25,440 And so we concluded that the overall unconditional mean 103 00:05:25,440 --> 00:05:27,750 squared error is also the same. 104 00:05:27,750 --> 00:05:30,530 This makes perfect intuitive sense. 105 00:05:30,530 --> 00:05:34,180 Our mean squared error is going to take this value no matter 106 00:05:34,180 --> 00:05:35,409 what I observe. 107 00:05:35,409 --> 00:05:40,860 So on the average, it will also take that particular value. 108 00:05:40,860 --> 00:05:43,540 Now, this expression that we have derived 109 00:05:43,540 --> 00:05:46,800 is also quite intuitive in its content. 110 00:05:46,800 --> 00:05:50,640 Let us try to understand some special cases. 111 00:05:50,640 --> 00:05:58,270 Suppose that some of the variances of the noise terms 112 00:05:58,270 --> 00:05:59,490 is very small. 113 00:06:02,940 --> 00:06:07,220 If one term is small, this means that the corresponding term 114 00:06:07,220 --> 00:06:09,170 here is going to be big. 115 00:06:09,170 --> 00:06:13,170 So the sum of those terms is going to be big. 116 00:06:13,170 --> 00:06:16,550 As long as one term is big, than the sum is also big. 117 00:06:16,550 --> 00:06:19,620 And then 1 over that is going to be small. 118 00:06:19,620 --> 00:06:23,640 So in that case, the mean squared error is small. 119 00:06:23,640 --> 00:06:29,050 What this is saying is that if just one of the measurements 120 00:06:29,050 --> 00:06:32,500 has low noise, then the uncertainty that 121 00:06:32,500 --> 00:06:36,650 remains for my random variable that I'm trying to estimate, 122 00:06:36,650 --> 00:06:38,370 that uncertainty will be small. 123 00:06:38,370 --> 00:06:40,450 I'm going to have a small error. 124 00:06:40,450 --> 00:06:48,130 On the other hand, if all of the noise variances are large, 125 00:06:48,130 --> 00:06:51,740 then this means that all of these terms 126 00:06:51,740 --> 00:06:53,290 here are going to be small. 127 00:06:53,290 --> 00:06:55,280 I'm adding small terms. 128 00:06:55,280 --> 00:06:58,159 1 over something small is something big. 129 00:06:58,159 --> 00:07:02,860 And so the mean squared error is going to be large. 130 00:07:02,860 --> 00:07:05,810 That is, if all my measurements are very noisy, 131 00:07:05,810 --> 00:07:09,440 then I do not expect to estimate my random variable particularly 132 00:07:09,440 --> 00:07:09,940 well. 133 00:07:12,470 --> 00:07:15,590 Let us now look at one more special case. 134 00:07:15,590 --> 00:07:17,610 Suppose that all of the variances 135 00:07:17,610 --> 00:07:20,830 are the same, the noise variances as well as 136 00:07:20,830 --> 00:07:23,490 the variance of the prior distribution. 137 00:07:23,490 --> 00:07:28,650 In that case, this expression here is going to become 1 over, 138 00:07:28,650 --> 00:07:32,030 we have the sum of n plus 1 terms. 139 00:07:32,030 --> 00:07:36,340 And each one of those terms is 1 over sigma squared, which 140 00:07:36,340 --> 00:07:40,960 is the same as sigma squared over n plus 1. 141 00:07:40,960 --> 00:07:44,870 This expression makes quite a lot of sense. 142 00:07:44,870 --> 00:07:48,110 It tells us that if we obtain more and more observations, 143 00:07:48,110 --> 00:07:53,280 that is, as n increases, we improve our performance. 144 00:07:53,280 --> 00:07:55,880 The variance of the posterior distribution, 145 00:07:55,880 --> 00:08:02,350 or the mean squared error, goes down in this particular way. 146 00:08:02,350 --> 00:08:05,660 Now, perhaps the most interesting aspect 147 00:08:05,660 --> 00:08:08,190 of the facts that we have established 148 00:08:08,190 --> 00:08:11,190 is this equation here that tells us 149 00:08:11,190 --> 00:08:14,460 that not matter what this value of little x 150 00:08:14,460 --> 00:08:19,660 is, the conditional variance, the variance of the posterior 151 00:08:19,660 --> 00:08:23,840 distribution of Theta, is going to be the same. 152 00:08:23,840 --> 00:08:27,990 In some sense, it tells us that no particular value of X 153 00:08:27,990 --> 00:08:33,860 is more informative or more desirable than any other value. 154 00:08:33,860 --> 00:08:37,049 In order to really appreciate what that statement is really 155 00:08:37,049 --> 00:08:40,190 saying, it's better to look at a very concrete example. 156 00:08:40,190 --> 00:08:42,929 So let us revisit the very first example 157 00:08:42,929 --> 00:08:47,950 that we studied, in which case, we only have one observation 158 00:08:47,950 --> 00:08:51,380 and where Theta and W are standard normal random 159 00:08:51,380 --> 00:08:52,590 variables. 160 00:08:52,590 --> 00:08:54,350 We did go through that example. 161 00:08:54,350 --> 00:08:58,970 And we found that the estimator, the maximum posterior 162 00:08:58,970 --> 00:09:04,300 probability estimator, was 1/2 of the observation. 163 00:09:04,300 --> 00:09:07,400 And now we are in a position to also calculate 164 00:09:07,400 --> 00:09:10,130 the conditional mean squared error given 165 00:09:10,130 --> 00:09:12,260 any particular observation. 166 00:09:12,260 --> 00:09:13,950 We apply this formula. 167 00:09:13,950 --> 00:09:16,620 In fact, we are dealing with this special case 168 00:09:16,620 --> 00:09:19,250 with sigma equal to 1. 169 00:09:19,250 --> 00:09:21,890 So we use this expression here. 170 00:09:21,890 --> 00:09:24,640 And we see that it is 1/2. 171 00:09:24,640 --> 00:09:27,240 So we started with a prior variance 172 00:09:27,240 --> 00:09:29,730 for Theta, which was 1. 173 00:09:29,730 --> 00:09:32,640 And after we obtained the observation, 174 00:09:32,640 --> 00:09:36,950 our uncertainty gets reduced and the variance goes down to 1/2. 175 00:09:36,950 --> 00:09:40,980 And this is true no matter what little x is. 176 00:09:40,980 --> 00:09:45,220 Pictorially, here is what's happening. 177 00:09:45,220 --> 00:09:49,260 We start with Theta being a standard normal. 178 00:09:49,260 --> 00:09:53,390 So it has a distribution of this form, centered at 0. 179 00:09:53,390 --> 00:09:57,770 This is a plot of the density of Theta, the prior density. 180 00:09:57,770 --> 00:10:00,130 Suppose that we obtain a measurement. 181 00:10:00,130 --> 00:10:05,840 And that measurement happens to be equal to 0. 182 00:10:05,840 --> 00:10:08,220 If the measurement is equal to 0, 183 00:10:08,220 --> 00:10:12,850 then our estimate will also be equal to 0. 184 00:10:12,850 --> 00:10:15,690 The posterior distribution of theta 185 00:10:15,690 --> 00:10:20,320 is going to be a normal distribution whose mean is 186 00:10:20,320 --> 00:10:26,900 the estimate and whose variance is this quantity that we 187 00:10:26,900 --> 00:10:29,080 have calculated, which is 1/2. 188 00:10:29,080 --> 00:10:34,060 And therefore, it is narrower than the original PDF 189 00:10:34,060 --> 00:10:35,650 that we started from. 190 00:10:35,650 --> 00:10:38,410 So initially, we had a fair amount of uncertainty 191 00:10:38,410 --> 00:10:39,610 about Theta. 192 00:10:39,610 --> 00:10:42,270 After we obtained a measurement of 0, 193 00:10:42,270 --> 00:10:44,420 this kind of reinforces our belief 194 00:10:44,420 --> 00:10:47,130 that Theta is somewhere near 0. 195 00:10:47,130 --> 00:10:50,200 And so we obtain a narrower distribution. 196 00:10:50,200 --> 00:10:53,650 This is our updated belief about Theta. 197 00:10:53,650 --> 00:10:57,100 But what if I happen to obtain a measurement that's 198 00:10:57,100 --> 00:11:00,000 somewhere out here? 199 00:11:00,000 --> 00:11:03,140 In this case, my estimate is going 200 00:11:03,140 --> 00:11:08,052 to be 1/2 of what I observed. 201 00:11:08,052 --> 00:11:09,330 It's here. 202 00:11:09,330 --> 00:11:12,960 And the posterior PDF of Theta is 203 00:11:12,960 --> 00:11:17,310 going to be a normal PDF that's centered around this point 204 00:11:17,310 --> 00:11:20,050 and has the same variance, 1/2. 205 00:11:25,370 --> 00:11:28,500 So in some sense, this particular measurement 206 00:11:28,500 --> 00:11:31,830 would be thought of as a quite abnormal one. 207 00:11:31,830 --> 00:11:35,520 We're really surprised to obtain an observation which 208 00:11:35,520 --> 00:11:37,760 is so far away from 0. 209 00:11:37,760 --> 00:11:39,970 Because our prior distribution told us 210 00:11:39,970 --> 00:11:42,070 that Theta is somewhere here. 211 00:11:42,070 --> 00:11:43,970 So we have been surprised. 212 00:11:43,970 --> 00:11:49,450 But after the surprise and after we form our estimate of Theta, 213 00:11:49,450 --> 00:11:52,260 our state of knowledge about Theta 214 00:11:52,260 --> 00:11:54,960 is that Theta is a random variable 215 00:11:54,960 --> 00:11:57,730 and it has a distribution that's normal, 216 00:11:57,730 --> 00:12:01,220 centered around this point, and whose width 217 00:12:01,220 --> 00:12:06,720 is the same no matter what particular observation I happen 218 00:12:06,720 --> 00:12:07,660 to get. 219 00:12:07,660 --> 00:12:12,840 So even though this particular observation value is unusual, 220 00:12:12,840 --> 00:12:15,990 after it is obtained, the remaining uncertainty 221 00:12:15,990 --> 00:12:20,120 about Theta is the same as if we had obtained 222 00:12:20,120 --> 00:12:23,570 any other particular value of X. So this 223 00:12:23,570 --> 00:12:25,650 is a very remarkable property that's 224 00:12:25,650 --> 00:12:29,190 special to this type of estimation problems involving 225 00:12:29,190 --> 00:12:32,340 normal random variables and linear relations. 226 00:12:32,340 --> 00:12:34,740 It has a very nice side effect. 227 00:12:34,740 --> 00:12:37,260 It means that we can report, we can 228 00:12:37,260 --> 00:12:40,780 say anything there is to be said about the performance 229 00:12:40,780 --> 00:12:44,600 of this maximum a posteriori probability estimator. 230 00:12:44,600 --> 00:12:47,240 By just giving a single number, we 231 00:12:47,240 --> 00:12:49,170 can characterize performance only 232 00:12:49,170 --> 00:12:51,470 in terms of this number, as opposed 233 00:12:51,470 --> 00:12:54,980 to having to tell what this conditional mean squared 234 00:12:54,980 --> 00:12:58,850 error is for the different values of X.