1 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,870 Commons license. 3 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 4 00:00:06,910 --> 00:00:10,560 offer high quality educational resources for free. 5 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 6 00:00:13,460 --> 00:00:19,290 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,290 --> 00:00:22,410 ocw.mit.edu 8 00:00:22,410 --> 00:00:25,430 PROFESSOR: So we're going to finish today our discussion of 9 00:00:25,430 --> 00:00:28,870 Bayesian Inference, which we started last time. 10 00:00:28,870 --> 00:00:32,960 As you probably saw there's not a huge lot of concepts 11 00:00:32,960 --> 00:00:37,370 that we're introducing at this point in terms of specific 12 00:00:37,370 --> 00:00:39,770 skills of calculating probabilities. 13 00:00:39,770 --> 00:00:44,040 But, rather, it's more of an interpretation and setting up 14 00:00:44,040 --> 00:00:45,460 the framework. 15 00:00:45,460 --> 00:00:48,010 So the framework in Bayesian estimation is that there is 16 00:00:48,010 --> 00:00:52,500 some parameter which is not known, but we have a prior 17 00:00:52,500 --> 00:00:53,550 distribution on it. 18 00:00:53,550 --> 00:01:00,040 These are beliefs about what this variable might be, and 19 00:01:00,040 --> 00:01:02,370 then we'll obtain some measurements. 20 00:01:02,370 --> 00:01:05,410 And the measurements are affected by the value of that 21 00:01:05,410 --> 00:01:07,560 parameter that we don't know. 22 00:01:07,560 --> 00:01:12,490 And this effect, the fact that X is affected by Theta, is 23 00:01:12,490 --> 00:01:15,970 captured by introducing a conditional probability 24 00:01:15,970 --> 00:01:16,660 distribution-- 25 00:01:16,660 --> 00:01:19,590 the distribution of X depends on Theta. 26 00:01:19,590 --> 00:01:22,270 It's a conditional probability distribution. 27 00:01:22,270 --> 00:01:26,280 So we have formulas for these two densities, the prior 28 00:01:26,280 --> 00:01:28,330 density and the conditional density. 29 00:01:28,330 --> 00:01:31,110 And given that we have these, if we multiply them we can 30 00:01:31,110 --> 00:01:34,000 also get the joint density of X and Theta. 31 00:01:34,000 --> 00:01:35,940 So we have everything that's there is to 32 00:01:35,940 --> 00:01:37,450 know in this second. 33 00:01:37,450 --> 00:01:41,650 And now we observe the random variable X. Given this random 34 00:01:41,650 --> 00:01:44,400 variable what can we say about Theta? 35 00:01:44,400 --> 00:01:48,380 Well, what we can do is we can always calculate the 36 00:01:48,380 --> 00:01:52,600 conditional distribution of theta given X. And now that we 37 00:01:52,600 --> 00:01:55,990 have the specific value of X we can plot this as 38 00:01:55,990 --> 00:01:58,650 a function of Theta. 39 00:01:58,650 --> 00:01:59,150 OK. 40 00:01:59,150 --> 00:02:01,380 And this is the complete answer to a 41 00:02:01,380 --> 00:02:02,990 Bayesian Inference problem. 42 00:02:02,990 --> 00:02:06,130 This posterior distribution captures everything there is 43 00:02:06,130 --> 00:02:10,240 to say about Theta, that's what we know about Theta. 44 00:02:10,240 --> 00:02:13,330 Given the X that we have observed Theta is still 45 00:02:13,330 --> 00:02:15,080 random, it's still unknown. 46 00:02:15,080 --> 00:02:18,270 And it might be here, there, or there with several 47 00:02:18,270 --> 00:02:19,900 probabilities. 48 00:02:19,900 --> 00:02:22,780 On the other hand, if you want to report a single value for 49 00:02:22,780 --> 00:02:27,590 Theta then you do some extra work. 50 00:02:27,590 --> 00:02:31,430 You continue from here, and you do some data processing on 51 00:02:31,430 --> 00:02:35,360 X. Doing data processing means that you apply a certain 52 00:02:35,360 --> 00:02:39,000 function on the data, and this function is 53 00:02:39,000 --> 00:02:40,650 something that you design. 54 00:02:40,650 --> 00:02:42,930 It's the so-called estimator. 55 00:02:42,930 --> 00:02:46,460 And once that function is applied it outputs an estimate 56 00:02:46,460 --> 00:02:50,760 of Theta, which we call Theta hat. 57 00:02:50,760 --> 00:02:53,490 So this is sort of the big picture of what's happening. 58 00:02:53,490 --> 00:02:55,880 Now one thing to keep in mind is that even though I'm 59 00:02:55,880 --> 00:03:00,450 writing single letters here, in general Theta or X could be 60 00:03:00,450 --> 00:03:02,030 vector random variables. 61 00:03:02,030 --> 00:03:03,540 So think of this-- 62 00:03:03,540 --> 00:03:08,170 it could be a collection Theta1, Theta2, Theta3. 63 00:03:08,170 --> 00:03:11,570 And maybe we obtained several measurements, so this X is 64 00:03:11,570 --> 00:03:15,630 really a vector X1, X2, up to Xn. 65 00:03:15,630 --> 00:03:20,190 All right, so now how do we choose a Theta to report? 66 00:03:20,190 --> 00:03:21,960 There are various ways of doing it. 67 00:03:21,960 --> 00:03:25,280 One is to look at the posterior distribution and 68 00:03:25,280 --> 00:03:29,940 report the value of Theta, at which the density or the PMF 69 00:03:29,940 --> 00:03:31,990 is highest. 70 00:03:31,990 --> 00:03:35,570 This is called the maximum a posteriori estimate. 71 00:03:35,570 --> 00:03:38,770 So we pick a value of theta for which the posteriori is 72 00:03:38,770 --> 00:03:40,990 maximum, and we report it. 73 00:03:40,990 --> 00:03:46,030 An alternative way is to try to be optimal with respects to 74 00:03:46,030 --> 00:03:48,500 a mean squared error. 75 00:03:48,500 --> 00:03:49,410 So what is this? 76 00:03:49,410 --> 00:03:53,260 If we have a specific estimator, g, this is the 77 00:03:53,260 --> 00:03:55,880 estimate it's going to produce. 78 00:03:55,880 --> 00:03:58,300 This is the true value of Theta, so this is our 79 00:03:58,300 --> 00:03:59,740 estimation error. 80 00:03:59,740 --> 00:04:03,180 We look at the square of the estimation error, and look at 81 00:04:03,180 --> 00:04:04,180 the average value. 82 00:04:04,180 --> 00:04:07,180 We would like this squared estimation error to be as 83 00:04:07,180 --> 00:04:08,710 small as possible. 84 00:04:08,710 --> 00:04:12,470 How can we design our estimator g to make that error 85 00:04:12,470 --> 00:04:13,920 as small as possible? 86 00:04:13,920 --> 00:04:19,490 It turns out that the answer is to produce, as an estimate, 87 00:04:19,490 --> 00:04:22,660 the conditional expectation of Theta given X. So the 88 00:04:22,660 --> 00:04:26,600 conditional expectation is the best estimate that you could 89 00:04:26,600 --> 00:04:30,690 produce if your objective is to keep the mean squared error 90 00:04:30,690 --> 00:04:32,720 as small as possible. 91 00:04:32,720 --> 00:04:35,280 So this statement here is a statement of what happens on 92 00:04:35,280 --> 00:04:39,950 the average over all Theta's and all X's that may happen in 93 00:04:39,950 --> 00:04:42,490 our experiment. 94 00:04:42,490 --> 00:04:45,160 The conditional expectation as an estimator has an even 95 00:04:45,160 --> 00:04:47,750 stronger property. 96 00:04:47,750 --> 00:04:51,490 Not only it's optimal on the average, but it's also optimal 97 00:04:51,490 --> 00:04:56,130 given that you have made a specific observation, no 98 00:04:56,130 --> 00:04:57,840 matter what you observe. 99 00:04:57,840 --> 00:05:01,150 Let's say you observe the specific value for the random 100 00:05:01,150 --> 00:05:05,560 variable X. After that point if you're asked to produce a 101 00:05:05,560 --> 00:05:11,190 best estimate Theta hat that minimizes this mean squared 102 00:05:11,190 --> 00:05:14,080 error, your best estimate would be the conditional 103 00:05:14,080 --> 00:05:18,940 expectation given the specific value that you have observed. 104 00:05:18,940 --> 00:05:23,150 These two statements say almost the same thing, but 105 00:05:23,150 --> 00:05:25,650 this one is a bit stronger. 106 00:05:25,650 --> 00:05:30,830 This one tells you no matter what specific X happens the 107 00:05:30,830 --> 00:05:33,370 conditional expectation is the best estimate. 108 00:05:33,370 --> 00:05:36,870 This one tells you on the average, over all X's may 109 00:05:36,870 --> 00:05:39,050 happen, the conditional 110 00:05:39,050 --> 00:05:42,650 expectation is the best estimator. 111 00:05:42,650 --> 00:05:44,870 Now this is really a consequence of this. 112 00:05:44,870 --> 00:05:48,510 If the conditional expectation is best for any specific X, 113 00:05:48,510 --> 00:05:52,750 then it's the best one even when X is left random and you 114 00:05:52,750 --> 00:05:58,200 are averaging your error over all possible X's. 115 00:05:58,200 --> 00:06:02,120 OK so now that we know what is the optimal way of producing 116 00:06:02,120 --> 00:06:05,510 an estimate let's do a simple example to see 117 00:06:05,510 --> 00:06:07,240 how things work out. 118 00:06:07,240 --> 00:06:10,290 So we have started with an unknown random variable, 119 00:06:10,290 --> 00:06:15,080 Theta, which is uniformly distributed between 4 and 10. 120 00:06:15,080 --> 00:06:18,270 And then we have an observation model that tells 121 00:06:18,270 --> 00:06:22,430 us that given the value of Theta, X is going to be a 122 00:06:22,430 --> 00:06:24,532 random variable that ranges between Theta - 123 00:06:24,532 --> 00:06:26,570 1, and Theta + 1. 124 00:06:26,570 --> 00:06:32,550 So think of X as a noisy measurement of Theta, plus 125 00:06:32,550 --> 00:06:37,600 some noise, which is between -1, and +1. 126 00:06:37,600 --> 00:06:41,980 So really the model that we are using here is that X is 127 00:06:41,980 --> 00:06:44,430 equal to Theta plus U -- 128 00:06:44,430 --> 00:06:50,500 where U is uniform on -1, and +1. 129 00:06:50,500 --> 00:06:52,350 one, and plus one. 130 00:06:52,350 --> 00:06:55,946 So we have the true value of Theta, but X could be Theta - 131 00:06:55,946 --> 00:07:00,750 1, or it could be all the way up to Theta + 1. 132 00:07:00,750 --> 00:07:03,770 And the X is uniformly distributed on that interval. 133 00:07:03,770 --> 00:07:08,060 That's the same as saying that U is uniformly distributed 134 00:07:08,060 --> 00:07:09,820 over this interval. 135 00:07:09,820 --> 00:07:12,780 So now we have all the information that we need, we 136 00:07:12,780 --> 00:07:15,270 can construct the joint density. 137 00:07:15,270 --> 00:07:19,020 And the joint density is, of course, the prior density 138 00:07:19,020 --> 00:07:21,850 times the conditional density. 139 00:07:21,850 --> 00:07:24,540 We go both of these. 140 00:07:24,540 --> 00:07:28,880 Both of these are constants, so the joint density is also 141 00:07:28,880 --> 00:07:30,150 going to be a constant. 142 00:07:30,150 --> 00:07:34,420 1/6 times 1/2, this is one over 12. 143 00:07:34,420 --> 00:07:37,700 But it is a constant, not everywhere. 144 00:07:37,700 --> 00:07:41,280 Only on the range of possible x's and thetas. 145 00:07:41,280 --> 00:07:46,030 So theta can take any value between four and ten, so these 146 00:07:46,030 --> 00:07:47,430 are the values of theta. 147 00:07:47,430 --> 00:07:51,990 And for any given value of theta x can take values from 148 00:07:51,990 --> 00:07:55,690 theta minus one, up to theta plus one. 149 00:07:55,690 --> 00:08:00,210 So here, if you can imagine, a line that goes with slope one, 150 00:08:00,210 --> 00:08:08,530 and then x can take that value of theta plus or minus one. 151 00:08:08,530 --> 00:08:14,720 So this object here, this is the set of possible x and 152 00:08:14,720 --> 00:08:16,070 theta pairs. 153 00:08:16,070 --> 00:08:21,490 So the density is equal to one over 12 over this set, and 154 00:08:21,490 --> 00:08:23,640 it's zero everywhere else. 155 00:08:23,640 --> 00:08:28,035 So outside here the density is zero, the density only applies 156 00:08:28,035 --> 00:08:29,800 at that point. 157 00:08:29,800 --> 00:08:33,110 All right, so now we're asked to estimate 158 00:08:33,110 --> 00:08:34,890 theta in terms of x. 159 00:08:34,890 --> 00:08:37,500 So we want to build an estimator which is going to be 160 00:08:37,500 --> 00:08:40,000 a function from the x's to the thetas. 161 00:08:40,000 --> 00:08:42,909 That's why I chose the axis this way-- x to be on this 162 00:08:42,909 --> 00:08:44,600 axis, theta on that axis-- 163 00:08:44,600 --> 00:08:48,020 Because the estimator we're building is a function of x. 164 00:08:48,020 --> 00:08:51,070 Based on the observation that we obtained, we want to 165 00:08:51,070 --> 00:08:51,940 estimate theta. 166 00:08:51,940 --> 00:08:55,680 So we know that the optimal estimator is the conditional 167 00:08:55,680 --> 00:08:59,360 expectation, given the value of x. 168 00:08:59,360 --> 00:09:02,160 So what is the conditional expectation? 169 00:09:02,160 --> 00:09:07,890 If you fix a particular value of x, let's say in this range. 170 00:09:07,890 --> 00:09:13,240 So this is our x, then what do we know about theta? 171 00:09:13,240 --> 00:09:18,050 We know that theta lies in this range. 172 00:09:18,050 --> 00:09:21,670 Theta can only be sampled between those two values. 173 00:09:21,670 --> 00:09:24,760 And what kind of distribution does theta have? 174 00:09:24,760 --> 00:09:28,980 What is the conditional distribution of theta given x? 175 00:09:28,980 --> 00:09:32,260 Well, remember how we built conditional distributions from 176 00:09:32,260 --> 00:09:33,410 joint distributions? 177 00:09:33,410 --> 00:09:38,900 The conditional distribution is just a section of the joint 178 00:09:38,900 --> 00:09:41,640 distribution applied to the place where we're 179 00:09:41,640 --> 00:09:42,770 conditioning. 180 00:09:42,770 --> 00:09:45,800 So the joint is constant. 181 00:09:45,800 --> 00:09:49,310 So the conditional is also going to be a constant density 182 00:09:49,310 --> 00:09:50,630 over this interval. 183 00:09:50,630 --> 00:09:53,560 So the posterior distribution of theta is 184 00:09:53,560 --> 00:09:57,210 uniform over this interval. 185 00:09:57,210 --> 00:10:01,110 So if the posterior of theta is uniform over that interval, 186 00:10:01,110 --> 00:10:04,900 the expected value of theta is going to be the meet point of 187 00:10:04,900 --> 00:10:06,070 that interval. 188 00:10:06,070 --> 00:10:08,880 So the estimate which you report-- 189 00:10:08,880 --> 00:10:10,710 if you observe that theta-- 190 00:10:10,710 --> 00:10:15,750 is going to be this particular point here, it's the midpoint. 191 00:10:15,750 --> 00:10:19,140 The same argument goes through even if you obtain an x 192 00:10:19,140 --> 00:10:22,570 somewhere here. 193 00:10:22,570 --> 00:10:29,540 Given this x, theta can take a value 194 00:10:29,540 --> 00:10:32,800 between these two values. 195 00:10:32,800 --> 00:10:35,990 Theta is going to have a uniform distribution over this 196 00:10:35,990 --> 00:10:40,650 interval, and the conditional expectation of theta given x 197 00:10:40,650 --> 00:10:43,840 is going to be the midpoint of that interval. 198 00:10:43,840 --> 00:10:50,790 So now if we plot our estimator by tracing midpoints 199 00:10:50,790 --> 00:10:56,300 in this diagram what you're going to obtain is a curve 200 00:10:56,300 --> 00:11:01,795 that starts like this, then it changes slope. 201 00:11:04,490 --> 00:11:07,280 So that it keeps track of the midpoint, and then it goes 202 00:11:07,280 --> 00:11:09,000 like that again. 203 00:11:09,000 --> 00:11:13,760 So this blue curve here is our g of x, which is the 204 00:11:13,760 --> 00:11:16,910 conditional expectation of theta given that x is 205 00:11:16,910 --> 00:11:20,480 equal to little x. 206 00:11:20,480 --> 00:11:26,610 So it's a curve, in our example it consists of three 207 00:11:26,610 --> 00:11:28,220 straight segments. 208 00:11:28,220 --> 00:11:30,780 But overall it's non-linear. 209 00:11:30,780 --> 00:11:33,440 It's not a single line through this diagram. 210 00:11:33,440 --> 00:11:35,670 And that's how things are in general. 211 00:11:35,670 --> 00:11:39,300 g of x, our optimal estimate has no reason to be a linear 212 00:11:39,300 --> 00:11:40,460 function of x. 213 00:11:40,460 --> 00:11:42,780 In general it's going to be some complicated curve. 214 00:11:47,350 --> 00:11:51,170 So how good is our estimate? 215 00:11:51,170 --> 00:11:55,700 I mean you reported your x, your estimate of theta based 216 00:11:55,700 --> 00:12:00,690 on x, and your boss asks you what kind of error do you 217 00:12:00,690 --> 00:12:03,350 expect to get? 218 00:12:03,350 --> 00:12:07,010 Having observed the particular value of x, what you can 219 00:12:07,010 --> 00:12:11,140 report to your boss is what you think is the mean squared 220 00:12:11,140 --> 00:12:13,040 error is going to be. 221 00:12:13,040 --> 00:12:15,380 We observe the particular value of x. 222 00:12:15,380 --> 00:12:19,650 So we're conditioning, and we're living in this universe. 223 00:12:19,650 --> 00:12:22,760 Given that we have made this observation, this is the true 224 00:12:22,760 --> 00:12:25,840 value of theta, this is the estimate that we have 225 00:12:25,840 --> 00:12:32,220 produced, this is the expected squared error, given that we 226 00:12:32,220 --> 00:12:35,740 have made the particular observation. 227 00:12:35,740 --> 00:12:39,700 Now in this conditional universe this is the expected 228 00:12:39,700 --> 00:12:42,880 value of theta given x. 229 00:12:42,880 --> 00:12:46,240 So this is the expected value of this random variable inside 230 00:12:46,240 --> 00:12:47,900 the conditional universe. 231 00:12:47,900 --> 00:12:50,900 So when you take the mean squared of a random variable 232 00:12:50,900 --> 00:12:53,780 minus the expected value, this is the same thing as the 233 00:12:53,780 --> 00:12:55,840 variance of that random variable. 234 00:12:55,840 --> 00:12:58,670 Except that it's the variance inside 235 00:12:58,670 --> 00:13:00,940 the conditional universe. 236 00:13:00,940 --> 00:13:06,230 Having observed x, theta is still a random variable. 237 00:13:06,230 --> 00:13:09,010 It's distributed according to the posterior distribution. 238 00:13:09,010 --> 00:13:12,220 Since it's a random variable, it has a variance. 239 00:13:12,220 --> 00:13:16,060 And that variance is our mean squared error. 240 00:13:16,060 --> 00:13:20,280 So this is the variance of the posterior distribution of 241 00:13:20,280 --> 00:13:22,605 Theta given the observation that we have made. 242 00:13:26,688 --> 00:13:30,180 OK, so what is the variance in our example? 243 00:13:30,180 --> 00:13:36,270 If X happens to be here, then Theta is uniform over this 244 00:13:36,270 --> 00:13:41,990 interval, and this interval has length 2. 245 00:13:41,990 --> 00:13:46,960 Theta is uniformly distributed over an interval of length 2. 246 00:13:46,960 --> 00:13:49,900 This is the posterior distribution of Theta. 247 00:13:49,900 --> 00:13:51,410 What is the variance? 248 00:13:51,410 --> 00:13:54,680 Then you remember the formula for the variance of a uniform 249 00:13:54,680 --> 00:13:59,520 random variable, it is the length of the interval squared 250 00:13:59,520 --> 00:14:03,590 divided by 12, so this is 1/3. 251 00:14:03,590 --> 00:14:06,060 So the variance of Theta -- 252 00:14:06,060 --> 00:14:10,330 the mean squared error-- is going to be 1/3 whenever this 253 00:14:10,330 --> 00:14:12,430 kind of picture applies. 254 00:14:12,430 --> 00:14:16,460 This picture applies when X is between 5 and 9. 255 00:14:16,460 --> 00:14:20,100 If X is less than 5, then the picture is a little different, 256 00:14:20,100 --> 00:14:22,020 and Theta is going to be uniform 257 00:14:22,020 --> 00:14:24,660 over a smaller interval. 258 00:14:24,660 --> 00:14:26,930 And so the variance of theta is going to 259 00:14:26,930 --> 00:14:28,770 be smaller as well. 260 00:14:28,770 --> 00:14:31,470 So let's start plotting our mean squared error. 261 00:14:31,470 --> 00:14:35,930 Between 5 and 9 the variance of Theta -- 262 00:14:35,930 --> 00:14:37,260 the posterior variance-- 263 00:14:37,260 --> 00:14:39,090 is 1/3. 264 00:14:39,090 --> 00:14:46,100 Now when the X falls in here Theta is uniformly distributed 265 00:14:46,100 --> 00:14:48,450 over a smaller interval. 266 00:14:48,450 --> 00:14:50,670 The size of this interval changes 267 00:14:50,670 --> 00:14:52,800 linearly over that range. 268 00:14:52,800 --> 00:14:59,260 And so when we take the square size of that interval we get a 269 00:14:59,260 --> 00:15:01,560 quadratic function of how much we have 270 00:15:01,560 --> 00:15:03,120 moved from that corner. 271 00:15:03,120 --> 00:15:07,140 So at that corner what is the variance of Theta? 272 00:15:07,140 --> 00:15:11,290 Well if I observe an X that's equal to 3 then I know with 273 00:15:11,290 --> 00:15:14,810 certainty that Theta is equal to 4. 274 00:15:14,810 --> 00:15:18,340 Then I'm in very good shape, I know exactly what Theta is 275 00:15:18,340 --> 00:15:19,240 going to be. 276 00:15:19,240 --> 00:15:22,890 So the variance, in this case, is going to be 0. 277 00:15:22,890 --> 00:15:26,570 If I observe an X that's a little larger than Theta is 278 00:15:26,570 --> 00:15:31,130 now random, takes values on a little interval, and the 279 00:15:31,130 --> 00:15:35,430 variance of Theta is going to be proportional to the square 280 00:15:35,430 --> 00:15:37,910 of the length of that little interval. 281 00:15:37,910 --> 00:15:40,400 So we get a curve that starts rising 282 00:15:40,400 --> 00:15:42,560 quadratically from here. 283 00:15:42,560 --> 00:15:45,390 It goes up forward 1/3. 284 00:15:45,390 --> 00:15:48,980 At the other end of the picture the same is true. 285 00:15:48,980 --> 00:15:54,500 If you observe an X which is 11 then Theta can only be 286 00:15:54,500 --> 00:15:57,150 equal to 10. 287 00:15:57,150 --> 00:16:00,720 And so the error in Theta is equal to 0, 288 00:16:00,720 --> 00:16:02,920 there's 0 error variance. 289 00:16:02,920 --> 00:16:07,360 But as we obtain X's that are slightly less than 11 then the 290 00:16:07,360 --> 00:16:10,380 mean squared error again rises quadratically. 291 00:16:10,380 --> 00:16:13,450 So we end up with a plot like this. 292 00:16:13,450 --> 00:16:17,120 What this plot tells us is that certain measurements are 293 00:16:17,120 --> 00:16:18,920 better than others. 294 00:16:18,920 --> 00:16:25,270 If you're lucky, and you see X equal to 3 then you're lucky, 295 00:16:25,270 --> 00:16:28,820 because you know Theta exactly what it is. 296 00:16:28,820 --> 00:16:33,830 If you see an X which is equal to 6 then you're sort of 297 00:16:33,830 --> 00:16:35,800 unlikely, because it doesn't tell you 298 00:16:35,800 --> 00:16:37,900 Theta with great precision. 299 00:16:37,900 --> 00:16:40,560 Theta could be anywhere on that interval. 300 00:16:40,560 --> 00:16:42,360 And so the variance of Theta -- 301 00:16:42,360 --> 00:16:44,630 even after you have observed X -- 302 00:16:44,630 --> 00:16:48,470 is a certain number, 1/3 in our case. 303 00:16:48,470 --> 00:16:52,370 So the moral to keep out of that story is 304 00:16:52,370 --> 00:16:56,970 that the error variance-- 305 00:16:56,970 --> 00:17:00,380 or the mean squared error-- 306 00:17:00,380 --> 00:17:03,350 depends on what particular observation 307 00:17:03,350 --> 00:17:04,829 you happen to obtain. 308 00:17:04,829 --> 00:17:10,240 Some observations may be very informative, and once you see 309 00:17:10,240 --> 00:17:13,550 a specific number than you know exactly what Theta is. 310 00:17:13,550 --> 00:17:15,760 Some observations might be less informative. 311 00:17:15,760 --> 00:17:18,980 You observe your X, but it could still leave a lot of 312 00:17:18,980 --> 00:17:20,230 uncertainty about Theta. 313 00:17:23,839 --> 00:17:27,650 So conditional expectations are really the cornerstone of 314 00:17:27,650 --> 00:17:28,890 Bayesian estimation. 315 00:17:28,890 --> 00:17:31,690 They're particularly popular, especially 316 00:17:31,690 --> 00:17:33,950 in engineering contexts. 317 00:17:33,950 --> 00:17:38,260 There used a lot in signal processing, communications, 318 00:17:38,260 --> 00:17:40,940 control theory, so on. 319 00:17:40,940 --> 00:17:44,300 So that makes it worth playing a little bit with their 320 00:17:44,300 --> 00:17:50,450 theoretical properties, and get some appreciation of a few 321 00:17:50,450 --> 00:17:53,590 subtleties involved here. 322 00:17:53,590 --> 00:17:57,990 No new math in reality, in what we're going to do here. 323 00:17:57,990 --> 00:18:01,290 But it's going to be a good opportunity to practice 324 00:18:01,290 --> 00:18:05,310 manipulation of conditional expectations. 325 00:18:05,310 --> 00:18:13,150 So let's look at the expected value of the estimation error 326 00:18:13,150 --> 00:18:15,330 that we obtained. 327 00:18:15,330 --> 00:18:18,540 So Theta hat is our estimator, is the conditional 328 00:18:18,540 --> 00:18:19,855 expectation. 329 00:18:19,855 --> 00:18:25,690 Theta hat minus Theta is what kind of error do we have? 330 00:18:25,690 --> 00:18:29,610 If Theta hat, is bigger than Theta then we have made the 331 00:18:29,610 --> 00:18:31,510 positive error. 332 00:18:31,510 --> 00:18:33,910 If not, if it's on the other side, we have made the 333 00:18:33,910 --> 00:18:35,290 negative error. 334 00:18:35,290 --> 00:18:39,110 Then it turns out that on the average the errors cancel each 335 00:18:39,110 --> 00:18:41,030 other out, on the average. 336 00:18:41,030 --> 00:18:43,110 So let's do this calculation. 337 00:18:43,110 --> 00:18:50,010 Let's calculate the expected value of the error given X. 338 00:18:50,010 --> 00:18:54,480 Now by definition the error is expected value of Theta hat 339 00:18:54,480 --> 00:18:57,850 minus Theta given X. 340 00:18:57,850 --> 00:19:01,090 We use linearity of expectations to break it up as 341 00:19:01,090 --> 00:19:04,850 expected value of Theta hat given X minus expected value 342 00:19:04,850 --> 00:19:11,090 of Theta given X. And now what? 343 00:19:11,090 --> 00:19:18,680 Our estimate is made on the basis of the data of the X's. 344 00:19:18,680 --> 00:19:23,600 If I tell you X then you know what Theta hat is. 345 00:19:23,600 --> 00:19:26,490 Remember that the conditional expectation is a random 346 00:19:26,490 --> 00:19:29,680 variable which is a function of the random variable, on 347 00:19:29,680 --> 00:19:31,560 which you're conditioning on. 348 00:19:31,560 --> 00:19:35,330 If you know X then you know the conditional expectation 349 00:19:35,330 --> 00:19:38,390 given X, you know what Theta hat is going to be. 350 00:19:38,390 --> 00:19:42,910 So Theta hat is a function of X. If it's a function of X 351 00:19:42,910 --> 00:19:45,910 then once I tell you X you know what Theta 352 00:19:45,910 --> 00:19:47,460 hat is going to be. 353 00:19:47,460 --> 00:19:49,580 So this conditional expectation is going to be 354 00:19:49,580 --> 00:19:51,860 Theta hat itself. 355 00:19:51,860 --> 00:19:54,030 Here this is-- just by definition-- 356 00:19:54,030 --> 00:19:59,580 Theta hat, and so we get equality to 0. 357 00:19:59,580 --> 00:20:04,260 So what we have proved is that no matter what I have 358 00:20:04,260 --> 00:20:08,970 observed, and given that I have observed something on the 359 00:20:08,970 --> 00:20:14,050 average my error is going to be 0. 360 00:20:14,050 --> 00:20:19,960 This is a statement involving equality of random variables. 361 00:20:19,960 --> 00:20:22,620 Remember that conditional expectations are random 362 00:20:22,620 --> 00:20:26,970 variables because they depend on the thing you're 363 00:20:26,970 --> 00:20:28,440 conditioning on. 364 00:20:28,440 --> 00:20:31,630 0 is sort of a trivial random variable. 365 00:20:31,630 --> 00:20:34,080 This tells you that this random variable is identically 366 00:20:34,080 --> 00:20:36,390 equal to the 0 random variable. 367 00:20:36,390 --> 00:20:40,720 More specifically it tells you that no matter what value for 368 00:20:40,720 --> 00:20:45,120 X you observe, the conditional expectation of the error is 369 00:20:45,120 --> 00:20:46,410 going to be 0. 370 00:20:46,410 --> 00:20:49,150 And this takes us to this statement here, which is 371 00:20:49,150 --> 00:20:51,830 inequality between numbers. 372 00:20:51,830 --> 00:20:56,330 No matter what specific value for capital X you have 373 00:20:56,330 --> 00:21:00,440 observed, your error, on the average, is going 374 00:21:00,440 --> 00:21:02,420 to be equal to 0. 375 00:21:02,420 --> 00:21:06,730 So this is a less abstract version of these statements. 376 00:21:06,730 --> 00:21:09,300 This is inequality between two numbers. 377 00:21:09,300 --> 00:21:15,080 It's true for every value of X, so it's true in terms of 378 00:21:15,080 --> 00:21:18,550 these random variables being equal to that random variable. 379 00:21:18,550 --> 00:21:21,170 Because remember according to our definition this random 380 00:21:21,170 --> 00:21:24,400 variable is the random variable that takes this 381 00:21:24,400 --> 00:21:27,410 specific value when capital X happens to be 382 00:21:27,410 --> 00:21:29,410 equal to little x. 383 00:21:29,410 --> 00:21:33,480 Now this doesn't mean that your error is 0, it only means 384 00:21:33,480 --> 00:21:37,050 that your error is as likely, in some sense, to fall on the 385 00:21:37,050 --> 00:21:40,040 positive side, as to fall on the negative side. 386 00:21:40,040 --> 00:21:41,400 So sometimes your error will be 387 00:21:41,400 --> 00:21:42,880 positive, sometimes negative. 388 00:21:42,880 --> 00:21:46,360 And on the average these things cancel out and 389 00:21:46,360 --> 00:21:48,150 give you a 0 --. 390 00:21:48,150 --> 00:21:49,470 on the average. 391 00:21:49,470 --> 00:21:53,620 So this is a property that's sometimes giving the name we 392 00:21:53,620 --> 00:21:59,040 say that Theta hat is unbiased. 393 00:21:59,040 --> 00:22:03,190 So Theta hat, our estimate, does not have a tendency to be 394 00:22:03,190 --> 00:22:04,180 on the high side. 395 00:22:04,180 --> 00:22:06,920 It does not have a tendency to be on the low side. 396 00:22:06,920 --> 00:22:10,580 On the average it's just right. 397 00:22:14,700 --> 00:22:18,390 So let's do a little more playing here. 398 00:22:21,790 --> 00:22:27,690 Let's see how our error is related to an arbitrary 399 00:22:27,690 --> 00:22:30,270 function of the data. 400 00:22:30,270 --> 00:22:36,960 Let's do this in a conditional universe and 401 00:22:36,960 --> 00:22:38,210 look at this quantity. 402 00:22:45,210 --> 00:22:47,910 In a conditional universe where X is known 403 00:22:47,910 --> 00:22:51,060 then h of X is known. 404 00:22:51,060 --> 00:22:54,200 And so you can pull it outside the expectation. 405 00:22:54,200 --> 00:22:58,010 In the conditional universe where the value of X is given 406 00:22:58,010 --> 00:23:01,290 this quantity becomes just a constant. 407 00:23:01,290 --> 00:23:03,250 There's nothing random about it. 408 00:23:03,250 --> 00:23:06,280 So you can pull it out, the expectation, and 409 00:23:06,280 --> 00:23:09,840 write things this way. 410 00:23:09,840 --> 00:23:14,090 And we have just calculated that this quantity is 0. 411 00:23:14,090 --> 00:23:17,390 So this number turns out to be 0 as well. 412 00:23:20,810 --> 00:23:23,830 Now having done this, we can take 413 00:23:23,830 --> 00:23:26,110 expectations of both sides. 414 00:23:26,110 --> 00:23:29,530 And now let's use the law of iterated expectations. 415 00:23:29,530 --> 00:23:33,040 Expectation of a conditional expectation gives us the 416 00:23:33,040 --> 00:23:42,200 unconditional expectation, and this is also going to be 0. 417 00:23:42,200 --> 00:23:47,455 So here we use the law of iterated expectations. 418 00:23:54,460 --> 00:23:55,710 OK. 419 00:24:04,510 --> 00:24:06,290 OK, why are we doing this? 420 00:24:06,290 --> 00:24:09,990 We're doing this because I would like to calculate the 421 00:24:09,990 --> 00:24:13,940 covariance between Theta tilde and Theta hat. 422 00:24:13,940 --> 00:24:16,490 Theta hat is, ask the question -- is there a systematic 423 00:24:16,490 --> 00:24:20,870 relation between the error and the estimate? 424 00:24:20,870 --> 00:24:30,830 So to calculate the covariance we use the property that we 425 00:24:30,830 --> 00:24:34,460 can calculate the covariances by calculating the expected 426 00:24:34,460 --> 00:24:39,520 value of the product minus the product of 427 00:24:39,520 --> 00:24:40,770 the expected values. 428 00:24:48,440 --> 00:24:50,850 And what do we get? 429 00:24:50,850 --> 00:24:56,080 This is 0, because of what we just proved. 430 00:25:00,980 --> 00:25:06,160 And this is 0, because of what we proved earlier. 431 00:25:06,160 --> 00:25:09,740 That the expected value of the error is equal to 0. 432 00:25:12,900 --> 00:25:27,800 So the covariance between the error and any function of X is 433 00:25:27,800 --> 00:25:29,470 equal to 0. 434 00:25:29,470 --> 00:25:33,060 Let's use that to the case where the function of X we're 435 00:25:33,060 --> 00:25:38,620 considering is Theta hat itself. 436 00:25:38,620 --> 00:25:43,300 Theta hat is our estimate, it's a function of X. So this 437 00:25:43,300 --> 00:25:46,845 0 result would still apply, and we get that this 438 00:25:46,845 --> 00:25:50,570 covariance is equal to 0. 439 00:25:50,570 --> 00:25:59,100 OK, so that's what we proved. 440 00:25:59,100 --> 00:26:02,720 Let's see, what are the morals to take out of all this? 441 00:26:02,720 --> 00:26:07,640 First is you should be very comfortable with this type of 442 00:26:07,640 --> 00:26:10,580 calculation involving conditional expectations. 443 00:26:10,580 --> 00:26:14,100 The main two things that we're using are that when you 444 00:26:14,100 --> 00:26:17,630 condition on a random variable any function of that random 445 00:26:17,630 --> 00:26:21,020 variable becomes a constant, and can be pulled out the 446 00:26:21,020 --> 00:26:22,690 conditional expectation. 447 00:26:22,690 --> 00:26:25,460 The other thing that we are using is the law of iterated 448 00:26:25,460 --> 00:26:29,450 expectations, so these are the skills involved. 449 00:26:29,450 --> 00:26:32,980 Now on the substance, why is this result interesting? 450 00:26:32,980 --> 00:26:35,390 This tells us that the error is 451 00:26:35,390 --> 00:26:37,060 uncorrelated with the estimate. 452 00:26:39,770 --> 00:26:42,530 What's a hypothetical situation where these would 453 00:26:42,530 --> 00:26:44,160 not happen? 454 00:26:44,160 --> 00:26:52,720 Whenever Theta hat is positive my error tends to be negative. 455 00:26:52,720 --> 00:26:57,000 Suppose that whenever Theta hat is big then you say oh my 456 00:26:57,000 --> 00:27:00,610 estimate is too big, maybe the true Theta is on the lower 457 00:27:00,610 --> 00:27:04,470 side, so I expect my error to be negative. 458 00:27:04,470 --> 00:27:09,230 That would be a situation that would violate this condition. 459 00:27:09,230 --> 00:27:13,880 This condition tells you that no matter what Theta hat is, 460 00:27:13,880 --> 00:27:17,110 you don't expect your error to be on the positive side or on 461 00:27:17,110 --> 00:27:18,030 the negative side. 462 00:27:18,030 --> 00:27:21,630 Your error will still be 0 on the average. 463 00:27:21,630 --> 00:27:25,780 So if you obtain a very high estimate this is no reason for 464 00:27:25,780 --> 00:27:29,630 you to suspect that the true Theta is 465 00:27:29,630 --> 00:27:30,890 lower than your estimate. 466 00:27:30,890 --> 00:27:34,420 If you suspected that the true Theta was lower than your 467 00:27:34,420 --> 00:27:38,830 estimate you should have changed your Theta hat. 468 00:27:38,830 --> 00:27:42,580 If you make an estimate and after obtaining that estimate 469 00:27:42,580 --> 00:27:46,270 you say I think my estimate is too big, and so 470 00:27:46,270 --> 00:27:47,770 the error is negative. 471 00:27:47,770 --> 00:27:50,730 If you thought that way then that means that your estimate 472 00:27:50,730 --> 00:27:53,690 is not the optimal one, that your estimate should have been 473 00:27:53,690 --> 00:27:57,200 corrected to be smaller. 474 00:27:57,200 --> 00:28:00,030 And that would mean that there's a better estimate than 475 00:28:00,030 --> 00:28:03,060 the one you used, but the estimate that we are using 476 00:28:03,060 --> 00:28:06,060 here is the optimal one in terms of mean squared error, 477 00:28:06,060 --> 00:28:08,350 there's no way of improving it. 478 00:28:08,350 --> 00:28:11,640 And this is really captured in that statement. 479 00:28:11,640 --> 00:28:14,250 That is knowing Theta hat doesn't give you a lot of 480 00:28:14,250 --> 00:28:18,290 information about the error, and gives you, therefore, no 481 00:28:18,290 --> 00:28:24,430 reason to adjust your estimate from what it was. 482 00:28:24,430 --> 00:28:29,190 Finally, a consequence of all this. 483 00:28:29,190 --> 00:28:31,910 This is the definition of the error. 484 00:28:31,910 --> 00:28:35,770 Send Theta to this side, send Theta tilde to that side, you 485 00:28:35,770 --> 00:28:36,850 get this relation. 486 00:28:36,850 --> 00:28:41,010 The true parameter is composed of two quantities. 487 00:28:41,010 --> 00:28:44,940 The estimate, and the error that they got 488 00:28:44,940 --> 00:28:46,460 with a minus sign. 489 00:28:46,460 --> 00:28:49,790 These two quantities are uncorrelated with each other. 490 00:28:49,790 --> 00:28:53,350 Their covariance is 0, and therefore, the variance of 491 00:28:53,350 --> 00:28:56,330 this is the sum of the variances of these two 492 00:28:56,330 --> 00:28:57,580 quantities. 493 00:29:00,470 --> 00:29:07,520 So what's an interpretation of this equality? 494 00:29:07,520 --> 00:29:10,930 There is some inherent randomness in the random 495 00:29:10,930 --> 00:29:14,540 variable theta that we're trying to estimate. 496 00:29:14,540 --> 00:29:19,360 Theta hat tries to estimate it, tries to get close to it. 497 00:29:19,360 --> 00:29:25,500 And if Theta hat always stays close to Theta, since Theta is 498 00:29:25,500 --> 00:29:29,260 random Theta hat must also be quite random, so it has 499 00:29:29,260 --> 00:29:31,170 uncertainty in it. 500 00:29:31,170 --> 00:29:35,270 And the more uncertain Theta hat is the more it moves 501 00:29:35,270 --> 00:29:36,640 together with Theta. 502 00:29:36,640 --> 00:29:40,860 So the more uncertainty it removes from Theta. 503 00:29:40,860 --> 00:29:43,900 And this is the remaining uncertainty in Theta. 504 00:29:43,900 --> 00:29:47,140 The uncertainty that's left after we've done our 505 00:29:47,140 --> 00:29:48,350 estimation. 506 00:29:48,350 --> 00:29:52,330 So ideally, to have a small error we want this 507 00:29:52,330 --> 00:29:54,120 quantity to be small. 508 00:29:54,120 --> 00:29:55,820 Which is the same as saying that this 509 00:29:55,820 --> 00:29:57,740 quantity should be big. 510 00:29:57,740 --> 00:30:02,070 In the ideal case Theta hat is the same as Theta. 511 00:30:02,070 --> 00:30:04,820 That's the best we could hope for. 512 00:30:04,820 --> 00:30:09,250 That corresponds to 0 error, and all the uncertainly in 513 00:30:09,250 --> 00:30:14,230 Theta is absorbed by the uncertainty in Theta hat. 514 00:30:14,230 --> 00:30:18,960 Interestingly, this relation here is just another variation 515 00:30:18,960 --> 00:30:21,630 of the law of total variance that we have seen at some 516 00:30:21,630 --> 00:30:23,880 point in the past. 517 00:30:23,880 --> 00:30:28,570 I will skip that derivation, but it's an interesting fact, 518 00:30:28,570 --> 00:30:31,430 and it can give you an alternative interpretation of 519 00:30:31,430 --> 00:30:32,680 the law of total variance. 520 00:30:36,840 --> 00:30:40,570 OK, so now let's return to our example. 521 00:30:40,570 --> 00:30:45,630 In our example we obtained the optimal estimator, and we saw 522 00:30:45,630 --> 00:30:51,220 that it was a nonlinear curve, something like this. 523 00:30:51,220 --> 00:30:53,660 I'm exaggerating the corner of a little bit to 524 00:30:53,660 --> 00:30:55,350 show that it's nonlinear. 525 00:30:55,350 --> 00:30:57,400 This is the optimal estimator. 526 00:30:57,400 --> 00:31:01,070 It's a nonlinear function of X -- 527 00:31:01,070 --> 00:31:05,200 nonlinear generally means complicated. 528 00:31:05,200 --> 00:31:09,020 Sometimes the conditional expectation is really hard to 529 00:31:09,020 --> 00:31:12,320 compute, because whenever you have to compute expectations 530 00:31:12,320 --> 00:31:17,270 you need to do some integrals. 531 00:31:17,270 --> 00:31:19,880 And if you have many random variables involved it might 532 00:31:19,880 --> 00:31:23,160 correspond to a multi-dimensional integration. 533 00:31:23,160 --> 00:31:24,370 We don't like this. 534 00:31:24,370 --> 00:31:27,370 Can we come up, maybe, with a simpler way 535 00:31:27,370 --> 00:31:29,200 of estimating Theta? 536 00:31:29,200 --> 00:31:32,580 Of coming up with a point estimate which still has some 537 00:31:32,580 --> 00:31:34,350 nice properties, it has some good 538 00:31:34,350 --> 00:31:37,120 motivation, but is simpler. 539 00:31:37,120 --> 00:31:38,630 What does simpler mean? 540 00:31:38,630 --> 00:31:40,920 Perhaps linear. 541 00:31:40,920 --> 00:31:45,570 Let's put ourselves in a straitjacket and restrict 542 00:31:45,570 --> 00:31:50,260 ourselves to estimators that's are of these forms. 543 00:31:50,260 --> 00:31:53,280 My estimate is constrained to be a linear 544 00:31:53,280 --> 00:31:54,930 function of the X's. 545 00:31:54,930 --> 00:31:59,320 So my estimator is going to be a curve, a linear curve. 546 00:31:59,320 --> 00:32:03,450 It could be this, it could be that, maybe it would want to 547 00:32:03,450 --> 00:32:06,350 be something like this. 548 00:32:06,350 --> 00:32:10,540 I want to choose the best possible linear function. 549 00:32:10,540 --> 00:32:11,490 What does that mean? 550 00:32:11,490 --> 00:32:15,570 It means that I write my Theta hat in this form. 551 00:32:15,570 --> 00:32:20,750 If I fix a certain a and b I have fixed the functional form 552 00:32:20,750 --> 00:32:23,940 of my estimator, and this is the corresponding 553 00:32:23,940 --> 00:32:25,360 mean squared error. 554 00:32:25,360 --> 00:32:28,210 That's the error between the true parameter and the 555 00:32:28,210 --> 00:32:31,130 estimate of that parameter, we take the square of this. 556 00:32:33,730 --> 00:32:38,350 And now the optimal linear estimator is defined as one 557 00:32:38,350 --> 00:32:42,210 for which these mean squared error is smallest possible 558 00:32:42,210 --> 00:32:45,600 over all choices of a and b. 559 00:32:45,600 --> 00:32:48,260 So we want to minimize this expression 560 00:32:48,260 --> 00:32:52,030 over all a's and b's. 561 00:32:52,030 --> 00:32:55,650 How do we do this minimization? 562 00:32:55,650 --> 00:32:58,910 Well this is a square, you can expand it. 563 00:32:58,910 --> 00:33:02,040 Write down all the terms in the expansion of the square. 564 00:33:02,040 --> 00:33:03,810 So you're going to get the term expected 565 00:33:03,810 --> 00:33:05,400 value of Theta squared. 566 00:33:05,400 --> 00:33:07,380 You're going to get another term-- 567 00:33:07,380 --> 00:33:11,010 a squared expected value of X squared, another term which is 568 00:33:11,010 --> 00:33:13,340 b squared, and then you're going to get to 569 00:33:13,340 --> 00:33:16,620 various cross terms. 570 00:33:16,620 --> 00:33:22,050 What you have here is really a quadratic function of a and b. 571 00:33:22,050 --> 00:33:25,030 So think of this quantity that we're minimizing as some 572 00:33:25,030 --> 00:33:28,920 function h of a and b, and it happens to be quadratic. 573 00:33:32,500 --> 00:33:35,280 How do we minimize a quadratic function? 574 00:33:35,280 --> 00:33:38,890 We set the derivative of this function with respect to a and 575 00:33:38,890 --> 00:33:42,940 b to 0, and then do the algebra. 576 00:33:42,940 --> 00:33:48,000 After you do the algebra you find that the best choice for 577 00:33:48,000 --> 00:33:54,380 a is this 1, so this is the coefficient next to X. This is 578 00:33:54,380 --> 00:33:55,630 the optimal a. 579 00:33:59,560 --> 00:34:03,660 And the optimal b corresponds of the constant terms. 580 00:34:03,660 --> 00:34:08,770 So this term and this times that together are the optimal 581 00:34:08,770 --> 00:34:11,090 choices of b. 582 00:34:11,090 --> 00:34:15,590 So the algebra itself is not very interesting. 583 00:34:15,590 --> 00:34:19,210 What is really interesting is the nature of the result that 584 00:34:19,210 --> 00:34:21,179 we get here. 585 00:34:21,179 --> 00:34:26,260 If we were to plot the result on this particular example you 586 00:34:26,260 --> 00:34:32,280 would get the curve that's something like this. 587 00:34:36,949 --> 00:34:40,710 It goes through the middle of this diagram 588 00:34:40,710 --> 00:34:43,080 and is a little slanted. 589 00:34:43,080 --> 00:34:48,639 In this example, X and Theta are positively correlated. 590 00:34:48,639 --> 00:34:51,190 Bigger values of X generally correspond to 591 00:34:51,190 --> 00:34:53,139 bigger values of Theta. 592 00:34:53,139 --> 00:34:56,310 So in this example the covariance between X and Theta 593 00:34:56,310 --> 00:35:05,530 is positive, and so our estimate can be interpreted in 594 00:35:05,530 --> 00:35:09,110 the following way: The expected value of Theta is the 595 00:35:09,110 --> 00:35:13,130 estimate that you would come up with if you didn't have any 596 00:35:13,130 --> 00:35:15,960 information about Theta. 597 00:35:15,960 --> 00:35:19,590 If you don't make any observations this is the best 598 00:35:19,590 --> 00:35:22,270 way of estimating Theta. 599 00:35:22,270 --> 00:35:26,190 But I have made an observation, X, and I need to 600 00:35:26,190 --> 00:35:27,920 take it into account. 601 00:35:27,920 --> 00:35:32,360 I look at this difference, which is the piece of news 602 00:35:32,360 --> 00:35:34,380 contained in X? 603 00:35:34,380 --> 00:35:37,870 That's what X should be on the average. 604 00:35:37,870 --> 00:35:41,910 If I observe an X which is bigger than what I expected it 605 00:35:41,910 --> 00:35:46,830 to be, and since X and Theta are positively correlated, 606 00:35:46,830 --> 00:35:51,070 this tells me that Theta should also be bigger than its 607 00:35:51,070 --> 00:35:52,690 average value. 608 00:35:52,690 --> 00:35:57,180 Whenever I see an X that's larger than its average value 609 00:35:57,180 --> 00:36:00,230 this gives me an indication that theta should also 610 00:36:00,230 --> 00:36:04,480 probably be larger than its average value. 611 00:36:04,480 --> 00:36:08,040 And so I'm taking that difference and multiplying it 612 00:36:08,040 --> 00:36:10,240 by a positive coefficient. 613 00:36:10,240 --> 00:36:12,360 And that's what gives me a curve here that 614 00:36:12,360 --> 00:36:14,880 has a positive slope. 615 00:36:14,880 --> 00:36:17,780 So this increment-- 616 00:36:17,780 --> 00:36:21,750 the new information contained in X as compared to the 617 00:36:21,750 --> 00:36:25,950 average value we expected apriori, that increment allows 618 00:36:25,950 --> 00:36:30,780 us to make a correction to our prior estimate of Theta, and 619 00:36:30,780 --> 00:36:34,780 the amount of that correction is guided by the covariance of 620 00:36:34,780 --> 00:36:36,260 X with Theta. 621 00:36:36,260 --> 00:36:39,670 If the covariance of X with Theta were 0, that would mean 622 00:36:39,670 --> 00:36:43,050 there's no systematic relation between the two, and in that 623 00:36:43,050 --> 00:36:46,380 case obtaining some information from X doesn't 624 00:36:46,380 --> 00:36:51,010 give us a guide as to how to change the estimates of Theta. 625 00:36:51,010 --> 00:36:53,870 If that were 0, we would just stay with 626 00:36:53,870 --> 00:36:55,050 this particular estimate. 627 00:36:55,050 --> 00:36:57,090 We're not able to make a correction. 628 00:36:57,090 --> 00:37:00,810 But when there's a non zero covariance between X and Theta 629 00:37:00,810 --> 00:37:04,620 that covariance works as a guide for us to obtain a 630 00:37:04,620 --> 00:37:08,130 better estimate of Theta. 631 00:37:12,270 --> 00:37:15,220 How about the resulting mean squared error? 632 00:37:15,220 --> 00:37:18,690 In this context turns out that there's a very nice formula 633 00:37:18,690 --> 00:37:21,360 for the mean squared error obtained from 634 00:37:21,360 --> 00:37:24,780 the best linear estimate. 635 00:37:24,780 --> 00:37:27,900 What's the story here? 636 00:37:27,900 --> 00:37:31,210 The mean squared error that we have has something to do with 637 00:37:31,210 --> 00:37:35,450 the variance of the original random variable. 638 00:37:35,450 --> 00:37:38,710 The more uncertain our original random variable is, 639 00:37:38,710 --> 00:37:41,670 the more error we're going to make. 640 00:37:41,670 --> 00:37:45,590 On the other hand, when the two variables are correlated 641 00:37:45,590 --> 00:37:48,370 we explored that correlation to improve our estimate. 642 00:37:52,100 --> 00:37:54,650 This row here is the correlation coefficient 643 00:37:54,650 --> 00:37:56,730 between the two random variables. 644 00:37:56,730 --> 00:37:59,720 When this correlation coefficient is larger this 645 00:37:59,720 --> 00:38:01,780 factor here becomes smaller. 646 00:38:01,780 --> 00:38:04,660 And our mean squared error become smaller. 647 00:38:04,660 --> 00:38:07,560 So think of the two extreme cases. 648 00:38:07,560 --> 00:38:11,270 One extreme case is when rho equal to 1 -- 649 00:38:11,270 --> 00:38:14,200 so X and Theta are perfectly correlated. 650 00:38:14,200 --> 00:38:18,420 When they're perfectly correlated once I know X then 651 00:38:18,420 --> 00:38:20,310 I also know Theta. 652 00:38:20,310 --> 00:38:23,580 And the two random variables are linearly related. 653 00:38:23,580 --> 00:38:27,080 In that case, my estimate is right on the target, and the 654 00:38:27,080 --> 00:38:30,860 mean squared error is going to be 0. 655 00:38:30,860 --> 00:38:34,810 The other extreme case is if rho is equal to 0. 656 00:38:34,810 --> 00:38:37,590 The two random variables are uncorrelated. 657 00:38:37,590 --> 00:38:41,740 In that case the measurement does not help me estimate 658 00:38:41,740 --> 00:38:45,390 Theta, and the uncertainty that's left-- 659 00:38:45,390 --> 00:38:46,970 the mean squared error-- 660 00:38:46,970 --> 00:38:49,830 is just the original variance of Theta. 661 00:38:49,830 --> 00:38:53,750 So the uncertainty in Theta does not get reduced. 662 00:38:53,750 --> 00:38:54,670 So moral-- 663 00:38:54,670 --> 00:38:59,710 the estimation error is a reduced version of the 664 00:38:59,710 --> 00:39:03,660 original amount of uncertainty in the random variable Theta, 665 00:39:03,660 --> 00:39:08,280 and the larger the correlation between those two random 666 00:39:08,280 --> 00:39:12,620 variables, the better we can remove uncertainty from the 667 00:39:12,620 --> 00:39:13,970 original random variable. 668 00:39:17,320 --> 00:39:21,200 I didn't derive this formula, but it's just a matter of 669 00:39:21,200 --> 00:39:22,430 algebraic manipulations. 670 00:39:22,430 --> 00:39:25,770 We have a formula for Theta hat, subtract 671 00:39:25,770 --> 00:39:27,620 Theta from that formula. 672 00:39:27,620 --> 00:39:30,640 Take square, take expectations, and do a few 673 00:39:30,640 --> 00:39:33,750 lines of algebra that you can read in the text, and you end 674 00:39:33,750 --> 00:39:35,915 up with this really neat and clean formula. 675 00:39:38,650 --> 00:39:42,360 Now I mentioned in the beginning of the lecture that 676 00:39:42,360 --> 00:39:45,220 we can do inference with Theta's and X's not just being 677 00:39:45,220 --> 00:39:48,970 single numbers, but they could be vector random variables. 678 00:39:48,970 --> 00:39:52,100 So for example we might have multiple data that gives us 679 00:39:52,100 --> 00:39:56,710 information about X. 680 00:39:56,710 --> 00:40:00,240 There are no vectors here, so this discussion was for the 681 00:40:00,240 --> 00:40:04,460 case where Theta and X were just scalar, one-dimensional 682 00:40:04,460 --> 00:40:05,350 quantities. 683 00:40:05,350 --> 00:40:08,060 What do we do if we have multiple data? 684 00:40:08,060 --> 00:40:11,990 Suppose that Theta is still a scalar, it's one dimensional, 685 00:40:11,990 --> 00:40:14,710 but we make several observations. 686 00:40:14,710 --> 00:40:17,050 And on the basis of these observations we want to 687 00:40:17,050 --> 00:40:20,080 estimate Theta. 688 00:40:20,080 --> 00:40:24,650 The optimal least mean squares estimator would be again the 689 00:40:24,650 --> 00:40:28,830 conditional expectation of Theta given X. That's the 690 00:40:28,830 --> 00:40:30,130 optimal one. 691 00:40:30,130 --> 00:40:36,330 And in this case X is a vector, so the general 692 00:40:36,330 --> 00:40:40,650 estimator we would use would be this one. 693 00:40:40,650 --> 00:40:44,050 But if we want to keep things simple and we want our 694 00:40:44,050 --> 00:40:47,300 estimator to have a simple functional form we might 695 00:40:47,300 --> 00:40:51,870 restrict to estimator that are linear functions of the data. 696 00:40:51,870 --> 00:40:53,800 And then the story is exactly the same as 697 00:40:53,800 --> 00:40:57,010 we discussed before. 698 00:40:57,010 --> 00:41:00,460 I constrained myself to estimating Theta using a 699 00:41:00,460 --> 00:41:05,880 linear function of the data, so my signal processing box 700 00:41:05,880 --> 00:41:07,830 just applies a linear function. 701 00:41:07,830 --> 00:41:11,145 And I'm looking for the best coefficients, the coefficients 702 00:41:11,145 --> 00:41:13,490 that are going to result in the least 703 00:41:13,490 --> 00:41:15,990 possible squared error. 704 00:41:15,990 --> 00:41:19,780 This is my squared error, this is (my estimate minus the 705 00:41:19,780 --> 00:41:22,110 thing I'm trying to estimate) squared, and 706 00:41:22,110 --> 00:41:24,100 then taking the average. 707 00:41:24,100 --> 00:41:25,330 How do we do this? 708 00:41:25,330 --> 00:41:26,580 Same story as before. 709 00:41:29,510 --> 00:41:32,500 The X's and the Theta's get averaged out because we have 710 00:41:32,500 --> 00:41:33,430 an expectation. 711 00:41:33,430 --> 00:41:36,830 Whatever is left is just a function of the coefficients 712 00:41:36,830 --> 00:41:38,760 of the a's and of b's. 713 00:41:38,760 --> 00:41:42,110 As before it turns out to be a quadratic function. 714 00:41:42,110 --> 00:41:46,580 Then we set the derivatives of this function of a's and b's 715 00:41:46,580 --> 00:41:50,000 with respect to the coefficients, we set it to 0. 716 00:41:50,000 --> 00:41:54,340 And this gives us a system of linear equations. 717 00:41:54,340 --> 00:41:56,780 It's a system of linear equations that's satisfied by 718 00:41:56,780 --> 00:41:57,730 those coefficients. 719 00:41:57,730 --> 00:42:00,860 It's a linear system because this is a quadratic function 720 00:42:00,860 --> 00:42:03,950 of those coefficients. 721 00:42:03,950 --> 00:42:10,410 So to get closed-form formulas in this particular case one 722 00:42:10,410 --> 00:42:13,180 would need to introduce vectors, and matrices, and 723 00:42:13,180 --> 00:42:15,330 metrics inverses and so on. 724 00:42:15,330 --> 00:42:18,570 The particular formulas are not so much what interests us 725 00:42:18,570 --> 00:42:22,950 here, rather, the interesting thing is that this is simply 726 00:42:22,950 --> 00:42:27,120 done just using straightforward solvers of 727 00:42:27,120 --> 00:42:29,240 linear equations. 728 00:42:29,240 --> 00:42:32,470 The only thing you need to do is to write down the correct 729 00:42:32,470 --> 00:42:35,280 coefficients of those non-linear equations. 730 00:42:35,280 --> 00:42:37,440 And the typical coefficient that you would 731 00:42:37,440 --> 00:42:39,240 get would be what? 732 00:42:39,240 --> 00:42:42,480 Let say a typical quick equations would be -- 733 00:42:42,480 --> 00:42:44,190 let's take a typical term of this 734 00:42:44,190 --> 00:42:45,680 quadratic one you expanded. 735 00:42:45,680 --> 00:42:51,470 You're going to get the terms such as a1x1 times a2x2. 736 00:42:51,470 --> 00:42:55,680 When you take expectations you're left with a1a2 times 737 00:42:55,680 --> 00:42:58,210 expected value of x1x2. 738 00:43:02,030 --> 00:43:06,700 So this would involve terms such as a1 squared expected 739 00:43:06,700 --> 00:43:08,520 value of x1 squared. 740 00:43:08,520 --> 00:43:14,760 You would get terms such as a1a2, expected value of x1x2, 741 00:43:14,760 --> 00:43:20,120 and a lot of other terms here should have a too. 742 00:43:20,120 --> 00:43:23,600 So you get something that's quadratic in your 743 00:43:23,600 --> 00:43:24,890 coefficients. 744 00:43:24,890 --> 00:43:30,490 And the constants that show up in your system of equations 745 00:43:30,490 --> 00:43:33,790 are things that have to do with the expected values of 746 00:43:33,790 --> 00:43:37,070 squares of your random variables, or products of your 747 00:43:37,070 --> 00:43:39,130 random variables. 748 00:43:39,130 --> 00:43:43,060 To write down the numerical values for these the only 749 00:43:43,060 --> 00:43:46,330 thing you need to know are the means and variances of your 750 00:43:46,330 --> 00:43:47,570 random variables. 751 00:43:47,570 --> 00:43:50,360 If you know the mean and variance then you know what 752 00:43:50,360 --> 00:43:51,760 this thing is. 753 00:43:51,760 --> 00:43:54,950 And if you know the covariances as well then you 754 00:43:54,950 --> 00:43:57,250 know what this thing is. 755 00:43:57,250 --> 00:44:02,080 So in order to find the optimal linear estimator in 756 00:44:02,080 --> 00:44:06,870 the case of multiple data you do not need to know the entire 757 00:44:06,870 --> 00:44:09,230 probability distribution of the random 758 00:44:09,230 --> 00:44:11,050 variables that are involved. 759 00:44:11,050 --> 00:44:14,690 You only need to know your means and covariances. 760 00:44:14,690 --> 00:44:18,670 These are the only quantities that affect the construction 761 00:44:18,670 --> 00:44:20,570 of your optimal estimator. 762 00:44:20,570 --> 00:44:23,840 We could see this already in this formula. 763 00:44:23,840 --> 00:44:29,650 The form of my optimal estimator is completely 764 00:44:29,650 --> 00:44:34,100 determined once I know the means, variance, and 765 00:44:34,100 --> 00:44:37,970 covariance of the random variables in my model. 766 00:44:37,970 --> 00:44:44,410 I do not need to know how the details distribution of the 767 00:44:44,410 --> 00:44:46,570 random variables that are involved here. 768 00:44:51,690 --> 00:44:55,110 So as I said in general, you find the form of the optimal 769 00:44:55,110 --> 00:44:59,550 estimator by using a linear equation solver. 770 00:44:59,550 --> 00:45:01,890 There are special examples in which you can 771 00:45:01,890 --> 00:45:05,210 get closed-form solutions. 772 00:45:05,210 --> 00:45:10,090 The nicest simplest estimation problem one can think of is 773 00:45:10,090 --> 00:45:11,120 the following-- 774 00:45:11,120 --> 00:45:14,870 you have some uncertain parameter, and you make 775 00:45:14,870 --> 00:45:17,790 multiple measurements of that parameter in 776 00:45:17,790 --> 00:45:19,950 the presence of noise. 777 00:45:19,950 --> 00:45:22,520 So the Wi's are noises. 778 00:45:22,520 --> 00:45:25,130 I corresponds to your i-th experiment. 779 00:45:25,130 --> 00:45:27,810 So this is the most common situation that you encounter 780 00:45:27,810 --> 00:45:28,490 in the lab. 781 00:45:28,490 --> 00:45:31,240 If you are dealing with some process, you're trying to 782 00:45:31,240 --> 00:45:34,110 measure something you measure it over and over. 783 00:45:34,110 --> 00:45:37,030 Each time your measurement has some random error. 784 00:45:37,030 --> 00:45:40,360 And then you need to take all your measurements together and 785 00:45:40,360 --> 00:45:43,550 come up with a single estimate. 786 00:45:43,550 --> 00:45:48,320 So the noises are assumed to be independent of each other, 787 00:45:48,320 --> 00:45:50,010 and also to be independent from the 788 00:45:50,010 --> 00:45:52,090 value of the true parameter. 789 00:45:52,090 --> 00:45:55,010 Without loss of generality we can assume that the noises 790 00:45:55,010 --> 00:45:58,890 have 0 mean and they have some variances that we 791 00:45:58,890 --> 00:46:00,340 assume to be known. 792 00:46:00,340 --> 00:46:03,180 Theta itself has a prior distribution with a certain 793 00:46:03,180 --> 00:46:05,670 mean and the certain variance. 794 00:46:05,670 --> 00:46:07,610 So the form of the optimal linear 795 00:46:07,610 --> 00:46:10,940 estimator is really nice. 796 00:46:10,940 --> 00:46:14,930 Well maybe you cannot see it right away because this looks 797 00:46:14,930 --> 00:46:18,580 messy, but what is it really? 798 00:46:18,580 --> 00:46:24,590 It's a linear combination of the X's and the prior mean. 799 00:46:24,590 --> 00:46:28,560 And it's actually a weighted average of the X's and the 800 00:46:28,560 --> 00:46:30,250 prior mean. 801 00:46:30,250 --> 00:46:33,570 Here we collect all of the coefficients that 802 00:46:33,570 --> 00:46:35,920 we have at the top. 803 00:46:35,920 --> 00:46:42,060 So the whole thing is basically a weighted average. 804 00:46:46,460 --> 00:46:51,110 1/(sigma_i-squared) is the weight that we give to Xi, and 805 00:46:51,110 --> 00:46:54,710 in the denominator we have the sum of all of the weights. 806 00:46:54,710 --> 00:46:59,260 So in the end we're dealing with a weighted average. 807 00:46:59,260 --> 00:47:03,760 If mu was equal to 1, and all the Xi's were equal to 1 then 808 00:47:03,760 --> 00:47:06,790 our estimate would also be equal to 1. 809 00:47:06,790 --> 00:47:10,670 Now the form of the weights that we have is interesting. 810 00:47:10,670 --> 00:47:16,050 Any given data point is weighted inversely 811 00:47:16,050 --> 00:47:17,820 proportional to the variance. 812 00:47:17,820 --> 00:47:20,270 What does that say? 813 00:47:20,270 --> 00:47:26,920 If my i-th data point has a lot of variance, if Wi is very 814 00:47:26,920 --> 00:47:32,900 noisy then Xi is not very useful, is not very reliable. 815 00:47:32,900 --> 00:47:36,840 So I'm giving it a small weight. 816 00:47:36,840 --> 00:47:41,870 Large variance, a lot of error in my Xi means that I should 817 00:47:41,870 --> 00:47:44,200 give it a smaller weight. 818 00:47:44,200 --> 00:47:47,920 If two data points have the same variance, they're of 819 00:47:47,920 --> 00:47:50,140 comparable quality, then I'm going to 820 00:47:50,140 --> 00:47:51,950 give them equal weight. 821 00:47:51,950 --> 00:47:56,200 The other interesting thing is that the prior mean is treated 822 00:47:56,200 --> 00:47:58,300 the same way as the X's. 823 00:47:58,300 --> 00:48:03,050 So it's treated as an additional observation. 824 00:48:03,050 --> 00:48:07,100 So we're taking a weighted average of the prior mean and 825 00:48:07,100 --> 00:48:09,850 of the measurements that we are making. 826 00:48:09,850 --> 00:48:13,380 The formula looks as if the prior mean was just another 827 00:48:13,380 --> 00:48:14,210 data point. 828 00:48:14,210 --> 00:48:17,440 So that's the way of thinking about Bayesian estimation. 829 00:48:17,440 --> 00:48:20,270 You have your real data points, the X's that you 830 00:48:20,270 --> 00:48:23,430 observe, you also had some prior information. 831 00:48:23,430 --> 00:48:27,470 This plays a role similar to a data point. 832 00:48:27,470 --> 00:48:31,580 Interesting note that if all random variables are normal in 833 00:48:31,580 --> 00:48:35,230 this model these optimal linear estimator happens to be 834 00:48:35,230 --> 00:48:36,950 also the conditional expectation. 835 00:48:36,950 --> 00:48:40,000 That's the nice thing about normal random variables that 836 00:48:40,000 --> 00:48:42,770 conditional expectations turn out to be linear. 837 00:48:42,770 --> 00:48:46,920 So the optimal estimate and the optimal linear estimate 838 00:48:46,920 --> 00:48:48,560 turn out to be the same. 839 00:48:48,560 --> 00:48:51,050 And that gives us another interpretation of linear 840 00:48:51,050 --> 00:48:52,100 estimation. 841 00:48:52,100 --> 00:48:54,660 Linear estimation is essentially the same as 842 00:48:54,660 --> 00:48:58,970 pretending that all random variables are normal. 843 00:48:58,970 --> 00:49:02,040 So that's a side point. 844 00:49:02,040 --> 00:49:04,230 Now I'd like to close with a comment. 845 00:49:08,370 --> 00:49:11,760 You do your measurements and you estimate Theta on the 846 00:49:11,760 --> 00:49:17,040 basis of X. Suppose that instead you have a measuring 847 00:49:17,040 --> 00:49:20,970 device that's measures X-cubed instead of measuring X, and 848 00:49:20,970 --> 00:49:23,350 you want to estimate Theta. 849 00:49:23,350 --> 00:49:26,760 Are you going to get to different a estimate? 850 00:49:26,760 --> 00:49:31,790 Well X and X-cubed contained the same information. 851 00:49:31,790 --> 00:49:34,730 Telling you X is the same as telling you 852 00:49:34,730 --> 00:49:36,640 the value of X-cubed. 853 00:49:36,640 --> 00:49:40,660 So the posterior distribution of Theta given X is the same 854 00:49:40,660 --> 00:49:44,160 as the posterior distribution of Theta given X-cubed. 855 00:49:44,160 --> 00:49:47,450 And so the means of these posterior distributions are 856 00:49:47,450 --> 00:49:49,390 going to be the same. 857 00:49:49,390 --> 00:49:52,850 So doing transformations through your data does not 858 00:49:52,850 --> 00:49:57,370 matter if you're doing optimal least squares estimation. 859 00:49:57,370 --> 00:50:00,100 On the other hand, if you restrict yourself to doing 860 00:50:00,100 --> 00:50:05,540 linear estimation then using a linear function of X is not 861 00:50:05,540 --> 00:50:09,720 the same as using a linear function of X-cubed. 862 00:50:09,720 --> 00:50:14,720 So this is a linear estimator, but where the data are the 863 00:50:14,720 --> 00:50:19,250 X-cube's, and we have a linear function of the data. 864 00:50:19,250 --> 00:50:23,690 So this means that when you're using linear estimation you 865 00:50:23,690 --> 00:50:28,040 have some choices to make linear on what? 866 00:50:28,040 --> 00:50:32,290 Sometimes you want to plot your data on a not ordinary 867 00:50:32,290 --> 00:50:35,090 scale and try to plot a line through them. 868 00:50:35,090 --> 00:50:38,360 Sometimes you plot your data on a logarithmic scale, and 869 00:50:38,360 --> 00:50:40,480 try to plot a line through them. 870 00:50:40,480 --> 00:50:42,390 Which scale is the appropriate one? 871 00:50:42,390 --> 00:50:44,510 Here it would be a cubic scale. 872 00:50:44,510 --> 00:50:46,830 And you have to think about your particular model to 873 00:50:46,830 --> 00:50:51,180 decide which version would be a more appropriate one. 874 00:50:51,180 --> 00:50:55,830 Finally when we have multiple data sometimes these multiple 875 00:50:55,830 --> 00:50:59,910 data might contain the same information. 876 00:50:59,910 --> 00:51:02,800 So X is one data point, X-squared is another data 877 00:51:02,800 --> 00:51:05,610 point, X-cubed is another data point. 878 00:51:05,610 --> 00:51:08,540 The three of them contain the same information, but you can 879 00:51:08,540 --> 00:51:11,480 try to form a linear function of them. 880 00:51:11,480 --> 00:51:14,380 And then you obtain a linear estimator that has a more 881 00:51:14,380 --> 00:51:16,930 general form as a function of X. 882 00:51:16,930 --> 00:51:22,130 So if you want to estimate your Theta as a cubic function 883 00:51:22,130 --> 00:51:26,330 of X, for example, you can set up a linear estimation model 884 00:51:26,330 --> 00:51:29,480 of this particular form and find the optimal coefficients, 885 00:51:29,480 --> 00:51:32,900 the a's and the b's. 886 00:51:32,900 --> 00:51:35,700 All right, so the last slide just gives you the big picture 887 00:51:35,700 --> 00:51:39,330 of what's happening in Bayesian Inference, it's for 888 00:51:39,330 --> 00:51:40,330 you to ponder. 889 00:51:40,330 --> 00:51:41,930 Basically we talked about three 890 00:51:41,930 --> 00:51:43,470 possible estimation methods. 891 00:51:43,470 --> 00:51:48,300 Maximum posteriori, mean squared error estimation, and 892 00:51:48,300 --> 00:51:51,070 linear mean squared error estimation, or least squares 893 00:51:51,070 --> 00:51:52,290 estimation. 894 00:51:52,290 --> 00:51:54,410 And there's a number of standard examples that you 895 00:51:54,410 --> 00:51:57,130 will be seeing over and over in the recitations, tutorial, 896 00:51:57,130 --> 00:52:00,950 homework, and so on, perhaps on exams even. 897 00:52:00,950 --> 00:52:05,630 Where we take some nice priors on some unknown parameter, we 898 00:52:05,630 --> 00:52:09,410 take some nice models for the noise or the observations, and 899 00:52:09,410 --> 00:52:11,880 then you need to work out posterior distributions in the 900 00:52:11,880 --> 00:52:13,570 various estimates and compare them.