1 00:00:00,000 --> 00:00:00,040 2 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 3 00:00:02,460 --> 00:00:03,870 Commons license. 4 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 5 00:00:06,910 --> 00:00:10,560 offer high quality educational resources for free. 6 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 7 00:00:13,460 --> 00:00:19,290 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:19,290 --> 00:00:22,410 ocw.mit.edu 9 00:00:22,410 --> 00:00:25,430 PROFESSOR: So we're going to finish today our discussion of 10 00:00:25,430 --> 00:00:28,870 Bayesian Inference, which we started last time. 11 00:00:28,870 --> 00:00:32,960 As you probably saw there's not a huge lot of concepts 12 00:00:32,960 --> 00:00:37,370 that we're introducing at this point in terms of specific 13 00:00:37,370 --> 00:00:39,770 skills of calculating probabilities. 14 00:00:39,770 --> 00:00:44,040 But, rather, it's more of an interpretation and setting up 15 00:00:44,040 --> 00:00:45,460 the framework. 16 00:00:45,460 --> 00:00:48,010 So the framework in Bayesian estimation is that there is 17 00:00:48,010 --> 00:00:52,500 some parameter which is not known, but we have a prior 18 00:00:52,500 --> 00:00:53,550 distribution on it. 19 00:00:53,550 --> 00:01:00,040 These are beliefs about what this variable might be, and 20 00:01:00,040 --> 00:01:02,370 then we'll obtain some measurements. 21 00:01:02,370 --> 00:01:05,410 And the measurements are affected by the value of that 22 00:01:05,410 --> 00:01:07,560 parameter that we don't know. 23 00:01:07,560 --> 00:01:12,490 And this effect, the fact that X is affected by Theta, is 24 00:01:12,490 --> 00:01:15,970 captured by introducing a conditional probability 25 00:01:15,970 --> 00:01:16,660 distribution-- 26 00:01:16,660 --> 00:01:19,590 the distribution of X depends on Theta. 27 00:01:19,590 --> 00:01:22,270 It's a conditional probability distribution. 28 00:01:22,270 --> 00:01:26,280 So we have formulas for these two densities, the prior 29 00:01:26,280 --> 00:01:28,330 density and the conditional density. 30 00:01:28,330 --> 00:01:31,110 And given that we have these, if we multiply them we can 31 00:01:31,110 --> 00:01:34,000 also get the joint density of X and Theta. 32 00:01:34,000 --> 00:01:35,940 So we have everything that's there is to 33 00:01:35,940 --> 00:01:37,450 know in this second. 34 00:01:37,450 --> 00:01:41,650 And now we observe the random variable X. Given this random 35 00:01:41,650 --> 00:01:44,400 variable what can we say about Theta? 36 00:01:44,400 --> 00:01:48,380 Well, what we can do is we can always calculate the 37 00:01:48,380 --> 00:01:52,600 conditional distribution of theta given X. And now that we 38 00:01:52,600 --> 00:01:55,990 have the specific value of X we can plot this as 39 00:01:55,990 --> 00:01:58,650 a function of Theta. 40 00:01:58,650 --> 00:01:59,150 OK. 41 00:01:59,150 --> 00:02:01,380 And this is the complete answer to a 42 00:02:01,380 --> 00:02:02,990 Bayesian Inference problem. 43 00:02:02,990 --> 00:02:06,130 This posterior distribution captures everything there is 44 00:02:06,130 --> 00:02:10,240 to say about Theta, that's what we know about Theta. 45 00:02:10,240 --> 00:02:13,330 Given the X that we have observed Theta is still 46 00:02:13,330 --> 00:02:15,080 random, it's still unknown. 47 00:02:15,080 --> 00:02:18,270 And it might be here, there, or there with several 48 00:02:18,270 --> 00:02:19,900 probabilities. 49 00:02:19,900 --> 00:02:22,780 On the other hand, if you want to report a single value for 50 00:02:22,780 --> 00:02:27,590 Theta then you do some extra work. 51 00:02:27,590 --> 00:02:31,430 You continue from here, and you do some data processing on 52 00:02:31,430 --> 00:02:35,360 X. Doing data processing means that you apply a certain 53 00:02:35,360 --> 00:02:39,000 function on the data, and this function is 54 00:02:39,000 --> 00:02:40,650 something that you design. 55 00:02:40,650 --> 00:02:42,930 It's the so-called estimator. 56 00:02:42,930 --> 00:02:46,460 And once that function is applied it outputs an estimate 57 00:02:46,460 --> 00:02:50,760 of Theta, which we call Theta hat. 58 00:02:50,760 --> 00:02:53,490 So this is sort of the big picture of what's happening. 59 00:02:53,490 --> 00:02:55,880 Now one thing to keep in mind is that even though I'm 60 00:02:55,880 --> 00:03:00,450 writing single letters here, in general Theta or X could be 61 00:03:00,450 --> 00:03:02,030 vector random variables. 62 00:03:02,030 --> 00:03:03,540 So think of this-- 63 00:03:03,540 --> 00:03:08,170 it could be a collection Theta1, Theta2, Theta3. 64 00:03:08,170 --> 00:03:11,570 And maybe we obtained several measurements, so this X is 65 00:03:11,570 --> 00:03:15,630 really a vector X1, X2, up to Xn. 66 00:03:15,630 --> 00:03:20,190 All right, so now how do we choose a Theta to report? 67 00:03:20,190 --> 00:03:21,960 There are various ways of doing it. 68 00:03:21,960 --> 00:03:25,280 One is to look at the posterior distribution and 69 00:03:25,280 --> 00:03:29,940 report the value of Theta, at which the density or the PMF 70 00:03:29,940 --> 00:03:31,990 is highest. 71 00:03:31,990 --> 00:03:35,570 This is called the maximum a posteriori estimate. 72 00:03:35,570 --> 00:03:38,770 So we pick a value of theta for which the posteriori is 73 00:03:38,770 --> 00:03:40,990 maximum, and we report it. 74 00:03:40,990 --> 00:03:46,030 An alternative way is to try to be optimal with respects to 75 00:03:46,030 --> 00:03:48,500 a mean squared error. 76 00:03:48,500 --> 00:03:49,410 So what is this? 77 00:03:49,410 --> 00:03:53,260 If we have a specific estimator, g, this is the 78 00:03:53,260 --> 00:03:55,880 estimate it's going to produce. 79 00:03:55,880 --> 00:03:58,300 This is the true value of Theta, so this is our 80 00:03:58,300 --> 00:03:59,740 estimation error. 81 00:03:59,740 --> 00:04:03,180 We look at the square of the estimation error, and look at 82 00:04:03,180 --> 00:04:04,180 the average value. 83 00:04:04,180 --> 00:04:07,180 We would like this squared estimation error to be as 84 00:04:07,180 --> 00:04:08,710 small as possible. 85 00:04:08,710 --> 00:04:12,470 How can we design our estimator g to make that error 86 00:04:12,470 --> 00:04:13,920 as small as possible? 87 00:04:13,920 --> 00:04:19,490 It turns out that the answer is to produce, as an estimate, 88 00:04:19,490 --> 00:04:22,660 the conditional expectation of Theta given X. So the 89 00:04:22,660 --> 00:04:26,600 conditional expectation is the best estimate that you could 90 00:04:26,600 --> 00:04:30,690 produce if your objective is to keep the mean squared error 91 00:04:30,690 --> 00:04:32,720 as small as possible. 92 00:04:32,720 --> 00:04:35,280 So this statement here is a statement of what happens on 93 00:04:35,280 --> 00:04:39,950 the average over all Theta's and all X's that may happen in 94 00:04:39,950 --> 00:04:42,490 our experiment. 95 00:04:42,490 --> 00:04:45,160 The conditional expectation as an estimator has an even 96 00:04:45,160 --> 00:04:47,750 stronger property. 97 00:04:47,750 --> 00:04:51,490 Not only it's optimal on the average, but it's also optimal 98 00:04:51,490 --> 00:04:56,130 given that you have made a specific observation, no 99 00:04:56,130 --> 00:04:57,840 matter what you observe. 100 00:04:57,840 --> 00:05:01,150 Let's say you observe the specific value for the random 101 00:05:01,150 --> 00:05:05,560 variable X. After that point if you're asked to produce a 102 00:05:05,560 --> 00:05:11,190 best estimate Theta hat that minimizes this mean squared 103 00:05:11,190 --> 00:05:14,080 error, your best estimate would be the conditional 104 00:05:14,080 --> 00:05:18,940 expectation given the specific value that you have observed. 105 00:05:18,940 --> 00:05:23,150 These two statements say almost the same thing, but 106 00:05:23,150 --> 00:05:25,650 this one is a bit stronger. 107 00:05:25,650 --> 00:05:30,830 This one tells you no matter what specific X happens the 108 00:05:30,830 --> 00:05:33,370 conditional expectation is the best estimate. 109 00:05:33,370 --> 00:05:36,870 This one tells you on the average, over all X's may 110 00:05:36,870 --> 00:05:39,050 happen, the conditional 111 00:05:39,050 --> 00:05:42,650 expectation is the best estimator. 112 00:05:42,650 --> 00:05:44,870 Now this is really a consequence of this. 113 00:05:44,870 --> 00:05:48,510 If the conditional expectation is best for any specific X, 114 00:05:48,510 --> 00:05:52,750 then it's the best one even when X is left random and you 115 00:05:52,750 --> 00:05:58,200 are averaging your error over all possible X's. 116 00:05:58,200 --> 00:06:02,120 OK so now that we know what is the optimal way of producing 117 00:06:02,120 --> 00:06:05,510 an estimate let's do a simple example to see 118 00:06:05,510 --> 00:06:07,240 how things work out. 119 00:06:07,240 --> 00:06:10,290 So we have started with an unknown random variable, 120 00:06:10,290 --> 00:06:15,080 Theta, which is uniformly distributed between 4 and 10. 121 00:06:15,080 --> 00:06:18,270 And then we have an observation model that tells 122 00:06:18,270 --> 00:06:22,430 us that given the value of Theta, X is going to be a 123 00:06:22,430 --> 00:06:24,532 random variable that ranges between Theta - 124 00:06:24,532 --> 00:06:26,570 1, and Theta + 1. 125 00:06:26,570 --> 00:06:32,550 So think of X as a noisy measurement of Theta, plus 126 00:06:32,550 --> 00:06:37,600 some noise, which is between -1, and +1. 127 00:06:37,600 --> 00:06:41,980 So really the model that we are using here is that X is 128 00:06:41,980 --> 00:06:44,430 equal to Theta plus U -- 129 00:06:44,430 --> 00:06:50,500 where U is uniform on -1, and +1. 130 00:06:50,500 --> 00:06:52,350 one, and plus one. 131 00:06:52,350 --> 00:06:55,946 So we have the true value of Theta, but X could be Theta - 132 00:06:55,946 --> 00:07:00,750 1, or it could be all the way up to Theta + 1. 133 00:07:00,750 --> 00:07:03,770 And the X is uniformly distributed on that interval. 134 00:07:03,770 --> 00:07:08,060 That's the same as saying that U is uniformly distributed 135 00:07:08,060 --> 00:07:09,820 over this interval. 136 00:07:09,820 --> 00:07:12,780 So now we have all the information that we need, we 137 00:07:12,780 --> 00:07:15,270 can construct the joint density. 138 00:07:15,270 --> 00:07:19,020 And the joint density is, of course, the prior density 139 00:07:19,020 --> 00:07:21,850 times the conditional density. 140 00:07:21,850 --> 00:07:24,540 We go both of these. 141 00:07:24,540 --> 00:07:28,880 Both of these are constants, so the joint density is also 142 00:07:28,880 --> 00:07:30,150 going to be a constant. 143 00:07:30,150 --> 00:07:34,420 1/6 times 1/2, this is one over 12. 144 00:07:34,420 --> 00:07:37,700 But it is a constant, not everywhere. 145 00:07:37,700 --> 00:07:41,280 Only on the range of possible x's and thetas. 146 00:07:41,280 --> 00:07:46,030 So theta can take any value between four and ten, so these 147 00:07:46,030 --> 00:07:47,430 are the values of theta. 148 00:07:47,430 --> 00:07:51,990 And for any given value of theta x can take values from 149 00:07:51,990 --> 00:07:55,690 theta minus one, up to theta plus one. 150 00:07:55,690 --> 00:08:00,210 So here, if you can imagine, a line that goes with slope one, 151 00:08:00,210 --> 00:08:08,530 and then x can take that value of theta plus or minus one. 152 00:08:08,530 --> 00:08:14,720 So this object here, this is the set of possible x and 153 00:08:14,720 --> 00:08:16,070 theta pairs. 154 00:08:16,070 --> 00:08:21,490 So the density is equal to one over 12 over this set, and 155 00:08:21,490 --> 00:08:23,640 it's zero everywhere else. 156 00:08:23,640 --> 00:08:28,035 So outside here the density is zero, the density only applies 157 00:08:28,035 --> 00:08:29,800 at that point. 158 00:08:29,800 --> 00:08:33,110 All right, so now we're asked to estimate 159 00:08:33,110 --> 00:08:34,890 theta in terms of x. 160 00:08:34,890 --> 00:08:37,500 So we want to build an estimator which is going to be 161 00:08:37,500 --> 00:08:40,000 a function from the x's to the thetas. 162 00:08:40,000 --> 00:08:42,909 That's why I chose the axis this way-- x to be on this 163 00:08:42,909 --> 00:08:44,600 axis, theta on that axis-- 164 00:08:44,600 --> 00:08:48,020 Because the estimator we're building is a function of x. 165 00:08:48,020 --> 00:08:51,070 Based on the observation that we obtained, we want to 166 00:08:51,070 --> 00:08:51,940 estimate theta. 167 00:08:51,940 --> 00:08:55,680 So we know that the optimal estimator is the conditional 168 00:08:55,680 --> 00:08:59,360 expectation, given the value of x. 169 00:08:59,360 --> 00:09:02,160 So what is the conditional expectation? 170 00:09:02,160 --> 00:09:07,890 If you fix a particular value of x, let's say in this range. 171 00:09:07,890 --> 00:09:13,240 So this is our x, then what do we know about theta? 172 00:09:13,240 --> 00:09:18,050 We know that theta lies in this range. 173 00:09:18,050 --> 00:09:21,670 Theta can only be sampled between those two values. 174 00:09:21,670 --> 00:09:24,760 And what kind of distribution does theta have? 175 00:09:24,760 --> 00:09:28,980 What is the conditional distribution of theta given x? 176 00:09:28,980 --> 00:09:32,260 Well, remember how we built conditional distributions from 177 00:09:32,260 --> 00:09:33,410 joint distributions? 178 00:09:33,410 --> 00:09:38,900 The conditional distribution is just a section of the joint 179 00:09:38,900 --> 00:09:41,640 distribution applied to the place where we're 180 00:09:41,640 --> 00:09:42,770 conditioning. 181 00:09:42,770 --> 00:09:45,800 So the joint is constant. 182 00:09:45,800 --> 00:09:49,310 So the conditional is also going to be a constant density 183 00:09:49,310 --> 00:09:50,630 over this interval. 184 00:09:50,630 --> 00:09:53,560 So the posterior distribution of theta is 185 00:09:53,560 --> 00:09:57,210 uniform over this interval. 186 00:09:57,210 --> 00:10:01,110 So if the posterior of theta is uniform over that interval, 187 00:10:01,110 --> 00:10:04,900 the expected value of theta is going to be the meet point of 188 00:10:04,900 --> 00:10:06,070 that interval. 189 00:10:06,070 --> 00:10:08,880 So the estimate which you report-- 190 00:10:08,880 --> 00:10:10,710 if you observe that theta-- 191 00:10:10,710 --> 00:10:15,750 is going to be this particular point here, it's the midpoint. 192 00:10:15,750 --> 00:10:19,140 The same argument goes through even if you obtain an x 193 00:10:19,140 --> 00:10:22,570 somewhere here. 194 00:10:22,570 --> 00:10:29,540 Given this x, theta can take a value 195 00:10:29,540 --> 00:10:32,800 between these two values. 196 00:10:32,800 --> 00:10:35,990 Theta is going to have a uniform distribution over this 197 00:10:35,990 --> 00:10:40,650 interval, and the conditional expectation of theta given x 198 00:10:40,650 --> 00:10:43,840 is going to be the midpoint of that interval. 199 00:10:43,840 --> 00:10:50,790 So now if we plot our estimator by tracing midpoints 200 00:10:50,790 --> 00:10:56,300 in this diagram what you're going to obtain is a curve 201 00:10:56,300 --> 00:11:01,795 that starts like this, then it changes slope. 202 00:11:01,795 --> 00:11:04,490 203 00:11:04,490 --> 00:11:07,280 So that it keeps track of the midpoint, and then it goes 204 00:11:07,280 --> 00:11:09,000 like that again. 205 00:11:09,000 --> 00:11:13,760 So this blue curve here is our g of x, which is the 206 00:11:13,760 --> 00:11:16,910 conditional expectation of theta given that x is 207 00:11:16,910 --> 00:11:20,480 equal to little x. 208 00:11:20,480 --> 00:11:26,610 So it's a curve, in our example it consists of three 209 00:11:26,610 --> 00:11:28,220 straight segments. 210 00:11:28,220 --> 00:11:30,780 But overall it's non-linear. 211 00:11:30,780 --> 00:11:33,440 It's not a single line through this diagram. 212 00:11:33,440 --> 00:11:35,670 And that's how things are in general. 213 00:11:35,670 --> 00:11:39,300 g of x, our optimal estimate has no reason to be a linear 214 00:11:39,300 --> 00:11:40,460 function of x. 215 00:11:40,460 --> 00:11:42,780 In general it's going to be some complicated curve. 216 00:11:42,780 --> 00:11:47,350 217 00:11:47,350 --> 00:11:51,170 So how good is our estimate? 218 00:11:51,170 --> 00:11:55,700 I mean you reported your x, your estimate of theta based 219 00:11:55,700 --> 00:12:00,690 on x, and your boss asks you what kind of error do you 220 00:12:00,690 --> 00:12:03,350 expect to get? 221 00:12:03,350 --> 00:12:07,010 Having observed the particular value of x, what you can 222 00:12:07,010 --> 00:12:11,140 report to your boss is what you think is the mean squared 223 00:12:11,140 --> 00:12:13,040 error is going to be. 224 00:12:13,040 --> 00:12:15,380 We observe the particular value of x. 225 00:12:15,380 --> 00:12:19,650 So we're conditioning, and we're living in this universe. 226 00:12:19,650 --> 00:12:22,760 Given that we have made this observation, this is the true 227 00:12:22,760 --> 00:12:25,840 value of theta, this is the estimate that we have 228 00:12:25,840 --> 00:12:32,220 produced, this is the expected squared error, given that we 229 00:12:32,220 --> 00:12:35,740 have made the particular observation. 230 00:12:35,740 --> 00:12:39,700 Now in this conditional universe this is the expected 231 00:12:39,700 --> 00:12:42,880 value of theta given x. 232 00:12:42,880 --> 00:12:46,240 So this is the expected value of this random variable inside 233 00:12:46,240 --> 00:12:47,900 the conditional universe. 234 00:12:47,900 --> 00:12:50,900 So when you take the mean squared of a random variable 235 00:12:50,900 --> 00:12:53,780 minus the expected value, this is the same thing as the 236 00:12:53,780 --> 00:12:55,840 variance of that random variable. 237 00:12:55,840 --> 00:12:58,670 Except that it's the variance inside 238 00:12:58,670 --> 00:13:00,940 the conditional universe. 239 00:13:00,940 --> 00:13:06,230 Having observed x, theta is still a random variable. 240 00:13:06,230 --> 00:13:09,010 It's distributed according to the posterior distribution. 241 00:13:09,010 --> 00:13:12,220 Since it's a random variable, it has a variance. 242 00:13:12,220 --> 00:13:16,060 And that variance is our mean squared error. 243 00:13:16,060 --> 00:13:20,280 So this is the variance of the posterior distribution of 244 00:13:20,280 --> 00:13:22,605 Theta given the observation that we have made. 245 00:13:22,605 --> 00:13:26,688 246 00:13:26,688 --> 00:13:30,180 OK, so what is the variance in our example? 247 00:13:30,180 --> 00:13:36,270 If X happens to be here, then Theta is uniform over this 248 00:13:36,270 --> 00:13:41,990 interval, and this interval has length 2. 249 00:13:41,990 --> 00:13:46,960 Theta is uniformly distributed over an interval of length 2. 250 00:13:46,960 --> 00:13:49,900 This is the posterior distribution of Theta. 251 00:13:49,900 --> 00:13:51,410 What is the variance? 252 00:13:51,410 --> 00:13:54,680 Then you remember the formula for the variance of a uniform 253 00:13:54,680 --> 00:13:59,520 random variable, it is the length of the interval squared 254 00:13:59,520 --> 00:14:03,590 divided by 12, so this is 1/3. 255 00:14:03,590 --> 00:14:06,060 So the variance of Theta -- 256 00:14:06,060 --> 00:14:10,330 the mean squared error-- is going to be 1/3 whenever this 257 00:14:10,330 --> 00:14:12,430 kind of picture applies. 258 00:14:12,430 --> 00:14:16,460 This picture applies when X is between 5 and 9. 259 00:14:16,460 --> 00:14:20,100 If X is less than 5, then the picture is a little different, 260 00:14:20,100 --> 00:14:22,020 and Theta is going to be uniform 261 00:14:22,020 --> 00:14:24,660 over a smaller interval. 262 00:14:24,660 --> 00:14:26,930 And so the variance of theta is going to 263 00:14:26,930 --> 00:14:28,770 be smaller as well. 264 00:14:28,770 --> 00:14:31,470 So let's start plotting our mean squared error. 265 00:14:31,470 --> 00:14:35,930 Between 5 and 9 the variance of Theta -- 266 00:14:35,930 --> 00:14:37,260 the posterior variance-- 267 00:14:37,260 --> 00:14:39,090 is 1/3. 268 00:14:39,090 --> 00:14:46,100 Now when the X falls in here Theta is uniformly distributed 269 00:14:46,100 --> 00:14:48,450 over a smaller interval. 270 00:14:48,450 --> 00:14:50,670 The size of this interval changes 271 00:14:50,670 --> 00:14:52,800 linearly over that range. 272 00:14:52,800 --> 00:14:59,260 And so when we take the square size of that interval we get a 273 00:14:59,260 --> 00:15:01,560 quadratic function of how much we have 274 00:15:01,560 --> 00:15:03,120 moved from that corner. 275 00:15:03,120 --> 00:15:07,140 So at that corner what is the variance of Theta? 276 00:15:07,140 --> 00:15:11,290 Well if I observe an X that's equal to 3 then I know with 277 00:15:11,290 --> 00:15:14,810 certainty that Theta is equal to 4. 278 00:15:14,810 --> 00:15:18,340 Then I'm in very good shape, I know exactly what Theta is 279 00:15:18,340 --> 00:15:19,240 going to be. 280 00:15:19,240 --> 00:15:22,890 So the variance, in this case, is going to be 0. 281 00:15:22,890 --> 00:15:26,570 If I observe an X that's a little larger than Theta is 282 00:15:26,570 --> 00:15:31,130 now random, takes values on a little interval, and the 283 00:15:31,130 --> 00:15:35,430 variance of Theta is going to be proportional to the square 284 00:15:35,430 --> 00:15:37,910 of the length of that little interval. 285 00:15:37,910 --> 00:15:40,400 So we get a curve that starts rising 286 00:15:40,400 --> 00:15:42,560 quadratically from here. 287 00:15:42,560 --> 00:15:45,390 It goes up forward 1/3. 288 00:15:45,390 --> 00:15:48,980 At the other end of the picture the same is true. 289 00:15:48,980 --> 00:15:54,500 If you observe an X which is 11 then Theta can only be 290 00:15:54,500 --> 00:15:57,150 equal to 10. 291 00:15:57,150 --> 00:16:00,720 And so the error in Theta is equal to 0, 292 00:16:00,720 --> 00:16:02,920 there's 0 error variance. 293 00:16:02,920 --> 00:16:07,360 But as we obtain X's that are slightly less than 11 then the 294 00:16:07,360 --> 00:16:10,380 mean squared error again rises quadratically. 295 00:16:10,380 --> 00:16:13,450 So we end up with a plot like this. 296 00:16:13,450 --> 00:16:17,120 What this plot tells us is that certain measurements are 297 00:16:17,120 --> 00:16:18,920 better than others. 298 00:16:18,920 --> 00:16:25,270 If you're lucky, and you see X equal to 3 then you're lucky, 299 00:16:25,270 --> 00:16:28,820 because you know Theta exactly what it is. 300 00:16:28,820 --> 00:16:33,830 If you see an X which is equal to 6 then you're sort of 301 00:16:33,830 --> 00:16:35,800 unlikely, because it doesn't tell you 302 00:16:35,800 --> 00:16:37,900 Theta with great precision. 303 00:16:37,900 --> 00:16:40,560 Theta could be anywhere on that interval. 304 00:16:40,560 --> 00:16:42,360 And so the variance of Theta -- 305 00:16:42,360 --> 00:16:44,630 even after you have observed X -- 306 00:16:44,630 --> 00:16:48,470 is a certain number, 1/3 in our case. 307 00:16:48,470 --> 00:16:52,370 So the moral to keep out of that story is 308 00:16:52,370 --> 00:16:56,970 that the error variance-- 309 00:16:56,970 --> 00:17:00,380 or the mean squared error-- 310 00:17:00,380 --> 00:17:03,350 depends on what particular observation 311 00:17:03,350 --> 00:17:04,829 you happen to obtain. 312 00:17:04,829 --> 00:17:10,240 Some observations may be very informative, and once you see 313 00:17:10,240 --> 00:17:13,550 a specific number than you know exactly what Theta is. 314 00:17:13,550 --> 00:17:15,760 Some observations might be less informative. 315 00:17:15,760 --> 00:17:18,980 You observe your X, but it could still leave a lot of 316 00:17:18,980 --> 00:17:20,230 uncertainty about Theta. 317 00:17:20,230 --> 00:17:23,839 318 00:17:23,839 --> 00:17:27,650 So conditional expectations are really the cornerstone of 319 00:17:27,650 --> 00:17:28,890 Bayesian estimation. 320 00:17:28,890 --> 00:17:31,690 They're particularly popular, especially 321 00:17:31,690 --> 00:17:33,950 in engineering contexts. 322 00:17:33,950 --> 00:17:38,260 There used a lot in signal processing, communications, 323 00:17:38,260 --> 00:17:40,940 control theory, so on. 324 00:17:40,940 --> 00:17:44,300 So that makes it worth playing a little bit with their 325 00:17:44,300 --> 00:17:50,450 theoretical properties, and get some appreciation of a few 326 00:17:50,450 --> 00:17:53,590 subtleties involved here. 327 00:17:53,590 --> 00:17:57,990 No new math in reality, in what we're going to do here. 328 00:17:57,990 --> 00:18:01,290 But it's going to be a good opportunity to practice 329 00:18:01,290 --> 00:18:05,310 manipulation of conditional expectations. 330 00:18:05,310 --> 00:18:13,150 So let's look at the expected value of the estimation error 331 00:18:13,150 --> 00:18:15,330 that we obtained. 332 00:18:15,330 --> 00:18:18,540 So Theta hat is our estimator, is the conditional 333 00:18:18,540 --> 00:18:19,855 expectation. 334 00:18:19,855 --> 00:18:25,690 Theta hat minus Theta is what kind of error do we have? 335 00:18:25,690 --> 00:18:29,610 If Theta hat, is bigger than Theta then we have made the 336 00:18:29,610 --> 00:18:31,510 positive error. 337 00:18:31,510 --> 00:18:33,910 If not, if it's on the other side, we have made the 338 00:18:33,910 --> 00:18:35,290 negative error. 339 00:18:35,290 --> 00:18:39,110 Then it turns out that on the average the errors cancel each 340 00:18:39,110 --> 00:18:41,030 other out, on the average. 341 00:18:41,030 --> 00:18:43,110 So let's do this calculation. 342 00:18:43,110 --> 00:18:50,010 Let's calculate the expected value of the error given X. 343 00:18:50,010 --> 00:18:54,480 Now by definition the error is expected value of Theta hat 344 00:18:54,480 --> 00:18:57,850 minus Theta given X. 345 00:18:57,850 --> 00:19:01,090 We use linearity of expectations to break it up as 346 00:19:01,090 --> 00:19:04,850 expected value of Theta hat given X minus expected value 347 00:19:04,850 --> 00:19:11,090 of Theta given X. And now what? 348 00:19:11,090 --> 00:19:18,680 Our estimate is made on the basis of the data of the X's. 349 00:19:18,680 --> 00:19:23,600 If I tell you X then you know what Theta hat is. 350 00:19:23,600 --> 00:19:26,490 Remember that the conditional expectation is a random 351 00:19:26,490 --> 00:19:29,680 variable which is a function of the random variable, on 352 00:19:29,680 --> 00:19:31,560 which you're conditioning on. 353 00:19:31,560 --> 00:19:35,330 If you know X then you know the conditional expectation 354 00:19:35,330 --> 00:19:38,390 given X, you know what Theta hat is going to be. 355 00:19:38,390 --> 00:19:42,910 So Theta hat is a function of X. If it's a function of X 356 00:19:42,910 --> 00:19:45,910 then once I tell you X you know what Theta 357 00:19:45,910 --> 00:19:47,460 hat is going to be. 358 00:19:47,460 --> 00:19:49,580 So this conditional expectation is going to be 359 00:19:49,580 --> 00:19:51,860 Theta hat itself. 360 00:19:51,860 --> 00:19:54,030 Here this is-- just by definition-- 361 00:19:54,030 --> 00:19:59,580 Theta hat, and so we get equality to 0. 362 00:19:59,580 --> 00:20:04,260 So what we have proved is that no matter what I have 363 00:20:04,260 --> 00:20:08,970 observed, and given that I have observed something on the 364 00:20:08,970 --> 00:20:14,050 average my error is going to be 0. 365 00:20:14,050 --> 00:20:19,960 This is a statement involving equality of random variables. 366 00:20:19,960 --> 00:20:22,620 Remember that conditional expectations are random 367 00:20:22,620 --> 00:20:26,970 variables because they depend on the thing you're 368 00:20:26,970 --> 00:20:28,440 conditioning on. 369 00:20:28,440 --> 00:20:31,630 0 is sort of a trivial random variable. 370 00:20:31,630 --> 00:20:34,080 This tells you that this random variable is identically 371 00:20:34,080 --> 00:20:36,390 equal to the 0 random variable. 372 00:20:36,390 --> 00:20:40,720 More specifically it tells you that no matter what value for 373 00:20:40,720 --> 00:20:45,120 X you observe, the conditional expectation of the error is 374 00:20:45,120 --> 00:20:46,410 going to be 0. 375 00:20:46,410 --> 00:20:49,150 And this takes us to this statement here, which is 376 00:20:49,150 --> 00:20:51,830 inequality between numbers. 377 00:20:51,830 --> 00:20:56,330 No matter what specific value for capital X you have 378 00:20:56,330 --> 00:21:00,440 observed, your error, on the average, is going 379 00:21:00,440 --> 00:21:02,420 to be equal to 0. 380 00:21:02,420 --> 00:21:06,730 So this is a less abstract version of these statements. 381 00:21:06,730 --> 00:21:09,300 This is inequality between two numbers. 382 00:21:09,300 --> 00:21:15,080 It's true for every value of X, so it's true in terms of 383 00:21:15,080 --> 00:21:18,550 these random variables being equal to that random variable. 384 00:21:18,550 --> 00:21:21,170 Because remember according to our definition this random 385 00:21:21,170 --> 00:21:24,400 variable is the random variable that takes this 386 00:21:24,400 --> 00:21:27,410 specific value when capital X happens to be 387 00:21:27,410 --> 00:21:29,410 equal to little x. 388 00:21:29,410 --> 00:21:33,480 Now this doesn't mean that your error is 0, it only means 389 00:21:33,480 --> 00:21:37,050 that your error is as likely, in some sense, to fall on the 390 00:21:37,050 --> 00:21:40,040 positive side, as to fall on the negative side. 391 00:21:40,040 --> 00:21:41,400 So sometimes your error will be 392 00:21:41,400 --> 00:21:42,880 positive, sometimes negative. 393 00:21:42,880 --> 00:21:46,360 And on the average these things cancel out and 394 00:21:46,360 --> 00:21:48,150 give you a 0 --. 395 00:21:48,150 --> 00:21:49,470 on the average. 396 00:21:49,470 --> 00:21:53,620 So this is a property that's sometimes giving the name we 397 00:21:53,620 --> 00:21:59,040 say that Theta hat is unbiased. 398 00:21:59,040 --> 00:22:03,190 So Theta hat, our estimate, does not have a tendency to be 399 00:22:03,190 --> 00:22:04,180 on the high side. 400 00:22:04,180 --> 00:22:06,920 It does not have a tendency to be on the low side. 401 00:22:06,920 --> 00:22:10,580 On the average it's just right. 402 00:22:10,580 --> 00:22:14,700 403 00:22:14,700 --> 00:22:18,390 So let's do a little more playing here. 404 00:22:18,390 --> 00:22:21,790 405 00:22:21,790 --> 00:22:27,690 Let's see how our error is related to an arbitrary 406 00:22:27,690 --> 00:22:30,270 function of the data. 407 00:22:30,270 --> 00:22:36,960 Let's do this in a conditional universe and 408 00:22:36,960 --> 00:22:38,210 look at this quantity. 409 00:22:38,210 --> 00:22:45,210 410 00:22:45,210 --> 00:22:47,910 In a conditional universe where X is known 411 00:22:47,910 --> 00:22:51,060 then h of X is known. 412 00:22:51,060 --> 00:22:54,200 And so you can pull it outside the expectation. 413 00:22:54,200 --> 00:22:58,010 In the conditional universe where the value of X is given 414 00:22:58,010 --> 00:23:01,290 this quantity becomes just a constant. 415 00:23:01,290 --> 00:23:03,250 There's nothing random about it. 416 00:23:03,250 --> 00:23:06,280 So you can pull it out, the expectation, and 417 00:23:06,280 --> 00:23:09,840 write things this way. 418 00:23:09,840 --> 00:23:14,090 And we have just calculated that this quantity is 0. 419 00:23:14,090 --> 00:23:17,390 So this number turns out to be 0 as well. 420 00:23:17,390 --> 00:23:20,810 421 00:23:20,810 --> 00:23:23,830 Now having done this, we can take 422 00:23:23,830 --> 00:23:26,110 expectations of both sides. 423 00:23:26,110 --> 00:23:29,530 And now let's use the law of iterated expectations. 424 00:23:29,530 --> 00:23:33,040 Expectation of a conditional expectation gives us the 425 00:23:33,040 --> 00:23:42,200 unconditional expectation, and this is also going to be 0. 426 00:23:42,200 --> 00:23:47,455 So here we use the law of iterated expectations. 427 00:23:47,455 --> 00:23:54,460 428 00:23:54,460 --> 00:23:55,710 OK. 429 00:23:55,710 --> 00:24:04,510 430 00:24:04,510 --> 00:24:06,290 OK, why are we doing this? 431 00:24:06,290 --> 00:24:09,990 We're doing this because I would like to calculate the 432 00:24:09,990 --> 00:24:13,940 covariance between Theta tilde and Theta hat. 433 00:24:13,940 --> 00:24:16,490 Theta hat is, ask the question -- is there a systematic 434 00:24:16,490 --> 00:24:20,870 relation between the error and the estimate? 435 00:24:20,870 --> 00:24:30,830 So to calculate the covariance we use the property that we 436 00:24:30,830 --> 00:24:34,460 can calculate the covariances by calculating the expected 437 00:24:34,460 --> 00:24:39,520 value of the product minus the product of 438 00:24:39,520 --> 00:24:40,770 the expected values. 439 00:24:40,770 --> 00:24:48,440 440 00:24:48,440 --> 00:24:50,850 And what do we get? 441 00:24:50,850 --> 00:24:56,080 This is 0, because of what we just proved. 442 00:24:56,080 --> 00:25:00,980 443 00:25:00,980 --> 00:25:06,160 And this is 0, because of what we proved earlier. 444 00:25:06,160 --> 00:25:09,740 That the expected value of the error is equal to 0. 445 00:25:09,740 --> 00:25:12,900 446 00:25:12,900 --> 00:25:27,800 So the covariance between the error and any function of X is 447 00:25:27,800 --> 00:25:29,470 equal to 0. 448 00:25:29,470 --> 00:25:33,060 Let's use that to the case where the function of X we're 449 00:25:33,060 --> 00:25:38,620 considering is Theta hat itself. 450 00:25:38,620 --> 00:25:43,300 Theta hat is our estimate, it's a function of X. So this 451 00:25:43,300 --> 00:25:46,845 0 result would still apply, and we get that this 452 00:25:46,845 --> 00:25:50,570 covariance is equal to 0. 453 00:25:50,570 --> 00:25:59,100 OK, so that's what we proved. 454 00:25:59,100 --> 00:26:02,720 Let's see, what are the morals to take out of all this? 455 00:26:02,720 --> 00:26:07,640 First is you should be very comfortable with this type of 456 00:26:07,640 --> 00:26:10,580 calculation involving conditional expectations. 457 00:26:10,580 --> 00:26:14,100 The main two things that we're using are that when you 458 00:26:14,100 --> 00:26:17,630 condition on a random variable any function of that random 459 00:26:17,630 --> 00:26:21,020 variable becomes a constant, and can be pulled out the 460 00:26:21,020 --> 00:26:22,690 conditional expectation. 461 00:26:22,690 --> 00:26:25,460 The other thing that we are using is the law of iterated 462 00:26:25,460 --> 00:26:29,450 expectations, so these are the skills involved. 463 00:26:29,450 --> 00:26:32,980 Now on the substance, why is this result interesting? 464 00:26:32,980 --> 00:26:35,390 This tells us that the error is 465 00:26:35,390 --> 00:26:37,060 uncorrelated with the estimate. 466 00:26:37,060 --> 00:26:39,770 467 00:26:39,770 --> 00:26:42,530 What's a hypothetical situation where these would 468 00:26:42,530 --> 00:26:44,160 not happen? 469 00:26:44,160 --> 00:26:52,720 Whenever Theta hat is positive my error tends to be negative. 470 00:26:52,720 --> 00:26:57,000 Suppose that whenever Theta hat is big then you say oh my 471 00:26:57,000 --> 00:27:00,610 estimate is too big, maybe the true Theta is on the lower 472 00:27:00,610 --> 00:27:04,470 side, so I expect my error to be negative. 473 00:27:04,470 --> 00:27:09,230 That would be a situation that would violate this condition. 474 00:27:09,230 --> 00:27:13,880 This condition tells you that no matter what Theta hat is, 475 00:27:13,880 --> 00:27:17,110 you don't expect your error to be on the positive side or on 476 00:27:17,110 --> 00:27:18,030 the negative side. 477 00:27:18,030 --> 00:27:21,630 Your error will still be 0 on the average. 478 00:27:21,630 --> 00:27:25,780 So if you obtain a very high estimate this is no reason for 479 00:27:25,780 --> 00:27:29,630 you to suspect that the true Theta is 480 00:27:29,630 --> 00:27:30,890 lower than your estimate. 481 00:27:30,890 --> 00:27:34,420 If you suspected that the true Theta was lower than your 482 00:27:34,420 --> 00:27:38,830 estimate you should have changed your Theta hat. 483 00:27:38,830 --> 00:27:42,580 If you make an estimate and after obtaining that estimate 484 00:27:42,580 --> 00:27:46,270 you say I think my estimate is too big, and so 485 00:27:46,270 --> 00:27:47,770 the error is negative. 486 00:27:47,770 --> 00:27:50,730 If you thought that way then that means that your estimate 487 00:27:50,730 --> 00:27:53,690 is not the optimal one, that your estimate should have been 488 00:27:53,690 --> 00:27:57,200 corrected to be smaller. 489 00:27:57,200 --> 00:28:00,030 And that would mean that there's a better estimate than 490 00:28:00,030 --> 00:28:03,060 the one you used, but the estimate that we are using 491 00:28:03,060 --> 00:28:06,060 here is the optimal one in terms of mean squared error, 492 00:28:06,060 --> 00:28:08,350 there's no way of improving it. 493 00:28:08,350 --> 00:28:11,640 And this is really captured in that statement. 494 00:28:11,640 --> 00:28:14,250 That is knowing Theta hat doesn't give you a lot of 495 00:28:14,250 --> 00:28:18,290 information about the error, and gives you, therefore, no 496 00:28:18,290 --> 00:28:24,430 reason to adjust your estimate from what it was. 497 00:28:24,430 --> 00:28:29,190 Finally, a consequence of all this. 498 00:28:29,190 --> 00:28:31,910 This is the definition of the error. 499 00:28:31,910 --> 00:28:35,770 Send Theta to this side, send Theta tilde to that side, you 500 00:28:35,770 --> 00:28:36,850 get this relation. 501 00:28:36,850 --> 00:28:41,010 The true parameter is composed of two quantities. 502 00:28:41,010 --> 00:28:44,940 The estimate, and the error that they got 503 00:28:44,940 --> 00:28:46,460 with a minus sign. 504 00:28:46,460 --> 00:28:49,790 These two quantities are uncorrelated with each other. 505 00:28:49,790 --> 00:28:53,350 Their covariance is 0, and therefore, the variance of 506 00:28:53,350 --> 00:28:56,330 this is the sum of the variances of these two 507 00:28:56,330 --> 00:28:57,580 quantities. 508 00:28:57,580 --> 00:29:00,470 509 00:29:00,470 --> 00:29:07,520 So what's an interpretation of this equality? 510 00:29:07,520 --> 00:29:10,930 There is some inherent randomness in the random 511 00:29:10,930 --> 00:29:14,540 variable theta that we're trying to estimate. 512 00:29:14,540 --> 00:29:19,360 Theta hat tries to estimate it, tries to get close to it. 513 00:29:19,360 --> 00:29:25,500 And if Theta hat always stays close to Theta, since Theta is 514 00:29:25,500 --> 00:29:29,260 random Theta hat must also be quite random, so it has 515 00:29:29,260 --> 00:29:31,170 uncertainty in it. 516 00:29:31,170 --> 00:29:35,270 And the more uncertain Theta hat is the more it moves 517 00:29:35,270 --> 00:29:36,640 together with Theta. 518 00:29:36,640 --> 00:29:40,860 So the more uncertainty it removes from Theta. 519 00:29:40,860 --> 00:29:43,900 And this is the remaining uncertainty in Theta. 520 00:29:43,900 --> 00:29:47,140 The uncertainty that's left after we've done our 521 00:29:47,140 --> 00:29:48,350 estimation. 522 00:29:48,350 --> 00:29:52,330 So ideally, to have a small error we want this 523 00:29:52,330 --> 00:29:54,120 quantity to be small. 524 00:29:54,120 --> 00:29:55,820 Which is the same as saying that this 525 00:29:55,820 --> 00:29:57,740 quantity should be big. 526 00:29:57,740 --> 00:30:02,070 In the ideal case Theta hat is the same as Theta. 527 00:30:02,070 --> 00:30:04,820 That's the best we could hope for. 528 00:30:04,820 --> 00:30:09,250 That corresponds to 0 error, and all the uncertainly in 529 00:30:09,250 --> 00:30:14,230 Theta is absorbed by the uncertainty in Theta hat. 530 00:30:14,230 --> 00:30:18,960 Interestingly, this relation here is just another variation 531 00:30:18,960 --> 00:30:21,630 of the law of total variance that we have seen at some 532 00:30:21,630 --> 00:30:23,880 point in the past. 533 00:30:23,880 --> 00:30:28,570 I will skip that derivation, but it's an interesting fact, 534 00:30:28,570 --> 00:30:31,430 and it can give you an alternative interpretation of 535 00:30:31,430 --> 00:30:32,680 the law of total variance. 536 00:30:32,680 --> 00:30:36,840 537 00:30:36,840 --> 00:30:40,570 OK, so now let's return to our example. 538 00:30:40,570 --> 00:30:45,630 In our example we obtained the optimal estimator, and we saw 539 00:30:45,630 --> 00:30:51,220 that it was a nonlinear curve, something like this. 540 00:30:51,220 --> 00:30:53,660 I'm exaggerating the corner of a little bit to 541 00:30:53,660 --> 00:30:55,350 show that it's nonlinear. 542 00:30:55,350 --> 00:30:57,400 This is the optimal estimator. 543 00:30:57,400 --> 00:31:01,070 It's a nonlinear function of X -- 544 00:31:01,070 --> 00:31:05,200 nonlinear generally means complicated. 545 00:31:05,200 --> 00:31:09,020 Sometimes the conditional expectation is really hard to 546 00:31:09,020 --> 00:31:12,320 compute, because whenever you have to compute expectations 547 00:31:12,320 --> 00:31:17,270 you need to do some integrals. 548 00:31:17,270 --> 00:31:19,880 And if you have many random variables involved it might 549 00:31:19,880 --> 00:31:23,160 correspond to a multi-dimensional integration. 550 00:31:23,160 --> 00:31:24,370 We don't like this. 551 00:31:24,370 --> 00:31:27,370 Can we come up, maybe, with a simpler way 552 00:31:27,370 --> 00:31:29,200 of estimating Theta? 553 00:31:29,200 --> 00:31:32,580 Of coming up with a point estimate which still has some 554 00:31:32,580 --> 00:31:34,350 nice properties, it has some good 555 00:31:34,350 --> 00:31:37,120 motivation, but is simpler. 556 00:31:37,120 --> 00:31:38,630 What does simpler mean? 557 00:31:38,630 --> 00:31:40,920 Perhaps linear. 558 00:31:40,920 --> 00:31:45,570 Let's put ourselves in a straitjacket and restrict 559 00:31:45,570 --> 00:31:50,260 ourselves to estimators that's are of these forms. 560 00:31:50,260 --> 00:31:53,280 My estimate is constrained to be a linear 561 00:31:53,280 --> 00:31:54,930 function of the X's. 562 00:31:54,930 --> 00:31:59,320 So my estimator is going to be a curve, a linear curve. 563 00:31:59,320 --> 00:32:03,450 It could be this, it could be that, maybe it would want to 564 00:32:03,450 --> 00:32:06,350 be something like this. 565 00:32:06,350 --> 00:32:10,540 I want to choose the best possible linear function. 566 00:32:10,540 --> 00:32:11,490 What does that mean? 567 00:32:11,490 --> 00:32:15,570 It means that I write my Theta hat in this form. 568 00:32:15,570 --> 00:32:20,750 If I fix a certain a and b I have fixed the functional form 569 00:32:20,750 --> 00:32:23,940 of my estimator, and this is the corresponding 570 00:32:23,940 --> 00:32:25,360 mean squared error. 571 00:32:25,360 --> 00:32:28,210 That's the error between the true parameter and the 572 00:32:28,210 --> 00:32:31,130 estimate of that parameter, we take the square of this. 573 00:32:31,130 --> 00:32:33,730 574 00:32:33,730 --> 00:32:38,350 And now the optimal linear estimator is defined as one 575 00:32:38,350 --> 00:32:42,210 for which these mean squared error is smallest possible 576 00:32:42,210 --> 00:32:45,600 over all choices of a and b. 577 00:32:45,600 --> 00:32:48,260 So we want to minimize this expression 578 00:32:48,260 --> 00:32:52,030 over all a's and b's. 579 00:32:52,030 --> 00:32:55,650 How do we do this minimization? 580 00:32:55,650 --> 00:32:58,910 Well this is a square, you can expand it. 581 00:32:58,910 --> 00:33:02,040 Write down all the terms in the expansion of the square. 582 00:33:02,040 --> 00:33:03,810 So you're going to get the term expected 583 00:33:03,810 --> 00:33:05,400 value of Theta squared. 584 00:33:05,400 --> 00:33:07,380 You're going to get another term-- 585 00:33:07,380 --> 00:33:11,010 a squared expected value of X squared, another term which is 586 00:33:11,010 --> 00:33:13,340 b squared, and then you're going to get to 587 00:33:13,340 --> 00:33:16,620 various cross terms. 588 00:33:16,620 --> 00:33:22,050 What you have here is really a quadratic function of a and b. 589 00:33:22,050 --> 00:33:25,030 So think of this quantity that we're minimizing as some 590 00:33:25,030 --> 00:33:28,920 function h of a and b, and it happens to be quadratic. 591 00:33:28,920 --> 00:33:32,500 592 00:33:32,500 --> 00:33:35,280 How do we minimize a quadratic function? 593 00:33:35,280 --> 00:33:38,890 We set the derivative of this function with respect to a and 594 00:33:38,890 --> 00:33:42,940 b to 0, and then do the algebra. 595 00:33:42,940 --> 00:33:48,000 After you do the algebra you find that the best choice for 596 00:33:48,000 --> 00:33:54,380 a is this 1, so this is the coefficient next to X. This is 597 00:33:54,380 --> 00:33:55,630 the optimal a. 598 00:33:55,630 --> 00:33:59,560 599 00:33:59,560 --> 00:34:03,660 And the optimal b corresponds of the constant terms. 600 00:34:03,660 --> 00:34:08,770 So this term and this times that together are the optimal 601 00:34:08,770 --> 00:34:11,090 choices of b. 602 00:34:11,090 --> 00:34:15,590 So the algebra itself is not very interesting. 603 00:34:15,590 --> 00:34:19,210 What is really interesting is the nature of the result that 604 00:34:19,210 --> 00:34:21,179 we get here. 605 00:34:21,179 --> 00:34:26,260 If we were to plot the result on this particular example you 606 00:34:26,260 --> 00:34:32,280 would get the curve that's something like this. 607 00:34:32,280 --> 00:34:36,949 608 00:34:36,949 --> 00:34:40,710 It goes through the middle of this diagram 609 00:34:40,710 --> 00:34:43,080 and is a little slanted. 610 00:34:43,080 --> 00:34:48,639 In this example, X and Theta are positively correlated. 611 00:34:48,639 --> 00:34:51,190 Bigger values of X generally correspond to 612 00:34:51,190 --> 00:34:53,139 bigger values of Theta. 613 00:34:53,139 --> 00:34:56,310 So in this example the covariance between X and Theta 614 00:34:56,310 --> 00:35:05,530 is positive, and so our estimate can be interpreted in 615 00:35:05,530 --> 00:35:09,110 the following way: The expected value of Theta is the 616 00:35:09,110 --> 00:35:13,130 estimate that you would come up with if you didn't have any 617 00:35:13,130 --> 00:35:15,960 information about Theta. 618 00:35:15,960 --> 00:35:19,590 If you don't make any observations this is the best 619 00:35:19,590 --> 00:35:22,270 way of estimating Theta. 620 00:35:22,270 --> 00:35:26,190 But I have made an observation, X, and I need to 621 00:35:26,190 --> 00:35:27,920 take it into account. 622 00:35:27,920 --> 00:35:32,360 I look at this difference, which is the piece of news 623 00:35:32,360 --> 00:35:34,380 contained in X? 624 00:35:34,380 --> 00:35:37,870 That's what X should be on the average. 625 00:35:37,870 --> 00:35:41,910 If I observe an X which is bigger than what I expected it 626 00:35:41,910 --> 00:35:46,830 to be, and since X and Theta are positively correlated, 627 00:35:46,830 --> 00:35:51,070 this tells me that Theta should also be bigger than its 628 00:35:51,070 --> 00:35:52,690 average value. 629 00:35:52,690 --> 00:35:57,180 Whenever I see an X that's larger than its average value 630 00:35:57,180 --> 00:36:00,230 this gives me an indication that theta should also 631 00:36:00,230 --> 00:36:04,480 probably be larger than its average value. 632 00:36:04,480 --> 00:36:08,040 And so I'm taking that difference and multiplying it 633 00:36:08,040 --> 00:36:10,240 by a positive coefficient. 634 00:36:10,240 --> 00:36:12,360 And that's what gives me a curve here that 635 00:36:12,360 --> 00:36:14,880 has a positive slope. 636 00:36:14,880 --> 00:36:17,780 So this increment-- 637 00:36:17,780 --> 00:36:21,750 the new information contained in X as compared to the 638 00:36:21,750 --> 00:36:25,950 average value we expected apriori, that increment allows 639 00:36:25,950 --> 00:36:30,780 us to make a correction to our prior estimate of Theta, and 640 00:36:30,780 --> 00:36:34,780 the amount of that correction is guided by the covariance of 641 00:36:34,780 --> 00:36:36,260 X with Theta. 642 00:36:36,260 --> 00:36:39,670 If the covariance of X with Theta were 0, that would mean 643 00:36:39,670 --> 00:36:43,050 there's no systematic relation between the two, and in that 644 00:36:43,050 --> 00:36:46,380 case obtaining some information from X doesn't 645 00:36:46,380 --> 00:36:51,010 give us a guide as to how to change the estimates of Theta. 646 00:36:51,010 --> 00:36:53,870 If that were 0, we would just stay with 647 00:36:53,870 --> 00:36:55,050 this particular estimate. 648 00:36:55,050 --> 00:36:57,090 We're not able to make a correction. 649 00:36:57,090 --> 00:37:00,810 But when there's a non zero covariance between X and Theta 650 00:37:00,810 --> 00:37:04,620 that covariance works as a guide for us to obtain a 651 00:37:04,620 --> 00:37:08,130 better estimate of Theta. 652 00:37:08,130 --> 00:37:12,270 653 00:37:12,270 --> 00:37:15,220 How about the resulting mean squared error? 654 00:37:15,220 --> 00:37:18,690 In this context turns out that there's a very nice formula 655 00:37:18,690 --> 00:37:21,360 for the mean squared error obtained from 656 00:37:21,360 --> 00:37:24,780 the best linear estimate. 657 00:37:24,780 --> 00:37:27,900 What's the story here? 658 00:37:27,900 --> 00:37:31,210 The mean squared error that we have has something to do with 659 00:37:31,210 --> 00:37:35,450 the variance of the original random variable. 660 00:37:35,450 --> 00:37:38,710 The more uncertain our original random variable is, 661 00:37:38,710 --> 00:37:41,670 the more error we're going to make. 662 00:37:41,670 --> 00:37:45,590 On the other hand, when the two variables are correlated 663 00:37:45,590 --> 00:37:48,370 we explored that correlation to improve our estimate. 664 00:37:48,370 --> 00:37:52,100 665 00:37:52,100 --> 00:37:54,650 This row here is the correlation coefficient 666 00:37:54,650 --> 00:37:56,730 between the two random variables. 667 00:37:56,730 --> 00:37:59,720 When this correlation coefficient is larger this 668 00:37:59,720 --> 00:38:01,780 factor here becomes smaller. 669 00:38:01,780 --> 00:38:04,660 And our mean squared error become smaller. 670 00:38:04,660 --> 00:38:07,560 So think of the two extreme cases. 671 00:38:07,560 --> 00:38:11,270 One extreme case is when rho equal to 1 -- 672 00:38:11,270 --> 00:38:14,200 so X and Theta are perfectly correlated. 673 00:38:14,200 --> 00:38:18,420 When they're perfectly correlated once I know X then 674 00:38:18,420 --> 00:38:20,310 I also know Theta. 675 00:38:20,310 --> 00:38:23,580 And the two random variables are linearly related. 676 00:38:23,580 --> 00:38:27,080 In that case, my estimate is right on the target, and the 677 00:38:27,080 --> 00:38:30,860 mean squared error is going to be 0. 678 00:38:30,860 --> 00:38:34,810 The other extreme case is if rho is equal to 0. 679 00:38:34,810 --> 00:38:37,590 The two random variables are uncorrelated. 680 00:38:37,590 --> 00:38:41,740 In that case the measurement does not help me estimate 681 00:38:41,740 --> 00:38:45,390 Theta, and the uncertainty that's left-- 682 00:38:45,390 --> 00:38:46,970 the mean squared error-- 683 00:38:46,970 --> 00:38:49,830 is just the original variance of Theta. 684 00:38:49,830 --> 00:38:53,750 So the uncertainty in Theta does not get reduced. 685 00:38:53,750 --> 00:38:54,670 So moral-- 686 00:38:54,670 --> 00:38:59,710 the estimation error is a reduced version of the 687 00:38:59,710 --> 00:39:03,660 original amount of uncertainty in the random variable Theta, 688 00:39:03,660 --> 00:39:08,280 and the larger the correlation between those two random 689 00:39:08,280 --> 00:39:12,620 variables, the better we can remove uncertainty from the 690 00:39:12,620 --> 00:39:13,970 original random variable. 691 00:39:13,970 --> 00:39:17,320 692 00:39:17,320 --> 00:39:21,200 I didn't derive this formula, but it's just a matter of 693 00:39:21,200 --> 00:39:22,430 algebraic manipulations. 694 00:39:22,430 --> 00:39:25,770 We have a formula for Theta hat, subtract 695 00:39:25,770 --> 00:39:27,620 Theta from that formula. 696 00:39:27,620 --> 00:39:30,640 Take square, take expectations, and do a few 697 00:39:30,640 --> 00:39:33,750 lines of algebra that you can read in the text, and you end 698 00:39:33,750 --> 00:39:35,915 up with this really neat and clean formula. 699 00:39:35,915 --> 00:39:38,650 700 00:39:38,650 --> 00:39:42,360 Now I mentioned in the beginning of the lecture that 701 00:39:42,360 --> 00:39:45,220 we can do inference with Theta's and X's not just being 702 00:39:45,220 --> 00:39:48,970 single numbers, but they could be vector random variables. 703 00:39:48,970 --> 00:39:52,100 So for example we might have multiple data that gives us 704 00:39:52,100 --> 00:39:56,710 information about X. 705 00:39:56,710 --> 00:40:00,240 There are no vectors here, so this discussion was for the 706 00:40:00,240 --> 00:40:04,460 case where Theta and X were just scalar, one-dimensional 707 00:40:04,460 --> 00:40:05,350 quantities. 708 00:40:05,350 --> 00:40:08,060 What do we do if we have multiple data? 709 00:40:08,060 --> 00:40:11,990 Suppose that Theta is still a scalar, it's one dimensional, 710 00:40:11,990 --> 00:40:14,710 but we make several observations. 711 00:40:14,710 --> 00:40:17,050 And on the basis of these observations we want to 712 00:40:17,050 --> 00:40:20,080 estimate Theta. 713 00:40:20,080 --> 00:40:24,650 The optimal least mean squares estimator would be again the 714 00:40:24,650 --> 00:40:28,830 conditional expectation of Theta given X. That's the 715 00:40:28,830 --> 00:40:30,130 optimal one. 716 00:40:30,130 --> 00:40:36,330 And in this case X is a vector, so the general 717 00:40:36,330 --> 00:40:40,650 estimator we would use would be this one. 718 00:40:40,650 --> 00:40:44,050 But if we want to keep things simple and we want our 719 00:40:44,050 --> 00:40:47,300 estimator to have a simple functional form we might 720 00:40:47,300 --> 00:40:51,870 restrict to estimator that are linear functions of the data. 721 00:40:51,870 --> 00:40:53,800 And then the story is exactly the same as 722 00:40:53,800 --> 00:40:57,010 we discussed before. 723 00:40:57,010 --> 00:41:00,460 I constrained myself to estimating Theta using a 724 00:41:00,460 --> 00:41:05,880 linear function of the data, so my signal processing box 725 00:41:05,880 --> 00:41:07,830 just applies a linear function. 726 00:41:07,830 --> 00:41:11,145 And I'm looking for the best coefficients, the coefficients 727 00:41:11,145 --> 00:41:13,490 that are going to result in the least 728 00:41:13,490 --> 00:41:15,990 possible squared error. 729 00:41:15,990 --> 00:41:19,780 This is my squared error, this is (my estimate minus the 730 00:41:19,780 --> 00:41:22,110 thing I'm trying to estimate) squared, and 731 00:41:22,110 --> 00:41:24,100 then taking the average. 732 00:41:24,100 --> 00:41:25,330 How do we do this? 733 00:41:25,330 --> 00:41:26,580 Same story as before. 734 00:41:26,580 --> 00:41:29,510 735 00:41:29,510 --> 00:41:32,500 The X's and the Theta's get averaged out because we have 736 00:41:32,500 --> 00:41:33,430 an expectation. 737 00:41:33,430 --> 00:41:36,830 Whatever is left is just a function of the coefficients 738 00:41:36,830 --> 00:41:38,760 of the a's and of b's. 739 00:41:38,760 --> 00:41:42,110 As before it turns out to be a quadratic function. 740 00:41:42,110 --> 00:41:46,580 Then we set the derivatives of this function of a's and b's 741 00:41:46,580 --> 00:41:50,000 with respect to the coefficients, we set it to 0. 742 00:41:50,000 --> 00:41:54,340 And this gives us a system of linear equations. 743 00:41:54,340 --> 00:41:56,780 It's a system of linear equations that's satisfied by 744 00:41:56,780 --> 00:41:57,730 those coefficients. 745 00:41:57,730 --> 00:42:00,860 It's a linear system because this is a quadratic function 746 00:42:00,860 --> 00:42:03,950 of those coefficients. 747 00:42:03,950 --> 00:42:10,410 So to get closed-form formulas in this particular case one 748 00:42:10,410 --> 00:42:13,180 would need to introduce vectors, and matrices, and 749 00:42:13,180 --> 00:42:15,330 metrics inverses and so on. 750 00:42:15,330 --> 00:42:18,570 The particular formulas are not so much what interests us 751 00:42:18,570 --> 00:42:22,950 here, rather, the interesting thing is that this is simply 752 00:42:22,950 --> 00:42:27,120 done just using straightforward solvers of 753 00:42:27,120 --> 00:42:29,240 linear equations. 754 00:42:29,240 --> 00:42:32,470 The only thing you need to do is to write down the correct 755 00:42:32,470 --> 00:42:35,280 coefficients of those non-linear equations. 756 00:42:35,280 --> 00:42:37,440 And the typical coefficient that you would 757 00:42:37,440 --> 00:42:39,240 get would be what? 758 00:42:39,240 --> 00:42:42,480 Let say a typical quick equations would be -- 759 00:42:42,480 --> 00:42:44,190 let's take a typical term of this 760 00:42:44,190 --> 00:42:45,680 quadratic one you expanded. 761 00:42:45,680 --> 00:42:51,470 You're going to get the terms such as a1x1 times a2x2. 762 00:42:51,470 --> 00:42:55,680 When you take expectations you're left with a1a2 times 763 00:42:55,680 --> 00:42:58,210 expected value of x1x2. 764 00:42:58,210 --> 00:43:02,030 765 00:43:02,030 --> 00:43:06,700 So this would involve terms such as a1 squared expected 766 00:43:06,700 --> 00:43:08,520 value of x1 squared. 767 00:43:08,520 --> 00:43:14,760 You would get terms such as a1a2, expected value of x1x2, 768 00:43:14,760 --> 00:43:20,120 and a lot of other terms here should have a too. 769 00:43:20,120 --> 00:43:23,600 So you get something that's quadratic in your 770 00:43:23,600 --> 00:43:24,890 coefficients. 771 00:43:24,890 --> 00:43:30,490 And the constants that show up in your system of equations 772 00:43:30,490 --> 00:43:33,790 are things that have to do with the expected values of 773 00:43:33,790 --> 00:43:37,070 squares of your random variables, or products of your 774 00:43:37,070 --> 00:43:39,130 random variables. 775 00:43:39,130 --> 00:43:43,060 To write down the numerical values for these the only 776 00:43:43,060 --> 00:43:46,330 thing you need to know are the means and variances of your 777 00:43:46,330 --> 00:43:47,570 random variables. 778 00:43:47,570 --> 00:43:50,360 If you know the mean and variance then you know what 779 00:43:50,360 --> 00:43:51,760 this thing is. 780 00:43:51,760 --> 00:43:54,950 And if you know the covariances as well then you 781 00:43:54,950 --> 00:43:57,250 know what this thing is. 782 00:43:57,250 --> 00:44:02,080 So in order to find the optimal linear estimator in 783 00:44:02,080 --> 00:44:06,870 the case of multiple data you do not need to know the entire 784 00:44:06,870 --> 00:44:09,230 probability distribution of the random 785 00:44:09,230 --> 00:44:11,050 variables that are involved. 786 00:44:11,050 --> 00:44:14,690 You only need to know your means and covariances. 787 00:44:14,690 --> 00:44:18,670 These are the only quantities that affect the construction 788 00:44:18,670 --> 00:44:20,570 of your optimal estimator. 789 00:44:20,570 --> 00:44:23,840 We could see this already in this formula. 790 00:44:23,840 --> 00:44:29,650 The form of my optimal estimator is completely 791 00:44:29,650 --> 00:44:34,100 determined once I know the means, variance, and 792 00:44:34,100 --> 00:44:37,970 covariance of the random variables in my model. 793 00:44:37,970 --> 00:44:44,410 I do not need to know how the details distribution of the 794 00:44:44,410 --> 00:44:46,570 random variables that are involved here. 795 00:44:46,570 --> 00:44:51,690 796 00:44:51,690 --> 00:44:55,110 So as I said in general, you find the form of the optimal 797 00:44:55,110 --> 00:44:59,550 estimator by using a linear equation solver. 798 00:44:59,550 --> 00:45:01,890 There are special examples in which you can 799 00:45:01,890 --> 00:45:05,210 get closed-form solutions. 800 00:45:05,210 --> 00:45:10,090 The nicest simplest estimation problem one can think of is 801 00:45:10,090 --> 00:45:11,120 the following-- 802 00:45:11,120 --> 00:45:14,870 you have some uncertain parameter, and you make 803 00:45:14,870 --> 00:45:17,790 multiple measurements of that parameter in 804 00:45:17,790 --> 00:45:19,950 the presence of noise. 805 00:45:19,950 --> 00:45:22,520 So the Wi's are noises. 806 00:45:22,520 --> 00:45:25,130 I corresponds to your i-th experiment. 807 00:45:25,130 --> 00:45:27,810 So this is the most common situation that you encounter 808 00:45:27,810 --> 00:45:28,490 in the lab. 809 00:45:28,490 --> 00:45:31,240 If you are dealing with some process, you're trying to 810 00:45:31,240 --> 00:45:34,110 measure something you measure it over and over. 811 00:45:34,110 --> 00:45:37,030 Each time your measurement has some random error. 812 00:45:37,030 --> 00:45:40,360 And then you need to take all your measurements together and 813 00:45:40,360 --> 00:45:43,550 come up with a single estimate. 814 00:45:43,550 --> 00:45:48,320 So the noises are assumed to be independent of each other, 815 00:45:48,320 --> 00:45:50,010 and also to be independent from the 816 00:45:50,010 --> 00:45:52,090 value of the true parameter. 817 00:45:52,090 --> 00:45:55,010 Without loss of generality we can assume that the noises 818 00:45:55,010 --> 00:45:58,890 have 0 mean and they have some variances that we 819 00:45:58,890 --> 00:46:00,340 assume to be known. 820 00:46:00,340 --> 00:46:03,180 Theta itself has a prior distribution with a certain 821 00:46:03,180 --> 00:46:05,670 mean and the certain variance. 822 00:46:05,670 --> 00:46:07,610 So the form of the optimal linear 823 00:46:07,610 --> 00:46:10,940 estimator is really nice. 824 00:46:10,940 --> 00:46:14,930 Well maybe you cannot see it right away because this looks 825 00:46:14,930 --> 00:46:18,580 messy, but what is it really? 826 00:46:18,580 --> 00:46:24,590 It's a linear combination of the X's and the prior mean. 827 00:46:24,590 --> 00:46:28,560 And it's actually a weighted average of the X's and the 828 00:46:28,560 --> 00:46:30,250 prior mean. 829 00:46:30,250 --> 00:46:33,570 Here we collect all of the coefficients that 830 00:46:33,570 --> 00:46:35,920 we have at the top. 831 00:46:35,920 --> 00:46:42,060 So the whole thing is basically a weighted average. 832 00:46:42,060 --> 00:46:46,460 833 00:46:46,460 --> 00:46:51,110 1/(sigma_i-squared) is the weight that we give to Xi, and 834 00:46:51,110 --> 00:46:54,710 in the denominator we have the sum of all of the weights. 835 00:46:54,710 --> 00:46:59,260 So in the end we're dealing with a weighted average. 836 00:46:59,260 --> 00:47:03,760 If mu was equal to 1, and all the Xi's were equal to 1 then 837 00:47:03,760 --> 00:47:06,790 our estimate would also be equal to 1. 838 00:47:06,790 --> 00:47:10,670 Now the form of the weights that we have is interesting. 839 00:47:10,670 --> 00:47:16,050 Any given data point is weighted inversely 840 00:47:16,050 --> 00:47:17,820 proportional to the variance. 841 00:47:17,820 --> 00:47:20,270 What does that say? 842 00:47:20,270 --> 00:47:26,920 If my i-th data point has a lot of variance, if Wi is very 843 00:47:26,920 --> 00:47:32,900 noisy then Xi is not very useful, is not very reliable. 844 00:47:32,900 --> 00:47:36,840 So I'm giving it a small weight. 845 00:47:36,840 --> 00:47:41,870 Large variance, a lot of error in my Xi means that I should 846 00:47:41,870 --> 00:47:44,200 give it a smaller weight. 847 00:47:44,200 --> 00:47:47,920 If two data points have the same variance, they're of 848 00:47:47,920 --> 00:47:50,140 comparable quality, then I'm going to 849 00:47:50,140 --> 00:47:51,950 give them equal weight. 850 00:47:51,950 --> 00:47:56,200 The other interesting thing is that the prior mean is treated 851 00:47:56,200 --> 00:47:58,300 the same way as the X's. 852 00:47:58,300 --> 00:48:03,050 So it's treated as an additional observation. 853 00:48:03,050 --> 00:48:07,100 So we're taking a weighted average of the prior mean and 854 00:48:07,100 --> 00:48:09,850 of the measurements that we are making. 855 00:48:09,850 --> 00:48:13,380 The formula looks as if the prior mean was just another 856 00:48:13,380 --> 00:48:14,210 data point. 857 00:48:14,210 --> 00:48:17,440 So that's the way of thinking about Bayesian estimation. 858 00:48:17,440 --> 00:48:20,270 You have your real data points, the X's that you 859 00:48:20,270 --> 00:48:23,430 observe, you also had some prior information. 860 00:48:23,430 --> 00:48:27,470 This plays a role similar to a data point. 861 00:48:27,470 --> 00:48:31,580 Interesting note that if all random variables are normal in 862 00:48:31,580 --> 00:48:35,230 this model these optimal linear estimator happens to be 863 00:48:35,230 --> 00:48:36,950 also the conditional expectation. 864 00:48:36,950 --> 00:48:40,000 That's the nice thing about normal random variables that 865 00:48:40,000 --> 00:48:42,770 conditional expectations turn out to be linear. 866 00:48:42,770 --> 00:48:46,920 So the optimal estimate and the optimal linear estimate 867 00:48:46,920 --> 00:48:48,560 turn out to be the same. 868 00:48:48,560 --> 00:48:51,050 And that gives us another interpretation of linear 869 00:48:51,050 --> 00:48:52,100 estimation. 870 00:48:52,100 --> 00:48:54,660 Linear estimation is essentially the same as 871 00:48:54,660 --> 00:48:58,970 pretending that all random variables are normal. 872 00:48:58,970 --> 00:49:02,040 So that's a side point. 873 00:49:02,040 --> 00:49:04,230 Now I'd like to close with a comment. 874 00:49:04,230 --> 00:49:08,370 875 00:49:08,370 --> 00:49:11,760 You do your measurements and you estimate Theta on the 876 00:49:11,760 --> 00:49:17,040 basis of X. Suppose that instead you have a measuring 877 00:49:17,040 --> 00:49:20,970 device that's measures X-cubed instead of measuring X, and 878 00:49:20,970 --> 00:49:23,350 you want to estimate Theta. 879 00:49:23,350 --> 00:49:26,760 Are you going to get to different a estimate? 880 00:49:26,760 --> 00:49:31,790 Well X and X-cubed contained the same information. 881 00:49:31,790 --> 00:49:34,730 Telling you X is the same as telling you 882 00:49:34,730 --> 00:49:36,640 the value of X-cubed. 883 00:49:36,640 --> 00:49:40,660 So the posterior distribution of Theta given X is the same 884 00:49:40,660 --> 00:49:44,160 as the posterior distribution of Theta given X-cubed. 885 00:49:44,160 --> 00:49:47,450 And so the means of these posterior distributions are 886 00:49:47,450 --> 00:49:49,390 going to be the same. 887 00:49:49,390 --> 00:49:52,850 So doing transformations through your data does not 888 00:49:52,850 --> 00:49:57,370 matter if you're doing optimal least squares estimation. 889 00:49:57,370 --> 00:50:00,100 On the other hand, if you restrict yourself to doing 890 00:50:00,100 --> 00:50:05,540 linear estimation then using a linear function of X is not 891 00:50:05,540 --> 00:50:09,720 the same as using a linear function of X-cubed. 892 00:50:09,720 --> 00:50:14,720 So this is a linear estimator, but where the data are the 893 00:50:14,720 --> 00:50:19,250 X-cube's, and we have a linear function of the data. 894 00:50:19,250 --> 00:50:23,690 So this means that when you're using linear estimation you 895 00:50:23,690 --> 00:50:28,040 have some choices to make linear on what? 896 00:50:28,040 --> 00:50:32,290 Sometimes you want to plot your data on a not ordinary 897 00:50:32,290 --> 00:50:35,090 scale and try to plot a line through them. 898 00:50:35,090 --> 00:50:38,360 Sometimes you plot your data on a logarithmic scale, and 899 00:50:38,360 --> 00:50:40,480 try to plot a line through them. 900 00:50:40,480 --> 00:50:42,390 Which scale is the appropriate one? 901 00:50:42,390 --> 00:50:44,510 Here it would be a cubic scale. 902 00:50:44,510 --> 00:50:46,830 And you have to think about your particular model to 903 00:50:46,830 --> 00:50:51,180 decide which version would be a more appropriate one. 904 00:50:51,180 --> 00:50:55,830 Finally when we have multiple data sometimes these multiple 905 00:50:55,830 --> 00:50:59,910 data might contain the same information. 906 00:50:59,910 --> 00:51:02,800 So X is one data point, X-squared is another data 907 00:51:02,800 --> 00:51:05,610 point, X-cubed is another data point. 908 00:51:05,610 --> 00:51:08,540 The three of them contain the same information, but you can 909 00:51:08,540 --> 00:51:11,480 try to form a linear function of them. 910 00:51:11,480 --> 00:51:14,380 And then you obtain a linear estimator that has a more 911 00:51:14,380 --> 00:51:16,930 general form as a function of X. 912 00:51:16,930 --> 00:51:22,130 So if you want to estimate your Theta as a cubic function 913 00:51:22,130 --> 00:51:26,330 of X, for example, you can set up a linear estimation model 914 00:51:26,330 --> 00:51:29,480 of this particular form and find the optimal coefficients, 915 00:51:29,480 --> 00:51:32,900 the a's and the b's. 916 00:51:32,900 --> 00:51:35,700 All right, so the last slide just gives you the big picture 917 00:51:35,700 --> 00:51:39,330 of what's happening in Bayesian Inference, it's for 918 00:51:39,330 --> 00:51:40,330 you to ponder. 919 00:51:40,330 --> 00:51:41,930 Basically we talked about three 920 00:51:41,930 --> 00:51:43,470 possible estimation methods. 921 00:51:43,470 --> 00:51:48,300 Maximum posteriori, mean squared error estimation, and 922 00:51:48,300 --> 00:51:51,070 linear mean squared error estimation, or least squares 923 00:51:51,070 --> 00:51:52,290 estimation. 924 00:51:52,290 --> 00:51:54,410 And there's a number of standard examples that you 925 00:51:54,410 --> 00:51:57,130 will be seeing over and over in the recitations, tutorial, 926 00:51:57,130 --> 00:52:00,950 homework, and so on, perhaps on exams even. 927 00:52:00,950 --> 00:52:05,630 Where we take some nice priors on some unknown parameter, we 928 00:52:05,630 --> 00:52:09,410 take some nice models for the noise or the observations, and 929 00:52:09,410 --> 00:52:11,880 then you need to work out posterior distributions in the 930 00:52:11,880 --> 00:52:13,570 various estimates and compare them. 931 00:52:13,570 --> 00:52:15,040