1 00:00:00,500 --> 00:00:04,240 As a preparation for more complex and more difficult 2 00:00:04,240 --> 00:00:08,580 models, we will start by looking at the simplest model 3 00:00:08,580 --> 00:00:12,330 that there is, that involves a linear relation 4 00:00:12,330 --> 00:00:15,060 and normal random variables. 5 00:00:15,060 --> 00:00:17,930 The specifics of the model are as follows. 6 00:00:17,930 --> 00:00:21,490 There's an unknown parameter modeled as a random variable, 7 00:00:21,490 --> 00:00:23,730 Theta, that we wish to estimate. 8 00:00:23,730 --> 00:00:28,250 What we have in our hands is Theta plus some additive noise, 9 00:00:28,250 --> 00:00:31,790 W. And this sum is our observation, 10 00:00:31,790 --> 00:00:34,390 X. The assumptions that we make are 11 00:00:34,390 --> 00:00:37,470 that Theta and W are normal random variables. 12 00:00:37,470 --> 00:00:39,520 And to keep the calculations simple, 13 00:00:39,520 --> 00:00:42,770 we assume that they're standard normal random variables. 14 00:00:42,770 --> 00:00:45,130 Furthermore, we assume that Theta and W 15 00:00:45,130 --> 00:00:48,380 are independent of each other. 16 00:00:48,380 --> 00:00:54,090 According to the Bayesian program, inference about Theta 17 00:00:54,090 --> 00:00:58,450 is essentially the calculation of the posterior distribution 18 00:00:58,450 --> 00:01:01,710 of Theta if I tell you that the observation, capital 19 00:01:01,710 --> 00:01:05,209 X takes on a specific value little x. 20 00:01:05,209 --> 00:01:07,620 To calculate this posterior distribution, 21 00:01:07,620 --> 00:01:11,880 we invoke the appropriate form of the Bayes rule. 22 00:01:11,880 --> 00:01:13,539 We have the prior of Theta. 23 00:01:13,539 --> 00:01:15,170 It's a standard normal. 24 00:01:15,170 --> 00:01:18,280 Now we need to figure out the conditional distribution 25 00:01:18,280 --> 00:01:21,530 of X given Theta. 26 00:01:21,530 --> 00:01:24,010 What is it? 27 00:01:24,010 --> 00:01:28,060 If I tell you that the random variable, capital Theta, 28 00:01:28,060 --> 00:01:32,190 takes on a specific value, little theta, then 29 00:01:32,190 --> 00:01:35,650 in that conditional universe, our observation 30 00:01:35,650 --> 00:01:40,890 is going to be that specific value of Theta, which 31 00:01:40,890 --> 00:01:45,090 is our little theta, plus the noise term, 32 00:01:45,090 --> 00:01:48,100 capital W. This is the relation that's holds 33 00:01:48,100 --> 00:01:49,860 in the conditional universe, where 34 00:01:49,860 --> 00:01:52,970 we are told the value of Theta. 35 00:01:52,970 --> 00:01:55,710 Now W is independent of Theta. 36 00:01:55,710 --> 00:01:59,380 So even though I have told you the value of Theta, 37 00:01:59,380 --> 00:02:01,780 the distribution of W does not change. 38 00:02:01,780 --> 00:02:04,160 It is still a standard normal. 39 00:02:04,160 --> 00:02:09,139 So X is a standard normal plus a constant theta. 40 00:02:09,139 --> 00:02:12,110 What that does is that it changes 41 00:02:12,110 --> 00:02:14,920 the mean of the normal distribution, 42 00:02:14,920 --> 00:02:18,420 but it will still be a normal random variable. 43 00:02:18,420 --> 00:02:21,030 So in this conditional universe, X 44 00:02:21,030 --> 00:02:23,790 is going to be a normal random variable 45 00:02:23,790 --> 00:02:26,980 with mean equal to theta, and with variance 46 00:02:26,980 --> 00:02:31,190 equal to the variance of W, which is equal to 1. 47 00:02:31,190 --> 00:02:34,550 So now, we know what this distribution is, 48 00:02:34,550 --> 00:02:38,530 and we can move with the calculation of the posterior. 49 00:02:38,530 --> 00:02:44,690 So we have the denominator term, which I'm writing here. 50 00:02:44,690 --> 00:02:47,860 And then we have the density of Theta. 51 00:02:47,860 --> 00:02:50,079 Since it is a standard normal, it 52 00:02:50,079 --> 00:02:54,300 takes the form of a constant. 53 00:02:54,300 --> 00:02:58,430 We do not really care about the value of that constant. 54 00:02:58,430 --> 00:03:02,710 What we care really is the term on the exponent. 55 00:03:02,710 --> 00:03:07,010 And then we have the conditional density of X 56 00:03:07,010 --> 00:03:11,180 given Theta, which is a normal with these parameters. 57 00:03:11,180 --> 00:03:17,510 And therefore, it takes the form c e to the minus 1/2. 58 00:03:17,510 --> 00:03:20,079 It's a density in x. 59 00:03:20,079 --> 00:03:24,100 And so, up here, we have x minus the mean of that density. 60 00:03:24,100 --> 00:03:29,370 But the mean is equal to theta, squared. 61 00:03:29,370 --> 00:03:31,120 And this is the final form. 62 00:03:31,120 --> 00:03:36,440 Now what we notice here is that we have a few constant terms. 63 00:03:36,440 --> 00:03:41,310 Another term that depends on x, and then a quadratic in theta. 64 00:03:41,310 --> 00:03:46,360 So we can write all this as some function of x, and then 65 00:03:46,360 --> 00:03:50,930 e to the negative of some quadratic in theta. 66 00:03:54,640 --> 00:03:59,820 Now when we're doing inference, we are given the value of X. 67 00:03:59,820 --> 00:04:04,980 So let us fix a particular value of little x 68 00:04:04,980 --> 00:04:08,770 and concentrate on the dependence on theta. 69 00:04:08,770 --> 00:04:12,500 So with x being fixed, this is just a constant. 70 00:04:12,500 --> 00:04:16,000 And as a function of theta, it's e to the minus something 71 00:04:16,000 --> 00:04:17,670 quadratic in theta. 72 00:04:17,670 --> 00:04:23,130 And we recognize that this is a normal PDF. 73 00:04:23,130 --> 00:04:25,380 So we conclude that the posterior distribution 74 00:04:25,380 --> 00:04:30,690 of Theta, given our observation, is normal. 75 00:04:30,690 --> 00:04:37,260 Since it is normal, the expected value of this conditional PDF 76 00:04:37,260 --> 00:04:40,620 will be the same as the peak of that the PDF. 77 00:04:40,620 --> 00:04:44,110 And this would be our point estimate of Theta 78 00:04:44,110 --> 00:04:45,270 in particular. 79 00:04:45,270 --> 00:04:48,610 If we use either of the MAP-- Maximum A Posterior 80 00:04:48,610 --> 00:04:52,090 Probability-- or the least mean squares estimator, 81 00:04:52,090 --> 00:04:54,240 which is defined as the conditional expectation 82 00:04:54,240 --> 00:04:57,130 of Theta, given the observation that we have made. 83 00:04:57,130 --> 00:04:59,480 So this conditional expectation is just 84 00:04:59,480 --> 00:05:01,680 the mean of this posterior distribution. 85 00:05:01,680 --> 00:05:05,490 It is also the peak of that posterior distribution. 86 00:05:05,490 --> 00:05:08,370 So let us find what the peak is. 87 00:05:08,370 --> 00:05:12,490 To find the peak, we focus on the exponent term, which 88 00:05:12,490 --> 00:05:18,420 is ignoring the minus sign, the exponent term is this one. 89 00:05:22,750 --> 00:05:25,420 And to find the peak of the distribution, 90 00:05:25,420 --> 00:05:31,910 we need to find the place where this exponent term is smallest. 91 00:05:31,910 --> 00:05:33,865 To find out when this term is smallest, 92 00:05:33,865 --> 00:05:37,840 we take its derivative with respect to theta 93 00:05:37,840 --> 00:05:39,390 and set it equal to 0. 94 00:05:39,390 --> 00:05:42,640 The derivative of this term is theta. 95 00:05:42,640 --> 00:05:48,380 The derivative of this term is theta minus x. 96 00:05:48,380 --> 00:05:51,409 We set this to 0. 97 00:05:51,409 --> 00:05:56,930 And when we solve this equation, we find 2 theta equal to x. 98 00:05:56,930 --> 00:06:00,375 Therefore, theta is equal to x/2. 99 00:06:00,375 --> 00:06:02,860 And so, we conclude from here that the peak 100 00:06:02,860 --> 00:06:07,930 of the distribution occurs when theta is equal to x/2. 101 00:06:07,930 --> 00:06:13,010 And this is our estimate of theta. 102 00:06:13,010 --> 00:06:16,220 So our estimate takes into account 103 00:06:16,220 --> 00:06:20,350 that we believe that theta is 0 on the average. 104 00:06:20,350 --> 00:06:24,280 But also takes into account the observation that we have made, 105 00:06:24,280 --> 00:06:28,890 and comes up with a value that's in between our prior mean, 106 00:06:28,890 --> 00:06:33,870 which was 0, and the observation, which is little x. 107 00:06:33,870 --> 00:06:36,800 So this is what the estimates are. 108 00:06:36,800 --> 00:06:40,750 If we want to talk about estimators, which are now 109 00:06:40,750 --> 00:06:43,520 random variables, what would they be? 110 00:06:43,520 --> 00:06:46,030 The estimator is a random variable 111 00:06:46,030 --> 00:06:49,420 that takes this value whenever capital X takes 112 00:06:49,420 --> 00:06:51,540 the value of little x. 113 00:06:51,540 --> 00:06:53,530 Therefore, it's the random variable, 114 00:06:53,530 --> 00:06:56,940 which is equal to capital X/2. 115 00:06:56,940 --> 00:06:59,680 This is a relation between random variables. 116 00:06:59,680 --> 00:07:02,290 This is a corresponding relation between numbers 117 00:07:02,290 --> 00:07:06,590 if you're given a specific value for little x. 118 00:07:06,590 --> 00:07:10,020 How special is this example? 119 00:07:10,020 --> 00:07:14,900 It turns out that the same structure of the solution 120 00:07:14,900 --> 00:07:19,000 shows up even if we assume that Theta and W are 121 00:07:19,000 --> 00:07:21,300 independent normal random variables, 122 00:07:21,300 --> 00:07:24,330 but with some general means and variances. 123 00:07:24,330 --> 00:07:26,480 You should be able to verify on your own 124 00:07:26,480 --> 00:07:30,150 by repeating the calculations that we just carried out 125 00:07:30,150 --> 00:07:35,720 that the posterior distribution of Theta will still be normal. 126 00:07:35,720 --> 00:07:39,390 And since it is normal, the peak of the distribution 127 00:07:39,390 --> 00:07:41,550 is the same as the expected value. 128 00:07:41,550 --> 00:07:45,320 So the expected value, or least mean squares estimator, 129 00:07:45,320 --> 00:07:48,330 coincides with the maximum a-posteriori probability 130 00:07:48,330 --> 00:07:49,550 estimator. 131 00:07:49,550 --> 00:07:52,940 And finally, although this formula will not 132 00:07:52,940 --> 00:07:56,760 be exactly true, there will be a similar formula 133 00:07:56,760 --> 00:07:59,130 for the estimator, namely the estimator 134 00:07:59,130 --> 00:08:04,200 will turn out to be a linear function of the measurements. 135 00:08:04,200 --> 00:08:06,830 We will see that these conclusions are actually 136 00:08:06,830 --> 00:08:09,230 even more general than that. 137 00:08:09,230 --> 00:08:12,730 And this is what makes it very appealing 138 00:08:12,730 --> 00:08:17,840 to work with normal random variables and linear relations.