1 00:00:00,500 --> 00:00:02,940 In this segment, we introduce the subject of least 2 00:00:02,940 --> 00:00:04,690 mean squares estimation. 3 00:00:04,690 --> 00:00:07,370 But as a warm-up, we will start with a very simple 4 00:00:07,370 --> 00:00:08,860 special case. 5 00:00:08,860 --> 00:00:11,330 Suppose that we have some random variable 6 00:00:11,330 --> 00:00:14,800 that we wish to estimate and that we have the probability 7 00:00:14,800 --> 00:00:17,500 distribution of this random variable-- a probability 8 00:00:17,500 --> 00:00:20,440 mass function if it's discrete or a probability density 9 00:00:20,440 --> 00:00:22,310 function if it's continuous. 10 00:00:22,310 --> 00:00:25,060 As a concrete instance, suppose that our random variable 11 00:00:25,060 --> 00:00:29,050 is uniformly distributed over a certain interval. 12 00:00:29,050 --> 00:00:32,320 We would like to estimate this random variable. 13 00:00:32,320 --> 00:00:34,660 We're interested in a point estimate. 14 00:00:34,660 --> 00:00:37,780 However, we look at the very special case 15 00:00:37,780 --> 00:00:40,450 where there are no observations available. 16 00:00:40,450 --> 00:00:43,580 All that we have is this probability distribution. 17 00:00:43,580 --> 00:00:46,150 How do we estimate this random variable? 18 00:00:46,150 --> 00:00:48,340 Well, we can use the rules that we have already 19 00:00:48,340 --> 00:00:50,530 developed, for example, the maximum 20 00:00:50,530 --> 00:00:54,079 a posteriori probability rule, what would it do? 21 00:00:54,079 --> 00:00:55,995 In this case, since there are no observations, 22 00:00:55,995 --> 00:01:00,530 the posterior distribution of Theta is the same as the prior. 23 00:01:00,530 --> 00:01:03,280 There are no observations that could change the prior. 24 00:01:03,280 --> 00:01:06,540 So we need to find a point at which this distribution is 25 00:01:06,540 --> 00:01:07,710 highest. 26 00:01:07,710 --> 00:01:10,170 Well, because this distribution is flat, 27 00:01:10,170 --> 00:01:13,920 the MAP rule does not give us a unique answer. 28 00:01:13,920 --> 00:01:18,840 Any value of theta inside the interval from 4 to 10 29 00:01:18,840 --> 00:01:21,880 would be an acceptable answer. 30 00:01:21,880 --> 00:01:29,100 So any particular estimate in this interval would be fine. 31 00:01:29,100 --> 00:01:31,900 So this rule is not particularly helpful. 32 00:01:31,900 --> 00:01:34,450 How about a different estimator? 33 00:01:34,450 --> 00:01:37,180 We have seen the conditional expectation estimator. 34 00:01:37,180 --> 00:01:40,840 Once more, in our case, since we do not have any observations, 35 00:01:40,840 --> 00:01:43,920 the conditional expectation is the same as the expectation. 36 00:01:43,920 --> 00:01:46,000 And for this particular example, it 37 00:01:46,000 --> 00:01:48,780 would give us the midpoint of the distribution, namely 38 00:01:48,780 --> 00:01:52,270 an estimate equal to 7. 39 00:01:52,270 --> 00:01:55,750 Now, this rule was inconclusive. 40 00:01:55,750 --> 00:01:58,000 This rule gave us a number. 41 00:01:58,000 --> 00:02:02,510 How can we choose and decide that one of these 42 00:02:02,510 --> 00:02:04,560 is the right estimate? 43 00:02:04,560 --> 00:02:08,650 We can do that if we introduce a specific performance criterion. 44 00:02:08,650 --> 00:02:12,100 What is it that we wish from our estimators? 45 00:02:12,100 --> 00:02:14,110 And the particular criterion, the one 46 00:02:14,110 --> 00:02:18,460 that will we be focusing on, is the mean squared error. 47 00:02:18,460 --> 00:02:20,840 If you come up with a certain estimate, 48 00:02:20,840 --> 00:02:24,170 you look at how far is your estimate from the true value 49 00:02:24,170 --> 00:02:27,040 that you're trying to estimate, take the square of that, 50 00:02:27,040 --> 00:02:28,360 and average it. 51 00:02:28,360 --> 00:02:30,850 And this leads us to a formulation 52 00:02:30,850 --> 00:02:36,010 whereby we will try to find an estimate theta hat that 53 00:02:36,010 --> 00:02:38,850 minimizes this mean squared error 54 00:02:38,850 --> 00:02:42,230 over all possible estimates. 55 00:02:42,230 --> 00:02:44,400 Let us now look at this formulation 56 00:02:44,400 --> 00:02:47,300 and see how we can solve it. 57 00:02:47,300 --> 00:02:50,650 This is a function of a single variable theta hat. 58 00:02:50,650 --> 00:02:54,630 And we can try to minimize it using conventional methods. 59 00:02:54,630 --> 00:02:56,820 To carry out this minimization, let 60 00:02:56,820 --> 00:03:01,100 us first expand this expectation into a sum of terms. 61 00:03:01,100 --> 00:03:03,220 We have the expected value of the square 62 00:03:03,220 --> 00:03:06,940 of the random variable, then a cross term minus 2 63 00:03:06,940 --> 00:03:12,360 the expected value of Theta times theta hat. 64 00:03:12,360 --> 00:03:16,010 However, theta hat is a number that we're trying to choose. 65 00:03:16,010 --> 00:03:17,250 It's not random. 66 00:03:17,250 --> 00:03:20,780 Therefore, we can pull it outside the expectation. 67 00:03:20,780 --> 00:03:24,079 And similarly, the last term, the expected value of theta hat 68 00:03:24,079 --> 00:03:28,970 squared is just theta hat squared itself. 69 00:03:28,970 --> 00:03:30,610 This is what we want to minimize. 70 00:03:30,610 --> 00:03:32,280 How do we minimize it? 71 00:03:32,280 --> 00:03:36,340 We take the derivative with respect to theta hat 72 00:03:36,340 --> 00:03:38,710 and set it to 0. 73 00:03:38,710 --> 00:03:47,480 And this gives us minus 2 the expected value of Theta 74 00:03:47,480 --> 00:03:52,450 plus twice theta hat equal to 0. 75 00:03:52,450 --> 00:03:54,770 And when we solve this, we find is 76 00:03:54,770 --> 00:03:58,180 that theta hat, the optimal estimate 77 00:03:58,180 --> 00:04:01,250 is equal to the expected value of Theta. 78 00:04:04,040 --> 00:04:08,250 So this is the answer to our optimization problem. 79 00:04:08,250 --> 00:04:13,460 The optimal estimate, according to the least squares criterion, 80 00:04:13,460 --> 00:04:17,709 is the expected value of the random variable. 81 00:04:17,709 --> 00:04:20,690 Now, it's interesting to derive this result, also, 82 00:04:20,690 --> 00:04:22,390 in a second way. 83 00:04:22,390 --> 00:04:25,530 Since it is a quite important and fundamental result, 84 00:04:25,530 --> 00:04:29,465 let us see whether there is a different way of establishing 85 00:04:29,465 --> 00:04:33,370 it that stays closer to the probabilistic world rather 86 00:04:33,370 --> 00:04:35,700 than the calculus world. 87 00:04:35,700 --> 00:04:39,050 So let us look at this criterion that we have here. 88 00:04:39,050 --> 00:04:43,180 The expected value of the square of a random variable 89 00:04:43,180 --> 00:04:52,540 is always equal to the variance of that random variable 90 00:04:52,540 --> 00:05:00,115 plus the expected value of that random variable squared. 91 00:05:03,480 --> 00:05:06,600 This is what we're trying to minimize. 92 00:05:06,600 --> 00:05:10,190 Now, theta hat is a constant. 93 00:05:10,190 --> 00:05:12,960 The variance of a random variable plus or minus 94 00:05:12,960 --> 00:05:17,380 a constant is the same as the variance 95 00:05:17,380 --> 00:05:20,020 of that random variable. 96 00:05:20,020 --> 00:05:23,160 And there is nothing that we can do about this term. 97 00:05:23,160 --> 00:05:25,680 So what we're trying to do is, essentially, 98 00:05:25,680 --> 00:05:29,900 just try to minimize this term with respect to theta hat. 99 00:05:29,900 --> 00:05:32,980 Now, here we have the square of something. 100 00:05:32,980 --> 00:05:37,810 The way to minimize this is to try to make this quantity as 101 00:05:37,810 --> 00:05:41,490 small as possible, make it 0 if we can. 102 00:05:41,490 --> 00:05:46,430 Well, we can make it 0 if we set theta hat equal to the expected 103 00:05:46,430 --> 00:05:48,400 value of Theta. 104 00:05:48,400 --> 00:05:51,640 So this term here is minimized. 105 00:05:54,700 --> 00:06:00,150 When theta hat is equal to the expected value of Theta, 106 00:06:00,150 --> 00:06:03,480 it is minimized because this is a choice that 107 00:06:03,480 --> 00:06:06,620 will make this term equal to 0. 108 00:06:06,620 --> 00:06:09,080 So this was a second derivation of why 109 00:06:09,080 --> 00:06:13,440 this is the optimal estimate of Theta. 110 00:06:13,440 --> 00:06:16,670 Once we adopt this particular estimate, 111 00:06:16,670 --> 00:06:19,650 the mean squared error is going to be 112 00:06:19,650 --> 00:06:22,680 equal to this, because this is our theta hat. 113 00:06:22,680 --> 00:06:25,980 And, of course, we recognize that this is the variance. 114 00:06:25,980 --> 00:06:29,790 So the variance of Theta is the least possible value 115 00:06:29,790 --> 00:06:32,120 of the mean squared error that we 116 00:06:32,120 --> 00:06:35,500 can obtain using any particular estimate. 117 00:06:35,500 --> 00:06:38,250 And this is our final conclusion. 118 00:06:38,250 --> 00:06:42,230 And in the next segment, we will exploit the conclusions 119 00:06:42,230 --> 00:06:44,720 that we found here and apply them 120 00:06:44,720 --> 00:06:47,150 to a more general situation.