1 00:00:00,475 --> 00:00:03,050 In this segment, we discuss a little more the 2 00:00:03,050 --> 00:00:04,890 mean squared error. 3 00:00:04,890 --> 00:00:06,660 Consider some estimator. 4 00:00:06,660 --> 00:00:10,000 It can be any estimator, not just the sample mean. 5 00:00:10,000 --> 00:00:12,670 We can decompose the mean squared error as 6 00:00:12,670 --> 00:00:15,430 a sum of two terms. 7 00:00:15,430 --> 00:00:17,520 Where does this formula come from? 8 00:00:17,520 --> 00:00:20,610 Well, we know that for any random variable Z, this 9 00:00:20,610 --> 00:00:22,490 formula is valid. 10 00:00:22,490 --> 00:00:27,060 And if we let Z be equal to the difference between the 11 00:00:27,060 --> 00:00:31,230 estimator and the value that we're trying to estimate, then 12 00:00:31,230 --> 00:00:33,120 we obtain this formula here. 13 00:00:33,120 --> 00:00:37,000 The expected value of our random variable Z squared is 14 00:00:37,000 --> 00:00:39,460 equal to the variance of that random variable plus the 15 00:00:39,460 --> 00:00:42,880 square of its mean. 16 00:00:42,880 --> 00:00:45,280 Let us now rewrite these two terms in a 17 00:00:45,280 --> 00:00:47,120 more suggestive way. 18 00:00:47,120 --> 00:00:49,920 We first notice that theta is a constant. 19 00:00:49,920 --> 00:00:52,530 When you add or subtract the constant from a random 20 00:00:52,530 --> 00:00:55,030 variable, the variance does not change. 21 00:00:55,030 --> 00:00:59,650 So this term is the same as the variance of theta hat. 22 00:00:59,650 --> 00:01:02,710 This quantity here, we will call it 23 00:01:02,710 --> 00:01:06,120 the bias of the estimator. 24 00:01:06,120 --> 00:01:10,130 It tells us whether theta hat is systematically above or 25 00:01:10,130 --> 00:01:12,870 below than the unknown parameter theta that we're 26 00:01:12,870 --> 00:01:14,340 trying to estimate. 27 00:01:14,340 --> 00:01:18,310 And using this terminology, this term here is just equal 28 00:01:18,310 --> 00:01:20,660 to the square of the bias. 29 00:01:20,660 --> 00:01:23,890 So the mean squared error consists of two components, 30 00:01:23,890 --> 00:01:27,820 and these capture different aspects of an estimator's 31 00:01:27,820 --> 00:01:29,039 performance. 32 00:01:29,039 --> 00:01:32,620 Let us see what they are in a concrete setting. 33 00:01:32,620 --> 00:01:35,780 Suppose that we're estimating the unknown mean of some 34 00:01:35,780 --> 00:01:42,130 distribution, and that our estimator is the sample mean. 35 00:01:42,130 --> 00:01:46,960 In this case, the mean squared error is the variance, which 36 00:01:46,960 --> 00:01:52,940 we know to be sigma squared over n, plus the bias term. 37 00:01:52,940 --> 00:01:55,539 But we know that the sample mean is unbiased. 38 00:01:55,539 --> 00:01:58,450 The expected value of the sample mean is equal to the 39 00:01:58,450 --> 00:01:59,500 unknown mean. 40 00:01:59,500 --> 00:02:03,160 And so the bias contribution is equal to zero. 41 00:02:03,160 --> 00:02:05,930 Now, for the sake of a comparison, let us consider a 42 00:02:05,930 --> 00:02:09,960 somewhat silly estimator which ignores the data all together, 43 00:02:09,960 --> 00:02:16,000 and always gives you an estimate of zero. 44 00:02:16,000 --> 00:02:20,210 In this case, the mean squared error is as follows. 45 00:02:20,210 --> 00:02:24,480 Since our estimator is just a constant, its variance is 46 00:02:24,480 --> 00:02:26,530 going to be equal to zero. 47 00:02:26,530 --> 00:02:33,230 On the other hand, since theta hat is zero, this term here is 48 00:02:33,230 --> 00:02:36,770 just the constant, theta, squared. 49 00:02:36,770 --> 00:02:39,260 And this gives us the corresponding 50 00:02:39,260 --> 00:02:40,980 mean squared error. 51 00:02:40,980 --> 00:02:43,829 Let us now compare the two estimators. 52 00:02:43,829 --> 00:02:48,490 We will plot the mean squared error as a function of the 53 00:02:48,490 --> 00:02:51,490 unknown parameter, theta. 54 00:02:51,490 --> 00:02:55,680 For the sample mean estimator, the mean squared error is 55 00:02:55,680 --> 00:03:01,400 constant, it does not depend on theta, and is equal to this 56 00:03:01,400 --> 00:03:03,650 value, sigma squared over n. 57 00:03:03,650 --> 00:03:08,260 On the other hand, for the zero estimator, the mean 58 00:03:08,260 --> 00:03:12,130 squared error is equal to theta squared. 59 00:03:12,130 --> 00:03:13,410 How do they compare? 60 00:03:13,410 --> 00:03:15,280 Which one is better? 61 00:03:15,280 --> 00:03:17,630 At this point, there's no way to say that one is 62 00:03:17,630 --> 00:03:19,130 better than the other. 63 00:03:19,130 --> 00:03:23,260 For some theta, the sample mean has a smaller mean 64 00:03:23,260 --> 00:03:25,320 squared error. 65 00:03:25,320 --> 00:03:31,130 But for other theta, the zero estimator has a smaller mean 66 00:03:31,130 --> 00:03:33,190 squared error. 67 00:03:33,190 --> 00:03:36,630 But we do not know where the true value of theta is. 68 00:03:36,630 --> 00:03:38,030 It could be anything. 69 00:03:38,030 --> 00:03:41,750 So we cannot say that one is better than the other. 70 00:03:41,750 --> 00:03:44,640 Of course, we know that the sample mean is 71 00:03:44,640 --> 00:03:46,720 a consistent estimator. 72 00:03:46,720 --> 00:03:49,530 As n goes to infinity, it will give you the 73 00:03:49,530 --> 00:03:51,160 true value of theta. 74 00:03:51,160 --> 00:03:54,170 And this is a very desirable properties that the zero 75 00:03:54,170 --> 00:03:56,630 estimator does not have. 76 00:03:56,630 --> 00:04:01,610 But if n is moderate, the situation is less clear. 77 00:04:01,610 --> 00:04:05,810 If we have some good reason to expect that the true value of 78 00:04:05,810 --> 00:04:09,460 theta is somewhere in the vicinity of zero, then the 79 00:04:09,460 --> 00:04:13,605 zero estimator might be a better one, because it then 80 00:04:13,605 --> 00:04:17,180 will achieve a smaller mean squared error. 81 00:04:17,180 --> 00:04:21,170 But in a classical statistical framework, there is no way to 82 00:04:21,170 --> 00:04:24,710 express a belief of this kind. 83 00:04:24,710 --> 00:04:28,530 In contrast, if we were following a Bayesian approach, 84 00:04:28,530 --> 00:04:31,830 you could provide a prior distribution for theta that 85 00:04:31,830 --> 00:04:34,560 would be highly concentrated around zero. 86 00:04:34,560 --> 00:04:37,780 This would express your beliefs about theta, and would 87 00:04:37,780 --> 00:04:41,190 provide you with the guidance to choose between the two 88 00:04:41,190 --> 00:04:46,860 estimators, or maybe suggest an even better estimator. 89 00:04:46,860 --> 00:04:50,370 In any case, going back to this formula, this quantity, 90 00:04:50,370 --> 00:04:53,770 the variance of the estimator plays an important role in the 91 00:04:53,770 --> 00:04:56,580 analysis of different estimators. 92 00:04:56,580 --> 00:05:01,000 And the more intuitive variant of this quantity is its square 93 00:05:01,000 --> 00:05:06,020 root, which is the standard deviation of the estimator, 94 00:05:06,020 --> 00:05:08,320 and is usually called the standard 95 00:05:08,320 --> 00:05:10,720 error of the estimator. 96 00:05:10,720 --> 00:05:14,280 We can interpret the standard error as follows. 97 00:05:14,280 --> 00:05:16,730 We have the true value of theta. 98 00:05:19,300 --> 00:05:23,290 Then on day one, we collect some data, we perform the 99 00:05:23,290 --> 00:05:27,610 estimation procedure, and we come up with an estimate. 100 00:05:27,610 --> 00:05:32,010 On day two, we do the same thing, but independently. 101 00:05:32,010 --> 00:05:35,420 We collect a new set of data, and we come up 102 00:05:35,420 --> 00:05:37,590 with another estimate. 103 00:05:37,590 --> 00:05:38,850 And so on. 104 00:05:38,850 --> 00:05:40,970 We do this many times. 105 00:05:40,970 --> 00:05:44,210 We use different data sets to come up 106 00:05:44,210 --> 00:05:46,030 with different estimates. 107 00:05:46,030 --> 00:05:49,900 And because of the randomness in the data, these estimates 108 00:05:49,900 --> 00:05:52,390 may be all over the place. 109 00:05:52,390 --> 00:05:56,710 Well, the standard error tells us how spread out all these 110 00:05:56,710 --> 00:05:58,130 estimates will be. 111 00:05:58,130 --> 00:06:00,910 It is the standard deviation of this 112 00:06:00,910 --> 00:06:03,240 collection of estimates. 113 00:06:03,240 --> 00:06:06,820 Having a large standard error means that our estimation 114 00:06:06,820 --> 00:06:11,570 procedure is quite noisy, and that our estimates have some 115 00:06:11,570 --> 00:06:13,590 inherent randomness. 116 00:06:13,590 --> 00:06:17,560 And therefore, also have a lack of accuracy. 117 00:06:17,560 --> 00:06:20,620 That is, they cannot be trusted too much. 118 00:06:20,620 --> 00:06:23,240 That's the case of a large standard error. 119 00:06:23,240 --> 00:06:27,240 Conversely, a small standard error would tell us that the 120 00:06:27,240 --> 00:06:30,210 estimates would tend to be concentrated 121 00:06:30,210 --> 00:06:32,520 close to each other. 122 00:06:32,520 --> 00:06:36,560 As such, the standard error is a very useful piece of 123 00:06:36,560 --> 00:06:39,180 information to have. 124 00:06:39,180 --> 00:06:44,210 Besides designing and implementing an estimator, one 125 00:06:44,210 --> 00:06:49,260 usually also tries to find a way to calculate and report 126 00:06:49,260 --> 00:06:51,090 the associated standard error.