1 00:00:00,000 --> 00:00:00,040 2 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 3 00:00:02,460 --> 00:00:03,870 Commons license. 4 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 5 00:00:06,910 --> 00:00:08,700 offer high quality, educational 6 00:00:08,700 --> 00:00:10,560 resources for free. 7 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 8 00:00:13,460 --> 00:00:19,290 hundreds of MIT courses, visit MIT OpenCourseWare at 9 00:00:19,290 --> 00:00:20,540 ocw.mit.edu. 10 00:00:20,540 --> 00:00:22,200 11 00:00:22,200 --> 00:00:24,920 PROFESSOR: So for the last three lectures we're going to 12 00:00:24,920 --> 00:00:28,200 talk about classical statistics, the way statistics 13 00:00:28,200 --> 00:00:32,340 can be done if you don't want to assume a prior distribution 14 00:00:32,340 --> 00:00:34,800 on the unknown parameters. 15 00:00:34,800 --> 00:00:38,290 Today we're going to focus, mostly, on the estimation side 16 00:00:38,290 --> 00:00:41,910 and leave hypothesis testing for the next two lectures. 17 00:00:41,910 --> 00:00:46,700 So where there is one generic method that one can use to 18 00:00:46,700 --> 00:00:50,850 carry out parameter estimation, that's the maximum 19 00:00:50,850 --> 00:00:51,850 likelihood method. 20 00:00:51,850 --> 00:00:53,990 We're going to define what it is. 21 00:00:53,990 --> 00:00:58,200 Then we will look at the most common estimation problem 22 00:00:58,200 --> 00:01:00,620 there is, which is to estimate the mean of a given 23 00:01:00,620 --> 00:01:02,110 distribution. 24 00:01:02,110 --> 00:01:05,540 And we're going to talk about confidence intervals, which 25 00:01:05,540 --> 00:01:09,130 refers to providing an interval around your 26 00:01:09,130 --> 00:01:13,330 estimates, which has some properties of the kind that 27 00:01:13,330 --> 00:01:17,640 the parameter is highly likely to be inside that interval, 28 00:01:17,640 --> 00:01:20,040 but we will be careful about how to interpret that 29 00:01:20,040 --> 00:01:22,220 particular statement. 30 00:01:22,220 --> 00:01:22,345 Ok. 31 00:01:22,345 --> 00:01:25,920 So the big framework first. 32 00:01:25,920 --> 00:01:29,120 The picture is almost the same as the one that we had in the 33 00:01:29,120 --> 00:01:31,130 case of Bayesian statistics. 34 00:01:31,130 --> 00:01:33,570 We have some unknown parameter. 35 00:01:33,570 --> 00:01:35,510 And we have a measuring device. 36 00:01:35,510 --> 00:01:38,150 There is some noise, some randomness. 37 00:01:38,150 --> 00:01:42,560 And we get an observation, X, whose distribution depends on 38 00:01:42,560 --> 00:01:44,560 the value of the parameter. 39 00:01:44,560 --> 00:01:47,850 However, the big change from the Bayesian setting is that 40 00:01:47,850 --> 00:01:50,840 here, this parameter is just a number. 41 00:01:50,840 --> 00:01:53,200 It's not modeled as a random variable. 42 00:01:53,200 --> 00:01:55,900 It does not have a probability distribution. 43 00:01:55,900 --> 00:01:57,460 There's nothing random about it. 44 00:01:57,460 --> 00:01:58,720 It's a constant. 45 00:01:58,720 --> 00:02:02,360 It just happens that we don't know what that constant is. 46 00:02:02,360 --> 00:02:05,970 And in particular, this probability distribution here, 47 00:02:05,970 --> 00:02:10,350 the distribution of X, depends on Theta. 48 00:02:10,350 --> 00:02:13,900 But this is not a conditional distribution in the usual 49 00:02:13,900 --> 00:02:15,450 sense of the word. 50 00:02:15,450 --> 00:02:18,480 Conditional distributions were defined when we had two random 51 00:02:18,480 --> 00:02:21,800 variables and we condition one random variable on the other. 52 00:02:21,800 --> 00:02:25,890 And we used the bar to separate the X from the Theta. 53 00:02:25,890 --> 00:02:27,870 To make the point that this is not a conditioned 54 00:02:27,870 --> 00:02:29,840 distribution, we use a different notation. 55 00:02:29,840 --> 00:02:31,730 We put a semicolon here. 56 00:02:31,730 --> 00:02:35,760 And what this is meant to say is that X has a distribution. 57 00:02:35,760 --> 00:02:39,640 That distribution has a certain parameter. 58 00:02:39,640 --> 00:02:42,240 And we don't know what that parameter is. 59 00:02:42,240 --> 00:02:46,270 So for example, this might be a normal distribution, with 60 00:02:46,270 --> 00:02:49,070 variance 1 but a mean Theta. 61 00:02:49,070 --> 00:02:50,560 We don't know what Theta is. 62 00:02:50,560 --> 00:02:52,980 And we want to estimate it. 63 00:02:52,980 --> 00:02:55,970 Now once we have this setting, then your job is to design 64 00:02:55,970 --> 00:02:57,560 this box, the estimator. 65 00:02:57,560 --> 00:03:00,620 The estimator is some data processing box that takes the 66 00:03:00,620 --> 00:03:03,950 measurements and produces an estimate 67 00:03:03,950 --> 00:03:06,300 of the unknown parameter. 68 00:03:06,300 --> 00:03:11,950 Now the notation that's used here is as if X and Theta were 69 00:03:11,950 --> 00:03:13,640 one-dimensional quantities. 70 00:03:13,640 --> 00:03:16,610 But actually, everything we say remains valid if you 71 00:03:16,610 --> 00:03:20,090 interpret X and Theta as vectors of parameters. 72 00:03:20,090 --> 00:03:22,180 So for example, you may obtain several 73 00:03:22,180 --> 00:03:25,050 measurements, X1 up to 2Xn. 74 00:03:25,050 --> 00:03:27,980 And there may be several unknown parameters in the 75 00:03:27,980 --> 00:03:30,260 background. 76 00:03:30,260 --> 00:03:34,200 Once more, we do not have, and we do not want to assume, a 77 00:03:34,200 --> 00:03:35,780 prior distribution on Theta. 78 00:03:35,780 --> 00:03:37,070 It's a constant. 79 00:03:37,070 --> 00:03:39,040 And if you want to think mathematically about this 80 00:03:39,040 --> 00:03:41,510 situation, it's as if you have many different 81 00:03:41,510 --> 00:03:43,340 probabilistic models. 82 00:03:43,340 --> 00:03:46,360 So a normal with this mean or a normal with that mean or a 83 00:03:46,360 --> 00:03:49,020 normal with that mean, these are alternative candidate 84 00:03:49,020 --> 00:03:50,700 probabilistic models. 85 00:03:50,700 --> 00:03:55,080 And we want to try to make a decision about which one is 86 00:03:55,080 --> 00:03:56,420 the correct model. 87 00:03:56,420 --> 00:03:59,480 In some cases, we have to choose just between a small 88 00:03:59,480 --> 00:04:00,390 number of models. 89 00:04:00,390 --> 00:04:03,400 For example, you have a coin with an unknown bias. 90 00:04:03,400 --> 00:04:06,410 The bias is either 1/2 or 3/4. 91 00:04:06,410 --> 00:04:08,650 You're going to flip the coin a few times. 92 00:04:08,650 --> 00:04:13,150 And you try to decide whether the true bias is this one or 93 00:04:13,150 --> 00:04:14,150 is that one. 94 00:04:14,150 --> 00:04:17,610 So in this case, we have two specific, alternative 95 00:04:17,610 --> 00:04:20,800 probabilistic models from which we want to distinguish. 96 00:04:20,800 --> 00:04:25,000 But sometimes things are a little more complicated. 97 00:04:25,000 --> 00:04:26,940 For example, you have a coin. 98 00:04:26,940 --> 00:04:30,940 And you have one hypothesis that my coin is unbiased. 99 00:04:30,940 --> 00:04:34,650 And the other hypothesis is that my coin is biased. 100 00:04:34,650 --> 00:04:36,040 And you do your experiments. 101 00:04:36,040 --> 00:04:40,840 And you want to come up with a decision that decides whether 102 00:04:40,840 --> 00:04:43,970 this is true or this one is true. 103 00:04:43,970 --> 00:04:46,630 In this case, we're not dealing with just two 104 00:04:46,630 --> 00:04:48,710 alternative probabilistic models. 105 00:04:48,710 --> 00:04:51,540 This one is a specific model for the coin. 106 00:04:51,540 --> 00:04:54,230 But this one actually corresponds to lots of 107 00:04:54,230 --> 00:04:56,890 possible, alternative coin models. 108 00:04:56,890 --> 00:05:00,420 So this includes the model where Theta is 0.6, the model 109 00:05:00,420 --> 00:05:03,860 where Theta is 0.7, Theta is 0.8, and so on. 110 00:05:03,860 --> 00:05:07,350 So we're trying to discriminate between one model 111 00:05:07,350 --> 00:05:09,510 and lots of alternative models. 112 00:05:09,510 --> 00:05:11,560 How does one go about this? 113 00:05:11,560 --> 00:05:14,750 Well, there's some systematic ways that one can approach 114 00:05:14,750 --> 00:05:16,120 problems of this kind. 115 00:05:16,120 --> 00:05:19,850 And we will start talking about these next time. 116 00:05:19,850 --> 00:05:22,380 So today, we're going to focus on estimation problems. 117 00:05:22,380 --> 00:05:27,080 In estimation problems, theta is a quantity, which is a real 118 00:05:27,080 --> 00:05:29,070 number, a continuous parameter. 119 00:05:29,070 --> 00:05:33,730 We're to design this box, so what we get out of this box is 120 00:05:33,730 --> 00:05:34,280 an estimate. 121 00:05:34,280 --> 00:05:37,900 Now notice that this estimate here is a random variable. 122 00:05:37,900 --> 00:05:42,000 Even though theta is deterministic, this is random, 123 00:05:42,000 --> 00:05:45,110 because it's a function of the data that we observe. 124 00:05:45,110 --> 00:05:46,360 The data are random. 125 00:05:46,360 --> 00:05:49,155 We're applying a function to the data to 126 00:05:49,155 --> 00:05:50,270 construct our estimate. 127 00:05:50,270 --> 00:05:52,850 So, since it's a function of random variables, it's a 128 00:05:52,850 --> 00:05:54,630 random variable itself. 129 00:05:54,630 --> 00:05:57,940 The distribution of Theta hat depends on the distribution of 130 00:05:57,940 --> 00:06:01,280 X. The distribution of X is affected by Theta. 131 00:06:01,280 --> 00:06:03,650 So in the end, the distribution of your estimate 132 00:06:03,650 --> 00:06:08,290 Theta hat will also be affected by whatever Theta 133 00:06:08,290 --> 00:06:09,920 happens to be. 134 00:06:09,920 --> 00:06:12,950 Our general objective, when designing estimators, is that 135 00:06:12,950 --> 00:06:17,390 we want to get, in the end, an error, an estimation error, 136 00:06:17,390 --> 00:06:19,070 which is not too large. 137 00:06:19,070 --> 00:06:21,500 But we'll have to make that specific. 138 00:06:21,500 --> 00:06:24,720 Again, what exactly do we mean by that? 139 00:06:24,720 --> 00:06:27,170 So how do we go about this problem? 140 00:06:27,170 --> 00:06:29,670 141 00:06:29,670 --> 00:06:40,150 One general approach is to pick a Theta, under which the 142 00:06:40,150 --> 00:06:44,590 data that we observe, that this is the X's, our most 143 00:06:44,590 --> 00:06:47,180 likely to have occurred. 144 00:06:47,180 --> 00:06:52,700 So I observe X. For any given Theta, I can calculate this 145 00:06:52,700 --> 00:06:56,630 quantity, which tells me, under this particular Theta, 146 00:06:56,630 --> 00:07:00,670 the X that you observed had this probability of occurring. 147 00:07:00,670 --> 00:07:03,270 Under that Theta, the X that you observe had that 148 00:07:03,270 --> 00:07:04,770 probability of occurring. 149 00:07:04,770 --> 00:07:08,580 You just choose that Theta, which makes the data that you 150 00:07:08,580 --> 00:07:12,700 observed most likely. 151 00:07:12,700 --> 00:07:15,810 It's interesting to compare this maximum likelihood 152 00:07:15,810 --> 00:07:19,120 estimate with the estimates that you would have, if you 153 00:07:19,120 --> 00:07:22,050 were in a Bayesian setting, and you were using maximum 154 00:07:22,050 --> 00:07:25,010 approach theory probability estimation. 155 00:07:25,010 --> 00:07:31,650 In the Bayesian setting, what we do is, given the data, we 156 00:07:31,650 --> 00:07:34,350 use the prior distribution on Theta. 157 00:07:34,350 --> 00:07:41,660 And we calculate the posterior distribution of Theta given X. 158 00:07:41,660 --> 00:07:44,350 Notice that this is sort of the opposite from 159 00:07:44,350 --> 00:07:46,040 what we have here. 160 00:07:46,040 --> 00:07:49,180 This is the probability of X for a particular value of 161 00:07:49,180 --> 00:07:51,780 Theta, whereas this is the probability of Theta for a 162 00:07:51,780 --> 00:07:55,380 particular X. So it's the opposite type of conditioning. 163 00:07:55,380 --> 00:07:58,240 In the Bayesian setting, Theta is a random variable. 164 00:07:58,240 --> 00:07:59,890 So we can talk about the probability 165 00:07:59,890 --> 00:08:01,570 distribution of Theta. 166 00:08:01,570 --> 00:08:04,740 So how do these two compare, except for this syntactic 167 00:08:04,740 --> 00:08:08,160 difference that the order X's and Theta's are reversed? 168 00:08:08,160 --> 00:08:11,410 Let's write down, in full detail, what this posterior 169 00:08:11,410 --> 00:08:13,280 distribution of Theta is. 170 00:08:13,280 --> 00:08:17,390 By the Bayes rule, this conditional distribution is 171 00:08:17,390 --> 00:08:20,430 obtained from the prior, and the model of the measurement 172 00:08:20,430 --> 00:08:21,850 process that we have. 173 00:08:21,850 --> 00:08:24,510 And we get to this expression. 174 00:08:24,510 --> 00:08:29,520 So in Bayesian estimation, we want to find the most likely 175 00:08:29,520 --> 00:08:30,870 value of Theta. 176 00:08:30,870 --> 00:08:33,070 And we need to maximize this quantity over 177 00:08:33,070 --> 00:08:34,539 all possible Theta's. 178 00:08:34,539 --> 00:08:38,210 First thing to notice is that the denominator is a constant. 179 00:08:38,210 --> 00:08:40,220 It does not involve Theta. 180 00:08:40,220 --> 00:08:43,250 So when you maximize this quantity, you don't care about 181 00:08:43,250 --> 00:08:44,520 the denominator. 182 00:08:44,520 --> 00:08:47,800 You just want to maximize the numerator. 183 00:08:47,800 --> 00:08:52,310 Now, here, things start to look a little more similar. 184 00:08:52,310 --> 00:08:56,530 And they would be exactly of the same kind, if that term 185 00:08:56,530 --> 00:08:59,890 here was absent, it the prior was absent. 186 00:08:59,890 --> 00:09:03,860 The two are going to become the same if that prior was 187 00:09:03,860 --> 00:09:05,830 just a constant. 188 00:09:05,830 --> 00:09:10,160 So if that prior is a constant, then maximum 189 00:09:10,160 --> 00:09:13,720 likelihood estimation takes exactly the same form as 190 00:09:13,720 --> 00:09:17,360 Bayesian maximum posterior probability estimation. 191 00:09:17,360 --> 00:09:21,230 So you can give this particular interpretation of 192 00:09:21,230 --> 00:09:22,680 maximum likelihood estimation. 193 00:09:22,680 --> 00:09:27,400 Maximum likelihood estimation is essentially what you have 194 00:09:27,400 --> 00:09:31,380 done, if you were in a Bayesian world, and you had 195 00:09:31,380 --> 00:09:35,400 assumed a prior on the Theta's that's uniform, all the 196 00:09:35,400 --> 00:09:37,030 Theta's being equally likely. 197 00:09:37,030 --> 00:09:42,620 198 00:09:42,620 --> 00:09:42,725 Okay. 199 00:09:42,725 --> 00:09:45,770 So let's look at a simple example. 200 00:09:45,770 --> 00:09:48,510 Suppose that the Xi's are independent, identically 201 00:09:48,510 --> 00:09:50,770 distributed random variables, with a 202 00:09:50,770 --> 00:09:52,690 certain parameter Theta. 203 00:09:52,690 --> 00:09:55,910 So the distribution of each one of the Xi's is this 204 00:09:55,910 --> 00:09:57,950 particular term. 205 00:09:57,950 --> 00:09:59,840 So Theta is one-dimensional. 206 00:09:59,840 --> 00:10:01,280 It's a one-dimensional parameter. 207 00:10:01,280 --> 00:10:03,180 But we have several data. 208 00:10:03,180 --> 00:10:07,020 We write down the formula for the probability of a 209 00:10:07,020 --> 00:10:12,360 particular X vector, given a particular value of Theta. 210 00:10:12,360 --> 00:10:14,950 But again, when I use the word, given, here it's not in 211 00:10:14,950 --> 00:10:16,080 the conditioning sense. 212 00:10:16,080 --> 00:10:18,770 It's the value of the density for a 213 00:10:18,770 --> 00:10:21,710 particular choice of Theta. 214 00:10:21,710 --> 00:10:24,890 Here, I wrote down, I defined maximum likelihood estimation 215 00:10:24,890 --> 00:10:26,190 in terms of PMFs. 216 00:10:26,190 --> 00:10:28,050 That's what you would do if the X's were 217 00:10:28,050 --> 00:10:29,950 discrete random variables. 218 00:10:29,950 --> 00:10:32,770 Here, the X's are continuous random variables, so instead 219 00:10:32,770 --> 00:10:36,220 of I'm using the PDF instead of the PMF. 220 00:10:36,220 --> 00:10:39,530 So this a definition, here, generalizes to the case of 221 00:10:39,530 --> 00:10:40,900 continuous random variables. 222 00:10:40,900 --> 00:10:44,620 And you use F's instead of X's, our usual recipe. 223 00:10:44,620 --> 00:10:47,560 So the maximum likelihood estimate is defined. 224 00:10:47,560 --> 00:10:51,880 Now, since the Xi's are independent, the joint density 225 00:10:51,880 --> 00:10:54,410 of all the X's together is the product of 226 00:10:54,410 --> 00:10:57,680 the individual densities. 227 00:10:57,680 --> 00:10:59,170 So you look at this quantity. 228 00:10:59,170 --> 00:11:03,310 This is the density or sort of probability of observing a 229 00:11:03,310 --> 00:11:05,340 particular sequence of X's. 230 00:11:05,340 --> 00:11:08,230 And we ask the question, what's the value of Theta that 231 00:11:08,230 --> 00:11:10,940 makes the X's that we observe most likely? 232 00:11:10,940 --> 00:11:13,160 So we want to carry out this maximization. 233 00:11:13,160 --> 00:11:17,430 Now this maximization is just a calculational problem. 234 00:11:17,430 --> 00:11:19,920 We're going to do this maximization by taking the 235 00:11:19,920 --> 00:11:21,910 logarithm of this expression. 236 00:11:21,910 --> 00:11:23,880 Maximizing an expression is the same as 237 00:11:23,880 --> 00:11:25,790 maximizing the logarithm. 238 00:11:25,790 --> 00:11:28,790 So the logarithm of this expression, the logarithm of a 239 00:11:28,790 --> 00:11:31,290 product is the sum of the logarithms. 240 00:11:31,290 --> 00:11:34,390 You get contributions from this Theta term. 241 00:11:34,390 --> 00:11:37,660 There's n of these, so we get an n log Theta. 242 00:11:37,660 --> 00:11:40,430 And then we have the sum of the logarithms of these terms. 243 00:11:40,430 --> 00:11:43,060 It gives us minus Theta. 244 00:11:43,060 --> 00:11:45,020 And then the sum of the X's. 245 00:11:45,020 --> 00:11:47,060 So we need to maximize this expression 246 00:11:47,060 --> 00:11:48,630 with respect to Theta. 247 00:11:48,630 --> 00:11:51,130 The way to do this maximization is you take the 248 00:11:51,130 --> 00:11:53,320 derivative, with respect to Theta. 249 00:11:53,320 --> 00:11:58,520 And you get n over Theta equals to the sum of the X's. 250 00:11:58,520 --> 00:12:00,280 And then you solve for Theta. 251 00:12:00,280 --> 00:12:02,040 And you find that the maximum likelihood 252 00:12:02,040 --> 00:12:04,680 estimate is this quantity. 253 00:12:04,680 --> 00:12:13,160 Which sort of makes sense, because this is the reciprocal 254 00:12:13,160 --> 00:12:16,700 of the sample mean of X's. 255 00:12:16,700 --> 00:12:19,520 Theta, in an exponential distribution, we know that 256 00:12:19,520 --> 00:12:23,380 it's 1 over (the mean of the exponential distribution). 257 00:12:23,380 --> 00:12:26,960 So it looks like a reasonable estimate. 258 00:12:26,960 --> 00:12:29,570 So in any case, this is the estimates that the maximum 259 00:12:29,570 --> 00:12:33,420 likelihood estimation procedure tells us that we 260 00:12:33,420 --> 00:12:35,780 should report. 261 00:12:35,780 --> 00:12:39,790 This formula here, of course, tells you what to do if you 262 00:12:39,790 --> 00:12:42,640 have already observed specific numbers. 263 00:12:42,640 --> 00:12:46,020 If you have observed specific numbers, then you observe this 264 00:12:46,020 --> 00:12:49,110 particular number as your estimate of Theta. 265 00:12:49,110 --> 00:12:52,000 If you want to describe your estimation procedure more 266 00:12:52,000 --> 00:12:55,900 abstractly, what you have constructed is an estimator, 267 00:12:55,900 --> 00:12:59,690 which is a box that's takes in the random variables, capital 268 00:12:59,690 --> 00:13:05,430 X1 up to Capital Xn, and produces out your estimate, 269 00:13:05,430 --> 00:13:07,440 which is also a random variable. 270 00:13:07,440 --> 00:13:10,760 Because it's a function of these random variables and is 271 00:13:10,760 --> 00:13:14,750 denoted by an upper case Theta, to indicate that this 272 00:13:14,750 --> 00:13:17,470 is now a random variable. 273 00:13:17,470 --> 00:13:21,040 So this is an equality about numbers. 274 00:13:21,040 --> 00:13:23,860 This is a description of the general procedure, which is an 275 00:13:23,860 --> 00:13:25,745 equality between two random variables. 276 00:13:25,745 --> 00:13:28,360 277 00:13:28,360 --> 00:13:31,920 And this gives you the more abstract view of what we're 278 00:13:31,920 --> 00:13:35,040 doing here. 279 00:13:35,040 --> 00:13:35,352 All right. 280 00:13:35,352 --> 00:13:37,970 So what can we tell about our estimate? 281 00:13:37,970 --> 00:13:40,090 Is it good or is it bad? 282 00:13:40,090 --> 00:13:42,960 So we should look at this particular random variable and 283 00:13:42,960 --> 00:13:46,220 talk about the statistical properties that it has. 284 00:13:46,220 --> 00:13:49,930 What we would like is this random variable to be close to 285 00:13:49,930 --> 00:13:55,810 the true value of Theta, with high probability, no matter 286 00:13:55,810 --> 00:13:59,470 what Theta is, since we don't know what Theta is. 287 00:13:59,470 --> 00:14:01,400 Let's make a little more specific the 288 00:14:01,400 --> 00:14:05,100 properties that we want. 289 00:14:05,100 --> 00:14:08,470 So we cook up the estimator somehow. 290 00:14:08,470 --> 00:14:11,850 So this estimator corresponds, again, to a box that takes 291 00:14:11,850 --> 00:14:15,400 data in, the capital X's, and produces an 292 00:14:15,400 --> 00:14:17,470 estimate Theta hat. 293 00:14:17,470 --> 00:14:18,710 This estimate is random. 294 00:14:18,710 --> 00:14:23,070 Sometimes it will be above the true value of Theta. 295 00:14:23,070 --> 00:14:25,660 Sometimes it will be below. 296 00:14:25,660 --> 00:14:30,220 Ideally, we would like it to not have a systematic error, 297 00:14:30,220 --> 00:14:32,810 on the positive side or the negative side. 298 00:14:32,810 --> 00:14:37,310 So a reasonable wish to have, for a good estimator, is that, 299 00:14:37,310 --> 00:14:41,700 on the average, it gives you the correct value. 300 00:14:41,700 --> 00:14:45,850 Now here, let's be a little more specific about what that 301 00:14:45,850 --> 00:14:47,740 expectation is. 302 00:14:47,740 --> 00:14:51,270 This is an expectation, with respect to the probability 303 00:14:51,270 --> 00:14:54,240 distribution of Theta hat. 304 00:14:54,240 --> 00:14:58,780 The probability distribution of Theta hat is affected by 305 00:14:58,780 --> 00:15:01,410 the probability distribution of the X's. 306 00:15:01,410 --> 00:15:03,760 Because Theta hat is a function of the X's. 307 00:15:03,760 --> 00:15:05,930 And the probability distribution of the X's is 308 00:15:05,930 --> 00:15:09,220 affected by the true value of Theta. 309 00:15:09,220 --> 00:15:13,710 So depending on which one is the true value of Theta, this 310 00:15:13,710 --> 00:15:16,650 is going to be a different expectation. 311 00:15:16,650 --> 00:15:20,830 So if you were to write this expectation out in more 312 00:15:20,830 --> 00:15:25,840 detail, it would look something like this. 313 00:15:25,840 --> 00:15:28,690 You need to write down the probability 314 00:15:28,690 --> 00:15:30,260 distribution of Theta hat. 315 00:15:30,260 --> 00:15:32,890 316 00:15:32,890 --> 00:15:36,470 And this is going to be some function. 317 00:15:36,470 --> 00:15:41,200 But this function depends on the true Theta, is affected by 318 00:15:41,200 --> 00:15:42,800 the true Theta. 319 00:15:42,800 --> 00:15:48,300 And then you integrate this with respect to Theta hat. 320 00:15:48,300 --> 00:15:49,430 What's the point here? 321 00:15:49,430 --> 00:15:53,280 Again, Theta hat is a function of the X's. 322 00:15:53,280 --> 00:15:57,000 So the density of Theta hat is affected by the 323 00:15:57,000 --> 00:15:58,400 density of the X's. 324 00:15:58,400 --> 00:16:00,730 The density of the X's is affected by the 325 00:16:00,730 --> 00:16:02,380 true value of Theta. 326 00:16:02,380 --> 00:16:05,420 So the distribution of Theta hat is affected by 327 00:16:05,420 --> 00:16:07,680 the value of Theta. 328 00:16:07,680 --> 00:16:10,500 Another way to put it is, as I've mentioned a few minutes 329 00:16:10,500 --> 00:16:14,550 ago, in this business, it's as if we are considering 330 00:16:14,550 --> 00:16:17,880 different possible probabilistic models, one 331 00:16:17,880 --> 00:16:20,470 probabilistic model for each choice of Theta. 332 00:16:20,470 --> 00:16:22,560 And we're trying to guess which one of these 333 00:16:22,560 --> 00:16:25,200 probabilistic models is the true one. 334 00:16:25,200 --> 00:16:28,420 One way of emphasizing the fact that this expression 335 00:16:28,420 --> 00:16:31,780 depends on the true Theta is to put a little subscript 336 00:16:31,780 --> 00:16:36,790 here, expectation, under the particular value of the 337 00:16:36,790 --> 00:16:38,300 parameter Theta. 338 00:16:38,300 --> 00:16:42,450 So depending on what value the true parameter Theta takes, 339 00:16:42,450 --> 00:16:45,000 this expectation will have a different value. 340 00:16:45,000 --> 00:16:49,730 And what we would like is that no matter what the true value 341 00:16:49,730 --> 00:16:55,300 is, that our estimate will not have a bias on the positive or 342 00:16:55,300 --> 00:16:57,140 the negative sides. 343 00:16:57,140 --> 00:17:00,150 So this is a property that's desirable. 344 00:17:00,150 --> 00:17:02,160 Is it always going to be true? 345 00:17:02,160 --> 00:17:05,218 Not necessarily, it depends on what estimator we construct. 346 00:17:05,218 --> 00:17:09,160 347 00:17:09,160 --> 00:17:12,400 Is it true for our exponential example? 348 00:17:12,400 --> 00:17:14,770 Unfortunately not, the estimate that we have in the 349 00:17:14,770 --> 00:17:18,300 exponential example turns out to be biased. 350 00:17:18,300 --> 00:17:22,900 And one extreme way of seeing this is to consider the case 351 00:17:22,900 --> 00:17:25,160 where our sample size is 1. 352 00:17:25,160 --> 00:17:27,020 We're trying to estimate Theta. 353 00:17:27,020 --> 00:17:30,370 And the estimator from the previous slide, in that case, 354 00:17:30,370 --> 00:17:33,410 is just 1/X1. 355 00:17:33,410 --> 00:17:37,990 Now X1 has a fair amount of density in the vicinity of 0, 356 00:17:37,990 --> 00:17:41,360 which means that 1/X1 has significant probability of 357 00:17:41,360 --> 00:17:42,810 being very large. 358 00:17:42,810 --> 00:17:46,140 And if you do the calculation, this ultimately makes the 359 00:17:46,140 --> 00:17:49,170 expected value of 1/X1 to be infinite. 360 00:17:49,170 --> 00:17:52,870 Now infinity is definitely not the correct value. 361 00:17:52,870 --> 00:17:56,330 So our estimate is biased upwards. 362 00:17:56,330 --> 00:18:00,130 And it's actually biased a lot upwards. 363 00:18:00,130 --> 00:18:01,800 So that's how things are. 364 00:18:01,800 --> 00:18:06,690 Maximum likelihood estimates, in general, will be biased. 365 00:18:06,690 --> 00:18:10,870 But under some conditions, they will turn out to be 366 00:18:10,870 --> 00:18:12,780 asymptotically unbiased. 367 00:18:12,780 --> 00:18:16,810 That is, as you get more and more data, as your X vector is 368 00:18:16,810 --> 00:18:21,750 longer and longer, with independent data, the estimate 369 00:18:21,750 --> 00:18:25,010 that you're going to have, the expected value of your 370 00:18:25,010 --> 00:18:26,860 estimator is going to get closer and 371 00:18:26,860 --> 00:18:28,370 closer to the true value. 372 00:18:28,370 --> 00:18:31,330 So you do have some nice asymptotic properties, but 373 00:18:31,330 --> 00:18:34,000 we're not going to prove anything like this. 374 00:18:34,000 --> 00:18:37,680 Speaking of asymptotic properties, in general, what 375 00:18:37,680 --> 00:18:40,950 we would like to have is that, as you collect more and more 376 00:18:40,950 --> 00:18:46,550 data, you get the correct answer, in some sense. 377 00:18:46,550 --> 00:18:49,360 And the sense that we're going to use here is the limiting 378 00:18:49,360 --> 00:18:52,560 sense of convergence in probability, since this is the 379 00:18:52,560 --> 00:18:55,270 only notion of convergence of random variables that we have 380 00:18:55,270 --> 00:18:56,540 in our hands. 381 00:18:56,540 --> 00:18:59,600 This is similar to what we had in the pollster 382 00:18:59,600 --> 00:19:01,180 problem, for example. 383 00:19:01,180 --> 00:19:04,900 If we had a bigger and bigger sample size, we could be more 384 00:19:04,900 --> 00:19:08,360 and more confident that the estimate that we obtained is 385 00:19:08,360 --> 00:19:11,970 close to the unknown true parameter of the distribution 386 00:19:11,970 --> 00:19:13,320 that we have. 387 00:19:13,320 --> 00:19:16,420 So this is a desirable property. 388 00:19:16,420 --> 00:19:20,720 If you have an infinitely large amount of data, you 389 00:19:20,720 --> 00:19:25,070 should be able to estimate an unknown parameter 390 00:19:25,070 --> 00:19:26,890 more or less exactly. 391 00:19:26,890 --> 00:19:32,280 So this is it desirable property of estimators. 392 00:19:32,280 --> 00:19:35,560 It turns out that maximum likelihood estimation, given 393 00:19:35,560 --> 00:19:39,330 independent data, does have this property, under mild 394 00:19:39,330 --> 00:19:40,280 conditions. 395 00:19:40,280 --> 00:19:43,100 So maximum likelihood estimation, in this respect, 396 00:19:43,100 --> 00:19:45,180 is a good approach. 397 00:19:45,180 --> 00:19:48,520 So let's see, do we have this consistency property in our 398 00:19:48,520 --> 00:19:50,150 exponential example? 399 00:19:50,150 --> 00:19:56,560 In our exponential example, we used this quantity to estimate 400 00:19:56,560 --> 00:19:59,040 the unknown parameter Theta. 401 00:19:59,040 --> 00:20:01,000 What properties does this quantity have 402 00:20:01,000 --> 00:20:03,160 as n goes to infinity? 403 00:20:03,160 --> 00:20:06,580 Well this quantity is the reciprocal of that quantity up 404 00:20:06,580 --> 00:20:09,890 here, which is the sample mean. 405 00:20:09,890 --> 00:20:12,950 We know from the weak law of large numbers, that the sample 406 00:20:12,950 --> 00:20:16,350 mean converges to the expectation. 407 00:20:16,350 --> 00:20:19,250 So this property here comes from the weak 408 00:20:19,250 --> 00:20:21,660 law of large numbers. 409 00:20:21,660 --> 00:20:24,670 In probability, this quantity converges to the expected 410 00:20:24,670 --> 00:20:29,830 value, which, for exponential distributions, is 1/Theta. 411 00:20:29,830 --> 00:20:33,460 Now, if something converges to something, then the reciprocal 412 00:20:33,460 --> 00:20:37,680 of that should converge to the reciprocal of that. 413 00:20:37,680 --> 00:20:41,520 That's a property that's certainly correct for numbers. 414 00:20:41,520 --> 00:20:44,000 But you're not talking about convergence of numbers. 415 00:20:44,000 --> 00:20:46,420 We're talking about convergence in probability, 416 00:20:46,420 --> 00:20:48,820 which is a more complicated notion. 417 00:20:48,820 --> 00:20:52,370 Fortunately, it turns out that the same thing is true, when 418 00:20:52,370 --> 00:20:54,660 we deal with convergence in probability. 419 00:20:54,660 --> 00:20:58,690 One can show, although we will not bother doing this, that 420 00:20:58,690 --> 00:21:01,840 indeed, the reciprocal of this, which is our estimate, 421 00:21:01,840 --> 00:21:05,880 converges in probability to the reciprocal of that. 422 00:21:05,880 --> 00:21:08,880 And that reciprocal is the true parameter Theta. 423 00:21:08,880 --> 00:21:11,570 So for this particular exponential example, we do 424 00:21:11,570 --> 00:21:15,250 have the desirable property, that as the number of data 425 00:21:15,250 --> 00:21:18,230 becomes larger and larger, the estimate that we have 426 00:21:18,230 --> 00:21:20,970 constructed will get closer and closer to the true 427 00:21:20,970 --> 00:21:22,510 parameter value. 428 00:21:22,510 --> 00:21:27,050 And this is true no matter what Theta is. 429 00:21:27,050 --> 00:21:30,130 No matter what the true parameter Theta is, we're 430 00:21:30,130 --> 00:21:33,240 going to get close to it as we collect more data. 431 00:21:33,240 --> 00:21:35,780 432 00:21:35,780 --> 00:21:35,950 Okay. 433 00:21:35,950 --> 00:21:39,100 So these are two rough qualitative properties that 434 00:21:39,100 --> 00:21:42,350 would be nice to have. 435 00:21:42,350 --> 00:21:47,340 If you want to get a little more quantitative, you can 436 00:21:47,340 --> 00:21:50,210 start looking at the mean squared error that your 437 00:21:50,210 --> 00:21:52,000 estimator gives. 438 00:21:52,000 --> 00:21:56,600 Now, once more, the comment I was making up there applies. 439 00:21:56,600 --> 00:22:00,540 Namely, that this expectation here is an expectation with 440 00:22:00,540 --> 00:22:04,600 respect to the probability distribution of Theta hat that 441 00:22:04,600 --> 00:22:07,280 corresponds to a particular value of little theta. 442 00:22:07,280 --> 00:22:09,840 So fix a little theta. 443 00:22:09,840 --> 00:22:11,910 Write down this expression. 444 00:22:11,910 --> 00:22:14,550 Look at the probability distribution of Theta hat, 445 00:22:14,550 --> 00:22:16,380 under that little theta. 446 00:22:16,380 --> 00:22:18,220 And do this calculation. 447 00:22:18,220 --> 00:22:20,610 You're going to get some quantity that depends on the 448 00:22:20,610 --> 00:22:21,860 little theta. 449 00:22:21,860 --> 00:22:24,200 450 00:22:24,200 --> 00:22:28,450 And so all quantities in this equality here should be 451 00:22:28,450 --> 00:22:33,360 interpreted as quantities under that particular value of 452 00:22:33,360 --> 00:22:34,490 little theta. 453 00:22:34,490 --> 00:22:38,640 So if you wanted to make this more explicit, you could start 454 00:22:38,640 --> 00:22:41,870 throwing little subscripts everywhere in those 455 00:22:41,870 --> 00:22:44,430 expressions. 456 00:22:44,430 --> 00:22:49,190 And let's see what those expressions tell us. 457 00:22:49,190 --> 00:22:55,430 The expected value squared of a random variable, we know 458 00:22:55,430 --> 00:22:59,210 that it's always equal to the variance of this random 459 00:22:59,210 --> 00:23:03,790 variable, plus the expectation of the 460 00:23:03,790 --> 00:23:05,860 random variable squared. 461 00:23:05,860 --> 00:23:08,465 So the expectation value of that random variable, squared. 462 00:23:08,465 --> 00:23:12,020 463 00:23:12,020 --> 00:23:17,030 This equality here is just our familiar formula, that the 464 00:23:17,030 --> 00:23:23,250 expected value of X squared is the variance of X plus the 465 00:23:23,250 --> 00:23:26,350 expected value of X squared. 466 00:23:26,350 --> 00:23:30,040 So we apply this formula to X equal to 467 00:23:30,040 --> 00:23:34,024 Theta hat minus Theta. 468 00:23:34,024 --> 00:23:37,180 469 00:23:37,180 --> 00:23:41,220 Now, remember that, in this classical setting, theta is 470 00:23:41,220 --> 00:23:42,140 just a constant. 471 00:23:42,140 --> 00:23:43,450 We have fixed Theta. 472 00:23:43,450 --> 00:23:45,850 We want to calculate the variance of this quantity, 473 00:23:45,850 --> 00:23:47,760 under that particular Theta. 474 00:23:47,760 --> 00:23:51,000 When you add or subtract a constant to a random variable, 475 00:23:51,000 --> 00:23:54,070 the variance doesn't change. 476 00:23:54,070 --> 00:23:56,860 This is the same as the variance of our estimator. 477 00:23:56,860 --> 00:24:00,300 And what we've got here is the bias of our estimate. 478 00:24:00,300 --> 00:24:02,580 It tells us, on the average, whether we 479 00:24:02,580 --> 00:24:04,470 fall above or below. 480 00:24:04,470 --> 00:24:06,850 And we're taking the bias to be b squared. 481 00:24:06,850 --> 00:24:10,110 If we have an unbiased estimator, the bias 482 00:24:10,110 --> 00:24:13,690 term will be 0. 483 00:24:13,690 --> 00:24:18,250 So ideally we want Theta hat to be very close to Theta. 484 00:24:18,250 --> 00:24:21,840 And since Theta is a constant, if that happens, the variance 485 00:24:21,840 --> 00:24:25,650 of Theta hat would be very small. 486 00:24:25,650 --> 00:24:26,870 So Theta is a constant. 487 00:24:26,870 --> 00:24:30,180 If Theta hat has a distribution that's 488 00:24:30,180 --> 00:24:33,610 concentrated just around own little theta, then Theta hat 489 00:24:33,610 --> 00:24:35,250 would have a small variance. 490 00:24:35,250 --> 00:24:37,690 So this is one desire that have. 491 00:24:37,690 --> 00:24:39,740 We're going to have a small variance. 492 00:24:39,740 --> 00:24:43,710 But we also want to have a small bias at the same time. 493 00:24:43,710 --> 00:24:47,370 So the general form of the mean squared error has two 494 00:24:47,370 --> 00:24:48,240 contributions. 495 00:24:48,240 --> 00:24:50,530 One is the variance of our estimator. 496 00:24:50,530 --> 00:24:52,350 The other is the bias. 497 00:24:52,350 --> 00:24:54,990 And one usually wants to design an estimator that 498 00:24:54,990 --> 00:24:58,900 simultaneously keeps both of these terms small. 499 00:24:58,900 --> 00:25:03,250 So here's an estimation method that would do very well with 500 00:25:03,250 --> 00:25:05,080 respect to this term, but badly with 501 00:25:05,080 --> 00:25:06,680 respect to that term. 502 00:25:06,680 --> 00:25:09,410 So suppose that my distribution is, let's say, 503 00:25:09,410 --> 00:25:13,700 normal with an unknown mean Theta and variance 1. 504 00:25:13,700 --> 00:25:17,580 And I use as my estimator something very dumb. 505 00:25:17,580 --> 00:25:23,330 I always produce an estimate that says my estimate is 100. 506 00:25:23,330 --> 00:25:26,430 So I'm just ignoring the data and report 100. 507 00:25:26,430 --> 00:25:27,750 What does this do? 508 00:25:27,750 --> 00:25:30,950 The variance of my estimator is 0. 509 00:25:30,950 --> 00:25:33,690 There's no randomness in the estimate that I report. 510 00:25:33,690 --> 00:25:37,020 But the bias is going to be pretty bad. 511 00:25:37,020 --> 00:25:44,180 The bias is going to be Theta hat, which is 100 minus the 512 00:25:44,180 --> 00:25:46,770 true value of Theta. 513 00:25:46,770 --> 00:25:50,340 And for some Theta's, my bias is going to be horrible. 514 00:25:50,340 --> 00:25:54,600 If my true Theta happens to be 0, my bias 515 00:25:54,600 --> 00:25:56,200 squared is a huge term. 516 00:25:56,200 --> 00:25:57,810 And I get a large error. 517 00:25:57,810 --> 00:26:00,220 So what's the moral of this example? 518 00:26:00,220 --> 00:26:03,700 There are ways of making that variance very small, but, in 519 00:26:03,700 --> 00:26:07,360 those cases, you pay a price in the bias. 520 00:26:07,360 --> 00:26:10,340 So you want to do something a little more delicate, where 521 00:26:10,340 --> 00:26:14,640 you try to keep both terms small at the same time. 522 00:26:14,640 --> 00:26:16,720 So these types of considerations become 523 00:26:16,720 --> 00:26:20,280 important when you start to try to design sophisticated 524 00:26:20,280 --> 00:26:22,840 estimators for more complicated problems. 525 00:26:22,840 --> 00:26:24,800 But we will not do this in this class. 526 00:26:24,800 --> 00:26:26,720 This belongs to further classes on 527 00:26:26,720 --> 00:26:28,750 statistics and inference. 528 00:26:28,750 --> 00:26:31,960 For this class, for parameter estimation, we will basically 529 00:26:31,960 --> 00:26:34,400 stick to two very simple methods. 530 00:26:34,400 --> 00:26:37,930 One is the maximum likelihood method we've just discussed. 531 00:26:37,930 --> 00:26:41,300 And the other method is what you would do if you were still 532 00:26:41,300 --> 00:26:44,010 in high school and didn't know any probability. 533 00:26:44,010 --> 00:26:46,610 You get data. 534 00:26:46,610 --> 00:26:50,430 And these data come from some distribution 535 00:26:50,430 --> 00:26:51,850 with an unknown mean. 536 00:26:51,850 --> 00:26:53,930 And you want to estimate that the unknown mean. 537 00:26:53,930 --> 00:26:54,810 What would you do? 538 00:26:54,810 --> 00:26:57,990 You would just take those data and average them out. 539 00:26:57,990 --> 00:27:00,440 So let's make this a little more specific. 540 00:27:00,440 --> 00:27:04,770 We have X's that come from a given distribution. 541 00:27:04,770 --> 00:27:07,775 We know the general form of the distribution, perhaps. 542 00:27:07,775 --> 00:27:10,570 543 00:27:10,570 --> 00:27:15,180 We do know, perhaps, the variance of that distribution, 544 00:27:15,180 --> 00:27:17,050 or, perhaps, we don't know it. 545 00:27:17,050 --> 00:27:19,030 But we do not know the mean. 546 00:27:19,030 --> 00:27:22,700 And we want to estimate the mean of that distribution. 547 00:27:22,700 --> 00:27:25,370 Now, we can write this situation. 548 00:27:25,370 --> 00:27:27,710 We can represent it in a different form. 549 00:27:27,710 --> 00:27:30,120 The Xi's are equal to Theta. 550 00:27:30,120 --> 00:27:31,380 This is the mean. 551 00:27:31,380 --> 00:27:34,310 Plus a 0 mean random variable, that you 552 00:27:34,310 --> 00:27:36,000 can think of as noise. 553 00:27:36,000 --> 00:27:39,380 So this corresponds to the usual situation you would have 554 00:27:39,380 --> 00:27:41,950 in a lab, where you go and try to 555 00:27:41,950 --> 00:27:43,870 measure an unknown quantity. 556 00:27:43,870 --> 00:27:45,260 You get lots of measurements. 557 00:27:45,260 --> 00:27:49,490 But each time that you measure them, your measurements have 558 00:27:49,490 --> 00:27:51,920 some extra noise in there. 559 00:27:51,920 --> 00:27:54,510 And you want to kind of get rid of that noise. 560 00:27:54,510 --> 00:27:57,860 The way to try to get rid of the measurement noise is to 561 00:27:57,860 --> 00:28:01,170 collect lots of data and average them out. 562 00:28:01,170 --> 00:28:02,930 This is the sample mean. 563 00:28:02,930 --> 00:28:07,380 And this is a very, very reasonable way of trying to 564 00:28:07,380 --> 00:28:10,130 estimate the unknown mean of the X's. 565 00:28:10,130 --> 00:28:12,700 So this is the sample mean. 566 00:28:12,700 --> 00:28:17,840 It's a reasonable, plausible, in general, pretty good 567 00:28:17,840 --> 00:28:22,390 estimator of the unknown mean of a certain distribution. 568 00:28:22,390 --> 00:28:26,910 We can apply this estimator without really knowing a lot 569 00:28:26,910 --> 00:28:28,810 about the distribution of the X's. 570 00:28:28,810 --> 00:28:31,010 Actually, we don't need to know anything about the 571 00:28:31,010 --> 00:28:32,320 distribution. 572 00:28:32,320 --> 00:28:35,840 We can still apply it, because the variance, for example, 573 00:28:35,840 --> 00:28:37,130 does not show up here. 574 00:28:37,130 --> 00:28:38,660 We don't need to know the variance to 575 00:28:38,660 --> 00:28:40,520 calculate that quantity. 576 00:28:40,520 --> 00:28:43,520 Does this estimator have good properties? 577 00:28:43,520 --> 00:28:45,110 Yes, it does. 578 00:28:45,110 --> 00:28:48,110 What's the expected value of the sample mean? 579 00:28:48,110 --> 00:28:51,910 If the expectation of this, it's the expectation of this 580 00:28:51,910 --> 00:28:53,600 sum divided by n. 581 00:28:53,600 --> 00:28:56,410 The expected value for each one of the X's is Theta. 582 00:28:56,410 --> 00:28:58,290 So the expected value of the sample mean 583 00:28:58,290 --> 00:29:00,010 is just Theta itself. 584 00:29:00,010 --> 00:29:03,310 So our estimator is unbiased. 585 00:29:03,310 --> 00:29:06,410 No matter what Theta is, our estimator does not have a 586 00:29:06,410 --> 00:29:11,130 systematic error in either direction. 587 00:29:11,130 --> 00:29:13,870 Furthermore, the weak law of large numbers tells us that 588 00:29:13,870 --> 00:29:18,140 this quantity converges to the true parameter in probability. 589 00:29:18,140 --> 00:29:20,700 So it's a consistent estimator. 590 00:29:20,700 --> 00:29:21,920 This is good. 591 00:29:21,920 --> 00:29:26,740 And if you want to calculate the mean squared error 592 00:29:26,740 --> 00:29:28,780 corresponding to this estimator. 593 00:29:28,780 --> 00:29:31,550 Remember how we defined the mean squared error? 594 00:29:31,550 --> 00:29:35,300 It's this quantity. 595 00:29:35,300 --> 00:29:38,680 Then it's a calculation that we have done a fair number of 596 00:29:38,680 --> 00:29:40,080 times by now. 597 00:29:40,080 --> 00:29:43,640 The mean squared error is the variance of the distribution 598 00:29:43,640 --> 00:29:46,000 of the X's divided by n. 599 00:29:46,000 --> 00:29:49,370 So as we get more and more data, the mean squared error 600 00:29:49,370 --> 00:29:52,170 goes down to 0. 601 00:29:52,170 --> 00:29:56,420 In some examples, it turns out that the sample mean is also 602 00:29:56,420 --> 00:29:58,930 the same as the maximum likelihood estimate. 603 00:29:58,930 --> 00:30:02,790 For example, if the X's are coming from a normal 604 00:30:02,790 --> 00:30:07,700 distribution, you can write down the likelihood, do the 605 00:30:07,700 --> 00:30:10,240 maximization with respect to Theta, you'll find that the 606 00:30:10,240 --> 00:30:15,190 maximum likelihood estimate is the same as the sample mean. 607 00:30:15,190 --> 00:30:18,730 In other cases, the sample mean will be different from 608 00:30:18,730 --> 00:30:20,850 the maximum likelihood. 609 00:30:20,850 --> 00:30:23,990 And then you have a choice about which one of the 610 00:30:23,990 --> 00:30:24,860 two you would use. 611 00:30:24,860 --> 00:30:27,890 Probably, in most reasonable situations, you would just use 612 00:30:27,890 --> 00:30:31,460 the sample mean, because it's simple, easy to compute, and 613 00:30:31,460 --> 00:30:33,830 has nice properties. 614 00:30:33,830 --> 00:30:33,936 All right. 615 00:30:33,936 --> 00:30:35,110 So you go to your boss. 616 00:30:35,110 --> 00:30:38,120 And you report and say, OK, I did all my 617 00:30:38,120 --> 00:30:39,910 experiments in the lab. 618 00:30:39,910 --> 00:30:49,820 And the average value that I got is a certain number, 2.37. 619 00:30:49,820 --> 00:30:52,490 So is that the informative to your boss? 620 00:30:52,490 --> 00:30:55,470 Well your boss would like to know how much they can trust 621 00:30:55,470 --> 00:30:58,280 this number, 2.37. 622 00:30:58,280 --> 00:31:00,630 Well, I know that the true value is not going to be 623 00:31:00,630 --> 00:31:02,270 exactly that. 624 00:31:02,270 --> 00:31:07,410 But how close should it be? 625 00:31:07,410 --> 00:31:09,820 So give me a range of what you think are 626 00:31:09,820 --> 00:31:12,080 possible values of Theta. 627 00:31:12,080 --> 00:31:16,220 So the situation is like this. 628 00:31:16,220 --> 00:31:20,370 So suppose that we observe X's that are coming from a certain 629 00:31:20,370 --> 00:31:22,070 distribution. 630 00:31:22,070 --> 00:31:24,230 And we're trying to estimate the mean. 631 00:31:24,230 --> 00:31:25,480 We get our data. 632 00:31:25,480 --> 00:31:27,880 633 00:31:27,880 --> 00:31:32,090 Maybe our data looks something like this. 634 00:31:32,090 --> 00:31:34,090 You calculate the mean. 635 00:31:34,090 --> 00:31:36,140 You find the sample mean. 636 00:31:36,140 --> 00:31:40,120 So let's suppose that the sample mean is a number, for 637 00:31:40,120 --> 00:31:45,570 some reason take to be 2.37. 638 00:31:45,570 --> 00:31:48,300 But you want to convey something to your boss about 639 00:31:48,300 --> 00:31:51,450 how spread out these data were. 640 00:31:51,450 --> 00:31:56,690 So the boss asks you to give him or her some kind of 641 00:31:56,690 --> 00:32:05,340 interval on which Theta, the true parameter, might lie. 642 00:32:05,340 --> 00:32:07,540 So the boss asked you for an interval. 643 00:32:07,540 --> 00:32:11,740 So what you do is you end up reporting an interval. 644 00:32:11,740 --> 00:32:14,990 And you somehow use the data that you have seen to 645 00:32:14,990 --> 00:32:17,580 construct this interval. 646 00:32:17,580 --> 00:32:19,900 And you report to your boss also the 647 00:32:19,900 --> 00:32:21,420 endpoints of this interval. 648 00:32:21,420 --> 00:32:24,020 Let's give names to these endpoints, 649 00:32:24,020 --> 00:32:27,710 Theta_n- and Theta_n+. 650 00:32:27,710 --> 00:32:31,000 The ends here just play the role of keeping track of how 651 00:32:31,000 --> 00:32:33,000 many data we're using. 652 00:32:33,000 --> 00:32:39,320 So what you report to your boss is this interval as well. 653 00:32:39,320 --> 00:32:42,340 Are these Theta's here, the endpoints of the interval, 654 00:32:42,340 --> 00:32:44,220 lowercase or uppercase? 655 00:32:44,220 --> 00:32:45,750 What should they be? 656 00:32:45,750 --> 00:32:48,180 Well you construct these intervals after 657 00:32:48,180 --> 00:32:49,430 you see your data. 658 00:32:49,430 --> 00:32:53,830 You take the data into account to construct your interval. 659 00:32:53,830 --> 00:32:57,020 So these definitely should depend on the data. 660 00:32:57,020 --> 00:32:59,460 And therefore they are random variables. 661 00:32:59,460 --> 00:33:03,240 Same thing with your estimator, in general, it's 662 00:33:03,240 --> 00:33:05,120 going to be a random variable. 663 00:33:05,120 --> 00:33:07,930 Although, when you go and report numbers to your boss, 664 00:33:07,930 --> 00:33:10,580 you give the specific realizations of the random 665 00:33:10,580 --> 00:33:15,450 variables, given the data that you got. 666 00:33:15,450 --> 00:33:21,500 So instead of having just a single box 667 00:33:21,500 --> 00:33:25,050 that produces estimates. 668 00:33:25,050 --> 00:33:29,540 So our previous picture was that you have your estimator 669 00:33:29,540 --> 00:33:34,130 that takes X's and produces Theta hats. 670 00:33:34,130 --> 00:33:40,960 Now our box will also be producing Theta hats minus and 671 00:33:40,960 --> 00:33:42,570 Theta hats plus. 672 00:33:42,570 --> 00:33:45,180 It's going to produce an interval as well. 673 00:33:45,180 --> 00:33:48,670 The X's are random, therefore these quantities are random. 674 00:33:48,670 --> 00:33:52,340 Once you go and do the experiment and obtain your 675 00:33:52,340 --> 00:33:55,930 data, then your data will be some 676 00:33:55,930 --> 00:33:58,810 lowercase x, specific numbers. 677 00:33:58,810 --> 00:34:00,950 And then your estimates and estimator 678 00:34:00,950 --> 00:34:05,110 become also lower case. 679 00:34:05,110 --> 00:34:08,010 What would we like this interval to do? 680 00:34:08,010 --> 00:34:11,760 We would like it to be highly likely to contain the true 681 00:34:11,760 --> 00:34:13,810 value of the parameter. 682 00:34:13,810 --> 00:34:17,800 So we might impose some specs of the following kind. 683 00:34:17,800 --> 00:34:19,170 I pick a number, alpha. 684 00:34:19,170 --> 00:34:21,170 Usually that alpha, think of it as a 685 00:34:21,170 --> 00:34:23,050 probability of a large error. 686 00:34:23,050 --> 00:34:27,449 Typical value of alpha might be 0.05, in which case this 687 00:34:27,449 --> 00:34:30,360 number here is point 0.95. 688 00:34:30,360 --> 00:34:33,989 And you're given specs that say something like this. 689 00:34:33,989 --> 00:34:41,110 I would like, with probability at least 0.95, this to happen, 690 00:34:41,110 --> 00:34:44,739 which says that the true parameter lies inside the 691 00:34:44,739 --> 00:34:47,100 confidence interval. 692 00:34:47,100 --> 00:34:50,840 Now let's try to interpret this statement. 693 00:34:50,840 --> 00:34:53,560 Suppose that you did the experiment, and that you ended 694 00:34:53,560 --> 00:34:56,230 up reporting to your boss a confidence interval 695 00:34:56,230 --> 00:35:01,520 from 1.97 to 2.56. 696 00:35:01,520 --> 00:35:03,170 That's what you report to your boss. 697 00:35:03,170 --> 00:35:06,790 698 00:35:06,790 --> 00:35:08,300 And suppose that the confidence 699 00:35:08,300 --> 00:35:10,280 interval has this property. 700 00:35:10,280 --> 00:35:16,400 Can you go to your boss and say, with probability 95%, the 701 00:35:16,400 --> 00:35:20,090 true value of Theta is between these two numbers? 702 00:35:20,090 --> 00:35:22,630 Is that a meaningful statement? 703 00:35:22,630 --> 00:35:26,100 So the statement is, the tentative statement is, with 704 00:35:26,100 --> 00:35:30,200 probability 95%, the true value of Theta is 705 00:35:30,200 --> 00:35:34,930 between 1.97 and 2.56. 706 00:35:34,930 --> 00:35:38,910 Well, what is random in that statement? 707 00:35:38,910 --> 00:35:40,460 There's nothing random. 708 00:35:40,460 --> 00:35:43,070 The true value of theta is a constant. 709 00:35:43,070 --> 00:35:44,720 1.97 is a number. 710 00:35:44,720 --> 00:35:46,740 2.56 is a number. 711 00:35:46,740 --> 00:35:52,960 So it doesn't make any sense to talk about the probability 712 00:35:52,960 --> 00:35:54,920 that theta is in this interval. 713 00:35:54,920 --> 00:35:57,540 Either theta happens to be in that interval, or it 714 00:35:57,540 --> 00:35:58,760 happens to not be. 715 00:35:58,760 --> 00:36:01,560 But there are no probabilities associated with this. 716 00:36:01,560 --> 00:36:04,700 Because theta is not random. 717 00:36:04,700 --> 00:36:06,690 Syntactically, you can see this. 718 00:36:06,690 --> 00:36:09,210 Because theta here is a lower case. 719 00:36:09,210 --> 00:36:11,930 So what kind of probabilities are we talking about here? 720 00:36:11,930 --> 00:36:13,460 Where's the randomness? 721 00:36:13,460 --> 00:36:15,880 Well the random thing is the interval. 722 00:36:15,880 --> 00:36:17,560 It's not theta. 723 00:36:17,560 --> 00:36:21,090 So the statement that is being made here is that the 724 00:36:21,090 --> 00:36:24,290 interval, that's being constructed by our procedure, 725 00:36:24,290 --> 00:36:28,410 should have the property that, with probability 95%, it's 726 00:36:28,410 --> 00:36:33,280 going to fall on top of the true value of theta. 727 00:36:33,280 --> 00:36:37,680 So the right way of interpreting what the 95% 728 00:36:37,680 --> 00:36:42,270 confidence interval is, is something like the following. 729 00:36:42,270 --> 00:36:45,390 We have the true value of theta that we don't know. 730 00:36:45,390 --> 00:36:46,750 I get data. 731 00:36:46,750 --> 00:36:50,150 Based on the data, I construct a confidence interval. 732 00:36:50,150 --> 00:36:51,950 I get my confidence interval. 733 00:36:51,950 --> 00:36:52,790 I got lucky. 734 00:36:52,790 --> 00:36:54,850 And the true value of theta is in here. 735 00:36:54,850 --> 00:36:57,790 Next day, I do the same experiment, take my data, 736 00:36:57,790 --> 00:37:00,500 construct a confidence interval. 737 00:37:00,500 --> 00:37:04,040 And I get this confidence interval, lucky once more. 738 00:37:04,040 --> 00:37:06,320 Next day I get data. 739 00:37:06,320 --> 00:37:09,620 I use my data to come up with an estimate of theta and the 740 00:37:09,620 --> 00:37:10,660 confidence interval. 741 00:37:10,660 --> 00:37:12,340 That day, I was unlucky. 742 00:37:12,340 --> 00:37:15,000 And I got a confidence interval out there. 743 00:37:15,000 --> 00:37:20,890 What the requirement here is, is that 95% of the days, where 744 00:37:20,890 --> 00:37:25,270 we use this certain procedure for constructing confidence 745 00:37:25,270 --> 00:37:29,180 intervals, 95% of those days, we will be lucky. 746 00:37:29,180 --> 00:37:33,750 And we will capture the correct value of theta by your 747 00:37:33,750 --> 00:37:35,160 confidence interval. 748 00:37:35,160 --> 00:37:39,390 So it's a statement about the distribution of these random 749 00:37:39,390 --> 00:37:42,820 confidence intervals, how likely are they to fall on top 750 00:37:42,820 --> 00:37:45,210 of the true theta, as opposed to how likely 751 00:37:45,210 --> 00:37:47,060 they are to fall outside. 752 00:37:47,060 --> 00:37:50,770 So it's a statement about probabilities associated with 753 00:37:50,770 --> 00:37:52,380 a confidence interval. 754 00:37:52,380 --> 00:37:55,080 They're not probabilities about theta, because theta, 755 00:37:55,080 --> 00:37:58,370 itself, is not random. 756 00:37:58,370 --> 00:38:02,080 So this is what the confidence interval is, in general, and 757 00:38:02,080 --> 00:38:03,470 how we interpret it. 758 00:38:03,470 --> 00:38:07,470 How do we construct a 95% confidence interval? 759 00:38:07,470 --> 00:38:09,320 Let's go through this exercise, in 760 00:38:09,320 --> 00:38:10,980 a particular example. 761 00:38:10,980 --> 00:38:13,970 The calculations are exactly the same as the ones that you 762 00:38:13,970 --> 00:38:17,770 did when we talked about laws of large numbers and the 763 00:38:17,770 --> 00:38:19,240 central limit theorem. 764 00:38:19,240 --> 00:38:22,600 So there's nothing new calculationally but it's, 765 00:38:22,600 --> 00:38:25,440 perhaps, new in terms of the language that we use and the 766 00:38:25,440 --> 00:38:26,800 interpretation. 767 00:38:26,800 --> 00:38:30,890 So we got our sample mean from some distribution. 768 00:38:30,890 --> 00:38:34,650 And we would like to calculate a 95% confidence interval. 769 00:38:34,650 --> 00:38:39,590 770 00:38:39,590 --> 00:38:42,650 We know from the normal tables, that the standard 771 00:38:42,650 --> 00:38:54,011 normal has 2.5% on the tail, that's after 1.96. 772 00:38:54,011 --> 00:38:58,060 Yes, by this time, the number 1.96 773 00:38:58,060 --> 00:39:00,600 should be pretty familiar. 774 00:39:00,600 --> 00:39:05,880 So if this probability here is 2.5%, this 775 00:39:05,880 --> 00:39:09,510 number here is 1.96. 776 00:39:09,510 --> 00:39:12,310 Now look at this random variable here. 777 00:39:12,310 --> 00:39:15,000 This is the sample mean. 778 00:39:15,000 --> 00:39:17,950 Difference, from the true mean, normalized by the usual 779 00:39:17,950 --> 00:39:18,940 normalizing factor. 780 00:39:18,940 --> 00:39:22,090 By the central limit theorem, this is approximately normal. 781 00:39:22,090 --> 00:39:26,790 So it has probability 0.95 of being less than 1.96. 782 00:39:26,790 --> 00:39:31,050 Now take this event here and rewrite it. 783 00:39:31,050 --> 00:39:36,240 This the event, well, that Theta hat minus theta is 784 00:39:36,240 --> 00:39:40,350 bigger than this number and smaller than that number. 785 00:39:40,350 --> 00:39:45,650 This event here is equivalent to that event here. 786 00:39:45,650 --> 00:39:50,670 And so this suggests a way of constructing our 95% percent 787 00:39:50,670 --> 00:39:52,130 confidence interval. 788 00:39:52,130 --> 00:39:56,330 I'm going to report the interval, which gives this as 789 00:39:56,330 --> 00:40:00,350 the lower end of the confidence interval, and gives 790 00:40:00,350 --> 00:40:05,720 this as the upper end of the confidence interval 791 00:40:05,720 --> 00:40:09,180 In other words, at the end of the experiment, we report the 792 00:40:09,180 --> 00:40:12,170 sample mean, which is our estimate. 793 00:40:12,170 --> 00:40:14,230 And we report also, an interval 794 00:40:14,230 --> 00:40:16,080 around the sample mean. 795 00:40:16,080 --> 00:40:20,510 And this is our 95% confidence interval. 796 00:40:20,510 --> 00:40:22,800 The confidence interval becomes 797 00:40:22,800 --> 00:40:26,050 smaller, when n is larger. 798 00:40:26,050 --> 00:40:28,950 In some sense, we're more certain that we're doing a 799 00:40:28,950 --> 00:40:32,390 good estimation job, so we can have a small interval and 800 00:40:32,390 --> 00:40:36,000 still be quite confident that our interval captures the true 801 00:40:36,000 --> 00:40:37,520 value of the parameter. 802 00:40:37,520 --> 00:40:41,890 Also, if our data have very little noise, when you have 803 00:40:41,890 --> 00:40:45,060 more accurate measurements, you're more confident that 804 00:40:45,060 --> 00:40:47,220 your estimate is pretty good. 805 00:40:47,220 --> 00:40:51,120 And that results in a smaller confidence interval, smaller 806 00:40:51,120 --> 00:40:52,610 length of the confidence interval. 807 00:40:52,610 --> 00:40:56,040 And still you have 95% probability of capturing the 808 00:40:56,040 --> 00:40:57,650 true value of theta. 809 00:40:57,650 --> 00:41:01,660 So we did this exercise by taking 95% confidence 810 00:41:01,660 --> 00:41:04,010 intervals and the corresponding value from the 811 00:41:04,010 --> 00:41:06,670 normal tables, which is 1.96. 812 00:41:06,670 --> 00:41:11,390 Of course, you can do it more generally, if you set your 813 00:41:11,390 --> 00:41:13,730 alpha to be some other number. 814 00:41:13,730 --> 00:41:16,590 Again, you look at the normal tables. 815 00:41:16,590 --> 00:41:20,460 And you find the value here, so that the tail has 816 00:41:20,460 --> 00:41:22,640 probability alpha over 2. 817 00:41:22,640 --> 00:41:26,790 And instead of using these 1.96, you use whatever number 818 00:41:26,790 --> 00:41:31,380 you get from the normal tables. 819 00:41:31,380 --> 00:41:33,520 And this tells you how to construct 820 00:41:33,520 --> 00:41:36,680 a confidence interval. 821 00:41:36,680 --> 00:41:42,060 Well, to be exact, this is not necessarily a 822 00:41:42,060 --> 00:41:44,640 95% confidence interval. 823 00:41:44,640 --> 00:41:47,540 It's approximately a 95% confidence interval. 824 00:41:47,540 --> 00:41:48,950 Why is this? 825 00:41:48,950 --> 00:41:51,060 Because we've done an approximation. 826 00:41:51,060 --> 00:41:53,890 We have used the central limit theorem. 827 00:41:53,890 --> 00:41:59,990 So it might turn out to be a 95.5% confidence interval 828 00:41:59,990 --> 00:42:03,220 instead of 95%, because our calculations are 829 00:42:03,220 --> 00:42:04,740 not entirely accurate. 830 00:42:04,740 --> 00:42:08,230 But for reasonable values of n, using the central limit 831 00:42:08,230 --> 00:42:10,190 theorem is a good approximation. 832 00:42:10,190 --> 00:42:13,330 And that's what people almost always do. 833 00:42:13,330 --> 00:42:17,350 So just take the value from the normal tables. 834 00:42:17,350 --> 00:42:18,600 Okay, except for one catch. 835 00:42:18,600 --> 00:42:22,830 836 00:42:22,830 --> 00:42:24,590 I used the data. 837 00:42:24,590 --> 00:42:26,440 I obtained my estimate. 838 00:42:26,440 --> 00:42:29,830 And I want to go to my boss and report this theta minus 839 00:42:29,830 --> 00:42:33,010 and theta hat, which is the confidence interval. 840 00:42:33,010 --> 00:42:35,720 What's the difficulty? 841 00:42:35,720 --> 00:42:37,540 I know what n is. 842 00:42:37,540 --> 00:42:40,790 But I don't know what sigma is, in general. 843 00:42:40,790 --> 00:42:44,750 So if I don't know sigma, what am I going to do? 844 00:42:44,750 --> 00:42:48,980 Here, there's a few options for what you can do. 845 00:42:48,980 --> 00:42:52,910 And the first option is familiar from what we did when 846 00:42:52,910 --> 00:42:55,020 we talked about the pollster problem. 847 00:42:55,020 --> 00:42:58,480 We don't know what sigma is, but maybe we have an upper 848 00:42:58,480 --> 00:43:00,030 bound on sigma. 849 00:43:00,030 --> 00:43:03,540 For example, if the Xi's Bernoulli random variables, we 850 00:43:03,540 --> 00:43:06,910 have seen that the standard deviation is at most 1/2. 851 00:43:06,910 --> 00:43:10,220 So use the most conservative value for sigma. 852 00:43:10,220 --> 00:43:13,520 Using the most conservative value means that you take 853 00:43:13,520 --> 00:43:17,890 bigger confidence intervals than necessary. 854 00:43:17,890 --> 00:43:20,780 So that's one option. 855 00:43:20,780 --> 00:43:25,480 Another option is to try to estimate sigma from the data. 856 00:43:25,480 --> 00:43:27,630 How do you do this estimation? 857 00:43:27,630 --> 00:43:31,140 In special cases, for special types of distributions, you 858 00:43:31,140 --> 00:43:34,180 can think of heuristic ways of doing this estimation. 859 00:43:34,180 --> 00:43:38,390 For example, in the case of Bernoulli random variables, we 860 00:43:38,390 --> 00:43:42,420 know that the true value of sigma, the standard deviation 861 00:43:42,420 --> 00:43:45,120 of a Bernoulli random variable, is the square root 862 00:43:45,120 --> 00:43:47,670 of theta1 minus theta, where theta is 863 00:43:47,670 --> 00:43:50,290 the mean of the Bernoulli. 864 00:43:50,290 --> 00:43:51,900 Try to use this formula. 865 00:43:51,900 --> 00:43:54,140 But theta is the thing we're trying to estimate in the 866 00:43:54,140 --> 00:43:54,760 first place. 867 00:43:54,760 --> 00:43:55,880 We don't know it. 868 00:43:55,880 --> 00:43:57,150 What do we do? 869 00:43:57,150 --> 00:44:00,850 Well, we have an estimate for theta, the estimate, produced 870 00:44:00,850 --> 00:44:04,195 by our estimation procedure, the sample mean. 871 00:44:04,195 --> 00:44:05,670 So I obtain my data. 872 00:44:05,670 --> 00:44:06,540 I get my data. 873 00:44:06,540 --> 00:44:09,030 I produce the estimate theta hat. 874 00:44:09,030 --> 00:44:10,740 It's an estimate of the mean. 875 00:44:10,740 --> 00:44:14,770 Use that estimate in this formula to come up with an 876 00:44:14,770 --> 00:44:17,290 estimate of my standard deviation. 877 00:44:17,290 --> 00:44:20,210 And then use that standard deviation, in the construction 878 00:44:20,210 --> 00:44:22,510 of the confidence interval, pretending 879 00:44:22,510 --> 00:44:24,180 that this is correct. 880 00:44:24,180 --> 00:44:29,050 Well the number of your data is large, then we know, from 881 00:44:29,050 --> 00:44:31,870 the law of large numbers, that theta hat is a pretty good 882 00:44:31,870 --> 00:44:33,130 estimate of theta. 883 00:44:33,130 --> 00:44:36,670 So sigma hat is going to be a pretty good estimate of sigma. 884 00:44:36,670 --> 00:44:42,380 So we're not making large errors by using this approach. 885 00:44:42,380 --> 00:44:47,980 So in this scenario here, things were simple, because we 886 00:44:47,980 --> 00:44:49,890 had an analytical formula. 887 00:44:49,890 --> 00:44:52,210 Sigma was determined by theta. 888 00:44:52,210 --> 00:44:54,420 So we could come up with a quick and 889 00:44:54,420 --> 00:44:57,340 dirty estimate of sigma. 890 00:44:57,340 --> 00:45:00,940 In general, if you do not have any nice formulas of this 891 00:45:00,940 --> 00:45:03,000 kind, what could you do? 892 00:45:03,000 --> 00:45:04,920 Well, you still need to come up with an 893 00:45:04,920 --> 00:45:07,110 estimate of sigma somehow. 894 00:45:07,110 --> 00:45:08,950 What is a generic method for 895 00:45:08,950 --> 00:45:11,300 estimating a standard deviation? 896 00:45:11,300 --> 00:45:14,440 Equivalently, what could be a generic method for estimating 897 00:45:14,440 --> 00:45:16,920 a variance? 898 00:45:16,920 --> 00:45:19,360 Well the variance is an expected value 899 00:45:19,360 --> 00:45:20,940 of some random variable. 900 00:45:20,940 --> 00:45:25,610 The variance is the mean of the random variable inside of 901 00:45:25,610 --> 00:45:28,200 those brackets. 902 00:45:28,200 --> 00:45:33,160 How does one estimate the mean of some random variable? 903 00:45:33,160 --> 00:45:36,140 You obtain lots of measurements of that random 904 00:45:36,140 --> 00:45:40,210 variable and average them out. 905 00:45:40,210 --> 00:45:45,170 So this would be a reasonable way of estimating the variance 906 00:45:45,170 --> 00:45:47,310 of a distribution. 907 00:45:47,310 --> 00:45:50,590 And again, the weak law of large numbers tells us that 908 00:45:50,590 --> 00:45:55,370 this average converges to the expected value of this, which 909 00:45:55,370 --> 00:45:58,590 is just the variance of the distribution. 910 00:45:58,590 --> 00:46:01,700 So we got a nice and consistent way 911 00:46:01,700 --> 00:46:03,940 of estimating variances. 912 00:46:03,940 --> 00:46:08,100 But now, we seem to be getting in a vicious circle here, 913 00:46:08,100 --> 00:46:10,580 because to estimate the variance, we 914 00:46:10,580 --> 00:46:12,910 need to know the mean. 915 00:46:12,910 --> 00:46:16,075 And the mean is something we're trying to estimate in 916 00:46:16,075 --> 00:46:18,250 the first place. 917 00:46:18,250 --> 00:46:18,400 Okay. 918 00:46:18,400 --> 00:46:20,880 But we do have an estimate from the mean. 919 00:46:20,880 --> 00:46:24,640 So a reasonable approximation, once more, is to plug-in, 920 00:46:24,640 --> 00:46:27,620 here, since we don't know the mean, the 921 00:46:27,620 --> 00:46:29,270 estimate of the mean. 922 00:46:29,270 --> 00:46:32,370 And so you get that expression, but with a theta 923 00:46:32,370 --> 00:46:35,130 hat instead of theta itself. 924 00:46:35,130 --> 00:46:37,980 And this is another reasonable way of 925 00:46:37,980 --> 00:46:40,180 estimating the variance. 926 00:46:40,180 --> 00:46:42,940 It does have the same consistency properties. 927 00:46:42,940 --> 00:46:44,050 Why? 928 00:46:44,050 --> 00:46:51,100 When n is large, this is going to behave the same as that, 929 00:46:51,100 --> 00:46:53,640 because theta hat converges to theta. 930 00:46:53,640 --> 00:46:57,890 And when n is large, this is approximately the same as 931 00:46:57,890 --> 00:46:58,820 sigma squared. 932 00:46:58,820 --> 00:47:02,220 So for a large n, this quantity also converges to 933 00:47:02,220 --> 00:47:03,350 sigma squared. 934 00:47:03,350 --> 00:47:05,500 And we have a consistent estimate of 935 00:47:05,500 --> 00:47:07,000 the variance as well. 936 00:47:07,000 --> 00:47:09,490 And we can take that consistent estimate and use it 937 00:47:09,490 --> 00:47:12,360 back in the construction of confidence interval. 938 00:47:12,360 --> 00:47:16,310 One little detail, here, we're dividing by n. 939 00:47:16,310 --> 00:47:19,590 Here, we're dividing by n-1. 940 00:47:19,590 --> 00:47:21,050 Why do we do this? 941 00:47:21,050 --> 00:47:24,630 Well, it turns out that's what you need to do for these 942 00:47:24,630 --> 00:47:28,590 estimates to be an unbiased estimate of the variance. 943 00:47:28,590 --> 00:47:32,080 One has to do a little bit of a calculation, and one finds 944 00:47:32,080 --> 00:47:36,650 that that's the factor that you need to have here in order 945 00:47:36,650 --> 00:47:37,770 to be unbiased. 946 00:47:37,770 --> 00:47:42,280 Of course, if you get 100 data points, whether you divide by 947 00:47:42,280 --> 00:47:46,070 100 or divided by 99, it's going to make only a tiny 948 00:47:46,070 --> 00:47:48,620 difference in your estimate of your variance. 949 00:47:48,620 --> 00:47:50,740 So it's going to make only a tiny difference in your 950 00:47:50,740 --> 00:47:52,670 estimate of the standard deviation. 951 00:47:52,670 --> 00:47:54,180 It's not a big deal. 952 00:47:54,180 --> 00:47:56,550 And it doesn't really matter. 953 00:47:56,550 --> 00:48:00,720 But if you want to show off about your deeper knowledge of 954 00:48:00,720 --> 00:48:06,810 statistics, you throw in the 1 over n-1 factor in there. 955 00:48:06,810 --> 00:48:11,350 So now one basically needs to put together this story here, 956 00:48:11,350 --> 00:48:15,260 how you estimate the variance. 957 00:48:15,260 --> 00:48:18,370 You first estimate the sample mean. 958 00:48:18,370 --> 00:48:21,010 And then you do some extra work to come up with a 959 00:48:21,010 --> 00:48:23,020 reasonable estimate of the variance and 960 00:48:23,020 --> 00:48:24,640 the standard deviation. 961 00:48:24,640 --> 00:48:27,510 And then you use your estimate, of the standard 962 00:48:27,510 --> 00:48:32,960 deviation, to come up with a confidence interval, which has 963 00:48:32,960 --> 00:48:35,150 these two endpoints. 964 00:48:35,150 --> 00:48:39,130 In doing this procedure, there's basically a number of 965 00:48:39,130 --> 00:48:41,810 approximations that are involved. 966 00:48:41,810 --> 00:48:43,570 There are two types of approximations. 967 00:48:43,570 --> 00:48:46,170 One approximation is that we're pretending that the 968 00:48:46,170 --> 00:48:48,720 sample mean has a normal distribution. 969 00:48:48,720 --> 00:48:51,080 That's something we're justified to do, by the 970 00:48:51,080 --> 00:48:52,470 central limit theorem. 971 00:48:52,470 --> 00:48:53,550 But it's not exact. 972 00:48:53,550 --> 00:48:54,910 It's an approximation. 973 00:48:54,910 --> 00:48:58,080 And the second approximation that comes in is that, instead 974 00:48:58,080 --> 00:49:01,260 of using the correct standard deviation, in general, you 975 00:49:01,260 --> 00:49:04,850 will have to use some approximation of 976 00:49:04,850 --> 00:49:06,100 the standard deviation. 977 00:49:06,100 --> 00:49:08,390 978 00:49:08,390 --> 00:49:11,200 Okay so you will be getting a little bit of practice with 979 00:49:11,200 --> 00:49:14,550 these concepts in recitation and tutorial. 980 00:49:14,550 --> 00:49:18,070 And we will move on to new topics next week. 981 00:49:18,070 --> 00:49:20,930 But the material that's going to be covered in the final 982 00:49:20,930 --> 00:49:23,570 exam is only up to this point. 983 00:49:23,570 --> 00:49:28,220 So next week is just general education. 984 00:49:28,220 --> 00:49:30,550 Hopefully useful, but it's not in the exam. 985 00:49:30,550 --> 00:49:31,800