1 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,870 Commons license. 3 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 4 00:00:06,910 --> 00:00:08,700 offer high quality, educational 5 00:00:08,700 --> 00:00:10,560 resources for free. 6 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 7 00:00:13,460 --> 00:00:19,290 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:19,290 --> 00:00:20,540 ocw.mit.edu. 9 00:00:22,200 --> 00:00:24,920 PROFESSOR: So for the last three lectures we're going to 10 00:00:24,920 --> 00:00:28,200 talk about classical statistics, the way statistics 11 00:00:28,200 --> 00:00:32,340 can be done if you don't want to assume a prior distribution 12 00:00:32,340 --> 00:00:34,800 on the unknown parameters. 13 00:00:34,800 --> 00:00:38,290 Today we're going to focus, mostly, on the estimation side 14 00:00:38,290 --> 00:00:41,910 and leave hypothesis testing for the next two lectures. 15 00:00:41,910 --> 00:00:46,700 So where there is one generic method that one can use to 16 00:00:46,700 --> 00:00:50,850 carry out parameter estimation, that's the maximum 17 00:00:50,850 --> 00:00:51,850 likelihood method. 18 00:00:51,850 --> 00:00:53,990 We're going to define what it is. 19 00:00:53,990 --> 00:00:58,200 Then we will look at the most common estimation problem 20 00:00:58,200 --> 00:01:00,620 there is, which is to estimate the mean of a given 21 00:01:00,620 --> 00:01:02,110 distribution. 22 00:01:02,110 --> 00:01:05,540 And we're going to talk about confidence intervals, which 23 00:01:05,540 --> 00:01:09,130 refers to providing an interval around your 24 00:01:09,130 --> 00:01:13,330 estimates, which has some properties of the kind that 25 00:01:13,330 --> 00:01:17,640 the parameter is highly likely to be inside that interval, 26 00:01:17,640 --> 00:01:20,040 but we will be careful about how to interpret that 27 00:01:20,040 --> 00:01:22,220 particular statement. 28 00:01:22,220 --> 00:01:22,345 Ok. 29 00:01:22,345 --> 00:01:25,920 So the big framework first. 30 00:01:25,920 --> 00:01:29,120 The picture is almost the same as the one that we had in the 31 00:01:29,120 --> 00:01:31,130 case of Bayesian statistics. 32 00:01:31,130 --> 00:01:33,570 We have some unknown parameter. 33 00:01:33,570 --> 00:01:35,510 And we have a measuring device. 34 00:01:35,510 --> 00:01:38,150 There is some noise, some randomness. 35 00:01:38,150 --> 00:01:42,560 And we get an observation, X, whose distribution depends on 36 00:01:42,560 --> 00:01:44,560 the value of the parameter. 37 00:01:44,560 --> 00:01:47,850 However, the big change from the Bayesian setting is that 38 00:01:47,850 --> 00:01:50,840 here, this parameter is just a number. 39 00:01:50,840 --> 00:01:53,200 It's not modeled as a random variable. 40 00:01:53,200 --> 00:01:55,900 It does not have a probability distribution. 41 00:01:55,900 --> 00:01:57,460 There's nothing random about it. 42 00:01:57,460 --> 00:01:58,720 It's a constant. 43 00:01:58,720 --> 00:02:02,360 It just happens that we don't know what that constant is. 44 00:02:02,360 --> 00:02:05,970 And in particular, this probability distribution here, 45 00:02:05,970 --> 00:02:10,350 the distribution of X, depends on Theta. 46 00:02:10,350 --> 00:02:13,900 But this is not a conditional distribution in the usual 47 00:02:13,900 --> 00:02:15,450 sense of the word. 48 00:02:15,450 --> 00:02:18,480 Conditional distributions were defined when we had two random 49 00:02:18,480 --> 00:02:21,800 variables and we condition one random variable on the other. 50 00:02:21,800 --> 00:02:25,890 And we used the bar to separate the X from the Theta. 51 00:02:25,890 --> 00:02:27,870 To make the point that this is not a conditioned 52 00:02:27,870 --> 00:02:29,840 distribution, we use a different notation. 53 00:02:29,840 --> 00:02:31,730 We put a semicolon here. 54 00:02:31,730 --> 00:02:35,760 And what this is meant to say is that X has a distribution. 55 00:02:35,760 --> 00:02:39,640 That distribution has a certain parameter. 56 00:02:39,640 --> 00:02:42,240 And we don't know what that parameter is. 57 00:02:42,240 --> 00:02:46,270 So for example, this might be a normal distribution, with 58 00:02:46,270 --> 00:02:49,070 variance 1 but a mean Theta. 59 00:02:49,070 --> 00:02:50,560 We don't know what Theta is. 60 00:02:50,560 --> 00:02:52,980 And we want to estimate it. 61 00:02:52,980 --> 00:02:55,970 Now once we have this setting, then your job is to design 62 00:02:55,970 --> 00:02:57,560 this box, the estimator. 63 00:02:57,560 --> 00:03:00,620 The estimator is some data processing box that takes the 64 00:03:00,620 --> 00:03:03,950 measurements and produces an estimate 65 00:03:03,950 --> 00:03:06,300 of the unknown parameter. 66 00:03:06,300 --> 00:03:11,950 Now the notation that's used here is as if X and Theta were 67 00:03:11,950 --> 00:03:13,640 one-dimensional quantities. 68 00:03:13,640 --> 00:03:16,610 But actually, everything we say remains valid if you 69 00:03:16,610 --> 00:03:20,090 interpret X and Theta as vectors of parameters. 70 00:03:20,090 --> 00:03:22,180 So for example, you may obtain several 71 00:03:22,180 --> 00:03:25,050 measurements, X1 up to 2Xn. 72 00:03:25,050 --> 00:03:27,980 And there may be several unknown parameters in the 73 00:03:27,980 --> 00:03:30,260 background. 74 00:03:30,260 --> 00:03:34,200 Once more, we do not have, and we do not want to assume, a 75 00:03:34,200 --> 00:03:35,780 prior distribution on Theta. 76 00:03:35,780 --> 00:03:37,070 It's a constant. 77 00:03:37,070 --> 00:03:39,040 And if you want to think mathematically about this 78 00:03:39,040 --> 00:03:41,510 situation, it's as if you have many different 79 00:03:41,510 --> 00:03:43,340 probabilistic models. 80 00:03:43,340 --> 00:03:46,360 So a normal with this mean or a normal with that mean or a 81 00:03:46,360 --> 00:03:49,020 normal with that mean, these are alternative candidate 82 00:03:49,020 --> 00:03:50,700 probabilistic models. 83 00:03:50,700 --> 00:03:55,080 And we want to try to make a decision about which one is 84 00:03:55,080 --> 00:03:56,420 the correct model. 85 00:03:56,420 --> 00:03:59,480 In some cases, we have to choose just between a small 86 00:03:59,480 --> 00:04:00,390 number of models. 87 00:04:00,390 --> 00:04:03,400 For example, you have a coin with an unknown bias. 88 00:04:03,400 --> 00:04:06,410 The bias is either 1/2 or 3/4. 89 00:04:06,410 --> 00:04:08,650 You're going to flip the coin a few times. 90 00:04:08,650 --> 00:04:13,150 And you try to decide whether the true bias is this one or 91 00:04:13,150 --> 00:04:14,150 is that one. 92 00:04:14,150 --> 00:04:17,610 So in this case, we have two specific, alternative 93 00:04:17,610 --> 00:04:20,800 probabilistic models from which we want to distinguish. 94 00:04:20,800 --> 00:04:25,000 But sometimes things are a little more complicated. 95 00:04:25,000 --> 00:04:26,940 For example, you have a coin. 96 00:04:26,940 --> 00:04:30,940 And you have one hypothesis that my coin is unbiased. 97 00:04:30,940 --> 00:04:34,650 And the other hypothesis is that my coin is biased. 98 00:04:34,650 --> 00:04:36,040 And you do your experiments. 99 00:04:36,040 --> 00:04:40,840 And you want to come up with a decision that decides whether 100 00:04:40,840 --> 00:04:43,970 this is true or this one is true. 101 00:04:43,970 --> 00:04:46,630 In this case, we're not dealing with just two 102 00:04:46,630 --> 00:04:48,710 alternative probabilistic models. 103 00:04:48,710 --> 00:04:51,540 This one is a specific model for the coin. 104 00:04:51,540 --> 00:04:54,230 But this one actually corresponds to lots of 105 00:04:54,230 --> 00:04:56,890 possible, alternative coin models. 106 00:04:56,890 --> 00:05:00,420 So this includes the model where Theta is 0.6, the model 107 00:05:00,420 --> 00:05:03,860 where Theta is 0.7, Theta is 0.8, and so on. 108 00:05:03,860 --> 00:05:07,350 So we're trying to discriminate between one model 109 00:05:07,350 --> 00:05:09,510 and lots of alternative models. 110 00:05:09,510 --> 00:05:11,560 How does one go about this? 111 00:05:11,560 --> 00:05:14,750 Well, there's some systematic ways that one can approach 112 00:05:14,750 --> 00:05:16,120 problems of this kind. 113 00:05:16,120 --> 00:05:19,850 And we will start talking about these next time. 114 00:05:19,850 --> 00:05:22,380 So today, we're going to focus on estimation problems. 115 00:05:22,380 --> 00:05:27,080 In estimation problems, theta is a quantity, which is a real 116 00:05:27,080 --> 00:05:29,070 number, a continuous parameter. 117 00:05:29,070 --> 00:05:33,730 We're to design this box, so what we get out of this box is 118 00:05:33,730 --> 00:05:34,280 an estimate. 119 00:05:34,280 --> 00:05:37,900 Now notice that this estimate here is a random variable. 120 00:05:37,900 --> 00:05:42,000 Even though theta is deterministic, this is random, 121 00:05:42,000 --> 00:05:45,110 because it's a function of the data that we observe. 122 00:05:45,110 --> 00:05:46,360 The data are random. 123 00:05:46,360 --> 00:05:49,155 We're applying a function to the data to 124 00:05:49,155 --> 00:05:50,270 construct our estimate. 125 00:05:50,270 --> 00:05:52,850 So, since it's a function of random variables, it's a 126 00:05:52,850 --> 00:05:54,630 random variable itself. 127 00:05:54,630 --> 00:05:57,940 The distribution of Theta hat depends on the distribution of 128 00:05:57,940 --> 00:06:01,280 X. The distribution of X is affected by Theta. 129 00:06:01,280 --> 00:06:03,650 So in the end, the distribution of your estimate 130 00:06:03,650 --> 00:06:08,290 Theta hat will also be affected by whatever Theta 131 00:06:08,290 --> 00:06:09,920 happens to be. 132 00:06:09,920 --> 00:06:12,950 Our general objective, when designing estimators, is that 133 00:06:12,950 --> 00:06:17,390 we want to get, in the end, an error, an estimation error, 134 00:06:17,390 --> 00:06:19,070 which is not too large. 135 00:06:19,070 --> 00:06:21,500 But we'll have to make that specific. 136 00:06:21,500 --> 00:06:24,720 Again, what exactly do we mean by that? 137 00:06:24,720 --> 00:06:27,170 So how do we go about this problem? 138 00:06:29,670 --> 00:06:40,150 One general approach is to pick a Theta, under which the 139 00:06:40,150 --> 00:06:44,590 data that we observe, that this is the X's, our most 140 00:06:44,590 --> 00:06:47,180 likely to have occurred. 141 00:06:47,180 --> 00:06:52,700 So I observe X. For any given Theta, I can calculate this 142 00:06:52,700 --> 00:06:56,630 quantity, which tells me, under this particular Theta, 143 00:06:56,630 --> 00:07:00,670 the X that you observed had this probability of occurring. 144 00:07:00,670 --> 00:07:03,270 Under that Theta, the X that you observe had that 145 00:07:03,270 --> 00:07:04,770 probability of occurring. 146 00:07:04,770 --> 00:07:08,580 You just choose that Theta, which makes the data that you 147 00:07:08,580 --> 00:07:12,700 observed most likely. 148 00:07:12,700 --> 00:07:15,810 It's interesting to compare this maximum likelihood 149 00:07:15,810 --> 00:07:19,120 estimate with the estimates that you would have, if you 150 00:07:19,120 --> 00:07:22,050 were in a Bayesian setting, and you were using maximum 151 00:07:22,050 --> 00:07:25,010 approach theory probability estimation. 152 00:07:25,010 --> 00:07:31,650 In the Bayesian setting, what we do is, given the data, we 153 00:07:31,650 --> 00:07:34,350 use the prior distribution on Theta. 154 00:07:34,350 --> 00:07:41,660 And we calculate the posterior distribution of Theta given X. 155 00:07:41,660 --> 00:07:44,350 Notice that this is sort of the opposite from 156 00:07:44,350 --> 00:07:46,040 what we have here. 157 00:07:46,040 --> 00:07:49,180 This is the probability of X for a particular value of 158 00:07:49,180 --> 00:07:51,780 Theta, whereas this is the probability of Theta for a 159 00:07:51,780 --> 00:07:55,380 particular X. So it's the opposite type of conditioning. 160 00:07:55,380 --> 00:07:58,240 In the Bayesian setting, Theta is a random variable. 161 00:07:58,240 --> 00:07:59,890 So we can talk about the probability 162 00:07:59,890 --> 00:08:01,570 distribution of Theta. 163 00:08:01,570 --> 00:08:04,740 So how do these two compare, except for this syntactic 164 00:08:04,740 --> 00:08:08,160 difference that the order X's and Theta's are reversed? 165 00:08:08,160 --> 00:08:11,410 Let's write down, in full detail, what this posterior 166 00:08:11,410 --> 00:08:13,280 distribution of Theta is. 167 00:08:13,280 --> 00:08:17,390 By the Bayes rule, this conditional distribution is 168 00:08:17,390 --> 00:08:20,430 obtained from the prior, and the model of the measurement 169 00:08:20,430 --> 00:08:21,850 process that we have. 170 00:08:21,850 --> 00:08:24,510 And we get to this expression. 171 00:08:24,510 --> 00:08:29,520 So in Bayesian estimation, we want to find the most likely 172 00:08:29,520 --> 00:08:30,870 value of Theta. 173 00:08:30,870 --> 00:08:33,070 And we need to maximize this quantity over 174 00:08:33,070 --> 00:08:34,539 all possible Theta's. 175 00:08:34,539 --> 00:08:38,210 First thing to notice is that the denominator is a constant. 176 00:08:38,210 --> 00:08:40,220 It does not involve Theta. 177 00:08:40,220 --> 00:08:43,250 So when you maximize this quantity, you don't care about 178 00:08:43,250 --> 00:08:44,520 the denominator. 179 00:08:44,520 --> 00:08:47,800 You just want to maximize the numerator. 180 00:08:47,800 --> 00:08:52,310 Now, here, things start to look a little more similar. 181 00:08:52,310 --> 00:08:56,530 And they would be exactly of the same kind, if that term 182 00:08:56,530 --> 00:08:59,890 here was absent, it the prior was absent. 183 00:08:59,890 --> 00:09:03,860 The two are going to become the same if that prior was 184 00:09:03,860 --> 00:09:05,830 just a constant. 185 00:09:05,830 --> 00:09:10,160 So if that prior is a constant, then maximum 186 00:09:10,160 --> 00:09:13,720 likelihood estimation takes exactly the same form as 187 00:09:13,720 --> 00:09:17,360 Bayesian maximum posterior probability estimation. 188 00:09:17,360 --> 00:09:21,230 So you can give this particular interpretation of 189 00:09:21,230 --> 00:09:22,680 maximum likelihood estimation. 190 00:09:22,680 --> 00:09:27,400 Maximum likelihood estimation is essentially what you have 191 00:09:27,400 --> 00:09:31,380 done, if you were in a Bayesian world, and you had 192 00:09:31,380 --> 00:09:35,400 assumed a prior on the Theta's that's uniform, all the 193 00:09:35,400 --> 00:09:37,030 Theta's being equally likely. 194 00:09:42,620 --> 00:09:42,725 Okay. 195 00:09:42,725 --> 00:09:45,770 So let's look at a simple example. 196 00:09:45,770 --> 00:09:48,510 Suppose that the Xi's are independent, identically 197 00:09:48,510 --> 00:09:50,770 distributed random variables, with a 198 00:09:50,770 --> 00:09:52,690 certain parameter Theta. 199 00:09:52,690 --> 00:09:55,910 So the distribution of each one of the Xi's is this 200 00:09:55,910 --> 00:09:57,950 particular term. 201 00:09:57,950 --> 00:09:59,840 So Theta is one-dimensional. 202 00:09:59,840 --> 00:10:01,280 It's a one-dimensional parameter. 203 00:10:01,280 --> 00:10:03,180 But we have several data. 204 00:10:03,180 --> 00:10:07,020 We write down the formula for the probability of a 205 00:10:07,020 --> 00:10:12,360 particular X vector, given a particular value of Theta. 206 00:10:12,360 --> 00:10:14,950 But again, when I use the word, given, here it's not in 207 00:10:14,950 --> 00:10:16,080 the conditioning sense. 208 00:10:16,080 --> 00:10:18,770 It's the value of the density for a 209 00:10:18,770 --> 00:10:21,710 particular choice of Theta. 210 00:10:21,710 --> 00:10:24,890 Here, I wrote down, I defined maximum likelihood estimation 211 00:10:24,890 --> 00:10:26,190 in terms of PMFs. 212 00:10:26,190 --> 00:10:28,050 That's what you would do if the X's were 213 00:10:28,050 --> 00:10:29,950 discrete random variables. 214 00:10:29,950 --> 00:10:32,770 Here, the X's are continuous random variables, so instead 215 00:10:32,770 --> 00:10:36,220 of I'm using the PDF instead of the PMF. 216 00:10:36,220 --> 00:10:39,530 So this a definition, here, generalizes to the case of 217 00:10:39,530 --> 00:10:40,900 continuous random variables. 218 00:10:40,900 --> 00:10:44,620 And you use F's instead of X's, our usual recipe. 219 00:10:44,620 --> 00:10:47,560 So the maximum likelihood estimate is defined. 220 00:10:47,560 --> 00:10:51,880 Now, since the Xi's are independent, the joint density 221 00:10:51,880 --> 00:10:54,410 of all the X's together is the product of 222 00:10:54,410 --> 00:10:57,680 the individual densities. 223 00:10:57,680 --> 00:10:59,170 So you look at this quantity. 224 00:10:59,170 --> 00:11:03,310 This is the density or sort of probability of observing a 225 00:11:03,310 --> 00:11:05,340 particular sequence of X's. 226 00:11:05,340 --> 00:11:08,230 And we ask the question, what's the value of Theta that 227 00:11:08,230 --> 00:11:10,940 makes the X's that we observe most likely? 228 00:11:10,940 --> 00:11:13,160 So we want to carry out this maximization. 229 00:11:13,160 --> 00:11:17,430 Now this maximization is just a calculational problem. 230 00:11:17,430 --> 00:11:19,920 We're going to do this maximization by taking the 231 00:11:19,920 --> 00:11:21,910 logarithm of this expression. 232 00:11:21,910 --> 00:11:23,880 Maximizing an expression is the same as 233 00:11:23,880 --> 00:11:25,790 maximizing the logarithm. 234 00:11:25,790 --> 00:11:28,790 So the logarithm of this expression, the logarithm of a 235 00:11:28,790 --> 00:11:31,290 product is the sum of the logarithms. 236 00:11:31,290 --> 00:11:34,390 You get contributions from this Theta term. 237 00:11:34,390 --> 00:11:37,660 There's n of these, so we get an n log Theta. 238 00:11:37,660 --> 00:11:40,430 And then we have the sum of the logarithms of these terms. 239 00:11:40,430 --> 00:11:43,060 It gives us minus Theta. 240 00:11:43,060 --> 00:11:45,020 And then the sum of the X's. 241 00:11:45,020 --> 00:11:47,060 So we need to maximize this expression 242 00:11:47,060 --> 00:11:48,630 with respect to Theta. 243 00:11:48,630 --> 00:11:51,130 The way to do this maximization is you take the 244 00:11:51,130 --> 00:11:53,320 derivative, with respect to Theta. 245 00:11:53,320 --> 00:11:58,520 And you get n over Theta equals to the sum of the X's. 246 00:11:58,520 --> 00:12:00,280 And then you solve for Theta. 247 00:12:00,280 --> 00:12:02,040 And you find that the maximum likelihood 248 00:12:02,040 --> 00:12:04,680 estimate is this quantity. 249 00:12:04,680 --> 00:12:13,160 Which sort of makes sense, because this is the reciprocal 250 00:12:13,160 --> 00:12:16,700 of the sample mean of X's. 251 00:12:16,700 --> 00:12:19,520 Theta, in an exponential distribution, we know that 252 00:12:19,520 --> 00:12:23,380 it's 1 over (the mean of the exponential distribution). 253 00:12:23,380 --> 00:12:26,960 So it looks like a reasonable estimate. 254 00:12:26,960 --> 00:12:29,570 So in any case, this is the estimates that the maximum 255 00:12:29,570 --> 00:12:33,420 likelihood estimation procedure tells us that we 256 00:12:33,420 --> 00:12:35,780 should report. 257 00:12:35,780 --> 00:12:39,790 This formula here, of course, tells you what to do if you 258 00:12:39,790 --> 00:12:42,640 have already observed specific numbers. 259 00:12:42,640 --> 00:12:46,020 If you have observed specific numbers, then you observe this 260 00:12:46,020 --> 00:12:49,110 particular number as your estimate of Theta. 261 00:12:49,110 --> 00:12:52,000 If you want to describe your estimation procedure more 262 00:12:52,000 --> 00:12:55,900 abstractly, what you have constructed is an estimator, 263 00:12:55,900 --> 00:12:59,690 which is a box that's takes in the random variables, capital 264 00:12:59,690 --> 00:13:05,430 X1 up to Capital Xn, and produces out your estimate, 265 00:13:05,430 --> 00:13:07,440 which is also a random variable. 266 00:13:07,440 --> 00:13:10,760 Because it's a function of these random variables and is 267 00:13:10,760 --> 00:13:14,750 denoted by an upper case Theta, to indicate that this 268 00:13:14,750 --> 00:13:17,470 is now a random variable. 269 00:13:17,470 --> 00:13:21,040 So this is an equality about numbers. 270 00:13:21,040 --> 00:13:23,860 This is a description of the general procedure, which is an 271 00:13:23,860 --> 00:13:25,745 equality between two random variables. 272 00:13:28,360 --> 00:13:31,920 And this gives you the more abstract view of what we're 273 00:13:31,920 --> 00:13:35,040 doing here. 274 00:13:35,040 --> 00:13:35,352 All right. 275 00:13:35,352 --> 00:13:37,970 So what can we tell about our estimate? 276 00:13:37,970 --> 00:13:40,090 Is it good or is it bad? 277 00:13:40,090 --> 00:13:42,960 So we should look at this particular random variable and 278 00:13:42,960 --> 00:13:46,220 talk about the statistical properties that it has. 279 00:13:46,220 --> 00:13:49,930 What we would like is this random variable to be close to 280 00:13:49,930 --> 00:13:55,810 the true value of Theta, with high probability, no matter 281 00:13:55,810 --> 00:13:59,470 what Theta is, since we don't know what Theta is. 282 00:13:59,470 --> 00:14:01,400 Let's make a little more specific the 283 00:14:01,400 --> 00:14:05,100 properties that we want. 284 00:14:05,100 --> 00:14:08,470 So we cook up the estimator somehow. 285 00:14:08,470 --> 00:14:11,850 So this estimator corresponds, again, to a box that takes 286 00:14:11,850 --> 00:14:15,400 data in, the capital X's, and produces an 287 00:14:15,400 --> 00:14:17,470 estimate Theta hat. 288 00:14:17,470 --> 00:14:18,710 This estimate is random. 289 00:14:18,710 --> 00:14:23,070 Sometimes it will be above the true value of Theta. 290 00:14:23,070 --> 00:14:25,660 Sometimes it will be below. 291 00:14:25,660 --> 00:14:30,220 Ideally, we would like it to not have a systematic error, 292 00:14:30,220 --> 00:14:32,810 on the positive side or the negative side. 293 00:14:32,810 --> 00:14:37,310 So a reasonable wish to have, for a good estimator, is that, 294 00:14:37,310 --> 00:14:41,700 on the average, it gives you the correct value. 295 00:14:41,700 --> 00:14:45,850 Now here, let's be a little more specific about what that 296 00:14:45,850 --> 00:14:47,740 expectation is. 297 00:14:47,740 --> 00:14:51,270 This is an expectation, with respect to the probability 298 00:14:51,270 --> 00:14:54,240 distribution of Theta hat. 299 00:14:54,240 --> 00:14:58,780 The probability distribution of Theta hat is affected by 300 00:14:58,780 --> 00:15:01,410 the probability distribution of the X's. 301 00:15:01,410 --> 00:15:03,760 Because Theta hat is a function of the X's. 302 00:15:03,760 --> 00:15:05,930 And the probability distribution of the X's is 303 00:15:05,930 --> 00:15:09,220 affected by the true value of Theta. 304 00:15:09,220 --> 00:15:13,710 So depending on which one is the true value of Theta, this 305 00:15:13,710 --> 00:15:16,650 is going to be a different expectation. 306 00:15:16,650 --> 00:15:20,830 So if you were to write this expectation out in more 307 00:15:20,830 --> 00:15:25,840 detail, it would look something like this. 308 00:15:25,840 --> 00:15:28,690 You need to write down the probability 309 00:15:28,690 --> 00:15:30,260 distribution of Theta hat. 310 00:15:32,890 --> 00:15:36,470 And this is going to be some function. 311 00:15:36,470 --> 00:15:41,200 But this function depends on the true Theta, is affected by 312 00:15:41,200 --> 00:15:42,800 the true Theta. 313 00:15:42,800 --> 00:15:48,300 And then you integrate this with respect to Theta hat. 314 00:15:48,300 --> 00:15:49,430 What's the point here? 315 00:15:49,430 --> 00:15:53,280 Again, Theta hat is a function of the X's. 316 00:15:53,280 --> 00:15:57,000 So the density of Theta hat is affected by the 317 00:15:57,000 --> 00:15:58,400 density of the X's. 318 00:15:58,400 --> 00:16:00,730 The density of the X's is affected by the 319 00:16:00,730 --> 00:16:02,380 true value of Theta. 320 00:16:02,380 --> 00:16:05,420 So the distribution of Theta hat is affected by 321 00:16:05,420 --> 00:16:07,680 the value of Theta. 322 00:16:07,680 --> 00:16:10,500 Another way to put it is, as I've mentioned a few minutes 323 00:16:10,500 --> 00:16:14,550 ago, in this business, it's as if we are considering 324 00:16:14,550 --> 00:16:17,880 different possible probabilistic models, one 325 00:16:17,880 --> 00:16:20,470 probabilistic model for each choice of Theta. 326 00:16:20,470 --> 00:16:22,560 And we're trying to guess which one of these 327 00:16:22,560 --> 00:16:25,200 probabilistic models is the true one. 328 00:16:25,200 --> 00:16:28,420 One way of emphasizing the fact that this expression 329 00:16:28,420 --> 00:16:31,780 depends on the true Theta is to put a little subscript 330 00:16:31,780 --> 00:16:36,790 here, expectation, under the particular value of the 331 00:16:36,790 --> 00:16:38,300 parameter Theta. 332 00:16:38,300 --> 00:16:42,450 So depending on what value the true parameter Theta takes, 333 00:16:42,450 --> 00:16:45,000 this expectation will have a different value. 334 00:16:45,000 --> 00:16:49,730 And what we would like is that no matter what the true value 335 00:16:49,730 --> 00:16:55,300 is, that our estimate will not have a bias on the positive or 336 00:16:55,300 --> 00:16:57,140 the negative sides. 337 00:16:57,140 --> 00:17:00,150 So this is a property that's desirable. 338 00:17:00,150 --> 00:17:02,160 Is it always going to be true? 339 00:17:02,160 --> 00:17:05,218 Not necessarily, it depends on what estimator we construct. 340 00:17:09,160 --> 00:17:12,400 Is it true for our exponential example? 341 00:17:12,400 --> 00:17:14,770 Unfortunately not, the estimate that we have in the 342 00:17:14,770 --> 00:17:18,300 exponential example turns out to be biased. 343 00:17:18,300 --> 00:17:22,900 And one extreme way of seeing this is to consider the case 344 00:17:22,900 --> 00:17:25,160 where our sample size is 1. 345 00:17:25,160 --> 00:17:27,020 We're trying to estimate Theta. 346 00:17:27,020 --> 00:17:30,370 And the estimator from the previous slide, in that case, 347 00:17:30,370 --> 00:17:33,410 is just 1/X1. 348 00:17:33,410 --> 00:17:37,990 Now X1 has a fair amount of density in the vicinity of 0, 349 00:17:37,990 --> 00:17:41,360 which means that 1/X1 has significant probability of 350 00:17:41,360 --> 00:17:42,810 being very large. 351 00:17:42,810 --> 00:17:46,140 And if you do the calculation, this ultimately makes the 352 00:17:46,140 --> 00:17:49,170 expected value of 1/X1 to be infinite. 353 00:17:49,170 --> 00:17:52,870 Now infinity is definitely not the correct value. 354 00:17:52,870 --> 00:17:56,330 So our estimate is biased upwards. 355 00:17:56,330 --> 00:18:00,130 And it's actually biased a lot upwards. 356 00:18:00,130 --> 00:18:01,800 So that's how things are. 357 00:18:01,800 --> 00:18:06,690 Maximum likelihood estimates, in general, will be biased. 358 00:18:06,690 --> 00:18:10,870 But under some conditions, they will turn out to be 359 00:18:10,870 --> 00:18:12,780 asymptotically unbiased. 360 00:18:12,780 --> 00:18:16,810 That is, as you get more and more data, as your X vector is 361 00:18:16,810 --> 00:18:21,750 longer and longer, with independent data, the estimate 362 00:18:21,750 --> 00:18:25,010 that you're going to have, the expected value of your 363 00:18:25,010 --> 00:18:26,860 estimator is going to get closer and 364 00:18:26,860 --> 00:18:28,370 closer to the true value. 365 00:18:28,370 --> 00:18:31,330 So you do have some nice asymptotic properties, but 366 00:18:31,330 --> 00:18:34,000 we're not going to prove anything like this. 367 00:18:34,000 --> 00:18:37,680 Speaking of asymptotic properties, in general, what 368 00:18:37,680 --> 00:18:40,950 we would like to have is that, as you collect more and more 369 00:18:40,950 --> 00:18:46,550 data, you get the correct answer, in some sense. 370 00:18:46,550 --> 00:18:49,360 And the sense that we're going to use here is the limiting 371 00:18:49,360 --> 00:18:52,560 sense of convergence in probability, since this is the 372 00:18:52,560 --> 00:18:55,270 only notion of convergence of random variables that we have 373 00:18:55,270 --> 00:18:56,540 in our hands. 374 00:18:56,540 --> 00:18:59,600 This is similar to what we had in the pollster 375 00:18:59,600 --> 00:19:01,180 problem, for example. 376 00:19:01,180 --> 00:19:04,900 If we had a bigger and bigger sample size, we could be more 377 00:19:04,900 --> 00:19:08,360 and more confident that the estimate that we obtained is 378 00:19:08,360 --> 00:19:11,970 close to the unknown true parameter of the distribution 379 00:19:11,970 --> 00:19:13,320 that we have. 380 00:19:13,320 --> 00:19:16,420 So this is a desirable property. 381 00:19:16,420 --> 00:19:20,720 If you have an infinitely large amount of data, you 382 00:19:20,720 --> 00:19:25,070 should be able to estimate an unknown parameter 383 00:19:25,070 --> 00:19:26,890 more or less exactly. 384 00:19:26,890 --> 00:19:32,280 So this is it desirable property of estimators. 385 00:19:32,280 --> 00:19:35,560 It turns out that maximum likelihood estimation, given 386 00:19:35,560 --> 00:19:39,330 independent data, does have this property, under mild 387 00:19:39,330 --> 00:19:40,280 conditions. 388 00:19:40,280 --> 00:19:43,100 So maximum likelihood estimation, in this respect, 389 00:19:43,100 --> 00:19:45,180 is a good approach. 390 00:19:45,180 --> 00:19:48,520 So let's see, do we have this consistency property in our 391 00:19:48,520 --> 00:19:50,150 exponential example? 392 00:19:50,150 --> 00:19:56,560 In our exponential example, we used this quantity to estimate 393 00:19:56,560 --> 00:19:59,040 the unknown parameter Theta. 394 00:19:59,040 --> 00:20:01,000 What properties does this quantity have 395 00:20:01,000 --> 00:20:03,160 as n goes to infinity? 396 00:20:03,160 --> 00:20:06,580 Well this quantity is the reciprocal of that quantity up 397 00:20:06,580 --> 00:20:09,890 here, which is the sample mean. 398 00:20:09,890 --> 00:20:12,950 We know from the weak law of large numbers, that the sample 399 00:20:12,950 --> 00:20:16,350 mean converges to the expectation. 400 00:20:16,350 --> 00:20:19,250 So this property here comes from the weak 401 00:20:19,250 --> 00:20:21,660 law of large numbers. 402 00:20:21,660 --> 00:20:24,670 In probability, this quantity converges to the expected 403 00:20:24,670 --> 00:20:29,830 value, which, for exponential distributions, is 1/Theta. 404 00:20:29,830 --> 00:20:33,460 Now, if something converges to something, then the reciprocal 405 00:20:33,460 --> 00:20:37,680 of that should converge to the reciprocal of that. 406 00:20:37,680 --> 00:20:41,520 That's a property that's certainly correct for numbers. 407 00:20:41,520 --> 00:20:44,000 But you're not talking about convergence of numbers. 408 00:20:44,000 --> 00:20:46,420 We're talking about convergence in probability, 409 00:20:46,420 --> 00:20:48,820 which is a more complicated notion. 410 00:20:48,820 --> 00:20:52,370 Fortunately, it turns out that the same thing is true, when 411 00:20:52,370 --> 00:20:54,660 we deal with convergence in probability. 412 00:20:54,660 --> 00:20:58,690 One can show, although we will not bother doing this, that 413 00:20:58,690 --> 00:21:01,840 indeed, the reciprocal of this, which is our estimate, 414 00:21:01,840 --> 00:21:05,880 converges in probability to the reciprocal of that. 415 00:21:05,880 --> 00:21:08,880 And that reciprocal is the true parameter Theta. 416 00:21:08,880 --> 00:21:11,570 So for this particular exponential example, we do 417 00:21:11,570 --> 00:21:15,250 have the desirable property, that as the number of data 418 00:21:15,250 --> 00:21:18,230 becomes larger and larger, the estimate that we have 419 00:21:18,230 --> 00:21:20,970 constructed will get closer and closer to the true 420 00:21:20,970 --> 00:21:22,510 parameter value. 421 00:21:22,510 --> 00:21:27,050 And this is true no matter what Theta is. 422 00:21:27,050 --> 00:21:30,130 No matter what the true parameter Theta is, we're 423 00:21:30,130 --> 00:21:33,240 going to get close to it as we collect more data. 424 00:21:35,780 --> 00:21:35,950 Okay. 425 00:21:35,950 --> 00:21:39,100 So these are two rough qualitative properties that 426 00:21:39,100 --> 00:21:42,350 would be nice to have. 427 00:21:42,350 --> 00:21:47,340 If you want to get a little more quantitative, you can 428 00:21:47,340 --> 00:21:50,210 start looking at the mean squared error that your 429 00:21:50,210 --> 00:21:52,000 estimator gives. 430 00:21:52,000 --> 00:21:56,600 Now, once more, the comment I was making up there applies. 431 00:21:56,600 --> 00:22:00,540 Namely, that this expectation here is an expectation with 432 00:22:00,540 --> 00:22:04,600 respect to the probability distribution of Theta hat that 433 00:22:04,600 --> 00:22:07,280 corresponds to a particular value of little theta. 434 00:22:07,280 --> 00:22:09,840 So fix a little theta. 435 00:22:09,840 --> 00:22:11,910 Write down this expression. 436 00:22:11,910 --> 00:22:14,550 Look at the probability distribution of Theta hat, 437 00:22:14,550 --> 00:22:16,380 under that little theta. 438 00:22:16,380 --> 00:22:18,220 And do this calculation. 439 00:22:18,220 --> 00:22:20,610 You're going to get some quantity that depends on the 440 00:22:20,610 --> 00:22:21,860 little theta. 441 00:22:24,200 --> 00:22:28,450 And so all quantities in this equality here should be 442 00:22:28,450 --> 00:22:33,360 interpreted as quantities under that particular value of 443 00:22:33,360 --> 00:22:34,490 little theta. 444 00:22:34,490 --> 00:22:38,640 So if you wanted to make this more explicit, you could start 445 00:22:38,640 --> 00:22:41,870 throwing little subscripts everywhere in those 446 00:22:41,870 --> 00:22:44,430 expressions. 447 00:22:44,430 --> 00:22:49,190 And let's see what those expressions tell us. 448 00:22:49,190 --> 00:22:55,430 The expected value squared of a random variable, we know 449 00:22:55,430 --> 00:22:59,210 that it's always equal to the variance of this random 450 00:22:59,210 --> 00:23:03,790 variable, plus the expectation of the 451 00:23:03,790 --> 00:23:05,860 random variable squared. 452 00:23:05,860 --> 00:23:08,465 So the expectation value of that random variable, squared. 453 00:23:12,020 --> 00:23:17,030 This equality here is just our familiar formula, that the 454 00:23:17,030 --> 00:23:23,250 expected value of X squared is the variance of X plus the 455 00:23:23,250 --> 00:23:26,350 expected value of X squared. 456 00:23:26,350 --> 00:23:30,040 So we apply this formula to X equal to 457 00:23:30,040 --> 00:23:34,024 Theta hat minus Theta. 458 00:23:37,180 --> 00:23:41,220 Now, remember that, in this classical setting, theta is 459 00:23:41,220 --> 00:23:42,140 just a constant. 460 00:23:42,140 --> 00:23:43,450 We have fixed Theta. 461 00:23:43,450 --> 00:23:45,850 We want to calculate the variance of this quantity, 462 00:23:45,850 --> 00:23:47,760 under that particular Theta. 463 00:23:47,760 --> 00:23:51,000 When you add or subtract a constant to a random variable, 464 00:23:51,000 --> 00:23:54,070 the variance doesn't change. 465 00:23:54,070 --> 00:23:56,860 This is the same as the variance of our estimator. 466 00:23:56,860 --> 00:24:00,300 And what we've got here is the bias of our estimate. 467 00:24:00,300 --> 00:24:02,580 It tells us, on the average, whether we 468 00:24:02,580 --> 00:24:04,470 fall above or below. 469 00:24:04,470 --> 00:24:06,850 And we're taking the bias to be b squared. 470 00:24:06,850 --> 00:24:10,110 If we have an unbiased estimator, the bias 471 00:24:10,110 --> 00:24:13,690 term will be 0. 472 00:24:13,690 --> 00:24:18,250 So ideally we want Theta hat to be very close to Theta. 473 00:24:18,250 --> 00:24:21,840 And since Theta is a constant, if that happens, the variance 474 00:24:21,840 --> 00:24:25,650 of Theta hat would be very small. 475 00:24:25,650 --> 00:24:26,870 So Theta is a constant. 476 00:24:26,870 --> 00:24:30,180 If Theta hat has a distribution that's 477 00:24:30,180 --> 00:24:33,610 concentrated just around own little theta, then Theta hat 478 00:24:33,610 --> 00:24:35,250 would have a small variance. 479 00:24:35,250 --> 00:24:37,690 So this is one desire that have. 480 00:24:37,690 --> 00:24:39,740 We're going to have a small variance. 481 00:24:39,740 --> 00:24:43,710 But we also want to have a small bias at the same time. 482 00:24:43,710 --> 00:24:47,370 So the general form of the mean squared error has two 483 00:24:47,370 --> 00:24:48,240 contributions. 484 00:24:48,240 --> 00:24:50,530 One is the variance of our estimator. 485 00:24:50,530 --> 00:24:52,350 The other is the bias. 486 00:24:52,350 --> 00:24:54,990 And one usually wants to design an estimator that 487 00:24:54,990 --> 00:24:58,900 simultaneously keeps both of these terms small. 488 00:24:58,900 --> 00:25:03,250 So here's an estimation method that would do very well with 489 00:25:03,250 --> 00:25:05,080 respect to this term, but badly with 490 00:25:05,080 --> 00:25:06,680 respect to that term. 491 00:25:06,680 --> 00:25:09,410 So suppose that my distribution is, let's say, 492 00:25:09,410 --> 00:25:13,700 normal with an unknown mean Theta and variance 1. 493 00:25:13,700 --> 00:25:17,580 And I use as my estimator something very dumb. 494 00:25:17,580 --> 00:25:23,330 I always produce an estimate that says my estimate is 100. 495 00:25:23,330 --> 00:25:26,430 So I'm just ignoring the data and report 100. 496 00:25:26,430 --> 00:25:27,750 What does this do? 497 00:25:27,750 --> 00:25:30,950 The variance of my estimator is 0. 498 00:25:30,950 --> 00:25:33,690 There's no randomness in the estimate that I report. 499 00:25:33,690 --> 00:25:37,020 But the bias is going to be pretty bad. 500 00:25:37,020 --> 00:25:44,180 The bias is going to be Theta hat, which is 100 minus the 501 00:25:44,180 --> 00:25:46,770 true value of Theta. 502 00:25:46,770 --> 00:25:50,340 And for some Theta's, my bias is going to be horrible. 503 00:25:50,340 --> 00:25:54,600 If my true Theta happens to be 0, my bias 504 00:25:54,600 --> 00:25:56,200 squared is a huge term. 505 00:25:56,200 --> 00:25:57,810 And I get a large error. 506 00:25:57,810 --> 00:26:00,220 So what's the moral of this example? 507 00:26:00,220 --> 00:26:03,700 There are ways of making that variance very small, but, in 508 00:26:03,700 --> 00:26:07,360 those cases, you pay a price in the bias. 509 00:26:07,360 --> 00:26:10,340 So you want to do something a little more delicate, where 510 00:26:10,340 --> 00:26:14,640 you try to keep both terms small at the same time. 511 00:26:14,640 --> 00:26:16,720 So these types of considerations become 512 00:26:16,720 --> 00:26:20,280 important when you start to try to design sophisticated 513 00:26:20,280 --> 00:26:22,840 estimators for more complicated problems. 514 00:26:22,840 --> 00:26:24,800 But we will not do this in this class. 515 00:26:24,800 --> 00:26:26,720 This belongs to further classes on 516 00:26:26,720 --> 00:26:28,750 statistics and inference. 517 00:26:28,750 --> 00:26:31,960 For this class, for parameter estimation, we will basically 518 00:26:31,960 --> 00:26:34,400 stick to two very simple methods. 519 00:26:34,400 --> 00:26:37,930 One is the maximum likelihood method we've just discussed. 520 00:26:37,930 --> 00:26:41,300 And the other method is what you would do if you were still 521 00:26:41,300 --> 00:26:44,010 in high school and didn't know any probability. 522 00:26:44,010 --> 00:26:46,610 You get data. 523 00:26:46,610 --> 00:26:50,430 And these data come from some distribution 524 00:26:50,430 --> 00:26:51,850 with an unknown mean. 525 00:26:51,850 --> 00:26:53,930 And you want to estimate that the unknown mean. 526 00:26:53,930 --> 00:26:54,810 What would you do? 527 00:26:54,810 --> 00:26:57,990 You would just take those data and average them out. 528 00:26:57,990 --> 00:27:00,440 So let's make this a little more specific. 529 00:27:00,440 --> 00:27:04,770 We have X's that come from a given distribution. 530 00:27:04,770 --> 00:27:07,775 We know the general form of the distribution, perhaps. 531 00:27:10,570 --> 00:27:15,180 We do know, perhaps, the variance of that distribution, 532 00:27:15,180 --> 00:27:17,050 or, perhaps, we don't know it. 533 00:27:17,050 --> 00:27:19,030 But we do not know the mean. 534 00:27:19,030 --> 00:27:22,700 And we want to estimate the mean of that distribution. 535 00:27:22,700 --> 00:27:25,370 Now, we can write this situation. 536 00:27:25,370 --> 00:27:27,710 We can represent it in a different form. 537 00:27:27,710 --> 00:27:30,120 The Xi's are equal to Theta. 538 00:27:30,120 --> 00:27:31,380 This is the mean. 539 00:27:31,380 --> 00:27:34,310 Plus a 0 mean random variable, that you 540 00:27:34,310 --> 00:27:36,000 can think of as noise. 541 00:27:36,000 --> 00:27:39,380 So this corresponds to the usual situation you would have 542 00:27:39,380 --> 00:27:41,950 in a lab, where you go and try to 543 00:27:41,950 --> 00:27:43,870 measure an unknown quantity. 544 00:27:43,870 --> 00:27:45,260 You get lots of measurements. 545 00:27:45,260 --> 00:27:49,490 But each time that you measure them, your measurements have 546 00:27:49,490 --> 00:27:51,920 some extra noise in there. 547 00:27:51,920 --> 00:27:54,510 And you want to kind of get rid of that noise. 548 00:27:54,510 --> 00:27:57,860 The way to try to get rid of the measurement noise is to 549 00:27:57,860 --> 00:28:01,170 collect lots of data and average them out. 550 00:28:01,170 --> 00:28:02,930 This is the sample mean. 551 00:28:02,930 --> 00:28:07,380 And this is a very, very reasonable way of trying to 552 00:28:07,380 --> 00:28:10,130 estimate the unknown mean of the X's. 553 00:28:10,130 --> 00:28:12,700 So this is the sample mean. 554 00:28:12,700 --> 00:28:17,840 It's a reasonable, plausible, in general, pretty good 555 00:28:17,840 --> 00:28:22,390 estimator of the unknown mean of a certain distribution. 556 00:28:22,390 --> 00:28:26,910 We can apply this estimator without really knowing a lot 557 00:28:26,910 --> 00:28:28,810 about the distribution of the X's. 558 00:28:28,810 --> 00:28:31,010 Actually, we don't need to know anything about the 559 00:28:31,010 --> 00:28:32,320 distribution. 560 00:28:32,320 --> 00:28:35,840 We can still apply it, because the variance, for example, 561 00:28:35,840 --> 00:28:37,130 does not show up here. 562 00:28:37,130 --> 00:28:38,660 We don't need to know the variance to 563 00:28:38,660 --> 00:28:40,520 calculate that quantity. 564 00:28:40,520 --> 00:28:43,520 Does this estimator have good properties? 565 00:28:43,520 --> 00:28:45,110 Yes, it does. 566 00:28:45,110 --> 00:28:48,110 What's the expected value of the sample mean? 567 00:28:48,110 --> 00:28:51,910 If the expectation of this, it's the expectation of this 568 00:28:51,910 --> 00:28:53,600 sum divided by n. 569 00:28:53,600 --> 00:28:56,410 The expected value for each one of the X's is Theta. 570 00:28:56,410 --> 00:28:58,290 So the expected value of the sample mean 571 00:28:58,290 --> 00:29:00,010 is just Theta itself. 572 00:29:00,010 --> 00:29:03,310 So our estimator is unbiased. 573 00:29:03,310 --> 00:29:06,410 No matter what Theta is, our estimator does not have a 574 00:29:06,410 --> 00:29:11,130 systematic error in either direction. 575 00:29:11,130 --> 00:29:13,870 Furthermore, the weak law of large numbers tells us that 576 00:29:13,870 --> 00:29:18,140 this quantity converges to the true parameter in probability. 577 00:29:18,140 --> 00:29:20,700 So it's a consistent estimator. 578 00:29:20,700 --> 00:29:21,920 This is good. 579 00:29:21,920 --> 00:29:26,740 And if you want to calculate the mean squared error 580 00:29:26,740 --> 00:29:28,780 corresponding to this estimator. 581 00:29:28,780 --> 00:29:31,550 Remember how we defined the mean squared error? 582 00:29:31,550 --> 00:29:35,300 It's this quantity. 583 00:29:35,300 --> 00:29:38,680 Then it's a calculation that we have done a fair number of 584 00:29:38,680 --> 00:29:40,080 times by now. 585 00:29:40,080 --> 00:29:43,640 The mean squared error is the variance of the distribution 586 00:29:43,640 --> 00:29:46,000 of the X's divided by n. 587 00:29:46,000 --> 00:29:49,370 So as we get more and more data, the mean squared error 588 00:29:49,370 --> 00:29:52,170 goes down to 0. 589 00:29:52,170 --> 00:29:56,420 In some examples, it turns out that the sample mean is also 590 00:29:56,420 --> 00:29:58,930 the same as the maximum likelihood estimate. 591 00:29:58,930 --> 00:30:02,790 For example, if the X's are coming from a normal 592 00:30:02,790 --> 00:30:07,700 distribution, you can write down the likelihood, do the 593 00:30:07,700 --> 00:30:10,240 maximization with respect to Theta, you'll find that the 594 00:30:10,240 --> 00:30:15,190 maximum likelihood estimate is the same as the sample mean. 595 00:30:15,190 --> 00:30:18,730 In other cases, the sample mean will be different from 596 00:30:18,730 --> 00:30:20,850 the maximum likelihood. 597 00:30:20,850 --> 00:30:23,990 And then you have a choice about which one of the 598 00:30:23,990 --> 00:30:24,860 two you would use. 599 00:30:24,860 --> 00:30:27,890 Probably, in most reasonable situations, you would just use 600 00:30:27,890 --> 00:30:31,460 the sample mean, because it's simple, easy to compute, and 601 00:30:31,460 --> 00:30:33,830 has nice properties. 602 00:30:33,830 --> 00:30:33,936 All right. 603 00:30:33,936 --> 00:30:35,110 So you go to your boss. 604 00:30:35,110 --> 00:30:38,120 And you report and say, OK, I did all my 605 00:30:38,120 --> 00:30:39,910 experiments in the lab. 606 00:30:39,910 --> 00:30:49,820 And the average value that I got is a certain number, 2.37. 607 00:30:49,820 --> 00:30:52,490 So is that the informative to your boss? 608 00:30:52,490 --> 00:30:55,470 Well your boss would like to know how much they can trust 609 00:30:55,470 --> 00:30:58,280 this number, 2.37. 610 00:30:58,280 --> 00:31:00,630 Well, I know that the true value is not going to be 611 00:31:00,630 --> 00:31:02,270 exactly that. 612 00:31:02,270 --> 00:31:07,410 But how close should it be? 613 00:31:07,410 --> 00:31:09,820 So give me a range of what you think are 614 00:31:09,820 --> 00:31:12,080 possible values of Theta. 615 00:31:12,080 --> 00:31:16,220 So the situation is like this. 616 00:31:16,220 --> 00:31:20,370 So suppose that we observe X's that are coming from a certain 617 00:31:20,370 --> 00:31:22,070 distribution. 618 00:31:22,070 --> 00:31:24,230 And we're trying to estimate the mean. 619 00:31:24,230 --> 00:31:25,480 We get our data. 620 00:31:27,880 --> 00:31:32,090 Maybe our data looks something like this. 621 00:31:32,090 --> 00:31:34,090 You calculate the mean. 622 00:31:34,090 --> 00:31:36,140 You find the sample mean. 623 00:31:36,140 --> 00:31:40,120 So let's suppose that the sample mean is a number, for 624 00:31:40,120 --> 00:31:45,570 some reason take to be 2.37. 625 00:31:45,570 --> 00:31:48,300 But you want to convey something to your boss about 626 00:31:48,300 --> 00:31:51,450 how spread out these data were. 627 00:31:51,450 --> 00:31:56,690 So the boss asks you to give him or her some kind of 628 00:31:56,690 --> 00:32:05,340 interval on which Theta, the true parameter, might lie. 629 00:32:05,340 --> 00:32:07,540 So the boss asked you for an interval. 630 00:32:07,540 --> 00:32:11,740 So what you do is you end up reporting an interval. 631 00:32:11,740 --> 00:32:14,990 And you somehow use the data that you have seen to 632 00:32:14,990 --> 00:32:17,580 construct this interval. 633 00:32:17,580 --> 00:32:19,900 And you report to your boss also the 634 00:32:19,900 --> 00:32:21,420 endpoints of this interval. 635 00:32:21,420 --> 00:32:24,020 Let's give names to these endpoints, 636 00:32:24,020 --> 00:32:27,710 Theta_n- and Theta_n+. 637 00:32:27,710 --> 00:32:31,000 The ends here just play the role of keeping track of how 638 00:32:31,000 --> 00:32:33,000 many data we're using. 639 00:32:33,000 --> 00:32:39,320 So what you report to your boss is this interval as well. 640 00:32:39,320 --> 00:32:42,340 Are these Theta's here, the endpoints of the interval, 641 00:32:42,340 --> 00:32:44,220 lowercase or uppercase? 642 00:32:44,220 --> 00:32:45,750 What should they be? 643 00:32:45,750 --> 00:32:48,180 Well you construct these intervals after 644 00:32:48,180 --> 00:32:49,430 you see your data. 645 00:32:49,430 --> 00:32:53,830 You take the data into account to construct your interval. 646 00:32:53,830 --> 00:32:57,020 So these definitely should depend on the data. 647 00:32:57,020 --> 00:32:59,460 And therefore they are random variables. 648 00:32:59,460 --> 00:33:03,240 Same thing with your estimator, in general, it's 649 00:33:03,240 --> 00:33:05,120 going to be a random variable. 650 00:33:05,120 --> 00:33:07,930 Although, when you go and report numbers to your boss, 651 00:33:07,930 --> 00:33:10,580 you give the specific realizations of the random 652 00:33:10,580 --> 00:33:15,450 variables, given the data that you got. 653 00:33:15,450 --> 00:33:21,500 So instead of having just a single box 654 00:33:21,500 --> 00:33:25,050 that produces estimates. 655 00:33:25,050 --> 00:33:29,540 So our previous picture was that you have your estimator 656 00:33:29,540 --> 00:33:34,130 that takes X's and produces Theta hats. 657 00:33:34,130 --> 00:33:40,960 Now our box will also be producing Theta hats minus and 658 00:33:40,960 --> 00:33:42,570 Theta hats plus. 659 00:33:42,570 --> 00:33:45,180 It's going to produce an interval as well. 660 00:33:45,180 --> 00:33:48,670 The X's are random, therefore these quantities are random. 661 00:33:48,670 --> 00:33:52,340 Once you go and do the experiment and obtain your 662 00:33:52,340 --> 00:33:55,930 data, then your data will be some 663 00:33:55,930 --> 00:33:58,810 lowercase x, specific numbers. 664 00:33:58,810 --> 00:34:00,950 And then your estimates and estimator 665 00:34:00,950 --> 00:34:05,110 become also lower case. 666 00:34:05,110 --> 00:34:08,010 What would we like this interval to do? 667 00:34:08,010 --> 00:34:11,760 We would like it to be highly likely to contain the true 668 00:34:11,760 --> 00:34:13,810 value of the parameter. 669 00:34:13,810 --> 00:34:17,800 So we might impose some specs of the following kind. 670 00:34:17,800 --> 00:34:19,170 I pick a number, alpha. 671 00:34:19,170 --> 00:34:21,170 Usually that alpha, think of it as a 672 00:34:21,170 --> 00:34:23,050 probability of a large error. 673 00:34:23,050 --> 00:34:27,449 Typical value of alpha might be 0.05, in which case this 674 00:34:27,449 --> 00:34:30,360 number here is point 0.95. 675 00:34:30,360 --> 00:34:33,989 And you're given specs that say something like this. 676 00:34:33,989 --> 00:34:41,110 I would like, with probability at least 0.95, this to happen, 677 00:34:41,110 --> 00:34:44,739 which says that the true parameter lies inside the 678 00:34:44,739 --> 00:34:47,100 confidence interval. 679 00:34:47,100 --> 00:34:50,840 Now let's try to interpret this statement. 680 00:34:50,840 --> 00:34:53,560 Suppose that you did the experiment, and that you ended 681 00:34:53,560 --> 00:34:56,230 up reporting to your boss a confidence interval 682 00:34:56,230 --> 00:35:01,520 from 1.97 to 2.56. 683 00:35:01,520 --> 00:35:03,170 That's what you report to your boss. 684 00:35:06,790 --> 00:35:08,300 And suppose that the confidence 685 00:35:08,300 --> 00:35:10,280 interval has this property. 686 00:35:10,280 --> 00:35:16,400 Can you go to your boss and say, with probability 95%, the 687 00:35:16,400 --> 00:35:20,090 true value of Theta is between these two numbers? 688 00:35:20,090 --> 00:35:22,630 Is that a meaningful statement? 689 00:35:22,630 --> 00:35:26,100 So the statement is, the tentative statement is, with 690 00:35:26,100 --> 00:35:30,200 probability 95%, the true value of Theta is 691 00:35:30,200 --> 00:35:34,930 between 1.97 and 2.56. 692 00:35:34,930 --> 00:35:38,910 Well, what is random in that statement? 693 00:35:38,910 --> 00:35:40,460 There's nothing random. 694 00:35:40,460 --> 00:35:43,070 The true value of theta is a constant. 695 00:35:43,070 --> 00:35:44,720 1.97 is a number. 696 00:35:44,720 --> 00:35:46,740 2.56 is a number. 697 00:35:46,740 --> 00:35:52,960 So it doesn't make any sense to talk about the probability 698 00:35:52,960 --> 00:35:54,920 that theta is in this interval. 699 00:35:54,920 --> 00:35:57,540 Either theta happens to be in that interval, or it 700 00:35:57,540 --> 00:35:58,760 happens to not be. 701 00:35:58,760 --> 00:36:01,560 But there are no probabilities associated with this. 702 00:36:01,560 --> 00:36:04,700 Because theta is not random. 703 00:36:04,700 --> 00:36:06,690 Syntactically, you can see this. 704 00:36:06,690 --> 00:36:09,210 Because theta here is a lower case. 705 00:36:09,210 --> 00:36:11,930 So what kind of probabilities are we talking about here? 706 00:36:11,930 --> 00:36:13,460 Where's the randomness? 707 00:36:13,460 --> 00:36:15,880 Well the random thing is the interval. 708 00:36:15,880 --> 00:36:17,560 It's not theta. 709 00:36:17,560 --> 00:36:21,090 So the statement that is being made here is that the 710 00:36:21,090 --> 00:36:24,290 interval, that's being constructed by our procedure, 711 00:36:24,290 --> 00:36:28,410 should have the property that, with probability 95%, it's 712 00:36:28,410 --> 00:36:33,280 going to fall on top of the true value of theta. 713 00:36:33,280 --> 00:36:37,680 So the right way of interpreting what the 95% 714 00:36:37,680 --> 00:36:42,270 confidence interval is, is something like the following. 715 00:36:42,270 --> 00:36:45,390 We have the true value of theta that we don't know. 716 00:36:45,390 --> 00:36:46,750 I get data. 717 00:36:46,750 --> 00:36:50,150 Based on the data, I construct a confidence interval. 718 00:36:50,150 --> 00:36:51,950 I get my confidence interval. 719 00:36:51,950 --> 00:36:52,790 I got lucky. 720 00:36:52,790 --> 00:36:54,850 And the true value of theta is in here. 721 00:36:54,850 --> 00:36:57,790 Next day, I do the same experiment, take my data, 722 00:36:57,790 --> 00:37:00,500 construct a confidence interval. 723 00:37:00,500 --> 00:37:04,040 And I get this confidence interval, lucky once more. 724 00:37:04,040 --> 00:37:06,320 Next day I get data. 725 00:37:06,320 --> 00:37:09,620 I use my data to come up with an estimate of theta and the 726 00:37:09,620 --> 00:37:10,660 confidence interval. 727 00:37:10,660 --> 00:37:12,340 That day, I was unlucky. 728 00:37:12,340 --> 00:37:15,000 And I got a confidence interval out there. 729 00:37:15,000 --> 00:37:20,890 What the requirement here is, is that 95% of the days, where 730 00:37:20,890 --> 00:37:25,270 we use this certain procedure for constructing confidence 731 00:37:25,270 --> 00:37:29,180 intervals, 95% of those days, we will be lucky. 732 00:37:29,180 --> 00:37:33,750 And we will capture the correct value of theta by your 733 00:37:33,750 --> 00:37:35,160 confidence interval. 734 00:37:35,160 --> 00:37:39,390 So it's a statement about the distribution of these random 735 00:37:39,390 --> 00:37:42,820 confidence intervals, how likely are they to fall on top 736 00:37:42,820 --> 00:37:45,210 of the true theta, as opposed to how likely 737 00:37:45,210 --> 00:37:47,060 they are to fall outside. 738 00:37:47,060 --> 00:37:50,770 So it's a statement about probabilities associated with 739 00:37:50,770 --> 00:37:52,380 a confidence interval. 740 00:37:52,380 --> 00:37:55,080 They're not probabilities about theta, because theta, 741 00:37:55,080 --> 00:37:58,370 itself, is not random. 742 00:37:58,370 --> 00:38:02,080 So this is what the confidence interval is, in general, and 743 00:38:02,080 --> 00:38:03,470 how we interpret it. 744 00:38:03,470 --> 00:38:07,470 How do we construct a 95% confidence interval? 745 00:38:07,470 --> 00:38:09,320 Let's go through this exercise, in 746 00:38:09,320 --> 00:38:10,980 a particular example. 747 00:38:10,980 --> 00:38:13,970 The calculations are exactly the same as the ones that you 748 00:38:13,970 --> 00:38:17,770 did when we talked about laws of large numbers and the 749 00:38:17,770 --> 00:38:19,240 central limit theorem. 750 00:38:19,240 --> 00:38:22,600 So there's nothing new calculationally but it's, 751 00:38:22,600 --> 00:38:25,440 perhaps, new in terms of the language that we use and the 752 00:38:25,440 --> 00:38:26,800 interpretation. 753 00:38:26,800 --> 00:38:30,890 So we got our sample mean from some distribution. 754 00:38:30,890 --> 00:38:34,650 And we would like to calculate a 95% confidence interval. 755 00:38:39,590 --> 00:38:42,650 We know from the normal tables, that the standard 756 00:38:42,650 --> 00:38:54,011 normal has 2.5% on the tail, that's after 1.96. 757 00:38:54,011 --> 00:38:58,060 Yes, by this time, the number 1.96 758 00:38:58,060 --> 00:39:00,600 should be pretty familiar. 759 00:39:00,600 --> 00:39:05,880 So if this probability here is 2.5%, this 760 00:39:05,880 --> 00:39:09,510 number here is 1.96. 761 00:39:09,510 --> 00:39:12,310 Now look at this random variable here. 762 00:39:12,310 --> 00:39:15,000 This is the sample mean. 763 00:39:15,000 --> 00:39:17,950 Difference, from the true mean, normalized by the usual 764 00:39:17,950 --> 00:39:18,940 normalizing factor. 765 00:39:18,940 --> 00:39:22,090 By the central limit theorem, this is approximately normal. 766 00:39:22,090 --> 00:39:26,790 So it has probability 0.95 of being less than 1.96. 767 00:39:26,790 --> 00:39:31,050 Now take this event here and rewrite it. 768 00:39:31,050 --> 00:39:36,240 This the event, well, that Theta hat minus theta is 769 00:39:36,240 --> 00:39:40,350 bigger than this number and smaller than that number. 770 00:39:40,350 --> 00:39:45,650 This event here is equivalent to that event here. 771 00:39:45,650 --> 00:39:50,670 And so this suggests a way of constructing our 95% percent 772 00:39:50,670 --> 00:39:52,130 confidence interval. 773 00:39:52,130 --> 00:39:56,330 I'm going to report the interval, which gives this as 774 00:39:56,330 --> 00:40:00,350 the lower end of the confidence interval, and gives 775 00:40:00,350 --> 00:40:05,720 this as the upper end of the confidence interval 776 00:40:05,720 --> 00:40:09,180 In other words, at the end of the experiment, we report the 777 00:40:09,180 --> 00:40:12,170 sample mean, which is our estimate. 778 00:40:12,170 --> 00:40:14,230 And we report also, an interval 779 00:40:14,230 --> 00:40:16,080 around the sample mean. 780 00:40:16,080 --> 00:40:20,510 And this is our 95% confidence interval. 781 00:40:20,510 --> 00:40:22,800 The confidence interval becomes 782 00:40:22,800 --> 00:40:26,050 smaller, when n is larger. 783 00:40:26,050 --> 00:40:28,950 In some sense, we're more certain that we're doing a 784 00:40:28,950 --> 00:40:32,390 good estimation job, so we can have a small interval and 785 00:40:32,390 --> 00:40:36,000 still be quite confident that our interval captures the true 786 00:40:36,000 --> 00:40:37,520 value of the parameter. 787 00:40:37,520 --> 00:40:41,890 Also, if our data have very little noise, when you have 788 00:40:41,890 --> 00:40:45,060 more accurate measurements, you're more confident that 789 00:40:45,060 --> 00:40:47,220 your estimate is pretty good. 790 00:40:47,220 --> 00:40:51,120 And that results in a smaller confidence interval, smaller 791 00:40:51,120 --> 00:40:52,610 length of the confidence interval. 792 00:40:52,610 --> 00:40:56,040 And still you have 95% probability of capturing the 793 00:40:56,040 --> 00:40:57,650 true value of theta. 794 00:40:57,650 --> 00:41:01,660 So we did this exercise by taking 95% confidence 795 00:41:01,660 --> 00:41:04,010 intervals and the corresponding value from the 796 00:41:04,010 --> 00:41:06,670 normal tables, which is 1.96. 797 00:41:06,670 --> 00:41:11,390 Of course, you can do it more generally, if you set your 798 00:41:11,390 --> 00:41:13,730 alpha to be some other number. 799 00:41:13,730 --> 00:41:16,590 Again, you look at the normal tables. 800 00:41:16,590 --> 00:41:20,460 And you find the value here, so that the tail has 801 00:41:20,460 --> 00:41:22,640 probability alpha over 2. 802 00:41:22,640 --> 00:41:26,790 And instead of using these 1.96, you use whatever number 803 00:41:26,790 --> 00:41:31,380 you get from the normal tables. 804 00:41:31,380 --> 00:41:33,520 And this tells you how to construct 805 00:41:33,520 --> 00:41:36,680 a confidence interval. 806 00:41:36,680 --> 00:41:42,060 Well, to be exact, this is not necessarily a 807 00:41:42,060 --> 00:41:44,640 95% confidence interval. 808 00:41:44,640 --> 00:41:47,540 It's approximately a 95% confidence interval. 809 00:41:47,540 --> 00:41:48,950 Why is this? 810 00:41:48,950 --> 00:41:51,060 Because we've done an approximation. 811 00:41:51,060 --> 00:41:53,890 We have used the central limit theorem. 812 00:41:53,890 --> 00:41:59,990 So it might turn out to be a 95.5% confidence interval 813 00:41:59,990 --> 00:42:03,220 instead of 95%, because our calculations are 814 00:42:03,220 --> 00:42:04,740 not entirely accurate. 815 00:42:04,740 --> 00:42:08,230 But for reasonable values of n, using the central limit 816 00:42:08,230 --> 00:42:10,190 theorem is a good approximation. 817 00:42:10,190 --> 00:42:13,330 And that's what people almost always do. 818 00:42:13,330 --> 00:42:17,350 So just take the value from the normal tables. 819 00:42:17,350 --> 00:42:18,600 Okay, except for one catch. 820 00:42:22,830 --> 00:42:24,590 I used the data. 821 00:42:24,590 --> 00:42:26,440 I obtained my estimate. 822 00:42:26,440 --> 00:42:29,830 And I want to go to my boss and report this theta minus 823 00:42:29,830 --> 00:42:33,010 and theta hat, which is the confidence interval. 824 00:42:33,010 --> 00:42:35,720 What's the difficulty? 825 00:42:35,720 --> 00:42:37,540 I know what n is. 826 00:42:37,540 --> 00:42:40,790 But I don't know what sigma is, in general. 827 00:42:40,790 --> 00:42:44,750 So if I don't know sigma, what am I going to do? 828 00:42:44,750 --> 00:42:48,980 Here, there's a few options for what you can do. 829 00:42:48,980 --> 00:42:52,910 And the first option is familiar from what we did when 830 00:42:52,910 --> 00:42:55,020 we talked about the pollster problem. 831 00:42:55,020 --> 00:42:58,480 We don't know what sigma is, but maybe we have an upper 832 00:42:58,480 --> 00:43:00,030 bound on sigma. 833 00:43:00,030 --> 00:43:03,540 For example, if the Xi's Bernoulli random variables, we 834 00:43:03,540 --> 00:43:06,910 have seen that the standard deviation is at most 1/2. 835 00:43:06,910 --> 00:43:10,220 So use the most conservative value for sigma. 836 00:43:10,220 --> 00:43:13,520 Using the most conservative value means that you take 837 00:43:13,520 --> 00:43:17,890 bigger confidence intervals than necessary. 838 00:43:17,890 --> 00:43:20,780 So that's one option. 839 00:43:20,780 --> 00:43:25,480 Another option is to try to estimate sigma from the data. 840 00:43:25,480 --> 00:43:27,630 How do you do this estimation? 841 00:43:27,630 --> 00:43:31,140 In special cases, for special types of distributions, you 842 00:43:31,140 --> 00:43:34,180 can think of heuristic ways of doing this estimation. 843 00:43:34,180 --> 00:43:38,390 For example, in the case of Bernoulli random variables, we 844 00:43:38,390 --> 00:43:42,420 know that the true value of sigma, the standard deviation 845 00:43:42,420 --> 00:43:45,120 of a Bernoulli random variable, is the square root 846 00:43:45,120 --> 00:43:47,670 of theta1 minus theta, where theta is 847 00:43:47,670 --> 00:43:50,290 the mean of the Bernoulli. 848 00:43:50,290 --> 00:43:51,900 Try to use this formula. 849 00:43:51,900 --> 00:43:54,140 But theta is the thing we're trying to estimate in the 850 00:43:54,140 --> 00:43:54,760 first place. 851 00:43:54,760 --> 00:43:55,880 We don't know it. 852 00:43:55,880 --> 00:43:57,150 What do we do? 853 00:43:57,150 --> 00:44:00,850 Well, we have an estimate for theta, the estimate, produced 854 00:44:00,850 --> 00:44:04,195 by our estimation procedure, the sample mean. 855 00:44:04,195 --> 00:44:05,670 So I obtain my data. 856 00:44:05,670 --> 00:44:06,540 I get my data. 857 00:44:06,540 --> 00:44:09,030 I produce the estimate theta hat. 858 00:44:09,030 --> 00:44:10,740 It's an estimate of the mean. 859 00:44:10,740 --> 00:44:14,770 Use that estimate in this formula to come up with an 860 00:44:14,770 --> 00:44:17,290 estimate of my standard deviation. 861 00:44:17,290 --> 00:44:20,210 And then use that standard deviation, in the construction 862 00:44:20,210 --> 00:44:22,510 of the confidence interval, pretending 863 00:44:22,510 --> 00:44:24,180 that this is correct. 864 00:44:24,180 --> 00:44:29,050 Well the number of your data is large, then we know, from 865 00:44:29,050 --> 00:44:31,870 the law of large numbers, that theta hat is a pretty good 866 00:44:31,870 --> 00:44:33,130 estimate of theta. 867 00:44:33,130 --> 00:44:36,670 So sigma hat is going to be a pretty good estimate of sigma. 868 00:44:36,670 --> 00:44:42,380 So we're not making large errors by using this approach. 869 00:44:42,380 --> 00:44:47,980 So in this scenario here, things were simple, because we 870 00:44:47,980 --> 00:44:49,890 had an analytical formula. 871 00:44:49,890 --> 00:44:52,210 Sigma was determined by theta. 872 00:44:52,210 --> 00:44:54,420 So we could come up with a quick and 873 00:44:54,420 --> 00:44:57,340 dirty estimate of sigma. 874 00:44:57,340 --> 00:45:00,940 In general, if you do not have any nice formulas of this 875 00:45:00,940 --> 00:45:03,000 kind, what could you do? 876 00:45:03,000 --> 00:45:04,920 Well, you still need to come up with an 877 00:45:04,920 --> 00:45:07,110 estimate of sigma somehow. 878 00:45:07,110 --> 00:45:08,950 What is a generic method for 879 00:45:08,950 --> 00:45:11,300 estimating a standard deviation? 880 00:45:11,300 --> 00:45:14,440 Equivalently, what could be a generic method for estimating 881 00:45:14,440 --> 00:45:16,920 a variance? 882 00:45:16,920 --> 00:45:19,360 Well the variance is an expected value 883 00:45:19,360 --> 00:45:20,940 of some random variable. 884 00:45:20,940 --> 00:45:25,610 The variance is the mean of the random variable inside of 885 00:45:25,610 --> 00:45:28,200 those brackets. 886 00:45:28,200 --> 00:45:33,160 How does one estimate the mean of some random variable? 887 00:45:33,160 --> 00:45:36,140 You obtain lots of measurements of that random 888 00:45:36,140 --> 00:45:40,210 variable and average them out. 889 00:45:40,210 --> 00:45:45,170 So this would be a reasonable way of estimating the variance 890 00:45:45,170 --> 00:45:47,310 of a distribution. 891 00:45:47,310 --> 00:45:50,590 And again, the weak law of large numbers tells us that 892 00:45:50,590 --> 00:45:55,370 this average converges to the expected value of this, which 893 00:45:55,370 --> 00:45:58,590 is just the variance of the distribution. 894 00:45:58,590 --> 00:46:01,700 So we got a nice and consistent way 895 00:46:01,700 --> 00:46:03,940 of estimating variances. 896 00:46:03,940 --> 00:46:08,100 But now, we seem to be getting in a vicious circle here, 897 00:46:08,100 --> 00:46:10,580 because to estimate the variance, we 898 00:46:10,580 --> 00:46:12,910 need to know the mean. 899 00:46:12,910 --> 00:46:16,075 And the mean is something we're trying to estimate in 900 00:46:16,075 --> 00:46:18,250 the first place. 901 00:46:18,250 --> 00:46:18,400 Okay. 902 00:46:18,400 --> 00:46:20,880 But we do have an estimate from the mean. 903 00:46:20,880 --> 00:46:24,640 So a reasonable approximation, once more, is to plug-in, 904 00:46:24,640 --> 00:46:27,620 here, since we don't know the mean, the 905 00:46:27,620 --> 00:46:29,270 estimate of the mean. 906 00:46:29,270 --> 00:46:32,370 And so you get that expression, but with a theta 907 00:46:32,370 --> 00:46:35,130 hat instead of theta itself. 908 00:46:35,130 --> 00:46:37,980 And this is another reasonable way of 909 00:46:37,980 --> 00:46:40,180 estimating the variance. 910 00:46:40,180 --> 00:46:42,940 It does have the same consistency properties. 911 00:46:42,940 --> 00:46:44,050 Why? 912 00:46:44,050 --> 00:46:51,100 When n is large, this is going to behave the same as that, 913 00:46:51,100 --> 00:46:53,640 because theta hat converges to theta. 914 00:46:53,640 --> 00:46:57,890 And when n is large, this is approximately the same as 915 00:46:57,890 --> 00:46:58,820 sigma squared. 916 00:46:58,820 --> 00:47:02,220 So for a large n, this quantity also converges to 917 00:47:02,220 --> 00:47:03,350 sigma squared. 918 00:47:03,350 --> 00:47:05,500 And we have a consistent estimate of 919 00:47:05,500 --> 00:47:07,000 the variance as well. 920 00:47:07,000 --> 00:47:09,490 And we can take that consistent estimate and use it 921 00:47:09,490 --> 00:47:12,360 back in the construction of confidence interval. 922 00:47:12,360 --> 00:47:16,310 One little detail, here, we're dividing by n. 923 00:47:16,310 --> 00:47:19,590 Here, we're dividing by n-1. 924 00:47:19,590 --> 00:47:21,050 Why do we do this? 925 00:47:21,050 --> 00:47:24,630 Well, it turns out that's what you need to do for these 926 00:47:24,630 --> 00:47:28,590 estimates to be an unbiased estimate of the variance. 927 00:47:28,590 --> 00:47:32,080 One has to do a little bit of a calculation, and one finds 928 00:47:32,080 --> 00:47:36,650 that that's the factor that you need to have here in order 929 00:47:36,650 --> 00:47:37,770 to be unbiased. 930 00:47:37,770 --> 00:47:42,280 Of course, if you get 100 data points, whether you divide by 931 00:47:42,280 --> 00:47:46,070 100 or divided by 99, it's going to make only a tiny 932 00:47:46,070 --> 00:47:48,620 difference in your estimate of your variance. 933 00:47:48,620 --> 00:47:50,740 So it's going to make only a tiny difference in your 934 00:47:50,740 --> 00:47:52,670 estimate of the standard deviation. 935 00:47:52,670 --> 00:47:54,180 It's not a big deal. 936 00:47:54,180 --> 00:47:56,550 And it doesn't really matter. 937 00:47:56,550 --> 00:48:00,720 But if you want to show off about your deeper knowledge of 938 00:48:00,720 --> 00:48:06,810 statistics, you throw in the 1 over n-1 factor in there. 939 00:48:06,810 --> 00:48:11,350 So now one basically needs to put together this story here, 940 00:48:11,350 --> 00:48:15,260 how you estimate the variance. 941 00:48:15,260 --> 00:48:18,370 You first estimate the sample mean. 942 00:48:18,370 --> 00:48:21,010 And then you do some extra work to come up with a 943 00:48:21,010 --> 00:48:23,020 reasonable estimate of the variance and 944 00:48:23,020 --> 00:48:24,640 the standard deviation. 945 00:48:24,640 --> 00:48:27,510 And then you use your estimate, of the standard 946 00:48:27,510 --> 00:48:32,960 deviation, to come up with a confidence interval, which has 947 00:48:32,960 --> 00:48:35,150 these two endpoints. 948 00:48:35,150 --> 00:48:39,130 In doing this procedure, there's basically a number of 949 00:48:39,130 --> 00:48:41,810 approximations that are involved. 950 00:48:41,810 --> 00:48:43,570 There are two types of approximations. 951 00:48:43,570 --> 00:48:46,170 One approximation is that we're pretending that the 952 00:48:46,170 --> 00:48:48,720 sample mean has a normal distribution. 953 00:48:48,720 --> 00:48:51,080 That's something we're justified to do, by the 954 00:48:51,080 --> 00:48:52,470 central limit theorem. 955 00:48:52,470 --> 00:48:53,550 But it's not exact. 956 00:48:53,550 --> 00:48:54,910 It's an approximation. 957 00:48:54,910 --> 00:48:58,080 And the second approximation that comes in is that, instead 958 00:48:58,080 --> 00:49:01,260 of using the correct standard deviation, in general, you 959 00:49:01,260 --> 00:49:04,850 will have to use some approximation of 960 00:49:04,850 --> 00:49:06,100 the standard deviation. 961 00:49:08,390 --> 00:49:11,200 Okay so you will be getting a little bit of practice with 962 00:49:11,200 --> 00:49:14,550 these concepts in recitation and tutorial. 963 00:49:14,550 --> 00:49:18,070 And we will move on to new topics next week. 964 00:49:18,070 --> 00:49:20,930 But the material that's going to be covered in the final 965 00:49:20,930 --> 00:49:23,570 exam is only up to this point. 966 00:49:23,570 --> 00:49:28,220 So next week is just general education. 967 00:49:28,220 --> 00:49:30,550 Hopefully useful, but it's not in the exam.