1 00:00:00,560 --> 00:00:04,310 In this segment we provide a high level introduction into 2 00:00:04,310 --> 00:00:08,090 the conceptual framework of classical statistics. 3 00:00:08,090 --> 00:00:11,970 In order to get there, it is better to start from what we 4 00:00:11,970 --> 00:00:14,910 already know and then make a comparison. 5 00:00:14,910 --> 00:00:19,430 We already know how to make inferences by just using the 6 00:00:19,430 --> 00:00:20,560 Bayes rule. 7 00:00:20,560 --> 00:00:24,370 In this setting, we have an unknown quantity, theta, which 8 00:00:24,370 --> 00:00:26,270 we model as a random variable. 9 00:00:26,270 --> 00:00:29,480 And so in particular, it's going to have a probability 10 00:00:29,480 --> 00:00:30,800 distribution. 11 00:00:30,800 --> 00:00:33,130 And then we make some observations. 12 00:00:33,130 --> 00:00:36,310 And those observations are modeled as random variables. 13 00:00:36,310 --> 00:00:39,350 And typically we are given the conditional distribution of 14 00:00:39,350 --> 00:00:43,190 the observations given the unknown variable. 15 00:00:43,190 --> 00:00:47,750 So these two distributions are the starting points, and then 16 00:00:47,750 --> 00:00:49,510 we do some calculations. 17 00:00:49,510 --> 00:00:51,430 And we use the Bayes rule. 18 00:00:51,430 --> 00:00:54,580 And we find the posterior distribution of theta given 19 00:00:54,580 --> 00:00:56,780 the observations. 20 00:00:56,780 --> 00:00:59,860 And this tells us all that there is to know about the 21 00:00:59,860 --> 00:01:02,460 unknown quantity, theta, given the observations 22 00:01:02,460 --> 00:01:04,120 that we have made. 23 00:01:04,120 --> 00:01:08,690 What is important in this framework is that theta is 24 00:01:08,690 --> 00:01:11,020 treated as a random variable. 25 00:01:11,020 --> 00:01:13,930 And so it has a distribution of its own. 26 00:01:13,930 --> 00:01:16,100 And that's our starting point. 27 00:01:16,100 --> 00:01:20,100 These are our prior beliefs about theta before we obtain 28 00:01:20,100 --> 00:01:23,150 any observations. 29 00:01:23,150 --> 00:01:27,910 However, one can think of situations where theta maybe 30 00:01:27,910 --> 00:01:30,780 cannot be modeled as a random variable. 31 00:01:30,780 --> 00:01:34,940 Suppose that theta is at some universal, physical constant. 32 00:01:34,940 --> 00:01:38,690 For example the mass of the electron. 33 00:01:38,690 --> 00:01:41,590 Does it make sense to think of that quantity as random? 34 00:01:41,590 --> 00:01:44,600 And how do we come up with a probability distribution for 35 00:01:44,600 --> 00:01:45,961 that quantity? 36 00:01:45,961 --> 00:01:50,050 One can argue that in certain situations one should not 37 00:01:50,050 --> 00:01:55,060 think of unknown quantities as being random, but rather they 38 00:01:55,060 --> 00:01:58,610 are just unknown constants. 39 00:01:58,610 --> 00:02:00,380 They are absolute constants. 40 00:02:00,380 --> 00:02:05,690 It just happens that we do not know their value. 41 00:02:05,690 --> 00:02:09,460 Or there may be other situations in which even 42 00:02:09,460 --> 00:02:12,540 though we may think that there is something random that 43 00:02:12,540 --> 00:02:16,910 determines theta, we are reluctant to postulate any 44 00:02:16,910 --> 00:02:18,190 prior distribution. 45 00:02:18,190 --> 00:02:21,450 We do not want to impose any biases. 46 00:02:21,450 --> 00:02:24,560 And that leads us to the classic statistical framework 47 00:02:24,560 --> 00:02:28,970 in which unknown quantities are treated as constants, not 48 00:02:28,970 --> 00:02:30,829 as random variables. 49 00:02:30,829 --> 00:02:33,290 Pictorially the setting is as follows. 50 00:02:33,290 --> 00:02:36,260 There's an unknown quantity that we wish to estimate. 51 00:02:36,260 --> 00:02:40,290 And we make some observations, X. Those 52 00:02:40,290 --> 00:02:42,490 observations are random. 53 00:02:42,490 --> 00:02:45,840 And they're drawn according to a probability distribution. 54 00:02:45,840 --> 00:02:49,630 And that probability distribution depends, or 55 00:02:49,630 --> 00:02:53,660 rather is affected, by that unknown quantity. 56 00:02:53,660 --> 00:02:58,090 So for example, for one value of theta, the distribution of 57 00:02:58,090 --> 00:03:00,100 the X's might be this one. 58 00:03:00,100 --> 00:03:03,710 And for another value of theta, the distribution of the 59 00:03:03,710 --> 00:03:06,930 X's could be a different one. 60 00:03:06,930 --> 00:03:10,000 And we're trying to guess what theta is. 61 00:03:10,000 --> 00:03:14,250 Which in some ways is the question, do my data come from 62 00:03:14,250 --> 00:03:18,740 this distribution or do they come from that distribution? 63 00:03:18,740 --> 00:03:24,390 In order to make a choice of theta, what we do is we take 64 00:03:24,390 --> 00:03:27,010 the data and we process them. 65 00:03:27,010 --> 00:03:32,340 And after we process them, we come up with our estimate-- 66 00:03:32,340 --> 00:03:36,110 or rather estimator. 67 00:03:36,110 --> 00:03:37,970 What is the estimator? 68 00:03:37,970 --> 00:03:41,130 We take the data, and we calculate a 69 00:03:41,130 --> 00:03:43,150 function of the data. 70 00:03:43,150 --> 00:03:46,120 That's what it means to process the data. 71 00:03:46,120 --> 00:03:49,700 And that function is our theta hat. 72 00:03:49,700 --> 00:03:53,210 Now this function, our data processing mechanism, is what 73 00:03:53,210 --> 00:03:55,550 we can call an estimator. 74 00:03:55,550 --> 00:03:59,900 But quite often, or usually, we also use the same 75 00:03:59,900 --> 00:04:04,790 terminology to call theta hat itself an estimator. 76 00:04:04,790 --> 00:04:08,570 Now notice that theta hat is a function of the random 77 00:04:08,570 --> 00:04:14,820 variable X. So theta hat is actually a random variable. 78 00:04:14,820 --> 00:04:19,130 And that's why we denote it with an uppercase theta. 79 00:04:19,130 --> 00:04:23,640 On the other hand, after you obtain some concrete data, 80 00:04:23,640 --> 00:04:26,850 little x, which are the realized values of the random 81 00:04:26,850 --> 00:04:31,060 variable capital X. Then we can apply your estimator to 82 00:04:31,060 --> 00:04:37,790 that particular input, and we compute a specific value-- 83 00:04:37,790 --> 00:04:40,390 call it theta hat lower case. 84 00:04:40,390 --> 00:04:45,850 And that quantity we call an estimate. 85 00:04:45,850 --> 00:04:48,580 So this is a useful distinction. 86 00:04:48,580 --> 00:04:51,590 Always, with random variables, we want to distinguish between 87 00:04:51,590 --> 00:04:55,030 the random variable itself indicated by uppercase letters 88 00:04:55,030 --> 00:04:57,780 and the values of the random variable, which are indicated 89 00:04:57,780 --> 00:04:59,430 with lower case letters. 90 00:04:59,430 --> 00:05:04,090 Similarly, the estimator is a random variable. 91 00:05:04,090 --> 00:05:08,920 It's essentially a description of how we generate estimates. 92 00:05:08,920 --> 00:05:12,770 Whereas the realized value, once we have some specific 93 00:05:12,770 --> 00:05:14,200 observations at hand-- 94 00:05:14,200 --> 00:05:15,895 that's what we call an estimate. 95 00:05:18,400 --> 00:05:21,280 Now let me continue with a few comments. 96 00:05:21,280 --> 00:05:25,780 The picture, or the setting, that I have here suggests that 97 00:05:25,780 --> 00:05:29,220 X is just one variable and theta is one variable. 98 00:05:29,220 --> 00:05:33,260 But we can have the same framework, even if X and theta 99 00:05:33,260 --> 00:05:34,980 are multi-dimensional. 100 00:05:34,980 --> 00:05:39,900 For example, X might consist of several random variables. 101 00:05:39,900 --> 00:05:43,280 And theta may be a parameter that consists of multiple 102 00:05:43,280 --> 00:05:45,430 components. 103 00:05:45,430 --> 00:05:48,900 Now you may notice that this notation that we're using here 104 00:05:48,900 --> 00:05:53,920 is a little different from our traditional notation which was 105 00:05:53,920 --> 00:05:55,170 of this form. 106 00:05:58,159 --> 00:06:01,470 In what ways is it different? 107 00:06:01,470 --> 00:06:04,700 The main difference is that here, theta is 108 00:06:04,700 --> 00:06:07,720 not a random variable. 109 00:06:07,720 --> 00:06:10,660 Theta is just a parameter. 110 00:06:10,660 --> 00:06:16,230 So what we're dealing with, here, is just an ordinary-- 111 00:06:16,230 --> 00:06:18,570 not a conditional distribution. 112 00:06:18,570 --> 00:06:22,950 It's an ordinary distribution that happens to involve, 113 00:06:22,950 --> 00:06:27,420 inside its description, some parameters theta. 114 00:06:27,420 --> 00:06:30,810 Just to emphasize the point that these are not conditional 115 00:06:30,810 --> 00:06:34,570 probabilities, because theta is not a random variable, we 116 00:06:34,570 --> 00:06:38,990 use a semicolon instead of using a bar. 117 00:06:38,990 --> 00:06:42,040 And since theta is not a random variable, we do not 118 00:06:42,040 --> 00:06:46,970 include it in the subscript down here when we talk about 119 00:06:46,970 --> 00:06:48,830 the classical setting. 120 00:06:48,830 --> 00:06:52,590 The best way to think of the situation mathematically is 121 00:06:52,590 --> 00:06:57,140 that we're essentially dealing with multiple candidate 122 00:06:57,140 --> 00:06:59,250 models, as in this picture. 123 00:06:59,250 --> 00:07:02,840 This could be one possible model of X. This could be 124 00:07:02,840 --> 00:07:07,320 another possible model of X. We have one such model for 125 00:07:07,320 --> 00:07:10,070 each possible value of theta. 126 00:07:10,070 --> 00:07:15,040 And if, for example, I were to get data points that sit down 127 00:07:15,040 --> 00:07:19,770 here, then a reasonable way to make an inference could be to 128 00:07:19,770 --> 00:07:23,320 say, these data are extremely unlikely to have been 129 00:07:23,320 --> 00:07:25,810 generated according to this model. 130 00:07:25,810 --> 00:07:27,790 This data are quite likely to have been 131 00:07:27,790 --> 00:07:29,880 generated by this model. 132 00:07:29,880 --> 00:07:32,570 So I'm going to pick this particular model. 133 00:07:32,570 --> 00:07:35,770 So even though we're not treating theta as a random 134 00:07:35,770 --> 00:07:39,380 variable, and we do not have the Bayes rule in our hands-- 135 00:07:39,380 --> 00:07:42,530 we can still see, at least from this trivial example, 136 00:07:42,530 --> 00:07:45,200 that there should be a reasonable way of making 137 00:07:45,200 --> 00:07:47,390 inferences. 138 00:07:47,390 --> 00:07:50,390 And let me close with some comments on the different 139 00:07:50,390 --> 00:07:54,110 types of problems that we may encounter in classical 140 00:07:54,110 --> 00:07:55,420 statistics. 141 00:07:55,420 --> 00:07:58,280 One class of problems are so-called hypothesis testing 142 00:07:58,280 --> 00:08:01,440 problems in which we're asked to choose between two 143 00:08:01,440 --> 00:08:02,710 candidate models. 144 00:08:02,710 --> 00:08:06,190 So the unknown parameter, as in this example, can take one 145 00:08:06,190 --> 00:08:07,540 of two values. 146 00:08:07,540 --> 00:08:11,050 So think of a machine that produces coins. 147 00:08:11,050 --> 00:08:14,470 And coins are either fair or they have a 148 00:08:14,470 --> 00:08:16,960 very specific bias. 149 00:08:16,960 --> 00:08:20,700 You want to flip the coin, maybe multiple times, and then 150 00:08:20,700 --> 00:08:22,870 decide whether you're dealing with a coin of this 151 00:08:22,870 --> 00:08:25,290 type or of that type. 152 00:08:25,290 --> 00:08:28,340 There's another type of hypothesis testing problems 153 00:08:28,340 --> 00:08:31,000 which is a little more complicated, for 154 00:08:31,000 --> 00:08:32,429 example this one. 155 00:08:32,429 --> 00:08:37,110 We have one hypothesis which says that my coin is fair, 156 00:08:37,110 --> 00:08:39,820 versus an alternative hypothesis in 157 00:08:39,820 --> 00:08:42,130 which my coin is unfair. 158 00:08:42,130 --> 00:08:45,240 But notice that this hypothesis actually includes 159 00:08:45,240 --> 00:08:46,870 many possible scenarios. 160 00:08:46,870 --> 00:08:50,300 There are many possible values of theta under which this 161 00:08:50,300 --> 00:08:52,550 hypothesis would be true. 162 00:08:52,550 --> 00:08:56,920 We will not deal with problems of this kind in this segment, 163 00:08:56,920 --> 00:08:58,920 or in this lecture sequence. 164 00:08:58,920 --> 00:09:01,200 Instead we will focus exclusively 165 00:09:01,200 --> 00:09:03,280 on estimation problems. 166 00:09:03,280 --> 00:09:08,170 In estimation problems, the unknown parameter, theta, is 167 00:09:08,170 --> 00:09:12,940 either continuous or can take one of many, many values. 168 00:09:12,940 --> 00:09:17,340 What we want to do is to design an estimator-- 169 00:09:17,340 --> 00:09:19,760 a way of processing the data-- 170 00:09:19,760 --> 00:09:24,070 that comes up with estimates that are good. 171 00:09:24,070 --> 00:09:26,530 What does it mean that an estimate is good? 172 00:09:26,530 --> 00:09:29,530 An estimate would be good if the resulting value of the 173 00:09:29,530 --> 00:09:31,120 estimation error-- 174 00:09:31,120 --> 00:09:34,720 that is the difference between the estimated value and the 175 00:09:34,720 --> 00:09:35,500 true value-- 176 00:09:35,500 --> 00:09:38,200 if that difference is small. 177 00:09:38,200 --> 00:09:40,480 You want to keep that difference 178 00:09:40,480 --> 00:09:43,160 small in some sense. 179 00:09:43,160 --> 00:09:48,010 Well one may need a criterion of what it means to be small. 180 00:09:48,010 --> 00:09:51,600 And whether we want this in expectation, or with high 181 00:09:51,600 --> 00:09:53,470 probability, and so on. 182 00:09:53,470 --> 00:09:56,780 This statement, to keep the estimation error small, can be 183 00:09:56,780 --> 00:09:58,870 interpreted in various ways. 184 00:09:58,870 --> 00:10:02,510 And because of that reason, there's no single approach to 185 00:10:02,510 --> 00:10:05,590 the problem of designing a good estimator. 186 00:10:05,590 --> 00:10:08,040 And this is something that happens more generally in 187 00:10:08,040 --> 00:10:09,710 classical statistics. 188 00:10:09,710 --> 00:10:15,050 Typically problems do not admit a single best approach. 189 00:10:15,050 --> 00:10:17,790 They do not admit unique answers. 190 00:10:17,790 --> 00:10:21,470 Reasonable people can come up with different methodologies 191 00:10:21,470 --> 00:10:23,920 for approaching the same problem. 192 00:10:23,920 --> 00:10:27,060 And there is a little bit of an element of an 193 00:10:27,060 --> 00:10:29,880 art involved here. 194 00:10:29,880 --> 00:10:33,230 In general, one wants to come up with reasonable methods 195 00:10:33,230 --> 00:10:36,120 that will have good properties. 196 00:10:36,120 --> 00:10:39,270 And we will see some examples of what this may mean. 197 00:10:39,270 --> 00:10:41,870 But again, I'm emphasizing that there is 198 00:10:41,870 --> 00:10:45,140 no single best method. 199 00:10:45,140 --> 00:10:49,260 So whereas the Bayes rule is a completely unambiguous way for 200 00:10:49,260 --> 00:10:52,720 making inferences, here, in the context of classical 201 00:10:52,720 --> 00:10:57,020 statistics, there will be some freedom as to what approaches 202 00:10:57,020 --> 00:10:58,270 one might take.