1 00:00:00,000 --> 00:00:00,040 2 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 3 00:00:02,460 --> 00:00:03,870 Commons license. 4 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 5 00:00:06,910 --> 00:00:10,560 offer high quality educational resources for free. 6 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 7 00:00:13,460 --> 00:00:19,290 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:19,290 --> 00:00:21,708 ocw.mit.edu. 9 00:00:21,708 --> 00:00:25,380 PROFESSOR: It involves real phenomena out there. 10 00:00:25,380 --> 00:00:28,960 So we have real stuff that happens. 11 00:00:28,960 --> 00:00:33,630 So it might be an arrival process to a bank that we're 12 00:00:33,630 --> 00:00:35,790 trying to model. 13 00:00:35,790 --> 00:00:38,230 This is a reality, but this is what we have 14 00:00:38,230 --> 00:00:39,660 been doing so far. 15 00:00:39,660 --> 00:00:41,910 We have been playing with models of 16 00:00:41,910 --> 00:00:43,770 probabilistic phenomena. 17 00:00:43,770 --> 00:00:46,730 And somehow we need to tie the two together. 18 00:00:46,730 --> 00:00:50,930 The way these are tied is that we observe the real world and 19 00:00:50,930 --> 00:00:53,530 this gives us data. 20 00:00:53,530 --> 00:00:58,590 And then based on these data, we try to come up with a model 21 00:00:58,590 --> 00:01:01,930 of what exactly is going on. 22 00:01:01,930 --> 00:01:05,290 For example, for an arrival process, you might ask the 23 00:01:05,290 --> 00:01:08,680 model in question, is my arrival process Poisson or is 24 00:01:08,680 --> 00:01:10,300 it something different? 25 00:01:10,300 --> 00:01:14,630 If it is Poisson, what is the rate of the arrival process? 26 00:01:14,630 --> 00:01:17,460 Once you come up with your model and you come up with the 27 00:01:17,460 --> 00:01:21,710 parameters of the model, then you can use it to make 28 00:01:21,710 --> 00:01:27,520 predictions about reality or to figure out certain hidden 29 00:01:27,520 --> 00:01:31,890 things, certain hidden aspects of reality, that you do not 30 00:01:31,890 --> 00:01:35,560 observe directly, but you try to infer what they are. 31 00:01:35,560 --> 00:01:38,900 So that's where the usefulness of the model comes in. 32 00:01:38,900 --> 00:01:43,330 Now this field is of course tremendously useful. 33 00:01:43,330 --> 00:01:46,650 And it shows up pretty much everywhere. 34 00:01:46,650 --> 00:01:50,000 So we talked about the polling examples in the 35 00:01:50,000 --> 00:01:51,280 last couple of lectures. 36 00:01:51,280 --> 00:01:53,520 This is, of course, a real application. 37 00:01:53,520 --> 00:01:57,525 You sample and on the basis of the sample that you have, you 38 00:01:57,525 --> 00:02:00,400 try to make some inferences about, let's say, the 39 00:02:00,400 --> 00:02:03,060 preferences in a given population. 40 00:02:03,060 --> 00:02:06,230 Let's say in the medical field, you want to try whether 41 00:02:06,230 --> 00:02:08,919 a certain drug makes a difference or not. 42 00:02:08,919 --> 00:02:14,380 So people would do medical trials, get some results, and 43 00:02:14,380 --> 00:02:17,640 then from the data somehow you need to make sense of them and 44 00:02:17,640 --> 00:02:18,530 make a decision. 45 00:02:18,530 --> 00:02:21,360 Is the new drug useful or is it not? 46 00:02:21,360 --> 00:02:23,460 How do we go systematically about the 47 00:02:23,460 --> 00:02:24,710 question of this type? 48 00:02:24,710 --> 00:02:27,770 49 00:02:27,770 --> 00:02:32,170 A sexier, more recent topic, there's this famous Netflix 50 00:02:32,170 --> 00:02:37,510 competition where Netflix gives you a huge table of 51 00:02:37,510 --> 00:02:41,450 movies and people. 52 00:02:41,450 --> 00:02:45,860 And people have rated the movies, but not everyone has 53 00:02:45,860 --> 00:02:47,850 watched all of the movies in there. 54 00:02:47,850 --> 00:02:49,460 You have some of the ratings. 55 00:02:49,460 --> 00:02:53,250 For example, this person gave a 4 to that particular movie. 56 00:02:53,250 --> 00:02:56,300 So you get the table that's partially filled. 57 00:02:56,300 --> 00:02:58,300 And the Netflix asks you to make 58 00:02:58,300 --> 00:02:59,860 recommendations to people. 59 00:02:59,860 --> 00:03:02,410 So this means trying to guess. 60 00:03:02,410 --> 00:03:06,100 This person here, how much would they like this 61 00:03:06,100 --> 00:03:07,610 particular movie? 62 00:03:07,610 --> 00:03:11,130 And you can start thinking, well, maybe this person has 63 00:03:11,130 --> 00:03:14,860 given somewhat similar ratings with another person. 64 00:03:14,860 --> 00:03:18,440 And if that other person has also seen that movie, maybe 65 00:03:18,440 --> 00:03:21,290 the rating of that other person is relevant. 66 00:03:21,290 --> 00:03:24,230 But of course it's a lot more complicated than that. 67 00:03:24,230 --> 00:03:26,650 And this has been a serious competition where people have 68 00:03:26,650 --> 00:03:30,230 been using every heavy, wet machinery that there is in 69 00:03:30,230 --> 00:03:32,540 statistics, trying to come up with good 70 00:03:32,540 --> 00:03:35,140 recommendation systems. 71 00:03:35,140 --> 00:03:37,870 Then the other people, of course, are trying to analyze 72 00:03:37,870 --> 00:03:39,010 financial data. 73 00:03:39,010 --> 00:03:43,680 Somebody gives you the sequence of the values, let's 74 00:03:43,680 --> 00:03:45,840 say of the SMP index. 75 00:03:45,840 --> 00:03:47,850 You look at something like this 76 00:03:47,850 --> 00:03:49,770 and you can ask questions. 77 00:03:49,770 --> 00:03:55,030 How do I model these data using any of the models that 78 00:03:55,030 --> 00:03:57,060 we have in our bag of tools? 79 00:03:57,060 --> 00:04:00,230 How can I make predictions about what's going to happen 80 00:04:00,230 --> 00:04:03,310 afterwards, and so on? 81 00:04:03,310 --> 00:04:09,700 On the engineering side, anywhere where you have noise 82 00:04:09,700 --> 00:04:11,590 inference comes in. 83 00:04:11,590 --> 00:04:13,810 Signal processing, in some sense, is just 84 00:04:13,810 --> 00:04:14,960 an inference problem. 85 00:04:14,960 --> 00:04:18,730 You observe signals that are noisy and you try to figure 86 00:04:18,730 --> 00:04:21,750 out exactly what's happening out there or what kind of 87 00:04:21,750 --> 00:04:24,130 signal has been sent. 88 00:04:24,130 --> 00:04:28,830 Maybe the beginning of the field could be traced a few 89 00:04:28,830 --> 00:04:32,060 hundred years ago where people would observe, make 90 00:04:32,060 --> 00:04:35,420 astronomical observations of the position of the 91 00:04:35,420 --> 00:04:37,550 planets in the sky. 92 00:04:37,550 --> 00:04:41,130 They would have some beliefs that perhaps the orbits of 93 00:04:41,130 --> 00:04:44,070 planets is an ellipse. 94 00:04:44,070 --> 00:04:47,840 Or if it's a comet, maybe it's a parabola, hyperbola, don't 95 00:04:47,840 --> 00:04:48,640 know what it is. 96 00:04:48,640 --> 00:04:51,320 But they would have a model of that. 97 00:04:51,320 --> 00:04:53,840 But, of course, astronomical measurements would not be 98 00:04:53,840 --> 00:04:55,300 perfectly exact. 99 00:04:55,300 --> 00:05:00,690 And they would try to find the curve that fits these data. 100 00:05:00,690 --> 00:05:05,580 How do you go about choosing this particular curve on the 101 00:05:05,580 --> 00:05:07,960 base of noisy data and try to do it in a 102 00:05:07,960 --> 00:05:11,274 somewhat principled way? 103 00:05:11,274 --> 00:05:13,890 OK, so questions of this type-- clearly the 104 00:05:13,890 --> 00:05:17,100 applications are all over the place. 105 00:05:17,100 --> 00:05:20,830 But how is this related conceptually with what we have 106 00:05:20,830 --> 00:05:22,480 been doing so far? 107 00:05:22,480 --> 00:05:25,960 What's the relation between the field of inference and the 108 00:05:25,960 --> 00:05:28,130 field of probability as we have been 109 00:05:28,130 --> 00:05:30,650 practicing until now? 110 00:05:30,650 --> 00:05:33,620 Well, mathematically speaking, what's going to happen in the 111 00:05:33,620 --> 00:05:38,780 next few lectures could be just exercises or homework 112 00:05:38,780 --> 00:05:44,880 problems in the class in based on what we have done so far. 113 00:05:44,880 --> 00:05:48,560 That means you're not going to get any new facts about 114 00:05:48,560 --> 00:05:50,200 probability theory. 115 00:05:50,200 --> 00:05:53,930 Everything we're going to do will be simple applications of 116 00:05:53,930 --> 00:05:57,110 things that you already do know. 117 00:05:57,110 --> 00:06:00,140 So in some sense, statistics and inference is just an 118 00:06:00,140 --> 00:06:02,780 applied exercise in probability. 119 00:06:02,780 --> 00:06:08,310 But actually, things are not that simple in 120 00:06:08,310 --> 00:06:09,550 the following sense. 121 00:06:09,550 --> 00:06:12,510 If you get a probability problem, 122 00:06:12,510 --> 00:06:14,040 there's a correct answer. 123 00:06:14,040 --> 00:06:15,450 There's a correct solution. 124 00:06:15,450 --> 00:06:18,170 And that correct solution is unique. 125 00:06:18,170 --> 00:06:20,550 There's no ambiguity. 126 00:06:20,550 --> 00:06:23,380 The theory of probability has clearly defined rules. 127 00:06:23,380 --> 00:06:24,570 These are the axioms. 128 00:06:24,570 --> 00:06:27,550 You're given some information about probability 129 00:06:27,550 --> 00:06:28,280 distributions. 130 00:06:28,280 --> 00:06:31,000 You're asked to calculate certain other things. 131 00:06:31,000 --> 00:06:32,190 There's no ambiguity. 132 00:06:32,190 --> 00:06:34,230 Answers are always unique. 133 00:06:34,230 --> 00:06:39,180 In statistical questions, it's no longer the case that the 134 00:06:39,180 --> 00:06:41,420 question has a unique answer. 135 00:06:41,420 --> 00:06:44,990 If I give you data and I ask you what's the best way of 136 00:06:44,990 --> 00:06:49,710 estimating the motion of that planet, reasonable people can 137 00:06:49,710 --> 00:06:53,370 come up with different methods. 138 00:06:53,370 --> 00:06:56,790 And reasonable people will try to argue that's my method has 139 00:06:56,790 --> 00:07:00,140 these desirable properties but somebody else may say, here's 140 00:07:00,140 --> 00:07:03,740 another method that has certain desirable properties. 141 00:07:03,740 --> 00:07:08,220 And it's not clear what the best method is. 142 00:07:08,220 --> 00:07:11,330 So it's good to have some understanding of what the 143 00:07:11,330 --> 00:07:16,910 issues are and to know at least what is the general 144 00:07:16,910 --> 00:07:20,150 class of methods that one tries to consider, how does 145 00:07:20,150 --> 00:07:22,380 one go about such problems. 146 00:07:22,380 --> 00:07:24,360 So we're going to see lots and lots of 147 00:07:24,360 --> 00:07:25,880 different inference methods. 148 00:07:25,880 --> 00:07:27,350 We're not going to tell you that one is 149 00:07:27,350 --> 00:07:28,730 better than the other. 150 00:07:28,730 --> 00:07:30,940 But it's important to understand what are the 151 00:07:30,940 --> 00:07:33,980 concepts between those different methods. 152 00:07:33,980 --> 00:07:38,710 And finally, statistics can be misused really badly. 153 00:07:38,710 --> 00:07:41,870 That is, one can come up with methods that you think are 154 00:07:41,870 --> 00:07:48,650 sound, but in fact they're not quite that. 155 00:07:48,650 --> 00:07:52,830 I will bring some examples next time and talk a little 156 00:07:52,830 --> 00:07:54,290 more about this. 157 00:07:54,290 --> 00:07:58,540 So, they want to say, you have some data, you want to make 158 00:07:58,540 --> 00:08:02,590 some inference from them, what many people will do is to go 159 00:08:02,590 --> 00:08:06,340 to Wikipedia, find a statistical test that they 160 00:08:06,340 --> 00:08:08,990 think it applies to that situation, plug in numbers, 161 00:08:08,990 --> 00:08:10,880 and present results. 162 00:08:10,880 --> 00:08:14,220 Are the conclusions that they get really justified or are 163 00:08:14,220 --> 00:08:16,400 they misusing statistical methods? 164 00:08:16,400 --> 00:08:20,520 Well, too many people actually do misuse statistics and 165 00:08:20,520 --> 00:08:24,530 conclusions that people get are often false. 166 00:08:24,530 --> 00:08:29,840 So it's important to, besides just being able to copy 167 00:08:29,840 --> 00:08:32,600 statistical tests and use them, to understand what are 168 00:08:32,600 --> 00:08:35,860 the assumptions between the different methods and what 169 00:08:35,860 --> 00:08:40,559 kind of guarantees they have, if any. 170 00:08:40,559 --> 00:08:44,420 All right, so we'll try to do a quick tour through the field 171 00:08:44,420 --> 00:08:47,600 of inference in this lecture and the next few lectures that 172 00:08:47,600 --> 00:08:51,700 we have left this semester and try to highlight at the very 173 00:08:51,700 --> 00:08:53,940 high level the main concept skills, and 174 00:08:53,940 --> 00:08:56,990 techniques that come in. 175 00:08:56,990 --> 00:08:59,840 Let's start with some generalities and some general 176 00:08:59,840 --> 00:09:01,090 statements. 177 00:09:01,090 --> 00:09:03,090 178 00:09:03,090 --> 00:09:07,090 One first statement is that statistics or inference 179 00:09:07,090 --> 00:09:11,800 problems come up in very different guises. 180 00:09:11,800 --> 00:09:16,490 And they may look as if they are of very different forms. 181 00:09:16,490 --> 00:09:20,190 Although, at some fundamental level, the basic issues turn 182 00:09:20,190 --> 00:09:23,320 out to be always pretty much the same. 183 00:09:23,320 --> 00:09:27,880 So let's look at this example. 184 00:09:27,880 --> 00:09:31,420 There's an unknown signal that's being sent. 185 00:09:31,420 --> 00:09:35,840 It's sent through some medium, and that medium just takes the 186 00:09:35,840 --> 00:09:39,180 signal and amplifies it by a certain number. 187 00:09:39,180 --> 00:09:41,340 So you can think of somebody shouting. 188 00:09:41,340 --> 00:09:42,920 There's the air out there. 189 00:09:42,920 --> 00:09:46,420 What you shouted will be attenuated through the air 190 00:09:46,420 --> 00:09:48,040 until it gets to a receiver. 191 00:09:48,040 --> 00:09:51,730 And that receiver then observes this, but together 192 00:09:51,730 --> 00:09:53,110 with some random noise. 193 00:09:53,110 --> 00:09:56,040 194 00:09:56,040 --> 00:10:00,390 Here I meant S. S is the signal that's being sent. 195 00:10:00,390 --> 00:10:06,280 And what you observe is an X. 196 00:10:06,280 --> 00:10:09,240 You observe X, so what kind of inference problems 197 00:10:09,240 --> 00:10:11,240 could we have here? 198 00:10:11,240 --> 00:10:15,400 In some cases, you want to build a model of the physical 199 00:10:15,400 --> 00:10:17,450 phenomenon that you're dealing with. 200 00:10:17,450 --> 00:10:21,180 So for example, you don't know the attenuation of your signal 201 00:10:21,180 --> 00:10:25,190 and you try to find out what this number is based on the 202 00:10:25,190 --> 00:10:26,980 observations that you have. 203 00:10:26,980 --> 00:10:30,240 So the way this is done in engineering systems is that 204 00:10:30,240 --> 00:10:35,020 you design a certain signal, you know what it is, you shout 205 00:10:35,020 --> 00:10:39,560 a particular word, and then the receiver listens. 206 00:10:39,560 --> 00:10:43,460 And based on the intensity of the signal that they get, they 207 00:10:43,460 --> 00:10:48,380 try to make a guess about A. So you don't know A, but you 208 00:10:48,380 --> 00:10:52,460 know S. And by observing X, you get some information 209 00:10:52,460 --> 00:10:54,270 about what A is. 210 00:10:54,270 --> 00:10:57,810 So in this case, you're trying to build a model of the medium 211 00:10:57,810 --> 00:11:01,170 through which your signal is propagating. 212 00:11:01,170 --> 00:11:04,600 So sometimes one would call problems of this kind, let's 213 00:11:04,600 --> 00:11:07,990 say, system identification. 214 00:11:07,990 --> 00:11:11,980 In a different version of an inference problem that comes 215 00:11:11,980 --> 00:11:15,300 with this picture, you've done your modeling. 216 00:11:15,300 --> 00:11:18,160 You know your A. You know the medium through which the 217 00:11:18,160 --> 00:11:22,330 signal is going, but it's a communication system. 218 00:11:22,330 --> 00:11:24,190 This person is trying to communicate 219 00:11:24,190 --> 00:11:26,140 something to that person. 220 00:11:26,140 --> 00:11:30,250 So you send the signal S, but that person receives a noisy 221 00:11:30,250 --> 00:11:35,430 version of S. So that person tries to reconstruct S based 222 00:11:35,430 --> 00:11:36,930 on X. 223 00:11:36,930 --> 00:11:42,210 So in both cases, we have a linear relation between X and 224 00:11:42,210 --> 00:11:43,490 the unknown quantity. 225 00:11:43,490 --> 00:11:47,360 In one version, A is the unknown and we know S. In the 226 00:11:47,360 --> 00:11:51,670 other version, A is known, and so we try to infer S. 227 00:11:51,670 --> 00:11:54,300 Mathematically, you can see that this is essentially the 228 00:11:54,300 --> 00:11:57,060 same kind of problem in both cases. 229 00:11:57,060 --> 00:12:03,590 Although, the kind of practical problem that you're 230 00:12:03,590 --> 00:12:07,580 trying to solve is a little different. 231 00:12:07,580 --> 00:12:11,880 So we will not be making any distinctions between problems 232 00:12:11,880 --> 00:12:15,940 of the model building type as opposed to models where you 233 00:12:15,940 --> 00:12:19,260 try to estimate some unknown signal and so on. 234 00:12:19,260 --> 00:12:22,400 Because conceptually, the tools that one uses for both 235 00:12:22,400 --> 00:12:26,850 types of problems are essentially the same. 236 00:12:26,850 --> 00:12:30,430 OK, next a very useful classification 237 00:12:30,430 --> 00:12:31,680 of inference problems-- 238 00:12:31,680 --> 00:12:34,170 239 00:12:34,170 --> 00:12:37,760 the unknown quantity that you're trying to estimate 240 00:12:37,760 --> 00:12:40,770 could be either a discrete one that takes a 241 00:12:40,770 --> 00:12:43,040 small number of values. 242 00:12:43,040 --> 00:12:45,605 So this could be discrete problems, such as the airplane 243 00:12:45,605 --> 00:12:48,080 radar problem we encountered back a long 244 00:12:48,080 --> 00:12:50,120 time ago in this class. 245 00:12:50,120 --> 00:12:52,120 So there's two possibilities-- 246 00:12:52,120 --> 00:12:55,450 an airplane is out there or an airplane is not out there. 247 00:12:55,450 --> 00:12:57,050 And you're trying to make a decision 248 00:12:57,050 --> 00:12:58,940 between these two options. 249 00:12:58,940 --> 00:13:01,570 Or you can have other problems would you have, let's say, 250 00:13:01,570 --> 00:13:03,380 four possible options. 251 00:13:03,380 --> 00:13:05,970 You don't know which one is true, but you get data and you 252 00:13:05,970 --> 00:13:09,040 try to figure out which one is true. 253 00:13:09,040 --> 00:13:12,050 In problems of these kind, usually you want to make a 254 00:13:12,050 --> 00:13:14,050 decision based on your data. 255 00:13:14,050 --> 00:13:17,000 And you're interested in the probability of making a 256 00:13:17,000 --> 00:13:18,040 correct decision. 257 00:13:18,040 --> 00:13:19,430 You would like that probability to 258 00:13:19,430 --> 00:13:21,830 be as high as possible. 259 00:13:21,830 --> 00:13:24,000 Estimation problems are a little different. 260 00:13:24,000 --> 00:13:28,540 Here you have some continuous quantity that's not known. 261 00:13:28,540 --> 00:13:31,860 And you try to make a good guess of that quantity. 262 00:13:31,860 --> 00:13:36,050 And you would like your guess to be as close as possible to 263 00:13:36,050 --> 00:13:37,310 the true quantity. 264 00:13:37,310 --> 00:13:40,270 So the polling problem was of this type. 265 00:13:40,270 --> 00:13:44,720 There was an unknown fraction f of the population that had 266 00:13:44,720 --> 00:13:45,870 some property. 267 00:13:45,870 --> 00:13:50,040 And you try to estimate f as accurately as you can. 268 00:13:50,040 --> 00:13:53,420 So the distinction here is that usually here the unknown 269 00:13:53,420 --> 00:13:56,440 quantity takes on discrete set of values. 270 00:13:56,440 --> 00:13:57,890 Here the unknown quantity takes a 271 00:13:57,890 --> 00:14:00,030 continuous set of values. 272 00:14:00,030 --> 00:14:02,980 Here we're interested in the probability of error. 273 00:14:02,980 --> 00:14:07,400 Here we're interested in the size of the error. 274 00:14:07,400 --> 00:14:11,000 Broadly speaking, most inference problems fall either 275 00:14:11,000 --> 00:14:13,940 in this category or in that category. 276 00:14:13,940 --> 00:14:17,230 Although, if you want to complicate life, you can also 277 00:14:17,230 --> 00:14:20,250 think or construct problems where both of these aspects 278 00:14:20,250 --> 00:14:24,410 are simultaneously present. 279 00:14:24,410 --> 00:14:28,530 OK, finally since we're in classification mode, there is 280 00:14:28,530 --> 00:14:33,670 a very big, important dichotomy into how one goes 281 00:14:33,670 --> 00:14:35,940 about inference problems. 282 00:14:35,940 --> 00:14:39,150 And here there's two fundamentally different 283 00:14:39,150 --> 00:14:46,070 philosophical points of view, which is how do we model the 284 00:14:46,070 --> 00:14:50,270 quantity that is unknown? 285 00:14:50,270 --> 00:14:54,530 In one approach, you say there's a certain quantity 286 00:14:54,530 --> 00:14:57,590 that has a definite value. 287 00:14:57,590 --> 00:15:00,010 It just happens that they don't know it. 288 00:15:00,010 --> 00:15:01,320 But it's a number. 289 00:15:01,320 --> 00:15:03,290 There's nothing random about it. 290 00:15:03,290 --> 00:15:05,945 So think of trying to estimate some physical quantity. 291 00:15:05,945 --> 00:15:10,630 292 00:15:10,630 --> 00:15:13,350 You're making measurements, you try to estimate the mass 293 00:15:13,350 --> 00:15:15,820 of an electron, which is a sort of 294 00:15:15,820 --> 00:15:18,270 universal physical constant. 295 00:15:18,270 --> 00:15:20,320 There's nothing random about it. 296 00:15:20,320 --> 00:15:22,340 It's a fixed number. 297 00:15:22,340 --> 00:15:29,120 You get data, because you have some measuring apparatus. 298 00:15:29,120 --> 00:15:33,020 And that measuring apparatus, depending on what that results 299 00:15:33,020 --> 00:15:37,160 that you get are affected by the true mass of the electron, 300 00:15:37,160 --> 00:15:39,340 but there's also some noise. 301 00:15:39,340 --> 00:15:42,200 You take the data out of your measuring apparatus and you 302 00:15:42,200 --> 00:15:44,465 try to come up with some estimate of 303 00:15:44,465 --> 00:15:47,220 that quantity theta. 304 00:15:47,220 --> 00:15:49,760 So this is definitely a legitimate picture, but the 305 00:15:49,760 --> 00:15:52,370 important thing in this picture is that this theta is 306 00:15:52,370 --> 00:15:54,570 written as lowercase. 307 00:15:54,570 --> 00:15:58,110 And that's to make the point that it's a real number, not a 308 00:15:58,110 --> 00:16:00,900 random variable. 309 00:16:00,900 --> 00:16:03,230 There's a different philosophical approach which 310 00:16:03,230 --> 00:16:08,180 says, well, anything that I don't know I should model it 311 00:16:08,180 --> 00:16:10,190 as a random variable. 312 00:16:10,190 --> 00:16:11,130 Yes, I know. 313 00:16:11,130 --> 00:16:14,500 The mass of the electron is not really random. 314 00:16:14,500 --> 00:16:15,690 It's a constant. 315 00:16:15,690 --> 00:16:17,920 But I don't know what it is. 316 00:16:17,920 --> 00:16:22,510 I have some vague sense, perhaps, what it is perhaps 317 00:16:22,510 --> 00:16:24,290 because of the experiments that some other 318 00:16:24,290 --> 00:16:25,940 people carried out. 319 00:16:25,940 --> 00:16:30,560 So perhaps I have a prior distribution on the possible 320 00:16:30,560 --> 00:16:32,160 values of Theta. 321 00:16:32,160 --> 00:16:34,990 And that prior distribution doesn't mean that the nature 322 00:16:34,990 --> 00:16:39,320 is random, but it's more of a subjective description of my 323 00:16:39,320 --> 00:16:44,570 subjective beliefs of where do I think this constant number 324 00:16:44,570 --> 00:16:46,200 happens to be. 325 00:16:46,200 --> 00:16:50,140 So even though it's not truly random, I model my initial 326 00:16:50,140 --> 00:16:52,600 beliefs before the experiment starts. 327 00:16:52,600 --> 00:16:55,790 In terms of a prior distribution, I view it as a 328 00:16:55,790 --> 00:16:57,470 random variable. 329 00:16:57,470 --> 00:17:01,850 Then I observe another related random variable through some 330 00:17:01,850 --> 00:17:02,930 measuring apparatus. 331 00:17:02,930 --> 00:17:05,920 And then I use this again to create an estimate. 332 00:17:05,920 --> 00:17:08,819 333 00:17:08,819 --> 00:17:12,069 So these two pictures philosophically are very 334 00:17:12,069 --> 00:17:13,589 different from each other. 335 00:17:13,589 --> 00:17:17,130 Here we treat the unknown quantities as unknown numbers. 336 00:17:17,130 --> 00:17:20,589 Here we treat them as random variables. 337 00:17:20,589 --> 00:17:24,829 When we treat them as a random variables, then we know pretty 338 00:17:24,829 --> 00:17:27,109 much already what we should be doing. 339 00:17:27,109 --> 00:17:29,470 We should just use the Bayes rule. 340 00:17:29,470 --> 00:17:31,850 Based on X, find the conditional 341 00:17:31,850 --> 00:17:33,670 distribution of Theta. 342 00:17:33,670 --> 00:17:37,520 And that's what we will be doing mostly over this lecture 343 00:17:37,520 --> 00:17:40,010 and the next lecture. 344 00:17:40,010 --> 00:17:44,660 Now in both cases, what you end up getting at the end is 345 00:17:44,660 --> 00:17:47,240 an estimate. 346 00:17:47,240 --> 00:17:52,120 But actually, that estimate is what kind of object is it? 347 00:17:52,120 --> 00:17:55,170 It's a random variable in both cases. 348 00:17:55,170 --> 00:17:56,000 Why? 349 00:17:56,000 --> 00:17:58,130 Even in this case where theta was a 350 00:17:58,130 --> 00:18:01,060 constant, my data are random. 351 00:18:01,060 --> 00:18:02,860 I do my data processing. 352 00:18:02,860 --> 00:18:06,050 So I calculate a function of the data, the 353 00:18:06,050 --> 00:18:07,580 data are random variables. 354 00:18:07,580 --> 00:18:11,390 So out here we output something which is a function 355 00:18:11,390 --> 00:18:12,770 of a random variable. 356 00:18:12,770 --> 00:18:15,830 So this quantity here will be also random. 357 00:18:15,830 --> 00:18:18,400 It's affected by the noise and the experiment that I have 358 00:18:18,400 --> 00:18:19,650 been doing. 359 00:18:19,650 --> 00:18:22,330 That's why these estimators will be denoted 360 00:18:22,330 --> 00:18:24,920 by uppercase Thetas. 361 00:18:24,920 --> 00:18:26,740 And we will be using hats. 362 00:18:26,740 --> 00:18:29,030 Hat, usually in estimation, means 363 00:18:29,030 --> 00:18:32,990 an estimate of something. 364 00:18:32,990 --> 00:18:35,380 All right, so this is the big picture. 365 00:18:35,380 --> 00:18:38,690 We're going to start with the Bayesian version. 366 00:18:38,690 --> 00:18:42,830 And then the last few lectures we're going to talk about the 367 00:18:42,830 --> 00:18:45,690 non-Bayesian version or the classical one. 368 00:18:45,690 --> 00:18:48,610 By the way, I should say that statisticians have been 369 00:18:48,610 --> 00:18:52,500 debating fiercely for 100 years whether the right way to 370 00:18:52,500 --> 00:18:56,030 approach statistics is to go the classical way or the 371 00:18:56,030 --> 00:18:57,420 Bayesian way. 372 00:18:57,420 --> 00:19:00,530 And there have been tides going back and forth between 373 00:19:00,530 --> 00:19:02,260 the two sides. 374 00:19:02,260 --> 00:19:05,330 These days, Bayesian methods tend to become a little more 375 00:19:05,330 --> 00:19:07,320 popular for various reasons. 376 00:19:07,320 --> 00:19:11,730 We're going to come back to this later. 377 00:19:11,730 --> 00:19:14,610 All right, so in Bayesian estimation, what we got in our 378 00:19:14,610 --> 00:19:16,610 hands is Bayes rule. 379 00:19:16,610 --> 00:19:19,380 And if you have Bayes rule, there's not a lot 380 00:19:19,380 --> 00:19:21,340 that's left to do. 381 00:19:21,340 --> 00:19:24,190 We have different forms of the Bayes rule, depending on 382 00:19:24,190 --> 00:19:27,920 whether we're dealing with discrete data, And discrete 383 00:19:27,920 --> 00:19:32,310 quantities to estimate, or continuous data, and so on. 384 00:19:32,310 --> 00:19:36,020 In the hypothesis testing problem, the unknown quantity 385 00:19:36,020 --> 00:19:38,210 Theta is discrete. 386 00:19:38,210 --> 00:19:42,890 So in both cases here, we have a P of Theta. 387 00:19:42,890 --> 00:19:45,530 We obtain data, the X's. 388 00:19:45,530 --> 00:19:49,040 And on the basis of the X that we observe, we can calculate 389 00:19:49,040 --> 00:19:53,340 the posterior distribution of Theta, given the data. 390 00:19:53,340 --> 00:19:59,840 So to use Bayesian inference, what do we start with? 391 00:19:59,840 --> 00:20:03,160 We start with some priors. 392 00:20:03,160 --> 00:20:05,910 These are our initial beliefs about what 393 00:20:05,910 --> 00:20:07,890 Theta that might be. 394 00:20:07,890 --> 00:20:10,440 That's before we do the experiment. 395 00:20:10,440 --> 00:20:13,840 We have a model of the experimental aparatus. 396 00:20:13,840 --> 00:20:17,520 397 00:20:17,520 --> 00:20:21,550 And the model of the experimental apparatus tells 398 00:20:21,550 --> 00:20:28,040 us if this Theta is true, I'm going to see X's of that kind. 399 00:20:28,040 --> 00:20:31,480 If that other Theta is true, I'm going to see X's that they 400 00:20:31,480 --> 00:20:33,130 are somewhere else. 401 00:20:33,130 --> 00:20:35,200 That models my apparatus. 402 00:20:35,200 --> 00:20:39,150 And based on that knowledge, once I observe I have these 403 00:20:39,150 --> 00:20:41,975 two functions in my hands, we have already seen that if you 404 00:20:41,975 --> 00:20:44,760 know those two functions, you can also calculate the 405 00:20:44,760 --> 00:20:46,550 denominator here. 406 00:20:46,550 --> 00:20:50,900 So all of these functions are available, so you can compute, 407 00:20:50,900 --> 00:20:54,170 you can find a formula for this function as well. 408 00:20:54,170 --> 00:20:58,780 And as soon as you observe the data, that X's, you plug in 409 00:20:58,780 --> 00:21:02,220 here the numerical value of those X's. 410 00:21:02,220 --> 00:21:04,720 And you get a function of Theta. 411 00:21:04,720 --> 00:21:07,870 And this is the posterior distribution of Theta, given 412 00:21:07,870 --> 00:21:09,680 the data that you have seen. 413 00:21:09,680 --> 00:21:11,930 So you've already done a fair number of 414 00:21:11,930 --> 00:21:13,760 exercises of these kind. 415 00:21:13,760 --> 00:21:17,320 So we not say more about this. 416 00:21:17,320 --> 00:21:20,470 And there's a similar formula as you know for the case where 417 00:21:20,470 --> 00:21:22,460 we have continuous data. 418 00:21:22,460 --> 00:21:25,140 If the X's are continuous random variable, then the 419 00:21:25,140 --> 00:21:28,620 formula is the same, except that X's are described by 420 00:21:28,620 --> 00:21:31,630 densities instead of being described by a probability 421 00:21:31,630 --> 00:21:32,880 mass functions. 422 00:21:32,880 --> 00:21:35,170 423 00:21:35,170 --> 00:21:40,200 OK, now if Theta is continuous, then we're dealing 424 00:21:40,200 --> 00:21:42,160 with estimation problems. 425 00:21:42,160 --> 00:21:44,880 But the story is once more the same. 426 00:21:44,880 --> 00:21:47,920 You're going to use the Bayes rule to come up with the 427 00:21:47,920 --> 00:21:51,090 posterior density of Theta, given the data 428 00:21:51,090 --> 00:21:53,300 that you have observed. 429 00:21:53,300 --> 00:21:57,250 Now just for the sake of the example, let's come back to 430 00:21:57,250 --> 00:21:58,900 this picture here. 431 00:21:58,900 --> 00:22:03,490 Suppose that something is flying in the air, and maybe 432 00:22:03,490 --> 00:22:07,800 this is just an object in the air close to the Earth. 433 00:22:07,800 --> 00:22:10,820 So because of gravity, the trajectory that it's going to 434 00:22:10,820 --> 00:22:15,170 follow it's going to be a parabola. 435 00:22:15,170 --> 00:22:18,014 So this is the general equation of a parabola. 436 00:22:18,014 --> 00:22:23,450 Zt is the position of my objects at time t. 437 00:22:23,450 --> 00:22:26,310 438 00:22:26,310 --> 00:22:29,500 But I don't know exactly which parabola it is. 439 00:22:29,500 --> 00:22:32,690 So the parameters of the parabola are unknown 440 00:22:32,690 --> 00:22:34,040 quantities. 441 00:22:34,040 --> 00:22:37,710 What I can do is to go and measure the position of my 442 00:22:37,710 --> 00:22:41,880 objects at different times. 443 00:22:41,880 --> 00:22:44,575 But unfortunately, my measurements are noisy. 444 00:22:44,575 --> 00:22:47,380 445 00:22:47,380 --> 00:22:51,070 What I want to do is to model the motion of my object. 446 00:22:51,070 --> 00:22:56,260 So I guess in the picture, the axis would be t going this way 447 00:22:56,260 --> 00:22:59,980 and Z going this way. 448 00:22:59,980 --> 00:23:02,470 And on the basis of the data that they get, 449 00:23:02,470 --> 00:23:05,020 these are my X's. 450 00:23:05,020 --> 00:23:07,390 I want to figure out the Thetas. 451 00:23:07,390 --> 00:23:09,570 That is, I want to figure out the exact 452 00:23:09,570 --> 00:23:11,840 equation of this parabola. 453 00:23:11,840 --> 00:23:14,940 Now if somebody gives you probability distributions for 454 00:23:14,940 --> 00:23:18,490 Theta, these would be your priors. 455 00:23:18,490 --> 00:23:19,840 So this is given. 456 00:23:19,840 --> 00:23:23,200 457 00:23:23,200 --> 00:23:26,200 We need the conditional distribution of the X's given 458 00:23:26,200 --> 00:23:27,360 the Thetas. 459 00:23:27,360 --> 00:23:30,870 Well, we have the conditional distribution of Z, given the 460 00:23:30,870 --> 00:23:32,920 Thetas from this equation. 461 00:23:32,920 --> 00:23:36,040 And then by playing with this equation, you can also find 462 00:23:36,040 --> 00:23:42,460 how is X distributed if Theta takes a particular value. 463 00:23:42,460 --> 00:23:46,420 So you do have all of the densities that you might need. 464 00:23:46,420 --> 00:23:48,790 And you can apply the Bayes rule. 465 00:23:48,790 --> 00:23:53,620 And at the end, your end result would be a formula for 466 00:23:53,620 --> 00:23:57,270 the distribution of Theta, given to the X 467 00:23:57,270 --> 00:23:59,130 that you have observed-- 468 00:23:59,130 --> 00:24:03,000 except for one sort of computation, or to make things 469 00:24:03,000 --> 00:24:04,470 more interesting. 470 00:24:04,470 --> 00:24:07,680 Instead of these X's and Theta's being single random 471 00:24:07,680 --> 00:24:11,070 variables that we have here, typically those X's and 472 00:24:11,070 --> 00:24:13,400 Theta's will be multi-dimensional random 473 00:24:13,400 --> 00:24:16,490 variables or will correspond to multiple ones. 474 00:24:16,490 --> 00:24:19,920 So this little Theta here actually stands for a triplet 475 00:24:19,920 --> 00:24:22,880 of Theta0, Theta1, and Theta2. 476 00:24:22,880 --> 00:24:26,820 And that X here stands here for the entire sequence of X's 477 00:24:26,820 --> 00:24:28,410 that we have observed. 478 00:24:28,410 --> 00:24:31,060 So in reality, the object that you're going to get at to the 479 00:24:31,060 --> 00:24:35,900 end after inference is done is a function that you plug in 480 00:24:35,900 --> 00:24:39,430 the values of the data and you get the function of the 481 00:24:39,430 --> 00:24:43,240 Theta's that tells you the relative likelihoods of 482 00:24:43,240 --> 00:24:46,780 different Theta triplets. 483 00:24:46,780 --> 00:24:49,760 So what I'm saying is that this is no harder than the 484 00:24:49,760 --> 00:24:53,720 problems that you have dealt with so far, except perhaps 485 00:24:53,720 --> 00:24:56,020 for the complication that's usually in interesting 486 00:24:56,020 --> 00:24:57,490 inference problems. 487 00:24:57,490 --> 00:25:01,940 Your Theta's and X's are often the vectors of random 488 00:25:01,940 --> 00:25:05,490 variables instead of individual random variables. 489 00:25:05,490 --> 00:25:09,630 Now if you are to do estimation in a case where you 490 00:25:09,630 --> 00:25:13,520 have discrete data, again the situation is no different. 491 00:25:13,520 --> 00:25:17,020 We still have a Bayes rule of the same kind, except that 492 00:25:17,020 --> 00:25:19,540 densities gets replaced by PMF's. 493 00:25:19,540 --> 00:25:23,680 If X is discrete, you put a P here instead of putting an f. 494 00:25:23,680 --> 00:25:27,990 So an example of an estimation problem with discrete data is 495 00:25:27,990 --> 00:25:29,740 similar to the polling problem. 496 00:25:29,740 --> 00:25:31,600 You have a coin. 497 00:25:31,600 --> 00:25:33,500 It has an unknown parameter Theta. 498 00:25:33,500 --> 00:25:35,230 This is the probability of obtaining heads. 499 00:25:35,230 --> 00:25:37,410 You flip the coin many times. 500 00:25:37,410 --> 00:25:41,560 What can you tell me about the true value of Theta? 501 00:25:41,560 --> 00:25:46,200 A classical statistician, at this point, would say, OK, I'm 502 00:25:46,200 --> 00:25:48,900 going to use an estimator, the most reasonable 503 00:25:48,900 --> 00:25:50,950 one, which is this. 504 00:25:50,950 --> 00:25:54,200 How many heads did they obtain in n trials? 505 00:25:54,200 --> 00:25:56,440 Divide by the total number of trials. 506 00:25:56,440 --> 00:26:00,700 This is my estimate of the bias of my coin. 507 00:26:00,700 --> 00:26:02,860 And then the classical statistician would continue 508 00:26:02,860 --> 00:26:07,610 from here and try to prove some properties and argue that 509 00:26:07,610 --> 00:26:10,030 this estimate is a good one. 510 00:26:10,030 --> 00:26:12,850 For example, we have the weak law of large numbers that 511 00:26:12,850 --> 00:26:15,630 tells us that this particular estimate converges in 512 00:26:15,630 --> 00:26:17,990 probability to the true parameter. 513 00:26:17,990 --> 00:26:21,000 This is a kind of guarantee that's useful to have. 514 00:26:21,000 --> 00:26:23,410 And the classical statistician would pretty much close the 515 00:26:23,410 --> 00:26:24,660 subject in this way. 516 00:26:24,660 --> 00:26:27,340 517 00:26:27,340 --> 00:26:30,160 What would the Bayesian person do differently? 518 00:26:30,160 --> 00:26:35,040 The Bayesian person would start by assuming a prior 519 00:26:35,040 --> 00:26:37,100 distribution of Theta. 520 00:26:37,100 --> 00:26:39,820 Instead of treating Theta as an unknown constant, they 521 00:26:39,820 --> 00:26:44,340 would say that Theta would speak randomly or pretend that 522 00:26:44,340 --> 00:26:47,360 it would speak randomly and assume a 523 00:26:47,360 --> 00:26:49,300 distribution on Theta. 524 00:26:49,300 --> 00:26:54,290 So for example, if you don't know they need anything more, 525 00:26:54,290 --> 00:26:57,510 you might assume that any value for the bias of the coin 526 00:26:57,510 --> 00:27:01,460 is as likely as any other value of the bias of the coin. 527 00:27:01,460 --> 00:27:04,150 And this way so the probability distribution 528 00:27:04,150 --> 00:27:05,720 that's uniform. 529 00:27:05,720 --> 00:27:09,840 Or if you have a little more faith in the manufacturing 530 00:27:09,840 --> 00:27:13,270 processes that's created that coin, you might choose your 531 00:27:13,270 --> 00:27:17,660 prior to be a distribution that's centered around 1/2 and 532 00:27:17,660 --> 00:27:21,860 sits fairly narrowly centered around 1/2. 533 00:27:21,860 --> 00:27:24,500 That would be a prior distribution in which you say, 534 00:27:24,500 --> 00:27:27,920 well, I believe that the manufacturer tried to make my 535 00:27:27,920 --> 00:27:29,410 coin to be fair. 536 00:27:29,410 --> 00:27:33,070 But they often makes some mistakes, so it's going to be, 537 00:27:33,070 --> 00:27:36,600 I believe, it's approximately 1/2 but not quite. 538 00:27:36,600 --> 00:27:40,050 So depending on your beliefs, you would choose an 539 00:27:40,050 --> 00:27:43,630 appropriate prior for the distribution of Theta. 540 00:27:43,630 --> 00:27:48,610 And then you would use the Bayes rule to find the 541 00:27:48,610 --> 00:27:52,270 probabilities of different values of Theta, based on the 542 00:27:52,270 --> 00:27:53,520 data that you have observed. 543 00:27:53,520 --> 00:27:59,620 544 00:27:59,620 --> 00:28:04,640 So no matter which version of the Bayes rule that you use, 545 00:28:04,640 --> 00:28:10,540 the end product of the Bayes rule is going to be either a 546 00:28:10,540 --> 00:28:14,400 plot of this kind or a plot of that kind. 547 00:28:14,400 --> 00:28:16,740 So what am I plotting here? 548 00:28:16,740 --> 00:28:19,810 This axis is the Theta axis. 549 00:28:19,810 --> 00:28:23,830 These are the possible values of the unknown quantity that 550 00:28:23,830 --> 00:28:26,670 we're trying to estimate. 551 00:28:26,670 --> 00:28:28,990 In the continuous case, theta is a 552 00:28:28,990 --> 00:28:30,800 continuous random variable. 553 00:28:30,800 --> 00:28:32,560 I obtain my data. 554 00:28:32,560 --> 00:28:36,430 And I plot for the posterior probability distribution after 555 00:28:36,430 --> 00:28:37,940 observing my data. 556 00:28:37,940 --> 00:28:42,220 And I'm plotting here the probability density for Theta. 557 00:28:42,220 --> 00:28:45,500 So this is a plot of that density. 558 00:28:45,500 --> 00:28:49,210 In the discrete case, theta can take finitely many values 559 00:28:49,210 --> 00:28:51,570 or a discrete set of values. 560 00:28:51,570 --> 00:28:54,470 And for each one of those values, I'm telling you how 561 00:28:54,470 --> 00:28:58,080 likely is that the value to be the correct one, given the 562 00:28:58,080 --> 00:29:01,040 data that I have observed. 563 00:29:01,040 --> 00:29:04,990 And in general, what you would go back to your boss and 564 00:29:04,990 --> 00:29:08,520 report after you've done all your inference work would be 565 00:29:08,520 --> 00:29:10,870 either a plot of this kinds or of that kind. 566 00:29:10,870 --> 00:29:14,180 So you go to your boss who asks you, what is 567 00:29:14,180 --> 00:29:15,190 the value of Theta? 568 00:29:15,190 --> 00:29:17,490 And you say, well, I only have limited data. 569 00:29:17,490 --> 00:29:19,420 That I don't know what it is. 570 00:29:19,420 --> 00:29:22,920 It could be this, with so much probability. 571 00:29:22,920 --> 00:29:24,640 There's probability. 572 00:29:24,640 --> 00:29:27,220 OK, let's throw in some numbers here. 573 00:29:27,220 --> 00:29:32,250 There's probability 0.3 that Theta is this value. 574 00:29:32,250 --> 00:29:36,100 There's probability 0.2 that Theta is this value, 0.1 that 575 00:29:36,100 --> 00:29:39,420 it's this one, 0.1 that it's this one, 0.2 that it's that 576 00:29:39,420 --> 00:29:40,830 one, and so on. 577 00:29:40,830 --> 00:29:44,890 OK, now bosses often want simple answers. 578 00:29:44,890 --> 00:29:48,480 They say, OK, you're talking too much. 579 00:29:48,480 --> 00:29:51,770 What do you think Theta is? 580 00:29:51,770 --> 00:29:55,920 And now you're forced to make a decision. 581 00:29:55,920 --> 00:30:00,680 If that was the situation and you have to make a decision, 582 00:30:00,680 --> 00:30:02,370 how would you make it? 583 00:30:02,370 --> 00:30:06,880 Well, I'm going to make a decision that's most likely to 584 00:30:06,880 --> 00:30:09,120 be correct. 585 00:30:09,120 --> 00:30:13,060 If I make this decision, what's going to happen? 586 00:30:13,060 --> 00:30:17,670 Theta is this value with probability 0.2, which means 587 00:30:17,670 --> 00:30:21,150 there's probably 0.8 that they make an error 588 00:30:21,150 --> 00:30:23,280 if I make that guess. 589 00:30:23,280 --> 00:30:29,370 If I make that decision, this decision has probably 0.3 of 590 00:30:29,370 --> 00:30:30,750 being the correct one. 591 00:30:30,750 --> 00:30:34,530 So I have probably of error 0.7. 592 00:30:34,530 --> 00:30:38,460 So if you want to just maximize the probability of 593 00:30:38,460 --> 00:30:41,730 giving the correct decision, or if you want to minimize the 594 00:30:41,730 --> 00:30:44,780 probability of making an incorrect decision, what 595 00:30:44,780 --> 00:30:48,790 you're going to choose to report is that value of Theta 596 00:30:48,790 --> 00:30:51,450 for which the probability is highest. 597 00:30:51,450 --> 00:30:54,230 So in this case, I would choose to report this 598 00:30:54,230 --> 00:30:58,210 particular value, the most likely value of Theta, given 599 00:30:58,210 --> 00:31:00,120 what I have observed. 600 00:31:00,120 --> 00:31:04,640 And that value is called them maximum a posteriori 601 00:31:04,640 --> 00:31:07,550 probability estimate. 602 00:31:07,550 --> 00:31:11,550 It's going to be this one in our case. 603 00:31:11,550 --> 00:31:16,830 So picking the point in the posterior PMF that has the 604 00:31:16,830 --> 00:31:19,040 highest probability. 605 00:31:19,040 --> 00:31:20,720 That's the reasonable thing to do. 606 00:31:20,720 --> 00:31:23,850 This is the optimal thing to do if you want to minimize the 607 00:31:23,850 --> 00:31:27,340 probability of an incorrect inference. 608 00:31:27,340 --> 00:31:31,400 And that's what people do usually if they need to report 609 00:31:31,400 --> 00:31:35,280 a single answer, if they need to report a single decision. 610 00:31:35,280 --> 00:31:39,530 How about in the estimation context? 611 00:31:39,530 --> 00:31:43,250 If that's what you know about Theta, Theta could be around 612 00:31:43,250 --> 00:31:46,670 here, but there's also some sharp probability that it is 613 00:31:46,670 --> 00:31:48,720 around here. 614 00:31:48,720 --> 00:31:52,380 What's the single answer that you would give to your boss? 615 00:31:52,380 --> 00:31:56,310 One option is to use the same philosophy and say, OK, I'm 616 00:31:56,310 --> 00:32:00,135 going to find the Theta at which this posterior density 617 00:32:00,135 --> 00:32:01,690 is highest. 618 00:32:01,690 --> 00:32:06,010 So I would pick this point here and report this 619 00:32:06,010 --> 00:32:06,920 particular Theta. 620 00:32:06,920 --> 00:32:11,110 So this would be my Theta, again, Theta MAP, the Theta 621 00:32:11,110 --> 00:32:15,290 that has the highest a posteriori probability, just 622 00:32:15,290 --> 00:32:19,100 because it corresponds to the peak of the density. 623 00:32:19,100 --> 00:32:23,810 But in this context, the maximum a posteriori 624 00:32:23,810 --> 00:32:27,120 probability theta was the one that was most 625 00:32:27,120 --> 00:32:28,600 likely to be true. 626 00:32:28,600 --> 00:32:32,460 In the continuous case, you cannot really say that this is 627 00:32:32,460 --> 00:32:34,940 the most likely value of Theta. 628 00:32:34,940 --> 00:32:38,340 In a continuous setting, any value of Theta has zero 629 00:32:38,340 --> 00:32:41,530 probability, so when we talk about densities. 630 00:32:41,530 --> 00:32:43,260 So it's not the most likely. 631 00:32:43,260 --> 00:32:48,240 It's the one for which the density, so the probabilities 632 00:32:48,240 --> 00:32:51,820 of that neighborhoods, are highest. 633 00:32:51,820 --> 00:32:56,390 So the rationale for picking this particular estimate in 634 00:32:56,390 --> 00:33:00,050 the continuous case is much less compelling than the 635 00:33:00,050 --> 00:33:02,210 rationale that we had in here. 636 00:33:02,210 --> 00:33:05,590 So in this case, reasonable people might choose different 637 00:33:05,590 --> 00:33:07,460 quantities to report. 638 00:33:07,460 --> 00:33:11,810 And the very popular one would be to report instead the 639 00:33:11,810 --> 00:33:13,700 conditional expectation. 640 00:33:13,700 --> 00:33:15,990 So I don't know quite what Theta is. 641 00:33:15,990 --> 00:33:19,600 Given the data that I have, Theta has this distribution. 642 00:33:19,600 --> 00:33:23,320 Let me just report the average over that distribution. 643 00:33:23,320 --> 00:33:27,090 Let me report to the center of gravity of this figure. 644 00:33:27,090 --> 00:33:30,340 And in this figure, the center of gravity would probably be 645 00:33:30,340 --> 00:33:32,230 somewhere around here. 646 00:33:32,230 --> 00:33:35,690 And that would be a different estimate that you 647 00:33:35,690 --> 00:33:37,520 might choose to report. 648 00:33:37,520 --> 00:33:40,340 So center of gravity is something around here. 649 00:33:40,340 --> 00:33:43,580 And this is a conditional expectation of Theta, given 650 00:33:43,580 --> 00:33:46,010 the data that you have. 651 00:33:46,010 --> 00:33:51,190 So these are two, in some sense, fairly reasonable ways 652 00:33:51,190 --> 00:33:53,850 of choosing what to report to your boss. 653 00:33:53,850 --> 00:33:55,690 Some people might choose to report this. 654 00:33:55,690 --> 00:33:58,630 Some people might choose to report that. 655 00:33:58,630 --> 00:34:03,230 And a priori, if there's no compelling reason why one 656 00:34:03,230 --> 00:34:08,639 would be preferable than other one, unless you set some rules 657 00:34:08,639 --> 00:34:12,350 for the game and you describe a little more precisely what 658 00:34:12,350 --> 00:34:14,090 your objectives are. 659 00:34:14,090 --> 00:34:19,070 But no matter which one you report, a single answer, a 660 00:34:19,070 --> 00:34:24,350 point estimate, doesn't really tell you the whole story. 661 00:34:24,350 --> 00:34:28,159 There's a lot more information conveyed by this posterior 662 00:34:28,159 --> 00:34:31,060 distribution plot than any single number 663 00:34:31,060 --> 00:34:32,159 that you might report. 664 00:34:32,159 --> 00:34:36,510 So in general, you may wish to convince your boss that's it's 665 00:34:36,510 --> 00:34:40,310 worth their time to look at the entire plot, because that 666 00:34:40,310 --> 00:34:43,100 plot sort of covers all the possibilities. 667 00:34:43,100 --> 00:34:47,060 It tells your boss most likely we're in that range, but 668 00:34:47,060 --> 00:34:51,620 there's also a distinct change that our Theta happens to lie 669 00:34:51,620 --> 00:34:54,080 in that range. 670 00:34:54,080 --> 00:34:58,400 All right, now let us try to perhaps differentiate between 671 00:34:58,400 --> 00:35:02,570 these two and see under what circumstances this one might 672 00:35:02,570 --> 00:35:05,530 be the better estimate to perform. 673 00:35:05,530 --> 00:35:07,320 Better with respect to what? 674 00:35:07,320 --> 00:35:08,830 We need some rules. 675 00:35:08,830 --> 00:35:10,730 So we're going to throw in some rules. 676 00:35:10,730 --> 00:35:14,320 677 00:35:14,320 --> 00:35:17,450 As a warm up, we're going to deal with the problem of 678 00:35:17,450 --> 00:35:22,000 making an estimation if you had no information at all, 679 00:35:22,000 --> 00:35:24,670 except for a prior distribution. 680 00:35:24,670 --> 00:35:27,650 So this is a warm up for what's coming next, which 681 00:35:27,650 --> 00:35:32,970 would be estimation that takes into account some information. 682 00:35:32,970 --> 00:35:34,860 So we have a Theta. 683 00:35:34,860 --> 00:35:38,500 And because of your subjective beliefs or models by others, 684 00:35:38,500 --> 00:35:41,780 you believe that Theta is uniformly distributed between, 685 00:35:41,780 --> 00:35:46,250 let's say, 4 and 10. 686 00:35:46,250 --> 00:35:48,120 You want to come up with a point estimate. 687 00:35:48,120 --> 00:35:51,770 688 00:35:51,770 --> 00:35:54,900 Let's try to look for an estimate. 689 00:35:54,900 --> 00:35:57,580 Call it c, in this case. 690 00:35:57,580 --> 00:36:00,090 I want to pick a number with which to estimate 691 00:36:00,090 --> 00:36:01,340 the value of Theta. 692 00:36:01,340 --> 00:36:04,030 693 00:36:04,030 --> 00:36:08,260 I will be interested in the size of the error that I make. 694 00:36:08,260 --> 00:36:12,310 And I really dislike large errors, so I'm going to focus 695 00:36:12,310 --> 00:36:15,500 on the square of the error that they make. 696 00:36:15,500 --> 00:36:19,140 So I pick c. 697 00:36:19,140 --> 00:36:21,340 Theta that has a random value that I don't know. 698 00:36:21,340 --> 00:36:25,900 But whatever it is, once it becomes known, it results into 699 00:36:25,900 --> 00:36:28,640 a squared error between what it is and what I 700 00:36:28,640 --> 00:36:30,660 guessed that it was. 701 00:36:30,660 --> 00:36:35,770 And I'm interested in making a small air on the average, 702 00:36:35,770 --> 00:36:38,170 where the average is taken with respect to all the 703 00:36:38,170 --> 00:36:42,350 possible and unknown values of Theta. 704 00:36:42,350 --> 00:36:47,220 So the problem, this is a least squares formulation of 705 00:36:47,220 --> 00:36:49,240 the problem, where we try to minimize the 706 00:36:49,240 --> 00:36:51,150 least squares errors. 707 00:36:51,150 --> 00:36:53,900 How do you find the optimal c? 708 00:36:53,900 --> 00:36:57,200 Well, we take that expression and expand it. 709 00:36:57,200 --> 00:37:00,930 710 00:37:00,930 --> 00:37:05,650 And it is, using linearity of expectations-- 711 00:37:05,650 --> 00:37:11,460 square minus 2c expected Theta plus c squared-- 712 00:37:11,460 --> 00:37:13,620 that's the quantity that we want to minimize, 713 00:37:13,620 --> 00:37:16,670 with respect to c. 714 00:37:16,670 --> 00:37:19,670 To do the minimization, take the derivative with respect to 715 00:37:19,670 --> 00:37:21,950 c and set it to 0. 716 00:37:21,950 --> 00:37:27,320 So that differentiation gives us from here minus 2 expected 717 00:37:27,320 --> 00:37:32,420 value of Theta plus 2c is equal to 0. 718 00:37:32,420 --> 00:37:36,550 And the answer that you get by solving this equation is that 719 00:37:36,550 --> 00:37:39,350 c is the expected value of Theta. 720 00:37:39,350 --> 00:37:42,860 So when you do this optimization, you find that 721 00:37:42,860 --> 00:37:45,170 the optimal estimate, the things you should be 722 00:37:45,170 --> 00:37:47,970 reporting, is the expected value of Theta. 723 00:37:47,970 --> 00:37:51,630 So in this particular example, you would choose your estimate 724 00:37:51,630 --> 00:37:55,500 c to be just the middle of these values, 725 00:37:55,500 --> 00:37:57,980 which would be 7. 726 00:37:57,980 --> 00:38:02,642 727 00:38:02,642 --> 00:38:06,640 OK, and in case your boss asks you, how 728 00:38:06,640 --> 00:38:08,610 good is your estimate? 729 00:38:08,610 --> 00:38:11,390 How big is your error going to be? 730 00:38:11,390 --> 00:38:14,910 731 00:38:14,910 --> 00:38:19,870 What you could report is the average size of the estimation 732 00:38:19,870 --> 00:38:22,570 error that you are making. 733 00:38:22,570 --> 00:38:26,760 We picked our estimates to be the expected value of Theta. 734 00:38:26,760 --> 00:38:29,450 So for this particular way that I'm choosing to do my 735 00:38:29,450 --> 00:38:33,610 estimation, this is the mean squared error that I get. 736 00:38:33,610 --> 00:38:35,330 And this is a familiar quantity. 737 00:38:35,330 --> 00:38:38,370 It's just the variance of the distribution. 738 00:38:38,370 --> 00:38:41,890 So the expectation is that best way to estimate a 739 00:38:41,890 --> 00:38:45,550 quantity, if you're interested in the mean squared error. 740 00:38:45,550 --> 00:38:50,430 And the resulting mean squared error is the variance itself. 741 00:38:50,430 --> 00:38:56,380 How will this story change if we now have data as well? 742 00:38:56,380 --> 00:39:01,290 Now having data means that we can compute posterior 743 00:39:01,290 --> 00:39:05,150 distributions or conditional distributions. 744 00:39:05,150 --> 00:39:10,400 So we get transported into a new universe where instead the 745 00:39:10,400 --> 00:39:14,740 working with the original distribution of Theta, the 746 00:39:14,740 --> 00:39:18,860 prior distribution, now we work with the condition of 747 00:39:18,860 --> 00:39:22,280 distribution of Theta, given the data 748 00:39:22,280 --> 00:39:24,860 that we have observed. 749 00:39:24,860 --> 00:39:30,430 Now remember our old slogan that conditional models and 750 00:39:30,430 --> 00:39:33,570 conditional probabilities are no different than ordinary 751 00:39:33,570 --> 00:39:38,880 probabilities, except that we live now in a new universe 752 00:39:38,880 --> 00:39:42,690 where the new information has been taken into account. 753 00:39:42,690 --> 00:39:47,860 So if you use that philosophy and you're asked to minimize 754 00:39:47,860 --> 00:39:53,310 the squared error but now that you live in a new universe 755 00:39:53,310 --> 00:39:56,910 where X has been fixed to something, what would the 756 00:39:56,910 --> 00:39:59,210 optimal solution be? 757 00:39:59,210 --> 00:40:03,540 It would again be the expectation of theta, but 758 00:40:03,540 --> 00:40:04,730 which expectation? 759 00:40:04,730 --> 00:40:08,910 It's the expectation which applies in the new conditional 760 00:40:08,910 --> 00:40:12,350 universe in which we live right now. 761 00:40:12,350 --> 00:40:16,330 So because of what we did before, by the same 762 00:40:16,330 --> 00:40:20,330 calculation, we would find that the optimal estimates is 763 00:40:20,330 --> 00:40:24,970 the expected value of X of Theta, but the optimal 764 00:40:24,970 --> 00:40:26,730 estimate that takes into account the 765 00:40:26,730 --> 00:40:29,170 information that we have. 766 00:40:29,170 --> 00:40:33,600 So the conclusion, once you get your data, if you want to 767 00:40:33,600 --> 00:40:40,480 minimize the mean squared error, you should just report 768 00:40:40,480 --> 00:40:43,870 the conditional estimation of this unknown quantity based on 769 00:40:43,870 --> 00:40:46,640 the data that you have. 770 00:40:46,640 --> 00:40:53,050 So the picture here is that Theta is unknown. 771 00:40:53,050 --> 00:41:00,710 You have your apparatus that creates measurements. 772 00:41:00,710 --> 00:41:07,880 So this creates an X. You take an X, and here you have a box 773 00:41:07,880 --> 00:41:10,203 that does calculations. 774 00:41:10,203 --> 00:41:13,490 775 00:41:13,490 --> 00:41:18,180 It does calculations and it spits out the conditional 776 00:41:18,180 --> 00:41:22,230 expectation of Theta, given the particular data that you 777 00:41:22,230 --> 00:41:24,750 have observed. 778 00:41:24,750 --> 00:41:28,680 And what we have done in this class so far is, to some 779 00:41:28,680 --> 00:41:33,450 extent, developing the computational tools and skills 780 00:41:33,450 --> 00:41:36,020 to do with this particular calculation-- 781 00:41:36,020 --> 00:41:39,780 how to calculate the posterior density for Theta and how to 782 00:41:39,780 --> 00:41:42,750 calculate expectations, conditional expectations. 783 00:41:42,750 --> 00:41:45,330 So in principle, we know how to do this. 784 00:41:45,330 --> 00:41:50,040 In principle, we can program a computer to take the data and 785 00:41:50,040 --> 00:41:51,670 to spit out condition expectations. 786 00:41:51,670 --> 00:41:56,140 787 00:41:56,140 --> 00:42:04,390 Somebody who doesn't think like us might instead design a 788 00:42:04,390 --> 00:42:09,940 calculating machine that does something differently and 789 00:42:09,940 --> 00:42:16,490 produces some other estimate. 790 00:42:16,490 --> 00:42:20,000 So we went through this argument and we decided to 791 00:42:20,000 --> 00:42:23,110 program our computer to calculate conditional 792 00:42:23,110 --> 00:42:24,490 expectations. 793 00:42:24,490 --> 00:42:28,460 Somebody else came up with some other crazy idea for how 794 00:42:28,460 --> 00:42:30,590 to estimate the random variable. 795 00:42:30,590 --> 00:42:34,460 They came up with some function g and the programmed 796 00:42:34,460 --> 00:42:38,700 it, and they designed a machine that estimates Theta's 797 00:42:38,700 --> 00:42:43,000 by outputting a certain g of X. 798 00:42:43,000 --> 00:42:47,690 That could be an alternative estimator. 799 00:42:47,690 --> 00:42:50,280 Which one is better? 800 00:42:50,280 --> 00:42:56,350 Well, we convinced ourselves that this is the optimal one 801 00:42:56,350 --> 00:42:59,780 in a universe where we have fixed the particular 802 00:42:59,780 --> 00:43:01,420 value of the data. 803 00:43:01,420 --> 00:43:06,030 So what we have proved so far is a relation of this kind. 804 00:43:06,030 --> 00:43:09,670 In this conditional universe, the mean squared 805 00:43:09,670 --> 00:43:11,920 error that I get-- 806 00:43:11,920 --> 00:43:15,170 I'm the one who's using this estimator-- 807 00:43:15,170 --> 00:43:18,850 is less than or equal than the mean squared error that this 808 00:43:18,850 --> 00:43:23,960 person will get, the person who uses that estimator. 809 00:43:23,960 --> 00:43:28,040 For any particular value of the data, I'm going to do 810 00:43:28,040 --> 00:43:30,190 better than the other person. 811 00:43:30,190 --> 00:43:32,760 Now the data themselves are random. 812 00:43:32,760 --> 00:43:38,050 If I average over all possible values of the data, I should 813 00:43:38,050 --> 00:43:40,240 still be better off. 814 00:43:40,240 --> 00:43:45,120 If I'm better off for any possible value X, then I 815 00:43:45,120 --> 00:43:49,140 should be better off on the average over all possible 816 00:43:49,140 --> 00:43:50,640 values of X. 817 00:43:50,640 --> 00:43:55,670 So let us average both sides of this quantity with respect 818 00:43:55,670 --> 00:43:58,990 to the probability distribution of X. If you want 819 00:43:58,990 --> 00:44:03,350 to do it formally, you can write this inequality between 820 00:44:03,350 --> 00:44:06,520 numbers as an inequality between random variables. 821 00:44:06,520 --> 00:44:10,240 And it tells that no matter what that random variable 822 00:44:10,240 --> 00:44:14,010 turns out to be, this quantity is better than that quantity. 823 00:44:14,010 --> 00:44:17,270 Take expectations of both sides, and you get this 824 00:44:17,270 --> 00:44:21,360 inequality between expectations overall. 825 00:44:21,360 --> 00:44:29,130 And this last inequality tells me that the person who's using 826 00:44:29,130 --> 00:44:34,430 this estimator who produces estimates according to this 827 00:44:34,430 --> 00:44:45,090 machine will have a mean squared estimation error 828 00:44:45,090 --> 00:44:48,580 that's less than or equal to the estimation error that's 829 00:44:48,580 --> 00:44:51,290 produced by the other person. 830 00:44:51,290 --> 00:44:54,710 In a few words, the conditional expectation 831 00:44:54,710 --> 00:44:58,500 estimator is the optimal estimator. 832 00:44:58,500 --> 00:45:01,765 It's the ultimate estimating machine. 833 00:45:01,765 --> 00:45:04,430 834 00:45:04,430 --> 00:45:08,720 That's how you should solve estimation problems and report 835 00:45:08,720 --> 00:45:10,240 a single value. 836 00:45:10,240 --> 00:45:14,510 If you're forced to report a single value and if you're 837 00:45:14,510 --> 00:45:18,060 interested in estimation errors. 838 00:45:18,060 --> 00:45:24,620 OK, while we could have told you that story, of course, a 839 00:45:24,620 --> 00:45:29,500 month or two ago, this is really about interpretation -- 840 00:45:29,500 --> 00:45:32,550 about realizing that conditional expectations have 841 00:45:32,550 --> 00:45:35,160 a very nice property. 842 00:45:35,160 --> 00:45:38,180 But other than that, any probabilistic skills that come 843 00:45:38,180 --> 00:45:41,180 into this business are just the probabilistic skills of 844 00:45:41,180 --> 00:45:44,330 being able to calculate conditional expectations, 845 00:45:44,330 --> 00:45:46,750 which you already know how to do. 846 00:45:46,750 --> 00:45:51,380 So conclusion, all of optimal Bayesian estimation just means 847 00:45:51,380 --> 00:45:54,655 calculating and reporting conditional expectations. 848 00:45:54,655 --> 00:45:58,380 Well, if the world were that simple, then statisticians 849 00:45:58,380 --> 00:46:02,670 wouldn't be able to find jobs if life is that simple. 850 00:46:02,670 --> 00:46:05,690 So real life is not that simple. 851 00:46:05,690 --> 00:46:07,540 There are complications. 852 00:46:07,540 --> 00:46:10,050 And that perhaps makes their life a little more 853 00:46:10,050 --> 00:46:11,300 interesting. 854 00:46:11,300 --> 00:46:22,010 855 00:46:22,010 --> 00:46:25,500 OK, one complication is that we would deal with the vectors 856 00:46:25,500 --> 00:46:28,580 instead of just single random variables. 857 00:46:28,580 --> 00:46:31,830 I use the notation here as if X was a 858 00:46:31,830 --> 00:46:33,500 single random variable. 859 00:46:33,500 --> 00:46:37,710 In real life, you get several data. 860 00:46:37,710 --> 00:46:39,520 Does our story change? 861 00:46:39,520 --> 00:46:41,950 Not really, same argument-- 862 00:46:41,950 --> 00:46:44,410 given all the data that you have observed, you should 863 00:46:44,410 --> 00:46:47,660 still report the conditional expectation of Theta. 864 00:46:47,660 --> 00:46:51,260 But what kind of work does it take in order to report this 865 00:46:51,260 --> 00:46:53,080 conditional expectation? 866 00:46:53,080 --> 00:46:57,030 One issue is that you need to cook up a plausible prior 867 00:46:57,030 --> 00:46:58,810 distribution for Theta. 868 00:46:58,810 --> 00:46:59,960 How do you do that? 869 00:46:59,960 --> 00:47:03,570 In a given application , this is a bit of a judgment call, 870 00:47:03,570 --> 00:47:05,970 what prior would you be working with. 871 00:47:05,970 --> 00:47:08,840 And there's a certain skill there of not 872 00:47:08,840 --> 00:47:12,100 making silly choices. 873 00:47:12,100 --> 00:47:16,690 A more pragmatic, practical issue is that this is a 874 00:47:16,690 --> 00:47:21,180 formula that's extremely nice and compact and simple that 875 00:47:21,180 --> 00:47:24,560 you can write with minimal ink. 876 00:47:24,560 --> 00:47:29,180 But the behind it there could be hidden a huge amount of 877 00:47:29,180 --> 00:47:31,520 calculation. 878 00:47:31,520 --> 00:47:34,820 So doing any sort of calculations that involve 879 00:47:34,820 --> 00:47:39,640 multiple random variables really involves calculating 880 00:47:39,640 --> 00:47:42,240 multi-dimensional integrals. 881 00:47:42,240 --> 00:47:46,230 And the multi-dimensional integrals are hard to compute. 882 00:47:46,230 --> 00:47:50,830 So implementing actually this calculating machine here may 883 00:47:50,830 --> 00:47:54,340 not be easy, might be complicated computationally. 884 00:47:54,340 --> 00:47:58,250 It's also complicated in terms of not being able to derive 885 00:47:58,250 --> 00:47:59,890 intuition about it. 886 00:47:59,890 --> 00:48:03,680 So perhaps you might want to have a simpler version, a 887 00:48:03,680 --> 00:48:07,940 simpler alternative to this formula that's easier to work 888 00:48:07,940 --> 00:48:10,950 with and easier to calculate. 889 00:48:10,950 --> 00:48:13,440 We will be talking about one such simpler 890 00:48:13,440 --> 00:48:15,540 alternative next time. 891 00:48:15,540 --> 00:48:18,570 So again, to conclude, at the high level, Bayesian 892 00:48:18,570 --> 00:48:22,330 estimation is very, very simple, given that you have 893 00:48:22,330 --> 00:48:24,180 mastered everything that has happened in 894 00:48:24,180 --> 00:48:26,370 this course so far. 895 00:48:26,370 --> 00:48:29,860 There are certain practical issues and it's also good to 896 00:48:29,860 --> 00:48:33,590 be familiar with the concepts and the issues that in 897 00:48:33,590 --> 00:48:36,620 general, you would prefer to report that complete posterior 898 00:48:36,620 --> 00:48:37,360 distribution. 899 00:48:37,360 --> 00:48:40,890 But if you're forced to report a point estimate, then there's 900 00:48:40,890 --> 00:48:43,130 a number of reasonable ways to do it. 901 00:48:43,130 --> 00:48:45,690 And perhaps the most reasonable one is to just the 902 00:48:45,690 --> 00:48:48,220 report the conditional expectation itself. 903 00:48:48,220 --> 00:48:49,470