1 00:00:00,000 --> 00:00:05,115 2 00:00:05,115 --> 00:00:06,210 PROFESSOR: Hi. 3 00:00:06,210 --> 00:00:08,920 Today I'd like to introduce a new module. 4 00:00:08,920 --> 00:00:11,270 In previous modules we've talked about how to model 5 00:00:11,270 --> 00:00:14,870 particular systems, how to make predictions about certain 6 00:00:14,870 --> 00:00:18,010 kinds and classifications of systems, and also how to 7 00:00:18,010 --> 00:00:21,070 design both theoretical and physical systems. 8 00:00:21,070 --> 00:00:24,040 If our systems are operating in a deterministic universe, 9 00:00:24,040 --> 00:00:25,200 then we're all set. 10 00:00:25,200 --> 00:00:27,210 We've accounted for all the things we could possibly 11 00:00:27,210 --> 00:00:28,150 account for. 12 00:00:28,150 --> 00:00:30,110 But if we're making systems that are going to operate in 13 00:00:30,110 --> 00:00:34,540 the real world, then we need to be able to deal with some 14 00:00:34,540 --> 00:00:36,600 level of uncertainty. 15 00:00:36,600 --> 00:00:38,430 Today I'm going to talk about probability, which is the 16 00:00:38,430 --> 00:00:40,320 method by which we're going to model 17 00:00:40,320 --> 00:00:41,930 uncertainty in our world. 18 00:00:41,930 --> 00:00:43,860 And later we're going to talk about different strategies we 19 00:00:43,860 --> 00:00:47,260 can take to deal with that uncertainty to hopefully 20 00:00:47,260 --> 00:00:50,160 increase the level of autonomy of our systems as they operate 21 00:00:50,160 --> 00:00:51,410 in the real world. 22 00:00:51,410 --> 00:00:56,000 23 00:00:56,000 --> 00:00:58,960 The first thing I need to talk about is how to properly model 24 00:00:58,960 --> 00:01:01,800 probability, or how to probably talk about 25 00:01:01,800 --> 00:01:05,239 probability such that we can use it to talk about 26 00:01:05,239 --> 00:01:07,490 uncertainty. 27 00:01:07,490 --> 00:01:10,410 When you're talking about probability, typically you'll 28 00:01:10,410 --> 00:01:13,180 end up talking about the sample space or U. This refers 29 00:01:13,180 --> 00:01:18,020 to your entire universe, where your universe is the values 30 00:01:18,020 --> 00:01:21,670 that you care about, or the possible assigned values of 31 00:01:21,670 --> 00:01:23,450 the variables that you care about. 32 00:01:23,450 --> 00:01:27,700 33 00:01:27,700 --> 00:01:30,070 I'm already talking about variables, but what I really 34 00:01:30,070 --> 00:01:31,320 mean to say is events. 35 00:01:31,320 --> 00:01:34,280 36 00:01:34,280 --> 00:01:36,550 There are different states that your universe can take, 37 00:01:36,550 --> 00:01:40,740 and those states can be sort of exhaustively enumerated or 38 00:01:40,740 --> 00:01:45,720 be atomic, or they can be factored 39 00:01:45,720 --> 00:01:46,970 into different variables. 40 00:01:46,970 --> 00:01:49,210 41 00:01:49,210 --> 00:01:50,460 Let me elaborate on this a little. 42 00:01:50,460 --> 00:01:54,920 43 00:01:54,920 --> 00:01:59,140 If I were talking about four coin flips, if those events 44 00:01:59,140 --> 00:02:02,850 were specified atomically then I would have to exhaustively 45 00:02:02,850 --> 00:02:07,480 specify all possible sequences of four coin flips. 46 00:02:07,480 --> 00:02:13,450 4 heads, 3 heads and 2 tail, 2 heads and 2 tails, that sort 47 00:02:13,450 --> 00:02:15,080 of exercise. 48 00:02:15,080 --> 00:02:26,810 Or flipping 4 heads would be one event, flipping 3 heads 49 00:02:26,810 --> 00:02:31,160 and then 1 tail would be a separate event, flipping 2 50 00:02:31,160 --> 00:02:35,760 heads 1 tail and another head would be a 51 00:02:35,760 --> 00:02:39,120 third, distinct event. 52 00:02:39,120 --> 00:02:42,270 If we're talking about a factored state space, then 53 00:02:42,270 --> 00:02:44,530 we'll have a different variable representing each one 54 00:02:44,530 --> 00:02:47,110 of these coin flips. 55 00:02:47,110 --> 00:02:52,940 And each one of those variables will say, here is 56 00:02:52,940 --> 00:02:56,230 the value associated with that particular sub-event. 57 00:02:56,230 --> 00:02:58,240 And the accumulation of all those values, or the 58 00:02:58,240 --> 00:03:01,450 specification for all those values, will end up referring 59 00:03:01,450 --> 00:03:07,460 to the same states that were addressed 60 00:03:07,460 --> 00:03:08,710 by the atomic events. 61 00:03:08,710 --> 00:03:12,430 62 00:03:12,430 --> 00:03:15,040 Let me talk about variables, because they're the thing that 63 00:03:15,040 --> 00:03:17,780 allow us to not exhaustively enumerate every possible event 64 00:03:17,780 --> 00:03:24,450 in the universe, and talk about our universe in 65 00:03:24,450 --> 00:03:29,290 meaningful ways, or in ways that they can be effectively 66 00:03:29,290 --> 00:03:30,540 communicated. 67 00:03:30,540 --> 00:03:32,390 68 00:03:32,390 --> 00:03:34,320 And why am I talking about random variables? 69 00:03:34,320 --> 00:03:35,710 Why aren't they just variables? 70 00:03:35,710 --> 00:03:37,985 Well, random variables is the way that you specify the fact 71 00:03:37,985 --> 00:03:40,920 that you're talking about probabilistic variables. 72 00:03:40,920 --> 00:03:43,940 When you're just dealing with regular algebra, variables 73 00:03:43,940 --> 00:03:50,230 have some sort of assigned value, and you're not forced 74 00:03:50,230 --> 00:03:51,720 to be in the space of 0 to 1. 75 00:03:51,720 --> 00:03:54,320 When you're talking about from 0 to 1, when you're talking 76 00:03:54,320 --> 00:03:58,260 about probabilities, you're forced to be in the spaces of 77 00:03:58,260 --> 00:04:01,860 0 to 1 and forced to remain within the reals. 78 00:04:01,860 --> 00:04:05,250 79 00:04:05,250 --> 00:04:08,900 If you want to talk about all the possible assigned values 80 00:04:08,900 --> 00:04:11,520 of that random variable, then you're talking about the 81 00:04:11,520 --> 00:04:15,950 distribution over that random variable. 82 00:04:15,950 --> 00:04:22,590 This means that A could be anything that A could be. 83 00:04:22,590 --> 00:04:25,070 You're talking about the function that says give me a 84 00:04:25,070 --> 00:04:27,780 value of A, and I will tell you a probability associated 85 00:04:27,780 --> 00:04:29,990 with that value of A. 86 00:04:29,990 --> 00:04:32,040 If you're talking about the probability of A being 87 00:04:32,040 --> 00:04:35,370 assigned to a particular value, then you'll return out 88 00:04:35,370 --> 00:04:37,040 the probability. 89 00:04:37,040 --> 00:04:46,380 If A represents the color of the shirt I'm wearing, and I 90 00:04:46,380 --> 00:04:48,990 want to know the probability that A is some value, then I 91 00:04:48,990 --> 00:04:50,750 look at the color of the shirt I'm wearing and try to 92 00:04:50,750 --> 00:04:52,300 determine whether or not it's 1 or 0. 93 00:04:52,300 --> 00:04:54,970 Or possibly look at the colors of shirts that I've worn over 94 00:04:54,970 --> 00:05:03,330 the past year and make some sort of ratio of the number of 95 00:05:03,330 --> 00:05:05,320 pink shirts I've worn over the past year. 96 00:05:05,320 --> 00:05:12,100 97 00:05:12,100 --> 00:05:14,660 If I'm dealing with a factored state space, I'm going to end 98 00:05:14,660 --> 00:05:18,410 up talking about more than one random variable. 99 00:05:18,410 --> 00:05:20,440 There are two main ways to talk about more than one 100 00:05:20,440 --> 00:05:24,260 random variable at the same time. 101 00:05:24,260 --> 00:05:27,690 One is joint probabilities, where all the random variables 102 00:05:27,690 --> 00:05:30,650 are collectively specified at the same time. 103 00:05:30,650 --> 00:05:33,210 And the other is conditional probabilities, where you've 104 00:05:33,210 --> 00:05:37,590 already decided that you specified some value for one 105 00:05:37,590 --> 00:05:38,670 or more random variables. 106 00:05:38,670 --> 00:05:41,520 And then within that scope you're going to talk about the 107 00:05:41,520 --> 00:05:43,635 probabilities associated with other random variables. 108 00:05:43,635 --> 00:05:48,110 109 00:05:48,110 --> 00:05:50,470 I want to demonstrate this graphically, but there are two 110 00:05:50,470 --> 00:05:52,770 more things that I need to mention right now. 111 00:05:52,770 --> 00:05:55,470 112 00:05:55,470 --> 00:05:58,080 One is the difference between frequentist and Bayesian 113 00:05:58,080 --> 00:05:59,330 interpretations of probability. 114 00:05:59,330 --> 00:06:02,760 115 00:06:02,760 --> 00:06:05,800 Right now they don't seem particularly relevant, but 116 00:06:05,800 --> 00:06:08,260 people will use these words, and it's good to know, 117 00:06:08,260 --> 00:06:11,480 approximately, what they're talking about. 118 00:06:11,480 --> 00:06:16,760 The frequentist interpretation of probability is more 119 00:06:16,760 --> 00:06:21,280 relevant when you're talking about actions that happen a 120 00:06:21,280 --> 00:06:22,580 lot of different times. 121 00:06:22,580 --> 00:06:28,930 For instance, how frequently it rains. 122 00:06:28,930 --> 00:06:31,760 If I say that today is a Wednesday, and I want to know 123 00:06:31,760 --> 00:06:36,170 the probability that it rains on Wednesdays, then that 124 00:06:36,170 --> 00:06:38,810 probability is open to frequentist interpretation 125 00:06:38,810 --> 00:06:41,720 because there are a lot of Wednesdays, and it's rained a 126 00:06:41,720 --> 00:06:44,580 lot in the universe of Wednesdays or the space of 127 00:06:44,580 --> 00:06:47,390 possible Wednesdays. 128 00:06:47,390 --> 00:06:49,750 So thinking about the fact that there's a 70% chance of 129 00:06:49,750 --> 00:06:52,230 rain, or a 30% chance of rain, or whatever probability of 130 00:06:52,230 --> 00:06:56,220 rain there is on Wednesdays, makes sense, or is open to 131 00:06:56,220 --> 00:06:57,470 frequentist interpretation. 132 00:06:57,470 --> 00:07:02,040 133 00:07:02,040 --> 00:07:04,410 The other interpretation that you'll hear talked about with 134 00:07:04,410 --> 00:07:07,120 respect to probabilities is the Bayesian interpretation. 135 00:07:07,120 --> 00:07:10,390 Bayesian interpretation tends to be more relevant when 136 00:07:10,390 --> 00:07:12,960 you're talking about spaces that are more atomic as 137 00:07:12,960 --> 00:07:17,540 opposed to factored, or represent events that are 138 00:07:17,540 --> 00:07:19,910 specified to the point that it does not make sense to talk 139 00:07:19,910 --> 00:07:25,540 about them in the frequentist sense. 140 00:07:25,540 --> 00:07:27,320 When we talk about probabilities in the Bayesian 141 00:07:27,320 --> 00:07:32,100 interpretation, we frequently use the term likelihood. 142 00:07:32,100 --> 00:07:34,610 For instance, if I'm talking about whether or not it's 143 00:07:34,610 --> 00:07:40,980 likely to rain on August 24, 2011 in the afternoon, the 144 00:07:40,980 --> 00:07:45,580 specificity of that event is so high that at that point it 145 00:07:45,580 --> 00:07:48,440 doesn't make sense to talk about the frequency of 146 00:07:48,440 --> 00:07:52,610 Wednesday, August 24, 2011 in the afternoons, 147 00:07:52,610 --> 00:07:55,280 at least for now. 148 00:07:55,280 --> 00:07:57,600 At that point we're talking more about likelihood and less 149 00:07:57,600 --> 00:08:00,280 about frequency. 150 00:08:00,280 --> 00:08:04,185 That event is more conducive to Bayesian interpretation. 151 00:08:04,185 --> 00:08:09,540 152 00:08:09,540 --> 00:08:11,350 No whirlwind tour of probability would be complete 153 00:08:11,350 --> 00:08:12,950 without a mention of the three axioms of probability. 154 00:08:12,950 --> 00:08:16,200 155 00:08:16,200 --> 00:08:19,610 The first axiom is that the likelihood of the universe 156 00:08:19,610 --> 00:08:23,870 happening is one, or all random variables are going to 157 00:08:23,870 --> 00:08:27,580 be specified at some point. 158 00:08:27,580 --> 00:08:33,320 Relatedly, the likelihood of nothing happening is 0. 159 00:08:33,320 --> 00:08:38,640 What these two really do is establish the boundaries of a 160 00:08:38,640 --> 00:08:40,090 graphical representation up here. 161 00:08:40,090 --> 00:08:42,830 162 00:08:42,830 --> 00:08:45,680 The other axiom of probability is that if you're going to be 163 00:08:45,680 --> 00:08:49,100 talking about the union of two events, or the probability 164 00:08:49,100 --> 00:08:52,150 associated with one or the other event having an 165 00:08:52,150 --> 00:08:58,450 assignment, then you're talking about the probability 166 00:08:58,450 --> 00:09:04,140 of one of those variables and the probability of the other 167 00:09:04,140 --> 00:09:08,880 variable, and then removing the section that you've 168 00:09:08,880 --> 00:09:10,560 double-counted. 169 00:09:10,560 --> 00:09:12,940 So if I were to attempt to demonstrate this on this 170 00:09:12,940 --> 00:09:17,400 graph, I would be talking about the probability of A, 171 00:09:17,400 --> 00:09:20,700 added the probability of B. And at this point I've 172 00:09:20,700 --> 00:09:22,250 double-counted this section the middle where 173 00:09:22,250 --> 00:09:23,940 they're both one. 174 00:09:23,940 --> 00:09:25,820 So I would subtract this exactly once. 175 00:09:25,820 --> 00:09:27,840 And then I'm talking about the size of this space. 176 00:09:27,840 --> 00:09:34,640 177 00:09:34,640 --> 00:09:36,420 If I'm talking about the probability of A being equal 178 00:09:36,420 --> 00:09:42,910 to 1, then I'm talking about the space in which A is 179 00:09:42,910 --> 00:09:47,580 specified to be equal to 1, divided by the area associated 180 00:09:47,580 --> 00:09:48,830 with my universe. 181 00:09:48,830 --> 00:10:02,540 182 00:10:02,540 --> 00:10:10,220 When I'm talking about joint distributions, I'm going to 183 00:10:10,220 --> 00:10:14,780 find the space in which both of these things are true, 184 00:10:14,780 --> 00:10:17,210 scoped to the size of the universe, or the 185 00:10:17,210 --> 00:10:20,600 entire sample space. 186 00:10:20,600 --> 00:10:25,220 So in this space both A and B are true, and I'm looking at 187 00:10:25,220 --> 00:10:26,940 it relative to the size of the universe. 188 00:10:26,940 --> 00:10:43,800 189 00:10:43,800 --> 00:10:47,680 In contrast, when I'm talking about conditional probability, 190 00:10:47,680 --> 00:10:58,090 or the probability of A given B, I'm going to look at where 191 00:10:58,090 --> 00:11:03,940 my specifications are true, scoped to where 192 00:11:03,940 --> 00:11:06,690 my givens are true. 193 00:11:06,690 --> 00:11:13,460 So if I'm already dealing in the space restricted to B, I'm 194 00:11:13,460 --> 00:11:15,955 just looking at this size, or this space. 195 00:11:15,955 --> 00:11:19,660 196 00:11:19,660 --> 00:11:22,140 But because I'm scoped to B, I'm going to end up dividing 197 00:11:22,140 --> 00:11:53,710 by the area of B, instead of the area of U. 198 00:11:53,710 --> 00:11:55,420 This is the main difference between the joint and 199 00:11:55,420 --> 00:11:56,450 conditional distributions. 200 00:11:56,450 --> 00:11:59,470 And a lot of people get hung up on it, which is why I'm 201 00:11:59,470 --> 00:12:01,055 exhaustively walking through it. 202 00:12:01,055 --> 00:12:04,870 203 00:12:04,870 --> 00:12:06,790 A couple of other things before we talk about the first 204 00:12:06,790 --> 00:12:09,930 way in which we can use our models for uncertainty to do 205 00:12:09,930 --> 00:12:12,730 some amount of addressing of the fact that we're going to 206 00:12:12,730 --> 00:12:14,010 have to deal with uncertainty in the future. 207 00:12:14,010 --> 00:12:18,380 208 00:12:18,380 --> 00:12:20,710 If we start off with a joint distribution and we want to 209 00:12:20,710 --> 00:12:22,690 reduce the number of variables that we're actually talking 210 00:12:22,690 --> 00:12:26,340 about, we can do so by exhaustively walking through 211 00:12:26,340 --> 00:12:29,980 all the possible assigned values for the variable that 212 00:12:29,980 --> 00:12:35,600 we want to disregard, and then summing up the values 213 00:12:35,600 --> 00:12:36,850 appropriately. 214 00:12:36,850 --> 00:12:40,670 215 00:12:40,670 --> 00:12:44,600 An easy example for this is if we had the joint probabilities 216 00:12:44,600 --> 00:12:46,980 of all the colors of shirts that I wear and all the colors 217 00:12:46,980 --> 00:12:49,360 of pants that I wear, and we only wanted to talk about all 218 00:12:49,360 --> 00:12:54,130 the colors of shirts that I wear, then we could 219 00:12:54,130 --> 00:12:57,900 exhaustively cover all the different colors of pants that 220 00:12:57,900 --> 00:13:01,870 I wear and accumulate all the different values of shirts 221 00:13:01,870 --> 00:13:11,140 that I wear simultaneously, and then collect that 222 00:13:11,140 --> 00:13:12,390 distribution. 223 00:13:12,390 --> 00:13:15,080 224 00:13:15,080 --> 00:13:19,410 Related is the concept of total probability, which is 225 00:13:19,410 --> 00:13:22,020 the same kind of accumulation. 226 00:13:22,020 --> 00:13:27,480 It just operates in the conditional space, as opposed 227 00:13:27,480 --> 00:13:30,890 to in the joint space. 228 00:13:30,890 --> 00:13:36,770 So in this case, if I'm already operating in A given B 229 00:13:36,770 --> 00:13:45,690 land, I have to scope myself back out to the space of the 230 00:13:45,690 --> 00:13:49,670 universe by then accounting for the fact that I've only 231 00:13:49,670 --> 00:13:58,290 been operating in the scope of B. Then exhaustively enumerate 232 00:13:58,290 --> 00:14:00,740 all the possible values of B, sum the probabilities 233 00:14:00,740 --> 00:14:05,880 associated with those values, and then I've reduced the 234 00:14:05,880 --> 00:14:07,390 number of dimensions that I'm talking about. 235 00:14:07,390 --> 00:14:12,840 236 00:14:12,840 --> 00:14:14,435 The final thing we have to talk about before we move on 237 00:14:14,435 --> 00:14:16,720 to state estimation is Bayes' evidence. 238 00:14:16,720 --> 00:14:18,840 Or you've probably seen this demonstrated as Bayes' rule. 239 00:14:18,840 --> 00:14:25,850 240 00:14:25,850 --> 00:14:30,980 If I want to talk about B scoped to A, and all I have is 241 00:14:30,980 --> 00:14:38,320 A scoped to B, B and A. When we walked through total 242 00:14:38,320 --> 00:14:42,930 probability, we saw that the conditional probability 243 00:14:42,930 --> 00:14:45,570 multiplied by the probability associated with the variable 244 00:14:45,570 --> 00:14:48,310 that we're conditioning on, is equal to the joint 245 00:14:48,310 --> 00:14:50,710 probability. 246 00:14:50,710 --> 00:14:56,640 When we multiply P of A given B by P of B, we're going to 247 00:14:56,640 --> 00:15:00,390 end up with the joint probability associated with 248 00:15:00,390 --> 00:15:01,900 the two variables. 249 00:15:01,900 --> 00:15:05,440 When we then divide back out by A or scope our joint 250 00:15:05,440 --> 00:15:10,510 probability to A, we end up talking about conditional 251 00:15:10,510 --> 00:15:13,790 probabilities again, which is where B given A comes from. 252 00:15:13,790 --> 00:15:18,280 253 00:15:18,280 --> 00:15:20,160 Bayes' evidence, or Bayes' rule, is the basis for 254 00:15:20,160 --> 00:15:22,810 inference, which is going to be really important for state 255 00:15:22,810 --> 00:15:24,380 estimation, which we'll cover next time. 256 00:15:24,380 --> 00:15:27,232