1 00:00:00,790 --> 00:00:03,130 The following content is provided under a Creative 2 00:00:03,130 --> 00:00:04,550 Commons license. 3 00:00:04,550 --> 00:00:06,760 Your support will help MIT OpenCourseWare 4 00:00:06,760 --> 00:00:10,850 continue to offer high quality educational resources for free. 5 00:00:10,850 --> 00:00:13,390 To make a donation or to view additional materials 6 00:00:13,390 --> 00:00:17,320 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,320 --> 00:00:18,570 at ocw.mit.edu. 8 00:00:28,762 --> 00:00:31,140 JOHN GUTTAG: So today, we're going 9 00:00:31,140 --> 00:00:34,560 to move on to a fairly different world than the world 10 00:00:34,560 --> 00:00:36,090 we've been living in. 11 00:00:36,090 --> 00:00:37,950 And this will be a world we'll be living in 12 00:00:37,950 --> 00:00:40,580 for quite a few lectures. 13 00:00:40,580 --> 00:00:42,990 But before I do that, I want to get back 14 00:00:42,990 --> 00:00:47,170 to just finish up something that Professor Grimson started. 15 00:00:47,170 --> 00:00:50,520 You may recall he talked about family trees 16 00:00:50,520 --> 00:00:52,650 and raised the question, was it actually 17 00:00:52,650 --> 00:00:55,890 possible to represent all ancestral relationships 18 00:00:55,890 --> 00:00:57,560 as a tree? 19 00:00:57,560 --> 00:00:59,890 Well, as a counterexample, I'm sure some of you 20 00:00:59,890 --> 00:01:03,770 are familiar with Oedipus Rex. 21 00:01:03,770 --> 00:01:05,269 For those of you who are not, I'm 22 00:01:05,269 --> 00:01:07,940 happy give you a plot summary at the end of the lecture. 23 00:01:07,940 --> 00:01:10,880 It's a rather bizarre plot. 24 00:01:10,880 --> 00:01:16,110 But it was captured in a wonderful song by Tom Lehrer. 25 00:01:16,110 --> 00:01:19,160 The short story is Oedipus ended up marrying his mother 26 00:01:19,160 --> 00:01:22,430 and having four children. 27 00:01:22,430 --> 00:01:25,100 And Tom Lehrer, if you've never heard of Tom Lehrer, 28 00:01:25,100 --> 00:01:29,660 you're missing one of the world's funniest songwriters. 29 00:01:29,660 --> 00:01:32,300 And he had a wonderful song called "Oedipus Rex," 30 00:01:32,300 --> 00:01:38,510 and I recommend this YouTube as a way to go and listen to it. 31 00:01:38,510 --> 00:01:44,870 And you can gather from the quote what the story is about. 32 00:01:44,870 --> 00:01:46,970 I also recommend the play, by the way. 33 00:01:46,970 --> 00:01:50,330 It's really kind of appalling what goes on, 34 00:01:50,330 --> 00:01:53,090 but it's beautiful. 35 00:01:53,090 --> 00:01:57,050 Back to the main topic, here's the relevant reading-- 36 00:01:59,570 --> 00:02:05,250 a small bit from later in the book and then chapter 14. 37 00:02:05,250 --> 00:02:07,260 You may notice that we're not actually going 38 00:02:07,260 --> 00:02:09,264 through the book in order. 39 00:02:09,264 --> 00:02:11,430 And the reason we're not doing that is because we're 40 00:02:11,430 --> 00:02:13,440 trying to get you information you need in time 41 00:02:13,440 --> 00:02:14,395 to do problem sets. 42 00:02:18,810 --> 00:02:24,480 So the topic of today is really uncertainty and the fact 43 00:02:24,480 --> 00:02:29,550 that the world is really annoyingly hard to understand. 44 00:02:32,520 --> 00:02:36,480 This is a signpost related to 6.0002, 45 00:02:36,480 --> 00:02:41,170 but we won't go into too much detail about it. 46 00:02:41,170 --> 00:02:43,050 We'd rather things were certain. 47 00:02:43,050 --> 00:02:47,330 But in fact, they usually are not. 48 00:02:47,330 --> 00:02:51,710 And this is a place where 6.0002 diverges 49 00:02:51,710 --> 00:02:53,930 from the typical introductory computer science 50 00:02:53,930 --> 00:02:58,250 course, which focuses on things that are functional-- 51 00:02:58,250 --> 00:03:02,030 given an input, you always get the same output. 52 00:03:02,030 --> 00:03:03,950 It's predictable. 53 00:03:03,950 --> 00:03:07,760 And we like to do that, because that's easier to teach. 54 00:03:07,760 --> 00:03:11,300 But in fact, for reasons we'll be talking about, 55 00:03:11,300 --> 00:03:14,210 it's not nearly as useful if you're 56 00:03:14,210 --> 00:03:16,580 trying to actually write computations that 57 00:03:16,580 --> 00:03:18,860 help you understand the world. 58 00:03:18,860 --> 00:03:21,110 You have to face uncertainty head on. 59 00:03:25,030 --> 00:03:27,860 An analogy is for many years people, believed 60 00:03:27,860 --> 00:03:31,490 in Newtonian mechanics-- 61 00:03:31,490 --> 00:03:34,520 I guess they still do in 8.01 maybe-- 62 00:03:34,520 --> 00:03:38,030 that every effect has a cause. 63 00:03:38,030 --> 00:03:40,430 An apple falls from the tree because of gravity, 64 00:03:40,430 --> 00:03:42,390 and you know where it's going to land. 65 00:03:42,390 --> 00:03:45,080 And the world can be understood causally. 66 00:03:45,080 --> 00:03:50,090 And people believed this really for quite a long time, 67 00:03:50,090 --> 00:03:54,500 most of history, until the early part 68 00:03:54,500 --> 00:03:58,250 of the 20th century, when the so-called Copenhagen 69 00:03:58,250 --> 00:04:00,770 doctrine was put forth. 70 00:04:03,880 --> 00:04:06,670 The doctrine there from Bohr and Heisenberg, 71 00:04:06,670 --> 00:04:09,910 two very famous physicists, was one 72 00:04:09,910 --> 00:04:13,930 of what they called causal nondeterminism. 73 00:04:13,930 --> 00:04:17,709 And their assertion was that the world at its very most 74 00:04:17,709 --> 00:04:24,610 fundamental level behaves in a way that you cannot predict. 75 00:04:24,610 --> 00:04:28,990 It's OK to make a statement that x is highly likely to occur, 76 00:04:28,990 --> 00:04:33,430 almost certain to occur, but for no case can 77 00:04:33,430 --> 00:04:36,310 you make a statement x will occur. 78 00:04:36,310 --> 00:04:40,360 Nothing has a probability of one. 79 00:04:40,360 --> 00:04:43,720 This was hard for us to imagine today, when we all 80 00:04:43,720 --> 00:04:45,580 know quantum mechanics. 81 00:04:45,580 --> 00:04:50,320 But at the turn of the century, this was a shocking statement. 82 00:04:50,320 --> 00:04:53,230 And two other very well-known physicists, 83 00:04:53,230 --> 00:04:55,900 Albert Einstein and Schrodinger, basically 84 00:04:55,900 --> 00:04:57,460 said, no, this is wrong. 85 00:04:57,460 --> 00:05:00,130 Bohr, Heisenberg, you guys are idiots. 86 00:05:00,130 --> 00:05:01,570 It's just not true. 87 00:05:01,570 --> 00:05:03,670 They probably didn't call them idiots. 88 00:05:03,670 --> 00:05:06,730 And this is most exemplified by Einstein's famous quote 89 00:05:06,730 --> 00:05:11,230 that "God does not play dice," which is indicative of the fact 90 00:05:11,230 --> 00:05:13,990 that this was actually a discussion that permeated 91 00:05:13,990 --> 00:05:19,570 not just the world of physics, but society in general people 92 00:05:19,570 --> 00:05:22,150 really turned it into literally a religious issue, 93 00:05:22,150 --> 00:05:24,900 as did Einstein. 94 00:05:24,900 --> 00:05:26,940 Well, so now we should ask the question, 95 00:05:26,940 --> 00:05:28,830 does it really matter? 96 00:05:28,830 --> 00:05:31,260 And to illustrate that, I need two coins. 97 00:05:31,260 --> 00:05:33,900 I forgot to bring any coins with me. 98 00:05:33,900 --> 00:05:35,840 Does anyone got a coin they can lend me? 99 00:05:35,840 --> 00:05:37,301 AUDIENCE: I have some coins. 100 00:05:37,301 --> 00:05:39,900 JOHN GUTTAG: All right. 101 00:05:39,900 --> 00:05:42,300 Now, this is where I see how much the students trust me. 102 00:05:42,300 --> 00:05:44,190 Do I get a penny? 103 00:05:44,190 --> 00:05:46,440 Do I get a silver dollar? 104 00:05:46,440 --> 00:05:47,460 So what do we got here? 105 00:05:50,500 --> 00:05:54,600 This is someone who's entrusting me with quarters, not so bad. 106 00:05:57,500 --> 00:06:00,149 So we'll take these quarters, and we'll shake them up, 107 00:06:00,149 --> 00:06:01,690 and we'll put them down on the table. 108 00:06:04,240 --> 00:06:07,000 And now, we'll ask a question-- 109 00:06:07,000 --> 00:06:13,140 do we have two heads, two tails, or one head and one tail? 110 00:06:13,140 --> 00:06:17,220 So who thinks we have two heads? 111 00:06:17,220 --> 00:06:20,370 Who thinks we have two tails? 112 00:06:20,370 --> 00:06:23,230 Who thinks we have one of each? 113 00:06:23,230 --> 00:06:26,580 Well, clearly, everyone except a few people-- for example, 114 00:06:26,580 --> 00:06:29,730 the Indians fan, who clearly believe in the counterfactual-- 115 00:06:33,030 --> 00:06:37,080 made the most probabilistic decision. 116 00:06:37,080 --> 00:06:40,550 But in fact, there is no nondeterminism here. 117 00:06:40,550 --> 00:06:43,040 I know the answer. 118 00:06:43,040 --> 00:06:47,600 And so in some sense, it doesn't matter 119 00:06:47,600 --> 00:06:49,820 whether it's deterministic, because in fact, it's 120 00:06:49,820 --> 00:06:52,070 not causally nondeterministic. 121 00:06:52,070 --> 00:06:58,120 The answer is quite clear, but you don't know the answer. 122 00:06:58,120 --> 00:07:03,870 And so whether or not the world is inherently unpredictable, 123 00:07:03,870 --> 00:07:08,760 the fact that we never have complete knowledge of the world 124 00:07:08,760 --> 00:07:10,770 suggests that we might as well treat 125 00:07:10,770 --> 00:07:15,130 it as inherently unpredictable. 126 00:07:15,130 --> 00:07:19,060 And so this is called predictive nondeterminism. 127 00:07:19,060 --> 00:07:21,365 And this really is what's going to underline 128 00:07:21,365 --> 00:07:23,740 pretty much everything else we're going to be doing here. 129 00:07:30,370 --> 00:07:34,000 No comments about that? 130 00:07:34,000 --> 00:07:37,150 I wouldn't do that to you. 131 00:07:37,150 --> 00:07:39,700 Thank you. 132 00:07:39,700 --> 00:07:42,260 I know you are wishing to get interest on the money, 133 00:07:42,260 --> 00:07:44,140 but you don't get any. 134 00:07:44,140 --> 00:07:46,060 AUDIENCE: Was it heads or tails? 135 00:07:51,376 --> 00:07:52,500 JOHN GUTTAG: What was that? 136 00:07:56,160 --> 00:08:00,660 So when we think about nondeterminism in computation, 137 00:08:00,660 --> 00:08:04,150 we use the word stochastic process. 138 00:08:04,150 --> 00:08:07,020 And that's any process that's ongoing 139 00:08:07,020 --> 00:08:12,180 in which the next state depends upon the previous states 140 00:08:12,180 --> 00:08:14,800 in some random element. 141 00:08:14,800 --> 00:08:18,450 So typically up till now when we've written code, 142 00:08:18,450 --> 00:08:20,890 one line of code did depended only 143 00:08:20,890 --> 00:08:23,260 on what the previous lines of code did. 144 00:08:23,260 --> 00:08:25,810 There was no randomness. 145 00:08:25,810 --> 00:08:28,282 Here, we're going to have randomness. 146 00:08:28,282 --> 00:08:29,740 And we can see the difference if we 147 00:08:29,740 --> 00:08:34,450 look at these two specifications of rolling a die. 148 00:08:34,450 --> 00:08:38,320 The first one, returns an int between 1 and 6, 149 00:08:38,320 --> 00:08:41,890 is what I'll call underdetermined. 150 00:08:41,890 --> 00:08:45,940 By that I mean you can't tell what it's going to return. 151 00:08:45,940 --> 00:08:49,540 Maybe it will return a different number each time you call it, 152 00:08:49,540 --> 00:08:51,700 but it's not required to. 153 00:08:51,700 --> 00:08:55,120 Maybe it will return three every time you call it. 154 00:08:55,120 --> 00:08:58,690 The second specification requires randomness. 155 00:08:58,690 --> 00:09:01,360 It says, it returns are randomly chosen int. 156 00:09:01,360 --> 00:09:06,710 So it requires a stochastic implementation. 157 00:09:06,710 --> 00:09:11,090 Let's look at how we implement a random process in Python. 158 00:09:11,090 --> 00:09:15,890 We start by importing the library random. 159 00:09:15,890 --> 00:09:17,520 This is not to say you can import 160 00:09:17,520 --> 00:09:19,770 any random library you want. 161 00:09:19,770 --> 00:09:22,530 It's to say you import the library called random. 162 00:09:22,530 --> 00:09:23,810 Let me get my pen out of here. 163 00:09:27,310 --> 00:09:29,230 And we'll use that a lot. 164 00:09:29,230 --> 00:09:32,590 And then we're going to use the function in random called 165 00:09:32,590 --> 00:09:34,940 random.choice. 166 00:09:34,940 --> 00:09:39,530 It takes as an argument a sequence, in this case a list, 167 00:09:39,530 --> 00:09:43,850 and randomly chooses one member of the list. 168 00:09:43,850 --> 00:09:46,160 And it chooses it uniformly. 169 00:09:49,010 --> 00:09:52,930 It's a uniform distribution. 170 00:09:52,930 --> 00:09:56,860 And what that means is that it's equally probable 171 00:09:56,860 --> 00:09:59,650 that it will choose any number in that list each time 172 00:09:59,650 --> 00:10:01,690 you call it. 173 00:10:01,690 --> 00:10:03,820 We'll later look at distributions 174 00:10:03,820 --> 00:10:06,700 that are not uniform, not equally probable, 175 00:10:06,700 --> 00:10:08,320 where things are weighted. 176 00:10:08,320 --> 00:10:10,375 But here, it's quite simple, it's just uniform. 177 00:10:13,470 --> 00:10:16,980 And then we can test it using testRoll-- 178 00:10:16,980 --> 00:10:21,930 take some number of n and rolls the die that many times 179 00:10:21,930 --> 00:10:24,970 and creates a string telling us what we got. 180 00:10:29,750 --> 00:10:36,300 So let's consider running this on, say, testRoll of five. 181 00:10:36,300 --> 00:10:38,680 And we'll ask the question, if we run it, 182 00:10:38,680 --> 00:10:43,180 how probable is it that it's going to return a string 183 00:10:43,180 --> 00:10:43,960 of five 1's? 184 00:10:50,100 --> 00:10:51,120 How do we do that? 185 00:10:51,120 --> 00:10:54,420 Now, how many people here are either in 6.041 186 00:10:54,420 --> 00:10:56,670 or would have taken 6.041? 187 00:10:56,670 --> 00:10:59,280 Raise your hand. 188 00:10:59,280 --> 00:10:59,850 Oh, good. 189 00:10:59,850 --> 00:11:02,830 So very few of you know probability. 190 00:11:02,830 --> 00:11:03,330 That helps. 191 00:11:06,450 --> 00:11:09,170 So how do we think about that question? 192 00:11:09,170 --> 00:11:14,480 Well, probability, to me at least, is all about counting, 193 00:11:14,480 --> 00:11:16,740 especially discrete probability, which 194 00:11:16,740 --> 00:11:19,900 is what we're looking at here. 195 00:11:19,900 --> 00:11:23,830 What you do is you start by counting the number of events 196 00:11:23,830 --> 00:11:29,710 that have the property of interest 197 00:11:29,710 --> 00:11:31,480 and the number of possible events 198 00:11:31,480 --> 00:11:32,940 and divide one by the other. 199 00:11:35,580 --> 00:11:41,430 So if we think about rolling a die five times, 200 00:11:41,430 --> 00:11:44,070 we can enumerate all of the possible outcomes 201 00:11:44,070 --> 00:11:44,885 of five rolls. 202 00:11:47,390 --> 00:11:50,870 So if we look at that, what are the outcomes? 203 00:11:50,870 --> 00:11:54,150 Well, I could get five 1's. 204 00:11:54,150 --> 00:12:00,720 I could get four 1's and a 2 or four 1's and 3, skip a few. 205 00:12:00,720 --> 00:12:05,040 The next one would be three 1's, a 2 and a 1, then a 2 and 2, 206 00:12:05,040 --> 00:12:08,850 and finally, at the end, all 6's. 207 00:12:08,850 --> 00:12:13,320 So remember, we looked before at when 208 00:12:13,320 --> 00:12:17,160 we're looking at optimization problems about binary numbers. 209 00:12:17,160 --> 00:12:20,670 And we said we can look at all the possible choices of items 210 00:12:20,670 --> 00:12:24,460 in the knapsack by a vector of 0's and 1's. 211 00:12:24,460 --> 00:12:27,340 We said, how many possible choices are there? 212 00:12:27,340 --> 00:12:30,200 Well, it depended on how many binary numbers you could 213 00:12:30,200 --> 00:12:32,910 get in that number of digits. 214 00:12:32,910 --> 00:12:36,410 Well, here we're doing the same thing, but instead of base 2, 215 00:12:36,410 --> 00:12:37,710 it's base 6. 216 00:12:40,590 --> 00:12:45,150 And so the number of possible outcomes of five rolls 217 00:12:45,150 --> 00:12:45,990 is quite high. 218 00:12:48,760 --> 00:12:50,860 How many of those are five 1's? 219 00:12:50,860 --> 00:12:54,180 Only one of them, right? 220 00:12:54,180 --> 00:12:58,300 So in order to get the probability of a five 1's, I 221 00:12:58,300 --> 00:13:00,370 divide 1 by 6 to the fifth. 222 00:13:03,232 --> 00:13:06,720 Does that makes sense to everybody? 223 00:13:06,720 --> 00:13:10,565 So in fact, we see it's highly unlikely. 224 00:13:10,565 --> 00:13:15,460 The probability of a five 1's is quite small. 225 00:13:15,460 --> 00:13:17,770 Now, suppose we were to ask about the probability 226 00:13:17,770 --> 00:13:19,870 of something else-- 227 00:13:19,870 --> 00:13:27,120 instead of five 1's, say 53421. 228 00:13:27,120 --> 00:13:31,230 It kind of looks more likely than that than five 1's 229 00:13:31,230 --> 00:13:33,630 in a row, but of course, it isn't, right? 230 00:13:33,630 --> 00:13:37,620 Any specific combination is equally probable. 231 00:13:37,620 --> 00:13:40,420 And there are a lot of them. 232 00:13:40,420 --> 00:13:44,920 So this is all the probability we're going to think about we 233 00:13:44,920 --> 00:13:48,550 could think about this way, as simply a matter of counting-- 234 00:13:48,550 --> 00:13:51,640 the number of possible events, the number of events that have 235 00:13:51,640 --> 00:13:54,970 the property of interest-- in this case being all 1's-- 236 00:13:54,970 --> 00:13:56,680 and then simple division. 237 00:13:59,530 --> 00:14:03,010 Given that framework, there were three basic facts 238 00:14:03,010 --> 00:14:07,870 about probability we're going to be using a lot of. 239 00:14:07,870 --> 00:14:15,980 So one, probabilities always range from 0 to 1. 240 00:14:15,980 --> 00:14:17,460 How do we know that? 241 00:14:17,460 --> 00:14:19,930 Well, we've got a fraction, right? 242 00:14:19,930 --> 00:14:25,190 And the denominator is all possible events. 243 00:14:25,190 --> 00:14:29,840 The numerator is the subset of that that's of interest. 244 00:14:29,840 --> 00:14:35,680 So it has to range from 0 to the denominator. 245 00:14:35,680 --> 00:14:37,330 And that tells us that the fraction 246 00:14:37,330 --> 00:14:40,250 has to range from 0 to 1. 247 00:14:40,250 --> 00:14:43,430 So 1 says it's always going to happen, 0 never. 248 00:14:46,870 --> 00:14:50,290 So if the probability of an event occurring is p, 249 00:14:50,290 --> 00:14:54,250 what's the probability of it not occurring? 250 00:14:54,250 --> 00:14:57,060 This follows from the first bullet. 251 00:14:57,060 --> 00:15:04,050 It's simply going to be 1 minus p. 252 00:15:04,050 --> 00:15:07,650 This is a trick that we'll find we'll use a lot. 253 00:15:07,650 --> 00:15:09,660 Because it's often the case when you 254 00:15:09,660 --> 00:15:13,080 want to compute the probability of something happening, 255 00:15:13,080 --> 00:15:16,680 it's easier to compute the probability of it not happening 256 00:15:16,680 --> 00:15:18,980 and subtract it from 1. 257 00:15:18,980 --> 00:15:21,560 And we'll see an example of that later today. 258 00:15:24,550 --> 00:15:27,940 Now, here's the biggie. 259 00:15:27,940 --> 00:15:31,650 When events are independent of each other, 260 00:15:31,650 --> 00:15:35,000 the probability of all of the events occurring 261 00:15:35,000 --> 00:15:39,380 is equal to the product of the probabilities of each 262 00:15:39,380 --> 00:15:40,885 of the events occurring. 263 00:15:44,280 --> 00:15:53,890 So if the probability of A is 0.5 and the probability of B 264 00:15:53,890 --> 00:16:01,150 is 0.4, the probability of A and B is what? 265 00:16:06,110 --> 00:16:07,670 0.5 times 0.4. 266 00:16:07,670 --> 00:16:10,680 You guys can figure that out. 267 00:16:10,680 --> 00:16:14,330 I think that's 0.2. 268 00:16:14,330 --> 00:16:16,100 So you'd expect that, that it should 269 00:16:16,100 --> 00:16:20,390 be much smaller than either of the first two probabilities. 270 00:16:20,390 --> 00:16:22,060 This is the most common rule, it's 271 00:16:22,060 --> 00:16:24,460 something we use all the time in probabilities, 272 00:16:24,460 --> 00:16:28,360 the so-called multiplicative law. 273 00:16:28,360 --> 00:16:33,120 We have to be careful about it, however, 274 00:16:33,120 --> 00:16:37,470 in that it only holds if the events are actually 275 00:16:37,470 --> 00:16:40,570 independent. 276 00:16:40,570 --> 00:16:44,920 Two events are independent if the outcome of one 277 00:16:44,920 --> 00:16:47,110 has no influence on the outcome of the other. 278 00:16:50,010 --> 00:16:52,370 So when we roll the die, we assume 279 00:16:52,370 --> 00:16:54,350 that the first roll, the outcome, 280 00:16:54,350 --> 00:16:55,870 was independent of the-- 281 00:16:55,870 --> 00:16:58,370 or the second roll was independent of the first roll, 282 00:16:58,370 --> 00:17:00,910 independent of the fourth roll. 283 00:17:00,910 --> 00:17:02,560 When we looked at the two coins, we 284 00:17:02,560 --> 00:17:05,410 assume that heads and tails of each coin 285 00:17:05,410 --> 00:17:08,460 was independent of the other coin. 286 00:17:08,460 --> 00:17:10,200 I didn't, for example, look at one coin 287 00:17:10,200 --> 00:17:12,304 and make sure that the other one was different. 288 00:17:15,700 --> 00:17:19,079 The danger here is that people often 289 00:17:19,079 --> 00:17:22,950 compute probabilities assuming independence when you don't 290 00:17:22,950 --> 00:17:26,099 actually have independence. 291 00:17:26,099 --> 00:17:29,470 So let's look at an example. 292 00:17:29,470 --> 00:17:32,980 For those of you familiar with American football, 293 00:17:32,980 --> 00:17:35,800 the New England Patriots and the Denver Broncos 294 00:17:35,800 --> 00:17:38,380 are two prominent teams. 295 00:17:38,380 --> 00:17:40,660 And let's look at computing the probability 296 00:17:40,660 --> 00:17:45,690 of whether one of them will lose on a given Sunday. 297 00:17:45,690 --> 00:17:48,840 So the Patriots have a winning percentage of 7 of 8-- 298 00:17:48,840 --> 00:17:51,420 they've won 7 of their 8 games so far-- 299 00:17:51,420 --> 00:17:54,590 and the Broncos 6 of 8. 300 00:17:54,590 --> 00:17:57,560 The probability of both winning next Sunday, 301 00:17:57,560 --> 00:18:00,860 assuming that this is indicative of how good they are, 302 00:18:00,860 --> 00:18:03,470 we can get with the multiplicative rule. 303 00:18:03,470 --> 00:18:08,750 So it's 7/8 times 6/8, or 42/64. 304 00:18:08,750 --> 00:18:12,060 We could simplify that fraction, I suppose. 305 00:18:12,060 --> 00:18:14,370 Does that makes sense? 306 00:18:14,370 --> 00:18:17,840 So this is probably a pretty good estimate of both of them 307 00:18:17,840 --> 00:18:20,600 winning next Sunday. 308 00:18:20,600 --> 00:18:24,380 So the probability of at least one of them losing 309 00:18:24,380 --> 00:18:27,740 is 1 minus that. 310 00:18:27,740 --> 00:18:30,430 So here's an example of why we often use 311 00:18:30,430 --> 00:18:34,120 the 1 minus rule, because we could 312 00:18:34,120 --> 00:18:38,020 compute the probability of both of them 313 00:18:38,020 --> 00:18:41,440 winning by simply multiplying. 314 00:18:41,440 --> 00:18:44,130 And we subtract that from 1. 315 00:18:44,130 --> 00:18:47,220 However, what about Sunday, December 18? 316 00:18:47,220 --> 00:18:50,440 What's the probability? 317 00:18:50,440 --> 00:18:53,920 Well, as it happens, that day the Patriots 318 00:18:53,920 --> 00:18:55,025 are playing the Broncos. 319 00:18:58,380 --> 00:19:02,230 So now suddenly, the outcomes are not independent. 320 00:19:02,230 --> 00:19:05,550 The probability of one of them losing 321 00:19:05,550 --> 00:19:10,470 is influenced by the probability of the other winning. 322 00:19:10,470 --> 00:19:13,540 So you would expect the probability of one 323 00:19:13,540 --> 00:19:17,989 of them losing is much closer to 1 than 22/64, 324 00:19:17,989 --> 00:19:18,780 which is about 1/3. 325 00:19:21,780 --> 00:19:25,490 So in this case, it's easy. 326 00:19:25,490 --> 00:19:28,430 But as we'll see, as we get through the term, 327 00:19:28,430 --> 00:19:30,560 there are lots of cases where you 328 00:19:30,560 --> 00:19:33,950 have to work pretty hard to understand whether or not two 329 00:19:33,950 --> 00:19:36,350 events really are independent. 330 00:19:36,350 --> 00:19:40,410 And if you get it wrong, you get a totally bogus answer. 331 00:19:40,410 --> 00:19:45,530 1/3 versus 1 is a pretty big difference. 332 00:19:45,530 --> 00:19:49,010 By the way, as it happens, the probability of the Broncos 333 00:19:49,010 --> 00:19:50,070 losing is about 1. 334 00:19:56,190 --> 00:19:58,400 Let's go look at some code. 335 00:20:01,040 --> 00:20:03,260 And we'll go back to our dice, because it's 336 00:20:03,260 --> 00:20:05,420 much easier to simulate dice games 337 00:20:05,420 --> 00:20:08,300 than it is to simulate football games. 338 00:20:11,510 --> 00:20:13,340 So here it is. 339 00:20:13,340 --> 00:20:17,030 And we're going to talk a lot about simulations. 340 00:20:17,030 --> 00:20:18,980 So here, rather than rolling the die, 341 00:20:18,980 --> 00:20:20,480 I've written a program to do it. 342 00:20:23,980 --> 00:20:27,660 We've already seen the code for rolling a die. 343 00:20:27,660 --> 00:20:32,990 And so to run this simulation, typically what we're doing here 344 00:20:32,990 --> 00:20:35,480 is I'm giving you the goal-- 345 00:20:35,480 --> 00:20:38,510 for example, are we going to get five 1's-- 346 00:20:38,510 --> 00:20:41,800 the number of trials-- 347 00:20:41,800 --> 00:20:47,060 each trial, in this case, will be say of length 5-- 348 00:20:47,060 --> 00:20:48,770 so I'm going to roll the same die 349 00:20:48,770 --> 00:20:55,130 five times say 1,000 different times, and then just some text 350 00:20:55,130 --> 00:20:57,910 as to what I'm going to print. 351 00:20:57,910 --> 00:21:01,090 Almost all the simulations we look at 352 00:21:01,090 --> 00:21:05,630 are going to start with lines that look a lot like that. 353 00:21:05,630 --> 00:21:08,650 We're going to initialize some variable. 354 00:21:08,650 --> 00:21:11,755 And then we're going to run some number of trials. 355 00:21:16,160 --> 00:21:19,860 So in this case, we're going to get 356 00:21:19,860 --> 00:21:21,340 from the length of the goal-- 357 00:21:21,340 --> 00:21:23,790 so if the goal is five 1's, then we're 358 00:21:23,790 --> 00:21:26,490 going to roll the dice five times; if it's 10 runs, 359 00:21:26,490 --> 00:21:29,830 we'll roll it 10 times. 360 00:21:29,830 --> 00:21:35,310 So this is essentially one trial, one attempt. 361 00:21:38,850 --> 00:21:41,850 And then we'll check the result. And if it 362 00:21:41,850 --> 00:21:43,720 has the property we want-- 363 00:21:43,720 --> 00:21:47,460 in this case, it's equal to the goal-- 364 00:21:47,460 --> 00:21:50,040 then we're going to increment the total, which 365 00:21:50,040 --> 00:21:54,380 we initialized up here by 1. 366 00:21:54,380 --> 00:21:57,170 So we'll keep track with just the counting-- 367 00:21:57,170 --> 00:22:01,610 the number of trials that actually meet the goal. 368 00:22:01,610 --> 00:22:04,990 And then when we're done, what we're going to do 369 00:22:04,990 --> 00:22:08,560 is divide the number that met the goal 370 00:22:08,560 --> 00:22:10,870 by the number of trials-- 371 00:22:10,870 --> 00:22:14,170 exactly the counting argument we just looked at. 372 00:22:14,170 --> 00:22:19,700 And then we'll print the result. 373 00:22:19,700 --> 00:22:22,220 Almost every simulation we look at 374 00:22:22,220 --> 00:22:24,360 is going to have this structure. 375 00:22:24,360 --> 00:22:27,680 There'll be an outer loop, which is the number of trials. 376 00:22:27,680 --> 00:22:29,870 And then inside-- maybe it'll have a loop, 377 00:22:29,870 --> 00:22:32,600 or maybe it won't-- will be a single trial. 378 00:22:32,600 --> 00:22:33,770 We'll sum up the results. 379 00:22:33,770 --> 00:22:36,920 And then we'll divide by the number of trials. 380 00:22:36,920 --> 00:22:37,490 Let's run it. 381 00:22:45,300 --> 00:22:49,650 So a couple of things are going to go on here. 382 00:22:49,650 --> 00:22:59,570 If you look at the code as we've looked at it before, 383 00:22:59,570 --> 00:23:02,780 what you're seeing is I'm computing the estimated 384 00:23:02,780 --> 00:23:05,180 probability by the simulation. 385 00:23:05,180 --> 00:23:08,270 And I'm comparing it to the actual probability, which we've 386 00:23:08,270 --> 00:23:09,590 already seen how to compute. 387 00:23:12,117 --> 00:23:14,700 So if you look at it, there are a couple of things to look at. 388 00:23:17,370 --> 00:23:19,260 The estimated probability is pretty 389 00:23:19,260 --> 00:23:24,704 close to the actual probability but not the same. 390 00:23:24,704 --> 00:23:26,590 So let's go back to the PowerPoint. 391 00:23:31,860 --> 00:23:34,240 Here are the results. 392 00:23:34,240 --> 00:23:37,680 And there are at least two questions raised 393 00:23:37,680 --> 00:23:40,050 by this result. First of all, how 394 00:23:40,050 --> 00:23:43,290 did I know that this is what would get printed? 395 00:23:43,290 --> 00:23:45,610 Remember, this is random. 396 00:23:45,610 --> 00:23:48,520 How did I know that the estimate-- well, there's 397 00:23:48,520 --> 00:23:51,790 nothing random about the actual probability. 398 00:23:51,790 --> 00:23:55,390 But how did I know that the estimated probability 399 00:23:55,390 --> 00:23:57,180 would be 0? 400 00:23:57,180 --> 00:23:58,470 And why did it print it twice? 401 00:23:58,470 --> 00:24:00,330 Because I messed up the PowerPoint. 402 00:24:00,330 --> 00:24:04,140 Any rate, so how do I know what would get printed? 403 00:24:04,140 --> 00:24:12,610 Well a confession-- random.choice 404 00:24:12,610 --> 00:24:14,920 is not actually random. 405 00:24:14,920 --> 00:24:20,140 In fact, nothing we can do in a computer is actually random. 406 00:24:20,140 --> 00:24:23,650 You can prove that it's impossible to build 407 00:24:23,650 --> 00:24:28,950 a computer that actually generates truly random numbers. 408 00:24:28,950 --> 00:24:32,520 What they do instead is generate numbers 409 00:24:32,520 --> 00:24:34,050 that called pseudorandom. 410 00:24:42,120 --> 00:24:44,740 How do they do that? 411 00:24:44,740 --> 00:24:48,930 They have an algorithm that given one number generates 412 00:24:48,930 --> 00:24:52,700 the next number in a sequence. 413 00:24:52,700 --> 00:24:56,375 And they start that algorithm with a seed. 414 00:25:00,050 --> 00:25:02,630 Now, typically, they get that seed 415 00:25:02,630 --> 00:25:05,930 by reading the clock of the computer. 416 00:25:05,930 --> 00:25:08,090 So most computers have a clock that, say, 417 00:25:08,090 --> 00:25:12,080 keeps track of the number of microseconds since January 1, 418 00:25:12,080 --> 00:25:14,174 1978. 419 00:25:14,174 --> 00:25:15,590 I don't know if that's still true. 420 00:25:15,590 --> 00:25:18,590 That's what Unix used to do. 421 00:25:18,590 --> 00:25:22,070 So the notion is, you start your program, 422 00:25:22,070 --> 00:25:26,420 there's no way of knowing how many microseconds have elapsed. 423 00:25:26,420 --> 00:25:29,395 And so you're getting a random number to start the process. 424 00:25:32,040 --> 00:25:33,660 Since you don't know where it starts, 425 00:25:33,660 --> 00:25:34,800 you don't know what the second number 426 00:25:34,800 --> 00:25:37,050 is, you don't know what the third number is, you don't 427 00:25:37,050 --> 00:25:38,580 know what the fourth number is. 428 00:25:38,580 --> 00:25:42,570 And so it's predictably nondeterministic, 429 00:25:42,570 --> 00:25:46,600 because you don't know what the seed is going to be. 430 00:25:46,600 --> 00:25:49,180 Now, you can imagine that this makes 431 00:25:49,180 --> 00:25:52,460 programs really hard to debug. 432 00:25:52,460 --> 00:25:55,850 Every time you run it, something different could happen. 433 00:25:55,850 --> 00:25:59,220 Now, we'll see often you want them to be unpredictable. 434 00:25:59,220 --> 00:26:02,300 But for now, we want them to be predictable, makes it easier 435 00:26:02,300 --> 00:26:04,130 prepare PowerPoint. 436 00:26:04,130 --> 00:26:08,635 So what you have is a command. 437 00:26:13,040 --> 00:26:19,190 You can call random.seed and give it a value 438 00:26:19,190 --> 00:26:21,800 and say, I don't want you to just choose some random seed, 439 00:26:21,800 --> 00:26:24,890 I want you to use 0 as the seed. 440 00:26:24,890 --> 00:26:27,530 For the same seed, you always get the same sequence 441 00:26:27,530 --> 00:26:30,120 of random values. 442 00:26:30,120 --> 00:26:33,410 And so what I've done is I set the seed to be, I think, 0 443 00:26:33,410 --> 00:26:36,620 in this case, not because there's anything magic about 0, 444 00:26:36,620 --> 00:26:38,780 it's just sort of habit. 445 00:26:38,780 --> 00:26:41,540 But it made it predictable. 446 00:26:41,540 --> 00:26:43,640 As you write programs with randomness 447 00:26:43,640 --> 00:26:45,980 in and when you're debugging it, you will almost surely 448 00:26:45,980 --> 00:26:49,550 want to start by setting random.seed to a value 449 00:26:49,550 --> 00:26:51,590 so you get the same answer. 450 00:26:51,590 --> 00:26:54,950 But make sure you debug it with more than one value of this, 451 00:26:54,950 --> 00:26:58,320 so you didn't just get lucky with your seed. 452 00:26:58,320 --> 00:27:01,460 So that's how I knew what would get printed. 453 00:27:01,460 --> 00:27:06,480 The next question is, why did the simulation 454 00:27:06,480 --> 00:27:09,670 give me the wrong answer? 455 00:27:09,670 --> 00:27:14,530 The actual probability is three 0's and 1286. 456 00:27:14,530 --> 00:27:16,630 But it's estimated a probability of 0. 457 00:27:19,150 --> 00:27:20,140 Why is it wrong? 458 00:27:24,200 --> 00:27:27,100 Well, let's think about this. 459 00:27:27,100 --> 00:27:30,020 I ran 1,000 trials. 460 00:27:30,020 --> 00:27:32,430 What does it mean to say the probability is zero? 461 00:27:32,430 --> 00:27:36,670 It means that I tried it 1,000 times and didn't ever get 462 00:27:36,670 --> 00:27:39,380 a sequence of five 1's. 463 00:27:39,380 --> 00:27:44,500 So the numerator of the division at the bottom was 0. 464 00:27:44,500 --> 00:27:46,150 Hence, the answer is 0. 465 00:27:46,150 --> 00:27:47,890 Is this surprising? 466 00:27:47,890 --> 00:27:49,440 Well, no. 467 00:27:49,440 --> 00:27:54,200 Because if that's the actual probability of getting five 468 00:27:54,200 --> 00:27:58,075 1's, it's not very shocking that in 1,000 trials 469 00:27:58,075 --> 00:27:58,825 it never happened. 470 00:28:02,260 --> 00:28:06,140 It's not a surprising result. And so we 471 00:28:06,140 --> 00:28:09,230 have to be careful when we run these things to understand 472 00:28:09,230 --> 00:28:14,250 the difference between what's in this case an actual probability 473 00:28:14,250 --> 00:28:17,510 and what statisticians call a sample probability. 474 00:28:25,530 --> 00:28:28,970 So what we got with the sample was 0. 475 00:28:28,970 --> 00:28:32,740 So what's the obvious thing to do? 476 00:28:32,740 --> 00:28:35,590 If you're doing a simulation of an event 477 00:28:35,590 --> 00:28:39,020 and the event is pretty rare, you 478 00:28:39,020 --> 00:28:43,520 want to try it on a very large number of trials. 479 00:28:43,520 --> 00:28:45,050 So let's go back to our code. 480 00:28:51,350 --> 00:28:58,720 And we'll change it to instead of 1,000, 1,000,000. 481 00:28:58,720 --> 00:29:01,572 You can see up here, by the way, where I set the seed. 482 00:29:01,572 --> 00:29:02,840 And now, let's run it. 483 00:29:17,760 --> 00:29:19,650 We did a lot better. 484 00:29:19,650 --> 00:29:22,470 If we look at here our estimated probability, 485 00:29:22,470 --> 00:29:25,980 it's three 0's 128, still not quite 486 00:29:25,980 --> 00:29:30,142 the actual probability but darn close. 487 00:29:30,142 --> 00:29:31,600 And maybe if I had done 10 million, 488 00:29:31,600 --> 00:29:32,891 it would have been even closer. 489 00:29:35,610 --> 00:29:38,040 So if you're writing a simulation 490 00:29:38,040 --> 00:29:41,130 to compute the probability of an event 491 00:29:41,130 --> 00:29:44,040 and the event is moderately rare, 492 00:29:44,040 --> 00:29:47,310 then you better run a lot of trials 493 00:29:47,310 --> 00:29:51,750 before you believe your estimated probability. 494 00:29:51,750 --> 00:29:55,440 In a week or so, we'll actually look at that more 495 00:29:55,440 --> 00:29:57,810 mathematically and say, what is a lot, 496 00:29:57,810 --> 00:29:59,130 how do we know what is enough. 497 00:30:12,110 --> 00:30:13,550 What are the morals here? 498 00:30:13,550 --> 00:30:15,430 Moral one, I've just told you-- 499 00:30:15,430 --> 00:30:18,950 takes a lot of trials to get a good estimate of the frequency 500 00:30:18,950 --> 00:30:21,510 of a rare event. 501 00:30:21,510 --> 00:30:26,470 Moral two, we should always, if we're getting an estimated 502 00:30:26,470 --> 00:30:29,290 probability, know that, and probably 503 00:30:29,290 --> 00:30:33,570 say that, and not confuse it with the actual probability. 504 00:30:33,570 --> 00:30:36,400 The third moral here is, it was kind of 505 00:30:36,400 --> 00:30:38,830 stupid to do a simulation. 506 00:30:38,830 --> 00:30:42,430 Since it was a very simple closed-form answer 507 00:30:42,430 --> 00:30:45,550 that we could compute that would really tell us 508 00:30:45,550 --> 00:30:48,220 what the actual probability is, why even 509 00:30:48,220 --> 00:30:51,550 bother with the simulation? 510 00:30:51,550 --> 00:30:53,880 Well, we're going to see why now, 511 00:30:53,880 --> 00:30:57,340 because simulations can be very useful. 512 00:30:57,340 --> 00:31:00,390 Let's look at another problem. 513 00:31:00,390 --> 00:31:02,070 This is the famous birthday problem. 514 00:31:02,070 --> 00:31:03,660 Some of you have seen it. 515 00:31:03,660 --> 00:31:06,240 What's the probability of at least two people in a group 516 00:31:06,240 --> 00:31:08,770 having the same birthday? 517 00:31:08,770 --> 00:31:10,600 There's a URL at the bottom. 518 00:31:10,600 --> 00:31:12,760 That's pointing to a Google form. 519 00:31:12,760 --> 00:31:15,940 I'd like please all of you who have a computing device 520 00:31:15,940 --> 00:31:20,100 to go to it and fill out your birthday. 521 00:31:20,100 --> 00:31:22,942 It's anonymous, so we won't know how old you are, don't worry. 522 00:31:22,942 --> 00:31:24,150 Actually, it's only the date. 523 00:31:24,150 --> 00:31:25,290 It's not the year. 524 00:31:27,880 --> 00:31:33,870 So suppose there were 367 people in the group, roughly 525 00:31:33,870 --> 00:31:40,680 the number of people who took the 6.0001 600 midterm. 526 00:31:40,680 --> 00:31:44,070 If they are 367 people, what's the probability of at least two 527 00:31:44,070 --> 00:31:45,230 of them sharing a birthday? 528 00:31:49,790 --> 00:31:54,110 One, by something called the pigeonhole principle. 529 00:31:54,110 --> 00:31:56,000 You got some number of holes. 530 00:31:56,000 --> 00:31:57,800 And if you have more pigeons than holes, 531 00:31:57,800 --> 00:32:01,430 two pigeons have to share a whole. 532 00:32:01,430 --> 00:32:04,040 What about smaller numbers? 533 00:32:04,040 --> 00:32:07,430 Well, if we make a simplifying assumption 534 00:32:07,430 --> 00:32:10,650 that each birthdate is equally likely, 535 00:32:10,650 --> 00:32:13,970 then there's actually a nice closed-form solution for it. 536 00:32:17,760 --> 00:32:20,730 Again, this is a question where it's easier 537 00:32:20,730 --> 00:32:24,210 to compute the opposite of what you're trying 538 00:32:24,210 --> 00:32:26,670 to do and subtract it from 1. 539 00:32:26,670 --> 00:32:32,160 And so this fraction is giving the probability of two people 540 00:32:32,160 --> 00:32:35,190 not sharing a birthday. 541 00:32:35,190 --> 00:32:38,560 The proof that this is right, it's a little bit elaborate. 542 00:32:38,560 --> 00:32:42,450 But you can trust me, it's accurate. 543 00:32:42,450 --> 00:32:46,150 But it's a formula, and it's not that complicated a formula. 544 00:32:46,150 --> 00:32:49,800 So numbers like 366 factorial are big. 545 00:32:55,240 --> 00:32:57,460 So let's approximate a solution. 546 00:32:57,460 --> 00:33:00,940 We'll right a simulation and see if we get the same answer 547 00:33:00,940 --> 00:33:03,920 that that formula gave us. 548 00:33:03,920 --> 00:33:05,200 So here's the code for that-- 549 00:33:07,810 --> 00:33:09,550 two arguments-- the number of people 550 00:33:09,550 --> 00:33:14,780 in the group and the number that we asking do 551 00:33:14,780 --> 00:33:17,520 they have the same birthday. 552 00:33:17,520 --> 00:33:21,120 So since I'm assuming for now that every birthday is equally 553 00:33:21,120 --> 00:33:26,100 likely, the possible dates range from 1 to 366, 554 00:33:26,100 --> 00:33:28,005 because some years have a February 29. 555 00:33:31,200 --> 00:33:35,490 I'll keep track of the number of people born in each date 556 00:33:35,490 --> 00:33:38,640 by starting with none. 557 00:33:38,640 --> 00:33:41,470 And then for p in the range of number of people, 558 00:33:41,470 --> 00:33:45,240 I'll make a random choice of the possible dates 559 00:33:45,240 --> 00:33:49,999 and increment that element of the list by 1. 560 00:33:49,999 --> 00:33:51,540 And then at the end, we can say, look 561 00:33:51,540 --> 00:33:54,330 at the maximum number of birthdays 562 00:33:54,330 --> 00:33:59,560 and see if it's greater than or equal to the number of same. 563 00:33:59,560 --> 00:34:01,240 So that tells us that. 564 00:34:04,490 --> 00:34:07,220 And then we can actually look at the birthday problem-- 565 00:34:07,220 --> 00:34:09,640 number of people, the number of same, and, as usual, 566 00:34:09,640 --> 00:34:10,514 the number of trials. 567 00:34:13,750 --> 00:34:17,840 So the number of hits is 0 for t in range number of trials. 568 00:34:17,840 --> 00:34:21,940 If sameDate is true, then we'll increment the number 569 00:34:21,940 --> 00:34:28,590 of hits by 1 and then as usual divide by the number of trials. 570 00:34:28,590 --> 00:34:34,739 And we'll try it for 10, 20, 40, and 100 people. 571 00:34:37,310 --> 00:34:41,480 And then just, we'll print the estimated probability 572 00:34:41,480 --> 00:34:46,429 and the actual probability computed using 573 00:34:46,429 --> 00:34:48,320 that formula I showed you. 574 00:34:48,320 --> 00:34:50,600 I have not shown you, but I've imported 575 00:34:50,600 --> 00:34:53,480 a library called math, because it 576 00:34:53,480 --> 00:34:55,040 is a factorial implementation. 577 00:34:55,040 --> 00:34:56,900 It's way faster than the recursive one 578 00:34:56,900 --> 00:35:00,270 that we've seen before. 579 00:35:00,270 --> 00:35:00,880 Let's run it. 580 00:35:23,920 --> 00:35:25,040 And we'll see what we get. 581 00:35:25,040 --> 00:35:30,580 So for 10, the estimated probability is 0.11 now. 582 00:35:30,580 --> 00:35:36,720 So you can see, the estimates are really pretty good. 583 00:35:36,720 --> 00:35:39,450 Once again, we have this business that for 100, 584 00:35:39,450 --> 00:35:43,450 we're estimating 1, when the real answer is point many, 585 00:35:43,450 --> 00:35:45,150 many 9's. 586 00:35:45,150 --> 00:35:47,400 But again, this is sample probability. 587 00:35:47,400 --> 00:35:53,250 It just means in the number of trials we did, every 1 588 00:35:53,250 --> 00:35:56,190 for 100 people, there was a shared birthday. 589 00:35:56,190 --> 00:35:59,010 This is a number that usually surprises people, 590 00:35:59,010 --> 00:36:03,690 as to why with 100 people the probability is so high. 591 00:36:03,690 --> 00:36:06,990 But we could work out the formula and see it. 592 00:36:06,990 --> 00:36:08,460 And as you can see, the estimates 593 00:36:08,460 --> 00:36:10,930 are pretty good from my simulation. 594 00:36:20,252 --> 00:36:22,210 Now, we're going to see why we did a simulation 595 00:36:22,210 --> 00:36:23,720 in the first place. 596 00:36:23,720 --> 00:36:27,970 Suppose we want the probability of three people sharing 597 00:36:27,970 --> 00:36:29,260 a birthday instead of two. 598 00:36:34,030 --> 00:36:37,240 It's pretty easy to see how we changed the simulation. 599 00:36:37,240 --> 00:36:38,980 I even made a parameter. 600 00:36:38,980 --> 00:36:42,190 I just changed the number 2 to number 3. 601 00:36:42,190 --> 00:36:45,200 The math, on the other hand, is ugly. 602 00:36:48,030 --> 00:36:52,190 Why is the math so much uglier for 3 than for 2? 603 00:36:52,190 --> 00:36:55,400 Because for 2, the complementary problem-- 604 00:36:55,400 --> 00:36:58,040 the number we're subtracting from 1-- 605 00:36:58,040 --> 00:37:03,640 is simply the question of, are all birthdays different? 606 00:37:03,640 --> 00:37:08,170 So did two people share a birthday is 1 minus or all 607 00:37:08,170 --> 00:37:11,570 does everybody have a different birthday. 608 00:37:11,570 --> 00:37:16,250 On the other hand, for 3 people, the complementary problem is 609 00:37:16,250 --> 00:37:19,490 a complicated disjunct-- a bunch of ors-- 610 00:37:19,490 --> 00:37:22,190 either all birthdays are distinct, 611 00:37:22,190 --> 00:37:26,240 or two people share a birthday and the rest are distinct, 612 00:37:26,240 --> 00:37:30,140 or there are two groups of two people sharing a birthday 613 00:37:30,140 --> 00:37:31,970 and everything is distinct. 614 00:37:31,970 --> 00:37:36,450 So you can see here, there's a lot of possibilities. 615 00:37:36,450 --> 00:37:40,800 And so it's 1 minus now a very complicated formula. 616 00:37:40,800 --> 00:37:42,840 And in fact, if you try and look how to do this, 617 00:37:42,840 --> 00:37:45,450 most people will tell you don't bother. 618 00:37:45,450 --> 00:37:48,490 Here's kind of a good approximation. 619 00:37:48,490 --> 00:37:50,320 But the math gets very hairy. 620 00:37:53,040 --> 00:37:57,160 In contrast, changing the simulation is dead easy. 621 00:37:57,160 --> 00:37:57,880 We can do that. 622 00:38:03,808 --> 00:38:06,280 Whoops. 623 00:38:06,280 --> 00:38:13,650 So if we come over here for the code, all I have to do 624 00:38:13,650 --> 00:38:15,075 is change this to 2 or 3. 625 00:38:25,090 --> 00:38:27,190 And I'm going to leave in this code, which 626 00:38:27,190 --> 00:38:31,180 is the wrong code, computing the actual probability now 627 00:38:31,180 --> 00:38:35,110 for 2 people sharing rather than 3, because I want 628 00:38:35,110 --> 00:38:37,660 to make it easy for you to see the difference between what 629 00:38:37,660 --> 00:38:41,260 happens when we look at 3 shared rather than 2 shared. 630 00:38:53,140 --> 00:38:55,980 And I get invalid syntax. 631 00:38:55,980 --> 00:38:58,766 That's not good. 632 00:38:58,766 --> 00:39:00,640 That's what happens when I type in real time. 633 00:39:07,970 --> 00:39:10,010 Why do I have invalid syntax? 634 00:39:10,010 --> 00:39:11,337 AUDIENCE: Line 56. 635 00:39:11,337 --> 00:39:12,170 JOHN GUTTAG: Pardon. 636 00:39:12,170 --> 00:39:13,631 AUDIENCE: Line 56. 637 00:39:13,631 --> 00:39:15,170 JOHN GUTTAG: One person, Anna. 638 00:39:15,170 --> 00:39:17,660 AUDIENCE: Line 56, there's a comma. 639 00:39:17,660 --> 00:39:20,532 JOHN GUTTAG: Oh. 640 00:39:20,532 --> 00:39:21,490 That's not a good line. 641 00:39:32,960 --> 00:39:40,410 So now, we see that if we get, say, to n equals 100, for 2, 642 00:39:40,410 --> 00:39:42,530 you'll remember, it was 0.99. 643 00:39:42,530 --> 00:39:46,000 But for 3, it's only 0.63. 644 00:39:46,000 --> 00:39:49,590 So we see going from two sharing to three sharing 645 00:39:49,590 --> 00:39:54,930 gets us a radically different answer, not surprisingly. 646 00:39:54,930 --> 00:39:57,240 But we also-- and the real thing I wanted you to see-- 647 00:39:57,240 --> 00:39:59,310 is how easy it was to answer this question 648 00:39:59,310 --> 00:40:01,810 with the simulation. 649 00:40:01,810 --> 00:40:05,940 And that's a primary reason we use simulations 650 00:40:05,940 --> 00:40:09,000 to get probabilistic questions rather 651 00:40:09,000 --> 00:40:11,190 than sitting down and the pencil and paper 652 00:40:11,190 --> 00:40:14,460 and doing fancy probability calculations, 653 00:40:14,460 --> 00:40:19,300 because it's often way easier to do a simulation. 654 00:40:19,300 --> 00:40:22,220 We can see that in spades if we look at the next question. 655 00:40:26,680 --> 00:40:28,210 Let's think about this assumption 656 00:40:28,210 --> 00:40:31,270 that all birthdays are equally likely. 657 00:40:31,270 --> 00:40:33,370 Well, as you can see, this is a chart 658 00:40:33,370 --> 00:40:38,440 of how common birthdates are in the US, a heat map. 659 00:40:38,440 --> 00:40:44,820 And you'll see, for example, that February 29 660 00:40:44,820 --> 00:40:47,930 is quite an uncommon birthday. 661 00:40:47,930 --> 00:40:52,010 So we should probably treat that differently. 662 00:40:52,010 --> 00:40:53,480 Somewhat surprisingly, you'll see 663 00:40:53,480 --> 00:40:57,160 that July 4 is a very uncommon birthday as well. 664 00:40:57,160 --> 00:41:00,410 It's easy to understand why February 29. 665 00:41:00,410 --> 00:41:02,570 The only thing I can figure out for July 4 666 00:41:02,570 --> 00:41:06,230 is obstetricians don't like working on holidays. 667 00:41:06,230 --> 00:41:08,300 And so they induce labor sometime 668 00:41:08,300 --> 00:41:10,790 around the 2nd or the 3rd, so they 669 00:41:10,790 --> 00:41:14,420 don't have to come to work on the 4th or the 5th. 670 00:41:14,420 --> 00:41:15,680 Sounds a horrible thought. 671 00:41:15,680 --> 00:41:19,952 But I can't think of any other explanation for this anomaly. 672 00:41:19,952 --> 00:41:21,410 You'll probably, if you look at it, 673 00:41:21,410 --> 00:41:25,580 see Christmas day is not so common either. 674 00:41:25,580 --> 00:41:27,170 So now, the question, which we can 675 00:41:27,170 --> 00:41:29,120 answer, since you've all fill out this form, 676 00:41:29,120 --> 00:41:32,810 is how exceptional are MIT students? 677 00:41:32,810 --> 00:41:35,930 We like to think that you're different in every respect. 678 00:41:35,930 --> 00:41:38,960 So are your birthdays distributed differently 679 00:41:38,960 --> 00:41:40,830 than other dates? 680 00:41:40,830 --> 00:41:43,220 Have we got that data? 681 00:41:43,220 --> 00:41:44,890 So now we'll go look at that. 682 00:41:49,180 --> 00:41:50,920 We should have a heat map for you guys. 683 00:41:53,900 --> 00:41:54,400 This one? 684 00:41:54,400 --> 00:41:56,850 AUDIENCE: Yep. 685 00:41:56,850 --> 00:41:59,300 I removed all the February 31. 686 00:41:59,300 --> 00:42:02,240 Thank you for those submissions. 687 00:42:02,240 --> 00:42:05,910 [LAUGHTER] 688 00:42:06,525 --> 00:42:08,750 JOHN GUTTAG: So here it is. 689 00:42:08,750 --> 00:42:13,310 And we can see that, well, they don't 690 00:42:13,310 --> 00:42:17,790 seem to be banded quite as much in the summer months, 691 00:42:17,790 --> 00:42:20,370 probably says more about your parents than it does about you. 692 00:42:23,030 --> 00:42:26,090 But you can see that, indeed, we do have-- 693 00:42:26,090 --> 00:42:28,280 wow, we have a day where there are 694 00:42:28,280 --> 00:42:30,110 five birthdays, that look like? 695 00:42:30,110 --> 00:42:30,620 Or no? 696 00:42:30,620 --> 00:42:32,556 AUDIENCE: February 12. 697 00:42:32,556 --> 00:42:33,355 JOHN GUTTAG: Wow. 698 00:42:36,654 --> 00:42:39,070 You want to raise your hand if you're born on February 12? 699 00:42:42,388 --> 00:42:45,800 [LAUGHTER] 700 00:42:46,670 --> 00:42:51,902 So you are exceptional in that you lie about when you're born. 701 00:42:51,902 --> 00:42:57,470 But if you hadn't lied, I think we would have still seen 702 00:42:57,470 --> 00:42:59,450 the probabilities would hold. 703 00:42:59,450 --> 00:43:03,155 How many people were there, do we know? 704 00:43:03,155 --> 00:43:07,865 AUDIENCE: 146 with 112 unique birthdays. 705 00:43:07,865 --> 00:43:12,190 JOHN GUTTAG: 146 people, 112 unique birthdays. 706 00:43:12,190 --> 00:43:16,220 So indeed, the probability does work. 707 00:43:26,470 --> 00:43:28,990 So we know you're exceptional in a funny way. 708 00:43:28,990 --> 00:43:32,240 Well, you can imagine how hard it 709 00:43:32,240 --> 00:43:36,080 would be to adjust the analytic model to account 710 00:43:36,080 --> 00:43:40,370 for a weird distribution of birthdates. 711 00:43:40,370 --> 00:43:44,900 But again, adjusting the simulation model is easy. 712 00:43:44,900 --> 00:43:46,700 I could have gone back to that heat 713 00:43:46,700 --> 00:43:49,670 map I showed you of birthdays in the US 714 00:43:49,670 --> 00:43:52,550 and gotten a separate probability for each day, 715 00:43:52,550 --> 00:43:55,130 but I was too lazy. 716 00:43:55,130 --> 00:44:01,220 And instead, what I observed was that we had a few days, 717 00:44:01,220 --> 00:44:06,950 like February 29, highly unlikely, and this band 718 00:44:06,950 --> 00:44:10,040 in the middle of people who were conceived 719 00:44:10,040 --> 00:44:13,670 in the late fall and early winter. 720 00:44:13,670 --> 00:44:19,950 So what I did is I duplicated some dates. 721 00:44:19,950 --> 00:44:25,565 So the 58th day of the year, February 29, occurs only once. 722 00:44:28,750 --> 00:44:30,590 The dates before that, I said, let's 723 00:44:30,590 --> 00:44:32,730 pretend they occur four times. 724 00:44:32,730 --> 00:44:34,590 What only matters here is not how often 725 00:44:34,590 --> 00:44:36,450 they occur but the relative frequency. 726 00:44:40,700 --> 00:44:46,000 And then the dates after that occur four times 727 00:44:46,000 --> 00:44:49,480 except for the dates in that band, which is going 728 00:44:49,480 --> 00:44:52,180 to have occur yet more often. 729 00:44:52,180 --> 00:44:56,000 So now-- and don't worry about the exact details here-- 730 00:44:56,000 --> 00:44:58,840 but what I'm doing is simply adjusting the simulation 731 00:44:58,840 --> 00:45:02,140 to change the probability of each date getting 732 00:45:02,140 --> 00:45:04,378 chosen by same date. 733 00:45:07,190 --> 00:45:09,170 And then I can run the simulation model. 734 00:45:09,170 --> 00:45:13,450 And, again, with a very small change to code, 735 00:45:13,450 --> 00:45:15,900 I've modeled something that's mathematically 736 00:45:15,900 --> 00:45:18,360 enormously complex. 737 00:45:18,360 --> 00:45:22,050 I have no idea how to actually do this probability 738 00:45:22,050 --> 00:45:23,670 mathematically. 739 00:45:23,670 --> 00:45:27,046 But the code is, as you can see, quite straightforward. 740 00:45:33,850 --> 00:45:35,460 So let's go to that here. 741 00:45:39,090 --> 00:45:45,450 So what I'm going to do is comment this one out 742 00:45:45,450 --> 00:46:02,660 and uncomment this more complicated set of dates 743 00:46:02,660 --> 00:46:03,500 and see what we get. 744 00:46:14,020 --> 00:46:16,240 And again, it changes quite dramatically. 745 00:46:16,240 --> 00:46:18,240 You might remember, before it was around I think 746 00:46:18,240 --> 00:46:23,460 0.6-something for 100, and now, it's 0.75. 747 00:46:23,460 --> 00:46:26,460 So getting away from the notion that birthdays are uniformly 748 00:46:26,460 --> 00:46:28,710 distributed to saying some birthdays are 749 00:46:28,710 --> 00:46:32,010 more common than others, again, dramatically changes 750 00:46:32,010 --> 00:46:34,570 the answer. 751 00:46:34,570 --> 00:46:36,589 And we can easily look at that. 752 00:46:43,080 --> 00:46:49,730 So that gets us to the big topic of simulation models. 753 00:46:49,730 --> 00:46:52,820 It's a program that describes a computation that 754 00:46:52,820 --> 00:46:57,830 provides information about the possible behaviors of a system. 755 00:46:57,830 --> 00:47:00,050 I say possible behaviors, because I'm 756 00:47:00,050 --> 00:47:02,835 particularly interested in stochastic systems. 757 00:47:05,720 --> 00:47:10,350 They're descriptive not prescriptive in the sense 758 00:47:10,350 --> 00:47:13,740 that they describe the possible outcomes. 759 00:47:13,740 --> 00:47:18,800 They don't tell you how to achieve possible outcomes. 760 00:47:18,800 --> 00:47:20,720 This is different from what we've 761 00:47:20,720 --> 00:47:22,550 looked at earlier in the course, where we 762 00:47:22,550 --> 00:47:25,700 looked at optimization models. 763 00:47:25,700 --> 00:47:30,440 So an optimization model is prescriptive. 764 00:47:30,440 --> 00:47:33,800 It tells you how to achieve an effect, 765 00:47:33,800 --> 00:47:38,000 how to get the most value out of your knapsack, 766 00:47:38,000 --> 00:47:42,350 how to find the shortest path from A to B in a graph. 767 00:47:42,350 --> 00:47:44,750 In contrast, a simulation model says, 768 00:47:44,750 --> 00:47:48,170 if I do this, here's what happens. 769 00:47:48,170 --> 00:47:52,290 It doesn't tell you how to make something happened. 770 00:47:52,290 --> 00:47:53,970 So it's very different, and it's why 771 00:47:53,970 --> 00:47:57,390 we need both, why we need optimization models 772 00:47:57,390 --> 00:48:00,570 and we need simulation models. 773 00:48:00,570 --> 00:48:03,750 We have to remember that a simulation model is only 774 00:48:03,750 --> 00:48:06,570 an approximation to reality. 775 00:48:06,570 --> 00:48:10,110 I put in an approximation to the distribution of birthdates, 776 00:48:10,110 --> 00:48:12,910 but it wasn't quite right. 777 00:48:12,910 --> 00:48:16,770 And as the very famous statistician George Box said, 778 00:48:16,770 --> 00:48:22,320 "all models are wrong, but some are actually very useful." 779 00:48:22,320 --> 00:48:27,930 In the next lecture, we'll look at a useful class of models. 780 00:48:27,930 --> 00:48:30,610 When do we use simulations? 781 00:48:30,610 --> 00:48:33,310 Typically, as we've just shown, to model systems that 782 00:48:33,310 --> 00:48:37,180 are mathematically intractable, like the birthday problem 783 00:48:37,180 --> 00:48:39,740 we just looked at. 784 00:48:39,740 --> 00:48:43,130 In other situations, to extract intermediate results-- 785 00:48:43,130 --> 00:48:47,660 something happens along the way to the answer. 786 00:48:47,660 --> 00:48:50,410 And as I hope you've seen that simulations 787 00:48:50,410 --> 00:48:55,480 are used because we can play what if games by successively 788 00:48:55,480 --> 00:48:57,340 refining it. 789 00:48:57,340 --> 00:48:59,230 We started with a simple simulation 790 00:48:59,230 --> 00:49:01,960 that assumed that we only asked the question of, do 791 00:49:01,960 --> 00:49:04,540 two people share a birthday. 792 00:49:04,540 --> 00:49:08,080 We showed how we could change it to ask do three people share 793 00:49:08,080 --> 00:49:10,020 a birthday. 794 00:49:10,020 --> 00:49:11,910 We then saw that we could change it 795 00:49:11,910 --> 00:49:16,260 to assume a different distribution of birthdates 796 00:49:16,260 --> 00:49:18,620 in the group. 797 00:49:18,620 --> 00:49:20,520 And so we can start with something simple. 798 00:49:20,520 --> 00:49:23,310 And we get it ever more complexed 799 00:49:23,310 --> 00:49:25,415 to answer questions what if. 800 00:49:29,510 --> 00:49:32,030 We're going to start in the next lecture 801 00:49:32,030 --> 00:49:36,680 by producing a simulation of a random walk. 802 00:49:36,680 --> 00:49:38,120 And with that, I'll stop. 803 00:49:38,120 --> 00:49:40,840 And see you guys soon.