1 00:00:00,835 --> 00:00:03,130 The following content is provided under a Creative 2 00:00:03,130 --> 00:00:04,550 Commons license. 3 00:00:04,550 --> 00:00:06,760 Your support will help MIT OpenCourseWare 4 00:00:06,760 --> 00:00:10,850 continue to offer high quality educational resources for free. 5 00:00:10,850 --> 00:00:13,390 To make a donation or to view additional materials 6 00:00:13,390 --> 00:00:17,320 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,320 --> 00:00:18,570 at ocw.mit.edu. 8 00:00:30,070 --> 00:00:31,406 JOHN GUTTAG: Hi, everybody. 9 00:00:35,870 --> 00:00:39,154 Welcome back to class. 10 00:00:39,154 --> 00:00:40,820 I have a lot to cover today, so I'm just 11 00:00:40,820 --> 00:00:42,600 going to get right to it. 12 00:00:42,600 --> 00:00:46,440 So you may remember at the end of last lecture, 13 00:00:46,440 --> 00:00:49,160 I talked about the Empirical Rule 14 00:00:49,160 --> 00:00:53,120 and said that there were a couple of assumptions 15 00:00:53,120 --> 00:00:54,160 underlying it. 16 00:00:54,160 --> 00:00:57,380 That one is the mean estimation error is zero. 17 00:00:57,380 --> 00:01:00,200 And second, that the distribution of errors 18 00:01:00,200 --> 00:01:03,140 will be normally distributed. 19 00:01:03,140 --> 00:01:04,819 I didn't probably mention at the time, 20 00:01:04,819 --> 00:01:08,060 but we often call this distribution Gaussian 21 00:01:08,060 --> 00:01:11,180 after the astronomer Carl Gauss. 22 00:01:11,180 --> 00:01:12,345 And it looks like that. 23 00:01:16,190 --> 00:01:21,250 Normal distributions are very easy to generate in Python. 24 00:01:21,250 --> 00:01:24,290 I have a little example here of generating, not 25 00:01:24,290 --> 00:01:28,020 a real normal distribution, but a discrete approximation 26 00:01:28,020 --> 00:01:28,520 of one. 27 00:01:31,170 --> 00:01:34,260 And the thing to really notice about it here 28 00:01:34,260 --> 00:01:38,380 is this line random.gauss. 29 00:01:38,380 --> 00:01:43,720 So that's a built in function of the random library. 30 00:01:43,720 --> 00:01:47,280 The first argument is the mean. 31 00:01:47,280 --> 00:01:49,920 And the second argument is the standard deviation 32 00:01:49,920 --> 00:01:53,970 or a mu and sigma as they're usually called. 33 00:01:53,970 --> 00:01:57,300 Every time I call that, I will get a different-- 34 00:01:57,300 --> 00:02:00,330 or usually a different random value 35 00:02:00,330 --> 00:02:04,260 drawn from a Gaussian with the mean, in this case of 0 36 00:02:04,260 --> 00:02:08,430 and a standard deviation of 100. 37 00:02:08,430 --> 00:02:10,990 I'm then going to produce a plot of those 38 00:02:10,990 --> 00:02:13,270 so you can see what it looks like. 39 00:02:13,270 --> 00:02:15,760 And I'm going to do that using some things we 40 00:02:15,760 --> 00:02:16,735 haven't seen before. 41 00:02:19,550 --> 00:02:22,880 So first of all, we've seen histograms. 42 00:02:22,880 --> 00:02:28,270 So pylab.hist produces a histogram. 43 00:02:28,270 --> 00:02:31,540 Bins tells us how many bins we want in the histogram. 44 00:02:31,540 --> 00:02:33,280 I said we want 100. 45 00:02:33,280 --> 00:02:35,240 The default is 10. 46 00:02:35,240 --> 00:02:39,190 Dist is the values it will use for it. 47 00:02:39,190 --> 00:02:45,960 And then this is something we haven't seen before, weights. 48 00:02:45,960 --> 00:02:48,540 Weights is a keyword argument. 49 00:02:48,540 --> 00:02:52,640 So normally when we produce a histogram, 50 00:02:52,640 --> 00:02:57,980 we take all of the values, the minimum to the maximum, 51 00:02:57,980 --> 00:03:00,760 and in this case, we would divide them into 100 bins, 52 00:03:00,760 --> 00:03:07,490 because I said, bins equals 100. 53 00:03:07,490 --> 00:03:09,790 So the first bin might be, well, let's 54 00:03:09,790 --> 00:03:13,030 say we only had values ranging from 0 to 100. 55 00:03:13,030 --> 00:03:16,360 The first bin would be all the 0's, all the 1's up to all 56 00:03:16,360 --> 00:03:16,960 the 99's. 57 00:03:21,010 --> 00:03:25,410 And it weights each value in the bin by 1. 58 00:03:25,410 --> 00:03:28,560 So if the bin had 10 values falling in it, 59 00:03:28,560 --> 00:03:31,340 the y-axis would be a 10. 60 00:03:31,340 --> 00:03:33,680 If the bin had 50 values falling in it, 61 00:03:33,680 --> 00:03:35,470 the y-axis would go up to 50. 62 00:03:38,430 --> 00:03:41,190 You can tell it how much you want 63 00:03:41,190 --> 00:03:45,600 to weight each bin, the elements in the bins. 64 00:03:45,600 --> 00:03:48,570 And say, no, I don't want them each to count as 1, 65 00:03:48,570 --> 00:03:51,290 I want them to count as a half or a quarter, 66 00:03:51,290 --> 00:03:52,990 and that will change the y-axis. 67 00:03:55,560 --> 00:03:57,980 So that's what I've done here. 68 00:03:57,980 --> 00:04:02,480 What I've said is I've created a list 69 00:04:02,480 --> 00:04:05,450 and I want to say for each of the bins-- 70 00:04:05,450 --> 00:04:10,010 in this case I'm going to weigh each of them the same way-- 71 00:04:10,010 --> 00:04:14,310 the weight is going to be 1 over the number of samples. 72 00:04:16,920 --> 00:04:19,910 I'm multiplying it by the len of dist, that 73 00:04:19,910 --> 00:04:23,220 will be how many items I have. 74 00:04:23,220 --> 00:04:26,660 And that will tell me how much each one is being weighted. 75 00:04:26,660 --> 00:04:38,380 So for example, if I have, say, 1,000 items, 76 00:04:38,380 --> 00:04:40,870 I could give 1,000 values and say, 77 00:04:40,870 --> 00:04:43,660 I want this item weighted by 1, and I 78 00:04:43,660 --> 00:04:48,580 want this item over here weighted by 12 or a half. 79 00:04:48,580 --> 00:04:50,450 We rarely do that. 80 00:04:50,450 --> 00:04:55,860 Usually, what we want is to give each item the same weight. 81 00:04:55,860 --> 00:05:01,150 So why would I want it not to be weighted at just one? 82 00:05:01,150 --> 00:05:06,760 Because I want my y-axis to be more easily interpreted, 83 00:05:06,760 --> 00:05:09,820 and essentially give me the fraction of the values 84 00:05:09,820 --> 00:05:10,990 that fell in that bin. 85 00:05:14,050 --> 00:05:17,660 And that's what I'm doing here. 86 00:05:17,660 --> 00:05:20,570 The other new thing I'm doing here 87 00:05:20,570 --> 00:05:23,900 is the plotting commands, including 88 00:05:23,900 --> 00:05:28,250 pylab.hist, many of them return values. 89 00:05:28,250 --> 00:05:30,800 Usually, I just ignore that value when we just 90 00:05:30,800 --> 00:05:34,280 say pylab.plot or pylab.hist. 91 00:05:34,280 --> 00:05:37,820 Here I am taking the value. 92 00:05:37,820 --> 00:05:41,710 The value in this case for a histogram 93 00:05:41,710 --> 00:05:44,560 is a tuple of length 2. 94 00:05:44,560 --> 00:05:50,030 The first element is a list or an array, 95 00:05:50,030 --> 00:05:53,540 giving me how many items are in each bin. 96 00:05:53,540 --> 00:05:57,740 And the second is the patches used 97 00:05:57,740 --> 00:06:01,660 to produce the beautiful pictures we're use to seeing. 98 00:06:01,660 --> 00:06:05,560 So here what I'm going to do is take this value 99 00:06:05,560 --> 00:06:07,360 so that I can do this at the end. 100 00:06:11,456 --> 00:06:13,330 Now, why would I want to look at the fraction 101 00:06:13,330 --> 00:06:16,820 within approximately 200 of the mean? 102 00:06:16,820 --> 00:06:19,070 What is that going to correspond to in this case? 103 00:06:22,510 --> 00:06:27,540 Well, if I divide 200 by 2 I get 100. 104 00:06:27,540 --> 00:06:30,710 Which happens to be the standard deviation. 105 00:06:30,710 --> 00:06:33,730 So in this case, what I'm going to be looking at 106 00:06:33,730 --> 00:06:37,150 is what fraction of the values fall 107 00:06:37,150 --> 00:06:39,970 within two standard deviations of the mean? 108 00:06:39,970 --> 00:06:43,740 Kind of a check on the empirical rule, right? 109 00:06:43,740 --> 00:06:49,600 All right, when I run the code I get this. 110 00:06:49,600 --> 00:06:53,350 So it is a discrete approximation 111 00:06:53,350 --> 00:06:56,150 to the probability density function. 112 00:06:56,150 --> 00:06:58,270 You'll notice, unlike the previous picture 113 00:06:58,270 --> 00:07:02,830 I showed you which was nice and smooth, this is jaggedy. 114 00:07:02,830 --> 00:07:05,600 You would expect it to be. 115 00:07:05,600 --> 00:07:11,800 And again, you can see it's very nice that the peak is what 116 00:07:11,800 --> 00:07:14,230 we said the mean should be, 0. 117 00:07:14,230 --> 00:07:16,240 And then it falls off. 118 00:07:16,240 --> 00:07:21,200 And indeed, slightly more than 95% 119 00:07:21,200 --> 00:07:26,666 fall within two standard deviations of the mean. 120 00:07:26,666 --> 00:07:30,490 I'm not even surprised that it's a little bit more than 95% 121 00:07:30,490 --> 00:07:35,560 because, remember the magic number is 1.96, not 2. 122 00:07:35,560 --> 00:07:39,820 But since this is only a finite sample, 123 00:07:39,820 --> 00:07:41,860 I only want it to be around 95. 124 00:07:41,860 --> 00:07:46,770 I'm not going to worry too much whether it's bigger or smaller. 125 00:07:46,770 --> 00:07:48,180 All right? 126 00:07:48,180 --> 00:07:53,670 So random.gauss does a nice job of giving us Gaussian values. 127 00:07:53,670 --> 00:07:56,580 We plotted them and now you can see that I've 128 00:07:56,580 --> 00:07:59,100 got the relative frequency. 129 00:07:59,100 --> 00:08:06,400 That's why I see fractions in the y-axis rather than counts. 130 00:08:06,400 --> 00:08:10,400 And that would be because of the way I use the weights command. 131 00:08:10,400 --> 00:08:14,360 All right, let's return to PDFs. 132 00:08:14,360 --> 00:08:17,240 So as we said last time, distributions 133 00:08:17,240 --> 00:08:21,090 can be defined by Probability Density Functions, 134 00:08:21,090 --> 00:08:23,000 and that gives us the probability 135 00:08:23,000 --> 00:08:27,980 of some random variable lying between two values. 136 00:08:27,980 --> 00:08:30,890 It defines a curve where the values in the x-axis 137 00:08:30,890 --> 00:08:35,990 lie between the minimum and maximum values of the variable. 138 00:08:35,990 --> 00:08:38,000 And it's the area under the curve-- 139 00:08:38,000 --> 00:08:39,980 and we'll come back to this-- 140 00:08:39,980 --> 00:08:42,650 between those two points that give us 141 00:08:42,650 --> 00:08:47,010 the probability of an example falling in that range. 142 00:08:47,010 --> 00:08:50,960 So now let's look at it for a normal distribution. 143 00:08:50,960 --> 00:08:54,680 You may recall from the last lecture this rather exotic 144 00:08:54,680 --> 00:09:02,380 looking formula which defines a PDF for a normal distribution. 145 00:09:02,380 --> 00:09:07,580 And here's some code that's as straight forward 146 00:09:07,580 --> 00:09:13,720 an implementation as one could imagine of this formula, OK. 147 00:09:13,720 --> 00:09:16,900 So that now is a value that, given 148 00:09:16,900 --> 00:09:21,610 a mu and a sigma and an x, gives me the x associated 149 00:09:21,610 --> 00:09:26,810 with that mu and sigma, OK? 150 00:09:26,810 --> 00:09:29,770 And you'll notice there's nothing random about this. 151 00:09:32,430 --> 00:09:37,030 All right, it is giving me the value. 152 00:09:37,030 --> 00:09:41,215 Now, let's go down here. 153 00:09:44,350 --> 00:09:47,380 And I'm going to, for a set of x's, get 154 00:09:47,380 --> 00:09:53,010 the set of y's corresponding to that and then plot it. 155 00:09:53,010 --> 00:09:55,170 I'm going to set mu to 0 and sigma 156 00:09:55,170 --> 00:09:59,260 to 1, the so-called standard normal distribution. 157 00:09:59,260 --> 00:10:05,690 And I'm going to look at the distribution from minus 4 to 4. 158 00:10:05,690 --> 00:10:08,240 Nothing magic about that other than, as you'll see, 159 00:10:08,240 --> 00:10:13,450 it's kind of a place where it asymptotes near 0. 160 00:10:13,450 --> 00:10:16,420 So while x is less than 4, I'll get 161 00:10:16,420 --> 00:10:20,980 the x-value, I'll get the y-value corresponding to that x 162 00:10:20,980 --> 00:10:25,440 by calling Gaussian, increment x by 0.05 163 00:10:25,440 --> 00:10:29,810 and do that until I'm done. 164 00:10:29,810 --> 00:10:34,680 And then simply, I'll plot the x-values against the y-values 165 00:10:34,680 --> 00:10:39,940 and throw a title on it using pylab.title. 166 00:10:39,940 --> 00:10:41,270 All right, code make sense? 167 00:10:45,350 --> 00:10:48,550 Well, this is where I got that beautiful picture 168 00:10:48,550 --> 00:10:50,840 we've looked at before. 169 00:10:50,840 --> 00:10:55,740 When I plotted here, it looks, actually quite smooth. 170 00:10:55,740 --> 00:10:59,820 It's not, it's really connecting a bunch of tiny little lines 171 00:10:59,820 --> 00:11:01,950 but I made the points close enough together 172 00:11:01,950 --> 00:11:06,180 that it looks at least here smooth at this resolution. 173 00:11:08,970 --> 00:11:12,840 So we know what the values on the x-axis are. 174 00:11:12,840 --> 00:11:15,300 Those are the values I happen to want 175 00:11:15,300 --> 00:11:19,560 to look at, from minus 4 to 4. 176 00:11:19,560 --> 00:11:21,180 What are the values on the y-axis? 177 00:11:26,640 --> 00:11:28,710 We kind of would like to interpret them 178 00:11:28,710 --> 00:11:31,170 as probabilities, right? 179 00:11:31,170 --> 00:11:34,800 But we could be pretty suspicious about that 180 00:11:34,800 --> 00:11:38,280 and then if we take this one point that's up here, 181 00:11:38,280 --> 00:11:42,540 we say the probability of that single point is 0.4. 182 00:11:42,540 --> 00:11:45,690 Well, that doesn't make any sense because, in fact, we 183 00:11:45,690 --> 00:11:49,740 know the probability of any particular point 184 00:11:49,740 --> 00:11:52,850 is 0 in some sense, right? 185 00:11:52,850 --> 00:11:58,670 So furthermore, if I chose a different value for sigma, 186 00:11:58,670 --> 00:12:05,200 I can actually get this to go bigger than 1 on the y-axis. 187 00:12:05,200 --> 00:12:09,240 So if you take sigma to be say, 0.1-- 188 00:12:09,240 --> 00:12:14,480 I think the y-axis goes up to something like 40. 189 00:12:14,480 --> 00:12:18,290 So we know we don't have probabilities in the range 40. 190 00:12:18,290 --> 00:12:21,710 So if these aren't probabilities, what are they? 191 00:12:21,710 --> 00:12:24,530 What are the y values? 192 00:12:24,530 --> 00:12:27,950 Well, not too surprising since I claimed 193 00:12:27,950 --> 00:12:32,670 this was a probability density function, they're densities. 194 00:12:32,670 --> 00:12:34,270 Well, what's a density? 195 00:12:37,160 --> 00:12:38,400 This makes sense. 196 00:12:38,400 --> 00:12:40,460 I'll say it and then I'll try and explain it. 197 00:12:40,460 --> 00:12:44,120 It's a derivative of the cumulative distribution 198 00:12:44,120 --> 00:12:45,540 function. 199 00:12:45,540 --> 00:12:49,850 Now, why are we talking about derivatives in the first place? 200 00:12:49,850 --> 00:12:52,960 Well, remember what we're trying to say. 201 00:12:52,960 --> 00:12:57,340 If we want to ask, what's the probability of a value falling 202 00:12:57,340 --> 00:13:02,110 between here and here, we claim that that 203 00:13:02,110 --> 00:13:10,380 was going to be the area under this curve, the integral. 204 00:13:10,380 --> 00:13:13,960 Well, as you know from 18.01, there's 205 00:13:13,960 --> 00:13:17,620 a very clear relationship between derivatives 206 00:13:17,620 --> 00:13:19,750 and integrals. 207 00:13:19,750 --> 00:13:24,460 And so if we interpret each of these points as a derivative, 208 00:13:24,460 --> 00:13:27,310 in some sense the slope here, then we 209 00:13:27,310 --> 00:13:30,040 can look at this as the area just 210 00:13:30,040 --> 00:13:31,570 by integrating under there. 211 00:13:34,720 --> 00:13:39,790 So to interpret a PDF, we always do it mathematically. 212 00:13:39,790 --> 00:13:42,260 Actually, I do it just by looking at it. 213 00:13:42,260 --> 00:13:46,420 But the only really interesting mathematical questions to ask 214 00:13:46,420 --> 00:13:49,400 have to do with area. 215 00:13:49,400 --> 00:13:51,950 Once we have the area, we can, then, 216 00:13:51,950 --> 00:13:56,390 talk about the probabilities of some value falling 217 00:13:56,390 --> 00:13:58,920 within a region of the curve. 218 00:13:58,920 --> 00:14:01,070 So what's interesting here is not 219 00:14:01,070 --> 00:14:06,930 the numbers per se on the y-axis but the shape of the curve, 220 00:14:06,930 --> 00:14:09,240 because those numbers have to be related 221 00:14:09,240 --> 00:14:14,190 to the numbers on the x-axis, dx, dy right? 222 00:14:14,190 --> 00:14:17,860 We're looking at derivatives. 223 00:14:17,860 --> 00:14:22,300 All right, so now, we have to talk about integration. 224 00:14:22,300 --> 00:14:25,700 I promise you'll only hear about it for another few minutes 225 00:14:25,700 --> 00:14:27,730 then we'll leave the topic. 226 00:14:27,730 --> 00:14:30,520 So I mentioned before SciPy as a library 227 00:14:30,520 --> 00:14:35,780 that contains a lot of useful mathematical functions. 228 00:14:35,780 --> 00:14:39,295 One of them is integrate.quad. 229 00:14:41,815 --> 00:14:43,790 Well, the integrate part is obvious. 230 00:14:43,790 --> 00:14:45,440 It means integration. 231 00:14:45,440 --> 00:14:47,930 Quad is telling you the algorithm it's 232 00:14:47,930 --> 00:14:50,240 choosing to do the integration. 233 00:14:50,240 --> 00:14:54,945 All of these integrals are going to be actual approximations 234 00:14:54,945 --> 00:14:55,820 to the real integral. 235 00:14:58,450 --> 00:15:01,010 SciPy is not doing some clever mathematics 236 00:15:01,010 --> 00:15:03,900 to get an analytical solution. 237 00:15:03,900 --> 00:15:08,520 It's using a numerical technique to approximate the integral. 238 00:15:08,520 --> 00:15:11,490 And the one here happens to be called quadrature, 239 00:15:11,490 --> 00:15:14,810 it doesn't matter. 240 00:15:14,810 --> 00:15:19,400 All right, you can pass it up to four arguments. 241 00:15:19,400 --> 00:15:22,930 You must pass it to function to be integrated, 242 00:15:22,930 --> 00:15:25,540 that makes sense. 243 00:15:25,540 --> 00:15:29,810 A number representing the lower limit of the integration-- 244 00:15:29,810 --> 00:15:31,160 you need to give it that. 245 00:15:31,160 --> 00:15:33,290 A number representing the upper limit-- 246 00:15:33,290 --> 00:15:35,150 you need to give it that. 247 00:15:35,150 --> 00:15:39,450 And then the fourth argument is a tuple 248 00:15:39,450 --> 00:15:43,490 supplying all values for the arguments, 249 00:15:43,490 --> 00:15:48,050 except the first of the function you are integrating. 250 00:15:48,050 --> 00:15:51,270 I'll show you an example of that on the next slide. 251 00:15:51,270 --> 00:15:55,640 And we'll see that it returns a tuple, an approximation 252 00:15:55,640 --> 00:15:58,730 to the result, what it thinks the integral is, 253 00:15:58,730 --> 00:16:01,940 and an estimate of how much error 254 00:16:01,940 --> 00:16:04,040 there might be in that one. 255 00:16:04,040 --> 00:16:06,780 We'll ignore the error for the moment. 256 00:16:06,780 --> 00:16:10,020 All right, let's look at the code. 257 00:16:10,020 --> 00:16:13,685 So we start by importing scipy.integrate. 258 00:16:17,150 --> 00:16:20,460 That gives us all the different integration methods. 259 00:16:20,460 --> 00:16:23,250 I'm not going to show you the code for Gaussian 260 00:16:23,250 --> 00:16:25,750 since I showed it to you a couple of minutes ago. 261 00:16:25,750 --> 00:16:27,750 But I wanted you to remember that it takes three 262 00:16:27,750 --> 00:16:32,480 arguments, x, mu, and sigma. 263 00:16:32,480 --> 00:16:37,460 Because when we get down here to the integration, 264 00:16:37,460 --> 00:16:44,180 we'll pass at the function Gaussian and then the values 265 00:16:44,180 --> 00:16:45,890 that we want to integrate over. 266 00:16:49,500 --> 00:16:56,250 So those will be the values that x can take upon. 267 00:16:56,250 --> 00:17:02,330 And that will change as we go from mu 268 00:17:02,330 --> 00:17:05,060 minus the number of standard deviations times sigma 269 00:17:05,060 --> 00:17:07,220 to mu plus the number of standard deviations 270 00:17:07,220 --> 00:17:09,800 times sigma. 271 00:17:09,800 --> 00:17:13,819 And then, this is the optional fourth argument, the tuple, mu, 272 00:17:13,819 --> 00:17:15,460 and sigma. 273 00:17:15,460 --> 00:17:17,740 Why do I need to pass that in? 274 00:17:17,740 --> 00:17:22,089 Because Gaussian is a ternary argument, or a function 275 00:17:22,089 --> 00:17:24,460 that takes three values. 276 00:17:24,460 --> 00:17:26,770 And I'm going to integrate over values 277 00:17:26,770 --> 00:17:32,830 of x so I have to fix mu and sigma to constants, which 278 00:17:32,830 --> 00:17:35,080 is what I'm doing down here. 279 00:17:35,080 --> 00:17:37,390 And then I'll take the zeroth value, 280 00:17:37,390 --> 00:17:41,320 which is its estimate of the integral. 281 00:17:41,320 --> 00:17:43,980 All right, so that's the new thing. 282 00:17:43,980 --> 00:17:46,980 The rest of the code is all stuff you've seen. 283 00:17:46,980 --> 00:17:49,680 For t and range number of trials, 284 00:17:49,680 --> 00:17:53,760 I'm going to choose a random mu between minus 10 and 10 285 00:17:53,760 --> 00:17:56,790 and a random sigma between 1 and 10. 286 00:17:56,790 --> 00:18:00,070 It doesn't matter what those constants are. 287 00:18:00,070 --> 00:18:04,500 And then for the number of standard deviations 288 00:18:04,500 --> 00:18:10,430 in 1, 1.96, and 3, I'm going to integrate 289 00:18:10,430 --> 00:18:16,720 Gaussian over that range. 290 00:18:16,720 --> 00:18:19,800 And then we're just going to see how many of them 291 00:18:19,800 --> 00:18:21,864 fall within that range. 292 00:18:21,864 --> 00:18:23,280 In some sense, what we're doing is 293 00:18:23,280 --> 00:18:26,090 we're checking the empirical rule. 294 00:18:26,090 --> 00:18:27,590 We're saying, take the Gaussian. 295 00:18:27,590 --> 00:18:29,270 I don't care what mu and sigma are. 296 00:18:29,270 --> 00:18:31,010 It doesn't matter. 297 00:18:31,010 --> 00:18:34,490 The empirical rule will still hold, I think. 298 00:18:34,490 --> 00:18:37,481 But we're just checking it here, OK? 299 00:18:41,690 --> 00:18:44,370 Well, here are the results. 300 00:18:44,370 --> 00:18:46,655 So from mu equals 9 and sigma equals 6, 301 00:18:46,655 --> 00:18:49,400 I happened to choose those, we'll 302 00:18:49,400 --> 00:18:55,380 see the fracture within 1, fraction within 1.96 and 3. 303 00:18:55,380 --> 00:18:58,800 And so for these random mus and sigmas, 304 00:18:58,800 --> 00:19:00,440 you can see that all of them-- and you 305 00:19:00,440 --> 00:19:02,190 can set them to whatever you want when you 306 00:19:02,190 --> 00:19:04,380 get your hand them the code. 307 00:19:04,380 --> 00:19:10,880 Essentially, what we have is, whoops, the empirical rule 308 00:19:10,880 --> 00:19:13,430 actually works. 309 00:19:13,430 --> 00:19:17,120 One of those beautiful cases where you can test the theory 310 00:19:17,120 --> 00:19:21,674 and see that the theory really is sound. 311 00:19:21,674 --> 00:19:25,340 So there we go. 312 00:19:25,340 --> 00:19:29,240 So why am I making such a big deal of normal distributions? 313 00:19:29,240 --> 00:19:31,910 They have lots of nice mathematical properties, 314 00:19:31,910 --> 00:19:34,142 some of which we've already talked about. 315 00:19:34,142 --> 00:19:35,600 But all of that would be irrelevant 316 00:19:35,600 --> 00:19:38,020 if we didn't see them. 317 00:19:38,020 --> 00:19:41,110 The good news is they're all over the place. 318 00:19:41,110 --> 00:19:43,900 I've just taken a few here. 319 00:19:43,900 --> 00:19:47,320 Up here we'll see SAT scores. 320 00:19:47,320 --> 00:19:50,900 I would never show that to high school students, 321 00:19:50,900 --> 00:19:53,260 or GREs to you guys. 322 00:19:53,260 --> 00:19:56,020 But you can see that they are amazingly 323 00:19:56,020 --> 00:19:58,944 well-distributed along a normal distribution. 324 00:20:02,660 --> 00:20:09,420 On down here, this is plotting percent change in oil prices. 325 00:20:09,420 --> 00:20:14,380 And again, we see something very close to a normal distribution. 326 00:20:14,380 --> 00:20:17,350 And here is just looking at heights of men and women. 327 00:20:17,350 --> 00:20:19,980 And again, they clearly look very normal. 328 00:20:23,330 --> 00:20:29,970 So it's really quite impressive how often they occur. 329 00:20:29,970 --> 00:20:34,630 But not everything is normal. 330 00:20:34,630 --> 00:20:37,360 So we saw that the empirical rule 331 00:20:37,360 --> 00:20:39,600 works for normal distributions. 332 00:20:39,600 --> 00:20:41,230 I won't say I proved it for you. 333 00:20:41,230 --> 00:20:43,915 I illustrated it for you with a bunch of examples. 334 00:20:47,110 --> 00:20:49,640 But are the outcomes of the spins of a roulette wheel 335 00:20:49,640 --> 00:20:50,140 normal? 336 00:20:53,370 --> 00:20:54,390 No. 337 00:20:54,390 --> 00:20:56,430 They're totally uniform, right? 338 00:20:56,430 --> 00:21:01,650 Everything is equally probable-- a 4, a 6, an 11, a 13, double-0 339 00:21:01,650 --> 00:21:03,330 if you're in Las Vegas. 340 00:21:03,330 --> 00:21:05,370 They're all equally probable. 341 00:21:05,370 --> 00:21:09,750 So if I plotted those, I'd basically 342 00:21:09,750 --> 00:21:12,580 just get a straight line with everything 343 00:21:12,580 --> 00:21:17,060 at 1 over however many pockets there are. 344 00:21:17,060 --> 00:21:23,810 So in that case, why does the empirical rule work? 345 00:21:23,810 --> 00:21:28,040 We saw that we were doing some estimates about returns 346 00:21:28,040 --> 00:21:30,400 and we used the empirical rule, we checked it 347 00:21:30,400 --> 00:21:35,560 and, by George, it was telling us the truth. 348 00:21:35,560 --> 00:21:38,470 And the reason is because we're not 349 00:21:38,470 --> 00:21:42,620 reasoning about a single spin of the wheel 350 00:21:42,620 --> 00:21:47,450 but about the mean of a set of spins. 351 00:21:47,450 --> 00:21:49,910 So if you think about it, what we were reasoning about 352 00:21:49,910 --> 00:21:52,820 was the return of betting. 353 00:21:52,820 --> 00:21:55,190 If we look at one spin-- 354 00:21:55,190 --> 00:21:57,680 well, let's say we bet $1. 355 00:21:57,680 --> 00:22:02,060 The return is either minus 1 because we've lost our dollar. 356 00:22:02,060 --> 00:22:05,360 Or if we get lucky and our pocket happens to come up, 357 00:22:05,360 --> 00:22:08,460 it was 36, I think, or 35. 358 00:22:08,460 --> 00:22:11,090 I forget which, OK? 359 00:22:11,090 --> 00:22:13,560 But that's all. 360 00:22:13,560 --> 00:22:16,860 So if we plotted a histogram, we would 361 00:22:16,860 --> 00:22:23,880 see a huge peak at minus 1 and a little bump 362 00:22:23,880 --> 00:22:28,900 here at 36 and nothing in the middle. 363 00:22:28,900 --> 00:22:31,655 Clearly, not a normal distribution. 364 00:22:36,110 --> 00:22:39,850 But what we're reasoning about is not 365 00:22:39,850 --> 00:22:43,870 the return of a single spin but the return of many spins. 366 00:22:43,870 --> 00:22:47,500 If we played 1,000 spins, what is our expected return? 367 00:22:51,310 --> 00:22:55,580 As soon as we end up reasoning, not about a single event 368 00:22:55,580 --> 00:22:59,890 but about the mean of something, we can imply something 369 00:22:59,890 --> 00:23:03,060 called the Central Limit Theorem. 370 00:23:03,060 --> 00:23:04,380 And here it is. 371 00:23:04,380 --> 00:23:10,280 It's actually for something so important, very simple. 372 00:23:10,280 --> 00:23:12,865 It says that given a sufficiently large sample-- 373 00:23:12,865 --> 00:23:16,120 and I love terms like sufficiently large 374 00:23:16,120 --> 00:23:18,700 but we'll later put a little meat on that-- 375 00:23:18,700 --> 00:23:22,200 the following three things are true. 376 00:23:22,200 --> 00:23:25,920 The means of the samples in a set of samples, 377 00:23:25,920 --> 00:23:29,340 the so-called sample means will be approximately normally 378 00:23:29,340 --> 00:23:32,650 distributed. 379 00:23:32,650 --> 00:23:39,140 So that says if I take a sample-- 380 00:23:39,140 --> 00:23:44,130 and remember, a sample will have multiple examples. 381 00:23:44,130 --> 00:23:45,375 So just to remind people. 382 00:23:50,280 --> 00:23:55,030 A population is a set of examples. 383 00:24:00,860 --> 00:24:07,770 A sample is a subset of the population. 384 00:24:12,880 --> 00:24:19,010 So it too is a set of examples, typically. 385 00:24:19,010 --> 00:24:21,785 If this set is sufficiently large-- 386 00:24:25,190 --> 00:24:28,870 certainly 1 is not sufficiently large-- 387 00:24:28,870 --> 00:24:34,460 then it will be the case that the mean of the means-- 388 00:24:34,460 --> 00:24:36,950 so I take the mean of each sample 389 00:24:36,950 --> 00:24:39,740 and then I can now plot all of those means 390 00:24:39,740 --> 00:24:43,390 and so take the mean of those, right-- 391 00:24:43,390 --> 00:24:45,220 and they'll be normally distributed. 392 00:24:49,090 --> 00:24:53,540 Furthermore, this distribution will 393 00:24:53,540 --> 00:24:56,700 have a mean that is close to the mean of the population. 394 00:25:00,930 --> 00:25:03,000 The mean of the means will be close to the mean 395 00:25:03,000 --> 00:25:04,930 of the population. 396 00:25:04,930 --> 00:25:07,980 And the variance of the sample means 397 00:25:07,980 --> 00:25:11,940 will be close to the variance of the population divided 398 00:25:11,940 --> 00:25:13,090 by the sample size. 399 00:25:16,960 --> 00:25:19,180 This is really amazing that this is 400 00:25:19,180 --> 00:25:23,105 true and dramatically useful. 401 00:25:26,940 --> 00:25:30,780 So to get some insight, let's check it. 402 00:25:30,780 --> 00:25:34,500 To do that, postulate that we have this kind of miraculous 403 00:25:34,500 --> 00:25:35,790 die. 404 00:25:35,790 --> 00:25:38,940 So instead of a die that when you roll it you get a number 1, 405 00:25:38,940 --> 00:25:44,220 2, 3, 4, 5, or 6, this particular die is continuous. 406 00:25:44,220 --> 00:25:48,000 It gives you a real number between 0 and 5, 407 00:25:48,000 --> 00:25:53,530 or maybe it's between 1 and 6, OK? 408 00:25:53,530 --> 00:25:55,420 So it's a continuous die. 409 00:25:58,450 --> 00:26:02,360 What we're going to do is roll it a lot of times. 410 00:26:07,950 --> 00:26:10,335 We're going to say, how many die? 411 00:26:13,040 --> 00:26:15,500 And then, how many times are we going 412 00:26:15,500 --> 00:26:17,780 to roll that number of die? 413 00:26:17,780 --> 00:26:21,170 So the number of die will be the sample size-- 414 00:26:21,170 --> 00:26:23,870 number of dice will be the sample size. 415 00:26:23,870 --> 00:26:26,960 And then we'll take a bunch of samples which 416 00:26:26,960 --> 00:26:28,910 I'm calling number of rolls. 417 00:26:28,910 --> 00:26:31,520 And then we'll plot it and I'm just 418 00:26:31,520 --> 00:26:34,070 choosing some bins and some colors 419 00:26:34,070 --> 00:26:36,770 and some style and various other things 420 00:26:36,770 --> 00:26:39,350 just to show you how we use the keyword arguments. 421 00:26:43,640 --> 00:26:46,880 Actually, I said the number of rolls is the number of trials. 422 00:26:46,880 --> 00:26:50,240 But it isn't quite that because I'm 423 00:26:50,240 --> 00:26:51,830 going to get the number of trials 424 00:26:51,830 --> 00:26:55,820 by dividing the number of rolls by the number of dice. 425 00:26:55,820 --> 00:27:00,110 So if I have more dice, I get to have fewer samples, more 426 00:27:00,110 --> 00:27:04,540 dice per sample, all right? 427 00:27:04,540 --> 00:27:06,360 Then we'll just do it. 428 00:27:06,360 --> 00:27:11,790 So it will be between 0 and 5 because random.random 429 00:27:11,790 --> 00:27:17,045 returns a number between 0 and 1 and I'm multiplying it by 5. 430 00:27:21,220 --> 00:27:24,720 And then we'll look at the means and we'll plot it all. 431 00:27:29,630 --> 00:27:32,620 Again, we're playing games with weights just 432 00:27:32,620 --> 00:27:35,810 to make the plot a little easier to read. 433 00:27:35,810 --> 00:27:36,870 And here's what we get. 434 00:27:39,470 --> 00:27:46,070 If we roll one die, the mean is very close to 2.5. 435 00:27:46,070 --> 00:27:48,470 Well, that's certainly what you'd expect, right? 436 00:27:48,470 --> 00:27:51,890 It's some random number between 0 and 5. 437 00:27:51,890 --> 00:27:56,890 2.5 is a pretty good guess as to what it should average. 438 00:27:56,890 --> 00:28:01,225 And it has a standard deviation of 1.44. 439 00:28:01,225 --> 00:28:02,950 And that's a little harder to guess 440 00:28:02,950 --> 00:28:04,720 that that's what it would be. 441 00:28:04,720 --> 00:28:07,210 But you could figure it out with a little math 442 00:28:07,210 --> 00:28:10,840 or as I did here with the simulation. 443 00:28:10,840 --> 00:28:15,670 But now, if I roll 50 dice, well, again, the mean 444 00:28:15,670 --> 00:28:20,000 is close to 2.5. 445 00:28:20,000 --> 00:28:22,530 It's what you'd expect, right? 446 00:28:22,530 --> 00:28:27,900 I roll 50 die, I get the mean value of those 50. 447 00:28:27,900 --> 00:28:30,960 But look how much smaller the standard deviation is. 448 00:28:34,270 --> 00:28:38,140 More importantly, what we see here 449 00:28:38,140 --> 00:28:46,200 is that if we look at the value, the probability 450 00:28:46,200 --> 00:28:51,240 is flat for all possible values between 0 and 5 451 00:28:51,240 --> 00:28:53,430 for a single die. 452 00:28:53,430 --> 00:28:57,960 But if we look at the distribution for the means, 453 00:28:57,960 --> 00:29:02,730 it's not quite Gaussian but it's pretty close. 454 00:29:02,730 --> 00:29:04,030 Why is it not Gaussian? 455 00:29:04,030 --> 00:29:06,970 Well, I didn't do it an infinite number of times. 456 00:29:06,970 --> 00:29:09,670 Did it quite a few, but not an infinite number. 457 00:29:09,670 --> 00:29:13,870 Enough that you didn't want to sit here while it ran. 458 00:29:13,870 --> 00:29:15,850 But you can see the amazing thing here 459 00:29:15,850 --> 00:29:19,960 that when I go from looking at 1 to looking at the mean of 50, 460 00:29:19,960 --> 00:29:25,180 suddenly I have a normal distribution. 461 00:29:25,180 --> 00:29:29,710 And that means that I can bring to bear 462 00:29:29,710 --> 00:29:35,560 on the problem the Central Limit Theorem. 463 00:29:35,560 --> 00:29:38,170 We can try it for roulette. 464 00:29:38,170 --> 00:29:40,930 Again I'm not going to make you sit through a million trials 465 00:29:40,930 --> 00:29:44,000 of 200 spins each. 466 00:29:44,000 --> 00:29:46,340 I'll do it only for fair roulette. 467 00:29:46,340 --> 00:29:51,077 And again, this is a very simple simulation, 468 00:29:51,077 --> 00:29:52,160 and we'll see what we get. 469 00:29:54,730 --> 00:29:59,960 And what we see is it's not quite normal, again, 470 00:29:59,960 --> 00:30:04,040 but it definitely has that shape. 471 00:30:04,040 --> 00:30:06,500 Now, it's going to be a little bit 472 00:30:06,500 --> 00:30:15,340 strange because I can't lose more than one, 473 00:30:15,340 --> 00:30:17,440 if I'm betting one. 474 00:30:17,440 --> 00:30:20,440 So it will never be quite normal because it's 475 00:30:20,440 --> 00:30:25,890 going to be truncated down on the left side, 476 00:30:25,890 --> 00:30:31,080 whereas the tail can be arbitrarily long. 477 00:30:31,080 --> 00:30:36,520 So again, mathematically it can't be normal 478 00:30:36,520 --> 00:30:41,930 but it's close enough in the main region, where 479 00:30:41,930 --> 00:30:46,460 most of the values lie, that we can get away 480 00:30:46,460 --> 00:30:49,700 with applying the empirical rule and looking at answers. 481 00:30:52,430 --> 00:30:56,500 And indeed as we saw, it does work. 482 00:30:56,500 --> 00:30:57,690 So what's the moral here? 483 00:31:00,510 --> 00:31:04,170 It doesn't matter what the shape of the distribution 484 00:31:04,170 --> 00:31:08,140 of the original values happen to be. 485 00:31:08,140 --> 00:31:12,700 If we're trying to estimate the mean using samples that 486 00:31:12,700 --> 00:31:17,020 are sufficiently large, the CLT will 487 00:31:17,020 --> 00:31:20,320 allow us to use the empirical rule when 488 00:31:20,320 --> 00:31:22,120 computing confidence intervals. 489 00:31:25,590 --> 00:31:30,180 Even if we go back and look at this anomaly over in the left, 490 00:31:30,180 --> 00:31:34,470 what do you think would happen if I, instead of had 200, have, 491 00:31:34,470 --> 00:31:40,670 say, 1,000? 492 00:31:40,670 --> 00:31:46,265 What's the probability of the average return being minus 1 493 00:31:46,265 --> 00:31:50,090 of 1,000 bets? 494 00:31:50,090 --> 00:31:54,560 Much smaller than for 100 bets. 495 00:31:54,560 --> 00:31:59,290 To lose 1,000 times in a row is pretty unlikely. 496 00:31:59,290 --> 00:32:01,180 So to get all the way to the left 497 00:32:01,180 --> 00:32:03,470 is going to be less likely, and, therefore, 498 00:32:03,470 --> 00:32:06,580 the thing will start looking more and more normal 499 00:32:06,580 --> 00:32:07,970 as the samples get bigger. 500 00:32:10,510 --> 00:32:15,870 All right, and so we can use the CLT 501 00:32:15,870 --> 00:32:18,720 to justify using the empirical rule 502 00:32:18,720 --> 00:32:21,390 when we compute confidence intervals. 503 00:32:21,390 --> 00:32:24,630 All right, I want to look at one more example in detail. 504 00:32:24,630 --> 00:32:27,070 This is kind of an interesting one. 505 00:32:27,070 --> 00:32:31,710 You might think that randomness is of no use for, 506 00:32:31,710 --> 00:32:34,880 say, finding the value of pi because there's 507 00:32:34,880 --> 00:32:37,320 nothing random about that. 508 00:32:37,320 --> 00:32:40,000 Similarly, you might think that randomness was of no use 509 00:32:40,000 --> 00:32:42,880 in integrating a function, but in fact, 510 00:32:42,880 --> 00:32:45,000 the way those numerical algorithms work 511 00:32:45,000 --> 00:32:47,700 is they use randomness. 512 00:32:47,700 --> 00:32:51,210 What you're about to see is that randomness, 513 00:32:51,210 --> 00:32:55,350 and indeed things related to Monte Carlo simulations, 514 00:32:55,350 --> 00:32:59,190 can be enormously useful, even when you're computing something 515 00:32:59,190 --> 00:33:04,620 that is inherently not random like the value of pi here. 516 00:33:04,620 --> 00:33:06,690 And we won't ask you to remember that many digits 517 00:33:06,690 --> 00:33:08,740 on the quiz and the exam. 518 00:33:08,740 --> 00:33:10,616 All right, so what's pi? 519 00:33:10,616 --> 00:33:13,860 Well, people have known about pi for thousands and thousands 520 00:33:13,860 --> 00:33:15,180 of years. 521 00:33:15,180 --> 00:33:20,520 And what people knew was that there was some constant, 522 00:33:20,520 --> 00:33:21,480 we'll call it pi. 523 00:33:21,480 --> 00:33:23,460 It wasn't always called that. 524 00:33:23,460 --> 00:33:25,410 Such that it was equal to the circumference 525 00:33:25,410 --> 00:33:28,910 of a circle divided by the diameter, 526 00:33:28,910 --> 00:33:34,100 and furthermore, the area was going to be pi r squared. 527 00:33:34,100 --> 00:33:38,450 People knew that way back to the Babylonians and the Egyptians. 528 00:33:38,450 --> 00:33:43,250 What they didn't know is what that value was. 529 00:33:43,250 --> 00:33:47,800 The earliest known estimate of pi was by the Egyptians, 530 00:33:47,800 --> 00:33:51,890 on something called the Rhind Papyrus pictured here. 531 00:33:51,890 --> 00:33:55,250 And they estimated pi to be 4 times 8 over 9 532 00:33:55,250 --> 00:33:59,840 squared, or 3.16. 533 00:33:59,840 --> 00:34:00,720 Not so bad. 534 00:34:03,920 --> 00:34:08,030 About 1,100 years later an estimate of pi 535 00:34:08,030 --> 00:34:14,466 appeared in the Bible, Kings 7:23. 536 00:34:14,466 --> 00:34:20,010 It didn't say pi is but it described-- 537 00:34:20,010 --> 00:34:23,639 I think this was a construction of King Solomon's Temple. 538 00:34:23,639 --> 00:34:25,980 "He made a molten sea, ten cubits from one brim 539 00:34:25,980 --> 00:34:27,210 to the other." 540 00:34:27,210 --> 00:34:28,480 it was round. 541 00:34:28,480 --> 00:34:30,120 It was a circle. 542 00:34:30,120 --> 00:34:33,120 "His height was five cubits and a line of 30 cubits 543 00:34:33,120 --> 00:34:35,699 did compass it round about." 544 00:34:35,699 --> 00:34:38,219 Well, if you do the arithmetic, what does 545 00:34:38,219 --> 00:34:41,460 this imply the value of pi to be? 546 00:34:41,460 --> 00:34:46,300 3, and I'm sure that's what Mike Pence thinks it is. 547 00:34:53,170 --> 00:34:58,450 About 300 years later, Archimedes 548 00:34:58,450 --> 00:34:59,980 did a better job of it. 549 00:35:02,530 --> 00:35:08,140 He estimated pi by constructing a 96-sided polygon. 550 00:35:08,140 --> 00:35:09,290 There's a picture of one. 551 00:35:09,290 --> 00:35:11,860 It looks a lot like a circle. 552 00:35:11,860 --> 00:35:14,380 On the left is one with fewer sides. 553 00:35:14,380 --> 00:35:17,420 And what he did is he then-- 554 00:35:17,420 --> 00:35:20,740 since it was a polygon he knew how to compute its area 555 00:35:20,740 --> 00:35:23,750 or it's circumference and he just counted things up 556 00:35:23,750 --> 00:35:25,940 and he said, well, pi is somewhere 557 00:35:25,940 --> 00:35:31,780 between 223/71 and 22/7. 558 00:35:31,780 --> 00:35:33,600 And that turns out to actually-- if you 559 00:35:33,600 --> 00:35:36,411 take the average of that it will be a really good estimate. 560 00:35:39,060 --> 00:35:44,830 2000 years after Archimedes, the French mathematicians 561 00:35:44,830 --> 00:35:47,980 Buffon and Laplace proposed finding 562 00:35:47,980 --> 00:35:51,130 the value of pi using what we would today call a Monte Carlo 563 00:35:51,130 --> 00:35:53,110 simulation. 564 00:35:53,110 --> 00:35:56,770 They did not have a computer to execute this on. 565 00:35:56,770 --> 00:35:59,850 So what they proposed-- 566 00:35:59,850 --> 00:36:02,420 it started with this mathematics. 567 00:36:02,420 --> 00:36:05,180 They took a circle and inscribed it 568 00:36:05,180 --> 00:36:11,610 in a square, a square of two units of whatever it 569 00:36:11,610 --> 00:36:12,990 is per side. 570 00:36:12,990 --> 00:36:15,090 And therefore we know that the area of the square 571 00:36:15,090 --> 00:36:19,060 is 2 times 2 or 4. 572 00:36:19,060 --> 00:36:24,440 We know that the area of the circle is pi r squared. 573 00:36:24,440 --> 00:36:26,530 And since we know that r is 1-- 574 00:36:26,530 --> 00:36:29,140 because that's the square it's inscribed in-- 575 00:36:29,140 --> 00:36:32,060 we know that the area of a circle is exactly pi. 576 00:36:36,440 --> 00:36:40,850 What Buffon then proposed was dropping a bunch of needles 577 00:36:40,850 --> 00:36:44,690 at random-- kind of like when Professor Grimson was sorting 578 00:36:44,690 --> 00:36:46,530 things-- 579 00:36:46,530 --> 00:36:49,700 and seeing where they land. 580 00:36:49,700 --> 00:36:53,420 Some would land in the square but not in the circle. 581 00:36:53,420 --> 00:36:55,040 Some would land in the circle. 582 00:36:55,040 --> 00:36:59,560 You would ignore any that landed outside the square. 583 00:36:59,560 --> 00:37:03,490 And then they said, well, since they're falling at random, 584 00:37:03,490 --> 00:37:07,660 the ratio of the needles in the circle to needles in the square 585 00:37:07,660 --> 00:37:10,870 should exactly equal the area of the square 586 00:37:10,870 --> 00:37:13,870 over the area of the circle, exactly, 587 00:37:13,870 --> 00:37:16,910 if you did an infinite number of needles. 588 00:37:16,910 --> 00:37:19,940 Does that make sense? 589 00:37:19,940 --> 00:37:24,140 Now, given that, you can do some algebra 590 00:37:24,140 --> 00:37:27,590 and solve for the area of the circle 591 00:37:27,590 --> 00:37:30,890 and say it has to be the area of the square times 592 00:37:30,890 --> 00:37:33,050 the number of needles in the circle divided 593 00:37:33,050 --> 00:37:36,630 by the needles in the square. 594 00:37:36,630 --> 00:37:38,880 And since we know that the area of the circle 595 00:37:38,880 --> 00:37:44,810 is pi that tells us that pi is going 596 00:37:44,810 --> 00:37:48,460 to equal four times the needles in the circle. 597 00:37:48,460 --> 00:37:51,200 That's 4 is the area of the square divided 598 00:37:51,200 --> 00:37:54,350 by the number of needles in the square. 599 00:37:54,350 --> 00:37:55,930 And so the argument was you can just 600 00:37:55,930 --> 00:37:59,590 drop a bunch of these needles, see where they land, 601 00:37:59,590 --> 00:38:02,860 add them up and from that you would magically now know 602 00:38:02,860 --> 00:38:06,440 the actual value of pi. 603 00:38:06,440 --> 00:38:10,940 Well, we tried a simulation one year in class but rather 604 00:38:10,940 --> 00:38:12,920 than using needles we had an archer 605 00:38:12,920 --> 00:38:17,340 and we blindfolded him so he would shoot arrows at random 606 00:38:17,340 --> 00:38:21,560 and we would see where they ended up. 607 00:38:21,560 --> 00:38:25,160 There's a video of this here if you want to see it. 608 00:38:25,160 --> 00:38:27,950 I'm going to play only the very end of the video which 609 00:38:27,950 --> 00:38:29,005 describes the results. 610 00:38:40,591 --> 00:38:42,840 Maybe we should just use Python to build a Monte Carlo 611 00:38:42,840 --> 00:38:44,830 simulation instead. 612 00:38:44,830 --> 00:38:46,290 So that's what happened. 613 00:38:46,290 --> 00:38:48,330 After it was all over, Anna came up 614 00:38:48,330 --> 00:38:50,340 and gave me some sound advice. 615 00:38:50,340 --> 00:38:53,910 Yeah, I was very proud of that particular shot with the apple. 616 00:38:53,910 --> 00:38:56,370 I had hoped the student would put it on his or her head 617 00:38:56,370 --> 00:38:59,190 but no one volunteered. 618 00:38:59,190 --> 00:39:01,010 That was not done blindfolded. 619 00:39:01,010 --> 00:39:05,970 Any rate, so here is the simulation. 620 00:39:05,970 --> 00:39:07,725 So first we have throwing the needles. 621 00:39:10,840 --> 00:39:15,160 For needles in range 1 to number of needles plus 1, 622 00:39:15,160 --> 00:39:20,290 we're going to choose the random x or random y 623 00:39:20,290 --> 00:39:24,100 and we're going to see whether or not it's in the circle. 624 00:39:24,100 --> 00:39:26,530 And there we will use the Pythagorean theorem 625 00:39:26,530 --> 00:39:28,460 to tell us that, all right? 626 00:39:31,260 --> 00:39:33,060 And we'll just do this, and then we're 627 00:39:33,060 --> 00:39:35,370 going to return exactly the formula we 628 00:39:35,370 --> 00:39:39,080 saw in the previous slide. 629 00:39:39,080 --> 00:39:40,390 Now, comes an interesting part. 630 00:39:40,390 --> 00:39:43,810 We need to get an estimate. 631 00:39:43,810 --> 00:39:45,720 So we start with the number of needles 632 00:39:45,720 --> 00:39:48,964 and the number of trials. 633 00:39:48,964 --> 00:39:51,400 For t in range number of trials, we'll 634 00:39:51,400 --> 00:39:55,510 get a guess by throwing number of needles. 635 00:39:55,510 --> 00:40:00,790 And then we'll append that guess to our list of estimates. 636 00:40:03,370 --> 00:40:10,180 We'll then compute the standard deviation, 637 00:40:10,180 --> 00:40:11,950 get the current estimate which will 638 00:40:11,950 --> 00:40:13,630 be the sum of the estimates divided 639 00:40:13,630 --> 00:40:17,860 by the len of the estimate, just the mean of the estimate. 640 00:40:17,860 --> 00:40:20,860 And then we'll print it, all right? 641 00:40:20,860 --> 00:40:24,040 So given a number of needles and a number of trials, 642 00:40:24,040 --> 00:40:27,190 we'll estimate pi and we'll give you the standard deviation 643 00:40:27,190 --> 00:40:28,324 of that estimate. 644 00:40:33,540 --> 00:40:37,240 However, to do that within a certain precision, 645 00:40:37,240 --> 00:40:40,670 I'm going to have yet another loop. 646 00:40:40,670 --> 00:40:43,100 And I'm showing you this because we often 647 00:40:43,100 --> 00:40:44,880 structure simulations this way. 648 00:40:48,250 --> 00:40:51,640 So what I have here is the number 649 00:40:51,640 --> 00:40:55,630 of trials, which is not so interesting, but the precision. 650 00:40:55,630 --> 00:40:59,260 I'm saying, I would like to know the value of pi 651 00:40:59,260 --> 00:41:01,090 and I would like to have good reason 652 00:41:01,090 --> 00:41:04,240 to believe that the value you give me 653 00:41:04,240 --> 00:41:08,920 is within 0.005, in this case, of the true value. 654 00:41:12,910 --> 00:41:14,620 Or maybe more precisely, I would like 655 00:41:14,620 --> 00:41:19,550 to know with a confidence of 95% that the true value is 656 00:41:19,550 --> 00:41:22,720 within a certain range. 657 00:41:22,720 --> 00:41:26,830 So what this is going to do is it's just going to-- and I 658 00:41:26,830 --> 00:41:32,220 should probably use 1.96 instead of 2, but oh, well-- 659 00:41:32,220 --> 00:41:34,860 it's just going to keep increasing 660 00:41:34,860 --> 00:41:38,490 the number of needles, doubling the number of needles 661 00:41:38,490 --> 00:41:41,880 until it's confident about the estimate, 662 00:41:41,880 --> 00:41:44,990 confident enough, all right? 663 00:41:44,990 --> 00:41:47,390 So this is a very common thing. 664 00:41:47,390 --> 00:41:50,630 We don't know how many needles we should need so let's 665 00:41:50,630 --> 00:41:52,700 just start with some small number 666 00:41:52,700 --> 00:41:57,015 and we'll keep going until we're good. 667 00:41:57,015 --> 00:41:58,640 All right, what happens when we run it? 668 00:42:01,440 --> 00:42:03,140 Well, we start with some estimates 669 00:42:03,140 --> 00:42:12,770 that when we had 1,000 needles it told us that pi was 3.148 670 00:42:12,770 --> 00:42:16,590 and standard deviation was 0.047 et cetera. 671 00:42:16,590 --> 00:42:19,590 So a couple of things to notice. 672 00:42:19,590 --> 00:42:26,120 One, are my estimates of pi getting monotonically better? 673 00:42:26,120 --> 00:42:29,030 You'd like to think, as I add more needles, 674 00:42:29,030 --> 00:42:31,164 my estimates are getting more accurate. 675 00:42:33,890 --> 00:42:35,780 So that's question one. 676 00:42:35,780 --> 00:42:40,020 Are they getting more accurate? 677 00:42:40,020 --> 00:42:42,220 Not monotonically, right? 678 00:42:44,930 --> 00:42:48,040 If we look at it, where are they getting worse? 679 00:42:48,040 --> 00:42:48,790 Well, let's see. 680 00:42:53,700 --> 00:43:00,760 All right, 3.148, well, 3.13 is already worse, right? 681 00:43:00,760 --> 00:43:02,860 So I double the number of needles 682 00:43:02,860 --> 00:43:06,340 and I get a worse estimate. 683 00:43:06,340 --> 00:43:10,340 Now, the trend is good. 684 00:43:10,340 --> 00:43:15,680 By the time I get here, I've got a pretty good estimate. 685 00:43:15,680 --> 00:43:23,250 So overall as I look at larger samples, a bigger 686 00:43:23,250 --> 00:43:26,010 subset of the population, my estimates 687 00:43:26,010 --> 00:43:29,360 are trending towards better but not every time. 688 00:43:32,700 --> 00:43:38,940 On the other hand, let's look at the standard deviations. 689 00:43:38,940 --> 00:43:42,510 What we see here is that the standard deviations 690 00:43:42,510 --> 00:43:45,880 are, indeed, getting monotonically better. 691 00:43:50,120 --> 00:43:52,620 Now, there's nothing mathematical guaranteeing that, 692 00:43:52,620 --> 00:43:53,120 right? 693 00:43:53,120 --> 00:43:55,490 Because there is randomness here. 694 00:43:55,490 --> 00:43:59,150 But I'm increasing the number of needles by so much 695 00:43:59,150 --> 00:44:02,540 that it kind of overrides the bad luck of maybe getting 696 00:44:02,540 --> 00:44:04,730 a bad random set. 697 00:44:04,730 --> 00:44:08,570 And we see those are getting better. 698 00:44:08,570 --> 00:44:10,570 So the important thing to see here 699 00:44:10,570 --> 00:44:15,420 is not that I happened to get a better estimate, 700 00:44:15,420 --> 00:44:19,272 but that I know more about the estimate. 701 00:44:19,272 --> 00:44:22,530 I can have more confidence in the estimate 702 00:44:22,530 --> 00:44:26,342 because it's closing in on it. 703 00:44:29,650 --> 00:44:32,850 So the moral here is it's not sufficient to produce 704 00:44:32,850 --> 00:44:33,630 a good answer. 705 00:44:33,630 --> 00:44:35,850 I've said this before. 706 00:44:35,850 --> 00:44:38,070 We need to have reason to believe that it 707 00:44:38,070 --> 00:44:39,988 is close to the right answer. 708 00:44:43,200 --> 00:44:46,850 So in this case, I'm using the standard deviation 709 00:44:46,850 --> 00:44:51,130 and say, given that it's gotten quite small-- 710 00:44:51,130 --> 00:44:58,800 you know, 1.96 times 0.002 is indeed a small number. 711 00:44:58,800 --> 00:45:00,810 I could make it smaller if I wanted. 712 00:45:04,616 --> 00:45:05,990 I have good reason to believe I'm 713 00:45:05,990 --> 00:45:09,210 close to the right value of pi. 714 00:45:09,210 --> 00:45:12,300 Everyone agree with that? 715 00:45:12,300 --> 00:45:13,020 Is that right? 716 00:45:15,770 --> 00:45:19,180 Not quite, actually. 717 00:45:19,180 --> 00:45:20,620 It would be nice if it were true. 718 00:45:20,620 --> 00:45:23,530 But it isn't. 719 00:45:23,530 --> 00:45:26,000 So let's look at some things. 720 00:45:26,000 --> 00:45:29,240 Is it correct to state that 95% of the time we 721 00:45:29,240 --> 00:45:32,270 run this simulation you'll get an estimate of pi 722 00:45:32,270 --> 00:45:35,070 between these two values? 723 00:45:35,070 --> 00:45:37,560 I don't expect you to do the arithmetic in your head 724 00:45:37,560 --> 00:45:40,570 but the answer is yes. 725 00:45:40,570 --> 00:45:42,960 So that is something we believe is 726 00:45:42,960 --> 00:45:46,790 true by the math we've been looking at for the last two 727 00:45:46,790 --> 00:45:47,290 lectures. 728 00:45:50,880 --> 00:45:54,570 Next statement, with a probability of 0.95, 729 00:45:54,570 --> 00:45:58,940 the actual value of pi is between these two things. 730 00:45:58,940 --> 00:45:59,690 Is that true? 731 00:46:07,490 --> 00:46:11,250 In fact, if I were to say, with a probability of 1, 732 00:46:11,250 --> 00:46:14,020 the actual value is pi is between those two values, 733 00:46:14,020 --> 00:46:15,940 would it be true? 734 00:46:15,940 --> 00:46:17,370 Yes. 735 00:46:17,370 --> 00:46:20,160 So they are both true facts. 736 00:46:22,860 --> 00:46:27,210 However, only the first of these can be 737 00:46:27,210 --> 00:46:28,690 inferred from our simulation. 738 00:46:31,550 --> 00:46:33,350 While the second fact is true, we 739 00:46:33,350 --> 00:46:36,980 can't infer it from the simulation. 740 00:46:36,980 --> 00:46:44,110 And to show you that, statistically valid is not 741 00:46:44,110 --> 00:46:49,100 the same as true, we'll look at this. 742 00:46:49,100 --> 00:46:53,430 I've introduced a bug in my simulation. 743 00:46:53,430 --> 00:47:00,725 I've replaced the 4 that we saw we needed by 2, now, 744 00:47:00,725 --> 00:47:03,870 an easy kind of mistake to make. 745 00:47:03,870 --> 00:47:06,012 And now, if we go to the code-- 746 00:47:15,180 --> 00:47:17,840 well, what do you think will happen if we go to the code 747 00:47:17,840 --> 00:47:21,620 and run it? 748 00:47:21,620 --> 00:47:22,370 We'll try it. 749 00:47:27,310 --> 00:47:28,817 We'll go down here to the code. 750 00:47:33,690 --> 00:47:35,328 We'll make that a 2. 751 00:47:49,980 --> 00:47:55,080 And what you'll see as it runs is that once again we're 752 00:47:55,080 --> 00:48:02,790 getting very nice confidence intervals, but totally 753 00:48:02,790 --> 00:48:04,170 bogus values of pi. 754 00:48:08,190 --> 00:48:11,790 So the statistics can tell us something 755 00:48:11,790 --> 00:48:15,720 about how reproducible our simulation is 756 00:48:15,720 --> 00:48:18,570 but not whether the simulation is an actually, 757 00:48:18,570 --> 00:48:21,347 accurate model of reality. 758 00:48:21,347 --> 00:48:22,430 So what do you need to do? 759 00:48:22,430 --> 00:48:25,610 You need to do something like a sanity check. 760 00:48:25,610 --> 00:48:28,730 So here you might look at a polygon 761 00:48:28,730 --> 00:48:32,300 and say, well, clearly that's a totally wrong number. 762 00:48:32,300 --> 00:48:33,877 Something is wrong with my code. 763 00:48:44,320 --> 00:48:49,620 OK, so just to wrap-up. 764 00:48:49,620 --> 00:48:53,610 What we've shown is a way to find pi. 765 00:48:53,610 --> 00:48:56,730 This is a generally useful technique. 766 00:48:56,730 --> 00:49:00,790 To estimate the area of any region r, 767 00:49:00,790 --> 00:49:02,650 you pick an enclosing region, call 768 00:49:02,650 --> 00:49:08,640 it e, such that it's easy to estimate the area of e, 769 00:49:08,640 --> 00:49:10,650 and r lies within it. 770 00:49:10,650 --> 00:49:16,590 Pick some random sets of points within e, let f be the fraction 771 00:49:16,590 --> 00:49:21,980 and fall within r, multiply e by f and you're done. 772 00:49:21,980 --> 00:49:26,410 So this for example, is a very common way to do integration. 773 00:49:26,410 --> 00:49:29,230 I promised you we'd talk about integration. 774 00:49:29,230 --> 00:49:31,140 So here's a sine of x. 775 00:49:31,140 --> 00:49:35,510 If I want to integrate the sine of x over some region, 776 00:49:35,510 --> 00:49:39,320 as done here, all I need to do is 777 00:49:39,320 --> 00:49:43,280 pick a bunch of random points, red and black in this case, 778 00:49:43,280 --> 00:49:47,440 and look at the ratio of one to the other. 779 00:49:47,440 --> 00:49:51,940 So showing how we can use randomness 780 00:49:51,940 --> 00:49:55,750 to again, compute something that is not inherently random. 781 00:49:55,750 --> 00:49:58,090 This is a trick people use over and over 782 00:49:58,090 --> 00:50:03,340 and over again when confronted with some situation where 783 00:50:03,340 --> 00:50:07,390 it's not easy to solve for things mathematically. 784 00:50:07,390 --> 00:50:11,470 You just do a simulation and if you do it right, 785 00:50:11,470 --> 00:50:13,720 you get a very good answer. 786 00:50:13,720 --> 00:50:20,190 All right, we will move on to a different topic on Wednesday.