1 00:00:00,790 --> 00:00:03,130 The following content is provided under a Creative 2 00:00:03,130 --> 00:00:04,550 Commons license. 3 00:00:04,550 --> 00:00:06,760 Your support will help MIT OpenCourseWare 4 00:00:06,760 --> 00:00:10,850 continue to offer high quality educational resources for free. 5 00:00:10,850 --> 00:00:13,390 To make a donation, or to view additional materials 6 00:00:13,390 --> 00:00:17,320 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,320 --> 00:00:18,570 at ocw.mit.edu. 8 00:00:28,870 --> 00:00:32,360 PROFESSOR: Good afternoon, everybody. 9 00:00:32,360 --> 00:00:35,170 Welcome to Lecture 8. 10 00:00:35,170 --> 00:00:39,730 So we're now more than halfway through the lectures. 11 00:00:39,730 --> 00:00:43,990 All right, the topic of today is sampling. 12 00:00:43,990 --> 00:00:48,760 I want to start by reminding you about this whole business 13 00:00:48,760 --> 00:00:51,160 of inferential statistics. 14 00:00:51,160 --> 00:00:53,980 We make references about populations 15 00:00:53,980 --> 00:00:56,800 by examining one or more random samples drawn 16 00:00:56,800 --> 00:00:59,620 from that population. 17 00:00:59,620 --> 00:01:04,090 We used Monte Carlo simulation over the last two lectures. 18 00:01:04,090 --> 00:01:06,190 And the key idea there, as we saw 19 00:01:06,190 --> 00:01:09,160 in trying to find the value of pi, 20 00:01:09,160 --> 00:01:13,330 was that we can generate lots of random samples, 21 00:01:13,330 --> 00:01:17,370 and then use them to compute confidence intervals. 22 00:01:17,370 --> 00:01:20,250 And then we use the empirical rule to say, 23 00:01:20,250 --> 00:01:23,450 all right, we really have good reason 24 00:01:23,450 --> 00:01:28,820 to believe that 95% of the time we run this simulation, 25 00:01:28,820 --> 00:01:32,910 our answer will be between here and here. 26 00:01:32,910 --> 00:01:37,430 Well, that's all well and good when we're doing simulations. 27 00:01:37,430 --> 00:01:41,690 But what happens when you to actually sample something real? 28 00:01:41,690 --> 00:01:43,730 For example, you run an experiment, 29 00:01:43,730 --> 00:01:46,190 and you get some data points. 30 00:01:46,190 --> 00:01:50,390 And it's too hard to do it over and over again. 31 00:01:50,390 --> 00:01:51,970 Think about political polls. 32 00:01:51,970 --> 00:01:56,030 Here was an interesting poll. 33 00:01:56,030 --> 00:01:57,860 How were these created? 34 00:01:57,860 --> 00:02:01,010 Not by simulation. 35 00:02:01,010 --> 00:02:04,460 They didn't run 1,000 polls and then compute 36 00:02:04,460 --> 00:02:05,510 the confidence interval. 37 00:02:05,510 --> 00:02:08,539 They ran one poll-- 38 00:02:08,539 --> 00:02:12,140 of 835 people, in this case. 39 00:02:12,140 --> 00:02:15,640 And yet they claim to have a confidence interval. 40 00:02:15,640 --> 00:02:17,600 That's what that margin of error is. 41 00:02:20,240 --> 00:02:25,440 Obviously they needed that large confidence interval. 42 00:02:25,440 --> 00:02:26,480 So how is this done? 43 00:02:29,940 --> 00:02:32,970 Backing up for a minute, let's talk about how sampling 44 00:02:32,970 --> 00:02:37,170 is done when you are not running a simulation. 45 00:02:37,170 --> 00:02:41,870 You want to do what's called probability sampling, in which 46 00:02:41,870 --> 00:02:46,250 each member of the population has a non-zero probability 47 00:02:46,250 --> 00:02:47,615 of being included in a sample. 48 00:02:50,150 --> 00:02:53,540 There are, roughly speaking, two kinds. 49 00:02:53,540 --> 00:02:55,800 We'll spend, really, all of our time 50 00:02:55,800 --> 00:02:59,180 on something called simple random sampling. 51 00:02:59,180 --> 00:03:03,980 And the key idea here is that each member of the population 52 00:03:03,980 --> 00:03:09,090 has an equal probability of being chosen in the sample 53 00:03:09,090 --> 00:03:12,630 so there's no bias. 54 00:03:12,630 --> 00:03:14,860 Now, that's not always appropriate. 55 00:03:14,860 --> 00:03:18,690 I do want to take a minute to talk about why. 56 00:03:18,690 --> 00:03:25,010 So suppose we wanted to survey MIT students to find out what 57 00:03:25,010 --> 00:03:27,500 fraction of them are nerds-- 58 00:03:27,500 --> 00:03:31,100 which, by the way, I consider a compliment. 59 00:03:31,100 --> 00:03:34,190 So suppose we wanted to consider a random sample of 100 60 00:03:34,190 --> 00:03:35,870 students. 61 00:03:35,870 --> 00:03:40,200 We could walk around campus and choose 100 people at random. 62 00:03:40,200 --> 00:03:43,400 And if 12% of them were nerds, we 63 00:03:43,400 --> 00:03:46,860 would say 12% of the MIT undergraduates are nerds-- 64 00:03:46,860 --> 00:03:49,860 if 98%, et cetera. 65 00:03:49,860 --> 00:03:53,950 Well, the problem with that is, let's look at the majors 66 00:03:53,950 --> 00:03:54,545 by school. 67 00:03:57,220 --> 00:04:02,090 This is actually the majors at MIT by school. 68 00:04:02,090 --> 00:04:06,780 And you can see that they're not exactly evenly distributed. 69 00:04:06,780 --> 00:04:08,510 And so if you went around and just 70 00:04:08,510 --> 00:04:11,890 sampled 100 students at random, there'd 71 00:04:11,890 --> 00:04:14,860 be a reasonably high probability that they would all be 72 00:04:14,860 --> 00:04:18,480 from engineering and science. 73 00:04:18,480 --> 00:04:22,740 And that might give you a misleading notion 74 00:04:22,740 --> 00:04:26,220 of the fraction of MIT students that were nerds, 75 00:04:26,220 --> 00:04:29,800 or it might not. 76 00:04:29,800 --> 00:04:32,980 In such situations we do something called stratified 77 00:04:32,980 --> 00:04:38,470 sampling, where we partition the population into subgroups, 78 00:04:38,470 --> 00:04:43,570 and then take a simple random sample from each subgroup. 79 00:04:43,570 --> 00:04:48,550 And we do that proportional to the size of the subgroups. 80 00:04:48,550 --> 00:04:50,650 So we would certainly want to take more students 81 00:04:50,650 --> 00:04:53,830 from engineering than from architecture. 82 00:04:53,830 --> 00:04:56,140 But we probably want to make sure we got somebody 83 00:04:56,140 --> 00:04:59,870 from architecture in our sample. 84 00:04:59,870 --> 00:05:03,230 This, by the way, is the way most political polls are done. 85 00:05:03,230 --> 00:05:04,880 They're stratified. 86 00:05:04,880 --> 00:05:08,360 They say, we want to get so many rural people, 87 00:05:08,360 --> 00:05:14,150 so many city people, so many minorities-- things like that. 88 00:05:14,150 --> 00:05:17,720 And in fact, that's probably where the election recent polls 89 00:05:17,720 --> 00:05:19,280 were all messed up. 90 00:05:19,280 --> 00:05:22,130 They did a very, retrospectively at least, 91 00:05:22,130 --> 00:05:23,810 a bad job of stratifying. 92 00:05:26,440 --> 00:05:28,540 So we use stratified sampling when 93 00:05:28,540 --> 00:05:31,510 there are small groups, subgroups, that we want 94 00:05:31,510 --> 00:05:34,100 to make sure are represented. 95 00:05:34,100 --> 00:05:36,620 And we want to represent them proportional to their size 96 00:05:36,620 --> 00:05:39,650 in the population. 97 00:05:39,650 --> 00:05:44,870 This can also be used to reduce the needed size of the sample. 98 00:05:44,870 --> 00:05:47,680 If we wanted to make sure we got some architecture 99 00:05:47,680 --> 00:05:49,750 students in our sample, we'd need 100 00:05:49,750 --> 00:05:52,660 to get more than 100 people to start with. 101 00:05:52,660 --> 00:05:56,335 But if we stratify, we can take fewer samples. 102 00:05:58,890 --> 00:06:01,630 It works well when you do it properly. 103 00:06:01,630 --> 00:06:04,590 But it can be tricky to do it properly. 104 00:06:04,590 --> 00:06:08,770 And we are going to stick to simple random samples here. 105 00:06:08,770 --> 00:06:11,660 All right, let's look at an example. 106 00:06:11,660 --> 00:06:16,870 So this is a map of temperatures in the United States. 107 00:06:16,870 --> 00:06:18,700 And so our running example today will 108 00:06:18,700 --> 00:06:24,910 be sampling to get information about the average temperatures. 109 00:06:24,910 --> 00:06:27,220 And of course, as you can see, they're highly variable. 110 00:06:27,220 --> 00:06:31,100 And we live in one of the cooler areas. 111 00:06:31,100 --> 00:06:34,120 The data we're going to use is real data-- 112 00:06:34,120 --> 00:06:38,680 and it's in the zip file that I put up for the class-- 113 00:06:38,680 --> 00:06:42,310 from the US Centers for Environmental Information. 114 00:06:42,310 --> 00:06:44,590 And it's got the daily high and low temperatures 115 00:06:44,590 --> 00:06:49,390 for 21 different American cities, every day 116 00:06:49,390 --> 00:06:53,090 from 1961 through 2015. 117 00:06:53,090 --> 00:06:55,760 So it's an interesting data set-- 118 00:06:55,760 --> 00:07:01,130 a total of about 422,000 examples in the dataset. 119 00:07:01,130 --> 00:07:03,560 So a fairly good sized dataset. 120 00:07:03,560 --> 00:07:06,170 It's fun to play with. 121 00:07:06,170 --> 00:07:11,210 All right, so we're sort of in the part of the course 122 00:07:11,210 --> 00:07:14,600 where the next series of lectures, including today, 123 00:07:14,600 --> 00:07:20,150 is going to be about data science, how to analyze data. 124 00:07:20,150 --> 00:07:24,680 I always like to start by actually looking at the data-- 125 00:07:24,680 --> 00:07:28,250 not looking at all 421,000 samples, 126 00:07:28,250 --> 00:07:30,380 but giving a plot to sort of give me 127 00:07:30,380 --> 00:07:34,640 a sense of what the data looks like. 128 00:07:34,640 --> 00:07:36,710 I'm not going to walk you through the code that 129 00:07:36,710 --> 00:07:38,450 does this plot. 130 00:07:38,450 --> 00:07:41,720 I do want to point out that there are two things in it 131 00:07:41,720 --> 00:07:43,550 that we may not have seen before. 132 00:07:47,710 --> 00:07:50,950 Simply enough, I'm going to use numpy.std to get 133 00:07:50,950 --> 00:07:55,330 standard deviations instead of my own code for it. 134 00:07:55,330 --> 00:08:02,260 And random.sample to take simple random samples 135 00:08:02,260 --> 00:08:04,150 from the population. 136 00:08:04,150 --> 00:08:06,440 random.sample takes two arguments. 137 00:08:06,440 --> 00:08:10,540 The first is some sort of a sequence of values. 138 00:08:10,540 --> 00:08:12,700 And the second is an integer telling you 139 00:08:12,700 --> 00:08:15,310 how many samples you want. 140 00:08:15,310 --> 00:08:18,790 And it returns a list containing sample size, 141 00:08:18,790 --> 00:08:22,340 randomly chosen distinct elements. 142 00:08:22,340 --> 00:08:26,510 Distinct elements is important, because there are two ways 143 00:08:26,510 --> 00:08:28,760 that people do sampling. 144 00:08:28,760 --> 00:08:31,280 You can do sampling without replacement, 145 00:08:31,280 --> 00:08:33,860 which is what's done here. 146 00:08:33,860 --> 00:08:37,980 You take a sample, and then it's out of the population. 147 00:08:37,980 --> 00:08:40,280 So you won't draw it the next time. 148 00:08:40,280 --> 00:08:42,260 Or you can do sampling with replacement, 149 00:08:42,260 --> 00:08:45,640 which allows you to draw the same sample multiple times-- 150 00:08:45,640 --> 00:08:48,650 the same example multiple times. 151 00:08:48,650 --> 00:08:50,300 We'll see later in the term that there 152 00:08:50,300 --> 00:08:52,310 are good reasons that we sometimes prefer 153 00:08:52,310 --> 00:08:55,310 sampling with replacement. 154 00:08:55,310 --> 00:08:57,800 But usually we're doing sampling without replacement. 155 00:08:57,800 --> 00:08:59,480 And that's what we'll do here. 156 00:08:59,480 --> 00:09:04,540 So we won't get Boston on April 3rd multiple times-- or, not 157 00:09:04,540 --> 00:09:06,561 the same year, at least. 158 00:09:06,561 --> 00:09:07,060 All right. 159 00:09:07,060 --> 00:09:10,630 So here's the histogram the code produces. 160 00:09:10,630 --> 00:09:12,580 You can run it yourself now, if you want, 161 00:09:12,580 --> 00:09:15,080 or you can run it later. 162 00:09:15,080 --> 00:09:16,990 And here's what it looks like. 163 00:09:16,990 --> 00:09:24,910 The daily high temperatures, the mean is 16.3 degrees Celsius. 164 00:09:24,910 --> 00:09:29,210 I sort of vaguely know what that feels like. 165 00:09:29,210 --> 00:09:32,190 And as you can see, it's kind of an interesting distribution. 166 00:09:32,190 --> 00:09:34,550 It's not normal. 167 00:09:34,550 --> 00:09:38,070 But it's not that far, right? 168 00:09:38,070 --> 00:09:41,690 We have a little tail of these cold temperatures on the left. 169 00:09:41,690 --> 00:09:43,380 And it is what it is. 170 00:09:43,380 --> 00:09:44,690 It's not a normal distribution. 171 00:09:44,690 --> 00:09:48,380 And we'll later see that doesn't really matter. 172 00:09:48,380 --> 00:09:51,050 OK, so this gives me a sense. 173 00:09:51,050 --> 00:09:54,530 The next thing I'll get is some statistics. 174 00:09:54,530 --> 00:09:59,390 So we know the mean is 16.3 and the standard deviation 175 00:09:59,390 --> 00:10:01,955 is approximately 9.4 degrees. 176 00:10:04,660 --> 00:10:09,260 So if you look at it, you can believe that. 177 00:10:09,260 --> 00:10:18,190 Well, here's a histogram of one random sample of size 100. 178 00:10:18,190 --> 00:10:20,080 Looks pretty different, as you might expect. 179 00:10:23,690 --> 00:10:28,880 Its standard deviation is 10.4, its mean 17.7. 180 00:10:28,880 --> 00:10:34,060 So even though the figures look a little different, in fact, 181 00:10:34,060 --> 00:10:37,740 the means and standard deviations are pretty similar. 182 00:10:37,740 --> 00:10:40,940 If we look at the population mean and the sample mean-- 183 00:10:40,940 --> 00:10:44,740 and I'll try and be careful to use those terms-- 184 00:10:44,740 --> 00:10:45,730 they're not the same. 185 00:10:45,730 --> 00:10:48,750 But they're in the same ballpark. 186 00:10:48,750 --> 00:10:55,060 And the same is true of the two standard deviations. 187 00:10:55,060 --> 00:10:58,860 Well, that raises the question, did 188 00:10:58,860 --> 00:11:04,320 we get lucky or is something we should expect? 189 00:11:04,320 --> 00:11:09,770 If we draw 100 random examples, should we 190 00:11:09,770 --> 00:11:15,620 expect them to correspond to the population as a whole? 191 00:11:15,620 --> 00:11:18,350 And the answer is sometimes yeah and sometimes no. 192 00:11:18,350 --> 00:11:22,010 And that's one of the issues I want to explore today. 193 00:11:22,010 --> 00:11:25,700 So one way to see whether it's a happy accident 194 00:11:25,700 --> 00:11:28,010 is to try it 1,000 times. 195 00:11:28,010 --> 00:11:33,650 We can draw 1,000 samples of size 100 and plot the results. 196 00:11:36,470 --> 00:11:38,280 Again, I'm not going to go over the code. 197 00:11:38,280 --> 00:11:40,400 There's something in that code, as well, 198 00:11:40,400 --> 00:11:42,510 that we haven't seen before. 199 00:11:42,510 --> 00:11:46,610 And that's the ax.vline plotting command. 200 00:11:46,610 --> 00:11:49,910 V for vertical. 201 00:11:49,910 --> 00:11:52,310 It just, in this case, will draw a red line-- 202 00:11:52,310 --> 00:11:55,210 because I've said the color is r-- 203 00:11:55,210 --> 00:11:57,400 at population mean on the x-axis. 204 00:11:57,400 --> 00:11:59,920 So just a vertical line. 205 00:11:59,920 --> 00:12:02,360 So that'll just show us where the mean is. 206 00:12:02,360 --> 00:12:04,240 If we wanted to draw a horizontal line, 207 00:12:04,240 --> 00:12:07,430 we'd use ax.hline. 208 00:12:07,430 --> 00:12:12,870 Just showing you a couple of useful functions. 209 00:12:12,870 --> 00:12:17,390 When we try it 1,000 times, here's what it looks like. 210 00:12:17,390 --> 00:12:22,840 So here we see what we had originally, same picture 211 00:12:22,840 --> 00:12:24,460 I showed you before. 212 00:12:24,460 --> 00:12:26,530 And here's what we get when we look 213 00:12:26,530 --> 00:12:30,450 at the means of 100 samples. 214 00:12:30,450 --> 00:12:35,540 So this plot on the left looks a lot more 215 00:12:35,540 --> 00:12:38,880 like it's a normal distribution than the one on the right. 216 00:12:38,880 --> 00:12:44,830 Should that surprise us, or is there 217 00:12:44,830 --> 00:12:49,190 a reason we should have expected that to happen? 218 00:12:49,190 --> 00:12:50,440 Well, what's the answer? 219 00:12:50,440 --> 00:12:53,790 Someone tell me why we should have expected it. 220 00:12:53,790 --> 00:12:56,750 It's because of the central limit theorem, right? 221 00:12:56,750 --> 00:12:59,420 That's exactly what the central limit theorem 222 00:12:59,420 --> 00:13:01,610 promised us would happen. 223 00:13:01,610 --> 00:13:06,462 And, sure enough, it's pretty close to normal. 224 00:13:06,462 --> 00:13:07,420 So that's a good thing. 225 00:13:12,910 --> 00:13:15,520 And now if we look at it, we can see 226 00:13:15,520 --> 00:13:20,700 that the mean of the sample means is 16.3, 227 00:13:20,700 --> 00:13:28,630 and the standard deviation of the sample means is 0.94. 228 00:13:28,630 --> 00:13:35,350 So if we go back to what we saw here, 229 00:13:35,350 --> 00:13:39,250 we see that, actually, when we run it 1,000 times 230 00:13:39,250 --> 00:13:43,420 and look at the means, we get very 231 00:13:43,420 --> 00:13:47,740 close to what we had initially. 232 00:13:47,740 --> 00:13:51,290 So, indeed, it's not a happy accident. 233 00:13:51,290 --> 00:13:54,130 It's something we can in general expect. 234 00:13:58,310 --> 00:14:03,480 All right, what's the 95% confidence interval here? 235 00:14:03,480 --> 00:14:11,640 Well, it's going to be 16.28 plus or minus 1.96 times 0.94, 236 00:14:11,640 --> 00:14:14,940 the standard deviation of the sample means. 237 00:14:14,940 --> 00:14:18,300 And so it tells us that the confidence interval is, 238 00:14:18,300 --> 00:14:20,760 the mean high temperature, is somewhere 239 00:14:20,760 --> 00:14:26,290 between 14.5 and 18.1. 240 00:14:26,290 --> 00:14:29,710 Well, that's actually a pretty big range, right? 241 00:14:29,710 --> 00:14:32,164 It's sort of enough to where you wear a sweater 242 00:14:32,164 --> 00:14:33,580 or where you don't wear a sweater. 243 00:14:36,910 --> 00:14:41,530 So the good news is it includes the population mean. 244 00:14:41,530 --> 00:14:43,420 That's nice. 245 00:14:43,420 --> 00:14:48,790 But the bad news is it's pretty wide. 246 00:14:48,790 --> 00:14:52,360 Suppose we wanted it tighter bound. 247 00:14:52,360 --> 00:14:55,180 I said, all right, sure enough, the central limit theorem 248 00:14:55,180 --> 00:14:58,180 is going to tell me the mean of the means is 249 00:14:58,180 --> 00:15:06,110 going to give me a good estimate of the actual population mean. 250 00:15:06,110 --> 00:15:08,300 But I want it tighter bound. 251 00:15:08,300 --> 00:15:10,030 What can I do? 252 00:15:10,030 --> 00:15:15,780 Well, let's think about a couple of things we could try. 253 00:15:15,780 --> 00:15:22,250 Well, one thing we could think about is drawing more samples. 254 00:15:22,250 --> 00:15:24,980 Suppose instead of 1,000 samples, 255 00:15:24,980 --> 00:15:29,500 I'd taken 2,000 or 3,000 samples. 256 00:15:29,500 --> 00:15:31,870 We can ask the question, would that have given me 257 00:15:31,870 --> 00:15:33,710 a smaller standard deviation? 258 00:15:36,149 --> 00:15:37,940 For those of you who have not looked ahead, 259 00:15:37,940 --> 00:15:38,930 what do you think? 260 00:15:38,930 --> 00:15:43,140 Who thinks it will give you a smaller standard deviation? 261 00:15:43,140 --> 00:15:46,200 Who thinks it won't? 262 00:15:46,200 --> 00:15:48,450 And the rest of you have either looked ahead 263 00:15:48,450 --> 00:15:50,220 or refused to think. 264 00:15:50,220 --> 00:15:53,870 I prefer to believe you looked ahead. 265 00:15:53,870 --> 00:15:56,350 Well, we can run the experiment. 266 00:15:56,350 --> 00:15:57,350 You can go to the code. 267 00:15:57,350 --> 00:16:00,950 And you'll see that there is a constant of 1,000, which 268 00:16:00,950 --> 00:16:03,560 you can easily change to 2,000. 269 00:16:03,560 --> 00:16:07,780 And lo and behold, the standard deviation barely budges. 270 00:16:07,780 --> 00:16:09,830 It got a little bit bigger, as it happens, 271 00:16:09,830 --> 00:16:12,770 but that's kind of an accident. 272 00:16:12,770 --> 00:16:15,920 It just, more or less, doesn't change. 273 00:16:15,920 --> 00:16:19,670 And it won't change if I go to 3,000 or 4,000 or 5,000. 274 00:16:19,670 --> 00:16:22,000 It'll wiggle around. 275 00:16:22,000 --> 00:16:23,150 But it won't help much. 276 00:16:26,590 --> 00:16:30,310 What we can see is doing that more often is not 277 00:16:30,310 --> 00:16:33,210 going to help. 278 00:16:33,210 --> 00:16:35,220 Suppose we take larger samples? 279 00:16:37,780 --> 00:16:40,480 Is that going to help? 280 00:16:40,480 --> 00:16:43,730 Who thinks that will help? 281 00:16:43,730 --> 00:16:46,540 And who thinks it won't? 282 00:16:46,540 --> 00:16:47,190 OK. 283 00:16:47,190 --> 00:16:51,520 Well, we can again run the experiment. 284 00:16:51,520 --> 00:16:52,750 I did run the experiment. 285 00:16:52,750 --> 00:16:57,230 I changed the sample size from 100 to 200. 286 00:16:57,230 --> 00:16:59,680 And, again, you can run this if you want. 287 00:16:59,680 --> 00:17:01,490 And if you run it, you'll get a result-- 288 00:17:01,490 --> 00:17:05,700 maybe not exactly this, but something very similar-- that, 289 00:17:05,700 --> 00:17:10,460 indeed, as I increase the size of the sample 290 00:17:10,460 --> 00:17:12,770 rather than the number of the samples, 291 00:17:12,770 --> 00:17:18,730 the standard deviation drops fairly dramatically, 292 00:17:18,730 --> 00:17:26,290 in this case from 0.94 0.66. 293 00:17:26,290 --> 00:17:29,280 So that's a good thing. 294 00:17:29,280 --> 00:17:34,080 I now want to digress a little bit before we come back to this 295 00:17:34,080 --> 00:17:36,864 and look at how you can visualize this-- 296 00:17:36,864 --> 00:17:38,280 Because this is a technique you'll 297 00:17:38,280 --> 00:17:41,850 want to use as you write papers and things like that-- 298 00:17:41,850 --> 00:17:46,380 is how do we visualize the variability of the data? 299 00:17:46,380 --> 00:17:48,910 And it's usually done with something called an error bar. 300 00:17:48,910 --> 00:17:51,510 You've all seen these things here. 301 00:17:51,510 --> 00:17:54,960 And this is one I took from the literature. 302 00:17:54,960 --> 00:18:01,980 This is plotting pulse rate against how much 303 00:18:01,980 --> 00:18:06,130 exercise you do or how frequently you exercise. 304 00:18:06,130 --> 00:18:09,130 And what you can see here is there's definitely 305 00:18:09,130 --> 00:18:14,350 a downward trend suggesting that the more you exercise, 306 00:18:14,350 --> 00:18:17,780 the lower your average resting pulse. 307 00:18:17,780 --> 00:18:20,500 That's probably worth knowing. 308 00:18:20,500 --> 00:18:26,380 And these error bars give us the 95% confidence intervals 309 00:18:26,380 --> 00:18:29,800 for different subpopulations. 310 00:18:32,330 --> 00:18:38,310 And what we can see here is that some of them overlap. 311 00:18:38,310 --> 00:18:40,989 So, yes, once a fortnight-- 312 00:18:40,989 --> 00:18:43,155 two weeks for those of you who don't speak British-- 313 00:18:45,980 --> 00:18:49,670 it does get a little bit smaller than rarely or never. 314 00:18:49,670 --> 00:18:53,100 But the confidence interval is very big. 315 00:18:53,100 --> 00:18:56,490 And so maybe we really shouldn't feel very comfortable 316 00:18:56,490 --> 00:18:59,400 that it would actually help. 317 00:18:59,400 --> 00:19:04,170 The thing we can say is that if the confidence intervals don't 318 00:19:04,170 --> 00:19:09,240 overlap, we can conclude that the means are actually 319 00:19:09,240 --> 00:19:12,650 statistically significantly different, 320 00:19:12,650 --> 00:19:15,780 in this case at the 95% level. 321 00:19:15,780 --> 00:19:19,520 So here we see that the more than weekly 322 00:19:19,520 --> 00:19:23,870 does not overlap with the rarely or never. 323 00:19:23,870 --> 00:19:26,740 And from that, we can conclude that this is actually, 324 00:19:26,740 --> 00:19:29,200 statistically true-- 325 00:19:29,200 --> 00:19:31,150 that if you exercise more than weekly, 326 00:19:31,150 --> 00:19:35,590 your pulse is likely to be lower than if you don't. 327 00:19:35,590 --> 00:19:39,450 If confidence intervals do overlap, 328 00:19:39,450 --> 00:19:42,900 you cannot conclude that there is no statistically significant 329 00:19:42,900 --> 00:19:43,920 difference. 330 00:19:43,920 --> 00:19:45,930 There might be, and you can use other tests 331 00:19:45,930 --> 00:19:47,800 to find out whether there are. 332 00:19:47,800 --> 00:19:50,190 When they don't overlap, it's a good thing. 333 00:19:50,190 --> 00:19:52,410 We can conclude something strong. 334 00:19:52,410 --> 00:19:57,960 When they do overlap, we need to investigate further. 335 00:19:57,960 --> 00:19:59,460 All right, let's look at the error 336 00:19:59,460 --> 00:20:01,800 bars for our temperatures. 337 00:20:01,800 --> 00:20:03,965 And again, we can plot those using something called 338 00:20:03,965 --> 00:20:04,590 pylab.errorbar. 339 00:20:04,590 --> 00:20:14,860 Lab So what it takes is two values, the usual x-axis 340 00:20:14,860 --> 00:20:25,550 and y-axis, and then it takes another list 341 00:20:25,550 --> 00:20:28,520 of the same length, or sequence of the same length, 342 00:20:28,520 --> 00:20:31,450 which is the y errors. 343 00:20:31,450 --> 00:20:34,910 And here I'm just going to say 1.96 344 00:20:34,910 --> 00:20:38,270 times the standard deviations. 345 00:20:38,270 --> 00:20:39,800 Where these variables come from you 346 00:20:39,800 --> 00:20:43,340 can tell by looking at the code. 347 00:20:43,340 --> 00:20:46,580 And then I can say the format, I want 348 00:20:46,580 --> 00:20:51,040 an o to show the mean, and then a label. 349 00:20:51,040 --> 00:20:54,130 Fmt stands for format. 350 00:20:54,130 --> 00:20:58,390 errorbar has different keyword arguments than plot. 351 00:20:58,390 --> 00:21:00,310 You'll find that you look at different ways 352 00:21:00,310 --> 00:21:04,530 like histograms and bar plots, scatterplots-- 353 00:21:04,530 --> 00:21:07,360 they all have different available keyword arguments. 354 00:21:07,360 --> 00:21:10,300 So you have to look up each individually. 355 00:21:10,300 --> 00:21:13,230 But other than this, everything in the code 356 00:21:13,230 --> 00:21:16,750 should look very familiar to you. 357 00:21:16,750 --> 00:21:19,220 And when I run the code, I get this. 358 00:21:22,850 --> 00:21:29,000 And so what I've plotted here is the mean against the sample 359 00:21:29,000 --> 00:21:31,430 size with errorbars. 360 00:21:34,250 --> 00:21:37,560 And 100 trials, in this case. 361 00:21:37,560 --> 00:21:46,530 So what you can see is that, as the sample size gets bigger, 362 00:21:46,530 --> 00:21:48,140 the errorbars get smaller. 363 00:21:52,010 --> 00:21:56,770 The estimates of the mean don't necessarily get any better. 364 00:21:56,770 --> 00:22:01,820 In fact, we can look here, and this is actually 365 00:22:01,820 --> 00:22:05,330 a worse estimate, relative to the true mean, 366 00:22:05,330 --> 00:22:07,910 than the previous two estimates. 367 00:22:07,910 --> 00:22:10,370 But we can have more confidence in it. 368 00:22:10,370 --> 00:22:12,830 The same thing we saw on Monday when 369 00:22:12,830 --> 00:22:16,190 we looked at estimating pi, dropping more needles 370 00:22:16,190 --> 00:22:19,550 didn't necessarily give us a more accurate estimate. 371 00:22:19,550 --> 00:22:22,990 But it gave us more confidence in our estimate. 372 00:22:22,990 --> 00:22:25,749 And the same thing is happening here. 373 00:22:25,749 --> 00:22:27,290 And we can see that, steadily, we can 374 00:22:27,290 --> 00:22:28,910 get more and more confidence. 375 00:22:31,620 --> 00:22:39,010 So larger samples seem to be better. 376 00:22:39,010 --> 00:22:41,010 That's a good thing. 377 00:22:41,010 --> 00:22:46,200 Going from a sample size of 50 to a sample size of 600 378 00:22:46,200 --> 00:22:48,270 reduced the confidence interval, as you 379 00:22:48,270 --> 00:22:56,800 can see, from a fairly large confidence interval here, 380 00:22:56,800 --> 00:23:03,430 ran from just below 14 to almost 19, as opposed to 15 and a half 381 00:23:03,430 --> 00:23:04,795 or so to 17. 382 00:23:07,940 --> 00:23:09,620 I said confidence interval here. 383 00:23:09,620 --> 00:23:10,550 I should not have. 384 00:23:10,550 --> 00:23:13,850 I should have said standard deviations. 385 00:23:13,850 --> 00:23:15,455 That's an error on the slides. 386 00:23:18,370 --> 00:23:20,910 OK, what's the catch? 387 00:23:20,910 --> 00:23:27,990 Well, we're now looking at 100 samples, each of size 600. 388 00:23:27,990 --> 00:23:36,260 So we've looked at a total of 600,000 examples. 389 00:23:36,260 --> 00:23:38,690 What has this bought us? 390 00:23:38,690 --> 00:23:39,790 Absolutely nothing. 391 00:23:42,350 --> 00:23:45,860 The entire population only contained about 422,000 392 00:23:45,860 --> 00:23:47,230 samples. 393 00:23:47,230 --> 00:23:51,020 We might as well have looked at the whole thing, 394 00:23:51,020 --> 00:23:53,590 rather than take a few of them. 395 00:23:53,590 --> 00:23:55,990 So it's like, you might as well hold an election 396 00:23:55,990 --> 00:24:00,130 rather than ask 800 people a million times 397 00:24:00,130 --> 00:24:03,250 who they're going to vote for. 398 00:24:03,250 --> 00:24:04,030 Sure, it's good. 399 00:24:04,030 --> 00:24:05,260 But it gave us nothing. 400 00:24:10,440 --> 00:24:13,170 Suppose we did it only once. 401 00:24:13,170 --> 00:24:18,150 Suppose we took only one sample, as we see in political polls. 402 00:24:18,150 --> 00:24:21,470 What can we can conclude from that? 403 00:24:21,470 --> 00:24:24,910 And the answer is actually kind of surprising, 404 00:24:24,910 --> 00:24:28,420 how much we can conclude, in a real mathematical sense, 405 00:24:28,420 --> 00:24:30,350 from one sample. 406 00:24:30,350 --> 00:24:32,440 And, again, this is thanks to our old friend, 407 00:24:32,440 --> 00:24:35,930 the central limit theorem. 408 00:24:35,930 --> 00:24:39,730 So if you recall the theorem, it had three parts. 409 00:24:39,730 --> 00:24:43,680 Up till now, we've exploited the first two. 410 00:24:48,990 --> 00:24:50,690 We've used the fact that the means will 411 00:24:50,690 --> 00:24:54,200 be normally distributed so that we could use the empirical rule 412 00:24:54,200 --> 00:24:57,770 to get confidence intervals, and the fact 413 00:24:57,770 --> 00:25:00,980 that the mean of the sample means 414 00:25:00,980 --> 00:25:04,420 would be close to the mean of the population. 415 00:25:04,420 --> 00:25:09,260 Now I want to use the third piece of it, which 416 00:25:09,260 --> 00:25:12,080 is that the variance of the sample means 417 00:25:12,080 --> 00:25:15,800 will be close to the variance of the population divided 418 00:25:15,800 --> 00:25:18,440 by the sample size. 419 00:25:18,440 --> 00:25:20,990 And we're going to use that to compute something 420 00:25:20,990 --> 00:25:24,650 called the standard error-- 421 00:25:24,650 --> 00:25:27,470 formerly the standard error of the mean. 422 00:25:27,470 --> 00:25:30,740 People often just call it the standard error. 423 00:25:30,740 --> 00:25:35,030 And I will be, alas, inconsistent. 424 00:25:35,030 --> 00:25:40,060 I sometimes call it one, sometimes the other. 425 00:25:40,060 --> 00:25:43,302 It's an incredibly simple formula. 426 00:25:43,302 --> 00:25:47,140 It says the standard error is going 427 00:25:47,140 --> 00:25:53,720 to be equal to sigma, where sigma is the population 428 00:25:53,720 --> 00:26:01,410 standard deviation divided by the square root of n, which 429 00:26:01,410 --> 00:26:03,060 is going to be the size of the sample. 430 00:26:09,960 --> 00:26:12,450 And then there's just this very small function 431 00:26:12,450 --> 00:26:14,880 that implements it. 432 00:26:14,880 --> 00:26:16,740 So we can compute this thing called 433 00:26:16,740 --> 00:26:21,030 the standard error of the mean in a very straightforward way. 434 00:26:27,580 --> 00:26:29,430 We can compute it. 435 00:26:29,430 --> 00:26:30,960 But does it work? 436 00:26:30,960 --> 00:26:34,080 What do I mean by work? 437 00:26:34,080 --> 00:26:37,410 I mean, what's the relationship of the standard error 438 00:26:37,410 --> 00:26:39,690 to the standard deviation? 439 00:26:39,690 --> 00:26:42,060 Because, remember, that was our goal, 440 00:26:42,060 --> 00:26:46,840 was to understand the standard deviation so we 441 00:26:46,840 --> 00:26:49,340 could use the empirical rule. 442 00:26:49,340 --> 00:26:52,720 Well, let's test the standard error of the mean. 443 00:26:52,720 --> 00:26:57,480 So here's a slightly longer piece of code. 444 00:26:57,480 --> 00:27:00,730 I'm going to look at a bunch of different sample sizes, 445 00:27:00,730 --> 00:27:10,390 from 25 to 600, 50 trials each. 446 00:27:10,390 --> 00:27:15,970 So getHighs is just a function that returns the temperatures. 447 00:27:15,970 --> 00:27:17,770 I'm going to get the standard deviation 448 00:27:17,770 --> 00:27:24,120 of the whole population, then the standard error of the means 449 00:27:24,120 --> 00:27:29,250 and the sample standard deviations, both. 450 00:27:29,250 --> 00:27:32,220 And then I'm just going to go through and run it. 451 00:27:32,220 --> 00:27:35,240 So for size and sample size, I'm going 452 00:27:35,240 --> 00:27:40,060 to append the standard error of the mean. 453 00:27:40,060 --> 00:27:43,680 And remember, that uses the population standard deviation 454 00:27:43,680 --> 00:27:46,770 and the size of the sample. 455 00:27:46,770 --> 00:27:50,030 So I'll compute all the SEMs. 456 00:27:50,030 --> 00:27:53,600 And then I'm going to compute all the actual standard 457 00:27:53,600 --> 00:27:56,890 deviations, as well. 458 00:27:56,890 --> 00:27:59,270 And then we'll produce a bunch of plots-- 459 00:27:59,270 --> 00:28:02,210 or a plot, actually. 460 00:28:02,210 --> 00:28:05,090 All right, so let's see what that plot looks like. 461 00:28:08,410 --> 00:28:11,540 Pretty striking. 462 00:28:11,540 --> 00:28:17,060 So we see the blue solid line is the standard deviation 463 00:28:17,060 --> 00:28:20,550 of the 50 means. 464 00:28:20,550 --> 00:28:26,800 And the red dotted line is the standard error of the mean. 465 00:28:26,800 --> 00:28:30,970 So we can see, quite strikingly here, that they really 466 00:28:30,970 --> 00:28:32,650 track each other very well. 467 00:28:35,950 --> 00:28:41,870 And this is saying that I can anticipate 468 00:28:41,870 --> 00:28:45,110 what the standard deviation would be by computing 469 00:28:45,110 --> 00:28:47,760 the standard error. 470 00:28:47,760 --> 00:28:51,530 Which is really useful, because now I have one sample. 471 00:28:51,530 --> 00:28:54,510 I computed standard error. 472 00:28:54,510 --> 00:28:57,090 And I get something very similar to what 473 00:28:57,090 --> 00:29:00,690 I get of the standard deviation if I took 50 samples 474 00:29:00,690 --> 00:29:04,710 and looked at the standard deviation of those 50 samples. 475 00:29:04,710 --> 00:29:09,260 All right, so not obvious that this would be true, right? 476 00:29:09,260 --> 00:29:12,170 That I could use this simple formula, 477 00:29:12,170 --> 00:29:14,550 and the two things would track each other so well. 478 00:29:17,730 --> 00:29:20,680 And it's not a coincidence, by the way, 479 00:29:20,680 --> 00:29:22,200 that as I get out here near the end, 480 00:29:22,200 --> 00:29:26,040 they're really lying on top of each other. 481 00:29:26,040 --> 00:29:29,352 As the sample size gets much larger, 482 00:29:29,352 --> 00:29:30,435 they really will coincide. 483 00:29:33,240 --> 00:29:36,960 So one, does everyone understand the difference 484 00:29:36,960 --> 00:29:40,600 between the standard deviation and the standard error? 485 00:29:40,600 --> 00:29:41,290 No. 486 00:29:41,290 --> 00:29:42,400 OK. 487 00:29:42,400 --> 00:29:44,500 So how do we compute a standard deviation? 488 00:29:44,500 --> 00:29:48,520 To do that, we have to look at many samples-- 489 00:29:48,520 --> 00:29:50,380 in this case 50-- 490 00:29:50,380 --> 00:29:52,540 and we compute how much variation 491 00:29:52,540 --> 00:29:56,110 there is in those 50 samples. 492 00:29:56,110 --> 00:30:00,130 For the standard error, we look at one sample, 493 00:30:00,130 --> 00:30:03,340 and we compute this thing called the standard error. 494 00:30:03,340 --> 00:30:07,570 And we argue that we get the same number, more or less, 495 00:30:07,570 --> 00:30:12,040 that we would have gotten had we taken 50 samples or 100 samples 496 00:30:12,040 --> 00:30:15,380 and computed the standard deviation. 497 00:30:15,380 --> 00:30:19,810 So I can avoid taking all 50 samples 498 00:30:19,810 --> 00:30:21,700 if my only reason for doing it was 499 00:30:21,700 --> 00:30:24,410 to get the standard deviation. 500 00:30:24,410 --> 00:30:27,160 I can take one sample instead and use 501 00:30:27,160 --> 00:30:30,220 the standard error of the mean. 502 00:30:30,220 --> 00:30:32,950 So going back to my temperature-- 503 00:30:32,950 --> 00:30:36,840 instead of having to look at lots of samples, 504 00:30:36,840 --> 00:30:38,910 I only have to look at one. 505 00:30:38,910 --> 00:30:40,440 And I can get a confidence interval. 506 00:30:40,440 --> 00:30:42,410 That make sense? 507 00:30:42,410 --> 00:30:44,189 OK. 508 00:30:44,189 --> 00:30:44,855 There's a catch. 509 00:30:48,390 --> 00:30:51,270 Notice that the formula for the standard error 510 00:30:51,270 --> 00:30:56,100 includes the standard deviation of the population-- 511 00:30:56,100 --> 00:31:00,250 the standard deviation of the sample. 512 00:31:00,250 --> 00:31:04,180 Well, that's kind of a bummer. 513 00:31:04,180 --> 00:31:06,160 Because how can I get the standard deviation 514 00:31:06,160 --> 00:31:08,110 of the population without looking 515 00:31:08,110 --> 00:31:09,647 at the whole population? 516 00:31:09,647 --> 00:31:11,980 And if we're going to look at the whole population, then 517 00:31:11,980 --> 00:31:15,930 what's the point of sampling in the first place? 518 00:31:15,930 --> 00:31:19,700 So we have a catch, that we've got something that's 519 00:31:19,700 --> 00:31:28,730 a really good approximation, but it uses a value we don't know. 520 00:31:28,730 --> 00:31:31,870 So what should we do about that? 521 00:31:31,870 --> 00:31:39,830 Well, what would be, really, the only obvious thing to try? 522 00:31:39,830 --> 00:31:42,320 What's our best guess at the standard deviation 523 00:31:42,320 --> 00:31:45,420 of the population if we have only one sample to look at? 524 00:31:49,070 --> 00:31:52,000 What would you use? 525 00:31:52,000 --> 00:31:53,820 Somebody? 526 00:31:53,820 --> 00:31:55,670 I know I forgot to bring the candy today, 527 00:31:55,670 --> 00:31:57,155 so no one wants to answer any questions. 528 00:31:57,155 --> 00:31:58,900 AUDIENCE: The standard deviation of the sample? 529 00:31:58,900 --> 00:32:00,900 PROFESSOR: The standard deviation of the sample. 530 00:32:00,900 --> 00:32:02,220 It's all I got. 531 00:32:02,220 --> 00:32:05,660 So let's ask the question, how good is that? 532 00:32:08,260 --> 00:32:10,930 Shockingly good. 533 00:32:10,930 --> 00:32:15,140 So I looked at our example here for the temperatures. 534 00:32:15,140 --> 00:32:17,510 And I'm plotting the sample standard deviation 535 00:32:17,510 --> 00:32:20,690 versus the population standard deviation 536 00:32:20,690 --> 00:32:28,470 for different sample sizes, ranging from 0 to 600 by one, 537 00:32:28,470 --> 00:32:30,100 I think. 538 00:32:30,100 --> 00:32:35,370 So what you can see here is when the sample size is small, 539 00:32:35,370 --> 00:32:36,440 I'm pretty far off. 540 00:32:36,440 --> 00:32:39,600 I'm off by 14% here. 541 00:32:39,600 --> 00:32:43,180 And I think that's 25. 542 00:32:43,180 --> 00:32:48,250 But when the sample sizes is larger, say 600, 543 00:32:48,250 --> 00:32:49,840 I'm off by about 2%. 544 00:32:56,270 --> 00:33:01,670 So what we see, at least for this data set of temperatures-- 545 00:33:01,670 --> 00:33:06,500 if the sample size is large enough, 546 00:33:06,500 --> 00:33:10,510 the sample standard deviation is a pretty good approximation 547 00:33:10,510 --> 00:33:12,250 of the population standard deviation. 548 00:33:15,170 --> 00:33:15,670 Well. 549 00:33:15,670 --> 00:33:20,590 Now we should ask the question, what good is this? 550 00:33:20,590 --> 00:33:26,710 Well, as I said, once the sample reaches a reasonable size-- 551 00:33:26,710 --> 00:33:32,980 and we see here, reasonable is probably somewhere around 500-- 552 00:33:32,980 --> 00:33:35,590 it becomes a good approximation. 553 00:33:35,590 --> 00:33:40,460 But is it true only for this example? 554 00:33:40,460 --> 00:33:42,500 The fact that it happened to work 555 00:33:42,500 --> 00:33:46,430 for high temperatures in the US doesn't mean 556 00:33:46,430 --> 00:33:49,810 that it will always be true. 557 00:33:49,810 --> 00:33:53,430 So there are at least two things we should consider 558 00:33:53,430 --> 00:33:55,170 to asking the question, when will this 559 00:33:55,170 --> 00:33:58,410 be true, when won't it be true. 560 00:33:58,410 --> 00:34:04,540 One is, does the distribution of the population matter? 561 00:34:04,540 --> 00:34:08,650 So here we saw, in our very first plot, 562 00:34:08,650 --> 00:34:11,050 the distribution of the high temperatures. 563 00:34:11,050 --> 00:34:17,350 And it was kind of symmetric around a point-- not perfectly. 564 00:34:17,350 --> 00:34:20,940 But not everything looks that way, right? 565 00:34:20,940 --> 00:34:22,530 So we should say, well, suppose we 566 00:34:22,530 --> 00:34:24,690 have a different distribution. 567 00:34:24,690 --> 00:34:30,469 Would that change this conclusion? 568 00:34:30,469 --> 00:34:32,030 And the other thing we should ask 569 00:34:32,030 --> 00:34:36,080 is, well, suppose we had a different sized population. 570 00:34:36,080 --> 00:34:38,560 Suppose instead of 400,000 temperatures 571 00:34:38,560 --> 00:34:41,860 I had 20 million temperatures. 572 00:34:41,860 --> 00:34:45,070 Would I need more than 600 samples for the two things 573 00:34:45,070 --> 00:34:47,530 to be about the same? 574 00:34:47,530 --> 00:34:52,350 Well, let's explore both of those questions. 575 00:34:52,350 --> 00:34:54,320 First, let's look at the distributions. 576 00:34:54,320 --> 00:34:58,200 And we'll look at three common distributions-- 577 00:34:58,200 --> 00:35:01,480 a uniform distribution, a normal distribution, 578 00:35:01,480 --> 00:35:04,820 and an exponential distribution. 579 00:35:04,820 --> 00:35:07,680 And we'll look at each of them for, what is this, 580 00:35:07,680 --> 00:35:10,980 100,000 points. 581 00:35:10,980 --> 00:35:13,610 So we know we can generate a uniform distribution 582 00:35:13,610 --> 00:35:17,740 by calling random.random. 583 00:35:17,740 --> 00:35:19,870 Gives me a uniform distribution of real numbers 584 00:35:19,870 --> 00:35:22,180 between 0 and 1. 585 00:35:22,180 --> 00:35:25,630 We know that we can generate our normal distribution 586 00:35:25,630 --> 00:35:28,630 by calling random.gauss. 587 00:35:28,630 --> 00:35:32,830 In this case, I'm looking at it between the mean of 0 588 00:35:32,830 --> 00:35:34,390 and a standard deviation of 1. 589 00:35:34,390 --> 00:35:36,700 But as we saw in the last lecture, 590 00:35:36,700 --> 00:35:40,010 the shape will be the same, independent of these values. 591 00:35:40,010 --> 00:35:44,330 And, finally, an exponential distribution, 592 00:35:44,330 --> 00:35:46,420 which we get by calling random.expovariate. 593 00:35:46,420 --> 00:35:52,060 Very And this number, 0.5, is something 594 00:35:52,060 --> 00:35:56,650 called lambda, which has to do with how 595 00:35:56,650 --> 00:35:59,860 quickly the exponential either decays or goes up, 596 00:35:59,860 --> 00:36:01,990 depending upon which direction. 597 00:36:01,990 --> 00:36:04,570 And I'm not going to give you the formula for it 598 00:36:04,570 --> 00:36:05,530 at the moment. 599 00:36:05,530 --> 00:36:07,630 But we'll look at the pictures. 600 00:36:07,630 --> 00:36:11,050 And we'll plot each of these discrete approximations 601 00:36:11,050 --> 00:36:12,370 to these distributions. 602 00:36:15,130 --> 00:36:18,510 So here's what they look like. 603 00:36:18,510 --> 00:36:21,020 Quite different, right? 604 00:36:21,020 --> 00:36:22,850 We've looked at uniform and we've 605 00:36:22,850 --> 00:36:24,800 looked at Gaussian before. 606 00:36:24,800 --> 00:36:28,840 And here we see an exponential, which basically 607 00:36:28,840 --> 00:36:32,800 decays and will asymptote towards zero, never quite 608 00:36:32,800 --> 00:36:35,020 getting there. 609 00:36:35,020 --> 00:36:38,570 But as you can see, it is certainly not 610 00:36:38,570 --> 00:36:42,020 very symmetric around the mean. 611 00:36:42,020 --> 00:36:47,020 All right, so let's see what happens. 612 00:36:47,020 --> 00:36:52,740 If we run the experiment on these three distributions, 613 00:36:52,740 --> 00:36:58,170 each of 100,000 point examples, and look at different sample 614 00:36:58,170 --> 00:37:00,780 sizes, we actually see that the difference 615 00:37:00,780 --> 00:37:06,150 between the standard deviation and the sample standard 616 00:37:06,150 --> 00:37:10,800 deviation of the population standard deviation 617 00:37:10,800 --> 00:37:11,680 is not the same. 618 00:37:14,600 --> 00:37:18,910 We see, down here-- 619 00:37:18,910 --> 00:37:23,140 this looks kind of like what we saw before. 620 00:37:23,140 --> 00:37:26,510 But the exponential one is really quite different. 621 00:37:29,940 --> 00:37:33,010 You know, its worst case is up here at 25. 622 00:37:35,520 --> 00:37:37,780 The normal is about 14. 623 00:37:37,780 --> 00:37:40,560 So that's not too surprising, since our temperatures 624 00:37:40,560 --> 00:37:42,540 were kind of normally distributed 625 00:37:42,540 --> 00:37:44,220 when we looked at it. 626 00:37:44,220 --> 00:37:51,352 And the uniform is, initially, much better an approximation. 627 00:37:54,450 --> 00:37:56,570 And the reason for this has to do 628 00:37:56,570 --> 00:37:59,540 with a fundamental difference in these distributions, 629 00:37:59,540 --> 00:38:02,670 something called skew. 630 00:38:02,670 --> 00:38:07,380 Skew is a measure of the asymmetry of a probability 631 00:38:07,380 --> 00:38:09,610 distribution. 632 00:38:09,610 --> 00:38:15,950 And what we can see here is that skew actually matters. 633 00:38:15,950 --> 00:38:20,280 The more skew you have, the more samples 634 00:38:20,280 --> 00:38:25,520 you're going to need to get a good approximation. 635 00:38:25,520 --> 00:38:28,690 So if the population is very skewed, very asymmetric 636 00:38:28,690 --> 00:38:31,150 in the distribution, you need a lot of samples 637 00:38:31,150 --> 00:38:33,700 to figure out what's going on. 638 00:38:33,700 --> 00:38:36,450 If it's very uniform, as in, for example, 639 00:38:36,450 --> 00:38:41,510 the uniform population, you need many fewer samples. 640 00:38:41,510 --> 00:38:44,150 OK, so that's an important thing. 641 00:38:44,150 --> 00:38:47,360 When we go about deciding how many samples we need, 642 00:38:47,360 --> 00:38:52,810 we need to have some estimate of the skew in our population. 643 00:38:52,810 --> 00:38:56,080 All right, how about size? 644 00:38:56,080 --> 00:38:59,110 Does size matter? 645 00:38:59,110 --> 00:39:01,660 Shockingly-- at least it was to me the first time 646 00:39:01,660 --> 00:39:03,625 I looked at this-- the answer is no. 647 00:39:09,030 --> 00:39:11,700 If we look at this-- and I'm looking just for the uniform 648 00:39:11,700 --> 00:39:15,355 distribution, but we'll see the same thing for all three-- 649 00:39:18,990 --> 00:39:20,510 it more or less doesn't matter. 650 00:39:25,730 --> 00:39:29,000 Quite amazing, right? 651 00:39:29,000 --> 00:39:32,435 If you have a bigger population, you don't need more samples. 652 00:39:35,440 --> 00:39:40,200 And it's really almost counterintuitive 653 00:39:40,200 --> 00:39:44,730 to think that you don't need any more samples to find out 654 00:39:44,730 --> 00:39:48,990 what's going to happen if you have a million people or 100 655 00:39:48,990 --> 00:39:51,340 million people. 656 00:39:51,340 --> 00:39:54,990 And that's why, when we look at, say, political polls, 657 00:39:54,990 --> 00:39:57,430 they're amazingly small. 658 00:39:57,430 --> 00:40:00,040 They poll 1,000 people and claim they're representative 659 00:40:00,040 --> 00:40:01,130 of Massachusetts. 660 00:40:05,380 --> 00:40:08,000 This is good news. 661 00:40:08,000 --> 00:40:11,470 So to estimate the mean of a population, given 662 00:40:11,470 --> 00:40:16,360 a single sample, we choose a sample size 663 00:40:16,360 --> 00:40:19,720 based upon some estimate of skew in the population. 664 00:40:22,600 --> 00:40:25,530 This is important, because if we get that wrong, 665 00:40:25,530 --> 00:40:29,334 we might choose a sample size that is too small. 666 00:40:29,334 --> 00:40:30,750 And in some sense, you always want 667 00:40:30,750 --> 00:40:33,780 to choose the smallest sample size you can 668 00:40:33,780 --> 00:40:36,870 that will give you an accurate answer, because it's 669 00:40:36,870 --> 00:40:41,810 more economical to have small samples than big samples. 670 00:40:41,810 --> 00:40:43,610 And I've been talking about polls, 671 00:40:43,610 --> 00:40:46,630 but the same is true in an experiment. 672 00:40:46,630 --> 00:40:48,490 How many pieces of data do you need 673 00:40:48,490 --> 00:40:52,210 to collect when you run an experiment in a lab. 674 00:40:52,210 --> 00:40:56,560 And how much will depend, again, on the skew of the data. 675 00:40:56,560 --> 00:41:00,090 And that will help you decide. 676 00:41:00,090 --> 00:41:03,850 When you know the size, you choose a random sample 677 00:41:03,850 --> 00:41:04,930 from the population. 678 00:41:09,660 --> 00:41:12,300 Then you compute the mean and the standard deviation 679 00:41:12,300 --> 00:41:13,280 of that sample. 680 00:41:17,350 --> 00:41:20,680 And then use the standard deviation of that sample 681 00:41:20,680 --> 00:41:23,930 to estimate the standard error. 682 00:41:23,930 --> 00:41:26,240 And I want to emphasize that what you're getting here 683 00:41:26,240 --> 00:41:30,110 is an estimate of the standard error, not the standard error 684 00:41:30,110 --> 00:41:34,340 itself, which would require you to know the population standard 685 00:41:34,340 --> 00:41:36,260 deviation. 686 00:41:36,260 --> 00:41:40,280 But if you've chosen the sample size to be appropriate, 687 00:41:40,280 --> 00:41:44,735 this will turn out to be a good estimate. 688 00:41:44,735 --> 00:41:46,110 And then once we've done that, we 689 00:41:46,110 --> 00:41:49,300 use the estimated standard error to generate 690 00:41:49,300 --> 00:41:52,290 confidence intervals around the sample mean. 691 00:41:52,290 --> 00:41:55,110 And we're done. 692 00:41:55,110 --> 00:41:57,390 Now this works great when we choose 693 00:41:57,390 --> 00:42:00,300 independent random samples. 694 00:42:00,300 --> 00:42:04,290 And, as we've seen before, that if you 695 00:42:04,290 --> 00:42:07,740 don't choose independent samples, 696 00:42:07,740 --> 00:42:10,170 it doesn't work so well. 697 00:42:10,170 --> 00:42:13,740 And, again, this is an issue where if you assume that, 698 00:42:13,740 --> 00:42:15,960 in an election, each state is independent 699 00:42:15,960 --> 00:42:20,590 of every other state, and you'll get the wrong answer, 700 00:42:20,590 --> 00:42:23,320 because they're not. 701 00:42:23,320 --> 00:42:27,360 All right, let's go back to our temperature example 702 00:42:27,360 --> 00:42:30,910 and pose a simple question. 703 00:42:30,910 --> 00:42:34,160 Are 200 samples enough? 704 00:42:34,160 --> 00:42:35,570 I don't know why I chose 200. 705 00:42:35,570 --> 00:42:36,980 I did. 706 00:42:36,980 --> 00:42:40,280 So we'll do an experiment here. 707 00:42:40,280 --> 00:42:44,280 This is similar to an experiment we saw on Monday. 708 00:42:44,280 --> 00:42:48,990 So I'm starting with the number of mistakes I make. 709 00:42:48,990 --> 00:42:50,850 For t in a range number of trials, 710 00:42:50,850 --> 00:42:54,960 sample will be random.sample of the temperatures in the sample 711 00:42:54,960 --> 00:42:56,890 size. 712 00:42:56,890 --> 00:42:58,150 This is a key step. 713 00:43:01,390 --> 00:43:04,660 The first time I did this, I messed it up. 714 00:43:04,660 --> 00:43:07,090 And instead of doing this very simple thing, 715 00:43:07,090 --> 00:43:10,360 I did a more complicated thing of just choosing 716 00:43:10,360 --> 00:43:12,370 some point in my list of temperatures 717 00:43:12,370 --> 00:43:16,950 and taking the next 200 temperatures. 718 00:43:16,950 --> 00:43:18,615 Why did that give me the wrong answer? 719 00:43:21,530 --> 00:43:24,090 Because it's organized by city. 720 00:43:24,090 --> 00:43:27,350 So if I happen to choose the first day of Phoenix, 721 00:43:27,350 --> 00:43:30,330 all 200 temperatures were Phoenix-- 722 00:43:30,330 --> 00:43:32,090 which is not a very good approximation 723 00:43:32,090 --> 00:43:35,319 of the temperature in the country as a whole. 724 00:43:35,319 --> 00:43:36,110 But this will work. 725 00:43:36,110 --> 00:43:38,540 I'm using random.sample. 726 00:43:38,540 --> 00:43:39,890 I'll then get the sample mean. 727 00:43:43,180 --> 00:43:46,690 Then I'll compute my estimate of the standard error 728 00:43:46,690 --> 00:43:49,570 by taking that as seen here. 729 00:43:49,570 --> 00:43:55,140 And then if the absolute value of the population 730 00:43:55,140 --> 00:44:01,510 minus the sample mean is more than 1.96 standard errors, 731 00:44:01,510 --> 00:44:03,720 I'm going to say I messed up. 732 00:44:06,370 --> 00:44:07,890 It's outside. 733 00:44:07,890 --> 00:44:10,170 And then at the end, I'm going to look 734 00:44:10,170 --> 00:44:14,280 at the fraction outside the 95% confidence intervals. 735 00:44:14,280 --> 00:44:17,180 And what do I hope it should print? 736 00:44:17,180 --> 00:44:20,510 What would be the perfect answer when I run this? 737 00:44:26,020 --> 00:44:28,290 What fraction should lie outside that? 738 00:44:34,050 --> 00:44:35,820 It's a pretty simple calculation. 739 00:44:44,000 --> 00:44:46,740 Five, right? 740 00:44:46,740 --> 00:44:49,170 Because if they all were inside, then 741 00:44:49,170 --> 00:44:52,830 I'm being too conservative in my interval, right? 742 00:44:52,830 --> 00:44:58,440 I want 5% of the tests to fall outside the 95% confidence 743 00:44:58,440 --> 00:45:00,500 interval. 744 00:45:00,500 --> 00:45:02,660 If I wanted fewer, then I would look 745 00:45:02,660 --> 00:45:04,490 at three standard deviations. 746 00:45:04,490 --> 00:45:08,600 Instead of 1.96, then I would expect less than 1% 747 00:45:08,600 --> 00:45:10,760 to fall outside. 748 00:45:10,760 --> 00:45:13,220 So this is something we have to always keep in mind when 749 00:45:13,220 --> 00:45:14,990 we do this kind of thing. 750 00:45:14,990 --> 00:45:19,100 If your answer is too good, you've messed up. 751 00:45:19,100 --> 00:45:22,790 Shouldn't be too bad, but it shouldn't be too good, either. 752 00:45:22,790 --> 00:45:25,290 That's what probabilities are all about. 753 00:45:25,290 --> 00:45:28,340 If you called every election correctly, 754 00:45:28,340 --> 00:45:29,390 then your math is wrong. 755 00:45:33,240 --> 00:45:40,720 Well, when we run this, we get this lovely answer, 756 00:45:40,720 --> 00:45:44,350 that the fraction outside the 95% confidence interval 757 00:45:44,350 --> 00:45:49,964 is 0.0511. 758 00:45:49,964 --> 00:45:51,880 That's exactly-- well, close to what you want. 759 00:45:51,880 --> 00:45:55,240 It's almost exactly 5%. 760 00:45:55,240 --> 00:45:57,540 And if I run it multiple times, I 761 00:45:57,540 --> 00:45:59,470 get slightly different numbers. 762 00:45:59,470 --> 00:46:03,030 But they're all in that range, showing that, here, 763 00:46:03,030 --> 00:46:05,690 in fact, it really does work. 764 00:46:08,370 --> 00:46:12,500 So that's what I want to say, and it's really important, 765 00:46:12,500 --> 00:46:15,020 this notion of the standard error. 766 00:46:15,020 --> 00:46:17,870 When I talk to other departments about what 767 00:46:17,870 --> 00:46:22,160 we should cover in 60002, about the only thing everybody 768 00:46:22,160 --> 00:46:25,220 agrees on was we should talk about standard error. 769 00:46:25,220 --> 00:46:29,060 So now I hope I have made everyone happy. 770 00:46:29,060 --> 00:46:32,420 And we will talk about fitting curves 771 00:46:32,420 --> 00:46:35,340 to experimental data starting next week. 772 00:46:35,340 --> 00:46:37,930 All right, thanks a lot.