1 00:00:00,790 --> 00:00:03,130 The following content is provided under a Creative 2 00:00:03,130 --> 00:00:04,550 Commons license. 3 00:00:04,550 --> 00:00:06,760 Your support will help MIT OpenCourseWare 4 00:00:06,760 --> 00:00:10,850 continue to offer high quality educational resources for free. 5 00:00:10,850 --> 00:00:13,390 To make a donation or to view additional materials 6 00:00:13,390 --> 00:00:17,320 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,320 --> 00:00:18,570 at ocw.mit.edu. 8 00:00:28,780 --> 00:00:30,578 JOHN GUTTAG: Hello, everybody. 9 00:00:33,330 --> 00:00:36,570 Well, here we are at the last lecture. 10 00:00:36,570 --> 00:00:39,667 We're going to finish talking about statistical sins 11 00:00:39,667 --> 00:00:41,250 and then do a little bit of a wrap-up. 12 00:00:44,880 --> 00:00:46,440 Let's look at a hot topic-- 13 00:00:46,440 --> 00:00:52,350 global fiction-- or global warming, fact or fiction. 14 00:00:52,350 --> 00:00:55,550 You've done a problem set related 15 00:00:55,550 --> 00:00:58,190 to temperatures in the US. 16 00:00:58,190 --> 00:01:01,040 Here is a plot generally accepted 17 00:01:01,040 --> 00:01:06,650 of the change in temperatures on the planet between 1880 18 00:01:06,650 --> 00:01:10,430 and 2014. 19 00:01:10,430 --> 00:01:14,210 Now, if we look at this plot, we could see this commits one 20 00:01:14,210 --> 00:01:18,490 of the statistical sins I complained about on Monday, 21 00:01:18,490 --> 00:01:21,970 that look where it's starting the y-axis, way down 22 00:01:21,970 --> 00:01:24,190 here at 55. 23 00:01:24,190 --> 00:01:27,700 And you remember, I told you to beware of charts for the y-axis 24 00:01:27,700 --> 00:01:30,540 doesn't start at 0. 25 00:01:30,540 --> 00:01:32,190 So maybe the people who are trying 26 00:01:32,190 --> 00:01:35,910 to claim about global warming are just deceiving us 27 00:01:35,910 --> 00:01:38,620 with this trick of the axis. 28 00:01:38,620 --> 00:01:41,970 So here's what happens when you put it at 0. 29 00:01:41,970 --> 00:01:44,020 And as you can see-- 30 00:01:44,020 --> 00:01:48,690 or barely see-- this axis runs from 0 up to 110 31 00:01:48,690 --> 00:01:51,030 as the average temperature. 32 00:01:51,030 --> 00:01:52,710 And as you can see quite clearly, 33 00:01:52,710 --> 00:01:55,940 it's hardly changed at all. 34 00:01:55,940 --> 00:01:59,020 So what's the deal here? 35 00:01:59,020 --> 00:02:04,450 Well, which is a more accurate presentation of the facts? 36 00:02:04,450 --> 00:02:07,760 Which conveys the accurate impression? 37 00:02:07,760 --> 00:02:09,910 Let's look at another example, maybe 38 00:02:09,910 --> 00:02:14,530 a little less controversial than climate change-- 39 00:02:14,530 --> 00:02:16,630 fever and flu. 40 00:02:16,630 --> 00:02:19,480 It's generally accepted that when you get the flu 41 00:02:19,480 --> 00:02:21,790 you might run a fever. 42 00:02:21,790 --> 00:02:23,440 So here is someone who had the flu. 43 00:02:23,440 --> 00:02:29,680 And this is plotting their fever from the beginning to its peak. 44 00:02:29,680 --> 00:02:32,380 And it does appear, if we were to fit a curve to this, 45 00:02:32,380 --> 00:02:35,920 it would look pretty much like that. 46 00:02:35,920 --> 00:02:38,560 On the other hand, if we assume that somebody's temperature 47 00:02:38,560 --> 00:02:42,334 could range between 0 and 200, we 48 00:02:42,334 --> 00:02:44,500 can see that, in fact, your temperature doesn't move 49 00:02:44,500 --> 00:02:47,280 at all when you get the flu. 50 00:02:47,280 --> 00:02:50,550 So the moral is pretty clear, I think. 51 00:02:50,550 --> 00:02:54,690 Even though on Monday I talked about being suspicious when 52 00:02:54,690 --> 00:03:00,040 people start the y-axis too far from 0, 53 00:03:00,040 --> 00:03:02,020 you should truncate it to eliminate 54 00:03:02,020 --> 00:03:05,680 totally preposterous values. 55 00:03:05,680 --> 00:03:07,510 No living person has a temperature 56 00:03:07,510 --> 00:03:11,540 of 0 degrees Fahrenheit. 57 00:03:11,540 --> 00:03:18,090 So again, don't truncate it just to make something look 58 00:03:18,090 --> 00:03:24,000 like it isn't, but don't expand it to deceive either. 59 00:03:24,000 --> 00:03:27,420 Let's return to global warming. 60 00:03:27,420 --> 00:03:29,040 This is a chart that was actually 61 00:03:29,040 --> 00:03:31,920 shown on the floor of the US Senate 62 00:03:31,920 --> 00:03:38,970 by a senator from Texas, who I shall not name. 63 00:03:38,970 --> 00:03:41,190 And obviously, the argument here was 64 00:03:41,190 --> 00:03:45,610 that, well, sure global warming bounces up and down. 65 00:03:45,610 --> 00:03:53,220 But if we go back, we can see here, the date is 19-- 66 00:03:53,220 --> 00:03:53,970 can we see it? 67 00:03:53,970 --> 00:03:56,240 I can see it. 68 00:03:56,240 --> 00:04:00,050 Maybe 1986, I think. 69 00:04:00,050 --> 00:04:01,970 You can see that the argument here 70 00:04:01,970 --> 00:04:05,630 is, in fact, if you fit a trend line to this, as he's done, 71 00:04:05,630 --> 00:04:08,670 it hasn't changed at all. 72 00:04:08,670 --> 00:04:12,140 And so even though we've had a lot of carbon emissions 73 00:04:12,140 --> 00:04:16,070 during this period, maybe global warming 74 00:04:16,070 --> 00:04:18,709 is not actually happening. 75 00:04:18,709 --> 00:04:24,371 This is in contradiction to the trend I showed before. 76 00:04:24,371 --> 00:04:26,910 Well, what's going on here? 77 00:04:26,910 --> 00:04:33,140 This is a very common way that people use statistics poorly. 78 00:04:33,140 --> 00:04:37,400 They confuse fluctuations with trends. 79 00:04:37,400 --> 00:04:40,880 What we see in any theories of data-- 80 00:04:40,880 --> 00:04:43,040 time series, or other series-- 81 00:04:43,040 --> 00:04:44,580 you always have fluctuations. 82 00:04:48,530 --> 00:04:52,580 And that's not to be confused with the trend. 83 00:04:52,580 --> 00:04:55,460 And in particular, what you need to think about when you're 84 00:04:55,460 --> 00:04:58,310 looking at a phenomenon is choose 85 00:04:58,310 --> 00:05:03,860 an interval consistent with the thing that's being considered. 86 00:05:03,860 --> 00:05:06,050 So we believe that climate change 87 00:05:06,050 --> 00:05:10,910 is something that happens over very long periods of time. 88 00:05:10,910 --> 00:05:12,860 And it's a little bit silly to look at it 89 00:05:12,860 --> 00:05:15,260 on a short period of time. 90 00:05:15,260 --> 00:05:17,120 Some of you may remember two years ago, 91 00:05:17,120 --> 00:05:20,017 we had a very cold winter here. 92 00:05:20,017 --> 00:05:21,850 And there were people who were saying, well, 93 00:05:21,850 --> 00:05:24,650 that shows we don't have global warming. 94 00:05:24,650 --> 00:05:27,340 Well, you can't really conclude anything about climate change 95 00:05:27,340 --> 00:05:30,880 looking at a year, or probably not even looking 96 00:05:30,880 --> 00:05:33,530 at 10 years or 20 years. 97 00:05:33,530 --> 00:05:35,860 It's a very slow phenomenon. 98 00:05:35,860 --> 00:05:37,600 On the other hand, if you're looking 99 00:05:37,600 --> 00:05:40,270 at the change in somebody's heart rate, 100 00:05:40,270 --> 00:05:42,310 seeing if they have a heart condition, 101 00:05:42,310 --> 00:05:46,330 you probably don't want to look at it over a 10-year period. 102 00:05:46,330 --> 00:05:48,100 So you have to decide what you're doing 103 00:05:48,100 --> 00:05:52,780 and find an interval that lets you look at the trends rather 104 00:05:52,780 --> 00:05:56,180 than the fluctuations. 105 00:05:56,180 --> 00:05:59,090 Any rate, maybe even if we're having global warming, at least 106 00:05:59,090 --> 00:06:02,660 the Arctic ice isn't melting, though apparently, 107 00:06:02,660 --> 00:06:07,580 I read in the paper this morning they found a huge crack in it. 108 00:06:07,580 --> 00:06:10,400 So this was reported in the Financial Post 109 00:06:10,400 --> 00:06:14,750 on April 15, 2013. 110 00:06:14,750 --> 00:06:17,070 You can read it yourself. 111 00:06:17,070 --> 00:06:20,510 But the basic import of it is they 112 00:06:20,510 --> 00:06:29,900 took the period from April 14, 1989 to April 15, 2013 113 00:06:29,900 --> 00:06:35,010 and said, look, it's not changing. 114 00:06:35,010 --> 00:06:41,250 In fact, the amount of arctic ice is unchanged. 115 00:06:41,250 --> 00:06:42,880 Well, what's the financial-- 116 00:06:42,880 --> 00:06:46,090 not the financial-- what's the statistical sin being 117 00:06:46,090 --> 00:06:46,930 committed here? 118 00:06:50,290 --> 00:06:56,580 If we look at this data, this is an anomaly chart. 119 00:06:56,580 --> 00:06:59,520 I think you saw one of these in one of the problems sets, 120 00:06:59,520 --> 00:07:01,860 where you fix something at 0 and then 121 00:07:01,860 --> 00:07:05,340 you show fluctuations relative to that. 122 00:07:05,340 --> 00:07:10,574 So here, it's the Arctic ice relative to a point. 123 00:07:10,574 --> 00:07:16,160 And what we see here is that if you go and choose 124 00:07:16,160 --> 00:07:18,380 the right date-- 125 00:07:18,380 --> 00:07:30,110 say this one in 1989-- 126 00:07:30,110 --> 00:07:38,390 and you come over here and you choose the right date in 2013-- 127 00:07:38,390 --> 00:07:45,570 say this one-- you can then draw a line and say, oh, look, 128 00:07:45,570 --> 00:07:46,325 it hasn't changed. 129 00:07:49,460 --> 00:07:52,460 This is something people frequently do, 130 00:07:52,460 --> 00:07:56,570 is they take a whole set of data, 131 00:07:56,570 --> 00:07:59,960 and they find two points that are consistent with something 132 00:07:59,960 --> 00:08:02,140 they believe. 133 00:08:02,140 --> 00:08:05,560 And they draw a line between those two points, 134 00:08:05,560 --> 00:08:09,477 fit a curve to those two points, and draw some conclusion. 135 00:08:12,280 --> 00:08:15,490 This is what we call cherry picking, 136 00:08:15,490 --> 00:08:17,290 I guess from the notion that when 137 00:08:17,290 --> 00:08:20,800 you go to pick cherries you only want to pick the right ones, 138 00:08:20,800 --> 00:08:23,080 leave the others to ripen for a bit on the tree. 139 00:08:25,990 --> 00:08:28,615 It's really bad. 140 00:08:28,615 --> 00:08:30,240 And it's something that, unfortunately, 141 00:08:30,240 --> 00:08:34,010 the scientific literature is replete with, 142 00:08:34,010 --> 00:08:35,720 where people look at a lot of data, 143 00:08:35,720 --> 00:08:40,220 and they pick the points that match what they want to prove. 144 00:08:40,220 --> 00:08:43,610 And so as you can see, while the trend is quite clear, 145 00:08:43,610 --> 00:08:47,420 you could prove almost anything you wanted by selecting 146 00:08:47,420 --> 00:08:50,130 two points very carefully. 147 00:08:50,130 --> 00:08:52,470 I could also show that it's crashing 148 00:08:52,470 --> 00:08:57,660 much faster than people think it is by picking these two points. 149 00:08:57,660 --> 00:09:00,410 If I wanted to argue that it's catastrophic, 150 00:09:00,410 --> 00:09:03,030 I'd pick those two points and say, look at that, 151 00:09:03,030 --> 00:09:06,960 it's disappearing at an incredible rate. 152 00:09:06,960 --> 00:09:10,170 So you can lie in either direction with this data 153 00:09:10,170 --> 00:09:11,520 by careful cherry picking. 154 00:09:16,010 --> 00:09:18,500 As a service to you, I know the holidays are coming 155 00:09:18,500 --> 00:09:21,590 and many of you have not bought presents for your parents, 156 00:09:21,590 --> 00:09:25,470 so here's a modest gift suggestion, 157 00:09:25,470 --> 00:09:29,920 that the family that shoots together something or other. 158 00:09:29,920 --> 00:09:34,630 Well, all right, so we can ask, is this a good gift? 159 00:09:34,630 --> 00:09:36,150 Well, probably. 160 00:09:36,150 --> 00:09:37,830 We can look at this statistic. 161 00:09:37,830 --> 00:09:39,750 It's not dangerous at least. 162 00:09:39,750 --> 00:09:43,440 We see that 99.8% of the firearms in the US 163 00:09:43,440 --> 00:09:46,850 will not be used to commit a violent crime. 164 00:09:46,850 --> 00:09:50,360 So guns apparently are not actually dangerous, or at least 165 00:09:50,360 --> 00:09:53,990 not in the hands of criminals. 166 00:09:53,990 --> 00:09:56,120 Well, let's look at this. 167 00:09:56,120 --> 00:09:59,530 How many privately owned firearms are there in the US? 168 00:09:59,530 --> 00:10:03,532 And anyone want to guess who hasn't looked ahead? 169 00:10:03,532 --> 00:10:04,508 Yeah. 170 00:10:04,508 --> 00:10:05,972 AUDIENCE: 400 million. 171 00:10:05,972 --> 00:10:07,580 JOHN GUTTAG: 400 million. 172 00:10:07,580 --> 00:10:11,210 340 million people and 400 million guns 173 00:10:11,210 --> 00:10:15,350 is the guess, more than one per person. 174 00:10:15,350 --> 00:10:17,540 You certainly are the right order of magnitude. 175 00:10:17,540 --> 00:10:21,800 I think it's about 300 million, but it's hard to count them. 176 00:10:21,800 --> 00:10:25,890 Maybe this doesn't count water pistols. 177 00:10:25,890 --> 00:10:30,930 So if you assume there are 300 million firearms 178 00:10:30,930 --> 00:10:35,170 and 0.2% of them are used to commit 179 00:10:35,170 --> 00:10:38,960 a violent crime in every year, we 180 00:10:38,960 --> 00:10:41,630 see that how many crimes is that? 181 00:10:41,630 --> 00:10:45,540 600,000. 182 00:10:45,540 --> 00:10:49,800 So in fact, it's not necessarily very 183 00:10:49,800 --> 00:10:51,480 meaningful to say that most of them 184 00:10:51,480 --> 00:10:54,210 are not used to commit a crime. 185 00:10:54,210 --> 00:10:57,600 Well, let's look at another place 186 00:10:57,600 --> 00:11:01,380 where we look at a statistic. 187 00:11:01,380 --> 00:11:05,580 Probably most of you don't even remember the scary swine flu 188 00:11:05,580 --> 00:11:08,370 epidemic. 189 00:11:08,370 --> 00:11:10,800 This was a big headline. 190 00:11:10,800 --> 00:11:13,050 And people got so scared of the swine flu 191 00:11:13,050 --> 00:11:16,230 they were doing things like closing schools 192 00:11:16,230 --> 00:11:19,110 to try limit the spread of the flu. 193 00:11:19,110 --> 00:11:21,900 New York City closed some schools because of it, 194 00:11:21,900 --> 00:11:23,520 for example. 195 00:11:23,520 --> 00:11:26,860 So is this a scary statistic? 196 00:11:26,860 --> 00:11:31,610 Well, maybe, but here's an interesting statistic. 197 00:11:31,610 --> 00:11:34,220 How many deaths per year are from the seasonal flu 198 00:11:34,220 --> 00:11:35,610 in the US-- 199 00:11:35,610 --> 00:11:39,470 the ones we try and prevent with a flu shot? 200 00:11:39,470 --> 00:11:42,260 36,000. 201 00:11:42,260 --> 00:11:47,180 So what we see is that, it doesn't make a lot of sense 202 00:11:47,180 --> 00:11:52,950 to panic over 159 in the light of this number. 203 00:11:56,800 --> 00:12:00,220 So the point here for both this and the issue 204 00:12:00,220 --> 00:12:04,300 about the firearms is that context matters. 205 00:12:07,200 --> 00:12:08,480 Yeah, I love this cartoon. 206 00:12:12,290 --> 00:12:16,790 A number without context is just a number. 207 00:12:16,790 --> 00:12:20,330 And numbers by themselves don't mean anything. 208 00:12:20,330 --> 00:12:23,720 So to say that there were 159 deaths from the swine flu 209 00:12:23,720 --> 00:12:27,830 is not very meaningful without some context. 210 00:12:27,830 --> 00:12:32,330 To say that only 0.2% of firearms 211 00:12:32,330 --> 00:12:34,610 are used to commit a violent crime 212 00:12:34,610 --> 00:12:37,580 is not very meaningful without context. 213 00:12:37,580 --> 00:12:40,580 Whenever you're presenting a statistic, 214 00:12:40,580 --> 00:12:42,230 reading about a statistic, and you just 215 00:12:42,230 --> 00:12:44,750 see a number that seems comforting 216 00:12:44,750 --> 00:12:48,920 or terrifying, try and put some context around it. 217 00:12:51,850 --> 00:12:56,300 So a related thing is relative to what? 218 00:12:56,300 --> 00:12:59,540 Suppose I told you that skipping lectures increases 219 00:12:59,540 --> 00:13:04,410 your probability of failing this course by 50%. 220 00:13:04,410 --> 00:13:07,650 Well, you would all feel great, because you're here. 221 00:13:07,650 --> 00:13:10,470 And you would be laughing at your friends who are not here, 222 00:13:10,470 --> 00:13:14,880 because figuring that will leave much better grades for you. 223 00:13:14,880 --> 00:13:17,520 What does this mean, though? 224 00:13:17,520 --> 00:13:22,310 Well, if I told you that it changed 225 00:13:22,310 --> 00:13:28,060 the probability of failing from a half to 0.75, 226 00:13:28,060 --> 00:13:32,300 you would be very tempted to come to lectures. 227 00:13:32,300 --> 00:13:37,550 On the other hand, if I told you that it changed the probability 228 00:13:37,550 --> 00:13:42,869 from 0.005 to 0.0075, you might say, the heck with it, 229 00:13:42,869 --> 00:13:43,910 I'd rather go to the gym. 230 00:13:48,090 --> 00:13:49,500 Again this, is an issue. 231 00:13:49,500 --> 00:13:53,070 And this is something that we see all the time when people 232 00:13:53,070 --> 00:13:57,440 talk about percentage change. 233 00:13:57,440 --> 00:14:03,510 This is particularly prominent in the pharmaceutical field. 234 00:14:03,510 --> 00:14:09,180 You will read a headline saying that drug x for arthritis 235 00:14:09,180 --> 00:14:16,980 increases the probability of a heart attack by 1% or 5%. 236 00:14:16,980 --> 00:14:18,810 Well, what does that mean? 237 00:14:18,810 --> 00:14:22,860 If the probability was already very low, increasing it by 5%, 238 00:14:22,860 --> 00:14:25,620 it's still very low. 239 00:14:25,620 --> 00:14:30,150 And maybe it's worth it not to be in pain from arthritis. 240 00:14:30,150 --> 00:14:34,440 So talking in percentages is, again, 241 00:14:34,440 --> 00:14:38,010 one of these issues of it doesn't make sense 242 00:14:38,010 --> 00:14:39,300 without the context. 243 00:14:39,300 --> 00:14:42,000 In order to know what this means, 244 00:14:42,000 --> 00:14:44,880 I need to know what regime I'm in here in order 245 00:14:44,880 --> 00:14:48,420 to make a intelligent decisions about whether to attend lecture 246 00:14:48,420 --> 00:14:50,470 or not. 247 00:14:50,470 --> 00:14:55,180 It goes without saying, you have all made the right decision. 248 00:14:55,180 --> 00:14:58,350 So beware of percentage change when 249 00:14:58,350 --> 00:15:01,020 you don't know the denominator. 250 00:15:01,020 --> 00:15:03,210 You get a percentage by dividing by something. 251 00:15:03,210 --> 00:15:05,760 And if you don't know what you're dividing by, 252 00:15:05,760 --> 00:15:09,180 then the percentage is itself a meaningless number. 253 00:15:14,440 --> 00:15:18,220 While we're sort of talking about medical things, 254 00:15:18,220 --> 00:15:21,640 let's look at cancer clusters to illustrate 255 00:15:21,640 --> 00:15:26,090 another statistical question. 256 00:15:26,090 --> 00:15:29,970 So this is a definition of a cancer cluster by the CDC-- 257 00:15:29,970 --> 00:15:33,160 "a greater-than-expected number of cancer cases 258 00:15:33,160 --> 00:15:36,370 that occurs in a group of people in a geographic area 259 00:15:36,370 --> 00:15:39,210 over a period of time." 260 00:15:39,210 --> 00:15:41,280 And the key part of this definition 261 00:15:41,280 --> 00:15:44,370 is greater-than-expected. 262 00:15:48,090 --> 00:15:54,870 About 1,000 cancer clusters per year are reported in the US, 263 00:15:54,870 --> 00:15:57,060 mostly to the Centers for Disease Control, 264 00:15:57,060 --> 00:16:01,940 but in general to other health agencies. 265 00:16:01,940 --> 00:16:08,200 Upon analysis, almost none of them pass this test. 266 00:16:08,200 --> 00:16:12,460 So the vast majority, some years all of them, 267 00:16:12,460 --> 00:16:14,915 are deemed actually not to be cancer clusters. 268 00:16:18,406 --> 00:16:20,530 So I don't know if-- has anyone here seen the movie 269 00:16:20,530 --> 00:16:23,210 Erin Brockovich? 270 00:16:23,210 --> 00:16:25,790 Subsequent analysis showed that was actually not 271 00:16:25,790 --> 00:16:26,750 a cancer cluster. 272 00:16:29,790 --> 00:16:35,450 It's a good movie, but turns out statistically wrong. 273 00:16:35,450 --> 00:16:37,400 This, by the way, is not a cancer cluster. 274 00:16:37,400 --> 00:16:40,010 This is a constellation. 275 00:16:40,010 --> 00:16:43,310 So let's look at a hypothetical example. 276 00:16:43,310 --> 00:16:46,190 By the way, the other movie about cancer clusters 277 00:16:46,190 --> 00:16:48,590 was the one set in Massachusetts. 278 00:16:48,590 --> 00:16:49,340 What was the name? 279 00:16:49,340 --> 00:16:50,120 A Civil Action. 280 00:16:50,120 --> 00:16:52,190 Anyone see that? 281 00:16:52,190 --> 00:16:53,000 No. 282 00:16:53,000 --> 00:16:54,380 That was a cancer cluster. 283 00:16:57,070 --> 00:17:01,230 Massachusetts is about 10,000 square miles. 284 00:17:01,230 --> 00:17:04,290 And there are about 36,000 cancer cases per year 285 00:17:04,290 --> 00:17:07,020 reported in Massachusetts. 286 00:17:07,020 --> 00:17:08,730 Those two numbers are accurate. 287 00:17:08,730 --> 00:17:12,369 And the rest of this is pure fiction. 288 00:17:12,369 --> 00:17:16,900 So let's assume that we had some ambitious attorney who 289 00:17:16,900 --> 00:17:22,550 partitioned the state into 1,000 regions of 10 square miles each 290 00:17:22,550 --> 00:17:25,069 and looked at the distribution of cancer cases 291 00:17:25,069 --> 00:17:29,000 in these regions trying to find cancer clusters that he or she 292 00:17:29,000 --> 00:17:32,120 could file a lawsuit about. 293 00:17:32,120 --> 00:17:34,740 Well, you can do some arithmetic. 294 00:17:34,740 --> 00:17:38,910 And if there are 36,000 new cancer cases a year 295 00:17:38,910 --> 00:17:41,420 and we have 1,000 regions, that should 296 00:17:41,420 --> 00:17:47,150 say that we should get about 36 cancer 297 00:17:47,150 --> 00:17:50,540 cases per year and per region. 298 00:17:50,540 --> 00:17:53,410 Well, when the attorney look at the data, 299 00:17:53,410 --> 00:17:56,530 this mythical attorney, he discovered 300 00:17:56,530 --> 00:18:01,960 that region number 111 had 143 new cancer 301 00:18:01,960 --> 00:18:05,240 cases over a three-year period. 302 00:18:05,240 --> 00:18:09,870 He compared that to 3 times 36 and said, wow, 303 00:18:09,870 --> 00:18:11,630 that's 32% more than expected. 304 00:18:14,140 --> 00:18:15,220 I've got a lawsuit. 305 00:18:15,220 --> 00:18:16,780 So he went to tell all these people-- 306 00:18:16,780 --> 00:18:18,540 they lived in a cancer cluster. 307 00:18:18,540 --> 00:18:20,570 And the question is, should they be worried? 308 00:18:24,350 --> 00:18:28,850 Well, another way to look at the question is, how likely is it 309 00:18:28,850 --> 00:18:32,510 that it was just bad luck? 310 00:18:32,510 --> 00:18:34,430 That's the question we've always ask when 311 00:18:34,430 --> 00:18:36,920 we do statistical analysis-- 312 00:18:36,920 --> 00:18:40,820 is this result meaningful, or is it just random variation 313 00:18:40,820 --> 00:18:43,590 that you would expect to see? 314 00:18:43,590 --> 00:18:50,370 So I wrote some code to simulate it to see what happens-- 315 00:18:50,370 --> 00:18:55,620 so number of cases, 36,000, number of years, 3. 316 00:18:55,620 --> 00:19:00,150 So all of this is just the numbers I had on the slide. 317 00:19:00,150 --> 00:19:02,880 We'll do a simulation. 318 00:19:02,880 --> 00:19:04,920 We'll take 100 trials. 319 00:19:07,565 --> 00:19:09,690 And then what I'm going to do is for t in the range 320 00:19:09,690 --> 00:19:14,300 number of trials, the locations, the regions, if you will, 321 00:19:14,300 --> 00:19:19,470 I'll initialize each to 0, 1,000 of them, in this case. 322 00:19:19,470 --> 00:19:22,290 And then for i in the range number of years 323 00:19:22,290 --> 00:19:28,740 times number of cases per year, so this will be 3 times 36,000. 324 00:19:28,740 --> 00:19:34,510 At random, I will assign the case to one of these regions. 325 00:19:34,510 --> 00:19:36,550 This is the random. 326 00:19:36,550 --> 00:19:40,280 Nothing to do with cancer clusters, just at random, 327 00:19:40,280 --> 00:19:44,200 this case gets assigned to one of the 1,000 regions. 328 00:19:44,200 --> 00:19:55,600 And then I'm going to check if region number 111 had greater 329 00:19:55,600 --> 00:20:01,030 than or equal to 143, the number of cases we assumed it had. 330 00:20:01,030 --> 00:20:05,080 If so, we'll increment the variable num greater by 1, 331 00:20:05,080 --> 00:20:11,850 saying, in this trial of 100, indeed, it had that many. 332 00:20:11,850 --> 00:20:14,580 And then we'll see how often that happens. 333 00:20:14,580 --> 00:20:20,040 That will tell us how improbable it is that region 111 actually 334 00:20:20,040 --> 00:20:20,940 had that many cases. 335 00:20:23,780 --> 00:20:25,676 And then we'll print it. 336 00:20:25,676 --> 00:20:27,820 Does that makes sense to everyone, 337 00:20:27,820 --> 00:20:30,970 that here I am doing my simulation 338 00:20:30,970 --> 00:20:36,160 to see whether or not how probable is it that 111 339 00:20:36,160 --> 00:20:39,230 would have had this many cases? 340 00:20:39,230 --> 00:20:41,870 Any questions? 341 00:20:41,870 --> 00:20:42,500 Let's run it. 342 00:20:52,480 --> 00:20:54,150 So here's the code we just looked at. 343 00:21:05,430 --> 00:21:06,840 Takes just a second. 344 00:21:06,840 --> 00:21:17,310 That's why I did only 100 trials instead of 1,000. 345 00:21:17,310 --> 00:21:20,100 I know the suspense is killing you. 346 00:21:20,100 --> 00:21:20,899 It's killing me. 347 00:21:20,899 --> 00:21:22,440 I don't know why it's taking so long. 348 00:21:29,130 --> 00:21:29,830 We'll finish. 349 00:21:33,160 --> 00:21:35,560 I wish I had the Jeopardy music or something 350 00:21:35,560 --> 00:21:37,977 to play while we waited for this. 351 00:21:37,977 --> 00:21:40,060 Anna, can you home some music or something to keep 352 00:21:40,060 --> 00:21:42,340 people amused? 353 00:21:42,340 --> 00:21:43,964 She will not. 354 00:21:43,964 --> 00:21:44,463 Wow. 355 00:21:48,770 --> 00:21:49,550 So here it is. 356 00:21:49,550 --> 00:21:51,980 The estimated probability of region 111 357 00:21:51,980 --> 00:21:53,570 having at least 1 case-- 358 00:21:57,980 --> 00:22:00,920 at least 143 cases-- 359 00:22:00,920 --> 00:22:13,390 easier to read if I spread this out is 0.01. 360 00:22:13,390 --> 00:22:18,110 So it seems, in fact, that it's pretty surprising-- 361 00:22:18,110 --> 00:22:20,020 unlikely to have happened at random. 362 00:22:22,990 --> 00:22:23,710 Do you buy it? 363 00:22:23,710 --> 00:22:25,240 Or is there a flaw here? 364 00:22:29,440 --> 00:22:32,070 Getting back to this whole question. 365 00:22:32,070 --> 00:22:33,548 Yes. 366 00:22:33,548 --> 00:22:36,726 AUDIENCE: I think it's flawed because first off you 367 00:22:36,726 --> 00:22:38,438 have to look at the population. 368 00:22:38,438 --> 00:22:39,416 That is more important. 369 00:22:39,416 --> 00:22:41,372 JOHN GUTTAG: You have to look at what? 370 00:22:41,372 --> 00:22:45,284 AUDIENCE: Population as opposed to like the number of areas, 371 00:22:45,284 --> 00:22:48,218 because when you get past the Boston area, you'd expect a-- 372 00:22:48,218 --> 00:22:50,510 JOHN GUTTAG: Let's assume that, in fact, instead of 373 00:22:50,510 --> 00:22:52,060 by square miles-- 374 00:22:52,060 --> 00:22:54,700 let's assume the populations were balanced. 375 00:22:54,700 --> 00:22:56,324 AUDIENCE: Then I also think it's flawed 376 00:22:56,324 --> 00:22:59,691 because I don't think the importance of block 111 377 00:22:59,691 --> 00:23:01,615 having 143 is important. 378 00:23:01,615 --> 00:23:04,510 I think the importance is just one area having a higher-- 379 00:23:04,510 --> 00:23:07,960 JOHN GUTTAG: Exactly right. 380 00:23:07,960 --> 00:23:09,050 Exactly right. 381 00:23:13,030 --> 00:23:16,720 I'm sorry, I forgot my candy bag today. 382 00:23:16,720 --> 00:23:18,720 Just means there'll be more candy for the final. 383 00:23:26,690 --> 00:23:34,960 What we have here is a variant of cherry picking. 384 00:23:34,960 --> 00:23:37,780 What I have done in this simulation 385 00:23:37,780 --> 00:23:41,120 is I've looked at 1,000 different regions. 386 00:23:44,880 --> 00:23:47,520 What the attorney did is, not in a simulation, 387 00:23:47,520 --> 00:23:50,490 is he looked at 1,000 different regions, 388 00:23:50,490 --> 00:23:54,690 found the one with the most cancer cases, and said, 389 00:23:54,690 --> 00:23:57,030 aha, there are too many here. 390 00:24:01,530 --> 00:24:06,130 And that's not what I did in my simulation. 391 00:24:06,130 --> 00:24:08,650 My simulation didn't ask the question, 392 00:24:08,650 --> 00:24:11,410 how likely is it that there is at least one region 393 00:24:11,410 --> 00:24:13,480 with that many cases. 394 00:24:13,480 --> 00:24:16,060 But it asked the question, how likely is it 395 00:24:16,060 --> 00:24:20,400 that this specific region has that many cases. 396 00:24:20,400 --> 00:24:24,750 Now, if the attorney had reason in advance to be suspicious 397 00:24:24,750 --> 00:24:27,750 of region 111, then maybe it would have 398 00:24:27,750 --> 00:24:29,910 been OK to just go check that. 399 00:24:29,910 --> 00:24:32,190 But having looked at 1,000 and then cherry 400 00:24:32,190 --> 00:24:36,400 pick the best is not right. 401 00:24:36,400 --> 00:24:41,410 So this is a simulation that does the right thing. 402 00:24:45,290 --> 00:24:47,640 I've left out the initialization. 403 00:24:47,640 --> 00:24:50,390 But what you can see I'm doing here 404 00:24:50,390 --> 00:24:52,160 is I'm looking at the probability of there 405 00:24:52,160 --> 00:24:57,220 being any region that has at least 143 cases. 406 00:25:02,630 --> 00:25:06,910 What this is called in the technical literature, what 407 00:25:06,910 --> 00:25:11,220 the attorney did is multiple hypothesis checking. 408 00:25:11,220 --> 00:25:15,450 So rather than having a single hypothesis, that region 111 is 409 00:25:15,450 --> 00:25:20,430 bad, he checked 1,000 different hypotheses, 410 00:25:20,430 --> 00:25:25,380 and then chose the one that met what he wanted. 411 00:25:25,380 --> 00:25:27,780 Now, there are good statistical techniques 412 00:25:27,780 --> 00:25:31,770 that exist for dealing with multiple hypotheses, things 413 00:25:31,770 --> 00:25:34,530 like the Bonferroni correction. 414 00:25:34,530 --> 00:25:37,350 I love to say that name. 415 00:25:37,350 --> 00:25:41,370 But you have to worry about it. 416 00:25:41,370 --> 00:25:59,690 And in fact, if we go back to the code 417 00:25:59,690 --> 00:26:14,710 and comment out this one and run this one, 418 00:26:14,710 --> 00:26:16,720 we'll see we get a very different answer. 419 00:26:42,190 --> 00:26:44,890 The answer we get is-- 420 00:26:44,890 --> 00:26:47,580 let's see. 421 00:26:47,580 --> 00:26:48,450 Oh, I see. 422 00:26:48,450 --> 00:26:52,670 All right, let me just comment this out. 423 00:27:02,970 --> 00:27:04,220 Yeah, this should work, right? 424 00:27:15,951 --> 00:27:17,700 Well, maybe you don't want to wait for it. 425 00:27:17,700 --> 00:27:20,730 But the answer you'll get is that it's actually 426 00:27:20,730 --> 00:27:22,380 very probable. 427 00:27:22,380 --> 00:27:26,520 My recollection is it's a 0.6 probability that at least 428 00:27:26,520 --> 00:27:29,400 one region has that many cases. 429 00:27:29,400 --> 00:27:30,924 And that's really what's going on 430 00:27:30,924 --> 00:27:32,340 with this whole business of people 431 00:27:32,340 --> 00:27:34,830 reporting cancer clusters. 432 00:27:34,830 --> 00:27:38,040 It's just by accident, by pure randomness, 433 00:27:38,040 --> 00:27:40,600 some region has more than its share. 434 00:27:43,290 --> 00:27:49,890 This particular form of cherry picking 435 00:27:49,890 --> 00:27:53,100 also goes by the name of the Texas sharpshooter fallacy. 436 00:27:55,960 --> 00:27:59,230 I don't know why people pick on Texas for this. 437 00:27:59,230 --> 00:28:00,880 But they seem to. 438 00:28:00,880 --> 00:28:03,280 But the notion is, you're driving down a road in Texas 439 00:28:03,280 --> 00:28:08,080 and you see a barn with a bunch of bullet holes in the wall 440 00:28:08,080 --> 00:28:11,740 right in the middle of a target. 441 00:28:11,740 --> 00:28:16,360 But what actually happened was you had a barn. 442 00:28:16,360 --> 00:28:19,850 The farmer just shot some things at random at the barn, 443 00:28:19,850 --> 00:28:22,910 then got out his paint brush and painted a target right around 444 00:28:22,910 --> 00:28:25,220 where they happened to land. 445 00:28:25,220 --> 00:28:29,270 And that's what happens when you cherry pick hypotheses. 446 00:28:33,020 --> 00:28:36,920 What's the bottom line of all these statistical fallacies? 447 00:28:36,920 --> 00:28:41,700 When drawing inferences from data, skepticism is merited. 448 00:28:41,700 --> 00:28:43,440 There are, unfortunately, more ways 449 00:28:43,440 --> 00:28:46,336 to go wrong than to go right. 450 00:28:46,336 --> 00:28:48,210 And you'll read the literature that tells you 451 00:28:48,210 --> 00:28:51,120 that in the scientific literature more than half 452 00:28:51,120 --> 00:28:53,070 of the papers were later shown to be wrong. 453 00:28:56,070 --> 00:28:59,040 You do need to remember that skepticism and denial are 454 00:28:59,040 --> 00:29:00,090 different. 455 00:29:00,090 --> 00:29:01,680 It's good to be skeptical. 456 00:29:01,680 --> 00:29:06,390 And I love Ambrose Bierce's description of the difference 457 00:29:06,390 --> 00:29:08,050 here. 458 00:29:08,050 --> 00:29:10,080 If you had never read Ambrose Bierce, 459 00:29:10,080 --> 00:29:11,920 he's well worth reading. 460 00:29:11,920 --> 00:29:14,770 He wrote something called The Devil's Dictionary, 461 00:29:14,770 --> 00:29:18,670 among other things, in which he has his own definition 462 00:29:18,670 --> 00:29:19,720 of a lot of words. 463 00:29:19,720 --> 00:29:25,030 And he went by the nickname Bitter Bierce. 464 00:29:25,030 --> 00:29:28,300 And if you read The Devil's Dictionary, you'll see why. 465 00:29:28,300 --> 00:29:30,790 But this, I think, has a lot of wisdom in it. 466 00:29:33,360 --> 00:29:37,390 Let's, in the remaining few minutes, wrap up the course. 467 00:29:37,390 --> 00:29:40,790 So what did we cover in 6.0002? 468 00:29:40,790 --> 00:29:43,630 A lot of things. 469 00:29:43,630 --> 00:29:47,750 If you look at the technical, things were three major units-- 470 00:29:47,750 --> 00:29:51,950 optimization problems, stochastic thinking, 471 00:29:51,950 --> 00:29:56,300 and modeling aspects of the world. 472 00:29:56,300 --> 00:29:59,330 But there was a big subtext amongst all of it, 473 00:29:59,330 --> 00:30:01,750 which was this. 474 00:30:01,750 --> 00:30:06,220 There was a reason our problem sets were not pencil and paper 475 00:30:06,220 --> 00:30:09,940 probability problems, but all coding. 476 00:30:09,940 --> 00:30:13,562 And that's because we really want, 477 00:30:13,562 --> 00:30:15,020 as an important part of the course, 478 00:30:15,020 --> 00:30:18,310 is to make you a better programmer. 479 00:30:18,310 --> 00:30:21,670 We introduced a few extra features of Python. 480 00:30:21,670 --> 00:30:24,970 But more importantly, we emphasized 481 00:30:24,970 --> 00:30:29,090 the use of libraries, because in the real world when 482 00:30:29,090 --> 00:30:33,710 you're trying to build things, you rarely start from scratch. 483 00:30:33,710 --> 00:30:35,150 And if you do start from scratch, 484 00:30:35,150 --> 00:30:37,600 you're probably making a mistake. 485 00:30:37,600 --> 00:30:40,490 And so we wanted to get you used to the idea of finding 486 00:30:40,490 --> 00:30:42,500 and using libraries. 487 00:30:42,500 --> 00:30:45,530 So we looked at plotting libraries and machine 488 00:30:45,530 --> 00:30:49,250 learning libraries and numeric libraries. 489 00:30:49,250 --> 00:30:51,792 And hopefully, you got a lot of practice 490 00:30:51,792 --> 00:30:53,750 in that you're a way better programmer than you 491 00:30:53,750 --> 00:30:57,230 were six weeks ago. 492 00:30:57,230 --> 00:30:59,370 A little more detailed-- 493 00:30:59,370 --> 00:31:04,960 the optimization problems, the probably most important 494 00:31:04,960 --> 00:31:08,420 takeaway is that many important problems 495 00:31:08,420 --> 00:31:12,320 can be formulated in terms of an objective function that you 496 00:31:12,320 --> 00:31:18,250 either maximize or minimize and some set of constraints. 497 00:31:18,250 --> 00:31:20,980 Once you've done that, there are lots 498 00:31:20,980 --> 00:31:23,110 of toolboxes, lots of libraries that you 499 00:31:23,110 --> 00:31:26,150 can use to solve the problem. 500 00:31:26,150 --> 00:31:28,440 You wrote some optimization code yourself. 501 00:31:28,440 --> 00:31:31,340 But most of the time, we don't solve them ourselves. 502 00:31:31,340 --> 00:31:34,250 We just call a built-in function that does it. 503 00:31:34,250 --> 00:31:37,310 So the hard part is not writing the code, 504 00:31:37,310 --> 00:31:38,930 but doing the formulation. 505 00:31:41,530 --> 00:31:44,410 We talked about different algorithms-- 506 00:31:44,410 --> 00:31:48,790 greedy algorithms, very often useful, 507 00:31:48,790 --> 00:31:53,510 but often don't find the optimal solution. 508 00:31:53,510 --> 00:31:57,690 So for example, we looked at k-means clustering. 509 00:31:57,690 --> 00:32:01,020 It was a very efficient way to find clusters. 510 00:32:01,020 --> 00:32:04,880 But it did not necessarily find the optimal set of clusters. 511 00:32:07,810 --> 00:32:10,360 We then observed that many optimization problems 512 00:32:10,360 --> 00:32:12,120 are inherently exponential. 513 00:32:14,730 --> 00:32:18,270 But even so, dynamic programming often 514 00:32:18,270 --> 00:32:24,750 works and gives us a really fast solution. 515 00:32:24,750 --> 00:32:27,690 And the notion here is this is not an approximate solution. 516 00:32:27,690 --> 00:32:30,270 It's not like using a greedy algorithm. 517 00:32:30,270 --> 00:32:34,900 It gives you an exact solution and in many circumstances 518 00:32:34,900 --> 00:32:37,700 gives it to you quickly. 519 00:32:37,700 --> 00:32:39,590 And the other thing I want you to take away 520 00:32:39,590 --> 00:32:43,610 is, outside the context of dynamic programming, 521 00:32:43,610 --> 00:32:47,610 memoization is a generally useful technique. 522 00:32:47,610 --> 00:32:53,790 What we've done there is we've traded time for space. 523 00:32:53,790 --> 00:32:56,130 We compute something, we save it, and when we need it, 524 00:32:56,130 --> 00:32:58,130 we look it up. 525 00:32:58,130 --> 00:33:02,670 And that's a very common programming technique. 526 00:33:02,670 --> 00:33:04,590 And we looked at a lot of different examples 527 00:33:04,590 --> 00:33:08,130 of optimization-- knapsack problems, several graph 528 00:33:08,130 --> 00:33:12,390 problems, curve fitting, clustering, logistic 529 00:33:12,390 --> 00:33:14,070 regression. 530 00:33:14,070 --> 00:33:17,280 Those are all optimization problems, 531 00:33:17,280 --> 00:33:20,980 can all be formulated as optimization problems. 532 00:33:20,980 --> 00:33:26,560 So it's very powerful and fits lots of needs. 533 00:33:26,560 --> 00:33:28,770 The next unit-- and, of course, I'm 534 00:33:28,770 --> 00:33:31,050 speaking as if these things were discrete in time, 535 00:33:31,050 --> 00:33:32,540 but they're not. 536 00:33:32,540 --> 00:33:34,860 We talked about optimization at the beginning. 537 00:33:34,860 --> 00:33:37,830 And I talk to an optimization last week. 538 00:33:37,830 --> 00:33:41,250 So these things were sort of spread out over the term. 539 00:33:41,250 --> 00:33:43,890 We talked about stochastic thinking. 540 00:33:43,890 --> 00:33:48,600 And the basic notion here is the world is nondeterministic, 541 00:33:48,600 --> 00:33:51,690 or at least predictably nondeterministic. 542 00:33:51,690 --> 00:33:53,850 And therefore, we need to think about things 543 00:33:53,850 --> 00:33:59,260 in terms of probabilities most of the time, or frequently. 544 00:33:59,260 --> 00:34:03,430 And randomness is a powerful tool for building computations 545 00:34:03,430 --> 00:34:05,230 that model the world. 546 00:34:05,230 --> 00:34:07,720 If you think the world is stochastic, 547 00:34:07,720 --> 00:34:09,909 then you need to have ways to write programs 548 00:34:09,909 --> 00:34:11,920 that are stochastic, if you're trying 549 00:34:11,920 --> 00:34:15,429 to model the world itself. 550 00:34:15,429 --> 00:34:18,300 The other point we made is that random computations-- 551 00:34:18,300 --> 00:34:21,719 randomness is a computational technique-- 552 00:34:21,719 --> 00:34:24,239 is useful even for problems that don't appear 553 00:34:24,239 --> 00:34:26,429 to involve any randomness. 554 00:34:26,429 --> 00:34:29,219 So we used it to find the value of pi. 555 00:34:29,219 --> 00:34:32,310 We showed you can use it to do integration. 556 00:34:32,310 --> 00:34:34,110 There's nothing random about the value 557 00:34:34,110 --> 00:34:36,360 of the integral of a function. 558 00:34:36,360 --> 00:34:39,389 Yet, the easiest way to solve it in a program 559 00:34:39,389 --> 00:34:42,030 is to use randomness. 560 00:34:42,030 --> 00:34:45,090 So randomness is a very powerful tool. 561 00:34:45,090 --> 00:34:48,409 And there's this whole area of random algorithms-- 562 00:34:48,409 --> 00:34:50,850 research area and practical area that's 563 00:34:50,850 --> 00:34:56,280 used to solve non-probabilistic problems. 564 00:34:56,280 --> 00:35:01,270 Modeling the world-- well, we just talked about part of it. 565 00:35:01,270 --> 00:35:02,850 Models are always inaccurate. 566 00:35:02,850 --> 00:35:07,390 They're providing some abstraction of reality. 567 00:35:07,390 --> 00:35:11,205 We looked at deterministic models-- 568 00:35:11,205 --> 00:35:12,205 the graph theory models. 569 00:35:12,205 --> 00:35:14,680 There was nothing nondeterministic 570 00:35:14,680 --> 00:35:18,130 about the graphs we looked at. 571 00:35:18,130 --> 00:35:21,460 And then we spent more time on statistical models. 572 00:35:21,460 --> 00:35:23,200 We looked at simulation models. 573 00:35:23,200 --> 00:35:25,690 In particular, spent quite a bit of time 574 00:35:25,690 --> 00:35:28,360 on the Monte Carlo simulation. 575 00:35:28,360 --> 00:35:30,460 We looked at models based on sampling. 576 00:35:33,940 --> 00:35:37,420 And there-- and also when we talked about simulation-- 577 00:35:37,420 --> 00:35:42,130 I really hope I emphasized enough the notion 578 00:35:42,130 --> 00:35:45,220 that we need to be able to characterize 579 00:35:45,220 --> 00:35:47,790 how believable the results are. 580 00:35:47,790 --> 00:35:50,620 It's not good enough to just run a program 581 00:35:50,620 --> 00:35:53,210 and say, oh, it has an answer. 582 00:35:53,210 --> 00:35:56,780 You need to know whether to believe the answer. 583 00:35:56,780 --> 00:36:01,280 And the point we made is it's not a binary question. 584 00:36:01,280 --> 00:36:04,910 It's not yes, it's right, no, it's wrong. 585 00:36:04,910 --> 00:36:10,790 Typically, what we do is we have some statement about confidence 586 00:36:10,790 --> 00:36:14,360 intervals and confidence levels. 587 00:36:14,360 --> 00:36:16,610 We used two variables to describe 588 00:36:16,610 --> 00:36:18,260 how believable the answer is. 589 00:36:21,000 --> 00:36:22,390 And that's an important thing. 590 00:36:22,390 --> 00:36:25,376 And then we looked at tools we use for doing that. 591 00:36:25,376 --> 00:36:27,000 We looked at the central limit theorem. 592 00:36:27,000 --> 00:36:28,440 We looked at the empirical rule. 593 00:36:28,440 --> 00:36:31,230 We talked about different distributions. 594 00:36:31,230 --> 00:36:33,150 And especially, we spent a fair amount of time 595 00:36:33,150 --> 00:36:37,290 on the normal or Gaussian distribution. 596 00:36:37,290 --> 00:36:39,750 And then finally, we looked at statistical models 597 00:36:39,750 --> 00:36:42,600 based upon machine learning. 598 00:36:42,600 --> 00:36:45,840 We looked at unsupervised learning, 599 00:36:45,840 --> 00:36:49,050 basically just clustering, looked at two algorithms-- 600 00:36:49,050 --> 00:36:51,360 hierarchical and k-means. 601 00:36:51,360 --> 00:36:53,970 And we looked at supervised learning. 602 00:36:53,970 --> 00:36:57,240 And there, we essentially focused mostly 603 00:36:57,240 --> 00:36:59,700 on classification. 604 00:36:59,700 --> 00:37:01,260 And we looked at two ways of doing 605 00:37:01,260 --> 00:37:05,082 that-- k-nearest neighbors and logistic regression. 606 00:37:08,530 --> 00:37:14,080 Finally, we talked about presentation of data-- 607 00:37:14,080 --> 00:37:17,620 how to build plots, utility of plots, 608 00:37:17,620 --> 00:37:21,010 and recently, over the last two lectures, good and bad 609 00:37:21,010 --> 00:37:24,330 practices in presenting results about data. 610 00:37:26,940 --> 00:37:29,490 So my summary is, I hope that you 611 00:37:29,490 --> 00:37:32,400 think you've come a long way, particularly those of you-- 612 00:37:32,400 --> 00:37:34,350 how many of you were here in September when 613 00:37:34,350 --> 00:37:36,910 we started 6.0001? 614 00:37:36,910 --> 00:37:39,980 All right, most of you. 615 00:37:39,980 --> 00:37:42,390 Yeah, this, by the way, was a very popular ad 616 00:37:42,390 --> 00:37:45,780 for a long time, saying that, finally women are allowed 617 00:37:45,780 --> 00:37:48,570 to smoke, isn't this great. 618 00:37:48,570 --> 00:37:51,480 And Virginia Slims sponsored tennis-- the women's tennis 619 00:37:51,480 --> 00:37:55,260 tour to show how good it was that women were now 620 00:37:55,260 --> 00:37:56,880 able to smoke. 621 00:37:56,880 --> 00:38:01,380 But anyway, I know not everyone in this class is a woman. 622 00:38:01,380 --> 00:38:05,340 So just for the men in the room, you too 623 00:38:05,340 --> 00:38:06,540 could have come a long way. 624 00:38:09,830 --> 00:38:12,050 I hope you think that, if you look back 625 00:38:12,050 --> 00:38:15,230 at how you struggled in those early problems sets, 626 00:38:15,230 --> 00:38:17,960 I hope you really feel that you've learned a lot about how 627 00:38:17,960 --> 00:38:20,270 to build programs. 628 00:38:20,270 --> 00:38:22,874 And if you spend enough time in front of a terminal, 629 00:38:22,874 --> 00:38:24,290 this is what you get to look like. 630 00:38:27,200 --> 00:38:29,780 What might be next? 631 00:38:29,780 --> 00:38:32,630 I should start by saying, this is a hard course. 632 00:38:32,630 --> 00:38:34,850 We know that many of you worked hard. 633 00:38:34,850 --> 00:38:38,990 And the staff and I really do appreciate it. 634 00:38:38,990 --> 00:38:42,530 You know your return on investment. 635 00:38:42,530 --> 00:38:45,560 I'd like you to remember that you can now write 636 00:38:45,560 --> 00:38:48,410 programs to do useful things. 637 00:38:48,410 --> 00:38:51,860 So if you're doing a UROP, you're sitting in a lab, 638 00:38:51,860 --> 00:38:55,490 and you get a bunch of data from some experiments, don't just 639 00:38:55,490 --> 00:38:56,210 stare at it. 640 00:38:56,210 --> 00:38:58,730 Sit down and write some code to plot it 641 00:38:58,730 --> 00:39:00,990 to do something useful with it. 642 00:39:00,990 --> 00:39:06,344 Don't be afraid to write programs to help you out. 643 00:39:06,344 --> 00:39:08,260 There are some courses that I think you're now 644 00:39:08,260 --> 00:39:10,840 well-prepared to take. 645 00:39:10,840 --> 00:39:16,490 I've listed the ones I know best-- the courses in course 6. 646 00:39:16,490 --> 00:39:20,770 6.009 is a sort of introduction to computer science. 647 00:39:20,770 --> 00:39:22,690 I think many of you will find that too 648 00:39:22,690 --> 00:39:25,210 easy after taking this course. 649 00:39:25,210 --> 00:39:28,900 But maybe, that's not a downside. 650 00:39:28,900 --> 00:39:33,620 6.005 is a software engineering course, 651 00:39:33,620 --> 00:39:36,440 where they'll switch programming languages on you. 652 00:39:36,440 --> 00:39:39,750 You get to program in Java. 653 00:39:39,750 --> 00:39:45,220 6.006 is a algorithms course in Python 654 00:39:45,220 --> 00:39:47,590 and I think actually quite interesting. 655 00:39:47,590 --> 00:39:49,480 And students seem to like it a lot, 656 00:39:49,480 --> 00:39:53,080 and they learn about algorithms and implementing them. 657 00:39:53,080 --> 00:39:58,270 And 6.034 is an introduction to artificial intelligence 658 00:39:58,270 --> 00:39:59,380 also in Python. 659 00:39:59,380 --> 00:40:03,880 And I should have listed 6.036, another introduction 660 00:40:03,880 --> 00:40:05,690 to machine learning in Python. 661 00:40:10,500 --> 00:40:14,130 You should go look for an interesting UROP. 662 00:40:14,130 --> 00:40:16,050 A lot of students come out of this course 663 00:40:16,050 --> 00:40:18,270 and go do UROPs, where they use what 664 00:40:18,270 --> 00:40:20,100 they've learned in this course. 665 00:40:20,100 --> 00:40:24,100 And many of them really have a very positive experience. 666 00:40:24,100 --> 00:40:27,570 So if you were worried that you're not ready for a UROP, 667 00:40:27,570 --> 00:40:29,220 you probably are-- 668 00:40:29,220 --> 00:40:31,890 a UROP using what's been done here. 669 00:40:31,890 --> 00:40:33,730 You can minor in computer science. 670 00:40:33,730 --> 00:40:36,840 This is now available for the first time this year. 671 00:40:36,840 --> 00:40:38,730 But really, if you have time, you 672 00:40:38,730 --> 00:40:40,640 should major in computer science, 673 00:40:40,640 --> 00:40:44,160 because it is really the best major on campus-- 674 00:40:44,160 --> 00:40:46,290 not even close, as somebody I know would say. 675 00:40:49,070 --> 00:40:51,120 Finally, sometimes people ask me where 676 00:40:51,120 --> 00:40:53,040 I think computing is headed. 677 00:40:53,040 --> 00:40:56,070 And I'll quote one of my favorite baseball players. 678 00:40:56,070 --> 00:41:00,660 "It's tough to make predictions, especially about the future." 679 00:41:00,660 --> 00:41:02,520 And instead of my predictions, let 680 00:41:02,520 --> 00:41:05,760 me show you the predictions of some famous people. 681 00:41:05,760 --> 00:41:09,270 So Thomas Watson, who was the chairman of IBM-- 682 00:41:09,270 --> 00:41:12,120 a company you've probably heard of-- 683 00:41:12,120 --> 00:41:14,970 and he said, "I think there is a world market for maybe five 684 00:41:14,970 --> 00:41:17,160 computers." 685 00:41:17,160 --> 00:41:18,660 This was in response to, should they 686 00:41:18,660 --> 00:41:22,530 become a computer company, which they were not at the time. 687 00:41:22,530 --> 00:41:24,330 He was off by a little bit. 688 00:41:27,420 --> 00:41:29,070 A few years later, there was an article 689 00:41:29,070 --> 00:41:33,880 in Popular Mechanics, which was saying, computers are amazing. 690 00:41:33,880 --> 00:41:35,700 They're going to change enormously. 691 00:41:35,700 --> 00:41:39,510 Someday, they may be no more than 1 and 1/2 tons. 692 00:41:39,510 --> 00:41:43,230 You might get a computer that's no more than 3,000 pounds-- 693 00:41:43,230 --> 00:41:44,520 someday. 694 00:41:44,520 --> 00:41:46,770 So we're still waiting for that, I guess. 695 00:41:51,250 --> 00:41:52,100 I like this one. 696 00:41:52,100 --> 00:41:54,580 This is, having written a book recently, 697 00:41:54,580 --> 00:41:57,420 the editor in charge of books for Prentice Hall. 698 00:41:57,420 --> 00:42:00,207 "I traveled the length and breadth of this country 699 00:42:00,207 --> 00:42:01,540 and talked with the best people. 700 00:42:01,540 --> 00:42:05,050 And I can assure you that data processing is a fad that 701 00:42:05,050 --> 00:42:09,060 won't last out the year." 702 00:42:09,060 --> 00:42:11,100 MIT had that attitude for a while. 703 00:42:11,100 --> 00:42:15,030 For about 35 years, computer science was in a building 704 00:42:15,030 --> 00:42:20,080 off campus, because they weren't sure we were here to stay. 705 00:42:20,080 --> 00:42:24,190 Maybe that's not why, but that's why I interpret it. 706 00:42:24,190 --> 00:42:26,950 Ken Olsen, an MIT graduate-- 707 00:42:26,950 --> 00:42:30,280 I should say, a course 6 graduate-- 708 00:42:30,280 --> 00:42:33,700 was the founder and president and chair 709 00:42:33,700 --> 00:42:37,870 of Digital Equipment Corporation, which in 1977 was 710 00:42:37,870 --> 00:42:42,250 the second largest computer manufacturer in the world based 711 00:42:42,250 --> 00:42:44,914 in Maynard, Massachusetts. 712 00:42:44,914 --> 00:42:46,330 None of you have ever heard of it. 713 00:42:46,330 --> 00:42:47,390 They disappeared. 714 00:42:47,390 --> 00:42:50,140 And this is in part why, because Ken said, 715 00:42:50,140 --> 00:42:54,370 "there's no reason anyone would want a computer in their home," 716 00:42:54,370 --> 00:43:00,760 and totally missed that part of computation. 717 00:43:00,760 --> 00:43:05,890 Finally, since this is the end of some famous last words, 718 00:43:05,890 --> 00:43:09,280 Douglas Fairbanks, Sr., a famous actor-- 719 00:43:09,280 --> 00:43:11,710 this is true-- the last thing he said before he died 720 00:43:11,710 --> 00:43:13,090 was, "never felt better." 721 00:43:16,170 --> 00:43:18,320 Amazing. 722 00:43:18,320 --> 00:43:22,400 This was from the movie The Mark of Zorro. 723 00:43:22,400 --> 00:43:23,390 Scientists are better. 724 00:43:23,390 --> 00:43:28,030 Luther Burbank, his last words were, I don't feel so good. 725 00:43:28,030 --> 00:43:30,890 And well, I guess not. 726 00:43:30,890 --> 00:43:34,820 [LAUGHTER] 727 00:43:34,820 --> 00:43:37,470 And this is the last one. 728 00:43:37,470 --> 00:43:41,640 John Sedgwick was a Union general in the Civil War. 729 00:43:41,640 --> 00:43:42,680 This is a true story. 730 00:43:42,680 --> 00:43:46,220 He was riding behind the lines and trying 731 00:43:46,220 --> 00:43:53,250 to rally his men to not hide behind 732 00:43:53,250 --> 00:43:57,720 the stone walls but to stand up and shoot at the enemy. 733 00:43:57,720 --> 00:44:02,040 And he said, "they couldn't hit an elephant at this distance." 734 00:44:02,040 --> 00:44:06,030 Moments later, he was shot in the face and died. 735 00:44:06,030 --> 00:44:07,980 [LAUGHTER] 736 00:44:07,980 --> 00:44:10,560 And I thought this was an apocryphal story. 737 00:44:10,560 --> 00:44:13,170 But in fact, there's a plaque at the battlefield 738 00:44:13,170 --> 00:44:15,960 where this happened, documenting this story. 739 00:44:15,960 --> 00:44:17,790 And apparently, it's quite true. 740 00:44:21,120 --> 00:44:22,980 So with that, I'll say my last words 741 00:44:22,980 --> 00:44:26,520 for the chorus, which is I appreciate all your coming. 742 00:44:26,520 --> 00:44:29,500 And I guess you were the survivors. 743 00:44:29,500 --> 00:44:31,980 So thank you for being here. 744 00:44:31,980 --> 00:44:36,230 [APPLAUSE]