1 00:00:00,790 --> 00:00:03,130 The following content is provided under a Creative 2 00:00:03,130 --> 00:00:04,550 Commons license. 3 00:00:04,550 --> 00:00:06,760 Your support will help MIT OpenCourseWare 4 00:00:06,760 --> 00:00:10,850 continue to offer high quality educational resources for free. 5 00:00:10,850 --> 00:00:13,390 To make a donation, or to view additional materials 6 00:00:13,390 --> 00:00:17,320 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,320 --> 00:00:18,270 at ocw.mit.edu. 8 00:00:28,431 --> 00:00:30,380 PROFESSOR: Hello, everybody. 9 00:00:30,380 --> 00:00:35,060 Before we start the material, a couple of announcements. 10 00:00:35,060 --> 00:00:37,280 As usual, there's some reading assignments, 11 00:00:37,280 --> 00:00:40,940 and you might be surprised to see something from Chapter 12 00:00:40,940 --> 00:00:43,370 5 suddenly popping up. 13 00:00:43,370 --> 00:00:45,380 But this is my relentless attempt 14 00:00:45,380 --> 00:00:47,050 to introduce more Python. 15 00:00:47,050 --> 00:00:51,690 We'll see one new concept later today, list comprehension. 16 00:00:51,690 --> 00:00:55,650 Today we're going to look at classification. 17 00:00:55,650 --> 00:00:58,580 And you remember last, on Monday, 18 00:00:58,580 --> 00:01:01,640 we looked at unsupervised learning. 19 00:01:01,640 --> 00:01:04,489 Today we're looking at supervised learning. 20 00:01:04,489 --> 00:01:08,660 It can usually be divided into two categories. 21 00:01:08,660 --> 00:01:11,940 Regression, where you try and predict 22 00:01:11,940 --> 00:01:15,420 some real number associated with the feature vector, 23 00:01:15,420 --> 00:01:18,540 and this is something we've already done really, 24 00:01:18,540 --> 00:01:22,980 back when we looked at curve fitting, linear regression 25 00:01:22,980 --> 00:01:24,150 in particular. 26 00:01:24,150 --> 00:01:28,080 It was exactly building a model that, given some features, 27 00:01:28,080 --> 00:01:30,232 would predict a point. 28 00:01:30,232 --> 00:01:31,690 In this case, it was pretty simple. 29 00:01:31,690 --> 00:01:33,810 It was given x predict y. 30 00:01:33,810 --> 00:01:38,640 You can imagine generalizing that to multi dimensions. 31 00:01:38,640 --> 00:01:42,660 Today I'm going to talk about classification, 32 00:01:42,660 --> 00:01:45,840 which is very common, in many ways more 33 00:01:45,840 --> 00:01:48,390 common than regression for-- 34 00:01:48,390 --> 00:01:50,550 in the machine learning world. 35 00:01:50,550 --> 00:01:55,170 And here the goal is to predict a discrete value, often called 36 00:01:55,170 --> 00:02:00,420 a label, associated with some feature vector. 37 00:02:00,420 --> 00:02:04,400 So this is the sort of thing where you try and, for example, 38 00:02:04,400 --> 00:02:07,550 predict whether a person will have 39 00:02:07,550 --> 00:02:10,340 an adverse reaction to a drug. 40 00:02:10,340 --> 00:02:12,290 You're not looking for a real number, 41 00:02:12,290 --> 00:02:17,720 you're looking for will they get sick, will they not get sick. 42 00:02:17,720 --> 00:02:21,230 Maybe you're trying to predict the grade in a course A, B, C, 43 00:02:21,230 --> 00:02:25,350 D, and other grades we won't mention. 44 00:02:25,350 --> 00:02:27,020 Again, those are labels, so it doesn't 45 00:02:27,020 --> 00:02:32,860 have to be a binary label but it's a finite number of labels. 46 00:02:32,860 --> 00:02:34,720 So here's an example to start with. 47 00:02:34,720 --> 00:02:37,580 We won't linger on it too long. 48 00:02:37,580 --> 00:02:40,660 This is basically something you saw 49 00:02:40,660 --> 00:02:44,470 in an earlier lecture, where we had a bunch of animals 50 00:02:44,470 --> 00:02:48,070 and a bunch of properties, and a label identifying 51 00:02:48,070 --> 00:02:49,885 whether or not they were a reptile. 52 00:02:55,810 --> 00:03:01,640 So we start by building a distance matrix. 53 00:03:01,640 --> 00:03:07,270 How far apart they are, an in fact, in this case, 54 00:03:07,270 --> 00:03:11,020 I'm not using the representation you just saw. 55 00:03:11,020 --> 00:03:15,010 I'm going to use the binary representation, 56 00:03:15,010 --> 00:03:17,667 As Professor Grimson showed you, and for the reasons 57 00:03:17,667 --> 00:03:18,250 he showed you. 58 00:03:21,240 --> 00:03:25,320 If you're interested, I didn't produce this table by hand, 59 00:03:25,320 --> 00:03:28,500 I wrote some Python code to produce it, 60 00:03:28,500 --> 00:03:30,420 not only to compute the distances, 61 00:03:30,420 --> 00:03:36,030 but more delicately to produce the actual table. 62 00:03:36,030 --> 00:03:39,030 And you'll probably find it instructive at some point 63 00:03:39,030 --> 00:03:41,700 to at least remember that that code is there, 64 00:03:41,700 --> 00:03:45,910 in case you need to ever produce a table for some paper. 65 00:03:45,910 --> 00:03:51,100 In general, you probably noticed I spent relatively little time 66 00:03:51,100 --> 00:03:53,560 going over the actual vast amounts of codes 67 00:03:53,560 --> 00:03:55,930 we've been posting. 68 00:03:55,930 --> 00:03:58,930 That doesn't mean you shouldn't look at it. 69 00:03:58,930 --> 00:04:02,380 In part, a lot of it's there because I'm 70 00:04:02,380 --> 00:04:04,510 hoping at some point in the future it will be handy 71 00:04:04,510 --> 00:04:08,680 for you to have a model on how to do something. 72 00:04:08,680 --> 00:04:09,610 All right. 73 00:04:09,610 --> 00:04:12,640 So we have all these distances. 74 00:04:12,640 --> 00:04:18,070 And we can tell how far apart one animal is from another. 75 00:04:18,070 --> 00:04:22,320 Now how do we use those to classify animals? 76 00:04:22,320 --> 00:04:25,020 And the simplest approach to classification, 77 00:04:25,020 --> 00:04:28,320 and it's actually one that's used a fair amount in practice 78 00:04:28,320 --> 00:04:31,750 is called nearest neighbor. 79 00:04:31,750 --> 00:04:35,140 So the learning part is trivial. 80 00:04:35,140 --> 00:04:39,010 We don't actually learn anything other than we just remember. 81 00:04:39,010 --> 00:04:42,010 So we remember the training data. 82 00:04:42,010 --> 00:04:45,640 And when we want to predict the label of a new example, 83 00:04:45,640 --> 00:04:48,240 we find the nearest example in the training data, 84 00:04:48,240 --> 00:04:53,030 and just choose the label associated with that example. 85 00:04:53,030 --> 00:04:55,570 So here I've just drawing a cloud 86 00:04:55,570 --> 00:04:59,060 of red dots and black dots. 87 00:04:59,060 --> 00:05:02,060 I have a fuschia colored X. And if I 88 00:05:02,060 --> 00:05:05,230 want to classify X as black or red, 89 00:05:05,230 --> 00:05:08,100 I'd say well its nearest neighbor is red. 90 00:05:08,100 --> 00:05:10,010 So we'll call X red. 91 00:05:12,650 --> 00:05:14,210 Doesn't get much simpler than that. 92 00:05:18,781 --> 00:05:19,280 All right. 93 00:05:19,280 --> 00:05:22,970 Let's try and do it now for our animals. 94 00:05:22,970 --> 00:05:26,250 I've blocked out this lower right hand corner, 95 00:05:26,250 --> 00:05:30,260 because I want to classify these three animals that are in gray. 96 00:05:30,260 --> 00:05:34,310 So my training data, very small, are these animals. 97 00:05:34,310 --> 00:05:37,170 And these are my test set here. 98 00:05:37,170 --> 00:05:40,980 So let's first try and classify the zebra. 99 00:05:40,980 --> 00:05:43,710 We look at the zebra's nearest neighbor. 100 00:05:43,710 --> 00:05:48,230 Well it's either a guppy or a dart frog. 101 00:05:48,230 --> 00:05:49,430 Well, let's just choose one. 102 00:05:49,430 --> 00:05:51,050 Let's choose the guppy. 103 00:05:51,050 --> 00:05:54,650 And if we look at the guppy, it's not a reptile, 104 00:05:54,650 --> 00:05:57,780 so we say the zebra is not a reptile. 105 00:05:57,780 --> 00:05:59,300 So got one right. 106 00:06:02,760 --> 00:06:05,130 Look at the python, choose its nearest neighbor, 107 00:06:05,130 --> 00:06:06,920 say it's a cobra. 108 00:06:06,920 --> 00:06:09,990 The label associated with cobra is reptile, 109 00:06:09,990 --> 00:06:12,535 so we win again on the python. 110 00:06:16,030 --> 00:06:22,170 Alligator, it's nearest neighbor is clearly a chicken. 111 00:06:22,170 --> 00:06:27,770 And so we classify the alligator as not a reptile. 112 00:06:31,450 --> 00:06:33,400 Oh, dear. 113 00:06:33,400 --> 00:06:34,840 Clearly the wrong answer. 114 00:06:38,720 --> 00:06:39,890 All right. 115 00:06:39,890 --> 00:06:43,540 What might have gone wrong? 116 00:06:43,540 --> 00:06:49,040 Well, the problem with K nearest neighbors, 117 00:06:49,040 --> 00:06:52,340 we can illustrate it by looking at this example. 118 00:06:52,340 --> 00:06:55,310 So one of the things people do with classifiers these days is 119 00:06:55,310 --> 00:06:57,750 handwriting recognition. 120 00:06:57,750 --> 00:07:01,800 So I just copied from a website a bunch of numbers, 121 00:07:01,800 --> 00:07:06,810 then I wrote the number 40 in my own inimitable handwriting. 122 00:07:06,810 --> 00:07:09,300 So if we go and we look for, say, the nearest neighbor 123 00:07:09,300 --> 00:07:10,500 of four-- 124 00:07:10,500 --> 00:07:13,020 or sorry, of whatever that digit is. 125 00:07:17,530 --> 00:07:20,080 It is, I believe, this one. 126 00:07:20,080 --> 00:07:23,640 And sure enough that's the row of fours. 127 00:07:23,640 --> 00:07:24,810 We're OK on this. 128 00:07:27,570 --> 00:07:32,910 Now if we want to classify my zero, 129 00:07:32,910 --> 00:07:35,610 the actual nearest neighbor, in terms 130 00:07:35,610 --> 00:07:39,770 of the bitmaps if you will, turns out to be this guy. 131 00:07:39,770 --> 00:07:42,240 A very poorly written nine. 132 00:07:42,240 --> 00:07:45,930 I didn't make up this nine, it was it was already there. 133 00:07:45,930 --> 00:07:50,670 And the problem we see here when we use nearest neighbor is 134 00:07:50,670 --> 00:07:55,540 if something is noisy, if you have one noisy piece of data, 135 00:07:55,540 --> 00:07:59,040 in this case, it's rather ugly looking version of nine, 136 00:07:59,040 --> 00:08:01,170 you can get the wrong answer because you match it. 137 00:08:03,830 --> 00:08:07,490 And indeed, in this case, you would get the wrong answer. 138 00:08:07,490 --> 00:08:10,960 What is usually done to avoid that is something 139 00:08:10,960 --> 00:08:12,940 called K nearest neighbors. 140 00:08:16,300 --> 00:08:19,930 And the basic idea here is that we don't just 141 00:08:19,930 --> 00:08:22,600 take the nearest neighbors, we take 142 00:08:22,600 --> 00:08:26,440 some number of nearest neighbors, usually 143 00:08:26,440 --> 00:08:30,730 an odd number, and we just let them vote. 144 00:08:30,730 --> 00:08:36,900 So now if we want to classify this fuchsia X, 145 00:08:36,900 --> 00:08:39,630 and we said K equal to three, we say well these 146 00:08:39,630 --> 00:08:42,600 are it's three nearest neighbors. 147 00:08:42,600 --> 00:08:45,570 One is red, two are black, so we're 148 00:08:45,570 --> 00:08:49,540 going to call X black is our better guess. 149 00:08:49,540 --> 00:08:51,670 And maybe that actually is a better guess, 150 00:08:51,670 --> 00:08:54,310 because it looks like this red point here is really 151 00:08:54,310 --> 00:08:59,320 an outlier, and we don't want to let the outliers dominate 152 00:08:59,320 --> 00:09:01,450 our classification. 153 00:09:01,450 --> 00:09:05,560 And this is why people almost always use K nearest neighbors 154 00:09:05,560 --> 00:09:09,410 rather than just nearest neighbor. 155 00:09:09,410 --> 00:09:14,520 Now if we look at this, and we use K nearest neighbors, 156 00:09:14,520 --> 00:09:18,270 those are the three nearest to the first numeral, 157 00:09:18,270 --> 00:09:21,132 and they are all fours. 158 00:09:21,132 --> 00:09:22,840 And if we look at the K nearest neighbors 159 00:09:22,840 --> 00:09:25,840 for the second numeral, we still have this nine 160 00:09:25,840 --> 00:09:28,600 but now we have two zeros. 161 00:09:28,600 --> 00:09:32,150 And so we vote and we decide it's a zero. 162 00:09:32,150 --> 00:09:33,290 Is it infallible? 163 00:09:33,290 --> 00:09:34,130 No. 164 00:09:34,130 --> 00:09:37,130 But it's typically much more reliable 165 00:09:37,130 --> 00:09:41,620 than just nearest neighbors, hence used much more often. 166 00:09:45,880 --> 00:09:49,120 And that was our problem, by the way, with the alligator. 167 00:09:49,120 --> 00:09:51,830 The nearest neighbor was the chicken, 168 00:09:51,830 --> 00:09:54,170 but if we went back and looked at it-- 169 00:09:54,170 --> 00:09:55,470 maybe we should go do that. 170 00:10:01,950 --> 00:10:04,930 And we take the alligator's three nearest neighbors, 171 00:10:04,930 --> 00:10:09,870 it would be the chicken, a cobra, and the rattlesnake-- 172 00:10:09,870 --> 00:10:12,120 or the boa, we don't care, and we 173 00:10:12,120 --> 00:10:15,180 would end up correctly classifying it now 174 00:10:15,180 --> 00:10:17,070 as a reptile. 175 00:10:17,070 --> 00:10:18,304 Yes? 176 00:10:18,304 --> 00:10:22,222 AUDIENCE: Is there like a limit to how many [INAUDIBLE]? 177 00:10:22,222 --> 00:10:23,680 PROFESSOR: The question is is there 178 00:10:23,680 --> 00:10:26,980 a limit to how many nearest neighbors you'd want? 179 00:10:26,980 --> 00:10:29,560 Absolutely. 180 00:10:29,560 --> 00:10:33,850 Most obviously, there's no point in setting K equal to-- whoops. 181 00:10:33,850 --> 00:10:36,090 Ooh, on the rebound-- 182 00:10:36,090 --> 00:10:40,270 to the size of the training set. 183 00:10:40,270 --> 00:10:42,940 So one of the problems with K nearest neighbors 184 00:10:42,940 --> 00:10:44,980 is efficiency. 185 00:10:44,980 --> 00:10:47,680 If you're trying to define K nearest neighbors 186 00:10:47,680 --> 00:10:51,530 and K is bigger, it takes longer. 187 00:10:51,530 --> 00:10:55,460 So we worry about how big K should be. 188 00:10:55,460 --> 00:10:58,400 And if we make it too big-- 189 00:10:58,400 --> 00:11:00,650 and this is a crucial thing-- 190 00:11:00,650 --> 00:11:07,240 we end up getting dominated by the size of the class. 191 00:11:07,240 --> 00:11:10,650 So let's look at this picture we had before. 192 00:11:10,650 --> 00:11:14,650 It happens to be more red dots than black dots. 193 00:11:14,650 --> 00:11:20,440 If I make K 10 or 15, I'm going to classify a lot of things 194 00:11:20,440 --> 00:11:26,230 as red, just because red is so much more prevalent than black. 195 00:11:26,230 --> 00:11:29,140 And so when you have an imbalance, which you usually 196 00:11:29,140 --> 00:11:34,250 do, you have to be very careful about K. Does that make sense? 197 00:11:34,250 --> 00:11:36,525 AUDIENCE: [INAUDIBLE] choose K? 198 00:11:36,525 --> 00:11:38,710 PROFESSOR: So how do you choose K? 199 00:11:38,710 --> 00:11:43,780 Remember back on Monday when we talked about choosing K for K 200 00:11:43,780 --> 00:11:45,900 means clustering? 201 00:11:45,900 --> 00:11:49,740 We typically do a very similar kind of thing. 202 00:11:49,740 --> 00:11:56,230 We take our training data and we split it into two parts. 203 00:11:56,230 --> 00:11:58,550 So we have training and testing, but now 204 00:11:58,550 --> 00:12:01,260 we just take the training, and we split that 205 00:12:01,260 --> 00:12:05,270 into training and testing multiple times. 206 00:12:05,270 --> 00:12:08,540 And we experiment with different K's, and we 207 00:12:08,540 --> 00:12:13,110 see which K's gives us the best result on the training data. 208 00:12:13,110 --> 00:12:19,190 And then that becomes our K. And that's a very common method. 209 00:12:19,190 --> 00:12:22,570 It's called cross-validation, and it's-- 210 00:12:22,570 --> 00:12:26,760 for almost all of machine learning, the algorithms 211 00:12:26,760 --> 00:12:30,960 have parameters in this case, it's just one parameter, K. 212 00:12:30,960 --> 00:12:34,170 And the way we typically choose the parameter values 213 00:12:34,170 --> 00:12:37,110 is by searching through the space using 214 00:12:37,110 --> 00:12:40,720 this cross-validation in the training data. 215 00:12:40,720 --> 00:12:43,350 Does that makes sense to everybody? 216 00:12:43,350 --> 00:12:44,481 Great question. 217 00:12:44,481 --> 00:12:46,230 And there was someone else had a question, 218 00:12:46,230 --> 00:12:47,362 but maybe it was the same. 219 00:12:47,362 --> 00:12:48,570 Do you still have a question? 220 00:12:48,570 --> 00:12:52,310 AUDIENCE: Well, just that you were using like K nearest 221 00:12:52,310 --> 00:12:54,351 and you get, like if my K is three 222 00:12:54,351 --> 00:12:56,684 and I get three different clusters for the K [INAUDIBLE] 223 00:12:56,684 --> 00:12:58,183 PROFESSOR: Three different clusters? 224 00:12:58,183 --> 00:12:59,126 AUDIENCE: [INAUDIBLE] 225 00:12:59,126 --> 00:13:00,370 PROFESSOR: Well, right. 226 00:13:00,370 --> 00:13:05,250 So if K is 3, and I had red, black, and purple 227 00:13:05,250 --> 00:13:08,190 and I get one of each, then what do I do? 228 00:13:08,190 --> 00:13:10,120 And then I'm kind of stuck. 229 00:13:10,120 --> 00:13:13,260 So you need to typically choose K in such a way 230 00:13:13,260 --> 00:13:16,140 that when you vote you get a winner. 231 00:13:16,140 --> 00:13:16,670 Nice. 232 00:13:16,670 --> 00:13:19,880 So if there's two, any odd number will do. 233 00:13:19,880 --> 00:13:22,070 If it's three, well then you need another number 234 00:13:22,070 --> 00:13:25,410 so that there's some-- so there's always a majority. 235 00:13:25,410 --> 00:13:27,070 Right? 236 00:13:27,070 --> 00:13:30,920 You want to make sure that there is a winner. 237 00:13:30,920 --> 00:13:31,940 Also a good question. 238 00:13:36,900 --> 00:13:39,210 Let's see if I get this to you directly. 239 00:13:41,870 --> 00:13:45,560 I'm much better at throwing overhand, I guess. 240 00:13:45,560 --> 00:13:46,430 Wow. 241 00:13:46,430 --> 00:13:48,140 Finally got applause for something. 242 00:13:48,140 --> 00:13:52,770 All right, advantages and disadvantages KNN? 243 00:13:52,770 --> 00:13:54,930 The learning is really fast, right? 244 00:13:54,930 --> 00:13:57,120 I just remember everything. 245 00:13:57,120 --> 00:13:59,472 No math is required. 246 00:13:59,472 --> 00:14:00,930 Didn't have to show you any theory. 247 00:14:00,930 --> 00:14:03,660 Was obviously an idea. 248 00:14:03,660 --> 00:14:06,900 It's easy to explain the method to somebody, and the results. 249 00:14:06,900 --> 00:14:08,430 Why did I label it black? 250 00:14:08,430 --> 00:14:12,210 Because that's who it was closest to. 251 00:14:12,210 --> 00:14:15,730 The disadvantages is it's memory intensive. 252 00:14:15,730 --> 00:14:19,740 If I've got a million examples, I have to store them all. 253 00:14:19,740 --> 00:14:23,840 And the predictions can take a long time. 254 00:14:23,840 --> 00:14:27,650 If I have an example and I want to find its K nearest 255 00:14:27,650 --> 00:14:30,480 neighbors, I'm doing a lot of comparisons. 256 00:14:30,480 --> 00:14:30,980 Right? 257 00:14:30,980 --> 00:14:33,710 If I have a million tank training points 258 00:14:33,710 --> 00:14:37,550 I have to compare my example to all a million. 259 00:14:37,550 --> 00:14:41,100 So I have no real pre-processing overhead. 260 00:14:41,100 --> 00:14:43,460 But each time I need to do a classification, 261 00:14:43,460 --> 00:14:46,030 it takes a long time. 262 00:14:46,030 --> 00:14:48,700 Now there are better algorithms and brute force 263 00:14:48,700 --> 00:14:53,230 that give you approximate K nearest neighbors. 264 00:14:53,230 --> 00:14:56,760 But on the whole, it's still not fast. 265 00:14:56,760 --> 00:15:02,920 And we're not getting any information about what process 266 00:15:02,920 --> 00:15:06,210 might have generated the data. 267 00:15:06,210 --> 00:15:10,290 We don't have a model of the data in the way we say when 268 00:15:10,290 --> 00:15:13,680 we did our linear regression for curve fitting, 269 00:15:13,680 --> 00:15:18,280 we had a model for the data that sort of described the pattern. 270 00:15:18,280 --> 00:15:23,240 We don't get that out of k nearest neighbors. 271 00:15:23,240 --> 00:15:25,340 I'm going to show you a different approach where 272 00:15:25,340 --> 00:15:27,180 we do get that. 273 00:15:27,180 --> 00:15:29,540 And I'm going to do it on a more interesting example 274 00:15:29,540 --> 00:15:32,030 than reptiles. 275 00:15:32,030 --> 00:15:36,230 I apologize to those of you who are reptologists. 276 00:15:36,230 --> 00:15:40,160 So you probably all heard of the Titanic. 277 00:15:40,160 --> 00:15:43,670 There was a movie about it, I'm told. 278 00:15:43,670 --> 00:15:47,610 It was one of the great sea disasters of all time, 279 00:15:47,610 --> 00:15:50,300 a so-called unsinkable ship-- 280 00:15:50,300 --> 00:15:53,060 they had advertised it as unsinkable-- 281 00:15:53,060 --> 00:15:55,025 hit an iceberg and went down. 282 00:15:55,025 --> 00:15:58,760 Of the 1,300 passengers, 812 died. 283 00:15:58,760 --> 00:16:00,829 The crew did way worse. 284 00:16:00,829 --> 00:16:02,870 So at least it looks as if the curve was actually 285 00:16:02,870 --> 00:16:04,070 pretty heroic. 286 00:16:04,070 --> 00:16:06,530 They had a higher death rate. 287 00:16:06,530 --> 00:16:08,870 So we're going to use machine learning 288 00:16:08,870 --> 00:16:12,530 to see if we can predict which passengers survived. 289 00:16:15,940 --> 00:16:17,960 There's an online database I'm using. 290 00:16:17,960 --> 00:16:20,280 It doesn't have all 1,200 passengers, 291 00:16:20,280 --> 00:16:24,790 but it has information about 1,046 of them. 292 00:16:24,790 --> 00:16:27,220 Some of them they couldn't get the information. 293 00:16:27,220 --> 00:16:29,830 Says what cabin class they were in first, second, 294 00:16:29,830 --> 00:16:33,760 or third, how old they were, and their gender. 295 00:16:33,760 --> 00:16:36,100 Also has their name and their home 296 00:16:36,100 --> 00:16:39,450 address and things, which I'm not using. 297 00:16:39,450 --> 00:16:42,990 We want to use these features to see 298 00:16:42,990 --> 00:16:46,020 if we can predict which passengers were 299 00:16:46,020 --> 00:16:50,030 going to survive the disaster. 300 00:16:50,030 --> 00:16:52,870 Well, the first question is something 301 00:16:52,870 --> 00:16:57,530 that Professor Grimson alluded to is, is it OK, 302 00:16:57,530 --> 00:16:58,940 just to look at accuracy? 303 00:16:58,940 --> 00:17:03,560 How are we going to evaluate our machine learning? 304 00:17:03,560 --> 00:17:04,329 And it's not. 305 00:17:04,329 --> 00:17:08,290 If we just predict died for everybody, well then 306 00:17:08,290 --> 00:17:14,319 we'll be 62% accurate for the passengers and 76% accurate 307 00:17:14,319 --> 00:17:16,270 for the crew members. 308 00:17:16,270 --> 00:17:18,760 Usually machine learning, if you're 76% 309 00:17:18,760 --> 00:17:20,710 you say that's not bad. 310 00:17:20,710 --> 00:17:25,329 Well, here I can get that just by predicting died. 311 00:17:25,329 --> 00:17:30,490 So whenever you have a class imbalance that much more of one 312 00:17:30,490 --> 00:17:33,960 than the other, accuracy isn't a particularly meaningful 313 00:17:33,960 --> 00:17:34,460 measure. 314 00:17:37,340 --> 00:17:41,460 I discovered this early on in my work and medical area. 315 00:17:41,460 --> 00:17:43,550 There are a lot of diseases that rarely occur, 316 00:17:43,550 --> 00:17:46,970 they occur in say 0.1% of the population. 317 00:17:46,970 --> 00:17:49,280 And I can build a great model for predicting it 318 00:17:49,280 --> 00:17:51,500 by just saying, no, you don't have 319 00:17:51,500 --> 00:17:57,170 it, which will be 0.999% accurate, but totally useless. 320 00:18:00,650 --> 00:18:02,810 Unfortunately, you do see people doing that sort 321 00:18:02,810 --> 00:18:04,200 of thing in the literature. 322 00:18:06,750 --> 00:18:10,710 You saw these in an earlier lecture, just to remind you, 323 00:18:10,710 --> 00:18:15,110 we're going to be looking at other metrics. 324 00:18:15,110 --> 00:18:18,870 Sensitivity, think of that as how good 325 00:18:18,870 --> 00:18:22,260 is it at identifying the positive cases. 326 00:18:22,260 --> 00:18:26,980 In this case, positive is going to be dead. 327 00:18:26,980 --> 00:18:33,110 How specific is it, and the positive predictive value. 328 00:18:33,110 --> 00:18:35,820 If we say somebody died, what's the probability 329 00:18:35,820 --> 00:18:38,172 is that they really did? 330 00:18:38,172 --> 00:18:40,130 And then there's the negative predictive value. 331 00:18:40,130 --> 00:18:41,900 If we say they didn't die, what's 332 00:18:41,900 --> 00:18:43,430 the probability they didn't die? 333 00:18:46,380 --> 00:18:50,040 So these are four very common metrics. 334 00:18:50,040 --> 00:18:54,660 There is something called an F score that combines them, 335 00:18:54,660 --> 00:18:58,500 but I'm not going to be showing you that today. 336 00:18:58,500 --> 00:19:00,810 I will mention that in the literature, 337 00:19:00,810 --> 00:19:04,170 people often use the word recall to mean sensitivity 338 00:19:04,170 --> 00:19:09,480 or sensitivity I mean recall, and specificity and precision 339 00:19:09,480 --> 00:19:12,160 are used pretty much interchangeably. 340 00:19:12,160 --> 00:19:16,080 So you might see various combinations of these words. 341 00:19:16,080 --> 00:19:18,840 Typically, people talk about recall n precision 342 00:19:18,840 --> 00:19:22,730 or sensitivity and specificity. 343 00:19:22,730 --> 00:19:24,400 Does that makes sense, why we want 344 00:19:24,400 --> 00:19:27,010 to look at the measures other than accuracy? 345 00:19:27,010 --> 00:19:31,330 We will look at accuracy, too, and how they all tell us 346 00:19:31,330 --> 00:19:34,510 kind of different things, and how you might 347 00:19:34,510 --> 00:19:37,840 choose a different balance. 348 00:19:37,840 --> 00:19:42,550 For example, if I'm running a screening test, say 349 00:19:42,550 --> 00:19:47,600 for breast cancer, a mammogram, and trying 350 00:19:47,600 --> 00:19:49,310 to find the people who should get on 351 00:19:49,310 --> 00:19:52,580 for a more extensive examination, 352 00:19:52,580 --> 00:19:55,990 what do I want to emphasize here? 353 00:19:55,990 --> 00:19:58,610 Which of these is likely to be the most important? 354 00:20:02,050 --> 00:20:04,750 Or what would you care about most? 355 00:20:08,190 --> 00:20:10,830 Well, maybe I want sensitivity. 356 00:20:10,830 --> 00:20:15,390 Since I'm going to send this person on for future tests, 357 00:20:15,390 --> 00:20:19,760 I really don't want to miss somebody who has cancer, 358 00:20:19,760 --> 00:20:22,580 and so I might think sensitivity is 359 00:20:22,580 --> 00:20:27,460 more important than specificity in that particular case. 360 00:20:27,460 --> 00:20:30,720 On the other hand, if I'm deciding 361 00:20:30,720 --> 00:20:36,710 who is so sick I should do open heart surgery on them, 362 00:20:36,710 --> 00:20:39,860 maybe I want to be pretty specific. 363 00:20:39,860 --> 00:20:43,190 Because the risk of the surgery itself are very high. 364 00:20:43,190 --> 00:20:47,060 I don't want to do it on people who don't need it. 365 00:20:47,060 --> 00:20:51,530 So we end up having to choose a balance between these things, 366 00:20:51,530 --> 00:20:53,210 depending upon our application. 367 00:20:57,160 --> 00:21:01,050 The other thing I want to talk about before actually building 368 00:21:01,050 --> 00:21:07,760 a classifier is how we test our classifier, 369 00:21:07,760 --> 00:21:09,870 because this is very important. 370 00:21:09,870 --> 00:21:13,190 I'm going to talk about two different methods, 371 00:21:13,190 --> 00:21:17,150 leave one out class of testing and repeated 372 00:21:17,150 --> 00:21:21,730 random subsampling. 373 00:21:21,730 --> 00:21:24,780 For leave one out, it's typically 374 00:21:24,780 --> 00:21:31,140 used when you have a small number of examples, 375 00:21:31,140 --> 00:21:34,200 so you want as much training data as possible 376 00:21:34,200 --> 00:21:36,730 as you build your model. 377 00:21:36,730 --> 00:21:41,680 So you take all of your n examples, remove one of them, 378 00:21:41,680 --> 00:21:45,850 train on n minus 1, test on the 1. 379 00:21:45,850 --> 00:21:49,450 Then you put that 1 back and remove another 1. 380 00:21:49,450 --> 00:21:53,110 Train on n minus 1, test on 1. 381 00:21:53,110 --> 00:21:56,284 And you do this for each element of the data, 382 00:21:56,284 --> 00:21:57,700 and then you average your results. 383 00:22:02,670 --> 00:22:05,490 Repeated random subsampling is done 384 00:22:05,490 --> 00:22:10,860 when you have a larger set of data, and there you might say 385 00:22:10,860 --> 00:22:13,730 split your data 80/20. 386 00:22:13,730 --> 00:22:20,130 Take 80% of the data to train on, test it on 20. 387 00:22:20,130 --> 00:22:23,910 So this is very similar to what I talked about earlier, 388 00:22:23,910 --> 00:22:26,310 and answered the question about how 389 00:22:26,310 --> 00:22:32,340 to choose K. I haven't seen the future examples, 390 00:22:32,340 --> 00:22:35,930 but in order to believe in my model 391 00:22:35,930 --> 00:22:38,930 and say my parameter settings, I do this repeated 392 00:22:38,930 --> 00:22:44,090 random subsampling or leave one out, either one. 393 00:22:44,090 --> 00:22:45,680 There's the code for leave one out. 394 00:22:48,790 --> 00:22:51,340 Absolutely nothing interesting about it, 395 00:22:51,340 --> 00:22:54,670 so I'm not going to waste your time looking at it. 396 00:22:57,430 --> 00:23:04,110 Repeated random subsampling is a little more interesting. 397 00:23:04,110 --> 00:23:10,640 What I've done here is I first sample-- 398 00:23:10,640 --> 00:23:13,600 this one is just to splitted 80/20. 399 00:23:13,600 --> 00:23:15,810 It's not doing anything repeated, 400 00:23:15,810 --> 00:23:27,445 and I start by sampling 20% of the indices, not the samples. 401 00:23:30,040 --> 00:23:31,690 And I want to do that at random. 402 00:23:31,690 --> 00:23:33,655 I don't want to say get consecutive ones. 403 00:23:37,840 --> 00:23:42,050 So we do that, and then once I've got the indices, 404 00:23:42,050 --> 00:23:44,550 I just go through and assign each example, 405 00:23:44,550 --> 00:23:50,560 to either test or training, and then return the two sets. 406 00:23:50,560 --> 00:23:54,500 But if I just sort of sampled one, 407 00:23:54,500 --> 00:23:56,480 then I'd have to do a more complicated thing 408 00:23:56,480 --> 00:23:57,740 to subtract it from the other. 409 00:23:57,740 --> 00:23:59,780 This is just efficiency. 410 00:23:59,780 --> 00:24:02,370 And then here's the-- 411 00:24:02,370 --> 00:24:04,640 sorry about the yellow there-- 412 00:24:04,640 --> 00:24:05,520 the random splits. 413 00:24:09,110 --> 00:24:10,820 Obviously, I was searching for results 414 00:24:10,820 --> 00:24:12,440 when I did my screen capture. 415 00:24:15,579 --> 00:24:17,620 I'm just going to for range and number of splits, 416 00:24:17,620 --> 00:24:19,617 I'm going to split it 80/20. 417 00:24:22,240 --> 00:24:26,550 It takes a parameter method, and that's interesting, 418 00:24:26,550 --> 00:24:29,800 and we'll see the ramifications of that later. 419 00:24:29,800 --> 00:24:32,520 That's going to be the machine learning method. 420 00:24:32,520 --> 00:24:35,850 We're going to compare KNN to another method called 421 00:24:35,850 --> 00:24:37,620 logistic regression. 422 00:24:37,620 --> 00:24:41,160 I didn't want to have to do this code 423 00:24:41,160 --> 00:24:45,260 twice, so I made the method itself a parameter. 424 00:24:45,260 --> 00:24:47,870 We'll see that introduces a slight complication, 425 00:24:47,870 --> 00:24:51,140 but we'll get to it when we get to it. 426 00:24:51,140 --> 00:24:54,090 So I split it, I apply whatever that method is 427 00:24:54,090 --> 00:25:01,040 the training the test set, I get the results, 428 00:25:01,040 --> 00:25:05,330 true positive false positive, true negative false negatives. 429 00:25:05,330 --> 00:25:08,210 And then I call this thing get stats, 430 00:25:08,210 --> 00:25:11,300 but I'm dividing it by the number of splits, 431 00:25:11,300 --> 00:25:13,580 so that will give me the average number 432 00:25:13,580 --> 00:25:18,320 of true positives, the average number of false positives, etc. 433 00:25:18,320 --> 00:25:22,340 And then I'm just going to return the average. 434 00:25:22,340 --> 00:25:27,770 Get stats actually just prints a bunch of statistics for us. 435 00:25:27,770 --> 00:25:29,840 Any questions about the two methods, 436 00:25:29,840 --> 00:25:32,300 leave one out versus repeated random sampling? 437 00:25:38,690 --> 00:25:41,870 Let's try it for KNN on the Titanic. 438 00:25:45,120 --> 00:25:50,400 So I'm not going to show you the code for K nearest classify. 439 00:25:50,400 --> 00:25:53,160 It's in the code we uploaded. 440 00:25:53,160 --> 00:25:56,520 It takes four arguments the training set, 441 00:25:56,520 --> 00:26:01,620 the test set, the label that we're trying to classify. 442 00:26:01,620 --> 00:26:03,270 Are we looking for the people who died? 443 00:26:03,270 --> 00:26:04,478 Or the people who didn't die? 444 00:26:04,478 --> 00:26:07,410 Are we looking for reptiles or not reptiles? 445 00:26:07,410 --> 00:26:09,240 Or if case there were six labels, 446 00:26:09,240 --> 00:26:11,910 which one are we trying to detect? 447 00:26:11,910 --> 00:26:16,470 And K as in how many nearest neighbors? 448 00:26:16,470 --> 00:26:18,990 And then it returns the true positives, the false positives, 449 00:26:18,990 --> 00:26:20,970 the true negatives, and the false negatives. 450 00:26:26,440 --> 00:26:30,820 Then you'll recall we'd already looked at lambda 451 00:26:30,820 --> 00:26:32,950 in a different context. 452 00:26:32,950 --> 00:26:41,250 The issue here is K nearest classify takes four arguments, 453 00:26:41,250 --> 00:26:47,180 yet if we go back here, for example, to random splits, 454 00:26:47,180 --> 00:26:51,320 what we're seeing is I'm calling the method with only two 455 00:26:51,320 --> 00:26:53,640 arguments. 456 00:26:53,640 --> 00:26:56,910 Because after all, if I'm not doing K nearest neighbors, 457 00:26:56,910 --> 00:27:02,120 maybe I don't need to pass in K. I'm sure I don't. 458 00:27:02,120 --> 00:27:04,070 Different methods will take different numbers 459 00:27:04,070 --> 00:27:09,920 of parameters, and yet I want to use the same function here 460 00:27:09,920 --> 00:27:12,630 method. 461 00:27:12,630 --> 00:27:14,760 So the trick I use to get around that-- 462 00:27:14,760 --> 00:27:17,900 and this is a very common programming trick-- 463 00:27:17,900 --> 00:27:18,550 in math. 464 00:27:18,550 --> 00:27:22,380 It's called currying, after the mathematician Curry, 465 00:27:22,380 --> 00:27:25,990 not the Indian dish. 466 00:27:25,990 --> 00:27:30,520 I'm creating a function a new function called KNN. 467 00:27:30,520 --> 00:27:33,580 This will be a function of two arguments, the training 468 00:27:33,580 --> 00:27:36,070 set and the test set, and it will 469 00:27:36,070 --> 00:27:40,240 be K nearest classifier with training set and test 470 00:27:40,240 --> 00:27:46,970 set as variables, and two constants, survived-- 471 00:27:46,970 --> 00:27:48,890 so I'm going to predict who survived-- 472 00:27:48,890 --> 00:27:53,420 and 3, the K. 473 00:27:53,420 --> 00:27:56,450 I've been able to turn a function of four arguments, 474 00:27:56,450 --> 00:28:00,140 K nearest classify, into a function of two arguments 475 00:28:00,140 --> 00:28:05,570 KNN by using lambda abstraction. 476 00:28:05,570 --> 00:28:09,000 This is something that people do fairly frequently, 477 00:28:09,000 --> 00:28:12,690 because it lets you build much more general programs when 478 00:28:12,690 --> 00:28:16,030 you don't have to worry about the number of arguments. 479 00:28:16,030 --> 00:28:19,500 So it's a good trick to keeping your bag of tricks. 480 00:28:19,500 --> 00:28:23,110 Again, it's a trick we've used before. 481 00:28:23,110 --> 00:28:26,740 Then I've just chosen 10 for the number of splits, 482 00:28:26,740 --> 00:28:36,850 and we'll try it, and we'll try it for both methods of testing. 483 00:28:36,850 --> 00:28:38,990 Any questions before I run this code? 484 00:28:52,720 --> 00:28:53,309 So here it is. 485 00:28:53,309 --> 00:28:53,850 We'll run it. 486 00:28:59,470 --> 00:29:02,020 Well, I should learn how to spell finished, shouldn't I? 487 00:29:02,020 --> 00:29:03,050 But that's OK. 488 00:29:11,220 --> 00:29:16,680 Here we have the results, and they're-- 489 00:29:16,680 --> 00:29:18,780 well, what can we say about them? 490 00:29:18,780 --> 00:29:21,750 They're not much different to start with, 491 00:29:21,750 --> 00:29:24,630 so it doesn't appear that our testing methodology had 492 00:29:24,630 --> 00:29:29,640 much of a difference on how well the KNN worked, 493 00:29:29,640 --> 00:29:33,060 and that's actually kind of comforting. 494 00:29:33,060 --> 00:29:36,480 The accurate-- none of the evaluation criteria 495 00:29:36,480 --> 00:29:39,660 are radically different, so that's kind of good. 496 00:29:39,660 --> 00:29:42,880 We hoped that was true. 497 00:29:42,880 --> 00:29:45,390 The other thing to notice is that we're actually 498 00:29:45,390 --> 00:29:50,040 doing considerably better than just always predicting, say, 499 00:29:50,040 --> 00:29:50,745 didn't survive. 500 00:29:56,070 --> 00:29:59,750 We're doing better than a random prediction. 501 00:29:59,750 --> 00:30:01,617 Let's go back now to the Power Point. 502 00:30:08,075 --> 00:30:08,950 Here are the results. 503 00:30:08,950 --> 00:30:10,738 We don't need to study them anymore. 504 00:30:14,020 --> 00:30:18,550 Better than 62% accuracy, but not much difference 505 00:30:18,550 --> 00:30:21,490 between the experiments. 506 00:30:21,490 --> 00:30:23,770 So that's one method. 507 00:30:23,770 --> 00:30:26,240 Now let's look at a different method, 508 00:30:26,240 --> 00:30:28,340 and this is probably the most common method 509 00:30:28,340 --> 00:30:30,290 used in machine learning. 510 00:30:30,290 --> 00:30:34,830 It's called logistic regression. 511 00:30:34,830 --> 00:30:37,800 It's, in some ways, if you look at it, similar 512 00:30:37,800 --> 00:30:40,200 to a linear regression, but different 513 00:30:40,200 --> 00:30:41,475 in some important ways. 514 00:30:44,490 --> 00:30:49,900 Linear regression, you will I'm sure recall, 515 00:30:49,900 --> 00:30:51,870 is designed to predict a real number. 516 00:30:54,920 --> 00:31:02,220 Now what we want here is a probability, so 517 00:31:02,220 --> 00:31:04,770 the probability of some event. 518 00:31:04,770 --> 00:31:07,140 We know that the dependent variable can only 519 00:31:07,140 --> 00:31:17,020 take on a finite set of values, so we want to predict survived 520 00:31:17,020 --> 00:31:18,820 or didn't survive. 521 00:31:18,820 --> 00:31:23,310 It's no good to say we predict this person half survived, 522 00:31:23,310 --> 00:31:25,480 you know survived, but is brain dead or something. 523 00:31:25,480 --> 00:31:27,040 I don't know. 524 00:31:27,040 --> 00:31:29,500 That's not what we're trying to do. 525 00:31:29,500 --> 00:31:33,370 The problem with just using regular linear regression 526 00:31:33,370 --> 00:31:37,240 is a lot of time you get nonsense predictions. 527 00:31:37,240 --> 00:31:41,050 Now you can claim, OK 0.5 is there, 528 00:31:41,050 --> 00:31:44,860 and it means has a half probability of dying, 529 00:31:44,860 --> 00:31:47,320 not that half died. 530 00:31:47,320 --> 00:31:49,900 But in fact, if you look at what goes on, 531 00:31:49,900 --> 00:31:54,740 you could get more than one or less than 0 532 00:31:54,740 --> 00:31:57,670 out of linear regression, and that's 533 00:31:57,670 --> 00:32:01,130 nonsense when we're talking about probabilities. 534 00:32:01,130 --> 00:32:06,520 So we need a different method, and that's logistic regression. 535 00:32:06,520 --> 00:32:10,420 What logistic regression does is it 536 00:32:10,420 --> 00:32:14,330 finds what are called the weights for each feature. 537 00:32:14,330 --> 00:32:17,710 You may recall I complained when Professor [? Grimson ?] used 538 00:32:17,710 --> 00:32:22,450 the word weights to mean something somewhat different. 539 00:32:22,450 --> 00:32:27,640 We take each feature, for example the gender, the cabin 540 00:32:27,640 --> 00:32:37,114 class, the age, and compute for that weight 541 00:32:37,114 --> 00:32:39,030 that we're going to use in making predictions. 542 00:32:39,030 --> 00:32:42,380 So think of the weights as corresponding 543 00:32:42,380 --> 00:32:46,410 to the coefficients we get when we do a linear regression. 544 00:32:46,410 --> 00:32:51,400 So we have now a coefficient associated with each variable. 545 00:32:51,400 --> 00:32:53,710 We're going to take those coefficients, 546 00:32:53,710 --> 00:32:56,710 add them up, multiply them by something, 547 00:32:56,710 --> 00:32:59,200 and make a prediction. 548 00:32:59,200 --> 00:33:02,820 A positive weight implies-- 549 00:33:02,820 --> 00:33:04,870 and I'll come back to this later-- 550 00:33:04,870 --> 00:33:08,530 it almost implies that the variable is positively 551 00:33:08,530 --> 00:33:11,840 correlated with the outcome. 552 00:33:11,840 --> 00:33:18,030 So we would, for example, say the 553 00:33:18,030 --> 00:33:20,670 have scales is positively correlated 554 00:33:20,670 --> 00:33:24,100 with being a reptile. 555 00:33:24,100 --> 00:33:27,820 A negative weight implies that the variable is negatively 556 00:33:27,820 --> 00:33:32,650 correlated with the outcome, so number of legs 557 00:33:32,650 --> 00:33:34,840 might have a negative weight. 558 00:33:34,840 --> 00:33:37,330 The more legs an animal has, the less likely 559 00:33:37,330 --> 00:33:40,150 it is to be a reptile. 560 00:33:40,150 --> 00:33:47,020 It's not absolute, it's just a correlation. 561 00:33:47,020 --> 00:33:49,390 The absolute magnitude is related 562 00:33:49,390 --> 00:33:52,230 to the strength of the correlation, 563 00:33:52,230 --> 00:33:54,230 so if it's being positive it means 564 00:33:54,230 --> 00:33:55,970 it's a really strong indicator. 565 00:33:55,970 --> 00:33:58,460 If it's big negative, it's a really strong 566 00:33:58,460 --> 00:33:59,893 negative indicator. 567 00:34:04,150 --> 00:34:07,960 And then we use an optimization process 568 00:34:07,960 --> 00:34:11,949 to compute these weights from the training data. 569 00:34:11,949 --> 00:34:13,659 It's a little bit complex. 570 00:34:13,659 --> 00:34:17,110 It's key is the way it uses the log function, hence 571 00:34:17,110 --> 00:34:21,610 the name logistic, but I'm not going to make you look at it. 572 00:34:24,270 --> 00:34:28,090 But I will show you how to use it. 573 00:34:28,090 --> 00:34:31,805 You start by importing something called sklearn.linear_model. 574 00:34:35,139 --> 00:34:42,300 Sklearn is a Python library, and in that is a class 575 00:34:42,300 --> 00:34:44,440 called logistic regression. 576 00:34:44,440 --> 00:34:47,330 It's the name of a class, and here are 577 00:34:47,330 --> 00:34:50,760 three methods of that class. 578 00:34:50,760 --> 00:34:56,610 Fit, which takes a sequence of feature vectors 579 00:34:56,610 --> 00:34:59,640 and a sequence of labels and returns 580 00:34:59,640 --> 00:35:05,180 an object of type logistic regression. 581 00:35:05,180 --> 00:35:09,960 So this is the place where the optimization is done. 582 00:35:09,960 --> 00:35:13,230 Now all the examples I'm going to show you, 583 00:35:13,230 --> 00:35:17,500 these two sequences will be-- 584 00:35:17,500 --> 00:35:18,220 well all right. 585 00:35:18,220 --> 00:35:20,860 So think of this as the sequence of feature vectors, 586 00:35:20,860 --> 00:35:25,870 one per passenger, and the labels associated with those. 587 00:35:25,870 --> 00:35:28,050 So this and this have to be the same length. 588 00:35:33,280 --> 00:35:37,470 That produces an object of this type, 589 00:35:37,470 --> 00:35:41,990 and then I can ask for the coefficients, which 590 00:35:41,990 --> 00:35:47,350 will return the weight of each variable, each feature. 591 00:35:47,350 --> 00:35:51,320 And then I can make a prediction, 592 00:35:51,320 --> 00:35:55,040 given a feature vector returned the probabilities 593 00:35:55,040 --> 00:35:59,120 of different labels. 594 00:35:59,120 --> 00:36:02,550 Let's look at it as an example. 595 00:36:02,550 --> 00:36:03,980 So first let's build the model. 596 00:36:06,870 --> 00:36:09,690 To build the model, we'll take the examples, the training 597 00:36:09,690 --> 00:36:13,410 data, and I just said whether we're going to print something. 598 00:36:13,410 --> 00:36:15,600 You'll notice from this slide I've 599 00:36:15,600 --> 00:36:18,020 elighted the printed stuff. 600 00:36:18,020 --> 00:36:22,090 We'll come back in a later slide and look at what's in there. 601 00:36:22,090 --> 00:36:24,980 But for now I want to focus on actually building the model. 602 00:36:28,160 --> 00:36:32,270 I need to create two vectors, two lists in this case, 603 00:36:32,270 --> 00:36:34,940 the feature vectors and the labels. 604 00:36:34,940 --> 00:36:36,695 For e in examples, featurevectors.a 605 00:36:36,695 --> 00:36:40,870 ppend(e.getfeatures e.getfeatures e.getlabel. 606 00:36:40,870 --> 00:36:45,830 Couldn't be much simpler than that. 607 00:36:45,830 --> 00:36:50,360 Then, just because it wouldn't fit on a line on my slide, 608 00:36:50,360 --> 00:36:52,700 I've created this identifier called 609 00:36:52,700 --> 00:36:56,495 logistic regression, which is sklearn.linearmo 610 00:36:56,495 --> 00:37:00,010 del.logisticregression. 611 00:37:00,010 --> 00:37:04,340 So this is the thing I imported, and this is a class, 612 00:37:04,340 --> 00:37:06,890 and now I'll get a model by first 613 00:37:06,890 --> 00:37:10,670 creating an instance of the class, logistic regression. 614 00:37:10,670 --> 00:37:13,070 Here I'm getting an instance, and then I'll 615 00:37:13,070 --> 00:37:16,730 call dot fit with that instance, passing 616 00:37:16,730 --> 00:37:19,410 it feature vecs and labels. 617 00:37:19,410 --> 00:37:21,660 I now have built a logistic regression 618 00:37:21,660 --> 00:37:25,260 model, which is simply a set of weights 619 00:37:25,260 --> 00:37:27,507 for each of the variables. 620 00:37:27,507 --> 00:37:28,215 This makes sense? 621 00:37:32,770 --> 00:37:35,590 Now we're going to apply the model, 622 00:37:35,590 --> 00:37:39,040 and I think this is the last piece of Python 623 00:37:39,040 --> 00:37:42,130 I'm going to introduce this semester, in case you're 624 00:37:42,130 --> 00:37:44,620 tired of learning about Python. 625 00:37:44,620 --> 00:37:48,050 And this is at least list comprehension. 626 00:37:48,050 --> 00:37:53,140 This is how I'm going to build my set of test feature vectors. 627 00:37:53,140 --> 00:37:56,470 So before we go and look at the code, 628 00:37:56,470 --> 00:38:00,690 let's look at how list comprehension works. 629 00:38:00,690 --> 00:38:04,380 In its simplest form, says some expression 630 00:38:04,380 --> 00:38:06,840 for some identifier in some list, 631 00:38:06,840 --> 00:38:14,235 L. It creates a new list by evaluating this expression Len 632 00:38:14,235 --> 00:38:19,860 (L) times with the ID in the expression replaced 633 00:38:19,860 --> 00:38:23,400 by each element of the list L. So let's 634 00:38:23,400 --> 00:38:25,500 look at a simple example. 635 00:38:25,500 --> 00:38:32,150 Here I'm saying L equals x times x for x in range 10. 636 00:38:32,150 --> 00:38:34,020 What's that going to do? 637 00:38:34,020 --> 00:38:37,654 It's going to, essentially, create a list. 638 00:38:37,654 --> 00:38:39,070 Think of it as a list, or at least 639 00:38:39,070 --> 00:38:43,620 a sequence of values, a range type actually in Python 3-- 640 00:38:43,620 --> 00:38:47,200 of values 0 to 9. 641 00:38:47,200 --> 00:38:51,080 It will then create a list of length 10, where 642 00:38:51,080 --> 00:38:54,260 the first element is going to be 0 times 0. 643 00:38:54,260 --> 00:38:58,630 The second element 1 times 1, etc. 644 00:38:58,630 --> 00:38:59,820 OK? 645 00:38:59,820 --> 00:39:01,560 So it's a simple way for me to create 646 00:39:01,560 --> 00:39:05,030 a list that looks like that. 647 00:39:05,030 --> 00:39:12,800 I can be fancier and say for x times L equals x times x for x 648 00:39:12,800 --> 00:39:15,810 in range 10, and I add and if. 649 00:39:15,810 --> 00:39:20,080 If x mod 2 is equal to 0. 650 00:39:20,080 --> 00:39:22,540 Now instead of returning all-- 651 00:39:22,540 --> 00:39:25,880 building a list using each value in range 10, 652 00:39:25,880 --> 00:39:29,754 it will use only those values that satisfy that test. 653 00:39:34,880 --> 00:39:37,220 We can go look at what happens when we run that code. 654 00:39:51,700 --> 00:39:54,610 You can see the first list is 1 times 1, 2 times 655 00:39:54,610 --> 00:39:57,100 2, et cetera, and the second list 656 00:39:57,100 --> 00:40:00,820 is much shorter, because I'm only squaring even numbers. 657 00:40:07,060 --> 00:40:09,280 Well, you can see that list comprehension gives us 658 00:40:09,280 --> 00:40:13,940 a convenient compact way to do certain kinds of things. 659 00:40:13,940 --> 00:40:19,460 Like lambda expressions, they're easy to misuse. 660 00:40:19,460 --> 00:40:22,220 I hate reading code where I have list comprehensions that 661 00:40:22,220 --> 00:40:26,060 go over multiple lines on my screen, for example. 662 00:40:26,060 --> 00:40:29,750 So I use it quite a lot for small things like this. 663 00:40:29,750 --> 00:40:33,110 If it's very large, I find another way to do it. 664 00:40:48,410 --> 00:40:49,720 Now we can move forward. 665 00:40:58,790 --> 00:41:03,480 In applying the model, I first build my testing feature 666 00:41:03,480 --> 00:41:07,160 of x, my e.getfeatures for e in test set, 667 00:41:07,160 --> 00:41:09,290 so that will give me the features associated 668 00:41:09,290 --> 00:41:11,570 with each element in the test set. 669 00:41:11,570 --> 00:41:14,930 I could obviously have written a for loop to do the same thing, 670 00:41:14,930 --> 00:41:18,250 but this was just a little cooler. 671 00:41:18,250 --> 00:41:22,690 Then we get model.predict for each of these. 672 00:41:22,690 --> 00:41:28,120 Model.predict_proba is nice in that I don't have to predict it 673 00:41:28,120 --> 00:41:30,340 for one example at a time. 674 00:41:30,340 --> 00:41:33,880 I can pass it as set of examples, and what I get back 675 00:41:33,880 --> 00:41:42,890 is a list of predictions, so that's just convenient. 676 00:41:42,890 --> 00:41:50,420 And then setting these to 0, and for I in range len of probs, 677 00:41:50,420 --> 00:41:53,280 here a probability of 0.5. 678 00:41:53,280 --> 00:42:00,200 What's that's saying is what I get out of logistic regression 679 00:42:00,200 --> 00:42:04,570 is a probability of something having a label. 680 00:42:04,570 --> 00:42:08,950 I then have to build a classifier, give a threshold. 681 00:42:08,950 --> 00:42:11,650 And here what I've said, if the probability of it being true 682 00:42:11,650 --> 00:42:14,890 is over a 0.5, call it true. 683 00:42:14,890 --> 00:42:17,650 So if the probability of survival is over 0.5, 684 00:42:17,650 --> 00:42:19,030 call it survived. 685 00:42:19,030 --> 00:42:22,600 If it's below, call it not survived. 686 00:42:22,600 --> 00:42:27,430 We'll later see that, again, setting that probability 687 00:42:27,430 --> 00:42:31,630 is itself an interesting thing, but the default in most systems 688 00:42:31,630 --> 00:42:34,390 is half, for obvious reasons. 689 00:42:38,280 --> 00:42:41,970 I get my probabilities for each feature vector, 690 00:42:41,970 --> 00:42:44,820 and then for I in ranged lens of probabilities, 691 00:42:44,820 --> 00:42:48,840 I'm just testing whether the predicted label is 692 00:42:48,840 --> 00:42:54,000 the same as the actual label, and updating true positives, 693 00:42:54,000 --> 00:42:56,940 false positives, true negatives, and false negatives 694 00:42:56,940 --> 00:42:59,518 accordingly. 695 00:42:59,518 --> 00:43:00,492 So far, so good? 696 00:43:05,860 --> 00:43:09,200 All right, let's put it all together. 697 00:43:09,200 --> 00:43:13,225 I'm defining something called LR, for logistic regression. 698 00:43:13,225 --> 00:43:17,720 It takes the training data, the test data, the probability, 699 00:43:17,720 --> 00:43:21,810 it builds a model, and then it gets the results 700 00:43:21,810 --> 00:43:24,520 by calling apply model with the label survived 701 00:43:24,520 --> 00:43:27,840 and whatever this prob was. 702 00:43:27,840 --> 00:43:30,430 Again, we'll do it for both leave one out 703 00:43:30,430 --> 00:43:34,950 and random splits, and again for 10 random splits. 704 00:44:03,790 --> 00:44:05,820 You'll notice it actually runs-- 705 00:44:05,820 --> 00:44:10,700 maybe you won't notice, but it does run faster than KNN. 706 00:44:10,700 --> 00:44:13,460 One of the nice things about logistic regression 707 00:44:13,460 --> 00:44:16,010 is building the model takes a while, 708 00:44:16,010 --> 00:44:18,590 but once you've got the model, applying it 709 00:44:18,590 --> 00:44:23,660 to a large number of variables-- feature vectors is fast. 710 00:44:23,660 --> 00:44:25,940 It's independent of the number of training examples, 711 00:44:25,940 --> 00:44:29,000 because we've got our weights. 712 00:44:29,000 --> 00:44:32,450 So solving the optimization problem, getting the weights, 713 00:44:32,450 --> 00:44:35,180 depends upon the number of training examples. 714 00:44:35,180 --> 00:44:39,350 Once we've got the weights, it's just evaluating a polynomial. 715 00:44:39,350 --> 00:44:42,986 It's very fast, so that's a nice advantage. 716 00:44:46,720 --> 00:44:47,595 If we look at those-- 717 00:44:55,170 --> 00:44:59,290 and we should probably compare them to our earlier KNN 718 00:44:59,290 --> 00:45:04,560 results, so KNN on the left, logistic regression 719 00:45:04,560 --> 00:45:06,290 on the right. 720 00:45:06,290 --> 00:45:12,000 And I guess if I look at it, it looks like logistic regression 721 00:45:12,000 --> 00:45:13,100 did a little bit better. 722 00:45:18,100 --> 00:45:20,580 That's not guaranteed, but it often 723 00:45:20,580 --> 00:45:25,172 does outperform because it's more subtle in what it does, 724 00:45:25,172 --> 00:45:26,880 in being able to assign different weights 725 00:45:26,880 --> 00:45:30,330 to different variables. 726 00:45:30,330 --> 00:45:31,400 It's a little bit better. 727 00:45:31,400 --> 00:45:36,800 That's probably a good thing, but there's 728 00:45:36,800 --> 00:45:40,040 another reason that's really important that people prefer 729 00:45:40,040 --> 00:45:42,680 logistic regression, is it provides 730 00:45:42,680 --> 00:45:46,570 insights about the variables. 731 00:45:46,570 --> 00:45:48,245 We can look at the feature weights. 732 00:45:51,100 --> 00:45:56,130 This code does that, so remember we looked at build model 733 00:45:56,130 --> 00:45:58,390 and I left out the printing? 734 00:45:58,390 --> 00:46:01,630 Well here I'm leaving out everything except the printing. 735 00:46:01,630 --> 00:46:04,900 Same function, but leaving out everything except the printing. 736 00:46:07,410 --> 00:46:10,250 We can do model underbar classes, 737 00:46:10,250 --> 00:46:16,110 so model.classes underbar gives you the classes. 738 00:46:16,110 --> 00:46:19,707 In this case, the classes are survived, didn't survive. 739 00:46:19,707 --> 00:46:20,790 I forget what I called it. 740 00:46:20,790 --> 00:46:22,200 We'll see. 741 00:46:22,200 --> 00:46:24,270 So I can see what the classes it's using 742 00:46:24,270 --> 00:46:30,510 are, and then for I in range len model dot cof underbar, 743 00:46:30,510 --> 00:46:32,910 these are giving the weights of each variable. 744 00:46:32,910 --> 00:46:36,656 The coefficients, I can print what they are. 745 00:46:39,530 --> 00:46:41,450 So let's run that and see what we get. 746 00:46:47,890 --> 00:46:50,940 We get a syntax error because I turned a comment 747 00:46:50,940 --> 00:46:51,940 into a line of code. 748 00:47:03,320 --> 00:47:08,650 Our model classes are died and survived, 749 00:47:08,650 --> 00:47:12,460 and for label survived-- 750 00:47:12,460 --> 00:47:15,100 what I've done, by the way, in the representation 751 00:47:15,100 --> 00:47:18,820 is I represented the cabin class as a binary variable. 752 00:47:18,820 --> 00:47:22,600 It's either 0 or 1, because it doesn't make sense 753 00:47:22,600 --> 00:47:26,896 to treat them as if they were really numbers because we don't 754 00:47:26,896 --> 00:47:28,270 know, for example, the difference 755 00:47:28,270 --> 00:47:31,030 between first and second is the same as the difference 756 00:47:31,030 --> 00:47:33,050 between second and third. 757 00:47:33,050 --> 00:47:35,570 If we treated the class, we just said cabin class 758 00:47:35,570 --> 00:47:39,610 and used an integer, implicitly the learning algorithm 759 00:47:39,610 --> 00:47:42,250 is going to assume that the difference between 1 and 2 760 00:47:42,250 --> 00:47:44,770 is the same as between 2 and 3. 761 00:47:44,770 --> 00:47:47,320 If you, for example, look at the prices of these cabins, 762 00:47:47,320 --> 00:47:50,690 you'll see that that's not true. 763 00:47:50,690 --> 00:47:53,120 The difference in an airplane between economy plus 764 00:47:53,120 --> 00:47:58,040 and economy is way smaller than between economy plus him first. 765 00:47:58,040 --> 00:48:00,840 Same thing on the Titanic. 766 00:48:00,840 --> 00:48:06,060 But what we see here is that for the label survived, 767 00:48:06,060 --> 00:48:08,340 pretty good sized positive weight 768 00:48:08,340 --> 00:48:10,320 for being in first class cabin. 769 00:48:13,000 --> 00:48:14,560 Moderate for being in the second, 770 00:48:14,560 --> 00:48:18,130 and if you're in the third class well, tough luck. 771 00:48:18,130 --> 00:48:20,590 So what we see here is that rich people did better 772 00:48:20,590 --> 00:48:22,180 than the poor people. 773 00:48:22,180 --> 00:48:25,135 Shocking. 774 00:48:25,135 --> 00:48:29,820 If We look at age, we'll see it's negatively correlated. 775 00:48:29,820 --> 00:48:32,010 What does this mean? 776 00:48:32,010 --> 00:48:34,110 It's not a huge weight, but it basically 777 00:48:34,110 --> 00:48:39,780 says that if you're older, the bigger your age, 778 00:48:39,780 --> 00:48:44,770 the less likely you are to have survived the disaster. 779 00:48:44,770 --> 00:48:47,860 And finally, it says it's really bad 780 00:48:47,860 --> 00:48:52,330 to be a male, that the men-- 781 00:48:52,330 --> 00:48:57,040 being a male was very negatively correlated with surviving. 782 00:48:57,040 --> 00:49:01,060 We see a nice thing here is we get these labels, which 783 00:49:01,060 --> 00:49:03,040 we can make sense of. 784 00:49:03,040 --> 00:49:05,080 One more slide and then I'm done. 785 00:49:09,890 --> 00:49:11,910 These values are slightly different, 786 00:49:11,910 --> 00:49:15,270 because different randomization, different example, 787 00:49:15,270 --> 00:49:17,820 but the main point I want to say is 788 00:49:17,820 --> 00:49:19,830 you have to be a little bit wary of reading 789 00:49:19,830 --> 00:49:22,290 too much into these weights. 790 00:49:22,290 --> 00:49:26,220 Because not in this example, but other examples-- 791 00:49:26,220 --> 00:49:30,580 well, also in these features are often correlated, 792 00:49:30,580 --> 00:49:36,210 and if they're correlated, you run-- 793 00:49:36,210 --> 00:49:37,620 actually it's 3:56. 794 00:49:37,620 --> 00:49:40,590 I'm going to explain the problem with this on Monday 795 00:49:40,590 --> 00:49:42,900 when I have time to do it properly. 796 00:49:42,900 --> 00:49:45,440 So I'll see you then.