1 00:00:00,790 --> 00:00:03,130 The following content is provided under a Creative 2 00:00:03,130 --> 00:00:04,550 Commons license. 3 00:00:04,550 --> 00:00:06,760 Your support will help MIT OpenCourseWare 4 00:00:06,760 --> 00:00:10,850 continue to offer high quality educational resources for free. 5 00:00:10,850 --> 00:00:13,390 To make a donation or to view additional materials 6 00:00:13,390 --> 00:00:17,320 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,320 --> 00:00:18,570 at ocw.mit.edu. 8 00:00:30,932 --> 00:00:31,640 ERIC GRIMSON: OK. 9 00:00:31,640 --> 00:00:34,480 Welcome back. 10 00:00:34,480 --> 00:00:36,040 You know, it's that time a term when 11 00:00:36,040 --> 00:00:38,950 we're all kind of doing this. 12 00:00:38,950 --> 00:00:41,980 So let me see if I can get a few smiles by simply noting to you 13 00:00:41,980 --> 00:00:45,717 that two weeks from today is the last class. 14 00:00:45,717 --> 00:00:48,050 Should be worth at least a little bit of a smile, right? 15 00:00:48,050 --> 00:00:49,500 Professor Guttag is smiling. 16 00:00:49,500 --> 00:00:51,030 He likes that idea. 17 00:00:51,030 --> 00:00:53,830 You're almost there. 18 00:00:53,830 --> 00:00:56,670 What are we doing for the last couple of lectures? 19 00:00:56,670 --> 00:00:58,452 We're talking about linear regression. 20 00:00:58,452 --> 00:00:59,910 And I just want to remind you, this 21 00:00:59,910 --> 00:01:03,690 was the idea of I have some experimental data. 22 00:01:03,690 --> 00:01:06,360 Case of a spring where I put different weights on measure 23 00:01:06,360 --> 00:01:07,620 displacements. 24 00:01:07,620 --> 00:01:11,640 And regression was giving us a way of deducing a model 25 00:01:11,640 --> 00:01:13,495 to fit that data. 26 00:01:13,495 --> 00:01:14,947 And In some cases it was easy. 27 00:01:14,947 --> 00:01:17,280 We knew, for example, it was going to be a linear model. 28 00:01:17,280 --> 00:01:19,320 We found the best line that would fit that data. 29 00:01:19,320 --> 00:01:22,110 In some cases, we said we could use validation 30 00:01:22,110 --> 00:01:24,960 to actually let us explore to find the best model that 31 00:01:24,960 --> 00:01:29,580 would fit it, whether a linear, a quadratic, a cubic, 32 00:01:29,580 --> 00:01:31,500 some higher order thing. 33 00:01:31,500 --> 00:01:36,340 So we'll be using that to deduce something about a model. 34 00:01:36,340 --> 00:01:39,800 That's a nice segue into the topic for the next three 35 00:01:39,800 --> 00:01:42,680 lectures, the last big topic of the class, 36 00:01:42,680 --> 00:01:43,897 which is machine learning. 37 00:01:43,897 --> 00:01:46,480 And I'm going to argue, you can debate whether that's actually 38 00:01:46,480 --> 00:01:47,510 an example of learning. 39 00:01:47,510 --> 00:01:49,135 But it has many of the elements that we 40 00:01:49,135 --> 00:01:52,700 want to talk about when we talk about machine learning. 41 00:01:52,700 --> 00:01:55,210 So as always, there's a reading assignment. 42 00:01:55,210 --> 00:01:57,500 Chapter 22 of the book gives you a good start on this, 43 00:01:57,500 --> 00:02:00,160 and it will follow up with other pieces. 44 00:02:00,160 --> 00:02:03,072 And I want to start by basically outlining 45 00:02:03,072 --> 00:02:04,030 what we're going to do. 46 00:02:04,030 --> 00:02:05,404 And I'm going to begin by saying, 47 00:02:05,404 --> 00:02:09,130 as I'm sure you're aware, this is a huge topic. 48 00:02:09,130 --> 00:02:12,760 I've listed just five subjects in course six 49 00:02:12,760 --> 00:02:14,770 that all focus on machine learning. 50 00:02:14,770 --> 00:02:16,900 And that doesn't include other subjects 51 00:02:16,900 --> 00:02:19,030 where learning is a central part. 52 00:02:19,030 --> 00:02:22,360 So natural language processing, computational biology, 53 00:02:22,360 --> 00:02:25,390 computer vision robotics all rely today, 54 00:02:25,390 --> 00:02:27,070 heavily on machine learning. 55 00:02:27,070 --> 00:02:30,050 And you'll see those in those subjects as well. 56 00:02:30,050 --> 00:02:32,560 So we're not going to compress five subjects 57 00:02:32,560 --> 00:02:34,490 into three lectures. 58 00:02:34,490 --> 00:02:36,920 But what we are going to do is give you the introduction. 59 00:02:36,920 --> 00:02:39,580 We're going to start by talking about the basic concepts 60 00:02:39,580 --> 00:02:40,810 of machine learning. 61 00:02:40,810 --> 00:02:43,780 The idea of having examples, and how do you talk about features 62 00:02:43,780 --> 00:02:45,610 representing those examples, how do 63 00:02:45,610 --> 00:02:47,650 you measure distances between them, 64 00:02:47,650 --> 00:02:50,080 and use the notion of distance to try 65 00:02:50,080 --> 00:02:52,210 and group similar things together as a way 66 00:02:52,210 --> 00:02:53,450 of doing machine learning. 67 00:02:53,450 --> 00:02:55,330 And we're going to look, as a consequence, 68 00:02:55,330 --> 00:02:59,530 of two different standard ways of doing learning. 69 00:02:59,530 --> 00:03:01,555 One, we call classification methods. 70 00:03:01,555 --> 00:03:02,930 Example we're going to see, there 71 00:03:02,930 --> 00:03:05,110 is something called "k nearest neighbor" 72 00:03:05,110 --> 00:03:08,390 and the second class, called clustering methods. 73 00:03:08,390 --> 00:03:10,750 Classification works well when I have what 74 00:03:10,750 --> 00:03:12,090 we would call labeled data. 75 00:03:12,090 --> 00:03:15,070 I know labels on my examples, and I'm 76 00:03:15,070 --> 00:03:17,200 going to use that to try and define classes 77 00:03:17,200 --> 00:03:19,510 that I can learn, and clustering working well, 78 00:03:19,510 --> 00:03:20,930 when I don't have labeled data. 79 00:03:20,930 --> 00:03:23,138 And we'll see what that means in a couple of minutes. 80 00:03:23,138 --> 00:03:27,500 But we're going to give you an early view of this. 81 00:03:27,500 --> 00:03:29,360 Unless Professor Guttag changes his mind, 82 00:03:29,360 --> 00:03:31,760 we're probably not going to show you the current really 83 00:03:31,760 --> 00:03:33,500 sophisticated machine learning methods 84 00:03:33,500 --> 00:03:35,870 like convolutional neural nets or deep learning, 85 00:03:35,870 --> 00:03:37,040 things you'll read about in the news. 86 00:03:37,040 --> 00:03:38,720 But you're going to get a sense of what's 87 00:03:38,720 --> 00:03:40,640 behind those, by looking at what we do when we 88 00:03:40,640 --> 00:03:43,890 talk about learning algorithms. 89 00:03:43,890 --> 00:03:45,640 Before I do it, I want to point out to you 90 00:03:45,640 --> 00:03:47,200 just how prevalent this is. 91 00:03:47,200 --> 00:03:49,870 And I'm going to admit with my gray hair, 92 00:03:49,870 --> 00:03:53,530 I started working in AI in 1975 when machine learning was 93 00:03:53,530 --> 00:03:55,030 a pretty simple thing to do. 94 00:03:55,030 --> 00:03:56,470 And it's been fascinating to watch 95 00:03:56,470 --> 00:03:58,150 over 40 years, the change. 96 00:03:58,150 --> 00:04:01,550 And if you think about it, just think about where you see it. 97 00:04:01,550 --> 00:04:06,730 AlphaGo, machine learning based system from Google that beat 98 00:04:06,730 --> 00:04:09,040 a world-class level Go player. 99 00:04:09,040 --> 00:04:11,980 Chess has already been conquered by computers for a while. 100 00:04:11,980 --> 00:04:14,020 Go now belongs to computers. 101 00:04:14,020 --> 00:04:16,930 Best Go players in the world are computers. 102 00:04:16,930 --> 00:04:18,519 I'm sure many of you use Netflix. 103 00:04:18,519 --> 00:04:20,660 Any recommendation system, Netflix, 104 00:04:20,660 --> 00:04:23,770 Amazon, pick your favorite, uses a machine learning algorithm 105 00:04:23,770 --> 00:04:25,481 to suggest things for you. 106 00:04:25,481 --> 00:04:27,730 And in fact, you've probably seen it on Google, right? 107 00:04:27,730 --> 00:04:29,560 The ads that pop up on Google are 108 00:04:29,560 --> 00:04:31,690 coming from a machine learning algorithm that's 109 00:04:31,690 --> 00:04:33,220 looking at your preferences. 110 00:04:33,220 --> 00:04:34,900 Scary thought. 111 00:04:34,900 --> 00:04:38,980 Drug discovery, character recognition-- the post office 112 00:04:38,980 --> 00:04:41,975 does character recognition of handwritten characters using 113 00:04:41,975 --> 00:04:44,350 a machine learning algorithm and a computer vision system 114 00:04:44,350 --> 00:04:46,330 behind it. 115 00:04:46,330 --> 00:04:48,190 You probably don't know this company. 116 00:04:48,190 --> 00:04:50,260 It's actually an MIT spin-off called Two Sigma, 117 00:04:50,260 --> 00:04:52,060 it's a hedge fund in New York. 118 00:04:52,060 --> 00:04:54,970 They heavily use AI and machine learning techniques. 119 00:04:54,970 --> 00:05:01,947 And two years ago, their fund returned a 56% return. 120 00:05:01,947 --> 00:05:03,280 I wish I'd invested in the fund. 121 00:05:03,280 --> 00:05:04,850 I don't have the kinds of millions you need, 122 00:05:04,850 --> 00:05:06,370 but that's an impressive return. 123 00:05:06,370 --> 00:05:09,546 56% return on your money in one year. 124 00:05:09,546 --> 00:05:11,170 Last year they didn't do quite as well, 125 00:05:11,170 --> 00:05:14,140 but they do extremely well using machine learning techniques. 126 00:05:14,140 --> 00:05:16,550 Siri. 127 00:05:16,550 --> 00:05:18,680 Another great MIT company called Mobileye 128 00:05:18,680 --> 00:05:21,110 that does computer vision systems with a heavy machine 129 00:05:21,110 --> 00:05:24,080 learning component that is used in assistive driving 130 00:05:24,080 --> 00:05:26,450 and will be used in completely autonomous driving. 131 00:05:26,450 --> 00:05:28,520 It will do things like kick in your brakes 132 00:05:28,520 --> 00:05:32,067 if you're closing too fast on the car in front of you, 133 00:05:32,067 --> 00:05:33,650 which is going to be really bad for me 134 00:05:33,650 --> 00:05:35,600 because I drive like a Bostonian. 135 00:05:35,600 --> 00:05:38,030 And it would be kicking in constantly. 136 00:05:38,030 --> 00:05:39,530 Face recognition. 137 00:05:39,530 --> 00:05:42,530 Facebook uses this, many other systems 138 00:05:42,530 --> 00:05:46,070 do to both detect and recognize faces. 139 00:05:46,070 --> 00:05:48,720 IBM Watson-- cancer diagnosis. 140 00:05:48,720 --> 00:05:50,390 These are all just examples of machine 141 00:05:50,390 --> 00:05:52,820 learning being used everywhere. 142 00:05:52,820 --> 00:05:53,870 And it really is. 143 00:05:53,870 --> 00:05:56,720 I've only picked nine. 144 00:05:56,720 --> 00:06:00,280 So what is it? 145 00:06:00,280 --> 00:06:02,140 I'm going to make an obnoxious statement. 146 00:06:02,140 --> 00:06:03,695 You're now used to that. 147 00:06:03,695 --> 00:06:05,070 I'm going to claim that you could 148 00:06:05,070 --> 00:06:09,420 argue that almost every computer program learns something. 149 00:06:09,420 --> 00:06:11,640 But the level of learning really varies a lot. 150 00:06:11,640 --> 00:06:15,150 So if you think back to the first lecture in 60001, 151 00:06:15,150 --> 00:06:18,630 we showed you Newton's method for computing square roots. 152 00:06:18,630 --> 00:06:20,910 And you could argue, you'd have to stretch it, 153 00:06:20,910 --> 00:06:23,280 but you could argue that that method learns 154 00:06:23,280 --> 00:06:25,260 something about how to compute square roots. 155 00:06:25,260 --> 00:06:29,986 In fact, you could generalize it to roots of any order power. 156 00:06:29,986 --> 00:06:31,110 But it really didn't learn. 157 00:06:31,110 --> 00:06:33,440 I really had to program it. 158 00:06:33,440 --> 00:06:33,960 All right. 159 00:06:33,960 --> 00:06:37,500 Think about last week when we talked about linear regression. 160 00:06:37,500 --> 00:06:39,810 Now it starts to feel a little bit more 161 00:06:39,810 --> 00:06:41,040 like a learning algorithm. 162 00:06:41,040 --> 00:06:41,998 Because what did we do? 163 00:06:41,998 --> 00:06:44,550 We gave you a set of data points, 164 00:06:44,550 --> 00:06:47,200 mass displacement data points. 165 00:06:47,200 --> 00:06:49,650 And then we showed you how the computer could essentially 166 00:06:49,650 --> 00:06:52,290 fit a curve to that data point. 167 00:06:52,290 --> 00:06:56,100 And it was, in some sense, learning a model for that data 168 00:06:56,100 --> 00:06:58,960 that it could then use to predict behavior. 169 00:06:58,960 --> 00:07:00,425 In other situations. 170 00:07:00,425 --> 00:07:01,800 And that's getting closer to what 171 00:07:01,800 --> 00:07:03,510 we would like when we think about a machine 172 00:07:03,510 --> 00:07:04,301 learning algorithm. 173 00:07:04,301 --> 00:07:10,460 We'd like to have program that can learn from experience, 174 00:07:10,460 --> 00:07:14,280 something that it can then use to deduce new facts. 175 00:07:14,280 --> 00:07:16,980 Now it's been a problem in AI for a very long time. 176 00:07:16,980 --> 00:07:18,080 And I love this quote. 177 00:07:18,080 --> 00:07:21,530 It's from a gentleman named Art Samuel. 178 00:07:21,530 --> 00:07:24,794 1959 is the quote in which he says, 179 00:07:24,794 --> 00:07:26,210 his definition of machine learning 180 00:07:26,210 --> 00:07:28,130 is the field of study that gives computers 181 00:07:28,130 --> 00:07:32,391 the ability to learn without being explicitly programmed. 182 00:07:32,391 --> 00:07:33,890 And I think many people would argue, 183 00:07:33,890 --> 00:07:36,080 he wrote the first such program. 184 00:07:36,080 --> 00:07:38,710 It learned from experience. 185 00:07:38,710 --> 00:07:40,292 In his case, it played checkers. 186 00:07:40,292 --> 00:07:42,250 Kind of shows you how the field has progressed. 187 00:07:42,250 --> 00:07:45,610 But we started with checkers, we got to chess, we now do Go. 188 00:07:45,610 --> 00:07:46,570 But it played checkers. 189 00:07:46,570 --> 00:07:49,240 It beat national level players, most importantly, 190 00:07:49,240 --> 00:07:52,570 it learned to improve its methods 191 00:07:52,570 --> 00:07:55,630 by watching how it did in games and then inferring something 192 00:07:55,630 --> 00:07:58,492 to change what it thought about as it did that. 193 00:07:58,492 --> 00:07:59,950 Samuel did a bunch of other things. 194 00:07:59,950 --> 00:08:00,940 I just highlighted one. 195 00:08:00,940 --> 00:08:02,170 You may see in a follow on course, 196 00:08:02,170 --> 00:08:04,295 he invented what's called Alpha-Beta Pruning, which 197 00:08:04,295 --> 00:08:06,896 is a really useful technique for doing search. 198 00:08:06,896 --> 00:08:10,390 But the idea is, how can we have the computer learn 199 00:08:10,390 --> 00:08:13,444 without being explicitly programmed? 200 00:08:13,444 --> 00:08:14,860 And one way to think about this is 201 00:08:14,860 --> 00:08:17,440 to think about the difference between how we would normally 202 00:08:17,440 --> 00:08:19,870 program and what we would like from a machine learning 203 00:08:19,870 --> 00:08:21,692 algorithm. 204 00:08:21,692 --> 00:08:23,650 Normal programming, I know you're not convinced 205 00:08:23,650 --> 00:08:25,441 there's such a thing as normal programming, 206 00:08:25,441 --> 00:08:27,820 but if you think of traditional programming, 207 00:08:27,820 --> 00:08:30,100 what's the process? 208 00:08:30,100 --> 00:08:33,610 I write a program that I input to the computer 209 00:08:33,610 --> 00:08:36,340 so that it can then take data and produce 210 00:08:36,340 --> 00:08:38,409 some appropriate output. 211 00:08:38,409 --> 00:08:40,750 And the square root finder really sits there, right? 212 00:08:40,750 --> 00:08:43,480 I wrote code for using Newton method to find a square root, 213 00:08:43,480 --> 00:08:47,000 and then it gave me the process of given any number, 214 00:08:47,000 --> 00:08:48,250 I'll give you the square root. 215 00:08:50,800 --> 00:08:52,780 But if you think about what we did last time, 216 00:08:52,780 --> 00:08:53,863 it was a little different. 217 00:08:53,863 --> 00:08:56,900 And in fact, in a machine learning approach, 218 00:08:56,900 --> 00:09:01,600 the idea is that I'm going to give the computer output. 219 00:09:01,600 --> 00:09:05,920 I'm going to give it examples of what I want the program to do, 220 00:09:05,920 --> 00:09:08,380 labels on data, characterizations 221 00:09:08,380 --> 00:09:10,380 of different classes of things. 222 00:09:10,380 --> 00:09:11,860 And what I want the computer to do 223 00:09:11,860 --> 00:09:15,580 is, given that characterization of output and data, 224 00:09:15,580 --> 00:09:17,500 I wanted that machine learning algorithm 225 00:09:17,500 --> 00:09:20,480 to actually produce for me a program, 226 00:09:20,480 --> 00:09:23,330 a program that I can then use to infer 227 00:09:23,330 --> 00:09:25,370 new information about things. 228 00:09:25,370 --> 00:09:29,442 And that creates, if you like, a really nice loop 229 00:09:29,442 --> 00:09:31,400 where I can have the machine learning algorithm 230 00:09:31,400 --> 00:09:34,190 learn the program which I can then use 231 00:09:34,190 --> 00:09:36,190 to solve some other problem. 232 00:09:36,190 --> 00:09:38,260 That would be really great if we could do it. 233 00:09:38,260 --> 00:09:40,340 And as I suggested, that curve-fitting algorithm 234 00:09:40,340 --> 00:09:41,870 is a simple version of that. 235 00:09:41,870 --> 00:09:44,690 It learned a model for the data, which I could then 236 00:09:44,690 --> 00:09:46,820 use to label any other instances of the data 237 00:09:46,820 --> 00:09:49,670 or predict what I would see in terms of spring displacement 238 00:09:49,670 --> 00:09:52,100 as I changed the masses. 239 00:09:52,100 --> 00:09:54,850 So that's the kind of idea we're going to explore. 240 00:09:54,850 --> 00:09:57,050 If we want to learn things, we could also 241 00:09:57,050 --> 00:09:59,200 ask, so how do you learn? 242 00:09:59,200 --> 00:10:02,537 And how should a computer learn? 243 00:10:02,537 --> 00:10:05,120 Well, for you as a human, there are a couple of possibilities. 244 00:10:05,120 --> 00:10:06,140 This is the boring one. 245 00:10:06,140 --> 00:10:08,900 This is the old style way of doing it, right? 246 00:10:08,900 --> 00:10:10,910 Memorize facts. 247 00:10:10,910 --> 00:10:13,630 Memorize as many facts as you can and hope that we ask you 248 00:10:13,630 --> 00:10:16,190 on the final exam instances of those facts, 249 00:10:16,190 --> 00:10:19,100 as opposed to some other facts you haven't memorized. 250 00:10:19,100 --> 00:10:22,580 This is, if you think way back to the first lecture, 251 00:10:22,580 --> 00:10:26,840 an example of declarative knowledge, statements of truth. 252 00:10:26,840 --> 00:10:28,370 Memorize as many as you can. 253 00:10:28,370 --> 00:10:31,620 Have Wikipedia in your back pocket. 254 00:10:31,620 --> 00:10:35,280 Better way to learn is to be able to infer, to deduce 255 00:10:35,280 --> 00:10:37,980 new information from old. 256 00:10:37,980 --> 00:10:39,630 And if you think about this, this 257 00:10:39,630 --> 00:10:42,870 gets closer to what we called imperative knowledge-- 258 00:10:42,870 --> 00:10:46,370 ways to deduce new things. 259 00:10:46,370 --> 00:10:48,100 Now, in the first cases, we built 260 00:10:48,100 --> 00:10:51,510 that in when we wrote that program to do square roots. 261 00:10:51,510 --> 00:10:53,430 But what we'd like in a learning algorithm 262 00:10:53,430 --> 00:10:56,760 is to have much more like that generalization idea. 263 00:10:56,760 --> 00:10:59,880 We're interested in extending our capabilities 264 00:10:59,880 --> 00:11:03,840 to write programs that can infer useful information 265 00:11:03,840 --> 00:11:06,030 from implicit patterns in the data. 266 00:11:06,030 --> 00:11:08,310 So not something explicitly built 267 00:11:08,310 --> 00:11:11,100 like that comparison of weights and displacements, 268 00:11:11,100 --> 00:11:13,350 but actually implicit patterns in the data, 269 00:11:13,350 --> 00:11:16,740 and have the algorithm figure out what those patterns are, 270 00:11:16,740 --> 00:11:18,780 and use those to generate a program you 271 00:11:18,780 --> 00:11:22,170 can use to infer new data about objects, 272 00:11:22,170 --> 00:11:24,300 about string displacements, whatever 273 00:11:24,300 --> 00:11:27,401 it is you're trying to do. 274 00:11:27,401 --> 00:11:27,900 OK. 275 00:11:27,900 --> 00:11:30,200 So the idea then, the basic paradigm 276 00:11:30,200 --> 00:11:32,480 that we're going to see, is we're 277 00:11:32,480 --> 00:11:35,090 going to give the system some training 278 00:11:35,090 --> 00:11:37,130 data, some observations. 279 00:11:37,130 --> 00:11:41,300 We did that last time with just the spring displacements. 280 00:11:41,300 --> 00:11:43,044 We're going to then try and have a way 281 00:11:43,044 --> 00:11:44,960 to figure out, how do we write code, how do we 282 00:11:44,960 --> 00:11:47,510 write a program, a system that will infer something 283 00:11:47,510 --> 00:11:51,290 about the process that generated the data? 284 00:11:51,290 --> 00:11:53,269 And then from that, we want to be 285 00:11:53,269 --> 00:11:55,310 able to use that to make predictions about things 286 00:11:55,310 --> 00:11:57,860 we haven't seen before. 287 00:11:57,860 --> 00:11:59,610 So again, I want to drive home this point. 288 00:11:59,610 --> 00:12:04,830 If you think about it, the spring example fit that model. 289 00:12:04,830 --> 00:12:08,340 I gave you a set of data, spatial deviations 290 00:12:08,340 --> 00:12:09,720 relative to mass displacements. 291 00:12:09,720 --> 00:12:12,040 For different masses, how far did the spring move? 292 00:12:12,040 --> 00:12:16,550 I then inferred something about the underlying process. 293 00:12:16,550 --> 00:12:18,660 In the first case, I said I know it's linear, 294 00:12:18,660 --> 00:12:21,480 but let me figure out what the actual linear equation is. 295 00:12:21,480 --> 00:12:24,150 What's the spring constant associated with it? 296 00:12:24,150 --> 00:12:26,820 And based on that result, I got a piece of code 297 00:12:26,820 --> 00:12:30,120 I could use to predict new displacements. 298 00:12:30,120 --> 00:12:32,860 So it's got all of those elements, training data, 299 00:12:32,860 --> 00:12:35,190 an inference engine, and then the ability 300 00:12:35,190 --> 00:12:38,140 to use that to make new predictions. 301 00:12:38,140 --> 00:12:40,330 But that's a very simple kind of learning setting. 302 00:12:40,330 --> 00:12:41,920 So the more common one is one I'm 303 00:12:41,920 --> 00:12:43,450 going to use as an example, which 304 00:12:43,450 --> 00:12:47,210 is, when I give you a set of examples, 305 00:12:47,210 --> 00:12:49,360 those examples have some data associated with them, 306 00:12:49,360 --> 00:12:52,330 some features and some labels. 307 00:12:52,330 --> 00:12:53,980 For each example, I might say this 308 00:12:53,980 --> 00:12:55,960 is a particular kind of thing. 309 00:12:55,960 --> 00:12:58,102 This other one is another kind of thing. 310 00:12:58,102 --> 00:12:59,560 And what I want to do is figure out 311 00:12:59,560 --> 00:13:01,930 how to do inference on labeling new things. 312 00:13:01,930 --> 00:13:04,524 So it's not just, what's the displacement of the mass, 313 00:13:04,524 --> 00:13:05,440 it's actually a label. 314 00:13:05,440 --> 00:13:07,540 And I'm going to use one of my favorite examples. 315 00:13:07,540 --> 00:13:09,970 I'm a big New England Patriots fan, 316 00:13:09,970 --> 00:13:11,507 if you're not, my apologies. 317 00:13:11,507 --> 00:13:13,090 But I'm going to use football players. 318 00:13:13,090 --> 00:13:15,040 So I'm going to show you in a second, 319 00:13:15,040 --> 00:13:17,830 I'm going to give you a set of examples of football players. 320 00:13:17,830 --> 00:13:20,222 The label is the position they play. 321 00:13:20,222 --> 00:13:22,180 And the data, well, it could be lots of things. 322 00:13:22,180 --> 00:13:23,920 We're going to use height and weight. 323 00:13:23,920 --> 00:13:25,660 But what we want to do is then see 324 00:13:25,660 --> 00:13:29,080 how would we come up with a way of characterizing 325 00:13:29,080 --> 00:13:32,170 the implicit pattern of how does weight and height predict 326 00:13:32,170 --> 00:13:34,630 the kind of position this player could play. 327 00:13:34,630 --> 00:13:36,290 And then come up with an algorithm 328 00:13:36,290 --> 00:13:38,515 that will predict the position of new players. 329 00:13:38,515 --> 00:13:39,940 We'll do the draft for next year. 330 00:13:39,940 --> 00:13:42,670 Where do we want them to play? 331 00:13:42,670 --> 00:13:44,500 That's the paradigm. 332 00:13:44,500 --> 00:13:48,910 Set of observations, potentially labeled, potentially not. 333 00:13:48,910 --> 00:13:51,520 Think about how do we do inference to find a model. 334 00:13:51,520 --> 00:13:55,000 And then how do we use that model to make predictions. 335 00:13:55,000 --> 00:13:56,429 What we're going to see, and we're 336 00:13:56,429 --> 00:13:57,970 going to see multiple examples today, 337 00:13:57,970 --> 00:13:59,500 is that that learning can be done 338 00:13:59,500 --> 00:14:02,870 in one of two very broad ways. 339 00:14:02,870 --> 00:14:05,460 The first one is called supervised learning. 340 00:14:05,460 --> 00:14:08,140 And in that case, for every new example 341 00:14:08,140 --> 00:14:09,850 I give you as part of the training data, 342 00:14:09,850 --> 00:14:11,470 I have a label on it. 343 00:14:11,470 --> 00:14:13,554 I know the kind of thing it is. 344 00:14:13,554 --> 00:14:15,220 And what I'm going to do is look for how 345 00:14:15,220 --> 00:14:18,400 do I find a rule that would predict the label associated 346 00:14:18,400 --> 00:14:21,610 with unseen input based on those examples. 347 00:14:21,610 --> 00:14:25,010 It's supervised because I know what the labeling is. 348 00:14:25,010 --> 00:14:26,750 Second kind, if this is supervised, 349 00:14:26,750 --> 00:14:28,820 the obvious other one is called unsupervised. 350 00:14:28,820 --> 00:14:32,210 In that case, I'm just going to give you a bunch of examples. 351 00:14:32,210 --> 00:14:34,550 But I don't know the labels associated with them. 352 00:14:34,550 --> 00:14:36,140 I'm going to just try and find what 353 00:14:36,140 --> 00:14:39,020 are the natural ways to group those examples 354 00:14:39,020 --> 00:14:41,680 together into different models. 355 00:14:41,680 --> 00:14:44,090 And in some cases, I may know how many models are there. 356 00:14:44,090 --> 00:14:45,923 In some cases, I may want to just say what's 357 00:14:45,923 --> 00:14:48,770 the best grouping I can find. 358 00:14:48,770 --> 00:14:50,760 OK. 359 00:14:50,760 --> 00:14:52,929 What I'm going to do today is not a lot of code. 360 00:14:52,929 --> 00:14:55,470 I was expecting cheers for that, John, but I didn't get them. 361 00:14:55,470 --> 00:14:57,240 Not a lot of code. 362 00:14:57,240 --> 00:14:59,250 What I'm going to do is show you basically, 363 00:14:59,250 --> 00:15:01,060 the intuitions behind doing this learning. 364 00:15:01,060 --> 00:15:03,560 And I"m going to start with my New England Patriots example. 365 00:15:03,560 --> 00:15:06,880 So here are some data points about current Patriots players. 366 00:15:06,880 --> 00:15:09,120 And I've got two kinds of positions. 367 00:15:09,120 --> 00:15:12,240 I've got receivers, and I have linemen. 368 00:15:12,240 --> 00:15:15,090 And each one is just labeled by the name, the height in inches, 369 00:15:15,090 --> 00:15:16,650 and the weight in pounds. 370 00:15:16,650 --> 00:15:17,700 OK? 371 00:15:17,700 --> 00:15:20,590 Five of each. 372 00:15:20,590 --> 00:15:24,370 If I plot those on a two dimensional plot, 373 00:15:24,370 --> 00:15:26,260 this is what I get. 374 00:15:26,260 --> 00:15:26,860 OK? 375 00:15:26,860 --> 00:15:29,320 No big deal. 376 00:15:29,320 --> 00:15:30,490 What am I trying to do? 377 00:15:30,490 --> 00:15:33,370 I'm trying to learn, are their characteristics 378 00:15:33,370 --> 00:15:36,220 that distinguish the two classes from one another? 379 00:15:36,220 --> 00:15:38,200 And in the unlabeled case, all I have 380 00:15:38,200 --> 00:15:40,240 are just a set of examples. 381 00:15:40,240 --> 00:15:42,220 So what I want to do is decide what 382 00:15:42,220 --> 00:15:46,240 makes two players similar with the goal of seeing, 383 00:15:46,240 --> 00:15:50,260 can I separate this distribution into two or more 384 00:15:50,260 --> 00:15:52,430 natural groups. 385 00:15:52,430 --> 00:15:54,120 Similar is a distance measure. 386 00:15:54,120 --> 00:15:56,118 It says how do I take two examples with values 387 00:15:56,118 --> 00:15:57,493 or features associated, and we're 388 00:15:57,493 --> 00:15:59,500 going to decide how far apart are they? 389 00:15:59,500 --> 00:16:03,770 And in the unlabeled case, the simple way to do it is to say, 390 00:16:03,770 --> 00:16:06,216 if I know that there are at least k groups there-- 391 00:16:06,216 --> 00:16:08,090 in this case, I'm going to tell you there are 392 00:16:08,090 --> 00:16:09,800 two different groups there-- 393 00:16:09,800 --> 00:16:12,830 how could I decide how best to cluster things 394 00:16:12,830 --> 00:16:15,470 together so that all the examples in one group 395 00:16:15,470 --> 00:16:18,410 are close to each other, all the examples in the other group 396 00:16:18,410 --> 00:16:22,190 are close to each other, and they're reasonably far apart. 397 00:16:22,190 --> 00:16:23,780 There are many ways to do it. 398 00:16:23,780 --> 00:16:25,260 I'm going to show you one. 399 00:16:25,260 --> 00:16:29,920 It's a very standard way, and it works, basically, as follows. 400 00:16:29,920 --> 00:16:32,340 If all I know is that there are two groups there, 401 00:16:32,340 --> 00:16:34,080 I'm going to start by just picking 402 00:16:34,080 --> 00:16:37,497 two examples as my exemplars. 403 00:16:37,497 --> 00:16:38,330 Pick them at random. 404 00:16:38,330 --> 00:16:39,410 Actually at random is not great. 405 00:16:39,410 --> 00:16:40,820 I don't want to pick too closely to each other. 406 00:16:40,820 --> 00:16:42,528 I'm going to try and pick them far apart. 407 00:16:42,528 --> 00:16:44,920 But I pick two examples as my exemplars. 408 00:16:44,920 --> 00:16:47,540 And for all the other examples in the training data, 409 00:16:47,540 --> 00:16:50,862 I say which one is it closest to. 410 00:16:50,862 --> 00:16:52,820 What I'm going to try and do is create clusters 411 00:16:52,820 --> 00:16:54,590 with the property that the distances 412 00:16:54,590 --> 00:16:57,860 between all of the examples of that cluster are small. 413 00:16:57,860 --> 00:16:59,932 The average distance is small. 414 00:16:59,932 --> 00:17:01,390 And see if I can find clusters that 415 00:17:01,390 --> 00:17:03,250 gets the average distance for both clusters 416 00:17:03,250 --> 00:17:05,380 as small as possible. 417 00:17:05,380 --> 00:17:08,099 This algorithm works by picking two examples, 418 00:17:08,099 --> 00:17:10,290 clustering all the other examples by simply saying 419 00:17:10,290 --> 00:17:15,480 put it in the group to which it's closest to that example. 420 00:17:15,480 --> 00:17:17,369 Once I've got those clusters, I'm 421 00:17:17,369 --> 00:17:20,160 going to find the median element of that group. 422 00:17:20,160 --> 00:17:24,060 Not mean, but median, what's the one closest to the center? 423 00:17:24,060 --> 00:17:28,057 And treat those as exemplars and repeat the process. 424 00:17:28,057 --> 00:17:30,015 And I'll just do it either some number of times 425 00:17:30,015 --> 00:17:33,570 or until I don't get any change in the process. 426 00:17:33,570 --> 00:17:35,490 So it's clustering based on distance. 427 00:17:35,490 --> 00:17:38,734 And we'll come back to distance in a second. 428 00:17:38,734 --> 00:17:40,650 So here's what would have my football players. 429 00:17:40,650 --> 00:17:43,770 If I just did this based on weight, 430 00:17:43,770 --> 00:17:45,400 there's the natural dividing line. 431 00:17:45,400 --> 00:17:46,865 And it kind of makes sense. 432 00:17:46,865 --> 00:17:47,500 All right? 433 00:17:47,500 --> 00:17:49,194 These three are obviously clustered, 434 00:17:49,194 --> 00:17:50,610 and again, it's just on this axis. 435 00:17:50,610 --> 00:17:51,990 They're all down here. 436 00:17:51,990 --> 00:17:53,580 These seven are at a different place. 437 00:17:53,580 --> 00:17:56,530 There's a natural dividing line there. 438 00:17:56,530 --> 00:18:01,790 If I were to do it based on height, not as clean. 439 00:18:01,790 --> 00:18:03,410 This is what my algorithm came up 440 00:18:03,410 --> 00:18:05,360 with as the best dividing line here, 441 00:18:05,360 --> 00:18:09,140 meaning that these four, again, just based on this axis 442 00:18:09,140 --> 00:18:10,550 are close together. 443 00:18:10,550 --> 00:18:12,170 These six are close together. 444 00:18:12,170 --> 00:18:13,900 But it's not nearly as clean. 445 00:18:13,900 --> 00:18:15,650 And that's part of the issue we'll look at 446 00:18:15,650 --> 00:18:18,350 is how do I find the best clusters. 447 00:18:18,350 --> 00:18:22,440 If I use both height and weight, I 448 00:18:22,440 --> 00:18:25,570 get that, which was actually kind of nice, right? 449 00:18:25,570 --> 00:18:28,780 Those three cluster together. they're near each other, 450 00:18:28,780 --> 00:18:30,910 in terms of just distance in the plane. 451 00:18:30,910 --> 00:18:32,830 Those seven are near each other. 452 00:18:32,830 --> 00:18:36,550 There's a nice, natural dividing line through here. 453 00:18:36,550 --> 00:18:40,240 And in fact, that gives me a classifier. 454 00:18:40,240 --> 00:18:43,360 This line is the equidistant line 455 00:18:43,360 --> 00:18:45,460 between the centers of those two clusters. 456 00:18:45,460 --> 00:18:47,650 Meaning, any point along this line 457 00:18:47,650 --> 00:18:49,750 is the same distance to the center of that group 458 00:18:49,750 --> 00:18:51,570 as it is to that group. 459 00:18:51,570 --> 00:18:53,620 And so any new example, if it's above the line, 460 00:18:53,620 --> 00:18:56,080 I would say gets that label, if it's below the line, 461 00:18:56,080 --> 00:18:58,211 gets that label. 462 00:18:58,211 --> 00:18:59,710 In a second, we'll come back to look 463 00:18:59,710 --> 00:19:01,168 at how do we measure the distances, 464 00:19:01,168 --> 00:19:02,830 but the idea here is pretty simple. 465 00:19:02,830 --> 00:19:05,530 I want to find groupings near each other 466 00:19:05,530 --> 00:19:09,110 and far apart from the other group. 467 00:19:09,110 --> 00:19:13,890 Now suppose I actually knew the labels on these players. 468 00:19:16,790 --> 00:19:18,980 These are the receivers. 469 00:19:18,980 --> 00:19:21,060 Those are the linemen. 470 00:19:21,060 --> 00:19:22,880 And for those of you who are football fans, 471 00:19:22,880 --> 00:19:23,710 you can figure it out, right? 472 00:19:23,710 --> 00:19:24,918 Those are the two tight ends. 473 00:19:24,918 --> 00:19:26,127 They are much bigger. 474 00:19:26,127 --> 00:19:28,460 I think that's Bennett and that's Gronk if you're really 475 00:19:28,460 --> 00:19:29,306 a big Patriots fan. 476 00:19:29,306 --> 00:19:31,430 But those are tight ends, those are wide receivers, 477 00:19:31,430 --> 00:19:33,096 and it's going to come back in a second, 478 00:19:33,096 --> 00:19:34,400 but there are the labels. 479 00:19:34,400 --> 00:19:36,830 Now what I want to do is say, if I could take advantage 480 00:19:36,830 --> 00:19:40,280 of knowing the labels, how would I divide these groups up? 481 00:19:40,280 --> 00:19:42,770 And that's kind of easy to see. 482 00:19:42,770 --> 00:19:44,870 Basic idea, in this case, is if I've 483 00:19:44,870 --> 00:19:46,640 got labeled groups in that feature 484 00:19:46,640 --> 00:19:51,080 space, what I want to do is find a subsurface that naturally 485 00:19:51,080 --> 00:19:52,220 divides that space. 486 00:19:52,220 --> 00:19:53,810 Now subsurface is a fancy word. 487 00:19:53,810 --> 00:19:55,460 It says, in the two-dimensional case, 488 00:19:55,460 --> 00:19:58,280 I want to know what's the best line, 489 00:19:58,280 --> 00:20:01,070 if I can find a single line, that separates all the examples 490 00:20:01,070 --> 00:20:04,820 with one label from all the examples of the second label. 491 00:20:04,820 --> 00:20:07,560 We'll see that, if the examples are well separated, 492 00:20:07,560 --> 00:20:09,890 this is easy to do, and it's great. 493 00:20:09,890 --> 00:20:11,600 But in some cases, it's going to be 494 00:20:11,600 --> 00:20:13,790 more complicated because some of the examples 495 00:20:13,790 --> 00:20:15,872 may be very close to one another. 496 00:20:15,872 --> 00:20:17,330 And that's going to raise a problem 497 00:20:17,330 --> 00:20:19,040 that you saw last lecture. 498 00:20:19,040 --> 00:20:20,900 I want to avoid overfitting. 499 00:20:20,900 --> 00:20:23,510 I don't want to create a really complicated surface 500 00:20:23,510 --> 00:20:24,870 to separate things. 501 00:20:24,870 --> 00:20:27,560 And so we may have to tolerate a few incorrectly 502 00:20:27,560 --> 00:20:30,716 labeled things, if we can't pull it out. 503 00:20:30,716 --> 00:20:32,590 And as you already figured out, in this case, 504 00:20:32,590 --> 00:20:35,020 with the labeled data, there's the best fitting line 505 00:20:35,020 --> 00:20:36,460 right there. 506 00:20:36,460 --> 00:20:40,090 Anybody over 280 pounds is going to be a great lineman. 507 00:20:40,090 --> 00:20:45,250 Anybody under 280 pounds is more likely to be a receiver. 508 00:20:45,250 --> 00:20:45,789 OK. 509 00:20:45,789 --> 00:20:47,830 So I've got two different ways of trying to think 510 00:20:47,830 --> 00:20:48,913 about doing this labeling. 511 00:20:48,913 --> 00:20:52,160 I'm going to come back to both of them in a second. 512 00:20:52,160 --> 00:20:55,490 Now suppose I add in some new data. 513 00:20:55,490 --> 00:20:57,879 I want to label new instances. 514 00:20:57,879 --> 00:21:00,170 Now these are actually players of a different position. 515 00:21:00,170 --> 00:21:01,230 These are running backs. 516 00:21:01,230 --> 00:21:04,530 But I say, all I know about is receivers and linemen. 517 00:21:04,530 --> 00:21:06,510 I get these two new data points. 518 00:21:06,510 --> 00:21:08,610 I'd like to know, are they more likely to be 519 00:21:08,610 --> 00:21:11,430 a receiver or a linemen? 520 00:21:11,430 --> 00:21:14,130 And there's the data for these two gentlemen. 521 00:21:14,130 --> 00:21:17,640 So if I go back to now plotting them, 522 00:21:17,640 --> 00:21:19,210 oh you notice one of the issues. 523 00:21:19,210 --> 00:21:22,110 So there are my linemen, the red ones are my receivers, 524 00:21:22,110 --> 00:21:25,860 the two black dots are the two running backs. 525 00:21:25,860 --> 00:21:28,500 And notice right here. 526 00:21:28,500 --> 00:21:31,550 It's going to be really hard to separate those two 527 00:21:31,550 --> 00:21:32,707 examples from one another. 528 00:21:32,707 --> 00:21:34,040 They are so close to each other. 529 00:21:34,040 --> 00:21:35,706 And that's going to be one of the things 530 00:21:35,706 --> 00:21:37,040 we have to trade off. 531 00:21:37,040 --> 00:21:41,240 But if I think about using what I learned as a classifier 532 00:21:41,240 --> 00:21:46,030 with unlabeled data, there were my two clusters. 533 00:21:46,030 --> 00:21:48,420 Now you see, oh, I've got an interesting example. 534 00:21:48,420 --> 00:21:51,690 This new example I would say is clearly more 535 00:21:51,690 --> 00:21:54,240 like a receiver than a lineman. 536 00:21:54,240 --> 00:21:57,660 But that one there, unclear. 537 00:21:57,660 --> 00:22:00,180 Almost exactly lies along that dividing line 538 00:22:00,180 --> 00:22:02,310 between those two clusters. 539 00:22:02,310 --> 00:22:04,860 And I would either say, I want to rethink the clustering 540 00:22:04,860 --> 00:22:06,610 or I want to say, you know what? 541 00:22:06,610 --> 00:22:09,270 As I know, maybe there aren't two clusters here. 542 00:22:09,270 --> 00:22:10,560 Maybe there are three. 543 00:22:10,560 --> 00:22:13,270 And I want to classify them a little differently. 544 00:22:13,270 --> 00:22:15,570 So I'll come back to that. 545 00:22:15,570 --> 00:22:18,540 On the other hand, if I had used the labeled data, 546 00:22:18,540 --> 00:22:20,340 there was my dividing line. 547 00:22:20,340 --> 00:22:21,960 This is really easy. 548 00:22:21,960 --> 00:22:24,150 Both of those new examples are clearly 549 00:22:24,150 --> 00:22:25,320 below the dividing line. 550 00:22:25,320 --> 00:22:27,240 They are clearly examples that I would 551 00:22:27,240 --> 00:22:29,910 categorize as being more like receivers 552 00:22:29,910 --> 00:22:32,380 than they are like linemen. 553 00:22:32,380 --> 00:22:33,880 And I know it's a football example. 554 00:22:33,880 --> 00:22:36,130 If you don't like football, pick another example. 555 00:22:36,130 --> 00:22:38,080 But you get the sense of why I can 556 00:22:38,080 --> 00:22:41,500 use the data in a labeled case and the unlabeled case 557 00:22:41,500 --> 00:22:45,772 to come up with different ways of building the clusters. 558 00:22:45,772 --> 00:22:47,480 So what we're going to do over the next 2 559 00:22:47,480 --> 00:22:50,190 and 1/2 lectures is look at how can we 560 00:22:50,190 --> 00:22:54,550 write code to learn that way of separating things out? 561 00:22:54,550 --> 00:22:57,059 We're going to learn models based on unlabeled data. 562 00:22:57,059 --> 00:22:59,350 That's the case where I don't know what the labels are, 563 00:22:59,350 --> 00:23:02,890 by simply trying to find ways to cluster things together 564 00:23:02,890 --> 00:23:05,830 nearby, and then use the clusters to assign labels 565 00:23:05,830 --> 00:23:07,020 to new data. 566 00:23:07,020 --> 00:23:09,820 And we're going to learn models by looking at labeled data 567 00:23:09,820 --> 00:23:13,510 and seeing how do we best come up with a way of separating 568 00:23:13,510 --> 00:23:17,230 with a line or a plane or a collection of lines, examples 569 00:23:17,230 --> 00:23:20,320 from one group, from examples of the other group. 570 00:23:20,320 --> 00:23:23,464 With the acknowledgment that we want to avoid overfitting, 571 00:23:23,464 --> 00:23:25,630 we don't want to create a really complicated system. 572 00:23:25,630 --> 00:23:27,127 And as a consequence, we're going 573 00:23:27,127 --> 00:23:28,960 to have to make some trade-offs between what 574 00:23:28,960 --> 00:23:32,200 we call false positives and false negatives. 575 00:23:32,200 --> 00:23:34,780 But the resulting classifier can then label any new data 576 00:23:34,780 --> 00:23:36,730 by just deciding where you are with respect 577 00:23:36,730 --> 00:23:37,730 to that separating line. 578 00:23:40,360 --> 00:23:43,020 So here's what you're going to see over the next 2 579 00:23:43,020 --> 00:23:44,930 and 1/2 lectures. 580 00:23:44,930 --> 00:23:49,510 Every machine learning method has five essential components. 581 00:23:49,510 --> 00:23:51,310 We need to decide what's the training data, 582 00:23:51,310 --> 00:23:54,370 and how are we going to evaluate the success of that system. 583 00:23:54,370 --> 00:23:56,900 We've already seen some examples of that. 584 00:23:56,900 --> 00:23:58,600 We need to decide how are we going 585 00:23:58,600 --> 00:24:02,560 to represent each instance that we're giving it. 586 00:24:02,560 --> 00:24:05,190 I happened to choose height and weight for football players. 587 00:24:05,190 --> 00:24:08,200 But I might have been better off to pick average speed 588 00:24:08,200 --> 00:24:10,210 or, I don't know, arm length, something else. 589 00:24:10,210 --> 00:24:12,670 How do I figure out what are the right features. 590 00:24:12,670 --> 00:24:15,820 And associated with that, how do I measure distances 591 00:24:15,820 --> 00:24:17,320 between those features? 592 00:24:17,320 --> 00:24:20,096 How do I decide what's close and what's not close? 593 00:24:20,096 --> 00:24:22,720 Maybe it should be different, in terms of weight versus height, 594 00:24:22,720 --> 00:24:23,290 for example. 595 00:24:23,290 --> 00:24:24,734 I need to make that decision. 596 00:24:24,734 --> 00:24:26,150 And those are the two things we're 597 00:24:26,150 --> 00:24:30,030 going to show you examples of today, how to go through that. 598 00:24:30,030 --> 00:24:31,530 Starting next week, Professor Guttag 599 00:24:31,530 --> 00:24:34,430 is going to show you how you take those and actually start 600 00:24:34,430 --> 00:24:38,260 building more detailed versions of measuring clustering, 601 00:24:38,260 --> 00:24:41,560 measuring similarities to find an objective function that you 602 00:24:41,560 --> 00:24:44,830 want to minimize to decide what is the best cluster to use. 603 00:24:44,830 --> 00:24:47,230 And then what is the best optimization method you want 604 00:24:47,230 --> 00:24:50,350 to use to learn that model. 605 00:24:50,350 --> 00:24:54,080 So let's start talking about features. 606 00:24:54,080 --> 00:24:56,960 I've got a set of examples, labeled or not. 607 00:24:56,960 --> 00:24:59,450 I need to decide what is it about those examples that's 608 00:24:59,450 --> 00:25:02,660 useful to use when I want to decide what's 609 00:25:02,660 --> 00:25:05,350 close to another thing or not. 610 00:25:05,350 --> 00:25:07,470 And one of the problems is, if it was really easy, 611 00:25:07,470 --> 00:25:09,600 it would be really easy. 612 00:25:09,600 --> 00:25:12,510 Features don't always capture what you want. 613 00:25:12,510 --> 00:25:14,490 I'm going to belabor that football analogy, 614 00:25:14,490 --> 00:25:16,050 but why did I pick height and weight. 615 00:25:16,050 --> 00:25:18,307 Because it was easy to find. 616 00:25:18,307 --> 00:25:20,640 You know, if you work for the New England Patriots, what 617 00:25:20,640 --> 00:25:22,900 is the thing that you really look for when you're asking, 618 00:25:22,900 --> 00:25:23,660 what's the right feature? 619 00:25:23,660 --> 00:25:25,660 It's probably some other combination of things. 620 00:25:25,660 --> 00:25:27,810 So you, as a designer, have to say what 621 00:25:27,810 --> 00:25:29,952 are the features I want to use. 622 00:25:29,952 --> 00:25:31,410 That quote, by the way, is from one 623 00:25:31,410 --> 00:25:33,618 of the great statisticians of the 20th century, which 624 00:25:33,618 --> 00:25:35,640 I think captures it well. 625 00:25:35,640 --> 00:25:39,630 So feature engineering, as you, as a programmer, 626 00:25:39,630 --> 00:25:42,030 comes down to deciding both what are the features 627 00:25:42,030 --> 00:25:45,480 I want to measure in that vector that I'm going to put together, 628 00:25:45,480 --> 00:25:49,340 and how do I decide relative ways to weight it? 629 00:25:49,340 --> 00:25:55,520 So John, and Ana, and I could have made our job 630 00:25:55,520 --> 00:25:57,900 this term really easy if we had sat down 631 00:25:57,900 --> 00:25:59,900 at the beginning of the term and said, you know, 632 00:25:59,900 --> 00:26:01,270 we've taught this course many times. 633 00:26:01,270 --> 00:26:02,686 We've got data from, I don't know, 634 00:26:02,686 --> 00:26:05,650 John, thousands of students, probably over this time. 635 00:26:05,650 --> 00:26:07,640 Let's just build a little learning algorithm 636 00:26:07,640 --> 00:26:11,127 that takes a set of data and predicts your final grade. 637 00:26:11,127 --> 00:26:12,710 You don't have to come to class, don't 638 00:26:12,710 --> 00:26:13,670 have to go through all the problems, 639 00:26:13,670 --> 00:26:15,440 because we'll just predict your final grade. 640 00:26:15,440 --> 00:26:16,356 Wouldn't that be nice? 641 00:26:16,356 --> 00:26:18,950 Make our job a little easier, and you may or may not 642 00:26:18,950 --> 00:26:20,930 like that idea. 643 00:26:20,930 --> 00:26:23,204 But I could think about predicting that grade? 644 00:26:23,204 --> 00:26:24,620 Now why am I telling this example. 645 00:26:24,620 --> 00:26:26,620 I was trying to see if I could get a few smiles. 646 00:26:26,620 --> 00:26:28,750 I saw a couple of them there. 647 00:26:28,750 --> 00:26:30,780 But think about the features. 648 00:26:30,780 --> 00:26:31,961 What I measure? 649 00:26:31,961 --> 00:26:34,210 Actually, I'll put this on John because it's his idea. 650 00:26:34,210 --> 00:26:35,930 What would he measure? 651 00:26:35,930 --> 00:26:41,090 Well, GPA is probably not a bad predictor of performance. 652 00:26:41,090 --> 00:26:42,650 You do well in other classes, you're 653 00:26:42,650 --> 00:26:45,140 likely to do well in this class. 654 00:26:45,140 --> 00:26:47,000 I'm going to use this one very carefully. 655 00:26:47,000 --> 00:26:50,840 Prior programming experience is at least a predictor, 656 00:26:50,840 --> 00:26:53,549 but it is not a perfect predictor. 657 00:26:53,549 --> 00:26:55,340 Those of you who haven't programmed before, 658 00:26:55,340 --> 00:26:57,320 in this class, you can still do really well in this class. 659 00:26:57,320 --> 00:26:59,695 But it's an indication that you've seen other programming 660 00:26:59,695 --> 00:27:01,580 languages. 661 00:27:01,580 --> 00:27:04,250 On the other hand, I don't believe in astrology. 662 00:27:04,250 --> 00:27:06,890 So I don't think the month in which you're born, 663 00:27:06,890 --> 00:27:09,470 the astrological sign under which you were born 664 00:27:09,470 --> 00:27:12,500 has probably anything to do with how well you'd program. 665 00:27:12,500 --> 00:27:14,304 I doubt that eye color has anything to do 666 00:27:14,304 --> 00:27:15,470 with how well you'd program. 667 00:27:15,470 --> 00:27:16,178 You get the idea. 668 00:27:16,178 --> 00:27:19,730 Some features matter, others don't. 669 00:27:19,730 --> 00:27:23,030 Now I could just throw all the features in and hope that 670 00:27:23,030 --> 00:27:25,610 the machine learning algorithm sorts out those it wants 671 00:27:25,610 --> 00:27:27,920 to keep from those it doesn't. 672 00:27:27,920 --> 00:27:30,560 But I remind you of that idea of overfitting. 673 00:27:30,560 --> 00:27:32,570 If I do that, there is the danger 674 00:27:32,570 --> 00:27:35,900 that it will find some correlation between birth 675 00:27:35,900 --> 00:27:39,200 month, eye color, and GPA. 676 00:27:39,200 --> 00:27:41,030 And that's going to lead to a conclusion 677 00:27:41,030 --> 00:27:43,162 that we really don't like. 678 00:27:43,162 --> 00:27:44,620 By the way, in case you're worried, 679 00:27:44,620 --> 00:27:46,060 I can assure you that Stu Schmill 680 00:27:46,060 --> 00:27:47,860 in the dean of admissions department 681 00:27:47,860 --> 00:27:50,437 does not use machine learning to pick you. 682 00:27:50,437 --> 00:27:52,270 He actually looks at a whole bunch of things 683 00:27:52,270 --> 00:27:55,180 because it's not easy to replace him with a machine-- 684 00:27:55,180 --> 00:27:56,765 yet. 685 00:27:56,765 --> 00:27:57,370 All right. 686 00:27:57,370 --> 00:28:00,014 So what this says is we need to think about 687 00:28:00,014 --> 00:28:01,180 how do we pick the features. 688 00:28:01,180 --> 00:28:02,638 And mostly, what we're trying to do 689 00:28:02,638 --> 00:28:05,830 is to maximize something called the signal to noise ratio. 690 00:28:05,830 --> 00:28:09,670 Maximize those features that carry the most information, 691 00:28:09,670 --> 00:28:12,580 and remove the ones that don't. 692 00:28:12,580 --> 00:28:14,410 So I want to show you an example of how 693 00:28:14,410 --> 00:28:17,530 you might think about this. 694 00:28:17,530 --> 00:28:19,965 I want to label reptiles. 695 00:28:19,965 --> 00:28:22,860 I want to come up with a way of labeling animals as, 696 00:28:22,860 --> 00:28:25,067 are they a reptile or not. 697 00:28:25,067 --> 00:28:26,400 And I give you a single example. 698 00:28:26,400 --> 00:28:28,560 With a single example, you can't really do much. 699 00:28:28,560 --> 00:28:32,760 But from this example, I know that a cobra, it lays eggs, 700 00:28:32,760 --> 00:28:35,130 it has scales, it's poisonous, it's cold blooded, 701 00:28:35,130 --> 00:28:37,200 it has no legs, and it's a reptile. 702 00:28:37,200 --> 00:28:39,834 So I could say my model of a reptile is well, 703 00:28:39,834 --> 00:28:40,500 I'm not certain. 704 00:28:40,500 --> 00:28:42,990 I don't have enough data yet. 705 00:28:42,990 --> 00:28:45,330 But if I give you a second example, 706 00:28:45,330 --> 00:28:47,070 and it also happens to be egg-laying, 707 00:28:47,070 --> 00:28:49,740 have scales, poisonous, cold blooded, no legs. 708 00:28:49,740 --> 00:28:51,720 There is my model, right? 709 00:28:51,720 --> 00:28:53,959 Perfectly reasonable model, whether I design it 710 00:28:53,959 --> 00:28:55,500 or a machine learning algorithm would 711 00:28:55,500 --> 00:29:00,210 do it says, if all of these are true, label it as a reptile. 712 00:29:00,210 --> 00:29:02,110 OK? 713 00:29:02,110 --> 00:29:05,366 And now I give you a boa constrictor. 714 00:29:05,366 --> 00:29:07,260 Ah. 715 00:29:07,260 --> 00:29:08,790 It's a reptile. 716 00:29:08,790 --> 00:29:11,120 But it doesn't fit the model. 717 00:29:11,120 --> 00:29:14,760 And in particular, it's not egg-laying, 718 00:29:14,760 --> 00:29:16,907 and it's not poisonous. 719 00:29:16,907 --> 00:29:18,240 So I've got to refine the model. 720 00:29:18,240 --> 00:29:19,990 Or the algorithm has got to refine the model. 721 00:29:19,990 --> 00:29:21,960 And this, I want to remind you, is looking at the features. 722 00:29:21,960 --> 00:29:23,910 So I started out with five features. 723 00:29:23,910 --> 00:29:25,870 This doesn't fit. 724 00:29:25,870 --> 00:29:28,730 So probably what I should do is reduce it. 725 00:29:28,730 --> 00:29:29,970 I'm going to look at scales. 726 00:29:29,970 --> 00:29:31,386 I'm going to look at cold blooded. 727 00:29:31,386 --> 00:29:32,790 I'm going to look at legs. 728 00:29:32,790 --> 00:29:34,411 That captures all three examples. 729 00:29:34,411 --> 00:29:36,660 Again, if you think about this in terms of clustering, 730 00:29:36,660 --> 00:29:39,621 all three of them would fit with that. 731 00:29:39,621 --> 00:29:40,120 OK. 732 00:29:40,120 --> 00:29:42,680 Now I give you another example-- 733 00:29:42,680 --> 00:29:44,070 chicken. 734 00:29:44,070 --> 00:29:45,535 I don't think it's a reptile. 735 00:29:45,535 --> 00:29:48,600 In fact, I'm pretty sure it's not a reptile. 736 00:29:48,600 --> 00:29:53,350 And it nicely still fits this model, right? 737 00:29:53,350 --> 00:29:56,310 Because, while it has scales, which you may or not realize, 738 00:29:56,310 --> 00:29:58,500 it's not cold blooded, and it has legs. 739 00:29:58,500 --> 00:30:02,380 So it is a negative example that reinforces the model. 740 00:30:02,380 --> 00:30:04,644 Sounds good. 741 00:30:04,644 --> 00:30:07,740 And now I'll give you an alligator. 742 00:30:07,740 --> 00:30:09,720 It's a reptile. 743 00:30:09,720 --> 00:30:11,460 And oh fudge, right? 744 00:30:11,460 --> 00:30:14,370 It doesn't satisfy the model. 745 00:30:14,370 --> 00:30:19,170 Because while it does have scales and it is cold blooded, 746 00:30:19,170 --> 00:30:21,155 it has legs. 747 00:30:21,155 --> 00:30:22,530 I'm almost done with the example. 748 00:30:22,530 --> 00:30:23,446 But you see the point. 749 00:30:23,446 --> 00:30:25,650 Again, I've got to think about how do I refine this. 750 00:30:25,650 --> 00:30:28,500 And I could by saying, all right. 751 00:30:28,500 --> 00:30:30,930 Let's make it a little more complicated-- has scales, 752 00:30:30,930 --> 00:30:32,787 cold blooded, 0 or four legs-- 753 00:30:32,787 --> 00:30:34,120 I'm going to say it's a reptile. 754 00:30:36,670 --> 00:30:38,860 I'll give you the dart frog. 755 00:30:38,860 --> 00:30:40,780 Not a reptile, it's an amphibian. 756 00:30:40,780 --> 00:30:43,000 And that's nice because it still satisfies this. 757 00:30:43,000 --> 00:30:45,730 So it's an example outside of the cluster that 758 00:30:45,730 --> 00:30:50,211 says no scales, not cold blooded, 759 00:30:50,211 --> 00:30:51,460 but happens to have four legs. 760 00:30:51,460 --> 00:30:52,251 It's not a reptile. 761 00:30:52,251 --> 00:30:53,780 That's good. 762 00:30:53,780 --> 00:30:55,112 And then I give you-- 763 00:30:55,112 --> 00:30:56,570 I have to give you a python, right? 764 00:30:56,570 --> 00:30:58,931 I mean, there has to be a python in here. 765 00:30:58,931 --> 00:30:59,430 Oh come on. 766 00:30:59,430 --> 00:31:01,130 At least grown at me when I say that. 767 00:31:01,130 --> 00:31:02,990 There has to be a python here. 768 00:31:02,990 --> 00:31:05,470 And I give you that and a salmon. 769 00:31:05,470 --> 00:31:08,620 And now I am in trouble. 770 00:31:08,620 --> 00:31:14,810 Because look at scales, look at cold blooded, look at legs. 771 00:31:14,810 --> 00:31:16,610 I can't separate them. 772 00:31:16,610 --> 00:31:18,230 On those features, there's no way 773 00:31:18,230 --> 00:31:20,510 to come up with a way that will correctly 774 00:31:20,510 --> 00:31:24,576 say that the python is a reptile and the salmon is not. 775 00:31:24,576 --> 00:31:28,370 And so there's no easy way to add in that rule. 776 00:31:28,370 --> 00:31:30,560 And probably my best thing is to simply go back 777 00:31:30,560 --> 00:31:34,490 to just two features, scales and cold blooded. 778 00:31:34,490 --> 00:31:35,960 And basically say, if something has 779 00:31:35,960 --> 00:31:38,786 scales and it's cold blooded, I'm going to call it a reptile. 780 00:31:38,786 --> 00:31:40,160 If it doesn't have both of those, 781 00:31:40,160 --> 00:31:42,170 I'm going to say it's not a reptile. 782 00:31:42,170 --> 00:31:44,009 It won't be perfect. 783 00:31:44,009 --> 00:31:45,800 It's going to incorrectly label the salmon. 784 00:31:45,800 --> 00:31:49,310 But I've made a design choice here that's important. 785 00:31:49,310 --> 00:31:54,230 And the design choice is that I will have no false negatives. 786 00:31:54,230 --> 00:31:55,730 What that means is there's not going 787 00:31:55,730 --> 00:31:59,240 to be any instance of something that's not a reptile that I'm 788 00:31:59,240 --> 00:32:01,380 going to call a reptile. 789 00:32:01,380 --> 00:32:04,114 I may have some false positives. 790 00:32:04,114 --> 00:32:05,280 So I did that the wrong way. 791 00:32:05,280 --> 00:32:06,720 A false negative says, everything 792 00:32:06,720 --> 00:32:10,020 that's not a reptile I'm going to categorize that direction. 793 00:32:10,020 --> 00:32:11,760 I may have some false positives, in that, 794 00:32:11,760 --> 00:32:13,920 I may have a few things that I will incorrectly 795 00:32:13,920 --> 00:32:15,690 label as a reptile. 796 00:32:15,690 --> 00:32:17,700 And in particular, salmon is going 797 00:32:17,700 --> 00:32:19,620 to be an instance of that. 798 00:32:19,620 --> 00:32:22,099 This trade off of false positives and false negatives 799 00:32:22,099 --> 00:32:24,390 is something that we worry about, as we think about it. 800 00:32:24,390 --> 00:32:26,690 Because there's no perfect way, in many cases, 801 00:32:26,690 --> 00:32:28,391 to separate out the data. 802 00:32:28,391 --> 00:32:30,640 And if you think back to my example of the New England 803 00:32:30,640 --> 00:32:33,876 Patriots, that running back and that wide receiver were 804 00:32:33,876 --> 00:32:35,500 so close together in height and weight, 805 00:32:35,500 --> 00:32:38,256 there was no way I'm going to be able to separate them apart. 806 00:32:38,256 --> 00:32:39,880 And I just have to be willing to decide 807 00:32:39,880 --> 00:32:42,320 how many false positives or false negatives 808 00:32:42,320 --> 00:32:45,370 do I want to tolerate. 809 00:32:45,370 --> 00:32:49,980 Once I've figured out what features to use, which is good, 810 00:32:49,980 --> 00:32:52,210 then I have to decide about distance. 811 00:32:52,210 --> 00:32:53,960 How do I compare two feature vectors? 812 00:32:53,960 --> 00:32:54,960 I'm going to say vector because there could 813 00:32:54,960 --> 00:32:56,640 be multiple dimensions to it. 814 00:32:56,640 --> 00:32:58,260 How do I decide how to compare them? 815 00:32:58,260 --> 00:33:01,350 Because I want to use the distances to figure out either 816 00:33:01,350 --> 00:33:03,960 how to group things together or how to find a dividing line 817 00:33:03,960 --> 00:33:05,940 that separates things apart. 818 00:33:05,940 --> 00:33:09,470 So one of the things I have to decide is which features. 819 00:33:09,470 --> 00:33:10,970 I also have to decide the distance. 820 00:33:10,970 --> 00:33:12,710 And finally, I may want to decide 821 00:33:12,710 --> 00:33:16,660 how to weigh relative importance of different dimensions 822 00:33:16,660 --> 00:33:17,990 in the feature vector. 823 00:33:17,990 --> 00:33:21,400 Some may be more valuable than others in making that decision. 824 00:33:21,400 --> 00:33:24,570 And I want to show you an example of that. 825 00:33:24,570 --> 00:33:27,909 So let's go back to my animals. 826 00:33:27,909 --> 00:33:29,950 I started off with a feature vector that actually 827 00:33:29,950 --> 00:33:31,390 had five dimensions to it. 828 00:33:31,390 --> 00:33:36,305 It was egg-laying, cold blooded, has scales, 829 00:33:36,305 --> 00:33:39,700 I forget what the other one was, and number of legs. 830 00:33:39,700 --> 00:33:41,560 So one of the ways I could think about this 831 00:33:41,560 --> 00:33:46,180 is saying I've got four binary features and one integer 832 00:33:46,180 --> 00:33:48,910 feature associated with each animal. 833 00:33:48,910 --> 00:33:52,000 And one way to learn to separate out reptiles from non reptiles 834 00:33:52,000 --> 00:33:56,591 is to measure the distance between pairs of examples 835 00:33:56,591 --> 00:33:58,840 and use that distance to decide what's near each other 836 00:33:58,840 --> 00:33:59,664 and what's not. 837 00:33:59,664 --> 00:34:01,330 And as we've said before, it will either 838 00:34:01,330 --> 00:34:04,210 be used to cluster things or to find a classifier surface that 839 00:34:04,210 --> 00:34:06,620 separates them. 840 00:34:06,620 --> 00:34:09,070 So here's a simple way to do it. 841 00:34:09,070 --> 00:34:11,470 For each of these examples, I'm going to just let true 842 00:34:11,470 --> 00:34:13,060 be 1, false be 0. 843 00:34:13,060 --> 00:34:15,310 So the first four are either 0s or 1s. 844 00:34:15,310 --> 00:34:17,709 And the last one is the number of legs. 845 00:34:17,709 --> 00:34:19,000 And now I could say, all right. 846 00:34:19,000 --> 00:34:22,540 How do I measure distances between animals 847 00:34:22,540 --> 00:34:25,884 or anything else, but these kinds of feature vectors? 848 00:34:25,884 --> 00:34:27,300 Here, we're going to use something 849 00:34:27,300 --> 00:34:30,750 called the Minkowski Metric or the Minkowski difference. 850 00:34:30,750 --> 00:34:34,080 Given two vectors and a power, p, 851 00:34:34,080 --> 00:34:36,300 we basically take the absolute value 852 00:34:36,300 --> 00:34:38,429 of the difference between each of the components 853 00:34:38,429 --> 00:34:43,969 of the vector, raise it to the p-th power, take the sum, 854 00:34:43,969 --> 00:34:46,840 and take the p-th route of that. 855 00:34:46,840 --> 00:34:48,460 So let's do the two obvious examples. 856 00:34:48,460 --> 00:34:51,699 If p is equal to 1, I just measure the absolute distance 857 00:34:51,699 --> 00:34:56,469 between each component, add them up, and that's my distance. 858 00:34:56,469 --> 00:34:58,715 It's called the Manhattan metric. 859 00:34:58,715 --> 00:35:00,840 The one you've seen more, the one we saw last time, 860 00:35:00,840 --> 00:35:03,900 if p is equal to 2, this is Euclidean distance, right? 861 00:35:03,900 --> 00:35:05,990 It's the sum of the squares of the differences 862 00:35:05,990 --> 00:35:07,050 of the components. 863 00:35:07,050 --> 00:35:08,149 Take the square root. 864 00:35:08,149 --> 00:35:09,690 Take the square root because it makes 865 00:35:09,690 --> 00:35:12,420 it have certain properties of a distance. 866 00:35:12,420 --> 00:35:16,540 That's the Euclidean distance. 867 00:35:16,540 --> 00:35:20,240 So now if I want to measure difference between these two, 868 00:35:20,240 --> 00:35:22,750 here's the question. 869 00:35:22,750 --> 00:35:27,780 Is this circle closer to the star or closer to the cross? 870 00:35:27,780 --> 00:35:30,310 Unfortunately, I put the answer up here. 871 00:35:30,310 --> 00:35:33,260 But it differs, depending on the metric I use. 872 00:35:33,260 --> 00:35:33,760 Right? 873 00:35:33,760 --> 00:35:37,000 Euclidean distance, well, that's square root of 2 times 2, 874 00:35:37,000 --> 00:35:38,692 so it's about 2.8. 875 00:35:38,692 --> 00:35:39,400 And that's three. 876 00:35:39,400 --> 00:35:42,580 So in terms of just standard distance in the plane, 877 00:35:42,580 --> 00:35:46,680 we would say that these two are closer than those two are. 878 00:35:46,680 --> 00:35:48,430 Manhattan distance, why is it called that? 879 00:35:48,430 --> 00:35:52,040 Because you can only walk along the avenues and the streets. 880 00:35:52,040 --> 00:35:53,500 Manhattan distance would basically 881 00:35:53,500 --> 00:35:56,500 say this is one, two, three, four units away. 882 00:35:56,500 --> 00:35:59,170 This is one, two, three units away. 883 00:35:59,170 --> 00:36:02,020 And under Manhattan distance, this is closer, 884 00:36:02,020 --> 00:36:05,847 this pairing is closer than that pairing is. 885 00:36:05,847 --> 00:36:07,430 Now you're used to thinking Euclidean. 886 00:36:07,430 --> 00:36:08,220 We're going to use that. 887 00:36:08,220 --> 00:36:09,595 But this is going to be important 888 00:36:09,595 --> 00:36:12,080 when we think about how are we comparing distances 889 00:36:12,080 --> 00:36:15,360 between these different pieces. 890 00:36:15,360 --> 00:36:16,912 So typically, we'll use Euclidean. 891 00:36:16,912 --> 00:36:19,120 We're going to see Manhattan actually has some value. 892 00:36:19,120 --> 00:36:20,960 So if I go back to my three examples-- boy, that's 893 00:36:20,960 --> 00:36:21,960 a gross slide, isn't it? 894 00:36:21,960 --> 00:36:22,790 But there we go-- 895 00:36:22,790 --> 00:36:25,570 rattlesnake, boa constrictor, and dart frog. 896 00:36:25,570 --> 00:36:26,870 There is the representation. 897 00:36:26,870 --> 00:36:29,457 I can ask, what's the distance between them? 898 00:36:29,457 --> 00:36:31,790 In the handout for today, we've given you a little piece 899 00:36:31,790 --> 00:36:32,914 of code that would do that. 900 00:36:32,914 --> 00:36:36,050 And if I actually run through it, I get, 901 00:36:36,050 --> 00:36:38,510 actually, a nice little result. Here 902 00:36:38,510 --> 00:36:43,199 are the distances between those vectors using Euclidean metric. 903 00:36:43,199 --> 00:36:44,490 I'm going to come back to them. 904 00:36:44,490 --> 00:36:48,350 But you can see the two snakes, nicely, are 905 00:36:48,350 --> 00:36:50,030 reasonably close to each other. 906 00:36:50,030 --> 00:36:54,220 Whereas, the dart frog is a fair distance away from that. 907 00:36:54,220 --> 00:36:54,910 Nice, right? 908 00:36:54,910 --> 00:36:56,740 That's a nice separation that says there's 909 00:36:56,740 --> 00:36:58,480 a difference between these two. 910 00:36:58,480 --> 00:37:00,220 OK. 911 00:37:00,220 --> 00:37:03,160 Now I throw in the alligator. 912 00:37:03,160 --> 00:37:04,810 Sounds like a Dungeons & Dragons game. 913 00:37:04,810 --> 00:37:09,480 I throw in the alligator, and I want to do the same comparison. 914 00:37:09,480 --> 00:37:14,720 And I don't get nearly as nice a result. Because now it says, 915 00:37:14,720 --> 00:37:19,320 as before, the two snakes are close to each other. 916 00:37:19,320 --> 00:37:21,700 But it says that the dart frog and the alligator 917 00:37:21,700 --> 00:37:24,640 are much closer, under this measurement, 918 00:37:24,640 --> 00:37:27,185 than either of them is to the other. 919 00:37:27,185 --> 00:37:30,220 And to remind you, right, the alligator and the two 920 00:37:30,220 --> 00:37:33,250 snakes I would like to be close to one another and a distance 921 00:37:33,250 --> 00:37:34,640 away from the frog. 922 00:37:34,640 --> 00:37:38,470 Because I'm trying to classify reptiles versus not. 923 00:37:38,470 --> 00:37:41,015 So what happened here? 924 00:37:41,015 --> 00:37:43,140 Well, this is a place where the feature engineering 925 00:37:43,140 --> 00:37:44,640 is going to be important. 926 00:37:44,640 --> 00:37:47,820 Because in fact, the alligator differs from the frog 927 00:37:47,820 --> 00:37:49,120 in three features. 928 00:37:51,810 --> 00:37:55,300 And only in two features from, say, the boa constrictor. 929 00:37:55,300 --> 00:37:57,590 But one of those features is the number of legs. 930 00:37:57,590 --> 00:37:59,650 And there, while on the binary axes, 931 00:37:59,650 --> 00:38:01,540 the difference is between a 0 and 1, 932 00:38:01,540 --> 00:38:05,620 here it can be between 0 and 4. 933 00:38:05,620 --> 00:38:09,100 So that is weighing the distance a lot more than we would like. 934 00:38:09,100 --> 00:38:13,520 The legs dimension is too large, if you like. 935 00:38:13,520 --> 00:38:15,416 How would I fix this? 936 00:38:15,416 --> 00:38:18,020 This is actually, I would argue, a natural place 937 00:38:18,020 --> 00:38:20,690 to use Manhattan distance. 938 00:38:20,690 --> 00:38:22,520 Why should I think that the difference 939 00:38:22,520 --> 00:38:26,160 in the number of legs or the number of legs difference 940 00:38:26,160 --> 00:38:30,400 is more important than whether it has scales or not? 941 00:38:30,400 --> 00:38:32,620 Why should I think that measuring that distance 942 00:38:32,620 --> 00:38:34,300 Euclidean-wise makes sense? 943 00:38:34,300 --> 00:38:36,590 They are really completely different measurements. 944 00:38:36,590 --> 00:38:38,090 And in fact, I'm not going to do it, 945 00:38:38,090 --> 00:38:39,880 but if I ran Manhattan metric on this, 946 00:38:39,880 --> 00:38:43,160 it would get the alligator much closer to the snakes, 947 00:38:43,160 --> 00:38:48,160 exactly because it differs only in two features, not three. 948 00:38:48,160 --> 00:38:49,900 The other way I could fix it would 949 00:38:49,900 --> 00:38:52,510 be to say I'm letting too much weight be associated 950 00:38:52,510 --> 00:38:54,430 with the difference in the number of legs. 951 00:38:54,430 --> 00:38:56,800 So let's just make it a binary feature. 952 00:38:56,800 --> 00:39:00,310 Either it doesn't have legs or it does have legs. 953 00:39:00,310 --> 00:39:03,040 Run the same classification. 954 00:39:03,040 --> 00:39:07,450 And now you see the snakes and the alligator 955 00:39:07,450 --> 00:39:09,510 are all close to each other. 956 00:39:09,510 --> 00:39:13,290 Whereas the dart frog, not as far away as it was before, 957 00:39:13,290 --> 00:39:15,480 but there's a pretty natural separation, especially 958 00:39:15,480 --> 00:39:18,450 using that number between them. 959 00:39:18,450 --> 00:39:20,180 What's my point? 960 00:39:20,180 --> 00:39:22,610 Choice of features matters. 961 00:39:22,610 --> 00:39:24,710 Throwing too many features in may, in fact, 962 00:39:24,710 --> 00:39:27,450 give us some overfitting. 963 00:39:27,450 --> 00:39:29,300 And in particular, deciding the weights 964 00:39:29,300 --> 00:39:32,090 that I want on those features has a real impact. 965 00:39:32,090 --> 00:39:33,830 And you, as a designer or a programmer, 966 00:39:33,830 --> 00:39:37,340 have a lot of influence in how you think about using those. 967 00:39:37,340 --> 00:39:38,930 So feature engineering really matters. 968 00:39:38,930 --> 00:39:40,610 How you pick the features, what you use 969 00:39:40,610 --> 00:39:43,580 is going to be important. 970 00:39:43,580 --> 00:39:44,880 OK. 971 00:39:44,880 --> 00:39:47,740 The last piece of this then is we're 972 00:39:47,740 --> 00:39:51,370 going to look at some examples where we give you data, got 973 00:39:51,370 --> 00:39:53,180 features associated with them. 974 00:39:53,180 --> 00:39:55,180 We're going to, in some cases have them labeled, 975 00:39:55,180 --> 00:39:56,120 in other cases not. 976 00:39:56,120 --> 00:39:57,970 And we know how now to think about how do we 977 00:39:57,970 --> 00:39:59,261 measure distances between them. 978 00:39:59,261 --> 00:40:00,425 John. 979 00:40:00,425 --> 00:40:02,050 JOHN GUTTAG: You probably didn't intend 980 00:40:02,050 --> 00:40:03,460 to say weights of features. 981 00:40:03,460 --> 00:40:04,780 You intended to say how they're scaled. 982 00:40:04,780 --> 00:40:04,990 ERIC GRIMSON: Sorry. 983 00:40:04,990 --> 00:40:06,530 The scales and not the-- thank you, John. 984 00:40:06,530 --> 00:40:07,029 No, I did. 985 00:40:07,029 --> 00:40:07,850 I take that back. 986 00:40:07,850 --> 00:40:09,600 I did not mean to say weights of features. 987 00:40:09,600 --> 00:40:11,650 I meant to say the scale of the dimension 988 00:40:11,650 --> 00:40:12,900 is going to be important here. 989 00:40:12,900 --> 00:40:15,210 Thank you, for the amplification and correction. 990 00:40:15,210 --> 00:40:16,210 You're absolutely right. 991 00:40:16,210 --> 00:40:18,082 JOHN GUTTAG: Weights, we use in a different way, 992 00:40:18,082 --> 00:40:19,020 as we'll see next time. 993 00:40:19,020 --> 00:40:19,590 ERIC GRIMSON: And we're going to see 994 00:40:19,590 --> 00:40:21,450 next time why we're going to use weights in different ways. 995 00:40:21,450 --> 00:40:22,404 So rephrase it. 996 00:40:22,404 --> 00:40:23,570 Block that out of your mind. 997 00:40:23,570 --> 00:40:26,070 We're going to talk about scales and the scale on the axes 998 00:40:26,070 --> 00:40:27,536 as being important here. 999 00:40:27,536 --> 00:40:29,160 And we already said we're going to look 1000 00:40:29,160 --> 00:40:31,740 at two different kinds of learning, 1001 00:40:31,740 --> 00:40:34,920 labeled and unlabeled, clustering and classifying. 1002 00:40:34,920 --> 00:40:37,530 And I want to just finish up by showing you 1003 00:40:37,530 --> 00:40:38,940 two examples of that. 1004 00:40:38,940 --> 00:40:41,310 How we would think about them algorithmically, 1005 00:40:41,310 --> 00:40:44,004 and we'll look at them in more detail next time. 1006 00:40:44,004 --> 00:40:45,420 As we look at it, I want to remind 1007 00:40:45,420 --> 00:40:48,530 you the things that are going to be important to you. 1008 00:40:48,530 --> 00:40:50,930 How do I measure distance between examples? 1009 00:40:50,930 --> 00:40:53,060 What's the right way to design that? 1010 00:40:53,060 --> 00:40:57,060 What is the right set of features to use in that vector? 1011 00:40:57,060 --> 00:41:01,520 And then, what constraints do I want to put on the model? 1012 00:41:01,520 --> 00:41:03,020 In the case of unlabelled data, how 1013 00:41:03,020 --> 00:41:06,424 do I decide how many clusters I want to have? 1014 00:41:06,424 --> 00:41:08,840 Because I can give you a really easy way to do clustering. 1015 00:41:08,840 --> 00:41:12,110 If I give you 100 examples, I say build 100 clusters. 1016 00:41:12,110 --> 00:41:14,250 Every example is its own cluster. 1017 00:41:14,250 --> 00:41:15,470 Distance is really good. 1018 00:41:15,470 --> 00:41:18,110 It's really close to itself, but it does a lousy job 1019 00:41:18,110 --> 00:41:19,240 of labeling things on it. 1020 00:41:19,240 --> 00:41:20,656 So I have to think about, how do I 1021 00:41:20,656 --> 00:41:23,526 decide how many clusters, what's the complexity 1022 00:41:23,526 --> 00:41:24,650 of that separating service? 1023 00:41:24,650 --> 00:41:27,710 How do I basically avoid the overfitting problem, 1024 00:41:27,710 --> 00:41:30,840 which I don't want to have? 1025 00:41:30,840 --> 00:41:32,850 So just to remind you, we've already 1026 00:41:32,850 --> 00:41:36,240 seen a little version of this, the clustering method. 1027 00:41:36,240 --> 00:41:39,276 This is a standard way to do it, simply repeating what 1028 00:41:39,276 --> 00:41:40,400 we had on an earlier slide. 1029 00:41:40,400 --> 00:41:42,420 If I want to cluster it into groups, 1030 00:41:42,420 --> 00:41:45,410 I start by saying how many clusters am I looking for? 1031 00:41:45,410 --> 00:41:48,590 Pick an example I take as my early representation. 1032 00:41:48,590 --> 00:41:50,640 For every other example in the training data, 1033 00:41:50,640 --> 00:41:53,210 put it to the closest cluster. 1034 00:41:53,210 --> 00:41:57,080 Once I've got those, find the median, repeat the process. 1035 00:41:57,080 --> 00:42:01,820 And that led to that separation. 1036 00:42:01,820 --> 00:42:03,930 Now once I've got it, I like to validate it. 1037 00:42:03,930 --> 00:42:05,780 And in fact, I should have said this better. 1038 00:42:05,780 --> 00:42:09,980 Those two clusters came without looking at the two black dots. 1039 00:42:09,980 --> 00:42:11,630 Once I put the black dots in, I'd 1040 00:42:11,630 --> 00:42:14,510 like to validate, how well does this really work? 1041 00:42:14,510 --> 00:42:17,780 And that example there is really not very encouraging. 1042 00:42:17,780 --> 00:42:19,590 It's too close. 1043 00:42:19,590 --> 00:42:22,020 So that's a natural place to say, OK, what if I did this 1044 00:42:22,020 --> 00:42:25,360 with three clusters? 1045 00:42:25,360 --> 00:42:27,970 That's what I get. 1046 00:42:27,970 --> 00:42:29,240 I like the that. 1047 00:42:29,240 --> 00:42:29,860 All right? 1048 00:42:29,860 --> 00:42:33,460 That has a really nice cluster up here. 1049 00:42:33,460 --> 00:42:35,630 The fact that the algorithm didn't know the labeling 1050 00:42:35,630 --> 00:42:36,213 is irrelevant. 1051 00:42:36,213 --> 00:42:37,720 There's a nice grouping of five. 1052 00:42:37,720 --> 00:42:39,710 There's a nice grouping of four. 1053 00:42:39,710 --> 00:42:42,620 And there's a nice grouping of three in between. 1054 00:42:42,620 --> 00:42:45,980 And in fact, if I looked at the average distance 1055 00:42:45,980 --> 00:42:48,200 between examples in each of these clusters, 1056 00:42:48,200 --> 00:42:52,440 it is much tighter than in that example. 1057 00:42:52,440 --> 00:42:56,550 And so that leads to, then, the question of should I 1058 00:42:56,550 --> 00:42:57,642 look for four clusters? 1059 00:42:57,642 --> 00:42:58,350 Question, please. 1060 00:42:58,350 --> 00:43:01,020 AUDIENCE: Is that overlap between the two clusters 1061 00:43:01,020 --> 00:43:01,690 not an issue? 1062 00:43:01,690 --> 00:43:02,440 ERIC GRIMSON: Yes. 1063 00:43:02,440 --> 00:43:04,600 The question is, is the overlap between the two clusters 1064 00:43:04,600 --> 00:43:05,099 a problem? 1065 00:43:05,099 --> 00:43:05,824 No. 1066 00:43:05,824 --> 00:43:07,240 I just drew it here so I could let 1067 00:43:07,240 --> 00:43:09,010 you see where those pieces are. 1068 00:43:09,010 --> 00:43:13,090 But in fact, if you like, the center is there. 1069 00:43:13,090 --> 00:43:15,550 Those three points are all closer to that center 1070 00:43:15,550 --> 00:43:16,780 than they are to that center. 1071 00:43:16,780 --> 00:43:18,260 So the fact that they overlap is a good question. 1072 00:43:18,260 --> 00:43:20,020 It's just the way I happened to draw them. 1073 00:43:20,020 --> 00:43:21,490 I should really draw these, not as 1074 00:43:21,490 --> 00:43:25,104 circles, but as some little bit more convoluted surface. 1075 00:43:25,104 --> 00:43:26,050 OK? 1076 00:43:26,050 --> 00:43:28,900 Having done three, I could say should I look for four? 1077 00:43:28,900 --> 00:43:31,919 Well, those points down there, as I've already said, 1078 00:43:31,919 --> 00:43:33,460 are an example where it's going to be 1079 00:43:33,460 --> 00:43:34,750 hard to separate them out. 1080 00:43:34,750 --> 00:43:35,920 And I don't want to overfit. 1081 00:43:35,920 --> 00:43:37,720 Because the only way to separate those out 1082 00:43:37,720 --> 00:43:40,900 is going to be to come up with a really convoluted cluster, 1083 00:43:40,900 --> 00:43:41,950 which I don't like. 1084 00:43:41,950 --> 00:43:43,580 All right? 1085 00:43:43,580 --> 00:43:46,480 Let me finish with showing you one other example 1086 00:43:46,480 --> 00:43:47,650 from the other direction. 1087 00:43:47,650 --> 00:43:52,010 Which is, suppose I give you labeled examples. 1088 00:43:52,010 --> 00:43:54,200 So again, the goal is I've got features 1089 00:43:54,200 --> 00:43:55,470 associated with each example. 1090 00:43:55,470 --> 00:43:57,470 They're going to have multiple dimensions on it. 1091 00:43:57,470 --> 00:43:59,450 But I also know the label associated with them. 1092 00:43:59,450 --> 00:44:01,880 And I want to learn what is the best 1093 00:44:01,880 --> 00:44:04,760 way to come up with a rule that will let me take new examples 1094 00:44:04,760 --> 00:44:07,301 and assign them to the right group. 1095 00:44:07,301 --> 00:44:08,910 A number of ways to do this. 1096 00:44:08,910 --> 00:44:12,020 You can simply say I'm looking for the simplest surface that 1097 00:44:12,020 --> 00:44:13,927 will separate those examples. 1098 00:44:13,927 --> 00:44:16,010 In my football case that were in the plane, what's 1099 00:44:16,010 --> 00:44:17,660 the best line that separates them, 1100 00:44:17,660 --> 00:44:19,610 which turns out to be easy. 1101 00:44:19,610 --> 00:44:21,725 I might look for a more complicated surface. 1102 00:44:21,725 --> 00:44:23,600 And we're going to see an example in a second 1103 00:44:23,600 --> 00:44:26,261 where maybe it's a sequence of line segments 1104 00:44:26,261 --> 00:44:27,260 that separates them out. 1105 00:44:27,260 --> 00:44:30,920 Because there's not just one line that does the separation. 1106 00:44:30,920 --> 00:44:32,790 As before, I want to be careful. 1107 00:44:32,790 --> 00:44:34,370 If I make it too complicated, I may 1108 00:44:34,370 --> 00:44:38,054 get a really good separator, but I overfit to the data. 1109 00:44:38,054 --> 00:44:39,470 And you're going to see next time. 1110 00:44:39,470 --> 00:44:40,520 I'm going to just highlight it here. 1111 00:44:40,520 --> 00:44:42,019 There's a third way, which will lead 1112 00:44:42,019 --> 00:44:43,910 to almost the same kind of result 1113 00:44:43,910 --> 00:44:46,160 called k nearest neighbors. 1114 00:44:46,160 --> 00:44:49,550 And the idea here is I've got a set of labeled data. 1115 00:44:49,550 --> 00:44:52,100 And what I'm going to do is, for every new example, 1116 00:44:52,100 --> 00:44:57,250 say find the k, say the five closest labeled examples. 1117 00:44:57,250 --> 00:44:58,600 And take a vote. 1118 00:44:58,600 --> 00:45:01,870 If 3 out of 5 or 4 out of 5 or 5 out of 5 of those labels 1119 00:45:01,870 --> 00:45:04,605 are the same, I'm going to say it's part of that group. 1120 00:45:04,605 --> 00:45:05,980 And if I have less than that, I'm 1121 00:45:05,980 --> 00:45:07,510 going to leave it as unclassified. 1122 00:45:07,510 --> 00:45:09,259 And that's a nice way of actually thinking 1123 00:45:09,259 --> 00:45:10,870 about how to learn them. 1124 00:45:10,870 --> 00:45:12,940 And let me just finish by showing you an example. 1125 00:45:12,940 --> 00:45:14,814 Now I won't use football players on this one. 1126 00:45:14,814 --> 00:45:17,380 I'll use a different example. 1127 00:45:17,380 --> 00:45:20,020 I'm going to give you some voting data. 1128 00:45:20,020 --> 00:45:21,800 I think this is actually simulated data. 1129 00:45:21,800 --> 00:45:25,974 But these are a set of voters in the United States 1130 00:45:25,974 --> 00:45:26,890 with their preference. 1131 00:45:26,890 --> 00:45:28,136 They tend to vote Republican. 1132 00:45:28,136 --> 00:45:29,260 They tend to vote Democrat. 1133 00:45:29,260 --> 00:45:32,800 And the two categories are their age and how far away 1134 00:45:32,800 --> 00:45:34,441 they live from Boston. 1135 00:45:34,441 --> 00:45:36,440 Whether those are relevant or not, I don't know, 1136 00:45:36,440 --> 00:45:39,064 but they are just two things I'm going to use to classify them. 1137 00:45:39,064 --> 00:45:41,750 And I'd like to say, how would I fit a curve 1138 00:45:41,750 --> 00:45:46,110 to separate those two classes? 1139 00:45:46,110 --> 00:45:48,690 I'm going to keep half the data to test. 1140 00:45:48,690 --> 00:45:50,910 I'm going to use half the data to train. 1141 00:45:50,910 --> 00:45:52,590 So if this is my training data, I 1142 00:45:52,590 --> 00:45:57,040 can say what's the best line that separates these? 1143 00:45:57,040 --> 00:46:00,200 I don't know about best, but here are two examples. 1144 00:46:00,200 --> 00:46:03,880 This solid line has the property that all the Democrats 1145 00:46:03,880 --> 00:46:05,620 are on one side. 1146 00:46:05,620 --> 00:46:07,540 Everything on the other side is a Republican, 1147 00:46:07,540 --> 00:46:10,000 but there are some Republicans on this side of the line. 1148 00:46:10,000 --> 00:46:12,310 I can't find a line that completely separates these, 1149 00:46:12,310 --> 00:46:14,260 as I did with the football players. 1150 00:46:14,260 --> 00:46:17,659 But there is a decent line to separate them. 1151 00:46:17,659 --> 00:46:18,700 Here's another candidate. 1152 00:46:18,700 --> 00:46:22,695 That dash line has the property that on the right side 1153 00:46:22,695 --> 00:46:24,820 you've got-- boy, I don't think this is deliberate, 1154 00:46:24,820 --> 00:46:26,480 John, right-- but on the right side, 1155 00:46:26,480 --> 00:46:28,195 you've got almost all Republicans. 1156 00:46:28,195 --> 00:46:30,730 It seems perfectly appropriate. 1157 00:46:30,730 --> 00:46:34,000 One Democrat, but there's a pretty good separation there. 1158 00:46:34,000 --> 00:46:36,130 And on the left side, you've got a mix of things. 1159 00:46:36,130 --> 00:46:39,980 But most of the Democrats are on the left side of that line. 1160 00:46:39,980 --> 00:46:40,480 All right? 1161 00:46:40,480 --> 00:46:42,104 The fact that left and right correlates 1162 00:46:42,104 --> 00:46:44,470 with distance from Boston is completely irrelevant here. 1163 00:46:44,470 --> 00:46:46,620 But it has a nice punch to it. 1164 00:46:46,620 --> 00:46:48,370 JOHN GUTTAG: Relevant, but not accidental. 1165 00:46:48,370 --> 00:46:49,745 ERIC GRIMSON: But not accidental. 1166 00:46:49,745 --> 00:46:50,570 Thank you. 1167 00:46:50,570 --> 00:46:51,070 All right. 1168 00:46:51,070 --> 00:46:53,194 So now the question is, how would I evaluate these? 1169 00:46:53,194 --> 00:46:55,306 How do I decide which one is better? 1170 00:46:55,306 --> 00:46:56,680 And I'm simply going to show you, 1171 00:46:56,680 --> 00:46:58,880 very quickly, some examples. 1172 00:46:58,880 --> 00:47:02,747 First one is to look at what's called the confusion matrix. 1173 00:47:02,747 --> 00:47:03,580 What does that mean? 1174 00:47:03,580 --> 00:47:07,090 It says for this, one of these classifiers for example, 1175 00:47:07,090 --> 00:47:07,760 the solid line. 1176 00:47:07,760 --> 00:47:10,260 Here are the predictions, based on the solid line 1177 00:47:10,260 --> 00:47:12,010 of whether they would be more likely to be 1178 00:47:12,010 --> 00:47:13,540 Democrat or Republican. 1179 00:47:13,540 --> 00:47:16,090 And here is the actual label. 1180 00:47:16,090 --> 00:47:17,410 Same thing for the dashed line. 1181 00:47:17,410 --> 00:47:21,280 And that diagonal is important because those are 1182 00:47:21,280 --> 00:47:23,740 the correctly labeled results. 1183 00:47:23,740 --> 00:47:24,540 Right? 1184 00:47:24,540 --> 00:47:27,460 It correctly, in the solid line case, 1185 00:47:27,460 --> 00:47:30,400 gets all of the correct labelings of the Democrats. 1186 00:47:30,400 --> 00:47:32,080 It gets half of the Republicans right. 1187 00:47:32,080 --> 00:47:35,080 But it has some where it's actually Republican, 1188 00:47:35,080 --> 00:47:37,700 but it labels it as a Democrat. 1189 00:47:37,700 --> 00:47:40,580 That, we'd like to be really large. 1190 00:47:40,580 --> 00:47:43,070 And in fact, it leads to a natural measure 1191 00:47:43,070 --> 00:47:44,880 called the accuracy. 1192 00:47:44,880 --> 00:47:46,520 Which is, just to go back to that, 1193 00:47:46,520 --> 00:47:48,650 we say that these are true positives. 1194 00:47:48,650 --> 00:47:52,070 Meaning, I labeled it as being an instance, and it really is. 1195 00:47:52,070 --> 00:47:53,330 These are true negatives. 1196 00:47:53,330 --> 00:47:56,330 I label it as not being an instance, and it really isn't. 1197 00:47:56,330 --> 00:47:59,450 And then these are the false positives. 1198 00:47:59,450 --> 00:48:01,424 I labeled it as being an instance and it's not, 1199 00:48:01,424 --> 00:48:02,840 and these are the false negatives. 1200 00:48:02,840 --> 00:48:05,680 I labeled it as not being an instance, and it is. 1201 00:48:05,680 --> 00:48:09,620 And an easy way to measure it is to look at the correct labels 1202 00:48:09,620 --> 00:48:11,290 over all of the labels. 1203 00:48:11,290 --> 00:48:13,040 The true positives and the true negatives, 1204 00:48:13,040 --> 00:48:14,820 the ones I got right. 1205 00:48:14,820 --> 00:48:19,862 And in that case, both models come up with a value of 0.7. 1206 00:48:19,862 --> 00:48:20,820 So which one is better? 1207 00:48:20,820 --> 00:48:21,900 Well, I should validate that. 1208 00:48:21,900 --> 00:48:23,399 And I'm going to do that in a second 1209 00:48:23,399 --> 00:48:25,511 by looking at other data. 1210 00:48:25,511 --> 00:48:27,260 We could also ask, could we find something 1211 00:48:27,260 --> 00:48:28,660 with less training error? 1212 00:48:28,660 --> 00:48:31,310 This is only getting 70% right. 1213 00:48:31,310 --> 00:48:33,250 Not great. 1214 00:48:33,250 --> 00:48:35,692 Well, here is a more complicated model. 1215 00:48:35,692 --> 00:48:37,150 And this is where you start getting 1216 00:48:37,150 --> 00:48:38,140 worried about overfitting. 1217 00:48:38,140 --> 00:48:39,598 Now what I've done, is I've come up 1218 00:48:39,598 --> 00:48:42,430 with a sequence of lines that separate them. 1219 00:48:42,430 --> 00:48:45,260 So everything above this line, I'm going to say 1220 00:48:45,260 --> 00:48:46,080 is a Republican. 1221 00:48:46,080 --> 00:48:48,910 Everything below this line, I'm going to say is a Democrat. 1222 00:48:48,910 --> 00:48:50,350 So I'm avoiding that one. 1223 00:48:50,350 --> 00:48:51,310 I'm avoiding that one. 1224 00:48:51,310 --> 00:48:54,340 I'm still capturing many of the same things. 1225 00:48:54,340 --> 00:48:59,140 And in this case, I get 12 true positives, 13 true negatives, 1226 00:48:59,140 --> 00:49:02,001 and only 5 false positives. 1227 00:49:02,001 --> 00:49:03,000 And that's kind of nice. 1228 00:49:03,000 --> 00:49:03,790 You can see the 5. 1229 00:49:03,790 --> 00:49:06,040 It's those five red ones down there. 1230 00:49:06,040 --> 00:49:09,360 It's accuracy is 0.833. 1231 00:49:09,360 --> 00:49:15,596 And now, if I apply that to the test data, I get an OK result. 1232 00:49:15,596 --> 00:49:19,440 It has an accuracy of about 0.6. 1233 00:49:19,440 --> 00:49:21,930 I could use this idea to try and generalize to say could I 1234 00:49:21,930 --> 00:49:23,100 come up with a better model. 1235 00:49:23,100 --> 00:49:25,860 And you're going to see that next time. 1236 00:49:25,860 --> 00:49:28,050 There could be other ways in which I measure this. 1237 00:49:28,050 --> 00:49:29,860 And I want to use this as the last example. 1238 00:49:29,860 --> 00:49:34,740 Another good measure we use is called PPV, Positive Predictive 1239 00:49:34,740 --> 00:49:39,290 Value which is how many true positives do I come up with out 1240 00:49:39,290 --> 00:49:42,580 of all the things I labeled positively. 1241 00:49:42,580 --> 00:49:45,630 And in this solid model, in the dashed line, 1242 00:49:45,630 --> 00:49:48,120 I can get values about 0.57. 1243 00:49:48,120 --> 00:49:50,500 The complex model on the training data is better. 1244 00:49:50,500 --> 00:49:53,960 And then the testing data is even stronger. 1245 00:49:53,960 --> 00:49:55,820 And finally, two other examples are called 1246 00:49:55,820 --> 00:49:58,850 sensitivity and specificity. 1247 00:49:58,850 --> 00:50:01,490 Sensitivity basically tells you what percentage 1248 00:50:01,490 --> 00:50:03,610 did I correctly find. 1249 00:50:03,610 --> 00:50:05,510 And specificity said what percentage 1250 00:50:05,510 --> 00:50:07,700 did I correctly reject. 1251 00:50:07,700 --> 00:50:09,290 And I show you this because this is 1252 00:50:09,290 --> 00:50:12,240 where the trade-off comes in. 1253 00:50:12,240 --> 00:50:14,210 If sensitivity is how many did I correctly 1254 00:50:14,210 --> 00:50:16,400 label out of those that I both correctly 1255 00:50:16,400 --> 00:50:20,382 labeled and incorrectly labeled as being negative, 1256 00:50:20,382 --> 00:50:21,840 how many them did I correctly label 1257 00:50:21,840 --> 00:50:23,980 as being the kind that I want? 1258 00:50:23,980 --> 00:50:27,310 I can make sensitivity 1. 1259 00:50:27,310 --> 00:50:30,090 Label everything is the thing I'm looking for. 1260 00:50:30,090 --> 00:50:30,710 Great. 1261 00:50:30,710 --> 00:50:32,140 Everything is correct. 1262 00:50:32,140 --> 00:50:35,740 But the specificity will be 0. 1263 00:50:35,740 --> 00:50:39,020 Because I'll have a bunch of things incorrectly labeled. 1264 00:50:39,020 --> 00:50:43,460 I could make the specificity 1, reject everything. 1265 00:50:43,460 --> 00:50:45,410 Say nothing as an instance. 1266 00:50:45,410 --> 00:50:52,070 True negatives goes to 1, and I'm in a great place there, 1267 00:50:52,070 --> 00:50:55,170 but my sensitivity goes to 0. 1268 00:50:55,170 --> 00:50:56,130 I've got a trade-off. 1269 00:50:56,130 --> 00:50:58,680 As I think about the machine learning algorithm I'm using 1270 00:50:58,680 --> 00:51:01,164 and my choice of that classifier, 1271 00:51:01,164 --> 00:51:02,580 I'm going to see a trade off where 1272 00:51:02,580 --> 00:51:07,260 I can increase specificity at the cost of sensitivity or vice 1273 00:51:07,260 --> 00:51:08,310 versa. 1274 00:51:08,310 --> 00:51:11,430 And you'll see a nice technique called ROC or Receiver Operator 1275 00:51:11,430 --> 00:51:14,576 Curve that gives you a sense of how you want to deal with that. 1276 00:51:14,576 --> 00:51:16,200 And with that, we'll see you next time. 1277 00:51:16,200 --> 00:51:17,180 We'll take your question off line 1278 00:51:17,180 --> 00:51:18,680 if you don't mind, because I've run over time. 1279 00:51:18,680 --> 00:51:20,763 But we'll see you next time where Professor Guttag 1280 00:51:20,763 --> 00:51:22,930 will show you examples of this.