1 00:00:09,310 --> 00:00:11,300 PATRICK WINSTON: Ladies and gentlemen, the Romanian 2 00:00:11,300 --> 00:00:12,960 national anthem. 3 00:00:12,960 --> 00:00:15,930 I did not ask you to stand, because I didn't play it as a 4 00:00:15,930 --> 00:00:18,250 symbol of Romanian national identity. 5 00:00:18,250 --> 00:00:21,580 But rather, to celebrate the end of the Cold War, which 6 00:00:21,580 --> 00:00:24,740 occurred about the time that you were born. 7 00:00:24,740 --> 00:00:28,980 Before that, no one came to MIT from Eastern Europe. 8 00:00:28,980 --> 00:00:33,070 But since that time, we've been blessed by having in our 9 00:00:33,070 --> 00:00:40,850 midst Lithuanians, Estonians, Poles, Czecs, Slovaks, 10 00:00:40,850 --> 00:00:47,400 Bulgarians, Romanians, Slovenians, Serbs, and all 11 00:00:47,400 --> 00:00:50,510 sorts of people from regions of the world 12 00:00:50,510 --> 00:00:53,580 formally excluded to us. 13 00:00:53,580 --> 00:00:57,900 Believe me, you are all welcome in our house. 14 00:00:57,900 --> 00:01:00,110 Almost all, that is to say. 15 00:01:00,110 --> 00:01:04,530 Because you may recall that Romania is the traditional 16 00:01:04,530 --> 00:01:07,430 home of vampires. 17 00:01:07,430 --> 00:01:10,180 And since the end of the Cold War, vampires have had new 18 00:01:10,180 --> 00:01:14,360 vectors for emerging from their traditional places and 19 00:01:14,360 --> 00:01:16,010 penetrating into the world at large. 20 00:01:16,010 --> 00:01:22,560 You may have vampire in your suite, or on your floor. 21 00:01:22,560 --> 00:01:26,740 And it's important to know how to recognize them, and take 22 00:01:26,740 --> 00:01:30,420 the necessary precautions. 23 00:01:30,420 --> 00:01:36,860 So if you have this concern, I would expect that the first 24 00:01:36,860 --> 00:01:40,009 thing you would do would be to look at some data concerning 25 00:01:40,009 --> 00:01:41,890 the characteristics of vampires. 26 00:01:55,550 --> 00:02:00,460 So there's a little database of samples of individuals who 27 00:02:00,460 --> 00:02:06,560 have been determined to be vampires and not vampires. 28 00:02:06,560 --> 00:02:08,020 And our task today-- 29 00:02:08,020 --> 00:02:10,919 and what you'll understand how to do by the end of the hour-- 30 00:02:10,919 --> 00:02:14,640 is to use data like this to build a recognition mechanism 31 00:02:14,640 --> 00:02:17,740 that would help you to identify whether someone is a 32 00:02:17,740 --> 00:02:21,000 vampire or an ordinary person. 33 00:02:21,000 --> 00:02:24,079 So this is a little different from the kind of problem we 34 00:02:24,079 --> 00:02:27,360 worked with neural nets. 35 00:02:27,360 --> 00:02:27,700 Right? 36 00:02:27,700 --> 00:02:32,579 So what's the most conspicuous difference between this data 37 00:02:32,579 --> 00:02:38,590 set and anything you could think to work on with nearest 38 00:02:38,590 --> 00:02:40,079 neighbors, which we studied last time. 39 00:02:40,079 --> 00:02:42,772 Katie, do you have any thoughts about why it would be 40 00:02:42,772 --> 00:02:45,020 difficult to use nearest neighbors with data like this? 41 00:02:48,260 --> 00:02:51,690 The question mark is there because this is MIT, and a lot 42 00:02:51,690 --> 00:02:53,170 of people are completely nocturnal. 43 00:02:53,170 --> 00:02:57,630 So you can't tell whether they cast a shadow or not. 44 00:02:57,630 --> 00:03:01,220 We want to take that into account. 45 00:03:01,220 --> 00:03:04,150 So what's different about this from the 46 00:03:04,150 --> 00:03:06,851 electrical cover data set? 47 00:03:06,851 --> 00:03:08,101 STUDENT: [INAUDIBLE] 48 00:03:12,810 --> 00:03:14,550 PATRICK WINSTON: Could you use the nearest neighbor technique 49 00:03:14,550 --> 00:03:18,180 to identify vampires with this data? 50 00:03:18,180 --> 00:03:19,430 STUDENT: [INAUDIBLE] 51 00:03:25,620 --> 00:03:27,120 PATRICK WINSTON: So obviously-- 52 00:03:27,120 --> 00:03:29,831 Yes, Lana? 53 00:03:29,831 --> 00:03:31,081 STUDENT: [INAUDIBLE] 54 00:03:33,807 --> 00:03:35,800 STUDENT: You cannot really quantify-- 55 00:03:35,800 --> 00:03:37,110 PATRICK WINSTON: Oh, that's the problem. 56 00:03:37,110 --> 00:03:38,970 This is not numerical data. 57 00:03:38,970 --> 00:03:41,138 This is symbolic. 58 00:03:41,138 --> 00:03:44,579 So we're not saying that your ability to 59 00:03:44,579 --> 00:03:47,320 cast a shadow is 0.7. 60 00:03:47,320 --> 00:03:49,970 You either cast a shadow, down cast a 61 00:03:49,970 --> 00:03:51,200 shadow, or we can't tell. 62 00:03:51,200 --> 00:03:53,770 It's a symbolic result. 63 00:03:53,770 --> 00:03:57,810 So problem number one we have to face with data of this kind 64 00:03:57,810 --> 00:03:59,060 is that it's not numeric. 65 00:04:05,480 --> 00:04:07,950 And there are other characteristics, as well. 66 00:04:07,950 --> 00:04:11,180 For example, it's not clear that all of these 67 00:04:11,180 --> 00:04:13,820 characteristics actually matter. 68 00:04:13,820 --> 00:04:15,490 So some characteristics don't matter. 69 00:04:26,460 --> 00:04:29,990 And a corollary to that is that some characteristics do 70 00:04:29,990 --> 00:04:33,460 matter, but they only matter part of the time. 71 00:04:52,240 --> 00:04:56,180 And finally, there's the matter of cost. 72 00:04:58,880 --> 00:05:01,400 Some of these tests may be more expensive to 73 00:05:01,400 --> 00:05:03,350 perform than others. 74 00:05:03,350 --> 00:05:05,740 For example, if you wanted to determine whether someone 75 00:05:05,740 --> 00:05:07,930 casts a shadow, you'd have to go to the trouble of getting 76 00:05:07,930 --> 00:05:09,970 up during daylight. 77 00:05:09,970 --> 00:05:12,000 That might be an expensive operation for you. 78 00:05:15,100 --> 00:05:17,570 You'd have to go find some garlic and ask them to eat it. 79 00:05:17,570 --> 00:05:19,070 That might be expensive. 80 00:05:19,070 --> 00:05:21,140 So some of these tests might be expensive 81 00:05:21,140 --> 00:05:23,560 relative to other tests. 82 00:05:23,560 --> 00:05:28,110 But once you realize that we are talking in terms of tests, 83 00:05:28,110 --> 00:05:32,530 and not a vector of real values, then 84 00:05:32,530 --> 00:05:34,920 what you do is clear. 85 00:05:34,920 --> 00:05:37,305 You build yourself a little tree of tests. 86 00:05:40,680 --> 00:05:43,810 So who knows how this problem will turn out? 87 00:05:43,810 --> 00:05:47,210 But you can imagine a situation where you have one 88 00:05:47,210 --> 00:05:52,740 test up here which might have three outcomes. 89 00:05:52,740 --> 00:05:57,520 And one but only one of those outcomes might require you to 90 00:05:57,520 --> 00:06:01,020 perform another test. 91 00:06:01,020 --> 00:06:05,940 And only when you've created the tree of tests that look 92 00:06:05,940 --> 00:06:07,835 like this are you finished. 93 00:06:10,540 --> 00:06:15,060 So given this set of tests and a set of samples, the question 94 00:06:15,060 --> 00:06:18,640 becomes, how do you arrange the tests in a tree like that 95 00:06:18,640 --> 00:06:22,250 so as to do the identification that you want to do? 96 00:06:22,250 --> 00:06:25,030 So since we're talking about identification, it's not 97 00:06:25,030 --> 00:06:28,250 surprising that this kind of tree is called an 98 00:06:28,250 --> 00:06:29,500 identification tree. 99 00:06:41,560 --> 00:06:44,420 And there's a tendency-- and I may slip into it myself-- to 100 00:06:44,420 --> 00:06:46,210 call this a decision tree. 101 00:06:46,210 --> 00:06:48,970 But a decision tree is a label for something else. 102 00:06:48,970 --> 00:06:51,060 This is an identification tree. 103 00:06:51,060 --> 00:06:55,170 And the task is to create a good one. 104 00:06:55,170 --> 00:06:59,330 So what is a good one versus a not so good one? 105 00:06:59,330 --> 00:07:02,070 What characteristic would you like for a decision tree-- 106 00:07:05,010 --> 00:07:08,250 for an identification trade to have, if you're going to call 107 00:07:08,250 --> 00:07:12,170 it good identification tree? 108 00:07:12,170 --> 00:07:13,265 What do you think, Krishna? 109 00:07:13,265 --> 00:07:15,964 What would be a good characteristic? 110 00:07:15,964 --> 00:07:18,399 STUDENT: Maybe the minimum number of levels? 111 00:07:18,399 --> 00:07:18,886 PATRICK WINSTON: Yeah. 112 00:07:18,886 --> 00:07:22,110 He said minimum number of levels. 113 00:07:22,110 --> 00:07:26,960 What's another way you could say what a good one is? 114 00:07:26,960 --> 00:07:28,680 Each test costs something, right? 115 00:07:28,680 --> 00:07:32,050 So what's another way of thinking about what a good 116 00:07:32,050 --> 00:07:33,864 tree would look like? 117 00:07:33,864 --> 00:07:34,820 STUDENT: Minimum cost. 118 00:07:34,820 --> 00:07:36,280 PATRICK WINSTON: The minimum cost. 119 00:07:36,280 --> 00:07:39,810 And if they all have the same cost, then it's 120 00:07:39,810 --> 00:07:41,770 the number of tests. 121 00:07:41,770 --> 00:07:45,530 So overall, what you like is a small tree 122 00:07:45,530 --> 00:07:47,400 rather than a big one. 123 00:07:47,400 --> 00:07:49,730 So you might be able to take your sample data and divide it 124 00:07:49,730 --> 00:07:53,190 up, so that at the bottom of the tree, at the leaves, all 125 00:07:53,190 --> 00:07:57,380 of the sets that are produced by the tests are uniform, 126 00:07:57,380 --> 00:07:59,070 homogeneous. 127 00:07:59,070 --> 00:08:01,905 We'd like that tree to be the simplest possible tree you can 128 00:08:01,905 --> 00:08:05,460 find, not some big complicated one that also divides up all 129 00:08:05,460 --> 00:08:08,920 the data into uniform subsets. 130 00:08:08,920 --> 00:08:10,070 By uniform subset-- 131 00:08:10,070 --> 00:08:13,100 at the bottom of the tree, you have all of the vampires 132 00:08:13,100 --> 00:08:16,980 together, and all the non-vampires together. 133 00:08:16,980 --> 00:08:17,840 So you'd like a small tree. 134 00:08:17,840 --> 00:08:23,590 So why not just go all the way and do British Museum, and 135 00:08:23,590 --> 00:08:25,946 calculate all possible trees? 136 00:08:25,946 --> 00:08:28,730 Well, you can do that, but it's one of those NP problems. 137 00:08:28,730 --> 00:08:32,620 And as you know, NP problems suck in general. 138 00:08:32,620 --> 00:08:33,929 And so you don't want to do that. 139 00:08:33,929 --> 00:08:37,820 You want to have some kind of heuristic mechanism for 140 00:08:37,820 --> 00:08:39,640 building a small tree. 141 00:08:39,640 --> 00:08:42,590 And we want a small tree because-- 142 00:08:42,590 --> 00:08:43,900 Why do we want a small tree? 143 00:08:43,900 --> 00:08:45,040 Because of the cost. 144 00:08:45,040 --> 00:08:46,870 but there's another, more important reason why we want a 145 00:08:46,870 --> 00:08:48,025 small tree. 146 00:08:48,025 --> 00:08:50,250 Let me give you a hint. 147 00:08:50,250 --> 00:08:52,090 It's Occam's Razor. 148 00:08:52,090 --> 00:08:57,110 The simplest explanation is often the best explanation. 149 00:08:57,110 --> 00:08:58,650 So if you have a big, complicated explanation, 150 00:08:58,650 --> 00:09:06,150 that's probably less good than a simple, small explanation. 151 00:09:06,150 --> 00:09:07,400 Occam's Razor. 152 00:09:07,400 --> 00:09:10,290 Spelled so many ways it doesn't matter how I spell it. 153 00:09:10,290 --> 00:09:12,090 And that's good, because I can't spell. 154 00:09:15,260 --> 00:09:20,040 So how are we going to go about finding the best 155 00:09:20,040 --> 00:09:24,180 possible arrangement of those four tests 156 00:09:24,180 --> 00:09:26,470 in a tree like that? 157 00:09:26,470 --> 00:09:29,630 Well, step one will be to see what each test 158 00:09:29,630 --> 00:09:30,340 does with the data. 159 00:09:30,340 --> 00:09:36,070 And by the way, before I go a step further, you know and I 160 00:09:36,070 --> 00:09:40,740 know that this is a sample data set that's very small, 161 00:09:40,740 --> 00:09:43,090 suitable for classroom manipulation. 162 00:09:43,090 --> 00:09:47,090 You'd never bet your life on a data set this small. 163 00:09:47,090 --> 00:09:49,020 We use it only for classroom illustration. 164 00:09:49,020 --> 00:09:51,970 But imagine that these rows are multiplied by 10. 165 00:09:51,970 --> 00:09:55,500 So instead of eight samples, you've got 80. 166 00:09:55,500 --> 00:09:56,670 Then you might begin to believe the 167 00:09:56,670 --> 00:09:58,220 results that are produced. 168 00:09:58,220 --> 00:10:00,700 So I'm just going to pretend that each one of those 169 00:10:00,700 --> 00:10:04,330 represents 10 other samples that I 170 00:10:04,330 --> 00:10:07,530 haven't bothered to show. 171 00:10:07,530 --> 00:10:10,290 But we can work with this one in the classroom, because it's 172 00:10:10,290 --> 00:10:10,990 pretty small. 173 00:10:10,990 --> 00:10:13,070 And we can say, well, what does this shadow test do? 174 00:10:16,034 --> 00:10:20,310 Well, the shadow test divides the sample population into 175 00:10:20,310 --> 00:10:21,780 three groups. 176 00:10:21,780 --> 00:10:24,790 There's the I Don't Know group of people who are nocturnal. 177 00:10:24,790 --> 00:10:26,110 There are the people who do cast the 178 00:10:26,110 --> 00:10:28,360 shadow, the Yes people. 179 00:10:28,360 --> 00:10:32,220 And the people who do not cast a shadow, the No people. 180 00:10:32,220 --> 00:10:37,430 So if I look at those rows up there and see which ones are 181 00:10:37,430 --> 00:10:44,440 vampires, it looks to me that if there's no shadow cast-- 182 00:10:44,440 --> 00:10:46,140 there's only one that doesn't cast a shadow-- 183 00:10:46,140 --> 00:10:47,615 and that is a vampire. 184 00:10:51,670 --> 00:10:53,950 So that's a plus over there. 185 00:10:53,950 --> 00:10:56,400 Vampire. 186 00:10:56,400 --> 00:10:59,680 Now, if we look at the ones who do cast a shadow, all 187 00:10:59,680 --> 00:11:00,715 those are not vampires. 188 00:11:00,715 --> 00:11:01,965 They're all OK. 189 00:11:05,870 --> 00:11:07,520 And now there're 8. 190 00:11:07,520 --> 00:11:09,450 Three are vampires. 191 00:11:09,450 --> 00:11:13,040 So that means that two of these must be vampires. 192 00:11:13,040 --> 00:11:15,420 And I've got three, four, five, six so far. 193 00:11:15,420 --> 00:11:17,340 So there must be two left. 194 00:11:17,340 --> 00:11:20,730 So that's the way the shadow test divides up the data. 195 00:11:20,730 --> 00:11:21,980 Now let's do garlic. 196 00:11:25,620 --> 00:11:28,040 Vampires traditionally don't eat garlic. 197 00:11:28,040 --> 00:11:29,290 I don't know why. 198 00:11:31,710 --> 00:11:33,480 So we look at the garlic test, and we see 199 00:11:33,480 --> 00:11:37,010 that all of the Nos-- 200 00:11:37,010 --> 00:11:39,880 well, there're three Yeses, and they all 201 00:11:39,880 --> 00:11:41,920 produce a No answer. 202 00:11:41,920 --> 00:11:46,590 So if somebody eats garlic, they're not vampires. 203 00:11:46,590 --> 00:11:49,650 That means the three vampires must be over here. 204 00:11:49,650 --> 00:11:50,580 Then there are two left. 205 00:11:50,580 --> 00:11:53,040 So that's what the garlic test does. 206 00:11:53,040 --> 00:11:53,960 See what we're trying to do? 207 00:11:53,960 --> 00:11:56,530 We're trying to look at all these tests to see which one 208 00:11:56,530 --> 00:12:01,220 we like best on the basis of how it divides up the data. 209 00:12:01,220 --> 00:12:06,940 So now we've got complexion. 210 00:12:13,210 --> 00:12:15,970 And there are three choices for this. 211 00:12:15,970 --> 00:12:17,530 You can have an average complexion. 212 00:12:17,530 --> 00:12:21,490 But a lot of vampires, in my experience, are rather pale. 213 00:12:21,490 --> 00:12:23,650 So pale is a possibility. 214 00:12:23,650 --> 00:12:26,210 And then the other option is that just after gorging 215 00:12:26,210 --> 00:12:27,610 themselves with blood, they tend to get a 216 00:12:27,610 --> 00:12:29,300 little red in the face. 217 00:12:29,300 --> 00:12:32,640 So we'll have a ruddy over here. 218 00:12:32,640 --> 00:12:35,650 Once again, we have to go back to our data set to see how 219 00:12:35,650 --> 00:12:37,640 this test divides things up. 220 00:12:37,640 --> 00:12:43,980 So there are three ruddies, and one's a No, one's a No, 221 00:12:43,980 --> 00:12:44,670 and one's a Yes. 222 00:12:44,670 --> 00:12:45,920 So two Nos and a Yes. 223 00:12:49,650 --> 00:12:52,680 Two Nos and a Yes. 224 00:12:52,680 --> 00:12:55,300 Now we can try for pale complexion people. 225 00:12:55,300 --> 00:12:57,100 There are only two of those. 226 00:12:57,100 --> 00:12:58,405 A No and a No. 227 00:13:03,310 --> 00:13:06,310 That must mean that there are two pluses over here, because 228 00:13:06,310 --> 00:13:08,370 there are three vampires altogether. 229 00:13:08,370 --> 00:13:13,681 Two, four, six, seven, eight, nine. 230 00:13:13,681 --> 00:13:14,848 Eight, sorry. 231 00:13:14,848 --> 00:13:15,724 Eight. 232 00:13:15,724 --> 00:13:17,480 Only eight. 233 00:13:17,480 --> 00:13:19,150 Just one more to go, and that's the accent. 234 00:13:22,520 --> 00:13:27,560 Historically, vampires go to great length to protect their 235 00:13:27,560 --> 00:13:29,600 accent and not betray their origins. 236 00:13:29,600 --> 00:13:32,900 But nevertheless, we can expect that if 237 00:13:32,900 --> 00:13:33,850 they've just arrived-- 238 00:13:33,850 --> 00:13:35,450 if they're just in from 239 00:13:35,450 --> 00:13:37,470 Transylvania, part of Romania-- 240 00:13:37,470 --> 00:13:38,590 they may still have an accent. 241 00:13:38,590 --> 00:13:41,950 So there's a normal, some still have a heavy accent, and 242 00:13:41,950 --> 00:13:44,080 some persist in having odd accents. 243 00:13:49,940 --> 00:13:51,090 So let's see. 244 00:13:51,090 --> 00:13:51,660 Accent. 245 00:13:51,660 --> 00:13:54,630 Four of them, right at the top, have no accent. 246 00:13:54,630 --> 00:13:55,880 Two Nos and a Yes. 247 00:14:03,050 --> 00:14:04,930 Heavy accent. 248 00:14:04,930 --> 00:14:05,910 Three of those. 249 00:14:05,910 --> 00:14:08,390 A Yes and two Nos. 250 00:14:13,190 --> 00:14:15,710 That means we must have a plus here. 251 00:14:15,710 --> 00:14:18,290 3, 6, plus and a minus. 252 00:14:20,790 --> 00:14:24,400 So we can look at this data and say, well, what will be 253 00:14:24,400 --> 00:14:25,610 the best test to use? 254 00:14:25,610 --> 00:14:30,860 And the best test to use would surely be the one that 255 00:14:30,860 --> 00:14:36,710 produces sets here, at the bottom of the branches, that 256 00:14:36,710 --> 00:14:38,950 correspond to the outcomes of the test. 257 00:14:38,950 --> 00:14:44,750 We're looking for a test that produces homogeneous groups. 258 00:14:44,750 --> 00:14:48,120 So just for the sake of illustration, I'm going to 259 00:14:48,120 --> 00:14:51,200 suppose that we're going to judge the quality of the test 260 00:14:51,200 --> 00:14:56,010 by how many sample individuals it put into a homogeneous set. 261 00:14:56,010 --> 00:15:00,910 So ideally, we'd like a test that will put all the vampires 262 00:15:00,910 --> 00:15:03,670 in one group and all the ordinary people in another 263 00:15:03,670 --> 00:15:05,530 group right off the bat. 264 00:15:05,530 --> 00:15:07,260 But there are no such tests. 265 00:15:07,260 --> 00:15:11,400 But we can add up the number of sample individuals who are 266 00:15:11,400 --> 00:15:14,460 put in to at least homogeneous sets. 267 00:15:14,460 --> 00:15:18,140 So when we do that, this guy has 3 in a 268 00:15:18,140 --> 00:15:19,470 homogeneous set here. 269 00:15:19,470 --> 00:15:20,480 A fourth. 270 00:15:20,480 --> 00:15:22,770 But these are not a homogeneous set. 271 00:15:22,770 --> 00:15:25,725 So the overall score for this guy will be 4. 272 00:15:28,740 --> 00:15:31,690 This one, well, not quite as good. 273 00:15:31,690 --> 00:15:34,130 It only puts 3 individuals in a homogeneous set. 274 00:15:38,110 --> 00:15:42,870 This one here, 2 individuals into a homogeneous set. 275 00:15:42,870 --> 00:15:45,150 Everybody else is all mixed up with some 276 00:15:45,150 --> 00:15:46,950 other kind of person. 277 00:15:46,950 --> 00:15:49,920 And over here, how many samples are in 278 00:15:49,920 --> 00:15:51,670 a homogeneous set? 279 00:15:51,670 --> 00:15:52,920 0. 280 00:15:55,260 --> 00:15:59,370 So on the basis of this analysis, you would conclude 281 00:15:59,370 --> 00:16:02,470 that the ordering of the test with respect to their quality 282 00:16:02,470 --> 00:16:04,250 is left to right. 283 00:16:04,250 --> 00:16:07,210 So the best test must be the shadow test. 284 00:16:07,210 --> 00:16:11,840 So let's pick the shadow test first, see what 285 00:16:11,840 --> 00:16:13,330 we can do with that. 286 00:16:13,330 --> 00:16:16,850 If we pick the shadow test first, then we have this 287 00:16:16,850 --> 00:16:18,100 arrangement. 288 00:16:20,080 --> 00:16:25,200 We have question mark, and we have Yes, casts a shadow, and 289 00:16:25,200 --> 00:16:26,610 No, doesn't. 290 00:16:26,610 --> 00:16:28,800 We have 3 minuses here. 291 00:16:28,800 --> 00:16:30,290 We have a plus here. 292 00:16:30,290 --> 00:16:32,890 And unfortunately, over here, we have plus, 293 00:16:32,890 --> 00:16:34,060 plus, minus, minus. 294 00:16:34,060 --> 00:16:37,757 So we need another test to divide that group up. 295 00:16:37,757 --> 00:16:38,731 Yes. 296 00:16:38,731 --> 00:16:41,653 STUDENT: How did you get the 4 on the shadow test again? 297 00:16:41,653 --> 00:16:44,100 Why was it 4? 298 00:16:44,100 --> 00:16:44,760 PATRICK WINSTON: Well, if I look at the 299 00:16:44,760 --> 00:16:48,870 data and I see who-- 300 00:16:48,870 --> 00:16:51,100 the question is, what about that shadow test? 301 00:16:51,100 --> 00:16:52,990 If you look at the shadow test, and you say, well, there 302 00:16:52,990 --> 00:16:55,270 are 4 question marks. 303 00:16:55,270 --> 00:16:57,210 And if we look and see what kind of people belong to those 304 00:16:57,210 --> 00:17:00,990 4 question marks, there are 2 vampires and 2 non-vampires. 305 00:17:00,990 --> 00:17:02,760 That's why it's 2 pluses and 2 minuses. 306 00:17:02,760 --> 00:17:04,148 STUDENT: No, I understand that. 307 00:17:04,148 --> 00:17:07,853 The question is, how did you get to the score of 4? 308 00:17:07,853 --> 00:17:10,450 PATRICK WINSTON: Oh, yeah. 309 00:17:10,450 --> 00:17:12,780 The question is how did I get this number 4? 310 00:17:12,780 --> 00:17:14,829 It has nothing to do this, because this is a mixed set. 311 00:17:14,829 --> 00:17:17,810 In fact, I've got three guys in a homogeneous set here, and 312 00:17:17,810 --> 00:17:19,240 one guy in a homogeneous set here, and I'm 313 00:17:19,240 --> 00:17:20,398 just adding them up. 314 00:17:20,398 --> 00:17:21,079 STUDENT: OK. 315 00:17:21,079 --> 00:17:23,190 PATRICK WINSTON: So very simple classroom illustration. 316 00:17:23,190 --> 00:17:24,790 Wouldn't work in practice. 317 00:17:24,790 --> 00:17:25,098 Yes. 318 00:17:25,098 --> 00:17:27,969 STUDENT: How do you adjust this for larger data sets 319 00:17:27,969 --> 00:17:30,390 where it's unlikely you're going to have any [INAUDIBLE]? 320 00:17:30,390 --> 00:17:31,770 PATRICK WINSTON: The question is, how do I adjust this for 321 00:17:31,770 --> 00:17:32,580 larger data sets? 322 00:17:32,580 --> 00:17:33,830 You're one step ahead. 323 00:17:38,540 --> 00:17:40,540 Trust me, I'll be doing large data sets in a moment. 324 00:17:40,540 --> 00:17:43,550 I just want to get the idea across. 325 00:17:43,550 --> 00:17:46,210 And I don't want there to be any thought that the method we 326 00:17:46,210 --> 00:17:50,720 use for larger data sets has got anything magic about it. 327 00:17:50,720 --> 00:17:52,450 OK, so we're off and running. 328 00:17:52,450 --> 00:17:56,710 And now we have to pick a test that will divide 329 00:17:56,710 --> 00:17:58,590 those four guys up. 330 00:17:58,590 --> 00:18:02,060 So we're going to have to work this a little harder, and 331 00:18:02,060 --> 00:18:03,610 repeat the analysis we did there. 332 00:18:03,610 --> 00:18:05,350 But at least it'll be simpler, because now we're only 333 00:18:05,350 --> 00:18:07,220 considering 4 samples, not 8. 334 00:18:07,220 --> 00:18:09,930 Just the 4 samples that we still have to divide up that 335 00:18:09,930 --> 00:18:12,240 have come down that left branch. 336 00:18:12,240 --> 00:18:13,590 So I have the shadow test. 337 00:18:18,190 --> 00:18:19,580 It has 3 outcomes. 338 00:18:19,580 --> 00:18:21,380 We have the garlic test. 339 00:18:24,000 --> 00:18:26,080 It has 2 outcomes. 340 00:18:26,080 --> 00:18:27,560 Yes and No. 341 00:18:27,560 --> 00:18:29,430 We have the complexion test. 342 00:18:34,290 --> 00:18:37,100 There's 3 outcomes. 343 00:18:37,100 --> 00:18:40,230 Average, pale, and ruddy. 344 00:18:40,230 --> 00:18:42,300 And we have finally the accent test. 345 00:18:44,860 --> 00:18:50,610 And that comes out to be either normal, heavy, or odd. 346 00:18:50,610 --> 00:18:53,340 And now, it's a little awkward to figure out what the results 347 00:18:53,340 --> 00:18:56,500 are for this data set as shown. 348 00:18:56,500 --> 00:19:00,910 So let me just strike out. 349 00:19:00,910 --> 00:19:04,000 The ones that we're no longer concerned with, and limit our 350 00:19:04,000 --> 00:19:07,240 analysis to the samples for which the outcome of the 351 00:19:07,240 --> 00:19:08,790 shadow test is a question mark. 352 00:19:08,790 --> 00:19:10,740 This is exactly the four people we still need to 353 00:19:10,740 --> 00:19:13,240 separate, right? 354 00:19:13,240 --> 00:19:18,170 So switching colors, keeping the color the same. 355 00:19:18,170 --> 00:19:20,460 We actually don't want to do the shadow 356 00:19:20,460 --> 00:19:21,340 test anymore, right? 357 00:19:21,340 --> 00:19:22,810 Because we've already done that. 358 00:19:22,810 --> 00:19:24,100 There's no point in doing that again. 359 00:19:24,100 --> 00:19:27,170 We don't have to look at that. 360 00:19:27,170 --> 00:19:30,890 It's already done all the division of data that it can. 361 00:19:30,890 --> 00:19:32,260 So the garlic test. 362 00:19:32,260 --> 00:19:33,050 Well, let's see. 363 00:19:33,050 --> 00:19:33,920 Garlic. 364 00:19:33,920 --> 00:19:35,480 2 Yeses, 2 Nos. 365 00:19:35,480 --> 00:19:39,040 The Yeses produce Nos and the Nos produce Yeses. 366 00:19:39,040 --> 00:19:44,210 So if the person does eat garlic, they're OK. 367 00:19:44,210 --> 00:19:48,686 And if they don't eat garlic, bad news-- they're vampires. 368 00:19:48,686 --> 00:19:50,320 Well, that looks like a pretty good test. 369 00:19:50,320 --> 00:19:52,420 But just for the sake of working it all out, let's try 370 00:19:52,420 --> 00:19:53,820 the others. 371 00:19:53,820 --> 00:19:55,070 Complexion. 372 00:19:56,920 --> 00:19:58,575 2 Ruddies, a Yes, and a No. 373 00:20:05,190 --> 00:20:09,540 1 pale, and that's a No. 374 00:20:09,540 --> 00:20:12,370 1 pale, and that's a No. 375 00:20:12,370 --> 00:20:17,590 And we must have 1 average, and sure enough, that's a Yes. 376 00:20:17,590 --> 00:20:19,790 Now we can do accent, the one on the far right, and look at 377 00:20:19,790 --> 00:20:23,820 how that measures up against the people who are still under 378 00:20:23,820 --> 00:20:25,860 consideration as samples. 379 00:20:25,860 --> 00:20:26,550 Accent. 380 00:20:26,550 --> 00:20:26,990 Let's see. 381 00:20:26,990 --> 00:20:29,010 2 Nones, a Yes and a No. 382 00:20:33,930 --> 00:20:34,840 No Heavies. 383 00:20:34,840 --> 00:20:39,396 2 Odds, a Yes and a No. 384 00:20:39,396 --> 00:20:39,850 All right. 385 00:20:39,850 --> 00:20:42,060 So now we can do the same thing we did before, and just 386 00:20:42,060 --> 00:20:45,400 say, for sake of classroom illustration, how many 387 00:20:45,400 --> 00:20:48,230 individuals are put into a homogeneous sets. 388 00:20:48,230 --> 00:20:51,120 And here we have 4. 389 00:20:51,120 --> 00:20:54,830 And here we have 2. 390 00:20:54,830 --> 00:20:58,420 And here we have 0. 391 00:20:58,420 --> 00:21:02,000 So plainly, the garlic test is the test of choice. 392 00:21:02,000 --> 00:21:04,810 So we go back over here, and we've completed the work that 393 00:21:04,810 --> 00:21:06,860 we needed to do. 394 00:21:06,860 --> 00:21:09,970 So that's the garlic test. 395 00:21:09,970 --> 00:21:13,070 And that produces 2 pluses. 396 00:21:13,070 --> 00:21:14,630 Let's see. 397 00:21:14,630 --> 00:21:16,340 Eats garlic, Yes. 398 00:21:16,340 --> 00:21:18,480 Eats garlic, No. 399 00:21:18,480 --> 00:21:22,565 I guess the pluses go over here like so. 400 00:21:22,565 --> 00:21:24,860 And these are the two ordinary people. 401 00:21:24,860 --> 00:21:26,710 And we're done with our task. 402 00:21:26,710 --> 00:21:30,080 And now you can quickly run off and put this into your 403 00:21:30,080 --> 00:21:33,360 PDA, and forever be protected against the possibility that 404 00:21:33,360 --> 00:21:36,135 one of those vampires got out in the flood of people that 405 00:21:36,135 --> 00:21:38,380 came in from Eastern Europe. 406 00:21:38,380 --> 00:21:42,220 Except what do we do a large data set? 407 00:21:42,220 --> 00:21:44,280 Well, the trouble is, a large data set's 408 00:21:44,280 --> 00:21:45,530 not likely to produce-- 409 00:21:52,130 --> 00:21:55,320 if you have a large data set, no test is likely to put 410 00:21:55,320 --> 00:21:58,310 together any homogeneous set right off. 411 00:21:58,310 --> 00:22:00,300 So you never get started. 412 00:22:00,300 --> 00:22:02,210 Everything would be 0. 413 00:22:02,210 --> 00:22:05,110 Every test would say, oh it doesn't put anybody into 414 00:22:05,110 --> 00:22:06,280 homogeneous sets. 415 00:22:06,280 --> 00:22:07,530 So you're screwed. 416 00:22:10,910 --> 00:22:16,000 You need some other, more sophisticated way of measuring 417 00:22:16,000 --> 00:22:17,580 how disordered this data is. 418 00:22:17,580 --> 00:22:22,040 Or how disordered these sets are that you find at the 419 00:22:22,040 --> 00:22:24,740 bottom of the tree branches. 420 00:22:24,740 --> 00:22:25,500 That's what you need. 421 00:22:25,500 --> 00:22:31,130 You need a way of measuring disorder of these sets that 422 00:22:31,130 --> 00:22:34,740 you find at the bottom of these branches, so you can 423 00:22:34,740 --> 00:22:37,750 find a kind of overall quality to the test based on your 424 00:22:37,750 --> 00:22:40,250 measurement of disorder. 425 00:22:40,250 --> 00:22:44,010 Now, the first heuristic of a good life is, when you have a 426 00:22:44,010 --> 00:22:47,120 problem to solve, ask somebody who knows the answer. 427 00:22:47,120 --> 00:22:48,180 It's the least amount of work. 428 00:22:48,180 --> 00:22:51,060 It's not even as hard going to Google. 429 00:22:51,060 --> 00:22:55,010 So who would you ask about ways of 430 00:22:55,010 --> 00:22:58,260 measuring disorder in sets? 431 00:22:58,260 --> 00:22:59,510 There are two possible answers. 432 00:23:05,050 --> 00:23:07,060 STUDENT: You could just do entropy. 433 00:23:07,060 --> 00:23:07,325 PATRICK WINSTON: What? 434 00:23:07,325 --> 00:23:09,840 STUDENT: Find the entropy of the set. 435 00:23:09,840 --> 00:23:11,345 PATRICK WINSTON: Who studies entropy? 436 00:23:11,345 --> 00:23:13,350 STUDENT: Probability. 437 00:23:13,350 --> 00:23:15,402 PATRICK WINSTON: What kind of classes? 438 00:23:15,402 --> 00:23:15,898 STUDENT: Physics. 439 00:23:15,898 --> 00:23:16,890 STUDENT: Thermodynamics. 440 00:23:16,890 --> 00:23:18,415 PATRICK WINSTON: Thermodynamics! 441 00:23:18,415 --> 00:23:21,125 The thermodynamicists are good at measuring disorder, because 442 00:23:21,125 --> 00:23:22,900 that's what thermodynamics is all about. 443 00:23:22,900 --> 00:23:25,640 Entropy increasing over time, and all that sort of stuff. 444 00:23:25,640 --> 00:23:28,834 There's another equally good answer. 445 00:23:28,834 --> 00:23:30,720 STUDENT: Statisticians? 446 00:23:30,720 --> 00:23:33,400 PATRICK WINSTON: Statisticians. 447 00:23:33,400 --> 00:23:37,680 Perhaps, but it's not the second best answer. 448 00:23:37,680 --> 00:23:39,190 It's actually not even the best answer. 449 00:23:39,190 --> 00:23:39,930 That's the best answer. 450 00:23:39,930 --> 00:23:40,643 What's your name? 451 00:23:40,643 --> 00:23:41,510 STUDENT: Leo. 452 00:23:41,510 --> 00:23:42,760 PATRICK WINSTON: Oh, yeah. 453 00:23:45,322 --> 00:23:49,150 [LAUGHTER] 454 00:23:49,150 --> 00:23:50,870 PATRICK WINSTON: Leonardo has got his finger on it. 455 00:23:50,870 --> 00:23:53,420 The information theorists are pretty good at measuring 456 00:23:53,420 --> 00:23:57,440 disorder, because that's what information is all about, too. 457 00:23:57,440 --> 00:24:00,130 So we might as well borrow a mechanism for measuring the 458 00:24:00,130 --> 00:24:03,970 disorder of a set from those information theory guys. 459 00:24:03,970 --> 00:24:05,710 So what we're going to do is exactly that. 460 00:24:08,610 --> 00:24:11,760 Let's put it over here, so we'll have it handy when we 461 00:24:11,760 --> 00:24:13,060 want to try to measure those things. 462 00:24:16,910 --> 00:24:21,800 The gospel according to information theorists is that 463 00:24:21,800 --> 00:24:28,910 the disorder, D, or some set is equal to-- now let's 464 00:24:28,910 --> 00:24:32,500 suppose that this is a set of binary values. 465 00:24:32,500 --> 00:24:34,720 So we have positives and then we have negatives. 466 00:24:34,720 --> 00:24:36,490 Pluses and minuses. 467 00:24:36,490 --> 00:24:39,640 But pluses, they don't go very well in an algebraic equation, 468 00:24:39,640 --> 00:24:41,790 because they might be confused with adding. 469 00:24:41,790 --> 00:24:45,900 So I'm going to say P and N. And then it'll be the total, 470 00:24:45,900 --> 00:24:48,030 which is P plus N. We only have two choices, 471 00:24:48,030 --> 00:24:50,090 positive and negative. 472 00:24:50,090 --> 00:24:53,850 So the disorder of set, according those guys, is equal 473 00:24:53,850 --> 00:24:59,400 to minus the number of positives over the total 474 00:24:59,400 --> 00:25:05,440 number, times the log to the base 2 of the positives over 475 00:25:05,440 --> 00:25:13,260 the total, minus the negatives over the total, times the log 476 00:25:13,260 --> 00:25:16,270 2 of the negatives over the total. 477 00:25:16,270 --> 00:25:18,780 Those negatives look a little worrisome, because you think, 478 00:25:18,780 --> 00:25:20,210 well, maybe this thing can go negative. 479 00:25:20,210 --> 00:25:21,390 But that's not going to be true, right? 480 00:25:21,390 --> 00:25:26,170 Because these ratios are all less than 1, and the logarithm 481 00:25:26,170 --> 00:25:29,830 of something that's less than 1 is negative. 482 00:25:29,830 --> 00:25:32,770 So we're OK. 483 00:25:32,770 --> 00:25:37,040 So that's a lovely way of measuring disorder. 484 00:25:37,040 --> 00:25:39,020 And then we ought to draw a graph of what 485 00:25:39,020 --> 00:25:40,270 that curve looks like. 486 00:25:44,370 --> 00:25:47,720 And what we're going to graph it against is the ratio of 487 00:25:47,720 --> 00:25:52,500 positives to the total number. 488 00:25:52,500 --> 00:25:56,730 So that's going to be an axis where we go from 0 to 1. 489 00:26:01,140 --> 00:26:05,750 So let's just find a couple of useful values. 490 00:26:05,750 --> 00:26:08,920 And by the way, it pays to pay attention to these curves, 491 00:26:08,920 --> 00:26:11,890 because if you pay attention to this stuff, you can work 492 00:26:11,890 --> 00:26:14,980 the quiz questions on this very rapidly. 493 00:26:14,980 --> 00:26:19,460 Otherwise, we see people getting out their calculators 494 00:26:19,460 --> 00:26:24,020 and quickly becoming both lost and screwed. 495 00:26:24,020 --> 00:26:26,160 OK so let's see. 496 00:26:26,160 --> 00:26:29,100 Let's suppose that the number of positives is equal to the 497 00:26:29,100 --> 00:26:29,810 number of negatives. 498 00:26:29,810 --> 00:26:32,260 So we've got a completely mixed-up set. 499 00:26:32,260 --> 00:26:34,470 It has no bias in either direction. 500 00:26:34,470 --> 00:26:41,310 So in that case, if P over T is equal to 1/2, then this is 501 00:26:41,310 --> 00:26:51,230 equal to minus 1/2, times the logarithm of 1/2. 502 00:26:51,230 --> 00:26:53,360 And I guess, since they're both the same, we 503 00:26:53,360 --> 00:26:55,400 can multiply by two. 504 00:26:55,400 --> 00:26:56,650 And what's that value? 505 00:27:00,596 --> 00:27:04,768 [INAUDIBLE], what does that calculate out to? 506 00:27:04,768 --> 00:27:06,716 STUDENT: Minus [INAUDIBLE] 507 00:27:06,716 --> 00:27:09,650 PATRICK WINSTON: Minus [INAUDIBLE]. 508 00:27:09,650 --> 00:27:11,790 Well, with a minus sign, you just turn the argument upside 509 00:27:11,790 --> 00:27:12,950 down, so it's log(2). 510 00:27:12,950 --> 00:27:14,890 So what's log(2)? 511 00:27:14,890 --> 00:27:18,920 Logarithm of base 2 of 2? 512 00:27:18,920 --> 00:27:19,870 1! 513 00:27:19,870 --> 00:27:22,720 So this whole thing is-- 514 00:27:22,720 --> 00:27:23,020 STUDENT: 1. 515 00:27:23,020 --> 00:27:24,000 PATRICK WINSTON: 1. 516 00:27:24,000 --> 00:27:27,650 So [INAUDIBLE], in her soft way, says, well, let's see. 517 00:27:27,650 --> 00:27:28,800 2 times 1/2. 518 00:27:28,800 --> 00:27:29,710 That cancels out. 519 00:27:29,710 --> 00:27:32,680 The minus, that flips the arguments so it's log to the 520 00:27:32,680 --> 00:27:35,640 base 2 of 2, and that's 1. 521 00:27:35,640 --> 00:27:38,785 So this whole thing, You work out the algebra, 522 00:27:38,785 --> 00:27:40,710 it gives you 1. 523 00:27:40,710 --> 00:27:43,740 So that's cool. 524 00:27:43,740 --> 00:27:47,460 So right here in the middle where they're equal, we get a 525 00:27:47,460 --> 00:27:48,710 value of 1. 526 00:27:51,800 --> 00:27:54,610 Next thing we need to do is let's calculate what happens 527 00:27:54,610 --> 00:27:59,610 if P over T is equal to 1. 528 00:27:59,610 --> 00:28:01,450 That is to say, everything is a positive. 529 00:28:01,450 --> 00:28:03,100 Any guesses? 530 00:28:03,100 --> 00:28:06,410 Maybe 10, 20, minus 15? 531 00:28:06,410 --> 00:28:08,650 Let's work it out. 532 00:28:08,650 --> 00:28:16,946 So if P over T equal 1, that would be minus 1 times the log 533 00:28:16,946 --> 00:28:21,120 to the base 2 of 1. 534 00:28:21,120 --> 00:28:22,642 What's that? 535 00:28:22,642 --> 00:28:24,040 STUDENT: [INAUDIBLE] 536 00:28:24,040 --> 00:28:25,570 PATRICK WINSTON: A 0? 537 00:28:25,570 --> 00:28:25,910 Oh, yeah. 538 00:28:25,910 --> 00:28:30,430 Because 2 raise to the 0 is one. 539 00:28:30,430 --> 00:28:31,715 So this part is 0. 540 00:28:34,310 --> 00:28:35,560 Now, what about this other part? 541 00:28:38,810 --> 00:28:41,190 If everything's a P, then nothing's an N. 542 00:28:41,190 --> 00:28:42,320 So we've got 0. 543 00:28:42,320 --> 00:28:44,820 And we can quit already. 544 00:28:44,820 --> 00:28:45,590 Well, not quite. 545 00:28:45,590 --> 00:28:46,710 We ought to work it out. 546 00:28:46,710 --> 00:28:50,290 Log 2 to the base 2 of 0. 547 00:28:50,290 --> 00:28:51,344 What's that? 548 00:28:51,344 --> 00:28:53,320 STUDENT: [INAUDIBLE] 549 00:28:53,320 --> 00:28:54,710 PATRICK WINSTON: Who? 550 00:28:54,710 --> 00:28:56,110 Minus infinity? 551 00:28:56,110 --> 00:28:57,100 Uh oh. 552 00:28:57,100 --> 00:29:00,140 0 times minus infinity is What I didn't get that when I was 553 00:29:00,140 --> 00:29:02,470 in high school. 554 00:29:02,470 --> 00:29:06,550 Finally, 1801 makes a difference. 555 00:29:06,550 --> 00:29:07,390 Finally. 556 00:29:07,390 --> 00:29:09,630 What's the answer. 557 00:29:09,630 --> 00:29:16,280 We're interested in the limit as N over T goes to 0, right? 558 00:29:16,280 --> 00:29:20,392 And when you have a deal like this, what do you do? 559 00:29:20,392 --> 00:29:25,330 You use that famous rule, that we all mispronounce when we 560 00:29:25,330 --> 00:29:27,600 see it written, right? 561 00:29:27,600 --> 00:29:31,240 We use the good old El Hospital's rule. 562 00:29:31,240 --> 00:29:33,210 OK, it's L'Hopital. 563 00:29:33,210 --> 00:29:34,610 L'Hopital's Rule. 564 00:29:34,610 --> 00:29:36,720 You have to differentiate the-- 565 00:29:36,720 --> 00:29:40,880 I guess we differentiate this guy as a ratio or something, 566 00:29:40,880 --> 00:29:42,610 and see what happens when it goes to 0. 567 00:29:42,610 --> 00:29:46,130 And what we get when we use L'Hopital's Rule is that, oh 568 00:29:46,130 --> 00:29:50,100 thank God, this is still zero. 569 00:29:50,100 --> 00:29:52,980 So now we know that we have a point up there and a point 570 00:29:52,980 --> 00:29:55,740 down there. 571 00:29:55,740 --> 00:29:57,410 So now we've got three points on the curve, 572 00:29:57,410 --> 00:29:58,660 and we can draw it. 573 00:30:01,310 --> 00:30:02,560 It goes like that. 574 00:30:05,140 --> 00:30:06,260 No, it doesn't go like that. 575 00:30:06,260 --> 00:30:08,400 It's obviously a Gaussian, right? 576 00:30:08,400 --> 00:30:10,240 Because everything in a nature is a Gaussian. 577 00:30:10,240 --> 00:30:11,640 Can you put that laptop away, please? 578 00:30:11,640 --> 00:30:13,260 Everything in nature is a Gaussian, so 579 00:30:13,260 --> 00:30:14,510 it looks like this. 580 00:30:18,158 --> 00:30:21,020 That right? 581 00:30:21,020 --> 00:30:23,840 No, actually, not everything in nature is a Gaussian. 582 00:30:23,840 --> 00:30:26,880 And in particular, this one isn't a Gaussian either. 583 00:30:26,880 --> 00:30:30,400 It looks more like one of those metal things they used 584 00:30:30,400 --> 00:30:32,660 to call quonset huts. 585 00:30:32,660 --> 00:30:34,612 That's what it looks like. 586 00:30:34,612 --> 00:30:37,100 Boom, like so. 587 00:30:37,100 --> 00:30:40,270 So that is the curve of interest. 588 00:30:40,270 --> 00:30:43,540 Now, did God say that using this way of measuring disorder 589 00:30:43,540 --> 00:30:45,950 was the best way? 590 00:30:45,950 --> 00:30:51,930 No, Got has not indicated any choice here. 591 00:30:51,930 --> 00:30:55,570 We use this because it's a convenient mechanism, it seems 592 00:30:55,570 --> 00:30:58,440 to make sense, but in contrast to the reason it's used 593 00:30:58,440 --> 00:31:01,690 information theory, it's not the result of some elegant 594 00:31:01,690 --> 00:31:02,270 mathematics. 595 00:31:02,270 --> 00:31:04,870 It's just a borrowing of something that seems to work 596 00:31:04,870 --> 00:31:06,870 pretty well. 597 00:31:06,870 --> 00:31:09,560 Any of those curves would work just about the same, because 598 00:31:09,560 --> 00:31:11,240 all we're doing with it is measuring how 599 00:31:11,240 --> 00:31:13,780 disordered a set is. 600 00:31:13,780 --> 00:31:18,950 So one thing to note here is that in this situation, where 601 00:31:18,950 --> 00:31:20,350 we're dealing with two choices-- 602 00:31:20,350 --> 00:31:23,340 P and N, positives and negatives-- 603 00:31:23,340 --> 00:31:26,450 we get a curve that maxes out at one. 604 00:31:26,450 --> 00:31:28,830 And notice that it kind of gets up there pretty fast. 605 00:31:28,830 --> 00:31:33,890 In fact, if you're down here at 2/3, are you're up here, 606 00:31:33,890 --> 00:31:39,090 this is about 0.9. 607 00:31:39,090 --> 00:31:43,770 So it gives you a large number for quite a bit of that area 608 00:31:43,770 --> 00:31:45,940 in the middle. 609 00:31:45,940 --> 00:31:49,210 So that, unfortunately, still doesn't tell us everything we 610 00:31:49,210 --> 00:31:49,550 need to know. 611 00:31:49,550 --> 00:31:53,700 That tells us how to measure a disorder in one of these sets. 612 00:31:53,700 --> 00:31:55,680 But we want to know how to measure the quality of the 613 00:31:55,680 --> 00:31:57,750 test overall. 614 00:31:57,750 --> 00:32:00,950 So we need some mechanism that says, OK, given that this test 615 00:32:00,950 --> 00:32:04,370 produces three different sets, and we now have a measure of 616 00:32:04,370 --> 00:32:08,100 the disorder in each of these sets, how do we measure the 617 00:32:08,100 --> 00:32:11,496 overall quality of the test? 618 00:32:11,496 --> 00:32:14,360 Well, you could just add up the disorder. 619 00:32:14,360 --> 00:32:16,630 Let's write that down, because that sounds good. 620 00:32:23,960 --> 00:32:33,200 So you can say that the quality of a test is equal to 621 00:32:33,200 --> 00:32:36,160 some sum over the sets produced. 622 00:32:41,280 --> 00:32:42,640 And what we're going to do is we're going to add up the 623 00:32:42,640 --> 00:32:45,620 disorder of each of those sets. 624 00:32:49,380 --> 00:32:53,220 I'm almost home, except that this means we're going to give 625 00:32:53,220 --> 00:33:00,210 equal weight to a branch that has almost nothing down it-- 626 00:33:00,210 --> 00:33:03,190 we're going to give the same weight to that as a branch 627 00:33:03,190 --> 00:33:05,920 that has almost everything going down it. 628 00:33:05,920 --> 00:33:07,150 So that doesn't seem that make sense. 629 00:33:07,150 --> 00:33:12,050 So one final flourish is we're going to weight this sum 630 00:33:12,050 --> 00:33:16,360 according to the fraction of the samples that end up down 631 00:33:16,360 --> 00:33:18,200 that branch. 632 00:33:18,200 --> 00:33:21,840 So it's, as usual, easier to write it down than to say it. 633 00:33:21,840 --> 00:33:27,370 So we're going to multiply that times the number of 634 00:33:27,370 --> 00:33:41,530 samples in the set, divided by the number of 635 00:33:41,530 --> 00:33:51,390 samples handled by test. 636 00:33:54,610 --> 00:33:57,570 So if half the samples go down a branch, and if that branch 637 00:33:57,570 --> 00:34:01,500 has a certain disorder, then we're going to multiply that 638 00:34:01,500 --> 00:34:04,090 disorder times 1/2. 639 00:34:04,090 --> 00:34:04,570 All right. 640 00:34:04,570 --> 00:34:08,610 So now let's see how it works with our sample problem. 641 00:34:08,610 --> 00:34:11,139 Well, here is our sample data. 642 00:34:11,139 --> 00:34:12,780 And we didn't need anything fancy for it. 643 00:34:12,780 --> 00:34:16,270 But let's pretend it was a large data set. 644 00:34:16,270 --> 00:34:16,790 Well, let's see. 645 00:34:16,790 --> 00:34:17,590 What would we do? 646 00:34:17,590 --> 00:34:23,020 Well, go down this way, there are 4 647 00:34:23,020 --> 00:34:24,639 samples down that direction. 648 00:34:24,639 --> 00:34:26,790 That's half of the total number of samples. 649 00:34:26,790 --> 00:34:28,139 So whatever we find down there, we're going 650 00:34:28,139 --> 00:34:29,820 to multiply by 1/2. 651 00:34:29,820 --> 00:34:32,150 This one we're going to multiply by 3/8. 652 00:34:32,150 --> 00:34:35,889 And this one we're going to multiply by 1/8. 653 00:34:35,889 --> 00:34:37,969 Now, what do we actually find at the bottom of these things? 654 00:34:37,969 --> 00:34:40,170 Well, here's a homogeneous set. 655 00:34:40,170 --> 00:34:41,770 Everything's the same. 656 00:34:41,770 --> 00:34:44,560 So we go to that curve and say, what is the disorder of a 657 00:34:44,560 --> 00:34:45,949 homogeneous set? 658 00:34:45,949 --> 00:34:47,199 It's zero. 659 00:34:50,380 --> 00:34:52,090 Let's see, they're all the same. 660 00:34:52,090 --> 00:34:57,640 I guess that means it's 0 over there. 661 00:34:57,640 --> 00:35:04,470 So the disorder of this set of three samples is zero. 662 00:35:04,470 --> 00:35:07,260 The disorder of this set of one sample, all 663 00:35:07,260 --> 00:35:10,110 the same, is zero. 664 00:35:10,110 --> 00:35:13,720 The disorder of this set-- well, let's see. 665 00:35:13,720 --> 00:35:16,780 Half of the samples there are plus, and half are minus, so 666 00:35:16,780 --> 00:35:20,830 we go over to our curve, and we say, what's the disorder of 667 00:35:20,830 --> 00:35:23,640 something with equal mixture of pluses and minuses? 668 00:35:23,640 --> 00:35:25,500 And that's one. 669 00:35:25,500 --> 00:35:28,560 So the disorder of this guy is one. 670 00:35:28,560 --> 00:35:33,590 So now we've got 1/2 times 1, and 3/8 times 0, 1/8 times 0. 671 00:35:33,590 --> 00:35:38,660 So the quality of this particular test, as determined 672 00:35:38,660 --> 00:35:43,770 by the disorder of the sets it produces, is 1/5. 673 00:35:43,770 --> 00:35:45,020 0.5. 674 00:35:48,420 --> 00:35:50,420 Let's do this one. 675 00:35:50,420 --> 00:35:53,910 So we have 3/8 coming down this way, 5/8 676 00:35:53,910 --> 00:35:55,270 coming down this way. 677 00:35:55,270 --> 00:35:57,610 3/8 is multiplied by the disorder of a 678 00:35:57,610 --> 00:35:58,730 set of uniform things. 679 00:35:58,730 --> 00:36:01,370 That's disorder 0. 680 00:36:01,370 --> 00:36:04,540 So this guy over here, let's see. 681 00:36:04,540 --> 00:36:09,160 That's 2/5 and 3/5 multiplied-- 682 00:36:09,160 --> 00:36:11,010 You know, this is one of those deals where if you look at the 683 00:36:11,010 --> 00:36:14,840 curve, you're pretty close to the middle. 684 00:36:14,840 --> 00:36:18,470 And that curve goes all the way up to about 0.9 there. 685 00:36:18,470 --> 00:36:20,560 So you can kind of just look at this, and eyeball it, and 686 00:36:20,560 --> 00:36:27,010 say, well, whatever it is, the overall, this is going to be 687 00:36:27,010 --> 00:36:29,060 something multiplied times 5/8. 688 00:36:29,060 --> 00:36:31,680 Something like 0.9 times 5/8. 689 00:36:31,680 --> 00:36:35,440 So let's just say, for the sake of discussion, that 690 00:36:35,440 --> 00:36:39,550 that's going to be about 0.6, which is within a hundredth, I 691 00:36:39,550 --> 00:36:41,340 think, of being right. 692 00:36:41,340 --> 00:36:44,220 Just kind of guessing. 693 00:36:44,220 --> 00:36:45,440 OK, well now we're on a roll. 694 00:36:45,440 --> 00:36:49,040 Here, we have 3/8 coming down this branch, 3/8 coming down 695 00:36:49,040 --> 00:36:52,640 this branch, 1/4 coming down this branch. 696 00:36:52,640 --> 00:36:54,510 This is 0. 697 00:36:54,510 --> 00:36:58,910 And this is one of those deals where these two are about 0.9. 698 00:36:58,910 --> 00:37:05,680 So it looks like it's going to be 3/8 plus 3/8 is 3/4. 699 00:37:05,680 --> 00:37:07,640 Times about 0.9. 700 00:37:07,640 --> 00:37:09,602 So that's going to turn out to be about 0.7. 701 00:37:17,710 --> 00:37:19,230 So one last go here. 702 00:37:19,230 --> 00:37:24,850 3/8, 3/8, and 1/4. 703 00:37:24,850 --> 00:37:26,280 Oh, that's interesting. 704 00:37:26,280 --> 00:37:29,890 Because these two are what we got 705 00:37:29,890 --> 00:37:32,490 contributed up to that 0.7. 706 00:37:32,490 --> 00:37:34,390 This one is 0.4 times-- 707 00:37:34,390 --> 00:37:37,090 this is evenly divided, so that's going to 708 00:37:37,090 --> 00:37:40,380 have disorder of 1. 709 00:37:40,380 --> 00:37:43,910 So that's going to be 0.25 bigger than the 710 00:37:43,910 --> 00:37:45,610 number we got over here. 711 00:37:45,610 --> 00:37:51,410 So that's going to end up being about 0.95. 712 00:37:51,410 --> 00:37:53,980 So thanks god our answer is the same as we got with our 713 00:37:53,980 --> 00:37:57,130 simple classroom measurement of disorder. 714 00:37:57,130 --> 00:37:59,800 Except this is measuring how disordered stuff is, we want 715 00:37:59,800 --> 00:38:02,520 the small number, not the big number. 716 00:38:02,520 --> 00:38:06,110 So once again, based on this analysis, you'll be sure to 717 00:38:06,110 --> 00:38:10,730 pick the shadow cast, because 0.5 is less than 0.6, which is 718 00:38:10,730 --> 00:38:13,640 less than 0.7, which is less than 0.95. 719 00:38:13,640 --> 00:38:16,310 So that accent test is really horrible. 720 00:38:16,310 --> 00:38:18,380 Don't use it. 721 00:38:18,380 --> 00:38:20,160 Just because somebody has a heavy accent doesn't mean 722 00:38:20,160 --> 00:38:21,430 they're a vampire. 723 00:38:21,430 --> 00:38:24,240 In fact, most vampires have worked very hard on their 724 00:38:24,240 --> 00:38:26,450 accent, as I mentioned before. 725 00:38:26,450 --> 00:38:28,470 All right, so now we know that we're still going to pick the 726 00:38:28,470 --> 00:38:32,440 shadow test as our first go. 727 00:38:32,440 --> 00:38:34,040 So that's good. 728 00:38:34,040 --> 00:38:36,460 Now, let's see if we can repeat the exercise with our 729 00:38:36,460 --> 00:38:39,290 second selection, the one we have to have to pick those 730 00:38:39,290 --> 00:38:40,910 guys apart. 731 00:38:40,910 --> 00:38:42,760 And this is going to be easier, because there are 732 00:38:42,760 --> 00:38:44,280 fewer things to work with. 733 00:38:44,280 --> 00:38:45,700 Ooh, wow, look. 734 00:38:45,700 --> 00:38:47,930 That's 0. 735 00:38:47,930 --> 00:38:49,030 That's 0. 736 00:38:49,030 --> 00:38:50,530 That's 1/2. 737 00:38:50,530 --> 00:38:53,000 That's 1/2. 738 00:38:53,000 --> 00:38:58,030 So the disorder of this guy is 0.0. 739 00:38:58,030 --> 00:39:05,630 So this is 1/4, 1/4, 1/2, 0, 0. 740 00:39:05,630 --> 00:39:07,140 1/2 times 1. 741 00:39:07,140 --> 00:39:09,590 Ooh, that's 0.5. 742 00:39:09,590 --> 00:39:10,500 That was easy. 743 00:39:10,500 --> 00:39:11,740 How about this one? 744 00:39:11,740 --> 00:39:13,300 Oh, he says 1. 745 00:39:13,300 --> 00:39:13,500 Let's see. 746 00:39:13,500 --> 00:39:14,210 That's 1. 747 00:39:14,210 --> 00:39:15,170 That's 1. 748 00:39:15,170 --> 00:39:16,130 That's 1/2. 749 00:39:16,130 --> 00:39:16,890 That's 1/2. 750 00:39:16,890 --> 00:39:20,130 Yeah, it is one. 751 00:39:20,130 --> 00:39:22,480 So sure enough, the answer also comes out to be the same 752 00:39:22,480 --> 00:39:24,450 as before, when we did our just 753 00:39:24,450 --> 00:39:27,160 simple intuition exercise. 754 00:39:27,160 --> 00:39:28,710 So I don't know. 755 00:39:28,710 --> 00:39:33,895 Christopher, is this all about using information theory? 756 00:39:33,895 --> 00:39:34,320 STUDENT: No. 757 00:39:34,320 --> 00:39:35,570 PATRICK WINSTON: No, no, no. 758 00:39:38,230 --> 00:39:40,010 See, it's not about the math. 759 00:39:40,010 --> 00:39:41,000 It's about the intuition. 760 00:39:41,000 --> 00:39:43,630 And the intuition is that you want to build a tree that's as 761 00:39:43,630 --> 00:39:44,750 simple as possible. 762 00:39:44,750 --> 00:39:47,500 And you can build a tree that's as simple as possible 763 00:39:47,500 --> 00:39:50,610 if you look at the data, and say, well, which test does the 764 00:39:50,610 --> 00:39:52,640 best job of splitting things up? 765 00:39:52,640 --> 00:39:56,150 Which test does the best job of building subsets underneath 766 00:39:56,150 --> 00:39:59,310 it that are as homogeneous as possible? 767 00:39:59,310 --> 00:40:03,330 So all this information theory, all this entropy 768 00:40:03,330 --> 00:40:07,290 stuff, is just a convenient mechanism for doing something 769 00:40:07,290 --> 00:40:09,440 that is intuitionally sound. 770 00:40:09,440 --> 00:40:10,175 OK? 771 00:40:10,175 --> 00:40:11,840 It's not about information theory. 772 00:40:11,840 --> 00:40:15,990 It's about a sound intuition. 773 00:40:15,990 --> 00:40:16,400 Oh, by the way. 774 00:40:16,400 --> 00:40:19,440 Does this kind of stuff ever get used in practice? 775 00:40:19,440 --> 00:40:21,780 10s of thousands of times. 776 00:40:21,780 --> 00:40:25,400 This is a winning mechanism that's used over and over 777 00:40:25,400 --> 00:40:30,320 again, even when the data is numeric. 778 00:40:30,320 --> 00:40:32,170 How would it work if it's numeric data? 779 00:40:32,170 --> 00:40:33,660 Well, let's think about that for a little bit. 780 00:40:41,820 --> 00:40:45,620 So let's suppose that we have an opportunity. 781 00:40:45,620 --> 00:40:48,430 We're an EMT or something, we work in the infirmary. 782 00:40:48,430 --> 00:40:49,360 What do they call it these days? 783 00:40:49,360 --> 00:40:49,940 Something else. 784 00:40:49,940 --> 00:40:52,500 But anyhow, you work in that kind of area, and you have the 785 00:40:52,500 --> 00:40:55,420 opportunity to take people's temperature. 786 00:40:55,420 --> 00:41:00,230 And so over time, you've accumulated some data on the 787 00:41:00,230 --> 00:41:02,320 temperature of people. 788 00:41:02,320 --> 00:41:04,020 And maybe you've found that there's a vampire 789 00:41:04,020 --> 00:41:07,140 here at about 102. 790 00:41:07,140 --> 00:41:09,980 There's a normal person here, about 98.6. 791 00:41:09,980 --> 00:41:12,000 But then they're scattered around. 792 00:41:14,590 --> 00:41:16,960 Some people have fevers when they come in. 793 00:41:16,960 --> 00:41:19,950 So the question is, is there a way of using numerical data-- 794 00:41:19,950 --> 00:41:22,750 things that you can put real numbers-- 795 00:41:22,750 --> 00:41:25,180 is there a way of using that with this mechanism? 796 00:41:25,180 --> 00:41:26,320 And the answer is yes. 797 00:41:26,320 --> 00:41:29,180 You just say, is the temperature greater than or 798 00:41:29,180 --> 00:41:30,910 less than some threshold? 799 00:41:30,910 --> 00:41:33,620 And that gives you a test, a binary test, just like any of 800 00:41:33,620 --> 00:41:36,460 these other tests. 801 00:41:36,460 --> 00:41:37,000 [? Krishna? ?] 802 00:41:37,000 --> 00:41:38,930 Right? 803 00:41:38,930 --> 00:41:40,180 But where would I put the threshold? 804 00:41:44,660 --> 00:41:48,480 I suppose I could just put it at the average value. 805 00:41:48,480 --> 00:41:51,100 But that might not be the place that does the best job 806 00:41:51,100 --> 00:41:57,400 of splitting the samples into homogeneous groups. 807 00:41:57,400 --> 00:41:57,690 Christopher? 808 00:41:57,690 --> 00:41:59,788 STUDENT: So you run this numerical analysis on 809 00:41:59,788 --> 00:42:01,150 different places with different thresholds. 810 00:42:01,150 --> 00:42:04,070 PATRICK WINSTON: So you try different places, he says. 811 00:42:04,070 --> 00:42:05,340 And he's right. 812 00:42:05,340 --> 00:42:07,980 Because this is a computer, this is our slave. 813 00:42:07,980 --> 00:42:09,930 We don't care how much it works to figure 814 00:42:09,930 --> 00:42:11,660 out the right threshold. 815 00:42:11,660 --> 00:42:15,610 So what we do is we say, well, maybe the threshold's halfway 816 00:42:15,610 --> 00:42:17,835 between those two guys, or halfway between those two 817 00:42:17,835 --> 00:42:19,450 guys, or those two guys, or those two guys, 818 00:42:19,450 --> 00:42:20,940 or those two guys. 819 00:42:20,940 --> 00:42:22,540 So we can try one less threshold 820 00:42:22,540 --> 00:42:23,600 than we have samples. 821 00:42:23,600 --> 00:42:25,600 And we don't care if there are 10,000 samples, because this 822 00:42:25,600 --> 00:42:28,720 is a computer, and we don't care if it works all night. 823 00:42:28,720 --> 00:42:32,750 So that's how you find the threshold for a numeric test. 824 00:42:32,750 --> 00:42:34,520 By the way, I assured you earlier on you would never use 825 00:42:34,520 --> 00:42:35,360 the same test twice. 826 00:42:35,360 --> 00:42:38,190 Is that true for this? 827 00:42:38,190 --> 00:42:39,910 Yes, you would still never use the same test twice. 828 00:42:39,910 --> 00:42:41,320 But what you might do is you might use a different 829 00:42:41,320 --> 00:42:44,860 threshold on the same measurement 830 00:42:44,860 --> 00:42:46,940 the next time around. 831 00:42:46,940 --> 00:42:49,480 So when you start having numerical data, you may find 832 00:42:49,480 --> 00:42:57,730 yourself using the same test with the same axis but with a 833 00:42:57,730 --> 00:43:00,000 different value. 834 00:43:00,000 --> 00:43:00,330 All right. 835 00:43:00,330 --> 00:43:05,500 So now that we have this, then we can go back and compare how 836 00:43:05,500 --> 00:43:10,660 this method would look when we put it up against the sort of 837 00:43:10,660 --> 00:43:15,290 stuff we were talking about last time, with 838 00:43:15,290 --> 00:43:16,700 the electrical covers. 839 00:43:20,980 --> 00:43:25,020 So with the electrical covers, we had a situation like this. 840 00:43:25,020 --> 00:43:25,470 I don't know. 841 00:43:25,470 --> 00:43:31,290 We had samples that were places like this, and we had a 842 00:43:31,290 --> 00:43:34,510 division of the space that look pretty much like that. 843 00:43:37,220 --> 00:43:42,000 Not quite exactly in the right spots, but pretty close. 844 00:43:42,000 --> 00:43:45,660 So these are the decision boundaries for the situation 845 00:43:45,660 --> 00:43:47,540 where we are using nearest neighbors to 846 00:43:47,540 --> 00:43:50,608 divide up the data. 847 00:43:50,608 --> 00:43:53,310 What would the decision boundaries look like if these 848 00:43:53,310 --> 00:43:57,940 were four different kinds of things, and we were using this 849 00:43:57,940 --> 00:43:59,190 kind of mechanism? 850 00:44:01,560 --> 00:44:04,640 And maybe there's a lot of samples all clustered around 851 00:44:04,640 --> 00:44:07,170 places like that. 852 00:44:07,170 --> 00:44:09,280 What would the decision boundaries look like? 853 00:44:09,280 --> 00:44:11,830 Would they be the same as this? 854 00:44:11,830 --> 00:44:12,470 god, I hope not. 855 00:44:12,470 --> 00:44:14,400 Why? 856 00:44:14,400 --> 00:44:16,700 Because what we're going to do is we're going to use a 857 00:44:16,700 --> 00:44:20,320 threshold on each axis. 858 00:44:20,320 --> 00:44:22,990 So therefore, the decision boundaries are going to be 859 00:44:22,990 --> 00:44:26,100 parallel to one axis or the other. 860 00:44:26,100 --> 00:44:29,460 So we might decide, for example-- 861 00:44:29,460 --> 00:44:30,220 Oh, shoot. 862 00:44:30,220 --> 00:44:32,480 I think I'll draw it again, because it'll get confused if 863 00:44:32,480 --> 00:44:34,530 I draw it over the other one. 864 00:44:34,530 --> 00:44:37,720 So it looks like this. 865 00:44:37,720 --> 00:44:40,270 And that's how nearest neighbors does it. 866 00:44:40,270 --> 00:44:44,540 But a identification tree approach will pick a threshold 867 00:44:44,540 --> 00:44:45,870 along one axis or the other. 868 00:44:45,870 --> 00:44:48,340 Let's say it's this axis. 869 00:44:48,340 --> 00:44:49,580 It's only got one choice there. 870 00:44:49,580 --> 00:44:53,370 So it's going to put a line there. 871 00:44:53,370 --> 00:44:55,820 And now, what's the next thing it does? 872 00:44:55,820 --> 00:44:59,370 Well, it still has these two different kinds 873 00:44:59,370 --> 00:45:00,250 of things to separate. 874 00:45:00,250 --> 00:45:01,300 We're going to assume we've got four 875 00:45:01,300 --> 00:45:03,380 different kinds of things. 876 00:45:03,380 --> 00:45:06,220 So it's going to say, oh! 877 00:45:06,220 --> 00:45:12,650 I've Come down the negative side, so I need a threshold on 878 00:45:12,650 --> 00:45:14,440 the remaining data. 879 00:45:14,440 --> 00:45:17,260 And these are the only two things that are now remaining. 880 00:45:17,260 --> 00:45:23,030 So my only choice is to put a threshold in there. 881 00:45:23,030 --> 00:45:26,370 Now I guarantee this, absolutely guaranteed-- 882 00:45:26,370 --> 00:45:28,570 on the quiz, somebody-- 883 00:45:28,570 --> 00:45:31,090 presumably somebody who doesn't go to lectures-- 884 00:45:31,090 --> 00:45:32,890 will draw that line all the way across. 885 00:45:32,890 --> 00:45:35,730 And that's desperately wrong. 886 00:45:35,730 --> 00:45:39,780 Because we've already divided this data set in half. 887 00:45:39,780 --> 00:45:43,170 Now the choice of what we do over here is governed only by 888 00:45:43,170 --> 00:45:46,270 the remaining samples that we see, these two. 889 00:45:46,270 --> 00:45:49,420 And so the threshold is going to go in there like that. 890 00:45:52,150 --> 00:45:55,050 So that's what happens when you go back. 891 00:45:55,050 --> 00:45:59,270 This is used 10s of thousands of times. 892 00:45:59,270 --> 00:46:00,250 Always used. 893 00:46:00,250 --> 00:46:01,460 What are the virtues of it? 894 00:46:01,460 --> 00:46:05,360 Number one, you don't use all the tests. 895 00:46:05,360 --> 00:46:07,630 You use only the test that seem to be doing some useful 896 00:46:07,630 --> 00:46:09,250 work for you. 897 00:46:09,250 --> 00:46:11,750 So that means that you do a better job, because your 898 00:46:11,750 --> 00:46:13,580 measurement technique is simpler. 899 00:46:13,580 --> 00:46:17,220 And it costs less, because you're not going to the 900 00:46:17,220 --> 00:46:21,140 expense of doing all of the testing. 901 00:46:21,140 --> 00:46:22,910 So it's a real winner. 902 00:46:22,910 --> 00:46:24,230 But you know what? 903 00:46:24,230 --> 00:46:26,200 Some classes of people-- 904 00:46:26,200 --> 00:46:30,420 not scientists, but I mean people like doctors and stuff. 905 00:46:30,420 --> 00:46:33,750 They don't like to look at these tress. 906 00:46:33,750 --> 00:46:36,170 They're kind of rule-oriented. 907 00:46:36,170 --> 00:46:40,370 So they look a tree like this for determining what kind of 908 00:46:40,370 --> 00:46:45,420 thyroid disease you have, and it would have maybe 20 or so 909 00:46:45,420 --> 00:46:48,670 tests in it of various kinds of hormones, like thyroxine 910 00:46:48,670 --> 00:46:50,200 and this and that. 911 00:46:50,200 --> 00:46:52,560 And they say, ah, we can't deal with that. 912 00:46:52,560 --> 00:46:56,400 So we have to work with them. 913 00:46:56,400 --> 00:47:01,230 So what we do is we convert the tree into a set of rules. 914 00:47:01,230 --> 00:47:03,030 How do we convert the tree into a set of rules? 915 00:47:09,720 --> 00:47:12,560 Oops, wrong one. 916 00:47:12,560 --> 00:47:13,810 Go away, go away. 917 00:47:16,310 --> 00:47:17,030 Here's what I want. 918 00:47:17,030 --> 00:47:18,070 Yeah, good. 919 00:47:18,070 --> 00:47:21,480 How would we convert this tree into a set of rules? 920 00:47:21,480 --> 00:47:22,470 It's straightforward. 921 00:47:22,470 --> 00:47:23,680 [INAUDIBLE], what do we do? 922 00:47:23,680 --> 00:47:25,390 STUDENT: You'd basically just look down each branch-- 923 00:47:25,390 --> 00:47:26,730 PATRICK WINSTON: You'd basically just go down each 924 00:47:26,730 --> 00:47:28,510 branch to a leaf. 925 00:47:28,510 --> 00:47:31,990 So you say, for example, here's one rule. 926 00:47:31,990 --> 00:47:44,970 If shadow equals question mark, and garlic equals oh, 927 00:47:44,970 --> 00:47:45,770 [INAUDIBLE] 928 00:47:45,770 --> 00:47:47,020 want to choose No. 929 00:47:50,012 --> 00:47:51,880 Doesn't eat garlic. 930 00:47:51,880 --> 00:47:52,350 No. 931 00:47:52,350 --> 00:47:54,130 I think I'll say Yes. 932 00:47:54,130 --> 00:47:54,810 Yes. 933 00:47:54,810 --> 00:47:56,060 That changes the answer. 934 00:48:03,740 --> 00:48:07,430 Then if it eats garlic, it's not a vampire, right? 935 00:48:07,430 --> 00:48:09,700 That's one of four possible rules, because there are four 936 00:48:09,700 --> 00:48:12,440 leaf nodes. 937 00:48:12,440 --> 00:48:15,900 Now, almost done. 938 00:48:15,900 --> 00:48:17,170 We are done, except for one thing. 939 00:48:17,170 --> 00:48:20,160 We can actually take these four rules, and start thinking 940 00:48:20,160 --> 00:48:21,950 about how to simplify them. 941 00:48:21,950 --> 00:48:26,140 You can ask questions like, if I have a rule that tests both 942 00:48:26,140 --> 00:48:30,280 the shadow and the garlic, do I actually need both of those 943 00:48:30,280 --> 00:48:30,550 antecedents? 944 00:48:30,550 --> 00:48:32,950 And the answer is, in many cases, no. 945 00:48:32,950 --> 00:48:35,890 And in particular, in this case, no. 946 00:48:35,890 --> 00:48:40,700 Because if we look at our data set, what we discover is that 947 00:48:40,700 --> 00:48:45,180 in the event that we're talking about a shadow 948 00:48:45,180 --> 00:48:46,285 question mark-- 949 00:48:46,285 --> 00:48:49,240 oh, I guess I had a better choice the other way. 950 00:48:49,240 --> 00:48:49,880 Oh, no. 951 00:48:49,880 --> 00:48:53,200 If you look at the garlic, all the garlics-- 952 00:48:53,200 --> 00:48:55,300 Yes, Yes, and Yes-- 953 00:48:55,300 --> 00:48:57,600 it turns out that the answer is no, independent of what the 954 00:48:57,600 --> 00:48:59,630 shadow condition is. 955 00:48:59,630 --> 00:49:01,680 So we can look at the rules, and in some cases, we'll 956 00:49:01,680 --> 00:49:04,050 discover that our tree is a little bit more complicated 957 00:49:04,050 --> 00:49:04,730 than it needs to be. 958 00:49:04,730 --> 00:49:06,800 We can actually get rid of some of the clauses. 959 00:49:06,800 --> 00:49:10,340 So in the end, we can develop a very simple mechanism based 960 00:49:10,340 --> 00:49:12,760 on good old fashioned rule-based behavior, like you 961 00:49:12,760 --> 00:49:15,820 saw almost in the beginning of the subject, 962 00:49:15,820 --> 00:49:16,940 that does the job. 963 00:49:16,940 --> 00:49:22,710 And now, without any royalty, you're all free to put this in 964 00:49:22,710 --> 00:49:25,610 your PDA and use it to protect yourself in the days to com, 965 00:49:25,610 --> 00:49:27,430 especially since Halloween's just around the corner.