1 00:00:00,080 --> 00:00:01,800 The following content is provided 2 00:00:01,800 --> 00:00:04,030 under a Creative Commons license. 3 00:00:04,030 --> 00:00:06,880 Your support will help MIT OpenCourseWare continue 4 00:00:06,880 --> 00:00:10,740 to offer high quality educational resources for free. 5 00:00:10,740 --> 00:00:13,360 To make a donation, or view additional materials 6 00:00:13,360 --> 00:00:17,256 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,256 --> 00:00:17,881 at ocw.mit.edu. 8 00:00:21,184 --> 00:00:23,600 PROFESSOR: So you guys know the quiz is cumulative, right? 9 00:00:23,600 --> 00:00:25,590 Everything all the way back from lecture one, 10 00:00:25,590 --> 00:00:28,620 so I would look at all the lectures and all the P sets, 11 00:00:28,620 --> 00:00:31,390 and look at all the stuff that we taught you, 12 00:00:31,390 --> 00:00:33,590 so data structures, algorithms, everything. 13 00:00:33,590 --> 00:00:37,420 And at least be able to know, for every one of them, 14 00:00:37,420 --> 00:00:40,910 what's the name, what it does, and wants the running time. 15 00:00:40,910 --> 00:00:43,360 Proofs and how it does it might be harder, 16 00:00:43,360 --> 00:00:46,410 but these be able to call it as a black box 17 00:00:46,410 --> 00:00:49,800 and argue about the running times. 18 00:00:49,800 --> 00:00:54,680 So I have a dp problem, and I have a non-dp problem. 19 00:00:54,680 --> 00:00:58,320 Which problem would you like me to start with? 20 00:00:58,320 --> 00:00:58,820 OK. 21 00:01:03,910 --> 00:01:06,630 Do you guys know the saying, if a woodchucker would chuck wood, 22 00:01:06,630 --> 00:01:09,620 how much wood would a woodchucker chuck? 23 00:01:09,620 --> 00:01:12,640 Today we're going to chuck wood. 24 00:01:12,640 --> 00:01:19,200 So you have a piece of wood that is l meters long, 25 00:01:19,200 --> 00:01:20,390 and they have n markings. 26 00:01:27,360 --> 00:01:32,560 So say the first mark is at 3 meters, 27 00:01:32,560 --> 00:01:37,620 the second mark is at 5 meters, so on, so forth. 28 00:01:37,620 --> 00:01:44,400 And 3, and 4, all the way up to mn. 29 00:01:44,400 --> 00:01:50,870 So we want to cut this piece of wood at all the markings. 30 00:01:50,870 --> 00:01:55,150 The thing is the woodchucker doesn't work for free. 31 00:01:55,150 --> 00:01:59,140 If you give it a piece of wood of length l, 32 00:01:59,140 --> 00:02:01,710 and you ask it to cut it at some marking, 33 00:02:01,710 --> 00:02:07,930 you're going to get two pieces of wood, length l1 and l2. 34 00:02:07,930 --> 00:02:15,730 The price for this is l1 times l2. 35 00:02:15,730 --> 00:02:18,200 So we like woodchucker, but woodchuckers would also 36 00:02:18,200 --> 00:02:19,020 like our wallets. 37 00:02:19,020 --> 00:02:22,590 So we want to cut this up by paying 38 00:02:22,590 --> 00:02:23,820 the minimum amount of money. 39 00:02:27,820 --> 00:02:30,270 Rings a bell? 40 00:02:30,270 --> 00:02:32,540 So I'll let you guys think for a minute, 41 00:02:32,540 --> 00:02:35,150 then I'll give you the running time, then we'll start talking. 42 00:02:38,351 --> 00:02:40,350 So we usually give you running times on quizzes. 43 00:02:40,350 --> 00:02:42,935 The running time is why you should know all the problems 44 00:02:42,935 --> 00:02:45,560 in their matching running times, because the moment we give you 45 00:02:45,560 --> 00:02:48,450 a running time you can automatically eliminate all 46 00:02:48,450 --> 00:02:51,450 the things that don't match, and just focus on a few things. 47 00:02:57,870 --> 00:03:00,930 So you're going to have to cut it at all the markings, 48 00:03:00,930 --> 00:03:05,120 eventually, but the order in which you cut is important. 49 00:03:05,120 --> 00:03:08,550 So if I cut here first, then I'm going to pay three times l 50 00:03:08,550 --> 00:03:11,310 minus 3, whereas if I cut in the middle first, 51 00:03:11,310 --> 00:03:19,120 I'm going to pay whatever this is, and 3 times l minus m3. 52 00:03:23,780 --> 00:03:27,400 So we're trying to decide the order. 53 00:03:27,400 --> 00:03:29,220 Does this look like any familiar problem? 54 00:03:33,300 --> 00:03:37,260 AUDIENCE: [INAUDIBLE] using dp, right? 55 00:03:37,260 --> 00:03:38,960 PROFESSOR: dp, that is good. 56 00:03:38,960 --> 00:03:41,490 I did say that we're going to start with a dp problem, 57 00:03:41,490 --> 00:03:44,371 so this is dp. 58 00:03:44,371 --> 00:03:45,120 It's a good start. 59 00:03:54,468 --> 00:03:55,452 AUDIENCE: [INAUDIBLE] 60 00:03:55,452 --> 00:03:58,404 PROFESSOR: What? 61 00:03:58,404 --> 00:03:59,880 Not exactly. 62 00:03:59,880 --> 00:04:01,860 AUDIENCE: Yeah. 63 00:04:01,860 --> 00:04:05,880 PROFESSOR: So, it is not like any problems 64 00:04:05,880 --> 00:04:07,275 on the recitations. 65 00:04:11,420 --> 00:04:13,900 So far recitations did prefixes and suffixes. 66 00:04:13,900 --> 00:04:17,209 We're going to solve this using a running time of n 67 00:04:17,209 --> 00:04:20,700 cubed, which is like the parenthesis problem. 68 00:04:23,740 --> 00:04:26,430 It should be what you said, but I don't know how to spell that, 69 00:04:26,430 --> 00:04:28,013 so we're going to go for this instead. 70 00:04:31,720 --> 00:04:33,604 So running n cubed-- the moment I 71 00:04:33,604 --> 00:04:35,270 said this you guys should know that this 72 00:04:35,270 --> 00:04:37,940 is the n cubed problem that we have in lecture notes. 73 00:04:40,712 --> 00:04:42,670 So make sure to have those on the cheat sheets, 74 00:04:42,670 --> 00:04:46,010 and try to understand them, right? 75 00:04:46,010 --> 00:04:50,040 OK, so given that I've said this, 76 00:04:50,040 --> 00:04:54,020 you should know the solution now. 77 00:04:54,020 --> 00:04:56,330 To make sure everyone is with me, 78 00:04:56,330 --> 00:04:58,280 we're going to go through the solution, whole. 79 00:04:58,280 --> 00:04:59,280 So what is a subproblem? 80 00:05:03,903 --> 00:05:06,300 AUDIENCE:Smaller piece of wood. 81 00:05:06,300 --> 00:05:07,738 PROFESSOR: OK. 82 00:05:07,738 --> 00:05:09,960 AUDIENCE: Like how to cut it up. 83 00:05:09,960 --> 00:05:10,680 PROFESSOR: OK. 84 00:05:10,680 --> 00:05:12,700 So this is how you think of it informally. 85 00:05:12,700 --> 00:05:14,930 When you write it up, I want to see this. 86 00:05:14,930 --> 00:05:21,650 I want to see dp of something means something. 87 00:05:21,650 --> 00:05:23,240 So how you fill out your dp table. 88 00:05:25,920 --> 00:05:30,040 It's really useful to write this up on your exam 89 00:05:30,040 --> 00:05:33,140 before, because one, this will help you write the recursion 90 00:05:33,140 --> 00:05:36,240 correctly, and two, if the grader sees this 91 00:05:36,240 --> 00:05:38,410 they might skim over the recursion completely. 92 00:05:38,410 --> 00:05:40,040 And then you might have bugs there. 93 00:05:40,040 --> 00:05:41,150 We might not see them. 94 00:05:41,150 --> 00:05:43,250 Good for you. 95 00:05:43,250 --> 00:05:45,630 So this says how you're going to fill out the table. 96 00:05:45,630 --> 00:05:48,810 Right? dp of something equals something. 97 00:05:48,810 --> 00:05:50,150 What's in a dp table? 98 00:05:50,150 --> 00:05:51,460 Numbers. 99 00:05:51,460 --> 00:05:53,140 It's never how to do something. 100 00:05:53,140 --> 00:05:55,570 It's always the numbers, so it's always 101 00:05:55,570 --> 00:05:58,680 the maximum profit, or the minimum cost, 102 00:05:58,680 --> 00:06:01,290 or the shortest distance, or the longest something. 103 00:06:01,290 --> 00:06:03,050 So it's always a number. 104 00:06:03,050 --> 00:06:04,760 So what we do here? 105 00:06:04,760 --> 00:06:10,037 AUDIENCE: Start and dp, start location to the end location is 106 00:06:10,037 --> 00:06:12,620 PROFESSOR: OK, so we're going to get the mean distance, right? 107 00:06:12,620 --> 00:06:16,390 We usually do i j k and whatever else it takes. 108 00:06:16,390 --> 00:06:18,620 So start to end is? 109 00:06:18,620 --> 00:06:22,030 AUDIENCE: The minimum cost of cutting that up. 110 00:06:22,030 --> 00:06:35,020 PROFESSOR: Minimum cost of cutting up the wood board 111 00:06:35,020 --> 00:06:40,460 from marking i, all the way to marking j. 112 00:06:43,250 --> 00:06:46,560 There's a tiny problem here, that the initial-- there's 113 00:06:46,560 --> 00:06:52,990 no problem for this big piece of wood, right? 114 00:06:52,990 --> 00:06:56,380 If I can only consider the board from i to j, 115 00:06:56,380 --> 00:07:00,980 so if I can only consider the board from marking 1 116 00:07:00,980 --> 00:07:03,580 to marking n, then I get to this. 117 00:07:03,580 --> 00:07:05,720 So this part and this part get left out. 118 00:07:11,110 --> 00:07:13,070 AUDIENCE: [INAUDIBLE] 119 00:07:13,070 --> 00:07:14,420 PROFESSOR: Exactly. 120 00:07:14,420 --> 00:07:15,700 We add fake markings. 121 00:07:15,700 --> 00:07:21,170 Then 0 is 0, and mn plus 1 equals l. 122 00:07:21,170 --> 00:07:21,670 Very good. 123 00:07:21,670 --> 00:07:23,770 AUDIENCE: [INAUDIBLE] equally spaced? 124 00:07:23,770 --> 00:07:24,450 PROFESSOR: No. 125 00:07:24,450 --> 00:07:27,100 So these are numbers. 126 00:07:27,100 --> 00:07:30,250 If they were evenly spaced, I think there's an algorithm. 127 00:07:30,250 --> 00:07:33,180 You might come up with a math and say, you always 128 00:07:33,180 --> 00:07:34,030 cut it up like this. 129 00:07:40,380 --> 00:07:42,970 So while we solve this, you guys have candy, right? 130 00:07:42,970 --> 00:07:46,390 You should eat the candy and be energetic and everything. 131 00:07:49,460 --> 00:07:51,980 So min cost of cutting up the board 132 00:07:51,980 --> 00:07:53,360 from marking i to marking j. 133 00:07:53,360 --> 00:07:54,280 I like this. 134 00:07:54,280 --> 00:07:55,966 Have this on your exam if possible, 135 00:07:55,966 --> 00:07:57,590 because this will make our life easier, 136 00:07:57,590 --> 00:07:59,214 and it's going to make your life easier 137 00:07:59,214 --> 00:08:00,720 when you get to the next step, which 138 00:08:00,720 --> 00:08:03,590 is how do we compute dp of i j? 139 00:08:06,490 --> 00:08:11,930 So suppose I'm looking at the subboard from m1 to m4 140 00:08:11,930 --> 00:08:16,650 so I'm looking at only this. 141 00:08:16,650 --> 00:08:20,820 How do I compute the best way to cut the board from m1 to m4? 142 00:08:33,147 --> 00:08:33,980 What are my options? 143 00:08:36,698 --> 00:08:38,510 AUDIENCE: The locations you can cut it. 144 00:08:38,510 --> 00:08:39,630 PROFESSOR: Exactly. 145 00:08:39,630 --> 00:08:42,110 So in order to cut this up, I can either 146 00:08:42,110 --> 00:08:45,230 make a first cut at m2. 147 00:08:45,230 --> 00:08:48,220 So say I make my first cut here, and then I 148 00:08:48,220 --> 00:08:52,880 recursively cut this, and cut this. 149 00:08:52,880 --> 00:08:59,750 Or the other alternative is take the same guy-- m1, 150 00:08:59,750 --> 00:09:07,110 m2, m3, m4-- cut it at m3, and then recursively cut this, 151 00:09:07,110 --> 00:09:10,730 and recursively cut this. 152 00:09:10,730 --> 00:09:15,790 So I'm iterating over all the markings inside the board. 153 00:09:15,790 --> 00:09:17,505 Now suppose I'm cutting it-- yes? 154 00:09:17,505 --> 00:09:21,176 AUDIENCE: [INAUDIBLE] cutting both, or actually, never mind. 155 00:09:21,176 --> 00:09:22,550 PROFESSOR: Yeah, when I recursed, 156 00:09:22,550 --> 00:09:24,470 that takes care of it. 157 00:09:24,470 --> 00:09:27,900 So suppose I'm looking at m1 through m4, 158 00:09:27,900 --> 00:09:35,470 and I'm cutting it at m2. 159 00:09:35,470 --> 00:09:38,220 What's the total cost? 160 00:09:38,220 --> 00:09:41,670 So what's the best way to cut, given that then I 161 00:09:41,670 --> 00:09:43,562 know I'm going to cut there? 162 00:09:43,562 --> 00:09:46,470 AUDIENCE: The sum of the dp's. 163 00:09:46,470 --> 00:09:56,580 PROFESSOR: OK, so it's the best way to cut m1 through m2, 164 00:09:56,580 --> 00:10:07,050 plus best way to cut m2 through m4, 165 00:10:07,050 --> 00:10:10,520 plus the price I'm paying for this cut, right? 166 00:10:10,520 --> 00:10:11,910 Not just the sum of the dp's. 167 00:10:11,910 --> 00:10:12,890 One more term. 168 00:10:12,890 --> 00:10:16,105 What's this term? 169 00:10:16,105 --> 00:10:17,560 AUDIENCE: 4 minus 1? 170 00:10:17,560 --> 00:10:20,634 Or the location of 4 minus the location of 1. 171 00:10:23,390 --> 00:10:27,300 PROFESSOR: So, not quite, almost. 172 00:10:27,300 --> 00:10:31,340 So if I'm cutting a board into two pieces, 173 00:10:31,340 --> 00:10:33,840 the cost is the product of the length of the two pieces. 174 00:10:38,710 --> 00:10:46,050 m2 minus m1, times yes. 175 00:10:46,050 --> 00:10:48,580 OK, why did I bother doing this? 176 00:10:48,580 --> 00:10:51,410 Some people think better with concrete numbers. 177 00:10:51,410 --> 00:10:56,050 If that's the case, then give yourself an example. 178 00:10:56,050 --> 00:10:59,310 Write some numbers on your sheet of paper, 179 00:10:59,310 --> 00:11:01,680 then see what letters match to what 180 00:11:01,680 --> 00:11:03,940 numbers, and copy it up using letters. 181 00:11:03,940 --> 00:11:07,220 And there you go, you've solved the problem. 182 00:11:07,220 --> 00:11:09,680 So where are i and j here? 183 00:11:12,630 --> 00:11:14,200 AUDIENCE: i would be 1. 184 00:11:14,200 --> 00:11:16,438 PROFESSOR: OK, so this is i. 185 00:11:16,438 --> 00:11:19,130 AUDIENCE: That's j. 186 00:11:19,130 --> 00:11:24,400 PROFESSOR: Cool, so let's try to write it up, now. 187 00:11:24,400 --> 00:11:31,040 So in order to cut the board from i to j, what am I doing? 188 00:11:31,040 --> 00:11:33,010 So what am I computing? 189 00:11:33,010 --> 00:11:35,670 Usually the first word in your subproblem definition 190 00:11:35,670 --> 00:11:38,080 is the function that you're going to use. 191 00:11:38,080 --> 00:11:43,008 So it's minimum, and I'm going iterate over something. 192 00:11:43,008 --> 00:11:50,770 AUDIENCE: dp of i to-- it has to be all of j. 193 00:11:50,770 --> 00:11:52,260 dp of i, j, and you're looking to-- 194 00:11:52,260 --> 00:11:54,380 PROFESSOR: So I'm computing dp of i j. 195 00:11:54,380 --> 00:11:57,380 AUDIENCE: I know, of j minus [INAUDIBLE]. 196 00:11:57,380 --> 00:12:00,620 AUDIENCE: j minus i, then k j minus [INAUDIBLE]. 197 00:12:00,620 --> 00:12:01,980 PROFESSOR: There's a k, right? 198 00:12:01,980 --> 00:12:05,980 I need a new variable for where I'm going to cut up, right? 199 00:12:05,980 --> 00:12:08,680 So fortunately, we have a lot of letters in the alphabet, 200 00:12:08,680 --> 00:12:11,030 i, j, k, so on and so forth, l, m. 201 00:12:13,622 --> 00:12:14,930 AUDIENCE: i plus k. 202 00:12:17,209 --> 00:12:19,250 PROFESSOR: So let's say that k is the place where 203 00:12:19,250 --> 00:12:21,410 we cut, to make our life easy. 204 00:12:21,410 --> 00:12:26,270 So I'm going to have dp of 205 00:12:26,270 --> 00:12:28,269 AUDIENCE: Well i is the starting point. 206 00:12:28,269 --> 00:12:28,810 PROFESSOR: OK 207 00:12:28,810 --> 00:12:34,280 AUDIENCE: And then, the endpoint is i plus k, right? 208 00:12:34,280 --> 00:12:36,267 PROFESSOR: So what's k here? 209 00:12:36,267 --> 00:12:37,600 AUDIENCE: k is an actual number. 210 00:12:37,600 --> 00:12:40,350 It's not the offset, it's the actual number, 211 00:12:40,350 --> 00:12:41,840 so it should be i to k. 212 00:12:41,840 --> 00:12:43,484 It depends how you define k. 213 00:12:43,484 --> 00:12:45,900 PROFESSOR: So I'm going to make my life easy, and define k 214 00:12:45,900 --> 00:12:47,735 as exactly the marking at which I cut. 215 00:12:50,940 --> 00:12:51,680 k is this 2 here. 216 00:12:55,800 --> 00:12:58,380 And this is easier, trust me. 217 00:12:58,380 --> 00:13:01,042 OK, plus? 218 00:13:01,042 --> 00:13:01,625 AUDIENCE: k j? 219 00:13:05,030 --> 00:13:08,866 PROFESSOR: OK, and? 220 00:13:08,866 --> 00:13:16,156 AUDIENCE: Cost of m-- 221 00:13:16,156 --> 00:13:23,910 AUDIENCE: j minus i, m of k minus m of i times m of j 222 00:13:23,910 --> 00:13:25,470 minus m of k. 223 00:13:25,470 --> 00:13:26,394 PROFESSOR: Cool. 224 00:13:26,394 --> 00:13:28,060 Yeah, other way around-- doesn't matter. 225 00:13:30,750 --> 00:13:35,430 So now where does k go? 226 00:13:35,430 --> 00:13:38,374 We have to come up with numbers for the loop, right? 227 00:13:40,912 --> 00:13:42,160 AUDIENCE: Between i and j. 228 00:13:42,160 --> 00:13:44,522 AUDIENCE: j minus i. 229 00:13:44,522 --> 00:13:48,490 AUDIENCE: Just for k in i to j. 230 00:13:48,490 --> 00:13:52,624 PROFESSOR: So if I have the board from 1 to 4, 231 00:13:52,624 --> 00:13:54,400 do I cut at 1? 232 00:13:54,400 --> 00:13:56,854 I can, but that's kind of weird. 233 00:13:56,854 --> 00:13:59,580 Because I'm recursing on the same subproblem. 234 00:13:59,580 --> 00:14:01,880 By the way, if you recurse to the same subproblem, 235 00:14:01,880 --> 00:14:05,150 what are you going to get as your running time? 236 00:14:05,150 --> 00:14:07,450 Infinite. 237 00:14:07,450 --> 00:14:09,430 So let's not do that. 238 00:14:09,430 --> 00:14:11,470 So we're going to go from? 239 00:14:11,470 --> 00:14:14,014 AUDIENCE: [INAUDIBLE] 240 00:14:14,014 --> 00:14:15,680 PROFESSOR: So going from i would be bad. 241 00:14:15,680 --> 00:14:16,991 So i plus 1. 242 00:14:16,991 --> 00:14:17,490 2? 243 00:14:17,490 --> 00:14:19,430 AUDIENCE: j minus 1. 244 00:14:19,430 --> 00:14:20,750 PROFESSOR: Very good. 245 00:14:20,750 --> 00:14:26,480 AUDIENCE: Would it be m over i plus 1, because [INAUDIBLE]. 246 00:14:26,480 --> 00:14:28,650 PROFESSOR: So k is which marking I'm cutting at. 247 00:14:28,650 --> 00:14:32,020 I never want to cut inside a marking. 248 00:14:32,020 --> 00:14:34,760 However, I don't even know these are integers. 249 00:14:34,760 --> 00:14:37,890 AUDIENCE: They wouldn't be called [INAUDIBLE]. 250 00:14:37,890 --> 00:14:39,975 PROFESSOR: So k is which marking, i, j, 251 00:14:39,975 --> 00:14:41,975 and k are which marking I'm cutting at. 252 00:14:44,580 --> 00:14:46,330 These are the only discrete things I have. 253 00:14:46,330 --> 00:14:51,132 This board is all filled with real numbers. 254 00:14:51,132 --> 00:14:52,590 So if I want to cut somewhere here, 255 00:14:52,590 --> 00:14:54,480 that's a real number-- I don't like that. 256 00:14:54,480 --> 00:14:55,900 I want to have integers. 257 00:14:55,900 --> 00:15:00,040 So my markings help me get integers. 258 00:15:00,040 --> 00:15:01,530 I only want to cut at the marking, 259 00:15:01,530 --> 00:15:05,370 so I always look at my problem in terms of which marking I'm 260 00:15:05,370 --> 00:15:06,080 cutting it. 261 00:15:09,110 --> 00:15:11,650 So this always iterates over markings. 262 00:15:11,650 --> 00:15:16,790 So this looks very much like the parentheses problem, right? 263 00:15:16,790 --> 00:15:20,740 Same subproblems, roughly the same recursion. 264 00:15:20,740 --> 00:15:22,500 Turns out that these problems, where 265 00:15:22,500 --> 00:15:24,550 you're not considering suffixes or prefixes, 266 00:15:24,550 --> 00:15:27,100 but rather you're considering substrings, 267 00:15:27,100 --> 00:15:30,600 are reasonably hard to come by, and reasonably hard to solve. 268 00:15:30,600 --> 00:15:32,840 So if we give these to you, chances 269 00:15:32,840 --> 00:15:35,890 are they're going to be exactly like the parentheses problem, 270 00:15:35,890 --> 00:15:37,990 except for the cost function. 271 00:15:37,990 --> 00:15:42,050 This isn't what we had in the parentheses problem, right? 272 00:15:42,050 --> 00:15:44,220 So you should be prepared to solve problems 273 00:15:44,220 --> 00:15:46,070 that look exactly like the paren problem, 274 00:15:46,070 --> 00:15:49,780 but might have a different cost function. 275 00:15:49,780 --> 00:15:52,960 And this is how we solve it. 276 00:15:52,960 --> 00:15:53,460 OK. 277 00:15:53,460 --> 00:15:55,668 AUDIENCE: When you say that the complexity determines 278 00:15:55,668 --> 00:15:58,245 which type of dp example we use, does 279 00:15:58,245 --> 00:16:03,260 that mean that a problem can be solved 280 00:16:03,260 --> 00:16:08,514 using any of dp examples? 281 00:16:08,514 --> 00:16:13,760 It's just that the only thing that changes is the complexity. 282 00:16:13,760 --> 00:16:15,310 PROFESSOR: I don't think you can map 283 00:16:15,310 --> 00:16:16,840 every approach onto every problem. 284 00:16:16,840 --> 00:16:20,450 For example, if you tried to map prefixes onto this, 285 00:16:20,450 --> 00:16:23,050 you'd come up with a solution that 286 00:16:23,050 --> 00:16:25,370 doesn't look at all the possible choices, 287 00:16:25,370 --> 00:16:27,540 so your answer would be sub-optimal. 288 00:16:27,540 --> 00:16:30,330 So you'd come up with a fast, but incorrect algorithm. 289 00:16:30,330 --> 00:16:35,750 However, if you take the problem of find the longest 290 00:16:35,750 --> 00:16:37,900 increasing sub-sequence, you can definitely 291 00:16:37,900 --> 00:16:39,230 apply this technique to it. 292 00:16:39,230 --> 00:16:41,290 It's more general than suffixes or prefixes. 293 00:16:41,290 --> 00:16:44,200 So it's going to work, but it's going to be slower. 294 00:16:44,200 --> 00:16:46,610 So in theory, what you should do is, 295 00:16:46,610 --> 00:16:48,510 you have all these techniques. 296 00:16:48,510 --> 00:16:51,450 Given a problem, you try all the techniques. 297 00:16:51,450 --> 00:16:53,720 You see which ones apply, and out of those, you 298 00:16:53,720 --> 00:16:56,460 see which one gives you the best running time. 299 00:16:56,460 --> 00:16:59,620 In practice, if we give you the running time, 300 00:16:59,620 --> 00:17:03,360 you match it to the techniques that match the running time. 301 00:17:03,360 --> 00:17:05,700 You start backwards from the stuff that you know. 302 00:17:10,630 --> 00:17:12,869 OK. 303 00:17:12,869 --> 00:17:15,730 Does this problem make sense? 304 00:17:15,730 --> 00:17:18,180 Sweet. 305 00:17:18,180 --> 00:17:19,680 Now let's do a hard problem. 306 00:17:19,680 --> 00:17:23,819 Do people remember hashing? 307 00:17:23,819 --> 00:17:25,589 You have one minute to remember hashing 308 00:17:25,589 --> 00:17:26,640 while I erase the board. 309 00:17:26,640 --> 00:17:28,530 [LAUGHING] 310 00:17:28,530 --> 00:17:31,940 So suppose we want to implement the set. 311 00:17:31,940 --> 00:17:36,590 The way we're going to implement the set is, we have n elements. 312 00:17:39,810 --> 00:17:41,680 We're going to put them into the set, 313 00:17:41,680 --> 00:17:50,020 so for i goes from 1 through n, we're 314 00:17:50,020 --> 00:17:54,700 going to insert element i, so first we're 315 00:17:54,700 --> 00:17:57,730 going to insert all the elements into the set. 316 00:17:57,730 --> 00:18:02,140 And then after that, given a random number, we want to see 317 00:18:02,140 --> 00:18:03,810 is it in the set, or not. 318 00:18:03,810 --> 00:18:09,190 So for some other number-- I used n before, so let's 319 00:18:09,190 --> 00:18:15,450 use-- for some other number f, we want to see is f in the set, 320 00:18:15,450 --> 00:18:19,068 or is f not in the set? 321 00:18:22,140 --> 00:18:24,400 What data structure would you use normally for this? 322 00:18:27,490 --> 00:18:28,560 A hash table, right? 323 00:18:28,560 --> 00:18:31,080 You stick everything into a hash table, 324 00:18:31,080 --> 00:18:32,820 then you try to find the elements. 325 00:18:32,820 --> 00:18:34,880 If you find them, then you say yes. 326 00:18:34,880 --> 00:18:36,980 If not, then you say no. 327 00:18:36,980 --> 00:18:38,710 Well, it turns out that this would 328 00:18:38,710 --> 00:18:41,160 take more memory than what we have. 329 00:18:41,160 --> 00:18:44,720 So instead, we're going to do this. 330 00:18:44,720 --> 00:18:47,850 We're going to have a hash table of m bits. 331 00:18:53,660 --> 00:18:54,840 So these are m bits. 332 00:18:54,840 --> 00:18:58,670 And say we have a hash function that 333 00:18:58,670 --> 00:19:02,410 satisfies with uniform hashing, so given any element, 334 00:19:02,410 --> 00:19:07,784 the value is anywhere from 0 to m minus 1, 335 00:19:07,784 --> 00:19:08,950 and they're all independent. 336 00:19:12,490 --> 00:19:14,600 So the way we're going to insert an element 337 00:19:14,600 --> 00:19:23,040 is-- this table is T-- we're going to say that T of h of ai 338 00:19:23,040 --> 00:19:24,690 equals 1. 339 00:19:24,690 --> 00:19:26,600 So this is a table of bits. 340 00:19:26,600 --> 00:19:28,570 For every element we hash the element, 341 00:19:28,570 --> 00:19:32,610 and we set the corresponding bit to 1. 342 00:19:32,610 --> 00:19:38,230 So we're going to have some 1s, and some zeros in the table. 343 00:19:38,230 --> 00:19:43,680 Say if this is ai, it hashes somewhere here. 344 00:19:43,680 --> 00:19:46,530 OK so the question is, we inserted 345 00:19:46,530 --> 00:19:49,560 n elements into a table of size n. 346 00:19:49,560 --> 00:19:55,110 Given a new element, f, where f stands for false positive-- f 347 00:19:55,110 --> 00:19:58,395 is not one of the elements that we inserted. 348 00:20:02,040 --> 00:20:04,490 I want to know what's the probability that the set will 349 00:20:04,490 --> 00:20:08,700 say that the element is in the set, so basically, 350 00:20:08,700 --> 00:20:11,674 the probability of a false positive. 351 00:20:11,674 --> 00:20:14,660 AUDIENCE: So what are we doing about [INAUDIBLE]? 352 00:20:14,660 --> 00:20:15,595 PROFESSOR: Nothing. 353 00:20:15,595 --> 00:20:18,612 AUDIENCE: Is it chaining, is it open addressing? 354 00:20:18,612 --> 00:20:21,604 Does it even matter? 355 00:20:21,604 --> 00:20:23,520 PROFESSOR: So we're not inserting the elements 356 00:20:23,520 --> 00:20:25,200 into the table. 357 00:20:25,200 --> 00:20:26,890 This table only has bits. 358 00:20:26,890 --> 00:20:31,800 The elements are lost completely after we insert them. 359 00:20:31,800 --> 00:20:34,200 So the tradeoff is uses a lot less memory. 360 00:20:34,200 --> 00:20:36,050 Instead of having to store entire elements, 361 00:20:36,050 --> 00:20:38,050 you just store bits. 362 00:20:38,050 --> 00:20:40,927 On the downside you're going to have false positives. 363 00:20:40,927 --> 00:20:42,510 Because if I have a different element, 364 00:20:42,510 --> 00:20:47,090 say f, if it hashes to the same location, 365 00:20:47,090 --> 00:20:51,360 then the set is going to say, yeah, it's in the set. 366 00:20:51,360 --> 00:20:53,041 So you get false positives. 367 00:20:53,041 --> 00:20:54,290 Would you get false negatives? 368 00:20:59,230 --> 00:20:59,730 No, right? 369 00:21:02,530 --> 00:21:05,790 Because you start out with a table of 0's, 370 00:21:05,790 --> 00:21:07,680 and you only set the table to ones 371 00:21:07,680 --> 00:21:11,570 for the numbers that match to hashes 372 00:21:11,570 --> 00:21:13,340 of elements that are in the set. 373 00:21:13,340 --> 00:21:14,954 Did you have a question? 374 00:21:14,954 --> 00:21:15,453 OK. 375 00:21:19,910 --> 00:21:21,680 OK, do we understand the problem, 376 00:21:21,680 --> 00:21:23,584 before we attempt to solve it? 377 00:21:23,584 --> 00:21:27,250 AUDIENCE: Is it probably 1/m? 378 00:21:27,250 --> 00:21:29,880 PROFESSOR: You'd wish, but no. 379 00:21:32,860 --> 00:21:34,245 AUDIENCE: It's less than n/m. 380 00:21:37,100 --> 00:21:38,560 PROFESSOR: OK, I like that. 381 00:21:38,560 --> 00:21:41,410 So what are you thinking? 382 00:21:41,410 --> 00:21:43,410 AUDIENCE: If there are no collisions previously, 383 00:21:43,410 --> 00:21:48,730 then it would equal to n/m, but there are collisions, probably 384 00:21:48,730 --> 00:21:50,566 collisions. 385 00:21:50,566 --> 00:21:52,940 PROFESSOR: OK, I'm going to open up a window in your head 386 00:21:52,940 --> 00:21:56,770 and tell everyone else the small steps you took to get here. 387 00:21:56,770 --> 00:21:58,940 So we have this new number f. 388 00:21:58,940 --> 00:22:01,200 How are we going to check if it's in the set or not? 389 00:22:01,200 --> 00:22:04,300 We're going to compute h of f, and we're 390 00:22:04,300 --> 00:22:09,580 going to check if t of h of f is 0 or 1. 391 00:22:13,370 --> 00:22:15,870 f is different from all the other elements. 392 00:22:15,870 --> 00:22:19,920 So its hash value is independent from all the other hash values 393 00:22:19,920 --> 00:22:20,640 we had before. 394 00:22:24,300 --> 00:22:26,480 We don't really care about this anymore, 395 00:22:26,480 --> 00:22:29,900 after we have the independence assumption. 396 00:22:29,900 --> 00:22:34,400 So h of f is just some random position in the table. 397 00:22:34,400 --> 00:22:38,080 So the question is, given some random position in the table, 398 00:22:38,080 --> 00:22:41,090 will that be a 0 or a 1? 399 00:22:41,090 --> 00:22:42,570 How do you know? 400 00:22:42,570 --> 00:22:44,910 If I knew how many 1's I have in the table-- 401 00:22:44,910 --> 00:22:49,530 if I have k 1's in the table, and automatically this means n 402 00:22:49,530 --> 00:22:56,190 minus k 0's-- then what's the probability that h of f will 403 00:22:56,190 --> 00:22:57,060 point to a 1? 404 00:23:06,332 --> 00:23:08,265 AUDIENCE: k/m. 405 00:23:08,265 --> 00:23:08,890 PROFESSOR: Yes. 406 00:23:08,890 --> 00:23:11,580 So the hash takes m possible values. 407 00:23:11,580 --> 00:23:13,090 k of them are 1's. 408 00:23:13,090 --> 00:23:18,750 So the probability that the hash is going to guess a 1 is k/m. 409 00:23:18,750 --> 00:23:23,820 So if we knew how many 1's we have, then this is the answer. 410 00:23:23,820 --> 00:23:26,520 We know that we're going to have at most n 1's-- that's what 411 00:23:26,520 --> 00:23:28,200 you're thinking, right? 412 00:23:28,200 --> 00:23:31,950 So k is definitely smaller or equal to n, 413 00:23:31,950 --> 00:23:39,900 so the answer definitely has to be smaller or equal than n/m. 414 00:23:39,900 --> 00:23:41,930 Now if you're in a rush, you might say, 415 00:23:41,930 --> 00:23:44,960 well, we inserted n elements, so we're definitely 416 00:23:44,960 --> 00:23:48,030 going to have n 1's here. 417 00:23:48,030 --> 00:23:49,320 That is not true. 418 00:23:49,320 --> 00:23:53,060 The hashes of all the elements are independent. 419 00:23:53,060 --> 00:23:55,440 So there is some probability that two elements will 420 00:23:55,440 --> 00:23:59,210 hash to the same value, and as the number of elements grows, 421 00:23:59,210 --> 00:24:00,785 that probability also grows. 422 00:24:03,970 --> 00:24:06,510 OK, so now by looking at this, we 423 00:24:06,510 --> 00:24:08,650 got rid of this part of the problem. 424 00:24:08,650 --> 00:24:10,372 We don't care that there's a new element. 425 00:24:10,372 --> 00:24:12,080 We don't care that it's a false positive. 426 00:24:12,080 --> 00:24:14,270 All that we care about is how many 427 00:24:14,270 --> 00:24:18,080 1's do we have in the table after inserting n values. 428 00:24:23,010 --> 00:24:23,760 Well, what's that? 429 00:24:23,760 --> 00:24:41,810 That's m times the probability that a slot in the table is 1. 430 00:24:41,810 --> 00:24:46,820 Right, the probability that the slot in the table is 1 is k/m. 431 00:24:46,820 --> 00:24:49,390 So if we know this probability, and we multiply it by m, 432 00:24:49,390 --> 00:24:50,130 then we get k. 433 00:24:57,450 --> 00:25:00,240 People still with me? 434 00:25:00,240 --> 00:25:03,470 AUDIENCE: And what does that variable represent, h? 435 00:25:03,470 --> 00:25:05,050 PROFESSOR: This is k. 436 00:25:05,050 --> 00:25:07,050 Represents that my handwriting sucks, basically. 437 00:25:07,050 --> 00:25:10,530 AUDIENCE: I mean, why do we do m times the probability. 438 00:25:10,530 --> 00:25:15,208 That's the the expected number of 1's in the table? 439 00:25:15,208 --> 00:25:15,874 PROFESSOR: Yeah. 440 00:25:19,710 --> 00:25:24,330 Yeah, this is E of k, I guess. 441 00:25:24,330 --> 00:25:30,194 So then our final answer is this thing divided by m. 442 00:25:34,940 --> 00:25:40,390 So the answer is the expected value of k, 443 00:25:40,390 --> 00:25:43,830 or you can just think of it as the average value of k, divided 444 00:25:43,830 --> 00:25:44,396 by m. 445 00:25:44,396 --> 00:25:53,300 So this is m times this probability, divided by m. 446 00:25:53,300 --> 00:25:56,450 So it is exactly this probability. 447 00:25:56,450 --> 00:25:59,920 So the thing that we want to focus on 448 00:25:59,920 --> 00:26:07,710 is, what's the probability that a random slot in the table 449 00:26:07,710 --> 00:26:08,210 is a 1? 450 00:26:16,900 --> 00:26:18,858 AUDIENCE: It's equal to 1 minus the probability 451 00:26:18,858 --> 00:26:21,290 that it was never fixed. 452 00:26:21,290 --> 00:26:26,090 PROFESSOR: Exactly, the first thing we do. 453 00:26:26,090 --> 00:26:30,570 1 minus the probability that a slot is 0. 454 00:26:33,270 --> 00:26:35,560 This is easy, right, like it looks easy. 455 00:26:35,560 --> 00:26:38,210 But this makes a huge difference, 456 00:26:38,210 --> 00:26:42,660 because once we're here, well, a slot is zero 457 00:26:42,660 --> 00:26:46,246 if none of the insertions made it a one. 458 00:26:46,246 --> 00:26:47,870 And the insertions are all independent. 459 00:26:50,710 --> 00:26:54,030 So this is like, you're flipping a coin. 460 00:26:54,030 --> 00:26:56,380 What's the probability that after you flip it n times, 461 00:26:56,380 --> 00:26:57,470 you never get a head? 462 00:27:01,450 --> 00:27:04,384 So this is 1 minus 463 00:27:04,384 --> 00:27:07,950 AUDIENCE: 1 over m to the something. 464 00:27:07,950 --> 00:27:11,420 PROFESSOR: That-- So a slot is 0 means 465 00:27:11,420 --> 00:27:14,260 that no number was inserted in it. 466 00:27:14,260 --> 00:27:17,990 We're inserting n numbers, so it's the probability 467 00:27:17,990 --> 00:27:27,940 that a single number was not necessarily 468 00:27:27,940 --> 00:27:36,350 in the slot, raised to the power of n. 469 00:27:36,350 --> 00:27:38,720 So we have n independent experiments, right? 470 00:27:38,720 --> 00:27:43,390 Every time you insert a number into the hash function, 471 00:27:43,390 --> 00:27:45,380 that's one experiment. 472 00:27:45,380 --> 00:27:47,790 The hash function gives you independent values 473 00:27:47,790 --> 00:27:50,560 for all the elements. 474 00:27:50,560 --> 00:27:53,530 So all the insertions are independent of each other. 475 00:27:53,530 --> 00:27:58,140 If, in a single insertion, you've hit that slot, 476 00:27:58,140 --> 00:28:00,460 then you've made it a 1-- game over. 477 00:28:00,460 --> 00:28:03,670 So the slot is only a zero if none of the insertions 478 00:28:03,670 --> 00:28:05,420 make it the 1. 479 00:28:05,420 --> 00:28:07,670 So you take the probability that the insertion doesn't 480 00:28:07,670 --> 00:28:09,810 make it a one, and you raise it to the power n, 481 00:28:09,810 --> 00:28:12,129 because that has to happen n times in order 482 00:28:12,129 --> 00:28:13,670 for the whole thing to be successful. 483 00:28:24,320 --> 00:28:26,200 And the probability that the number was not 484 00:28:26,200 --> 00:28:29,520 inserted in a slot is 1 minus the probability 485 00:28:29,520 --> 00:28:31,440 that it was inserted. 486 00:28:31,440 --> 00:28:33,820 Right, we're doing this again. 487 00:28:33,820 --> 00:28:39,710 1 minus probability that a number hit. 488 00:28:44,081 --> 00:28:45,205 Well what this probability? 489 00:28:47,740 --> 00:28:48,760 Uniform hashing. 490 00:28:48,760 --> 00:28:49,865 AUDIENCE: 1/m 491 00:28:49,865 --> 00:28:50,490 PROFESSOR: 1/m. 492 00:28:53,800 --> 00:29:01,690 So this whole thing is 1 minus 1 minus 1, over m to the power n. 493 00:29:01,690 --> 00:29:07,470 1 minus m minus 1, over m to the power n. 494 00:29:13,398 --> 00:29:15,868 AUDIENCE: Can we go through this again. 495 00:29:15,868 --> 00:29:20,067 From 1 minus probability of a slot is 0, to 1 496 00:29:20,067 --> 00:29:25,260 minus probability of a number was not inserted in a slot? 497 00:29:25,260 --> 00:29:26,200 PROFESSOR: OK. 498 00:29:26,200 --> 00:29:28,120 So first off, the point of the problem. 499 00:29:28,120 --> 00:29:29,570 It's our problem, right? 500 00:29:29,570 --> 00:29:31,284 Don't panic, don't be angry. 501 00:29:31,284 --> 00:29:33,450 You're not going to have some this hard on the exam. 502 00:29:33,450 --> 00:29:35,984 The point of this is, I want to go through probabilities 503 00:29:35,984 --> 00:29:37,900 a little bit, and I want to go through hashing 504 00:29:37,900 --> 00:29:39,220 and the math behind hashing. 505 00:29:39,220 --> 00:29:43,140 Because remembering that will be useful. 506 00:29:43,140 --> 00:29:48,970 OK, so now you said you're having trouble with this step? 507 00:29:53,474 --> 00:29:55,470 OK, so let's see. 508 00:29:55,470 --> 00:29:59,100 Let's do this here. 509 00:29:59,100 --> 00:30:01,540 So we have this table here, right? 510 00:30:01,540 --> 00:30:08,550 And we have n elements-- e1, e2, e3, all the way through en. 511 00:30:08,550 --> 00:30:10,140 How do we put them in the table? 512 00:30:10,140 --> 00:30:13,720 We hash each of them, and each of them maps 513 00:30:13,720 --> 00:30:16,570 to a random slot in the table. 514 00:30:16,570 --> 00:30:22,060 If these are the slots, then e1 might map here, 515 00:30:22,060 --> 00:30:26,260 e2 might map here, e3 might map here, 516 00:30:26,260 --> 00:30:30,220 e4 might map here, so on and so forth. 517 00:30:30,220 --> 00:30:32,460 So I have arrows, right? 518 00:30:32,460 --> 00:30:38,162 Every time I do a hash, that's going to set something to a 1. 519 00:30:38,162 --> 00:30:40,370 The numbers don't necessarily map to different slots, 520 00:30:40,370 --> 00:30:44,820 because each number, on its own, maps to a random slot. 521 00:30:44,820 --> 00:30:48,240 So these are all going to be ones. 522 00:30:48,240 --> 00:30:50,480 And everything else becomes zero. 523 00:30:50,480 --> 00:30:54,840 If no number maps to a slot, it is 0. 524 00:30:54,840 --> 00:30:58,490 OK, let's look at one slot, any slot. 525 00:30:58,490 --> 00:31:01,590 So let's say I'm looking at this slot over here. 526 00:31:01,590 --> 00:31:03,870 Can you guys see, by the way? 527 00:31:03,870 --> 00:31:06,420 OK, so let's look at this guy here. 528 00:31:06,420 --> 00:31:09,940 What's the probability that it's a 0? 529 00:31:09,940 --> 00:31:14,935 So the probability that the slot is 530 00:31:14,935 --> 00:31:21,180 a 0 is the probability that the first number didn't 531 00:31:21,180 --> 00:31:27,560 map to it-- otherwise it would be a 1-- e1 532 00:31:27,560 --> 00:31:30,490 didn't hash to that slot. 533 00:31:33,410 --> 00:31:37,950 e2 also couldn't match to that slot, right? 534 00:31:37,950 --> 00:31:43,180 So it's the probability that e1 didn't hash to the slot, 535 00:31:43,180 --> 00:31:54,940 and e2 didn't hash into slot, and e3 536 00:31:54,940 --> 00:31:59,100 didn't hash into the slot, so on so forth, right? 537 00:31:59,100 --> 00:32:04,600 All the way up until en didn't hash to the slot. 538 00:32:04,600 --> 00:32:06,510 This makes sense? 539 00:32:06,510 --> 00:32:08,240 Now these are all independent events, 540 00:32:08,240 --> 00:32:10,440 because all the hashes are independent, 541 00:32:10,440 --> 00:32:13,100 by the uniform hashing assumption. 542 00:32:13,100 --> 00:32:16,860 So then I can turn ands into products. 543 00:32:16,860 --> 00:32:20,880 So I can say that this equals to the probability 544 00:32:20,880 --> 00:32:28,500 that e1 didn't hash into the slot, times the probability 545 00:32:28,500 --> 00:32:34,990 that e2 didn't hash into the slot, times the probability 546 00:32:34,990 --> 00:32:39,240 that e3 didn't hash into the slot, so on and so forth, 547 00:32:39,240 --> 00:32:43,915 all the way to the probability that en didn't hash. 548 00:32:51,800 --> 00:32:53,970 So since I'm dealing with the same hash function, 549 00:32:53,970 --> 00:32:56,920 turns out that all the probabilities are the same. 550 00:32:56,920 --> 00:33:01,810 So there, the probability that some fixed number 551 00:33:01,810 --> 00:33:06,700 didn't hash, to the power n. 552 00:33:11,700 --> 00:33:15,300 So this is how I got from here to here. 553 00:33:15,300 --> 00:33:18,900 Probabilities and the properties of hashes and hashing 554 00:33:18,900 --> 00:33:19,930 assumptions. 555 00:33:19,930 --> 00:33:22,700 So you guys should have those on your cheat sheet, 556 00:33:22,700 --> 00:33:25,521 and maybe if you have time, review probabilities a bit. 557 00:33:25,521 --> 00:33:26,896 AUDIENCE: What is the probability 558 00:33:26,896 --> 00:33:29,540 that any given one doesn't hash, 1/m? 559 00:33:32,697 --> 00:33:34,985 So if e1 doesn't hash in that spot, 560 00:33:34,985 --> 00:33:37,690 isn't that probability 1/m? 561 00:33:37,690 --> 00:33:39,590 PROFESSOR: Not quite. 562 00:33:39,590 --> 00:33:41,520 You're close, but not quite. 563 00:33:41,520 --> 00:33:44,550 So you're saying that the probability that e1 564 00:33:44,550 --> 00:33:47,722 doesn't hash to this slot is 1/m? 565 00:33:47,722 --> 00:33:49,180 AUDIENCE: I guess it's 1 minus 1/m. 566 00:33:49,180 --> 00:33:50,750 PROFESSOR: Exactly. 567 00:33:50,750 --> 00:33:52,590 The probability that it would hash here 568 00:33:52,590 --> 00:33:55,480 is 1/m, because it has to pick that one slot out 569 00:33:55,480 --> 00:33:57,630 of n possible slots. 570 00:33:57,630 --> 00:33:59,470 But if you're just saying, all I want 571 00:33:59,470 --> 00:34:02,920 is that it doesn't hash here, well, it 572 00:34:02,920 --> 00:34:05,290 means it can hash anywhere else. 573 00:34:05,290 --> 00:34:07,680 So it has m minus 1 options. 574 00:34:07,680 --> 00:34:09,850 It can go to any of those m minus 1 places, 575 00:34:09,850 --> 00:34:11,520 just not to that one place. 576 00:34:11,520 --> 00:34:13,111 So m minus 1 over m. 577 00:34:19,845 --> 00:34:22,250 AUDIENCE: It's interesting it went the other direction. 578 00:34:22,250 --> 00:34:25,050 Instead of saying, it's 1, it's 1 minus it. 579 00:34:27,387 --> 00:34:29,053 Wouldn't it have been just as easy to go 580 00:34:29,053 --> 00:34:30,409 the other direction, or no? 581 00:34:30,409 --> 00:34:32,510 PROFESSOR: No. 582 00:34:32,510 --> 00:34:34,460 Not doing this makes the problem hard, 583 00:34:34,460 --> 00:34:36,489 so that's why we're doing it. 584 00:34:36,489 --> 00:34:40,100 This kind of flipping is easy to do conceptually, 585 00:34:40,100 --> 00:34:43,152 but it might make a hard problem into a really easy problem, 586 00:34:43,152 --> 00:34:44,610 or at least into a do-able problem. 587 00:34:49,923 --> 00:34:52,338 AUDIENCE: Isn't it this the same thing? 588 00:34:52,338 --> 00:34:55,474 I guess maybe not totally. 589 00:34:55,474 --> 00:34:57,890 PROFESSOR: So it is exactly the same in terms of the math, 590 00:34:57,890 --> 00:35:01,875 but computing this without turning it into this 591 00:35:01,875 --> 00:35:05,240 is really hard. 592 00:35:05,240 --> 00:35:07,070 AUDIENCE: Any given slot is 1, isn't it 593 00:35:07,070 --> 00:35:10,125 kind of like what we just said, except if the probability 594 00:35:10,125 --> 00:35:17,020 of any one mapping is 1/m, mapping to a 1, right? 595 00:35:17,020 --> 00:35:18,550 And then you take 1 over m raised 596 00:35:18,550 --> 00:35:23,636 to the n, that's the probability of it being a 1 597 00:35:23,636 --> 00:35:27,479 at that one place, right? 598 00:35:27,479 --> 00:35:28,520 PROFESSOR: No, not quite. 599 00:35:37,304 --> 00:35:40,250 Yeah. 600 00:35:40,250 --> 00:35:43,610 OK, so are we getting this? 601 00:35:43,610 --> 00:35:46,000 Somewhat? 602 00:35:46,000 --> 00:35:46,555 Yes? 603 00:35:46,555 --> 00:35:48,500 AUDIENCE: So the probability of a false positive, 604 00:35:48,500 --> 00:35:50,666 you're saying that's what's the probability that you 605 00:35:50,666 --> 00:35:54,004 get the 1, if you actually should [INAUDIBLE] the 0. 606 00:35:54,004 --> 00:35:57,479 It's because multiple things mapped to that one slot, right? 607 00:35:57,479 --> 00:35:59,520 PROFESSOR: So the probability of a false positive 608 00:35:59,520 --> 00:36:04,420 is the probability that, given a new element, when we hash it 609 00:36:04,420 --> 00:36:07,260 we get the 1. 610 00:36:07,260 --> 00:36:09,927 The hash of that new element is independent of all 611 00:36:09,927 --> 00:36:10,635 the other hashes. 612 00:36:12,898 --> 00:36:14,814 AUDIENCE: Then why is it simple in probability 613 00:36:14,814 --> 00:36:17,877 that you get the 1? 614 00:36:17,877 --> 00:36:19,460 PROFESSOR: So if I have a new element, 615 00:36:19,460 --> 00:36:21,350 I'm going to compute its hash, and I'm 616 00:36:21,350 --> 00:36:22,970 going to look in the table. 617 00:36:22,970 --> 00:36:24,801 If I see a 1, I'm going to say, oh. 618 00:36:24,801 --> 00:36:26,176 AUDIENCE: Oh, it's a new element. 619 00:36:26,176 --> 00:36:26,240 OK. 620 00:36:26,240 --> 00:36:27,656 PROFESSOR: Yeah, so it's something 621 00:36:27,656 --> 00:36:29,230 that was not in the set. 622 00:36:29,230 --> 00:36:30,714 AUDIENCE: OK. 623 00:36:30,714 --> 00:36:31,630 PROFESSOR: Okay, cool. 624 00:36:36,940 --> 00:36:40,080 OK, so let's see if we can do a harder version of this. 625 00:36:43,000 --> 00:36:48,030 So this probability isn't great, but if we do one trick, 626 00:36:48,030 --> 00:36:49,390 we can make this really nice. 627 00:36:49,390 --> 00:36:52,430 And this puts together a trick called bloom filters that 628 00:36:52,430 --> 00:36:54,490 is used in all sorts of situations. 629 00:37:01,510 --> 00:37:07,220 So for Bloom filters, we still have n elements, 630 00:37:07,220 --> 00:37:11,175 and we still have a table of m bits. 631 00:37:16,310 --> 00:37:19,230 What changes this time is instead of having one function, 632 00:37:19,230 --> 00:37:21,510 we have k hash functions. 633 00:37:27,570 --> 00:37:31,120 So when we take an element and insert it, 634 00:37:31,120 --> 00:37:32,930 we're taking element i. 635 00:37:32,930 --> 00:37:34,640 The way to insert it is we're going 636 00:37:34,640 --> 00:37:43,980 to compute its hash value using all the hash functions, 637 00:37:43,980 --> 00:37:46,500 and set all the corresponding bits to 1. 638 00:37:54,010 --> 00:38:06,350 So insert ei becomes, for j in 1 through k, 639 00:38:06,350 --> 00:38:09,530 the table bit corresponding to the hash function, j, 640 00:38:09,530 --> 00:38:13,970 of the element is 1. 641 00:38:13,970 --> 00:38:16,030 So each element sets k bits to 1. 642 00:38:19,050 --> 00:38:21,330 Now how do we check if an element is in the table? 643 00:38:26,395 --> 00:38:27,270 AUDIENCE: [INAUDIBLE] 644 00:38:34,680 --> 00:38:36,180 PROFESSOR: Since, for every element, 645 00:38:36,180 --> 00:38:38,980 we set all the corresponding k bits to 1, now when 646 00:38:38,980 --> 00:38:40,850 we have a new element, we're going 647 00:38:40,850 --> 00:38:44,900 to compute to the k positions, and if any of them is a 0, 648 00:38:44,900 --> 00:38:47,540 then we couldn't have possibly put that in the table. 649 00:38:52,610 --> 00:39:03,830 So all T of h j of f have to be 1. 650 00:39:06,530 --> 00:39:09,070 So for every element, we hashed it k times, 651 00:39:09,070 --> 00:39:10,990 and set the corresponding bits. 652 00:39:10,990 --> 00:39:17,730 If we have a new element, and by hashing we get here and here, 653 00:39:17,730 --> 00:39:21,370 but we also get here, and this guy was a zero, 654 00:39:21,370 --> 00:39:23,137 we know we definitely didn't put this in. 655 00:39:25,940 --> 00:39:28,480 So now what's the probability of a false positive? 656 00:39:33,420 --> 00:39:36,384 AUDIENCE: My first intuition is just raising that to a power. 657 00:39:41,324 --> 00:39:44,222 AUDIENCE: The probability that when you check-- 658 00:39:44,222 --> 00:39:46,430 PROFESSOR: Oh, I forgot to say something, by the way. 659 00:39:46,430 --> 00:39:50,320 The k hash functions-- I think they satisfy simple uniform 660 00:39:50,320 --> 00:39:51,285 hashing. 661 00:39:51,285 --> 00:39:52,910 I'm not sure if that's the right thing, 662 00:39:52,910 --> 00:39:55,345 but they all have independent values from each other. 663 00:39:55,345 --> 00:39:56,470 So they're all independent. 664 00:40:02,080 --> 00:40:05,170 So for any number you give, any hash function 665 00:40:05,170 --> 00:40:07,000 returns a value that's independent of all 666 00:40:07,000 --> 00:40:10,551 the other hash functions, and they're all 0 667 00:40:10,551 --> 00:40:11,300 through n minus 1. 668 00:40:18,500 --> 00:40:20,780 AUDIENCE: Why is not that just raised to something? 669 00:40:20,780 --> 00:40:22,321 Because we know the probability-- OK, 670 00:40:22,321 --> 00:40:25,290 actually we need to recalculate that. 671 00:40:25,290 --> 00:40:27,581 AUDIENCE: Because it's the probability that all of them 672 00:40:27,581 --> 00:40:30,606 are 1, even though you haven't hashed yet. 673 00:40:35,930 --> 00:40:38,130 PROFESSOR: So the false positive, the probability 674 00:40:38,130 --> 00:40:40,280 of false positives is the probability 675 00:40:40,280 --> 00:40:46,995 that all the k slots that correspond to f are 1's, right? 676 00:40:54,760 --> 00:41:01,620 So, since the hash functions are all independent, 677 00:41:01,620 --> 00:41:03,730 this is the probability that one slot 678 00:41:03,730 --> 00:41:05,810 is the 1, raised to the power k. 679 00:41:05,810 --> 00:41:08,940 Right, because they're all independent slots. 680 00:41:08,940 --> 00:41:14,160 So it's the probability that one slot 681 00:41:14,160 --> 00:41:18,740 is a 1, raised to the power k. 682 00:41:18,740 --> 00:41:20,510 OK, so now what's the probability 683 00:41:20,510 --> 00:41:22,500 that one slot is a 1? 684 00:41:22,500 --> 00:41:26,170 It looks a lot like this problem, right? 685 00:41:26,170 --> 00:41:27,850 Except there's a tweak. 686 00:41:27,850 --> 00:41:30,294 How many times did we put the 1 in the table? 687 00:41:33,430 --> 00:41:38,010 So here, we put a 1 in the table for every element. 688 00:41:38,010 --> 00:41:42,920 So we have n sets, right? 689 00:41:42,920 --> 00:41:49,220 So n times we're going to set t of something to 1. 690 00:41:52,220 --> 00:41:53,160 Right? 691 00:41:53,160 --> 00:41:55,690 For every element, we have one set. 692 00:41:55,690 --> 00:41:57,049 We set one bit to 1. 693 00:41:57,049 --> 00:41:59,340 It might have been said before-- that's something else. 694 00:41:59,340 --> 00:41:59,840 Yes? 695 00:41:59,840 --> 00:42:03,570 AUDIENCE: So here it's raised to the m k? 696 00:42:03,570 --> 00:42:05,790 PROFESSOR: Yeah, pretty much. 697 00:42:05,790 --> 00:42:08,440 So here, for every element we hash it through all the k 698 00:42:08,440 --> 00:42:12,210 functions, and set the corresponding bits to 1. 699 00:42:12,210 --> 00:42:20,370 So one element generates k set operations, 700 00:42:20,370 --> 00:42:25,230 and we have n elements, so we set n k bits to 1. 701 00:42:35,060 --> 00:42:36,274 Does this make sense? 702 00:42:36,274 --> 00:42:39,754 AUDIENCE: Can two hash functions point to the same slot? 703 00:42:39,754 --> 00:42:40,420 PROFESSOR: Sure. 704 00:42:43,216 --> 00:42:44,840 But they're all independent, and that's 705 00:42:44,840 --> 00:42:46,840 the only thing that matters. 706 00:42:46,840 --> 00:42:50,000 So every time we set the bit, which bit was set 707 00:42:50,000 --> 00:42:54,380 is independent of all the other bits we set, 708 00:42:54,380 --> 00:42:56,500 because all the hash functions are independent, 709 00:42:56,500 --> 00:42:58,560 and all the values are independent of each other. 710 00:43:01,830 --> 00:43:05,410 So this time, the table size is still m, so that didn't change. 711 00:43:05,410 --> 00:43:08,750 This time we set n bits to 1, this time we set n k bits to 1. 712 00:43:08,750 --> 00:43:12,420 So then the right thing to do is copy this answer, 713 00:43:12,420 --> 00:43:14,790 and replace n with n k. 714 00:43:14,790 --> 00:43:17,330 And if you have to write the proof, 715 00:43:17,330 --> 00:43:19,969 you'd copy-paste the proof and replace n with n k. 716 00:43:26,460 --> 00:43:32,900 So this is 1 minus m minus 1, over m, times n k. 717 00:43:35,815 --> 00:43:37,940 And of course you should go through the whole thing 718 00:43:37,940 --> 00:43:40,324 in your head and convince yourselves that this is true. 719 00:43:40,324 --> 00:43:42,490 AUDIENCE: Does that say one of the elements is what? 720 00:43:42,490 --> 00:43:44,460 k, something? 721 00:43:44,460 --> 00:43:45,460 AUDIENCE: Sets. 722 00:43:45,460 --> 00:43:47,270 PROFESSOR: Bit sets. 723 00:43:47,270 --> 00:43:51,920 So one element sets k bits in the table, not necessarily 724 00:43:51,920 --> 00:43:54,400 different bits, just independent bits. 725 00:43:54,400 --> 00:43:56,350 So if you have n elements altogether, 726 00:43:56,350 --> 00:43:57,913 they set n times k bits. 727 00:44:09,120 --> 00:44:14,880 This thing gets run n times k times, whereas here, 728 00:44:14,880 --> 00:44:21,180 the set operation gets run n times in total. 729 00:44:21,180 --> 00:44:22,930 That's the difference in the two problems. 730 00:44:30,360 --> 00:44:32,560 Right here you have one function for each element, 731 00:44:32,560 --> 00:44:34,055 here you have k hash functions. 732 00:44:44,240 --> 00:44:46,500 This is hard, right? 733 00:44:46,500 --> 00:44:50,060 Well, it's the hardest hashing problem 734 00:44:50,060 --> 00:44:51,530 that I could think about and that 735 00:44:51,530 --> 00:44:54,430 makes us go through probabilities and through all 736 00:44:54,430 --> 00:44:55,990 the hash stuff. 737 00:44:55,990 --> 00:44:59,200 The problems on the exam will be easier, so one, don't panic. 738 00:44:59,200 --> 00:45:02,720 Two, review hashing, review probabilities. 739 00:45:02,720 --> 00:45:06,160 When I said, from the theory, this is what you get, 740 00:45:06,160 --> 00:45:09,540 if you didn't understand that then please review the theory. 741 00:45:09,540 --> 00:45:11,550 AUDIENCE: Why is it raised to the k? 742 00:45:11,550 --> 00:45:14,590 Because we did down there, if we replace n with n k, 743 00:45:14,590 --> 00:45:18,310 then we'd just get everything except. 744 00:45:18,310 --> 00:45:22,510 PROFESSOR: So this thing in here is the answer 745 00:45:22,510 --> 00:45:28,500 to the previous problem, except you take an n 746 00:45:28,500 --> 00:45:31,640 and you replace it with an n k. 747 00:45:31,640 --> 00:45:36,560 So this is the probability that one bit is set to 1. 748 00:45:36,560 --> 00:45:38,430 But here, when you're given an element, 749 00:45:38,430 --> 00:45:41,450 you're going to hash it through the k functions-- you take 750 00:45:41,450 --> 00:45:44,550 this guy-- you're going to hash it through the k functions, 751 00:45:44,550 --> 00:45:46,580 and you're going to check all the bits. 752 00:45:46,580 --> 00:45:50,450 So you're going to check k bits. 753 00:45:50,450 --> 00:45:53,130 So as long as any of the k bits is a zero, not 754 00:45:53,130 --> 00:45:55,140 a false positive. 755 00:45:55,140 --> 00:45:58,812 So we need all the k bits to be a 1. 756 00:45:58,812 --> 00:45:59,645 AUDIENCE: Oh, I see. 757 00:46:02,579 --> 00:46:04,977 AUDIENCE: What if the hash functions are dependent? 758 00:46:04,977 --> 00:46:06,435 PROFESSOR: Then become intractable. 759 00:46:09,140 --> 00:46:11,530 AUDIENCE: And what if they are? 760 00:46:15,370 --> 00:46:19,770 I think the in this problem, the way they are being hashed, 761 00:46:19,770 --> 00:46:21,566 that becomes dependent, because I 762 00:46:21,566 --> 00:46:24,670 think there were some problems where, if something is being 763 00:46:24,670 --> 00:46:27,694 hashed somewhere, then the probability-- 764 00:46:27,694 --> 00:46:29,110 there could be hash functions that 765 00:46:29,110 --> 00:46:33,980 would put the other thing in the next slot. 766 00:46:33,980 --> 00:46:36,770 PROFESSOR: Yes, so you want to reduce these problems 767 00:46:36,770 --> 00:46:37,870 to independent hashing. 768 00:46:37,870 --> 00:46:40,050 If you look at the proofs, all the proofs 769 00:46:40,050 --> 00:46:42,890 assume uniform hashing, simple uniform, 770 00:46:42,890 --> 00:46:45,722 whatever it takes to get the math down to independence. 771 00:46:45,722 --> 00:46:47,305 Because this is the only thing that we 772 00:46:47,305 --> 00:46:49,310 know how to solve with probabilities. 773 00:46:49,310 --> 00:46:51,090 If everything is independent, then things 774 00:46:51,090 --> 00:46:54,700 multiply and add up in the right places, and everything is easy. 775 00:46:54,700 --> 00:46:56,720 If things are dependent, then proofs 776 00:46:56,720 --> 00:46:57,890 become really, really hard. 777 00:46:57,890 --> 00:46:59,750 So whenever you have dependent things, 778 00:46:59,750 --> 00:47:02,397 you want to find a way to reduce that to independent things. 779 00:47:15,560 --> 00:47:17,860 Is everyone tired, or do you guys really 780 00:47:17,860 --> 00:47:19,830 not like this problem? 781 00:47:19,830 --> 00:47:23,180 By the way, really cool trick-- so this 782 00:47:23,180 --> 00:47:25,260 turns out to be a lot better than that, 783 00:47:25,260 --> 00:47:27,280 and I think the optimal value of k 784 00:47:27,280 --> 00:47:30,150 is around square roots of log n. 785 00:47:30,150 --> 00:47:34,220 And that gives you some filters with a really low 786 00:47:34,220 --> 00:47:36,244 false positive rate. 787 00:47:36,244 --> 00:47:38,680 AUDIENCE: What do you mean by optimal? 788 00:47:38,680 --> 00:47:41,080 PROFESSOR: Minimize the false positives. 789 00:47:41,080 --> 00:47:47,840 So given n and m, pick a case so that this thing is minimized. 790 00:47:47,840 --> 00:47:50,370 AUDIENCE: What was the answer again? 791 00:47:50,370 --> 00:47:52,812 Or actually, regardless of that, what's 792 00:47:52,812 --> 00:47:55,540 the percentage of false positives? 793 00:47:55,540 --> 00:47:58,270 PROFESSOR: It depends on what your n and m are, right? 794 00:47:58,270 --> 00:48:00,040 The more bits you can afford 795 00:48:00,040 --> 00:48:01,510 AUDIENCE: But if maximize your k, 796 00:48:01,510 --> 00:48:06,100 you said you came up with some k that's maximized 797 00:48:06,100 --> 00:48:07,079 PROFESSOR: I think k is 798 00:48:07,079 --> 00:48:08,370 AUDIENCE: Square root of log n. 799 00:48:10,964 --> 00:48:12,660 AUDIENCE: So then if you use that. 800 00:48:12,660 --> 00:48:14,206 PROFESSOR: Let's not do the math. 801 00:48:14,206 --> 00:48:15,595 [LAUGHTER] 802 00:48:15,595 --> 00:48:16,900 It's really, really good. 803 00:48:16,900 --> 00:48:21,540 So these are used for all sorts of practical problems, all 804 00:48:21,540 --> 00:48:25,268 the way from branch predictors in processors, to databases. 805 00:48:25,268 --> 00:48:26,836 AUDIENCE: So is it better than 1%? 806 00:48:26,836 --> 00:48:27,960 Do you know that, at least? 807 00:48:27,960 --> 00:48:33,210 PROFESSOR: Oh yeah, for practical uses, this gets you, 808 00:48:33,210 --> 00:48:36,030 I think to 1% of 1% of 1%. 809 00:48:41,680 --> 00:48:45,110 So usually, put a Bloom filter before a really expensive 810 00:48:45,110 --> 00:48:48,300 check, and the Bloom filter gets rid of most 811 00:48:48,300 --> 00:48:50,264 of the false positives. 812 00:48:50,264 --> 00:48:51,680 And then you have a few more where 813 00:48:51,680 --> 00:48:53,206 you do the more expensive check. 814 00:49:05,620 --> 00:49:06,952 Okay, does this make sense? 815 00:49:11,240 --> 00:49:11,910 Any questions? 816 00:49:15,290 --> 00:49:19,812 AUDIENCE: Do you more optimal if you repeated this Bloom 817 00:49:19,812 --> 00:49:22,040 filter independently of the other one, 818 00:49:22,040 --> 00:49:25,330 with more hash functions in that memory structure? 819 00:49:25,330 --> 00:49:31,090 PROFESSOR: I think doubling the memory size is better. 820 00:49:31,090 --> 00:49:33,600 So two filters is the same as having two n bits. 821 00:49:33,600 --> 00:49:36,150 I think doubling gives you better results, always. 822 00:49:47,500 --> 00:49:48,650 OK, so general stuff. 823 00:49:48,650 --> 00:49:51,610 We're going to have a lot of conceptual questions, 824 00:49:51,610 --> 00:49:55,300 so please make sure, again, make sure that for everything 825 00:49:55,300 --> 00:49:57,130 that we did, go through the problem. 826 00:49:57,130 --> 00:50:00,760 Understand the problem, know that there is a solution. 827 00:50:00,760 --> 00:50:02,430 Know the running time, maybe know 828 00:50:02,430 --> 00:50:03,880 how to implement the solution. 829 00:50:03,880 --> 00:50:05,905 Don't worry so much about the proof. 830 00:50:05,905 --> 00:50:07,530 We're going to have some problems where 831 00:50:07,530 --> 00:50:10,150 you have to come up with new things on your own, 832 00:50:10,150 --> 00:50:12,950 so get a good night's sleep before the exam. 833 00:50:12,950 --> 00:50:14,495 Really, if you have five hours left, 834 00:50:14,495 --> 00:50:16,620 then you have to choose between sleeping five hours 835 00:50:16,620 --> 00:50:18,985 or reading notes for five hours-- 836 00:50:18,985 --> 00:50:21,060 AUDIENCE: Drink caffeine. 837 00:50:21,060 --> 00:50:22,480 PROFESSOR: It's not going to help, 838 00:50:22,480 --> 00:50:24,300 so caffeine actually helps you stay up, 839 00:50:24,300 --> 00:50:26,280 but it decreases your performance. 840 00:50:26,280 --> 00:50:30,030 And so if you're on caffeine, you're not going to think. 841 00:50:30,030 --> 00:50:33,530 You can regurgitate stuff, but you can't think. 842 00:50:33,530 --> 00:50:36,556 So caffeinating yourself is a-- 843 00:50:36,556 --> 00:50:39,670 AUDIENCE: I thought it was like it gives you concentration. 844 00:50:39,670 --> 00:50:42,460 PROFESSOR: So there's an optimum amount of sleep and caffeine 845 00:50:42,460 --> 00:50:43,080 combination. 846 00:50:43,080 --> 00:50:45,240 If you don't sleep and caffeinate yourself, 847 00:50:45,240 --> 00:50:46,720 I guarantee that you will not solve 848 00:50:46,720 --> 00:50:49,246 any of the problems that require new algorithms. 849 00:50:49,246 --> 00:50:51,620 AUDIENCE: Caffeine just squirts adrenaline in your brain. 850 00:50:51,620 --> 00:50:54,870 It doesn't do anything else. 851 00:50:54,870 --> 00:50:57,370 PROFESSOR: So the thing is the memory is going to be better. 852 00:50:57,370 --> 00:50:59,270 If all you're doing is memorization stuff, 853 00:50:59,270 --> 00:51:01,230 then it's going to be better. 854 00:51:01,230 --> 00:51:03,850 So you're going to do well on the pattern matching stuff. 855 00:51:03,850 --> 00:51:05,290 But when your brain is panicking, 856 00:51:05,290 --> 00:51:07,540 you're not going to come up with new solutions, right? 857 00:51:07,540 --> 00:51:10,030 Usually, you have a problem, a hard problem. 858 00:51:10,030 --> 00:51:12,370 You're thinking about it, and then at some point 859 00:51:12,370 --> 00:51:14,495 when you're relaxed, like when you're in the shower 860 00:51:14,495 --> 00:51:18,380 or when you wake up you're like, crap, I found a solution. 861 00:51:18,380 --> 00:51:20,470 So the brain finds solutions when it's relaxed, 862 00:51:20,470 --> 00:51:23,160 not when it's like, holy shit, holy shit, holy shit. 863 00:51:23,160 --> 00:51:26,570 And adrenaline gets it in that mood. 864 00:51:26,570 --> 00:51:27,700 That's what it does. 865 00:51:27,700 --> 00:51:29,920 And that's what caffeine does in the end. 866 00:51:29,920 --> 00:51:32,970 So a little bit of caffeine might help you get up 867 00:51:32,970 --> 00:51:35,250 and get you running, but don't caffeinate 868 00:51:35,250 --> 00:51:37,990 yourself to not sleep the entire night. 869 00:51:37,990 --> 00:51:40,520 That's probably going to make you bomb the hard questions. 870 00:51:40,520 --> 00:51:41,550 Good luck on Friday. 871 00:51:41,550 --> 00:51:43,140 Eat candy.