1 00:00:00,050 --> 00:00:01,770 The following content is provided 2 00:00:01,770 --> 00:00:04,000 under a Creative Commons license. 3 00:00:04,000 --> 00:00:06,850 Your support will help MIT OpenCourseWare continue 4 00:00:06,850 --> 00:00:10,710 to offer high quality educational resources for free. 5 00:00:10,710 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,187 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,187 --> 00:00:17,812 at ocw.mit.edu. 8 00:00:22,300 --> 00:00:27,160 PROFESSOR: One more exacting lecture on hashing. 9 00:00:27,160 --> 00:00:29,010 And a couple reminders. 10 00:00:29,010 --> 00:00:32,960 I don't want to start out saying unpopular things, 11 00:00:32,960 --> 00:00:37,410 but we do have a quiz coming up next week on Tuesday. 12 00:00:37,410 --> 00:00:41,430 There will not be a lecture next Tuesday, 13 00:00:41,430 --> 00:00:43,110 but there will be a quiz. 14 00:00:43,110 --> 00:00:47,690 7:30 to 9:30 Tuesday evening. 15 00:00:47,690 --> 00:00:49,140 I will send announcement. 16 00:00:49,140 --> 00:00:51,480 There's going to be a couple rooms. 17 00:00:51,480 --> 00:00:52,930 Some of you will be in this room. 18 00:00:52,930 --> 00:00:54,929 Some of you will have to go to a different room, 19 00:00:54,929 --> 00:00:56,890 since this room really can't hold 20 00:00:56,890 --> 00:01:00,070 180 students taking a quiz. 21 00:01:00,070 --> 00:01:01,400 All right? 22 00:01:01,400 --> 00:01:04,047 So hashing. 23 00:01:04,047 --> 00:01:05,630 I'm pretty excited about this lecture, 24 00:01:05,630 --> 00:01:09,280 because I think as I was talking with Victor just 25 00:01:09,280 --> 00:01:12,660 before this, if there's one thing you want to remember 26 00:01:12,660 --> 00:01:16,700 about hashing and you want to go implement a hash table, 27 00:01:16,700 --> 00:01:18,410 it's open addressing. 28 00:01:18,410 --> 00:01:20,680 It's the simplest way that you can possibly 29 00:01:20,680 --> 00:01:22,720 implement a hash table. 30 00:01:22,720 --> 00:01:26,030 You can implement a hash table using an array. 31 00:01:26,030 --> 00:01:30,000 We've obviously talked about link lists 32 00:01:30,000 --> 00:01:35,360 and chaining to implement hash tables in previous lectures, 33 00:01:35,360 --> 00:01:38,930 but we're going to actually get rid of pointers and link lists, 34 00:01:38,930 --> 00:01:42,890 and implement a hash table using a single array data structure, 35 00:01:42,890 --> 00:01:46,830 and that's the notion of open addressing. 36 00:01:46,830 --> 00:01:49,360 Now in order to get open addressing to work, 37 00:01:49,360 --> 00:01:50,660 there's no free lunch, right? 38 00:01:50,660 --> 00:01:52,560 So you have a simple implementation. 39 00:01:52,560 --> 00:01:56,240 It turns out that in order to make open addressing efficient, 40 00:01:56,240 --> 00:01:58,800 you have to be a little more careful than if you're 41 00:01:58,800 --> 00:02:02,390 using the hash tables with chaining. 42 00:02:02,390 --> 00:02:05,070 And we're going to have to make an assumption 43 00:02:05,070 --> 00:02:06,706 about uniform hashing. 44 00:02:06,706 --> 00:02:08,289 I'll say a little bit more about that. 45 00:02:08,289 --> 00:02:11,780 But it's a different assumption from simple uniform hashing 46 00:02:11,780 --> 00:02:13,480 that Eric talked about. 47 00:02:13,480 --> 00:02:16,160 And we'll state this uniform hashing assumption. 48 00:02:16,160 --> 00:02:20,990 And we look at what the performance is of open 49 00:02:20,990 --> 00:02:23,390 addressing under this assumption. 50 00:02:23,390 --> 00:02:26,330 And this is assumption is going to give us 51 00:02:26,330 --> 00:02:30,470 a sense of what good hash functions are 52 00:02:30,470 --> 00:02:33,350 for open addressing applications or for open addressing 53 00:02:33,350 --> 00:02:34,770 hash tables. 54 00:02:34,770 --> 00:02:39,330 And finally we'll talk about cryptographic hashing. 55 00:02:39,330 --> 00:02:42,100 This is not really 6006 material, 56 00:02:42,100 --> 00:02:44,290 but it's kind of cool material. 57 00:02:44,290 --> 00:02:47,890 It has a lot of applications in computer security 58 00:02:47,890 --> 00:02:49,100 and cryptography. 59 00:02:49,100 --> 00:02:53,710 And so as we'll describe the notion of a cryptographic hash, 60 00:02:53,710 --> 00:02:57,970 and we'll talk about a couple of real simple and pervasive 61 00:02:57,970 --> 00:03:00,560 applications like password storage 62 00:03:00,560 --> 00:03:05,240 and file corruption detectors that you can implement 63 00:03:05,240 --> 00:03:07,450 using cryptographic hash functions, which 64 00:03:07,450 --> 00:03:10,440 are quite different from the regular hash functions 65 00:03:10,440 --> 00:03:13,060 that we're using in hash tables. 66 00:03:13,060 --> 00:03:18,460 Be it chaining hash tables or open addressing hash tables. 67 00:03:18,460 --> 00:03:19,690 All right? 68 00:03:19,690 --> 00:03:23,120 So let's get started and talk about open addressing. 69 00:03:30,080 --> 00:03:33,909 This is another approach to dealing with collisions. 70 00:03:33,909 --> 00:03:35,950 If you didn't have collisions, obviously an array 71 00:03:35,950 --> 00:03:37,190 would work, right? 72 00:03:37,190 --> 00:03:39,734 If you could somehow guarantee that there were no collisions. 73 00:03:39,734 --> 00:03:41,150 When you have collisions, you have 74 00:03:41,150 --> 00:03:44,450 to worry about the chaining and ensuring 75 00:03:44,450 --> 00:03:46,690 that you can still find the keys even though you 76 00:03:46,690 --> 00:03:50,800 had two keys that collided into the same slot. 77 00:03:50,800 --> 00:03:54,090 And we don't want to use chaining. 78 00:03:56,820 --> 00:03:59,910 The simplest data structure that we can possibly use are arrays. 79 00:03:59,910 --> 00:04:04,430 Back when I was a grad student, I went through and got a PhD 80 00:04:04,430 --> 00:04:08,940 writing programs in C, never using any other structure 81 00:04:08,940 --> 00:04:12,370 than arrays, because I didn't like pointers. 82 00:04:12,370 --> 00:04:15,530 And so open addressing is a way that you 83 00:04:15,530 --> 00:04:18,810 can implement hash tables doing exactly this. 84 00:04:18,810 --> 00:04:22,300 And in particular, what we're going to do 85 00:04:22,300 --> 00:04:25,055 is assume an array structure with items. 86 00:04:31,810 --> 00:04:37,390 And we're going to assume that this one item-- at most 87 00:04:37,390 --> 00:04:38,630 one item per slot. 88 00:04:41,580 --> 00:04:44,970 So m has to be greater than or equal to n, right? 89 00:04:44,970 --> 00:04:48,240 So this is important because we don't have link lists. 90 00:04:48,240 --> 00:04:51,960 We can't arbitrarily increase the storage 91 00:04:51,960 --> 00:04:56,860 of a slot using a chain, and have 92 00:04:56,860 --> 00:04:59,060 n, which is the number of elements, 93 00:04:59,060 --> 00:05:01,040 be greater than m, right? 94 00:05:01,040 --> 00:05:06,290 Which you could in the link list table with chaining. 95 00:05:06,290 --> 00:05:09,100 But here you only have these area locations, 96 00:05:09,100 --> 00:05:11,510 these indices that you can put items into. 97 00:05:11,510 --> 00:05:16,620 So it's pretty much guaranteed that if you want a working open 98 00:05:16,620 --> 00:05:23,990 addressing hash table that m, which is the number of slots 99 00:05:23,990 --> 00:05:29,080 in the table, should be greater than or equal to the number 100 00:05:29,080 --> 00:05:31,660 of elements, all right? 101 00:05:31,660 --> 00:05:34,580 That's important. 102 00:05:34,580 --> 00:05:36,510 Now how does this work. 103 00:05:36,510 --> 00:05:38,990 Well, we're going to have this notion of probing. 104 00:05:44,250 --> 00:05:48,160 And the notion of probing is that we're 105 00:05:48,160 --> 00:05:53,160 going to try to see if we can insert something 106 00:05:53,160 --> 00:05:56,016 into this hash table, and if you fail 107 00:05:56,016 --> 00:05:57,390 we're actually going to recompute 108 00:05:57,390 --> 00:06:00,410 a slightly different hash for the key 109 00:06:00,410 --> 00:06:02,160 that we're trying to insert, the key value 110 00:06:02,160 --> 00:06:03,535 pair that we're trying to insert. 111 00:06:03,535 --> 00:06:04,034 All right? 112 00:06:04,034 --> 00:06:05,960 So this is an iterative process, and we're 113 00:06:05,960 --> 00:06:09,920 going to continually probe until we find an empty slot 114 00:06:09,920 --> 00:06:13,560 into which we can insert this key value pair. 115 00:06:13,560 --> 00:06:15,750 The key should index into it. 116 00:06:15,750 --> 00:06:19,570 So you do have different hashes that 117 00:06:19,570 --> 00:06:22,190 are going to be computed based on this probing 118 00:06:22,190 --> 00:06:24,480 notion for a given key. 119 00:06:24,480 --> 00:06:27,050 All right? 120 00:06:27,050 --> 00:06:31,390 And so what we need now is a hash function 121 00:06:31,390 --> 00:06:35,370 that's different from the standard hash functions 122 00:06:35,370 --> 00:06:42,470 that we've talked about so far, which specifies 123 00:06:42,470 --> 00:06:51,210 the order of slots to probe, which is basically 124 00:06:51,210 --> 00:06:52,350 to try for a key. 125 00:06:58,570 --> 00:07:06,080 And this is going to be true for insert, search, and delete, 126 00:07:06,080 --> 00:07:08,190 which are three basic operations. 127 00:07:08,190 --> 00:07:10,820 And they're a little bit different, all right? 128 00:07:10,820 --> 00:07:14,224 Just like they were different for the chaining hash table, 129 00:07:14,224 --> 00:07:16,640 they're different here, but they're kind of more different 130 00:07:16,640 --> 00:07:17,190 here. 131 00:07:17,190 --> 00:07:19,620 And you'll see what I mean when we go through this. 132 00:07:22,660 --> 00:07:25,270 And this is not just for one slot. 133 00:07:25,270 --> 00:07:28,190 It's going to specify an order of slots. 134 00:07:28,190 --> 00:07:32,180 And so our hash function h is going 135 00:07:32,180 --> 00:07:49,890 to take the universe of keys and also take 136 00:07:49,890 --> 00:07:53,120 what we're going to call the trial count. 137 00:07:53,120 --> 00:07:57,660 So if you're lucky-- well, you get lucky in your first trial. 138 00:07:57,660 --> 00:08:01,320 And if you're not, you hope to get lucky in your second trial, 139 00:08:01,320 --> 00:08:02,710 and so on and so forth. 140 00:08:02,710 --> 00:08:08,090 But the hash function is going to take two arguments. 141 00:08:08,090 --> 00:08:12,580 It's going to take the key as an argument, 142 00:08:12,580 --> 00:08:17,110 and it's going to take a trial, which is an integer between 0 143 00:08:17,110 --> 00:08:19,440 to n minus 1, all right? 144 00:08:19,440 --> 00:08:23,960 And it's going to produce-- just like the chaining hash function 145 00:08:23,960 --> 00:08:31,660 it's going to produce a number between 0 and m minus 1, right? 146 00:08:31,660 --> 00:08:34,030 Where m is the number of slots in the table. 147 00:08:34,030 --> 00:08:35,919 All right. 148 00:08:35,919 --> 00:08:39,150 So that's the story. 149 00:08:39,150 --> 00:08:48,360 In order to ensure that you are using the hash table 150 00:08:48,360 --> 00:08:54,770 corresponding to open addressing properly, what you want 151 00:08:54,770 --> 00:09:01,680 is-- and this is an important property-- that h k 1, 152 00:09:01,680 --> 00:09:03,640 so that's a key that you're given. 153 00:09:03,640 --> 00:09:08,360 And this could be an arbitrary key, mind you. 154 00:09:08,360 --> 00:09:17,770 So arbitrary key k. 155 00:09:17,770 --> 00:09:20,430 And what you have in terms of the slots that 156 00:09:20,430 --> 00:09:27,880 are being computed is this, h k 1, h k 2, 157 00:09:27,880 --> 00:09:33,520 and so on and so forth to h k n minus 1. 158 00:09:33,520 --> 00:09:40,990 And what you want is for this vector 159 00:09:40,990 --> 00:09:54,510 to be a permutation of 0 1 and so on to n minus 1. 160 00:09:54,510 --> 00:09:57,320 And the reason for this hopefully is clear. 161 00:09:57,320 --> 00:10:01,900 It's because you want to be able to use 162 00:10:01,900 --> 00:10:04,660 the entirety of your hash table. 163 00:10:04,660 --> 00:10:07,810 You don't want particular slots to go unused. 164 00:10:07,810 --> 00:10:13,920 And when you get to the point where the number of elements n 165 00:10:13,920 --> 00:10:20,030 is pretty close to m, and maybe there's just one slot left, OK? 166 00:10:20,030 --> 00:10:25,280 And you want to fill up this last slot with this key k 167 00:10:25,280 --> 00:10:27,740 that you want to put in there, and what 168 00:10:27,740 --> 00:10:30,460 you want to be able to say is that for this arbitrary key k 169 00:10:30,460 --> 00:10:34,260 that you want to put in there that the one slot that's free-- 170 00:10:34,260 --> 00:10:35,730 and it could be that first slot. 171 00:10:35,730 --> 00:10:37,400 It could be the 17th slot. 172 00:10:37,400 --> 00:10:40,340 Whatever-- That eventually the sequence of probes 173 00:10:40,340 --> 00:10:43,920 is going to be able to allow you to insert into that slot. 174 00:10:43,920 --> 00:10:45,230 All right? 175 00:10:45,230 --> 00:10:47,450 And we generalize this notion into 176 00:10:47,450 --> 00:10:51,180 the uniform hashing assumption in a few minutes, 177 00:10:51,180 --> 00:10:53,650 but hopefully this makes sense from a standpoint 178 00:10:53,650 --> 00:10:57,150 of really load balancing the table 179 00:10:57,150 --> 00:11:00,590 and ensuring that all slots in the table 180 00:11:00,590 --> 00:11:02,600 are sort of equal opportunity slots. 181 00:11:02,600 --> 00:11:08,270 That you're going to be able to put keys in them as long as you 182 00:11:08,270 --> 00:11:11,670 probe long enough that you're going to be able to get there. 183 00:11:11,670 --> 00:11:14,150 Now of course the fact that you're 184 00:11:14,150 --> 00:11:16,650 using one particular slot for one particular key 185 00:11:16,650 --> 00:11:18,580 depends on the order of keys that you're 186 00:11:18,580 --> 00:11:20,300 inserting into this table. 187 00:11:20,300 --> 00:11:24,140 Again, you'll see that as we go through an example, all right? 188 00:11:24,140 --> 00:11:25,260 So that's the set up. 189 00:11:25,260 --> 00:11:27,800 That's the open addressing notion. 190 00:11:27,800 --> 00:11:30,670 And that as you can see, we're just 191 00:11:30,670 --> 00:11:34,080 going to go through a sequence of probes 192 00:11:34,080 --> 00:11:36,200 and our hash function is going to tell us 193 00:11:36,200 --> 00:11:38,950 what the sequences is, and so we don't need any pointers 194 00:11:38,950 --> 00:11:41,350 or anything like that. 195 00:11:41,350 --> 00:11:50,340 So let's take a look at how this might work in practice. 196 00:11:50,340 --> 00:11:55,890 So maybe the easiest thing to do is to run through an example, 197 00:11:55,890 --> 00:11:57,870 and then I'll show you some pseudocode. 198 00:11:57,870 --> 00:12:01,800 But let's say that I have a table here, 199 00:12:01,800 --> 00:12:07,060 and I'm going to concentrate on the insert operation. 200 00:12:07,060 --> 00:12:10,530 And I'm going to start inserting things into this table. 201 00:12:17,130 --> 00:12:19,690 And right here I have seven slots up there. 202 00:12:19,690 --> 00:12:27,840 So let's say that I want to insert 586 into the table, 203 00:12:27,840 --> 00:12:35,020 and I compute h of 586 comma 1, and that gives me 1. 204 00:12:35,020 --> 00:12:35,760 OK? 205 00:12:35,760 --> 00:12:37,220 This is the first insert. 206 00:12:37,220 --> 00:12:42,650 So I'm going to go ahead and stick 586 in here, all right? 207 00:12:42,650 --> 00:12:47,730 And then I insert, for argument's sake, 133. 208 00:12:47,730 --> 00:12:50,600 I insert 204 out here. 209 00:12:50,600 --> 00:12:54,490 And these are all things because the hash table is empty. 210 00:12:54,490 --> 00:12:57,900 481 out here and so on. 211 00:12:57,900 --> 00:12:59,800 And because the hash table is empty, 212 00:12:59,800 --> 00:13:03,800 my very first trial is successful, all right? 213 00:13:03,800 --> 00:13:11,190 So h of 481-- I'm not going to write this all out, but h 481 1 214 00:13:11,190 --> 00:13:15,280 happens to be 6 and so on. 215 00:13:15,280 --> 00:13:15,820 All right? 216 00:13:15,820 --> 00:13:24,910 Now I get to the point where I want to insert 496. 217 00:13:24,910 --> 00:13:40,700 And when I try to insert 496, I have h 496 1. 218 00:13:40,700 --> 00:13:43,490 It happens to be 4. 219 00:13:43,490 --> 00:13:44,450 OK? 220 00:13:44,450 --> 00:13:48,230 So the first thing that happens is I go in here, 221 00:13:48,230 --> 00:13:50,070 and I say oops. 222 00:13:50,070 --> 00:13:54,990 This slot is occupied, because this-- I'm 223 00:13:54,990 --> 00:14:00,470 going to have a special flag associated with an empty slot, 224 00:14:00,470 --> 00:14:03,830 and we can say it's none. 225 00:14:03,830 --> 00:14:06,020 And if it's not none, then it's occupied. 226 00:14:06,020 --> 00:14:08,080 And 204 is not equal to none. 227 00:14:08,080 --> 00:14:14,510 So I look at this, and I say the first probe actually failed. 228 00:14:14,510 --> 00:14:15,160 OK? 229 00:14:15,160 --> 00:14:30,150 And so h 496 1 equals 4 fails, so I need to go do h 496 2. 230 00:14:30,150 --> 00:14:36,180 And h 496 2 may also fail. 231 00:14:36,180 --> 00:14:45,910 You might be in a situation where h 496 2 gives you 586. 232 00:14:45,910 --> 00:14:56,850 So this was h 496 1 h 496 2 might give you 586. 233 00:14:56,850 --> 00:15:03,650 And finally it may be that h 496 3, which is your third attempt, 234 00:15:03,650 --> 00:15:05,130 equals 3. 235 00:15:05,130 --> 00:15:07,560 So you go in, and you say great. 236 00:15:07,560 --> 00:15:10,210 I can insert 496. 237 00:15:10,210 --> 00:15:11,770 And let me write that in bold here. 238 00:15:14,830 --> 00:15:16,190 Out there. 239 00:15:16,190 --> 00:15:16,780 All right? 240 00:15:16,780 --> 00:15:18,750 So pretty straightforward. 241 00:15:18,750 --> 00:15:23,620 In this case, you've gone through three trials in order 242 00:15:23,620 --> 00:15:25,490 to find an empty slot. 243 00:15:25,490 --> 00:15:28,560 And so the big question really here is 244 00:15:28,560 --> 00:15:32,580 other than taking care of search and delete, how long is 245 00:15:32,580 --> 00:15:34,060 this process going to take? 246 00:15:34,060 --> 00:15:34,990 All right? 247 00:15:34,990 --> 00:15:37,800 And I'm talking about that in a few minutes, 248 00:15:37,800 --> 00:15:41,350 but let me explain, now that you've 249 00:15:41,350 --> 00:15:45,820 seen insert, how search would work, right? 250 00:15:45,820 --> 00:15:50,510 Or maybe I get one of you guys to explain to me 251 00:15:50,510 --> 00:15:55,230 once you have insert, how would search work? 252 00:15:55,230 --> 00:15:55,730 Someone? 253 00:15:58,550 --> 00:15:59,938 Someone from the back? 254 00:16:03,290 --> 00:16:04,800 No one. 255 00:16:04,800 --> 00:16:08,450 You guys are always answering questions. 256 00:16:08,450 --> 00:16:09,800 Yeah, all the way in the back. 257 00:16:09,800 --> 00:16:11,960 AUDIENCE: Would you just do the same kind 258 00:16:11,960 --> 00:16:18,022 of probing [INAUDIBLE] where you find it or you don't find it? 259 00:16:18,022 --> 00:16:18,730 PROFESSOR: Right. 260 00:16:18,730 --> 00:16:19,560 So you do exactly. 261 00:16:19,560 --> 00:16:20,790 It's very similar to insert. 262 00:16:23,810 --> 00:16:26,380 You have a situation where you're 263 00:16:26,380 --> 00:16:33,840 going to none would indicate an empty slot. 264 00:16:33,840 --> 00:16:37,660 And you can think of this as being a flag. 265 00:16:37,660 --> 00:16:45,450 And in the case of insert, what you did was you-- 266 00:16:45,450 --> 00:16:53,120 insert k v would say keep probing. 267 00:16:53,120 --> 00:16:56,020 I'm not going to write the pseudocode for it. 268 00:16:56,020 --> 00:17:03,980 Keep probing until an empty slot is found. 269 00:17:06,630 --> 00:17:08,480 And then when it's found, insert item. 270 00:17:16,560 --> 00:17:19,930 And as long as you have the permutation property 271 00:17:19,930 --> 00:17:23,150 that we have up there, and given that m is greater than 272 00:17:23,150 --> 00:17:26,260 or equal to n, you're guaranteed that insert 273 00:17:26,260 --> 00:17:28,060 is going to find a slot. 274 00:17:28,060 --> 00:17:28,560 OK? 275 00:17:28,560 --> 00:17:29,870 That's the good news. 276 00:17:29,870 --> 00:17:31,420 Now it might take awhile, and so we 277 00:17:31,420 --> 00:17:35,970 have a talk about performance a bit later, but it'll work. 278 00:17:35,970 --> 00:17:36,800 OK? 279 00:17:36,800 --> 00:17:39,110 Now search is a little bit different. 280 00:17:42,490 --> 00:17:50,706 You're searching for a key k, and you essentially 281 00:17:50,706 --> 00:17:52,080 say you're going to keep probing. 282 00:17:52,080 --> 00:18:04,290 And you say as long as the slots encountered 283 00:18:04,290 --> 00:18:14,160 are occupied by keys not equal to k. 284 00:18:14,160 --> 00:18:16,440 So every time you probe, you go in there 285 00:18:16,440 --> 00:18:18,440 and you say I got a key. 286 00:18:18,440 --> 00:18:20,830 I found a hash for it. 287 00:18:20,830 --> 00:18:22,500 I go to this particular slot. 288 00:18:22,500 --> 00:18:25,270 I look inside of it, and I check to see 289 00:18:25,270 --> 00:18:28,000 whether the key that's stored inside of it 290 00:18:28,000 --> 00:18:31,170 is the same as the key I'm searching for. 291 00:18:31,170 --> 00:18:34,990 If not, I go to the next trial. 292 00:18:34,990 --> 00:18:37,130 If it is, then I return it. 293 00:18:37,130 --> 00:18:37,630 Right? 294 00:18:37,630 --> 00:18:41,440 So that's pretty much it. 295 00:18:41,440 --> 00:19:00,690 And we keep probing until you either encounter k or find 296 00:19:00,690 --> 00:19:01,420 an empty slot. 297 00:19:04,930 --> 00:19:05,920 And this is the key. 298 00:19:08,714 --> 00:19:09,380 No pun intended. 299 00:19:12,230 --> 00:19:16,680 A notion which is that when you find an empty slot, 300 00:19:16,680 --> 00:19:21,840 it means that you have failed to discover this key. 301 00:19:21,840 --> 00:19:24,272 You fail to-- yeah, question back there? 302 00:19:24,272 --> 00:19:27,170 AUDIENCE: What happens if you were to delete a key though? 303 00:19:27,170 --> 00:19:29,670 PROFESSOR: I'll make you answer that question for a cushion. 304 00:19:32,200 --> 00:19:34,744 So we'll get to delete in a minute. 305 00:19:34,744 --> 00:19:36,160 But I want to make sure you're all 306 00:19:36,160 --> 00:19:39,170 on board with insert and search. 307 00:19:39,170 --> 00:19:39,920 OK? 308 00:19:39,920 --> 00:19:43,280 So these are actually fairly straightforward in comparison 309 00:19:43,280 --> 00:19:43,780 to delete. 310 00:19:43,780 --> 00:19:45,850 It's not like delete is much more complicated, 311 00:19:45,850 --> 00:19:48,854 but there is a subtlety there. 312 00:19:48,854 --> 00:19:50,270 And so that's kind of neat, right? 313 00:19:50,270 --> 00:19:52,630 I mean this actually works. 314 00:19:52,630 --> 00:19:58,700 So if you had a situation where you were just accumulating 315 00:19:58,700 --> 00:20:02,920 keys, and you're looking for the number of distinct elements 316 00:20:02,920 --> 00:20:05,360 in the stream of data that was coming in, 317 00:20:05,360 --> 00:20:08,090 and that was pretty much it with respect to your program, 318 00:20:08,090 --> 00:20:11,940 you'd never have to delete keys, and this would be all 319 00:20:11,940 --> 00:20:13,410 that you'd have to implement. 320 00:20:13,410 --> 00:20:14,280 Right? 321 00:20:14,280 --> 00:20:17,690 But let's talk about delete. 322 00:20:17,690 --> 00:20:19,870 Every once in awhile we'd want to delete a key? 323 00:20:19,870 --> 00:20:20,570 Yeah, you had a question? 324 00:20:20,570 --> 00:20:22,278 AUDIENCE: I have a question about search. 325 00:20:22,278 --> 00:20:25,350 Why do you stop searching once you find an empty slot? 326 00:20:25,350 --> 00:20:27,070 PROFESSOR: Because you're searching. 327 00:20:27,070 --> 00:20:30,010 So what that means is that you're 328 00:20:30,010 --> 00:20:34,120 looking to see if this key were already in the table. 329 00:20:34,120 --> 00:20:37,150 And if key were already in the table, 330 00:20:37,150 --> 00:20:39,870 you want to return the value associated with that key. 331 00:20:39,870 --> 00:20:42,260 If you find an empty slot, since you're 332 00:20:42,260 --> 00:20:47,540 using the same deterministic sequence of probes 333 00:20:47,540 --> 00:20:50,220 that you would have if you had inserted it, 334 00:20:50,220 --> 00:20:52,210 then-- that make sense? 335 00:20:52,210 --> 00:20:53,320 Good. 336 00:20:53,320 --> 00:20:54,080 All right. 337 00:20:54,080 --> 00:20:56,500 So so far so good? 338 00:20:56,500 --> 00:21:00,550 That's what works for insert and search. 339 00:21:00,550 --> 00:21:01,530 Let's talk delete. 340 00:21:01,530 --> 00:21:04,216 So back there. 341 00:21:04,216 --> 00:21:05,210 How does delete work? 342 00:21:09,070 --> 00:21:12,428 AUDIENCE: Well [INAUDIBLE] if you 343 00:21:12,428 --> 00:21:16,412 search until you find the none and assume 344 00:21:16,412 --> 00:21:20,396 that the key you're searching for was not put in there. 345 00:21:20,396 --> 00:21:25,210 But let's say you had one that was in that slot before 346 00:21:25,210 --> 00:21:26,710 and it got put back in, but then you 347 00:21:26,710 --> 00:21:28,501 delete the one that was in the slot before. 348 00:21:28,501 --> 00:21:29,747 PROFESSOR: Great, great. 349 00:21:29,747 --> 00:21:31,330 You haven't told me how to fix it yet, 350 00:21:31,330 --> 00:21:35,340 but do you have the guts for this? 351 00:21:35,340 --> 00:21:37,040 No. 352 00:21:37,040 --> 00:21:39,460 OK, I think this veers to the right. 353 00:21:39,460 --> 00:21:41,906 I always wanted to do this to somebody in the back. 354 00:21:41,906 --> 00:21:44,236 All right. 355 00:21:44,236 --> 00:21:45,170 Whoa. 356 00:21:45,170 --> 00:21:48,580 All right, good catch. 357 00:21:48,580 --> 00:21:49,230 All right. 358 00:21:49,230 --> 00:21:49,820 OK. 359 00:21:49,820 --> 00:21:51,830 So you pointed out the problem, and I'm 360 00:21:51,830 --> 00:21:53,800 going to ask somebody else for a solution. 361 00:21:53,800 --> 00:21:55,800 All right? 362 00:21:55,800 --> 00:21:57,570 But here's the problem. 363 00:21:57,570 --> 00:21:59,100 Here's the problem, and we can look 364 00:21:59,100 --> 00:22:04,560 at it from a standpoint of that example right there. 365 00:22:04,560 --> 00:22:08,700 Let's say for argument's sake that I'm searching-- now 366 00:22:08,700 --> 00:22:11,840 I've done all of the inserts that I have up there, OK? 367 00:22:11,840 --> 00:22:14,200 So I've inserted 496. 368 00:22:14,200 --> 00:22:14,860 All right? 369 00:22:14,860 --> 00:22:21,840 Then I delete 586 from the table, OK? 370 00:22:21,840 --> 00:22:24,500 I delete 586 from the table. 371 00:22:24,500 --> 00:22:30,080 So let's just say that what I end up 372 00:22:30,080 --> 00:22:38,910 doing-- I have 586, 133, 496, and then 373 00:22:38,910 --> 00:22:42,780 I have 204, and then a 481. 374 00:22:42,780 --> 00:22:47,770 And this is 0, 1, 2, et cetera. 375 00:22:47,770 --> 00:22:52,270 So I'm deleting 586, and let's say I replace it with none. 376 00:22:52,270 --> 00:22:53,300 OK? 377 00:22:53,300 --> 00:22:55,130 Let's just say I replace it with none. 378 00:22:55,130 --> 00:23:03,670 Now what happens is that when I search for 496, according 379 00:23:03,670 --> 00:23:09,940 to this search algorithm what am I going to get? 380 00:23:09,940 --> 00:23:12,040 AUDIENCE: None. 381 00:23:12,040 --> 00:23:15,690 PROFESSOR: Well the first slot I'm going to look at is 1, 382 00:23:15,690 --> 00:23:18,340 and according to this search algorithm, 383 00:23:18,340 --> 00:23:21,030 I find an empty slot, right? 384 00:23:21,030 --> 00:23:23,270 And when I find an empty slot, I'm 385 00:23:23,270 --> 00:23:26,700 going to say I failed in the search. 386 00:23:26,700 --> 00:23:33,820 If you encounter k, you succeed and return the key value pair, 387 00:23:33,820 --> 00:23:34,320 right? 388 00:23:34,320 --> 00:23:36,510 Success means you return the value. 389 00:23:36,510 --> 00:23:38,790 And if you encounter an empty slot, 390 00:23:38,790 --> 00:23:41,690 it means that you've decided that this key is not 391 00:23:41,690 --> 00:23:43,630 in the table. 392 00:23:43,630 --> 00:23:46,510 And you say couldn't find it, right? 393 00:23:46,510 --> 00:23:47,980 That make sense? 394 00:23:47,980 --> 00:23:49,970 So this is obviously wrong, right? 395 00:23:49,970 --> 00:23:54,200 Because I just inserted 496 into the table. 396 00:23:54,200 --> 00:23:56,520 So this would fail incorrectly. 397 00:24:00,560 --> 00:24:02,990 So failed to find the key, which is OK. 398 00:24:02,990 --> 00:24:05,200 I mean failure is OK if the key isn't there. 399 00:24:05,200 --> 00:24:07,151 But you don't want to fail incorrectly. 400 00:24:07,151 --> 00:24:07,650 Right? 401 00:24:07,650 --> 00:24:09,590 Everyone buy that? 402 00:24:09,590 --> 00:24:10,650 Everyone buy that? 403 00:24:10,650 --> 00:24:11,500 Good. 404 00:24:11,500 --> 00:24:12,000 All right. 405 00:24:12,000 --> 00:24:14,170 So how do I fix it. 406 00:24:14,170 --> 00:24:15,460 Someone else? 407 00:24:15,460 --> 00:24:16,960 How do I fix this? 408 00:24:16,960 --> 00:24:18,563 Someone who doesn't have a cushion. 409 00:24:18,563 --> 00:24:20,686 All right, you. 410 00:24:20,686 --> 00:24:30,110 AUDIENCE: [INAUDIBLE] you can mark that spot by a, and when 411 00:24:30,110 --> 00:24:34,580 search comes across a, you just [INAUDIBLE]. 412 00:24:34,580 --> 00:24:38,020 PROFESSOR: Right, great answer. 413 00:24:38,020 --> 00:24:40,480 We're now going to have to do a couple of different things 414 00:24:40,480 --> 00:24:42,340 for insert and search, OK? 415 00:24:42,340 --> 00:24:44,019 It's going to be subtly different, 416 00:24:44,019 --> 00:24:45,560 but the first thing we're going to do 417 00:24:45,560 --> 00:24:46,934 is we're going to have this flag, 418 00:24:46,934 --> 00:24:48,920 and I'll just call it delete me flag. 419 00:24:48,920 --> 00:24:50,620 OK? 420 00:24:50,620 --> 00:25:00,350 And we're going to say that when I delete something, 421 00:25:00,350 --> 00:25:09,960 replace deleted item with not the non flag, 422 00:25:09,960 --> 00:25:15,200 but a different flag that we'll call delete me. 423 00:25:15,200 --> 00:25:20,140 Is different from none. 424 00:25:24,230 --> 00:25:26,040 And that's going to be important, 425 00:25:26,040 --> 00:25:28,600 because now that you have a different flag, 426 00:25:28,600 --> 00:25:35,530 and you replace 586 with delete me, 427 00:25:35,530 --> 00:25:40,900 you can now do different things in insert versus search, right? 428 00:25:40,900 --> 00:25:43,900 So in particular, what you would do 429 00:25:43,900 --> 00:25:51,122 is you'd have to modify this slightly, 430 00:25:51,122 --> 00:25:52,580 because the notion of an empty slot 431 00:25:52,580 --> 00:25:55,380 means that you're looking for none, right? 432 00:25:55,380 --> 00:26:00,650 And all it means is that-- well actually in some sense, 433 00:26:00,650 --> 00:26:02,500 the pseudo code doesn't really change 434 00:26:02,500 --> 00:26:08,160 because if you say you either encounter k 435 00:26:08,160 --> 00:26:14,510 or you would-- even if you encounter a delete me, 436 00:26:14,510 --> 00:26:15,720 you keep going. 437 00:26:15,720 --> 00:26:16,220 All right? 438 00:26:16,220 --> 00:26:18,650 That's the important thing. 439 00:26:18,650 --> 00:26:20,570 So I guess it does change, because I assume 440 00:26:20,570 --> 00:26:23,170 that you have only two cases here, 441 00:26:23,170 --> 00:26:26,075 but what you really have now are three cases. 442 00:26:26,075 --> 00:26:28,150 The three cases are when you're doing 443 00:26:28,150 --> 00:26:30,860 the search is that you encounter the key, which 444 00:26:30,860 --> 00:26:31,960 is the easy case. 445 00:26:31,960 --> 00:26:32,690 You return it. 446 00:26:32,690 --> 00:26:34,440 You return the value. 447 00:26:34,440 --> 00:26:38,530 Or you can encounter a delete me flag, in which case 448 00:26:38,530 --> 00:26:40,240 you keep going. 449 00:26:40,240 --> 00:26:42,140 OK? 450 00:26:42,140 --> 00:26:44,930 And if you encounter an empty slot, which 451 00:26:44,930 --> 00:26:47,012 corresponds to none, at that point you know 452 00:26:47,012 --> 00:26:49,630 you failed and the key doesn't exist in the table. 453 00:26:49,630 --> 00:26:50,570 All right? 454 00:26:50,570 --> 00:26:54,310 So let me just write that out. 455 00:26:54,310 --> 00:27:03,040 Insert treats delete me the same as none. 456 00:27:07,250 --> 00:27:21,070 But search keeps going and treats it differently. 457 00:27:32,117 --> 00:27:33,200 And that's pretty much it. 458 00:27:33,200 --> 00:27:35,260 So what would happen in our example? 459 00:27:35,260 --> 00:27:39,840 Well, going through exactly the same example, 460 00:27:39,840 --> 00:27:43,750 we started from here, and then we decided to delete 586. 461 00:27:43,750 --> 00:27:51,580 And so if we replaced 586 not with none, but with delete me, 462 00:27:51,580 --> 00:27:55,260 and the next time around when you search for 496, 463 00:27:55,260 --> 00:27:57,360 you're searching for 496. 464 00:27:57,360 --> 00:27:58,870 And what would happen is that you 465 00:27:58,870 --> 00:28:04,010 would go look at 586-- the slot that contained 586, 466 00:28:04,010 --> 00:28:06,360 and you see that there's a delete me flag in there. 467 00:28:06,360 --> 00:28:08,400 And so you go to the next trial. 468 00:28:08,400 --> 00:28:14,800 And then in the next trial, you discover that, in this case, 469 00:28:14,800 --> 00:28:19,210 you have-- I'm sorry. 470 00:28:19,210 --> 00:28:22,330 I had 204 first as the first trial, 471 00:28:22,330 --> 00:28:26,110 and then in the second trial I had 586. 472 00:28:26,110 --> 00:28:28,790 And I would continue beyond the second trial 473 00:28:28,790 --> 00:28:36,080 and get to third trial, and in fact return 496 in this case. 474 00:28:36,080 --> 00:28:39,752 I would get to returning 496 in my third trial, which 475 00:28:39,752 --> 00:28:40,710 is exactly what I want. 476 00:28:43,780 --> 00:28:46,810 The interesting thing here is that you can reuse storage. 477 00:28:46,810 --> 00:28:48,850 I mean the whole point of deleting 478 00:28:48,850 --> 00:28:53,880 is that you can take the storage and insert other keys in there. 479 00:28:53,880 --> 00:28:56,140 Once you've freed up the storage. 480 00:28:56,140 --> 00:29:01,780 And you can do that by making insert treat delete me 481 00:29:01,780 --> 00:29:03,565 the same as the none. 482 00:29:03,565 --> 00:29:05,190 So the next time you want to insert you 483 00:29:05,190 --> 00:29:09,620 could-- if you happen to index into the index corresponding 484 00:29:09,620 --> 00:29:12,650 to 586, you can override that. 485 00:29:12,650 --> 00:29:15,920 The delete me flag goes away, and some other key-- 486 00:29:15,920 --> 00:29:20,740 call it 999 or something-- would get in there. 487 00:29:20,740 --> 00:29:23,700 And you're all set with that. 488 00:29:23,700 --> 00:29:24,540 OK? 489 00:29:24,540 --> 00:29:26,380 Any questions? 490 00:29:26,380 --> 00:29:28,530 This all makes sense? 491 00:29:28,530 --> 00:29:33,050 So you could imagine coding this up with an array structure 492 00:29:33,050 --> 00:29:35,100 is fairly straightforward. 493 00:29:35,100 --> 00:29:38,890 What remains here to be discussed 494 00:29:38,890 --> 00:29:42,170 is how well does this work, right? 495 00:29:42,170 --> 00:29:46,270 You have this extra requirement on the hash function 496 00:29:46,270 --> 00:29:50,930 corresponding to creating an extra argument 497 00:29:50,930 --> 00:29:53,950 as an input to it, which is this trial count. 498 00:29:53,950 --> 00:29:57,200 And you'd like to have this nice property of corresponding 499 00:29:57,200 --> 00:29:58,340 to a permutation. 500 00:29:58,340 --> 00:30:01,150 Can we actually design hash functions like this? 501 00:30:01,150 --> 00:30:03,380 And we'll take a look at a bad hash function, 502 00:30:03,380 --> 00:30:05,600 and then at a better one. 503 00:30:05,600 --> 00:30:08,260 So let's talk about probing strategies, which 504 00:30:08,260 --> 00:30:15,910 is essentially the same as taking a hash function 505 00:30:15,910 --> 00:30:18,570 and changing it so it is actually 506 00:30:18,570 --> 00:30:21,240 applicable to open addressing. 507 00:30:21,240 --> 00:30:30,480 So the notion of linear probing is 508 00:30:30,480 --> 00:30:40,920 that you do h k i equals h prime k, which 509 00:30:40,920 --> 00:30:43,220 is some hash function that you've chosen, 510 00:30:43,220 --> 00:30:49,585 plus i mod m, where this is an ordinary hash function. 511 00:30:54,620 --> 00:30:55,460 OK? 512 00:30:55,460 --> 00:30:57,001 So that looks pretty straightforward. 513 00:31:01,280 --> 00:31:02,100 What happens here? 514 00:31:02,100 --> 00:31:05,220 Does this satisfy the permutation argument? 515 00:31:08,785 --> 00:31:10,500 Before I forget. 516 00:31:10,500 --> 00:31:13,680 Does it satisfy the permutation property 517 00:31:13,680 --> 00:31:19,800 that I want h k 1, h k 2, h k m minus 1 to be a permutation? 518 00:31:19,800 --> 00:31:20,580 That make sense? 519 00:31:20,580 --> 00:31:21,380 Yep, yep. 520 00:31:21,380 --> 00:31:23,240 Because I then I start adding. 521 00:31:23,240 --> 00:31:26,780 The mod is precisely kind of this round robin cycle, 522 00:31:26,780 --> 00:31:28,780 so it's going to satisfy the permutation. 523 00:31:28,780 --> 00:31:29,320 That's good. 524 00:31:34,120 --> 00:31:37,170 What's wrong with this? 525 00:31:37,170 --> 00:31:39,620 What's wrong with this? 526 00:31:39,620 --> 00:31:40,120 Someone? 527 00:31:43,120 --> 00:31:47,620 AUDIENCE: The fact that [INAUDIBLE] keys, which they're 528 00:31:47,620 --> 00:31:50,620 all filled, then if you hit anywhere in here [INAUDIBLE] 529 00:31:50,620 --> 00:31:51,974 list of consecutive keys. 530 00:31:51,974 --> 00:31:52,640 AUDIENCE: Right. 531 00:31:52,640 --> 00:31:53,390 That's excellent. 532 00:31:53,390 --> 00:31:54,740 Excellent, excellent answer. 533 00:31:54,740 --> 00:31:59,390 So this notion of clustering is basically 534 00:31:59,390 --> 00:32:01,370 what's wrong with this probing strategy. 535 00:32:01,370 --> 00:32:05,430 And in fact, I'm not going to do this particular analysis, 536 00:32:05,430 --> 00:32:10,820 but I'll give you a sense of why the statement I'm going to make 537 00:32:10,820 --> 00:32:11,760 is true. 538 00:32:11,760 --> 00:32:13,840 But the notion of clustering is that you 539 00:32:13,840 --> 00:32:18,530 start getting consecutive groups of occupied slots, OK? 540 00:32:27,850 --> 00:32:28,780 Which keep growing. 541 00:32:32,820 --> 00:32:36,780 And so these clusters get longer and longer. 542 00:32:36,780 --> 00:32:38,950 And if you have a big cluster, it's 543 00:32:38,950 --> 00:32:41,020 more likely to grow bigger, right? 544 00:32:41,020 --> 00:32:41,840 Which is bad. 545 00:32:41,840 --> 00:32:44,879 This is exactly the wrong thing for load balancing, right? 546 00:32:44,879 --> 00:32:47,170 And clustering is the reverse of load balancing, right? 547 00:32:47,170 --> 00:32:48,970 If you have a bunch of clumps and you 548 00:32:48,970 --> 00:32:52,101 have a bunch of empty space in your table, that's bad. 549 00:32:52,101 --> 00:32:52,600 Right? 550 00:32:52,600 --> 00:32:54,100 The problem with linear probing is 551 00:32:54,100 --> 00:32:57,940 that once you start getting a cluster, given the, let's say, 552 00:32:57,940 --> 00:33:00,110 the randomness in the hash function, and h prime k 553 00:33:00,110 --> 00:33:03,470 is a pretty good hash function and can randomly go anywhere. 554 00:33:03,470 --> 00:33:07,140 Well, if you have 100 slots and you have a cluster of size 4, 555 00:33:07,140 --> 00:33:10,900 well there's a for 4/100 chance, which is obviously 556 00:33:10,900 --> 00:33:15,050 four times greater than 1/100, even I can do that, 557 00:33:15,050 --> 00:33:17,760 to go into those four slots. 558 00:33:17,760 --> 00:33:19,480 And if you going into those four slots 559 00:33:19,480 --> 00:33:22,440 you're going to keep going down to the bottom, 560 00:33:22,440 --> 00:33:27,500 and you're going to make that a cluster of size five, right? 561 00:33:27,500 --> 00:33:30,520 So that's the problem the linear probing, 562 00:33:30,520 --> 00:33:34,290 and you can essentially argue through making 563 00:33:34,290 --> 00:33:40,250 some probabilistic assumptions that if, in fact, you 564 00:33:40,250 --> 00:33:47,040 use linear probing that you lose your average constant time 565 00:33:47,040 --> 00:33:51,760 look up in your hash table for most load factors. 566 00:33:51,760 --> 00:33:54,900 So what's happening out here pictorially really 567 00:33:54,900 --> 00:33:57,870 is that you have a table and let's say you have a cluster. 568 00:34:02,060 --> 00:34:03,460 And this is your cluster. 569 00:34:06,220 --> 00:34:10,440 So if your h k 1-- it doesn't really 570 00:34:10,440 --> 00:34:15,679 matter what it is-- but h k i maps to this cluster, 571 00:34:15,679 --> 00:34:18,679 then you're going to-- linear probing 572 00:34:18,679 --> 00:34:21,239 says that the next thing you're going to try 573 00:34:21,239 --> 00:34:24,544 is if you map to 42 in the cluster, 574 00:34:24,544 --> 00:34:25,960 the next thing you're going to try 575 00:34:25,960 --> 00:34:32,370 is 43, 44, until you get maybe to this slot here, which is 57, 576 00:34:32,370 --> 00:34:34,020 for argument's sake. 577 00:34:34,020 --> 00:34:34,520 Right? 578 00:34:34,520 --> 00:34:36,228 So you're going to keep going, and you're 579 00:34:36,228 --> 00:34:41,300 going to try 15 times in this relatively dumb fashion 580 00:34:41,300 --> 00:34:45,730 to go down to get to the open slot, which is 57. 581 00:34:45,730 --> 00:34:47,840 And oh, by the way, at the end of this you 582 00:34:47,840 --> 00:34:51,159 just increased your cluster length by one. 583 00:34:51,159 --> 00:34:51,969 All right? 584 00:34:51,969 --> 00:34:53,820 So it doesn't really work. 585 00:34:53,820 --> 00:34:58,790 And in fact, under reasonable probabilistic assumptions 586 00:34:58,790 --> 00:35:01,780 in terms of what your hash functions are, 587 00:35:01,780 --> 00:35:07,850 you can say that when you have alpha, which is essentially 588 00:35:07,850 --> 00:35:15,613 your load factor, which is n over m less than 0.99, 589 00:35:15,613 --> 00:35:24,840 you see clusters of size log n, OK? 590 00:35:24,840 --> 00:35:25,470 Right. 591 00:35:25,470 --> 00:35:28,520 So this is a probabilistic argument, 592 00:35:28,520 --> 00:35:30,879 and you're assuming that you have a hash function that's 593 00:35:30,879 --> 00:35:32,045 a pretty good hash function. 594 00:35:32,045 --> 00:35:36,680 So h prime k can be this perfect hash function, all right? 595 00:35:36,680 --> 00:35:39,060 So there's a problem here beyond the choice of h 596 00:35:39,060 --> 00:35:42,010 prime k, which is this hash function that worked really 597 00:35:42,010 --> 00:35:44,080 well for chaining. 598 00:35:44,080 --> 00:35:44,630 All right? 599 00:35:44,630 --> 00:35:49,410 And the problem here is the linear probing aspect of it. 600 00:35:49,410 --> 00:35:50,570 So what does that mean? 601 00:35:50,570 --> 00:35:53,590 If you have clusters of theta log n, 602 00:35:53,590 --> 00:35:56,830 then your search and your insert are not 603 00:35:56,830 --> 00:35:58,350 going to be constant time anymore. 604 00:35:58,350 --> 00:35:58,850 Right? 605 00:35:58,850 --> 00:36:02,180 Which is bad in a probabilistic sense. 606 00:36:02,180 --> 00:36:04,080 OK? 607 00:36:04,080 --> 00:36:06,010 So how do we fix that? 608 00:36:06,010 --> 00:36:14,590 Well, one strategy that works reasonably well 609 00:36:14,590 --> 00:36:15,660 is called double hashing. 610 00:36:18,590 --> 00:36:23,120 And it literally means what it says. 611 00:36:23,120 --> 00:36:26,970 You have to run a couple of hashes. 612 00:36:26,970 --> 00:36:37,270 And so the notion of double hashing is that you have h k i 613 00:36:37,270 --> 00:36:47,910 equals h1 k plus i h2 k mod m. 614 00:36:47,910 --> 00:36:51,310 And h1 and h2 are just ordinary hash functions. 615 00:36:51,310 --> 00:36:53,140 OK? 616 00:36:53,140 --> 00:36:56,000 Now the first thing that we need to do 617 00:36:56,000 --> 00:37:01,886 is figure out how we can guarantee a permutation, right? 618 00:37:01,886 --> 00:37:03,510 Because we still have that requirement, 619 00:37:03,510 --> 00:37:05,570 and it was OK for the linear probing part, 620 00:37:05,570 --> 00:37:07,270 but you still have this requirement 621 00:37:07,270 --> 00:37:09,770 that you need a permutation. 622 00:37:09,770 --> 00:37:15,770 And so those of you who are into number theory, 623 00:37:15,770 --> 00:37:24,560 can you tell me what property, what neat property of h2 and m 624 00:37:24,560 --> 00:37:28,150 can we ask for to guarantee a permutation? 625 00:37:28,150 --> 00:37:30,124 Do you have a question? 626 00:37:30,124 --> 00:37:31,310 You already do. 627 00:37:31,310 --> 00:37:34,520 Do you have a question? 628 00:37:34,520 --> 00:37:35,980 AUDIENCE: [INAUDIBLE]. 629 00:37:35,980 --> 00:37:36,720 PROFESSOR: [INAUDIBLE] relatively prime. 630 00:37:36,720 --> 00:37:37,460 OK, good. 631 00:37:37,460 --> 00:37:39,320 So I figured some of you knew the answer, 632 00:37:39,320 --> 00:37:42,010 but I've seen you before. 633 00:37:42,010 --> 00:37:42,710 Right. 634 00:37:42,710 --> 00:37:43,300 Exactly right. 635 00:37:43,300 --> 00:37:45,300 Relatively prime. 636 00:37:45,300 --> 00:37:47,950 Just hand it to Victor. 637 00:37:47,950 --> 00:37:52,600 So h2 k and m being relatively prime, 638 00:37:52,600 --> 00:38:05,715 if that implies a permutation. 639 00:38:08,592 --> 00:38:10,050 It's similar to what we had before. 640 00:38:10,050 --> 00:38:13,217 You're multiplying this by i. i keeps increasing, 641 00:38:13,217 --> 00:38:14,550 and you're going to roll around. 642 00:38:14,550 --> 00:38:14,900 All right? 643 00:38:14,900 --> 00:38:16,316 I mean you could do a proof of it, 644 00:38:16,316 --> 00:38:18,220 but I'm not going to bother. 645 00:38:18,220 --> 00:38:20,720 The important thing here is that you can now 646 00:38:20,720 --> 00:38:24,760 do something as simple as m equals 2 raised to r, 647 00:38:24,760 --> 00:38:33,620 and h2 k for all k is odd, and now you're in great shape. 648 00:38:33,620 --> 00:38:36,250 You can have your array to be 2 raised 649 00:38:36,250 --> 00:38:39,090 to something, which is what you really want. 650 00:38:39,090 --> 00:38:41,360 And you just use h2 k. 651 00:38:41,360 --> 00:38:43,390 You could even take a regular hash function 652 00:38:43,390 --> 00:38:48,800 and truncate it to make sure it's odd. 653 00:38:48,800 --> 00:38:50,140 You can do a bunch of things. 654 00:38:50,140 --> 00:38:52,980 There's hash functions that produce odd values, 655 00:38:52,980 --> 00:38:54,380 and you can use that. 656 00:38:54,380 --> 00:38:55,180 All right? 657 00:38:55,180 --> 00:38:58,560 And so double hashing works fairly well in practice. 658 00:38:58,560 --> 00:39:05,290 It's a good way of getting open addressing to work. 659 00:39:05,290 --> 00:39:08,810 And in order to prove that open addressing actually 660 00:39:08,810 --> 00:39:14,200 works to the level at which chaining works, 661 00:39:14,200 --> 00:39:18,380 we have to make an assumption corresponding 662 00:39:18,380 --> 00:39:20,960 to uniform hashing. 663 00:39:20,960 --> 00:39:25,390 And I'm not going to actually do a proof, 664 00:39:25,390 --> 00:39:27,320 but it'll be in the notes. 665 00:39:27,320 --> 00:39:33,720 But I do want to talk about the theorem and the result 666 00:39:33,720 --> 00:39:38,320 that the theorem implies, assuming 667 00:39:38,320 --> 00:39:40,700 you have the uniform hashing assumption. 668 00:39:40,700 --> 00:39:43,580 And let me first say that this is not 669 00:39:43,580 --> 00:39:49,920 the same as simple uniform happening, which 670 00:39:49,920 --> 00:39:54,410 talks about the independence of keys in terms of their mapping 671 00:39:54,410 --> 00:39:55,650 to slots. 672 00:39:55,650 --> 00:39:57,980 The uniform hashing assumption says 673 00:39:57,980 --> 00:40:11,230 that each key is equally likely to have 674 00:40:11,230 --> 00:40:19,250 any one of the m factorial permutations-- 675 00:40:19,250 --> 00:40:21,020 so we're talking about random permutations 676 00:40:21,020 --> 00:40:24,780 here-- as its probe sequence. 677 00:40:31,080 --> 00:40:31,650 All right? 678 00:40:31,650 --> 00:40:33,930 This is very hard to get in practice. 679 00:40:33,930 --> 00:40:38,110 You can get pretty close using double hashing. 680 00:40:38,110 --> 00:40:41,120 But nobody's discovered a perfect hash function, 681 00:40:41,120 --> 00:40:44,572 deterministic hash function that satisfies this property. 682 00:40:44,572 --> 00:40:45,780 At least not that I know off. 683 00:40:48,290 --> 00:40:49,380 So what does this imply? 684 00:40:49,380 --> 00:40:53,340 Assuming that you have this and double hatching 685 00:40:53,340 --> 00:40:59,180 gives you this property, to a large extent what this means is 686 00:40:59,180 --> 00:41:03,170 that if alpha is n over m, you can 687 00:41:03,170 --> 00:41:18,280 show that the cost of operations such as search, insert, delete, 688 00:41:18,280 --> 00:41:19,690 et cetera. 689 00:41:19,690 --> 00:41:22,740 And in particular we talk about insert 690 00:41:22,740 --> 00:41:27,210 is less than or equal to 1 divided by 1 minus alpha. 691 00:41:27,210 --> 00:41:29,150 OK? 692 00:41:29,150 --> 00:41:33,650 So obviously this goes as alpha tends to 1. 693 00:41:33,650 --> 00:41:40,990 As alpha tends to 1, the load factor in the table gets large, 694 00:41:40,990 --> 00:41:44,180 and the number of expected probes 695 00:41:44,180 --> 00:41:47,920 that you need to do when you get an insert grows. 696 00:41:47,920 --> 00:41:52,130 And if alpha is 0.99, you're going, on average, 697 00:41:52,130 --> 00:41:54,200 require 100 probes. 698 00:41:54,200 --> 00:41:56,960 It's a constant number, but it's a pretty bad constant. 699 00:41:56,960 --> 00:41:57,460 Right? 700 00:41:57,460 --> 00:42:01,050 So you really want alpha to be fairly small. 701 00:42:01,050 --> 00:42:03,130 And in practice it turns out that you 702 00:42:03,130 --> 00:42:05,720 have to re-size you're open addressing table 703 00:42:05,720 --> 00:42:10,190 when alpha gets beyond about 0.5, 0.6 or so, 704 00:42:10,190 --> 00:42:13,132 because by then you're really in trouble. 705 00:42:13,132 --> 00:42:15,340 Remember this is an average case we're talking about. 706 00:42:15,340 --> 00:42:18,250 All of this is using a probabilistic assumption. 707 00:42:18,250 --> 00:42:21,780 But as you get to high alphas, suddenly 708 00:42:21,780 --> 00:42:24,720 by the time you get to 0.7, open addressing 709 00:42:24,720 --> 00:42:28,930 doesn't work well in relation to an equivalent table 710 00:42:28,930 --> 00:42:32,460 with the overall number of slots that 711 00:42:32,460 --> 00:42:35,190 correspond to a changing table, OK? 712 00:42:35,190 --> 00:42:39,020 So open addressing is easy to implement. 713 00:42:39,020 --> 00:42:42,170 It uses less memory because you don't need pointers. 714 00:42:42,170 --> 00:42:47,370 But you better be careful that your alpha stays around 0.5 715 00:42:47,370 --> 00:42:48,480 and no more. 716 00:42:48,480 --> 00:42:50,880 So all that means is you can still use it. 717 00:42:50,880 --> 00:42:52,547 You just have to re-size your table. 718 00:42:52,547 --> 00:42:54,130 You have slightly different strategies 719 00:42:54,130 --> 00:42:56,430 for resizing your table when you use open 720 00:42:56,430 --> 00:43:03,580 addressing as opposed to chaining hash tables. 721 00:43:03,580 --> 00:43:04,350 All right? 722 00:43:04,350 --> 00:43:06,130 So that's a summary of open addressing. 723 00:43:06,130 --> 00:43:09,392 I want to spend some time on cryptographic hashes 724 00:43:09,392 --> 00:43:10,600 in the time that I have left. 725 00:43:10,600 --> 00:43:12,380 I guess I have a few minutes left. 726 00:43:12,380 --> 00:43:15,940 But any questions about open addressing? 727 00:43:15,940 --> 00:43:17,114 Yep? 728 00:43:17,114 --> 00:43:18,875 AUDIENCE: On this delete part, what's 729 00:43:18,875 --> 00:43:21,570 going to happen if, say, you fill the table up and then 730 00:43:21,570 --> 00:43:24,020 delete everything, and then you start searching. 731 00:43:24,020 --> 00:43:26,143 Isn't that going to be bad because it's 732 00:43:26,143 --> 00:43:27,601 going to search through everything? 733 00:43:27,601 --> 00:43:29,680 PROFESSOR: So that's right. 734 00:43:29,680 --> 00:43:31,210 The bad thing about open addressing 735 00:43:31,210 --> 00:43:34,990 is that delete isn't instantaneous, right? 736 00:43:34,990 --> 00:43:37,890 In the sense that if you deleted something from the link list 737 00:43:37,890 --> 00:43:40,000 in your chaining table, then even 738 00:43:40,000 --> 00:43:43,470 if you went to that same thing, the chain got smaller, 739 00:43:43,470 --> 00:43:46,850 and that helps you, because your table now has lower load. 740 00:43:46,850 --> 00:43:49,990 But there's a delay associated with load 741 00:43:49,990 --> 00:43:52,130 when you have the delete me flag. 742 00:43:52,130 --> 00:43:52,630 OK? 743 00:43:52,630 --> 00:43:56,610 So in some sense the alpha that you want to think about, 744 00:43:56,610 --> 00:43:59,816 you should be careful as to how you define alpha. 745 00:43:59,816 --> 00:44:01,190 And that's one of the reasons why 746 00:44:01,190 --> 00:44:03,874 when you get alpha being 0.5, 0.6 747 00:44:03,874 --> 00:44:06,290 you get into trouble, because if you have all these delete 748 00:44:06,290 --> 00:44:09,080 me flags, they're still hurting you. 749 00:44:09,080 --> 00:44:10,699 AUDIENCE: And when you resize do those 750 00:44:10,699 --> 00:44:12,669 delete me flags get deleted? 751 00:44:12,669 --> 00:44:14,210 PROFESSOR: When you completely resize 752 00:44:14,210 --> 00:44:15,720 and you redo the whole thing, then you 753 00:44:15,720 --> 00:44:17,928 can clean up the delete me's and turn them into nones 754 00:44:17,928 --> 00:44:22,210 because you're rehashing it. 755 00:44:22,210 --> 00:44:22,850 All right. 756 00:44:22,850 --> 00:44:24,340 So yeah, back there. 757 00:44:24,340 --> 00:44:24,840 Question? 758 00:44:24,840 --> 00:44:26,530 AUDIENCE: Yes, can you explain how you got the equation 759 00:44:26,530 --> 00:44:28,747 that the cost of operation insert is less than 760 00:44:28,747 --> 00:44:30,994 or equal to 1 over [INAUDIBLE]. 761 00:44:30,994 --> 00:44:32,410 PROFESSOR: That's a longish proof, 762 00:44:32,410 --> 00:44:36,630 but let me explain to you how that comes out. 763 00:44:36,630 --> 00:44:39,370 Basically the intuition behind the proof 764 00:44:39,370 --> 00:44:45,080 is that we're going to assume some probability p. 765 00:44:45,080 --> 00:44:48,410 And initially you're going to say something 766 00:44:48,410 --> 00:44:58,080 like if the table, your p-- I'll just write this out here-- 767 00:44:58,080 --> 00:45:02,300 is m minus n divided by m. 768 00:45:02,300 --> 00:45:03,350 So what is that? 769 00:45:03,350 --> 00:45:06,620 Right now I have n elements in the table, 770 00:45:06,620 --> 00:45:12,390 and I have m slots, OK? 771 00:45:12,390 --> 00:45:17,530 So the probability that my very first trial is going to succeed 772 00:45:17,530 --> 00:45:22,360 is going to be m minus n divided by m, because these 773 00:45:22,360 --> 00:45:24,250 are the number of empty slots. 774 00:45:24,250 --> 00:45:26,580 And assuming my permutation argument, 775 00:45:26,580 --> 00:45:28,240 I could go into one of them. 776 00:45:28,240 --> 00:45:30,260 And so that's what I have here. 777 00:45:30,260 --> 00:45:36,010 And if you look at what this is, this is 1 minus alpha, OK? 778 00:45:36,010 --> 00:45:38,470 And so then you run off and you remember 779 00:45:38,470 --> 00:45:41,165 6041 or the high school probability course 780 00:45:41,165 --> 00:45:44,380 that you take, and you say generally speaking, 781 00:45:44,380 --> 00:45:47,470 you're going to be no worse than p for every trial. 782 00:45:47,470 --> 00:45:49,840 And so if you assume the worst and say 783 00:45:49,840 --> 00:45:52,390 every trial has a probability of success of p, 784 00:45:52,390 --> 00:45:56,040 the expected number of trials is 1/p, OK? 785 00:45:56,040 --> 00:46:00,080 And that's how you got the 1 over 1 minus alpha. 786 00:46:00,080 --> 00:46:04,030 So you'll see that written in gory detail in the notes. 787 00:46:04,030 --> 00:46:05,030 All right? 788 00:46:05,030 --> 00:46:06,270 OK. 789 00:46:06,270 --> 00:46:08,370 Expected to have a little more time 790 00:46:08,370 --> 00:46:11,380 in terms of talking about cryptographic hashes, 791 00:46:11,380 --> 00:46:15,040 but cryptographic hashes are not going to be on the quiz. 792 00:46:15,040 --> 00:46:19,920 This is purely material FYI. 793 00:46:19,920 --> 00:46:22,160 For your interest only. 794 00:46:22,160 --> 00:46:24,580 And again I have some notes on it, 795 00:46:24,580 --> 00:46:28,390 but I want to give you a sense of the other kinds of hashes 796 00:46:28,390 --> 00:46:34,370 that exist in the world, I guess. 797 00:46:34,370 --> 00:46:39,850 And hashes that are used for many different applications. 798 00:46:39,850 --> 00:46:42,070 So maybe the best way of motivating this 799 00:46:42,070 --> 00:46:43,990 is through an example. 800 00:46:43,990 --> 00:46:46,880 So let's talk about an example that 801 00:46:46,880 --> 00:46:51,280 is near and dear to every security person's heart 802 00:46:51,280 --> 00:46:55,050 and probably to people who aren't interested in security 803 00:46:55,050 --> 00:46:58,620 as well, which is password storage. 804 00:46:58,620 --> 00:47:01,750 So think about how, let's say, Unix systems 805 00:47:01,750 --> 00:47:04,650 work when you type in your password. 806 00:47:04,650 --> 00:47:06,650 You're typing in your password [INAUDIBLE], 807 00:47:06,650 --> 00:47:09,460 and this is true for other systems as well, 808 00:47:09,460 --> 00:47:11,650 but you have a password. 809 00:47:11,650 --> 00:47:16,470 And my password is a permutation of my first daughters 810 00:47:16,470 --> 00:47:18,910 first name. 811 00:47:18,910 --> 00:47:21,040 [LAUGHTER] 812 00:47:21,040 --> 00:47:24,880 Yeah, but haven't given it away, right? 813 00:47:24,880 --> 00:47:27,290 Haven't given it away. 814 00:47:27,290 --> 00:47:29,510 And so this password is something 815 00:47:29,510 --> 00:47:33,430 that I'm typing in every day, right? 816 00:47:33,430 --> 00:47:36,760 Now the sum check that needs to happen 817 00:47:36,760 --> 00:47:40,660 to ensure that I'm typing in the right password. 818 00:47:40,660 --> 00:47:43,610 So what is a dumb way of doing things. 819 00:47:43,610 --> 00:47:46,210 What's a dumb way of building systems? 820 00:47:46,210 --> 00:47:49,510 AUDIENCE: Storing [INAUDIBLE]. 821 00:47:49,510 --> 00:47:52,522 PROFESSOR: This is kind of a freebie. 822 00:47:52,522 --> 00:47:54,235 AUDIENCE: [INAUDIBLE]. 823 00:47:54,235 --> 00:47:55,360 PROFESSOR: In situ hashing. 824 00:47:55,360 --> 00:47:58,710 That's better. 825 00:47:58,710 --> 00:48:00,010 So you'd store it. 826 00:48:00,010 --> 00:48:01,070 I offered the dumb way. 827 00:48:01,070 --> 00:48:03,230 So there's a perfectly valid answer. 828 00:48:03,230 --> 00:48:06,450 So you could clearly store this in plain text in some file 829 00:48:06,450 --> 00:48:09,720 and you could call it slash etc slaw password. 830 00:48:09,720 --> 00:48:14,200 And you could make it read for the work, right? 831 00:48:14,200 --> 00:48:17,290 And that'd be great, and people do that, right? 832 00:48:17,290 --> 00:48:19,770 But what you would rather do is you 833 00:48:19,770 --> 00:48:24,580 want to make sure that even the sysadmin doesn't know 834 00:48:24,580 --> 00:48:27,630 my password or your password, right? 835 00:48:27,630 --> 00:48:29,140 So how do you do that? 836 00:48:29,140 --> 00:48:32,110 Well you do that using a cryptographic hash that 837 00:48:32,110 --> 00:48:36,400 has this interesting property that is one way, OK? 838 00:48:36,400 --> 00:48:42,370 And what that means is that given h of x-- OK, 839 00:48:42,370 --> 00:48:45,460 this is the value of the hash-- it 840 00:48:45,460 --> 00:48:55,620 is very hard to find the x such that x basically 841 00:48:55,620 --> 00:48:56,790 hashes to this value. 842 00:48:56,790 --> 00:49:02,380 So if h of x equals let's call it q, 843 00:49:02,380 --> 00:49:08,910 then you're only given h of x. 844 00:49:08,910 --> 00:49:11,750 And so what do you do now? 845 00:49:11,750 --> 00:49:13,360 Well, it's beautiful. 846 00:49:13,360 --> 00:49:16,710 Assuming you have this one way hash, this cryptographic hash, 847 00:49:16,710 --> 00:49:23,110 in your etc slash password file, you 848 00:49:23,110 --> 00:49:31,780 have something like login name, [INAUDIBLE], 849 00:49:31,780 --> 00:49:35,450 which happens to be the hash of my daughter's first name, 850 00:49:35,450 --> 00:49:36,530 or something. 851 00:49:36,530 --> 00:49:41,000 But this is what's stored in there and the same thing 852 00:49:41,000 --> 00:49:43,140 for a bunch of different users, right? 853 00:49:43,140 --> 00:49:46,970 So when I log in and I type in the actual password, 854 00:49:46,970 --> 00:49:48,670 what does the system do? 855 00:49:48,670 --> 00:49:51,120 What does the system do? 856 00:49:51,120 --> 00:49:52,130 It hashes it. 857 00:49:52,130 --> 00:50:00,300 It takes x prime, which is the typed in password, which 858 00:50:00,300 --> 00:50:04,307 may or may not be equal to my password, 859 00:50:04,307 --> 00:50:06,390 because somebody else might be trying to break in, 860 00:50:06,390 --> 00:50:11,520 or I just mistyped, or forgot my daughter's first name, 861 00:50:11,520 --> 00:50:13,250 which would be bad. 862 00:50:13,250 --> 00:50:18,700 And it will just check to see-- it doesn't need x, because it's 863 00:50:18,700 --> 00:50:23,650 stored h of x in the system, so it doesn't need x. 864 00:50:23,650 --> 00:50:27,300 So if we just compare against what I typed in, 865 00:50:27,300 --> 00:50:28,830 it would compute the hash again. 866 00:50:28,830 --> 00:50:33,700 And then would let me in assuming that these things 867 00:50:33,700 --> 00:50:36,530 matched and would not let me in if it didn't. 868 00:50:36,530 --> 00:50:39,060 So now we can talk about-- and I don't have time for this, 869 00:50:39,060 --> 00:50:41,835 but you can certainly read up on it on Wikipedia 870 00:50:41,835 --> 00:50:43,344 and a bunch in the notes. 871 00:50:43,344 --> 00:50:44,760 You can talk about what properties 872 00:50:44,760 --> 00:50:48,240 should this hash function have, namely one way collision 873 00:50:48,240 --> 00:50:50,950 resistance, in order to solve these problems 874 00:50:50,950 --> 00:50:52,020 and other problems. 875 00:50:52,020 --> 00:50:54,770 I'm happy to stick around and answer questions.