1 00:00:00,090 --> 00:00:02,490 The following content is provided under a Creative 2 00:00:02,490 --> 00:00:04,030 Commons license. 3 00:00:04,030 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,720 continue to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,280 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,280 --> 00:00:18,450 at ocw.mit.edu. 8 00:00:20,940 --> 00:00:22,440 ERIK DEMAINE: All right, today we're 9 00:00:22,440 --> 00:00:26,550 going to do an exciting topic, which is hashing. 10 00:00:26,550 --> 00:00:28,490 Do it all in one lecture, that's the plan. 11 00:00:28,490 --> 00:00:31,740 See if we make it. 12 00:00:31,740 --> 00:00:33,450 You've probably heard about hashing. 13 00:00:33,450 --> 00:00:35,970 It's probably the most common data structure 14 00:00:35,970 --> 00:00:37,050 in computer science. 15 00:00:37,050 --> 00:00:39,804 It's covered in pretty much every algorithms class. 16 00:00:39,804 --> 00:00:41,220 But there's a lot to say about it. 17 00:00:41,220 --> 00:00:42,810 And I want to quickly review things 18 00:00:42,810 --> 00:00:45,360 you might know and then quickly get to things 19 00:00:45,360 --> 00:00:47,335 you shouldn't know. 20 00:00:47,335 --> 00:00:48,960 And we're going to talk on the one hand 21 00:00:48,960 --> 00:00:51,930 about different kinds of hash functions, 22 00:00:51,930 --> 00:00:53,790 fancy stuff like k-wise independence, 23 00:00:53,790 --> 00:00:56,040 and a new technique that's been analyzed a lot lately, 24 00:00:56,040 --> 00:00:58,819 simple tabulation hashing, just in the last year. 25 00:00:58,819 --> 00:01:00,360 And then we'll look at different ways 26 00:01:00,360 --> 00:01:03,240 to use this hash function that actually build data structures, 27 00:01:03,240 --> 00:01:05,760 chaining is the obvious one; perfect hashing you 28 00:01:05,760 --> 00:01:09,150 may have seen; linear probing is another obvious one, 29 00:01:09,150 --> 00:01:11,100 but has only been analyzed recently; 30 00:01:11,100 --> 00:01:15,690 and cuckoo hashing is a new one that has its own fun feature. 31 00:01:15,690 --> 00:01:18,810 So that's where we're going to go today 32 00:01:18,810 --> 00:01:22,440 Remember, the basic idea of hashing 33 00:01:22,440 --> 00:01:25,350 is you want to reduce a giant universe 34 00:01:25,350 --> 00:01:28,090 to a reasonably small table. 35 00:01:28,090 --> 00:01:30,640 So I'm going to call our hash function h. 36 00:01:30,640 --> 00:01:35,438 I'm going to call the universe integer 0 up to u minus 1. 37 00:01:35,438 --> 00:01:38,610 So this is the universe. 38 00:01:38,610 --> 00:01:42,600 And I'm going to denote the universe by a capital U. 39 00:01:42,600 --> 00:01:46,839 And we have a table that we'd like to store. 40 00:01:46,839 --> 00:01:48,630 I'm not going to draw it yet because that's 41 00:01:48,630 --> 00:01:52,290 the second half of the lecture, what is the table actually. 42 00:01:52,290 --> 00:01:56,080 But we'll just think of it as indices 0 through m minus 1. 43 00:01:56,080 --> 00:02:01,000 So this is the table size. 44 00:02:01,000 --> 00:02:03,780 And probably, we want m to be about n. n 45 00:02:03,780 --> 00:02:06,660 is the number of keys we're actually storing in the table. 46 00:02:06,660 --> 00:02:10,289 But that's not necessarily seen at this level. 47 00:02:10,289 --> 00:02:13,110 So that's a hash function. 48 00:02:13,110 --> 00:02:15,555 m is going to be much smaller than u. 49 00:02:15,555 --> 00:02:16,930 We're just hashing integers here, 50 00:02:16,930 --> 00:02:19,830 so if you don't have integers you map your whatever space 51 00:02:19,830 --> 00:02:21,690 of things you have to integers. 52 00:02:21,690 --> 00:02:23,790 That's pretty much always possible. 53 00:02:23,790 --> 00:02:28,000 Now, best case scenario would be to use a totally random hash 54 00:02:28,000 --> 00:02:28,500 function. 55 00:02:31,590 --> 00:02:33,730 What does totally random mean? 56 00:02:33,730 --> 00:02:37,800 The probability, if you choose your hash function, 57 00:02:37,800 --> 00:02:41,850 that any key x maps to any particular slot-- 58 00:02:41,850 --> 00:02:44,700 these table things are called slots-- 59 00:02:44,700 --> 00:02:47,640 is 1 over m. 60 00:02:47,640 --> 00:02:51,925 And this is independent for all x. 61 00:03:08,290 --> 00:03:09,740 So this would be ideal. 62 00:03:09,740 --> 00:03:14,900 You choose each h of x for every possible key randomly, 63 00:03:14,900 --> 00:03:16,120 independently. 64 00:03:16,120 --> 00:03:18,450 Then that gives you perfect hashing. 65 00:03:18,450 --> 00:03:20,380 Not perfect in this sense, sorry. 66 00:03:20,380 --> 00:03:22,600 It gives you ideal hashing. 67 00:03:22,600 --> 00:03:24,910 Perfect means no collisions. 68 00:03:24,910 --> 00:03:26,410 This actually might have collisions, 69 00:03:26,410 --> 00:03:29,834 there's some chance that two keys hash to the same value. 70 00:03:29,834 --> 00:03:31,000 We call this totally random. 71 00:03:31,000 --> 00:03:32,410 This is sort of the ideal thing that we're 72 00:03:32,410 --> 00:03:34,840 trying to approximate with reasonable hash functions. 73 00:03:34,840 --> 00:03:36,040 Why is this bad? 74 00:03:36,040 --> 00:03:37,990 Because it's big. 75 00:03:37,990 --> 00:03:39,820 The number of bits of information 76 00:03:39,820 --> 00:03:43,480 you'd need if you actually could flip all these coins, 77 00:03:43,480 --> 00:03:47,840 you'd need to write down, I guess U times 78 00:03:47,840 --> 00:03:50,790 log m bits of information. 79 00:03:55,800 --> 00:03:57,330 Which is generally way too big. 80 00:03:57,330 --> 00:03:59,220 We can't afford U. The whole point is 81 00:03:59,220 --> 00:04:03,880 we want to store n items much smaller than U. Surprisingly, 82 00:04:03,880 --> 00:04:05,389 this concept will still be useful. 83 00:04:05,389 --> 00:04:06,180 So we'll get there. 84 00:04:11,920 --> 00:04:18,870 Another system you've probably seen is universal hashing. 85 00:04:18,870 --> 00:04:27,380 This is a constraint on hash function. 86 00:04:27,380 --> 00:04:29,690 So this would be ideally you'd choose h 87 00:04:29,690 --> 00:04:31,700 uniformly at random from all hash functions. 88 00:04:31,700 --> 00:04:34,090 That would give you this probability. 89 00:04:34,090 --> 00:04:36,440 We're going to make a smaller set of hash functions 90 00:04:36,440 --> 00:04:38,100 whose size is much smaller. 91 00:04:38,100 --> 00:04:40,850 And so you can encode the hash function in many fewer bits. 92 00:04:45,050 --> 00:04:47,960 And the property we want from that hash family 93 00:04:47,960 --> 00:04:52,160 is that if you look at the probability 94 00:04:52,160 --> 00:04:56,300 that two keys collide, you get roughly what 95 00:04:56,300 --> 00:05:00,200 you expect from totally random. 96 00:05:00,200 --> 00:05:02,832 You would hope for 1 over m. 97 00:05:02,832 --> 00:05:04,415 Once you pick one key, the probability 98 00:05:04,415 --> 00:05:08,110 the other key would hit it would be 1 over m. 99 00:05:08,110 --> 00:05:11,030 But we'll allow constant factor. 100 00:05:11,030 --> 00:05:13,390 And also allow it to be smaller. 101 00:05:13,390 --> 00:05:15,077 It gives us some slop. 102 00:05:15,077 --> 00:05:16,160 You don't have to do this. 103 00:05:16,160 --> 00:05:18,500 If you don't do this you it's called strongly universal. 104 00:05:21,445 --> 00:05:23,420 That's universal. 105 00:05:23,420 --> 00:05:25,460 And universal is enough for a lot of things 106 00:05:25,460 --> 00:05:29,130 that you've probably seen, but not enough for other things. 107 00:05:29,130 --> 00:05:31,040 So here are some examples of hash functions 108 00:05:31,040 --> 00:05:34,655 that are universal, which again, you may have seen. 109 00:05:49,880 --> 00:05:53,030 You can take a random integer a, multiply it 110 00:05:53,030 --> 00:05:54,680 by x integer multiplication. 111 00:05:54,680 --> 00:05:57,140 You could also do this as a vector wise dot product. 112 00:05:57,140 --> 00:05:59,840 But here I'm doing it as multiplication. 113 00:05:59,840 --> 00:06:01,460 Modulo a prime. 114 00:06:01,460 --> 00:06:07,630 Prime has to be bigger than U, maybe bigger or equal is fine. 115 00:06:13,220 --> 00:06:14,690 Universe. 116 00:06:14,690 --> 00:06:17,270 And then you take the whole thing modulo em, 117 00:06:17,270 --> 00:06:20,690 Now, this is universal but it loses a factor of 2 here, 118 00:06:20,690 --> 00:06:21,950 I believe, in general. 119 00:06:21,950 --> 00:06:26,060 Because you take things modulo prime and then 120 00:06:26,060 --> 00:06:28,310 you take things modulo whatever your table size is. 121 00:06:28,310 --> 00:06:30,970 If you set your table size to p that's great. 122 00:06:30,970 --> 00:06:32,750 I think you get a factor of 1. 123 00:06:32,750 --> 00:06:39,170 If you don't, you're essentially losing possibly half the slots, 124 00:06:39,170 --> 00:06:43,040 depending on how m and p are related to each other. 125 00:06:43,040 --> 00:06:46,190 So it's OK, but not great. 126 00:06:48,610 --> 00:06:50,360 It's also considered expensive because you 127 00:06:50,360 --> 00:06:54,290 have do all this division, which people don't like to do. 128 00:06:54,290 --> 00:06:56,900 So there's a fancier method which 129 00:06:56,900 --> 00:07:08,810 is a times x shifted right by log u minus log m. 130 00:07:12,180 --> 00:07:18,809 This is when m and u are powers of 2, which is 131 00:07:18,809 --> 00:07:20,100 the case we kind of care about. 132 00:07:20,100 --> 00:07:24,380 Usually your universe is of size 2 133 00:07:24,380 --> 00:07:26,910 to the word size of your machine, 2 to the 32, 134 00:07:26,910 --> 00:07:30,202 2 to the 64, however bigger integers are. 135 00:07:30,202 --> 00:07:31,410 So it's usually a power of 2. 136 00:07:31,410 --> 00:07:33,180 It's fine to make your table a power of 2. 137 00:07:33,180 --> 00:07:36,000 We're probably going to use table doubling. 138 00:07:36,000 --> 00:07:40,680 So you just multiply and then take the high order bits, 139 00:07:40,680 --> 00:07:42,760 that's what this is saying. 140 00:07:42,760 --> 00:07:46,200 So this is a method more recent, 1997. 141 00:07:46,200 --> 00:07:49,500 Whereas this one goes back to 1979. 142 00:07:49,500 --> 00:07:53,580 So '79, '97. 143 00:07:53,580 --> 00:07:57,071 And it's also universal. 144 00:07:57,071 --> 00:07:58,820 There's a lot of universal hash functions. 145 00:07:58,820 --> 00:08:00,540 I'm not going to list them all. 146 00:08:00,540 --> 00:08:05,460 I'd rather get to stronger properties than universality. 147 00:08:05,460 --> 00:08:08,330 So the next one is called k-wise independence. 148 00:08:14,400 --> 00:08:17,780 This is harder to obtain. 149 00:08:17,780 --> 00:08:21,600 And it implies universality. 150 00:08:21,600 --> 00:08:27,270 So we want a family of hash functions such that-- 151 00:08:54,600 --> 00:08:56,720 Maybe let's start with just pairwise independence, 152 00:08:56,720 --> 00:08:58,349 k equals 2. 153 00:08:58,349 --> 00:09:00,140 Then what this is saying is the probability 154 00:09:00,140 --> 00:09:03,740 of your choice of hash function, that the first key maps 155 00:09:03,740 --> 00:09:07,250 to this slot, t1, and the second key maps 156 00:09:07,250 --> 00:09:09,920 to some other slot, t2. 157 00:09:09,920 --> 00:09:13,140 For any two keys x1 and xk. 158 00:09:16,070 --> 00:09:19,580 If your function was random each of those 159 00:09:19,580 --> 00:09:22,070 happens with probability 1 over m, they're independent. 160 00:09:22,070 --> 00:09:26,025 So you get 1 over m to the k, or 1 over m squared for k 161 00:09:26,025 --> 00:09:28,340 equals 2. 162 00:09:28,340 --> 00:09:30,590 Even in that situation that's different from saying 163 00:09:30,590 --> 00:09:34,460 the probability of 2 keys being equal is 1 over m. 164 00:09:34,460 --> 00:09:36,380 This would imply that. 165 00:09:36,380 --> 00:09:38,810 But here there could still be some co-dependence 166 00:09:38,810 --> 00:09:40,320 between x and y. 167 00:09:40,320 --> 00:09:42,590 Here there essentially can't be. 168 00:09:42,590 --> 00:09:46,160 I mean, other than this constant factor. 169 00:09:46,160 --> 00:09:49,130 Pairwise independence means every two guys are independent, 170 00:09:49,130 --> 00:09:51,640 k-wise means every k guys are independent 171 00:09:51,640 --> 00:09:55,010 up to the constant factor. 172 00:09:55,010 --> 00:10:00,740 So this is for distinct xi's. 173 00:10:03,770 --> 00:10:05,260 Obviously if two of them are equal 174 00:10:05,260 --> 00:10:08,690 they're very likely to hash to the same slot. 175 00:10:08,690 --> 00:10:11,830 So you've got to forbid that. 176 00:10:11,830 --> 00:10:14,510 OK, so an example of such a hash function. 177 00:10:19,130 --> 00:10:22,820 Here we just took a product. 178 00:10:22,820 --> 00:10:35,180 In general you can take a polynomial of degree k minus 1. 179 00:10:35,180 --> 00:10:36,630 Evaluate that mod p. 180 00:10:39,200 --> 00:10:47,279 And then if you want some, modulo that to your table size. 181 00:10:47,279 --> 00:10:49,070 So in particular if k equals 2, we actually 182 00:10:49,070 --> 00:10:49,945 have to do some work. 183 00:10:49,945 --> 00:10:55,740 This function is not pairwise independent, it is universal. 184 00:10:55,740 --> 00:10:58,790 If you make it ax plus b for random a and b, 185 00:10:58,790 --> 00:11:00,890 then this becomes pairwise independent. 186 00:11:00,890 --> 00:11:04,340 In general, you want three wise independent, 187 00:11:04,340 --> 00:11:10,040 triple wise independent, you need ax squared plus bx plus c 188 00:11:10,040 --> 00:11:13,250 for random a, b's, and c's. 189 00:11:13,250 --> 00:11:19,713 So these are arbitrary numbers between 0 and p, I guess. 190 00:11:25,140 --> 00:11:26,350 OK. 191 00:11:26,350 --> 00:11:29,870 This is also old, 1981. 192 00:11:29,870 --> 00:11:33,300 Wegman and Carter introduced these two notions 193 00:11:33,300 --> 00:11:34,830 in a couple of different papers. 194 00:11:34,830 --> 00:11:37,620 This is an old idea. 195 00:11:37,620 --> 00:11:41,670 This is, of course, expensive in that we pay order k time 196 00:11:41,670 --> 00:11:43,500 to evaluate it. 197 00:11:43,500 --> 00:11:46,050 Also, there's a lot of multiplications 198 00:11:46,050 --> 00:11:47,820 and you have to do everything modulo p. 199 00:11:47,820 --> 00:11:50,028 So a lot of people have worked on more efficient ways 200 00:11:50,028 --> 00:11:51,780 to do k-wise independence. 201 00:11:51,780 --> 00:11:54,780 And there two main results on this. 202 00:11:54,780 --> 00:11:59,830 Both of them achieve m to the epsilon space, is not great. 203 00:12:11,840 --> 00:12:16,580 One of them the query time depends on k, 204 00:12:16,580 --> 00:12:24,255 and it's uniform, and reasonably practical 205 00:12:24,255 --> 00:12:26,450 [? with ?] experiments. 206 00:12:26,450 --> 00:12:38,136 The other one is constant query for a logarithmic independence. 207 00:12:47,860 --> 00:12:49,840 So this one is actually later. 208 00:12:49,840 --> 00:12:54,610 It's by Thorpe and [? Tsang. ?] And this is by Siegel. 209 00:12:54,610 --> 00:12:59,800 Both 2004, so fairly recent. 210 00:12:59,800 --> 00:13:01,960 It takes a fair amount of space. 211 00:13:01,960 --> 00:13:04,630 This paper proves that to get constant query 212 00:13:04,630 --> 00:13:06,100 time for logarithmic independence 213 00:13:06,100 --> 00:13:09,640 you'd need quite a bit of space to store your hash function. 214 00:13:09,640 --> 00:13:13,330 Keep in mind, these hash functions only take-- 215 00:13:13,330 --> 00:13:16,660 well this is like k log in bits to store. 216 00:13:16,660 --> 00:13:18,920 So this is words of space. 217 00:13:18,920 --> 00:13:22,574 So here we're only spending k words 218 00:13:22,574 --> 00:13:23,740 to store this hash function. 219 00:13:23,740 --> 00:13:24,360 It's very small. 220 00:13:24,360 --> 00:13:25,420 Here you need something depending 221 00:13:25,420 --> 00:13:27,211 on n, which is kind of annoying, especially 222 00:13:27,211 --> 00:13:29,290 if you want to be dynamic. 223 00:13:29,290 --> 00:13:33,920 But statically you can get constant query logarithmic wise 224 00:13:33,920 --> 00:13:36,205 independence, but you have to pay a lot in space. 225 00:13:39,999 --> 00:13:41,290 There's more practical methods. 226 00:13:41,290 --> 00:13:43,010 This is especially practical for k 227 00:13:43,010 --> 00:13:47,320 equals 5, which is a case that we'll see is of interest. 228 00:13:50,010 --> 00:13:52,030 Cool. 229 00:13:52,030 --> 00:13:55,000 So this much space is necessary if you want. 230 00:13:55,000 --> 00:13:57,940 We'll see log wise independence is the most we'll ever 231 00:13:57,940 --> 00:14:01,060 require in this class. 232 00:14:04,987 --> 00:14:06,820 And as far as I know, in hashing in general. 233 00:14:06,820 --> 00:14:08,361 So you don't need to worry about more 234 00:14:08,361 --> 00:14:11,982 than log wise independence. 235 00:14:11,982 --> 00:14:17,555 All right, one more hashing scheme. 236 00:14:17,555 --> 00:14:21,120 It's called simple tabulation hashing. 237 00:14:32,730 --> 00:14:35,790 This is a simple idea. 238 00:14:35,790 --> 00:14:39,660 It goes also back to '81 but it's just 239 00:14:39,660 --> 00:14:42,740 been analyzed last year. 240 00:14:42,740 --> 00:14:47,000 So there's a lot of results to report on it. 241 00:14:47,000 --> 00:14:49,500 The idea is just take your integer, 242 00:14:49,500 --> 00:14:55,720 split it up into some base so that there's exactly 243 00:14:55,720 --> 00:14:58,720 c characters. 244 00:14:58,720 --> 00:15:01,050 c is going to be a constant. 245 00:15:01,050 --> 00:15:08,070 Then build a totally random hash table. 246 00:15:08,070 --> 00:15:11,020 This is the thing that we couldn't afford. 247 00:15:11,020 --> 00:15:13,020 But we're just going to do it on each character. 248 00:15:21,269 --> 00:15:23,185 So there's going to be c of these hash tables. 249 00:15:26,290 --> 00:15:33,130 And each of them is going to have size u to the 1 over c. 250 00:15:33,130 --> 00:15:37,960 So essentially we're getting u to the epsilon space, which is 251 00:15:37,960 --> 00:15:40,030 similar to these space bounds. 252 00:15:40,030 --> 00:15:45,100 So again, not great, but it's a really simple hash function. 253 00:15:47,920 --> 00:15:52,780 Hash function is just going to be you take your first table, 254 00:15:52,780 --> 00:15:56,110 apply it to the first character, x over that, 255 00:15:56,110 --> 00:16:00,220 with the second table applied to the second character, 256 00:16:00,220 --> 00:16:02,980 and so on through all the characters. 257 00:16:06,260 --> 00:16:08,994 So the nice thing about this is it's super simple. 258 00:16:08,994 --> 00:16:10,660 You can imagine this being done probably 259 00:16:10,660 --> 00:16:14,800 in one instruction on a fancy CPU. 260 00:16:14,800 --> 00:16:17,470 if you convince people this is a cool enough instruction 261 00:16:17,470 --> 00:16:17,970 to have. 262 00:16:17,970 --> 00:16:21,940 It's very simple to implement circuit wide. 263 00:16:21,940 --> 00:16:25,270 But in our model you have to do all these operations 264 00:16:25,270 --> 00:16:26,100 separately. 265 00:16:26,100 --> 00:16:30,700 You're going to take orders c time to compute. 266 00:16:30,700 --> 00:16:32,770 And one thing that's known about it 267 00:16:32,770 --> 00:16:39,340 is that it's three independent, three wise independent. 268 00:16:39,340 --> 00:16:41,310 So it does kind of fit in this model. 269 00:16:41,310 --> 00:16:44,140 But three wise independence is not very impressive. 270 00:16:44,140 --> 00:16:47,740 A lot of the results we'll see require log n independence. 271 00:16:47,740 --> 00:16:51,970 But the cool thing is, roughly speaking simple tabulation 272 00:16:51,970 --> 00:16:54,830 hashing is almost as good as log n wise independence 273 00:16:54,830 --> 00:16:58,280 in all the hashing schemes that we care about. 274 00:16:58,280 --> 00:17:01,990 And so we'll get there, exactly what that means. 275 00:17:04,950 --> 00:17:09,810 So that was my overview of some hash functions, these two guys. 276 00:17:09,810 --> 00:17:13,609 Next we're going to look at basic chaining. 277 00:17:16,800 --> 00:17:17,569 Perfect hashing. 278 00:17:20,437 --> 00:17:21,910 How many people have seen perfect 279 00:17:21,910 --> 00:17:25,000 hashing just to get a sense? 280 00:17:25,000 --> 00:17:26,980 More than half. 281 00:17:26,980 --> 00:17:29,020 Maybe 2/3. 282 00:17:29,020 --> 00:17:32,440 All right, I should do this really fast. 283 00:17:32,440 --> 00:17:38,314 Chaining, this is the first kind of hashing you usually see. 284 00:17:38,314 --> 00:17:39,730 You have your hash function, which 285 00:17:39,730 --> 00:17:42,700 is mapping keys into slots. 286 00:17:42,700 --> 00:17:45,010 If you have two keys that go to the same slot 287 00:17:45,010 --> 00:17:47,960 you store them as a linked list. 288 00:17:47,960 --> 00:17:51,290 OK, if you don't have anything in this slot, it's blank. 289 00:17:51,290 --> 00:17:53,410 This is very easy. 290 00:17:53,410 --> 00:17:58,120 If you look at a particular slot t and call 291 00:17:58,120 --> 00:18:02,200 the length of the chain that you get there Ct. 292 00:18:02,200 --> 00:18:05,860 You can look at the expected length of that chain. 293 00:18:05,860 --> 00:18:11,260 In general, it's just going to be sum of the probability 294 00:18:11,260 --> 00:18:19,410 that the keys map to that slot. 295 00:18:19,410 --> 00:18:21,780 And then you sum over all keys. 296 00:18:21,780 --> 00:18:23,900 This is just writing this as a sum of indicator 297 00:18:23,900 --> 00:18:27,660 random variables, and then each linearity of expectation, 298 00:18:27,660 --> 00:18:30,300 expectation of each of the indicator variables 299 00:18:30,300 --> 00:18:31,770 is probability. 300 00:18:31,770 --> 00:18:33,240 So that is the expected number. 301 00:18:39,770 --> 00:18:41,610 Here we just need to compute the probability 302 00:18:41,610 --> 00:18:43,290 each guy goes to each slot. 303 00:18:43,290 --> 00:18:46,810 As long as your hash function is uniform, 304 00:18:46,810 --> 00:18:50,070 meaning that each of these guys is equally likely 305 00:18:50,070 --> 00:18:52,097 to be hashed to. 306 00:18:52,097 --> 00:18:54,180 Well actually, we're looking at a particular slot. 307 00:18:54,180 --> 00:18:57,990 So we're essentially using universality here. 308 00:18:57,990 --> 00:19:00,490 Once we fix one slot that we care about, 309 00:19:00,490 --> 00:19:04,520 so let t be some h of y that we care about, 310 00:19:04,520 --> 00:19:07,230 then this is universality. 311 00:19:07,230 --> 00:19:10,160 By universality we know this is 1 over m. 312 00:19:10,160 --> 00:19:16,890 And so this is 1 over n over m, usually called the load factor. 313 00:19:16,890 --> 00:19:22,170 And what we care about is this is constant for m 314 00:19:22,170 --> 00:19:23,220 equal theta n. 315 00:19:23,220 --> 00:19:26,010 And so you use table doubling to keep m theta n. 316 00:19:26,010 --> 00:19:30,580 Boom, you've got expected chain length constant. 317 00:19:30,580 --> 00:19:36,090 But in the theory world expected is a very weak bound. 318 00:19:36,090 --> 00:19:38,007 What we want are high probability bounds. 319 00:19:38,007 --> 00:19:40,590 So let me tell you a little bit about high probability bounds. 320 00:19:40,590 --> 00:19:42,390 This, you may not have seen as much. 321 00:19:50,520 --> 00:19:55,390 Let's start with if your hash function is totally random, 322 00:19:55,390 --> 00:20:01,914 then your chain lengths will be order log m over a log log n, 323 00:20:01,914 --> 00:20:02,830 with high probability. 324 00:20:02,830 --> 00:20:05,050 They are not constant. 325 00:20:05,050 --> 00:20:07,680 In fact, you expect the maximum chain 326 00:20:07,680 --> 00:20:09,920 to be at least log n over log log n. 327 00:20:09,920 --> 00:20:11,370 I won't prove that here. 328 00:20:11,370 --> 00:20:13,670 Instead, I'll prove the upper bound. 329 00:20:13,670 --> 00:20:20,240 So the claim is that while in expectation each of them 330 00:20:20,240 --> 00:20:23,360 is constant, variance is essentially high. 331 00:20:32,610 --> 00:20:34,700 Actually let's talk about variance a little bit. 332 00:20:34,700 --> 00:20:36,779 Sorry, I'm getting distracted. 333 00:20:41,470 --> 00:20:44,052 So you might say, oh OK, expectation is nice. 334 00:20:44,052 --> 00:20:45,010 Let's look at variance. 335 00:20:45,010 --> 00:20:49,905 Turns out variance is constant for these chains. 336 00:20:49,905 --> 00:20:51,655 There are various definitions of variance. 337 00:20:55,010 --> 00:20:58,210 But in particular the formula I want to use is this one. 338 00:21:01,472 --> 00:21:05,140 It writes it as some expectations. 339 00:21:05,140 --> 00:21:08,057 Now, this expected chain length we know is constant. 340 00:21:08,057 --> 00:21:09,640 So you square it, it's still constant. 341 00:21:09,640 --> 00:21:11,520 So that's sort of irrelevant. 342 00:21:11,520 --> 00:21:15,995 The interesting part is what is the expected squared chain 343 00:21:15,995 --> 00:21:17,230 length. 344 00:21:17,230 --> 00:21:20,320 Now this is going to depend exactly on your hash function. 345 00:21:20,320 --> 00:21:22,570 Let's analyze it for totally random. 346 00:21:22,570 --> 00:21:25,900 In general, we just need a certain kind of symmetry here. 347 00:21:25,900 --> 00:21:35,890 You can write I will look at the expected squared chain lengths. 348 00:21:40,000 --> 00:21:45,770 And instead, what I'd like to do is just sum over all of them. 349 00:21:45,770 --> 00:21:48,540 This is going to be easier to analyze. 350 00:21:48,540 --> 00:21:50,620 So this is expected squared chain lengths. 351 00:21:50,620 --> 00:21:55,520 If I sum over all chains and then divide, take the average, 352 00:21:55,520 --> 00:21:57,259 then I'll probably get the expected chain 353 00:21:57,259 --> 00:21:58,300 length of any individual. 354 00:21:58,300 --> 00:22:00,049 As long as your hash function is symmetric 355 00:22:00,049 --> 00:22:02,020 all the keys are sort of equally likely. 356 00:22:02,020 --> 00:22:02,770 This will be true. 357 00:22:05,926 --> 00:22:07,300 And then you could just basically 358 00:22:07,300 --> 00:22:08,860 apply a random permutation to your keys 359 00:22:08,860 --> 00:22:10,443 to make this true if it isn't already. 360 00:22:13,360 --> 00:22:16,750 Now this thing is just the number 361 00:22:16,750 --> 00:22:20,240 of pairs of keys that collide. 362 00:22:20,240 --> 00:22:25,850 So you can forget about slots, this is just the sum 363 00:22:25,850 --> 00:22:32,390 overall pairs of keys ij of the probability 364 00:22:32,390 --> 00:22:37,400 that xi hashes to the same spot as xj. 365 00:22:37,400 --> 00:22:41,330 And that's something we know by universality. 366 00:22:41,330 --> 00:22:43,790 This is 1 over m. 367 00:22:43,790 --> 00:22:46,940 Big O. 368 00:22:46,940 --> 00:22:49,940 The number of pairs is m squared. 369 00:22:49,940 --> 00:22:53,110 So we get m squared times 1 over m, times 1 over m. 370 00:22:53,110 --> 00:22:53,840 This is constant. 371 00:22:58,620 --> 00:23:00,230 So the variance is actually small. 372 00:23:00,230 --> 00:23:03,240 It's not a good indicator of how big our chains can get. 373 00:23:03,240 --> 00:23:05,960 Because still, with time probability one of the chains 374 00:23:05,960 --> 00:23:07,340 will be log n over log log n. 375 00:23:07,340 --> 00:23:10,900 It's just typical one won't be. 376 00:23:10,900 --> 00:23:12,740 Let's prove the upper bound. 377 00:23:12,740 --> 00:23:14,840 This uses Chernoff bounds. 378 00:23:18,090 --> 00:23:21,289 This is a tail bound, essentially. 379 00:23:21,289 --> 00:23:23,330 I haven't probably defined with high probability. 380 00:23:23,330 --> 00:23:25,365 It's probably good to remember, review this. 381 00:23:25,365 --> 00:23:28,640 This means probability at least 1 minus 1 382 00:23:28,640 --> 00:23:34,370 over n to c where I get to choose any constant c. 383 00:23:34,370 --> 00:23:38,990 So high probability means polynomially small failure 384 00:23:38,990 --> 00:23:40,250 probability. 385 00:23:40,250 --> 00:23:43,310 This is good because if you do this polynomially many times 386 00:23:43,310 --> 00:23:46,280 this property remains true. 387 00:23:46,280 --> 00:23:49,280 You just up your constant by however many times 388 00:23:49,280 --> 00:23:51,840 you're going to use it. 389 00:23:51,840 --> 00:23:56,330 So we prove these kinds of bounds using Chernov, which 390 00:23:56,330 --> 00:23:58,460 looks something like this. 391 00:23:58,460 --> 00:24:07,608 e to the c minus 1 mu over c mu c mu. 392 00:24:07,608 --> 00:24:10,240 So mu here is the mean. 393 00:24:10,240 --> 00:24:12,820 The mean we've already computed is constant. 394 00:24:12,820 --> 00:24:16,830 The expectation of the ct variable is constant. 395 00:24:16,830 --> 00:24:20,340 So we want it to be not much larger than that. 396 00:24:20,340 --> 00:24:23,980 So say the probability that it's some factor larger-- 397 00:24:23,980 --> 00:24:26,480 c doesn't have to be constant here, sorry, maybe not 398 00:24:26,480 --> 00:24:29,900 great terminology. 399 00:24:29,900 --> 00:24:34,839 So the probability of ct is at least some c times the mean, 400 00:24:34,839 --> 00:24:36,505 is going to be at most this exponential. 401 00:24:36,505 --> 00:24:39,980 Which is a bit annoying, or a bit ugly. 402 00:24:39,980 --> 00:24:45,626 But in particular, if we plug in c equals log n over log log n, 403 00:24:45,626 --> 00:24:50,840 use that as our factor, which is what we're concerned about here 404 00:24:50,840 --> 00:24:51,860 then. 405 00:24:51,860 --> 00:24:58,070 We get that this probability is essentially dominated 406 00:24:58,070 --> 00:25:00,170 by the bottom term here. 407 00:25:00,170 --> 00:25:07,100 And this becomes log n over log log n 408 00:25:07,100 --> 00:25:09,800 to the power log n over log log n. 409 00:25:12,620 --> 00:25:16,720 So essentially, get 1 over that. 410 00:25:16,720 --> 00:25:21,470 And if you take this bottom part and put it into the exponent, 411 00:25:21,470 --> 00:25:23,560 you get essentially log log n. 412 00:25:23,560 --> 00:25:27,035 So this is something like 1 over 2 413 00:25:27,035 --> 00:25:33,780 to the log n over log log n times log log n. 414 00:25:33,780 --> 00:25:35,210 And the log log n's cancel. 415 00:25:35,210 --> 00:25:38,300 And so this is basically 1 over n. 416 00:25:38,300 --> 00:25:40,850 And if you put a constant in here 417 00:25:40,850 --> 00:25:43,920 you can get a constant in the exponent here. 418 00:25:43,920 --> 00:25:47,366 So you can get failure probability 1 over n to the c. 419 00:25:47,366 --> 00:25:48,740 So get this with high probability 420 00:25:48,740 --> 00:25:53,690 bound as long as you go up to a chain 421 00:25:53,690 --> 00:25:55,310 length of log n over log log n. 422 00:25:55,310 --> 00:25:56,960 It's not true otherwise. 423 00:25:56,960 --> 00:25:58,370 So this is kind of depressing. 424 00:25:58,370 --> 00:26:01,070 It's one reason we will turn to perfect hashing, 425 00:26:01,070 --> 00:26:02,780 some of the chains are long. 426 00:26:02,780 --> 00:26:05,540 But there is a sense in which this is not so bad. 427 00:26:08,910 --> 00:26:12,200 So let me go to that. 428 00:26:21,020 --> 00:26:23,010 I kind of want all these. 429 00:26:29,432 --> 00:26:32,841 AUDIENCE: [INAUDIBLE] 430 00:26:33,611 --> 00:26:34,694 ERIK DEMAINE: What's that? 431 00:26:34,694 --> 00:26:36,880 AUDIENCE: [INAUDIBLE] 432 00:26:36,880 --> 00:26:38,590 ERIK DEMAINE: Since when is log n long. 433 00:26:38,590 --> 00:26:39,345 Well-- 434 00:26:39,345 --> 00:26:40,960 AUDIENCE: [INAUDIBLE] 435 00:26:40,960 --> 00:26:43,100 ERIK DEMAINE: Right, so I mean, in some sense 436 00:26:43,100 --> 00:26:44,599 the name of the game here is we want 437 00:26:44,599 --> 00:26:46,300 to beat binary search trees. 438 00:26:46,300 --> 00:26:48,349 I didn't even mention what problem we're solving. 439 00:26:48,349 --> 00:26:50,140 We're solving the dictionary problem, which 440 00:26:50,140 --> 00:26:52,840 is sort of bunch of keys, insert delete, 441 00:26:52,840 --> 00:26:54,520 and search is now just exact search. 442 00:26:54,520 --> 00:26:56,020 I want to know is this key in there? 443 00:26:56,020 --> 00:26:59,660 If so, find some data associated with it. 444 00:26:59,660 --> 00:27:02,920 Which is something binary search trees could do, n log n time. 445 00:27:02,920 --> 00:27:04,750 And we've seen various fancy ways 446 00:27:04,750 --> 00:27:06,420 to try to make that better. 447 00:27:06,420 --> 00:27:08,170 But in the worst case, you need log n time 448 00:27:08,170 --> 00:27:09,253 to do binary search trees. 449 00:27:09,253 --> 00:27:12,940 We want to get to constant as much as possible. 450 00:27:12,940 --> 00:27:17,470 We want the hash function to be evaluatable in constant time. 451 00:27:17,470 --> 00:27:21,280 We want the queries to be done in constant time. 452 00:27:21,280 --> 00:27:24,550 If you have a long chain, you've got to search the whole chain 453 00:27:24,550 --> 00:27:29,290 and I don't want to spend log n over log log n. 454 00:27:29,290 --> 00:27:31,250 Because I said so. 455 00:27:31,250 --> 00:27:34,220 Admittedly, log n over log log n is not that big. 456 00:27:34,220 --> 00:27:36,470 And furthermore, the following holds. 457 00:27:36,470 --> 00:27:39,230 This is a sense in which it's not really 458 00:27:39,230 --> 00:27:41,450 log n over log log n. 459 00:27:41,450 --> 00:27:46,230 If we change the model briefly and say, 460 00:27:46,230 --> 00:27:50,300 well, suppose I have a cache of the last log n items 461 00:27:50,300 --> 00:27:53,420 that I searched for in the hash table. 462 00:27:56,630 --> 00:28:00,039 Then if you're totally random, which 463 00:28:00,039 --> 00:28:01,580 is something we assumed here in order 464 00:28:01,580 --> 00:28:04,320 to apply the Chernoff bound, we needed that everything 465 00:28:04,320 --> 00:28:06,750 was completely random. 466 00:28:06,750 --> 00:28:12,050 Then you get a constant amortized bound per operation. 467 00:28:15,110 --> 00:28:16,610 So this is kind of funny. 468 00:28:16,610 --> 00:28:20,550 In fact, all it's saying this is easy to prove. 469 00:28:20,550 --> 00:28:22,260 And it's not yet in any paper. 470 00:28:22,260 --> 00:28:30,440 It's on Mihai Petrescu's blog from 2011. 471 00:28:30,440 --> 00:28:31,514 All right, we're here. 472 00:28:31,514 --> 00:28:32,930 We're looking at different chains. 473 00:28:32,930 --> 00:28:35,420 So you access some chain, then you access another chain, 474 00:28:35,420 --> 00:28:37,640 then you access another chain. 475 00:28:37,640 --> 00:28:39,740 If you're unlucky, you'll hit the big chain 476 00:28:39,740 --> 00:28:41,930 which cost log n over over log log n 477 00:28:41,930 --> 00:28:46,190 to touch, which is expensive. 478 00:28:46,190 --> 00:28:48,410 But you could then put all those guys into cache, 479 00:28:48,410 --> 00:28:52,320 and if you happen to keep probing there 480 00:28:52,320 --> 00:28:55,070 you know it should be faster. 481 00:28:55,070 --> 00:28:58,370 In general, you do a bunch of searches. 482 00:28:58,370 --> 00:29:06,020 OK, first I search for x1, then I search for x2, x3, so on. 483 00:29:06,020 --> 00:29:10,706 Cluster those into groups theta log n. 484 00:29:10,706 --> 00:29:12,080 OK, let's look at the first log n 485 00:29:12,080 --> 00:29:15,160 searches, then the next log n searches, and analyze those 486 00:29:15,160 --> 00:29:16,140 separately. 487 00:29:16,140 --> 00:29:20,130 We're going to amortize over that period of log n. 488 00:29:20,130 --> 00:29:25,290 So if we look at theta log n-- 489 00:29:25,290 --> 00:29:27,290 actually, this is written in a funny way. 490 00:29:34,342 --> 00:29:36,200 You've got the data, just log n. 491 00:29:42,590 --> 00:29:45,500 So I'm going to look at a batch of log n operations. 492 00:29:45,500 --> 00:29:48,590 And I claim that the number of keys 493 00:29:48,590 --> 00:29:56,126 that collide with them is theta log n, with high probability. 494 00:29:59,530 --> 00:30:06,460 If this is true, then it's constant each. 495 00:30:06,460 --> 00:30:11,380 If I can do log n operations by visiting order log n total 496 00:30:11,380 --> 00:30:13,810 chain items with high probability, 497 00:30:13,810 --> 00:30:15,550 then I just charge one each. 498 00:30:15,550 --> 00:30:18,075 And so amortized over this little log n window 499 00:30:18,075 --> 00:30:20,470 of sort of smoothing the cost. 500 00:30:20,470 --> 00:30:23,410 With high probability now, not just expectation, 501 00:30:23,410 --> 00:30:26,560 I get constant amortized per operation. 502 00:30:26,560 --> 00:30:29,050 So I should have said, with high probability. 503 00:30:33,040 --> 00:30:35,050 Why is this true? 504 00:30:35,050 --> 00:30:36,580 It's essentially the same argument. 505 00:30:36,580 --> 00:30:40,000 Here this is normally called a balls and bin argument. 506 00:30:40,000 --> 00:30:42,910 So you're throwing balls, which are your keys, randomly 507 00:30:42,910 --> 00:30:46,600 into your bins, which are your slots. 508 00:30:46,600 --> 00:30:51,574 And the expectation is constant probability to any one of them, 509 00:30:51,574 --> 00:30:52,990 I mean any one of them could go up 510 00:30:52,990 --> 00:30:56,590 to log n over log log n, high probability. 511 00:30:56,590 --> 00:30:59,950 Over here, we're looking at log n different slots 512 00:30:59,950 --> 00:31:04,990 and taking the sum of balls that fall into each of the slots. 513 00:31:04,990 --> 00:31:08,412 And in expectation that's log n, because it's constant each. 514 00:31:08,412 --> 00:31:10,120 An expectation is linear if you're taking 515 00:31:10,120 --> 00:31:12,760 the sum over these log n bins. 516 00:31:12,760 --> 00:31:14,620 So the expectation is log n. 517 00:31:14,620 --> 00:31:16,390 So you apply Chernoff again. 518 00:31:16,390 --> 00:31:19,210 Except now the mean is log n. 519 00:31:19,210 --> 00:31:22,080 And then it suffices to put c equals 2. 520 00:31:22,080 --> 00:31:23,410 We can run through this. 521 00:31:23,410 --> 00:31:27,130 So here we get the mean is theta log n. 522 00:31:27,130 --> 00:31:30,550 We expect there to be log n items that 523 00:31:30,550 --> 00:31:33,190 fall into these log n bins. 524 00:31:33,190 --> 00:31:36,730 And so you just plug in c equals 2 to the Chernoff bound and you 525 00:31:36,730 --> 00:31:41,860 get e to the log n-- which is kind of weird-- 526 00:31:41,860 --> 00:31:49,420 over 2 log n to the 2 log n. 527 00:31:49,420 --> 00:31:58,060 So this thing is like n to the log log n 2. 528 00:31:58,060 --> 00:32:00,250 So this is big, way bigger than this. 529 00:32:00,250 --> 00:32:02,850 So this essentially disappears. 530 00:32:02,850 --> 00:32:06,880 And in particular, this is bigger than 1 over n to the c, 531 00:32:06,880 --> 00:32:07,945 for any c. 532 00:32:07,945 --> 00:32:11,900 So log log n is bigger than any constant. 533 00:32:11,900 --> 00:32:12,929 So you're done. 534 00:32:12,929 --> 00:32:14,470 So that's just saying the probability 535 00:32:14,470 --> 00:32:18,700 that you're more than twice the mean is very, very small. 536 00:32:18,700 --> 00:32:21,951 So with high probability there's only log n items 537 00:32:21,951 --> 00:32:23,200 that fall in these log n bins. 538 00:32:23,200 --> 00:32:25,510 So you just amortize, boom, constant. 539 00:32:25,510 --> 00:32:27,010 This is kind of a weird notion. 540 00:32:27,010 --> 00:32:29,410 I've never actually seen amortized with high probability 541 00:32:29,410 --> 00:32:31,070 ever in a paper. 542 00:32:31,070 --> 00:32:35,330 This is the first time it seems like a useful concept. 543 00:32:35,330 --> 00:32:40,160 So if you think log n over log log n is bad, 544 00:32:40,160 --> 00:32:42,880 this is a sense in which it's OK. 545 00:32:42,880 --> 00:32:45,820 Don't worry about it. 546 00:32:45,820 --> 00:32:49,390 All right, but if you did worry about it, next thing you do 547 00:32:49,390 --> 00:32:53,790 is perfect hashing. 548 00:33:12,266 --> 00:33:17,260 So, perfect hashing is really just an embellishment. 549 00:33:17,260 --> 00:33:21,000 This is also called FKS hashing. 550 00:33:21,000 --> 00:33:27,800 From the authors, Friedman, Komlosh, and [? Samaretti. ?] 551 00:33:27,800 --> 00:33:30,350 This is from 1984, so old idea. 552 00:33:30,350 --> 00:33:33,170 You just take chaining, but instead 553 00:33:33,170 --> 00:33:34,880 of storing your chains in a linked list, 554 00:33:34,880 --> 00:33:36,590 store them in a hash table. 555 00:33:36,590 --> 00:33:38,550 Simple idea. 556 00:33:38,550 --> 00:33:39,920 There's one clever trick. 557 00:33:44,036 --> 00:33:45,410 You store it in a big hash table. 558 00:33:48,410 --> 00:33:53,175 Hash table of size theta ct squared. 559 00:33:55,960 --> 00:33:58,761 Now this looks like a problem because that's 560 00:33:58,761 --> 00:34:01,260 going to be quadratic space, in the worst case, if everybody 561 00:34:01,260 --> 00:34:02,844 hashes to the same chain. 562 00:34:02,844 --> 00:34:04,260 But we know that chains are pretty 563 00:34:04,260 --> 00:34:07,030 small with high probability. 564 00:34:07,030 --> 00:34:09,989 So turns out this is OK. 565 00:34:09,989 --> 00:34:15,454 The space is sum of ct squared. 566 00:34:19,250 --> 00:34:21,760 And that's something we actually computed already, 567 00:34:21,760 --> 00:34:23,236 except I erased it. 568 00:34:23,236 --> 00:34:24,110 How convenient of me. 569 00:34:24,110 --> 00:34:25,179 It was right here. 570 00:34:25,179 --> 00:34:26,780 I can still barely read it. 571 00:34:26,780 --> 00:34:29,330 When we computed the variance. 572 00:34:29,330 --> 00:34:31,219 We can do it again it's not really that hard. 573 00:34:31,219 --> 00:34:34,770 This is the number of pairs of keys that collide. 574 00:34:34,770 --> 00:34:39,980 And so there's n squared pairs and each of them 575 00:34:39,980 --> 00:34:43,070 has a probability 1 over me of colliding if you 576 00:34:43,070 --> 00:34:45,300 have a universal hash function. 577 00:34:45,300 --> 00:34:49,550 So this is n squared over m, which 578 00:34:49,550 --> 00:34:52,580 if m is within a constant factor of n, is linear. 579 00:34:59,890 --> 00:35:02,440 So linear space in expectation. 580 00:35:04,990 --> 00:35:07,210 Expected amount of space is linear. 581 00:35:07,210 --> 00:35:10,580 I won't try to do with high probability bound here. 582 00:35:10,580 --> 00:35:11,620 What else can I say? 583 00:35:16,190 --> 00:35:18,470 You have to play a similar trick when you're actually 584 00:35:18,470 --> 00:35:19,700 building these hash tables. 585 00:35:19,700 --> 00:35:22,850 All right, so why do we use n squared? 586 00:35:22,850 --> 00:35:24,980 Because of the birthday paradox. 587 00:35:24,980 --> 00:35:29,870 So if you have a hash table of size n squared, essentially, 588 00:35:29,870 --> 00:35:34,610 or ct squared with high probability 589 00:35:34,610 --> 00:35:37,560 or constant probability you don't get any collisions. 590 00:35:37,560 --> 00:35:38,150 Why? 591 00:35:38,150 --> 00:35:47,200 Because then they expected number of collisions in ct. 592 00:35:47,200 --> 00:35:50,570 Well, there's ct pairs or ct squared pairs. 593 00:35:50,570 --> 00:35:52,970 Each of them, if you're using universal hashing, 594 00:35:52,970 --> 00:35:58,190 had a probability of 1 over ct squared of happening. 595 00:35:58,190 --> 00:36:00,752 Because that's the table size, 1 over m. 596 00:36:00,752 --> 00:36:04,004 So this is constant. 597 00:36:04,004 --> 00:36:05,420 And if we set the constants right, 598 00:36:05,420 --> 00:36:08,610 I get to set this theta to be whatever I want. 599 00:36:08,610 --> 00:36:12,960 I get this to be less than 1/2. 600 00:36:12,960 --> 00:36:15,740 If the expected number of collisions is less than 1/2, 601 00:36:15,740 --> 00:36:25,100 then the probability that the number of collisions is 0 602 00:36:25,100 --> 00:36:26,960 is at least a 1/2. 603 00:36:26,960 --> 00:36:31,960 This is Markov's inequality, in particular. 604 00:36:31,960 --> 00:36:34,100 The probability number of collisions is at least 1 605 00:36:34,100 --> 00:36:43,990 is at most the expectation over 1, so which is 1/2. 606 00:36:43,990 --> 00:36:46,510 So you try to build this table. 607 00:36:46,510 --> 00:36:48,907 If you have 0 collisions you're happy. 608 00:36:48,907 --> 00:36:49,990 You go on to the next one. 609 00:36:49,990 --> 00:36:52,090 If you don't have 0 collisions, just try again. 610 00:36:52,090 --> 00:36:54,065 So in an expected constant number of trials 611 00:36:54,065 --> 00:36:55,440 you're flipping a coin each time. 612 00:36:55,440 --> 00:36:56,920 Eventually you'll get heads. 613 00:36:56,920 --> 00:36:58,880 Then you can build this table. 614 00:36:58,880 --> 00:37:00,430 And then you have 0 collisions. 615 00:37:00,430 --> 00:37:03,040 So we always want them to be collision free. 616 00:37:07,930 --> 00:37:12,680 So in an expected linear time you can build this table 617 00:37:12,680 --> 00:37:14,540 and it will have expected linear space. 618 00:37:14,540 --> 00:37:15,820 In fact, if it doesn't have linear space 619 00:37:15,820 --> 00:37:17,653 you can just try the whole thing over again. 620 00:37:17,653 --> 00:37:19,300 So in expected linear time you'll 621 00:37:19,300 --> 00:37:21,280 build a guaranteed linear space structure. 622 00:37:21,280 --> 00:37:22,870 The nice thing about perfect hashing 623 00:37:22,870 --> 00:37:26,650 is you're doing two hash de-references and that's it. 624 00:37:26,650 --> 00:37:34,050 So the query is constant deterministic. 625 00:37:34,050 --> 00:37:38,350 Queries are now deterministic, only updates are randomized. 626 00:37:38,350 --> 00:37:39,760 I didn't talk about updates. 627 00:37:39,760 --> 00:37:41,760 I talked about building. 628 00:37:41,760 --> 00:37:43,900 The construction here is randomized, 629 00:37:43,900 --> 00:37:45,820 queries are constant deterministic. 630 00:37:45,820 --> 00:37:50,020 Now, you can make this dynamic in pretty much the obvious way 631 00:37:50,020 --> 00:37:52,700 you say, OK, I want to insert. 632 00:37:52,700 --> 00:37:59,410 So I compute which of the, it's essentially two level hashing. 633 00:37:59,410 --> 00:38:02,710 So first you figure out where it fits in the big hash table, 634 00:38:02,710 --> 00:38:05,920 then you find the corresponding chain, which is now hash table, 635 00:38:05,920 --> 00:38:07,900 and you insert into that hash table. 636 00:38:07,900 --> 00:38:09,160 So it's the obvious thing. 637 00:38:09,160 --> 00:38:12,190 The trouble is you might get a collision in that hash table. 638 00:38:12,190 --> 00:38:16,180 If you get a collision, you rebuild that hash table. 639 00:38:16,180 --> 00:38:20,740 Probability of the collision happening is essentially small. 640 00:38:20,740 --> 00:38:25,659 It's going to remain small because of this argument. 641 00:38:25,659 --> 00:38:27,700 Because we know the expected number of collisions 642 00:38:27,700 --> 00:38:30,820 remains small unless that chain gets really big. 643 00:38:30,820 --> 00:38:33,760 So if the chain grows by a factor of 2, 644 00:38:33,760 --> 00:38:37,710 then generally you have to rebuild the table. 645 00:38:37,710 --> 00:38:43,120 But if your chain length grows by a factor of 2, 646 00:38:43,120 --> 00:38:46,720 then you rebuild your table to have factor 4 size larger 647 00:38:46,720 --> 00:38:50,410 because it's ct squared. 648 00:38:50,410 --> 00:38:52,300 And in general, you maintain that the table 649 00:38:52,300 --> 00:38:56,590 is sized for a chain of roughly the correct chain length 650 00:38:56,590 --> 00:38:58,090 within a constant factor. 651 00:38:58,090 --> 00:39:01,510 And you do doubling and halving in the usual way, like b-tree. 652 00:39:01,510 --> 00:39:04,840 Or, I guess it's table doubling really. 653 00:39:04,840 --> 00:39:07,420 And it will be constant amortized 654 00:39:07,420 --> 00:39:11,320 expected per operation. 655 00:39:11,320 --> 00:39:14,650 And there's a fancy way to make this constant 656 00:39:14,650 --> 00:39:17,410 with high probability per insert and delete, 657 00:39:17,410 --> 00:39:19,622 which I have not read. 658 00:39:19,622 --> 00:39:21,580 But it's by [? Desfil ?] Binger and [? Meyer ?] 659 00:39:21,580 --> 00:39:23,890 [? Ofterheight, ?] 1990. 660 00:39:23,890 --> 00:39:28,180 So, easy to make this expected amortized. 661 00:39:28,180 --> 00:39:29,770 With more effort you could make it 662 00:39:29,770 --> 00:39:32,630 with high probability per operation. 663 00:39:32,630 --> 00:39:35,980 That is trickier. 664 00:39:35,980 --> 00:39:37,966 Cool. 665 00:39:37,966 --> 00:39:42,560 I actually skipped one thing with chaining, 666 00:39:42,560 --> 00:39:43,780 which I wanted to talk about. 667 00:39:43,780 --> 00:39:49,180 So this analysis was fine, it just used universality. 668 00:39:49,180 --> 00:39:52,060 This cache analysis, I said totally random. 669 00:39:52,060 --> 00:39:53,920 This analysis, I said totally random. 670 00:39:53,920 --> 00:39:55,580 What about real hash functions? 671 00:39:55,580 --> 00:39:57,070 We can't use totally random. 672 00:39:57,070 --> 00:40:00,760 What about universal k-wise independent simple tabulation 673 00:40:00,760 --> 00:40:04,600 hashing, just for chaining? 674 00:40:04,600 --> 00:40:08,380 OK, and similar things hold for perfect hashing, I think. 675 00:40:08,380 --> 00:40:11,610 I'm not sure if they're all known. 676 00:40:11,610 --> 00:40:13,410 Oh, sorry, perfect hashing. 677 00:40:13,410 --> 00:40:16,540 In expectation everything's fine, just with universal. 678 00:40:16,540 --> 00:40:18,370 So we've already done that with universal. 679 00:40:18,370 --> 00:40:20,310 What about chaining? 680 00:40:20,310 --> 00:40:21,670 How big can the chains get? 681 00:40:21,670 --> 00:40:23,849 I said log n over log log with a high probability, 682 00:40:23,849 --> 00:40:25,390 but our analysis used Chernoff bound. 683 00:40:25,390 --> 00:40:27,450 That's only true for Bernoulli trials. 684 00:40:27,450 --> 00:40:29,980 It was only true for totally random hash functions. 685 00:40:29,980 --> 00:40:34,030 It turns out same is true if you have 686 00:40:34,030 --> 00:40:44,890 a log n over log log n wise independent hash function. 687 00:40:44,890 --> 00:40:46,240 So this is kind of annoying. 688 00:40:46,240 --> 00:40:49,270 If you want this to be true you need a lot of independence. 689 00:40:49,270 --> 00:40:51,340 And it's hard to get log n independence. 690 00:40:51,340 --> 00:40:55,060 There is a way to get constant, but it needed a lot of space. 691 00:40:55,060 --> 00:41:02,800 That was this one, which is not so thrilling. 692 00:41:02,800 --> 00:41:05,320 It's also kind of complicated. 693 00:41:05,320 --> 00:41:11,050 So if you don't mind the space but you just 694 00:41:11,050 --> 00:41:14,640 want it to be simpler, you can use simple tabulation hashing. 695 00:41:18,520 --> 00:41:23,980 Both of these, the same chain analysis turns out to work. 696 00:41:23,980 --> 00:41:25,710 So this is fairly old. 697 00:41:25,710 --> 00:41:28,670 This is from 1995. 698 00:41:28,670 --> 00:41:31,370 This is from last year. 699 00:41:31,370 --> 00:41:33,670 So if you just use this simple tabulation hashing, 700 00:41:33,670 --> 00:41:37,540 still has a lot of space, e to the epsilon. 701 00:41:37,540 --> 00:41:40,210 But very simple to implement. 702 00:41:40,210 --> 00:41:43,780 Then still, the chain lengths are as you expect them to be. 703 00:41:43,780 --> 00:41:46,579 And I believe that carries over to this caching argument, 704 00:41:46,579 --> 00:41:47,620 but I haven't checked it. 705 00:41:50,120 --> 00:41:50,620 All right. 706 00:41:53,290 --> 00:41:56,490 Great, I think we're now happy. 707 00:41:56,490 --> 00:41:58,150 We've talked about real hash functions 708 00:41:58,150 --> 00:42:02,200 for chaining and perfect hashing. 709 00:42:02,200 --> 00:42:05,202 Next thing we're going to talk about is linear probing. 710 00:42:05,202 --> 00:42:07,660 I mean, in some sense we have good theoretical answers now. 711 00:42:07,660 --> 00:42:10,510 We can do constant expected amortized even 712 00:42:10,510 --> 00:42:15,070 with constant deterministic queries. 713 00:42:15,070 --> 00:42:21,260 But we're greedy, or people like to implement all sorts 714 00:42:21,260 --> 00:42:23,000 of different hashing schemes. 715 00:42:23,000 --> 00:42:25,070 Perfect hashing is pretty rare in practice. 716 00:42:25,070 --> 00:42:25,869 Why? 717 00:42:25,869 --> 00:42:28,160 I guess, because you have to hash twice instead of once 718 00:42:28,160 --> 00:42:29,470 and that's just more expensive. 719 00:42:29,470 --> 00:42:31,220 So what about the simpler hashing schemes? 720 00:42:31,220 --> 00:42:34,130 Simple tabulation hashing is nice and simple, 721 00:42:34,130 --> 00:42:39,380 but what about linear probing? 722 00:42:42,590 --> 00:42:43,850 That's really simple. 723 00:42:59,130 --> 00:43:00,870 Linear probing is either the first 724 00:43:00,870 --> 00:43:04,519 or the second hashing scheme you learned. 725 00:43:04,519 --> 00:43:05,685 You store things in a table. 726 00:43:09,554 --> 00:43:11,220 The hash function tells you where to go. 727 00:43:11,220 --> 00:43:13,094 If that's full, you just go to the next spot. 728 00:43:13,094 --> 00:43:14,910 If that's full, you go to the next spot 729 00:43:14,910 --> 00:43:18,000 till you find an empty slot and then you put x there. 730 00:43:18,000 --> 00:43:21,600 So if there's some y and z here, that that's 731 00:43:21,600 --> 00:43:24,840 where you end up putting it. 732 00:43:24,840 --> 00:43:27,450 Everyone knows linear probing is bad 733 00:43:27,450 --> 00:43:29,070 because the rich get richer. 734 00:43:29,070 --> 00:43:30,700 It's like the parking lot problem. 735 00:43:30,700 --> 00:43:35,940 If you get big runs of elements they're more likely to get hit, 736 00:43:35,940 --> 00:43:39,840 so they're going to grow even faster and get worse. 737 00:43:39,840 --> 00:43:43,070 So you should never use linear probing. 738 00:43:43,070 --> 00:43:46,290 Has everyone learned that? 739 00:43:46,290 --> 00:43:49,500 It's all false, however. 740 00:43:49,500 --> 00:43:51,900 Linear probing is actually really good. 741 00:43:51,900 --> 00:43:54,780 And first indication is it's really good in practice. 742 00:43:54,780 --> 00:43:59,200 There's this small experiment by Mihai Petrescu, 743 00:43:59,200 --> 00:44:02,370 who was an undergrad and PhD student here. 744 00:44:02,370 --> 00:44:04,110 He's working on AT&T now. 745 00:44:04,110 --> 00:44:05,670 And he was doing some experiments 746 00:44:05,670 --> 00:44:08,540 and he found that in practice on a network router 747 00:44:08,540 --> 00:44:14,250 linear probing costs 10% more time than a memory access. 748 00:44:14,250 --> 00:44:18,180 So, basically free. 749 00:44:18,180 --> 00:44:20,820 Why? 750 00:44:20,820 --> 00:44:25,370 You just set m to be 2 times n, or 1 plus epsilon times 751 00:44:25,370 --> 00:44:27,570 n, whatever. 752 00:44:27,570 --> 00:44:28,880 It actually works really well. 753 00:44:28,880 --> 00:44:31,920 And I'd like to convince you that it works really well. 754 00:44:31,920 --> 00:44:37,680 Now, first let me tell you some things. 755 00:44:37,680 --> 00:44:42,120 The idea that it works really well is old. 756 00:44:42,120 --> 00:44:56,420 For a totally random hash function 757 00:44:56,420 --> 00:44:58,790 you require constant time per operation. 758 00:44:58,790 --> 00:45:01,490 And Knuth actually showed this first in 1962 759 00:45:01,490 --> 00:45:03,020 in a technical report. 760 00:45:03,020 --> 00:45:05,496 The answer ends up being 1 over epsilon squared. 761 00:45:05,496 --> 00:45:06,870 Now you might say, 1 over epsilon 762 00:45:06,870 --> 00:45:08,630 squared, oh that's really bad. 763 00:45:08,630 --> 00:45:10,380 And there are other schemes that achieve 1 764 00:45:10,380 --> 00:45:11,780 over epsilon, which is better. 765 00:45:11,780 --> 00:45:14,180 But what's a little bit of space, right? 766 00:45:14,180 --> 00:45:16,550 I mean, just set epsilon to 1, you're done. 767 00:45:19,320 --> 00:45:22,460 So I think linear probing was bad when we really 768 00:45:22,460 --> 00:45:23,390 were tight on space. 769 00:45:23,390 --> 00:45:26,212 But when you can afford a factor of 2, linear probing is great. 770 00:45:26,212 --> 00:45:27,170 That's the bottom line. 771 00:45:29,760 --> 00:45:32,450 Now, this is totally random, not so useful. 772 00:45:32,450 --> 00:45:34,580 What about all these other hash functions? 773 00:45:34,580 --> 00:45:36,620 Like universal, turns out universal 774 00:45:36,620 --> 00:45:38,330 with the universal hash function, 775 00:45:38,330 --> 00:45:40,580 linear probing is really, really bad. 776 00:45:40,580 --> 00:45:42,350 And that's why it gets a bad rap. 777 00:45:42,350 --> 00:45:44,720 But some good news. 778 00:45:44,720 --> 00:45:49,580 OK, first result was log n wise independence. 779 00:45:49,580 --> 00:45:52,220 This is extremely strong but it also 780 00:45:52,220 --> 00:45:56,780 implies constant expected per operation. 781 00:45:56,780 --> 00:45:59,390 Not very exciting. 782 00:45:59,390 --> 00:46:04,790 The big breakthrough was in 2007 that five-wise independence 783 00:46:04,790 --> 00:46:06,260 is enough. 784 00:46:06,260 --> 00:46:14,310 And this is why this paper, Thorpe 785 00:46:14,310 --> 00:46:17,179 was focusing in particular on the case of k equals 4. 786 00:46:17,179 --> 00:46:18,720 Actually, they were doing k equals 4, 787 00:46:18,720 --> 00:46:22,340 but they solved 5 at the same time. 788 00:46:22,340 --> 00:46:26,080 And so this was a very highly optimized, practical, 789 00:46:26,080 --> 00:46:27,180 all that good stuff. 790 00:46:27,180 --> 00:46:30,750 Get five-wise independence, admittedly with some space. 791 00:46:30,750 --> 00:46:33,610 But it's pretty cool. 792 00:46:33,610 --> 00:46:40,490 So this is enough to get constant expected. 793 00:46:40,490 --> 00:46:43,900 I shouldn't write order 1, because I'm not writing 794 00:46:43,900 --> 00:46:46,190 the dependence on epsilon here. 795 00:46:46,190 --> 00:46:47,950 I don't know exactly what it is. 796 00:46:47,950 --> 00:46:51,870 But it's some constant depending on epsilon. 797 00:46:51,870 --> 00:46:56,280 And then this turns out to be tight. 798 00:46:56,280 --> 00:47:01,200 There are four-wise independent hash functions, 799 00:47:01,200 --> 00:47:06,120 including I think the polynomial ones that we did. 800 00:47:06,120 --> 00:47:09,090 These guys that are really bad. 801 00:47:09,090 --> 00:47:11,380 You can get really bad. 802 00:47:11,380 --> 00:47:13,275 They're as bad as binary search trees. 803 00:47:13,275 --> 00:47:14,670 You can get constant expected. 804 00:47:17,220 --> 00:47:19,320 So you really need five-wise independence. 805 00:47:19,320 --> 00:47:22,540 It's kind of weird, but it's true. 806 00:47:22,540 --> 00:47:29,460 And the other fun fact is that simple tabulation hashing also 807 00:47:29,460 --> 00:47:30,940 achieves constant. 808 00:47:30,940 --> 00:47:34,200 And here it's known that it's also 1 over epsilon squared. 809 00:47:34,200 --> 00:47:37,180 So it's just as good as totally random simple tabulation 810 00:47:37,180 --> 00:47:37,680 hashing. 811 00:47:37,680 --> 00:47:39,930 Which is nice because again, this is simple. 812 00:47:39,930 --> 00:47:46,350 Takes a bit of space but both of these have that property. 813 00:47:46,350 --> 00:47:48,600 And so these are good ways to use 814 00:47:48,600 --> 00:47:50,026 linear probing in particular. 815 00:47:50,026 --> 00:47:51,650 So you really need a good hash function 816 00:47:51,650 --> 00:47:53,100 for linear probing to work out. 817 00:47:53,100 --> 00:47:56,100 If you use the universal hash function like a times 818 00:47:56,100 --> 00:48:00,270 x mod p mod m it will fail. 819 00:48:00,270 --> 00:48:03,014 But if you use a good hash function, which we're now 820 00:48:03,014 --> 00:48:03,930 getting to the point-- 821 00:48:03,930 --> 00:48:07,101 I mean, this is super simple to implement. 822 00:48:07,101 --> 00:48:08,130 It should work fine. 823 00:48:08,130 --> 00:48:09,755 I think would be a neat project to take 824 00:48:09,755 --> 00:48:12,930 a Python or something that had hash tables deep inside it, 825 00:48:12,930 --> 00:48:13,580 replace-- 826 00:48:13,580 --> 00:48:17,250 I think they use quadratic probing and universal hash 827 00:48:17,250 --> 00:48:18,750 functions. 828 00:48:18,750 --> 00:48:21,810 If you instead use linear probing and simple tabulation 829 00:48:21,810 --> 00:48:25,812 hashing, might do the same, might do better, I don't know. 830 00:48:25,812 --> 00:48:26,520 It's interesting. 831 00:48:26,520 --> 00:48:28,240 It would be a project to try out. 832 00:48:31,690 --> 00:48:33,930 Cool. 833 00:48:33,930 --> 00:48:35,110 Well, I just quoted results. 834 00:48:35,110 --> 00:48:39,790 What I'd like to do is prove something like this to you. 835 00:48:39,790 --> 00:48:43,000 Totally random hash functions imply some constant expected. 836 00:48:43,000 --> 00:48:45,800 I won't try to work out the dependence on epsilon 837 00:48:45,800 --> 00:48:48,790 because it's actually a pretty clean proof, it looks nice. 838 00:48:51,750 --> 00:48:53,030 Very data structures-y. 839 00:48:59,810 --> 00:49:01,440 I'm not going to cover Knut's proof. 840 00:49:01,440 --> 00:49:07,650 I'm essentially covering this proof. 841 00:49:07,650 --> 00:49:09,360 In this paper five-wise independence 842 00:49:09,360 --> 00:49:10,830 implies constant expected. 843 00:49:10,830 --> 00:49:13,320 They re-prove the totally random case 844 00:49:13,320 --> 00:49:15,880 and strengthen it, analyze the independence they need. 845 00:49:18,690 --> 00:49:26,400 Let's just do totally random unbiased constant 846 00:49:26,400 --> 00:49:29,940 expected for linear probing. 847 00:49:29,940 --> 00:49:32,520 We obviously know how to do constant expected already 848 00:49:32,520 --> 00:49:33,840 with other fancy techniques. 849 00:49:33,840 --> 00:49:37,230 But linear probing seems really bad. 850 00:49:37,230 --> 00:49:41,580 Yet I claim, not so much. 851 00:49:41,580 --> 00:49:47,822 And we're going to assume m is at least 3 times n. 852 00:49:47,822 --> 00:49:49,530 That will just make the analysis cleaner. 853 00:49:49,530 --> 00:49:52,440 But it does hold for 1 plus epsilon. 854 00:49:52,440 --> 00:49:53,760 OK, so here's the idea. 855 00:49:53,760 --> 00:49:58,890 We're going to take our array, our hash table, it's an array. 856 00:50:02,700 --> 00:50:04,740 And build a binary tree on it because that's 857 00:50:04,740 --> 00:50:06,390 what we like to do. 858 00:50:06,390 --> 00:50:09,900 We do this every lecture pretty much. 859 00:50:09,900 --> 00:50:12,300 This is kind of like ordered file maintenance, I guess. 860 00:50:12,300 --> 00:50:14,400 This is just a conceptual tree. 861 00:50:14,400 --> 00:50:16,380 I mean, you're not even defining an algorithm 862 00:50:16,380 --> 00:50:18,630 based on this because the algorithm is linear probing. 863 00:50:18,630 --> 00:50:19,850 You go into somewhere. 864 00:50:19,850 --> 00:50:22,140 You hop, hop, hop, hop until you find a blank space. 865 00:50:22,140 --> 00:50:23,710 You put your item there. 866 00:50:23,710 --> 00:50:25,770 OK, but each of these nodes defines an interval 867 00:50:25,770 --> 00:50:28,320 in the array, as we know. 868 00:50:28,320 --> 00:50:38,880 So I'm going to call a node dangerous, 869 00:50:38,880 --> 00:50:43,350 essentially if its density is at least 2/3. 870 00:50:43,350 --> 00:50:46,920 But not in the literal sense because there's a little bit 871 00:50:46,920 --> 00:50:48,240 of a subtlety here. 872 00:50:48,240 --> 00:50:50,970 There's the location where a key wants to live, 873 00:50:50,970 --> 00:50:52,740 which is h of that key. 874 00:50:52,740 --> 00:50:56,790 And there's the location that it ended up living. 875 00:50:56,790 --> 00:50:59,400 I care more about the first one because that's 876 00:50:59,400 --> 00:51:00,690 what I understand. 877 00:51:00,690 --> 00:51:03,820 h of x, that's going to be nice. 878 00:51:03,820 --> 00:51:04,770 It's totally random. 879 00:51:04,770 --> 00:51:07,860 So h of x is random independent of everything else. 880 00:51:07,860 --> 00:51:09,210 Great. 881 00:51:09,210 --> 00:51:12,190 Where x ends up being, that depends on other keys 882 00:51:12,190 --> 00:51:13,830 and it depends on this linear thing 883 00:51:13,830 --> 00:51:15,510 which I'm trying to understand. 884 00:51:15,510 --> 00:51:22,140 So I just want to talk about the number of keys 885 00:51:22,140 --> 00:51:34,935 that hash via h to the interval if that is at least 2/3 times 886 00:51:34,935 --> 00:51:36,060 the length of the interval. 887 00:51:38,544 --> 00:51:40,710 This is the number of slots that are actually there. 888 00:51:44,520 --> 00:51:47,100 We expect the number of keys that hash via h to the interval 889 00:51:47,100 --> 00:51:48,090 to be 1/2. 890 00:51:48,090 --> 00:51:51,240 So the expectation would be 1/3 the length the interval. 891 00:51:51,240 --> 00:51:53,850 If it happens to be 2/3 it could happen 892 00:51:53,850 --> 00:51:55,770 because of high probability, whatever. 893 00:51:55,770 --> 00:51:56,770 That's a dangerous node. 894 00:51:56,770 --> 00:51:59,100 That's the definition. 895 00:51:59,100 --> 00:52:01,752 Those ones we worry will be very expensive. 896 00:52:01,752 --> 00:52:03,960 And we worry that we're going to get super clustering 897 00:52:03,960 --> 00:52:06,560 and then get these giant runs, and so on. 898 00:52:30,100 --> 00:52:33,050 So, one thing I want to compute was 899 00:52:33,050 --> 00:52:37,010 what's the probability of this happening. 900 00:52:37,010 --> 00:52:40,550 Probability of a node being dangerous. 901 00:52:40,550 --> 00:52:42,800 Well, we can again use Chernoff bounds here 902 00:52:42,800 --> 00:52:45,170 because we're in a totally random situation. 903 00:52:45,170 --> 00:52:47,420 So this is the probability that the number 904 00:52:47,420 --> 00:52:49,220 of things that went there was bigger 905 00:52:49,220 --> 00:52:51,520 than twice the expectation. 906 00:52:51,520 --> 00:52:55,370 The expectation is 1/2, 2/3 is twice of 1/3. 907 00:52:55,370 --> 00:52:59,580 So this is the probability that you're at least twice the mean, 908 00:52:59,580 --> 00:53:03,830 which by Chernoff is small. 909 00:53:03,830 --> 00:53:10,825 It comes out to e to the mu over 2 to 2 mu. 910 00:53:16,460 --> 00:53:19,550 So this is e over 4 to the mu. 911 00:53:19,550 --> 00:53:20,820 You can check e. 912 00:53:20,820 --> 00:53:24,230 It's 2.71828. 913 00:53:24,230 --> 00:53:31,580 So this is less than 1, kind of roughly a half-ish. 914 00:53:31,580 --> 00:53:33,350 So this is good. 915 00:53:33,350 --> 00:53:36,635 This is something like 1 over 2 to the mu. 916 00:53:36,635 --> 00:53:37,340 What's mu? 917 00:53:45,260 --> 00:53:53,340 mu is 1/3 2 to the h for a height h node. 918 00:53:53,340 --> 00:53:55,130 It depends on how high you are. 919 00:53:55,130 --> 00:53:58,980 If you're at a leaf h is 0, so you expect 1/3 of an element 920 00:53:58,980 --> 00:54:00,300 there. 921 00:54:00,300 --> 00:54:01,860 As you go up you expect more elements 922 00:54:01,860 --> 00:54:04,380 to hash there, of course. 923 00:54:04,380 --> 00:54:06,450 OK, so this gives us some measure 924 00:54:06,450 --> 00:54:09,180 in terms of this h of what's going on. 925 00:54:09,180 --> 00:54:11,550 But it's actually doubly exponential in h. 926 00:54:11,550 --> 00:54:13,340 So this is a very small probability. 927 00:54:13,340 --> 00:54:15,030 You go up a few levels. 928 00:54:15,030 --> 00:54:16,670 Like, after log log n levels it's 929 00:54:16,670 --> 00:54:20,040 a polynomially small probability of happening. 930 00:54:20,040 --> 00:54:22,700 Because then 2 to the log log n is log n. 931 00:54:22,700 --> 00:54:26,670 And then e over 4 to the log n is about n. 932 00:54:26,670 --> 00:54:27,170 OK. 933 00:54:29,930 --> 00:54:35,100 But at small levels this may happen, near the leaves. 934 00:54:35,100 --> 00:54:41,260 All right, so now I want to look at a run in the table. 935 00:54:41,260 --> 00:54:45,270 These are the things I have trouble thinking about 936 00:54:45,270 --> 00:54:49,500 because runs tend to get bigger, and we worry about them. 937 00:54:49,500 --> 00:54:52,620 This is now as items are actually stored at the table, 938 00:54:52,620 --> 00:54:55,950 when do I have a bunch of consecutive items in there 939 00:54:55,950 --> 00:54:59,745 that happen to end up in consecutive slots? 940 00:55:02,280 --> 00:55:05,640 So I'm worried about how long that run is. 941 00:55:05,640 --> 00:55:12,300 So let's look at its logarithm and round 942 00:55:12,300 --> 00:55:13,530 to the nearest power of 2. 943 00:55:13,530 --> 00:55:15,860 So let's say it has length about 2 to the l. 944 00:55:15,860 --> 00:55:17,740 Sorry, plus 1. 945 00:55:20,270 --> 00:55:23,890 All right, between 2 to the l and 2 to the l plus 1. 946 00:55:23,890 --> 00:55:28,830 OK, look at that. 947 00:55:28,830 --> 00:55:44,160 And it's spanned by some number of nodes of height h 948 00:55:44,160 --> 00:55:48,100 equals l minus 3. 949 00:55:48,100 --> 00:55:51,240 OK, so there's some interval that happens to be a run, 950 00:55:51,240 --> 00:55:54,810 meaning all of these slots are occupied. 951 00:55:54,810 --> 00:55:58,830 And that's 2 to the 2, I guess, since I 952 00:55:58,830 --> 00:56:00,550 got to level negative 1. 953 00:56:00,550 --> 00:56:02,910 A little hard to do in a small picture. 954 00:56:02,910 --> 00:56:04,800 But we're worried about when this is really 955 00:56:04,800 --> 00:56:08,020 big more than some constant. 956 00:56:08,020 --> 00:56:11,460 OK, so let's suppose I was looking at this level. 957 00:56:11,460 --> 00:56:14,520 Then this interval is spanned, in particular, 958 00:56:14,520 --> 00:56:15,870 by these two nodes. 959 00:56:15,870 --> 00:56:18,210 Now it's a little sloppy because this node 960 00:56:18,210 --> 00:56:20,700 contains some non-interval, non-run stuff, 961 00:56:20,700 --> 00:56:22,860 and so does this one. 962 00:56:22,860 --> 00:56:26,750 At the next level down it would be this way one, this one, 963 00:56:26,750 --> 00:56:28,730 and this one, which is a little more precise. 964 00:56:28,730 --> 00:56:31,680 But it's never going to be quite perfect. 965 00:56:31,680 --> 00:56:34,350 But just take all the nodes you need 966 00:56:34,350 --> 00:56:37,050 to completely cover the run. 967 00:56:37,050 --> 00:56:43,630 Then this will be at least eight nodes because the length is 968 00:56:43,630 --> 00:56:44,190 1 to the l. 969 00:56:44,190 --> 00:56:47,640 We went three levels down, 2 to the 3 is 8. 970 00:56:47,640 --> 00:56:51,840 So if it's perfectly aligned it will be exactly 8 nodes. 971 00:56:51,840 --> 00:56:56,700 In the worst case, it could be as much as 17. 972 00:56:56,700 --> 00:57:00,750 Because potentially, we're 2 to the l plus 1, 973 00:57:00,750 --> 00:57:03,170 which means we have 16 nodes if we're perfectly aligned. 974 00:57:03,170 --> 00:57:05,100 But then if you shift if over it might be 975 00:57:05,100 --> 00:57:07,750 one more because of the slot. 976 00:57:07,750 --> 00:57:10,630 OK, but some constant number of nodes. 977 00:57:10,630 --> 00:57:13,780 It's important that it's at least eight. 978 00:57:13,780 --> 00:57:15,592 That's what we need. 979 00:57:15,592 --> 00:57:17,550 Actually, we just need that's it at least five, 980 00:57:17,550 --> 00:57:22,860 but eight is the nearest power of two rounding up. 981 00:57:22,860 --> 00:57:23,700 Cool. 982 00:57:23,700 --> 00:57:25,500 So, there they are. 983 00:57:25,500 --> 00:57:30,750 Now, I want to look at the first four nodes 984 00:57:30,750 --> 00:57:32,850 of these eight to 12 nodes. 985 00:57:32,850 --> 00:57:35,220 So first meaning leftmost. 986 00:57:35,220 --> 00:57:38,107 Earliest in the run. 987 00:57:38,107 --> 00:57:39,690 So if you think about them, so there's 988 00:57:39,690 --> 00:57:43,325 some four nodes each of them spans some-- 989 00:57:43,325 --> 00:57:45,645 I should draw these properly. 990 00:57:51,090 --> 00:57:56,630 What we know is that these guys are entirely filled with items. 991 00:57:56,630 --> 00:57:58,410 The run occupies here. 992 00:57:58,410 --> 00:58:01,740 It's got to be at least one item into here, but the rest of this 993 00:58:01,740 --> 00:58:02,682 could be empty. 994 00:58:02,682 --> 00:58:04,390 And the interval keeps going to the right 995 00:58:04,390 --> 00:58:06,600 so we know that all of these are completely 996 00:58:06,600 --> 00:58:09,360 filled with items somehow. 997 00:58:09,360 --> 00:58:13,500 So let's start with how many there are, I guess. 998 00:58:13,500 --> 00:58:26,940 They span more than three times 2 to the h slots of the run. 999 00:58:26,940 --> 00:58:29,815 So somehow 3 times 2 to the h-- 1000 00:58:29,815 --> 00:58:32,190 because there's three of them that are completely filled, 1001 00:58:32,190 --> 00:58:35,490 otherwise it would be four. 1002 00:58:35,490 --> 00:58:38,490 Somehow three times two to the h items ended here. 1003 00:58:38,490 --> 00:58:40,030 Now, how did they end up here? 1004 00:58:40,030 --> 00:58:44,910 Notice there's a blank space right here. 1005 00:58:44,910 --> 00:58:46,860 By definition this was the beginning of a run. 1006 00:58:46,860 --> 00:58:50,380 Meaning the previous slot is empty. 1007 00:58:50,380 --> 00:58:53,580 Which means all of the keys that wanted to live 1008 00:58:53,580 --> 00:58:57,480 from here to the left got to. 1009 00:58:57,480 --> 00:58:59,730 So if we're just thinking about the keys that ended up 1010 00:58:59,730 --> 00:59:03,840 in this interval, they had to initially hash to somewhere 1011 00:59:03,840 --> 00:59:05,160 in here. 1012 00:59:05,160 --> 00:59:08,010 h put them somewhere in this interval 1013 00:59:08,010 --> 00:59:10,290 and then they may have moved to the right, 1014 00:59:10,290 --> 00:59:12,820 but they never move to the left in linear hashing 1015 00:59:12,820 --> 00:59:14,490 if you're not completely full. 1016 00:59:14,490 --> 00:59:17,370 So because there was a blank spot here none of these keys 1017 00:59:17,370 --> 00:59:23,310 could have fallen over to here, no deletions. 1018 00:59:23,310 --> 00:59:25,925 So you're doing insertions. 1019 00:59:25,925 --> 00:59:27,300 They may have just spread it out, 1020 00:59:27,300 --> 00:59:28,950 and they made sconces have gone farther to the right, 1021 00:59:28,950 --> 00:59:31,170 or they may filled in gaps, whatever, but h 1022 00:59:31,170 --> 00:59:33,780 put them in this interval. 1023 00:59:33,780 --> 00:59:40,350 Now, I claim that in fact, at least one of these nodes 1024 00:59:40,350 --> 00:59:42,470 must be dangerous. 1025 00:59:42,470 --> 00:59:45,360 Now dangerous is tricky, because dangerous is talking 1026 00:59:45,360 --> 00:59:46,980 about where h puts nodes. 1027 00:59:46,980 --> 00:59:51,180 But we just said, got to be at least three times two to the h 1028 00:59:51,180 --> 00:59:55,620 keys, where h put them within these four nodes, 1029 00:59:55,620 --> 00:59:58,170 otherwise they wouldn't have filled in here. 1030 00:59:58,170 --> 01:00:11,520 Now, if none of those nodes were dangerous, 1031 01:00:11,520 --> 01:00:15,120 then we'll get a contradiction. 1032 01:00:15,120 --> 01:00:17,610 Because none of them were dangerous 1033 01:00:17,610 --> 01:00:25,460 this means at most 4 times 2/3 times 2 1034 01:00:25,460 --> 01:00:32,430 to the h keys hash via h to them. 1035 01:00:37,991 --> 01:00:38,490 Why? 1036 01:00:38,490 --> 01:00:40,350 Because there's four of the nodes. 1037 01:00:40,350 --> 01:00:45,210 Each of them, if it's not dangerous, has at most 2/3 1038 01:00:45,210 --> 01:00:50,130 of its size keys hashing there. 1039 01:00:50,130 --> 01:00:58,670 4 times 2/3 is 8/3, which is less than 9/3, which is 3. 1040 01:00:58,670 --> 01:01:01,620 OK, so this would be a contradiction 1041 01:01:01,620 --> 01:01:05,070 because we just argued that at least 3 times 2 to the h nodes 1042 01:01:05,070 --> 01:01:10,350 have to hash via h to somewhere in these nodes. 1043 01:01:10,350 --> 01:01:12,434 They might hash here and then fallen over to here. 1044 01:01:12,434 --> 01:01:14,724 So there is this kind of, things can move to the right, 1045 01:01:14,724 --> 01:01:15,900 we've got to worry about it. 1046 01:01:15,900 --> 01:01:19,650 But just look three levels up and it's OK. 1047 01:01:22,146 --> 01:01:24,270 So one of these nodes, not necessarily all of them, 1048 01:01:24,270 --> 01:01:25,314 are dangerous. 1049 01:01:33,380 --> 01:01:36,570 And we can use that to finish our analysis. 1050 01:01:54,690 --> 01:01:58,230 This is good news because it says 1051 01:01:58,230 --> 01:02:00,230 that if we have a run, which is something that's 1052 01:02:00,230 --> 01:02:02,500 hard to think about because nodes are moving around 1053 01:02:02,500 --> 01:02:05,540 to form a run, so keys are moving around to form a run, 1054 01:02:05,540 --> 01:02:08,101 we can charge it to a dangerous node. 1055 01:02:08,101 --> 01:02:10,100 Which is easy to think about because that's just 1056 01:02:10,100 --> 01:02:16,190 talking about where keys hash via h, and h is totally random. 1057 01:02:16,190 --> 01:02:19,760 There's a loss of a factor of 17, potentially. 1058 01:02:19,760 --> 01:02:23,350 But it's a constant factor, no big deal. 1059 01:02:23,350 --> 01:02:28,520 If we look at the probability that the length of a run, 1060 01:02:28,520 --> 01:02:37,130 say containing some key x, has length between 2 to the l 1061 01:02:37,130 --> 01:02:41,420 and to the l plus 1, this is going 1062 01:02:41,420 --> 01:02:51,335 to be at most 17 times the probability of a node at height 1063 01:02:51,335 --> 01:02:54,650 l minus 3 is dangerous. 1064 01:02:58,970 --> 01:03:02,790 Because we know one of them is, and so just to be sloppy 1065 01:03:02,790 --> 01:03:05,881 it's at most the sum of the probabilities that any of them 1066 01:03:05,881 --> 01:03:06,380 is. 1067 01:03:06,380 --> 01:03:08,930 Then potentially there's a run of that length. 1068 01:03:08,930 --> 01:03:12,020 And so union bound it's at most 17 times probability 1069 01:03:12,020 --> 01:03:12,830 of this happening. 1070 01:03:12,830 --> 01:03:14,750 Now all nodes look the same because we have 1071 01:03:14,750 --> 01:03:16,940 a totally random hash function. 1072 01:03:16,940 --> 01:03:20,260 So we just say any node at height l minus 3. 1073 01:03:20,260 --> 01:03:22,820 We already computed that probability. 1074 01:03:22,820 --> 01:03:23,630 That was this. 1075 01:03:23,630 --> 01:03:27,980 Probability of being dangerous was e over 4 to the 1/3 2 1076 01:03:27,980 --> 01:03:29,810 to the h. 1077 01:03:29,810 --> 01:03:39,250 So this is going to be at most 17 times e over 4 to the 2 1078 01:03:39,250 --> 01:03:42,150 to the l minus 3 power. 1079 01:03:42,150 --> 01:03:48,410 Again, doubly exponential in l. 1080 01:03:48,410 --> 01:03:57,890 So if we want to compute the expected run length 1081 01:03:57,890 --> 01:04:01,280 we can just expand out the definition. 1082 01:04:01,280 --> 01:04:05,765 Well, let's round it to powers of 2. 1083 01:04:05,765 --> 01:04:07,850 It could be the run length is about 2 1084 01:04:07,850 --> 01:04:11,360 to the l within a constant factor of 2 to the l. 1085 01:04:11,360 --> 01:04:16,474 So it's going to be that times this probability. 1086 01:04:21,420 --> 01:04:25,640 But this thing is basically 1 over 2 to the 2 to the l. 1087 01:04:25,640 --> 01:04:27,425 And so the whole thing is constant. 1088 01:04:31,640 --> 01:04:32,579 This is l. 1089 01:04:32,579 --> 01:04:33,870 I mean, l could go to infinity. 1090 01:04:33,870 --> 01:04:34,703 I don't really care. 1091 01:04:37,320 --> 01:04:40,210 I mean, this gets dwarfed by the double exponential. 1092 01:04:40,210 --> 01:04:42,090 This is super geometric. 1093 01:04:42,090 --> 01:04:46,470 So a very low probability of getting long runs. 1094 01:04:46,470 --> 01:04:51,210 As we said, after a log log n size-- 1095 01:04:51,210 --> 01:04:54,690 yeah, it's very unlikely to run longer than log n. 1096 01:04:54,690 --> 01:04:57,619 We proved that in particular. 1097 01:04:57,619 --> 01:04:59,910 But in particular, you compute the expected run length, 1098 01:04:59,910 --> 01:05:02,800 it's constant. 1099 01:05:02,800 --> 01:05:07,010 OK, now this of course assumed totally random. 1100 01:05:07,010 --> 01:05:09,300 It's harder to prove-- 1101 01:05:09,300 --> 01:05:11,790 where were we. 1102 01:05:11,790 --> 01:05:12,660 Somewhere. 1103 01:05:12,660 --> 01:05:14,420 Linear probing. 1104 01:05:14,420 --> 01:05:16,680 It's harder to prove five-wise independence is enough, 1105 01:05:16,680 --> 01:05:17,810 but it's true. 1106 01:05:17,810 --> 01:05:21,090 And it's much harder to prove simple tabulation 1107 01:05:21,090 --> 01:05:22,840 hashing works, but it's true. 1108 01:05:22,840 --> 01:05:24,210 So we can use them. 1109 01:05:24,210 --> 01:05:26,610 This gives you some intuition for why it's really not 1110 01:05:26,610 --> 01:05:28,129 that bad. 1111 01:05:28,129 --> 01:05:29,670 And similar proof techniques are used 1112 01:05:29,670 --> 01:05:33,570 for the five-wise independence. 1113 01:05:33,570 --> 01:05:34,680 Other fun facts. 1114 01:05:34,680 --> 01:05:40,540 You can do similar caching trick that we did before. 1115 01:05:40,540 --> 01:05:45,510 Again, the worst run is going to be log, or log over log log. 1116 01:05:45,510 --> 01:05:48,030 I don't have it written here. 1117 01:05:48,030 --> 01:05:52,680 But if you cache the last-- 1118 01:05:52,680 --> 01:05:55,320 it's not quite enough to do the last log n. 1119 01:05:55,320 --> 01:06:06,852 But if you cache the last log to the 1 plus epsilon n queries. 1120 01:06:06,852 --> 01:06:09,390 It's a little bit more. 1121 01:06:09,390 --> 01:06:11,440 Then you can generalize this argument. 1122 01:06:11,440 --> 01:06:16,230 And so at least for totally random hash functions 1123 01:06:16,230 --> 01:06:20,230 you get constant amortize with high probability. 1124 01:06:26,120 --> 01:06:29,690 This weird thing that I've never seen before. 1125 01:06:29,690 --> 01:06:34,280 But it's comforting because it's expected bounds are not 1126 01:06:34,280 --> 01:06:36,590 so great, but you get it with high probability bound 1127 01:06:36,590 --> 01:06:40,040 as long as you're willing to average over log to the 1 1128 01:06:40,040 --> 01:06:41,785 plus epsilon different queries. 1129 01:06:41,785 --> 01:06:43,160 As long as you can remember them. 1130 01:06:46,880 --> 01:06:49,730 And the proof is basically the same. 1131 01:06:49,730 --> 01:06:52,610 Except now instead of looking at the length of a run 1132 01:06:52,610 --> 01:06:55,220 containing x, you're looking at the length of the run 1133 01:06:55,220 --> 01:06:59,810 containing one of these log to the 1 plus epsilon n nodes. 1134 01:06:59,810 --> 01:07:01,520 That's your batch. 1135 01:07:01,520 --> 01:07:03,890 And you do the same thing. 1136 01:07:03,890 --> 01:07:07,140 But now do it with high probability analysis. 1137 01:07:07,140 --> 01:07:09,860 But again, because the expectation is now 1138 01:07:09,860 --> 01:07:13,250 bigger than log, expect there to be 1139 01:07:13,250 --> 01:07:15,470 a lot of fairly long runs here. 1140 01:07:15,470 --> 01:07:18,440 But that's OK, because on average is good. 1141 01:07:21,696 --> 01:07:23,820 You expect to pay log to the 1 plus epsilon for log 1142 01:07:23,820 --> 01:07:25,790 to the 1 plus epsilon queries. 1143 01:07:25,790 --> 01:07:30,860 And so then you divide and amortize and you're done. 1144 01:07:30,860 --> 01:07:33,490 It's a little bit more details in the notes about 1145 01:07:33,490 --> 01:07:36,020 that if you want to read. 1146 01:07:36,020 --> 01:07:40,390 I want to do one more topic, unless there are 1147 01:07:40,390 --> 01:07:43,700 questions about linear probing. 1148 01:07:43,700 --> 01:07:44,957 So, yeah? 1149 01:07:44,957 --> 01:07:49,727 AUDIENCE: So, could you motivate why the [INAUDIBLE] value of mu 1150 01:07:49,727 --> 01:07:52,600 is the mean for whatever quantity? 1151 01:07:52,600 --> 01:07:55,250 ERIK DEMAINE: So mu is defined to be the mean of whatever 1152 01:07:55,250 --> 01:07:56,780 quantity we're analyzing. 1153 01:07:56,780 --> 01:08:00,410 And the Chernoff bounds says, probability 1154 01:08:00,410 --> 01:08:03,320 that you're at least something times the mean is 1155 01:08:03,320 --> 01:08:05,250 the formula we wrote last time. 1156 01:08:05,250 --> 01:08:09,270 Now here, we're measuring-- 1157 01:08:09,270 --> 01:08:11,652 I didn't write what the left-hand side was. 1158 01:08:11,652 --> 01:08:13,610 But here we're measuring what's the probability 1159 01:08:13,610 --> 01:08:16,370 that the number of keys that hash via h to the interval 1160 01:08:16,370 --> 01:08:19,050 is at least 2/3 the length of the interval. 1161 01:08:19,050 --> 01:08:25,760 Now, let's say m equals 3m then the expected number of keys 1162 01:08:25,760 --> 01:08:28,590 that hash via h to interval is 1/3 times the length 1163 01:08:28,590 --> 01:08:29,899 of the interval. 1164 01:08:29,899 --> 01:08:32,149 Because we have a totally random thing, 1165 01:08:32,149 --> 01:08:35,420 and we have a density of 1/3 overall. 1166 01:08:35,420 --> 01:08:38,270 So you expect there to be 1/3 and so 1167 01:08:38,270 --> 01:08:42,380 dangerous is when you're more than twice that. 1168 01:08:42,380 --> 01:08:44,000 And so it's twice mu. 1169 01:08:44,000 --> 01:08:46,470 Mu is, in this case, 1/3 the length the interval. 1170 01:08:46,470 --> 01:08:48,482 And that's why I wrote that. 1171 01:08:48,482 --> 01:08:50,294 AUDIENCE: So this comes from the m squared. 1172 01:08:50,294 --> 01:08:50,750 [INAUDIBLE] 1173 01:08:50,750 --> 01:08:52,208 ERIK DEMAINE: Yeah, it comes from m 1174 01:08:52,208 --> 01:08:54,482 equals 3m and totally random. 1175 01:08:54,482 --> 01:08:58,180 AUDIENCE: [INAUDIBLE] 1176 01:08:58,180 --> 01:09:00,720 ERIK DEMAINE: Yeah, OK let's make this equal. 1177 01:09:00,720 --> 01:09:03,600 Make this more formal. 1178 01:09:03,600 --> 01:09:07,359 It's an assumption, anyway, to simplify the proof. 1179 01:09:07,359 --> 01:09:08,481 Good. 1180 01:09:08,481 --> 01:09:09,689 Change that in the notes too. 1181 01:09:16,090 --> 01:09:16,590 Cool. 1182 01:09:16,590 --> 01:09:18,300 So then the expectation is exactly 1183 01:09:18,300 --> 01:09:19,979 1/3 instead of at most 1/3. 1184 01:09:19,979 --> 01:09:21,330 So it's all a little cleaner. 1185 01:09:21,330 --> 01:09:24,420 Of course, this all works when m is at least 1 1186 01:09:24,420 --> 01:09:26,580 plus epsilon times n, but then you 1187 01:09:26,580 --> 01:09:28,590 get a dependence on epsilon. 1188 01:09:28,590 --> 01:09:32,189 Other questions? 1189 01:09:32,189 --> 01:09:36,806 So bottom line is linear probing is actually good. 1190 01:09:36,806 --> 01:09:39,180 Quadratic probing, double hashing, all those fancy things 1191 01:09:39,180 --> 01:09:40,859 are also good. 1192 01:09:40,859 --> 01:09:42,990 But they're really tuned for the case 1193 01:09:42,990 --> 01:09:44,399 when your table is almost full. 1194 01:09:44,399 --> 01:09:46,210 They get a better dependence on epsilon, 1195 01:09:46,210 --> 01:09:49,800 which is how close to the bound you are. 1196 01:09:49,800 --> 01:09:53,040 And so if you're constant factor away from space bound, 1197 01:09:53,040 --> 01:09:54,660 linear probing is just fine. 1198 01:09:54,660 --> 01:09:57,600 As long as you have enough independence, admittedly. 1199 01:09:57,600 --> 01:10:01,080 Double hashing, I believe, gets around that. 1200 01:10:01,080 --> 01:10:07,030 It does not need so much independence. 1201 01:10:07,030 --> 01:10:08,640 OK. 1202 01:10:08,640 --> 01:10:10,481 Instead of going to double hashing, 1203 01:10:10,481 --> 01:10:12,980 I'm going to go to something kind of related double hashing, 1204 01:10:12,980 --> 01:10:13,980 which is cuckoo hashing. 1205 01:10:25,340 --> 01:10:29,070 Cuckoo hashing is a weird idea. 1206 01:10:29,070 --> 01:10:32,870 It's kind of a more extreme form of perfect hashing. 1207 01:10:32,870 --> 01:10:41,420 It says, look, perfect hashing did two hash queries. 1208 01:10:41,420 --> 01:10:45,620 So I did one hash evaluation and another hash evaluation 1209 01:10:45,620 --> 01:10:48,680 followed it, which is OK. 1210 01:10:51,770 --> 01:10:57,560 But again, I want my queries to only do two things, two probes. 1211 01:10:57,560 --> 01:11:07,090 So it's going to take that concept of just two 1212 01:11:07,090 --> 01:11:09,460 and actually use two hash tables. 1213 01:11:09,460 --> 01:11:14,080 So you've got B over here, I've got A over here. 1214 01:11:19,270 --> 01:11:23,740 And if you have a key x, you hash it 1215 01:11:23,740 --> 01:11:27,300 to a particular spot in A via g, and you hash it 1216 01:11:27,300 --> 01:11:31,630 to a particular spot in B via H. So you have two hash tables, 1217 01:11:31,630 --> 01:11:32,680 two hash functions. 1218 01:11:42,990 --> 01:11:50,880 To do a query you look at A of g of x, 1219 01:11:50,880 --> 01:11:55,890 and you look at B of h of x. 1220 01:11:55,890 --> 01:11:57,860 Oh sorry, I forgot to mention. 1221 01:11:57,860 --> 01:11:59,610 The other great thing about linear probing 1222 01:11:59,610 --> 01:12:01,740 is that it's cache performance is so great. 1223 01:12:01,740 --> 01:12:04,440 This is why it runs so fast in practice. 1224 01:12:04,440 --> 01:12:06,510 Why it's only 10% slower than a memory access. 1225 01:12:06,510 --> 01:12:09,810 Because once you access a single slot, 1226 01:12:09,810 --> 01:12:13,230 whole you get B slots in a cache with block size B. 1227 01:12:13,230 --> 01:12:17,990 So most of the time, because your runs are very short, 1228 01:12:17,990 --> 01:12:20,310 you will find your answer immediately. 1229 01:12:20,310 --> 01:12:22,300 So that's why we kind of prefer linear probing 1230 01:12:22,300 --> 01:12:24,050 in practice over all the other schemes I'm 1231 01:12:24,050 --> 01:12:26,010 going to talk about. 1232 01:12:26,010 --> 01:12:28,230 Well, cuckoo hashing is all right 1233 01:12:28,230 --> 01:12:31,740 because it's only going to look at two places and that's it. 1234 01:12:31,740 --> 01:12:33,330 Doesn't go anywhere else. 1235 01:12:36,980 --> 01:12:41,172 I guess with perfect hashing the thing is you have 1236 01:12:41,172 --> 01:12:42,380 more than two hash functions. 1237 01:12:42,380 --> 01:12:43,730 You have the first hash function which 1238 01:12:43,730 --> 01:12:44,938 sends you to the first table. 1239 01:12:44,938 --> 01:12:46,940 Then you look up a second hash function. 1240 01:12:46,940 --> 01:12:51,012 Using that hash function you rehash your value x. 1241 01:12:51,012 --> 01:12:52,970 Downside of that is you can't compare those two 1242 01:12:52,970 --> 01:12:54,810 hash functions in parallel. 1243 01:12:54,810 --> 01:12:57,020 So if you're like two cores, you could 1244 01:12:57,020 --> 01:12:59,030 compute these two in parallel, look them 1245 01:12:59,030 --> 01:13:00,500 both up simultaneously. 1246 01:13:00,500 --> 01:13:02,720 So in that sense you save a factor of 2 1247 01:13:02,720 --> 01:13:03,931 with some parallelism. 1248 01:13:07,160 --> 01:13:12,380 Now, the weird thing is the way we do an insertion. 1249 01:13:12,380 --> 01:13:22,950 You try to put it in the A slot, or the B slot. 1250 01:13:22,950 --> 01:13:26,510 If either of them is empty you're golden. 1251 01:13:26,510 --> 01:13:28,010 If neither of them are empty, you've 1252 01:13:28,010 --> 01:13:31,400 got to kick out whoever's there. 1253 01:13:31,400 --> 01:13:41,030 So let's say if you kicked out y from it's A slot. 1254 01:13:44,360 --> 01:13:47,660 So we ended up putting x in this one, 1255 01:13:47,660 --> 01:13:52,160 so we end up kicking y from wherever it belonged. 1256 01:13:52,160 --> 01:13:59,750 Then you move it to B of h of y. 1257 01:13:59,750 --> 01:14:02,210 There's only one other place that that item can go, 1258 01:14:02,210 --> 01:14:05,060 so you put it there instead. 1259 01:14:05,060 --> 01:14:11,900 In general, I think about A key it has two places it can go. 1260 01:14:11,900 --> 01:14:13,670 There's some slot in A, some slot in B. 1261 01:14:13,670 --> 01:14:17,040 You can think of this as an edge in a bipartite graph. 1262 01:14:17,040 --> 01:14:19,760 So make vertices for the A slots, 1263 01:14:19,760 --> 01:14:22,190 vertices for the B slots. 1264 01:14:22,190 --> 01:14:25,732 Each edge is an item on a key. 1265 01:14:25,732 --> 01:14:28,880 Key Can only live one spot in A, one spot in B 1266 01:14:28,880 --> 01:14:31,890 for this query to work. 1267 01:14:31,890 --> 01:14:34,820 So what's happening is if both of these 1268 01:14:34,820 --> 01:14:37,880 are full you take whoever is currently here 1269 01:14:37,880 --> 01:14:41,510 and put them over in their corresponding slot 1270 01:14:41,510 --> 01:14:43,360 over in B. Now, that one might be full, 1271 01:14:43,360 --> 01:14:45,484 which means you've got to kick that guy to wherever 1272 01:14:45,484 --> 01:14:47,930 he belongs in A, and so on. 1273 01:14:47,930 --> 01:14:51,380 If eventually you find an empty slot, great, you're done. 1274 01:14:51,380 --> 01:14:55,040 Just chain reaction of cuckoo steps 1275 01:14:55,040 --> 01:14:57,680 where the bird's going from in and out, or from A 1276 01:14:57,680 --> 01:15:01,010 to B, vice a versa. 1277 01:15:01,010 --> 01:15:03,020 If it terminates, you're happy. 1278 01:15:03,020 --> 01:15:05,570 It doesn't terminate, you're in trouble 1279 01:15:05,570 --> 01:15:09,890 because you might get a cycle, or a few failure situations. 1280 01:15:09,890 --> 01:15:11,240 In that case you're screwed. 1281 01:15:11,240 --> 01:15:13,100 There is no cuckoo hash table that 1282 01:15:13,100 --> 01:15:14,690 works for your set of keys. 1283 01:15:14,690 --> 01:15:16,820 In that case, you pick another hash function, 1284 01:15:16,820 --> 01:15:19,160 rebuild from scratch. 1285 01:15:19,160 --> 01:15:20,840 So it's kind of a weird hashing scheme 1286 01:15:20,840 --> 01:15:24,470 because it can fail catastrophic. 1287 01:15:24,470 --> 01:15:26,510 Fortunately, it doesn't happen too often. 1288 01:15:33,640 --> 01:15:35,810 It still rubs me a funny way. 1289 01:15:35,810 --> 01:15:37,910 I don't know what to say about it. 1290 01:15:41,450 --> 01:15:44,390 OK, so you lose a factor of 2 in space. 1291 01:15:47,910 --> 01:15:50,360 2 deterministic probes for a query. 1292 01:15:50,360 --> 01:15:53,225 That's good news. 1293 01:15:58,850 --> 01:16:01,760 All right, now we get to, what about updates? 1294 01:16:01,760 --> 01:16:15,510 So if it's fully random or log n-wise independent, 1295 01:16:15,510 --> 01:16:22,110 then you get a constant expected update, which is what we want. 1296 01:16:22,110 --> 01:16:23,520 Even with the rebuilding cost. 1297 01:16:23,520 --> 01:16:28,290 So you'll have to rebuild about every n squared insertions 1298 01:16:28,290 --> 01:16:30,150 you do. 1299 01:16:30,150 --> 01:16:35,670 The way they say this is there's a 1 over n 1300 01:16:35,670 --> 01:16:37,883 build failure probability. 1301 01:16:42,050 --> 01:16:44,530 There's a 1 over n chance that your key set will 1302 01:16:44,530 --> 01:16:47,856 be completely unsustainable. 1303 01:16:47,856 --> 01:16:50,230 If you want to put all n keys into this table there's a 1 1304 01:16:50,230 --> 01:16:52,750 over n chance that it will be impossible 1305 01:16:52,750 --> 01:16:54,430 and then you have to start over. 1306 01:16:54,430 --> 01:16:58,360 So amortize per insertion, that's about 1 over n squared. 1307 01:16:58,360 --> 01:17:00,960 Insertions you can do before the whole thing falls apart 1308 01:17:00,960 --> 01:17:03,130 and you have to rebuild. 1309 01:17:03,130 --> 01:17:04,990 So it's definitely going to be this should 1310 01:17:04,990 --> 01:17:08,120 be amortize expected, I guess. 1311 01:17:08,120 --> 01:17:10,780 However you want to think about it. 1312 01:17:10,780 --> 01:17:14,920 But it's another way to do constant amortized expected. 1313 01:17:14,920 --> 01:17:15,940 Cool. 1314 01:17:15,940 --> 01:17:22,000 The other thing that's known is that six-wise independence 1315 01:17:22,000 --> 01:17:24,190 is not enough. 1316 01:17:24,190 --> 01:17:26,650 This was actually a project in this class, 1317 01:17:26,650 --> 01:17:30,270 I believe the first time it was offered in 2003. 1318 01:17:30,270 --> 01:17:33,850 Six-wise independence is not sufficient to get 1319 01:17:33,850 --> 01:17:35,560 constant expected bound. 1320 01:17:38,110 --> 01:17:41,770 It will actually fail with high probability 1321 01:17:41,770 --> 01:17:43,510 if you only have six-wise independence. 1322 01:17:43,510 --> 01:17:46,780 What's not known is, do you need constant Independence? 1323 01:17:46,780 --> 01:17:47,850 Or log n independence? 1324 01:17:47,850 --> 01:17:50,830 With log n, very low failure probability. 1325 01:17:50,830 --> 01:17:54,250 With six-wise, high probability you fail. 1326 01:17:54,250 --> 01:17:58,711 Like, you fail with probability 1 minus 1 over n. 1327 01:17:58,711 --> 01:17:59,210 Not so good. 1328 01:18:04,530 --> 01:18:07,610 Some good news is simple tabulation hashing. 1329 01:18:13,230 --> 01:18:23,060 Means you will fail to build with probability not 1 over n, 1330 01:18:23,060 --> 01:18:26,000 but 1 over n to the 1/3 power. 1331 01:18:30,340 --> 01:18:31,190 And this is theta. 1332 01:18:31,190 --> 01:18:33,470 This is tight. 1333 01:18:33,470 --> 01:18:34,850 It's almost as good as this. 1334 01:18:34,850 --> 01:18:36,470 We really only need constant here. 1335 01:18:36,470 --> 01:18:38,940 This is to build the entire table. 1336 01:18:38,940 --> 01:18:41,400 So in this case you can insert like n to the 4 1337 01:18:41,400 --> 01:18:43,910 those items before your table self-destructs. 1338 01:18:43,910 --> 01:18:47,940 So simple tabulation hashing is, again, pretty good. 1339 01:18:47,940 --> 01:18:51,310 That's I think the hardest result 1340 01:18:51,310 --> 01:18:53,000 in this paper from last year. 1341 01:18:59,930 --> 01:19:03,170 So I do have a proof of this one. 1342 01:19:07,107 --> 01:19:07,940 Something like that. 1343 01:19:07,940 --> 01:19:09,710 Or part of a proof. 1344 01:19:09,710 --> 01:19:13,130 So me give you a rough idea how this works. 1345 01:19:13,130 --> 01:19:18,620 So if you're a fully random hash function. 1346 01:19:18,620 --> 01:19:23,630 The main concern is that what if this path is really long. 1347 01:19:23,630 --> 01:19:36,760 I claim that if an insert follows a path of length k, 1348 01:19:36,760 --> 01:19:41,840 or the probability of this happening, 1349 01:19:41,840 --> 01:19:44,320 is actually at most 1 over 2 to the k. 1350 01:19:44,320 --> 01:19:45,400 It's very small. 1351 01:19:45,400 --> 01:19:47,130 Exponentially small in k. 1352 01:19:50,590 --> 01:19:53,170 I just want to sketch how this works because it's 1353 01:19:53,170 --> 01:19:58,630 a cool argument that's actually in this simple tabulation 1354 01:19:58,630 --> 01:19:59,802 paper. 1355 01:19:59,802 --> 01:20:01,010 So the idea is the following. 1356 01:20:01,010 --> 01:20:04,760 You have some really long path. 1357 01:20:04,760 --> 01:20:07,870 What I'm going to give you is a way 1358 01:20:07,870 --> 01:20:13,450 to encode the hash functions. 1359 01:20:13,450 --> 01:20:16,300 There's hash functions g and h. 1360 01:20:16,300 --> 01:20:20,380 Each of them has n values. 1361 01:20:20,380 --> 01:20:24,730 Each of those values is log m bits. 1362 01:20:24,730 --> 01:20:27,300 So if I just wrote them down the obvious way, 1363 01:20:27,300 --> 01:20:31,840 it's 2 n log m bits to write down those hash functions. 1364 01:20:31,840 --> 01:20:34,300 Now we're assuming these are totally random hash 1365 01:20:34,300 --> 01:20:36,955 functions, which means you need this many bits. 1366 01:20:36,955 --> 01:20:41,020 But I claim that if you follow a path of length k, 1367 01:20:41,020 --> 01:20:43,690 I can find a new encoding scheme, a way 1368 01:20:43,690 --> 01:20:48,950 to write down g and h that is basically minus k. 1369 01:20:48,950 --> 01:20:50,920 This many bits minus k. 1370 01:20:50,920 --> 01:20:52,750 I get to save k bits. 1371 01:20:52,750 --> 01:20:54,730 Now, it turns out that can happen 1372 01:20:54,730 --> 01:20:57,480 but it happens only with probability 1 over 2 to the k. 1373 01:20:57,480 --> 01:20:59,920 This is an information theoretic argument. 1374 01:20:59,920 --> 01:21:01,480 You might get lucky. 1375 01:21:01,480 --> 01:21:04,690 And the g's and h's you're trying to encode 1376 01:21:04,690 --> 01:21:07,851 can be done with fewer bits, k fewer bits. 1377 01:21:07,851 --> 01:21:10,350 But that will only happen with probability 1 over 2 to the k 1378 01:21:10,350 --> 01:21:12,580 if g and h are totally random. 1379 01:21:12,580 --> 01:21:14,040 So how do you do it? 1380 01:21:14,040 --> 01:21:17,740 Basically, I want to encode the things 1381 01:21:17,740 --> 01:21:20,200 on the path slightly cheaper. 1382 01:21:20,200 --> 01:21:24,350 I'm going to save one bit per node on the path. 1383 01:21:24,350 --> 01:21:26,450 So what do I need to do? 1384 01:21:26,450 --> 01:21:33,220 Well, the idea is OK, I will start by writing down 1385 01:21:33,220 --> 01:21:34,730 this hash value. 1386 01:21:34,730 --> 01:21:39,549 This takes log m bits to write down that hash value. 1387 01:21:39,549 --> 01:21:41,090 Then I'll write down this hash value. 1388 01:21:41,090 --> 01:21:42,690 That takes a log m bit, log m bits. 1389 01:21:42,690 --> 01:21:46,150 Generally there's going to be roughly k log n to write down 1390 01:21:46,150 --> 01:21:48,520 all of the node hash values. 1391 01:21:48,520 --> 01:21:50,490 Then I need to say that it's actually 1392 01:21:50,490 --> 01:21:54,940 x, this particular key that corresponds to this edge. 1393 01:21:54,940 --> 01:21:56,680 So I've got to write that down. 1394 01:21:56,680 --> 01:21:59,170 That's going to take a log n bits 1395 01:21:59,170 --> 01:22:01,390 to say that x is the guy for the first edge, 1396 01:22:01,390 --> 01:22:04,215 then y is the key that corresponds to the second edge 1397 01:22:04,215 --> 01:22:06,190 of the path, then z, then w. 1398 01:22:06,190 --> 01:22:08,680 But nicely, things are ordered here. 1399 01:22:08,680 --> 01:22:11,800 So it only takes me to log n, k log 1400 01:22:11,800 --> 01:22:14,420 n to write down all these guys. 1401 01:22:14,420 --> 01:22:19,540 So I get k times log m plus log n. 1402 01:22:19,540 --> 01:22:37,525 Now if m is 2 times n, this is k times 2 log m minus 1. 1403 01:22:37,525 --> 01:22:43,130 So I get one bit of savings per k, per thing in the path. 1404 01:22:43,130 --> 01:22:46,460 Essentially because it's easier for me 1405 01:22:46,460 --> 01:22:48,090 to write down these labels to say, 1406 01:22:48,090 --> 01:22:50,777 oh, it's the key x that's going here. 1407 01:22:50,777 --> 01:22:53,360 Instead of having to write down slot names all the time, which 1408 01:22:53,360 --> 01:22:56,360 cost log m bits, writing down key names only 1409 01:22:56,360 --> 01:22:58,310 takes log n bits, which is a savings 1410 01:22:58,310 --> 01:23:02,000 of 1 bit per thing on the path. 1411 01:23:02,000 --> 01:23:05,120 And so that was a quick sketch of how this proof goes. 1412 01:23:05,120 --> 01:23:07,600 It's kind of neat, information theoretic argument 1413 01:23:07,600 --> 01:23:08,960 why the paths can't get long. 1414 01:23:08,960 --> 01:23:12,350 You then have to worry about cycles and things that 1415 01:23:12,350 --> 01:23:14,170 look like this. 1416 01:23:14,170 --> 01:23:15,500 That's kind of messy. 1417 01:23:15,500 --> 01:23:18,170 But same kind of argument generalizes. 1418 01:23:18,170 --> 01:23:23,110 So that was your quick overview of lots of hashing stuff.