1 00:00:00,090 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:04,019 Commons license. 3 00:00:04,019 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,730 continue to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:22,420 --> 00:00:25,460 ERIK DEMAINE: All right, let's get started. 9 00:00:25,460 --> 00:00:29,060 Today we're going to continue the theme of randomization 10 00:00:29,060 --> 00:00:30,320 and data structures. 11 00:00:30,320 --> 00:00:31,880 Last time we saw skip lists. 12 00:00:31,880 --> 00:00:36,010 Skip lists solve the predecessor-successor problem. 13 00:00:36,010 --> 00:00:38,080 You can search for an item and if it's not there, 14 00:00:38,080 --> 00:00:41,660 you get the closest item on either side in log n 15 00:00:41,660 --> 00:00:43,840 with high probability. 16 00:00:43,840 --> 00:00:46,650 But we already knew how to do that deterministically. 17 00:00:46,650 --> 00:00:49,400 Today we're going to solve a slightly different problem, 18 00:00:49,400 --> 00:00:52,000 the dictionary problem with hash tables. 19 00:00:52,000 --> 00:00:54,610 Something you already think you know. 20 00:00:54,610 --> 00:00:57,450 But we're going to show you how much you didn't know. 21 00:00:57,450 --> 00:01:00,080 But after today you will know. 22 00:01:00,080 --> 00:01:04,114 And we're going to get constant time and not 23 00:01:04,114 --> 00:01:05,030 with high probability. 24 00:01:05,030 --> 00:01:06,110 That's hard. 25 00:01:06,110 --> 00:01:08,640 But we'll do constant expected time. 26 00:01:08,640 --> 00:01:11,709 So that's in some sense better. 27 00:01:11,709 --> 00:01:13,250 It's going to solve a weaker problem. 28 00:01:13,250 --> 00:01:16,230 But we're going to get tighter bound constant instead 29 00:01:16,230 --> 00:01:17,940 of logarithmic. 30 00:01:17,940 --> 00:01:22,830 So for starters let me remind you what problem we're solving 31 00:01:22,830 --> 00:01:30,770 and the basics of hashing which you learned in 6006. 32 00:01:30,770 --> 00:01:34,600 I'm going to give this problem a name because it's important 33 00:01:34,600 --> 00:01:37,880 and we often forget to distinguish 34 00:01:37,880 --> 00:01:41,210 between two types of things. 35 00:01:44,070 --> 00:01:47,350 This is kind of an old term, but I would call this 36 00:01:47,350 --> 00:01:51,400 an abstract data type. 37 00:01:51,400 --> 00:01:54,290 This is just the problem specification 38 00:01:54,290 --> 00:01:56,580 of what you're trying to do. 39 00:01:56,580 --> 00:01:58,604 You might call this an interface or something. 40 00:01:58,604 --> 00:02:00,895 This is the problem statement versus the data structure 41 00:02:00,895 --> 00:02:02,220 is how you actually solve it. 42 00:02:02,220 --> 00:02:04,330 The hash tables are the data structure. 43 00:02:04,330 --> 00:02:08,300 The dictionary is the problem or the abstract data type. 44 00:02:08,300 --> 00:02:11,480 So what we're trying to do today, 45 00:02:11,480 --> 00:02:13,530 as in most data structures, is maintain 46 00:02:13,530 --> 00:02:14,695 a dynamic set of items. 47 00:02:18,150 --> 00:02:20,920 And here I'm going to distinguish between the items 48 00:02:20,920 --> 00:02:21,695 and their keys. 49 00:02:24,310 --> 00:02:25,857 Each item has a key. 50 00:02:25,857 --> 00:02:27,440 And normally you'd think of there also 51 00:02:27,440 --> 00:02:29,640 being a value like in Python. 52 00:02:29,640 --> 00:02:31,500 But we're just worrying about the keys 53 00:02:31,500 --> 00:02:33,760 and moving the items around. 54 00:02:33,760 --> 00:02:36,990 And we want to support three operations. 55 00:02:39,770 --> 00:02:53,170 We want to be able to insert an item, delete an item, 56 00:02:53,170 --> 00:02:54,230 and search for an item. 57 00:02:58,920 --> 00:03:02,340 But search is going to be different from what 58 00:03:02,340 --> 00:03:04,770 we know from AVL trees or skip lists or even 59 00:03:04,770 --> 00:03:08,820 Venom [INAUDIBLE] That was a predecessor-successor search. 60 00:03:08,820 --> 00:03:10,779 Here we just want to know-- sorry, 61 00:03:10,779 --> 00:03:12,070 your not searching for an item. 62 00:03:12,070 --> 00:03:17,760 Usually you're searching for just a key-- here 63 00:03:17,760 --> 00:03:20,840 you just want to know is there any item with that key, 64 00:03:20,840 --> 00:03:22,190 and return it. 65 00:03:28,880 --> 00:03:31,980 This is often called an exact search 66 00:03:31,980 --> 00:03:33,600 because if the key is not in there, 67 00:03:33,600 --> 00:03:36,310 you learn absolutely nothing. 68 00:03:36,310 --> 00:03:38,400 You can't find the nearest key. 69 00:03:38,400 --> 00:03:41,480 And for whatever reason this is called a dictionary problem 70 00:03:41,480 --> 00:03:43,150 though it's unlike a real dictionary. 71 00:03:43,150 --> 00:03:45,820 Usually when you search for a word you do find its neighbors. 72 00:03:45,820 --> 00:03:48,510 Here we're just going to either-- if the key's there 73 00:03:48,510 --> 00:03:50,630 we find that, otherwise not. 74 00:03:50,630 --> 00:03:55,930 And this is exactly what a Python dictionary implements. 75 00:03:55,930 --> 00:04:00,980 So I guess that's why Python dictionaries are called dicts. 76 00:04:00,980 --> 00:04:07,540 So today I'm going to assume all items have distinct keys. 77 00:04:07,540 --> 00:04:13,480 So in the insertion I will assume key is not already 78 00:04:13,480 --> 00:04:14,240 in the table. 79 00:04:17,760 --> 00:04:21,769 With a little bit of work, you can 80 00:04:21,769 --> 00:04:24,040 allow inserting an item with an existing key, 81 00:04:24,040 --> 00:04:27,240 and you just overwrite that existing item. 82 00:04:27,240 --> 00:04:28,990 But I don't want to worry about that here. 83 00:04:31,540 --> 00:04:34,300 So we could, of course, solve this using an AVL tree 84 00:04:34,300 --> 00:04:36,010 in log n time. 85 00:04:36,010 --> 00:04:40,720 But our goal is to do better because it's an easier problem. 86 00:04:40,720 --> 00:04:44,540 And I'm going to remind you of the simplest way you learn 87 00:04:44,540 --> 00:04:50,950 to do this which was hashing with chaining in 006. 88 00:04:50,950 --> 00:04:57,930 And the catch is you didn't really analyze this in 006. 89 00:04:57,930 --> 00:05:02,880 So we're going make a constant time per operation. 90 00:05:06,950 --> 00:05:14,570 It's going to be expected or something and linear space. 91 00:05:18,230 --> 00:05:22,340 And remember the variables we care 92 00:05:22,340 --> 00:05:27,330 about, there's u, n, and m. 93 00:05:27,330 --> 00:05:28,860 So u is the size of the universe. 94 00:05:28,860 --> 00:05:30,970 This is the all possible keys. 95 00:05:30,970 --> 00:05:32,490 The space of all possible keys. 96 00:05:38,900 --> 00:05:42,460 n is the size of the set your currently storing. 97 00:05:42,460 --> 00:05:47,130 So that's the number of items or keys 98 00:05:47,130 --> 00:05:49,060 currently in the data structure. 99 00:05:54,510 --> 00:05:57,950 And then m is the size of your table. 100 00:05:57,950 --> 00:06:01,390 So say it's the number of slots in the table. 101 00:06:04,460 --> 00:06:05,760 So you remember the picture. 102 00:06:05,760 --> 00:06:10,630 You have a table of slots. 103 00:06:10,630 --> 00:06:13,340 Let's say 0 to m minus 1. 104 00:06:13,340 --> 00:06:15,320 Each of them is a pointer to a linked list. 105 00:06:18,610 --> 00:06:21,140 And if you have, let's say over here 106 00:06:21,140 --> 00:06:24,660 is your universe of all possible keys, 107 00:06:24,660 --> 00:06:28,810 then we have a hash function which maps each universe 108 00:06:28,810 --> 00:06:32,950 item into one of these slots. 109 00:06:32,950 --> 00:06:35,100 And then the linked list here is storing 110 00:06:35,100 --> 00:06:38,720 all of the items that hash to that slot. 111 00:06:38,720 --> 00:06:49,810 So we have a hash function which maps the universe. 112 00:06:49,810 --> 00:06:52,340 I'm going to assume the universe has already been mapped 113 00:06:52,340 --> 00:06:54,685 into integers 0 to u minus 1. 114 00:06:54,685 --> 00:06:56,390 And it maps to slots. 115 00:07:02,660 --> 00:07:07,280 And when we do hashing with chaining, 116 00:07:07,280 --> 00:07:09,780 I think I mentioned this last week, the bounds 117 00:07:09,780 --> 00:07:23,950 you get, we achieve a bound of 1 plus 118 00:07:23,950 --> 00:07:29,450 alpha where alpha is the load factor n/m. 119 00:07:29,450 --> 00:07:33,300 The average number of items you'd expect to hash to a slot 120 00:07:33,300 --> 00:07:36,710 is the number of items divided by the number of slots. 121 00:07:36,710 --> 00:07:38,810 OK. 122 00:07:38,810 --> 00:07:41,770 And you proved this in 6006 but you 123 00:07:41,770 --> 00:07:47,200 assumed something called simple uniform hashing. 124 00:08:05,590 --> 00:08:08,900 Simple uniform hashing is an assumption, 125 00:08:08,900 --> 00:08:10,750 I think invented for CLRS. 126 00:08:10,750 --> 00:08:13,380 It makes the analysis very simple, 127 00:08:13,380 --> 00:08:15,480 but it's also basically cheating. 128 00:08:15,480 --> 00:08:17,820 So today our goal is to not cheat. 129 00:08:17,820 --> 00:08:19,690 It's nice as a warm up. 130 00:08:19,690 --> 00:08:21,520 But we don't like cheating. 131 00:08:21,520 --> 00:08:34,270 So you may recall the assumption is about the hash function. 132 00:08:34,270 --> 00:08:37,740 You want a good hash function. 133 00:08:37,740 --> 00:08:43,080 And good means this. 134 00:08:43,080 --> 00:08:46,480 I want the probability of two distinct keys 135 00:08:46,480 --> 00:08:51,520 mapping to the same slot to be 1/m if there are m slots. 136 00:08:51,520 --> 00:08:53,430 If everything was completely random, 137 00:08:53,430 --> 00:08:57,290 if h was basically choosing a random number for every key, 138 00:08:57,290 --> 00:09:00,090 then that's what we would expect to happen. 139 00:09:00,090 --> 00:09:02,550 So this is like the idealized scenario. 140 00:09:02,550 --> 00:09:04,270 Now, we can't have a hash function 141 00:09:04,270 --> 00:09:07,000 could choosing a random number for every key 142 00:09:07,000 --> 00:09:09,250 because it has to choose the same value if you give it 143 00:09:09,250 --> 00:09:10,520 the same key. 144 00:09:10,520 --> 00:09:14,029 So it has to be some kind of deterministic strategy 145 00:09:14,029 --> 00:09:15,570 or at least repeatable strategy where 146 00:09:15,570 --> 00:09:18,570 if you plug in the same key you get the same thing. 147 00:09:18,570 --> 00:09:20,520 So really what this assumption is saying 148 00:09:20,520 --> 00:09:28,750 is that the key's that you give are in some sense random. 149 00:09:28,750 --> 00:09:33,100 If I give you random keys and I have not-too-crazy hash 150 00:09:33,100 --> 00:09:36,460 function then this will be true. 151 00:09:36,460 --> 00:09:39,630 But I don't like assuming anything about the keys maybe. 152 00:09:39,630 --> 00:09:44,190 I want my keys to be worst case maybe. 153 00:09:44,190 --> 00:09:47,872 There are lots of examples in the real world where you apply 154 00:09:47,872 --> 00:09:49,330 some hash function and it turns out 155 00:09:49,330 --> 00:09:51,660 your data has some very particular structure. 156 00:09:51,660 --> 00:09:53,260 And if you choose a bad hash function, 157 00:09:53,260 --> 00:09:56,350 then your hash table gets really, really slow. 158 00:09:56,350 --> 00:10:00,250 Maybe everything hashes to the same slot. 159 00:10:00,250 --> 00:10:02,920 Or say you take-- well yeah, there 160 00:10:02,920 --> 00:10:05,440 are lots of examples of that. 161 00:10:05,440 --> 00:10:06,360 We want to avoid that. 162 00:10:06,360 --> 00:10:10,650 After today you will know how to achieve constant expected time 163 00:10:10,650 --> 00:10:14,580 no matter what your keys are, for worst case keys. 164 00:10:14,580 --> 00:10:18,140 But it's going to take some work to do that. 165 00:10:18,140 --> 00:10:26,080 So this assumption requires assuming 166 00:10:26,080 --> 00:10:28,060 that the keys are random. 167 00:10:32,447 --> 00:10:34,780 And this is what we would call an average case analysis. 168 00:10:41,296 --> 00:10:43,170 You might think that average case analysis is 169 00:10:43,170 --> 00:10:45,800 necessary for randomized algorithms, 170 00:10:45,800 --> 00:10:47,800 but that's not true. 171 00:10:47,800 --> 00:10:50,580 And we saw that last week with quicksort. 172 00:10:50,580 --> 00:10:54,660 Quicksort, if you say I will always choose a of 1 173 00:10:54,660 --> 00:10:57,130 to be my partition element, that's 174 00:10:57,130 --> 00:11:00,770 what the textbook calls basic quicksort, then 175 00:11:00,770 --> 00:11:03,830 for an average input that will do really well. 176 00:11:03,830 --> 00:11:07,610 If you have a uniform random permutation of items 177 00:11:07,610 --> 00:11:10,450 and you sort with the method of always choosing the first item 178 00:11:10,450 --> 00:11:16,090 as your partition, then that will be n log n on average 179 00:11:16,090 --> 00:11:18,230 if your data is average. 180 00:11:18,230 --> 00:11:21,270 But we saw we could avoid that assumption 181 00:11:21,270 --> 00:11:24,000 by choosing a random pivot. 182 00:11:24,000 --> 00:11:25,920 If you choose a random pivot, then you 183 00:11:25,920 --> 00:11:27,836 don't need to assume anything about the input. 184 00:11:27,836 --> 00:11:30,060 You just need to assume that the pivots are random. 185 00:11:30,060 --> 00:11:32,726 So it's a big difference between assuming your inputs are random 186 00:11:32,726 --> 00:11:35,230 versus assuming your coin flips are random. 187 00:11:35,230 --> 00:11:39,569 It's pretty reasonable to assume you can flip coins. 188 00:11:39,569 --> 00:11:41,610 If you've got enough dexterity in your thumb then 189 00:11:41,610 --> 00:11:43,341 you can do it. 190 00:11:43,341 --> 00:11:44,840 But it's not so reasonable to assume 191 00:11:44,840 --> 00:11:45,923 that your input is random. 192 00:11:45,923 --> 00:11:50,012 So we'd like to avoid average case analysis whenever we can, 193 00:11:50,012 --> 00:11:51,220 and that's the goal of today. 194 00:11:51,220 --> 00:11:54,470 So what you saw in 006 was essentially assuming the inputs 195 00:11:54,470 --> 00:11:55,089 are random. 196 00:11:55,089 --> 00:11:57,630 We're going to get rid of that unreasonable assumption today. 197 00:12:03,860 --> 00:12:07,780 So that's, in some sense, review from 006. 198 00:12:07,780 --> 00:12:10,430 I'm going to take a brief pause and tell you 199 00:12:10,430 --> 00:12:14,220 about the etymology of the word hash in case you're curious. 200 00:12:14,220 --> 00:12:25,020 Hash is an English word since the 1650's, so it's pretty old. 201 00:12:25,020 --> 00:12:29,070 It means literally cut into small pieces. 202 00:12:29,070 --> 00:12:31,520 It's usually used in a culinary sense, 203 00:12:31,520 --> 00:12:36,412 like these days you have corned beef hash or something. 204 00:12:36,412 --> 00:12:37,828 I'll put the definition over here. 205 00:12:45,480 --> 00:13:01,130 It comes from French, hacher, which means to chop up. 206 00:13:04,320 --> 00:13:08,660 You know it in English from the word hatchet. 207 00:13:08,660 --> 00:13:10,260 So it's the same derivation. 208 00:13:12,800 --> 00:13:20,520 And it comes from old French-- I don't actually know whether 209 00:13:20,520 --> 00:13:30,916 that's "hash-ay" or "hash" but-- which means axe. 210 00:13:30,916 --> 00:13:32,165 So you can see the derivation. 211 00:13:35,940 --> 00:13:38,020 If you look this up in OED or pick 212 00:13:38,020 --> 00:13:41,760 your favorite dictionary or even Google, that's what you find. 213 00:13:41,760 --> 00:13:45,120 But in fact there's a new prevailing theory 214 00:13:45,120 --> 00:13:53,570 that in fact hash comes from another language which 215 00:13:53,570 --> 00:14:01,140 is Vulcan, la'ash, I mean you can see the derivation right? 216 00:14:01,140 --> 00:14:03,160 Actually means axe. 217 00:14:03,160 --> 00:14:06,800 So maybe French got it from Vulcan or vice versa 218 00:14:06,800 --> 00:14:10,600 but I think that's pretty clear. 219 00:14:10,600 --> 00:14:13,960 Live long and prosper, and farewell to Spock. 220 00:14:16,720 --> 00:14:17,865 Sad news of last week. 221 00:14:20,880 --> 00:14:22,770 So enough about hashing. 222 00:14:22,770 --> 00:14:24,870 We'll come back to that in a little bit. 223 00:14:24,870 --> 00:14:27,050 But hash functions essentially take up 224 00:14:27,050 --> 00:14:30,410 this idea of taking your key, chopping up into pieces, 225 00:14:30,410 --> 00:14:35,910 and mixing it like in a good dish. 226 00:14:35,910 --> 00:14:39,620 All right, so we're going to cover two ways to get 227 00:14:39,620 --> 00:14:43,186 strong constant time bounds. 228 00:14:43,186 --> 00:14:45,560 Probably the most useful one is called universal hashing. 229 00:14:45,560 --> 00:14:47,440 We'll spend most of our time on that. 230 00:14:47,440 --> 00:14:50,379 But the theoretically cooler one is called perfect hashing. 231 00:14:50,379 --> 00:14:52,170 Universal hashing, we're going to guarantee 232 00:14:52,170 --> 00:14:54,580 there are very few conflicts in expectation. 233 00:14:54,580 --> 00:14:56,990 Perfect hashing , we're going to guarantee there are zero 234 00:14:56,990 --> 00:14:58,240 conflicts. 235 00:14:58,240 --> 00:15:01,570 The catch is, at least in its obvious form, 236 00:15:01,570 --> 00:15:04,290 it only works for static sets. 237 00:15:04,290 --> 00:15:07,630 If you forbid, insert, and delete and just want 238 00:15:07,630 --> 00:15:10,894 to do search, then perfect hashing is a good method. 239 00:15:10,894 --> 00:15:12,310 So like if you're actually storing 240 00:15:12,310 --> 00:15:15,464 a dictionary, like the OED, English 241 00:15:15,464 --> 00:15:16,630 doesn't change that quickly. 242 00:15:16,630 --> 00:15:19,650 So you can afford to recompute your data structure whenever 243 00:15:19,650 --> 00:15:22,285 you release a new edition. 244 00:15:22,285 --> 00:15:23,910 But let's start with universal hashing. 245 00:15:23,910 --> 00:15:27,600 This is a nice powerful technique. 246 00:15:27,600 --> 00:15:30,030 It works for dynamic data. 247 00:15:30,030 --> 00:15:32,430 Insert, delete, and search will be constant 248 00:15:32,430 --> 00:15:36,340 expected time with no assumptions about the input. 249 00:15:36,340 --> 00:15:37,800 So it will not be average case. 250 00:15:37,800 --> 00:15:40,045 It's in some sense worse case but randomized. 251 00:15:43,000 --> 00:15:46,320 So the idea is we need to do something random. 252 00:15:46,320 --> 00:15:48,900 If you just say, well, I choose one hash function 253 00:15:48,900 --> 00:15:51,240 once and for all, and I use that for my table, 254 00:15:51,240 --> 00:15:52,750 OK maybe my table doubles in size 255 00:15:52,750 --> 00:15:54,070 and I change the hash function. 256 00:15:54,070 --> 00:15:57,400 But there's no randomness there. 257 00:15:57,400 --> 00:15:59,740 We need to introduce randomness somehow 258 00:15:59,740 --> 00:16:01,760 into this data structure. 259 00:16:01,760 --> 00:16:04,090 And the way we're going to do that 260 00:16:04,090 --> 00:16:07,160 is in how we choose the hash function. 261 00:16:07,160 --> 00:16:17,580 We're going to choose our hash function randomly 262 00:16:17,580 --> 00:16:19,745 from some set of hash functions. 263 00:16:19,745 --> 00:16:22,430 Call it h. 264 00:16:22,430 --> 00:16:25,845 This is going to be a universal hash family. 265 00:16:25,845 --> 00:16:27,970 We're going to imagine there are many possible hash 266 00:16:27,970 --> 00:16:29,260 functions we could choose. 267 00:16:29,260 --> 00:16:31,780 If we choose one of them uniformly at random, 268 00:16:31,780 --> 00:16:33,190 that's a random choice. 269 00:16:33,190 --> 00:16:35,500 And that randomness is going to be enough 270 00:16:35,500 --> 00:16:39,750 that we no longer need to assume anything about the keys. 271 00:16:39,750 --> 00:16:46,800 So for that to work, we need some assumption about h. 272 00:16:46,800 --> 00:16:48,900 Maybe it's just a set of one hash function. 273 00:16:48,900 --> 00:16:50,600 That wouldn't add much randomness. 274 00:16:50,600 --> 00:16:52,540 Two also would not add much randomness. 275 00:16:52,540 --> 00:16:54,090 We need a lot of them. 276 00:16:54,090 --> 00:16:56,340 And so we're going to require H to have this property. 277 00:16:59,070 --> 00:17:02,040 And we're going to call it the property universality. 278 00:17:04,770 --> 00:17:07,785 Generally you would call it a universal hash family. 279 00:17:11,589 --> 00:17:14,920 Just a set of hash functions. 280 00:17:14,920 --> 00:17:21,390 What we want is that-- so we're choosing our hash function 281 00:17:21,390 --> 00:17:25,490 h from H. And among those choices 282 00:17:25,490 --> 00:17:31,180 we want the probability that two keys hash to the same value 283 00:17:31,180 --> 00:17:31,910 to be small. 284 00:17:42,520 --> 00:17:51,610 I'll say-- and this is very similar 285 00:17:51,610 --> 00:17:55,380 looking to simple uniform hashing. 286 00:17:55,380 --> 00:17:59,130 Looks almost the same here except I switched from k1 287 00:17:59,130 --> 00:18:03,470 and k2 to k and k', but same thing. 288 00:18:03,470 --> 00:18:05,770 But what we're taking the probability over, 289 00:18:05,770 --> 00:18:08,190 what we're assuming is random is different. 290 00:18:08,190 --> 00:18:12,520 Here we're assuming k1 and k2 a are because h was fixed. 291 00:18:12,520 --> 00:18:15,820 This was an assumption about the inputs. 292 00:18:15,820 --> 00:18:19,970 Over here we're thinking of k and k' as being fixed. 293 00:18:19,970 --> 00:18:23,400 This has to work for every pair of distinct keys. 294 00:18:23,400 --> 00:18:25,400 And the probability we're considering 295 00:18:25,400 --> 00:18:28,010 is the distribution of h. 296 00:18:28,010 --> 00:18:31,470 So we're trying all the different h's Or we're trying 297 00:18:31,470 --> 00:18:33,330 little h uniformly at random. 298 00:18:33,330 --> 00:18:37,730 We want the probability that a random h makes k and k' collide 299 00:18:37,730 --> 00:18:40,160 to be at most 1/m. 300 00:18:40,160 --> 00:18:44,030 The other difference is we switch from equals to at most. 301 00:18:44,030 --> 00:18:45,900 I mean less would be better. 302 00:18:45,900 --> 00:18:48,160 And there are ways to make it less for a couple pairs 303 00:18:48,160 --> 00:18:50,310 but it doesn't really matter. 304 00:18:50,310 --> 00:18:52,310 But of course anything less than or equal to 1/m 305 00:18:52,310 --> 00:18:55,170 will be just as good. 306 00:18:55,170 --> 00:18:57,960 So this is an assumption about H. 307 00:18:57,960 --> 00:19:00,700 We'll see how to achieve this assumption in a little bit. 308 00:19:00,700 --> 00:19:04,510 Let me first prove to you that this is enough. 309 00:19:04,510 --> 00:19:09,280 It's going to be basically the same as the 006 analysis. 310 00:19:09,280 --> 00:19:13,870 But it's worth repeating just so we are sure everything's OK. 311 00:19:20,230 --> 00:19:24,835 And so I can be more precise about what we're assuming. 312 00:19:39,240 --> 00:19:42,600 The key difference between this theorem and the 006 theorem is 313 00:19:42,600 --> 00:19:44,740 we get to make no assumptions about the keys. 314 00:19:44,740 --> 00:19:45,790 They are arbitrary. 315 00:19:45,790 --> 00:19:48,550 You get to choose them however you want. 316 00:19:48,550 --> 00:19:51,680 But then I choose a random hash function. 317 00:19:51,680 --> 00:19:54,160 The hash function cannot depend on these keys. 318 00:19:54,160 --> 00:19:55,980 But it's going to be random. 319 00:19:55,980 --> 00:19:59,860 And I choose the hash function after you choose the keys. 320 00:19:59,860 --> 00:20:01,275 That's important. 321 00:20:07,570 --> 00:20:11,430 So we're going to choose a random h and H. 322 00:20:11,430 --> 00:20:14,110 And we're assuming H is universal. 323 00:20:19,060 --> 00:20:35,680 Then the expected number of keys in a slot among those n keys 324 00:20:35,680 --> 00:20:39,600 is at most 1 plus alpha. 325 00:20:39,600 --> 00:20:41,260 Alpha is n/m. 326 00:20:41,260 --> 00:20:45,470 So this is exactly what we had over here. 327 00:20:45,470 --> 00:20:47,560 Here we're talking about time bound. 328 00:20:47,560 --> 00:20:49,720 But the time bound followed because the length 329 00:20:49,720 --> 00:20:53,630 of each chain was expected to be 1 plus alpha. 330 00:20:53,630 --> 00:20:57,136 And here the expectation is over the choice of h. 331 00:20:57,136 --> 00:21:01,910 Not assuming anything about the keys. 332 00:21:01,910 --> 00:21:04,010 So let's prove this theorem. 333 00:21:08,920 --> 00:21:09,660 It's pretty easy. 334 00:21:09,660 --> 00:21:12,170 But I'm going to introduce some analysis techniques 335 00:21:12,170 --> 00:21:16,880 that we will use for more interesting things. 336 00:21:16,880 --> 00:21:20,280 So let's give the keys a name. 337 00:21:20,280 --> 00:21:28,380 I'll just call them-- I'll be lazy. 338 00:21:28,380 --> 00:21:30,965 Use k1 up to kn. 339 00:21:35,680 --> 00:21:41,100 And I just want to compute that expectation. 340 00:21:53,620 --> 00:22:01,820 So I want to compute let's say the number of keys colliding 341 00:22:01,820 --> 00:22:06,000 with one of those keys, let's say ki. 342 00:22:12,840 --> 00:22:16,760 So this is of course the size of the slot that ki happens to go. 343 00:22:16,760 --> 00:22:18,220 This is going to work for all i. 344 00:22:18,220 --> 00:22:22,200 And so if I can say that this is at most 1/alpha for each i, 345 00:22:22,200 --> 00:22:23,920 then I have my theorem. 346 00:22:23,920 --> 00:22:25,760 Just another way to talk about it. 347 00:22:25,760 --> 00:22:29,100 Now the number of keys colliding with ki, here's 348 00:22:29,100 --> 00:22:32,100 a general trick, whenever you want to count something 349 00:22:32,100 --> 00:22:34,820 in expectation, a very helpful tool 350 00:22:34,820 --> 00:22:37,850 is indicator random variables. 351 00:22:37,850 --> 00:22:42,530 Let's name all of the different events that we want to count. 352 00:22:42,530 --> 00:22:45,860 And then we're basically summing those variables. 353 00:22:45,860 --> 00:22:53,514 So I'm going to say-- I'm going to use I ij to be an indicator 354 00:22:53,514 --> 00:22:54,180 random variable. 355 00:22:54,180 --> 00:22:56,320 It's going to be 1 or 0. 356 00:22:56,320 --> 00:23:06,665 1 if hash function of ki equals the hash function of kj. 357 00:23:06,665 --> 00:23:10,790 So there's a collision between ki and kj j and 0 358 00:23:10,790 --> 00:23:12,280 if they hash to different slots. 359 00:23:14,830 --> 00:23:17,660 Now this is, it's a random variable because it depends 360 00:23:17,660 --> 00:23:19,930 on h and h is a random thing. 361 00:23:19,930 --> 00:23:22,020 ki and kj are not random. 362 00:23:22,020 --> 00:23:24,290 They're given to you. 363 00:23:24,290 --> 00:23:28,620 And then I want to know when does h back those two 364 00:23:28,620 --> 00:23:30,660 keys to the same slot. 365 00:23:30,660 --> 00:23:39,070 And so this number is really just the sum of Iij over all j. 366 00:23:39,070 --> 00:23:42,150 This is the same thing. 367 00:23:42,150 --> 00:23:50,620 The number in here is the sum for j not equal to i of Iij. 368 00:23:50,620 --> 00:23:53,730 Because we get a 1 every time they collide, zero otherwise. 369 00:23:53,730 --> 00:23:57,170 So that counts how many collide. 370 00:23:57,170 --> 00:23:58,870 Once we have it in this notation, 371 00:23:58,870 --> 00:24:02,600 we can use all the great dilemmas and theorems 372 00:24:02,600 --> 00:24:06,600 about in this case, E, expectation. 373 00:24:06,600 --> 00:24:07,580 What should I use here? 374 00:24:10,442 --> 00:24:12,236 STUDENT: What? 375 00:24:12,236 --> 00:24:13,860 ERIK DEMAINE: What's a good-- how can I 376 00:24:13,860 --> 00:24:14,970 simplify this formula? 377 00:24:14,970 --> 00:24:16,625 STUDENT: The linearity of expectation. 378 00:24:16,625 --> 00:24:16,950 ERIK DEMAINE: The linearity of expectation. 379 00:24:16,950 --> 00:24:17,450 Thank you. 380 00:24:20,170 --> 00:24:21,640 If you don't know all these things, 381 00:24:21,640 --> 00:24:25,340 read the probability appendix in the textbook. 382 00:24:25,340 --> 00:24:29,280 So we want to talk about expectation 383 00:24:29,280 --> 00:24:31,450 of the simplest thing possible. 384 00:24:31,450 --> 00:24:35,910 So linearity let's us put the E inside the sum 385 00:24:35,910 --> 00:24:37,710 without losing anything. 386 00:24:37,710 --> 00:24:41,600 Now the expectation of an indicator random variable 387 00:24:41,600 --> 00:24:44,050 is pretty simple because the zeros don't 388 00:24:44,050 --> 00:24:45,810 contribute to the expectation. 389 00:24:45,810 --> 00:24:47,310 The 1's contribute 1. 390 00:24:47,310 --> 00:24:50,140 So this is the same thing as just the probability 391 00:24:50,140 --> 00:24:51,971 of this being 1. 392 00:24:51,971 --> 00:24:59,700 So we get sum of j9 equal to I of the probability 393 00:24:59,700 --> 00:25:04,980 that Iij equals 1. 394 00:25:04,980 --> 00:25:07,270 And the probability that Iij equals 1, 395 00:25:07,270 --> 00:25:11,570 well, that's the probability that this happens. 396 00:25:11,570 --> 00:25:15,010 And what's the probability that that happens? 397 00:25:15,010 --> 00:25:18,520 At most 1/m our universality. 398 00:25:18,520 --> 00:25:22,665 So I'm going to-- I'll write it out. 399 00:25:22,665 --> 00:25:26,450 This is sum j not equal to I. Probability 400 00:25:26,450 --> 00:25:31,420 that h maps ki and kj to the same slot. 401 00:25:34,870 --> 00:25:37,340 So that's the definition of Iij. 402 00:25:37,340 --> 00:25:41,950 And this is at most sum j not equal to i 403 00:25:41,950 --> 00:25:44,569 of 1/m by universality. 404 00:25:44,569 --> 00:25:45,860 So here's where we're using it. 405 00:25:50,450 --> 00:25:56,030 And sum of j not equal to I, well that's basically n. 406 00:26:04,920 --> 00:26:07,190 But I made a mistake here. 407 00:26:07,190 --> 00:26:08,960 Slightly off. 408 00:26:08,960 --> 00:26:11,410 From here-- yeah. 409 00:26:11,410 --> 00:26:14,450 So this line is wrong. 410 00:26:14,450 --> 00:26:14,950 Sorry. 411 00:26:14,950 --> 00:26:16,130 Let me fix it. 412 00:26:16,130 --> 00:26:18,430 Because this assumption only works 413 00:26:18,430 --> 00:26:20,810 when the keys are distinct. 414 00:26:20,810 --> 00:26:32,070 So in fact-- how did I get j-- yeah. , Yeah, sorry. 415 00:26:32,070 --> 00:26:34,290 This should have been this-- actually everything 416 00:26:34,290 --> 00:26:37,960 I said is true, but if you want to count the number of keys-- 417 00:26:37,960 --> 00:26:40,210 I really wanted to count the total number of keys that 418 00:26:40,210 --> 00:26:43,630 hash to the same place as ki. 419 00:26:43,630 --> 00:26:46,060 So there's one more which is ki itself. 420 00:26:46,060 --> 00:26:48,920 Always hashes to wherever ki hashes. 421 00:26:48,920 --> 00:26:50,800 So I did a summation j not equal i 422 00:26:50,800 --> 00:26:57,670 but I should also have a plus Iii-- captain. 423 00:26:57,670 --> 00:27:03,650 So there's the case when I hashing to the same place which 424 00:27:03,650 --> 00:27:07,150 of course is always going to happen so you get basically 425 00:27:07,150 --> 00:27:08,570 plus 1 everywhere. 426 00:27:11,390 --> 00:27:13,310 So that makes me happier because then I 427 00:27:13,310 --> 00:27:15,835 actually get with the theorem said which is 1 plus alpha. 428 00:27:15,835 --> 00:27:18,540 There is always going to be the one guy hashing there 429 00:27:18,540 --> 00:27:21,970 when I assume that ki hashed to wherever it does. 430 00:27:24,650 --> 00:27:27,980 So this tells you that if we could find a universal hash 431 00:27:27,980 --> 00:27:32,800 family, then we're guaranteed insert, delete, and search 432 00:27:32,800 --> 00:27:35,535 cost order 1 plus alpha in expectation. 433 00:27:35,535 --> 00:27:38,520 And the expectation is only over the choice of h, 434 00:27:38,520 --> 00:27:39,470 not over the inputs. 435 00:27:39,470 --> 00:27:42,466 I think I've stressed that enough times. 436 00:27:42,466 --> 00:27:44,340 But the remaining question is can we actually 437 00:27:44,340 --> 00:27:46,390 design a universal hash family? 438 00:27:46,390 --> 00:27:48,015 Are there any universal hash families? 439 00:27:53,960 --> 00:27:56,672 Yes, as you might expect there are. 440 00:27:56,672 --> 00:27:58,505 Otherwise this wouldn't be very interesting. 441 00:28:07,140 --> 00:28:12,990 Let me give you an example of a bad universal hash family. 442 00:28:12,990 --> 00:28:15,505 Sort of an oxymoron but it's possible. 443 00:28:24,190 --> 00:28:25,380 Bad. 444 00:28:25,380 --> 00:28:27,600 Here's a hash family that's universal. 445 00:28:27,600 --> 00:28:32,360 h is the set of all hash functions. 446 00:28:32,360 --> 00:28:36,370 h from 0,1 to u minus 1. 447 00:28:44,010 --> 00:28:46,790 This is what's normally called uniform hashing. 448 00:28:46,790 --> 00:28:50,350 It makes analysis really easy because you 449 00:28:50,350 --> 00:28:53,240 get to assume-- I mean this says ahead 450 00:28:53,240 --> 00:28:55,500 of time for every universe item, I'm 451 00:28:55,500 --> 00:28:59,300 going to choose a random slot to put it. 452 00:28:59,300 --> 00:29:01,510 And then I'll just remember that. 453 00:29:01,510 --> 00:29:06,030 And so whenever you give me the key, I'll just map it by h. 454 00:29:06,030 --> 00:29:10,420 And I get a consistent slot and definitely it's universal. 455 00:29:10,420 --> 00:29:13,460 What's bad about this hash function? 456 00:29:13,460 --> 00:29:14,541 Many things but-- 457 00:29:17,427 --> 00:29:22,520 STUDENT: [INAUDIBLE] That's just as hard as the problem I'm 458 00:29:22,520 --> 00:29:23,020 solving. 459 00:29:23,020 --> 00:29:23,820 ERIK DEMAINE: Sort of. 460 00:29:23,820 --> 00:29:25,460 I'm begging the question that it's just 461 00:29:25,460 --> 00:29:27,240 as hard as the problem I'm solving. 462 00:29:27,240 --> 00:29:31,022 And what, algorithmically, what goes wrong here? 463 00:29:31,022 --> 00:29:32,230 There are two things I guess. 464 00:29:38,304 --> 00:29:38,804 Yeah? 465 00:29:38,804 --> 00:29:40,730 STUDENT: It's not deterministic? 466 00:29:40,730 --> 00:29:42,650 ERIK DEMAINE: It's not deterministic. 467 00:29:42,650 --> 00:29:45,275 That's OK because we're allowing randomization 468 00:29:45,275 --> 00:29:46,960 in this algorithm. 469 00:29:46,960 --> 00:29:49,100 So I mean how I would compute this 470 00:29:49,100 --> 00:29:52,610 is I would do a four loop over all universe items. 471 00:29:52,610 --> 00:29:54,980 And I assume I have a way to generate a random number 472 00:29:54,980 --> 00:29:56,840 between 0 and m minus 1. 473 00:29:56,840 --> 00:29:58,570 That's legitimate. 474 00:29:58,570 --> 00:30:01,342 But there's something bad about that algorithm. 475 00:30:01,342 --> 00:30:02,550 STUDENT: It's not consistent. 476 00:30:02,550 --> 00:30:03,758 ERIK DEMAINE: Not consistent? 477 00:30:03,758 --> 00:30:06,470 It is consistent if I precompute for every universe item 478 00:30:06,470 --> 00:30:07,700 where to map it. 479 00:30:07,700 --> 00:30:08,720 That's good. 480 00:30:08,720 --> 00:30:10,670 So all these things are actually OK. 481 00:30:10,670 --> 00:30:12,540 STUDENT: It takes too much time and space. 482 00:30:12,540 --> 00:30:14,498 ERIK DEMAINE: It takes too much time and space. 483 00:30:14,498 --> 00:30:16,460 Yeah. 484 00:30:16,460 --> 00:30:19,380 That's the bad thing. 485 00:30:19,380 --> 00:30:22,640 It's hard to isolate in a bad thing what is so bad about it. 486 00:30:22,640 --> 00:30:29,710 But we need u time to compute all those random numbers. 487 00:30:29,710 --> 00:30:32,540 And we need u space to store that hash function. 488 00:30:32,540 --> 00:30:37,270 In order to get to the consistency we have to-- Oops. 489 00:30:37,270 --> 00:30:38,850 Good catch. 490 00:30:38,850 --> 00:30:40,350 In order to get consistency, we need 491 00:30:40,350 --> 00:30:43,840 to keep track of all those hash function values. 492 00:30:43,840 --> 00:30:47,524 And that's not good. 493 00:30:47,524 --> 00:30:49,440 You could try to not store them all, you know, 494 00:30:49,440 --> 00:30:50,400 use a hash table. 495 00:30:50,400 --> 00:30:53,620 But you can't use a hash table to store a hash function. 496 00:30:53,620 --> 00:30:58,180 That would be-- that would be infinite recursion. 497 00:30:58,180 --> 00:31:00,100 So but at least they're out there. 498 00:31:00,100 --> 00:31:03,510 So the challenge is to find an efficient hash family that 499 00:31:03,510 --> 00:31:05,690 doesn't take much space to store and doesn't 500 00:31:05,690 --> 00:31:07,850 take much time to compute. 501 00:31:07,850 --> 00:31:09,786 OK, we're allowing randomness. 502 00:31:19,720 --> 00:31:21,280 But we don't want to much randomness. 503 00:31:21,280 --> 00:31:23,620 We can't afford u units of time of randomness. 504 00:31:23,620 --> 00:31:25,630 I mean u could be huge. 505 00:31:25,630 --> 00:31:28,800 We're only doing n operations probably on this hash table. 506 00:31:28,800 --> 00:31:31,030 u could be way bigger than n. 507 00:31:31,030 --> 00:31:33,400 We don't want to have to precompute this giant table 508 00:31:33,400 --> 00:31:35,170 and then use it for like five steps. 509 00:31:35,170 --> 00:31:38,220 It would be really, really slow even amortized. 510 00:31:38,220 --> 00:31:42,542 So here's one that I will analyze. 511 00:31:42,542 --> 00:31:45,000 And there's another one in the textbook which I'll mention. 512 00:31:49,800 --> 00:31:53,359 This one's a little bit simpler to analyze. 513 00:31:53,359 --> 00:31:55,650 We're going to need a little bit of number theory, just 514 00:31:55,650 --> 00:31:57,610 prime numbers. 515 00:31:57,610 --> 00:32:02,240 And you've probably heard of the idea of your hash table size 516 00:32:02,240 --> 00:32:03,400 being prime. 517 00:32:03,400 --> 00:32:05,729 Here you'll see why that's useful, 518 00:32:05,729 --> 00:32:06,770 at least for this family. 519 00:32:06,770 --> 00:32:08,860 You don't always need primality, but it's 520 00:32:08,860 --> 00:32:11,320 going to make this family work. 521 00:32:11,320 --> 00:32:14,430 So I'm going to assume that my table size is prime. 522 00:32:14,430 --> 00:32:17,716 Now really my table size is doubling, 523 00:32:17,716 --> 00:32:18,840 so that's a little awkward. 524 00:32:18,840 --> 00:32:21,550 But luckily there are algorithms given a number 525 00:32:21,550 --> 00:32:23,170 to find a nearby prime number. 526 00:32:23,170 --> 00:32:25,150 We're not going to cover that here, 527 00:32:25,150 --> 00:32:27,500 but that's an algorithmic number theory thing. 528 00:32:27,500 --> 00:32:29,860 And in polylogarithmic time, I guess 529 00:32:29,860 --> 00:32:33,340 you can find a nearby prime number. 530 00:32:33,340 --> 00:32:35,220 So you want it to be a power of 2. 531 00:32:35,220 --> 00:32:38,390 And you'll just look around for nearby prime numbers. 532 00:32:38,390 --> 00:32:41,090 And then we have a prime that's about the same size so that 533 00:32:41,090 --> 00:32:45,550 will work just as well from a table doubling perspective. 534 00:32:45,550 --> 00:32:49,810 Then furthermore, for convenience, 535 00:32:49,810 --> 00:32:53,740 I'm going to assume that u is an integer power of m. 536 00:33:01,404 --> 00:33:06,489 I want my universe to be a power of that prime. 537 00:33:06,489 --> 00:33:08,530 I mean, if it isn't, just make u a little bigger. 538 00:33:08,530 --> 00:33:10,113 It's OK if u gets bigger as long as it 539 00:33:10,113 --> 00:33:13,450 covers all of the same items. 540 00:33:13,450 --> 00:33:19,340 Now once I view my universe as a power of the table size, 541 00:33:19,340 --> 00:33:23,140 a natural thing to do is take my universe items, 542 00:33:23,140 --> 00:33:27,530 to take my input integers, and think of them in base m. 543 00:33:27,530 --> 00:33:29,730 So that's what I'm going to do. 544 00:33:29,730 --> 00:33:37,880 I'm going to view a key k in base m. 545 00:33:37,880 --> 00:33:41,640 Whenever I have a key, I can think of it 546 00:33:41,640 --> 00:33:51,850 as a vector of subkeys, k1 up to kr minus 1. 547 00:33:51,850 --> 00:33:57,024 There are digits in base m because of this relation. 548 00:33:57,024 --> 00:33:59,190 And I don't even care which is the least significant 549 00:33:59,190 --> 00:34:00,606 and which is the most significant. 550 00:34:00,606 --> 00:34:02,630 That won't matter so whatever, whichever order 551 00:34:02,630 --> 00:34:05,130 you want to think of it. 552 00:34:05,130 --> 00:34:08,830 And each of the ki's here I guess 553 00:34:08,830 --> 00:34:11,550 is between 0 and m minus 1. 554 00:34:17,480 --> 00:34:18,680 So far so good. 555 00:34:37,760 --> 00:34:40,670 So with this perspective, the base m perspective, 556 00:34:40,670 --> 00:34:45,469 I can define a dot product hash function as follows. 557 00:34:45,469 --> 00:34:48,520 It's going to be parametrized by another key, 558 00:34:48,520 --> 00:34:52,865 I'll call it a, which we can think of again as a vector. 559 00:34:57,380 --> 00:35:03,040 I want to define h sub a of k. 560 00:35:03,040 --> 00:35:04,790 So this is parametrized by a, but it's 561 00:35:04,790 --> 00:35:10,910 a function of a given key k as the dot product 562 00:35:10,910 --> 00:35:13,135 of those two vectors mod m. 563 00:35:16,390 --> 00:35:19,930 So remember dot products are just the sum from i 564 00:35:19,930 --> 00:35:26,800 equals 0 to r minus 1 of a1 times ki. 565 00:35:26,800 --> 00:35:31,230 I want to do all of that modulo m. 566 00:35:31,230 --> 00:35:33,992 We'll worry about how long this takes 567 00:35:33,992 --> 00:35:37,710 to compute in a moment I guess. 568 00:35:37,710 --> 00:35:40,680 Maybe very soon. 569 00:35:40,680 --> 00:35:45,690 But the hash family h is just all of these ha's 570 00:35:45,690 --> 00:35:48,560 for all possible choices of a. 571 00:35:52,276 --> 00:35:56,860 a was a key so it comes from the universe u. 572 00:36:01,770 --> 00:36:04,530 And so what that means is to do universal hashing, 573 00:36:04,530 --> 00:36:07,650 I want to choose one of these ha's uniformly at random. 574 00:36:07,650 --> 00:36:08,700 How do I do that? 575 00:36:08,700 --> 00:36:11,410 I just choose a uniformly at random. 576 00:36:11,410 --> 00:36:12,390 Pretty easy. 577 00:36:12,390 --> 00:36:16,230 It's one random value from one random key. 578 00:36:16,230 --> 00:36:19,500 So that should take constant time and constant space 579 00:36:19,500 --> 00:36:22,660 to store one number. 580 00:36:22,660 --> 00:36:28,100 In general we're in a world called the Word RAM model. 581 00:36:28,100 --> 00:36:31,800 This is actually-- I guess m stands for model 582 00:36:31,800 --> 00:36:33,610 so I shouldn't write model. 583 00:36:33,610 --> 00:36:38,240 Random access machine which you may have heard. 584 00:36:38,240 --> 00:36:42,980 The word RAM assumes that in general we're 585 00:36:42,980 --> 00:36:45,180 manipulating integers. 586 00:36:45,180 --> 00:36:49,550 And the integers fit in a word. 587 00:36:49,550 --> 00:36:51,430 And the computational assumption is 588 00:36:51,430 --> 00:36:55,176 that manipulating a constant number of words 589 00:36:55,176 --> 00:36:56,800 and doing essentially any operation you 590 00:36:56,800 --> 00:37:01,060 want on constant number of words takes constant time. 591 00:37:06,530 --> 00:37:08,430 And the other part of the word RAM model 592 00:37:08,430 --> 00:37:11,770 is to assume that the things you care about fit in a word. 593 00:37:16,950 --> 00:37:24,950 Say individual data values, here we're talking about keys, 594 00:37:24,950 --> 00:37:28,190 fit in a word. 595 00:37:28,190 --> 00:37:30,590 This is what you need to assume in [INAUDIBLE] 596 00:37:30,590 --> 00:37:33,830 that you can compute high of x in constant time or low 597 00:37:33,830 --> 00:37:35,470 of x in constant time. 598 00:37:35,470 --> 00:37:38,250 Here I'm going to use it to assume that we can compute 599 00:37:38,250 --> 00:37:41,790 h sub a of k in constant time. 600 00:37:41,790 --> 00:37:44,010 In practice this would be done by implementing 601 00:37:44,010 --> 00:37:46,870 this computation, this dot product computation, 602 00:37:46,870 --> 00:37:48,540 in hardware. 603 00:37:48,540 --> 00:37:53,420 And the reason a 64-bit edition on a modern processor 604 00:37:53,420 --> 00:37:56,359 or a 32-bit on most phones takes constant time 605 00:37:56,359 --> 00:37:58,150 is because there's hardware that's designed 606 00:37:58,150 --> 00:38:00,050 to do that really fast. 607 00:38:00,050 --> 00:38:03,990 And in general we're assuming that the things we care about 608 00:38:03,990 --> 00:38:06,137 fit in a single word. 609 00:38:06,137 --> 00:38:08,720 And we're assuming random access and that we can have a raise. 610 00:38:08,720 --> 00:38:10,720 That's what we need in order to store a table. 611 00:38:10,720 --> 00:38:12,930 And same thing in [INAUDIBLE], we needed to assume we 612 00:38:12,930 --> 00:38:13,430 had a raise. 613 00:38:16,772 --> 00:38:18,400 And I think this operation is actually 614 00:38:18,400 --> 00:38:22,540 pretty-- exists in Intel architectures in some form. 615 00:38:22,540 --> 00:38:25,117 But it's certainly not a normal operation. 616 00:38:25,117 --> 00:38:26,700 If you're going to do this explicitly, 617 00:38:26,700 --> 00:38:28,340 adding up and multiplying things this 618 00:38:28,340 --> 00:38:34,900 would be r is the log base m of u, so it's kind of logish time. 619 00:38:34,900 --> 00:38:39,970 Maybe I'll mention another hash family that's 620 00:38:39,970 --> 00:38:41,985 more obviously computable. 621 00:38:45,499 --> 00:38:46,540 But I won't analyze here. 622 00:38:46,540 --> 00:38:48,000 It's analyzed in the textbook. 623 00:38:48,000 --> 00:38:52,450 So if you're curious you can check it out there. 624 00:38:52,450 --> 00:38:56,600 Let's call this just another. 625 00:39:15,620 --> 00:39:17,520 It's a bit weird because it has two mods. 626 00:39:17,520 --> 00:39:19,100 You take mod p and then mod m. 627 00:39:19,100 --> 00:39:22,010 But the main computation is very simple. 628 00:39:22,010 --> 00:39:24,390 You choose a uniformly random value a. 629 00:39:24,390 --> 00:39:29,640 You multiply it by your key in usual binary multiplication 630 00:39:29,640 --> 00:39:30,900 instead of dot product. 631 00:39:30,900 --> 00:39:34,070 And then you add another uniformly random key. 632 00:39:34,070 --> 00:39:36,360 This is also universal. 633 00:39:36,360 --> 00:39:44,660 So H is hab for all a and b that are keys. 634 00:39:48,629 --> 00:39:50,420 So if you're not happy with this assumption 635 00:39:50,420 --> 00:39:52,272 that you can compute this in constant time, 636 00:39:52,272 --> 00:39:53,980 you should be happy with this assumption. 637 00:39:53,980 --> 00:39:56,396 If you believe in addition and multiplication and division 638 00:39:56,396 --> 00:39:58,690 being constant time, then this will be constant time. 639 00:40:01,860 --> 00:40:03,640 So both of these families are universal. 640 00:40:03,640 --> 00:40:06,150 I'm going to prove that this one is universal because it's 641 00:40:06,150 --> 00:40:06,790 a little bit easier. 642 00:40:06,790 --> 00:40:07,290 Yeah? 643 00:40:07,290 --> 00:40:09,640 STUDENT: Is this p a choice that you made? 644 00:40:09,640 --> 00:40:10,640 ERIK DEMAINE: OK, right. 645 00:40:10,640 --> 00:40:11,704 What is p? 646 00:40:11,704 --> 00:40:19,030 P just has to be bigger than m, and it should be prime. 647 00:40:19,030 --> 00:40:20,940 It's not random. 648 00:40:20,940 --> 00:40:24,550 You can just choose one prime that's bigger than your table 649 00:40:24,550 --> 00:40:26,120 size, and this will work. 650 00:40:26,120 --> 00:40:29,340 STUDENT: [INAUDIBLE] 651 00:40:32,390 --> 00:40:33,860 ERIK DEMAINE: I forget whether you 652 00:40:33,860 --> 00:40:35,160 have to assume that m is prime. 653 00:40:35,160 --> 00:40:37,630 I'd have to check. 654 00:40:37,630 --> 00:40:43,250 I'm guessing not, but don't quote me on that. 655 00:40:43,250 --> 00:40:46,760 Check the section in the textbook. 656 00:40:46,760 --> 00:40:47,920 So good. 657 00:40:47,920 --> 00:40:50,112 Easy to compute. 658 00:40:50,112 --> 00:40:52,570 The analysis is simpler, but it's a little bit easier here. 659 00:40:52,570 --> 00:40:56,070 Essentially this is very much like products 660 00:40:56,070 --> 00:40:59,720 but there's no carries here from one. 661 00:40:59,720 --> 00:41:02,140 When we do the dot product instead of just multiplying 662 00:41:02,140 --> 00:41:04,500 in base m we multiply them based on that 663 00:41:04,500 --> 00:41:07,150 would give the same thing as multiplying in base 2, 664 00:41:07,150 --> 00:41:10,330 but we get carries from one m-sized digit to the next one. 665 00:41:10,330 --> 00:41:12,280 And that's just more annoying to think about. 666 00:41:12,280 --> 00:41:14,535 So here we're essentially getting rid of carries. 667 00:41:14,535 --> 00:41:17,170 So it's in some sense even easier to compute. 668 00:41:17,170 --> 00:41:20,305 And in both cases, it's universal. 669 00:41:24,370 --> 00:41:34,140 So we want to prove this property. 670 00:41:34,140 --> 00:41:39,200 That if we choose a random a then the probability 671 00:41:39,200 --> 00:41:42,170 of two keys, k and k' which are distinct mapping 672 00:41:42,170 --> 00:41:49,450 via h to the same value is at most 1/m So let's prove that. 673 00:42:06,450 --> 00:42:12,422 So we're given two keys. 674 00:42:12,422 --> 00:42:14,130 We have no control over them because this 675 00:42:14,130 --> 00:42:16,645 has to work for all keys that are distinct. 676 00:42:22,430 --> 00:42:24,550 The only thing we know is that they're distinct. 677 00:42:24,550 --> 00:42:27,267 Now if two keys are distinct, then their vectors 678 00:42:27,267 --> 00:42:27,975 must be distinct. 679 00:42:27,975 --> 00:42:29,360 If two vectors are distinct, that 680 00:42:29,360 --> 00:42:32,269 means at least one item must be different. 681 00:42:32,269 --> 00:42:33,185 Should sound familiar. 682 00:42:39,870 --> 00:42:43,240 So this was like in the matrix multiplication verification 683 00:42:43,240 --> 00:42:46,420 algorithm that [INAUDIBLE] taught. 684 00:42:46,420 --> 00:42:54,855 So k and k' differ in some digit. 685 00:42:58,190 --> 00:42:59,440 Let's call that digit d. 686 00:43:02,902 --> 00:43:06,830 So k sub d is different from k sub d'. 687 00:43:09,370 --> 00:43:14,590 And I want to compute this probability. 688 00:43:14,590 --> 00:43:15,530 We'll rewrite it. 689 00:43:33,970 --> 00:43:36,450 The probability is over a. 690 00:43:36,450 --> 00:43:38,400 I'm choosing a uniformly at random. 691 00:43:38,400 --> 00:43:39,900 I want another probability that that 692 00:43:39,900 --> 00:43:43,520 maps k and k' to the same slot. 693 00:43:43,520 --> 00:43:47,210 So let me just write out the definition. 694 00:43:47,210 --> 00:43:58,750 It's probability over a that the dot product of a and k 695 00:43:58,750 --> 00:44:12,180 is the same thing as when I do the dot product with k' mod m. 696 00:44:12,180 --> 00:44:15,620 These two, that sum should come out the same, mod m. 697 00:44:19,570 --> 00:44:25,210 So let me move this part over to this side because in both cases 698 00:44:25,210 --> 00:44:26,490 we have the same ai. 699 00:44:26,490 --> 00:44:28,920 So I can group terms and say this 700 00:44:28,920 --> 00:44:45,640 is the probability-- probability sum over i 701 00:44:45,640 --> 00:44:50,900 equals 0 to r minus 1 of ai times ki minus 702 00:44:50,900 --> 00:44:54,420 ki prime equals 0. 703 00:44:57,660 --> 00:44:58,160 Mod m. 704 00:45:12,380 --> 00:45:14,750 OK, no pun intended. 705 00:45:14,750 --> 00:45:19,430 Now we care about this digit d. 706 00:45:19,430 --> 00:45:22,210 d is a place where we know that this is non-zero. 707 00:45:22,210 --> 00:45:28,270 So let me separate out the terms for d and everything but d. 708 00:45:28,270 --> 00:45:34,630 So this is the same as ability of, let's do the d term first, 709 00:45:34,630 --> 00:45:41,920 so we have ad times kd minus kd prime. 710 00:45:41,920 --> 00:45:43,240 That's one term. 711 00:45:43,240 --> 00:45:46,860 I'm going to write the summation of i 712 00:45:46,860 --> 00:45:56,485 not equal to d of ai ki minus ki prime. 713 00:45:56,485 --> 00:45:58,110 These ones, some of them might be zero. 714 00:45:58,110 --> 00:45:58,990 Some are not. 715 00:45:58,990 --> 00:46:00,850 We're not going to worry about it. 716 00:46:00,850 --> 00:46:03,105 It's enough to just isolate one term that is non-zero. 717 00:46:08,550 --> 00:46:11,520 So this thing we know does not equal zero. 718 00:46:14,370 --> 00:46:15,350 Cool. 719 00:46:15,350 --> 00:46:17,850 Here's where I'm going to use a little bit of number theory. 720 00:46:17,850 --> 00:46:20,360 I haven't yet used that m is prime. 721 00:46:20,360 --> 00:46:27,120 I required m is prime because when you're working modulo m, 722 00:46:27,120 --> 00:46:30,470 you have multiplicative inverses. 723 00:46:30,470 --> 00:46:32,430 Because this is not zero, there is 724 00:46:32,430 --> 00:46:35,190 something I can multiply on both sides 725 00:46:35,190 --> 00:46:40,800 and get this to cancel out and become one. 726 00:46:40,800 --> 00:46:43,650 For every value x there is a value y. 727 00:46:43,650 --> 00:46:46,170 So x times y equals 1 modulo m. 728 00:46:46,170 --> 00:46:48,070 And you can even compute it in constant time 729 00:46:48,070 --> 00:46:50,410 in a reasonable model. 730 00:46:50,410 --> 00:47:08,290 So then I can say I want the probability that ad is minus 731 00:47:08,290 --> 00:47:12,030 kd minus kd prime inverse. 732 00:47:12,030 --> 00:47:14,520 This is the multiplicative inverse I was talking about. 733 00:47:14,520 --> 00:47:20,680 And then the sum i not equal to d whatever, I don't actually 734 00:47:20,680 --> 00:47:27,264 care what this is too much, I've already done the equals part. 735 00:47:27,264 --> 00:47:28,430 I still need to write mod m. 736 00:47:31,240 --> 00:47:36,520 The point is this is all about ad. 737 00:47:36,520 --> 00:47:38,850 Remember we're choosing a uniformly at random. 738 00:47:38,850 --> 00:47:40,700 That's the same thing as choosing 739 00:47:40,700 --> 00:47:45,896 each of the ai's independently uniformly at random. 740 00:47:45,896 --> 00:47:47,372 Yeah? 741 00:47:47,372 --> 00:47:53,276 STUDENT: Is the second line over there isolating d [INAUDIBLE]? 742 00:47:53,276 --> 00:47:54,667 Second from the top. 743 00:47:54,667 --> 00:47:55,500 ERIK DEMAINE: Which? 744 00:47:55,500 --> 00:47:56,020 This one? 745 00:47:56,020 --> 00:47:56,520 STUDENT: No up. 746 00:47:56,520 --> 00:47:57,311 ERIK DEMAINE: This? 747 00:47:57,311 --> 00:47:58,398 STUDENT: Down. 748 00:47:58,398 --> 00:47:58,898 That one. 749 00:47:58,898 --> 00:47:59,380 No. 750 00:47:59,380 --> 00:48:00,171 The one below that. 751 00:48:00,171 --> 00:48:01,790 ERIK DEMAINE: Yes. 752 00:48:01,790 --> 00:48:03,730 STUDENT: Is that line isolating d or is that-- 753 00:48:03,730 --> 00:48:04,438 ERIK DEMAINE: No. 754 00:48:04,438 --> 00:48:05,490 I haven't isolated d yet. 755 00:48:05,490 --> 00:48:06,970 This is all the terms. 756 00:48:06,970 --> 00:48:08,970 And then going from this line to this one, 757 00:48:08,970 --> 00:48:12,910 I'm just pulling out the i equals d term. 758 00:48:12,910 --> 00:48:13,650 That's this term. 759 00:48:13,650 --> 00:48:16,324 And then separating out the i not equal to d. 760 00:48:16,324 --> 00:48:17,090 STUDENT: I get it. 761 00:48:17,090 --> 00:48:17,380 ERIK DEMAINE: Right? 762 00:48:17,380 --> 00:48:18,980 This sum is just the same as that sum. 763 00:48:18,980 --> 00:48:20,480 But I've done the d term explicitly. 764 00:48:20,480 --> 00:48:21,063 STUDENT: Sure. 765 00:48:21,063 --> 00:48:21,780 I get it. 766 00:48:24,310 --> 00:48:27,150 ERIK DEMAINE: So I've done all this rewriting 767 00:48:27,150 --> 00:48:29,780 because I know that ad is chosen uniformly at random. 768 00:48:29,780 --> 00:48:34,340 Here we have this thing, this monstrosity, 769 00:48:34,340 --> 00:48:36,890 but it does not depend on ad. 770 00:48:36,890 --> 00:48:39,300 In fact it is independent of ad. 771 00:48:39,300 --> 00:48:44,570 I'm going to write this as a function of k and k' 772 00:48:44,570 --> 00:48:46,850 because those are given to us and fixed. 773 00:48:46,850 --> 00:48:50,310 And then it's also a function of a0 and a1. 774 00:48:50,310 --> 00:48:53,520 Everything except d. 775 00:48:53,520 --> 00:49:01,920 So ad minus 1, ad plus 1, and so on up to ar minus 1. 776 00:49:01,920 --> 00:49:03,700 This is awkward to write. 777 00:49:03,700 --> 00:49:06,230 But everything except ad appears here 778 00:49:06,230 --> 00:49:09,230 because we have i not equal to d. 779 00:49:09,230 --> 00:49:13,040 And these ai's are random variables. 780 00:49:13,040 --> 00:49:16,460 But we're assuming that they're all chosen independently 781 00:49:16,460 --> 00:49:17,910 from each other. 782 00:49:17,910 --> 00:49:21,720 So I don't really care what's going on in this function. 783 00:49:21,720 --> 00:49:22,660 It's something. 784 00:49:22,660 --> 00:49:24,390 And if I rewrite this probability, 785 00:49:24,390 --> 00:49:27,636 it's the probability over the choice of a. 786 00:49:27,636 --> 00:49:31,720 I can separate out the choice of all these things 787 00:49:31,720 --> 00:49:35,320 from the choice of ad. 788 00:49:35,320 --> 00:49:39,560 And this is just a useful formula. 789 00:49:39,560 --> 00:49:43,500 I'm going to write a not equal to d. 790 00:49:43,500 --> 00:49:48,400 All the other-- maybe I'll write a sub i not equal to d. 791 00:49:48,400 --> 00:49:51,080 All the choices of those guys separately 792 00:49:51,080 --> 00:49:59,700 from the probability of choosing ad of ad 793 00:49:59,700 --> 00:50:00,895 equal to this function. 794 00:50:05,090 --> 00:50:08,200 If you just think about the definition of expectation, 795 00:50:08,200 --> 00:50:09,560 this is doing the same thing. 796 00:50:09,560 --> 00:50:12,780 We're thinking of first choosing the ai's where 797 00:50:12,780 --> 00:50:14,370 i is not equal to d. 798 00:50:14,370 --> 00:50:15,970 And then we choose ad. 799 00:50:15,970 --> 00:50:19,470 And this computational will come out the same as that. 800 00:50:25,110 --> 00:50:28,720 But this is the probability of a uniformly random number 801 00:50:28,720 --> 00:50:31,680 equaling something. 802 00:50:31,680 --> 00:50:35,950 So we just need to think about-- sorry. 803 00:50:35,950 --> 00:50:37,470 Important. 804 00:50:37,470 --> 00:50:39,810 That would be pretty unlikely that would be 1/u, 805 00:50:39,810 --> 00:50:42,970 but this is all working modulo m. 806 00:50:42,970 --> 00:50:45,760 So if I just take a uniformly random integer 807 00:50:45,760 --> 00:50:49,530 and the chance of it hitting any particular value mod m is 1/m. 808 00:50:53,011 --> 00:50:54,010 And that's universality. 809 00:50:57,430 --> 00:51:02,500 So in this case, you get exactly 1/m, no less than or equal to. 810 00:51:02,500 --> 00:51:06,440 Sorry, I should have written it's the expectation of 1/m, 811 00:51:06,440 --> 00:51:12,540 but that's 1/m because 1/m has no random parts in it. 812 00:51:12,540 --> 00:51:13,412 Yeah? 813 00:51:13,412 --> 00:51:15,220 STUDENT: How do we know that the, 814 00:51:15,220 --> 00:51:19,735 that this expression doesn't have any biases in the sense 815 00:51:19,735 --> 00:51:23,832 that it doesn't give more, more, like if you give it 816 00:51:23,832 --> 00:51:26,718 the uniform distribution of numbers, 817 00:51:26,718 --> 00:51:28,642 it doesn't spit out more numbers than others 818 00:51:28,642 --> 00:51:30,514 and that could potentially-- 819 00:51:30,514 --> 00:51:31,930 ERIK DEMAINE: Oh, so you're asking 820 00:51:31,930 --> 00:51:35,360 how do we know that this hash family doesn't 821 00:51:35,360 --> 00:51:38,085 prefer some slots over others, I guess. 822 00:51:38,085 --> 00:51:41,378 STUDENT: Of course like after the equals sign, 823 00:51:41,378 --> 00:51:46,219 like in this middle line in the middle. 824 00:51:46,219 --> 00:51:46,760 Middle board. 825 00:51:46,760 --> 00:51:47,718 ERIK DEMAINE: This one? 826 00:51:47,718 --> 00:51:49,304 Oh, this one. 827 00:51:49,304 --> 00:51:50,220 STUDENT: Middle board. 828 00:51:50,220 --> 00:51:51,531 ERIK DEMAINE: Middle board. 829 00:51:51,531 --> 00:51:52,030 Here. 830 00:51:52,030 --> 00:51:53,056 STUDENT: Yes. 831 00:51:53,056 --> 00:51:54,680 So how do we know that if you give it-- 832 00:51:54,680 --> 00:51:55,846 ERIK DEMAINE: This function. 833 00:51:55,846 --> 00:51:59,760 STUDENT: --random variables, it won't prefer certain numbers 834 00:51:59,760 --> 00:52:00,480 over others? 835 00:52:00,480 --> 00:52:03,560 ERIK DEMAINE: So this function may prefer some numbers 836 00:52:03,560 --> 00:52:04,940 over others. 837 00:52:04,940 --> 00:52:06,300 But it doesn't matter. 838 00:52:06,300 --> 00:52:08,310 All we need is that this function 839 00:52:08,310 --> 00:52:10,285 is independent of our choice of ad. 840 00:52:10,285 --> 00:52:12,500 So you can think of this function, 841 00:52:12,500 --> 00:52:15,010 you choose all of these random-- actually k and k' 842 00:52:15,010 --> 00:52:18,179 are not random-- but you choose all these random numbers. 843 00:52:18,179 --> 00:52:19,220 Then you evaluate your f. 844 00:52:19,220 --> 00:52:20,970 Maybe it always comes out to 5. 845 00:52:20,970 --> 00:52:21,470 Who knows. 846 00:52:21,470 --> 00:52:23,090 It could be super biased. 847 00:52:23,090 --> 00:52:26,430 But then you choose ad uniformly at random. 848 00:52:26,430 --> 00:52:29,350 So the chance of ad equalling 5 is the same 849 00:52:29,350 --> 00:52:31,730 as the chance of ad equaling 3. 850 00:52:31,730 --> 00:52:34,410 So in all cases, you get the probability is 1/m. 851 00:52:34,410 --> 00:52:36,020 What we need is independence. 852 00:52:36,020 --> 00:52:39,220 We need that the ad is chosen independently from the other 853 00:52:39,220 --> 00:52:39,810 ai's. 854 00:52:39,810 --> 00:52:42,210 But we don't need to know anything about f other 855 00:52:42,210 --> 00:52:44,640 than it doesn't depend on ad. 856 00:52:44,640 --> 00:52:48,690 So and we made it not depend on ad because I isolated ad 857 00:52:48,690 --> 00:52:50,600 by pulling it out of that summation. 858 00:52:50,600 --> 00:52:53,370 So we know there's no ad's over here. 859 00:52:53,370 --> 00:52:56,110 Good question. 860 00:52:56,110 --> 00:52:58,825 You get a bonus Frisbee for your question. 861 00:53:01,500 --> 00:53:02,890 All right. 862 00:53:02,890 --> 00:53:06,520 That ends universal hashing. 863 00:53:06,520 --> 00:53:08,010 Any more questions? 864 00:53:08,010 --> 00:53:10,390 So at this point we have at least one 865 00:53:10,390 --> 00:53:12,350 universal hash family. 866 00:53:12,350 --> 00:53:15,790 So we're just choosing, in this case, a uniformly at random. 867 00:53:15,790 --> 00:53:19,400 In the other method, we choose a and b uniformly at random. 868 00:53:19,400 --> 00:53:23,040 And then we build our hash table. 869 00:53:23,040 --> 00:53:25,667 And the hash function depends on m. 870 00:53:25,667 --> 00:53:27,500 So also every time we double our table size, 871 00:53:27,500 --> 00:53:29,570 we're going to have to choose a new hash function 872 00:53:29,570 --> 00:53:32,340 for the new value of m. 873 00:53:32,340 --> 00:53:34,440 And that's about it. 874 00:53:34,440 --> 00:53:38,870 So this will give us constant expected time-- or in general 1 875 00:53:38,870 --> 00:53:42,480 plus alpha if you're not doing table doubling-- for insert, 876 00:53:42,480 --> 00:53:45,230 delete, and exact search. 877 00:53:45,230 --> 00:53:49,230 Just building on the hashing with chaining. 878 00:53:49,230 --> 00:53:50,760 And so this is a good method. 879 00:53:50,760 --> 00:53:51,551 Question? 880 00:53:51,551 --> 00:53:54,497 STUDENT: Why do you say expected value of the probability? 881 00:53:54,497 --> 00:53:58,430 Isn't it sufficient to just say the probability of [INAUDIBLE]? 882 00:53:58,430 --> 00:54:02,210 ERIK DEMAINE: Uh, yeah, I wanted to isolate-- 883 00:54:02,210 --> 00:54:05,400 it is the overall probability of this happening. 884 00:54:05,400 --> 00:54:07,140 I rewrote it this way because I wanted 885 00:54:07,140 --> 00:54:09,640 to think about first choosing the ai's where i does not 886 00:54:09,640 --> 00:54:12,225 equal d and then choosing ad. 887 00:54:12,225 --> 00:54:14,180 So this probability was supposed to be only 888 00:54:14,180 --> 00:54:15,626 over the choice of ad. 889 00:54:15,626 --> 00:54:17,700 And you have to do something with the other ai's 890 00:54:17,700 --> 00:54:18,470 because they're random. 891 00:54:18,470 --> 00:54:20,345 You can't just say, what's the probability ad 892 00:54:20,345 --> 00:54:21,800 equaling a random variable? 893 00:54:21,800 --> 00:54:23,300 That's a little sketchy. 894 00:54:23,300 --> 00:54:25,255 I wanted to have no random variables over all. 895 00:54:25,255 --> 00:54:28,460 So I have to kind of bind those variables with something. 896 00:54:28,460 --> 00:54:32,480 And I just want to see what the-- This doesn't really 897 00:54:32,480 --> 00:54:35,860 affect very much, but to make this algebraically 898 00:54:35,860 --> 00:54:38,610 correct I need to say what the ai's, i not 899 00:54:38,610 --> 00:54:41,490 equal to d are doing. 900 00:54:41,490 --> 00:54:43,214 Other questions? 901 00:54:43,214 --> 00:54:43,714 Yeah. 902 00:54:43,714 --> 00:54:45,922 STUDENT: Um, I'm a bit confused about your definition 903 00:54:45,922 --> 00:54:50,546 of the collision in the lower left board. 904 00:54:50,546 --> 00:54:53,357 Why are you adding i's [INAUDIBLE]? 905 00:54:53,357 --> 00:54:54,440 ERIK DEMAINE: Yeah, sorry. 906 00:54:54,440 --> 00:54:56,320 This is a funny notion of colliding. 907 00:54:56,320 --> 00:54:58,560 I just mean I want to count the number of keys that 908 00:54:58,560 --> 00:55:00,350 hash to the same slot as ki. 909 00:55:00,350 --> 00:55:04,106 STUDENT: So it's not necessarily like a collision [INAUDIBLE]. 910 00:55:04,106 --> 00:55:05,480 ERIK DEMAINE: You may not call it 911 00:55:05,480 --> 00:55:08,250 a collision when it collides with itself, yeah. 912 00:55:08,250 --> 00:55:11,050 Whatever you want to call it. 913 00:55:11,050 --> 00:55:14,920 But I just mean hashing to the same slot is ki. 914 00:55:14,920 --> 00:55:15,470 Yeah. 915 00:55:15,470 --> 00:55:17,960 Just because I want to count the total length of the chain. 916 00:55:17,960 --> 00:55:21,210 I don't want to count the number of collisions in the chain. 917 00:55:21,210 --> 00:55:21,710 Sorry. 918 00:55:21,710 --> 00:55:23,220 Probably a poor choice of word. 919 00:55:27,720 --> 00:55:31,179 We're hashing because we're taking our key, 920 00:55:31,179 --> 00:55:32,720 we're cutting it up into little bits, 921 00:55:32,720 --> 00:55:35,480 and then we're mixing them up just like a good corned beef 922 00:55:35,480 --> 00:55:38,550 hash or something. 923 00:55:38,550 --> 00:55:41,140 All right let's move on to perfect hashing. 924 00:55:41,140 --> 00:55:44,950 This is more exciting I would say. 925 00:55:44,950 --> 00:55:48,690 Even cooler-- this was cool from a probability perspective, 926 00:55:48,690 --> 00:55:50,590 depending on your notion of cool. 927 00:55:50,590 --> 00:55:53,280 This method will be cool from a data structures perspective 928 00:55:53,280 --> 00:55:54,610 and a probability perspective. 929 00:55:57,630 --> 00:56:01,600 But so far data structures are what we know from 006. 930 00:56:01,600 --> 00:56:06,080 Now we're going to go up a level, literally. 931 00:56:06,080 --> 00:56:08,930 We're going to have two levels. 932 00:56:08,930 --> 00:56:12,410 So here we're solving-- you can actually make this data 933 00:56:12,410 --> 00:56:13,420 structure dynamic. 934 00:56:13,420 --> 00:56:15,800 But we're going to solve the static dictionary 935 00:56:15,800 --> 00:56:24,530 problem which is when you have no inserts and deletes. 936 00:56:24,530 --> 00:56:26,195 You're given the keys up front. 937 00:56:29,760 --> 00:56:30,900 You're given n keys. 938 00:56:30,900 --> 00:56:34,560 You want to build a table that supports search. 939 00:56:38,424 --> 00:56:39,590 And that's it. 940 00:56:39,590 --> 00:56:42,550 You want search to be constant time 941 00:56:42,550 --> 00:56:51,820 and perfect hashing, also known as FKS hashing 942 00:56:51,820 --> 00:56:55,390 because it was invented by Fredman, Komlos, and Szemeredi 943 00:56:55,390 --> 00:56:59,270 in 1984. 944 00:56:59,270 --> 00:57:09,520 What we will achieve is constant time worst case for search. 945 00:57:16,670 --> 00:57:19,010 So that's a little better because here we're 946 00:57:19,010 --> 00:57:22,000 just doing constant expected time for search. 947 00:57:22,000 --> 00:57:26,260 But it's worse in that we have to know the keys up in advance. 948 00:57:26,260 --> 00:57:30,620 We're going to take the linear space in the worst case. 949 00:57:40,250 --> 00:57:41,750 And then the remaining question is 950 00:57:41,750 --> 00:57:44,870 how long does it take you to build this data structure? 951 00:57:44,870 --> 00:57:47,570 And for now I'll just say it's polynomial time. 952 00:57:47,570 --> 00:57:49,820 It's actually going to be nearly linear. 953 00:57:56,750 --> 00:58:00,150 And this is also an expected bounds. 954 00:58:00,150 --> 00:58:05,111 Actually with high probability could be a little more strong 955 00:58:05,111 --> 00:58:05,610 here. 956 00:58:07,745 --> 00:58:09,620 So it's going to take us a little bit of time 957 00:58:09,620 --> 00:58:11,536 to build this structure, but once you have it, 958 00:58:11,536 --> 00:58:12,925 you have the perfect scenario. 959 00:58:12,925 --> 00:58:14,300 There's going to be in some sense 960 00:58:14,300 --> 00:58:16,591 no collisions in our hash table so it would be constant 961 00:58:16,591 --> 00:58:19,936 times first search and linear space. 962 00:58:19,936 --> 00:58:20,810 So that part's great. 963 00:58:20,810 --> 00:58:24,710 The only catch is it's static. 964 00:58:24,710 --> 00:58:30,500 But beggars can't be choosers I guess. 965 00:58:30,500 --> 00:58:31,040 All right. 966 00:58:34,102 --> 00:58:36,060 I'm not sure who's begging in that analogy but. 967 00:58:40,370 --> 00:58:41,900 The keys who want to be stored. 968 00:58:41,900 --> 00:58:43,580 I don't know. 969 00:58:43,580 --> 00:58:48,170 All right, so the big idea for perfect hashing 970 00:58:48,170 --> 00:58:49,350 is to use two levels. 971 00:58:55,710 --> 00:58:57,980 So let me draw a picture. 972 00:58:57,980 --> 00:59:04,240 We have our universe, and we're mapping that via hash function 973 00:59:04,240 --> 00:59:07,180 h1 into a table. 974 00:59:07,180 --> 00:59:08,640 Look familiar? 975 00:59:08,640 --> 00:59:11,540 Exactly the diagram I drew before. 976 00:59:11,540 --> 00:59:14,970 It's going to have some table size m. 977 00:59:14,970 --> 00:59:22,090 And we're going to set m to be within a constant factor of n. 978 00:59:22,090 --> 00:59:25,410 So right now it looks exactly like regular-- 979 00:59:25,410 --> 00:59:27,380 and it's going to be a universal, 980 00:59:27,380 --> 00:59:30,160 h1 is chosen from a universal hash family, 981 00:59:30,160 --> 00:59:34,080 so universal hashing applies. 982 00:59:34,080 --> 00:59:38,760 The trouble is we're going to get some lists here. 983 00:59:38,760 --> 00:59:44,755 And we don't want to store the set of colliding elements, 984 00:59:44,755 --> 00:59:47,380 the set of elements that hash to that place, with a linked list 985 00:59:47,380 --> 00:59:50,550 because linked lists are slow. 986 00:59:50,550 --> 00:59:53,692 Instead we're going to store them using a hash table. 987 00:59:53,692 --> 00:59:56,000 It sounds crazy. 988 00:59:56,000 --> 01:00:01,340 But we're going to have-- so this is position 1. 989 01:00:01,340 --> 01:00:04,470 This is going to be h2,1. 990 01:00:04,470 --> 01:00:10,500 There's going to be another hash function h2,0 that maps to some 991 01:00:10,500 --> 01:00:11,230 other hash table. 992 01:00:11,230 --> 01:00:14,180 These hash tables are going to be of varying sizes. 993 01:00:14,180 --> 01:00:19,300 Some of them will be of size 0 because nothing hashes there. 994 01:00:19,300 --> 01:00:21,420 But in general each of these slots 995 01:00:21,420 --> 01:00:25,570 is going to map instead of to a linked list to a hash table. 996 01:00:25,570 --> 01:00:31,260 So this would be h2, m minus 1. 997 01:00:31,260 --> 01:00:33,840 I'm going to guarantee in the second level of hashing 998 01:00:33,840 --> 01:00:34,960 there are zero collisions. 999 01:00:50,590 --> 01:00:53,130 Let that sink in a little bit. 1000 01:00:53,130 --> 01:00:56,100 Let me write down a little more carefully what I'm doing. 1001 01:01:09,050 --> 01:01:12,330 So h1 is picked from a universal hash family. 1002 01:01:20,220 --> 01:01:25,420 Where m is theta n. 1003 01:01:25,420 --> 01:01:27,680 I want to put a theta-- I mean I could m equals n, 1004 01:01:27,680 --> 01:01:29,810 but sometimes we require m to be a prime. 1005 01:01:29,810 --> 01:01:32,164 So I'm going to give you some slop in how you choose m. 1006 01:01:32,164 --> 01:01:33,830 So it can be prime or whatever you want. 1007 01:01:36,370 --> 01:01:37,880 And then at the first level we're 1008 01:01:37,880 --> 01:01:40,810 basically doing hashing with chaining. 1009 01:01:40,810 --> 01:01:50,580 And now I want to look at each slot in that hash table. 1010 01:01:50,580 --> 01:01:51,650 So between 0 and m-1. 1011 01:01:55,520 --> 01:02:02,150 I'm going to let lj be the number of keys that hash, 1012 01:02:02,150 --> 01:02:05,520 it's the length of the list that would go there. 1013 01:02:05,520 --> 01:02:07,130 It's going to be the number of keys, 1014 01:02:07,130 --> 01:02:23,730 among just the n keys, Number of, keys hashing to slot j. 1015 01:02:26,760 --> 01:02:30,200 So now the big question is, if I have lj keys here, 1016 01:02:30,200 --> 01:02:31,830 how big do I make that table? 1017 01:02:31,830 --> 01:02:33,750 You might say, well I make a theta lj. 1018 01:02:33,750 --> 01:02:34,750 That's what I always do. 1019 01:02:34,750 --> 01:02:36,640 But that's not what I'm going to do. 1020 01:02:36,640 --> 01:02:38,380 That wouldn't help. 1021 01:02:38,380 --> 01:02:40,490 We get exactly, I think, the same number 1022 01:02:40,490 --> 01:02:44,450 of collisions if we did that, more or less, in expectation. 1023 01:02:44,450 --> 01:02:49,340 So we're going do something else. 1024 01:02:49,340 --> 01:02:54,150 We're going to pick a hash function from a universal 1025 01:02:54,150 --> 01:02:55,300 family, h2,j. 1026 01:02:58,720 --> 01:03:00,170 It again maps the same universe. 1027 01:03:05,510 --> 01:03:08,200 The key thing is the size of the hash table 1028 01:03:08,200 --> 01:03:13,227 I'm going to choose which is lj squared. 1029 01:03:32,510 --> 01:03:37,890 So if there are 3 elements that happen to hash to this slot, 1030 01:03:37,890 --> 01:03:42,730 this table will have size 9. 1031 01:03:42,730 --> 01:03:43,920 So it's mostly empty. 1032 01:03:43,920 --> 01:03:47,180 Only square root fraction-- if that's a word, if that's 1033 01:03:47,180 --> 01:03:48,930 a phrase-- will be full. 1034 01:03:48,930 --> 01:03:50,050 Most of it's empty. 1035 01:03:50,050 --> 01:03:50,735 Why squared? 1036 01:03:53,370 --> 01:03:55,820 Any ideas? 1037 01:03:55,820 --> 01:03:59,460 I claim this will guarantee zero collisions with decent chance. 1038 01:03:59,460 --> 01:03:59,960 Yeah. 1039 01:03:59,960 --> 01:04:01,860 STUDENT: With 1/2 probability you're 1040 01:04:01,860 --> 01:04:03,474 going to end up with no collisions. 1041 01:04:03,474 --> 01:04:04,890 ERIK DEMAINE: With 1/2 probability 1042 01:04:04,890 --> 01:04:05,880 I'm going to end up with no collisions. 1043 01:04:05,880 --> 01:04:06,290 Why? 1044 01:04:06,290 --> 01:04:06,998 What's it called? 1045 01:04:09,516 --> 01:04:11,219 STUDENT: Markov [INAUDIBLE] 1046 01:04:11,219 --> 01:04:13,260 ERIK DEMAINE: Markov's inequality would prove it. 1047 01:04:13,260 --> 01:04:17,970 But it's more commonly known as the, whoa, 1048 01:04:17,970 --> 01:04:21,020 as the birthday paradox. 1049 01:04:21,020 --> 01:04:25,280 So the whole name of the game here is the birthday paradox. 1050 01:04:25,280 --> 01:04:29,315 If I have, how's it go, if I have n 1051 01:04:29,315 --> 01:04:33,450 squared people with n possible birthdays then-- 1052 01:04:33,450 --> 01:04:35,430 is that the right way? 1053 01:04:35,430 --> 01:04:36,240 No, less. 1054 01:04:36,240 --> 01:04:40,280 If I have n people and n squared possible birthdays, 1055 01:04:40,280 --> 01:04:42,700 the probability of getting a collision, a shared birthday, 1056 01:04:42,700 --> 01:04:44,390 is 1/2. 1057 01:04:44,390 --> 01:04:46,740 Normally we think of that as a funny thing. 1058 01:04:46,740 --> 01:04:48,860 You know, if I choose a fair number of people, 1059 01:04:48,860 --> 01:04:51,330 then I get immediately a collision. 1060 01:04:51,330 --> 01:04:52,830 I'm going to do it the opposite way. 1061 01:04:52,830 --> 01:04:56,130 I'm going to guarantee that there's so many birthdays 1062 01:04:56,130 --> 01:04:59,560 that no 2 of them will collide with probability of 1/2 No, 1063 01:04:59,560 --> 01:05:00,430 1/2 is not great. 1064 01:05:00,430 --> 01:05:01,430 We're going to fix that. 1065 01:05:08,230 --> 01:05:11,880 So actually I haven't given you the whole algorithm yet. 1066 01:05:11,880 --> 01:05:14,050 There are two steps, 1 and 2. 1067 01:05:14,050 --> 01:05:19,920 But there are also two other steps 1.5 and 2.5. 1068 01:05:19,920 --> 01:05:22,290 But this is the right idea and this will make 1069 01:05:22,290 --> 01:05:23,660 things work in expectation. 1070 01:05:23,660 --> 01:05:26,020 But I'm going to tweak it a little bit. 1071 01:05:28,810 --> 01:05:30,650 So first let me tell you step 1.5. 1072 01:05:30,650 --> 01:05:33,170 It fits in between the two. 1073 01:05:33,170 --> 01:05:38,100 I want that the space of this data structure is linear. 1074 01:05:38,100 --> 01:05:40,050 So I need to make sure it is. 1075 01:05:40,050 --> 01:05:48,840 If the sum j equals 0 to m minus 1 of lj squared 1076 01:05:48,840 --> 01:05:50,570 is bigger than some constant times 1077 01:05:50,570 --> 01:05:55,250 n-- we'll figure out what the constant is later-- then redo 1078 01:05:55,250 --> 01:05:58,020 step 1. 1079 01:05:58,020 --> 01:06:01,360 So after I do step 1, I know how big all these tables 1080 01:06:01,360 --> 01:06:02,240 are going to be. 1081 01:06:02,240 --> 01:06:07,140 If the sum of those squares is bigger than linear, start over. 1082 01:06:07,140 --> 01:06:09,180 I need to prove that this will only 1083 01:06:09,180 --> 01:06:12,180 have to take-- this will happen an expected 1084 01:06:12,180 --> 01:06:13,690 constant number of times. 1085 01:06:13,690 --> 01:06:16,120 log n times with high probability. 1086 01:06:16,120 --> 01:06:21,290 In fact why don't we-- yeah, let's worry about that later. 1087 01:06:24,150 --> 01:06:27,690 Let me first tell you step 2.5 which 1088 01:06:27,690 --> 01:06:30,650 is I want there to be zero collisions in each 1089 01:06:30,650 --> 01:06:31,720 of these tables. 1090 01:06:31,720 --> 01:06:34,170 It's only going to happen with probability of 1/2 1091 01:06:34,170 --> 01:06:37,900 So if it doesn't happen, just try again. 1092 01:06:37,900 --> 01:06:50,160 So 2.5 is while there's some hash function h2,j that maps 2 1093 01:06:50,160 --> 01:07:02,310 keys that we're given to the same slot at the second level, 1094 01:07:02,310 --> 01:07:17,290 this is for some j and let's say ki different from ki prime. 1095 01:07:17,290 --> 01:07:20,310 But they map to the same place by the first hash function. 1096 01:07:26,350 --> 01:07:29,680 So if two keys map to the same secondary table 1097 01:07:29,680 --> 01:07:32,100 and there's a conflict, then I'm just 1098 01:07:32,100 --> 01:07:36,020 going to redo that construction. 1099 01:07:36,020 --> 01:07:40,420 So I'm going to repick h2,j. 1100 01:07:40,420 --> 01:07:42,020 h2,j was a random choice. 1101 01:07:42,020 --> 01:07:47,230 So if I get a bad choice, I'll just try another one. 1102 01:07:47,230 --> 01:07:50,045 Just keep randomly choosing the a 1103 01:07:50,045 --> 01:07:51,910 or randomly choosing this hash function 1104 01:07:51,910 --> 01:07:55,780 until there are zero collisions in that secondary table. 1105 01:07:55,780 --> 01:07:57,920 And I'm going to do this for each table. 1106 01:07:57,920 --> 01:08:00,600 So we worry about how long these will take, 1107 01:08:00,600 --> 01:08:02,745 but I claim expected constant number of trials. 1108 01:08:05,560 --> 01:08:07,250 So let's do the second one first. 1109 01:08:13,040 --> 01:08:16,870 After we do this y loop there are no collisions 1110 01:08:16,870 --> 01:08:19,050 with the proper notion of the word collisions, which 1111 01:08:19,050 --> 01:08:21,750 is two different keys mapping to the same value. 1112 01:08:35,970 --> 01:08:41,470 So at this point we have guaranteed 1113 01:08:41,470 --> 01:08:43,220 that searches are constant time worst 1114 01:08:43,220 --> 01:08:48,740 case after we do all these 4 steps because we apply h1, 1115 01:08:48,740 --> 01:08:51,029 we figure out which slot we fit in. 1116 01:08:51,029 --> 01:08:53,930 Say it's slot j, then we apply h2j 1117 01:08:53,930 --> 01:08:56,689 and if your item's in the overall table, 1118 01:08:56,689 --> 01:08:58,410 it should be in that secondary table. 1119 01:08:58,410 --> 01:09:00,243 Because there are no collisions you can see, 1120 01:09:00,243 --> 01:09:02,130 is that one item the one I'm looking for? 1121 01:09:02,130 --> 01:09:02,920 If so, return it. 1122 01:09:02,920 --> 01:09:04,699 If not, it's not anywhere. 1123 01:09:04,699 --> 01:09:07,829 If there are no collisions then I 1124 01:09:07,829 --> 01:09:10,120 don't need chains coming out of here because it is just 1125 01:09:10,120 --> 01:09:10,800 a single item. 1126 01:09:13,750 --> 01:09:16,760 The big question-- so constant worst case space 1127 01:09:16,760 --> 01:09:19,130 because 1.5 guarantees that. 1128 01:09:19,130 --> 01:09:20,964 Constant worst case time first search. 1129 01:09:20,964 --> 01:09:23,130 The big question is, how long does it take to build? 1130 01:09:23,130 --> 01:09:25,330 How many times do we have to redo 1131 01:09:25,330 --> 01:09:28,890 steps 1 and 2 before we get a decent-- before we 1132 01:09:28,890 --> 01:09:30,130 get a perfect hash table. 1133 01:09:32,979 --> 01:09:35,880 So let me remind you of the birthday 1134 01:09:35,880 --> 01:09:37,750 paradox, why it works here. 1135 01:09:54,530 --> 01:09:59,847 As mentioned earlier this is going to be a union bounds. 1136 01:09:59,847 --> 01:10:01,680 We want to know the probability of collision 1137 01:10:01,680 --> 01:10:02,930 at that second level. 1138 01:10:02,930 --> 01:10:06,754 Well that's at most the sum of all possible collisions, 1139 01:10:06,754 --> 01:10:07,920 probabilities of collisions. 1140 01:10:07,920 --> 01:10:09,910 So I'm going to say the sum over all i 1141 01:10:09,910 --> 01:10:14,340 not equal to ij of the probability. 1142 01:10:14,340 --> 01:10:16,800 Now this is over our choice of the hash function h2,j. 1143 01:10:19,848 --> 01:10:29,120 Of h2,j of ki equaling h2,j of ki prime. 1144 01:10:29,120 --> 01:10:30,970 So union bounds says, of course. 1145 01:10:30,970 --> 01:10:33,080 The probability of any of them happening-- 1146 01:10:33,080 --> 01:10:35,380 we don't know about interdependence or whatnot-- 1147 01:10:35,380 --> 01:10:39,730 but certainly almost the sum of each of these possible events. 1148 01:10:39,730 --> 01:10:42,150 There are a lot of possible events. 1149 01:10:42,150 --> 01:10:43,620 If there 's li things, that there 1150 01:10:43,620 --> 01:10:47,462 are going to be li choose 2 possible collisions 1151 01:10:47,462 --> 01:10:48,420 we have to worry about. 1152 01:10:48,420 --> 01:10:49,836 We know i is not equal to i prime. 1153 01:10:53,360 --> 01:10:57,710 So the number of terms here is li choose 2. 1154 01:11:00,890 --> 01:11:02,115 And what's this probability? 1155 01:11:06,120 --> 01:11:07,990 STUDENT: [INAUDIBLE] 1156 01:11:07,990 --> 01:11:14,420 ERIK DEMAINE: 1/li at most because we're assuming h2,j is 1157 01:11:14,420 --> 01:11:17,880 a universal hash function so the probability of choosing-- 1158 01:11:17,880 --> 01:11:18,780 sorry? 1159 01:11:18,780 --> 01:11:19,530 li squared. 1160 01:11:19,530 --> 01:11:20,640 Thank you. 1161 01:11:20,640 --> 01:11:23,020 The size of the table. 1162 01:11:23,020 --> 01:11:27,230 1/m but m in this case, the size of our table is li squared. 1163 01:11:27,230 --> 01:11:30,455 So the probability that we choose a good hash function 1164 01:11:30,455 --> 01:11:32,790 and that these particular keys don't hit 1165 01:11:32,790 --> 01:11:34,740 is at most 1/li squared. 1166 01:11:34,740 --> 01:11:37,740 This is basically li squared/ 2. 1167 01:11:37,740 --> 01:11:40,690 And so this is at most 1/2. 1168 01:11:40,690 --> 01:11:43,030 It's a slightly less than li squared/2. 1169 01:11:43,030 --> 01:11:45,274 So this is at most 1/2. 1170 01:11:45,274 --> 01:11:46,940 And this is basically a birthday paradox 1171 01:11:46,940 --> 01:11:48,375 in this particular case. 1172 01:11:48,375 --> 01:11:49,750 That means there is a probability 1173 01:11:49,750 --> 01:11:53,610 of at least a half that there is zero collisions in one 1174 01:11:53,610 --> 01:11:54,620 of these tables. 1175 01:11:54,620 --> 01:11:57,160 So that means I'm basically flipping a fair coin. 1176 01:11:57,160 --> 01:11:58,922 If I ever get a heads I'm happy. 1177 01:11:58,922 --> 01:12:00,630 Each time I get a tails I have to reflip. 1178 01:12:00,630 --> 01:12:03,060 This should sound familiar from last time. 1179 01:12:03,060 --> 01:12:14,045 So this is 2 expected trials or log n with high probability. 1180 01:12:20,710 --> 01:12:23,600 We've proved log n with high probability. 1181 01:12:23,600 --> 01:12:26,360 That's the same as saying the number of levels in a skip list 1182 01:12:26,360 --> 01:12:28,330 is log n with high probability. 1183 01:12:28,330 --> 01:12:30,959 How many times do I have to flip a coin before I get a heads? 1184 01:12:30,959 --> 01:12:32,000 Definitely at most log n. 1185 01:12:35,620 --> 01:12:38,530 Now we have to do this for each secondary table. 1186 01:12:38,530 --> 01:12:41,700 There are m equal theta and secondary tables. 1187 01:12:50,110 --> 01:12:53,490 There's a slight question of how big are the secondary tables. 1188 01:12:53,490 --> 01:12:56,770 If one of these tables is like linear size, 1189 01:12:56,770 --> 01:12:59,600 then I have to spend linear time for a trial. 1190 01:12:59,600 --> 01:13:02,450 And then I multiply that by the number of trials 1191 01:13:02,450 --> 01:13:05,050 and also the number of different things that would be like n 1192 01:13:05,050 --> 01:13:06,670 squared log n n. 1193 01:13:06,670 --> 01:13:11,460 But you know a secondary table better not have linear sides. 1194 01:13:11,460 --> 01:13:14,450 I mean a linear number of li equal n. 1195 01:13:14,450 --> 01:13:16,850 That would be bad because then li squared is n squared 1196 01:13:16,850 --> 01:13:20,540 and we guaranteed that we had linear space. 1197 01:13:20,540 --> 01:13:25,790 So in fact you can prove with another Chernoff bound. 1198 01:13:25,790 --> 01:13:27,450 Let me put this over here. 1199 01:13:34,330 --> 01:13:36,940 That all the li's are pretty small. 1200 01:13:36,940 --> 01:13:42,730 Not constant but logarithmic. 1201 01:13:42,730 --> 01:13:50,400 So li is order log n with high probability for each i 1202 01:13:50,400 --> 01:13:51,850 and therefore for all i. 1203 01:13:51,850 --> 01:13:56,550 So I can just change the alpha my minus 1 n to the alpha 1204 01:13:56,550 --> 01:14:00,654 and get that for all i this happens. 1205 01:14:00,654 --> 01:14:02,570 In fact, the right answer is log over log log, 1206 01:14:02,570 --> 01:14:04,620 if you want to do some really messy analysis. 1207 01:14:04,620 --> 01:14:08,430 But we just, logarithmic is fine for us. 1208 01:14:08,430 --> 01:14:10,920 So what this means is we're doing 1209 01:14:10,920 --> 01:14:14,010 n different things for each of them 1210 01:14:14,010 --> 01:14:17,960 with high probability li is of size log n. 1211 01:14:17,960 --> 01:14:20,470 And then maybe we'll have to do like log n trials 1212 01:14:20,470 --> 01:14:23,200 repeating until we get a good hash function there. 1213 01:14:23,200 --> 01:14:29,320 And so the total build time for steps 1 and 2.5 1214 01:14:29,320 --> 01:14:34,240 is going to be at most n times log squared n. 1215 01:14:34,240 --> 01:14:37,420 You can prove a tighter bound but it's polynomial. 1216 01:14:37,420 --> 01:14:41,200 That's all I wanted to go for and it's almost linear. 1217 01:14:41,200 --> 01:14:46,855 So I'm left with one thing to analyze which is step 1.5. 1218 01:14:46,855 --> 01:14:48,730 This to me is maybe the most surprising thing 1219 01:14:48,730 --> 01:14:50,120 that it works out. 1220 01:14:50,120 --> 01:14:53,490 I mean here we designed-- we did this li to li 1221 01:14:53,490 --> 01:14:55,480 squared so the birthday paradox would happen. 1222 01:14:55,480 --> 01:14:56,854 This is not surprising. 1223 01:14:56,854 --> 01:14:59,020 I mean it's a cool idea, but once you have the idea, 1224 01:14:59,020 --> 01:15:01,370 it's not surprising that it works. 1225 01:15:01,370 --> 01:15:03,370 What's a little more surprising is that squaring 1226 01:15:03,370 --> 01:15:05,670 is OK from a space perspective. 1227 01:15:05,670 --> 01:15:07,310 1.5 says we're going to have to rebuild 1228 01:15:07,310 --> 01:15:10,380 that first table until the sum of these squared lengths 1229 01:15:10,380 --> 01:15:11,470 is at most linear. 1230 01:15:11,470 --> 01:15:13,540 I can guarantee that each of these 1231 01:15:13,540 --> 01:15:16,840 is logarithmic so the sum of the squares is at most like n log 1232 01:15:16,840 --> 01:15:17,950 squared n. 1233 01:15:17,950 --> 01:15:19,315 But I claim I can get linear. 1234 01:15:22,360 --> 01:15:25,580 Let's do that. 1235 01:15:25,580 --> 01:15:29,880 So for step 1.5 we're looking at what 1236 01:15:29,880 --> 01:15:35,410 is the expectation of the sum of the lj squareds being 1237 01:15:35,410 --> 01:15:37,640 more than linear. 1238 01:15:37,640 --> 01:15:38,750 Sorry. 1239 01:15:38,750 --> 01:15:39,852 Expectation. 1240 01:15:39,852 --> 01:15:41,310 Let's first compute the expectation 1241 01:15:41,310 --> 01:15:43,710 and then we'll talk about a tail bound 1242 01:15:43,710 --> 01:15:45,420 which is the probability that we're much 1243 01:15:45,420 --> 01:15:46,980 bigger than the expectation. 1244 01:15:46,980 --> 01:15:50,280 First thing is I claim the expectation is linear. 1245 01:15:50,280 --> 01:15:56,000 So again whenever we're counting something-- 1246 01:15:56,000 --> 01:15:59,370 I mean this is basically the total number of pairs 1247 01:15:59,370 --> 01:16:02,580 of items that collide at the first level 1248 01:16:02,580 --> 01:16:05,060 with double counting. 1249 01:16:05,060 --> 01:16:08,940 So I mean if you think of lj and then I make a complete graph 1250 01:16:08,940 --> 01:16:11,910 on those lj items, that's going to have 1251 01:16:11,910 --> 01:16:14,400 like the squared number of edges, so, 1252 01:16:14,400 --> 01:16:16,700 if I also multiply by 2. 1253 01:16:16,700 --> 01:16:19,060 So this is the same thing as counting 1254 01:16:19,060 --> 01:16:25,180 how many pairs of items map to the same spot, the same slot. 1255 01:16:25,180 --> 01:16:28,890 So this is going to-- and that I can write as an indicator 1256 01:16:28,890 --> 01:16:30,820 random variable which lets me use linearity 1257 01:16:30,820 --> 01:16:33,830 of expectation which makes me happy 1258 01:16:33,830 --> 01:16:36,940 because then everything simple. 1259 01:16:36,940 --> 01:16:38,090 So I'm going to write Ii,j. 1260 01:16:41,210 --> 01:16:54,070 This is going to be 1 if each 1 of ki, I guess, equals h1 if kj 1261 01:16:54,070 --> 01:16:59,055 and it's going to be 0 if h1 otherwise. 1262 01:17:05,080 --> 01:17:07,080 This is the total number of pairwise colliding 1263 01:17:07,080 --> 01:17:10,520 items including i versus i. 1264 01:17:10,520 --> 01:17:14,210 And so like if li equals 1, li squared is also 1. 1265 01:17:14,210 --> 01:17:15,820 There's 1 item colliding with itself. 1266 01:17:15,820 --> 01:17:19,490 So this actually works exactly. 1267 01:17:19,490 --> 01:17:21,840 All right, with the wrong definition of colliding. 1268 01:17:21,840 --> 01:17:24,140 If you bear with me. 1269 01:17:24,140 --> 01:17:26,735 So now we can use linear of expectation 1270 01:17:26,735 --> 01:17:28,890 and put the E in here. 1271 01:17:28,890 --> 01:17:35,400 So this is sum i equals 1 to n sum j equals 1 to n 1272 01:17:35,400 --> 01:17:39,590 of the expectation of Ii,j. 1273 01:17:39,590 --> 01:17:42,982 But we know the expectation of the Ii,j is the probability 1274 01:17:42,982 --> 01:17:45,440 of it equaling 1 because it's an indicator random variable. 1275 01:17:45,440 --> 01:17:48,582 The probability of this happening over our choice of h1 1276 01:17:48,582 --> 01:17:51,170 is at most 1/m by universality. 1277 01:17:51,170 --> 01:17:53,580 Here it actually is m because we're at the first level. 1278 01:17:53,580 --> 01:17:59,520 So this is at most 1/m which is theta n. 1279 01:18:02,420 --> 01:18:10,930 So when i does not equal j, so it's a little bit annoying. 1280 01:18:10,930 --> 01:18:16,570 I do have to separate out the Ii terms from the i and different 1281 01:18:16,570 --> 01:18:17,994 i not equal to j terms. 1282 01:18:17,994 --> 01:18:19,660 But there's only-- I mean it's basically 1283 01:18:19,660 --> 01:18:21,170 the diagonal of this matrix. 1284 01:18:21,170 --> 01:18:24,510 There's n things that will always collide with themselves. 1285 01:18:24,510 --> 01:18:30,690 So we're going to get like n plus the number of i 1286 01:18:30,690 --> 01:18:32,430 not equal to pairs double counted. 1287 01:18:32,430 --> 01:18:35,520 So it's like 2 times n choose 2. 1288 01:18:35,520 --> 01:18:38,180 But we get to divide by m. 1289 01:18:38,180 --> 01:18:40,930 So this is like n squared /n. 1290 01:18:40,930 --> 01:18:46,070 So we get order n. 1291 01:18:46,070 --> 01:18:49,130 So that's not-- well, that's cool. 1292 01:18:49,130 --> 01:18:51,073 Expected space is linear. 1293 01:18:51,073 --> 01:18:52,531 This is what makes everything work. 1294 01:18:59,410 --> 01:19:01,840 Last class was about getting with high probability bounds 1295 01:19:01,840 --> 01:19:03,070 when we're working with logs. 1296 01:19:05,610 --> 01:19:07,570 When you want to get that something 1297 01:19:07,570 --> 01:19:09,280 is log with high probability, you 1298 01:19:09,280 --> 01:19:11,690 have to use, with respect to n, you 1299 01:19:11,690 --> 01:19:13,670 have to use a turn off bound. 1300 01:19:13,670 --> 01:19:17,290 But this is about-- now I want to show that the space is 1301 01:19:17,290 --> 01:19:19,090 linear with high probability. 1302 01:19:19,090 --> 01:19:20,630 Linear is actually really easy. 1303 01:19:20,630 --> 01:19:24,560 You can use a much weaker bound called Markov inequality. 1304 01:19:24,560 --> 01:19:36,200 So I want to claim that the probability of h1 of this thing 1305 01:19:36,200 --> 01:19:39,980 lj squareds being bigger than some constant times 1306 01:19:39,980 --> 01:19:49,624 n is at most the expectation of that thing divided by cn. 1307 01:19:49,624 --> 01:19:50,790 This is Markov's inequality. 1308 01:19:50,790 --> 01:19:52,710 It holds for anything here. 1309 01:19:52,710 --> 01:19:54,320 So I'm just repeating it over here. 1310 01:19:58,550 --> 01:20:05,170 So this is nice because we know that this expectation is 1311 01:20:05,170 --> 01:20:06,970 linear. 1312 01:20:06,970 --> 01:20:12,000 So we're getting like a linear function divided by cn. 1313 01:20:12,000 --> 01:20:14,100 Remember we get to choose c. 1314 01:20:14,100 --> 01:20:16,500 The step said if it's bigger than some constant times n 1315 01:20:16,500 --> 01:20:18,050 then we're redoing the thing. 1316 01:20:18,050 --> 01:20:20,270 So I can choose c to be 100, whatever. 1317 01:20:20,270 --> 01:20:23,870 I'm going to choose it to be twice this constant. 1318 01:20:23,870 --> 01:20:27,460 And then this is at most half. 1319 01:20:27,460 --> 01:20:29,600 So the probability of my space being too big 1320 01:20:29,600 --> 01:20:30,670 is at most a half. 1321 01:20:30,670 --> 01:20:31,980 We're back to coin flipping. 1322 01:20:31,980 --> 01:20:33,590 Every time I flip a coin, if I get 1323 01:20:33,590 --> 01:20:40,550 heads I have the right amount of space at less than c times n 1324 01:20:40,550 --> 01:20:42,100 space. 1325 01:20:42,100 --> 01:20:43,650 If I get a tails I try again. 1326 01:20:43,650 --> 01:20:50,000 So the expected number of trials is 2 at most 1327 01:20:50,000 --> 01:20:53,710 not trails, trials. 1328 01:20:53,710 --> 01:20:57,605 And it's also log n trials with high probability. 1329 01:21:01,510 --> 01:21:03,480 How much time do I spend for each trial? 1330 01:21:03,480 --> 01:21:04,186 Linear time. 1331 01:21:04,186 --> 01:21:05,310 I choose one hash function. 1332 01:21:05,310 --> 01:21:06,850 I hash all the items. 1333 01:21:06,850 --> 01:21:10,120 I count the number of collision squared or the sum of lj 1334 01:21:10,120 --> 01:21:10,620 squared. 1335 01:21:10,620 --> 01:21:12,420 That takes linear time to do. 1336 01:21:12,420 --> 01:21:16,150 And so the total work I'm doing for these steps is n log n. 1337 01:21:20,920 --> 01:21:23,710 So n log n to do step 1 and 1 prime 1338 01:21:23,710 --> 01:21:27,000 and log squared n to do steps 2 and 2 prime. 1339 01:21:27,000 --> 01:21:30,940 Overall n Polylog or polynomial time. 1340 01:21:30,940 --> 01:21:35,190 And we get guaranteed no collisions for static data. 1341 01:21:35,190 --> 01:21:39,714 Constant worst case search and linear worst case space. 1342 01:21:39,714 --> 01:21:41,630 This is kind of surprising that this works out 1343 01:21:41,630 --> 01:21:43,049 but everything's nice. 1344 01:21:47,780 --> 01:21:49,910 Now you know hashing.