1 00:00:09,000 --> 00:00:10,000 Hashing. 2 00:00:15,000 --> 00:00:19,000 Today we're going to do some amazing stuff with hashing. 3 00:00:19,000 --> 00:00:21,000 And, really, this is such neat stuff, 4 00:00:21,000 --> 00:00:24,000 it's amazing. We're going to start by 5 00:00:24,000 --> 00:00:28,000 addressing a fundamental weakness of hashing. 6 00:00:34,000 --> 00:00:37,000 And that is that for any choice of hash function 7 00:00:49,000 --> 00:01:04,000 There exists a bad set of keys that all hash to the same slot. 8 00:01:09,000 --> 00:01:11,000 OK. So you pick a hash function. 9 00:01:11,000 --> 00:01:15,000 We looked at some that seem to work well in practice, 10 00:01:15,000 --> 00:01:18,000 that are easy to put into your code. 11 00:01:18,000 --> 00:01:23,000 But whichever one you pick, there's always some bad set of 12 00:01:23,000 --> 00:01:25,000 keys. So you can imagine, 13 00:01:25,000 --> 00:01:30,000 just to drive this point home a little bit. 14 00:01:30,000 --> 00:01:35,000 Imagine that you're building a compiler for a customer and you 15 00:01:35,000 --> 00:01:40,000 have a symbol table in your compiler and one of the things 16 00:01:40,000 --> 00:01:46,000 that the customer is demanding is that compilations go fast. 17 00:01:46,000 --> 00:01:50,000 They don't want to sit around waiting for compilations. 18 00:01:50,000 --> 00:01:56,000 And you have a competitor who's also building a compiler and 19 00:01:56,000 --> 00:02:01,000 they're going to test the compiler, both of your compilers 20 00:02:01,000 --> 00:02:07,000 and sort of have a run-off. And one of the things in the 21 00:02:07,000 --> 00:02:12,000 test that they're going to allow you to do is not only will the 22 00:02:12,000 --> 00:02:16,000 customer run his own benchmarks, but he'll let you make up 23 00:02:16,000 --> 00:02:20,000 benchmarks for the other program, for your competitor. 24 00:02:20,000 --> 00:02:24,000 And your competitor gets to make up benchmarks for you. 25 00:02:24,000 --> 00:02:28,000 So and not only that, but you're actually sharing 26 00:02:28,000 --> 00:02:32,000 code. So you get to look at what the 27 00:02:32,000 --> 00:02:37,000 competitor is actually doing and what hash function they're 28 00:02:37,000 --> 00:02:40,000 actually using. So it's pretty clear that in 29 00:02:40,000 --> 00:02:44,000 this circumstance, you have an adversary who is 30 00:02:44,000 --> 00:02:49,000 going to look at whatever hash function you have and figure out 31 00:02:49,000 --> 00:02:53,000 OK, what's a set of variable names and so forth that are 32 00:02:53,000 --> 00:02:58,000 going to all hash to the same slot so that essentially you're 33 00:02:58,000 --> 00:03:03,000 just chasing through a linked list whenever it comes to 34 00:03:03,000 --> 00:03:07,000 looking something up. Slowing down your program 35 00:03:07,000 --> 00:03:12,000 enormously compared to if in fact they got distributed nicely 36 00:03:12,000 --> 00:03:15,000 across the hash table which is, what after all, 37 00:03:15,000 --> 00:03:19,000 you have a hash table in there to do in the first place. 38 00:03:19,000 --> 00:03:22,000 And so the question is, how do you defeat this 39 00:03:22,000 --> 00:03:26,000 adversary? And the answer is one word. 40 00:03:31,000 --> 00:03:33,000 One word. How do you achieve? 41 00:03:33,000 --> 00:03:37,000 How do you defeat any adversary in this class? 42 00:03:37,000 --> 00:03:38,000 Randomness. OK. 43 00:03:38,000 --> 00:03:39,000 Randomness. OK. 44 00:03:39,000 --> 00:03:42,000 You make it so that he can't guess. 45 00:03:42,000 --> 00:03:47,000 And the idea is that you choose a hash function at random. 46 00:03:47,000 --> 00:03:50,000 Independent. So he can look at the code, 47 00:03:50,000 --> 00:03:55,000 but when it actually runs, it's going to use a random hash 48 00:03:55,000 --> 00:04:00,000 function that he has no way of predicting what the hash 49 00:04:00,000 --> 00:04:05,000 function is that will actually be used. 50 00:04:05,000 --> 00:04:07,000 OK. So that's the game and that way 51 00:04:07,000 --> 00:04:11,000 he can provide an input, but he can't provide an input 52 00:04:11,000 --> 00:04:15,000 that's guaranteed to force you to run slowly. 53 00:04:15,000 --> 00:04:19,000 You might get unlucky in your choice of hash function, 54 00:04:19,000 --> 00:04:23,000 but it's not going to be because of the adversary. 55 00:04:23,000 --> 00:04:28,000 So the idea is to choose a hash function -- 56 00:04:34,000 --> 00:04:38,000 -- at random, independently from the keys 57 00:04:38,000 --> 00:04:42,000 that you're, that are going to be fed to it. 58 00:04:42,000 --> 00:04:47,000 So even if your adversary can see your code, 59 00:04:47,000 --> 00:04:53,000 he can't tell which hash function is going to be actually 60 00:04:53,000 --> 00:04:58,000 used at run time. Doesn't get to see the output 61 00:04:58,000 --> 00:05:04,000 of the random numbers. And so it turns out you can 62 00:05:04,000 --> 00:05:11,000 make this scheme work and the name of the scheme is universal 63 00:05:11,000 --> 00:05:17,000 hashing, OK, is one way of making this scheme work. 64 00:05:22,000 --> 00:05:34,000 So let's do some math. So let U be a universe of keys. 65 00:05:34,000 --> 00:05:41,000 And let H be a finite collection -- 66 00:05:48,000 --> 00:05:49,000 -- of hash functions -- 67 00:05:56,000 --> 00:06:04,000 -- mapping U to what are going to be the slots in our hash 68 00:06:04,000 --> 00:06:06,000 table. OK. 69 00:06:06,000 --> 00:06:11,000 So we just have H as some finite collection. 70 00:06:11,000 --> 00:06:15,000 We say that H is universal -- 71 00:06:22,000 --> 00:06:30,000 -- if for all pairs of the keys, distinct keys -- 72 00:06:36,000 --> 00:06:41,000 -- so the keys are distinct, the following is true. 73 00:07:03,000 --> 00:07:08,000 So if the set of keys, if for any pair of keys I pick, 74 00:07:08,000 --> 00:07:15,000 the number of hash functions that hash those two keys to the 75 00:07:15,000 --> 00:07:21,000 same place is a one over m fraction of the total set of 76 00:07:21,000 --> 00:07:23,000 keys. So let m just, 77 00:07:23,000 --> 00:07:28,000 so to view that, another way of viewing that is 78 00:07:28,000 --> 00:07:33,000 if H is chosen randomly -- 79 00:07:39,000 --> 00:07:51,000 -- from the set of keys H, the probability of collision 80 00:07:51,000 --> 00:07:58,000 between x and y is what? 81 00:08:12,000 --> 00:08:17,000 What's the probability if the fraction of hash functions, 82 00:08:17,000 --> 00:08:22,000 OK, if the number of hash functions is H over m, 83 00:08:22,000 --> 00:08:27,000 what's the probability of a collision between x and y? 84 00:08:27,000 --> 00:08:32,000 If I pick a hash function at random. 85 00:08:32,000 --> 00:08:39,000 So I pick a hash function at random, what's the odds they 86 00:08:39,000 --> 00:08:42,000 collide? One over m. 87 00:08:42,000 --> 00:08:49,000 Now let's draw a picture for that, help people see that 88 00:08:49,000 --> 00:08:56,000 that's in fact the case. So imagine this is our set of 89 00:08:56,000 --> 00:09:00,000 all hash functions. OK. 90 00:09:00,000 --> 00:09:08,000 And then if I pick a particular x and y, let's say that this is 91 00:09:08,000 --> 00:09:16,000 the set of hash functions such that H of x is equal to H of y. 92 00:09:16,000 --> 00:09:23,000 And so what we're saying is that the cardinality of that set 93 00:09:23,000 --> 00:09:30,000 is one over m times the cardinality of H. 94 00:09:30,000 --> 00:09:33,000 So if I throw a dart and pick one hash function at random, 95 00:09:33,000 --> 00:09:37,000 the odds are one in m that the hash function falls into this 96 00:09:37,000 --> 00:09:39,000 particular set. And of course, 97 00:09:39,000 --> 00:09:43,000 this has to be true of every x and y that I can pick. 98 00:09:43,000 --> 00:09:45,000 Of course, it will be a different set, 99 00:09:45,000 --> 00:09:49,000 a different x and y will somehow map the hash functions 100 00:09:49,000 --> 00:09:52,000 differently, but the odds that for any x and y that I pick, 101 00:09:52,000 --> 00:09:55,000 the odds that if I have a random hash function, 102 00:09:55,000 --> 00:10:00,000 it hashes it to the same place, is one over m. 103 00:10:00,000 --> 00:10:03,000 Now this is a little bit hard sometimes for people to get 104 00:10:03,000 --> 00:10:07,000 their head around because we're used to thinking of perhaps 105 00:10:07,000 --> 00:10:09,000 picking keys at random or something. 106 00:10:09,000 --> 00:10:11,000 OK, that's not what's going on here. 107 00:10:11,000 --> 00:10:14,000 We're picking hash functions at random. 108 00:10:14,000 --> 00:10:18,000 So our probability space is defined over the hash functions, 109 00:10:18,000 --> 00:10:21,000 not over the keys. And this has to be true now for 110 00:10:21,000 --> 00:10:24,000 any particular two keys that I pick that are distinct. 111 00:10:24,000 --> 00:10:28,000 That the places that they hash, this set of hash functions, 112 00:10:28,000 --> 00:10:34,000 I mean this is like a marvelous property if you think about it. 113 00:10:34,000 --> 00:10:39,000 OK, that you can actually find ones where no matter what two 114 00:10:39,000 --> 00:10:43,000 elements I pick, the odds are exactly one in m 115 00:10:43,000 --> 00:10:48,000 that a random hash function from this set is going to hash them 116 00:10:48,000 --> 00:10:51,000 to the same place. So very neat. 117 00:10:51,000 --> 00:10:56,000 Very, very neat property and we'll see the mathematics 118 00:10:56,000 --> 00:11:00,000 associated with this is very cool. 119 00:11:00,000 --> 00:11:14,000 So our theorem is that if we choose h randomly from the set 120 00:11:14,000 --> 00:11:25,000 of hash functions H, and then we suppose we're 121 00:11:25,000 --> 00:11:37,000 hashing n keys into m slots in Table T -- 122 00:11:44,000 --> 00:11:46,000 -- then for given key x -- 123 00:11:52,000 --> 00:11:56,000 -- the expected number of collisions with x -- 124 00:12:03,000 --> 00:12:12,000 -- is less than n over m. And who remembers what we call 125 00:12:12,000 --> 00:12:16,000 n over m? Alpha, which is the, 126 00:12:16,000 --> 00:12:22,000 what's the term that we use there? 127 00:12:22,000 --> 00:12:30,000 Load factor. The load factor of the table. 128 00:12:30,000 --> 00:12:36,000 OK, load factor alpha. So the average number of keys 129 00:12:36,000 --> 00:12:42,000 per slot is the load factor of the table. 130 00:12:42,000 --> 00:12:48,000 So we're saying, so what is this theorem saying? 131 00:12:48,000 --> 00:12:55,000 It's saying that in fact, if we have one of these 132 00:12:55,000 --> 00:13:02,000 universal sets of hash functions, then things perform 133 00:13:02,000 --> 00:13:10,000 exactly the way we want them to. Things get distributed evenly. 134 00:13:10,000 --> 00:13:15,000 The number of things that are going to collide with any 135 00:13:15,000 --> 00:13:19,000 particular key that I pick is going to be n over m. 136 00:13:19,000 --> 00:13:22,000 So that's a really good property to have. 137 00:13:22,000 --> 00:13:27,000 Now I haven't shown you, the construction of U is going, 138 00:13:27,000 --> 00:13:31,000 sorry, of the set of hash functions H, that that 139 00:13:31,000 --> 00:13:36,000 construction will take us a little bit of effort. 140 00:13:36,000 --> 00:13:39,000 But first I want to show you why this is such a great 141 00:13:39,000 --> 00:13:42,000 property. Basically it's this theorem. 142 00:13:42,000 --> 00:13:46,000 So let's prove this theorem. So any questions about what the 143 00:13:46,000 --> 00:13:50,000 statement of the theorem is? So we're going to go actually 144 00:13:50,000 --> 00:13:54,000 kind of fast today. We've got a lot of good stuff 145 00:13:54,000 --> 00:13:57,000 today. So I want to make sure people 146 00:13:57,000 --> 00:14:03,000 are onboard as we go through. So if there are any questions, 147 00:14:03,000 --> 00:14:07,000 make sure, you know, statement of theorem of 148 00:14:07,000 --> 00:14:13,000 whatever, best to get them out early so that you're not 149 00:14:13,000 --> 00:14:19,000 confused later on when the going gets a little more exciting. 150 00:14:19,000 --> 00:14:21,000 OK? OK, good. 151 00:14:21,000 --> 00:14:26,000 So to prove this, let's let C sub x be the random 152 00:14:26,000 --> 00:14:33,000 variable denoting the total number of collisions -- 153 00:14:38,000 --> 00:14:44,000 -- of keys in T with x. So this is a total number and 154 00:14:44,000 --> 00:14:51,000 one of the techniques that you use a lot in probabilistic 155 00:14:51,000 --> 00:14:57,000 analysis of randomized algorithms is recognizing that C 156 00:14:57,000 --> 00:15:05,000 of x is in fact a sum of indicator random variables. 157 00:15:05,000 --> 00:15:11,000 If you can decompose things into indicator random variables, 158 00:15:11,000 --> 00:15:17,000 the analysis goes much more easily than if you're left with 159 00:15:17,000 --> 00:15:22,000 aggregate variables. So here we're going to let our 160 00:15:22,000 --> 00:15:27,000 indicator random variable be little c of x., 161 00:15:27,000 --> 00:15:32,000 which is going to be one if h of x equals h of y and 0 162 00:15:32,000 --> 00:15:35,000 otherwise. 163 00:15:40,000 --> 00:15:49,000 And so we can note two things. First, what is the expectation 164 00:15:49,000 --> 00:15:52,000 of C of x.. 165 00:15:57,000 --> 00:16:00,000 OK, if I have a process which is picking a hash function at 166 00:16:00,000 --> 00:16:04,000 random, what's the expectation of C of x.? 167 00:16:04,000 --> 00:16:07,000 One over m. Because that's basically this 168 00:16:07,000 --> 00:16:11,000 definition here. Now in other words I pick a 169 00:16:11,000 --> 00:16:16,000 hash function at random, what's the odds that the hash 170 00:16:16,000 --> 00:16:19,000 is the same? It's one over m. 171 00:16:19,000 --> 00:16:24,000 And then the other thing is, and the reason we pick this 172 00:16:24,000 --> 00:16:28,000 thing is that I can express capital C sub x, 173 00:16:28,000 --> 00:16:33,000 the random variable denoting the total number of collisions 174 00:16:33,000 --> 00:16:39,000 as being just the sum over all the keys in the table except x 175 00:16:39,000 --> 00:16:46,000 of C of x.. So for each one that would 176 00:16:46,000 --> 00:16:53,000 cause me a collision, with x, I add one and if it 177 00:16:53,000 --> 00:17:00,000 wouldn't cause me a collision, I add 0. 178 00:17:00,000 --> 00:17:06,000 And that adds up all of the collisions that I would have in 179 00:17:06,000 --> 00:17:09,000 the table with x. 180 00:17:17,000 --> 00:17:20,000 Is there any questions so far? Because this is the set-up. 181 00:17:20,000 --> 00:17:24,000 The set-up in most of these things, the set-up is where most 182 00:17:24,000 --> 00:17:27,000 students make mistakes and most practicing researchers make 183 00:17:27,000 --> 00:17:30,000 mistakes as well, let me tell you. 184 00:17:30,000 --> 00:17:32,000 And then once you get the set-up right, 185 00:17:32,000 --> 00:17:36,000 then working out the math is fine, but it's often that set-up 186 00:17:36,000 --> 00:17:40,000 of how do you actually translate the situation into the math. 187 00:17:40,000 --> 00:17:43,000 That's the hard part. Once you get that right, 188 00:17:43,000 --> 00:17:46,000 well, then, algebra, we can all do algebra. 189 00:17:46,000 --> 00:17:49,000 Of course, we can also all make mistakes doing algebra, 190 00:17:49,000 --> 00:17:53,000 but at least those mistakes are much more easy to check than the 191 00:17:53,000 --> 00:17:57,000 one that does the translation. So I want to make sure people 192 00:17:57,000 --> 00:18:00,000 are sort of understanding of how that's set up. 193 00:18:00,000 --> 00:18:05,000 So now we just have to use our math skills. 194 00:18:05,000 --> 00:18:12,000 So the expectation then of the number of collisions is the 195 00:18:12,000 --> 00:18:18,000 expectation of C sub x and that's just the expectation of 196 00:18:18,000 --> 00:18:26,000 just plugging the sum of y and T minus the element x of c_xy. 197 00:18:26,000 --> 00:18:33,000 So that's just definition. And that's equal to the sum of 198 00:18:33,000 --> 00:18:39,000 y and T minus x of expectation of c_xy. 199 00:18:39,000 --> 00:18:44,000 So why is that? Yeah, that's linearity. 200 00:18:52,000 --> 00:18:56,000 Linearity of expectation, doesn't require independence. 201 00:18:56,000 --> 00:19:00,000 It's true of all random variables. 202 00:19:00,000 --> 00:19:07,000 And that's equal to, and now the math gets easier. 203 00:19:07,000 --> 00:19:10,000 So what is that? One over m. 204 00:19:10,000 --> 00:19:16,000 That makes the summation easy to evaluate. 205 00:19:16,000 --> 00:19:22,000 That's just n minus one over m. 206 00:19:30,000 --> 00:19:35,000 So fairly simple analysis and shows you why we would love to 207 00:19:35,000 --> 00:19:41,000 have one of these sets of universal hash functions because 208 00:19:41,000 --> 00:19:45,000 if you have them, then they behave exactly the 209 00:19:45,000 --> 00:19:51,000 way you would want it to behave. And you defeat your adversary 210 00:19:51,000 --> 00:19:55,000 by just picking up the hash function at random. 211 00:19:55,000 --> 00:20:00,000 There's nothing he can do. Or she. 212 00:20:00,000 --> 00:20:02,000 OK, any questions about that proof? 213 00:20:02,000 --> 00:20:04,000 OK, now we get into the fun math. 214 00:20:04,000 --> 00:20:07,000 Constructing one of these babies. 215 00:20:07,000 --> 00:20:08,000 OK. 216 00:20:20,000 --> 00:20:23,000 This is not the only construction. 217 00:20:23,000 --> 00:20:31,000 This is a construction of a classic universal hash function. 218 00:20:31,000 --> 00:20:37,000 And there are other constructions in the literature 219 00:20:37,000 --> 00:20:42,000 and I think there's one on the practice quiz. 220 00:20:42,000 --> 00:20:47,000 So let's see. So this one works when m is 221 00:20:47,000 --> 00:20:51,000 prime. So it works when the set of 222 00:20:51,000 --> 00:20:57,000 slots is a prime number. Number of slots is a prime 223 00:20:57,000 --> 00:21:05,000 number. So the idea here is we're going 224 00:21:05,000 --> 00:21:16,000 to decompose any key k in our universe into r plus 1 digits. 225 00:21:16,000 --> 00:21:25,000 So k, we're going to look at as being a k 0, k one, 226 00:21:25,000 --> 00:21:33,000 k_r where 0 is less than or equal to k sub I, 227 00:21:33,000 --> 00:21:41,000 is less than or equal to m minus one. 228 00:21:41,000 --> 00:21:47,000 So the idea is in some sense we're looking at what the 229 00:21:47,000 --> 00:21:52,000 representation would be of k base m. 230 00:21:52,000 --> 00:21:58,000 So if it were base two, it would be just one bit at a 231 00:21:58,000 --> 00:22:01,000 time. These would just be the bits. 232 00:22:01,000 --> 00:22:05,000 I'm not going to do base two. We're going to do base min 233 00:22:05,000 --> 00:22:09,000 general and so each of these represents one of the digits. 234 00:22:09,000 --> 00:22:13,000 And the way I've done it is I've done low order digit first. 235 00:22:13,000 --> 00:22:16,000 It actually doesn't matter. We're not actually going to 236 00:22:16,000 --> 00:22:20,000 care really about what the order is, but basically we're just 237 00:22:20,000 --> 00:22:24,000 looking at busting it into a twofold represented by each of 238 00:22:24,000 --> 00:22:27,000 those digits. So one algorithm for computing 239 00:22:27,000 --> 00:22:31,000 this out of k is take the remainder mod m. 240 00:22:31,000 --> 00:22:34,000 That's the low order one. OK, take what's left. 241 00:22:34,000 --> 00:22:37,000 Take the remainder of that mod m. 242 00:22:37,000 --> 00:22:39,000 Take whatever's left, etc. 243 00:22:39,000 --> 00:22:42,000 So you're familiar with the conversion to a base 244 00:22:42,000 --> 00:22:46,000 representation. That's exactly how we're 245 00:22:46,000 --> 00:22:49,000 getting this representation. So we treat, 246 00:22:49,000 --> 00:22:53,000 this is just a question of taking the data that we've got 247 00:22:53,000 --> 00:22:57,000 and treating it as an r plus one base m number. 248 00:22:57,000 --> 00:23:02,000 And now we invoke our randomized strategy. 249 00:23:02,000 --> 00:23:05,000 The randomized strategy is going to be able to have a class 250 00:23:05,000 --> 00:23:09,000 of hash functions that's dependent essentially on random 251 00:23:09,000 --> 00:23:11,000 numbers. And the random numbers we're 252 00:23:11,000 --> 00:23:15,000 going to pick is we're going to pick an a at random -- 253 00:23:28,000 --> 00:23:33,000 -- which we're also going to look at as a base mnumber. 254 00:23:33,000 --> 00:23:38,000 For each a_i is chosen randomly -- 255 00:23:49,000 --> 00:23:50,000 -- from -- 256 00:23:55,000 --> 00:23:58,000 -- 0 to m minus one. So one of our, 257 00:23:58,000 --> 00:24:03,000 it's a random if you will, it's a random base mdigit. 258 00:24:03,000 --> 00:24:06,000 Random base m digit. So each one of these is picked 259 00:24:06,000 --> 00:24:09,000 at random. And for each one we, 260 00:24:09,000 --> 00:24:13,000 possible value of A, we're going to get a different 261 00:24:13,000 --> 00:24:16,000 hash function. So we're going to index our 262 00:24:16,000 --> 00:24:19,000 hash functions by this random number. 263 00:24:19,000 --> 00:24:23,000 So this is where the randomness is going to come in. 264 00:24:23,000 --> 00:24:28,000 Everybody with me? And here's the hash function. 265 00:24:56,000 --> 00:25:06,000 So what we do is we dot product this vector with this vector and 266 00:25:06,000 --> 00:25:11,000 take the result, mod m. 267 00:25:11,000 --> 00:25:18,000 So each digit of k of our key gets multiplied by a random 268 00:25:18,000 --> 00:25:25,000 other digit. We add all those up and we take 269 00:25:25,000 --> 00:25:29,000 that mod m. So that's a dot product 270 00:25:29,000 --> 00:25:34,000 operator. And this is what we're going to 271 00:25:34,000 --> 00:25:37,000 show is universal, that this set of h sub a, 272 00:25:37,000 --> 00:25:39,000 where I look over that whole set. 273 00:25:39,000 --> 00:25:44,000 So one of the things we need to know is how big is the set of 274 00:25:44,000 --> 00:25:46,000 hash functions here. 275 00:25:59,000 --> 00:26:01,000 So how big is this set of hash functions? 276 00:26:01,000 --> 00:26:07,000 How many different hash functions do I have in this set? 277 00:26:24,000 --> 00:26:31,000 It's basic 6.042 material. It's basically how many vectors 278 00:26:31,000 --> 00:26:38,000 of length r plus one where each element of the vector is a 279 00:26:38,000 --> 00:26:45,000 number of 0 to m minus one, has m different values. 280 00:26:45,000 --> 00:26:50,000 So how many? m minus one to the r. 281 00:26:50,000 --> 00:26:51,000 No. Close. 282 00:26:51,000 --> 00:26:56,000 It's up there. It's a big number. 283 00:26:56,000 --> 00:27:01,000 m to the r plus one. Good. 284 00:27:01,000 --> 00:27:06,000 It's m, so the size of H is equal to m to the r plus one. 285 00:27:06,000 --> 00:27:10,000 So we're going to want to remember that. 286 00:27:10,000 --> 00:27:13,000 OK, so let's just understand why that is. 287 00:27:13,000 --> 00:27:17,000 I have m choices for the first value of A. 288 00:27:17,000 --> 00:27:19,000 m for the second, etc. 289 00:27:19,000 --> 00:27:23,000 m for the r th. And since there are plus one 290 00:27:23,000 --> 00:27:28,000 things here, for each choice here, I have this many same 291 00:27:28,000 --> 00:27:34,000 number of choices here, so it's a product. 292 00:27:34,000 --> 00:27:39,000 OK, so this is the product rule in counting. 293 00:27:39,000 --> 00:27:45,000 So if you haven't reviewed your 6.042 notes for counting, 294 00:27:45,000 --> 00:27:52,000 this is going to be a good idea to go back and review that 295 00:27:52,000 --> 00:27:57,000 because we're doing stuff of that nature. 296 00:27:57,000 --> 00:28:01,000 This is just the product rule. Good. 297 00:28:01,000 --> 00:28:10,000 So then the theorem we want to prove is that H is universal. 298 00:28:10,000 --> 00:28:14,000 And this is going to involve a little bit of number theory, 299 00:28:14,000 --> 00:28:19,000 so it gets kind of interesting. And it's a non-trivial proof, 300 00:28:19,000 --> 00:28:23,000 so this is where if there's any questions as I'm going along, 301 00:28:23,000 --> 00:28:28,000 please ask because the argument is not as simple as other 302 00:28:28,000 --> 00:28:33,000 arguments we've seen so far. OK, not the ones we've seen so 303 00:28:33,000 --> 00:28:38,000 far have been simple, but this is definitely a more 304 00:28:38,000 --> 00:28:43,000 involved mathematical argument. So here's a proof. 305 00:28:43,000 --> 00:28:46,000 So let's let, so we have two keys. 306 00:28:46,000 --> 00:28:50,000 What are we trying to show if it's universal, 307 00:28:50,000 --> 00:28:55,000 that if I pick any two keys, the number of hash functions 308 00:28:55,000 --> 00:29:01,000 for which they hash to the same thing is the size of set of hash 309 00:29:01,000 --> 00:29:08,000 functions divided by m. OK, so I'm going to look at two 310 00:29:08,000 --> 00:29:11,000 keys. So let's pick two keys 311 00:29:11,000 --> 00:29:16,000 arbitrarily. So x, and we'll decompose it 312 00:29:16,000 --> 00:29:23,000 into our base r representation and y, y_0, y_1 -- 313 00:29:33,000 --> 00:29:39,000 So these are two distinct keys. So if these are two distinct 314 00:29:39,000 --> 00:29:45,000 keys, so they're different, then this base representation 315 00:29:45,000 --> 00:29:50,000 has the property that they've got to differ somewhere. 316 00:29:50,000 --> 00:29:54,000 Right? OK, they differ in at least one 317 00:29:54,000 --> 00:29:56,000 digit. 318 00:30:08,000 --> 00:30:12,000 OK, and this is where most people get lost because I'm 319 00:30:12,000 --> 00:30:16,000 going to make a simplification. They could differ in any one of 320 00:30:16,000 --> 00:30:20,000 these digits. I'm going to say they differ in 321 00:30:20,000 --> 00:30:24,000 position 0 because it doesn't matter which one I do, 322 00:30:24,000 --> 00:30:28,000 the math is the same, but it'll make it so that if I 323 00:30:28,000 --> 00:30:31,000 pick some said they differ in some position i, 324 00:30:31,000 --> 00:30:35,000 I would have to be taking summations as you'll see over 325 00:30:35,000 --> 00:30:41,000 the elements that are not i, and that's complicated. 326 00:30:41,000 --> 00:30:44,000 If I do it in position 0, then I can just sum for the 327 00:30:44,000 --> 00:30:46,000 rest of them. So the math is going to be 328 00:30:46,000 --> 00:30:50,000 identical if I were to do it for any position because it's 329 00:30:50,000 --> 00:30:52,000 symmetric. All the digits are symmetric. 330 00:30:52,000 --> 00:30:56,000 So let's say they differ in position 0, but the same 331 00:30:56,000 --> 00:30:59,000 argument is going to be true if they differed in some other 332 00:30:59,000 --> 00:31:02,000 position. So let's say, 333 00:31:02,000 --> 00:31:05,000 so we're saying without loss of generality. 334 00:31:05,000 --> 00:31:08,000 So that's without loss of generality. 335 00:31:08,000 --> 00:31:12,000 Position 0. Because all the positions are 336 00:31:12,000 --> 00:31:16,000 symmetric here. And so, now we need to ask the 337 00:31:16,000 --> 00:31:19,000 question for how many -- 338 00:31:24,000 --> 00:31:30,000 -- hash functions in our universal, purportedly universal 339 00:31:30,000 --> 00:31:34,000 set do x and y collide? 340 00:31:39,000 --> 00:31:42,000 OK, we've got to count them up. So how often do they collide? 341 00:31:42,000 --> 00:31:46,000 This is where we're going to pull out some heavy duty number 342 00:31:46,000 --> 00:31:48,000 theory. So we must have, 343 00:31:48,000 --> 00:31:50,000 if they collide -- 344 00:31:56,000 --> 00:32:03,000 -- that h sub a of x is equal to h sub a of y. 345 00:32:03,000 --> 00:32:09,000 That's what it means for them to collide. 346 00:32:09,000 --> 00:32:20,000 So that implies that the sum of i equal 0 to r of a sub i x sub 347 00:32:20,000 --> 00:32:30,000 i is equal to the sum of i equals 0 to r of a sub i y sub i 348 00:32:30,000 --> 00:32:35,000 mod m. Actually this is congruent mod 349 00:32:35,000 --> 00:32:38,000 m. So congruence for those people 350 00:32:38,000 --> 00:32:43,000 who haven't seen much number theory, is basically the way of 351 00:32:43,000 --> 00:32:48,000 essentially, rather than having to say mod everywhere in here 352 00:32:48,000 --> 00:32:52,000 and mod everywhere in here, we just at the end say OK, 353 00:32:52,000 --> 00:32:56,000 do a mod at the end. Everything is being done mod, 354 00:32:56,000 --> 00:32:59,000 module m. And then typically we use a 355 00:32:59,000 --> 00:33:06,000 congruence sign. OK, there's a more mathematical 356 00:33:06,000 --> 00:33:13,000 definition but this will work for us engineers. 357 00:33:13,000 --> 00:33:18,000 OK, so everybody with me so far? 358 00:33:18,000 --> 00:33:23,000 This is just applying the definition. 359 00:33:23,000 --> 00:33:32,000 So that implies that the sum of i equals 0 to r of a i x i minus 360 00:33:32,000 --> 00:33:41,000 y i is congruent to zeros mod m. OK, just threw it on the other 361 00:33:41,000 --> 00:33:45,000 side and applied the distributive law. 362 00:33:45,000 --> 00:33:49,000 Now what I'm going to do is pull out the 0-th position 363 00:33:49,000 --> 00:33:53,000 because that's the one that I care about. 364 00:33:53,000 --> 00:33:58,000 And this is where it saves me on the math, compared to if I 365 00:33:58,000 --> 00:34:03,000 didn't say that it was 0. I'd have to pull out x_i. 366 00:34:03,000 --> 00:34:05,000 It wouldn't matter, but it just would make the math 367 00:34:05,000 --> 00:34:06,000 a little bit cruftier 368 00:34:23,000 --> 00:34:30,000 OK, so now we've just pulled out one term. 369 00:34:30,000 --> 00:34:41,000 That implies that a_0 x_0 minus y_0 is congruent to minus -- 370 00:34:54,000 --> 00:34:58,000 -- mod m. Now remember that when I have a 371 00:34:58,000 --> 00:35:02,000 minus number mod m, I just map it into whatever, 372 00:35:02,000 --> 00:35:07,000 into that range from 0 to m minus one. 373 00:35:07,000 --> 00:35:12,000 So for example, minus five mod seven is two. 374 00:35:12,000 --> 00:35:19,000 So if any of these things are negative, we simply translate 375 00:35:19,000 --> 00:35:27,000 them into by adding multiples of mbecause adding multiples of m 376 00:35:27,000 --> 00:35:32,000 doesn't affect the congruence. 377 00:35:39,000 --> 00:35:41,000 OK. And now for the next step, 378 00:35:41,000 --> 00:35:44,000 we need to use a number theory fact. 379 00:35:44,000 --> 00:35:48,000 So let's pull out our number theory -- 380 00:35:57,000 --> 00:36:05,000 -- textbook and take a little digression 381 00:36:10,000 --> 00:36:14,000 So this comes from the theory of finite fields. 382 00:36:14,000 --> 00:36:17,000 So for people who are knowledgeable, 383 00:36:17,000 --> 00:36:21,000 that's where you're plugging your knowledge in. 384 00:36:21,000 --> 00:36:26,000 If you're not knowledgeable, this is a great area of math to 385 00:36:26,000 --> 00:36:30,000 learn about. So here's the fact. 386 00:36:30,000 --> 00:36:34,000 So let m be prime. Then for any z, 387 00:36:34,000 --> 00:36:41,000 little z element of z sub m, and z sub m is the integers mod 388 00:36:41,000 --> 00:36:46,000 m. So this is essentially numbers 389 00:36:46,000 --> 00:36:51,000 from 0 to m minus one with all the operations, 390 00:36:51,000 --> 00:36:57,000 times, minus, plus, etc., defined on that 391 00:36:57,000 --> 00:37:04,000 such that if you end up outside of the range of 0 to m minus 392 00:37:04,000 --> 00:37:11,000 one, you re-normalize by subtracting or adding multiples 393 00:37:11,000 --> 00:37:21,000 of m to get back within the range from 0 to m minus one. 394 00:37:21,000 --> 00:37:30,000 So it's the standard thing of just doing things module m. 395 00:37:30,000 --> 00:37:38,000 So for any z such that z is not congruent to 0, 396 00:37:38,000 --> 00:37:47,000 there exists a unique z inverse in z sub m, such that if I 397 00:37:47,000 --> 00:37:57,000 multiply z times the inverse, it produces something congruent 398 00:37:57,000 --> 00:38:04,000 to one mod m. So for any number it says, 399 00:38:04,000 --> 00:38:11,000 I can find another number that when multiplied by it gives me 400 00:38:11,000 --> 00:38:15,000 one. So let's just do an example for 401 00:38:15,000 --> 00:38:18,000 m equals seven. So here we have, 402 00:38:18,000 --> 00:38:24,000 we'll make a little table. So z is not equal to 0, 403 00:38:24,000 --> 00:38:29,000 so I just write down the other numbers. 404 00:38:29,000 --> 00:38:35,000 And let's figure out what z inverse is. 405 00:38:35,000 --> 00:38:41,000 So what's the inverse of one? What number when multiplied by 406 00:38:41,000 --> 00:38:43,000 one gives me one? One. 407 00:38:43,000 --> 00:38:45,000 Good. How about two? 408 00:38:45,000 --> 00:38:51,000 What number when I multiply it by two gives me one? 409 00:38:51,000 --> 00:38:55,000 Four. Because two times four is eight 410 00:38:55,000 --> 00:39:01,000 and eight is congruent to one mod seven. 411 00:39:01,000 --> 00:39:04,000 So I've re-normalized it. What about three? 412 00:39:12,000 --> 00:39:13,000 Five. Good. 413 00:39:13,000 --> 00:39:16,000 Five. Three times five is 15. 414 00:39:16,000 --> 00:39:22,000 That's congruent to one mod seven because 15 divided by 415 00:39:22,000 --> 00:39:28,000 seven is two remainder of one. So that's the key thing. 416 00:39:28,000 --> 00:39:32,000 What about four? Two. 417 00:39:32,000 --> 00:39:36,000 Five? Three. And six. 418 00:39:43,000 --> 00:39:43,000 Yeah. Six. Yeah, six it turns out. OK, six times six is 36. 419 00:39:48,000 --> 00:39:52,000 OK, mod seven. Basically subtract off the 35, 420 00:39:52,000 --> 00:39:56,000 gives m one. So people have observed some 421 00:39:56,000 --> 00:40:02,000 interesting facts that if one number's an inverse of another, 422 00:40:02,000 --> 00:40:08,000 then that other is an inverse of the one. 423 00:40:08,000 --> 00:40:12,000 So that's actually one of these things that you prove when you 424 00:40:12,000 --> 00:40:16,000 do group theory and field theory and so forth. 425 00:40:16,000 --> 00:40:21,000 There are all sorts of other great properties of this kind of 426 00:40:21,000 --> 00:40:23,000 math. But the main thing is, 427 00:40:23,000 --> 00:40:27,000 and this turns out not to be true if m is not a prime. 428 00:40:27,000 --> 00:40:31,000 So can somebody think of, imagine we're doing something 429 00:40:31,000 --> 00:40:36,000 mod 10. Can somebody think of a number 430 00:40:36,000 --> 00:40:39,000 that doesn't have an inverse mod 10? 431 00:40:39,000 --> 00:40:40,000 Yeah. Two. 432 00:40:40,000 --> 00:40:45,000 Another one is five. OK, it turns out the divisors 433 00:40:45,000 --> 00:40:49,000 in fact actually, more generally, 434 00:40:49,000 --> 00:40:53,000 something that is not relatively prime, 435 00:40:53,000 --> 00:40:58,000 meaning that it has no common factors, the GCD is not one 436 00:40:58,000 --> 00:41:04,000 between that number and the modulus. 437 00:41:04,000 --> 00:41:08,000 OK, those numbers do not have an inverse mod m. 438 00:41:08,000 --> 00:41:13,000 OK, but if it's prime, every number is relatively 439 00:41:13,000 --> 00:41:17,000 prime to the modulus. And that's the property that 440 00:41:17,000 --> 00:41:22,000 we're taking advantage of. So this is our fact and so, 441 00:41:22,000 --> 00:41:28,000 in this case what I'm after is I want to divide by x_0 minus 442 00:41:28,000 --> 00:41:31,000 y_0. That's what I want to do at 443 00:41:31,000 --> 00:41:34,000 this point. But I can't do that if x_0, 444 00:41:34,000 --> 00:41:36,000 first of all, if m isn't prime, 445 00:41:36,000 --> 00:41:40,000 I can't necessarily do that. I might be able to, 446 00:41:40,000 --> 00:41:43,000 but I can't necessarily. But if m is prime, 447 00:41:43,000 --> 00:41:46,000 I can definitely divide by x_0 minus y_0. 448 00:41:46,000 --> 00:41:49,000 I can find that inverse. And the other thing I have to 449 00:41:49,000 --> 00:41:52,000 do is make sure x_0 minus y_0 is not 0. 450 00:41:52,000 --> 00:41:57,000 OK, it would be 0 if these two were equal, but our supposition 451 00:41:57,000 --> 00:42:01,000 was they weren't equal. And once again, 452 00:42:01,000 --> 00:42:05,000 just bringing it back to the without loss of generality, 453 00:42:05,000 --> 00:42:08,000 if it were some other position that we were off, 454 00:42:08,000 --> 00:42:13,000 I would be doing exactly the same thing with that position. 455 00:42:13,000 --> 00:42:16,000 So now we're going to be able to divide. 456 00:42:16,000 --> 00:42:19,000 So we continue with our -- 457 00:42:24,000 --> 00:42:33,000 -- continue with our proof. So since x_0 is not equal to 458 00:42:33,000 --> 00:42:42,000 y_0, there exists an inverse for x_0 minus y_0. 459 00:42:42,000 --> 00:42:48,000 And that implies, just continue on from over 460 00:42:48,000 --> 00:42:56,000 there, that a_0 is congruent therefore to minus the sum of i 461 00:42:56,000 --> 00:43:04,000 equal one to r of a_i, x_i minus y_i times x_0 minus 462 00:43:04,000 --> 00:43:10,000 y_0 inverse. So let's just go back to the 463 00:43:10,000 --> 00:43:15,000 beginning of our proof and see what we've derived. 464 00:43:15,000 --> 00:43:19,000 If we're saying we have two distinct keys, 465 00:43:19,000 --> 00:43:24,000 and we've picked all of these a_i randomly, 466 00:43:24,000 --> 00:43:30,000 and we're saying that these two distinct keys hash to the same 467 00:43:30,000 --> 00:43:34,000 place. If they hash to the same place, 468 00:43:34,000 --> 00:43:41,000 it says that a_0 essentially had to have a particular value 469 00:43:41,000 --> 00:43:47,000 as a function of the other a_i. Because in other words, 470 00:43:47,000 --> 00:43:51,000 once I've picked each of these a_i from one to r, 471 00:43:51,000 --> 00:43:54,000 if I did them in that order, for example, 472 00:43:54,000 --> 00:43:58,000 then I don't have a choice for how I pick a_0 to make it 473 00:43:58,000 --> 00:44:00,000 collide. Exactly one value allows it to 474 00:44:00,000 --> 00:44:05,000 collide, namely the value of a_0 given by this. 475 00:44:05,000 --> 00:44:10,000 If I picked a different value of a_0, they wouldn't collide. 476 00:44:10,000 --> 00:44:16,000 So let m write that down. Thus, while you think about it 477 00:45:12,000 --> 00:45:18,000 So for any choice of these a_i, there's exactly one of the 478 00:45:18,000 --> 00:45:24,000 impossible choices of a_0 that cause a collision. 479 00:45:24,000 --> 00:45:29,000 And for all the other choices I might make of a_0, 480 00:45:29,000 --> 00:45:36,000 there's n collision. So essentially I don't have, 481 00:45:36,000 --> 00:45:42,000 if they're going to collide, I've reduced essentially the 482 00:45:42,000 --> 00:45:49,000 number of degrees of freedom of my randomness by a factor of m. 483 00:45:49,000 --> 00:45:55,000 So if I count up the number of h_a's that cause x and y to 484 00:45:55,000 --> 00:46:01,000 collide, that's equal to, well, there's m choices, 485 00:46:01,000 --> 00:46:06,000 just using the product rule again. 486 00:46:06,000 --> 00:46:13,000 There's m choices for a_1 times m choices for a_2, 487 00:46:13,000 --> 00:46:21,000 up to m choices for a_r and then only one choice for a_0. 488 00:46:21,000 --> 00:46:28,000 So this is choices for a_1, a_2, a_r and only one choice 489 00:46:28,000 --> 00:46:35,000 for a_0 if they're going to collide. 490 00:46:35,000 --> 00:46:40,000 If they're not going to collide, I've got more choices 491 00:46:40,000 --> 00:46:43,000 for a_0. But if I want them to collide, 492 00:46:43,000 --> 00:46:48,000 there's only one value I can pick, namely this value. 493 00:46:48,000 --> 00:46:53,000 That's the only value for which I will pick. 494 00:46:53,000 --> 00:46:58,000 And that's equal to m to the r, which is just the size of H 495 00:46:58,000 --> 00:47:03,000 divided by m. And that completes the proof. 496 00:47:11,000 --> 00:47:14,000 So there are other universal constructions, 497 00:47:14,000 --> 00:47:18,000 but this is a particularly elegant one. 498 00:47:18,000 --> 00:47:22,000 So the point is that I have m plus one, sorry, 499 00:47:22,000 --> 00:47:27,000 r plus one degrees of freedom where each degree of freedom I 500 00:47:27,000 --> 00:47:33,000 have m choices. But if I want them to collide, 501 00:47:33,000 --> 00:47:40,000 once I've picked any of the, once I've picked r of those 502 00:47:40,000 --> 00:47:45,000 possible choices, the last one is forced if I 503 00:47:45,000 --> 00:47:48,000 want it to collide. So therefore, 504 00:47:48,000 --> 00:47:55,000 the set of functions for which it collides is only one in m. 505 00:47:55,000 --> 00:48:01,000 A very slick construction. Very slick. 506 00:48:01,000 --> 00:48:03,000 OK. Everybody with me here? 507 00:48:03,000 --> 00:48:07,000 Didn't lose too many people? Yeah, question. 508 00:48:07,000 --> 00:48:12,000 Well, part of it is, actually this is a quite common 509 00:48:12,000 --> 00:48:15,000 type of thing to be doing actually. 510 00:48:15,000 --> 00:48:19,000 If you take a class, so we have follow on classes in 511 00:48:19,000 --> 00:48:24,000 cryptography and so forth, and this kind of thing of 512 00:48:24,000 --> 00:48:29,000 taking dot products, modulo m and also Galois fields 513 00:48:29,000 --> 00:48:34,000 which are particularly simple finite fields and things like 514 00:48:34,000 --> 00:48:40,000 that, people play with these all the time. 515 00:48:40,000 --> 00:48:43,000 So Galois fields are like using exor's as your, 516 00:48:43,000 --> 00:48:46,000 same sort of thing as this except base two. 517 00:48:46,000 --> 00:48:49,000 And so there's a lot of study of this sort of thing. 518 00:48:49,000 --> 00:48:53,000 So people understand these kind of properties. 519 00:48:53,000 --> 00:48:57,000 But yeah, it's like what's the algorithm for having a brilliant 520 00:48:57,000 --> 00:49:01,000 insight into algorithms? It's like OK. 521 00:49:01,000 --> 00:49:05,000 Wish I knew. Then I'd just turn the crank. 522 00:49:05,000 --> 00:49:11,000 [LAUGHTER] But if it were that easy, I wouldn't be standing up 523 00:49:11,000 --> 00:49:13,000 here today. [LAUGHTER] Good. 524 00:49:13,000 --> 00:49:19,000 OK, so now I want to take on another topic which is also I 525 00:49:19,000 --> 00:49:22,000 find, I think this is astounding. 526 00:49:22,000 --> 00:49:27,000 It's just beautiful, beautiful mathematics and a big 527 00:49:27,000 --> 00:49:34,000 impact on your ability to build good hash functions. 528 00:49:34,000 --> 00:49:37,000 Now I want to talk about another one topic, 529 00:49:37,000 --> 00:49:41,000 which is related, which is the topic of perfect 530 00:49:41,000 --> 00:49:42,000 hashing. 531 00:49:54,000 --> 00:49:59,000 So everything we've done so far does expected time performance. 532 00:49:59,000 --> 00:50:03,000 Hashing is good in the expected sense. 533 00:50:03,000 --> 00:50:08,000 A perfect hashing addresses the following questions. 534 00:50:08,000 --> 00:50:14,000 Suppose that I gave you a set of keys, and I said just build 535 00:50:14,000 --> 00:50:20,000 me a static table so I can look up whether the key is in the 536 00:50:20,000 --> 00:50:25,000 table with worst case time. Good worst case time. 537 00:50:25,000 --> 00:50:31,000 So I have a fixed set of keys. They might be something like 538 00:50:31,000 --> 00:50:37,000 for example, the hundred most common or thousand most common 539 00:50:37,000 --> 00:50:42,000 words in English. And when I get a word I want to 540 00:50:42,000 --> 00:50:47,000 check quickly in this table, is the word that I've got one 541 00:50:47,000 --> 00:50:49,000 of the most common words in English. 542 00:50:49,000 --> 00:50:54,000 I would like to do that not with expected performance, 543 00:50:54,000 --> 00:50:57,000 but guaranteed worst case performance. 544 00:50:57,000 --> 00:51:03,000 Is there a way of building it so that I can find this quickly? 545 00:51:03,000 --> 00:51:06,000 So the problem is given n keys -- 546 00:51:12,000 --> 00:51:14,000 -- construct a static hash table. 547 00:51:14,000 --> 00:51:17,000 In other words, no insertion and deletion. 548 00:51:17,000 --> 00:51:20,000 We're just going to put the elements in there. 549 00:51:20,000 --> 00:51:22,000 A size -- 550 00:51:30,000 --> 00:51:37,000 -- m equal Order n. So I don't want it to be a huge 551 00:51:37,000 --> 00:51:42,000 table. I want it to be a table that is 552 00:51:42,000 --> 00:51:50,000 the size of my keys. Table of size m equals Order n, 553 00:51:50,000 --> 00:51:59,000 such that search takes O(1) time in the worst case. 554 00:52:06,000 --> 00:52:10,000 So there's no place in the table where I'm going to have, 555 00:52:10,000 --> 00:52:14,000 I know in the average case, that's not hard to do. 556 00:52:14,000 --> 00:52:18,000 But in the worst case, I want to make sure that 557 00:52:18,000 --> 00:52:22,000 there's no particular spot where the number of keys piles up to 558 00:52:22,000 --> 00:52:26,000 be a large number. OK, in no spot should that 559 00:52:26,000 --> 00:52:29,000 happen. Every single search I do should 560 00:52:29,000 --> 00:52:33,000 take Order one time. There shouldn't be any 561 00:52:33,000 --> 00:52:37,000 statistical variation in terms of how long it takes me to get 562 00:52:37,000 --> 00:52:39,000 something. Does everybody understand what 563 00:52:39,000 --> 00:52:42,000 the puzzle is? So this is a great, 564 00:52:42,000 --> 00:52:45,000 because this actually ends up having a lot of uses. 565 00:52:45,000 --> 00:52:49,000 You know, you want to build a table for something and you know 566 00:52:49,000 --> 00:52:52,000 what the values are that you're going look up in it. 567 00:52:52,000 --> 00:52:56,000 But you don't want to spend a lot of space on it and so forth. 568 00:52:56,000 --> 00:53:00,000 So the idea here is actually going to be to use a two-level 569 00:53:00,000 --> 00:53:02,000 scheme. 570 00:53:09,000 --> 00:53:22,000 So the idea is we're going to use a two-level scheme with 571 00:53:22,000 --> 00:53:31,000 universal hashing at both levels. 572 00:53:31,000 --> 00:53:36,000 So the idea is we're going to hash, we're going to have a hash 573 00:53:36,000 --> 00:53:41,000 table, we're going to hash into slots, but rather than using 574 00:53:41,000 --> 00:53:46,000 chaining, we're going to have another hash table there. 575 00:53:46,000 --> 00:53:51,000 We're going to do a second hash into the second hash table. 576 00:53:51,000 --> 00:53:56,000 And the idea is that we're going to do it in such a way 577 00:53:56,000 --> 00:54:01,000 that we have no collisions at level two. 578 00:54:01,000 --> 00:54:03,000 So we may have collisions at level one. 579 00:54:03,000 --> 00:54:08,000 We'll take anything that collides at level one and put 580 00:54:08,000 --> 00:54:12,000 them into a hash table and then our second level hash table, 581 00:54:12,000 --> 00:54:15,000 but that hash table, no collisions. 582 00:54:15,000 --> 00:54:17,000 Boom. We're just going to hash right 583 00:54:17,000 --> 00:54:20,000 in there. And it'll just go boom to its 584 00:54:20,000 --> 00:54:23,000 thing. So let's draw a picture of this 585 00:54:23,000 --> 00:54:28,000 to illustrate the scheme. OK, so we have -- 586 00:54:34,000 --> 00:54:37,000 -- 0 one, let's say six, m minus one. 587 00:54:37,000 --> 00:54:42,000 So here's our hash table. And what we're going to do is 588 00:54:42,000 --> 00:54:47,000 we're going to use universal hashing at the first level, 589 00:54:47,000 --> 00:54:49,000 OK. So we find a universal hash 590 00:54:49,000 --> 00:54:52,000 function. We pick a hash function at 591 00:54:52,000 --> 00:54:56,000 random. And what we'll do is we'll hash 592 00:54:56,000 --> 00:55:00,000 into that level. And then what we'll do is we'll 593 00:55:00,000 --> 00:55:05,000 keep track of two things. One is what the size of the 594 00:55:05,000 --> 00:55:09,000 hash table is at the next level. So in this case, 595 00:55:09,000 --> 00:55:13,000 the size of the hash table will only use the number of slots. 596 00:55:13,000 --> 00:55:17,000 There's going to be four. And we're also going to keep a 597 00:55:17,000 --> 00:55:19,000 separate hash key for the second level. 598 00:55:19,000 --> 00:55:23,000 So each slot will have its own hash function for the second 599 00:55:23,000 --> 00:55:25,000 level. So for example, 600 00:55:25,000 --> 00:55:30,000 this one might have a key of 31 that is a random number. 601 00:55:30,000 --> 00:55:32,000 The a's here. a's up there. 602 00:55:32,000 --> 00:55:34,000 There we go, a's up there. 603 00:55:34,000 --> 00:55:39,000 So that's going to be the basis of my hash function, 604 00:55:39,000 --> 00:55:42,000 the key with which I'm going to hash. 605 00:55:42,000 --> 00:55:46,000 This one say has 86. And let's say that this, 606 00:55:46,000 --> 00:55:50,000 and then we have a pointer to the hash table. 607 00:55:50,000 --> 00:55:55,000 This is say S_1. And it's got four slots and we 608 00:55:55,000 --> 00:56:01,000 stored up 14 and 27. And these two slots are empty. 609 00:56:01,000 --> 00:56:09,000 And this one for example, had what? 610 00:56:09,000 --> 00:56:12,000 Two nines. 611 00:56:28,000 --> 00:56:34,000 So the idea here is that in this case if we look over all 612 00:56:34,000 --> 00:56:40,000 our top level hash function, which I'll just call H, 613 00:56:40,000 --> 00:56:47,000 has that H of 14 is equal to H of 27 is equal to one. 614 00:56:47,000 --> 00:56:53,000 Because we're in slot one. OK, so these two both hash to 615 00:56:53,000 --> 00:56:57,000 the same slot in the level one hash table. 616 00:56:57,000 --> 00:57:02,000 This is level one. And this is level two over 617 00:57:02,000 --> 00:57:06,000 here. So level one hashing, 618 00:57:06,000 --> 00:57:11,000 14 and 27 collided. They went into the same slot 619 00:57:11,000 --> 00:57:13,000 here. But at level two, 620 00:57:13,000 --> 00:57:20,000 they got hashed to different places and the hash function I 621 00:57:20,000 --> 00:57:26,000 use is going to be indexed by whatever the random numbers are 622 00:57:26,000 --> 00:57:33,000 that I chose and found for those and I'll show you how we find 623 00:57:33,000 --> 00:57:36,000 those. We have then h of 31 of 14 is 624 00:57:36,000 --> 00:57:43,000 equal to one h of 31 of 27 is equal to two. 625 00:57:43,000 --> 00:57:46,000 For level two. So I go, hash in here, 626 00:57:46,000 --> 00:57:51,000 find the, use this as the basis of my hash function to hash into 627 00:57:51,000 --> 00:57:55,000 whatever table I've got here. And so, if there are no, 628 00:57:55,000 --> 00:58:00,000 if I can guarantee that there are no collisions at level two, 629 00:58:00,000 --> 00:58:05,000 this is going to cost me Order one time in the worst case to 630 00:58:05,000 --> 00:58:09,000 look something up. How do I look it up? 631 00:58:09,000 --> 00:58:12,000 Take the value. I apply h to it. 632 00:58:12,000 --> 00:58:16,000 That takes me to some slot. Then I look to see what the key 633 00:58:16,000 --> 00:58:21,000 is for this hash function. I apply that hash function and 634 00:58:21,000 --> 00:58:24,000 that takes me to another slot. Then I go there. 635 00:58:24,000 --> 00:58:29,000 And that took me basically two applications of hash functions 636 00:58:29,000 --> 00:58:33,000 plus some look-up, plus who knows what minor 637 00:58:33,000 --> 00:58:41,000 amount of bookkeeping. So the reason we're going to 638 00:58:41,000 --> 00:58:50,000 have no collisions at this level is the following. 639 00:58:50,000 --> 00:59:01,000 If they're n sub i items that hash to a level one slot i, 640 00:59:01,000 --> 00:59:11,000 then we're going to use m sub i, which is equal to n sub i 641 00:59:11,000 --> 00:59:21,000 squared slots in the level two hash table. 642 00:59:29,000 --> 00:59:33,000 OK, so I should have mentioned here this is going to be m sub 643 00:59:33,000 --> 00:59:37,000 i, the size of the hash table and this is going to be my a sub 644 00:59:37,000 --> 00:59:39,000 i essentially. 645 00:59:45,000 --> 00:59:50,000 So I'm going to use, so basically I'm going to hash 646 00:59:50,000 --> 00:59:55,000 n sub i things into n sub i squared locations here. 647 00:59:55,000 --> 01:00:00,000 So this is going to be incredibly sparse. 648 01:00:00,000 --> 01:00:02,480 OK, it's going to be quadratic in size. 649 01:00:02,480 --> 01:00:05,612 And so what I'm going to show is that under those 650 01:00:05,612 --> 01:00:08,418 circumstances, it's easy for me to find hash 651 01:00:08,418 --> 01:00:11,159 functions such that there are n collisions. 652 01:00:11,159 --> 01:00:15,010 That's the name of the game. Figure out how can I make these 653 01:00:15,010 --> 01:00:18,012 hash functions so that there are no collisions. 654 01:00:18,012 --> 01:00:21,341 So that's why I draw this with so few elements here. 655 01:00:21,341 --> 01:00:24,604 So here for example, I have two elements and I have 656 01:00:24,604 --> 01:00:27,867 a hash table size four here. I have three elements. 657 01:00:27,867 --> 01:00:32,520 I need a hash table size nine. OK, if there are a hundred 658 01:00:32,520 --> 01:00:34,918 elements, I need a hash table size 10,000. 659 01:00:34,918 --> 01:00:38,485 I'm not going to pick something so there's likely that there's 660 01:00:38,485 --> 01:00:41,350 anything of that size. And then the fact that this 661 01:00:41,350 --> 01:00:44,801 actually works and gives us all the properties that we want, 662 01:00:44,801 --> 01:00:48,251 that's part of the analysis. So does everybody see that this 663 01:00:48,251 --> 01:00:51,877 takes Order one worst case time and what the basic structure of 664 01:00:51,877 --> 01:00:52,988 it is? These things, 665 01:00:52,988 --> 01:00:55,210 by the way, are not in this case prime. 666 01:00:55,210 --> 01:00:58,134 I could always pick primes that were close to this. 667 01:00:58,134 --> 01:01:03,730 I didn't do that in this case. Or I could use a universal hash 668 01:01:03,730 --> 01:01:09,103 function that in fact would work for things other than primes. 669 01:01:09,103 --> 01:01:12,362 But I didn't do that for this example. 670 01:01:12,362 --> 01:01:16,943 We all ready for analysis? OK, let's do some analysis 671 01:01:16,943 --> 01:01:18,000 then. 672 01:01:29,000 --> 01:01:31,000 And this is really pretty analysis. 673 01:01:31,000 --> 01:01:33,528 Partly as you'll see because we've already done some of this 674 01:01:33,528 --> 01:01:34,000 analysis. 675 01:01:50,000 --> 01:01:53,238 So the trick is analyzing level two. 676 01:01:53,238 --> 01:01:57,309 That's the main thing that I want to analyze, 677 01:01:57,309 --> 01:02:02,583 to show that I can find hash functions here that are going 678 01:02:02,583 --> 01:02:06,192 to, when I map them into, very sparsely, 679 01:02:06,192 --> 01:02:09,523 into these arrays here, that in fact, 680 01:02:09,523 --> 01:02:16,000 such hash functions exist and I can compute them in advance. 681 01:02:16,000 --> 01:02:23,344 So that I have a good way of storing those. 682 01:02:23,344 --> 01:02:30,338 So here's the theorem we're going to use. 683 01:02:30,338 --> 01:02:40,830 My hash and keys into m equals n squared slots using a random 684 01:02:40,830 --> 01:02:48,000 hash function in a universal set H. 685 01:02:48,000 --> 01:03:00,393 Then the expected number of collisions is less than one 686 01:03:00,393 --> 01:03:02,502 half. OK. 687 01:03:02,502 --> 01:03:11,372 The expected number of collisions I don't expect there 688 01:03:11,372 --> 01:03:20,577 to be even one collision. I expect there to be less than 689 01:03:20,577 --> 01:03:29,447 half a collision on average. And so, let's prove this, 690 01:03:29,447 --> 01:03:39,154 so that the probability that two given keys collide under h 691 01:03:39,154 --> 01:03:45,216 is what? What's the probability that two 692 01:03:45,216 --> 01:03:51,443 given keys collide under h when h is chosen randomly from the 693 01:03:51,443 --> 01:03:54,037 universal set? One over m. 694 01:03:54,037 --> 01:03:56,943 Right? That's the definition, 695 01:03:56,943 --> 01:04:02,235 right, of, which is in this case equal to one over n 696 01:04:02,235 --> 01:04:06,210 squared. So now how many keys, 697 01:04:06,210 --> 01:04:11,052 how many pairs of keys do I have in this table? 698 01:04:11,052 --> 01:04:16,526 How many keys could possibly collide with each other? 699 01:04:16,526 --> 01:04:19,368 OK. So that's basically just 700 01:04:19,368 --> 01:04:25,157 looking at how many different pairs of keys do I have to 701 01:04:25,157 --> 01:04:30,315 evaluate this for. So that's n choose two pairs of 702 01:04:30,315 --> 01:04:36,654 keys. n choose two pairs of keys. 703 01:04:36,654 --> 01:04:42,689 So therefore, the expected number of 704 01:04:42,689 --> 01:04:52,172 collisions is while for each of these n, not n over two. 705 01:04:52,172 --> 01:05:00,793 n choose two pairs of keys. The probability that it 706 01:05:00,793 --> 01:05:08,923 collides is one in n squared. So that's equal to n times n 707 01:05:08,923 --> 01:05:12,221 minus one over two, if you remember your formula, 708 01:05:12,221 --> 01:05:16,000 times one in n squared. And that's less than a half. 709 01:05:24,000 --> 01:05:28,183 So for every pair of keys, so those of you who remember 710 01:05:28,183 --> 01:05:33,063 from 6.042 the birthday paradox, this is related to the birthday 711 01:05:33,063 --> 01:05:36,800 paradox a little bit. But here I basically have a 712 01:05:36,800 --> 01:05:40,333 large set, and I'm looking at all pairs, but my set is 713 01:05:40,333 --> 01:05:44,000 sufficiently big that the odds that I get a collision is 714 01:05:44,000 --> 01:05:47,199 relatively small. If I start increasing it beyond 715 01:05:47,199 --> 01:05:50,400 the square root of m, OK, the number of elements, 716 01:05:50,400 --> 01:05:54,466 it starts getting bigger in the square root of m then the odds 717 01:05:54,466 --> 01:05:57,733 of a collision go up dramatically as you know from 718 01:05:57,733 --> 01:06:01,532 the birthday paradox. But if I'm less than, 719 01:06:01,532 --> 01:06:05,401 if I'm really sparse in there, I don't get collisions. 720 01:06:05,401 --> 01:06:09,197 Or at least I get a relatively small number expected. 721 01:06:09,197 --> 01:06:13,430 Now I want to remind you of something which actually in the 722 01:06:13,430 --> 01:06:17,080 past I have just assumed, but I want to actually go 723 01:06:17,080 --> 01:06:20,291 through it briefly. It's Markov's inequality. 724 01:06:20,291 --> 01:06:22,919 So who remembers Markov's inequality? 725 01:06:22,919 --> 01:06:25,839 Don't everybody raise their hand at once. 726 01:06:25,839 --> 01:06:30,000 So Markov's inequality says the following. 727 01:06:30,000 --> 01:06:34,145 This is one of these great probability facts. 728 01:06:34,145 --> 01:06:38,762 For random variable x which is bounded below by 0, 729 01:06:38,762 --> 01:06:44,227 says the probability that x is bigger than, greater than or 730 01:06:44,227 --> 01:06:49,316 equal to any given value T is less than or equal to the 731 01:06:49,316 --> 01:06:53,838 expectation of x divided by T. It's a great fact. 732 01:06:53,838 --> 01:06:57,796 Doesn't happen if x isn't bound below by 0. 733 01:06:57,796 --> 01:07:03,230 But it's a great fact. It allows me to relate the 734 01:07:03,230 --> 01:07:06,833 probability of an event to its expectation. 735 01:07:06,833 --> 01:07:12,066 And the idea is in general that if the expectation is going to 736 01:07:12,066 --> 01:07:17,213 be small, then I can't have a high probability that the value 737 01:07:17,213 --> 01:07:21,845 of the random variable is large. It doesn't make sense. 738 01:07:21,845 --> 01:07:26,649 How could you have a high probability that it's a million 739 01:07:26,649 --> 01:07:31,968 when my expectation is one or in this case we're going to apply 740 01:07:31,968 --> 01:07:36,000 it when the expectation is a half? 741 01:07:36,000 --> 01:07:39,676 Couldn't happen. And the proof follows just 742 01:07:39,676 --> 01:07:44,666 directly on the definition of expectation, and so I'mdoing 743 01:07:44,666 --> 01:07:47,730 this for a discrete random variable. 744 01:07:47,730 --> 01:07:52,282 So the expectation by definition is just the sum from 745 01:07:52,282 --> 01:07:57,622 little x goes to 0 to infinity of x times the probability that 746 01:07:57,622 --> 01:08:02,000 my random variable takes on the value x. 747 01:08:02,000 --> 01:08:06,560 That's the definition. And now it's just a question of 748 01:08:06,560 --> 01:08:11,120 doing like the coarsest approximation you can imagine. 749 01:08:11,120 --> 01:08:14,734 First of all, let me just simply throw away 750 01:08:14,734 --> 01:08:19,725 all small terms that can be greater to or equal to x equals 751 01:08:19,725 --> 01:08:24,716 T to infinity of x times the probability that x is equal to 752 01:08:24,716 --> 01:08:28,072 little x. So just throw away all the low 753 01:08:28,072 --> 01:08:31,426 order terms. Now what I'm going to do is 754 01:08:31,426 --> 01:08:36,848 replace every one of these terms is lower bounded by the value x 755 01:08:36,848 --> 01:08:42,875 equals T. So that's just the summation of 756 01:08:42,875 --> 01:08:49,750 x equals T to infinity of T times the probability that x 757 01:08:49,750 --> 01:08:51,250 equals x. OK. 758 01:08:51,250 --> 01:08:58,250 Over x going from T larger. Because these are only bigger 759 01:08:58,250 --> 01:09:02,009 values. And that's just equal then to 760 01:09:02,009 --> 01:09:06,306 T, because I can pull that out, and the summation of x equals T 761 01:09:06,306 --> 01:09:10,256 to infinity of the probability that x equals x is just the 762 01:09:10,256 --> 01:09:14,000 probability that x is greater than or equal to T. 763 01:09:20,000 --> 01:09:26,000 And that's done because I just divide by T. 764 01:09:31,000 --> 01:09:34,379 So that's Markov's inequality. Really dumb. 765 01:09:34,379 --> 01:09:37,919 Really simple. There are much stronger things 766 01:09:37,919 --> 01:09:42,264 like Chebyshev bounds and Chernoff bounds and things of 767 01:09:42,264 --> 01:09:44,839 that nature. But Markov's is like 768 01:09:44,839 --> 01:09:49,586 unbelievably simple and useful. So we're going to just apply 769 01:09:49,586 --> 01:09:52,000 that as a corollary. 770 01:10:06,000 --> 01:10:13,059 So the probability now of no collisions, when I hash n keys 771 01:10:13,059 --> 01:10:19,391 into n squared slots using a universal hash function, 772 01:10:19,391 --> 01:10:26,817 I claim is the probability of no collisions is greater than or 773 01:10:26,817 --> 01:10:32,173 equal to a half. So I pick a hash function at 774 01:10:32,173 --> 01:10:36,409 random. What are the odds that I got no 775 01:10:36,409 --> 01:10:40,917 collisions when I hashed those n keys into n squared slots? 776 01:10:40,917 --> 01:10:43,326 Answer. Probability is I have no 777 01:10:43,326 --> 01:10:47,834 collisions is at least a half. Half the time I'm guaranteed 778 01:10:47,834 --> 01:10:51,409 that there won't be a collision. And the proof, 779 01:10:51,409 --> 01:10:54,129 pretty simple. The probability of no 780 01:10:54,129 --> 01:10:57,549 collisions is the same as the probability as, 781 01:10:57,549 --> 01:11:01,746 sorry, is one minus the probability that I have at most 782 01:11:01,746 --> 01:11:05,850 one collision. So the odds that I have at 783 01:11:05,850 --> 01:11:09,337 least one collision, the odds that I have at least 784 01:11:09,337 --> 01:11:12,254 one collision, probability greater than or 785 01:11:12,254 --> 01:11:15,599 equal to one collision is less than or equal to, 786 01:11:15,599 --> 01:11:18,872 now I just apply Markov's inequality with this. 787 01:11:18,872 --> 01:11:23,000 So it's just the expected number of collisions -- 788 01:11:29,000 --> 01:11:33,090 -- divided by one. And that is by Markov's 789 01:11:33,090 --> 01:11:36,272 inequality less than, by definition, 790 01:11:36,272 --> 01:11:40,181 excuse me, of expected number of collisions, 791 01:11:40,181 --> 01:11:44,363 which we've already shown, is less than a half. 792 01:11:44,363 --> 01:11:49,636 So the probability of at least one collision is less than a 793 01:11:49,636 --> 01:11:52,909 half. The probability of 0 collisions 794 01:11:52,909 --> 01:11:56,363 is at least a half. So we're done here. 795 01:11:56,363 --> 01:12:02,000 So to find a good level to hash function is easy. 796 01:12:02,000 --> 01:12:06,562 I just test a few at random. Most of them out there, 797 01:12:06,562 --> 01:12:10,856 OK, half of them, at least half of them are going 798 01:12:10,856 --> 01:12:13,808 to work. So this is in some sense, 799 01:12:13,808 --> 01:12:18,102 if you think about it, a randomized construction, 800 01:12:18,102 --> 01:12:22,664 because I can't tell you which one it's going to be. 801 01:12:22,664 --> 01:12:27,763 It's non-constructive in that sense, but it's a randomized 802 01:12:27,763 --> 01:12:32,485 construction. But they have to exist because 803 01:12:32,485 --> 01:12:36,297 most of them out there have this good property. 804 01:12:36,297 --> 01:12:40,605 So I'mgoing to be able to find for each one of these, 805 01:12:40,605 --> 01:12:44,168 I just test a few at random, and I find one. 806 01:12:44,168 --> 01:12:47,068 Test a few at random, find one, etc. 807 01:12:47,068 --> 01:12:50,548 Fill in my table there. Because all that is 808 01:12:50,548 --> 01:12:53,945 pre-computation. And I'mgoing to find them 809 01:12:53,945 --> 01:12:57,342 because the odds are good that one exists. 810 01:12:57,342 --> 01:12:59,000 So -- 811 01:13:13,000 --> 01:13:14,000 -- we just test a few at random. 812 01:13:24,000 --> 01:13:25,000 And we'll find one quickly -- 813 01:13:32,000 --> 01:13:34,300 -- since at least half will work. 814 01:13:34,300 --> 01:13:37,679 I just want to show that there exists good ones. 815 01:13:37,679 --> 01:13:41,777 All I have to prove is that at least one works for each of 816 01:13:41,777 --> 01:13:44,366 these cases. In fact, I've shown that 817 01:13:44,366 --> 01:13:46,954 there's a huge number that will work. 818 01:13:46,954 --> 01:13:50,189 Half of them will work. But to show it exists, 819 01:13:50,189 --> 01:13:54,647 I would just have to show that the probability was greater than 820 00:00:00,000 --> 01:13:55,941 So to finish up, 821 01:13:55,941 --> 01:14:00,254 we need to still analyze the storage because I promised in my 822 01:14:00,254 --> 01:14:05,000 theorem that the table would be of size order n. 823 01:14:05,000 --> 01:14:12,702 And yet now I've said there's all of these quadratic-sized 824 01:14:12,702 --> 01:14:18,378 slots here. So I'mgoing to show that that's 825 01:14:18,378 --> 01:14:20,000 order n. 826 01:14:31,000 --> 01:14:35,605 So for level one, that's easy. 827 01:14:35,605 --> 01:14:45,450 We'll just choose the number of slots to be equal to the number 828 01:14:45,450 --> 01:14:51,008 of keys. And that way the storage at 829 01:14:51,008 --> 01:14:59,583 level one is just order n. And now let's let n sub i be 830 01:14:59,583 --> 01:15:08,000 the random variable for the number of keys -- 831 01:15:13,000 --> 01:15:21,712 -- that hash to slot i in T. OK, so n sub i is just what 832 01:15:21,712 --> 01:15:28,683 we've called it. Number of elements that slot 833 01:15:28,683 --> 01:15:34,386 there. And we're going to use m sub i 834 01:15:34,386 --> 01:15:45,000 equals n sub i squared slots in each level two table S sub i. 835 01:15:45,000 --> 01:15:47,000 So the expected total storage -- 836 01:15:54,000 --> 01:16:01,085 -- is just n for level one, order n if you want, 837 01:16:01,085 --> 01:16:09,979 but basically n slots for level one plus the expected value, 838 01:16:09,979 --> 01:16:19,326 whatever I expect the sum of i equals 0 to m minus one of theta 839 01:16:19,326 --> 01:16:24,000 of n sub i squared to be. 840 01:16:30,000 --> 01:16:36,048 Because I basically have to add up the square for every element 841 01:16:36,048 --> 01:16:40,731 that applies here, the square of what's in there. 842 01:16:40,731 --> 01:16:46,682 Who recognizes this summation? Where have we seen that before? 843 01:16:46,682 --> 01:16:51,951 Who attends recitation? Where have we seen this before? 844 01:16:51,951 --> 01:16:54,000 What's the -- 845 01:17:03,000 --> 01:17:06,000 We're summing the expected value of a bunch of -- 846 01:17:11,000 --> 01:17:14,959 Yeah, what was that algorithm? We did the sorting algorithm, 847 01:17:14,959 --> 01:17:17,375 right? What was the sorting algorithm 848 01:17:17,375 --> 01:17:21,000 for which this was an important thing to evaluate? 849 01:17:26,000 --> 01:17:29,272 Don't everybody shout it out at once. 850 01:17:29,272 --> 01:17:33,000 What was that sorting algorithm called? 851 01:17:33,000 --> 01:17:35,397 Bucket sort. Good. 852 01:17:35,397 --> 01:17:37,794 Bucket sort. Yeah. 853 01:17:37,794 --> 01:17:46,397 We showed that the sum of the squares of random variables when 854 01:17:46,397 --> 01:17:53,025 they're falling randomly into n bins is order n. 855 01:17:53,025 --> 01:17:55,000 Right? 856 01:18:16,000 --> 01:18:20,105 And you can also out of this get a, as we did before, 857 01:18:20,105 --> 01:18:24,131 get a probability bound. What's the probability that 858 01:18:24,131 --> 01:18:28,315 it's more than a certain amount times n using Markov's 859 01:18:28,315 --> 01:18:31,394 inequality. But this is the key thing is 860 01:18:31,394 --> 01:18:36,109 we've seen this analysis. OK, we used it there in time, 861 01:18:36,109 --> 01:18:39,963 so there's a little bit, but that's one of the reasons 862 01:18:39,963 --> 01:18:43,963 we study sorting at the beginning of the term is because 863 01:18:43,963 --> 01:18:47,890 the techniques of sorting, they just propagate into all 864 01:18:47,890 --> 01:18:52,327 these other areas of analysis. You see a lot of the same kinds 865 01:18:52,327 --> 01:18:55,309 of things. And so now that you know bucket 866 01:18:55,309 --> 01:18:59,018 sort clearly so well, now you know that this without 867 01:18:59,018 --> 01:19:04,610 having to do any extra work. So you might want to go back 868 01:19:04,610 --> 01:19:09,925 and review your bucket sort analysis, because it's applied 869 01:19:09,925 --> 01:19:11,604 now. Same analysis. 870 01:19:11,604 --> 01:19:12,909 Two places. OK. 871 01:19:12,909 --> 01:19:18,411 Good recitation this Friday, which will be a quiz review and 872 01:19:18,411 --> 01:19:22,794 we have a quiz next, there's no class on Monday, 873 01:19:22,794 --> 01:19:26,151 but we have a quiz on next Wednesday. 874 01:19:26,151 --> 01:19:31,000 OK, so good luck everybody on the quiz. 875 01:19:31,000 --> 01:19:34,000 Make sure you get plenty of sleep.