1 00:00:00,090 --> 00:00:02,490 The following content is provided under a Creative 2 00:00:02,490 --> 00:00:04,030 Commons license. 3 00:00:04,030 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,720 continue to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,280 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,280 --> 00:00:18,450 at ocw.mit.edu. 8 00:00:20,970 --> 00:00:21,970 ERIK DEMAINE: All right. 9 00:00:21,970 --> 00:00:24,904 Today is all about the predecessor problem, which 10 00:00:24,904 --> 00:00:27,070 is a problem we've certainly talked about implicitly 11 00:00:27,070 --> 00:00:28,540 with, say, binary search trees. 12 00:00:28,540 --> 00:00:30,940 You want to be able to insert and delete into a set, 13 00:00:30,940 --> 00:00:34,570 and compute the predecessor and successor of any given key. 14 00:00:34,570 --> 00:00:39,115 So maybe define that formally. 15 00:00:48,160 --> 00:00:51,070 And this is not really our first, 16 00:00:51,070 --> 00:00:54,370 but it is an example of an integer data structure. 17 00:00:54,370 --> 00:00:56,980 And for whatever reason, I don't brand 18 00:00:56,980 --> 00:00:58,900 hashing as an integer data structure, 19 00:00:58,900 --> 00:01:01,384 just because it's its own beast. 20 00:01:01,384 --> 00:01:03,550 But in particular, today, I need to be a little more 21 00:01:03,550 --> 00:01:06,030 formal about the models of computation we're allowing-- 22 00:01:06,030 --> 00:01:07,600 or I want to be. 23 00:01:07,600 --> 00:01:12,880 In particular, because, in the predecessor problem, which is, 24 00:01:12,880 --> 00:01:19,410 insert, delete, predecessor, successor, 25 00:01:19,410 --> 00:01:22,540 there are actually lower bounds that say you cannot do better 26 00:01:22,540 --> 00:01:23,950 than such and such. 27 00:01:23,950 --> 00:01:26,170 With hashing, there aren't really any lower bounds, 28 00:01:26,170 --> 00:01:28,086 because you can do everything in constant time 29 00:01:28,086 --> 00:01:29,017 with high probability. 30 00:01:29,017 --> 00:01:30,850 So I mean, there are maybe some lower bounds 31 00:01:30,850 --> 00:01:31,990 on deterministic hashing. 32 00:01:31,990 --> 00:01:33,080 That's harder. 33 00:01:33,080 --> 00:01:35,850 But if you allow randomization, there's no real lower bounds, 34 00:01:35,850 --> 00:01:37,710 whereas predecessor, there is. 35 00:01:40,430 --> 00:01:46,566 And in general, predecessor problem-- 36 00:01:46,566 --> 00:01:48,670 the key thing I want to highlight 37 00:01:48,670 --> 00:01:53,500 is that we're maintaining here a set of-- 38 00:01:53,500 --> 00:02:04,270 the set is called s of n elements, which 39 00:02:04,270 --> 00:02:06,890 live in some universe, U-- 40 00:02:06,890 --> 00:02:08,014 just like last time. 41 00:02:08,014 --> 00:02:10,180 When you insert, you can insert an arbitrary element 42 00:02:10,180 --> 00:02:10,900 of the universe. 43 00:02:10,900 --> 00:02:14,895 That probably shouldn't be an s, or it will get thrown away. 44 00:02:14,895 --> 00:02:17,020 But the key thing is that predecessor and successor 45 00:02:17,020 --> 00:02:19,600 operate not just on the [INAUDIBLE] in s, 46 00:02:19,600 --> 00:02:21,000 but you can give it any key. 47 00:02:21,000 --> 00:02:22,390 It doesn't have to be in there. 48 00:02:22,390 --> 00:02:24,475 And it will find the previous key that is in s, 49 00:02:24,475 --> 00:02:26,350 or the next key that is in s. 50 00:02:26,350 --> 00:02:29,280 So predecessor is the largest key 51 00:02:29,280 --> 00:02:34,030 that is less than or equal to x that's in your set. 52 00:02:37,550 --> 00:02:43,406 And successor is the smallest that is larger-- 53 00:02:43,406 --> 00:02:44,530 of course, if there is one. 54 00:02:48,675 --> 00:02:50,800 So those are the kinds of operations we want to do. 55 00:02:50,800 --> 00:02:54,040 Now, we know how to do all of this n log n time, no problem, 56 00:02:54,040 --> 00:02:57,850 with binary search trees, in the comparison model. 57 00:02:57,850 --> 00:03:01,330 But I want to introduce two more, say, realistic models 58 00:03:01,330 --> 00:03:05,620 of computers, that ignore the memory hierarchy, 59 00:03:05,620 --> 00:03:08,620 but think about regular RAM machines-- 60 00:03:08,620 --> 00:03:13,212 random access machines-- and what they can really do. 61 00:03:13,212 --> 00:03:15,670 And it's a model we're going to be working on for the next, 62 00:03:15,670 --> 00:03:17,780 I think, five lectures. 63 00:03:17,780 --> 00:03:19,630 So, important to set the stage right. 64 00:03:22,150 --> 00:03:26,267 So these are models for integer data structures. 65 00:03:29,470 --> 00:03:34,960 In general, we have a unifying concept, 66 00:03:34,960 --> 00:03:40,045 which is a word of information, a word of data, 67 00:03:40,045 --> 00:03:41,690 a word of memory. 68 00:03:41,690 --> 00:03:44,730 It's used all over the place-- a word of input. 69 00:03:44,730 --> 00:03:48,640 A word is the machine theoretic sense, 70 00:03:48,640 --> 00:03:50,790 not like the linguistic sense. 71 00:03:50,790 --> 00:03:54,300 It's going to be a w-bit integer. 72 00:03:54,300 --> 00:03:57,790 And so this defines the universe, which is-- 73 00:03:57,790 --> 00:04:00,910 I'm going to assume they're all unsigned integers. 74 00:04:00,910 --> 00:04:05,120 So this is 2 to the w minus one. 75 00:04:05,120 --> 00:04:07,000 Those are all the unsigned integers 76 00:04:07,000 --> 00:04:08,760 you can represent with w-bits. 77 00:04:08,760 --> 00:04:11,860 We'll also call this number, 2 to the w, little u. 78 00:04:11,860 --> 00:04:16,060 That is the size of the universe, which is capital U. 79 00:04:16,060 --> 00:04:18,399 So this matches notation from last time. 80 00:04:18,399 --> 00:04:23,350 But I'm really highlighting how many bits we have, which is w. 81 00:04:23,350 --> 00:04:26,980 Now, here's where things get interesting. 82 00:04:26,980 --> 00:04:28,930 I'm going to get to a model called a word 83 00:04:28,930 --> 00:04:32,469 RAM, which is what you might expect, more or less. 84 00:04:32,469 --> 00:04:34,510 But before I get there I want to define something 85 00:04:34,510 --> 00:04:38,770 called transdichotomous RAM-- 86 00:04:38,770 --> 00:04:40,280 tough word to spell. 87 00:04:40,280 --> 00:04:43,360 It just means bridging a dichotomy-- bridging 88 00:04:43,360 --> 00:04:46,510 two worlds, if you will. 89 00:04:46,510 --> 00:04:50,170 RAM is a random access machine. 90 00:04:50,170 --> 00:04:54,294 I've certainly mentioned the word RAM before. 91 00:04:54,294 --> 00:04:56,710 But now we're going to get a little more precise about it. 92 00:04:56,710 --> 00:05:00,360 So in general, in the RAM, memory is an array. 93 00:05:00,360 --> 00:05:02,530 And you can do random access into the array. 94 00:05:02,530 --> 00:05:06,580 But now, we're going to say, the cells of the memory-- 95 00:05:06,580 --> 00:05:07,990 each slot in that array-- 96 00:05:07,990 --> 00:05:09,040 is a word. 97 00:05:09,040 --> 00:05:10,540 Everything is going to be a word. 98 00:05:10,540 --> 00:05:14,887 Every input-- all these x's are going to be words. 99 00:05:14,887 --> 00:05:15,970 Everything will be a word. 100 00:05:15,970 --> 00:05:18,520 And in particular, the things in your memory are words. 101 00:05:21,070 --> 00:05:22,300 Let's say you have s of them. 102 00:05:22,300 --> 00:05:25,750 That's your space bound. 103 00:05:25,750 --> 00:05:27,430 In general, in transdichotomous RAM, 104 00:05:27,430 --> 00:05:34,180 you can do any operation that reads and writes 105 00:05:34,180 --> 00:05:40,600 a constant number of words in memory. 106 00:05:40,600 --> 00:05:45,860 And in particular, you can do random access to that memory. 107 00:05:45,860 --> 00:05:52,225 But in particular, we use words to serve as pointers. 108 00:05:57,460 --> 00:05:59,665 Here's my memory of words. 109 00:06:02,590 --> 00:06:05,460 Each of them is w bits-- 110 00:06:05,460 --> 00:06:12,040 so s of them, from, I guess, 0 to s minus one. 111 00:06:12,040 --> 00:06:17,170 And if you have, like, the number 3 here, 112 00:06:17,170 --> 00:06:21,610 that can be used as a pointer to the third slot of memory. 113 00:06:21,610 --> 00:06:24,910 One, two, three. 114 00:06:24,910 --> 00:06:27,187 You can use numbers as indexes into memory. 115 00:06:27,187 --> 00:06:29,270 So that's what I mean by, words serve as pointers. 116 00:06:29,270 --> 00:06:32,110 So particularly, you can implement a pointer machine, 117 00:06:32,110 --> 00:06:33,850 which-- no surprise-- 118 00:06:33,850 --> 00:06:39,760 but for this to work, we need a lower bound on w. 119 00:06:39,760 --> 00:06:44,440 This implies w has to be at least log of the space bound. 120 00:06:44,440 --> 00:06:47,560 Otherwise, you just can't index your whole memory. 121 00:06:47,560 --> 00:06:50,940 And if you've got s minus 1 things, this 2 to the w minus 1 122 00:06:50,940 --> 00:06:53,880 better be at least s minus 1. 123 00:06:53,880 --> 00:06:55,980 So we get this lower bound. 124 00:06:55,980 --> 00:07:03,621 So in particular, presumably, s is at least your problem size, 125 00:07:03,621 --> 00:07:04,120 n. 126 00:07:04,120 --> 00:07:05,880 If you're trying to maintain n items, 127 00:07:05,880 --> 00:07:08,670 you've got to store them. 128 00:07:08,670 --> 00:07:12,420 So w is at least log n. 129 00:07:12,420 --> 00:07:17,760 Now, this relation is essentially a statement 130 00:07:17,760 --> 00:07:18,930 bridging two worlds. 131 00:07:18,930 --> 00:07:20,560 Namely, you have, on the one hand, 132 00:07:20,560 --> 00:07:24,045 your model of computation, which has a particular word size. 133 00:07:24,045 --> 00:07:28,590 And in reality, we think of that as being 32 or 64 or maybe 128. 134 00:07:28,590 --> 00:07:31,230 Some fancy operations on Intel machines, 135 00:07:31,230 --> 00:07:34,650 you can do 128-bit or so. 136 00:07:34,650 --> 00:07:36,210 And then there's your problem size, 137 00:07:36,210 --> 00:07:37,830 which we think of as an input. 138 00:07:37,830 --> 00:07:39,810 Now, this is relating the two. 139 00:07:39,810 --> 00:07:40,929 It's a little weird. 140 00:07:40,929 --> 00:07:43,470 I guess you could say it's just a limitation for a given CPU. 141 00:07:43,470 --> 00:07:46,100 There's only certain problems you can solve. 142 00:07:46,100 --> 00:07:49,687 But theoretically, it makes a lot of sense 143 00:07:49,687 --> 00:07:50,520 to relate these two. 144 00:07:50,520 --> 00:07:52,770 Because if you're in a RAM, and you've 145 00:07:52,770 --> 00:07:54,889 got to be able to index your data, 146 00:07:54,889 --> 00:07:56,430 you need at least that many bits just 147 00:07:56,430 --> 00:07:58,860 to be able to talk about all those things. 148 00:07:58,860 --> 00:08:01,680 And so the claim is, basically, machines 149 00:08:01,680 --> 00:08:03,980 will grow to accommodate memory size. 150 00:08:03,980 --> 00:08:06,700 As memory size grows, you'll need more bits. 151 00:08:06,700 --> 00:08:11,894 Now, in reality, there's only about 2 to 256-- 152 00:08:11,894 --> 00:08:13,320 what do you call them-- particles 153 00:08:13,320 --> 00:08:15,250 in the known universe. 154 00:08:15,250 --> 00:08:18,520 So word size probably won't get that much bigger. 155 00:08:18,520 --> 00:08:20,440 Beyond 256 should be OK. 156 00:08:20,440 --> 00:08:23,610 But theoretically, this is a nice way 157 00:08:23,610 --> 00:08:26,220 to formalize this claim that word sizes don't 158 00:08:26,220 --> 00:08:29,370 need to get too big unless memories get gigantic. 159 00:08:29,370 --> 00:08:32,400 So it may seem weird at first, but it's very natural. 160 00:08:32,400 --> 00:08:35,549 And all real world machines have big enough words 161 00:08:35,549 --> 00:08:36,750 to accommodate that. 162 00:08:36,750 --> 00:08:38,833 Word size could be bigger, and that will give you, 163 00:08:38,833 --> 00:08:40,320 essentially, more parallelism. 164 00:08:40,320 --> 00:08:42,789 But it should be at least that big. 165 00:08:42,789 --> 00:08:43,289 All right. 166 00:08:43,289 --> 00:08:46,030 Enough proselytizing. 167 00:08:50,410 --> 00:08:52,350 That's the transdichotomous RAM. 168 00:08:52,350 --> 00:08:54,060 The end. 169 00:08:54,060 --> 00:08:56,220 And the word RAM is a specific version 170 00:08:56,220 --> 00:09:04,230 of the transdichotomous RAM, where 171 00:09:04,230 --> 00:09:07,440 you restrict the operations to c-like operations. 172 00:09:11,610 --> 00:09:13,362 These are sort of the standard-- 173 00:09:13,362 --> 00:09:14,820 they're instructions on, basically, 174 00:09:14,820 --> 00:09:18,850 all computers, except a few risk architectures don't have 175 00:09:18,850 --> 00:09:21,650 multiplication and division. 176 00:09:21,650 --> 00:09:26,191 But everything else is on everything. 177 00:09:29,350 --> 00:09:35,200 So these are the operators, unless I missed one, in c. 178 00:09:35,200 --> 00:09:37,530 They're all in Python, and pick your-- 179 00:09:37,530 --> 00:09:38,565 most languages. 180 00:09:38,565 --> 00:09:42,696 You've got integer arithmetic, including mod. 181 00:09:42,696 --> 00:09:46,230 You've got bitwise and, bitwise or, bitwise x 182 00:09:46,230 --> 00:09:50,730 or, bitwise negation, and shift left, and shift right. 183 00:09:50,730 --> 00:09:53,400 These we all view as taking constant time. 184 00:09:53,400 --> 00:09:59,700 They take one or two integer inputs-- words as inputs. 185 00:09:59,700 --> 00:10:01,120 They can compute an answer. 186 00:10:01,120 --> 00:10:03,150 They write out another word. 187 00:10:03,150 --> 00:10:05,771 Of course, there's also random access-- array dereference, 188 00:10:05,771 --> 00:10:06,270 I guess. 189 00:10:10,980 --> 00:10:12,110 So that's the word RAM. 190 00:10:12,110 --> 00:10:13,260 You restrict to these operations. 191 00:10:13,260 --> 00:10:14,460 Whereas transdichotomous RAM, you 192 00:10:14,460 --> 00:10:16,501 can do weird things, as long as they only involve 193 00:10:16,501 --> 00:10:18,030 a constant number of words. 194 00:10:18,030 --> 00:10:19,890 Word RAM-- it's the regular thing. 195 00:10:19,890 --> 00:10:22,800 So this is basically the standard model 196 00:10:22,800 --> 00:10:25,350 that all integer data structures use, pretty much. 197 00:10:25,350 --> 00:10:28,200 If they don't use this model, they have to say so. 198 00:10:28,200 --> 00:10:31,590 Otherwise this model has become accepted as the normal one. 199 00:10:31,590 --> 00:10:33,930 It took several years before people realized 200 00:10:33,930 --> 00:10:36,360 that's a good model-- good enough to capture 201 00:10:36,360 --> 00:10:39,300 pretty much everything we want. 202 00:10:39,300 --> 00:10:41,280 The cool thing about word RAM is, 203 00:10:41,280 --> 00:10:43,980 it lets you do things on w-bits in parallel. 204 00:10:43,980 --> 00:10:47,920 You can take the and of w-bits, pairwise, all at once. 205 00:10:47,920 --> 00:10:49,684 So you get some speed up. 206 00:10:49,684 --> 00:10:51,600 But it's a natural generalization of something 207 00:10:51,600 --> 00:10:52,683 like the comparison model. 208 00:10:52,683 --> 00:10:55,215 Comparison model-- I guess I didn't write those. 209 00:10:55,215 --> 00:10:59,040 It's more operations-- less than, greater than, and so on. 210 00:10:59,040 --> 00:11:01,290 You can compare two numbers in constant time, 211 00:11:01,290 --> 00:11:03,900 get a Boolean output via, say, subtraction, 212 00:11:03,900 --> 00:11:05,050 and computing the sine. 213 00:11:09,000 --> 00:11:11,230 And you think of comparisons as taking constant time, 214 00:11:11,230 --> 00:11:14,850 so why not all of these things? 215 00:11:14,850 --> 00:11:15,930 Cool. 216 00:11:15,930 --> 00:11:21,710 One more model-- this is kind of a weird one. 217 00:11:21,710 --> 00:11:25,920 It's called cell probe model, which 218 00:11:25,920 --> 00:11:31,440 is, we just count the number of memory 219 00:11:31,440 --> 00:11:42,206 reads and writes that we need to do to solve a data 220 00:11:42,206 --> 00:11:43,080 structure or a query. 221 00:11:43,080 --> 00:11:44,905 Like, you you're looking at predecessor, 222 00:11:44,905 --> 00:11:47,280 and you just want to know, how much of the data structure 223 00:11:47,280 --> 00:11:50,271 do I have to read in order to be able to answer the predecessor 224 00:11:50,271 --> 00:11:50,770 problem? 225 00:11:50,770 --> 00:11:52,311 How much do I have to write out to do 226 00:11:52,311 --> 00:11:55,140 an insertion, or whatever? 227 00:11:55,140 --> 00:12:00,480 And so in this model, computation is free. 228 00:12:00,480 --> 00:12:04,889 And this is kind of like the external memory model, 229 00:12:04,889 --> 00:12:06,180 and the cache oblivious models. 230 00:12:06,180 --> 00:12:07,830 There, we were measuring how many block 231 00:12:07,830 --> 00:12:08,955 reads and writes there are. 232 00:12:08,955 --> 00:12:11,160 Here, our blocks are actually our words. 233 00:12:11,160 --> 00:12:14,397 So there is a bit of a relation, except there's no real-- 234 00:12:14,397 --> 00:12:16,230 you can either think of there being no cache 235 00:12:16,230 --> 00:12:17,130 here, because you're just reading 236 00:12:17,130 --> 00:12:19,463 in a constant number of words, doing something, spitting 237 00:12:19,463 --> 00:12:20,029 stuff out. 238 00:12:20,029 --> 00:12:21,570 Or in the cell probe model, you could 239 00:12:21,570 --> 00:12:24,630 imagine there being an infinite cache for this operation, 240 00:12:24,630 --> 00:12:26,365 but no cache from operation to operation. 241 00:12:26,365 --> 00:12:28,740 It's just, how much do I have to [INAUDIBLE] information, 242 00:12:28,740 --> 00:12:31,190 theoretically, to solve a particular predecessor problem? 243 00:12:31,190 --> 00:12:33,060 We'll deal with this a lot in a couple 244 00:12:33,060 --> 00:12:35,230 of lectures-- not quite yet. 245 00:12:35,230 --> 00:12:38,200 This model is just used for lower bounds. 246 00:12:38,200 --> 00:12:40,140 It's not a realistic model, because you 247 00:12:40,140 --> 00:12:43,794 have to pay for computation in the real world. 248 00:12:43,794 --> 00:12:45,210 But if you can prove that you need 249 00:12:45,210 --> 00:12:47,520 to read at least a certain number of words, then, 250 00:12:47,520 --> 00:12:50,920 of course, you have to do at least that many operations. 251 00:12:50,920 --> 00:12:52,590 So it's nice for lower bounds. 252 00:12:52,590 --> 00:12:58,020 In general, we have this sort of hierarchy of models, 253 00:12:58,020 --> 00:13:03,390 where this is the most powerful, strongest, 254 00:13:03,390 --> 00:13:13,920 and below cell probe, we have transdichotomous RAM, 255 00:13:13,920 --> 00:13:20,100 then word RAM, then-- just to fit it in context, 256 00:13:20,100 --> 00:13:21,330 what we've been doing-- 257 00:13:21,330 --> 00:13:25,620 below that is pointer machine, and below that 258 00:13:25,620 --> 00:13:26,820 would be binary search tree. 259 00:13:26,820 --> 00:13:28,445 I've mentioned before, pointer machines 260 00:13:28,445 --> 00:13:30,240 are more powerful than binary search tree. 261 00:13:30,240 --> 00:13:32,281 And of course, we can implement a pointer machine 262 00:13:32,281 --> 00:13:32,910 on a word RAM. 263 00:13:32,910 --> 00:13:35,112 So we have these relations. 264 00:13:35,112 --> 00:13:36,570 There are, of course, other models. 265 00:13:36,570 --> 00:13:39,450 But this is a quick picture of models we've seen so far. 266 00:13:47,140 --> 00:13:50,330 So now, we have this notion of a word. 267 00:13:50,330 --> 00:13:54,475 In the predecessor problem, these elements are words. 268 00:13:57,090 --> 00:14:00,311 They're w-bit integers, universe-defined. 269 00:14:00,311 --> 00:14:02,560 And we want to be able to insert, delete, predecessor, 270 00:14:02,560 --> 00:14:04,756 and successor over words. 271 00:14:04,756 --> 00:14:07,558 So that's our challenge. 272 00:14:12,910 --> 00:14:14,590 In the binary search tree model, we 273 00:14:14,590 --> 00:14:16,690 know the answer to this problem is theta log n. 274 00:14:16,690 --> 00:14:19,690 In general, any comparison-based data structure, 275 00:14:19,690 --> 00:14:22,310 you need theta log n, in the worst case. 276 00:14:22,310 --> 00:14:24,710 It's an easy lower bound. 277 00:14:24,710 --> 00:14:28,240 But we're going to do better on these other models in the word 278 00:14:28,240 --> 00:14:30,040 RAM. 279 00:14:30,040 --> 00:14:31,435 So here are some results. 280 00:14:38,440 --> 00:14:40,840 First data structure is called Van Emde Boas. 281 00:14:40,840 --> 00:14:42,950 You might guess it is by van Emde Boas-- 282 00:14:42,950 --> 00:14:45,160 Peter. 283 00:14:45,160 --> 00:14:46,900 It actually has a couple other authors 284 00:14:46,900 --> 00:14:48,500 in some versions of the papers, which 285 00:14:48,500 --> 00:14:49,570 makes a little bit confusing. 286 00:14:49,570 --> 00:14:51,361 But for whatever reason, the data structure 287 00:14:51,361 --> 00:14:54,340 is just named Van Emde Boas. 288 00:14:54,340 --> 00:15:03,770 And it achieves log w per operation. 289 00:15:03,770 --> 00:15:05,350 I think I'll rewrite this. 290 00:15:05,350 --> 00:15:09,490 This is log log u per operation. 291 00:15:09,490 --> 00:15:12,250 But it requires u space. 292 00:15:12,250 --> 00:15:16,330 So think of u space as being, like, for every item 293 00:15:16,330 --> 00:15:19,360 in the universe I store, yes or no, is it in the set? 294 00:15:19,360 --> 00:15:24,700 So that's a lot of space, unless n and u are not too different. 295 00:15:24,700 --> 00:15:27,310 But we can do better. 296 00:15:27,310 --> 00:15:29,020 But the cool thing is the running time. 297 00:15:29,020 --> 00:15:30,970 This is really fast-- 298 00:15:30,970 --> 00:15:33,040 log log u. 299 00:15:33,040 --> 00:15:35,110 If you think about, for example-- 300 00:15:35,110 --> 00:15:40,660 I don't know-- the universe being polynomial in n, 301 00:15:40,660 --> 00:15:43,300 or even if the universe is something like-- 302 00:15:43,300 --> 00:15:45,460 polynomial in n is the same as this-- 303 00:15:45,460 --> 00:15:46,960 2 to the c log n. 304 00:15:46,960 --> 00:15:49,650 You can go crazy and say log to the c power-- 305 00:15:49,650 --> 00:15:52,100 so, like, 2 to the log to the fifth power. 306 00:15:52,100 --> 00:15:55,540 All those things, you take log twice. 307 00:15:55,540 --> 00:16:02,590 Then log log u becomes theta log log n. 308 00:16:02,590 --> 00:16:11,545 So as long as your word size is not insanely large, 309 00:16:11,545 --> 00:16:13,640 you're getting log log n performance. 310 00:16:13,640 --> 00:16:23,700 So in general, when, let's say, w is is polylog n, 311 00:16:23,700 --> 00:16:26,150 then we're getting this kind of performance. 312 00:16:26,150 --> 00:16:28,540 And I think on most computers, w is polylogarithmic. 313 00:16:28,540 --> 00:16:30,490 We said it has to be at least log. 314 00:16:30,490 --> 00:16:32,670 It's also, generally, not so much bigger than log. 315 00:16:32,670 --> 00:16:35,590 So log squared is probably fine most of the time, 316 00:16:35,590 --> 00:16:39,680 unless you have a really small problem. 317 00:16:39,680 --> 00:16:40,680 OK, so cool. 318 00:16:40,680 --> 00:16:41,770 But the space is giant. 319 00:16:41,770 --> 00:16:43,480 So how do we do better than that? 320 00:16:43,480 --> 00:16:46,700 Well, there's a couple of answers. 321 00:16:46,700 --> 00:16:54,640 One is that you can achieve log w with high probability, 322 00:16:54,640 --> 00:16:58,660 and order n space. 323 00:16:58,660 --> 00:17:01,930 With a slight tweak, basically, you combine Van Emde Boas 324 00:17:01,930 --> 00:17:05,876 plus hashing, and you get that. 325 00:17:05,876 --> 00:17:08,800 I don't actually know what the reference is for this result. 326 00:17:08,800 --> 00:17:14,349 It's been an exercise in various courses, and so on. 327 00:17:14,349 --> 00:17:16,403 I can talk more about that later. 328 00:17:16,403 --> 00:17:18,819 Then alternatively, there's another data structure, which, 329 00:17:18,819 --> 00:17:20,740 in many ways, is simpler. 330 00:17:20,740 --> 00:17:22,240 It really embraces hashing. 331 00:17:22,240 --> 00:17:23,980 It's called y-fast trees. 332 00:17:23,980 --> 00:17:26,904 It achieves the same bounds-- so log w 333 00:17:26,904 --> 00:17:29,710 with high probability and linear space. 334 00:17:29,710 --> 00:17:33,430 It's basically just a hash table with some cleverness. 335 00:17:33,430 --> 00:17:34,240 So we'll get there. 336 00:17:34,240 --> 00:17:35,781 Even though it's simpler, we're going 337 00:17:35,781 --> 00:17:38,020 to start with this structure. 338 00:17:38,020 --> 00:17:40,210 Historically, this is the way it happened-- 339 00:17:40,210 --> 00:17:43,735 Van Emde Boas, then y-fast trees, which are by Willard. 340 00:17:43,735 --> 00:17:45,670 And it'll be kind of a nice finale. 341 00:17:49,750 --> 00:17:53,480 There's another data structure I want to talk about, 342 00:17:53,480 --> 00:17:58,780 which is designed for the case when w is very large-- 343 00:17:58,780 --> 00:18:01,060 much bigger than polylog n. 344 00:18:01,060 --> 00:18:04,360 In that case, there's something called fusion trees. 345 00:18:04,360 --> 00:18:09,301 And you can achieve log base w of n-- 346 00:18:09,301 --> 00:18:13,840 and, I guess, with high probability and linear space. 347 00:18:16,480 --> 00:18:18,190 The original fusion trees are static. 348 00:18:18,190 --> 00:18:21,880 And you could just do log base w of n deterministic queries. 349 00:18:21,880 --> 00:18:24,430 But there's a later version that dynamic, 350 00:18:24,430 --> 00:18:30,160 achieves this using hashing for updates, insertions, 351 00:18:30,160 --> 00:18:31,900 and deletions. 352 00:18:31,900 --> 00:18:32,560 Cool. 353 00:18:32,560 --> 00:18:34,990 So this is an almost upside-down-- 354 00:18:34,990 --> 00:18:36,490 it's obviously always an improvement 355 00:18:36,490 --> 00:18:41,890 over just log base 2 of n. 356 00:18:41,890 --> 00:18:44,430 But it's sometimes better and sometimes worse than log w. 357 00:18:44,430 --> 00:18:48,490 In fact, it kind of makes sense to take the min of them. 358 00:18:48,490 --> 00:18:50,780 When w is small, you want to use log w. 359 00:18:50,780 --> 00:18:53,950 When w is big, you want to use log base w of n. 360 00:18:53,950 --> 00:19:04,890 They're going to balance out when w is 2 to the root log n-- 361 00:19:04,890 --> 00:19:07,100 something like that. 362 00:19:07,100 --> 00:19:09,560 The easy thing is, when these balance out 363 00:19:09,560 --> 00:19:11,150 is when they're equal. 364 00:19:11,150 --> 00:19:16,850 And that will be when this is log n divided by log w. 365 00:19:16,850 --> 00:19:20,380 So when log w equals log n divided by log w-- 366 00:19:20,380 --> 00:19:22,250 let do that over here. 367 00:19:22,250 --> 00:19:27,870 log w is log n over log w. 368 00:19:27,870 --> 00:19:32,240 Then this is like saying log squared w equals log n, 369 00:19:32,240 --> 00:19:36,641 or log w is root log n. 370 00:19:36,641 --> 00:19:39,260 So I was right. w is 2 to the root log in, 371 00:19:39,260 --> 00:19:40,352 which is a weird quantity. 372 00:19:40,352 --> 00:19:42,310 But the easy thing to think about is this one-- 373 00:19:42,310 --> 00:19:44,000 log w is root log n. 374 00:19:44,000 --> 00:19:47,610 And in that case, the running time you get is root log n. 375 00:19:51,050 --> 00:19:52,880 So it's always, at most, this. 376 00:19:52,880 --> 00:19:57,340 And the worst case is when these things are balanced, 377 00:19:57,340 --> 00:20:00,290 or these two are the same, and they both achieve root log n. 378 00:20:00,290 --> 00:20:02,780 But if w is smaller or larger than this threshold, 379 00:20:02,780 --> 00:20:05,030 these structures will be even better than root log n. 380 00:20:05,030 --> 00:20:06,446 But in particular, it's a nice way 381 00:20:06,446 --> 00:20:09,380 to think about, oh, we're doing sort of a square factor 382 00:20:09,380 --> 00:20:12,930 better than binary search trees. 383 00:20:12,930 --> 00:20:16,880 And we can do this high probability in linear space. 384 00:20:20,120 --> 00:20:20,770 So that's cool. 385 00:20:27,530 --> 00:20:30,870 Turns out it's also pretty much optimal. 386 00:20:30,870 --> 00:20:38,540 And that's not at all obvious, and wasn't 387 00:20:38,540 --> 00:20:39,680 known for many years. 388 00:20:46,650 --> 00:20:50,880 So there's a cell probe lower bound. 389 00:20:50,880 --> 00:20:57,580 So these are all in the word RAM model-- 390 00:20:57,580 --> 00:20:58,670 all these results. 391 00:20:58,670 --> 00:21:01,236 The first one actually kind of works in the pointer machine. 392 00:21:01,236 --> 00:21:02,360 I'll talk about that later. 393 00:21:10,570 --> 00:21:15,140 This lower bound's a little bit messy to state. 394 00:21:15,140 --> 00:21:18,100 The bound is slightly more complicated 395 00:21:18,100 --> 00:21:19,700 than what we've seen. 396 00:21:19,700 --> 00:21:22,160 But I'm going to restrict to a special situation, which is, 397 00:21:22,160 --> 00:21:24,660 if you have n polylog n space. 398 00:21:24,660 --> 00:21:27,470 So this is a lower bound on static predecessor. 399 00:21:27,470 --> 00:21:30,200 All you need to do is solve predecessor and successor, 400 00:21:30,200 --> 00:21:31,790 or even just predecessor. 401 00:21:31,790 --> 00:21:33,770 There's no inserts and deletes. 402 00:21:33,770 --> 00:21:37,250 In that case, if you use lots of space, like u space, of course, 403 00:21:37,250 --> 00:21:39,110 you can do constant time for everything. 404 00:21:39,110 --> 00:21:41,450 You just store all the answers. 405 00:21:41,450 --> 00:21:45,632 But if you want space that's not much bigger than n-- 406 00:21:45,632 --> 00:21:47,840 in particular, if you wanted to be able to do updates 407 00:21:47,840 --> 00:21:49,820 in polylog, this is the most space 408 00:21:49,820 --> 00:21:51,830 you could ever hope to achieve. 409 00:21:51,830 --> 00:21:56,450 So assuming that, which is pretty reasonable, 410 00:21:56,450 --> 00:22:00,050 there's a bound of the min of two things-- 411 00:22:00,050 --> 00:22:02,750 log base w of n, which is fusion trees, 412 00:22:02,750 --> 00:22:05,930 and, roughly, log w, which is Van Emde Boas. 413 00:22:05,930 --> 00:22:07,820 But it's slightly smaller than that. 414 00:22:16,170 --> 00:22:17,600 Yeah, pretty weird. 415 00:22:17,600 --> 00:22:19,620 Let me tell you the consequences-- 416 00:22:19,620 --> 00:22:21,050 a little easier to think about. 417 00:22:21,050 --> 00:22:26,270 Van Emde Boas is going to be optimal for the kind of cases 418 00:22:26,270 --> 00:22:30,360 we care about, which is when w is polylog n. 419 00:22:35,270 --> 00:22:38,780 And fusion trees are optimal when w is big. 420 00:22:49,910 --> 00:22:55,450 Square root log n log log n. 421 00:22:55,450 --> 00:22:59,870 OK-- a little messy. 422 00:22:59,870 --> 00:23:03,890 So there's this divided by log of log w over log n. 423 00:23:03,890 --> 00:23:08,507 If w is polylog n, then this is just order log log n. 424 00:23:08,507 --> 00:23:09,340 And so this cancels. 425 00:23:09,340 --> 00:23:10,612 This becomes constant. 426 00:23:10,612 --> 00:23:13,070 So in these situations, which are the ones I mentioned over 427 00:23:13,070 --> 00:23:15,440 here, w is polylog n, which is when 428 00:23:15,440 --> 00:23:16,760 we get log log n performance. 429 00:23:16,760 --> 00:23:18,770 And that's kind of the case we care about. 430 00:23:18,770 --> 00:23:21,320 Van Emde Boas is the best thing to do. 431 00:23:21,320 --> 00:23:24,140 Turns out, this is actually the right answer. 432 00:23:24,140 --> 00:23:26,670 You can do slightly better. 433 00:23:26,670 --> 00:23:27,830 It's almost an exercise. 434 00:23:27,830 --> 00:23:30,710 You can tweak Van Emde Boas and get this slight improvement. 435 00:23:30,710 --> 00:23:33,600 But most word sizes, it really doesn't matter. 436 00:23:33,600 --> 00:23:36,420 You're not saving much. 437 00:23:36,420 --> 00:23:37,400 Cool. 438 00:23:37,400 --> 00:23:39,050 So other than that little factor, 439 00:23:39,050 --> 00:23:40,216 these are the right answers. 440 00:23:40,216 --> 00:23:41,960 You have to know about Van Emde Boas. 441 00:23:41,960 --> 00:23:43,460 You have to know about fusion trees. 442 00:23:43,460 --> 00:23:45,380 And so this lecture is about Van Emde Boas. 443 00:23:45,380 --> 00:23:48,890 Next lecture is about fusion trees. 444 00:23:48,890 --> 00:24:05,040 This result is from 2006 and 2007, so pretty recent. 445 00:24:05,040 --> 00:24:07,100 So let's start a Van Emde Boas. 446 00:24:39,540 --> 00:24:40,049 Yeah. 447 00:24:40,049 --> 00:24:40,840 Let's dive into it. 448 00:24:40,840 --> 00:24:44,160 I'll talk about history a little later. 449 00:24:44,160 --> 00:24:45,800 The central idea, I guess, if you 450 00:24:45,800 --> 00:24:49,530 wanted to sum up Van Emde Boas in an equation, which 451 00:24:49,530 --> 00:24:52,980 is something we very rarely get to do in algorithms, 452 00:24:52,980 --> 00:24:55,115 is to think about this recurrence-- 453 00:24:55,115 --> 00:25:00,430 T of u is T of square root of u plus order 1. 454 00:25:00,430 --> 00:25:02,310 What does this solve to? 455 00:25:02,310 --> 00:25:05,470 log log u. 456 00:25:05,470 --> 00:25:10,080 All right, just think of taking logs. 457 00:25:10,080 --> 00:25:13,470 This is the same as T of w equals T of w 458 00:25:13,470 --> 00:25:15,930 over 2 plus order 1. 459 00:25:15,930 --> 00:25:17,600 w is the word size. 460 00:25:17,600 --> 00:25:19,470 And so this is log w. 461 00:25:19,470 --> 00:25:22,188 It's the same thing. 462 00:25:22,188 --> 00:25:25,410 If we could achieve this recurrence, then-- 463 00:25:25,410 --> 00:25:28,090 boom-- we get our bound of log w. 464 00:25:30,810 --> 00:25:32,800 So how do we do it. 465 00:25:32,800 --> 00:25:45,480 We split the universe into root u clusters, 466 00:25:45,480 --> 00:25:48,290 each of size root u. 467 00:25:51,980 --> 00:25:59,940 OK, so, if here is our universe, then I just 468 00:25:59,940 --> 00:26:03,090 split every square root of u items. 469 00:26:03,090 --> 00:26:06,870 So each of these is root u long. 470 00:26:06,870 --> 00:26:09,132 The number of them is square root of u. 471 00:26:09,132 --> 00:26:10,590 And then somehow, I want to recurse 472 00:26:10,590 --> 00:26:14,460 on each of these clusters. 473 00:26:14,460 --> 00:26:16,400 And I only get to recurse on one of them-- 474 00:26:16,400 --> 00:26:17,400 so a pretty simple idea. 475 00:26:34,550 --> 00:26:35,080 Yeah. 476 00:26:35,080 --> 00:26:36,621 So I'll talk about how to actually do 477 00:26:36,621 --> 00:26:37,810 that recursion in a moment. 478 00:26:37,810 --> 00:26:39,309 Before I get there, I want to define 479 00:26:39,309 --> 00:26:43,790 a sort of hierarchical coordinate system. 480 00:26:43,790 --> 00:26:46,730 This is a new way of phrasing it for me. 481 00:26:46,730 --> 00:26:48,790 So I hope you like it. 482 00:26:48,790 --> 00:26:52,510 If we have a word x, I want to write it as two 483 00:26:52,510 --> 00:26:55,440 coordinates-- c and i. 484 00:26:55,440 --> 00:26:57,470 I'm going to use angle brackets, so it 485 00:26:57,470 --> 00:27:00,700 doesn't get too confusing. c is which cluster you're in. 486 00:27:00,700 --> 00:27:03,970 So this is cluster 0, cluster 1, cluster 2, cluster three. 487 00:27:03,970 --> 00:27:05,820 i is your index within the cluster. 488 00:27:05,820 --> 00:27:09,437 So this is 0, 1, 2, 3, 4, 5-- up to root u minus 1 489 00:27:09,437 --> 00:27:10,270 within this cluster. 490 00:27:10,270 --> 00:27:12,670 Then 0, 1, 2, 3, 4, 5 up to root u minus 1 491 00:27:12,670 --> 00:27:15,220 with in this cluster-- so the i is 492 00:27:15,220 --> 00:27:19,720 your index within the cluster, like this, 493 00:27:19,720 --> 00:27:23,750 and c is which cluster you are in. 494 00:27:23,750 --> 00:27:24,250 OK. 495 00:27:24,250 --> 00:27:25,540 Pretty simple. 496 00:27:25,540 --> 00:27:29,260 And there's easy arithmetic to do this. 497 00:27:29,260 --> 00:27:33,430 c is x integer divide root u. 498 00:27:33,430 --> 00:27:38,200 And i is x integer mod root u. 499 00:27:38,200 --> 00:27:41,560 I used Python notation here. 500 00:27:41,560 --> 00:27:44,517 So fine, I think you all know this-- 501 00:27:44,517 --> 00:27:45,100 pretty simple. 502 00:27:45,100 --> 00:27:47,230 And if I gave you c and i, you could 503 00:27:47,230 --> 00:27:49,070 reconstruct x by just saying, oh, well, 504 00:27:49,070 --> 00:27:52,600 that's c times root u plus i. 505 00:27:52,600 --> 00:27:55,690 So in constant time, you can decompose a number 506 00:27:55,690 --> 00:27:56,950 into its two coordinates. 507 00:27:56,950 --> 00:27:59,581 That's the point. 508 00:27:59,581 --> 00:28:01,080 In fact, it's much easier than this. 509 00:28:01,080 --> 00:28:02,950 You don't even have to do division 510 00:28:02,950 --> 00:28:04,910 if you think of everything in binary, 511 00:28:04,910 --> 00:28:06,580 which computers tend to do. 512 00:28:06,580 --> 00:28:16,560 So the binary perspective is that x is a word. 513 00:28:16,560 --> 00:28:18,410 So it's a bunch of bits. 514 00:28:18,410 --> 00:28:25,000 0, 1, 1, 0, 1, 0, 0, 1-- whatever. 515 00:28:25,000 --> 00:28:29,440 Divide that bit sequence in half, and then this part 516 00:28:29,440 --> 00:28:32,920 is c, this part is i. 517 00:28:32,920 --> 00:28:35,740 And if you assume that w is a power of 2, 518 00:28:35,740 --> 00:28:37,012 these two are identical. 519 00:28:37,012 --> 00:28:38,470 If they're not a power of 2, you've 520 00:28:38,470 --> 00:28:40,600 got to round a little bit here. 521 00:28:40,600 --> 00:28:42,470 It doesn't matter. 522 00:28:42,470 --> 00:28:46,190 But you can use this definition instead of this one either way. 523 00:28:46,190 --> 00:28:48,430 So in this case, c is-- 524 00:28:48,430 --> 00:28:54,100 ooh, boy-- x shifted right, w over 2, basically. 525 00:28:54,100 --> 00:28:58,100 So this w over 2-- 526 00:28:58,100 --> 00:29:00,220 w over 2. 527 00:29:00,220 --> 00:29:04,220 The whole thing is w bits. 528 00:29:04,220 --> 00:29:07,150 So if I shift right, I get rid of the low order bits, 529 00:29:07,150 --> 00:29:08,240 if I want. 530 00:29:08,240 --> 00:29:10,120 i is slightly more annoying. 531 00:29:10,120 --> 00:29:18,070 But I can't do it as an and with one 532 00:29:18,070 --> 00:29:24,690 shifted left w over 2 minus 1. 533 00:29:24,690 --> 00:29:26,290 That's probably how you do it in c. 534 00:29:26,290 --> 00:29:27,190 I don't know if you're used to this. 535 00:29:27,190 --> 00:29:29,470 But if I take it a 1 bit, I shift it over to here, 536 00:29:29,470 --> 00:29:30,310 and I subtract 1. 537 00:29:30,310 --> 00:29:31,900 Then I get a whole bunch of 1 bits. 538 00:29:31,900 --> 00:29:34,660 And then you mask with that bit pattern. 539 00:29:34,660 --> 00:29:36,850 So I'm masking with 1, 1, 1, 1. 540 00:29:36,850 --> 00:29:38,890 Then I'll just get the low order bits. 541 00:29:38,890 --> 00:29:40,900 Computers do the super fast-- way 542 00:29:40,900 --> 00:29:42,610 faster than integer division. 543 00:29:42,610 --> 00:29:44,830 Because this is just like routing bits around. 544 00:29:44,830 --> 00:29:47,770 So this is easy to do on a typical CPU. 545 00:29:47,770 --> 00:29:49,720 And this will be much faster than this code, 546 00:29:49,720 --> 00:29:53,440 even though looks like more operations, typically. 547 00:29:53,440 --> 00:29:54,010 All right. 548 00:29:54,010 --> 00:29:54,940 So fine. 549 00:29:54,940 --> 00:29:58,090 The point is, I can decompos x into c and i. 550 00:29:58,090 --> 00:30:01,410 Of course, I can also do the reverse. 551 00:30:01,410 --> 00:30:06,355 This would be c shifted left w over 2, ord with i. 552 00:30:10,160 --> 00:30:11,920 It's a slight diversion. 553 00:30:11,920 --> 00:30:15,400 Now, I can tell you the actual recursion, 554 00:30:15,400 --> 00:30:19,240 and then talk about how to maintain it. 555 00:30:19,240 --> 00:30:24,580 So we're going to define a recursive Van Emde Boas 556 00:30:24,580 --> 00:30:32,280 structure of size u and word size w. 557 00:30:37,660 --> 00:30:39,670 And what it's going to look like is, 558 00:30:39,670 --> 00:30:48,830 we have a bunch of clusters, each of size square root of u. 559 00:30:54,820 --> 00:30:56,747 So this represents the first root u items. 560 00:30:56,747 --> 00:30:58,330 This represents the next root u items. 561 00:30:58,330 --> 00:31:01,100 This represents the last root u items, and so on. 562 00:31:01,100 --> 00:31:03,710 So that's the obvious recursion from this. 563 00:31:03,710 --> 00:31:05,540 So this is going to be a Van Emde Boas 564 00:31:05,540 --> 00:31:07,850 structure of size root u. 565 00:31:07,850 --> 00:31:09,980 And then we also have a structure 566 00:31:09,980 --> 00:31:14,930 up top, which is called the summary structure. 567 00:31:14,930 --> 00:31:19,250 And the idea is, it represents, for each of these clusters, 568 00:31:19,250 --> 00:31:21,620 is the cluster empty or not? 569 00:31:21,620 --> 00:31:24,620 Does this cluster have any items in it? 570 00:31:24,620 --> 00:31:25,770 Yes or no. 571 00:31:25,770 --> 00:31:28,940 If yes, then the name of this cluster 572 00:31:28,940 --> 00:31:31,340 is in the summary structure. 573 00:31:31,340 --> 00:31:33,950 So notice, by this hierarchical decomposition, 574 00:31:33,950 --> 00:31:37,340 the cluster number and the index are 575 00:31:37,340 --> 00:31:40,020 valid names of items within these substructures. 576 00:31:40,020 --> 00:31:43,977 And basically we're going to use the i part to talk about things 577 00:31:43,977 --> 00:31:44,810 within the clusters. 578 00:31:44,810 --> 00:31:46,640 And we're going to use the c part to talk about things 579 00:31:46,640 --> 00:31:47,848 within the summary structure. 580 00:31:47,848 --> 00:31:50,280 They're both numbers between 0 and root u minus 1. 581 00:31:50,280 --> 00:31:54,170 And so we get this perspective. 582 00:31:54,170 --> 00:31:54,830 All right. 583 00:31:54,830 --> 00:32:01,730 So formally, or some notation, cluster i-- 584 00:32:01,730 --> 00:32:05,300 so we're going to have an array of clusters. 585 00:32:05,300 --> 00:32:10,470 It is Van Emde Boas thing of size square root u, 586 00:32:10,470 --> 00:32:15,620 and word size w over 2. 587 00:32:15,620 --> 00:32:19,100 This is slightly weird, because the machine, of course, 588 00:32:19,100 --> 00:32:20,630 its word size remains w. 589 00:32:20,630 --> 00:32:22,890 It doesn't get smaller as you recurse. 590 00:32:22,890 --> 00:32:24,890 We're not going to try to spread the parallelism 591 00:32:24,890 --> 00:32:26,950 around or whatever. 592 00:32:26,950 --> 00:32:28,700 But this is just a notational convenience. 593 00:32:28,700 --> 00:32:31,340 I want to say the word size conceptually 594 00:32:31,340 --> 00:32:34,040 goes down to w over 2, so that this definition still 595 00:32:34,040 --> 00:32:35,270 makes sense. 596 00:32:35,270 --> 00:32:38,480 Because as I look at a smaller part of the word, 597 00:32:38,480 --> 00:32:41,690 in order to divide it in half, I have to shift right 598 00:32:41,690 --> 00:32:42,950 by a smaller amount. 599 00:32:42,950 --> 00:32:47,330 So that's the w that I'm passing into the structure. 600 00:32:47,330 --> 00:32:52,710 OK, and then v dot summary is same thing. 601 00:32:52,710 --> 00:32:58,110 It's also Van Emde Boa's thing of size root u. 602 00:32:58,110 --> 00:33:01,550 Then the one other clever idea, which makes all of this work, 603 00:33:01,550 --> 00:33:05,040 is that we store the minimum element in v dot min. 604 00:33:10,490 --> 00:33:13,144 And we do not store it recursively. 605 00:33:20,070 --> 00:33:27,080 So there's also one item here, size 1, which is the min. 606 00:33:27,080 --> 00:33:28,530 It's just stored off to the side. 607 00:33:28,530 --> 00:33:30,590 It doesn't live in these structures. 608 00:33:30,590 --> 00:33:33,204 Every other item lives down here. 609 00:33:33,204 --> 00:33:35,120 And furthermore, if one of these is not empty, 610 00:33:35,120 --> 00:33:38,750 there's also a corresponding item up here. 611 00:33:38,750 --> 00:33:43,880 This turns out to be crucial to make a Van Emde Boas work. 612 00:33:43,880 --> 00:33:46,850 And then v dot max, we also need-- 613 00:33:46,850 --> 00:33:48,382 but it can be stored recursively. 614 00:33:48,382 --> 00:33:50,090 So just think of it as a copy of whatever 615 00:33:50,090 --> 00:33:52,632 the maximum element is. 616 00:33:52,632 --> 00:33:54,590 OK, so in constant time, we can compute the min 617 00:33:54,590 --> 00:33:55,520 and compute the max. 618 00:33:55,520 --> 00:33:56,150 That's good. 619 00:33:56,150 --> 00:33:59,840 But then I claim also in log w time-- log log u time-- 620 00:33:59,840 --> 00:34:02,492 we can do insert, delete, predecessor, successor. 621 00:34:08,889 --> 00:34:09,770 So let's do that. 622 00:34:22,380 --> 00:34:24,040 This data structure-- the solution 623 00:34:24,040 --> 00:34:26,364 is both simple and a little bit subtle. 624 00:34:26,364 --> 00:34:28,030 And so this will be one of the few times 625 00:34:28,030 --> 00:34:30,250 I'm going to write explicit pseudocode-- say 626 00:34:30,250 --> 00:34:33,400 exactly how to maintain this data structure. 627 00:34:33,400 --> 00:34:35,320 It's short code, which is good. 628 00:34:35,320 --> 00:34:38,739 Each algorithm is only a few lines. 629 00:34:38,739 --> 00:34:40,301 But every line matters. 630 00:34:40,301 --> 00:34:42,550 So I want to write them down so I can talk about them. 631 00:34:46,040 --> 00:34:49,030 And with this new hierarchical notation, 632 00:34:49,030 --> 00:34:52,460 I think it's even easier to write these down. 633 00:34:52,460 --> 00:34:54,690 Let's see how I do. 634 00:36:08,510 --> 00:36:10,934 OK, so we'll start with the successor code. 635 00:36:10,934 --> 00:36:12,350 Predecessor is, of course, metric. 636 00:36:28,000 --> 00:36:31,624 And it basically has two cases. 637 00:36:31,624 --> 00:36:33,540 There's a special case in the beginning, which 638 00:36:33,540 --> 00:36:36,270 is, if the thing you're querying happens to be less 639 00:36:36,270 --> 00:36:38,670 than the minimum of the whole thing, then of course, 640 00:36:38,670 --> 00:36:40,572 the minimum is the successor. 641 00:36:40,572 --> 00:36:42,780 This has to be done specially, because the min is not 642 00:36:42,780 --> 00:36:43,987 stored recursively. 643 00:36:43,987 --> 00:36:45,570 And so you've got to check for the min 644 00:36:45,570 --> 00:36:48,251 every single level of the recursion. 645 00:36:48,251 --> 00:36:49,500 But that's just constant time. 646 00:36:49,500 --> 00:36:50,334 No big deal. 647 00:36:50,334 --> 00:36:51,750 Then the interesting things is, we 648 00:36:51,750 --> 00:36:54,420 have recursions in both sides-- 649 00:36:54,420 --> 00:36:58,150 in both cases-- but only one. 650 00:36:58,150 --> 00:37:00,380 The key is, we want this recurrence-- 651 00:37:00,380 --> 00:37:05,820 T of u is 1 times T of root u plus order 1. 652 00:37:05,820 --> 00:37:07,460 That gives us log log u. 653 00:37:07,460 --> 00:37:12,760 If there was a 2 here, we would get log u, which is no good. 654 00:37:12,760 --> 00:37:13,560 We want the one. 655 00:37:13,560 --> 00:37:16,810 So in one case, we call successor on a cluster. 656 00:37:16,810 --> 00:37:18,630 In the other case, we call successor 657 00:37:18,630 --> 00:37:22,230 on the summary structure. 658 00:37:22,230 --> 00:37:24,840 But we don't want to do both. 659 00:37:24,840 --> 00:37:27,900 So let's just think about, intuitively, what's going on. 660 00:37:27,900 --> 00:37:29,420 We've got this-- 661 00:37:29,420 --> 00:37:31,200 I guess I can do it in the same picture. 662 00:37:31,200 --> 00:37:34,710 We've got this summary and a bunch of clusters. 663 00:37:34,710 --> 00:37:36,870 And let's say you want to compute, what's 664 00:37:36,870 --> 00:37:39,040 the successor of this item? 665 00:37:39,040 --> 00:37:40,830 So via this transformation, we compute 666 00:37:40,830 --> 00:37:44,100 which cluster it lives in and where it is within the cluster. 667 00:37:44,100 --> 00:37:45,040 That's i. 668 00:37:45,040 --> 00:37:46,560 So it's some item here. 669 00:37:46,560 --> 00:37:49,650 Now, it could be the successor is inside the same cluster. 670 00:37:49,650 --> 00:37:51,870 Maybe there's an item right there. 671 00:37:51,870 --> 00:37:54,330 Then want to recurse in here. 672 00:37:54,330 --> 00:37:57,090 Or it could be, it's in some future cluster. 673 00:38:00,570 --> 00:38:02,910 Let's do the first case. 674 00:38:02,910 --> 00:38:08,190 If, basically, we are less than the max of our own cluster, 675 00:38:08,190 --> 00:38:12,064 that means that the answer is in there. 676 00:38:12,064 --> 00:38:13,980 Figure out what the max is in this structure-- 677 00:38:13,980 --> 00:38:18,780 the rightmost item in s that's inside this cluster c. 678 00:38:18,780 --> 00:38:21,300 This is c. 679 00:38:21,300 --> 00:38:25,845 If our index is less than the max's index, then if we recurse 680 00:38:25,845 --> 00:38:28,219 in here, we will find an answer. 681 00:38:28,219 --> 00:38:29,760 If we're bigger than the max, then we 682 00:38:29,760 --> 00:38:31,051 won't find an answer down here. 683 00:38:31,051 --> 00:38:32,770 We have to recurse somewhere else. 684 00:38:32,770 --> 00:38:34,890 So that's what we do. 685 00:38:34,890 --> 00:38:37,500 If we're less than the max, then we just 686 00:38:37,500 --> 00:38:42,090 recursively find the successor of our index within cluster c. 687 00:38:42,090 --> 00:38:45,630 And we have to add on the c in front. 688 00:38:45,630 --> 00:38:47,460 Because successor within this cluster 689 00:38:47,460 --> 00:38:50,370 will only give an index within the cluster. 690 00:38:50,370 --> 00:38:54,620 And we have to prepend this c part to give a global name. 691 00:38:54,620 --> 00:38:56,070 OK, so that's case 1. 692 00:38:56,070 --> 00:38:57,520 Very easy. 693 00:38:57,520 --> 00:39:01,590 The other case is where we're slightly clever, in some sense. 694 00:39:01,590 --> 00:39:06,630 We say, OK, well, if there's no successor within the cluster, 695 00:39:06,630 --> 00:39:08,040 maybe it's in the next cluster. 696 00:39:08,040 --> 00:39:09,660 Of course, that one might be empty, in which case, 697 00:39:09,660 --> 00:39:10,480 it's in the next cluster. 698 00:39:10,480 --> 00:39:13,050 But that one might be empty, so look at the next cluster. 699 00:39:13,050 --> 00:39:15,630 We need to find, what is the next non-empty cluster? 700 00:39:15,630 --> 00:39:19,020 For that, we use the summary structure. 701 00:39:19,020 --> 00:39:22,230 So we go up to position c here. 702 00:39:22,230 --> 00:39:25,400 We say, OK, what is the next non-empty structure after c? 703 00:39:25,400 --> 00:39:27,950 Because we know that's going to be where 704 00:39:27,950 --> 00:39:30,187 our answer lives for successor. 705 00:39:30,187 --> 00:39:31,770 So that's going to give us, basically, 706 00:39:31,770 --> 00:39:36,750 a pointer to one of these structures-- c prime, which-- 707 00:39:36,750 --> 00:39:38,249 all these guys are empty. 708 00:39:38,249 --> 00:39:39,790 And so there's no successor in there. 709 00:39:39,790 --> 00:39:43,150 The successor is then the min in this structure. 710 00:39:43,150 --> 00:39:44,160 So that's all we do. 711 00:39:44,160 --> 00:39:48,130 Compute the successor of c in the summary structure. 712 00:39:48,130 --> 00:39:51,900 And then, in that cluster, c prime, 713 00:39:51,900 --> 00:39:54,240 find the min, which takes constant time, 714 00:39:54,240 --> 00:39:59,060 and then prepend c prime to that to get a global name. 715 00:39:59,060 --> 00:40:01,320 And that's our successor. 716 00:40:01,320 --> 00:40:01,970 Yeah, question. 717 00:40:01,970 --> 00:40:05,864 AUDIENCE: Could you repeat why min is not recursive? 718 00:40:05,864 --> 00:40:07,238 Because looking at this, it looks 719 00:40:07,238 --> 00:40:10,368 like all these smaller [INAUDIBLE] trees have 720 00:40:10,368 --> 00:40:12,715 [INAUDIBLE] 721 00:40:12,715 --> 00:40:13,590 ERIK DEMAINE: Ah, OK. 722 00:40:13,590 --> 00:40:14,295 Sorry. 723 00:40:14,295 --> 00:40:16,505 The question is, why is the minimum not recursive? 724 00:40:16,505 --> 00:40:18,380 The answer to that question is not yet clear. 725 00:40:18,380 --> 00:40:19,890 It will have to do with insertion. 726 00:40:19,890 --> 00:40:22,060 But I think what exactly this means, 727 00:40:22,060 --> 00:40:25,440 I maybe didn't state carefully enough. 728 00:40:25,440 --> 00:40:28,020 Every Van Emde Boas structure has a min-- 729 00:40:28,020 --> 00:40:29,460 stores a min. 730 00:40:29,460 --> 00:40:32,080 In that sense, this is done-- 731 00:40:32,080 --> 00:40:34,320 that's funny-- not so recursively. 732 00:40:34,320 --> 00:40:36,180 But every one stores it. 733 00:40:36,180 --> 00:40:38,850 The point is that this item doesn't 734 00:40:38,850 --> 00:40:40,740 get put into one of these clusters 735 00:40:40,740 --> 00:40:42,670 recursively-- just the item. 736 00:40:42,670 --> 00:40:44,310 But each of these has its own min, 737 00:40:44,310 --> 00:40:46,620 which is then not stored at the next level down. 738 00:40:46,620 --> 00:40:48,720 And each of those has its own min, which is not 739 00:40:48,720 --> 00:40:50,190 stored at the next level down. 740 00:40:50,190 --> 00:40:52,444 Think of this as kind of like a little buffer. 741 00:40:52,444 --> 00:40:54,360 The first time I insert it into the structure, 742 00:40:54,360 --> 00:40:55,568 I just stick it into the min. 743 00:40:55,568 --> 00:40:57,787 I don't touch anything else. 744 00:40:57,787 --> 00:40:59,870 You'll see when we get to the insertion algorithm. 745 00:40:59,870 --> 00:41:02,430 But it sort of slows things down from trickling. 746 00:41:02,430 --> 00:41:07,126 AUDIENCE: So putting that min, is that what prevents from-- 747 00:41:07,126 --> 00:41:09,000 ERIK DEMAINE: That will prevent the insertion 748 00:41:09,000 --> 00:41:11,051 from doing two recursions instead of one. 749 00:41:11,051 --> 00:41:12,300 So we'll see that in a moment. 750 00:41:12,300 --> 00:41:15,379 At this point, just successor is very clear. 751 00:41:15,379 --> 00:41:17,920 This would work whether the min is stored recursively or not. 752 00:41:17,920 --> 00:41:20,440 But we need to know what the min is of every structure, 753 00:41:20,440 --> 00:41:23,382 and we need to know the max of every structure. 754 00:41:23,382 --> 00:41:25,840 At this point, you could just say that min and max could be 755 00:41:25,840 --> 00:41:27,610 copies-- no big deal-- 756 00:41:27,610 --> 00:41:28,550 and we'd be happy. 757 00:41:28,550 --> 00:41:31,294 And of course, predecessor does the same thing. 758 00:41:31,294 --> 00:41:33,710 So the slight cleverness here is that we use the min here. 759 00:41:33,710 --> 00:41:36,640 This could have been a successor operation with minus infinity 760 00:41:36,640 --> 00:41:37,840 as the query. 761 00:41:37,840 --> 00:41:40,120 But that would be two recursions. 762 00:41:40,120 --> 00:41:41,137 We can only afford one. 763 00:41:41,137 --> 00:41:42,970 Fortunately, it's the min item that we need. 764 00:41:42,970 --> 00:41:45,740 So we're done with successor. 765 00:41:45,740 --> 00:41:46,870 That was the easy case-- 766 00:41:46,870 --> 00:41:47,800 or the easy one. 767 00:41:47,800 --> 00:41:50,710 Insert is slightly harder. 768 00:41:50,710 --> 00:41:53,065 Delete is just slightly messier. 769 00:41:53,065 --> 00:41:54,570 It's basically the same as insert. 770 00:41:59,610 --> 00:42:03,790 So insert-- let me write the code again. 771 00:43:17,170 --> 00:43:20,340 Insertion also has two main cases. 772 00:43:20,340 --> 00:43:22,620 There's this case, and the other case. 773 00:43:22,620 --> 00:43:23,850 But there's no else here. 774 00:43:23,850 --> 00:43:25,650 This happens in both cases. 775 00:43:25,650 --> 00:43:27,900 And then there's some just annoying little details 776 00:43:27,900 --> 00:43:28,800 at the beginning. 777 00:43:28,800 --> 00:43:31,410 Just like over here, we had to check for the min specially, 778 00:43:31,410 --> 00:43:34,170 here, we've got to update the min and max. 779 00:43:34,170 --> 00:43:37,836 And there's a special case, which I haven't mentioned yet. 780 00:43:37,836 --> 00:43:44,700 v dot min-- special case is, it will be this value, none, 781 00:43:44,700 --> 00:43:48,480 if the whole structure is empty. 782 00:43:48,480 --> 00:43:52,740 So this is the obvious way to tell whether a structure is 783 00:43:52,740 --> 00:43:54,247 empty and has no min. 784 00:43:54,247 --> 00:43:55,830 Because if there's any items in there, 785 00:43:55,830 --> 00:43:57,810 there's going to be one in the min slot. 786 00:43:57,810 --> 00:44:00,410 So first thing we do is check, is our structure empty? 787 00:44:00,410 --> 00:44:04,710 If it's empty, the min and the max become the inserted item. 788 00:44:04,710 --> 00:44:06,050 We're done. 789 00:44:06,050 --> 00:44:07,410 So that's the easy case. 790 00:44:07,410 --> 00:44:11,820 We do not store it recursively in here. 791 00:44:11,820 --> 00:44:14,580 That's what this means. 792 00:44:14,580 --> 00:44:17,894 This element does not get stored in any of the clusters. 793 00:44:17,894 --> 00:44:20,310 If it's not the very first item, or it's not the min item, 794 00:44:20,310 --> 00:44:24,520 then we're going to recursively insert it into a cluster. 795 00:44:24,520 --> 00:44:29,130 So if we have x in cluster c, we always 796 00:44:29,130 --> 00:44:36,840 insert index i into cluster c, except if it's the min. 797 00:44:36,840 --> 00:44:39,480 Now, it could be where a structure is non-empty. 798 00:44:39,480 --> 00:44:40,612 There is a min item there. 799 00:44:40,612 --> 00:44:41,820 But we are less than the min. 800 00:44:41,820 --> 00:44:43,650 In that case, we're the new min, and we just swap those. 801 00:44:43,650 --> 00:44:45,733 And now, we have to recursively insert the old min 802 00:44:45,733 --> 00:44:47,680 into the rest of the structure. 803 00:44:47,680 --> 00:44:49,290 So that's a simple case. 804 00:44:49,290 --> 00:44:50,930 Then we also have to update v dot max, 805 00:44:50,930 --> 00:44:51,930 just in the obvious way. 806 00:44:51,930 --> 00:44:55,869 This is the easy way to maintain v dot max in variant, 807 00:44:55,869 --> 00:44:56,910 that is the maximum item. 808 00:44:56,910 --> 00:45:00,240 OK, now we have the two cases. 809 00:45:00,240 --> 00:45:02,100 I mean, this is really the obvious thing 810 00:45:02,100 --> 00:45:03,870 to get to do insertion. 811 00:45:03,870 --> 00:45:06,900 We have to update the summary structure, meaning, 812 00:45:06,900 --> 00:45:10,020 if the cluster that we are inserting into-- cluster c-- 813 00:45:10,020 --> 00:45:13,330 is empty, that means it was not yet in the summary structure. 814 00:45:13,330 --> 00:45:14,500 We need to put it in there. 815 00:45:14,500 --> 00:45:17,190 So we just insert c into v dot summary-- 816 00:45:17,190 --> 00:45:18,370 pretty obvious. 817 00:45:18,370 --> 00:45:24,044 And in all cases, we insert our item into cluster c. 818 00:45:24,044 --> 00:45:25,710 This looks bad, however, because there's 819 00:45:25,710 --> 00:45:27,820 two recursions in some cases. 820 00:45:27,820 --> 00:45:29,880 If this if doesn't hold, it's one recursion. 821 00:45:29,880 --> 00:45:30,930 Everything's fine. 822 00:45:30,930 --> 00:45:34,320 So if the cluster was already in use, great. 823 00:45:34,320 --> 00:45:35,770 This is one recursion. 824 00:45:35,770 --> 00:45:37,370 This is constant work. 825 00:45:37,370 --> 00:45:38,550 We're done. 826 00:45:38,550 --> 00:45:40,800 The worry is, if the cluster was empty 827 00:45:40,800 --> 00:45:44,670 before, then this insertion is a whole recursion. 828 00:45:44,670 --> 00:45:48,010 That's scary, because we can't afford a second recursion. 829 00:45:48,010 --> 00:45:50,310 But it's all OK. 830 00:45:50,310 --> 00:45:53,160 Because if we do this recursion, that 831 00:45:53,160 --> 00:45:56,250 means that this cluster was empty, which means, 832 00:45:56,250 --> 00:45:59,910 in this recursion, we fall into this very first case. 833 00:45:59,910 --> 00:46:01,950 That structure, it's min is none. 834 00:46:01,950 --> 00:46:03,750 That's what we just checked for. 835 00:46:03,750 --> 00:46:06,572 If it's none, we do constant work and stop. 836 00:46:06,572 --> 00:46:10,250 So everything's OK. 837 00:46:10,250 --> 00:46:13,170 If we recursed in the summary structure, 838 00:46:13,170 --> 00:46:15,060 this recursion will be a shallow recursion. 839 00:46:15,060 --> 00:46:16,290 It just does one thing. 840 00:46:16,290 --> 00:46:23,340 You could actually put this code into this if case, 841 00:46:23,340 --> 00:46:25,050 and make this an else case. 842 00:46:25,050 --> 00:46:26,814 That's another way to write the code. 843 00:46:26,814 --> 00:46:28,480 But this will be a very short recursion. 844 00:46:28,480 --> 00:46:30,580 So either you just do this recursion, 845 00:46:30,580 --> 00:46:32,160 which could be expensive, or you just 846 00:46:32,160 --> 00:46:34,470 do this one, in which case, we know this one was cheap. 847 00:46:34,470 --> 00:46:36,790 If this happens, we know this will take constant time. 848 00:46:36,790 --> 00:46:39,660 So in both cases, we get this recursion-- 849 00:46:39,660 --> 00:46:43,200 square root of u plus constant. 850 00:46:43,200 --> 00:46:45,086 And so we get log log u insertion. 851 00:46:48,507 --> 00:46:49,590 Do you want to see delete? 852 00:46:49,590 --> 00:46:51,410 I mean, it's basically the same thing. 853 00:46:51,410 --> 00:46:53,995 It's in the notes. 854 00:46:53,995 --> 00:46:55,620 I mean, you do the obvious thing, which 855 00:46:55,620 --> 00:46:57,919 is, you delete in the cluster. 856 00:46:57,919 --> 00:46:59,460 And then if it became empty, you also 857 00:46:59,460 --> 00:47:02,550 have to delete in the summary structure. 858 00:47:02,550 --> 00:47:05,510 So there's, again, a chance that you do two recursions. 859 00:47:05,510 --> 00:47:08,130 But-- OK, I'm talking about it. 860 00:47:08,130 --> 00:47:10,920 Maybe I'll write a little bit of the code. 861 00:47:17,672 --> 00:47:19,130 I think I won't write all the code, 862 00:47:19,130 --> 00:47:20,450 though-- just the main stuff. 863 00:47:24,600 --> 00:47:31,130 So if we want to delete, then basically, 864 00:47:31,130 --> 00:47:36,800 we delete in cluster c, index i. 865 00:47:39,920 --> 00:47:44,510 And then if the cluster has become empty 866 00:47:44,510 --> 00:47:49,970 as a result of that, then we have 867 00:47:49,970 --> 00:47:53,870 to delete cluster c from the summary structure, 868 00:47:53,870 --> 00:47:56,240 so that our predecessor and successor queries actually 869 00:47:56,240 --> 00:47:56,930 still work. 870 00:48:04,132 --> 00:48:05,590 OK, so that's the bulk of the code. 871 00:48:05,590 --> 00:48:07,256 I mean, that's where the action happens. 872 00:48:07,256 --> 00:48:09,190 And the worry would be, in this if case, we're 873 00:48:09,190 --> 00:48:12,400 doing two recursive deletes. 874 00:48:12,400 --> 00:48:16,300 The claim is, if we do this second delete, 875 00:48:16,300 --> 00:48:19,930 which is potentially expensive-- this one was really cheap-- 876 00:48:19,930 --> 00:48:23,429 the claim is that emptying a Van Emde Boas structure 877 00:48:23,429 --> 00:48:24,970 takes constant time-- like, if you're 878 00:48:24,970 --> 00:48:26,841 deleting the last element. 879 00:48:26,841 --> 00:48:27,340 Why? 880 00:48:27,340 --> 00:48:29,760 Because when you're deleting the last element, 881 00:48:29,760 --> 00:48:32,260 it's in the min right here. 882 00:48:32,260 --> 00:48:35,020 Everything below it-- all the recursive structures-- 883 00:48:35,020 --> 00:48:36,670 will be empty if there's only one item, 884 00:48:36,670 --> 00:48:37,919 because it will be right here. 885 00:48:37,919 --> 00:48:39,970 And you can check that from the insertion. 886 00:48:39,970 --> 00:48:43,390 If it was empty, all we did was change v dot min and v dot max. 887 00:48:43,390 --> 00:48:45,690 So the inverse, which I want right here, 888 00:48:45,690 --> 00:48:48,350 is just to clear out v dot min and v dot max. 889 00:48:48,350 --> 00:48:52,630 So if this ends up happening, this only took constant time. 890 00:48:52,630 --> 00:48:55,570 You don't have to recurse when you're deleting the last item. 891 00:48:55,570 --> 00:48:59,000 So in either case, you're really only doing one deep recursion. 892 00:48:59,000 --> 00:49:01,870 So you get the same recurrence, and you get log log u. 893 00:49:01,870 --> 00:49:04,390 So for the details, check out the notes. 894 00:49:04,390 --> 00:49:09,250 I want to go to other perspectives of Van Emde Boas. 895 00:49:09,250 --> 00:49:11,110 This is one way to think about it. 896 00:49:11,110 --> 00:49:14,260 And amusingly, and this is probably the most taut way 897 00:49:14,260 --> 00:49:16,540 to do Van Emde Boas. 898 00:49:16,540 --> 00:49:19,120 It's, in CLRS, described this way, 899 00:49:19,120 --> 00:49:21,967 because in 2001, when I first came here, 900 00:49:21,967 --> 00:49:24,550 I presented Van Emde Boas like this in an undergrad algorithms 901 00:49:24,550 --> 00:49:26,930 class with more details. 902 00:49:26,930 --> 00:49:29,500 You guys are grads, so I did it like three times faster 903 00:49:29,500 --> 00:49:34,497 than I would in 6046. 904 00:49:34,497 --> 00:49:36,080 So now, it's in textbooks and whatnot. 905 00:49:36,080 --> 00:49:37,640 But this is not how Van Emde Boas 906 00:49:37,640 --> 00:49:39,832 presented this data structure-- just out 907 00:49:39,832 --> 00:49:40,790 of historical interest. 908 00:49:40,790 --> 00:49:44,401 This is a way that I believe was invented by Michael Bender 909 00:49:44,401 --> 00:49:46,400 and Martin Farach-Colton, who are the co-authors 910 00:49:46,400 --> 00:49:48,080 on "Cache-oblivious B-trees." 911 00:49:48,080 --> 00:49:49,730 And around 2001, they were looking 912 00:49:49,730 --> 00:49:52,680 at lots of old data structures and simplifying them. 913 00:49:52,680 --> 00:49:54,800 And I think this is a very clean, simple way 914 00:49:54,800 --> 00:49:56,429 to think about Van Emde Boas. 915 00:49:56,429 --> 00:49:58,220 But I want to tell you the other way, which 916 00:49:58,220 --> 00:50:02,600 is the way it originally appeared in their papers. 917 00:50:02,600 --> 00:50:05,840 There's actually three papers by van Emde 918 00:50:05,840 --> 00:50:09,260 Boas about this structure. 919 00:50:09,260 --> 00:50:10,640 Many papers appear twice-- 920 00:50:10,640 --> 00:50:12,770 once in a conference, once in a journal-- 921 00:50:12,770 --> 00:50:15,350 this one, there's three relevant papers. 922 00:50:15,350 --> 00:50:18,105 There's conference version, journal version. 923 00:50:18,105 --> 00:50:20,480 The only weird thing there is that the conference version 924 00:50:20,480 --> 00:50:22,150 has one author-- van Emde Boas. 925 00:50:22,150 --> 00:50:24,130 The journal version has three authors-- 926 00:50:24,130 --> 00:50:28,135 van Emde Boas, Kaas, and Zijlstra. 927 00:50:28,135 --> 00:50:30,260 And they're acknowledged in the conference version, 928 00:50:30,260 --> 00:50:32,990 so I guess they helped even more. 929 00:50:32,990 --> 00:50:35,540 In particular, they, I think, implemented this data structure 930 00:50:35,540 --> 00:50:36,180 for the first time. 931 00:50:36,180 --> 00:50:37,554 It's a really easy data structure 932 00:50:37,554 --> 00:50:38,860 to implement, and very fast. 933 00:50:41,370 --> 00:50:43,400 Then there's a third paper by van Emde Boas 934 00:50:43,400 --> 00:50:47,010 only in a journal which improves the space a little bit. 935 00:50:47,010 --> 00:50:50,860 So we'll see a little bit what that's about. 936 00:50:50,860 --> 00:50:52,610 But what I like about both of these papers 937 00:50:52,610 --> 00:51:00,140 is they offer a simpler way to get log log u, successor, 938 00:51:00,140 --> 00:51:01,490 predecessor. 939 00:51:01,490 --> 00:51:04,490 Let's not worry about insertions and deletions for a little bit, 940 00:51:04,490 --> 00:51:08,990 and take what I'll call the simple tree view. 941 00:51:14,660 --> 00:51:18,150 So I'm going to draw a picture-- 942 00:51:18,150 --> 00:51:21,760 0, 1, 0, 0, 0, 0, 0-- 943 00:51:27,780 --> 00:51:29,720 OK. 944 00:51:29,720 --> 00:51:36,200 This is what we call a bit vector, meaning, here's 945 00:51:36,200 --> 00:51:38,510 item zero, item one, item two. 946 00:51:38,510 --> 00:51:42,195 And here is u minus 1. 947 00:51:42,195 --> 00:51:45,470 And I'll put a 1 if that element is in my set, and a 0 948 00:51:45,470 --> 00:51:47,280 otherwise. 949 00:51:47,280 --> 00:51:51,680 OK, so one is in the set, nine-- 950 00:51:51,680 --> 00:51:55,190 I think-- is in the set, 10, and 15 are in the set. 951 00:51:58,597 --> 00:51:59,930 I kind of want to maintain this. 952 00:51:59,930 --> 00:52:00,980 This is, of course, easy to maintain 953 00:52:00,980 --> 00:52:02,146 by insertions and deletions. 954 00:52:02,146 --> 00:52:03,740 I just flip a bit on or off. 955 00:52:03,740 --> 00:52:05,760 But I want to be able to do successor queries. 956 00:52:05,760 --> 00:52:07,850 And if I want the successor of, say, this 0, 957 00:52:07,850 --> 00:52:08,744 finding the next 1-- 958 00:52:08,744 --> 00:52:10,160 I don't want to have to walk down. 959 00:52:10,160 --> 00:52:12,890 That would take order u time-- very bad. 960 00:52:12,890 --> 00:52:15,290 So obvious thing to do is build a tree on this thing. 961 00:52:20,990 --> 00:52:25,265 And I'm going to put in here the or of the two children. 962 00:52:25,265 --> 00:52:27,140 Every node will store the or of its children. 963 00:52:31,160 --> 00:52:32,990 And then keep building the tree. 964 00:52:44,630 --> 00:52:49,400 Now we have a binary tree, with bits on the vertices. 965 00:52:49,400 --> 00:52:51,500 And I claim, if I want to compute 966 00:52:51,500 --> 00:52:54,020 the successor of this item, I can do it 967 00:52:54,020 --> 00:52:58,290 in a pretty natural way in the log log u time. 968 00:52:58,290 --> 00:53:03,850 So keep in mind, this height here is w-- 969 00:53:03,850 --> 00:53:04,350 log u. 970 00:53:07,610 --> 00:53:09,270 So I need to achieve log w. 971 00:53:09,270 --> 00:53:12,740 So of course, you could try just walking down this tree, 972 00:53:12,740 --> 00:53:14,660 or walking up and then back down. 973 00:53:14,660 --> 00:53:17,510 That would take order w time. 974 00:53:17,510 --> 00:53:19,340 That's the obvious BST approach. 975 00:53:19,340 --> 00:53:21,600 I want to do log w. 976 00:53:21,600 --> 00:53:22,360 So how do I do it? 977 00:53:22,360 --> 00:53:27,626 I'm going to binary search on the height. 978 00:53:27,626 --> 00:53:29,920 How could I binary search on the height? 979 00:53:29,920 --> 00:53:33,340 Well, what I'd really like to do, in some sense-- 980 00:53:33,340 --> 00:53:37,570 if I look at the path of this node to the route-- 981 00:53:37,570 --> 00:53:40,940 where is my red chalk? 982 00:53:40,940 --> 00:53:43,710 So here's the path to the root. 983 00:53:46,840 --> 00:53:50,540 These bits are saying, is there anybody down here? 984 00:53:50,540 --> 00:53:52,870 That's what the or gives you. 985 00:53:52,870 --> 00:53:55,540 So it's like the summary structure. 986 00:53:55,540 --> 00:53:59,590 If I want to search for this guy-- well, if I walked up, 987 00:53:59,590 --> 00:54:01,660 eventually, I find a 1. 988 00:54:01,660 --> 00:54:04,180 And that's when I find the first nearby element. 989 00:54:04,180 --> 00:54:06,220 Now, in this case it's not the successor I find. 990 00:54:06,220 --> 00:54:08,110 It's really the predecessor I found. 991 00:54:08,110 --> 00:54:11,320 When you get to the first one-- the transition from 0 to 1-- 992 00:54:11,320 --> 00:54:12,730 you look at your sibling-- 993 00:54:12,730 --> 00:54:15,250 the other child of that one. 994 00:54:15,250 --> 00:54:19,210 And down in this subtree, there will be either the predecessor 995 00:54:19,210 --> 00:54:20,274 or the successor. 996 00:54:20,274 --> 00:54:21,940 In this case, we've got the predecessor, 997 00:54:21,940 --> 00:54:23,460 because it was to the left. 998 00:54:23,460 --> 00:54:25,140 We take the max element in there, 999 00:54:25,140 --> 00:54:27,081 and that's the predecessor of this item. 1000 00:54:27,081 --> 00:54:29,080 If instead, we had found this was our first one, 1001 00:54:29,080 --> 00:54:30,856 then we look over here, take the min-- 1002 00:54:30,856 --> 00:54:32,230 there's, of course, nothing here. 1003 00:54:32,230 --> 00:54:35,110 But in that situation, the min over there 1004 00:54:35,110 --> 00:54:36,670 would be our successor. 1005 00:54:36,670 --> 00:54:39,220 So we can't guarantee which one we find. 1006 00:54:39,220 --> 00:54:42,130 But we will find either the predecessor or the successor 1007 00:54:42,130 --> 00:54:45,410 if we could find the first transition from 0 to 1. 1008 00:54:45,410 --> 00:54:47,470 And we can do that via binary search, 1009 00:54:47,470 --> 00:54:49,596 because this string is monotone. 1010 00:54:49,596 --> 00:54:51,220 It's a whole bunch of zeros for awhile, 1011 00:54:51,220 --> 00:54:52,761 and then once you get a 1, it's going 1012 00:54:52,761 --> 00:54:54,730 to continue to be 1, because those are or. 1013 00:54:54,730 --> 00:54:55,880 That one will propagate up. 1014 00:55:18,090 --> 00:55:21,210 So this is the new idea to get log log u, predecessor, 1015 00:55:21,210 --> 00:55:25,376 successor is to-- 1016 00:55:25,376 --> 00:55:34,555 let's say-- any root-to-leaf path is monotone. 1017 00:55:34,555 --> 00:55:37,090 It's 0 for awhile, and then it becomes 1 forever. 1018 00:55:40,550 --> 00:55:44,890 So we should be able to binary search for the 0 1019 00:55:44,890 --> 00:55:46,236 to 1 transition. 1020 00:55:51,200 --> 00:55:57,470 And it either looks like this, or it looks like this. 1021 00:55:57,470 --> 00:56:04,745 So our query was somewhere down here in the 0 part. 1022 00:56:04,745 --> 00:56:06,370 I'm assuming that our query is not a 1. 1023 00:56:06,370 --> 00:56:08,990 Otherwise, it's an immediate 0 to 1 transition. 1024 00:56:08,990 --> 00:56:10,410 And that's a special case. 1025 00:56:10,410 --> 00:56:11,770 It's easy to deal with. 1026 00:56:11,770 --> 00:56:17,190 And then there's the other tree-- 1027 00:56:17,190 --> 00:56:19,450 the sibling of x-- 1028 00:56:19,450 --> 00:56:22,810 the other child of the 1. 1029 00:56:22,810 --> 00:56:25,870 And in this case, we want to take the min. 1030 00:56:25,870 --> 00:56:28,240 And that will give us our successor of x. 1031 00:56:31,540 --> 00:56:34,219 And in this case, we want to take the max over here, 1032 00:56:34,219 --> 00:56:36,010 and that will give us the predecessor of x. 1033 00:56:41,110 --> 00:56:42,860 So as long as we have minimax of subtrees, 1034 00:56:42,860 --> 00:56:44,690 this is constant time. 1035 00:56:44,690 --> 00:56:47,480 We find either the predecessor or the successor. 1036 00:56:47,480 --> 00:56:49,400 Now, how do we get the other one? 1037 00:56:49,400 --> 00:56:50,330 Pretty easy. 1038 00:56:50,330 --> 00:56:54,140 Just store a linked list of all the items, in order. 1039 00:56:54,140 --> 00:56:57,980 So I'm going to store a pointer from this one to this one, 1040 00:56:57,980 --> 00:56:59,390 and vice versa-- 1041 00:56:59,390 --> 00:57:01,020 and this one or this one. 1042 00:57:01,020 --> 00:57:04,210 This is actually really easy to maintain. 1043 00:57:04,210 --> 00:57:07,394 Because when you insert, if you can compute 1044 00:57:07,394 --> 00:57:08,810 the predecessor and the successor, 1045 00:57:08,810 --> 00:57:10,280 you can just stick it in the linked list. 1046 00:57:10,280 --> 00:57:11,100 That's really easy. 1047 00:57:11,100 --> 00:57:13,260 We know how to do that in constant time. 1048 00:57:13,260 --> 00:57:15,770 So once you do this, it's enough to find one of them, 1049 00:57:15,770 --> 00:57:17,270 as long as you know which one it is. 1050 00:57:17,270 --> 00:57:18,830 Because then you just follow a pointer-- 1051 00:57:18,830 --> 00:57:19,990 either a forward or a backward pointer-- 1052 00:57:19,990 --> 00:57:21,060 and you get the other one. 1053 00:57:21,060 --> 00:57:22,268 So whichever one you wanted-- 1054 00:57:22,268 --> 00:57:24,350 you find both the predecessor and successor 1055 00:57:24,350 --> 00:57:26,690 at the cost of finding either one. 1056 00:57:26,690 --> 00:57:30,170 So that's a cute little trick. 1057 00:57:30,170 --> 00:57:34,610 This is hard to maintain, dynamically, at the moment. 1058 00:57:34,610 --> 00:57:37,670 But this is, I think, where the Van Emde Boas 1059 00:57:37,670 --> 00:57:39,080 structure came from. 1060 00:57:39,080 --> 00:57:42,830 It's nice to think about it in the tree view. 1061 00:57:42,830 --> 00:57:51,320 So we get log log you u, predecessor, and successor. 1062 00:57:54,260 --> 00:57:58,040 I should say what this relies on is the ability to binary search 1063 00:57:58,040 --> 00:57:59,750 on any route-to-node path. 1064 00:57:59,750 --> 00:58:03,740 Now, there aren't enough pointers to do that. 1065 00:58:03,740 --> 00:58:04,680 So you have a choice. 1066 00:58:04,680 --> 00:58:07,520 Either you realize, oh, this is a bunch 1067 00:58:07,520 --> 00:58:09,990 of bits in a complete binary tree, 1068 00:58:09,990 --> 00:58:14,090 so I can store them sequentially in array. 1069 00:58:14,090 --> 00:58:18,050 And given a particular node position in that array, 1070 00:58:18,050 --> 00:58:20,660 I can compute, what is the second ancestor, 1071 00:58:20,660 --> 00:58:23,970 or the fourth ancestor or whatever, in constant time. 1072 00:58:23,970 --> 00:58:26,330 I just do some arithmetic and I can compute from here 1073 00:58:26,330 --> 00:58:27,205 where to go to there. 1074 00:58:27,205 --> 00:58:29,630 It's like the regular old heaps, but a little bit 1075 00:58:29,630 --> 00:58:31,255 embellished, because you have to divide 1076 00:58:31,255 --> 00:58:33,540 by a larger power of two, not just one of them. 1077 00:58:33,540 --> 00:58:36,000 So that's one way to do it. 1078 00:58:36,000 --> 00:58:39,310 So in a RAM, that all works fine. 1079 00:58:39,310 --> 00:58:42,145 When van Emde Boas wrote this paper, though, the RAM didn't-- 1080 00:58:42,145 --> 00:58:43,490 it kind of existed. 1081 00:58:43,490 --> 00:58:45,590 It just wasn't as well-developed then. 1082 00:58:45,590 --> 00:58:49,520 And the hot thing at the time was the pointer machine, 1083 00:58:49,520 --> 00:58:52,310 or I guess at that point, they called it the Pascal machine, 1084 00:58:52,310 --> 00:58:53,690 more or less. 1085 00:58:53,690 --> 00:58:55,280 Pascal does have arrays. 1086 00:58:55,280 --> 00:58:59,660 And the funny thing is, Van Emde Boas does use arrays, 1087 00:58:59,660 --> 00:59:01,220 but mostly it's pointers. 1088 00:59:01,220 --> 00:59:03,840 And you can get rid of the arrays from their structure. 1089 00:59:03,840 --> 00:59:07,040 And essentially, in the end, Van Emde Boas, 1090 00:59:07,040 --> 00:59:10,744 as presented like this, is in a pointer machine. 1091 00:59:10,744 --> 00:59:12,410 Let me tell you a little bit about that. 1092 00:59:16,440 --> 00:59:28,040 So original Van Emde Boas, which I'll call stratified trees-- 1093 00:59:28,040 --> 00:59:30,980 that's what he called it-- 1094 00:59:30,980 --> 00:59:35,520 is basically this tree structure with a lot more pointers. 1095 00:59:35,520 --> 00:59:39,500 So in particular, each leaf-- 1096 00:59:39,500 --> 00:59:42,000 or every node, actually, let's say-- 1097 00:59:42,000 --> 00:59:54,080 stores a pointer to 2 to the ith ancestor, 1098 00:59:54,080 --> 01:00:02,123 where i is 0, 1, up to log w. 1099 01:00:02,123 --> 01:00:04,590 Because it was the 2 to the-- here. 1100 01:00:04,590 --> 01:00:08,047 So once you get the ancestor immediately 1101 01:00:08,047 --> 01:00:10,130 above me, two steps above me, four steps above me, 1102 01:00:10,130 --> 01:00:11,880 eight steps above me, that's what I really 1103 01:00:11,880 --> 01:00:13,700 need to do this binary search. 1104 01:00:13,700 --> 01:00:16,094 The first thing I need is halfway up. 1105 01:00:16,094 --> 01:00:17,510 And then if I have to go down, I'm 1106 01:00:17,510 --> 01:00:19,310 going to need a quarter of the way up. 1107 01:00:19,310 --> 01:00:21,800 And if I have to go down, I want an eighth of the way up. 1108 01:00:21,800 --> 01:00:25,110 Whenever I go up, from-- if I decide, oh, this is a 0. 1109 01:00:25,110 --> 01:00:26,432 I've got to go above here. 1110 01:00:26,432 --> 01:00:27,890 Then I do the same thing from here. 1111 01:00:27,890 --> 01:00:29,970 I want to go halfway up from here-- 1112 01:00:29,970 --> 01:00:30,830 from this node. 1113 01:00:30,830 --> 01:00:36,140 So as long as every node knows how to go up by any power of 2, 1114 01:00:36,140 --> 01:00:36,800 we're golden. 1115 01:00:36,800 --> 01:00:39,396 We can do a binary search. 1116 01:00:39,396 --> 01:00:41,270 The trouble with this is, it increases space. 1117 01:00:41,270 --> 01:00:47,560 This is u log w space, which is a little bit bigger than u. 1118 01:00:47,560 --> 01:00:50,060 And the original van Emde Boas paper, conference and journal 1119 01:00:50,060 --> 01:00:51,830 version, achieves this bound-- 1120 01:00:51,830 --> 01:00:53,210 not u. 1121 01:00:53,210 --> 01:00:55,790 Little historical fun fact-- 1122 01:00:55,790 --> 01:00:58,250 not terribly well known. 1123 01:00:58,250 --> 01:00:58,790 Cool. 1124 01:00:58,790 --> 01:01:01,235 So that's stratified trees. 1125 01:01:05,620 --> 01:01:08,100 Anything else? 1126 01:01:08,100 --> 01:01:08,600 All right. 1127 01:01:08,600 --> 01:01:09,680 Stratified tree. 1128 01:01:09,680 --> 01:01:10,730 Right. 1129 01:01:10,730 --> 01:01:13,547 At this point, we have fast search, but slow update let. 1130 01:01:13,547 --> 01:01:15,130 Me tell you about updates in a second. 1131 01:01:15,130 --> 01:01:16,080 Yeah, question. 1132 01:01:16,080 --> 01:01:19,002 AUDIENCE: So once you do binary search to find the first 1, 1133 01:01:19,002 --> 01:01:22,910 how do you walk back down the tree-- 1134 01:01:22,910 --> 01:01:24,400 ERIK DEMAINE: Oh, I didn't mention, 1135 01:01:24,400 --> 01:01:26,360 but also, every node stores min and max. 1136 01:01:33,500 --> 01:01:36,950 So that lets me do the teleportation back down. 1137 01:01:36,950 --> 01:01:39,570 Every node knows the min and the max of its subtree. 1138 01:01:39,570 --> 01:01:40,070 Right. 1139 01:01:40,070 --> 01:01:42,917 One more thing I was forgetting here-- 1140 01:01:42,917 --> 01:01:44,750 when I say, this a lot of pointers to store. 1141 01:01:44,750 --> 01:01:47,330 You can't store them all in one node. 1142 01:01:47,330 --> 01:01:50,220 And in the van Emde Boas paper, it's stored in an array. 1143 01:01:50,220 --> 01:01:51,470 But it doesn't really need that its an array. 1144 01:01:51,470 --> 01:01:53,120 It could just as well be a linked list. 1145 01:01:53,120 --> 01:01:56,600 And that's how you get pointer machine. 1146 01:01:56,600 --> 01:01:59,486 So this could be linked list. 1147 01:01:59,486 --> 01:02:01,610 And then this whole thing works in pointer machine, 1148 01:02:01,610 --> 01:02:03,230 which is kind of neat. 1149 01:02:03,230 --> 01:02:07,479 And it's a little weird, because if you used a comparison 1150 01:02:07,479 --> 01:02:09,770 pointer machine, where all you can do is compare items, 1151 01:02:09,770 --> 01:02:12,110 there's a lower bound of log n, because you only 1152 01:02:12,110 --> 01:02:14,780 have branching factor constant. 1153 01:02:14,780 --> 01:02:18,620 But here, the formulation of the problem is, when I say, 1154 01:02:18,620 --> 01:02:20,480 give me the successor of this, I actually 1155 01:02:20,480 --> 01:02:23,967 give you a pointer to this item. 1156 01:02:23,967 --> 01:02:26,300 And then from there, you can do all this jumping around, 1157 01:02:26,300 --> 01:02:28,520 and find your predecessor or successor. 1158 01:02:28,520 --> 01:02:30,830 So in this world, you need at least u space, 1159 01:02:30,830 --> 01:02:32,555 even to be able to specify the input. 1160 01:02:35,290 --> 01:02:37,750 So that's kind of a limitation of the pointer machine. 1161 01:02:37,750 --> 01:02:39,958 And you can actually show in the pointer machine log, 1162 01:02:39,958 --> 01:02:46,070 log u is optimal for any predecessor data structure 1163 01:02:46,070 --> 01:02:47,510 in the pointer machine. 1164 01:02:47,510 --> 01:02:53,567 So there's a matching lower bound log log u in this model. 1165 01:02:53,567 --> 01:02:54,650 And you need to use space. 1166 01:02:54,650 --> 01:02:56,020 So it's not very exciting. 1167 01:02:56,020 --> 01:02:58,120 What we like is the word RAM. 1168 01:02:58,120 --> 01:03:00,070 There, we can reduce space to n. 1169 01:03:00,070 --> 01:03:03,400 And that's what I want to do next, I believe-- 1170 01:03:03,400 --> 01:03:04,780 almost next. 1171 01:03:04,780 --> 01:03:08,380 One more mention-- actual stratified trees-- 1172 01:03:08,380 --> 01:03:11,110 here, we got query fast, update slow. 1173 01:03:11,110 --> 01:03:13,990 Stratified trees actually do update fast, as well. 1174 01:03:13,990 --> 01:03:17,710 Essentially, it's this idea, plus you don't recursively 1175 01:03:17,710 --> 01:03:20,290 store the min, which, of course, makes 1176 01:03:20,290 --> 01:03:21,900 all these bits no longer accurate, 1177 01:03:21,900 --> 01:03:23,740 as it gets much messier. 1178 01:03:23,740 --> 01:03:26,800 But in the end, it's doing exactly the same thing 1179 01:03:26,800 --> 01:03:28,175 as this recursion. 1180 01:03:28,175 --> 01:03:30,100 In fact, you can draw the picture. 1181 01:03:30,100 --> 01:03:35,870 It is this part up here-- 1182 01:03:35,870 --> 01:03:37,330 the top half of the tree-- 1183 01:03:37,330 --> 01:03:38,130 this is summary. 1184 01:03:41,110 --> 01:03:46,408 And each of these bottom halves is a cluster. 1185 01:03:46,408 --> 01:03:51,820 And there's root u clusters down here. 1186 01:03:51,820 --> 01:03:53,500 So those are smaller structures. 1187 01:03:53,500 --> 01:03:57,610 And there's one root u sized Van Emde Boas structure, which 1188 01:03:57,610 --> 01:03:58,900 is a summary structure. 1189 01:03:58,900 --> 01:04:02,617 These bits here is the bit vector representation 1190 01:04:02,617 --> 01:04:03,700 of the summary structures. 1191 01:04:03,700 --> 01:04:05,283 It's, is there anyone in this cluster? 1192 01:04:05,283 --> 01:04:08,597 Is there anyone in this cluster, and so on? 1193 01:04:08,597 --> 01:04:10,930 This, of course, also looks a lot like the Van Emde Boas 1194 01:04:10,930 --> 01:04:11,650 layout. 1195 01:04:11,650 --> 01:04:14,137 Take a binary tree, cut it in half, do the top, 1196 01:04:14,137 --> 01:04:15,220 recursively do the bottom. 1197 01:04:15,220 --> 01:04:17,428 So that's why it was called the Van Emde Boas layout, 1198 01:04:17,428 --> 01:04:18,791 is this picture. 1199 01:04:18,791 --> 01:04:20,290 But if you take this tree structure, 1200 01:04:20,290 --> 01:04:22,039 and then you don't recursively store mins, 1201 01:04:22,039 --> 01:04:24,880 and then the bits are not quite accurate, it's messy. 1202 01:04:24,880 --> 01:04:26,769 And so stratified trees-- you should 1203 01:04:26,769 --> 01:04:28,060 try to read the original paper. 1204 01:04:28,060 --> 01:04:28,870 It's a mess. 1205 01:04:28,870 --> 01:04:31,640 Whereas this code-- pretty clean. 1206 01:04:31,640 --> 01:04:33,280 And so once you say, oh, I'm just 1207 01:04:33,280 --> 01:04:35,440 going to store all these clusters as an array 1208 01:04:35,440 --> 01:04:37,540 and not worry about keeping track of the tree, 1209 01:04:37,540 --> 01:04:39,620 it actually gets a lot easier. 1210 01:04:39,620 --> 01:04:43,090 And that was the Bender/Farach-Colton cleaning 1211 01:04:43,090 --> 01:04:44,920 up, which never appeared in print. 1212 01:04:44,920 --> 01:04:48,380 But it's appeared in the lecture notes all over the place-- 1213 01:04:48,380 --> 01:04:50,700 and now CLRS. 1214 01:04:50,700 --> 01:04:51,940 Cool. 1215 01:04:51,940 --> 01:04:53,656 I want to tell you about two more things. 1216 01:04:53,656 --> 01:04:55,030 It's actually going to get easier 1217 01:04:55,030 --> 01:04:57,989 the more time we spend with this data structure. 1218 01:05:21,970 --> 01:05:24,790 All right. 1219 01:05:24,790 --> 01:05:27,970 Let me draw a box. 1220 01:05:27,970 --> 01:05:31,930 At this point, we've seen a clean way to get Van Emde Boas. 1221 01:05:31,930 --> 01:05:34,570 And we've seen a cute way in a tree 1222 01:05:34,570 --> 01:05:37,240 to get search fast, but update slow. 1223 01:05:37,240 --> 01:05:39,280 I want to talk a little more about that. 1224 01:05:39,280 --> 01:05:41,200 Let's suppose I have this data structure. 1225 01:05:41,200 --> 01:05:46,120 It's achieves log w query, which is fast, 1226 01:05:46,120 --> 01:05:50,590 but it only achieves w update, which is slow. 1227 01:05:50,590 --> 01:05:52,270 How do you update the structure? 1228 01:05:52,270 --> 01:05:54,010 You update one bit at the bottom, 1229 01:05:54,010 --> 01:05:56,980 and then you've got to update all the bits up the path. 1230 01:05:56,980 --> 01:05:59,590 So you spend w time to do an update over here. 1231 01:06:02,410 --> 01:06:05,680 If updates are slow, I just want to do less updates. 1232 01:06:05,680 --> 01:06:07,570 We have a trick for doing this, which 1233 01:06:07,570 --> 01:06:10,630 is, you put little things down here of size theta w. 1234 01:06:16,810 --> 01:06:19,810 And then only one item from here gets promoted 1235 01:06:19,810 --> 01:06:21,490 into the top structure. 1236 01:06:21,490 --> 01:06:26,860 We only end up having n over w items up here, and about 1 1237 01:06:26,860 --> 01:06:29,050 over w as many updates. 1238 01:06:29,050 --> 01:06:31,930 If I want to do an insertion, I do a search here 1239 01:06:31,930 --> 01:06:33,880 to figure out which of these little-- 1240 01:06:33,880 --> 01:06:39,370 I'll call these "chunks--" which little chunk it belongs in. 1241 01:06:39,370 --> 01:06:41,800 I do an insert there. 1242 01:06:41,800 --> 01:06:43,360 If that structure gets too big-- it's 1243 01:06:43,360 --> 01:06:45,730 bigger than, say, 2 times w, or 4 times w, 1244 01:06:45,730 --> 01:06:48,516 whatever-- then I'll split it. 1245 01:06:48,516 --> 01:06:50,765 And if I delete from something, and it gets too small, 1246 01:06:50,765 --> 01:06:53,330 I'll merge with the neighbor, or maybe re-split-- 1247 01:06:53,330 --> 01:06:55,580 just like B-trees. 1248 01:06:55,580 --> 01:06:58,820 We've done this many times, by now. 1249 01:06:58,820 --> 01:07:01,490 But only when it splits, or I do a merge, 1250 01:07:01,490 --> 01:07:03,050 do I have to do an update up here. 1251 01:07:03,050 --> 01:07:05,540 Only when the set of chunks changes do 1252 01:07:05,540 --> 01:07:07,790 I need to do a single insertion or deletion 1253 01:07:07,790 --> 01:07:10,100 up here-- or a constant number. 1254 01:07:10,100 --> 01:07:15,860 So this update time goes down by a factor of w. 1255 01:07:15,860 --> 01:07:18,567 But I have to pay whatever the update cost is here. 1256 01:07:18,567 --> 01:07:20,150 So what I do with this data structure? 1257 01:07:20,150 --> 01:07:21,530 I don't want use Van Emde Boas, because this 1258 01:07:21,530 --> 01:07:22,580 could be a very big universe. 1259 01:07:22,580 --> 01:07:23,205 Who knows what? 1260 01:07:23,205 --> 01:07:26,360 I use the binary search tree. 1261 01:07:26,360 --> 01:07:28,250 Here, I can afford a binary search tree, 1262 01:07:28,250 --> 01:07:30,610 because then it's only log w. 1263 01:07:30,610 --> 01:07:32,914 log w is the bound we're trying to get. 1264 01:07:32,914 --> 01:07:34,580 So you can do these binary search trees. 1265 01:07:34,580 --> 01:07:35,490 It's trivial. 1266 01:07:35,490 --> 01:07:37,970 Just do insert, delete, search. 1267 01:07:37,970 --> 01:07:39,994 Everything will be log w. 1268 01:07:39,994 --> 01:07:42,410 So if I want to do a search, I search through here, which, 1269 01:07:42,410 --> 01:07:44,476 conveniently, is already fast-- log w-- 1270 01:07:44,476 --> 01:07:46,850 and then I do a search through here, which is also log w. 1271 01:07:46,850 --> 01:07:47,445 So it's nice and balanced. 1272 01:07:47,445 --> 01:07:48,530 Everything's log w. 1273 01:07:51,632 --> 01:07:53,840 If I want to do an insertion, I do an insertion here. 1274 01:07:53,840 --> 01:07:56,090 If it splits, I do an insertion here. 1275 01:07:56,090 --> 01:08:00,470 But that order w update cost, I charge to the order 1276 01:08:00,470 --> 01:08:02,660 w updates I would have had to do in this chunk 1277 01:08:02,660 --> 01:08:04,700 before it got split. 1278 01:08:04,700 --> 01:08:08,060 So this our good friend indirection, a technique we 1279 01:08:08,060 --> 01:08:09,800 will use over and over in this class. 1280 01:08:09,800 --> 01:08:14,450 It's very helpful when you're almost at the right bound. 1281 01:08:14,450 --> 01:08:17,479 And that's actually in the follow-up van Emde Boas paper. 1282 01:08:17,479 --> 01:08:20,520 A similar indirection trick is in there. 1283 01:08:20,520 --> 01:08:31,370 So we can charge the order w update in top to-- 1284 01:08:31,370 --> 01:08:33,020 that's the cost of the update-- 1285 01:08:33,020 --> 01:08:38,180 to the order w updates that have actually 1286 01:08:38,180 --> 01:08:42,740 been performed in the bottom. 1287 01:08:42,740 --> 01:08:44,600 Because when somebody gets split, 1288 01:08:44,600 --> 01:08:47,330 it's nice in its average state-- or when it gets merged, 1289 01:08:47,330 --> 01:08:48,640 it's going to be close to its average state. 1290 01:08:48,640 --> 01:08:50,598 You have to do a lot of insertions or deletions 1291 01:08:50,598 --> 01:08:54,630 to get it out of whack, and cause a split or a merge. 1292 01:08:54,630 --> 01:08:56,000 So-- boom. 1293 01:08:56,000 --> 01:09:01,700 This means the updates become log w. 1294 01:09:01,700 --> 01:09:04,130 Searches are also log w. 1295 01:09:04,130 --> 01:09:07,550 So we've got Van Emde Boas again, in a new way. 1296 01:09:07,550 --> 01:09:11,127 Bonus points-- if you take this structure-- 1297 01:09:14,330 --> 01:09:16,950 even this structure, if we did it in the array form-- 1298 01:09:16,950 --> 01:09:17,450 great. 1299 01:09:17,450 --> 01:09:18,890 It was order u space. 1300 01:09:18,890 --> 01:09:20,750 If we did it with all these pointers, 1301 01:09:20,750 --> 01:09:22,708 and we wanted a pointer machine data structure, 1302 01:09:22,708 --> 01:09:25,589 we needed u log w space. 1303 01:09:25,589 --> 01:09:28,130 But this indirection trick, you can also get rid of the log w 1304 01:09:28,130 --> 01:09:30,160 in space factor. 1305 01:09:30,160 --> 01:09:31,340 It's a little less obvious. 1306 01:09:31,340 --> 01:09:33,020 But you take this-- 1307 01:09:33,020 --> 01:09:34,859 here, we reduced n by a factor of w. 1308 01:09:34,859 --> 01:09:37,420 You can also reduce u by a factor of w. 1309 01:09:37,420 --> 01:09:38,420 I'll just wave my hands. 1310 01:09:38,420 --> 01:09:39,500 That's possible. 1311 01:09:39,500 --> 01:09:41,510 So u gets a little bit smaller. 1312 01:09:41,510 --> 01:09:44,120 And so when we pay u log w space, 1313 01:09:44,120 --> 01:09:46,040 if you got smaller by a factor of w, 1314 01:09:46,040 --> 01:09:48,689 this basically disappears. 1315 01:09:48,689 --> 01:09:50,580 So you get, at most, order u space. 1316 01:09:53,210 --> 01:09:54,390 But order u is not order n. 1317 01:09:54,390 --> 01:09:56,550 I want order n space, darn it. 1318 01:09:56,550 --> 01:10:01,400 So let's reduce space. 1319 01:10:01,400 --> 01:10:04,015 As I said, this is going to get easier and easier. 1320 01:10:04,015 --> 01:10:06,390 By the end, we will have very little of a data structure. 1321 01:10:06,390 --> 01:10:10,150 But still, we'll have log log u. 1322 01:10:10,150 --> 01:10:13,860 And you thought this was easy, but wait, there's more. 1323 01:10:16,680 --> 01:10:19,410 Right now, we have two ways to get log log u-- 1324 01:10:19,410 --> 01:10:22,950 query and order u space. 1325 01:10:22,950 --> 01:10:25,170 There's the one I'm erasing, and there's 1326 01:10:25,170 --> 01:10:28,410 this-- take this tree structure with the very simple pointers. 1327 01:10:28,410 --> 01:10:29,867 Add indirection. 1328 01:10:29,867 --> 01:10:31,950 So admittedly, it's more complicated to implement. 1329 01:10:31,950 --> 01:10:33,526 But conceptually, it's super simple. 1330 01:10:33,526 --> 01:10:35,400 It's like, do this obvious tree binary search 1331 01:10:35,400 --> 01:10:36,930 on the level thing. 1332 01:10:36,930 --> 01:10:40,080 And then add indirection, and it fixes all your bounds, 1333 01:10:40,080 --> 01:10:41,670 magically. 1334 01:10:41,670 --> 01:10:43,600 So conceptually, very simple-- 1335 01:10:43,600 --> 01:10:47,850 practically, you definitely want to do this-- much simpler. 1336 01:10:47,850 --> 01:10:51,225 Now, what about saving space? 1337 01:10:54,420 --> 01:10:56,820 Very simple idea-- which, I think, 1338 01:10:56,820 --> 01:11:01,560 again, comes from Michael Bender and Martin Farach-Colton. 1339 01:11:01,560 --> 01:11:05,490 Don't store empty structures. 1340 01:11:05,490 --> 01:11:09,620 So in this picture, we had an array of all the clusters. 1341 01:11:09,620 --> 01:11:13,200 But a cluster could be entirely empty, like this one-- 1342 01:11:13,200 --> 01:11:15,470 this entirely empty cluster. 1343 01:11:15,470 --> 01:11:16,440 Don't store it. 1344 01:11:16,440 --> 01:11:18,110 It's a waste. 1345 01:11:18,110 --> 01:11:20,710 If you store them all, you're going to spend order u space. 1346 01:11:20,710 --> 01:11:22,130 If you don't store them all-- 1347 01:11:22,130 --> 01:11:23,780 just don't store the empty ones-- 1348 01:11:23,780 --> 01:11:25,210 I claim your order n space. 1349 01:11:25,210 --> 01:11:27,910 Done. 1350 01:11:27,910 --> 01:11:30,350 So I'm going back to the structure I erased. 1351 01:11:30,350 --> 01:11:33,110 Ignore the tree perspective for awhile. 1352 01:11:33,110 --> 01:11:39,440 Don't store empty clusters. 1353 01:11:39,440 --> 01:11:41,960 OK, now, this sounds easy. 1354 01:11:41,960 --> 01:11:44,090 But in reality, it's a little bit more annoying. 1355 01:11:44,090 --> 01:11:47,640 Because we wanted to have an array of clusters. 1356 01:11:47,640 --> 01:11:51,260 So we could quickly find the cluster. 1357 01:11:51,260 --> 01:11:52,810 If you store an array, you're going 1358 01:11:52,810 --> 01:11:54,650 to spend at least square rot of u space. 1359 01:11:54,650 --> 01:11:56,720 Because at the very beginning, you say, 1360 01:11:56,720 --> 01:11:58,040 here are my root u clusters. 1361 01:11:58,040 --> 01:11:59,748 Now, some of them might be null pointers. 1362 01:11:59,748 --> 01:12:03,980 But I can't afford to store that entire array of clusters. 1363 01:12:03,980 --> 01:12:05,410 So don't use an array. 1364 01:12:05,410 --> 01:12:06,980 Use a perfect hash table. 1365 01:12:10,730 --> 01:12:13,990 So v dot cluster, instead of being an array, 1366 01:12:13,990 --> 01:12:18,650 is now, let's say, a dynamic perfect hashing. 1367 01:12:18,650 --> 01:12:21,250 And I'm going to use the version which I did not present. 1368 01:12:21,250 --> 01:12:23,750 The version I presented, which used universal hashing, 1369 01:12:23,750 --> 01:12:26,270 was order 1 expected. 1370 01:12:26,270 --> 01:12:30,410 But I said that it can be constant with high probability 1371 01:12:30,410 --> 01:12:31,040 per operation. 1372 01:12:31,040 --> 01:12:34,070 It's a little bit stronger. 1373 01:12:34,070 --> 01:12:35,540 So now, everything's fine. 1374 01:12:35,540 --> 01:12:37,880 If I do an index v dot cluster c, 1375 01:12:37,880 --> 01:12:41,120 that's still constant time, with high probability now. 1376 01:12:41,120 --> 01:12:45,940 And I claim this structure is now order n's space. 1377 01:12:45,940 --> 01:12:47,540 Why is it order n's space? 1378 01:12:47,540 --> 01:12:56,510 By simple amortization-- charge each table entry in that 1379 01:12:56,510 --> 01:13:01,250 hash table to the min of the cluster. 1380 01:13:06,924 --> 01:13:08,340 We're only storing non-empty ones. 1381 01:13:08,340 --> 01:13:11,840 So if one of these guys exists in the hash table, 1382 01:13:11,840 --> 01:13:13,670 we had to store a pointer to it, then 1383 01:13:13,670 --> 01:13:16,100 that means the summary structure is non-zero. 1384 01:13:16,100 --> 01:13:17,990 It means this guy is not empty. 1385 01:13:17,990 --> 01:13:19,810 So it has an item in its min. 1386 01:13:19,810 --> 01:13:22,820 Charge the space up here to store the pointer to that min 1387 01:13:22,820 --> 01:13:23,810 guy. 1388 01:13:23,810 --> 01:13:27,200 Then each item-- each min item-- 1389 01:13:27,200 --> 01:13:28,880 only gets charged once. 1390 01:13:28,880 --> 01:13:32,380 Because it only has one parent that has a pointer to it. 1391 01:13:32,380 --> 01:13:33,980 So you only charge once. 1392 01:13:33,980 --> 01:13:38,290 And therefore-- charge and table entry-- 1393 01:13:43,010 --> 01:13:45,050 only charge each element once. 1394 01:13:49,800 --> 01:13:50,910 And that's all your space. 1395 01:13:50,910 --> 01:13:53,310 So it's order n space. 1396 01:13:53,310 --> 01:13:56,250 Done. 1397 01:13:56,250 --> 01:13:57,050 Kind of crazy. 1398 01:13:57,050 --> 01:14:00,180 I guess, if you want, there's also the pointer to the summary 1399 01:14:00,180 --> 01:14:00,680 structure. 1400 01:14:00,680 --> 01:14:02,290 You could charge that to your own min. 1401 01:14:02,290 --> 01:14:03,581 And then you're charging twice. 1402 01:14:03,581 --> 01:14:07,290 But it's constant per item. 1403 01:14:07,290 --> 01:14:08,517 So this is kind of funny. 1404 01:14:08,517 --> 01:14:10,850 Again, it doesn't appear in print anywhere, except maybe 1405 01:14:10,850 --> 01:14:12,920 as an exercise in CLRS now. 1406 01:14:12,920 --> 01:14:16,760 But you get linear order n space, 1407 01:14:16,760 --> 01:14:18,890 just by adding hashing in the obvious way. 1408 01:14:18,890 --> 01:14:23,180 Now, for whatever reason, Willard didn't see this, 1409 01:14:23,180 --> 01:14:25,160 or wanted to do his own thing, and so he 1410 01:14:25,160 --> 01:14:30,409 found another way to do order n space log log u query 1411 01:14:30,409 --> 01:14:30,950 with hashing. 1412 01:14:34,000 --> 01:14:35,570 Well, I guess, also, you had to think 1413 01:14:35,570 --> 01:14:36,780 of it in this simple form. 1414 01:14:36,780 --> 01:14:38,510 It's harder to do this in the tree. 1415 01:14:38,510 --> 01:14:39,800 It can be done, I think. 1416 01:14:39,800 --> 01:14:42,530 But this is a simpler view than the tree, I think. 1417 01:14:42,530 --> 01:14:45,320 And then boom-- order n space. 1418 01:14:45,320 --> 01:14:48,140 But it turns out there's another way to do it. 1419 01:14:48,140 --> 01:14:49,820 This is a completely different way 1420 01:14:49,820 --> 01:14:52,940 to do Van Emde Boas-- actually, not that completely different. 1421 01:14:52,940 --> 01:14:58,613 It's another way to do this with hashing. 1422 01:15:02,690 --> 01:15:06,140 And we're going to start with what's called x-fast trees, 1423 01:15:06,140 --> 01:15:08,510 and then we will modify it to get y-fast trees. 1424 01:15:08,510 --> 01:15:11,882 That's Willard's terminology. 1425 01:15:11,882 --> 01:15:15,560 OK, so x-fast trees is, store this tree, 1426 01:15:15,560 --> 01:15:18,060 but don't store the zeros. 1427 01:15:18,060 --> 01:15:21,590 So don't store zeros. 1428 01:15:21,590 --> 01:15:27,530 Only store the ones in the-- we call this the simple tree view. 1429 01:15:27,530 --> 01:15:29,030 This is why I, in particular, wanted 1430 01:15:29,030 --> 01:15:30,655 to tell you about the simple tree view, 1431 01:15:30,655 --> 01:15:33,320 because it is really what x fast trees do. 1432 01:15:33,320 --> 01:15:35,330 So what do I mean by only store the ones? 1433 01:15:35,330 --> 01:15:41,280 Well, each of these ones has sort of a name. 1434 01:15:41,280 --> 01:15:42,590 What is the name of this item? 1435 01:15:42,590 --> 01:15:43,580 Its name is one-- 1436 01:15:43,580 --> 01:15:46,227 or in other words, 0, 0, 0, 1. 1437 01:15:46,227 --> 01:15:47,810 Each of these nodes, you can think of, 1438 01:15:47,810 --> 01:15:49,580 what is the path to get here? 1439 01:15:49,580 --> 01:15:52,730 Like, the path to get to this one is 1, 0, 0. 1440 01:15:52,730 --> 01:15:53,450 1 means right. 1441 01:15:53,450 --> 01:15:54,800 0 means left. 1442 01:15:54,800 --> 01:15:56,960 Those names give you the binary indicator 1443 01:15:56,960 --> 01:16:00,860 of where that node is in the tree, in some sense. 1444 01:16:00,860 --> 01:16:13,316 So store the ones as binary strings in a hash table-- 1445 01:16:17,260 --> 01:16:19,270 again, a dynamic perfect hash table. 1446 01:16:19,270 --> 01:16:22,150 Let's say I can get constant with high probability. 1447 01:16:22,150 --> 01:16:23,860 OK. 1448 01:16:23,860 --> 01:16:26,280 And if you're a little concerned-- 1449 01:16:26,280 --> 01:16:29,290 so what this means-- the ones are exactly 1450 01:16:29,290 --> 01:16:32,050 the prefixes of the paths to each of the items. 1451 01:16:32,050 --> 01:16:33,280 This was item one. 1452 01:16:33,280 --> 01:16:37,420 And so I want to store this one, which is empty string, 1453 01:16:37,420 --> 01:16:40,120 this one, which is 0, this one, which is 00, this one, 1454 01:16:40,120 --> 01:16:44,060 which is 000, this one, which is 0001. 1455 01:16:44,060 --> 01:16:49,600 So I take 0001, which is the item I want to store. 1456 01:16:49,600 --> 01:16:51,970 And there's all these prefixes, which 1457 01:16:51,970 --> 01:16:54,310 are the items I want to store. 1458 01:16:54,310 --> 01:16:56,290 And for this really to make sense, 1459 01:16:56,290 --> 01:16:58,180 you also need the length of the string. 1460 01:16:58,180 --> 01:17:01,810 Strings of different lengths should be in different worlds. 1461 01:17:01,810 --> 01:17:03,790 So the way, actually, x-fast trees originally 1462 01:17:03,790 --> 01:17:05,500 did it in the paper is, have a different hash 1463 01:17:05,500 --> 01:17:07,124 table for strings of different lengths. 1464 01:17:07,124 --> 01:17:09,310 So that's probably an easier way to think about it. 1465 01:17:09,310 --> 01:17:11,770 You store all the items themselves in a hash table. 1466 01:17:11,770 --> 01:17:13,330 You store all the prefixes of all 1467 01:17:13,330 --> 01:17:16,240 but the last bit in a separate hash table, 1468 01:17:16,240 --> 01:17:20,120 all but the last two bits in a separate hash table, and so on. 1469 01:17:20,120 --> 01:17:22,560 Now, what does this let you do? 1470 01:17:22,560 --> 01:17:24,220 It lets you do this-- 1471 01:17:24,220 --> 01:17:28,162 binary search for the 0 to 1 transition. 1472 01:17:30,760 --> 01:17:32,440 What we did here was-- 1473 01:17:32,440 --> 01:17:35,470 I look at the bit, is it 0 or 1? 1474 01:17:35,470 --> 01:17:38,200 Instead of doing that, you do a query into the hash table, 1475 01:17:38,200 --> 01:17:39,760 and say, is it in the hash table? 1476 01:17:39,760 --> 01:17:42,900 It's in the hash table if and only if it is one. 1477 01:17:42,900 --> 01:17:44,980 So looking at a bit in this conceptual tree 1478 01:17:44,980 --> 01:17:47,350 is the same thing as checking for containment 1479 01:17:47,350 --> 01:17:48,760 in this hash table. 1480 01:17:48,760 --> 01:17:52,630 But now, we don't have to store the zeros, which is cool. 1481 01:17:55,930 --> 01:18:04,600 We can now do search, predecessor or successor, fast, 1482 01:18:04,600 --> 01:18:11,650 in log w time, via this old thing. 1483 01:18:11,650 --> 01:18:15,080 Again, you have to have min and max pointers, as well. 1484 01:18:15,080 --> 01:18:17,039 So in this hash table, you store the min 1485 01:18:17,039 --> 01:18:18,205 and the max of your subtree. 1486 01:18:21,030 --> 01:18:22,960 Or actually, from a 1, you actually 1487 01:18:22,960 --> 01:18:25,270 need the max of the left subtree, 1488 01:18:25,270 --> 01:18:27,170 and you need the min of the right subtree. 1489 01:18:27,170 --> 01:18:30,190 But it's a constant amount of information per thing. 1490 01:18:30,190 --> 01:18:36,190 This is not perfect, however, in that it uses nw space. 1491 01:18:38,920 --> 01:18:40,840 And also, updates are slow. 1492 01:18:40,840 --> 01:18:42,775 It's order w updates. 1493 01:19:01,540 --> 01:19:02,650 But we're almost there. 1494 01:19:02,650 --> 01:19:06,520 Because we have fast queries, slow updates, not 1495 01:19:06,520 --> 01:19:07,990 optimal space. 1496 01:19:07,990 --> 01:19:09,340 Take this. 1497 01:19:09,340 --> 01:19:11,955 Add indirection-- done. 1498 01:19:11,955 --> 01:19:12,955 And that's y-fast trees. 1499 01:19:17,950 --> 01:19:20,660 y-fast trees-- you take x-fast trees, 1500 01:19:20,660 --> 01:19:25,470 you add this indirection right here, 1501 01:19:25,470 --> 01:19:33,266 and you get log w per operation order and space. 1502 01:19:33,266 --> 01:19:34,950 Of course, this is a high probability 1503 01:19:34,950 --> 01:19:37,740 because we're using hashing. 1504 01:19:37,740 --> 01:19:40,290 Because we have a factor w bad here, 1505 01:19:40,290 --> 01:19:41,400 we have factor w bad here. 1506 01:19:41,400 --> 01:19:42,390 You divide by w. 1507 01:19:42,390 --> 01:19:43,770 You're done. 1508 01:19:43,770 --> 01:19:48,360 Up here, you have n over w space. n over w times w is n. 1509 01:19:48,360 --> 01:19:51,360 Queries, just like before, remain log w. 1510 01:19:51,360 --> 01:19:53,040 But now-- boom-- 1511 01:19:53,040 --> 01:19:56,160 updates, we pay log w because of the binary search 1512 01:19:56,160 --> 01:19:59,035 trees at the bottom, but pretty cool. 1513 01:19:59,035 --> 01:20:00,840 Isn't that neat? 1514 01:20:00,840 --> 01:20:02,190 I've never seen this before. 1515 01:20:02,190 --> 01:20:04,690 OK, I've seen x-fast trees and y-fast trees. 1516 01:20:04,690 --> 01:20:06,690 But it's really just the same-- 1517 01:20:06,690 --> 01:20:09,990 we're taking Van Emde Boas, looking at it in the tree view. 1518 01:20:09,990 --> 01:20:11,699 You can see where Willard got this stuff. 1519 01:20:11,699 --> 01:20:14,073 It's like, oh, man I really want to store all these bits, 1520 01:20:14,073 --> 01:20:15,300 but hey, it's way too big. 1521 01:20:15,300 --> 01:20:17,280 Just don't store the zeros. 1522 01:20:17,280 --> 01:20:19,260 That means we should use a hash table. 1523 01:20:19,260 --> 01:20:22,770 Ah , hash table just gives you whether the bit is in or out. 1524 01:20:22,770 --> 01:20:24,480 Great. 1525 01:20:24,480 --> 01:20:25,749 Now use indirection. 1526 01:20:25,749 --> 01:20:27,540 And indirection was already floating around 1527 01:20:27,540 --> 01:20:29,220 as a concept at the time-- 1528 01:20:29,220 --> 01:20:30,560 slightly different parameters. 1529 01:20:30,560 --> 01:20:32,970 Van Emde Boas had his own indirection 1530 01:20:32,970 --> 01:20:38,187 to reduce the space from u times log w to u. 1531 01:20:38,187 --> 01:20:40,020 But Willard did it, and-- boom-- it got down 1532 01:20:40,020 --> 01:20:43,140 to n space in this way. 1533 01:20:43,140 --> 01:20:45,700 But as you saw, you can also do it directly to Van Emde Boas. 1534 01:20:45,700 --> 01:20:47,304 All these ideas can be interchanged. 1535 01:20:47,304 --> 01:20:48,720 You can combine any data structure 1536 01:20:48,720 --> 01:20:50,220 you want with any space saving trick 1537 01:20:50,220 --> 01:20:51,810 you want, with indirection, if you 1538 01:20:51,810 --> 01:20:55,290 need to, to speed things up and reduce space a little bit. 1539 01:20:55,290 --> 01:20:57,190 So there's many, many ways to do this. 1540 01:20:57,190 --> 01:20:59,230 But in the end, you get log w per operation, 1541 01:20:59,230 --> 01:21:00,750 and order n space. 1542 01:21:00,750 --> 01:21:02,070 And that sort of result one. 1543 01:21:02,070 --> 01:21:04,530 And it's probably the most useful predecessor data 1544 01:21:04,530 --> 01:21:05,487 structure, in general. 1545 01:21:05,487 --> 01:21:07,320 But next time, we'll see fusion trees, which 1546 01:21:07,320 --> 01:21:10,730 are good for when w is huge.