1 00:00:00,090 --> 00:00:02,490 The following content is provided under a Creative 2 00:00:02,490 --> 00:00:04,030 Commons license. 3 00:00:04,030 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,720 continue to offer high-quality educational resources for free. 5 00:00:10,720 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,280 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,280 --> 00:00:18,450 at osw.mit.edu. 8 00:00:20,930 --> 00:00:21,930 ERIK DEMAINE: All right. 9 00:00:21,930 --> 00:00:24,740 Today we start a new section of data structures. 10 00:00:24,740 --> 00:00:26,500 This is going to be two lectures long, 11 00:00:26,500 --> 00:00:30,310 so this week succinct data structures, where the goal is 12 00:00:30,310 --> 00:00:32,200 to get really small space. 13 00:00:37,810 --> 00:00:43,750 And first thing to do is to define what "small" means. 14 00:00:43,750 --> 00:00:46,270 Most succinct data structures are also static, 15 00:00:46,270 --> 00:00:48,150 although there are a few that are dynamic. 16 00:00:48,150 --> 00:00:53,500 We'll be focusing here on static data structures. 17 00:00:57,400 --> 00:00:59,380 And so in general, the name of the game 18 00:00:59,380 --> 00:01:01,510 is taking a data structure that you're 19 00:01:01,510 --> 00:01:04,239 familiar with-- we're going to talk about essentially two 20 00:01:04,239 --> 00:01:05,150 today. 21 00:01:05,150 --> 00:01:08,320 One is binary tries, which has the killer 22 00:01:08,320 --> 00:01:12,280 application of doing binary suffix trees in particular. 23 00:01:12,280 --> 00:01:14,230 So it could be a compressed try, whatever. 24 00:01:14,230 --> 00:01:16,174 But we'll assume hear the alphabet is binary. 25 00:01:16,174 --> 00:01:17,590 A lot of this has been generalized 26 00:01:17,590 --> 00:01:18,424 to larger alphabets. 27 00:01:18,424 --> 00:01:20,006 I'll tell you a little bit about that. 28 00:01:20,006 --> 00:01:21,430 But to keep it simple in lecture, 29 00:01:21,430 --> 00:01:23,349 I'm going to stick to a binary alphabet. 30 00:01:23,349 --> 00:01:25,390 And another data structure we're going to look at 31 00:01:25,390 --> 00:01:29,610 is a bit vector, so n bits in a row. 32 00:01:29,610 --> 00:01:32,260 And you want to do interesting operations, 33 00:01:32,260 --> 00:01:36,760 like find the i-th 1 bit in constant time. 34 00:01:36,760 --> 00:01:38,260 These are things that are easy to do 35 00:01:38,260 --> 00:01:43,810 in linear space, where linear space means order and words. 36 00:01:43,810 --> 00:01:45,310 Like if you want to implement a try, 37 00:01:45,310 --> 00:01:47,380 you have a pointer at every node, two pointers 38 00:01:47,380 --> 00:01:50,050 at every node, that's easy. 39 00:01:50,050 --> 00:01:54,160 Bit vectors you could store an array of where all the 1s are. 40 00:01:54,160 --> 00:01:56,830 And so it's easy to do linear space, 41 00:01:56,830 --> 00:01:59,450 but linear space is not optimal. 42 00:01:59,450 --> 00:02:01,870 And there are three senses of small space 43 00:02:01,870 --> 00:02:04,540 that we'll be going for. 44 00:02:04,540 --> 00:02:08,330 The best version is implicit. 45 00:02:08,330 --> 00:02:12,280 An implicit data structure means you use the very optimum number 46 00:02:12,280 --> 00:02:14,245 of bits, plus a constant. 47 00:02:17,440 --> 00:02:20,440 But here, we're focusing on bits, not words. 48 00:02:20,440 --> 00:02:23,830 And if you translate a linear word data structure, 49 00:02:23,830 --> 00:02:28,480 and words is order n times w bits, 50 00:02:28,480 --> 00:02:30,310 to store something like an n-bit vector, 51 00:02:30,310 --> 00:02:36,154 you should really only use order n bits, not order nw bits. 52 00:02:36,154 --> 00:02:37,570 Now, what would be ideal is if you 53 00:02:37,570 --> 00:02:40,210 used n bits plus a constant. 54 00:02:40,210 --> 00:02:42,280 The plus a constant is essentially 55 00:02:42,280 --> 00:02:45,044 because sometimes the optimum number is not an integer. 56 00:02:45,044 --> 00:02:47,210 And it's hard to store a non-integer number of bits, 57 00:02:47,210 --> 00:02:48,950 so you add a constant. 58 00:02:48,950 --> 00:02:52,270 I mean, I wouldn't mind too much if you added order log n bits 59 00:02:52,270 --> 00:02:53,170 or something. 60 00:02:53,170 --> 00:02:56,620 But typically, the goal here is constant. 61 00:02:56,620 --> 00:02:58,720 Sometimes you can really get zero. 62 00:02:58,720 --> 00:03:06,460 But the next best thing would be a succinct data structure, 63 00:03:06,460 --> 00:03:14,570 where the goal is to get opt plus little o of opt bits. 64 00:03:14,570 --> 00:03:20,260 So the key here is to get a constant factor of 1 in front. 65 00:03:20,260 --> 00:03:27,100 And then the worst thing in this regime is still pretty good. 66 00:03:27,100 --> 00:03:28,765 It's order opt bits. 67 00:03:32,300 --> 00:03:35,865 So for example, if you're storing an n-bit vector, 68 00:03:35,865 --> 00:03:38,740 if you use order n bits, that's still better 69 00:03:38,740 --> 00:03:39,940 than order nw bits. 70 00:03:39,940 --> 00:03:44,350 Usually, compact is a savings of at least a factor 71 00:03:44,350 --> 00:03:47,140 of w over what you'd normally call a linear space data 72 00:03:47,140 --> 00:03:47,690 structure. 73 00:03:47,690 --> 00:03:49,874 So this is the big savings, factor w. 74 00:03:49,874 --> 00:03:51,040 But there's a constant here. 75 00:03:51,040 --> 00:03:54,460 Sometimes you like to get rid of that constant, make it 1. 76 00:03:54,460 --> 00:03:56,770 And it's not so bad if you have, say, 77 00:03:56,770 --> 00:04:00,650 another square root of n or something little o 78 00:04:00,650 --> 00:04:04,205 of n bits of extra space. 79 00:04:04,205 --> 00:04:06,580 But of course, the ideal would be to have no extra space, 80 00:04:06,580 --> 00:04:08,560 and that's implicit. 81 00:04:08,560 --> 00:04:09,830 Now, it's a little confusing. 82 00:04:09,830 --> 00:04:12,950 I usually call this area succinct data structures 83 00:04:12,950 --> 00:04:14,140 to mean all three. 84 00:04:14,140 --> 00:04:17,140 But the middle one is called succinct. 85 00:04:17,140 --> 00:04:19,329 That's typically the goal is to go for succinct, 86 00:04:19,329 --> 00:04:22,270 because implicit is very hard, and compact 87 00:04:22,270 --> 00:04:25,060 is kind of like a warm-up towards succinct. 88 00:04:25,060 --> 00:04:30,100 So this is the usual goal in the middle. 89 00:04:30,100 --> 00:04:32,140 And we're going to do this for binary tries, 90 00:04:32,140 --> 00:04:33,730 and rank and select. 91 00:04:33,730 --> 00:04:37,690 So let me tell you a little bit about what's known. 92 00:04:37,690 --> 00:04:38,710 Oh, sorry. 93 00:04:38,710 --> 00:04:40,460 One quick example. 94 00:04:40,460 --> 00:04:42,100 You've seen implicit data structures. 95 00:04:42,100 --> 00:04:46,130 You may have even heard this term in the context of heaps. 96 00:04:46,130 --> 00:04:49,010 Binary heaps are an example of a dynamic implicit data 97 00:04:49,010 --> 00:04:49,510 structure. 98 00:04:49,510 --> 00:04:52,450 They achieve this bound. 99 00:04:52,450 --> 00:04:57,130 They have no extra space in the appropriate model. 100 00:04:57,130 --> 00:04:59,530 And another one would be a sorted array. 101 00:04:59,530 --> 00:05:01,105 Sorted array supports binary search. 102 00:05:01,105 --> 00:05:04,090 You can't really update it, so it's a static search structure. 103 00:05:04,090 --> 00:05:06,040 It achieves the optimal number of bits. 104 00:05:06,040 --> 00:05:07,310 You're just storing the data. 105 00:05:07,310 --> 00:05:10,902 And usually, implicit data structures just store the data, 106 00:05:10,902 --> 00:05:13,360 though sometimes they reorder the data in interesting ways, 107 00:05:13,360 --> 00:05:14,230 like sorting. 108 00:05:14,230 --> 00:05:20,870 A sorted array is one way to reorder your data items. 109 00:05:20,870 --> 00:05:23,490 So here's a short survey. 110 00:05:23,490 --> 00:05:26,010 This is definitely not exhaustive, 111 00:05:26,010 --> 00:05:30,300 but it covers a bunch of the main results. 112 00:05:33,750 --> 00:05:38,035 So one place where this area really got started-- 113 00:05:38,035 --> 00:05:39,660 well, there are a few places, actually. 114 00:05:44,910 --> 00:05:49,410 My academic father, one of my PhD advisors, 115 00:05:49,410 --> 00:05:53,190 Ian Munro is sort of one of the fathers of this field. 116 00:05:53,190 --> 00:05:55,850 And he started looking at specific data structures 117 00:05:55,850 --> 00:05:57,150 at the very early days. 118 00:05:57,150 --> 00:06:00,240 And one of the problems he worked on 119 00:06:00,240 --> 00:06:02,642 was dynamic search trees. 120 00:06:02,642 --> 00:06:04,350 So if you want to do static search trees, 121 00:06:04,350 --> 00:06:08,160 you can just store the items in a sorted array, easy log n 122 00:06:08,160 --> 00:06:11,350 search, no extra space. 123 00:06:11,350 --> 00:06:13,620 What if you want to do inserts and deletes also 124 00:06:13,620 --> 00:06:16,930 in log n time per operation, just like a regular search 125 00:06:16,930 --> 00:06:20,390 tree, but you want to do it implicitly? 126 00:06:20,390 --> 00:06:21,330 Now, this is tricky. 127 00:06:21,330 --> 00:06:23,160 And there's an old result that would 128 00:06:23,160 --> 00:06:27,230 let you get log squared per update and query, which 129 00:06:27,230 --> 00:06:29,130 essentially encoded-- 130 00:06:29,130 --> 00:06:31,740 you can't afford pointers at all here. 131 00:06:31,740 --> 00:06:33,450 But the idea was to encode the pointers 132 00:06:33,450 --> 00:06:36,890 by permuting enough items. 133 00:06:36,890 --> 00:06:39,450 If you take, say, log n items, then the permutations 134 00:06:39,450 --> 00:06:42,600 among them is roughly log n, log log n bits. 135 00:06:42,600 --> 00:06:47,270 And so you can encode bits by just permuting pairs of items. 136 00:06:47,270 --> 00:06:50,160 And so you could read a pointer in like log n operations, 137 00:06:50,160 --> 00:06:52,200 you end up with log squared. 138 00:06:52,200 --> 00:06:55,530 And then Ian Munro got down to a little bit less 139 00:06:55,530 --> 00:06:57,200 than log squared. 140 00:06:57,200 --> 00:06:59,940 And then there's a series of improvements 141 00:06:59,940 --> 00:07:02,160 over the last several years. 142 00:07:02,160 --> 00:07:12,800 And the final result is log n worst case insert, delete, 143 00:07:12,800 --> 00:07:15,690 and predecessor. 144 00:07:15,690 --> 00:07:19,932 And this is by Franceschini and Grossi. 145 00:07:19,932 --> 00:07:21,515 And furthermore, it's cache oblivious, 146 00:07:21,515 --> 00:07:27,780 so you can get log base b of n cache oblivious. 147 00:07:27,780 --> 00:07:30,870 So this has been pretty much completely solved. 148 00:07:30,870 --> 00:07:33,060 Implicitly, you can do all the good things 149 00:07:33,060 --> 00:07:35,757 we know how to do with search trees. 150 00:07:35,757 --> 00:07:38,340 Now, this is not trying to solve the predecessor problem using 151 00:07:38,340 --> 00:07:40,050 Van Emde Boas and such tricks. 152 00:07:40,050 --> 00:07:42,320 That's, I believe, open. 153 00:07:42,320 --> 00:07:47,750 But for a basic log n performance, it's solved. 154 00:07:47,750 --> 00:07:52,010 Before this got solved, another important problem 155 00:07:52,010 --> 00:07:54,770 is essentially the equivalent of hashing. 156 00:07:54,770 --> 00:07:57,095 So you want a succinct dictionary. 157 00:07:57,095 --> 00:07:59,290 You want to be able to do-- 158 00:07:59,290 --> 00:08:01,940 now, this is going to be static, so there's 159 00:08:01,940 --> 00:08:04,100 no insert and delete. 160 00:08:04,100 --> 00:08:07,770 It's just, is this item in the dictionary? 161 00:08:07,770 --> 00:08:09,080 So I have a universe of size u. 162 00:08:09,080 --> 00:08:11,430 I have n items in the dictionary. 163 00:08:11,430 --> 00:08:14,677 So the first question is, what is the optimal number of bits? 164 00:08:14,677 --> 00:08:16,760 And this is actually usually very easy to compute. 165 00:08:16,760 --> 00:08:19,855 You just take, what are the set of possible structures 166 00:08:19,855 --> 00:08:21,230 you're trying to represent, which 167 00:08:21,230 --> 00:08:25,490 is n items out of a universe of size u? 168 00:08:25,490 --> 00:08:27,110 How many different ways are there 169 00:08:27,110 --> 00:08:29,720 to have n items in a universe of size u? 170 00:08:35,409 --> 00:08:36,249 Come on. 171 00:08:36,249 --> 00:08:37,720 It's easy combinatorics. 172 00:08:43,504 --> 00:08:46,396 Somebody? 173 00:08:46,396 --> 00:08:48,340 AUDIENCE: Log u of n. 174 00:08:48,340 --> 00:08:49,454 ERIK DEMAINE: Log u-- 175 00:08:49,454 --> 00:08:50,120 AUDIENCE: Sorry. 176 00:08:52,920 --> 00:08:54,410 That's how many bits you'll need. 177 00:08:54,410 --> 00:08:55,201 ERIK DEMAINE: Yeah. 178 00:08:55,201 --> 00:08:57,360 Log u choose n is the number of bits you'll need. 179 00:08:57,360 --> 00:09:00,157 The number of different possibilities is u choose n. 180 00:09:00,157 --> 00:09:02,490 You take log base 2 of that, that's how many bits you'll 181 00:09:02,490 --> 00:09:03,600 need to represent this. 182 00:09:03,600 --> 00:09:06,450 Now, this is not necessarily an integer, 183 00:09:06,450 --> 00:09:07,830 because it's got a log. 184 00:09:07,830 --> 00:09:10,804 That's why we would have plus order 185 00:09:10,804 --> 00:09:11,970 1 if we were doing implicit. 186 00:09:11,970 --> 00:09:13,620 It's not known how to do implicit. 187 00:09:13,620 --> 00:09:15,690 It's known how to do succinct, so this 188 00:09:15,690 --> 00:09:20,880 is going to be plus little o of that thing. 189 00:09:20,880 --> 00:09:26,149 And actually, I have the explicit bound. 190 00:09:26,149 --> 00:09:26,649 Eraser. 191 00:09:31,644 --> 00:09:33,060 I don't know how exciting this is, 192 00:09:33,060 --> 00:09:37,950 but you get to know how little o it is. 193 00:09:37,950 --> 00:09:46,800 Log log n squared over log n. 194 00:09:46,800 --> 00:09:49,710 So this is slightly smaller, so this is 195 00:09:49,710 --> 00:09:53,725 roughly something like u log n. 196 00:09:53,725 --> 00:09:54,680 It depends. 197 00:09:54,680 --> 00:09:57,480 If n is small, this is like u log n. 198 00:09:57,480 --> 00:10:01,080 If n is big, this is like-- 199 00:10:01,080 --> 00:10:03,480 sorry, n log u, I should say. 200 00:10:03,480 --> 00:10:06,060 You can always encode a dictionary using n log u bits. 201 00:10:06,060 --> 00:10:08,550 Just for every item, you specify the log u bits. 202 00:10:08,550 --> 00:10:11,100 But when n is big, close to u, then you 203 00:10:11,100 --> 00:10:15,150 can just use a bit vector and use u bits. 204 00:10:15,150 --> 00:10:19,320 And so this is little o of n, so it's always 205 00:10:19,320 --> 00:10:22,950 smaller than whatever you're encoding over here. 206 00:10:22,950 --> 00:10:26,280 It's only slightly smaller than n, log log n squared over log 207 00:10:26,280 --> 00:10:27,930 n, but it's a little o of 1. 208 00:10:30,650 --> 00:10:34,470 And that's I believe the best known. 209 00:10:34,470 --> 00:10:38,640 So this is Brodnik and Munro, and then improved by Pagh. 210 00:10:38,640 --> 00:10:42,220 And the point is you get constant time membership query. 211 00:10:45,552 --> 00:10:47,010 In general, the name of the game is 212 00:10:47,010 --> 00:10:48,384 you want to do the queries you're 213 00:10:48,384 --> 00:10:50,790 used to doing in something like a dictionary, same amount 214 00:10:50,790 --> 00:10:55,100 of time, but with less space. 215 00:10:55,100 --> 00:10:56,140 OK. 216 00:10:56,140 --> 00:11:01,320 Next one is-- maybe I'll go over here-- 217 00:11:05,010 --> 00:11:12,790 binary try, which is what we're going to work on today. 218 00:11:12,790 --> 00:11:15,670 There's various results on this, but sort 219 00:11:15,670 --> 00:11:18,690 of one of the main ones is by Munro and Raman. 220 00:11:18,690 --> 00:11:21,500 Again, now this is a little harder of a question. 221 00:11:21,500 --> 00:11:25,720 How many binary tries on n nodes are there? 222 00:11:25,720 --> 00:11:30,610 The answer, I will tell you, is the nth Catalan number, 223 00:11:30,610 --> 00:11:34,990 which is a quantity we've seen before. 224 00:11:34,990 --> 00:11:37,940 2n choose n over n plus 1. 225 00:11:37,940 --> 00:11:40,930 As we mentioned last time, this is roughly 4 to the n. 226 00:11:45,620 --> 00:11:47,559 This is kind of interesting. 227 00:11:47,559 --> 00:11:49,600 We saw the Catalan number in a different context, 228 00:11:49,600 --> 00:11:51,910 which is we were doing indirection and using 229 00:11:51,910 --> 00:11:58,940 lookup tables at the bottom on all rooted trees on n nodes. 230 00:11:58,940 --> 00:12:01,930 Number of rooted trees on n nodes was also Catalan number. 231 00:12:01,930 --> 00:12:04,780 This is a different concept, binary tries. 232 00:12:04,780 --> 00:12:06,640 If you've ever taken a combinatorics class, 233 00:12:06,640 --> 00:12:08,690 you see Catalan numbers all over the place. 234 00:12:08,690 --> 00:12:11,620 There's a zillion different things, all of them 235 00:12:11,620 --> 00:12:13,727 the number of them is Catalan number. 236 00:12:13,727 --> 00:12:16,060 And we will actually use that equivalence between binary 237 00:12:16,060 --> 00:12:19,760 tries and rooted trees at the end of today's lecture. 238 00:12:19,760 --> 00:12:23,010 So you'll see why they're the same number. 239 00:12:23,010 --> 00:12:23,510 OK. 240 00:12:23,510 --> 00:12:26,780 So we take log of that, that's 2n bits. 241 00:12:26,780 --> 00:12:31,960 So you need 2n bits to represent a binary try. 242 00:12:31,960 --> 00:12:38,630 And indeed, you can achieve 2n plus little o of n bits. 243 00:12:38,630 --> 00:12:40,780 So that's a succinct data structure. 244 00:12:40,780 --> 00:12:45,760 And our goal will be to be able to do constant time 245 00:12:45,760 --> 00:12:50,320 traversal of the tree, so left child, right child, parent. 246 00:12:52,840 --> 00:12:55,420 And for fun, another operation we might want to do 247 00:12:55,420 --> 00:12:59,710 is compute the size of the current subtree. 248 00:12:59,710 --> 00:13:02,860 Again, think of a suffix tree. 249 00:13:02,860 --> 00:13:04,750 You start at the root and you want 250 00:13:04,750 --> 00:13:06,880 to be able to go along the left child 251 00:13:06,880 --> 00:13:09,940 every time you have a 0 bit in your query string. 252 00:13:09,940 --> 00:13:11,440 You want to go to the right child 253 00:13:11,440 --> 00:13:13,360 every time you have a 1 bit. 254 00:13:13,360 --> 00:13:15,700 Parent we don't really need, but why not? 255 00:13:15,700 --> 00:13:17,230 And then subtree size would tell us, 256 00:13:17,230 --> 00:13:18,970 how many matches are there below us? 257 00:13:18,970 --> 00:13:20,800 You could count either the number of nodes below you 258 00:13:20,800 --> 00:13:22,216 or the number of leaves below you. 259 00:13:22,216 --> 00:13:25,210 It's roughly the same for a compact try. 260 00:13:25,210 --> 00:13:28,642 And so this lets you do substring searches. 261 00:13:28,642 --> 00:13:30,850 And we'll actually talk more about that next lecture, 262 00:13:30,850 --> 00:13:33,140 how to actually do a full suffix tree. 263 00:13:33,140 --> 00:13:35,860 But this is a component of a binary suffix 264 00:13:35,860 --> 00:13:38,650 tree that has the same performance 265 00:13:38,650 --> 00:13:41,830 but uses optimal amount of space. 266 00:13:41,830 --> 00:13:47,590 And this is a big motivator originally for doing compact 267 00:13:47,590 --> 00:13:50,080 or succinct data structures. 268 00:13:50,080 --> 00:13:52,900 At University of Waterloo, they were doing this new OED 269 00:13:52,900 --> 00:13:55,180 project, where Oxford English Dictionary was 270 00:13:55,180 --> 00:13:57,820 trying to go online or digital. 271 00:13:57,820 --> 00:14:00,100 And this concept of having a CD-ROM that 272 00:14:00,100 --> 00:14:04,750 could have an entire dictionary on it was crazy. 273 00:14:04,750 --> 00:14:06,850 And CD-ROMs were really slow, so you 274 00:14:06,850 --> 00:14:10,330 don't want to just scan the entire CD to do a search. 275 00:14:10,330 --> 00:14:12,340 You want to be able to do a search for arbitrary 276 00:14:12,340 --> 00:14:12,910 substrings. 277 00:14:12,910 --> 00:14:14,410 That's what suffix trees let you do. 278 00:14:14,410 --> 00:14:16,470 But you really can't afford much space. 279 00:14:16,470 --> 00:14:19,990 And so that was the motivation for developing these data 280 00:14:19,990 --> 00:14:22,394 structures back in the day. 281 00:14:22,394 --> 00:14:23,560 Things are a lot easier now. 282 00:14:23,560 --> 00:14:24,535 Space is cheaper. 283 00:14:24,535 --> 00:14:27,160 But still, there's always going to be some giant thing that you 284 00:14:27,160 --> 00:14:30,640 need to store that if you want to store a data structure, 285 00:14:30,640 --> 00:14:34,060 you really can't afford much space. 286 00:14:34,060 --> 00:14:34,780 Cool. 287 00:14:34,780 --> 00:14:37,480 So that's static. 288 00:14:37,480 --> 00:14:39,540 There is a dynamic version. 289 00:14:39,540 --> 00:14:42,340 It's a more recent result from just a few years ago. 290 00:14:42,340 --> 00:14:49,240 You can do constant time, insert, and delete of a leaf. 291 00:14:49,240 --> 00:14:52,210 And you can do a subdivision of an edge. 292 00:14:56,860 --> 00:14:59,140 These are operations we saw for dynamic LCA. 293 00:15:02,010 --> 00:15:04,150 So same operations as dynamic LCA, 294 00:15:04,150 --> 00:15:05,890 you can do these in constant time 295 00:15:05,890 --> 00:15:08,740 and still maintain the succinct binary 296 00:15:08,740 --> 00:15:10,240 try representation, where you can do 297 00:15:10,240 --> 00:15:11,410 all these traversal operations. 298 00:15:11,410 --> 00:15:12,430 So we won't cover that. 299 00:15:12,430 --> 00:15:16,360 We will cover the static version today. 300 00:15:16,360 --> 00:15:18,640 Then that's for binary alphabet. 301 00:15:18,640 --> 00:15:22,720 Let me tell you what's known about larger alphabet. 302 00:15:22,720 --> 00:15:24,820 This is a problem I worked on a while ago, 303 00:15:24,820 --> 00:15:28,690 though our result is no longer the best. 304 00:15:28,690 --> 00:15:32,350 So here, it's a little complicated. 305 00:15:32,350 --> 00:15:36,720 But the number of tries, number of k-ary tries, 306 00:15:36,720 --> 00:15:39,760 so k is now the size of the alphabet, 307 00:15:39,760 --> 00:15:47,400 is kn plus 1 choose n over kn plus 1. 308 00:15:52,120 --> 00:15:54,730 And so it's succinct, meaning we can 309 00:15:54,730 --> 00:15:59,790 achieve log of that plus little o of that bits. 310 00:15:59,790 --> 00:16:01,590 And the queries we can achieve are 311 00:16:01,590 --> 00:16:11,670 constant time, child with label i, 312 00:16:11,670 --> 00:16:13,875 and parent, and subtree size. 313 00:16:21,430 --> 00:16:24,680 So we can do all the things we were able to do before. 314 00:16:24,680 --> 00:16:26,620 The analog of left child and right child 315 00:16:26,620 --> 00:16:29,770 now is I have a character in my pattern, p, 316 00:16:29,770 --> 00:16:32,120 and I want to know which child has that label. 317 00:16:32,120 --> 00:16:37,630 So it's not the same as finding the i-th child of a node. 318 00:16:37,630 --> 00:16:41,080 The edges are labeled by their letter in the alphabet, 319 00:16:41,080 --> 00:16:43,030 and you can achieve that. 320 00:16:43,030 --> 00:16:45,427 We had an earlier result that achieved like log log 321 00:16:45,427 --> 00:16:48,010 or something, and then finally it was brought down to constant 322 00:16:48,010 --> 00:16:53,020 by Farzan and Muno. 323 00:16:53,020 --> 00:16:59,070 So couple more. 324 00:16:59,070 --> 00:17:01,080 Why don't I just mention what they are? 325 00:17:01,080 --> 00:17:04,490 It's getting a little tedious. 326 00:17:04,490 --> 00:17:07,589 There's a lot of other things you might want to store. 327 00:17:07,589 --> 00:17:11,000 You can store succinct permutations, 328 00:17:11,000 --> 00:17:12,839 although there are some open problems here. 329 00:17:12,839 --> 00:17:14,670 So you want to store a permutation using 330 00:17:14,670 --> 00:17:18,359 log n factorial bits, plus little o of n. 331 00:17:18,359 --> 00:17:21,869 If you want to achieve succincts, the best known-- 332 00:17:21,869 --> 00:17:23,940 oh, and the interesting query is you 333 00:17:23,940 --> 00:17:27,790 want to be able to do the k-th power of your permutation, 334 00:17:27,790 --> 00:17:31,620 so see where an item goes after k steps. 335 00:17:31,620 --> 00:17:33,840 Best query known for that is log n over log log 336 00:17:33,840 --> 00:17:35,400 n if you want succinct. 337 00:17:35,400 --> 00:17:38,400 If you only want compact, then it's 338 00:17:38,400 --> 00:17:39,905 known how to do constant time. 339 00:17:39,905 --> 00:17:42,030 So an interesting open question is, can you achieve 340 00:17:42,030 --> 00:17:45,210 succinct constant time queries? 341 00:17:45,210 --> 00:17:48,850 If you relax either, then it's known how to do it. 342 00:17:48,850 --> 00:17:52,290 There's a generalization of this to functions, 343 00:17:52,290 --> 00:17:57,000 where one of those results is known, the other one isn't. 344 00:17:57,000 --> 00:17:59,115 You can try to do Abelian groups. 345 00:18:06,906 --> 00:18:08,464 There are finite Abelian groups. 346 00:18:08,464 --> 00:18:10,005 There aren't too many different ones, 347 00:18:10,005 --> 00:18:12,510 and you can represent an entire Abelian group 348 00:18:12,510 --> 00:18:15,780 on n items using log n bits, which is pretty crazy, 349 00:18:15,780 --> 00:18:17,430 order log n bits. 350 00:18:17,430 --> 00:18:20,010 And you can represent an item in that group in log n bits 351 00:18:20,010 --> 00:18:24,480 and do multiplication, inverse, and equality testing. 352 00:18:24,480 --> 00:18:27,270 There's other results on graphs, which I won't get into. 353 00:18:27,270 --> 00:18:30,210 Those are a little harder to state. 354 00:18:30,210 --> 00:18:34,890 And then another interesting case is integers. 355 00:18:34,890 --> 00:18:36,840 So you want to store an integer and you 356 00:18:36,840 --> 00:18:39,840 want to be able to increment and decrement the integer. 357 00:18:39,840 --> 00:18:43,470 And you want to do as few bit operations as possible. 358 00:18:43,470 --> 00:18:46,570 Worst case, for example, if you have 359 00:18:46,570 --> 00:18:49,320 lots of 1s in your bit string, you do an increment, 360 00:18:49,320 --> 00:18:52,860 you don't want to pay linear costs, linear number of bit 361 00:18:52,860 --> 00:18:55,020 updates to do it. 362 00:18:55,020 --> 00:18:59,640 And so you can achieve implicit, so just 363 00:18:59,640 --> 00:19:01,365 a constant number of extra bits of space. 364 00:19:03,960 --> 00:19:06,750 If I have an n-bit integer, then I 365 00:19:06,750 --> 00:19:12,210 can do an increment or a decrement 366 00:19:12,210 --> 00:19:21,510 in order log n bit reads, and constant bit writes. 367 00:19:25,520 --> 00:19:29,642 And this is Raman and Munro from a couple years ago. 368 00:19:29,642 --> 00:19:30,600 So this is pretty good. 369 00:19:30,600 --> 00:19:34,024 Of course, ideal would be to do a constant number of-- 370 00:19:34,024 --> 00:19:35,940 well, a constant number of bit reads or writes 371 00:19:35,940 --> 00:19:38,064 would be optimal, I guess. 372 00:19:38,064 --> 00:19:39,855 I personally would be interested in getting 373 00:19:39,855 --> 00:19:43,710 a constant number of word reads and writes. 374 00:19:43,710 --> 00:19:46,530 But that's an open problem, I believe. 375 00:19:46,530 --> 00:19:49,614 So there's only order n bit reads. 376 00:19:49,614 --> 00:19:51,030 If they were all consecutive, that 377 00:19:51,030 --> 00:19:53,100 would be a constant number of word reads, 378 00:19:53,100 --> 00:19:54,675 but they're kind of spread out. 379 00:19:54,675 --> 00:19:59,850 It'd be nice to get constant number of word operations. 380 00:20:03,740 --> 00:20:07,300 So that's a quick survey of what's known. 381 00:20:07,300 --> 00:20:09,200 Let's do some actual data structures now. 382 00:20:09,200 --> 00:20:16,930 So we're going to be focusing on this synced binary tries. 383 00:20:16,930 --> 00:20:18,700 And we're going to do two versions of it. 384 00:20:18,700 --> 00:20:21,580 One of them is level order. 385 00:20:32,920 --> 00:20:35,110 And the other will use a balanced parenthesis 386 00:20:35,110 --> 00:20:38,140 representation. 387 00:20:38,140 --> 00:20:40,960 So let's start with the level order one. 388 00:20:40,960 --> 00:20:41,920 This is very easy. 389 00:21:25,990 --> 00:21:30,030 So I'm going to just loop over the nodes in my try in level 390 00:21:30,030 --> 00:21:32,872 order, so level by level, and write one bit 391 00:21:32,872 --> 00:21:34,830 to say whether there's a left child and one bit 392 00:21:34,830 --> 00:21:36,642 to say whether there's a right child. 393 00:21:36,642 --> 00:21:37,725 Let's do a little example. 394 00:22:01,000 --> 00:22:03,480 So here is a binary try. 395 00:22:06,110 --> 00:22:10,680 And I'm going to write a bit string for it. 396 00:22:10,680 --> 00:22:13,140 So first, I'm going to look at the top level. 397 00:22:13,140 --> 00:22:17,340 I have the node A. It has a left child of B, right child of C, 398 00:22:17,340 --> 00:22:20,270 so I write 1, 1. 399 00:22:20,270 --> 00:22:24,930 And this corresponds to B. This corresponds to C. 400 00:22:24,930 --> 00:22:26,340 OK. 401 00:22:26,340 --> 00:22:30,330 Then next level is B. It has no left child 402 00:22:30,330 --> 00:22:33,900 and it has a right child, which is D. I'm just writing down 403 00:22:33,900 --> 00:22:35,970 the labels so I don't get lost. 404 00:22:35,970 --> 00:22:44,740 Then we have node C, which has a left child and a right child, 405 00:22:44,740 --> 00:22:51,365 E and F. Then we have node D, which has no left child. 406 00:22:51,365 --> 00:22:56,481 It has a right child, which is G. 407 00:22:56,481 --> 00:22:58,770 Node E has no children. 408 00:22:58,770 --> 00:23:01,230 Node F has no children. 409 00:23:01,230 --> 00:23:04,980 Node G has no children. 410 00:23:04,980 --> 00:23:05,480 OK. 411 00:23:05,480 --> 00:23:07,380 So there is a 2n-bit string. 412 00:23:07,380 --> 00:23:14,850 This is obviously 2n bits for n nodes, 413 00:23:14,850 --> 00:23:17,470 so this is one way to prove there's at most four to the n 414 00:23:17,470 --> 00:23:19,140 tries. 415 00:23:19,140 --> 00:23:24,052 And well, we'll talk about how useful it is. 416 00:23:24,052 --> 00:23:25,760 I want to give you another representation 417 00:23:25,760 --> 00:23:30,800 of the same thing, which is if we take these nodes 418 00:23:30,800 --> 00:23:32,290 and add on-- 419 00:23:32,290 --> 00:23:33,880 wherever there's an absent leaf, I'm 420 00:23:33,880 --> 00:23:37,580 going to add what we call an external node, as you sometimes 421 00:23:37,580 --> 00:23:40,310 see in data structures books. 422 00:23:40,310 --> 00:23:42,260 One way to represent a null pointer, say, oh, 423 00:23:42,260 --> 00:23:46,170 there's a node there that has no children. 424 00:23:46,170 --> 00:23:47,605 This unifies things a little bit, 425 00:23:47,605 --> 00:23:48,980 because now every node either has 426 00:23:48,980 --> 00:23:51,530 two children or no children. 427 00:23:51,530 --> 00:23:54,650 Another way to think about the same thing. 428 00:23:54,650 --> 00:24:00,230 And it turns out if you look at this bit string and add a 1 429 00:24:00,230 --> 00:24:01,810 in front-- 430 00:24:01,810 --> 00:24:04,760 so I'll put this one in parentheses-- 431 00:24:04,760 --> 00:24:07,940 then what this is encoding is just 432 00:24:07,940 --> 00:24:11,660 for every node in level order, are you a real node 433 00:24:11,660 --> 00:24:13,220 or are you an external node? 434 00:24:13,220 --> 00:24:16,060 Are you an internal node or an external node? 435 00:24:16,060 --> 00:24:18,770 A here is internal. 436 00:24:18,770 --> 00:24:19,790 B is internal. 437 00:24:19,790 --> 00:24:21,290 C is internal. 438 00:24:21,290 --> 00:24:24,110 This is an external node. 439 00:24:24,110 --> 00:24:27,590 Then this is internal, internal, internal. 440 00:24:27,590 --> 00:24:33,610 Then external, G, external, external, external, external, 441 00:24:33,610 --> 00:24:35,337 external, external. 442 00:24:35,337 --> 00:24:36,920 The zeros correspond to external nodes 443 00:24:36,920 --> 00:24:38,490 because those are absent children. 444 00:24:38,490 --> 00:24:40,216 So same thing. 445 00:24:40,216 --> 00:24:59,430 So I'll write equivalently, 1 equals an internal node and 0 446 00:24:59,430 --> 00:25:00,480 equals an external node. 447 00:25:12,530 --> 00:25:14,390 Of course, to do this, we need one more bit. 448 00:25:18,720 --> 00:25:20,960 And I'm going to take this view primarily, 449 00:25:20,960 --> 00:25:22,550 because it's a little easier to work. 450 00:25:22,550 --> 00:25:24,383 It doesn't really make much of a difference, 451 00:25:24,383 --> 00:25:27,350 just shifts everything over by 1. 452 00:25:27,350 --> 00:25:30,800 And I'd like to write down the indices here, 453 00:25:30,800 --> 00:25:46,340 so we have 1, 2, 3, 4, 5, 6, 7 into this array. 454 00:25:46,340 --> 00:25:48,350 Because now our challenge is all right, great. 455 00:25:48,350 --> 00:25:50,090 We've represented a binary try. 456 00:25:50,090 --> 00:25:51,980 But we want to be able to do constant time, 457 00:25:51,980 --> 00:25:53,894 left child, right child, parent. 458 00:25:53,894 --> 00:25:55,810 We're not going to be able to do subtree size. 459 00:25:55,810 --> 00:25:58,520 Level order is really not good for subtree size. 460 00:25:58,520 --> 00:26:00,440 But left child, right child, and parent 461 00:26:00,440 --> 00:26:01,850 we can do in constant time. 462 00:26:04,410 --> 00:26:06,920 And the reason that we can do it in constant time 463 00:26:06,920 --> 00:26:08,630 is because there's a nice lemma, kind 464 00:26:08,630 --> 00:26:11,955 of analogous to binary heaps. 465 00:26:55,800 --> 00:26:59,990 So the claim is if we look at the i-th internal node, 466 00:26:59,990 --> 00:27:04,410 so for example C is the third internal node, so 467 00:27:04,410 --> 00:27:09,240 internal node number 3, then we look at positions 2 times 3, 468 00:27:09,240 --> 00:27:12,960 and 2 times 3 plus 1, so 6 and 7 in this array. 469 00:27:12,960 --> 00:27:17,040 And we get the two children E and F. So that worked. 470 00:27:17,040 --> 00:27:19,800 6 and 7 are E and F. 471 00:27:19,800 --> 00:27:22,740 Or this one would be the fourth internal node. 472 00:27:22,740 --> 00:27:25,560 D is the fourth internal node. 473 00:27:25,560 --> 00:27:29,100 And so at positions 8 and 9 should be this external node 474 00:27:29,100 --> 00:27:31,410 and G. 8 is an external node. 475 00:27:31,410 --> 00:27:33,310 9 is g. 476 00:27:33,310 --> 00:27:34,870 So it works. 477 00:27:34,870 --> 00:27:36,520 This is a lemma. 478 00:27:36,520 --> 00:27:42,610 You can prove it pretty easily by induction on i. 479 00:27:46,966 --> 00:27:50,260 So the idea is, well, if you look 480 00:27:50,260 --> 00:27:54,520 at let's say the i-th internal node 481 00:27:54,520 --> 00:27:58,120 and the i minus first internal node, this one 482 00:27:58,120 --> 00:27:59,570 has two left children. 483 00:27:59,570 --> 00:28:01,570 Don't know whether they're internal or external. 484 00:28:01,570 --> 00:28:03,530 Between them-- they're on the same level 485 00:28:03,530 --> 00:28:06,850 and we're in level order, so anything in between here 486 00:28:06,850 --> 00:28:08,380 is an external node. 487 00:28:08,380 --> 00:28:10,840 So they have no children, which means 488 00:28:10,840 --> 00:28:12,652 if you look at the children of i, 489 00:28:12,652 --> 00:28:14,860 they're going to appear right after the children of i 490 00:28:14,860 --> 00:28:20,080 minus 1, because we're level order. 491 00:28:20,080 --> 00:28:21,930 So we have these two guys. 492 00:28:21,930 --> 00:28:26,800 The next nodes at this level are going to be these two guys. 493 00:28:26,800 --> 00:28:31,860 So this one appeared at 2i minus 2, and 2i minus 1. 494 00:28:31,860 --> 00:28:36,630 This one will appear at 2i and 2i plus 1. 495 00:28:36,630 --> 00:28:39,029 This is if i and i minus 1 are on the same level, 496 00:28:39,029 --> 00:28:41,070 but it also works if they're in different levels. 497 00:28:50,760 --> 00:28:53,630 So if I'm i minus 1 is the last node on its level, 498 00:28:53,630 --> 00:28:57,170 again it's going to have two children here, which will be 499 00:28:57,170 --> 00:29:00,050 the last nodes on this level. 500 00:29:00,050 --> 00:29:03,320 And then right after that will come the children of i, 501 00:29:03,320 --> 00:29:06,680 again at position 2i and 2i plus 1. 502 00:29:10,800 --> 00:29:11,300 OK. 503 00:29:11,300 --> 00:29:14,646 So that's essentially the proof that this works out. 504 00:29:14,646 --> 00:29:16,520 There's lots of ways to see why this is true, 505 00:29:16,520 --> 00:29:20,475 but I think I'll leave it at that. 506 00:29:20,475 --> 00:29:21,890 OK. 507 00:29:21,890 --> 00:29:22,880 So this is good news. 508 00:29:22,880 --> 00:29:26,707 It says if we have the i-th internal node, 509 00:29:26,707 --> 00:29:28,540 we can find the left and the right children. 510 00:29:28,540 --> 00:29:31,000 But these are in different namespaces, right? 511 00:29:31,000 --> 00:29:33,085 On the one hand, we're counting by internal nodes. 512 00:29:33,085 --> 00:29:34,960 On the other hand, we're counting by position 513 00:29:34,960 --> 00:29:37,690 in the array, which is counting position 514 00:29:37,690 --> 00:29:41,120 by internal and external nodes. 515 00:29:41,120 --> 00:29:43,180 This counts both 0's and 1's. 516 00:29:43,180 --> 00:29:44,175 This only counts 1's. 517 00:30:03,410 --> 00:30:06,350 So we need a mechanism for translating between those two 518 00:30:06,350 --> 00:30:10,250 worlds, translating between indices that only count 519 00:30:10,250 --> 00:30:13,610 1's and indices that count 0's and 1's. 520 00:30:13,610 --> 00:30:15,515 And this is the idea of rank and select. 521 00:30:26,360 --> 00:30:30,050 So in general, if I have a string of n bits, 522 00:30:30,050 --> 00:30:43,650 I want to be able to compute the rank of a bit, which 523 00:30:43,650 --> 00:30:48,650 is the number of 1's at or before position i. 524 00:30:48,650 --> 00:30:50,317 So I'm given a position like 6 and I 525 00:30:50,317 --> 00:30:51,900 want to know how many 1's are there up 526 00:30:51,900 --> 00:30:57,360 to 6, which would be 5 in the full array here. 527 00:30:57,360 --> 00:31:01,830 Or I'm given a query like 8, number of 1's is 6 up 528 00:31:01,830 --> 00:31:04,470 to position 8. 529 00:31:04,470 --> 00:31:10,410 And then the inverse of rank is select, 530 00:31:10,410 --> 00:31:15,690 which gives you the position of the j-th 1 bit. 531 00:31:19,470 --> 00:31:23,070 So this lets you translate between these two worlds 532 00:31:23,070 --> 00:31:29,670 of counting just the 1's, which is rank, or going to the j-th 1 533 00:31:29,670 --> 00:31:30,510 bit, that's select. 534 00:31:37,860 --> 00:31:41,820 So this lets you compute the left child as just 535 00:31:41,820 --> 00:31:45,510 being at position twice the rank, 536 00:31:45,510 --> 00:31:49,560 because the rank tells you this value i, which 537 00:31:49,560 --> 00:31:52,340 is which internal node are you. 538 00:31:52,340 --> 00:31:53,250 That's your rank. 539 00:31:53,250 --> 00:31:57,240 You multiply by 2 and that was the position of the left child. 540 00:31:57,240 --> 00:32:02,400 Right child is going to be that plus 1. 541 00:32:02,400 --> 00:32:06,000 Parent is going to use select. 542 00:32:06,000 --> 00:32:16,410 So if we want the parent of i, this is going to be select of i 543 00:32:16,410 --> 00:32:17,880 over 2 with a floor. 544 00:32:21,102 --> 00:32:23,480 So that's just the inverse of left and right child. 545 00:32:23,480 --> 00:32:25,730 If I divide by 2 and take the floor, I get rid of that 546 00:32:25,730 --> 00:32:26,840 plus 1, I get the rank. 547 00:32:26,840 --> 00:32:33,170 And then I do select, sub 1, and that's the inverse of rank. 548 00:32:33,170 --> 00:32:35,690 So that lets me implement. 549 00:32:35,690 --> 00:32:38,440 If I have rank and select in constant time, now I can do 550 00:32:38,440 --> 00:32:41,010 left child, right child, parent in constant time. 551 00:32:41,010 --> 00:32:43,940 The remaining challenge is, how do I do rank and select? 552 00:32:43,940 --> 00:32:46,150 And that's what we're going to do next. 553 00:32:46,150 --> 00:32:47,870 Any questions about that before we go on? 554 00:32:50,756 --> 00:32:52,680 All right. 555 00:32:52,680 --> 00:32:55,530 So now we do real data structures. 556 00:32:55,530 --> 00:32:58,820 This is going to be some integer data structures, some fun 557 00:32:58,820 --> 00:33:01,520 stuff. 558 00:33:01,520 --> 00:33:04,210 It's going to use some techniques we know, 559 00:33:04,210 --> 00:33:07,090 but in a different setting, because now our goal is 560 00:33:07,090 --> 00:33:10,021 to really minimize space. 561 00:33:10,021 --> 00:33:13,240 We're going to use indirection and lookup 562 00:33:13,240 --> 00:33:14,650 tables in a new kind of way. 563 00:33:20,380 --> 00:33:22,960 These are going to be word RAM data structures. 564 00:33:22,960 --> 00:33:27,160 I want to do both rank and select in constant time. 565 00:33:27,160 --> 00:33:30,630 And the amount of space I have is little o of n. 566 00:33:30,630 --> 00:33:31,410 I want succinct. 567 00:33:31,410 --> 00:33:34,720 And I'm going to store the bit vector. 568 00:33:34,720 --> 00:33:36,555 So then in addition to the bit vector, all 569 00:33:36,555 --> 00:33:39,665 I'm allowed for rank and select is little o of n space. 570 00:33:39,665 --> 00:33:40,540 That's the cool part. 571 00:33:44,200 --> 00:33:47,990 So rank is one of the first succinct data structures. 572 00:33:47,990 --> 00:33:56,740 It's by Jacobson, 1989. 573 00:33:56,740 --> 00:34:02,770 So a first observation is, what can we do with a lookup table? 574 00:34:02,770 --> 00:34:05,980 Suppose I wanted to store all the answers, 575 00:34:05,980 --> 00:34:09,880 but I can't afford much space. 576 00:34:23,489 --> 00:34:27,030 Well, let's do sort of a worksheet. 577 00:34:27,030 --> 00:34:31,230 If I had space x, or if I looked at all bit strings of length 578 00:34:31,230 --> 00:34:36,179 x and then I wanted to store them, that's going to cost-- 579 00:34:36,179 --> 00:34:39,384 or store a lookup table for each of them, 580 00:34:39,384 --> 00:34:40,800 it's going to cost-- well, there's 581 00:34:40,800 --> 00:34:45,440 2 to the x different bit strings of that length. 582 00:34:45,440 --> 00:34:49,010 Then for each of them, I have to store all possible answers 583 00:34:49,010 --> 00:34:51,449 to rank and select queries. 584 00:34:51,449 --> 00:34:55,190 So there's order x different queries. 585 00:34:55,190 --> 00:34:58,024 You could query every bit. 586 00:34:58,024 --> 00:35:00,440 And then for each of them, I have to write down an answer. 587 00:35:00,440 --> 00:35:05,030 So this is going to be log x bits, because the answer is 588 00:35:05,030 --> 00:35:07,320 a value between 0 and x minus 1, so it takes 589 00:35:07,320 --> 00:35:09,381 log x bits to write it down. 590 00:35:09,381 --> 00:35:10,880 So this is how much space it's going 591 00:35:10,880 --> 00:35:15,500 to be to store the answer for all bit strings of length x. 592 00:35:15,500 --> 00:35:18,260 So what should I set x to? 593 00:35:18,260 --> 00:35:20,550 I'd like this to be little o of n, 594 00:35:20,550 --> 00:35:24,140 so anything that's a little bit less than log n 595 00:35:24,140 --> 00:35:25,460 is going to be OK. 596 00:35:25,460 --> 00:35:30,540 And in particular, 1/2 log n is a good enough choice. 597 00:35:30,540 --> 00:35:33,050 If we use 1/2 log n bits, this is 598 00:35:33,050 --> 00:35:36,960 going to be root n, log n log log 599 00:35:36,960 --> 00:35:44,227 n, which is little o of n, quite small as succinct data 600 00:35:44,227 --> 00:35:44,810 structures go. 601 00:35:44,810 --> 00:35:46,934 We're going to use more space than root n. 602 00:35:46,934 --> 00:35:48,350 The point is, if we could get down 603 00:35:48,350 --> 00:35:51,410 to bit strings of logarithmic size, we'd be done. 604 00:35:51,410 --> 00:35:54,660 But we have a bit string of linear size. 605 00:35:54,660 --> 00:35:56,450 So how do we reduce it? 606 00:35:56,450 --> 00:35:58,532 Something like indirection. 607 00:36:10,360 --> 00:36:17,580 So the funny thing here is we're going to do indirection twice, 608 00:36:17,580 --> 00:36:20,890 kind of recursively, but stopping after two levels. 609 00:36:20,890 --> 00:36:24,120 The first level of indirection is 610 00:36:24,120 --> 00:36:27,710 going to reduce things down to size log squared. 611 00:36:36,640 --> 00:36:38,490 So we're going to take our n bit string, 612 00:36:38,490 --> 00:36:41,130 divide into chunks of size log squared n, 613 00:36:41,130 --> 00:36:43,830 so there's n over log squared n chunks. 614 00:36:43,830 --> 00:36:51,470 They look something like this, log squared n. 615 00:36:57,650 --> 00:37:01,250 And the idea is right now we're trying to just do rank, 616 00:37:01,250 --> 00:37:05,580 so rank number of 1's at or before a given position. 617 00:37:05,580 --> 00:37:08,750 So what I'm going to do is that each of these vertical bars, 618 00:37:08,750 --> 00:37:11,710 I'm going to store the cumulative rank so far. 619 00:37:20,075 --> 00:37:21,090 Why log squared? 620 00:37:21,090 --> 00:37:23,190 Basically because this is what I can afford. 621 00:37:23,190 --> 00:37:30,129 To store that cumulative rank is log n bits. 622 00:37:30,129 --> 00:37:31,920 I mean, this rank is going to get very big. 623 00:37:31,920 --> 00:37:33,753 By the end, it will have most of the 1 bits, 624 00:37:33,753 --> 00:37:35,580 so it could be potentially linear. 625 00:37:35,580 --> 00:37:39,180 So I'm going to need log n bits to write that down. 626 00:37:39,180 --> 00:37:41,430 But how many of these vertical bars are there? 627 00:37:41,430 --> 00:37:44,415 Well, only n over log squared n of them. 628 00:37:44,415 --> 00:37:50,670 So I have n over log squared n things I need to write down. 629 00:37:50,670 --> 00:37:52,250 Each of them is log n bits. 630 00:38:00,218 --> 00:38:06,065 So do some cancellation. 631 00:38:06,065 --> 00:38:08,860 This cancels with that. 632 00:38:08,860 --> 00:38:12,970 We have n over log n bits overall, which 633 00:38:12,970 --> 00:38:14,690 is slightly little o of n. 634 00:38:14,690 --> 00:38:17,730 And that's the bound we're going to achieve. 635 00:38:17,730 --> 00:38:18,230 OK. 636 00:38:18,230 --> 00:38:21,901 Of course, now we have to solve the problem within a chunk. 637 00:38:21,901 --> 00:38:24,400 But we've at least reduced to something of size log squared. 638 00:38:24,400 --> 00:38:26,500 Unfortunately, we need something of size 1/2 log n 639 00:38:26,500 --> 00:38:27,950 before we can use a lookup table. 640 00:38:27,950 --> 00:38:29,530 So there's a bit of a gap here, so we're 641 00:38:29,530 --> 00:38:31,154 going to use indirection a second time. 642 00:38:41,320 --> 00:38:44,810 This time, we can go all the way to 1/2 log n. 643 00:39:03,230 --> 00:39:06,440 So I'll use red vertical bars to denote the subchunks. 644 00:39:06,440 --> 00:39:08,207 Each of these is 1/2 log n. 645 00:39:10,900 --> 00:39:13,340 Overall size of a chunk here is log squared n. 646 00:39:15,910 --> 00:39:19,150 So every one of these chunks gets further divided. 647 00:39:19,150 --> 00:39:20,710 Now, how could this help? 648 00:39:20,710 --> 00:39:23,590 Why didn't I just subdivide into chunks of size 1/2 log n 649 00:39:23,590 --> 00:39:24,377 before? 650 00:39:24,377 --> 00:39:25,960 I mean, why I couldn't do it is clear. 651 00:39:25,960 --> 00:39:29,590 If I did n over log n of them, each of them stores log n bits, 652 00:39:29,590 --> 00:39:31,866 I'd have a linear number of bits. 653 00:39:31,866 --> 00:39:33,490 I can't afford a linear number of bits. 654 00:39:33,490 --> 00:39:36,610 That would only be compact, not succinct. 655 00:39:36,610 --> 00:39:39,880 How does it help me to first reduce to this 656 00:39:39,880 --> 00:39:42,260 and then reduce to this? 657 00:39:42,260 --> 00:39:48,310 Well, what I want to do at each of these red vertical bars 658 00:39:48,310 --> 00:39:51,550 is store the cumulative rank, but not 659 00:39:51,550 --> 00:39:53,050 the overall cumulative rank. 660 00:39:53,050 --> 00:40:01,900 I only need the cumulative rank within the overall chunk, 661 00:40:01,900 --> 00:40:05,774 not relative to the entire array. 662 00:40:05,774 --> 00:40:06,690 Why does that help me? 663 00:40:11,520 --> 00:40:12,980 AUDIENCE: Need less bits. 664 00:40:12,980 --> 00:40:14,480 ERIK DEMAINE: Need less bits. 665 00:40:14,480 --> 00:40:17,030 These ranks can't get too big, because the overall size 666 00:40:17,030 --> 00:40:19,520 of a chunk is just log squared. 667 00:40:19,520 --> 00:40:21,940 Log of log squared is log log n. 668 00:40:21,940 --> 00:40:26,960 So I only need log log n bits to write down 669 00:40:26,960 --> 00:40:28,400 those cumulative ranks. 670 00:40:31,200 --> 00:40:35,330 And so total size here is going to be 671 00:40:35,330 --> 00:40:43,670 n over log n times log log n bits, 672 00:40:43,670 --> 00:40:46,760 because there's n over log n of these red vertical bars. 673 00:40:46,760 --> 00:40:48,720 Each one I only need to write log log n bits. 674 00:40:48,720 --> 00:40:50,450 And this is slightly little o of n. 675 00:40:50,450 --> 00:40:52,283 It's actually a little bit bigger than this, 676 00:40:52,283 --> 00:40:54,260 but still little of of n. 677 00:40:54,260 --> 00:40:57,050 So we can still afford this. 678 00:40:57,050 --> 00:41:01,260 And now we're done, because these subchunks are of size 1/2 679 00:41:01,260 --> 00:41:05,600 log n, so I can use this lookup table and solve my problem. 680 00:41:05,600 --> 00:41:11,585 So let me step forward, just putting everything together. 681 00:41:15,200 --> 00:41:17,656 To compute the rank of a query, first thing you do 682 00:41:17,656 --> 00:41:20,030 is figure out which chunk you fall into, which you can do 683 00:41:20,030 --> 00:41:22,760 by division, integer division. 684 00:41:22,760 --> 00:41:25,280 These things are stored in an array, so you just compute, 685 00:41:25,280 --> 00:41:26,630 what is that cumulative rank? 686 00:41:32,300 --> 00:41:35,300 So you take the rank of that chunk, 687 00:41:35,300 --> 00:41:43,490 you add on the rank of the subchunk within the chunk, 688 00:41:43,490 --> 00:41:48,530 and then you add on the rank of the element in the subchunk. 689 00:41:53,000 --> 00:41:57,560 So rank of the chunk is stored in the array known as 2. 690 00:41:57,560 --> 00:41:59,300 The rank of the subchunk within the chunk 691 00:41:59,300 --> 00:42:02,270 is stored in the array, in the array known as 3. 692 00:42:02,270 --> 00:42:05,180 And then to compute the rank of the element in the subchunk, 693 00:42:05,180 --> 00:42:08,180 you use the lookup table, which is essentially 694 00:42:08,180 --> 00:42:11,990 telling you for every possible subchunk what the answers are. 695 00:42:11,990 --> 00:42:15,800 So 3 times a constant is a constant. 696 00:42:15,800 --> 00:42:19,400 And we get rank, constant time, and n log log 697 00:42:19,400 --> 00:42:20,750 n over log n space. 698 00:42:25,700 --> 00:42:28,940 If you're concerned that n times log log n over log n 699 00:42:28,940 --> 00:42:35,800 is not very sublinear, you can do a little bit better 700 00:42:35,800 --> 00:42:36,910 using fancier tricks. 701 00:42:49,620 --> 00:42:55,720 Namely, you can achieve n over log to the kn space. 702 00:43:06,590 --> 00:43:09,830 This is the result of Patrascu from 2008. 703 00:43:09,830 --> 00:43:11,810 I'm not going to go into how it's done. 704 00:43:11,810 --> 00:43:14,370 But if you're interested, it's a little bit less. 705 00:43:14,370 --> 00:43:15,620 It would be nice to do better. 706 00:43:15,620 --> 00:43:17,510 But my guess is there should be a lower 707 00:43:17,510 --> 00:43:21,180 bounds, that with constant-- so this is for any constant k. 708 00:43:25,280 --> 00:43:27,579 It would be nice to do better, like square root of n 709 00:43:27,579 --> 00:43:28,120 or something. 710 00:43:28,120 --> 00:43:30,530 But my guess is there's a matching lower bound. 711 00:43:30,530 --> 00:43:33,271 I don't think that's known. 712 00:43:33,271 --> 00:43:33,770 OK. 713 00:43:33,770 --> 00:43:36,020 So that was rank. 714 00:43:36,020 --> 00:43:38,810 Our next challenge is to do select, the inverse. 715 00:43:38,810 --> 00:43:43,320 And select is a little bit harder, I would say. 716 00:43:43,320 --> 00:43:44,794 Don't have a great intuition why. 717 00:43:48,746 --> 00:43:49,734 But it is. 718 00:43:52,710 --> 00:43:56,100 And we're going to be able to use the same kind of technique. 719 00:43:56,100 --> 00:44:01,842 So again, we can use a lookup table and-- 720 00:44:01,842 --> 00:44:02,880 I'll do that first. 721 00:44:17,710 --> 00:44:21,750 So just like before, if we have bit strings of length at most 722 00:44:21,750 --> 00:44:26,280 1/2 log n, then we're only going to need something 723 00:44:26,280 --> 00:44:27,420 like root n space. 724 00:44:27,420 --> 00:44:32,910 It's root n again times log n log log n space, just 725 00:44:32,910 --> 00:44:33,690 like rank. 726 00:44:33,690 --> 00:44:36,504 There are at most n possible queries, actually fewer, 727 00:44:36,504 --> 00:44:37,920 because there may be fewer 1 bits. 728 00:44:37,920 --> 00:44:40,350 But at most, there are n 1 bits to query. 729 00:44:40,350 --> 00:44:43,870 An answer is now an index, which is within a thing of size 1/2 730 00:44:43,870 --> 00:44:44,370 log n. 731 00:44:44,370 --> 00:44:46,860 So I just have to write down an index of that size, 732 00:44:46,860 --> 00:44:50,730 Soit's it's log log n bits to write it down. 733 00:44:50,730 --> 00:44:51,570 Cool. 734 00:44:51,570 --> 00:44:53,190 So that's the same. 735 00:44:53,190 --> 00:44:56,839 Now the challenge is about getting down to 1/2 log n bits. 736 00:44:56,839 --> 00:44:58,380 We're going to use the same technique 737 00:44:58,380 --> 00:45:00,990 of two levels of indirection. 738 00:45:00,990 --> 00:45:03,000 But they work differently. 739 00:45:03,000 --> 00:45:07,680 There's an extra thing we need to deal with in select. 740 00:45:10,650 --> 00:45:14,070 There will be two cases, depending on whether your array 741 00:45:14,070 --> 00:45:18,030 has lots of 1's or not so many 1's. 742 00:45:18,030 --> 00:45:21,120 And those two cases can vary throughout the string. 743 00:45:24,850 --> 00:45:31,335 So what we do, first of all, is-- 744 00:45:36,356 --> 00:45:38,050 actually, maybe I'll go over here. 745 00:45:40,580 --> 00:45:42,160 I'll stick here. 746 00:45:42,160 --> 00:45:42,660 Whatever. 747 00:45:48,310 --> 00:45:49,840 So we're back to an n-bit string. 748 00:46:07,100 --> 00:46:11,990 So we're looking at we want the analog of this structure, 749 00:46:11,990 --> 00:46:13,380 this structure of chunks. 750 00:46:13,380 --> 00:46:15,560 Now, we can't just say, oh, take the bit string, 751 00:46:15,560 --> 00:46:17,930 divide it into chunks of equal size, 752 00:46:17,930 --> 00:46:22,460 because then given a query, we want to do select of j, 753 00:46:22,460 --> 00:46:27,969 we need to know which of these chunks j belongs to. 754 00:46:27,969 --> 00:46:29,510 So instead of making them equal size, 755 00:46:29,510 --> 00:46:32,160 we're going to make them have an equal number of 1 bits. 756 00:46:32,160 --> 00:46:34,310 So then we can just take j, divide 757 00:46:34,310 --> 00:46:38,771 by the size of these chunks, which is log n log log n. 758 00:46:38,771 --> 00:46:40,520 You could probably do log squared as well, 759 00:46:40,520 --> 00:46:44,450 but log n log log n is a slightly better choice. 760 00:46:47,150 --> 00:46:51,740 And so we just divide every log n log log n 761 00:46:51,740 --> 00:46:53,849 1 bit, put a vertical bar. 762 00:46:53,849 --> 00:46:55,640 That way, given j, we divide by this thing, 763 00:46:55,640 --> 00:46:58,940 take the floor, that tells us which chunk we belong to. 764 00:46:58,940 --> 00:47:00,020 So it's different. 765 00:47:00,020 --> 00:47:04,490 Decomposing by 1 space instead of 01 space. 766 00:47:04,490 --> 00:47:07,160 And so for those guys, we just store an array. 767 00:47:07,160 --> 00:47:10,700 If your query happens to have a 0 mod this, 768 00:47:10,700 --> 00:47:12,020 then you have your answer. 769 00:47:12,020 --> 00:47:14,810 Otherwise, you still need to query within the chunk. 770 00:47:14,810 --> 00:47:19,490 In some sense, the array has gotten divided something like 771 00:47:19,490 --> 00:47:25,410 this, so the number of 1 bits in here is always the same, 772 00:47:25,410 --> 00:47:30,840 log n log log n 1's. 773 00:47:30,840 --> 00:47:32,965 So you can now teleport to the appropriate chunk. 774 00:47:32,965 --> 00:47:34,850 And the issue is, how do I solve a chunk? 775 00:47:34,850 --> 00:47:36,720 But now chunks have different sizes, 776 00:47:36,720 --> 00:47:38,050 which is kind of annoying. 777 00:47:38,050 --> 00:47:41,810 That's why we need this extra step, which 778 00:47:41,810 --> 00:47:53,040 is within a group of log n log log n 1 bits-- 779 00:47:53,040 --> 00:47:57,350 I'm calling them groups now, instead of chunks. 780 00:47:57,350 --> 00:48:00,830 So each of these groups has different size. 781 00:48:00,830 --> 00:48:09,630 Let's suppose it has size r, so say it's r bits long. 782 00:48:14,170 --> 00:48:15,920 R is going to be different for each chunk, 783 00:48:15,920 --> 00:48:21,040 but we'll do this for every chunk, every group. 784 00:48:21,040 --> 00:48:22,750 Then there's two cases. 785 00:48:22,750 --> 00:48:28,080 If r is big, we're done. 786 00:48:28,080 --> 00:48:29,360 How big? 787 00:48:29,360 --> 00:48:32,910 Well, if it's at least the square of the number of 1 bits, 788 00:48:32,910 --> 00:48:35,090 that means it's very sparse. 789 00:48:35,090 --> 00:48:37,560 Only square root of the bits are 1's. 790 00:48:37,560 --> 00:48:40,270 The rest are all 0's. 791 00:48:40,270 --> 00:48:42,600 But then, I can afford to just store all the answers. 792 00:48:59,575 --> 00:49:01,950 I'm just going to store a lookup table of all the answers 793 00:49:01,950 --> 00:49:30,560 if it's very sparse, because then I 794 00:49:30,560 --> 00:49:33,280 claim I only need this many bits in order 795 00:49:33,280 --> 00:49:36,300 to store all of these answers. 796 00:49:36,300 --> 00:49:38,950 So if I do this for all groups that 797 00:49:38,950 --> 00:49:42,550 have a large number of bits, I store this lookup array, 798 00:49:42,550 --> 00:49:44,950 how many-- 799 00:49:44,950 --> 00:49:46,840 if I sum up the size of all of these arrays, 800 00:49:46,840 --> 00:49:48,670 how much do I pay? 801 00:49:48,670 --> 00:49:53,470 Well, the lookup array has this kind of size. 802 00:49:53,470 --> 00:49:56,260 There are log n log log n 1 bits. 803 00:49:56,260 --> 00:49:59,500 And each of them I need to store an index for them. 804 00:49:59,500 --> 00:50:02,830 Now, this could cost log n bits, because potentially one 805 00:50:02,830 --> 00:50:05,369 of these groups is very large. 806 00:50:05,369 --> 00:50:06,660 It could be almost linear size. 807 00:50:06,660 --> 00:50:09,850 So I need log n bits to write down a position in there. 808 00:50:09,850 --> 00:50:12,670 There's log n log log n 1 bits to write down to position for. 809 00:50:12,670 --> 00:50:19,240 So this is the size of one of these arrays. 810 00:50:19,240 --> 00:50:22,650 Now, how many of these could I possibly need to store? 811 00:50:22,650 --> 00:50:25,390 Well, I know that this group has log n log log 812 00:50:25,390 --> 00:50:29,500 n squared bits in it, so the maximum number of such groups 813 00:50:29,500 --> 00:50:34,110 is n over that, n over log n log log n squared. 814 00:50:34,110 --> 00:50:38,250 And now we get to do some cancellation. 815 00:50:38,250 --> 00:50:42,010 So this 2 cancels with this log n log log n. 816 00:50:42,010 --> 00:50:44,830 And then this log n cancels with this log n. 817 00:50:44,830 --> 00:50:53,080 And so we get n over log log n bits, which 818 00:50:53,080 --> 00:50:56,470 is slightly little of of n. 819 00:50:56,470 --> 00:50:57,750 OK. 820 00:50:57,750 --> 00:51:01,750 Again, it is possible to get n over log to the k space. 821 00:51:01,750 --> 00:51:03,170 But we won't do that here. 822 00:51:03,170 --> 00:51:06,890 We'll be happy enough with n over log log n. 823 00:51:06,890 --> 00:51:07,390 OK. 824 00:51:07,390 --> 00:51:09,040 But we're not done, unfortunately. 825 00:51:09,040 --> 00:51:12,340 So we've now reduced in two groups, 826 00:51:12,340 --> 00:51:13,760 and I've only given you one case. 827 00:51:13,760 --> 00:51:16,165 This is when r is large. 828 00:51:16,165 --> 00:51:18,100 The other cases, r is small, meaning 829 00:51:18,100 --> 00:51:21,310 the number of bits in the group is at most log n log log n 830 00:51:21,310 --> 00:51:22,272 squared. 831 00:51:22,272 --> 00:51:23,980 That's a good case for us, because that's 832 00:51:23,980 --> 00:51:25,810 pretty similar to rank. 833 00:51:25,810 --> 00:51:27,500 Here we got chunks of size log squared. 834 00:51:27,500 --> 00:51:29,320 Here it's slightly larger than log squared, 835 00:51:29,320 --> 00:51:32,770 but only by a poly log log factor. 836 00:51:32,770 --> 00:51:37,210 And that would correspond to this step 2 in rank. 837 00:51:37,210 --> 00:51:39,440 You do step 2 and step 3 here. 838 00:51:39,440 --> 00:51:41,110 Then we get step 2 of rank. 839 00:51:41,110 --> 00:51:45,850 We've reduced to poly log size chunks 840 00:51:45,850 --> 00:51:47,810 by getting rid of this case. 841 00:51:47,810 --> 00:51:50,880 And so we have to do it again, because we have poly log 842 00:51:50,880 --> 00:51:52,000 size groups. 843 00:51:52,000 --> 00:51:57,160 But we need to get down to groups of size log, 1/2 log. 844 00:51:57,160 --> 00:52:00,258 So we need to do another layer of indirection. 845 00:52:25,740 --> 00:52:29,310 So we get to do steps 2 and 3 again. 846 00:52:29,310 --> 00:52:30,980 This is what I'll call step 4. 847 00:52:36,330 --> 00:52:42,880 Repeat steps 2 and 3 on-- 848 00:52:42,880 --> 00:52:43,410 oh, sorry. 849 00:52:43,410 --> 00:52:48,150 I didn't say I need an else clause. 850 00:52:48,150 --> 00:52:53,010 Else I'm going to call this bit vector a reduced bit vector. 851 00:52:56,640 --> 00:53:03,345 So I've reduced to order log n log log n squared bits. 852 00:53:09,510 --> 00:53:15,200 And so step 4 is on all reduced strings, 853 00:53:15,200 --> 00:53:16,275 all reduced bit strings. 854 00:53:22,310 --> 00:53:25,390 I want to do steps 2 and 3 again. 855 00:53:25,390 --> 00:53:26,265 Let me do it quickly. 856 00:53:29,250 --> 00:53:45,760 My goal is to further reduce to poly log log n. 857 00:53:45,760 --> 00:53:49,510 I took n bit strings and I got down to log poly log bits. 858 00:53:49,510 --> 00:53:52,240 I do it again, I should get down to poly log log bits. 859 00:53:52,240 --> 00:53:53,050 And indeed, I can. 860 00:53:53,050 --> 00:53:54,670 And this is plenty small. 861 00:53:54,670 --> 00:53:57,910 Poly log log is way smaller than 1/2 log, 862 00:53:57,910 --> 00:54:02,710 so we don't even need that much of the lookup table. 863 00:54:02,710 --> 00:54:03,340 Fine. 864 00:54:03,340 --> 00:54:05,430 So I'll call this 2 prime. 865 00:54:11,110 --> 00:54:12,940 I want to make this explicit, because they 866 00:54:12,940 --> 00:54:15,310 are slightly different, because now everything's 867 00:54:15,310 --> 00:54:19,825 relative to the reduced string, which is poly log. 868 00:54:37,710 --> 00:54:41,910 This gets hard to pronounce, but every log log n square-th 1 869 00:54:41,910 --> 00:54:44,710 bit, we're going to write down the relative index 870 00:54:44,710 --> 00:54:48,090 within the reduced string of size log n. 871 00:54:48,090 --> 00:54:49,890 So writing down the relative index 872 00:54:49,890 --> 00:54:55,290 only costs log log n bits, because we're 873 00:54:55,290 --> 00:54:57,330 in something of size log n, so writing down 874 00:54:57,330 --> 00:54:59,340 that index is short. 875 00:54:59,340 --> 00:55:01,150 We write it down for all of these, 876 00:55:01,150 --> 00:55:08,910 so we end up paying n over log log n squared. 877 00:55:08,910 --> 00:55:10,920 That's the maximum number of these indices 878 00:55:10,920 --> 00:55:12,240 that we need to store. 879 00:55:12,240 --> 00:55:14,340 Each of them we pay log log n. 880 00:55:14,340 --> 00:55:17,400 So here I'm summing over all the reduced bit strings. 881 00:55:17,400 --> 00:55:19,140 This is an overall size. 882 00:55:19,140 --> 00:55:21,810 It's at most n over log log n squared that we need to store. 883 00:55:21,810 --> 00:55:24,200 Could be fewer if there aren't many reduced bit strings. 884 00:55:24,200 --> 00:55:27,880 But worst case, everything ends up being reduced, 885 00:55:27,880 --> 00:55:31,270 so we have this many times that many times that many bits. 886 00:55:31,270 --> 00:55:33,180 And we get n over log log n bits. 887 00:55:37,620 --> 00:55:41,490 This is roughly following the pattern of step 2 over here. 888 00:55:41,490 --> 00:55:44,060 Step 2 over here didn't just have the log term. 889 00:55:44,060 --> 00:55:47,020 It also had an auxiliary log log term. 890 00:55:47,020 --> 00:55:49,140 So if you felt like it, you could make this log 891 00:55:49,140 --> 00:55:51,471 log n times log log log n. 892 00:55:51,471 --> 00:55:53,470 But it will actually give you worse space bound, 893 00:55:53,470 --> 00:55:54,594 so this is slightly better. 894 00:55:56,950 --> 00:55:57,450 OK. 895 00:55:57,450 --> 00:56:00,750 Then we apply step 3 prime, which 896 00:56:00,750 --> 00:56:04,440 is we look at each of the groups that we've identified. 897 00:56:04,440 --> 00:56:08,100 And either it's big and it has lots of 0 bits, 898 00:56:08,100 --> 00:56:09,737 or it's not big. 899 00:56:09,737 --> 00:56:11,570 And in either case, we're going to be happy. 900 00:56:15,220 --> 00:56:29,480 So if a group of log log n squared 1 bits has r bits, 901 00:56:29,480 --> 00:56:32,020 we look at each of them individually. 902 00:56:32,020 --> 00:56:35,490 And if r is at least the square of that, 903 00:56:35,490 --> 00:56:39,320 so log log n to the fourth power-- 904 00:56:39,320 --> 00:56:41,280 so we're losing constants in the exponents, 905 00:56:41,280 --> 00:56:43,650 but it's not a big deal-- 906 00:56:43,650 --> 00:56:52,830 then store relative-- 907 00:56:52,830 --> 00:56:55,290 I mean, store all the answers, but now as relative indices. 908 00:57:01,530 --> 00:57:02,030 OK. 909 00:57:02,030 --> 00:57:03,000 Let's go over here. 910 00:57:13,770 --> 00:57:16,320 So how much do these relative indices cost? 911 00:57:16,320 --> 00:57:27,060 Again, it's at most order log log n bits to write them down. 912 00:57:27,060 --> 00:57:29,880 We don't know that a group is any smaller than log n, 913 00:57:29,880 --> 00:57:32,140 but it's at most the original size of log n. 914 00:57:32,140 --> 00:57:34,950 It's only log log n bits to write each of them down. 915 00:57:34,950 --> 00:57:36,600 And now we get to say, oh, well, we 916 00:57:36,600 --> 00:57:38,700 had to write down log log n squared 1 bits. 917 00:57:38,700 --> 00:57:40,690 But this can only happen n over log log n 918 00:57:40,690 --> 00:57:43,110 to the fourth many times. 919 00:57:43,110 --> 00:57:49,350 So the space is n over log log n to the fourth. 920 00:57:49,350 --> 00:57:52,500 That's the maximum number of these 921 00:57:52,500 --> 00:57:56,610 I guess you call them sparse bit vectors, sparse groups there 922 00:57:56,610 --> 00:57:59,340 could be, because each of them is at least this big. 923 00:57:59,340 --> 00:58:01,790 The total number of them is at most n divided by that. 924 00:58:01,790 --> 00:58:03,420 For each of them, we have to write down 925 00:58:03,420 --> 00:58:08,370 log log n squared different indices for our array. 926 00:58:08,370 --> 00:58:12,720 And each of those indices cost log log n bits to write down. 927 00:58:12,720 --> 00:58:15,720 So this is log log n to the third power. 928 00:58:15,720 --> 00:58:17,680 This is log log n to the fourth power. 929 00:58:17,680 --> 00:58:19,950 So again, this is n over log log n. 930 00:58:19,950 --> 00:58:22,230 You can tell I've tuned all of these numbers 931 00:58:22,230 --> 00:58:24,990 to come out to n over log log n bits. 932 00:58:28,700 --> 00:58:31,050 OK. 933 00:58:31,050 --> 00:58:32,230 That was the if case. 934 00:58:32,230 --> 00:58:35,280 There's the else case, which is that you have reduced 935 00:58:35,280 --> 00:58:40,050 to poly log log size, namely then in the dense case, 936 00:58:40,050 --> 00:58:44,070 you have r is at most log log n to the fourth. 937 00:58:44,070 --> 00:58:49,440 So at this point, else you are further reduced. 938 00:58:55,170 --> 00:59:01,330 When you're further reduced, you have at most log log n 939 00:59:01,330 --> 00:59:04,230 to the fourth bits. 940 00:59:04,230 --> 00:59:07,200 And at most, log log n squared of them are 1 bits. 941 00:59:07,200 --> 00:59:08,700 But we don't really care about that. 942 00:59:08,700 --> 00:59:11,640 Once we're down to a bit vector of poly log log size, 943 00:59:11,640 --> 00:59:14,640 we can use our lookup table and we're done. 944 00:59:21,140 --> 00:59:22,170 So that's select. 945 00:59:22,170 --> 00:59:25,420 If you want to do a select on an index, 946 00:59:25,420 --> 00:59:28,420 first you figure out which group it's 947 00:59:28,420 --> 00:59:32,710 in by dividing by log n log log n, taking the floor. 948 00:59:32,710 --> 00:59:36,100 You teleport to the appropriate group using this array. 949 00:59:36,100 --> 00:59:39,520 Then within that group, there's a bit saying 950 00:59:39,520 --> 00:59:42,430 whether it was sparse or dense. 951 00:59:42,430 --> 00:59:46,270 If it was sparse, so lots of 0's in it, 952 00:59:46,270 --> 00:59:49,690 then you have a lookup table that gives you all your answers 953 00:59:49,690 --> 00:59:52,610 for the remainder of your query. 954 00:59:52,610 --> 00:59:56,050 If it's dense, then you go over here. 955 00:59:56,050 --> 00:59:58,210 You know that this thing will be stored, 956 00:59:58,210 --> 01:00:00,730 and so you figure out which subgroup 957 01:00:00,730 --> 01:00:03,760 you belong to by dividing by log log n 958 01:00:03,760 --> 01:00:06,670 squared, taking the floor. 959 01:00:06,670 --> 01:00:09,760 There's an array, this thing, that 960 01:00:09,760 --> 01:00:13,390 teleports you to that group-- 961 01:00:13,390 --> 01:00:15,340 sorry, to that subgroup. 962 01:00:15,340 --> 01:00:18,125 And then you apply-- 963 01:00:18,125 --> 01:00:20,750 then there's a bit there saying whether it was sparse or dense. 964 01:00:20,750 --> 01:00:22,875 If it was sparse, there's a lookup table giving you 965 01:00:22,875 --> 01:00:23,590 the answer. 966 01:00:23,590 --> 01:00:27,970 If it was dense, there's an index into the number 1 lookup 967 01:00:27,970 --> 01:00:30,340 table that tells you what this bit 968 01:00:30,340 --> 01:00:35,300 string is, because there is only log log n to the fourth bits. 969 01:00:35,300 --> 01:00:37,000 In fact, that is the index. 970 01:00:37,000 --> 01:00:40,120 Just what those bits are lets you look up into this table 971 01:00:40,120 --> 01:00:42,490 and solve your query in constant time, in all cases 972 01:00:42,490 --> 01:00:43,320 constant time. 973 01:00:43,320 --> 01:00:46,120 But here's a little bit more branching, 974 01:00:46,120 --> 01:00:47,935 depending on your situation. 975 01:00:47,935 --> 01:00:50,310 As I said, select is a little more complicated than rank. 976 01:00:50,310 --> 01:00:52,900 But in the end, constant time, little 977 01:00:52,900 --> 01:00:55,690 of of n space, n over log log n, which can again 978 01:00:55,690 --> 01:00:59,160 be improved by Patrascu, n over log 979 01:00:59,160 --> 01:01:01,160 to the k for any constant k. 980 01:01:01,160 --> 01:01:01,660 Question? 981 01:01:01,660 --> 01:01:05,428 AUDIENCE: Can you just quickly remind us how the 2 and 3 982 01:01:05,428 --> 01:01:06,352 changed [INAUDIBLE]? 983 01:01:06,352 --> 01:01:07,060 ERIK DEMAINE: OK. 984 01:01:07,060 --> 01:01:08,330 How did 2 and 3 change? 985 01:01:08,330 --> 01:01:10,300 You don't actually really need to change them. 986 01:01:10,300 --> 01:01:12,910 The big change is that you're storing only relative indices, 987 01:01:12,910 --> 01:01:15,250 not indices. 988 01:01:15,250 --> 01:01:18,820 So before, we were storing an array of indices 989 01:01:18,820 --> 01:01:21,160 of every log log nth 1 bit. 990 01:01:21,160 --> 01:01:24,130 These were global pointers. 991 01:01:24,130 --> 01:01:27,520 But now after 2 and 3, we've reduced to something 992 01:01:27,520 --> 01:01:29,920 of size log n or poly log n. 993 01:01:29,920 --> 01:01:32,470 We need to exploit that here, so that we were only 994 01:01:32,470 --> 01:01:33,470 storing log log n bits. 995 01:01:33,470 --> 01:01:35,470 If we didn't do that, this would be order n bits 996 01:01:35,470 --> 01:01:36,980 and it would be too big. 997 01:01:36,980 --> 01:01:38,980 That's really the only thing you need to change. 998 01:01:38,980 --> 01:01:41,050 The other thing I changed was this value. 999 01:01:41,050 --> 01:01:43,210 If you follow that plan, it would be-- 1000 01:01:47,690 --> 01:01:50,130 it was also a square. 1001 01:01:50,130 --> 01:01:51,690 So we did have to add a square here. 1002 01:01:51,690 --> 01:01:54,847 You could also add a log log log n term here, 1003 01:01:54,847 --> 01:01:55,680 but it won't matter. 1004 01:01:55,680 --> 01:01:59,010 Basically you do something that works. 1005 01:01:59,010 --> 01:02:01,380 The square was necessary to cancel out this guy, 1006 01:02:01,380 --> 01:02:02,040 for example. 1007 01:02:02,040 --> 01:02:03,250 If you didn't do the square-- 1008 01:02:03,250 --> 01:02:05,010 well, so instead of this, you could 1009 01:02:05,010 --> 01:02:08,610 have done log log n times log log log n without the square. 1010 01:02:08,610 --> 01:02:11,967 Then here you would have gotten n over log log log n. 1011 01:02:11,967 --> 01:02:13,800 So you could have followed the same pattern. 1012 01:02:13,800 --> 01:02:15,633 You'd just get a slightly worse space bound. 1013 01:02:15,633 --> 01:02:16,500 I tuned it here. 1014 01:02:16,500 --> 01:02:18,600 Here we needed-- here we could not 1015 01:02:18,600 --> 01:02:20,100 have afforded to go to log squared, 1016 01:02:20,100 --> 01:02:24,090 if I recall correctly, though you can check that. 1017 01:02:24,090 --> 01:02:27,540 Maybe it's a good pset question. 1018 01:02:27,540 --> 01:02:29,337 There's lots of choices that work here. 1019 01:02:29,337 --> 01:02:31,170 But this is the one I find the cleanest that 1020 01:02:31,170 --> 01:02:35,220 gets a decent bound, not the best bound, but reasonable. 1021 01:02:35,220 --> 01:02:37,140 Other questions? 1022 01:02:37,140 --> 01:02:38,770 I think that's all that I changed. 1023 01:02:38,770 --> 01:02:42,780 The sparsity definition was still a squared thing, 1024 01:02:42,780 --> 01:02:46,111 so it was squared over here and it was squared over here. 1025 01:02:46,111 --> 01:02:47,610 It's just the thing we were squaring 1026 01:02:47,610 --> 01:02:48,720 was a little different. 1027 01:02:51,540 --> 01:02:54,370 OK. 1028 01:02:54,370 --> 01:02:56,900 One more thing I want to talk about. 1029 01:02:56,900 --> 01:03:04,690 So at this point, we just finish this level order representation 1030 01:03:04,690 --> 01:03:07,150 of binary tries, because we already 1031 01:03:07,150 --> 01:03:09,460 saw left child, right child, and parent, 1032 01:03:09,460 --> 01:03:11,250 reduced to rank and select. 1033 01:03:11,250 --> 01:03:13,670 We just solved rank and select in little o of n bits, 1034 01:03:13,670 --> 01:03:19,090 so at least statically we can do left child, 1035 01:03:19,090 --> 01:03:21,310 right child, parent in a binary try now 1036 01:03:21,310 --> 01:03:24,660 in constant time per operation, 2n plus little 1037 01:03:24,660 --> 01:03:25,660 o of n bits of space. 1038 01:03:25,660 --> 01:03:27,730 The 2n bits are to store those 2n bits 1039 01:03:27,730 --> 01:03:29,410 that we wrote down before. 1040 01:03:29,410 --> 01:03:33,220 So that's succinct binary tries. 1041 01:03:33,220 --> 01:03:34,960 Done. 1042 01:03:34,960 --> 01:03:37,900 One mention, there are some dynamic versions. 1043 01:03:37,900 --> 01:03:43,180 In particular, there are dynamic versions of rank and select. 1044 01:03:43,180 --> 01:03:46,200 But the best versions that are known to do dynamic rank 1045 01:03:46,200 --> 01:03:50,350 and select achieve something like log over log 1046 01:03:50,350 --> 01:03:56,050 log time per operation, if you're interested in dynamic. 1047 01:04:02,355 --> 01:04:03,830 So this is kind of annoying. 1048 01:04:03,830 --> 01:04:06,196 If you want to go to dynamic, either you pay more time 1049 01:04:06,196 --> 01:04:07,570 or you don't use rank and select. 1050 01:04:07,570 --> 01:04:10,019 But I'm not going to worry too much about dynamic. 1051 01:04:10,019 --> 01:04:11,060 Stick to rank and select. 1052 01:04:11,060 --> 01:04:12,726 But there's one more thing on this list, 1053 01:04:12,726 --> 01:04:16,820 which is a different way to do succinct binary tries. 1054 01:04:16,820 --> 01:04:19,250 And this different way is going to be more powerful, more 1055 01:04:19,250 --> 01:04:22,850 useful for things like suffix trees, which is what 1056 01:04:22,850 --> 01:04:24,180 we're going to do next class. 1057 01:04:24,180 --> 01:04:28,250 So I want to tell you a little bit about this. 1058 01:04:28,250 --> 01:04:31,040 The level order representation is kind of like a warm-up. 1059 01:04:31,040 --> 01:04:33,770 It motivates rank and select. 1060 01:04:33,770 --> 01:04:36,740 But it does not let us do subtree size. 1061 01:04:36,740 --> 01:04:38,900 Subtree size would be nice to do, 1062 01:04:38,900 --> 01:04:41,150 because you care about how many matches you have 1063 01:04:41,150 --> 01:04:43,790 after you do a search down a suffix tree. 1064 01:04:43,790 --> 01:04:46,400 Level order just ain't going to cut it for that, 1065 01:04:46,400 --> 01:04:49,452 so we're going to use different representation. 1066 01:04:49,452 --> 01:04:51,410 We're still going to use rank and select a lot. 1067 01:04:54,050 --> 01:04:56,550 And I'll generalize forms of rank and select. 1068 01:05:05,586 --> 01:05:07,460 But it's going to be a little bit more handy. 1069 01:05:07,460 --> 01:05:10,670 Essentially I want to do more like a depth-first search 1070 01:05:10,670 --> 01:05:13,477 of the try, less like a-- 1071 01:05:19,330 --> 01:05:23,430 less level order, so more depth-first. 1072 01:05:23,430 --> 01:05:24,100 OK. 1073 01:05:24,100 --> 01:05:33,795 Here's our friend the binary try, same one as before. 1074 01:05:37,130 --> 01:05:38,770 We had our binary representation of it. 1075 01:05:38,770 --> 01:05:41,320 I'm not going to draw that here. 1076 01:05:41,320 --> 01:05:43,620 First thing I want to do is say, hey, look, 1077 01:05:43,620 --> 01:05:47,940 this is the same thing as a rooted ordered tree. 1078 01:05:52,170 --> 01:05:54,840 I already mentioned that there's the same number of them. 1079 01:05:54,840 --> 01:05:58,430 There's Catalan of these and there's Catalan of these. 1080 01:05:58,430 --> 01:05:59,955 So a rooted ordered tree has a node, 1081 01:05:59,955 --> 01:06:03,360 it has some number of children, then more nodes. 1082 01:06:03,360 --> 01:06:06,000 The children are ordered, but they don't have labels on them. 1083 01:06:06,000 --> 01:06:07,320 So it's a tree, not a try. 1084 01:06:11,500 --> 01:06:14,530 So I claim these two things are equivalent. 1085 01:06:14,530 --> 01:06:17,540 And there's a nice combinatorial bijection between them, 1086 01:06:17,540 --> 01:06:19,380 which you may have seen before. 1087 01:06:19,380 --> 01:06:20,890 It's kind of a classic. 1088 01:06:20,890 --> 01:06:24,500 But here we're going to use it for handy stuff. 1089 01:06:24,500 --> 01:06:30,850 Basically so binary tries distinguish 1090 01:06:30,850 --> 01:06:32,180 between left and right. 1091 01:06:32,180 --> 01:06:34,080 Rooted order trees do not. 1092 01:06:34,080 --> 01:06:36,160 They just have order. 1093 01:06:36,160 --> 01:06:38,380 So to clean that up, I'm going to look 1094 01:06:38,380 --> 01:06:42,850 at the right spine of the try, distinguish that, because right 1095 01:06:42,850 --> 01:06:46,010 and left make the difference here, and then recurse. 1096 01:06:46,010 --> 01:06:48,640 So this is the right spine of down here. 1097 01:06:48,640 --> 01:06:52,300 This is the right spine of this subtree. 1098 01:06:52,300 --> 01:06:55,030 Now every node lives in some right spine. 1099 01:06:55,030 --> 01:06:57,220 And then I'm just going to rotate 1100 01:06:57,220 --> 01:07:01,390 45 degrees counterclockwise. 1101 01:07:01,390 --> 01:07:06,280 So I have A, E, G. That's my first right spine. 1102 01:07:06,280 --> 01:07:11,710 I'm going to think of them as children of a new root node. 1103 01:07:11,710 --> 01:07:14,380 And then they have children below that 1104 01:07:14,380 --> 01:07:17,110 which correspond to the right spines that hang below. 1105 01:07:17,110 --> 01:07:23,044 So A, for example, has this right spine, B, C, D. 1106 01:07:23,044 --> 01:07:28,090 So we have B, C, D here. 1107 01:07:28,090 --> 01:07:32,880 E has a right spine of F hanging off of it. 1108 01:07:32,880 --> 01:07:35,590 G has no right spine hanging off of it. 1109 01:07:35,590 --> 01:07:37,900 So you need to prove that this is a real bijection. 1110 01:07:37,900 --> 01:07:40,830 Every binary try can be so converted 1111 01:07:40,830 --> 01:07:42,320 into a rooted order tree. 1112 01:07:42,320 --> 01:07:45,159 And it's unique, so if it's different over here, 1113 01:07:45,159 --> 01:07:46,450 it will be different over here. 1114 01:07:46,450 --> 01:07:48,033 And you can convert backwards as well, 1115 01:07:48,033 --> 01:07:49,630 if you just delete the super-root 1116 01:07:49,630 --> 01:07:52,150 and turn all the children into a right spine or recurse. 1117 01:07:52,150 --> 01:07:53,767 They're really the same thing. 1118 01:07:53,767 --> 01:07:55,600 This is why there's Catalan of each of them. 1119 01:07:58,120 --> 01:07:58,620 OK. 1120 01:08:01,390 --> 01:08:04,740 Now what I'd really like to get to is balanced parentheses. 1121 01:08:12,260 --> 01:08:14,300 And while it's a little unclear how 1122 01:08:14,300 --> 01:08:17,330 to represent a binary try with balanced parentheses, 1123 01:08:17,330 --> 01:08:20,330 these things it's really clear how to represent a binary-- 1124 01:08:20,330 --> 01:08:22,250 represent with balanced parentheses. 1125 01:08:22,250 --> 01:08:25,140 Here I just do an Euler tour, which 1126 01:08:25,140 --> 01:08:28,040 was a depth-first search visiting these things. 1127 01:08:28,040 --> 01:08:30,590 And every time I start a node, I'll write an open paren. 1128 01:08:30,590 --> 01:08:33,020 Every time I finish a node, I write a close paren. 1129 01:08:33,020 --> 01:08:36,330 Similar to representation we talked about before. 1130 01:08:36,330 --> 01:08:38,029 So this would be-- 1131 01:08:38,029 --> 01:08:39,470 I'm going to need more space. 1132 01:08:42,020 --> 01:08:44,960 This is going to be an open paren for star. 1133 01:08:44,960 --> 01:08:47,420 Why don't I make that one really big? 1134 01:08:47,420 --> 01:08:48,470 We start here. 1135 01:08:48,470 --> 01:08:50,899 Then we open the A chunk. 1136 01:08:50,899 --> 01:08:54,170 Then we do B, which has no children. 1137 01:08:54,170 --> 01:08:56,779 Then we do C, which has no children. 1138 01:08:56,779 --> 01:08:59,569 Then we do D, which has no children. 1139 01:08:59,569 --> 01:09:02,850 And that finishes A. 1140 01:09:02,850 --> 01:09:03,350 OK. 1141 01:09:03,350 --> 01:09:10,430 Then we start E. Then we do F. Then 1142 01:09:10,430 --> 01:09:18,290 we finish F. We finish E. Then we do G. 1143 01:09:18,290 --> 01:09:19,720 And then we're done with star. 1144 01:09:23,340 --> 01:09:25,040 So that's a very easy transformation. 1145 01:09:25,040 --> 01:09:28,220 Again, there are Catalan many of these balanced parens. 1146 01:09:28,220 --> 01:09:32,399 You think, oh, there's 2 to the n of them, 1147 01:09:32,399 --> 01:09:34,519 because each paren could be open or closed. 1148 01:09:34,519 --> 01:09:36,810 But they have to be balanced, so it's a little bit more 1149 01:09:36,810 --> 01:09:39,810 constrained than that. 1150 01:09:39,810 --> 01:09:42,010 And so it ends up being Catalan of n over 2 1151 01:09:42,010 --> 01:09:47,010 if there's n parens, because there's 2 parens here 1152 01:09:47,010 --> 01:09:48,430 for every node over here. 1153 01:09:48,430 --> 01:09:50,310 This is going to be our bit string. 1154 01:09:50,310 --> 01:09:51,450 Open parens are 0's. 1155 01:09:51,450 --> 01:09:53,729 Close parens are 1's. 1156 01:09:53,729 --> 01:09:59,190 This has roughly 2n bits, 2n plus 2, I guess, for the star, 1157 01:09:59,190 --> 01:10:01,880 relative to this n. 1158 01:10:01,880 --> 01:10:05,060 So basically, nodes here correspond 1159 01:10:05,060 --> 01:10:10,380 to nodes here, which correspond to an open paren, close paren 1160 01:10:10,380 --> 01:10:12,300 pair over here. 1161 01:10:12,300 --> 01:10:14,190 Now, we can't afford to store these labels. 1162 01:10:14,190 --> 01:10:18,402 Those are just guidelines to think about what you need. 1163 01:10:18,402 --> 01:10:20,610 So let's think about there are three things we really 1164 01:10:20,610 --> 01:10:24,610 want here, left child, right child, and parent. 1165 01:10:31,270 --> 01:10:33,940 This is the thing that we care about, 1166 01:10:33,940 --> 01:10:36,190 but this is what we're going to store. 1167 01:10:36,190 --> 01:10:39,610 So I want to translate from here to here to here. 1168 01:10:39,610 --> 01:10:42,920 This is an exercise in translation. 1169 01:10:42,920 --> 01:10:44,830 So what does a left child mean here? 1170 01:10:44,830 --> 01:10:49,045 Left child over here corresponds to-- 1171 01:10:51,670 --> 01:10:52,780 well, I guess it goes-- 1172 01:10:52,780 --> 01:10:55,390 in general, the left child goes to this branch, which 1173 01:10:55,390 --> 01:10:58,180 is like all of these children pointers from A. 1174 01:10:58,180 --> 01:11:00,410 But really, if you follow the left child, 1175 01:11:00,410 --> 01:11:02,139 you get to B, not any of the other things 1176 01:11:02,139 --> 01:11:02,930 on the right spine. 1177 01:11:02,930 --> 01:11:04,804 You always get to the top of the right spine. 1178 01:11:04,804 --> 01:11:07,840 Top of the right spine is the left-most node in the spine 1179 01:11:07,840 --> 01:11:08,570 here. 1180 01:11:08,570 --> 01:11:13,390 In other words, it is the first child of a node. 1181 01:11:13,390 --> 01:11:15,420 First child of a node, if there is one, 1182 01:11:15,420 --> 01:11:18,580 is going to be the left child over here. 1183 01:11:18,580 --> 01:11:21,040 Right child is like following the spine. 1184 01:11:21,040 --> 01:11:23,590 That's like going this way. 1185 01:11:23,590 --> 01:11:26,935 So right child is what I would call next sibling. 1186 01:11:29,920 --> 01:11:32,264 The next sibling to the right, if there is one, 1187 01:11:32,264 --> 01:11:34,180 that's going to correspond to the right child, 1188 01:11:34,180 --> 01:11:36,681 because we're just following a right spine. 1189 01:11:36,681 --> 01:11:37,180 OK. 1190 01:11:37,180 --> 01:11:40,750 Parent is a little trickier. 1191 01:11:40,750 --> 01:11:43,810 Parent is the reverse of these, so either you 1192 01:11:43,810 --> 01:11:46,110 take your previous sibling-- 1193 01:11:46,110 --> 01:11:48,610 but if you're here and there is no previous sibling, 1194 01:11:48,610 --> 01:11:51,260 then you take your actual parent, 1195 01:11:51,260 --> 01:11:53,120 because parent should walk up here. 1196 01:11:53,120 --> 01:11:55,090 This was like going left, previous sibling, 1197 01:11:55,090 --> 01:11:56,110 previous sibling. 1198 01:11:56,110 --> 01:12:00,410 Parent of this guy, though, is the actual parent over here. 1199 01:12:00,410 --> 01:12:05,710 So this is going to be previous sibling if there is one, 1200 01:12:05,710 --> 01:12:08,580 or if there isn't one, you go to the parent. 1201 01:12:14,030 --> 01:12:14,530 OK. 1202 01:12:14,530 --> 01:12:16,390 So that's easy translation. 1203 01:12:16,390 --> 01:12:18,910 Now we need to convert these pictures 1204 01:12:18,910 --> 01:12:21,056 into balanced parentheses pictures, which is also 1205 01:12:21,056 --> 01:12:22,180 going to be easy in itself. 1206 01:12:22,180 --> 01:12:24,760 But to jump all the way from binary tries to balanced parens 1207 01:12:24,760 --> 01:12:27,340 would be pretty confusing, so that's why 1208 01:12:27,340 --> 01:12:29,480 we have this intermediate step. 1209 01:12:29,480 --> 01:12:33,760 So we want first child here. 1210 01:12:33,760 --> 01:12:37,870 If I have a paren-- like I'm looking at A. So A corresponds 1211 01:12:37,870 --> 01:12:40,440 to this paren and this paren. 1212 01:12:40,440 --> 01:12:45,130 I'm going to represent the node let's say by the first paren. 1213 01:12:45,130 --> 01:12:48,490 Then the first child is just the very next character. 1214 01:12:48,490 --> 01:12:51,650 You put the first child right after that open paren. 1215 01:12:51,650 --> 01:12:54,410 So this is really the next character 1216 01:12:54,410 --> 01:12:56,740 if we want to find the first child. 1217 01:12:56,740 --> 01:13:00,945 This is if it's an open paren. 1218 01:13:00,945 --> 01:13:03,070 It could be the very next-- like if you're doing B, 1219 01:13:03,070 --> 01:13:04,986 the very next character is a close paren, that 1220 01:13:04,986 --> 01:13:06,284 means there are no children. 1221 01:13:06,284 --> 01:13:08,450 But that's how you can tell whether there's a child. 1222 01:13:08,450 --> 01:13:10,810 If there's an open paren right after your open paren, 1223 01:13:10,810 --> 01:13:13,150 that's your next child. 1224 01:13:13,150 --> 01:13:15,550 That's your first child, I should say. 1225 01:13:15,550 --> 01:13:17,770 Now what about next sibling? 1226 01:13:17,770 --> 01:13:22,060 So let's say again I'm at A. And I 1227 01:13:22,060 --> 01:13:23,530 want to know the next sibling. 1228 01:13:23,530 --> 01:13:26,710 Next sibling is E. So that's like I 1229 01:13:26,710 --> 01:13:28,660 go to the close paren for A, and then I 1230 01:13:28,660 --> 01:13:30,910 go to the next character. 1231 01:13:30,910 --> 01:13:38,740 So this would be go to the close paren for where you are right 1232 01:13:38,740 --> 01:13:42,220 now, and then go to the next character. 1233 01:13:44,940 --> 01:13:46,990 This is, again, if it's an open paren. 1234 01:13:46,990 --> 01:13:49,630 If it's a close paren, then you have no next sibling. 1235 01:13:49,630 --> 01:13:52,960 So again, you can tell whether this operation fails. 1236 01:13:52,960 --> 01:13:57,460 What we need is an operation given a bit string representing 1237 01:13:57,460 --> 01:14:01,660 balanced parentheses and given a query position of a left paren, 1238 01:14:01,660 --> 01:14:04,630 I need to know what is the matching right paren. 1239 01:14:04,630 --> 01:14:06,520 And I'll just wave my hands and claim 1240 01:14:06,520 --> 01:14:10,090 that can be done with the same techniques as rank and select. 1241 01:14:10,090 --> 01:14:10,741 It's not easy. 1242 01:14:10,741 --> 01:14:11,740 It's quite a bit harder. 1243 01:14:11,740 --> 01:14:14,200 But you do enough of these recursions, 1244 01:14:14,200 --> 01:14:16,510 eventually you can solve it. 1245 01:14:18,950 --> 01:14:19,450 OK. 1246 01:14:19,450 --> 01:14:21,880 Last operation is parent over here, 1247 01:14:21,880 --> 01:14:24,100 which corresponds to previous sibling, 1248 01:14:24,100 --> 01:14:27,430 or parent over here, which corresponds to-- 1249 01:14:30,730 --> 01:14:32,282 there are two cases. 1250 01:14:32,282 --> 01:14:34,240 We want to move backwards, so here we're always 1251 01:14:34,240 --> 01:14:36,040 ending with next character. 1252 01:14:36,040 --> 01:14:38,439 So first thing we do is go to the previous character. 1253 01:14:42,840 --> 01:14:45,920 And there are two cases. 1254 01:14:45,920 --> 01:14:47,640 If it's a close paren-- 1255 01:14:47,640 --> 01:14:51,190 so let's say we're here at E, we go to the previous character. 1256 01:14:51,190 --> 01:14:54,675 If it's a close paren, then A is our previous sibling, 1257 01:14:54,675 --> 01:14:57,340 and so we want to do the previous sibling situation. 1258 01:14:57,340 --> 01:14:58,570 We again find the match. 1259 01:14:58,570 --> 01:15:00,444 We hit percent and vi and-- 1260 01:15:00,444 --> 01:15:02,110 what's the corresponding thing in Emacs? 1261 01:15:02,110 --> 01:15:03,580 I forget. 1262 01:15:03,580 --> 01:15:05,690 You go to the matching close paren-- 1263 01:15:05,690 --> 01:15:07,174 sorry, open paren. 1264 01:15:07,174 --> 01:15:08,090 And then there you go. 1265 01:15:08,090 --> 01:15:09,464 You've got your previous sibling. 1266 01:15:09,464 --> 01:15:12,550 So if it's a close paren, then you 1267 01:15:12,550 --> 01:15:15,280 go to the corresponding open paren. 1268 01:15:15,280 --> 01:15:20,140 If the previous character is an open paren, then you're done. 1269 01:15:20,140 --> 01:15:21,160 That's your parent. 1270 01:15:21,160 --> 01:15:23,890 So like here if you're at A, you go to the previous character 1271 01:15:23,890 --> 01:15:26,950 and it's open paren, then you've just found the parent of A. 1272 01:15:26,950 --> 01:15:29,680 There was no previous sibling. 1273 01:15:29,680 --> 01:15:31,990 So in either case you end up with an open paren, 1274 01:15:31,990 --> 01:15:34,090 corresponding to either your previous sibling 1275 01:15:34,090 --> 01:15:36,400 or your parent. 1276 01:15:36,400 --> 01:15:38,620 So that's left child, right child, parent. 1277 01:15:38,620 --> 01:15:41,770 If you have this matching paren operation, 1278 01:15:41,770 --> 01:15:44,170 you can do all of these in constant time, 1279 01:15:44,170 --> 01:15:48,550 and little of n space beyond the 2n bits 1280 01:15:48,550 --> 01:15:52,300 to write down that bit string. 1281 01:15:52,300 --> 01:15:54,190 That's not so exciting, because we just 1282 01:15:54,190 --> 01:15:57,170 reinvented the same results we had before 1283 01:15:57,170 --> 01:16:01,000 of doing left, right, and parent in constant time. 1284 01:16:01,000 --> 01:16:03,100 But what we buy out of this representation 1285 01:16:03,100 --> 01:16:06,010 is we can now do subtree size. 1286 01:16:06,010 --> 01:16:13,000 So this is a little bit trickier. 1287 01:16:13,000 --> 01:16:14,240 Let me go to another board. 1288 01:16:18,407 --> 01:16:20,510 But whereas with level representation 1289 01:16:20,510 --> 01:16:22,835 it was impossible, now it is possible. 1290 01:16:32,650 --> 01:16:35,590 And we're going to use subtree size I think next class, when 1291 01:16:35,590 --> 01:16:38,350 we do compact suffix trees. 1292 01:16:41,150 --> 01:16:47,310 So subtree size is what we want in the binary try. 1293 01:16:49,880 --> 01:16:52,170 In the rooted ordered tree it's a little tricky, 1294 01:16:52,170 --> 01:16:54,420 because subtrees no longer correspond to subtrees. 1295 01:16:54,420 --> 01:16:58,050 For example, the subtree of C consists 1296 01:16:58,050 --> 01:17:03,450 of this subtree and this subtree, so it's really C-- 1297 01:17:03,450 --> 01:17:07,400 over here, it's C and all of its right siblings. 1298 01:17:07,400 --> 01:17:20,400 So this is size of the node plus size of right siblings, 1299 01:17:20,400 --> 01:17:22,930 however many you have. 1300 01:17:22,930 --> 01:17:23,430 OK. 1301 01:17:23,430 --> 01:17:26,730 So in the rooted ordered tree, it's actually kind of messy. 1302 01:17:26,730 --> 01:17:28,800 Turns out in the balanced parenthesis 1303 01:17:28,800 --> 01:17:32,580 it's pretty clean, because all your right 1304 01:17:32,580 --> 01:17:36,000 siblings correspond to paren groups 1305 01:17:36,000 --> 01:17:37,230 that just follow each other. 1306 01:17:37,230 --> 01:17:38,550 And you want to know-- 1307 01:17:38,550 --> 01:17:43,812 so these are a bunch of siblings here of varying size. 1308 01:17:43,812 --> 01:17:45,270 And we're given, say, this sibling. 1309 01:17:45,270 --> 01:17:46,645 We want to know for this sibling, 1310 01:17:46,645 --> 01:17:49,190 up to all the ones to the right-- so there's 1311 01:17:49,190 --> 01:17:52,710 an enclosing parenthesis here for our parent 1312 01:17:52,710 --> 01:17:55,320 in this representation. 1313 01:17:55,320 --> 01:17:58,605 We want to know the length of these, so it's just-- 1314 01:18:04,130 --> 01:18:07,400 we want to take the-- here we are at this left paren. 1315 01:18:07,400 --> 01:18:16,590 We want to compute the distance to the enclosing close paren. 1316 01:18:16,590 --> 01:18:19,735 So that's here. 1317 01:18:19,735 --> 01:18:21,110 That's our enclosing close paren. 1318 01:18:21,110 --> 01:18:22,670 So here's a new operation. 1319 01:18:22,670 --> 01:18:26,585 Given a paren pair, I want to compute the enclosing paren 1320 01:18:26,585 --> 01:18:29,080 pair, these guys. 1321 01:18:29,080 --> 01:18:31,040 That can also be done in constant time with 1322 01:18:31,040 --> 01:18:33,110 rank-and-select-like techniques. 1323 01:18:33,110 --> 01:18:36,620 And then you just measure this distance and you divide by 2. 1324 01:18:36,620 --> 01:18:39,690 That will give you the number of nodes in here. 1325 01:18:39,690 --> 01:18:41,810 It's half the number of paren. 1326 01:18:41,810 --> 01:18:43,580 That will give you subtree size. 1327 01:18:43,580 --> 01:18:45,140 And we have a couple extra seconds, 1328 01:18:45,140 --> 01:18:47,660 so another bonus is suppose you want 1329 01:18:47,660 --> 01:18:49,430 to know the number of leaves in a subtree. 1330 01:18:53,390 --> 01:18:57,350 If I recall correctly, that's something like-- 1331 01:18:57,350 --> 01:19:00,500 instead of doing this distance to the enclosing paren, 1332 01:19:00,500 --> 01:19:05,300 you do something like rank of-- 1333 01:19:05,300 --> 01:19:06,470 just rank of that. 1334 01:19:06,470 --> 01:19:18,920 Those are the number of leaves of the enclosing close paren-- 1335 01:19:18,920 --> 01:19:21,210 this is getting notationally confusing-- 1336 01:19:21,210 --> 01:19:27,700 minus the rank of here. 1337 01:19:27,700 --> 01:19:28,200 OK. 1338 01:19:28,200 --> 01:19:31,050 So I just want to compute how many open parens, close parens 1339 01:19:31,050 --> 01:19:32,790 are there from here to here. 1340 01:19:32,790 --> 01:19:35,250 And so I just take the rank here, subtract by the rank 1341 01:19:35,250 --> 01:19:36,030 here. 1342 01:19:36,030 --> 01:19:38,621 That gives me the number of leaves in that range. 1343 01:19:38,621 --> 01:19:40,120 So this is a generalization of rank. 1344 01:19:40,120 --> 01:19:42,360 Before we did rank of just a single bit. 1345 01:19:42,360 --> 01:19:44,610 This is rank of a two-bit pattern. 1346 01:19:44,610 --> 01:19:46,920 But two bits is not much harder than one bit. 1347 01:19:46,920 --> 01:19:49,260 You can very easily adapt the rank structure 1348 01:19:49,260 --> 01:19:54,870 we saw to do any two-bit pattern instead of just the one bit. 1349 01:19:54,870 --> 01:19:56,580 So that gives you the number of leaves 1350 01:19:56,580 --> 01:19:57,690 in the subtree, which corresponds 1351 01:19:57,690 --> 01:19:58,840 to the number of matches. 1352 01:19:58,840 --> 01:20:00,975 So you can do lots of fun things like this. 1353 01:20:00,975 --> 01:20:03,260 This representation is super powerful 1354 01:20:03,260 --> 01:20:05,940 and we'll use it next time.