1 00:00:00,090 --> 00:00:02,490 The following content is provided under a Creative 2 00:00:02,490 --> 00:00:04,030 Commons license. 3 00:00:04,030 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,720 continue to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,320 To make a donation, or view additional materials 6 00:00:13,320 --> 00:00:17,280 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,280 --> 00:00:18,450 at ocw.mit.edu. 8 00:00:21,139 --> 00:00:22,680 ERIK DEMAINE: All right, welcome back 9 00:00:22,680 --> 00:00:26,580 to Succinct Data Structures, part two of two. 10 00:00:26,580 --> 00:00:29,610 Today we're going to take all the stuff we know about tries 11 00:00:29,610 --> 00:00:32,910 and apply them to the main motivating application, which 12 00:00:32,910 --> 00:00:35,190 is suffix trees. 13 00:00:35,190 --> 00:00:37,320 And as we know, suffix trees and suffix arrays 14 00:00:37,320 --> 00:00:39,510 are more or less equivalent. 15 00:00:39,510 --> 00:00:43,524 But if you build one, you can build the other. 16 00:00:43,524 --> 00:00:44,940 But what we're going to show today 17 00:00:44,940 --> 00:00:47,500 is they're equivalent also from a space perspective. 18 00:00:47,500 --> 00:00:49,130 That will be the last topic. 19 00:00:49,130 --> 00:00:52,980 If you can succinctly represent a suffix array, 20 00:00:52,980 --> 00:00:56,700 then you can transform-- with a little o of n extra space, 21 00:00:56,700 --> 00:00:59,430 you can make a suffix tree as well 22 00:00:59,430 --> 00:01:02,920 and do searches in roughly the time we're used to, 23 00:01:02,920 --> 00:01:05,844 which is p plus size of output. 24 00:01:05,844 --> 00:01:07,260 It's not going to be exactly that. 25 00:01:07,260 --> 00:01:09,660 We're going to lose like log to the epsilons and such. 26 00:01:09,660 --> 00:01:12,630 But that's mostly caused-- this transformation only 27 00:01:12,630 --> 00:01:17,240 occurs like an additive log, log, log, log, log, log n. 28 00:01:17,240 --> 00:01:19,800 You could have as many logs as you want. 29 00:01:19,800 --> 00:01:23,700 Take any arbitrarily, slowly growing function, 30 00:01:23,700 --> 00:01:24,940 that it will-- 31 00:01:24,940 --> 00:01:27,610 your space bound gets closer and closer to linear. 32 00:01:27,610 --> 00:01:29,580 Anyway, that's what we'll get to at the end. 33 00:01:29,580 --> 00:01:32,360 The bulk of the lecture will be on building suffix arrays. 34 00:01:32,360 --> 00:01:34,290 Here we're going to lose a log to the epsilon 35 00:01:34,290 --> 00:01:36,810 time in the query. 36 00:01:36,810 --> 00:01:40,901 And we're going to start out improving down to T log log T 37 00:01:40,901 --> 00:01:41,400 space. 38 00:01:41,400 --> 00:01:43,500 Our bottom line is T log T space. 39 00:01:43,500 --> 00:01:46,020 That's a normal-- if you just stored a suffix array 40 00:01:46,020 --> 00:01:47,850 as a bunch of numbers. 41 00:01:47,850 --> 00:01:51,600 First we'll add another log, then we'll get down to linear. 42 00:01:51,600 --> 00:01:56,230 That gives us a compact suffix tree, or sorry, suffix array. 43 00:01:56,230 --> 00:01:59,490 That's also knowing how to do succinct suffix arrays. 44 00:01:59,490 --> 00:02:03,270 But there are dozens of papers on this topic, it's kind 45 00:02:03,270 --> 00:02:05,970 of a big field all to itself. 46 00:02:05,970 --> 00:02:09,210 And a lot of the techniques are pretty complicated. 47 00:02:09,210 --> 00:02:13,140 So I'm going try to keep it to the bare minimum we can do, 48 00:02:13,140 --> 00:02:15,222 that will give us linear-- 49 00:02:15,222 --> 00:02:17,670 linear number, bits of space give us a compact data 50 00:02:17,670 --> 00:02:18,720 structure. 51 00:02:18,720 --> 00:02:21,390 But before we go to those data structures, 52 00:02:21,390 --> 00:02:23,700 I want to give you a little survey of what's known. 53 00:02:26,680 --> 00:02:34,589 So compact suffix arrays, and trees, start out with-- 54 00:02:34,589 --> 00:02:36,630 I'm going to start out with the original results. 55 00:02:36,630 --> 00:02:39,088 And then I'll jump to sort of the latest results, which are 56 00:02:39,088 --> 00:02:40,515 getting things to be succinct. 57 00:02:43,810 --> 00:02:48,660 So the first result on this topic that got a compact suffix 58 00:02:48,660 --> 00:02:52,620 array, is by Grossi and Vitter. 59 00:02:52,620 --> 00:02:55,710 This was in 2000, spring of 2000. 60 00:02:55,710 --> 00:02:58,530 And let me tell you the bounds that they achieve. 61 00:02:58,530 --> 00:03:01,560 This is actually the solution that we're going to look at. 62 00:03:26,340 --> 00:03:28,830 So this is the first space bound. 63 00:03:31,670 --> 00:03:35,000 I guess the big term here is T log sigma. 64 00:03:35,000 --> 00:03:38,030 That's how many bits it takes just to write down the text. 65 00:03:38,030 --> 00:03:41,480 So this is what you might call optimal, in this world. 66 00:03:41,480 --> 00:03:42,980 I mean, if you have random text, you 67 00:03:42,980 --> 00:03:44,720 need that many bits to write it down. 68 00:03:44,720 --> 00:03:46,610 So there's 1 times that. 69 00:03:46,610 --> 00:03:49,040 We're also going to have this 1 over epsilon times that. 70 00:03:49,040 --> 00:03:50,750 And this is actually the data structure. 71 00:03:50,750 --> 00:03:52,541 It's going to store the text, and then it's 72 00:03:52,541 --> 00:03:55,730 going to add on a data structure of 1 over epsilon times that. 73 00:03:55,730 --> 00:04:01,326 So it's order-- order [? ops. ?] There's some lower-order terms. 74 00:04:01,326 --> 00:04:03,200 We won't actually have this lower-order term, 75 00:04:03,200 --> 00:04:06,860 because I'm going to focus on binary alphabet here. 76 00:04:06,860 --> 00:04:07,887 Keep it simple. 77 00:04:07,887 --> 00:04:09,470 But if you have a non-binary alphabet, 78 00:04:09,470 --> 00:04:12,410 they have another order T bits, and so on. 79 00:04:12,410 --> 00:04:14,210 But you get to control this constant. 80 00:04:14,210 --> 00:04:17,060 This will work for any epsilon between 0 and 1. 81 00:04:19,850 --> 00:04:22,910 And why are you interested in a small epsilon? 82 00:04:22,910 --> 00:04:27,060 Because if epsilon is small, this space bound goes up. 83 00:04:27,060 --> 00:04:29,295 Well, that happens in the query bound. 84 00:04:48,380 --> 00:04:51,590 So in the query bound, there's this multiplicative log 85 00:04:51,590 --> 00:04:55,130 to the epsilon of T. So if you really want queries to go fast, 86 00:04:55,130 --> 00:04:57,290 you don't want to pay a big polylog here, 87 00:04:57,290 --> 00:04:59,490 then you're going to have to pay for it in space. 88 00:04:59,490 --> 00:05:00,781 So those are the same epsilons. 89 00:05:05,150 --> 00:05:08,690 In the Grossi-Vitter paper, they only multiply this 90 00:05:08,690 --> 00:05:09,920 by the size of the output. 91 00:05:09,920 --> 00:05:11,597 So if you want to just output one guy, 92 00:05:11,597 --> 00:05:13,430 you only pay an additive log to the epsilon. 93 00:05:13,430 --> 00:05:14,900 If you want to output all the matches 94 00:05:14,900 --> 00:05:16,775 you have to pay a number of matches times log 95 00:05:16,775 --> 00:05:17,690 to the epsilon. 96 00:05:17,690 --> 00:05:20,330 They achieve the P bound. 97 00:05:20,330 --> 00:05:23,050 In fact, they do a little bit better than order P query. 98 00:05:23,050 --> 00:05:26,150 On a RAM, you can hope to do-- 99 00:05:26,150 --> 00:05:33,170 save a log factor by reading log base sigma of T, of the letters 100 00:05:33,170 --> 00:05:34,950 in one word operation. 101 00:05:34,950 --> 00:05:36,900 So I'm not going to go into how to do this-- 102 00:05:36,900 --> 00:05:41,257 I'm going to cover this paper today, or a simplification 103 00:05:41,257 --> 00:05:41,840 of this paper. 104 00:05:41,840 --> 00:05:43,560 You might say, throw away. 105 00:05:43,560 --> 00:05:46,250 I'm going to get a slightly worse bounds than this. 106 00:05:46,250 --> 00:05:51,170 Space bound will be the same, but I'm not going to-- 107 00:05:51,170 --> 00:05:53,300 I'm not going to worry about this log factor. 108 00:05:53,300 --> 00:05:55,490 And in fact, both P and output are 109 00:05:55,490 --> 00:05:59,540 going to be multiplied by log to the epsilon. 110 00:05:59,540 --> 00:06:01,700 So I won't achieve quite the best query bound, 111 00:06:01,700 --> 00:06:03,590 but same space bound, just to give you 112 00:06:03,590 --> 00:06:06,800 an idea of how it works. 113 00:06:06,800 --> 00:06:12,170 The next result-- yeah, I'll go to another board. 114 00:06:12,170 --> 00:06:14,480 These bounds are a bit big, as you see. 115 00:06:17,090 --> 00:06:19,860 The next result, which was done later in the same year. 116 00:06:19,860 --> 00:06:21,830 So these are probably discovered, 117 00:06:21,830 --> 00:06:25,690 basically at the same time. 118 00:06:25,690 --> 00:06:29,840 Because writing a paper takes probably a year or so. 119 00:06:29,840 --> 00:06:31,550 So they were being done in parallel, 120 00:06:31,550 --> 00:06:34,260 and then this was published in the spring of 2000. 121 00:06:34,260 --> 00:06:36,899 This was published in the fall of 2000. 122 00:06:36,899 --> 00:06:37,940 It's called the FM-index. 123 00:06:40,640 --> 00:06:43,754 And it achieves this bound, which 124 00:06:43,754 --> 00:06:45,545 is going to take a little while to explain. 125 00:07:09,270 --> 00:07:11,450 OK. 126 00:07:11,450 --> 00:07:16,160 Think of this right now, as this is T log sigma. 127 00:07:16,160 --> 00:07:20,254 Ignore this H. This is entropy stuff. 128 00:07:20,254 --> 00:07:21,920 But if you think of this as T log sigma, 129 00:07:21,920 --> 00:07:24,470 we're getting 5 times T log sigma, 130 00:07:24,470 --> 00:07:27,800 plus some lower-order term. 131 00:07:27,800 --> 00:07:30,046 So it's a little less flexible over here. 132 00:07:30,046 --> 00:07:31,670 We kind of got to control the constant. 133 00:07:31,670 --> 00:07:35,600 Anything greater or equal to 2 would be all right. 134 00:07:35,600 --> 00:07:37,580 Over here, it's always at least 5. 135 00:07:37,580 --> 00:07:39,170 This has since been improved. 136 00:07:39,170 --> 00:07:41,750 I'm just telling you the historical-- 137 00:07:41,750 --> 00:07:45,410 these days people can get down to at least 4 or so. 138 00:07:45,410 --> 00:07:46,470 Actually, get down to 1. 139 00:07:46,470 --> 00:07:49,730 We'll talk about it in a moment. 140 00:07:49,730 --> 00:07:51,380 Before I get to the Hk part, I want 141 00:07:51,380 --> 00:07:53,180 to talk about the lower-order term. 142 00:07:53,180 --> 00:07:55,340 There's some scary parts like this. 143 00:07:55,340 --> 00:07:58,430 If sigma is that at all large, this is big trouble. 144 00:07:58,430 --> 00:08:00,320 Or even sigma log n-- 145 00:08:00,320 --> 00:08:03,110 this is a super polynomial. 146 00:08:03,110 --> 00:08:05,960 So this cannot handle very large sigma, 147 00:08:05,960 --> 00:08:07,580 whereas this solution can. 148 00:08:07,580 --> 00:08:11,930 And other structures can, but this is an early result. 149 00:08:11,930 --> 00:08:14,990 This also gets bad when sigma's very large. 150 00:08:14,990 --> 00:08:17,540 Even bigger-- when sigma's bigger than log log T, 151 00:08:17,540 --> 00:08:20,660 then this starts to dominate. 152 00:08:20,660 --> 00:08:23,779 OK, but for sigma small, think binary alphabets, whatever. 153 00:08:23,779 --> 00:08:25,820 This is good, and in many ways is actually better 154 00:08:25,820 --> 00:08:26,750 than T log sigma. 155 00:08:26,750 --> 00:08:31,750 So let me tell you about this Hk of T thing. 156 00:08:31,750 --> 00:08:34,985 This is what's called k-th order empirical entropy. 157 00:08:45,770 --> 00:08:49,645 Maybe I should start with an aside of 0-th order entropy, 158 00:08:49,645 --> 00:08:51,020 because we haven't talked about-- 159 00:08:51,020 --> 00:08:53,450 I guess we talked about entropy in the context of binary search 160 00:08:53,450 --> 00:08:53,949 trees. 161 00:08:53,949 --> 00:08:55,820 We said, oh, if you've got-- 162 00:08:55,820 --> 00:08:59,300 if you access item i with probability P i, 163 00:08:59,300 --> 00:09:07,450 then there's this entropy bound, which is sum of P log 1/P. 164 00:09:07,450 --> 00:09:11,630 So I don't know, let's call this character x. 165 00:09:11,630 --> 00:09:23,420 So if you-- let's see, you have H0 substring s. 166 00:09:23,420 --> 00:09:30,030 You sum over all characters in the alphabet, 167 00:09:30,030 --> 00:09:33,380 of the probability-- this is not really a probability. 168 00:09:33,380 --> 00:09:40,490 This is going to be the number of x's in s, divided 169 00:09:40,490 --> 00:09:41,900 by the length of s. 170 00:09:41,900 --> 00:09:43,810 This is what's called empirical probability. 171 00:09:43,810 --> 00:09:46,167 It's what you observe from this string. 172 00:09:46,167 --> 00:09:48,000 There's this many occurrences in the string. 173 00:09:48,000 --> 00:09:49,220 You divide by the length of the string. 174 00:09:49,220 --> 00:09:50,660 That's kind of like a probability. 175 00:09:50,660 --> 00:09:52,390 It's scaled to be like a probability. 176 00:09:52,390 --> 00:09:53,760 It's between 0 and 1. 177 00:09:53,760 --> 00:09:57,410 And if you take sum of P log 1/P, that gives you a bound. 178 00:09:57,410 --> 00:10:01,950 And this is the bound achieved by say, Huffman coding, 179 00:10:01,950 --> 00:10:04,542 or the optimal code. 180 00:10:04,542 --> 00:10:06,500 If all you're allowed to do is give a code word 181 00:10:06,500 --> 00:10:08,660 for each letter of the alphabet, and then you 182 00:10:08,660 --> 00:10:10,820 write down a binary code word for each letter 183 00:10:10,820 --> 00:10:11,750 of the alphabet. 184 00:10:11,750 --> 00:10:14,870 And you write that down for each letter in s, then you achieve-- 185 00:10:14,870 --> 00:10:17,480 I guess Huffman codes achieve ceiling of this. 186 00:10:17,480 --> 00:10:19,610 If you want to achieve exactly that bound, 187 00:10:19,610 --> 00:10:23,660 you can use arithmetic coding, but we're not 188 00:10:23,660 --> 00:10:25,850 going to get into those kinds of details. 189 00:10:25,850 --> 00:10:30,740 So if you used what's called a 0-th order code, where you just 190 00:10:30,740 --> 00:10:34,220 have a code for each character of the alphabet, 191 00:10:34,220 --> 00:10:37,400 then the space bound you would achieve is H0 of s, 192 00:10:37,400 --> 00:10:41,460 times the number of characters in s. 193 00:10:41,460 --> 00:10:45,800 So that would be if you substituted k equals 0 here. 194 00:10:45,800 --> 00:10:46,800 So that's kind of neat. 195 00:10:46,800 --> 00:10:50,030 This is a compressed representation of the string. 196 00:10:50,030 --> 00:10:53,150 Over here, we just wrote down the string. 197 00:10:53,150 --> 00:10:54,950 And if the string is incompressible, 198 00:10:54,950 --> 00:10:56,954 yeah, T log sigma is optimal. 199 00:10:56,954 --> 00:10:59,120 But if the string is compressible, like many strings 200 00:10:59,120 --> 00:11:01,328 we want to store-- you're storing English, whatever-- 201 00:11:01,328 --> 00:11:04,610 you should save somewhere between a factor of 2 and 10. 202 00:11:04,610 --> 00:11:06,110 This will try to save it. 203 00:11:06,110 --> 00:11:09,530 Of course, factor between 2 and 10 is not-- 204 00:11:09,530 --> 00:11:12,020 is a little scary, when there's this factor 5 out here. 205 00:11:12,020 --> 00:11:14,420 That might dominate whatever savings you get. 206 00:11:14,420 --> 00:11:17,420 But in theory, this could be a lot better. 207 00:11:17,420 --> 00:11:20,090 And this is just the first result in this series. 208 00:11:20,090 --> 00:11:23,240 Now we can get 1 times Hk of T, and then it's 209 00:11:23,240 --> 00:11:27,960 a lot more interesting. 210 00:11:27,960 --> 00:11:30,020 OK, so that was 0-th order entropy. 211 00:11:30,020 --> 00:11:32,210 What's this k-th order entropy business? 212 00:11:32,210 --> 00:11:34,790 Essentially, it's about taking-- instead of writing a code 213 00:11:34,790 --> 00:11:37,520 word for a single letter, you can write a code word 214 00:11:37,520 --> 00:11:43,670 for a letter that depends on the previous k characters. 215 00:11:43,670 --> 00:11:47,170 So I'm going to write down a definition. 216 00:11:47,170 --> 00:11:54,970 Hk of T is going to be the sum over all words of length k. 217 00:11:54,970 --> 00:11:58,970 This is going to be our context of the probability, 218 00:11:58,970 --> 00:12:09,740 or empirical probability of w occurring times the 0-th order 219 00:12:09,740 --> 00:12:24,005 entropy of the string of successor characters of w. 220 00:12:26,510 --> 00:12:29,660 So again, the empirical probability of w occurring 221 00:12:29,660 --> 00:12:37,730 is the number of occurrences of w, divided by T, basically. 222 00:12:37,730 --> 00:12:44,570 So the idea is, now you get to encode a character depending 223 00:12:44,570 --> 00:12:47,180 on the context of the last k characters. 224 00:12:47,180 --> 00:12:50,150 So we're summing over all possible contexts 225 00:12:50,150 --> 00:12:54,260 of k characters, and we're taking the expectation 226 00:12:54,260 --> 00:12:58,040 over all possible context w. 227 00:12:58,040 --> 00:13:01,070 That's the sum of the probabilities times something. 228 00:13:01,070 --> 00:13:05,540 And then condition on w being the context, the last k 229 00:13:05,540 --> 00:13:06,440 characters. 230 00:13:06,440 --> 00:13:10,790 We want to measure what characters follow that. 231 00:13:10,790 --> 00:13:13,130 And there, we can use a 0-th order encoding. 232 00:13:13,130 --> 00:13:15,140 I mean, we've already conditioned 233 00:13:15,140 --> 00:13:17,940 on w being right there. 234 00:13:17,940 --> 00:13:21,170 So for all occurrences of w, you look at the next character 235 00:13:21,170 --> 00:13:23,660 right after it, and you take 0-th order entropy 236 00:13:23,660 --> 00:13:26,600 of that, that's called k-th order entropy. 237 00:13:26,600 --> 00:13:31,230 OK, you have to think about it for a while, too. 238 00:13:31,230 --> 00:13:34,520 But this essentially means the best, 239 00:13:34,520 --> 00:13:37,400 you can prove this is the best encoding you can do, 240 00:13:37,400 --> 00:13:40,520 if the codeword of a letter can depend on the previous k 241 00:13:40,520 --> 00:13:41,540 characters. 242 00:13:41,540 --> 00:13:44,000 Of course, if you have such a code it's easy to decompress, 243 00:13:44,000 --> 00:13:46,620 because as you're decompressing, you know what the previous k 244 00:13:46,620 --> 00:13:48,780 characters were. 245 00:13:48,780 --> 00:13:51,980 OK, interesting thing about this index or this data structure, 246 00:13:51,980 --> 00:13:54,620 is it's independent of k. 247 00:13:54,620 --> 00:13:57,080 The data structure doesn't know what k is. 248 00:13:57,080 --> 00:14:00,200 This works for all k. 249 00:14:00,200 --> 00:14:05,032 For any fixed k-- k has to be constant here. 250 00:14:05,032 --> 00:14:07,490 There are other data structures like [? KB, ?] logarithmic, 251 00:14:07,490 --> 00:14:08,870 or so. 252 00:14:08,870 --> 00:14:11,020 But here, we'll think of k as a constant. 253 00:14:11,020 --> 00:14:14,150 And so this is really a neat thing about compression. 254 00:14:14,150 --> 00:14:17,750 There's a technique called the Burrows-Wheeler transform. 255 00:14:17,750 --> 00:14:20,431 And Lempel-Ziv does similar things. 256 00:14:20,431 --> 00:14:22,430 You may have heard of those compression schemes. 257 00:14:22,430 --> 00:14:24,430 They're used in bzip, and things like-- bzip 258 00:14:24,430 --> 00:14:28,190 is named after Burrows-Wheeler, I believe. 259 00:14:28,190 --> 00:14:32,150 And those compression schemes achieve 260 00:14:32,150 --> 00:14:35,330 Hk of T bits per character-- 261 00:14:35,330 --> 00:14:37,220 so Hk of T times T-- 262 00:14:37,220 --> 00:14:39,720 for all k. 263 00:14:39,720 --> 00:14:42,410 So if your text is really good, given 264 00:14:42,410 --> 00:14:45,320 the context of the last five letters, or three letters. 265 00:14:45,320 --> 00:14:49,110 In some sense, the compression scheme adapts to that. 266 00:14:49,110 --> 00:14:53,360 So this is what we call a self index, in that this also 267 00:14:53,360 --> 00:14:54,260 stores the string. 268 00:14:54,260 --> 00:14:56,690 You can read the data of the string. 269 00:14:56,690 --> 00:14:59,360 And so whereas over here, we just 270 00:14:59,360 --> 00:15:00,920 stored the string uncompressed. 271 00:15:00,920 --> 00:15:02,930 Here we're effectively storing the string 272 00:15:02,930 --> 00:15:05,450 in a compressed form, and the data structure 273 00:15:05,450 --> 00:15:06,920 is similarly compressed. 274 00:15:06,920 --> 00:15:08,990 So if your string is compressible by more 275 00:15:08,990 --> 00:15:13,600 than a factor of 5, this will be really good. 276 00:15:13,600 --> 00:15:17,390 And that's the FM-index bound. 277 00:15:17,390 --> 00:15:21,530 Now that you have that Hk stuff, it's a lot easier 278 00:15:21,530 --> 00:15:25,010 to state all other results. 279 00:15:25,010 --> 00:15:28,370 So we have-- oh, I didn't give a query bound. 280 00:15:28,370 --> 00:15:30,410 That was the space. 281 00:15:30,410 --> 00:15:46,070 Query is P plus size of output times log to the epsilon T. 282 00:15:46,070 --> 00:15:49,520 So, similar to this one, but we don't 283 00:15:49,520 --> 00:15:54,470 have this trick over here. 284 00:15:54,470 --> 00:15:56,880 Another early result is by Sadakane. 285 00:16:00,205 --> 00:16:10,510 I think also, maybe 2001, I have the journal referenced as 2003. 286 00:16:10,510 --> 00:16:13,737 This is in some ways better, some ways worse, 287 00:16:13,737 --> 00:16:15,695 it's kind of incomparable to the other results. 288 00:16:32,380 --> 00:16:40,435 This is bits, and then the query has an extra large factor. 289 00:16:48,840 --> 00:16:50,439 This is again, another early result 290 00:16:50,439 --> 00:16:51,480 that I want to highlight. 291 00:16:51,480 --> 00:16:53,520 Now I'm going to start skipping results. 292 00:16:53,520 --> 00:16:55,320 The main innovation here, is that it 293 00:16:55,320 --> 00:16:57,690 works good for large alphabets. 294 00:16:57,690 --> 00:17:00,030 This is a very small dependence on sigma, 295 00:17:00,030 --> 00:17:02,841 whereas-- as I mentioned, this structure really 296 00:17:02,841 --> 00:17:04,424 doesn't work well for large alphabets. 297 00:17:04,424 --> 00:17:06,636 Here we're getting-- not getting k-th order entropy, 298 00:17:06,636 --> 00:17:08,010 we're getting 0-th order entropy. 299 00:17:08,010 --> 00:17:12,270 It's a somewhat weaker result. The dependence on epsilon 300 00:17:12,270 --> 00:17:13,349 is more like this one. 301 00:17:16,680 --> 00:17:19,079 But if you just want a log factor here, 302 00:17:19,079 --> 00:17:21,690 then this is a 1 plus epsilon times H0. 303 00:17:21,690 --> 00:17:23,490 So in that sense, we're doing better-- 304 00:17:23,490 --> 00:17:26,880 only a 1 plus epsilon, which is better. 305 00:17:26,880 --> 00:17:30,300 This thing was always at least 2. 306 00:17:30,300 --> 00:17:33,000 This thing was always at least 5, the complete constant. 307 00:17:33,000 --> 00:17:34,980 Here the lead constant can be 1 plus epsilon. 308 00:17:34,980 --> 00:17:38,010 This is almost succinct, but not quite. 309 00:17:38,010 --> 00:17:40,050 It doesn't quite compress as well-- 310 00:17:40,050 --> 00:17:41,520 it only uses 0-th order entropy-- 311 00:17:41,520 --> 00:17:43,140 but that's still not bad. 312 00:17:43,140 --> 00:17:44,610 And then the other big innovation 313 00:17:44,610 --> 00:17:47,670 is the dependence on sigma small. 314 00:17:47,670 --> 00:17:49,800 The query is a little bit worse. 315 00:17:53,050 --> 00:17:55,710 OK, now fast forward a little bit. 316 00:17:58,620 --> 00:18:00,750 I want to talk about succinct data structures 317 00:18:00,750 --> 00:18:04,785 for suffix-tree-like queries. 318 00:18:07,350 --> 00:18:10,050 So there's two succinct data structures out there, 319 00:18:10,050 --> 00:18:13,620 with more or less the same authors as the first two 320 00:18:13,620 --> 00:18:15,480 results I talked about. 321 00:18:15,480 --> 00:18:19,050 So Grossi and Vitter, together with Gupta, 322 00:18:19,050 --> 00:18:27,510 can get Hk of T times T, which is optimal even 323 00:18:27,510 --> 00:18:33,390 with compression, with k-th order compression. 324 00:18:33,390 --> 00:18:34,830 And a good dependence on sigma. 325 00:18:38,440 --> 00:18:39,170 Yeah, I guess-- 326 00:18:39,170 --> 00:18:42,330 T log sigma is the uncompressed bound. 327 00:18:42,330 --> 00:18:43,607 So you have to worry about-- 328 00:18:43,607 --> 00:18:45,190 when you're talking about compression, 329 00:18:45,190 --> 00:18:47,070 so here we have the optimal bound 330 00:18:47,070 --> 00:18:50,670 using k-th order entropy with a lead constant of 1, 331 00:18:50,670 --> 00:18:51,690 so that's great. 332 00:18:51,690 --> 00:18:53,220 That's what makes it succinct. 333 00:18:53,220 --> 00:18:54,840 As long as this is little o of that. 334 00:18:54,840 --> 00:18:58,170 This is going to be a little o of that, as long as Hk of T 335 00:18:58,170 --> 00:19:00,940 is not too small. 336 00:19:00,940 --> 00:19:06,420 If it's like 1 over log T, then actually this term dominates. 337 00:19:06,420 --> 00:19:11,080 But as long as it's bigger than log log T over log T, 338 00:19:11,080 --> 00:19:13,810 this thing, then you're fine. 339 00:19:13,810 --> 00:19:16,420 Just as long as you're not compressing a huge amount, 340 00:19:16,420 --> 00:19:18,541 then this will be lower-order. 341 00:19:21,940 --> 00:19:23,010 Sorry, query time. 342 00:19:23,010 --> 00:19:25,860 Query's a little bit worse, though. 343 00:19:25,860 --> 00:19:30,240 We have a log term with a P, only a log sigma, 344 00:19:30,240 --> 00:19:35,050 but then we also have this log squared over log log. 345 00:19:35,050 --> 00:19:41,280 Times log sigma, and here I haven't-- 346 00:19:41,280 --> 00:19:44,180 there isn't a clear dependence on the size of the output. 347 00:19:44,180 --> 00:19:46,140 So this is-- let's say size of output is 1. 348 00:19:46,140 --> 00:19:49,265 You just want to find one match. 349 00:19:49,265 --> 00:19:51,690 I won't write this dependence on the size of the output. 350 00:19:51,690 --> 00:19:54,065 My guess is this is multiplied by the size of the output, 351 00:19:54,065 --> 00:19:56,440 but it's not stated explicitly in the paper, 352 00:19:56,440 --> 00:19:58,180 so I want to be careful. 353 00:19:58,180 --> 00:20:02,460 So we have a polylog additive slowdown here. 354 00:20:02,460 --> 00:20:04,670 So it's a little bit worse in time, 355 00:20:04,670 --> 00:20:06,420 but this space is obviously, a lot better. 356 00:20:06,420 --> 00:20:12,840 We've improved our constant factor from 5, over here, to 1. 357 00:20:12,840 --> 00:20:16,890 OK, and then there's one more paper 358 00:20:16,890 --> 00:20:29,075 I want to mention, by Ferragina, Manzini, Makinen, and Navarro. 359 00:20:32,160 --> 00:20:38,880 This is from just five years ago now, 2007. 360 00:20:38,880 --> 00:20:47,890 They also achieved 1 times Hk of T times T as the lead term. 361 00:20:47,890 --> 00:20:53,160 And they get T divided by log to the epsilon n, so this is-- 362 00:20:56,040 --> 00:20:59,370 yes, it's slight, there's probably a log sigma here, too. 363 00:20:59,370 --> 00:21:03,190 I'm not sure, it might just be T. Probably just T, actually. 364 00:21:03,190 --> 00:21:07,210 So we get rid of the log sigma, but this log log over log 365 00:21:07,210 --> 00:21:08,140 gets slightly smaller. 366 00:21:08,140 --> 00:21:12,040 It's only a log to the epsilon now. 367 00:21:12,040 --> 00:21:15,550 But the query bound is a little bit better. 368 00:21:15,550 --> 00:21:18,120 So the P plus-- 369 00:21:18,120 --> 00:21:27,380 as the output times log to the 1 plus epsilon T query. 370 00:21:30,490 --> 00:21:33,010 So instead of basically log squared, we have log to 1 371 00:21:33,010 --> 00:21:35,290 plus epsilon, slightly better. 372 00:21:35,290 --> 00:21:39,350 They also have an order P counting query. 373 00:21:39,350 --> 00:21:41,710 So if you just want to know how many matches are there, 374 00:21:41,710 --> 00:21:47,530 they can do that really fast in kind of regular time order P. 375 00:21:47,530 --> 00:21:49,039 And this is obviously very small. 376 00:21:49,039 --> 00:21:50,830 So this is probably the best result so far, 377 00:21:50,830 --> 00:21:57,040 still obviously, lots of open problems in this world. 378 00:21:57,040 --> 00:21:58,990 Still an active area of research. 379 00:21:58,990 --> 00:22:01,769 There are papers since these, but they 380 00:22:01,769 --> 00:22:03,310 don't achieve-- the space bounds they 381 00:22:03,310 --> 00:22:04,390 achieve are not quite as good. 382 00:22:04,390 --> 00:22:06,250 There may be like 2 times Hk, and then they 383 00:22:06,250 --> 00:22:07,770 can get better query bounds. 384 00:22:07,770 --> 00:22:09,620 A lot of papers that I'm not talking about, 385 00:22:09,620 --> 00:22:12,190 there's just a few too many. 386 00:22:12,190 --> 00:22:15,550 But if you just care about space, this is the best so far. 387 00:22:15,550 --> 00:22:21,612 Or I use these two, depending on exactly how big sigma is. 388 00:22:21,612 --> 00:22:24,070 Just to mention, there's some other cool things you can do. 389 00:22:24,070 --> 00:22:28,240 So these are small space static data structures. 390 00:22:28,240 --> 00:22:30,250 Some of them can be made dynamic. 391 00:22:30,250 --> 00:22:32,440 But in particular, there's work on, 392 00:22:32,440 --> 00:22:36,100 how do you actually build these data structures with low space? 393 00:22:36,100 --> 00:22:39,070 Because you don't really want to build a huge suffix tree 394 00:22:39,070 --> 00:22:40,180 and then compress it. 395 00:22:40,180 --> 00:22:42,138 Because the whole point is you have a hard time 396 00:22:42,138 --> 00:22:43,550 storing this data structure. 397 00:22:43,550 --> 00:22:47,080 So in fact, there's some papers-- 398 00:22:47,080 --> 00:22:49,480 I think more along the lines of these original results-- 399 00:22:49,480 --> 00:22:52,630 the Grossi-Vitter, Ferragina, Manzini, and Sadakane-- 400 00:22:52,630 --> 00:22:54,820 building those data structures. 401 00:22:54,820 --> 00:22:57,250 And while you're building the amount of working space 402 00:22:57,250 --> 00:23:01,540 is at least proportional to the size of the final data 403 00:23:01,540 --> 00:23:02,170 structure. 404 00:23:02,170 --> 00:23:04,510 So that can be done. 405 00:23:04,510 --> 00:23:06,380 We're not going to go into it here. 406 00:23:06,380 --> 00:23:08,537 There are other papers about-- 407 00:23:08,537 --> 00:23:10,120 all of these papers are focused on how 408 00:23:10,120 --> 00:23:12,130 do I do a search, how do I search for a pattern, 409 00:23:12,130 --> 00:23:13,449 find all the matches. 410 00:23:13,449 --> 00:23:15,490 There's other things you can do with suffix trees 411 00:23:15,490 --> 00:23:19,180 like, given two suffixes, you can find the longest 412 00:23:19,180 --> 00:23:21,100 common prefix of them. 413 00:23:21,100 --> 00:23:23,200 So there's papers on how to do that kind of stuff 414 00:23:23,200 --> 00:23:26,200 in the compressed regime. 415 00:23:26,200 --> 00:23:28,990 There's papers on-- or there is a paper on how to do document 416 00:23:28,990 --> 00:23:32,290 retrieval, which is a problem we looked at two lectures ago, 417 00:23:32,290 --> 00:23:33,670 in the string lecture. 418 00:23:33,670 --> 00:23:35,350 You want to find-- not all the matches, 419 00:23:35,350 --> 00:23:36,974 you want to find all the documents that 420 00:23:36,974 --> 00:23:40,360 have this substring in them. 421 00:23:40,360 --> 00:23:42,400 So that can be-- that reduces the size 422 00:23:42,400 --> 00:23:46,150 of the output in these bounds. 423 00:23:46,150 --> 00:23:49,240 That can also be done, Sadakane wrote a paper about that. 424 00:23:49,240 --> 00:23:50,316 Some work on dynamic-- 425 00:23:50,316 --> 00:23:52,690 there's actually a lot of work in implementing these data 426 00:23:52,690 --> 00:23:57,360 structures, definitely FM-index, and I believe, 427 00:23:57,360 --> 00:23:58,850 maybe the Sadakane one. 428 00:23:58,850 --> 00:24:00,910 And maybe this-- versions of this one. 429 00:24:00,910 --> 00:24:03,160 I don't think the succinct ones have been implemented, 430 00:24:03,160 --> 00:24:04,760 although I don't know for sure. 431 00:24:04,760 --> 00:24:06,718 But there's a lot of work in implementing this, 432 00:24:06,718 --> 00:24:08,710 because people care, and indeed they're 433 00:24:08,710 --> 00:24:11,760 small and reasonably fast. 434 00:24:11,760 --> 00:24:14,170 So if you need a text index, there's 435 00:24:14,170 --> 00:24:18,490 freely available implementations of at least some of these. 436 00:24:18,490 --> 00:24:21,670 So this is one of-- 437 00:24:21,670 --> 00:24:25,350 I mean this is practical stuff, too. 438 00:24:25,350 --> 00:24:26,020 Cool. 439 00:24:26,020 --> 00:24:29,590 But as I said, I'm going to focus on the simplest I know, 440 00:24:29,590 --> 00:24:32,934 which is Grossi and Vitter. 441 00:24:32,934 --> 00:24:34,600 If you look at the paper, there are sort 442 00:24:34,600 --> 00:24:36,340 of successive improvements. 443 00:24:36,340 --> 00:24:39,280 And we're going to cover up to the point 444 00:24:39,280 --> 00:24:41,290 where we get a good space bound, and the query 445 00:24:41,290 --> 00:24:44,310 won't be quite as good. 446 00:24:44,310 --> 00:24:47,710 So that's going to be the bulk of the lecture. 447 00:24:47,710 --> 00:24:49,200 It's how to get that space bound. 448 00:24:52,230 --> 00:24:54,120 And as I mentioned, we're going to start out 449 00:24:54,120 --> 00:24:56,910 with a weaker bound, which is getting T log log T bits, 450 00:24:56,910 --> 00:25:01,050 and then we'll see how to improve that to T. 451 00:25:01,050 --> 00:25:04,440 And then we'll see how to improve it to 1 over epsilon 452 00:25:04,440 --> 00:25:05,580 times T. 453 00:25:05,580 --> 00:25:07,759 So it will be a series of improvements. 454 00:25:11,799 --> 00:25:13,590 And we're going to start just with thinking 455 00:25:13,590 --> 00:25:16,390 about suffix arrays. 456 00:25:16,390 --> 00:25:19,110 So what is the compressed suffix array problem? 457 00:25:19,110 --> 00:25:22,500 Well, it's just that I have-- 458 00:25:22,500 --> 00:25:27,830 I want to be able to do queries of the form SA of k. 459 00:25:27,830 --> 00:25:29,580 If I imagine the suffixes in sorted order, 460 00:25:29,580 --> 00:25:30,930 what is the k-th suffix? 461 00:25:30,930 --> 00:25:32,520 Where does it begin? 462 00:25:32,520 --> 00:25:34,940 So I want to be able to represent that array. 463 00:25:34,940 --> 00:25:36,750 And using that, you could do searches, 464 00:25:36,750 --> 00:25:40,590 and later we'll see how to use that to make a suffix tree. 465 00:25:40,590 --> 00:25:44,730 But for now, that's just our goal, is to compute SA of k. 466 00:25:44,730 --> 00:25:47,400 OK, well, the idea is actually going to be very familiar. 467 00:25:47,400 --> 00:25:49,920 We saw it two lectures ago, when we did this divide 468 00:25:49,920 --> 00:25:52,200 and conquer for building a suffix array. 469 00:25:52,200 --> 00:25:55,770 We did this-- we divided the letters in our string 470 00:25:55,770 --> 00:25:58,170 by 0, 1, and 2, mod 3. 471 00:25:58,170 --> 00:26:00,050 We won't need mod 3. 472 00:26:00,050 --> 00:26:01,555 We'll just do mod 2 here. 473 00:26:01,555 --> 00:26:05,760 It won't actually matter what constant we use. 474 00:26:05,760 --> 00:26:07,830 But we're going to follow that recursion 475 00:26:07,830 --> 00:26:09,930 and use it to represent the suffix array, 476 00:26:09,930 --> 00:26:12,720 instead of using it to build it. 477 00:26:12,720 --> 00:26:16,020 So the base case, and set up some notation. 478 00:26:16,020 --> 00:26:20,730 T0 is going to represent T. The length of that string I'm 479 00:26:20,730 --> 00:26:22,290 going to call n0 or n. 480 00:26:27,090 --> 00:26:31,590 And we have a suffix array, which 481 00:26:31,590 --> 00:26:37,530 I'm going to call SA 0, which is the suffix array of that text. 482 00:26:37,530 --> 00:26:38,820 So that's just notation. 483 00:26:38,820 --> 00:26:40,778 We're not actually storing all of those things. 484 00:26:43,230 --> 00:26:49,410 Now, the recursion is T k plus 1. 485 00:26:49,410 --> 00:26:52,770 That's going to be the next level, which is, we write-- 486 00:26:52,770 --> 00:26:55,880 we combine two letters, Tk-- 487 00:26:55,880 --> 00:26:57,580 sorry, square bracket-- 488 00:26:57,580 --> 00:27:02,970 2i comma Tk square bracket 2i plus 1. 489 00:27:02,970 --> 00:27:06,480 Combine two adjacent letters into one letter, 490 00:27:06,480 --> 00:27:12,480 and we do that for i equals 0, 1, up to n/2. 491 00:27:15,510 --> 00:27:17,820 That's our new string. 492 00:27:17,820 --> 00:27:19,440 I'm not going to sort these letters 493 00:27:19,440 --> 00:27:21,537 and remap the letters to compress the alphabet. 494 00:27:21,537 --> 00:27:23,370 I'm just going to leave those letters alone, 495 00:27:23,370 --> 00:27:24,750 as an ordered pair. 496 00:27:24,750 --> 00:27:29,370 In general, at level Tk, a single letter 497 00:27:29,370 --> 00:27:32,027 is actually 2 to the k letters. 498 00:27:32,027 --> 00:27:34,110 But still, this is a useful way to think about it, 499 00:27:34,110 --> 00:27:36,190 because it lets me think about fewer suffixes. 500 00:27:36,190 --> 00:27:38,880 Here, we only have the even suffixes, 501 00:27:38,880 --> 00:27:46,650 suffixes that begin at even positions relative to Tk. 502 00:27:46,650 --> 00:27:49,500 The size of this string, in terms of number of letters, 503 00:27:49,500 --> 00:27:51,360 is 1/2 of the original. 504 00:27:51,360 --> 00:27:56,430 So in general, this is going to be n over 2 to the k. 505 00:27:56,430 --> 00:28:01,680 And then we're interested in the suffix array SA k plus 1. 506 00:28:01,680 --> 00:28:07,630 This is going to be just looking at the even values. 507 00:28:07,630 --> 00:28:17,340 So if we extract even entries from sorry, SA k. 508 00:28:17,340 --> 00:28:19,770 So if we already have SA k, we just 509 00:28:19,770 --> 00:28:22,380 take the even values that are in there. 510 00:28:22,380 --> 00:28:25,620 Those are the ones that are existing suffixes. 511 00:28:25,620 --> 00:28:28,050 Extract those, divide by 2. 512 00:28:28,050 --> 00:28:32,220 That will be the suffix array of this text. 513 00:28:32,220 --> 00:28:33,810 This is kind of backwards from how 514 00:28:33,810 --> 00:28:35,820 you would construct the thing. 515 00:28:35,820 --> 00:28:37,290 You would construct it bottom up. 516 00:28:37,290 --> 00:28:38,370 Here, we're imagining-- we already 517 00:28:38,370 --> 00:28:40,590 know the suffix arrays are just about representation. 518 00:28:40,590 --> 00:28:43,440 So this is a top-down kind of definition 519 00:28:43,440 --> 00:28:45,650 of what we're trying to store. 520 00:28:45,650 --> 00:28:48,399 OK, so this is what we want to do. 521 00:28:48,399 --> 00:28:50,190 Now we are going to build things bottom up. 522 00:28:50,190 --> 00:28:51,689 We're going to imagine we've already 523 00:28:51,689 --> 00:28:54,210 represented SA k plus 1. 524 00:28:54,210 --> 00:28:57,060 And now we need to represent SA k. 525 00:28:57,060 --> 00:29:04,050 If we can represent SA k in terms of SA k plus 1 526 00:29:04,050 --> 00:29:06,420 with not too many bits, then you add up 527 00:29:06,420 --> 00:29:08,114 all of the levels of recursion. 528 00:29:08,114 --> 00:29:10,530 We'll have to talk about how many levels of this recursion 529 00:29:10,530 --> 00:29:11,113 we need to do. 530 00:29:11,113 --> 00:29:13,470 We're not going to go down to constant size. 531 00:29:13,470 --> 00:29:17,310 We'll just go log log n levels. 532 00:29:17,310 --> 00:29:20,940 But we just add up all those costs, 533 00:29:20,940 --> 00:29:23,826 and we'll get the overall size of our data structure. 534 00:29:27,560 --> 00:29:30,080 So how do we do this representation? 535 00:29:30,080 --> 00:29:34,190 I need to define two kind of weird things, 536 00:29:34,190 --> 00:29:37,280 and then we'll see why they're interesting. 537 00:29:37,280 --> 00:29:44,920 OK, the first thing is called even successor sub k of i. 538 00:29:44,920 --> 00:29:47,330 So let me define it. 539 00:29:47,330 --> 00:29:55,700 It's going to be i if the i-th suffix starts 540 00:29:55,700 --> 00:29:58,010 in an even position. 541 00:29:58,010 --> 00:30:00,754 So it doesn't do anything for the even guys. 542 00:30:00,754 --> 00:30:02,420 The interesting thing is when the suffix 543 00:30:02,420 --> 00:30:04,370 starts in an odd position. 544 00:30:04,370 --> 00:30:06,730 Then we're going to write down a different number j. 545 00:30:09,860 --> 00:30:14,120 This is going to look kind of weird, but it's actually-- 546 00:30:14,120 --> 00:30:19,330 it's simple after you think about it for 10 minutes. 547 00:30:19,330 --> 00:30:23,880 This one is odd. 548 00:30:23,880 --> 00:30:26,330 OK, so the other situation is that SA k-- 549 00:30:26,330 --> 00:30:29,550 the i-th suffix starts at an even position. 550 00:30:29,550 --> 00:30:31,910 So let me draw a little picture. 551 00:30:31,910 --> 00:30:36,814 So here is SA of i. 552 00:30:39,600 --> 00:30:42,920 OK, if this happens to be odd, this position 553 00:30:42,920 --> 00:30:44,645 in the text-- this is Tk. 554 00:30:47,300 --> 00:30:50,540 Then I want to go here. 555 00:30:50,540 --> 00:30:51,050 OK? 556 00:30:51,050 --> 00:30:53,008 Because that's an even position, it's a suffix, 557 00:30:53,008 --> 00:30:55,080 it's right next to the suffix I care about. 558 00:30:55,080 --> 00:30:57,710 It is what we call the even successor suffix. 559 00:30:57,710 --> 00:30:59,810 But I don't want to know the index of that. 560 00:30:59,810 --> 00:31:03,590 The index of that would just be SA k of i plus 1. 561 00:31:03,590 --> 00:31:07,580 I want to map backwards through SA inverse. 562 00:31:07,580 --> 00:31:12,320 I want to know, what is the rank of that suffix? 563 00:31:12,320 --> 00:31:17,120 Which suffix j starts right there? 564 00:31:17,120 --> 00:31:19,790 I want to know that the j-th suffix starts right 565 00:31:19,790 --> 00:31:23,480 after the i-th suffix, and I want to write down j. 566 00:31:23,480 --> 00:31:26,590 We'll see why this is the right thing in a moment. 567 00:31:26,590 --> 00:31:29,270 We're just mapping through SA, adding 1, 568 00:31:29,270 --> 00:31:32,350 and then mapping backwards through SA. 569 00:31:32,350 --> 00:31:33,584 So that's a function. 570 00:31:33,584 --> 00:31:35,750 We're going to store that function in a particular-- 571 00:31:35,750 --> 00:31:40,670 in a very weird way, which we'll get to in a moment. 572 00:31:40,670 --> 00:31:45,420 OK, next thing we need is called even rank. 573 00:31:45,420 --> 00:31:47,910 This is going to be like our rank function. 574 00:31:47,910 --> 00:31:49,850 We've had it before. 575 00:31:49,850 --> 00:31:55,220 This is going to be the number of even suffixes-- 576 00:31:55,220 --> 00:31:59,090 even suffixes are suffixes starting at even positions-- 577 00:31:59,090 --> 00:32:05,600 preceding the i-th suffix. 578 00:32:05,600 --> 00:32:09,570 i-th suffix meaning the i-th one in sorted order. 579 00:32:09,570 --> 00:32:11,180 So the suffix SA of i. 580 00:32:14,600 --> 00:32:15,980 Yes, so this is-- 581 00:32:15,980 --> 00:32:18,170 let me be more precise. 582 00:32:18,170 --> 00:32:28,460 This is the number of even values in SA k up to i. 583 00:32:28,460 --> 00:32:30,890 So we're looking-- so this was the text. 584 00:32:30,890 --> 00:32:33,680 Now we're looking at the suffix array, which has 585 00:32:33,680 --> 00:32:35,660 the suffixes in sorted order. 586 00:32:35,660 --> 00:32:38,440 We're looking at position i here, and we want to know, 587 00:32:38,440 --> 00:32:41,270 of all of these values, which ones are even? 588 00:32:41,270 --> 00:32:42,540 Or how many are even-- 589 00:32:42,540 --> 00:32:44,390 that's the even rank. 590 00:32:44,390 --> 00:32:46,250 Again, a weird thing, we'll see why it's 591 00:32:46,250 --> 00:32:47,530 the right thing in a moment. 592 00:32:55,760 --> 00:32:56,690 Right now, in fact. 593 00:33:02,440 --> 00:33:07,900 So here is observation 3, putting these together. 594 00:33:07,900 --> 00:33:09,295 This is a rather long equation. 595 00:33:12,040 --> 00:33:13,510 Ultimately, I want to know-- 596 00:33:13,510 --> 00:33:16,260 I want to represent Sk of i. 597 00:33:16,260 --> 00:33:17,830 I'm trying to represent that. 598 00:33:17,830 --> 00:33:22,090 And I want the right-hand side to only refer to SA k plus 1. 599 00:33:22,090 --> 00:33:24,280 So here's the claim. 600 00:33:24,280 --> 00:33:28,540 Take 2 times SA k plus 1 of-- 601 00:33:35,187 --> 00:33:36,520 I'm going to need another board. 602 00:33:50,240 --> 00:33:51,830 Not of i. 603 00:33:51,830 --> 00:34:03,110 Even rank of even successor of i, minus 1 minus 604 00:34:03,110 --> 00:34:14,060 is even suffix of i. 605 00:34:14,060 --> 00:34:16,320 OK, so that's the equation. 606 00:34:16,320 --> 00:34:19,580 Let me unpack this a little bit. 607 00:34:19,580 --> 00:34:22,240 The idea is, we want to know about a suffix i. 608 00:34:22,240 --> 00:34:24,080 If i happens to be even-- 609 00:34:24,080 --> 00:34:26,929 sorry, not if i happens to be even-- if SA of i 610 00:34:26,929 --> 00:34:29,510 happens to be even, we're golden. 611 00:34:29,510 --> 00:34:33,260 Because that suffix is represented by SA k plus 1, 612 00:34:33,260 --> 00:34:34,429 but it might not be even. 613 00:34:34,429 --> 00:34:38,120 So we want to round it to an even suffix. 614 00:34:38,120 --> 00:34:41,150 Knowing about this odd suffix is just about as good 615 00:34:41,150 --> 00:34:44,340 as knowing about the suffix that starts right after it. 616 00:34:44,340 --> 00:34:46,310 So that's what even successor does. 617 00:34:46,310 --> 00:34:51,620 This is rounding to an even suffix, 618 00:34:51,620 --> 00:34:54,949 meaning a suffix starting at an even position. 619 00:34:59,870 --> 00:35:04,820 Now there's this issue that over here, we 620 00:35:04,820 --> 00:35:08,630 have this relation between SA k and SA k plus 1, 621 00:35:08,630 --> 00:35:11,970 but it extracts the even entries. 622 00:35:11,970 --> 00:35:14,300 So if you think about the suffix array, which now I'm 623 00:35:14,300 --> 00:35:16,508 going to draw a vertical, because that's more normal. 624 00:35:19,580 --> 00:35:21,964 Some of these values are going to be even, 625 00:35:21,964 --> 00:35:24,380 but you don't really know which ones are going to be even. 626 00:35:24,380 --> 00:35:26,530 It's arbitrary subset of-- 627 00:35:26,530 --> 00:35:29,630 in SA k, our even values. 628 00:35:29,630 --> 00:35:35,950 And those are the ones that you extract and form SA k plus 1. 629 00:35:35,950 --> 00:35:38,600 But it's an arbitrary subset, that's kind of a-- 630 00:35:38,600 --> 00:35:40,550 you can't just divide by 2 or something. 631 00:35:40,550 --> 00:35:42,140 It's not the right thing. 632 00:35:42,140 --> 00:35:45,680 If I'm given an index into here, even if it's an even one, 633 00:35:45,680 --> 00:35:48,470 I need to know what the corresponding index is 634 00:35:48,470 --> 00:35:50,270 over here. 635 00:35:50,270 --> 00:35:53,390 And that, I claim, is exactly even rank. 636 00:35:53,390 --> 00:35:59,090 Because what position does this cell become over here? 637 00:35:59,090 --> 00:36:03,140 Well, however many even numbers there are above it. 638 00:36:03,140 --> 00:36:05,840 So you take-- that's what this definition was, a number 639 00:36:05,840 --> 00:36:08,060 of even values in that prefix. 640 00:36:08,060 --> 00:36:12,360 That is the position you will be in, in SA k plus 1. 641 00:36:12,360 --> 00:36:15,510 So this is what I would call the name-- 642 00:36:15,510 --> 00:36:17,630 we've now rounded to an even suffix but now 643 00:36:17,630 --> 00:36:21,530 we need to find the name of that even suffix-- 644 00:36:21,530 --> 00:36:25,070 in SA k plus 1. 645 00:36:25,070 --> 00:36:29,120 So that's exactly what even rank does. 646 00:36:29,120 --> 00:36:33,050 So now we can dereference SA k plus 1 of that thing. 647 00:36:33,050 --> 00:36:38,120 That will give us a position into the text T k 648 00:36:38,120 --> 00:36:42,830 plus 1, where that suffix begins. 649 00:36:42,830 --> 00:36:47,960 Now that's an index into this divided by 2 string, 650 00:36:47,960 --> 00:36:51,560 we need to uncompress that to an index into the actual string. 651 00:36:51,560 --> 00:36:52,560 And there are two parts. 652 00:36:52,560 --> 00:36:55,220 One is we need to multiply by 2, because every letter in T 653 00:36:55,220 --> 00:36:57,770 k plus 1 is two letters in Tk. 654 00:36:57,770 --> 00:36:58,670 So multiply by 2. 655 00:36:58,670 --> 00:37:01,070 And sometimes we need to subtract 1. 656 00:37:01,070 --> 00:37:04,670 We basically need to subtract 1 if if even successor did 657 00:37:04,670 --> 00:37:05,505 anything. 658 00:37:05,505 --> 00:37:08,540 If even successor essentially moved us to the right by 1, 659 00:37:08,540 --> 00:37:10,040 now we need to move back to the left 660 00:37:10,040 --> 00:37:13,080 by 1, if this moved us at all. 661 00:37:13,080 --> 00:37:15,620 So I have one more function here, which is is even suffix. 662 00:37:15,620 --> 00:37:19,670 Was SA of i an even-- 663 00:37:19,670 --> 00:37:22,490 SA sub k of i, an even number already. 664 00:37:22,490 --> 00:37:25,580 Which means that even successor did nothing. 665 00:37:25,580 --> 00:37:30,180 If it did nothing, then 1 minus 1 is 0., 666 00:37:30,180 --> 00:37:31,550 and so nothing happens. 667 00:37:31,550 --> 00:37:33,800 If it did something, then its even suffix 668 00:37:33,800 --> 00:37:35,780 will be 0, because it was odd. 669 00:37:35,780 --> 00:37:37,250 And then we're subtracting 1. 670 00:37:37,250 --> 00:37:40,550 So this just means subtract 1, if it was odd. 671 00:37:40,550 --> 00:37:43,130 You might say minus is odd suffix, instead of 672 00:37:43,130 --> 00:37:44,510 1 minus is even suffix. 673 00:37:44,510 --> 00:37:47,330 But it turns out, this is the thing I want to store, 674 00:37:47,330 --> 00:37:50,424 so I wrote it in a weird way. 675 00:37:50,424 --> 00:37:51,590 Why did I write it that way? 676 00:37:51,590 --> 00:37:57,380 Because is even suffix is related to even rank. 677 00:37:57,380 --> 00:38:04,040 Even rank is just rank sub 1 of is even suffix. 678 00:38:04,040 --> 00:38:05,900 And we already saw how to do rank sub 1, 679 00:38:05,900 --> 00:38:09,650 and so that's why I wanted to reuse it. 680 00:38:09,650 --> 00:38:12,800 I think you see now why this equation holds. 681 00:38:12,800 --> 00:38:16,820 What remains is how to store is even suffix, 682 00:38:16,820 --> 00:38:18,957 even rank, even successor. 683 00:38:23,340 --> 00:38:26,590 One other thing that remains, is to say 684 00:38:26,590 --> 00:38:29,240 when to stop this recursion. 685 00:38:29,240 --> 00:38:32,885 So I claim it's enough to just do this recursion for log log n 686 00:38:32,885 --> 00:38:33,385 levels. 687 00:38:43,900 --> 00:38:46,900 And then I'll call log log n l, the number 688 00:38:46,900 --> 00:38:48,620 of levels in this recursion. 689 00:38:48,620 --> 00:38:53,440 Because at that point, n sub l equals n over-- 690 00:38:53,440 --> 00:38:55,210 it's n over 2 to the l, so that's 691 00:38:55,210 --> 00:38:57,760 going to be n over log n. 692 00:38:57,760 --> 00:39:00,880 Once I have a string of length n over log n, 693 00:39:00,880 --> 00:39:04,600 I can afford the regular boring representation 694 00:39:04,600 --> 00:39:11,310 of a suffix tree. 695 00:39:11,310 --> 00:39:16,560 I can afford T log T bits, when T is only n over log n. 696 00:39:16,560 --> 00:39:18,790 If you want to be a little extra clever, 697 00:39:18,790 --> 00:39:22,500 you can put a factor 2 here, and then there's a square here. 698 00:39:22,500 --> 00:39:25,140 And so then you're really paying little o of T 699 00:39:25,140 --> 00:39:27,485 in order to store that thing. 700 00:39:27,485 --> 00:39:28,860 So once you get down to here, you 701 00:39:28,860 --> 00:39:31,510 can afford a simple representation. 702 00:39:31,510 --> 00:39:36,210 Now let's think about how to compute SA, 703 00:39:36,210 --> 00:39:40,360 like the original SA, sub 0, of an index. 704 00:39:40,360 --> 00:39:46,740 Well I apply this formula at all times, 705 00:39:46,740 --> 00:39:49,690 I do all these computations. 706 00:39:49,690 --> 00:39:52,806 And now I've reduced the problem to SA 1, 707 00:39:52,806 --> 00:39:54,180 and then I do these computations. 708 00:39:54,180 --> 00:39:56,170 I reduce it to SA 2, and so on. 709 00:39:56,170 --> 00:40:00,350 After l steps, I'll have reduced it to an SA query 710 00:40:00,350 --> 00:40:02,070 in a boring old suffix array, which 711 00:40:02,070 --> 00:40:04,090 I've just stored as an array. 712 00:40:04,090 --> 00:40:07,160 So then I can answer it, and then I pop up the recursion, 713 00:40:07,160 --> 00:40:11,460 log log n times, doing these adjustments as appropriate. 714 00:40:11,460 --> 00:40:15,780 In the end, I get the correct index into the original text T. 715 00:40:15,780 --> 00:40:17,850 How much time did it take? 716 00:40:17,850 --> 00:40:19,110 Order log log n time. 717 00:40:23,640 --> 00:40:30,050 So I can do a log log n time query to SA. 718 00:40:34,740 --> 00:40:37,940 This is, of course, assuming that even rank, even successor, 719 00:40:37,940 --> 00:40:42,030 and is even suffix are all constant time operations. 720 00:40:42,030 --> 00:40:44,340 So what remains is to do each of these 721 00:40:44,340 --> 00:40:46,740 in small space and constant time. 722 00:40:46,740 --> 00:40:50,790 Then my overall query time will only go up by log log factor. 723 00:40:50,790 --> 00:40:52,830 This is actually going to be pretty good, 724 00:40:52,830 --> 00:40:54,470 we're not going to-- 725 00:40:54,470 --> 00:40:56,550 we're going to achieve log log query 726 00:40:56,550 --> 00:40:59,920 when we have T log log T bits. 727 00:40:59,920 --> 00:41:02,150 That'll be our first encoding of these things. 728 00:41:02,150 --> 00:41:03,608 Later on, we're going have to go up 729 00:41:03,608 --> 00:41:05,970 to log to the epsilon, which is worse than log log n. 730 00:41:08,620 --> 00:41:09,600 Clear, so far? 731 00:41:09,600 --> 00:41:12,447 Everything is pretty easy at this point now. 732 00:41:12,447 --> 00:41:14,280 It's going to remain easy, it's just there's 733 00:41:14,280 --> 00:41:15,780 a lot of pieces to the puzzle. 734 00:41:15,780 --> 00:41:18,300 This is the first-- this is the big idea. 735 00:41:18,300 --> 00:41:21,240 Next thing is some fancy encoding schemes 736 00:41:21,240 --> 00:41:22,615 to make these things quite small. 737 00:41:22,615 --> 00:41:23,114 Question? 738 00:41:23,114 --> 00:41:25,690 AUDIENCE: [INAUDIBLE] Did you say what the space [INAUDIBLE] 739 00:41:25,690 --> 00:41:26,640 was? 740 00:41:26,640 --> 00:41:27,510 ERIK DEMAINE: We haven't analyzed space yet, 741 00:41:27,510 --> 00:41:28,710 because I haven't said how we're actually 742 00:41:28,710 --> 00:41:29,812 storing these functions. 743 00:41:29,812 --> 00:41:31,770 If you stored these functions explicitly, you'd 744 00:41:31,770 --> 00:41:34,590 have bad space, probably still T log T. But it turns out, 745 00:41:34,590 --> 00:41:38,400 these functions can be encoded in a clever way, that small-- 746 00:41:38,400 --> 00:41:41,610 smaller, it's going to be T log log T. 747 00:41:41,610 --> 00:41:44,710 And still has constant time query. 748 00:41:44,710 --> 00:41:47,266 AUDIENCE: Without the functions, how much space are we using? 749 00:41:47,266 --> 00:41:48,765 ERIK DEMAINE: Without the functions, 750 00:41:48,765 --> 00:41:51,090 we're using, essentially, no space. 751 00:41:51,090 --> 00:41:54,096 I guess, at the end where we're using-- 752 00:41:54,096 --> 00:41:55,470 the only thing we've said so far, 753 00:41:55,470 --> 00:41:58,120 is at the end we use an explicit suffix array. 754 00:41:58,120 --> 00:42:00,480 And if you set this to 2 log log T, 755 00:42:00,480 --> 00:42:04,010 then this would be like n over log n bits of space. 756 00:42:04,010 --> 00:42:07,860 Because it's going to be this times-- 757 00:42:07,860 --> 00:42:12,980 I mean, the space at the bottom is going to be nl log nl. 758 00:42:12,980 --> 00:42:15,150 That's to store an explicit suffix array, 759 00:42:15,150 --> 00:42:17,400 so it's going to be this times log of this, 760 00:42:17,400 --> 00:42:23,190 which is going to be n over log n, if we put the 2 in. 761 00:42:23,190 --> 00:42:26,280 So that part's really cheap, and that's little o of n. 762 00:42:26,280 --> 00:42:28,670 Of course, we probably also have to store the text. 763 00:42:28,670 --> 00:42:30,720 So that's n bits. 764 00:42:30,720 --> 00:42:32,572 I didn't mention-- I'm going to assume, 765 00:42:32,572 --> 00:42:33,780 I don't think we need it yet. 766 00:42:33,780 --> 00:42:36,600 At some point I will assume that the alphabets binary. 767 00:42:36,600 --> 00:42:39,700 So I'm going to leave off-- when I say n bits, really it's 768 00:42:39,700 --> 00:42:42,270 n log sigma bits, or n characters, or whatever. 769 00:42:42,270 --> 00:42:45,850 But I'm not going to worry about that here. 770 00:42:45,850 --> 00:42:48,810 Are there questions? 771 00:42:48,810 --> 00:42:51,064 So now, it's an encoding problem. 772 00:42:51,064 --> 00:42:52,230 How do we encode these guys? 773 00:42:57,120 --> 00:43:00,419 Actually, even successor is the only thing that's non-trivial. 774 00:43:00,419 --> 00:43:02,460 We're going to do the obvious thing for the rest. 775 00:43:05,110 --> 00:43:07,910 So let me tell you about the obvious ones, easy ones. 776 00:43:16,527 --> 00:43:18,110 At least, the first revision we're not 777 00:43:18,110 --> 00:43:20,540 going to do anything fancy with them, later on we will. 778 00:43:26,170 --> 00:43:27,865 Sorry, is even suffix. 779 00:43:37,010 --> 00:43:39,080 We're just going to store this as a bit vector. 780 00:43:39,080 --> 00:43:47,990 This is 1 if SA k is even, 0 if it's odd. 781 00:43:47,990 --> 00:43:49,840 So if we just store that is a bit vector, 782 00:43:49,840 --> 00:43:55,730 this is n sub k bits that we can afford. 783 00:43:55,730 --> 00:43:57,439 Because this is a geometric series, 784 00:43:57,439 --> 00:43:58,480 it's going to be order n. 785 00:44:02,030 --> 00:44:03,290 Next is even rank. 786 00:44:06,830 --> 00:44:10,550 This is just the rank one structure 787 00:44:10,550 --> 00:44:14,900 that we covered last class, on this thing. 788 00:44:14,900 --> 00:44:19,420 So this is going to be nk-- 789 00:44:19,420 --> 00:44:25,610 I think we did log log nk over log nk. 790 00:44:25,610 --> 00:44:28,670 And this can be improved to nk over log to the k-- 791 00:44:28,670 --> 00:44:31,760 or log to the something of nk. 792 00:44:31,760 --> 00:44:33,830 But that's an OK bound. 793 00:44:33,830 --> 00:44:36,740 It's little o of N. Again, this is geometric, 794 00:44:36,740 --> 00:44:39,890 so this overall will be little o of n. 795 00:44:39,890 --> 00:44:48,032 So those are easy, the remaining part is doing even successor. 796 00:45:00,120 --> 00:45:03,370 A little optimization. 797 00:45:03,370 --> 00:45:09,640 If the i's where Sk of i is even, 798 00:45:09,640 --> 00:45:11,320 we don't really need to store anything. 799 00:45:11,320 --> 00:45:14,870 Because then, even successor is the identity function. 800 00:45:14,870 --> 00:45:16,810 So let's forget about those guys. 801 00:45:16,810 --> 00:45:21,640 I'll say, it's trivial for even successors-- 802 00:45:21,640 --> 00:45:24,490 for even suffixes. 803 00:45:29,350 --> 00:45:33,130 So what I'd like to do, is store the answers for odd suffixes. 804 00:45:33,130 --> 00:45:36,550 That's what we're going to do. 805 00:45:36,550 --> 00:45:39,880 We're going to store them in a weird way, as we will see. 806 00:45:50,388 --> 00:45:52,100 So that's the odd suffixes. 807 00:45:52,100 --> 00:45:59,054 There are nk over 2 evens, and there are nk over 2 odds. 808 00:45:59,054 --> 00:46:00,470 So we've just saved a factor of 2. 809 00:46:00,470 --> 00:46:02,350 This wasn't a very deep observation. 810 00:46:02,350 --> 00:46:06,320 But it turns out, if you focus in on the odd ones, 811 00:46:06,320 --> 00:46:08,500 has a nice little structure to them. 812 00:46:12,330 --> 00:46:14,770 This step isn't really necessary, 813 00:46:14,770 --> 00:46:16,060 but it saves a factor of 2. 814 00:46:24,910 --> 00:46:29,560 Now the kind of interesting observation. 815 00:46:29,560 --> 00:46:34,269 What I'd like to do is store these answers in order by i. 816 00:46:34,269 --> 00:46:35,560 That's the obvious thing to do. 817 00:46:35,560 --> 00:46:37,060 I want to store basically an array. 818 00:46:40,780 --> 00:46:43,600 Just store it in order by i, so I'm 819 00:46:43,600 --> 00:46:46,360 skipping the even suffixes, just storing the answers 820 00:46:46,360 --> 00:46:49,240 for the odd suffixes. 821 00:46:49,240 --> 00:46:52,750 So if I was given a number i, how would I look it up? 822 00:46:52,750 --> 00:46:59,180 Well, given an index i into the suffix array, 823 00:46:59,180 --> 00:47:00,925 what I need to know is-- 824 00:47:00,925 --> 00:47:05,290 this is basically the inverse of what we did with SA k plus 1. 825 00:47:05,290 --> 00:47:07,560 SA k plus 1 is extracting the even entries, 826 00:47:07,560 --> 00:47:09,640 here we're extracting the odd entries. 827 00:47:09,640 --> 00:47:13,660 So all I need to know is the odd rank of i, 828 00:47:13,660 --> 00:47:15,940 and then I look up into this array 829 00:47:15,940 --> 00:47:18,110 at position odd rank of i. 830 00:47:18,110 --> 00:47:20,260 That will give me the answer I want. 831 00:47:20,260 --> 00:47:23,350 Well, first I check is is it an even suffix, 832 00:47:23,350 --> 00:47:25,000 which I have stored as a bit vector. 833 00:47:25,000 --> 00:47:28,990 If it's an even suffix, I do nothing, I just return i. 834 00:47:28,990 --> 00:47:32,710 But if it's an odd suffix, then I compute the odd rank. 835 00:47:32,710 --> 00:47:34,290 How do I compute the odd rank? 836 00:47:34,290 --> 00:47:37,720 I take the even rank and take i minus that. 837 00:47:37,720 --> 00:47:42,362 Odd rank, we don't need to store anything for it. 838 00:47:42,362 --> 00:47:43,820 I mean, you could if you wanted to, 839 00:47:43,820 --> 00:47:47,530 but odd rank is just i minus even rank. 840 00:47:50,050 --> 00:47:55,590 Because every index is either odd or even. 841 00:47:55,590 --> 00:47:57,040 OK, great. 842 00:47:57,040 --> 00:48:00,900 So I can look up odd rank and then look at this array. 843 00:48:00,900 --> 00:48:02,580 That'll give me the answer I need. 844 00:48:02,580 --> 00:48:04,788 But I'm not going to actually store this as an array. 845 00:48:04,788 --> 00:48:07,350 I lied. 846 00:48:07,350 --> 00:48:09,370 But in any case, let's worry about how I'm 847 00:48:09,370 --> 00:48:10,880 going to store it in a moment. 848 00:48:10,880 --> 00:48:15,670 Let's think about i-- if I I'm storing these answers-- 849 00:48:15,670 --> 00:48:21,091 the even successor answers, these j values, in order by i. 850 00:48:21,091 --> 00:48:25,210 I claim that order is a very special order, because what 851 00:48:25,210 --> 00:48:27,100 does it mean to order by i? 852 00:48:27,100 --> 00:48:31,240 Ordering by i, that means the suffixes are sorted, right? 853 00:48:31,240 --> 00:48:43,240 So this is the same thing as ordering by an odd suffix in Tk 854 00:48:43,240 --> 00:48:47,590 from SA of i onwards. 855 00:48:47,590 --> 00:48:49,360 That's the suffix that we're-- 856 00:48:49,360 --> 00:48:52,462 sorting by that suffix, is sorting by i. 857 00:48:55,060 --> 00:48:56,980 Now we can unpack an odd suffix-- 858 00:48:56,980 --> 00:49:00,190 it has the first character-- and then an even suffix. 859 00:49:00,190 --> 00:49:03,850 So this is the same thing as ordering by-- 860 00:49:03,850 --> 00:49:05,350 this should look familiar because we 861 00:49:05,350 --> 00:49:06,460 did the same kinds of tricks when 862 00:49:06,460 --> 00:49:07,710 we were building suffix trees. 863 00:49:21,010 --> 00:49:23,110 This is even. 864 00:49:26,390 --> 00:49:28,328 In fact, it's the even successor. 865 00:49:31,674 --> 00:49:41,560 There's a typo here, [? see ?] If we follow SA k, 866 00:49:41,560 --> 00:49:42,700 and then we add 1. 867 00:49:42,700 --> 00:49:45,820 If we follow SA k backwards, that 868 00:49:45,820 --> 00:49:48,430 was the definition of even successor. 869 00:49:48,430 --> 00:49:51,200 So I can rewrite this thing. 870 00:49:51,200 --> 00:50:06,500 This part is the same thing as Tk SA k even successor k of i, 871 00:50:06,500 --> 00:50:09,706 closed bracket, colon, closed bracket. 872 00:50:09,706 --> 00:50:11,980 Get that right? 873 00:50:11,980 --> 00:50:14,650 Yes. 874 00:50:14,650 --> 00:50:16,630 That was the definition of even successors. 875 00:50:16,630 --> 00:50:20,950 Even successor is the value j, for which if I do SA k of j, 876 00:50:20,950 --> 00:50:22,896 I get SA k of i plus 1. 877 00:50:22,896 --> 00:50:25,600 That's the definition. 878 00:50:25,600 --> 00:50:30,185 OK, now Tk of SA of k. 879 00:50:32,980 --> 00:50:35,400 Sorry, the suffix-- that's not Tk of. 880 00:50:35,400 --> 00:50:36,580 There's a colon here. 881 00:50:36,580 --> 00:50:40,690 The suffix of Tk starting at SA k. 882 00:50:40,690 --> 00:50:42,903 If I sort by those suffixes-- 883 00:50:46,770 --> 00:50:47,780 they're sorted, right? 884 00:50:47,780 --> 00:50:50,330 I mean, that was the point of the suffix array, 885 00:50:50,330 --> 00:50:51,860 is to sort the suffixes. 886 00:50:51,860 --> 00:50:57,350 So if I say I'm ordering by the suffixes given in order by SA 887 00:50:57,350 --> 00:50:59,660 k, they're already sorted. 888 00:50:59,660 --> 00:51:02,720 There's no reason to do this Tk of SA k part. 889 00:51:02,720 --> 00:51:07,250 This is going to be the same thing as the order 890 00:51:07,250 --> 00:51:16,513 by this first letter, Tk SA k of i comma, even successor. 891 00:51:20,257 --> 00:51:22,340 The suffix array is defined to have this property, 892 00:51:22,340 --> 00:51:23,881 that these orders are the same thing. 893 00:51:26,470 --> 00:51:28,510 And sorting by the suffixes is the same thing 894 00:51:28,510 --> 00:51:33,590 as sorting by the indices into the suffix array. 895 00:51:33,590 --> 00:51:36,702 Interesting, because this is what I want to store, right? 896 00:51:36,702 --> 00:51:38,660 Those are the answers that I'm trying to store. 897 00:51:38,660 --> 00:51:41,990 I'm trying to store even successor for every i 898 00:51:41,990 --> 00:51:43,910 that has an odd-- 899 00:51:43,910 --> 00:51:45,710 that starts in an odd suffix. 900 00:51:48,500 --> 00:51:51,740 So really, all I need to do is order by this thing. 901 00:51:51,740 --> 00:51:54,560 And then once I've ordered by this thing, 902 00:51:54,560 --> 00:52:00,450 I'll store these guys in order by their value. 903 00:52:00,450 --> 00:52:00,980 Cool. 904 00:52:00,980 --> 00:52:04,160 So these are the pairs I'm going to store. 905 00:52:04,160 --> 00:52:05,780 I'm not going to-- 906 00:52:05,780 --> 00:52:11,110 I'm going to store this comma this, for all i, in order 907 00:52:11,110 --> 00:52:12,242 by this value. 908 00:52:12,242 --> 00:52:13,430 That is my goal. 909 00:52:13,430 --> 00:52:16,010 If I can store these in order by this value, 910 00:52:16,010 --> 00:52:18,560 then by computing odd rank, I know 911 00:52:18,560 --> 00:52:20,762 where in this list of pairs to go. 912 00:52:20,762 --> 00:52:22,220 And I just look at the second value 913 00:52:22,220 --> 00:52:26,720 of the pair, that is my answer. 914 00:52:26,720 --> 00:52:28,290 Why am I storing this? 915 00:52:28,290 --> 00:52:28,790 We'll see. 916 00:52:31,500 --> 00:52:33,500 I don't know if you really need to, but you can. 917 00:52:36,080 --> 00:52:38,810 OK. 918 00:52:38,810 --> 00:52:41,276 So what we're going to-- 919 00:52:41,276 --> 00:52:43,040 I feel like it's cheating. 920 00:52:43,040 --> 00:52:44,780 I say, actually store these pairs. 921 00:52:44,780 --> 00:52:46,700 We're not really going to actually store them. 922 00:52:46,700 --> 00:52:49,070 We still have another trick up our sleeve. 923 00:52:49,070 --> 00:52:51,920 But more or less, we're going to store these pairs-- 924 00:52:51,920 --> 00:52:54,860 I'll cross out, actually. 925 00:52:54,860 --> 00:53:02,906 Store these pairs in order by value. 926 00:53:02,906 --> 00:53:04,280 Storing them in order by value is 927 00:53:04,280 --> 00:53:06,820 the same thing as order by i. 928 00:53:06,820 --> 00:53:09,150 That's what we just proved. 929 00:53:09,150 --> 00:53:10,850 And at this point, is when I'm going 930 00:53:10,850 --> 00:53:12,350 to assume a binary alphabet. 931 00:53:16,971 --> 00:53:17,470 OK. 932 00:53:22,980 --> 00:53:28,330 Maybe, I'll go through here. 933 00:53:31,000 --> 00:53:31,960 Need lots of stuff. 934 00:53:35,180 --> 00:53:37,440 Think we don't need this giant recursion up here. 935 00:53:41,494 --> 00:53:42,910 Just remember, it's enough to know 936 00:53:42,910 --> 00:53:45,550 how to compute even successor, the rest is easy. 937 00:54:16,620 --> 00:54:17,300 So here we go. 938 00:54:19,434 --> 00:54:20,850 We're trying to store these pairs, 939 00:54:20,850 --> 00:54:30,740 so we're trying to store a sorted array of nk 940 00:54:30,740 --> 00:54:31,480 over 2 values. 941 00:54:34,550 --> 00:54:37,520 That's how many odd suffixes there are. 942 00:54:37,520 --> 00:54:46,460 And they're each 2 to the k plus log nk bits, I claim. 943 00:54:46,460 --> 00:54:47,450 Why? 944 00:54:47,450 --> 00:54:51,510 Because this was a single character in Tk. 945 00:54:51,510 --> 00:54:54,329 But a single character in Tk was actually 2 to the k bits, 946 00:54:54,329 --> 00:54:56,120 in the original string for binary alphabet, 947 00:54:56,120 --> 00:54:59,120 and general sigma to the k. 948 00:54:59,120 --> 00:55:01,470 So that's that part of this 2 to the k bits. 949 00:55:01,470 --> 00:55:03,710 The even successor, well, that's just an index 950 00:55:03,710 --> 00:55:05,570 into something of size nk. 951 00:55:05,570 --> 00:55:08,060 So it's log nk bits. 952 00:55:08,060 --> 00:55:08,880 OK, fine. 953 00:55:08,880 --> 00:55:11,390 If I store that explicitly, I would be in trouble, 954 00:55:11,390 --> 00:55:16,470 because 2 to the k times nk is n. 955 00:55:16,470 --> 00:55:19,460 And so I would be storing n bits at every level-- 956 00:55:19,460 --> 00:55:23,300 well, so I guess they get n log log n space. 957 00:55:23,300 --> 00:55:25,110 That part's actually OK. 958 00:55:25,110 --> 00:55:27,200 I can afford that much if I'm just 959 00:55:27,200 --> 00:55:29,900 going for an n log log n bound. 960 00:55:29,900 --> 00:55:32,840 This part, not so much. 961 00:55:32,840 --> 00:55:35,270 Because in particular, when k equals 0, 962 00:55:35,270 --> 00:55:38,030 that's going to be n times log n. 963 00:55:38,030 --> 00:55:40,431 I don't want to spend n log n space. 964 00:55:40,431 --> 00:55:41,930 And the whole point, is we're trying 965 00:55:41,930 --> 00:55:43,346 to avoid storing these explicitly. 966 00:55:43,346 --> 00:55:45,920 Because if I did, I'd get n log n space. 967 00:55:45,920 --> 00:55:48,010 So we're not going to store them explicitly. 968 00:55:52,010 --> 00:56:01,397 As follows, we are going to store so there 969 00:56:01,397 --> 00:56:02,480 are these big bit vectors. 970 00:56:02,480 --> 00:56:07,550 We're going to look at the leading log nk bits. 971 00:56:07,550 --> 00:56:10,580 This is kind of weird, because the log nk bits 972 00:56:10,580 --> 00:56:12,050 we care about are at the end. 973 00:56:12,050 --> 00:56:14,150 But we're going to look at the leading log nk bits 974 00:56:14,150 --> 00:56:20,190 especially, because this is a sorted list of bit vectors. 975 00:56:20,190 --> 00:56:23,452 So if you look at the leading bits, most of the time, 976 00:56:23,452 --> 00:56:24,660 they're going to be the same. 977 00:56:24,660 --> 00:56:26,300 They don't change very much. 978 00:56:26,300 --> 00:56:28,850 Leading bits are going to be all 0's for a while, 979 00:56:28,850 --> 00:56:30,530 and then occasionally they'll increment. 980 00:56:30,530 --> 00:56:31,904 How many times will it increment? 981 00:56:31,904 --> 00:56:36,440 nk times, at most, if we look at the leading log nk bits. 982 00:56:48,274 --> 00:56:49,690 Here's the crazy idea, we're going 983 00:56:49,690 --> 00:56:53,080 to use unary encoding, unary differential encoding. 984 00:56:59,440 --> 00:57:00,970 Differential encoding means, instead 985 00:57:00,970 --> 00:57:03,910 of storing a list of values, you store the first value. 986 00:57:03,910 --> 00:57:07,540 Then the next value, minus the first value, 987 00:57:07,540 --> 00:57:10,284 and then the next value minus that value, and so on. 988 00:57:10,284 --> 00:57:11,950 And unary means we're going to represent 989 00:57:11,950 --> 00:57:14,650 those differences in unary. 990 00:57:14,650 --> 00:57:18,370 Seems like a bad idea, but it turns out it's a good idea. 991 00:57:18,370 --> 00:57:20,230 So here's what it looks like, you look at-- 992 00:57:20,230 --> 00:57:22,020 I'm going to write down 0. 993 00:57:22,020 --> 00:57:27,270 I'm going to write down a bunch of 0's, however big v1 is. 994 00:57:27,270 --> 00:57:28,630 Then I'm going to write a 1. 995 00:57:28,630 --> 00:57:30,670 Then I'm going to write a bunch of 0's, however 996 00:57:30,670 --> 00:57:35,510 big v2 minus v1 is. 997 00:57:35,510 --> 00:57:39,130 Then I'll write a 1, and so on. 998 00:57:39,130 --> 00:57:43,200 0 to the lead, the leading bits of v-- 999 00:57:43,200 --> 00:57:44,340 sorry. 1000 00:57:44,340 --> 00:57:49,190 It's the leading bits of v2 minus the leading bits of v1. 1001 00:57:49,190 --> 00:57:51,960 That's what I meant. 1002 00:57:51,960 --> 00:57:56,240 And then leading bits of v3 minus the leading bits of v2. 1003 00:57:56,240 --> 00:57:59,860 And then 1, and so on. 1004 00:57:59,860 --> 00:58:02,290 OK, that is unary differential encoding. 1005 00:58:02,290 --> 00:58:05,380 I claim this is small, looks kind of crazy. 1006 00:58:05,380 --> 00:58:09,210 But it's small, because how many 0's are there total? 1007 00:58:09,210 --> 00:58:11,050 Well, at most, nk 0's. 1008 00:58:11,050 --> 00:58:15,340 Because I start at the value 0. 1009 00:58:15,340 --> 00:58:18,760 With log nk bits, at most I get up to n k minus 1. 1010 00:58:18,760 --> 00:58:22,505 So the number of times I increment is, at most, nk. 1011 00:58:25,660 --> 00:58:26,770 How many 1's are there? 1012 00:58:30,040 --> 00:58:34,030 Well, there's one 1, per value. 1013 00:58:34,030 --> 00:58:35,620 So there's nk over 2 1's. 1014 00:58:40,840 --> 00:58:46,030 So total size of this bit factor is 3/2 nk. 1015 00:58:48,580 --> 00:58:54,630 So storing those leading bits in this weird way is cheap. 1016 00:58:54,630 --> 00:58:57,470 Linear-- again, this geometric series 1017 00:58:57,470 --> 00:58:59,240 is going to add up to 3/2. 1018 00:58:59,240 --> 00:59:04,640 All right, it's going to add up to 3 times n. 1019 00:59:04,640 --> 00:59:06,924 Cool. 1020 00:59:06,924 --> 00:59:08,340 But that's just the leading bits-- 1021 00:59:08,340 --> 00:59:09,646 I need to store this thing. 1022 00:59:09,646 --> 00:59:11,020 I need to store the leading bits, 1023 00:59:11,020 --> 00:59:12,644 and I need to store the remaining bits. 1024 00:59:12,644 --> 00:59:15,660 Now the remaining bits, there's only 2 to the k remaining bits. 1025 00:59:15,660 --> 00:59:17,040 We switched the order. 1026 00:59:17,040 --> 00:59:18,624 We looked at the high log nk bits, 1027 00:59:18,624 --> 00:59:20,040 but then the low end bits, there's 1028 00:59:20,040 --> 00:59:21,390 going to be 2 to the k of them. 1029 00:59:21,390 --> 00:59:23,700 That I already said was OK. 1030 00:59:23,700 --> 00:59:25,890 We could afford that-- 1031 00:59:25,890 --> 00:59:30,020 kind of, we'd lose a log log factor. 1032 00:59:30,020 --> 00:59:34,430 So we store the trailing 2 of the k bits. 1033 00:59:34,430 --> 00:59:36,940 This we actually store explicitly. 1034 00:59:41,280 --> 00:59:44,350 So this is going to be 2 to the k times 1035 00:59:44,350 --> 00:59:50,520 nk over 2, which is n/2 bits. 1036 00:59:50,520 --> 00:59:52,840 nk is n over 2 to the k. 1037 00:59:52,840 --> 00:59:56,650 Cancel, n over 2. 1038 00:59:56,650 --> 01:00:00,880 OK, so total number of bits-- we add these up-- 1039 01:00:00,880 --> 01:00:11,130 is going to be 1/2 n plus 3/2 nk plus-- 1040 01:00:11,130 --> 01:00:13,410 we'll get to this later. 1041 01:00:13,410 --> 01:00:19,710 And then the total, this we have to do for log log n levels. 1042 01:00:19,710 --> 01:00:25,170 We're summing k equals 0 to log log n. 1043 01:00:25,170 --> 01:00:26,940 This thing. 1044 01:00:26,940 --> 01:00:33,604 And this comes out to 1/2 n log log n. 1045 01:00:33,604 --> 01:00:35,270 This is bad, we want to get rid of that. 1046 01:00:35,270 --> 01:00:42,660 But that was our first aim, then we have 5n-- 1047 01:00:42,660 --> 01:00:43,800 did I miss a term? 1048 01:00:47,580 --> 01:00:48,080 OK. 1049 01:00:53,555 --> 01:00:57,060 Where did I miss the nk? 1050 01:00:57,060 --> 01:01:01,450 This was the cost for even successor. 1051 01:01:01,450 --> 01:01:05,010 OK, but there was also, is even suffix, which was nk bits, 1052 01:01:05,010 --> 01:01:08,010 and there was even rank, which was little o of that. 1053 01:01:08,010 --> 01:01:13,035 So there's an extra nk here for is even suffix. 1054 01:01:17,760 --> 01:01:20,490 OK, so we have nk plus 3/2 nk. 1055 01:01:20,490 --> 01:01:22,080 That's 5/2 nk. 1056 01:01:22,080 --> 01:01:23,850 And then the 1/2 disappears because it's 1057 01:01:23,850 --> 01:01:24,840 a geometric series. 1058 01:01:24,840 --> 01:01:28,710 So we end up with 5n, for what it's worth. 1059 01:01:28,710 --> 01:01:30,770 Plus big O of something. 1060 01:01:30,770 --> 01:01:32,730 OK, I left out something, because there's 1061 01:01:32,730 --> 01:01:35,640 one data structure we haven't yet described. 1062 01:01:35,640 --> 01:01:37,290 There's one more thing we need. 1063 01:01:37,290 --> 01:01:40,230 And that comes up if you want to do a query in the structure. 1064 01:01:40,230 --> 01:01:41,280 How do I do a query? 1065 01:01:44,490 --> 01:01:46,620 I already did odd rank, so I'm just 1066 01:01:46,620 --> 01:01:50,430 trying to look up into the sorted array, at a given index. 1067 01:01:50,430 --> 01:01:55,740 Well, first thing is to compute the leading bits. 1068 01:01:55,740 --> 01:01:58,650 Actually, computing leading bits is really easy 1069 01:01:58,650 --> 01:02:00,720 if I have rank and select. 1070 01:02:00,720 --> 01:02:05,590 What I want, if I'm trying to index into index i, 1071 01:02:05,590 --> 01:02:07,860 I want the i-th one bit. 1072 01:02:07,860 --> 01:02:09,390 To look at the i-th one bit, which 1073 01:02:09,390 --> 01:02:18,690 is select sub 1 of i, which we already know how to do, 1074 01:02:18,690 --> 01:02:24,040 then that corresponds to the i-th value. 1075 01:02:24,040 --> 01:02:25,980 And in particular, if I look at how many 1076 01:02:25,980 --> 01:02:30,570 0's are there up to that point, it's 1077 01:02:30,570 --> 01:02:31,950 going to be the sum of this. 1078 01:02:31,950 --> 01:02:35,220 Plus this, plus this, it's a telescoping sum. 1079 01:02:35,220 --> 01:02:39,380 It's just going to give me the leading bits. 1080 01:02:39,380 --> 01:02:42,350 Because this plus this is just lead of v2. 1081 01:02:42,350 --> 01:02:44,930 This plus that is lead of v3. 1082 01:02:44,930 --> 01:02:45,944 So they all cancel. 1083 01:02:45,944 --> 01:02:47,360 I just count the number of 0 bits. 1084 01:02:47,360 --> 01:02:50,540 That's exactly the value I want to know. 1085 01:02:50,540 --> 01:02:56,960 So I want to do rank sub 0 of that position. 1086 01:02:56,960 --> 01:03:00,780 That will tell me the leading bits. 1087 01:03:00,780 --> 01:03:08,730 In a query, it's not really lead of i, I guess. 1088 01:03:08,730 --> 01:03:13,040 Lead of vi is what we're trying to compute. 1089 01:03:13,040 --> 01:03:14,540 Now, we also need the trailing bits. 1090 01:03:14,540 --> 01:03:16,090 The trailing bits, they're just in an array, 1091 01:03:16,090 --> 01:03:17,220 so you just look that up. 1092 01:03:17,220 --> 01:03:18,303 You get the trailing bits. 1093 01:03:18,303 --> 01:03:20,180 You concatenate those two words, the leading 1094 01:03:20,180 --> 01:03:22,730 bits of the trailing bits-- boom, you have your answer. 1095 01:03:22,730 --> 01:03:25,820 That gives you the even successor. 1096 01:03:25,820 --> 01:03:28,130 So the only thing is we need to store 1097 01:03:28,130 --> 01:03:30,020 rank and a select structure. 1098 01:03:30,020 --> 01:03:36,320 And for rank, we used nk over log log nk space. 1099 01:03:36,320 --> 01:03:39,130 Again, that can be improved to nk over polylog nk. 1100 01:03:39,130 --> 01:03:40,880 But let's not worry about that. 1101 01:03:52,290 --> 01:03:54,700 Item 1 completes. 1102 01:03:54,700 --> 01:03:59,410 We now have a T log log T bit suffix array. 1103 01:03:59,410 --> 01:04:01,950 Next, we need to make it order T, 1104 01:04:01,950 --> 01:04:05,261 then we need to make it into suffix tree. 1105 01:04:05,261 --> 01:04:06,760 We're going to move a little faster. 1106 01:04:11,570 --> 01:04:12,465 Where to go now? 1107 01:04:22,080 --> 01:04:24,270 Now I want a compact suffix array. 1108 01:04:31,670 --> 01:04:33,770 I'm going to use the same definition. 1109 01:04:33,770 --> 01:04:36,849 Everything's going to be more or less the same. 1110 01:04:36,849 --> 01:04:38,765 I just can't afford to store all these levels. 1111 01:04:44,034 --> 01:04:45,200 There were log log n levels. 1112 01:04:45,200 --> 01:04:47,100 Log log n levels is too expensive. 1113 01:04:47,100 --> 01:04:49,240 Each one costs linear space. 1114 01:04:49,240 --> 01:04:52,096 So I'm only going to store a constant number of levels. 1115 01:04:54,850 --> 01:05:00,350 Only store 1 over epsilon plus 1 levels. 1116 01:05:03,350 --> 01:05:06,410 And not just any levels, but the first level, 1117 01:05:06,410 --> 01:05:10,390 the epsilon l-th level, the 2 epsilon l-th level, 1118 01:05:10,390 --> 01:05:11,660 up to the l-th level. 1119 01:05:11,660 --> 01:05:14,486 So it's still log log n levels. 1120 01:05:14,486 --> 01:05:17,120 I'm just going to skip a lot of them. 1121 01:05:17,120 --> 01:05:19,130 Now, it's going to be different. 1122 01:05:19,130 --> 01:05:21,570 I can't use even successor anymore. 1123 01:05:21,570 --> 01:05:27,080 Instead, even is going to be replaced 1124 01:05:27,080 --> 01:05:32,740 with the notion of divisible by 2 to the epsilon l, 1125 01:05:32,740 --> 01:05:33,950 instead of divisible by 2. 1126 01:05:37,460 --> 01:05:40,310 So I do all this, but replace the notion of even 1127 01:05:40,310 --> 01:05:44,990 with divisible by epsilon l. 1128 01:05:44,990 --> 01:05:57,860 Because this is when you are in SA sub k plus 1 epsilon l. 1129 01:05:57,860 --> 01:06:00,390 The whole name of the game is, you're 1130 01:06:00,390 --> 01:06:03,620 trying to do a query in SA k epsilon l, 1131 01:06:03,620 --> 01:06:07,250 and now you want to reduce it to SA k plus 1 epsilon l. 1132 01:06:07,250 --> 01:06:10,790 And these are the suffixes that are explicitly represented. 1133 01:06:10,790 --> 01:06:13,610 Everything else needs to be rounded to that value, then 1134 01:06:13,610 --> 01:06:17,215 rounded back, like we had with our giant formula before. 1135 01:06:17,215 --> 01:06:19,340 It's not so easy to write a single formula anymore, 1136 01:06:19,340 --> 01:06:21,156 it's now really an algorithm. 1137 01:06:24,440 --> 01:06:30,010 So to compute SA k epsilon l of i, 1138 01:06:30,010 --> 01:06:34,510 what you do is follow a new thing, which 1139 01:06:34,510 --> 01:06:38,600 I'm going to call just successor of i, 1140 01:06:38,600 --> 01:06:45,050 repeatedly to get a new index j. 1141 01:06:47,620 --> 01:06:51,810 Or I guess call it i prime, make it a little clearer-- 1142 01:06:51,810 --> 01:06:52,940 until it's even. 1143 01:07:00,080 --> 01:07:02,090 So before, we just had to make one step, 1144 01:07:02,090 --> 01:07:03,110 and then we were even. 1145 01:07:03,110 --> 01:07:06,680 Now, we're going to have to make potentially epsilon l steps. 1146 01:07:06,680 --> 01:07:08,620 So this could cost log log n. 1147 01:07:08,620 --> 01:07:10,860 Log log n, that's not much. 1148 01:07:10,860 --> 01:07:14,700 Actually-- sorry, not log log n. 1149 01:07:14,700 --> 01:07:16,850 This is going to cost 2 to the epsilon 1150 01:07:16,850 --> 01:07:20,020 l, because it's divisible by 2 to the epsilon l. 1151 01:07:20,020 --> 01:07:23,660 2 to the epsilon l is log to the epsilon. 1152 01:07:23,660 --> 01:07:33,650 So this now may take log to the epsilon T steps. 1153 01:07:33,650 --> 01:07:37,460 This is where we're going to get the log to the epsilon penalty, 1154 01:07:37,460 --> 01:07:38,800 in time. 1155 01:07:38,800 --> 01:07:42,410 OK, but it's simple linear search, nothing clever here. 1156 01:07:42,410 --> 01:07:44,040 Now, what is successor? 1157 01:07:44,040 --> 01:07:45,950 Well, successor is just the same thing. 1158 01:07:48,710 --> 01:07:51,740 If you're even in this strong sense, then nothing happens. 1159 01:07:51,740 --> 01:07:53,640 Otherwise, you just-- same definition. 1160 01:07:53,640 --> 01:07:56,360 This part is exactly the same. 1161 01:07:56,360 --> 01:07:59,180 Just go to the next position, the next suffix. 1162 01:07:59,180 --> 01:08:01,340 But now we have to follow it several times, 1163 01:08:01,340 --> 01:08:04,220 until we get to an even one. 1164 01:08:04,220 --> 01:08:05,630 OK. 1165 01:08:05,630 --> 01:08:13,220 Then we recurse, just like before on SA k plus 1166 01:08:13,220 --> 01:08:15,260 1, epsilon l. 1167 01:08:15,260 --> 01:08:20,456 The next level down of the-- 1168 01:08:20,456 --> 01:08:22,430 I think we can still call it even rank. 1169 01:08:35,520 --> 01:08:46,370 And then we multiply by 2 to the epsilon l. 1170 01:08:49,319 --> 01:08:57,020 And then subtract the number of steps we did, in 1. 1171 01:09:03,319 --> 01:09:05,160 We made several steps here, we need 1172 01:09:05,160 --> 01:09:07,180 to undo those steps at the end. 1173 01:09:07,180 --> 01:09:07,680 That's it. 1174 01:09:07,680 --> 01:09:09,846 So it's just the same as before, except before there 1175 01:09:09,846 --> 01:09:12,010 was one step here, and at most, one step here. 1176 01:09:12,010 --> 01:09:14,439 Now you just count them, subtract at the end. 1177 01:09:14,439 --> 01:09:16,620 So exactly the same template, just 1178 01:09:16,620 --> 01:09:18,689 skipping a lot of the levels. 1179 01:09:18,689 --> 01:09:25,529 And now the space is going to be 1 over epsilon, plus 1 times n. 1180 01:09:25,529 --> 01:09:27,880 That's it. 1181 01:09:27,880 --> 01:09:29,560 OK, so let me analyze a little bit. 1182 01:09:35,340 --> 01:09:37,920 So you have to check that all of this works. 1183 01:09:37,920 --> 01:09:39,689 Is is even suffix, that's easy. 1184 01:09:39,689 --> 01:09:40,680 It's still nk bits. 1185 01:09:40,680 --> 01:09:42,479 Even rank, still nk bits. 1186 01:09:42,479 --> 01:09:45,240 Even successor, we did all this fancy encoding. 1187 01:09:45,240 --> 01:09:47,472 The one thing you can't do, is this part. 1188 01:09:47,472 --> 01:09:49,680 I mean, there aren't very many even suffixes anymore. 1189 01:09:49,680 --> 01:09:54,970 So it really doesn't help you, it buys you a very tiny factor. 1190 01:09:54,970 --> 01:10:00,840 But 1 over 2 to the epsilon l are going to be even. 1191 01:10:00,840 --> 01:10:01,891 So that's very few. 1192 01:10:01,891 --> 01:10:04,140 So you still have to store all the answers, basically. 1193 01:10:04,140 --> 01:10:06,730 But you can do all this ordering trick, it still works. 1194 01:10:06,730 --> 01:10:10,650 We weren't really exploiting the fact that it was odd. 1195 01:10:10,650 --> 01:10:13,290 And now you have to-- this is not a single character, 1196 01:10:13,290 --> 01:10:16,800 it's a bunch of characters. 1197 01:10:16,800 --> 01:10:19,860 But still-- and so now instead of 2 to the k, 1198 01:10:19,860 --> 01:10:24,480 it's probably 2 to the k epsilon l. 1199 01:10:24,480 --> 01:10:25,950 But it all works out. 1200 01:10:25,950 --> 01:10:29,160 It's just a renaming of everything. 1201 01:10:29,160 --> 01:10:32,665 It's still going to be linear number of bits, I claim. 1202 01:10:32,665 --> 01:10:34,790 I don't want to go through a formal proof for that, 1203 01:10:34,790 --> 01:10:35,581 we don't have time. 1204 01:10:38,290 --> 01:10:39,550 But all the same tricks work. 1205 01:10:45,730 --> 01:10:53,500 So the claim is space going to be sum 1206 01:10:53,500 --> 01:10:58,850 k equals 0 to 1 over epsilon. 1207 01:10:58,850 --> 01:11:05,480 nk epsilon l, plus n, plus 2 nk epsilon l, 1208 01:11:05,480 --> 01:11:12,702 plus the select bound, n over log log n. 1209 01:11:17,270 --> 01:11:17,900 Why? 1210 01:11:17,900 --> 01:11:20,930 Because this is storing the is even structure. 1211 01:11:20,930 --> 01:11:23,180 That was just nk bits. 1212 01:11:23,180 --> 01:11:27,784 And then, this is the successor. 1213 01:11:27,784 --> 01:11:29,049 This is, is even. 1214 01:11:33,270 --> 01:11:36,080 Same as we had over here, except there's no 1/2 anymore. 1215 01:11:36,080 --> 01:11:38,840 It's just n plus-- 1216 01:11:38,840 --> 01:11:43,430 claim is 2 nk epsilon l. 1217 01:11:43,430 --> 01:11:45,590 That's the right answer. 1218 01:11:45,590 --> 01:11:47,694 Yeah, that 3 was because of this, plus this. 1219 01:11:47,694 --> 01:11:50,110 So we still have the 3, just don't divide it by 2 anymore. 1220 01:11:55,950 --> 01:12:04,060 So this equals some constant times n, 6n 1221 01:12:04,060 --> 01:12:05,840 plus 1 over epsilon n. 1222 01:12:09,410 --> 01:12:14,416 Plus order n over log log n bits. 1223 01:12:18,520 --> 01:12:19,840 OK, not bad. 1224 01:12:19,840 --> 01:12:22,400 Not quite as good as this bound for binary alphabet, 1225 01:12:22,400 --> 01:12:25,210 so ignore the log sigma. 1226 01:12:25,210 --> 01:12:26,980 Before we had 1 plus 1 over epsilon, now 1227 01:12:26,980 --> 01:12:28,540 we have 6 plus 1 over epsilon. 1228 01:12:32,494 --> 01:12:33,660 Kind of running out of time. 1229 01:12:33,660 --> 01:12:40,260 I'll just tell you, you can tune this to 1 over epsilon n, 1230 01:12:40,260 --> 01:12:44,050 plus the little o, with two very simple tricks. 1231 01:12:44,050 --> 01:12:45,310 Two simple observations. 1232 01:12:45,310 --> 01:12:51,540 The first one is, the successor structure. 1233 01:12:51,540 --> 01:12:55,760 At level 0, there's nothing to do. 1234 01:12:55,760 --> 01:12:56,260 Why? 1235 01:12:56,260 --> 01:13:02,710 Because level 0-- a single step just 1236 01:13:02,710 --> 01:13:04,900 corresponds to walking in the string. 1237 01:13:04,900 --> 01:13:08,437 I've got to think about this a little bit. 1238 01:13:08,437 --> 01:13:16,420 Successor-- Actually not quite clear to me why that's true, 1239 01:13:16,420 --> 01:13:17,820 but it turns out to be true. 1240 01:13:17,820 --> 01:13:20,660 It's an exercise, I guess. 1241 01:13:20,660 --> 01:13:23,580 At level 0, you don't need to [? the ?] successor structure. 1242 01:13:23,580 --> 01:13:27,210 So that actually saves you a big factor, because if you 1243 01:13:27,210 --> 01:13:28,680 can skip the very-- 1244 01:13:28,680 --> 01:13:32,340 k equals 0, then you get to skip-- you get to divide by 2 1245 01:13:32,340 --> 01:13:33,870 to the epsilon l, the space. 1246 01:13:33,870 --> 01:13:38,850 So that gets rid of this term. 1247 01:13:38,850 --> 01:13:43,630 Then there's this other term, which you can skip, 1248 01:13:43,630 --> 01:13:45,660 or you can store is even more efficiently. 1249 01:13:45,660 --> 01:13:48,219 So before is even, should be a big factor. 1250 01:13:48,219 --> 01:13:50,010 Because half of them are even, half of them 1251 01:13:50,010 --> 01:13:52,200 are odd, that's the optimal thing to do. 1252 01:13:52,200 --> 01:13:55,920 But in this structure, most of them are not even. 1253 01:13:55,920 --> 01:14:00,240 So you can save a little bit using succinct dictionaries. 1254 01:14:00,240 --> 01:14:01,800 Because there are very few ones-- 1255 01:14:01,800 --> 01:14:05,160 you can achieve log, the total number of things, 1256 01:14:05,160 --> 01:14:08,240 choose the number of ones. 1257 01:14:08,240 --> 01:14:10,980 [? Bog ?] of that binomial coefficient is the number 1258 01:14:10,980 --> 01:14:12,710 of 0's plus 1's. 1259 01:14:12,710 --> 01:14:15,170 Not going to work it out, it's worked out in the notes. 1260 01:14:15,170 --> 01:14:17,550 But if you store that more efficient dictionary, which 1261 01:14:17,550 --> 01:14:20,010 we claimed could be done last time, 1262 01:14:20,010 --> 01:14:23,760 then this turns out to get a nice sort of cascading thing. 1263 01:14:23,760 --> 01:14:27,210 And it's little of of n, in the end. 1264 01:14:27,210 --> 01:14:28,920 So that gets rid of this term. 1265 01:14:28,920 --> 01:14:32,580 And so you're left with just n times 1 over epsilon. 1266 01:14:32,580 --> 01:14:34,680 Plus 1, because you have to store the text also. 1267 01:14:37,200 --> 01:14:43,410 Or maybe because of this plus 1, anyway. 1268 01:14:43,410 --> 01:14:45,960 Boom. 1269 01:14:45,960 --> 01:14:48,720 That's all I want to say about this structure. 1270 01:14:48,720 --> 01:14:51,310 So I wanted to focus on the ideas, which got us 1271 01:14:51,310 --> 01:14:54,940 the T log log T. Just apply the same ideas, 1272 01:14:54,940 --> 01:14:56,119 but much more sparsely. 1273 01:14:56,119 --> 01:14:57,910 You lose in running time, instead of paying 1274 01:14:57,910 --> 01:15:00,240 log log T. Now we pay-- 1275 01:15:00,240 --> 01:15:02,520 we pay log to the epsilon times log log T, 1276 01:15:02,520 --> 01:15:04,350 but that's just log to some other epsilon. 1277 01:15:06,870 --> 01:15:10,320 So that gives us better space. 1278 01:15:10,320 --> 01:15:13,700 Now linear space, instead of n log log space. 1279 01:15:13,700 --> 01:15:16,290 Any questions about that? 1280 01:15:16,290 --> 01:15:16,790 All right. 1281 01:15:19,310 --> 01:15:24,740 Now, I get to hurry through transforming suffix arrays, 1282 01:15:24,740 --> 01:15:25,730 into suffix trees. 1283 01:15:35,611 --> 01:15:37,110 This is actually a much older paper. 1284 01:15:37,110 --> 01:15:45,710 It's by [? Monroe, ?] [? Roman, ?] and [? Row. ?] 1285 01:15:45,710 --> 01:15:49,370 There's two versions of it in the same paper. 1286 01:15:49,370 --> 01:15:51,680 First version is going to be compact, second version 1287 01:15:51,680 --> 01:15:52,450 is succinct. 1288 01:15:52,450 --> 01:15:54,950 Probably won't have much time to cover the succinct version, 1289 01:15:54,950 --> 01:15:57,920 but here's what we do. 1290 01:15:57,920 --> 01:16:00,950 Start with compact. 1291 01:16:00,950 --> 01:16:04,460 Store compressed-- we're going to assume 1292 01:16:04,460 --> 01:16:12,230 binary alphabet again, as this paper does, I believe. 1293 01:16:12,230 --> 01:16:17,090 Store the suffix tree, but only store the trie part of it. 1294 01:16:17,090 --> 01:16:19,510 Suffix tree really consists of trie-- 1295 01:16:19,510 --> 01:16:22,260 binary trie, if it's a binary alphabet. 1296 01:16:22,260 --> 01:16:25,220 Plus, lengths on the edges. 1297 01:16:25,220 --> 01:16:26,960 Don't store the links. 1298 01:16:26,960 --> 01:16:30,980 Or, as Ian likes to call it, skip the skips. 1299 01:16:30,980 --> 01:16:33,050 The lengths of an edge is how many bits 1300 01:16:33,050 --> 01:16:36,530 you're supposed to skip, so skip those. 1301 01:16:36,530 --> 01:16:39,270 Just store the trie structure. 1302 01:16:39,270 --> 01:16:43,250 So the trie structure is on 2n plus 1 nodes, 1303 01:16:43,250 --> 01:16:45,705 because there is n leaves, and minus 1. 1304 01:16:45,705 --> 01:16:47,991 Telling me it's plus 1, I don't know. 1305 01:16:47,991 --> 01:16:50,820 2n plus a constant nodes. 1306 01:16:50,820 --> 01:16:55,160 So this is 4n bits. 1307 01:16:55,160 --> 01:16:57,440 We know how to do binary tries, finally 1308 01:16:57,440 --> 01:16:59,090 we're using last lecture. 1309 01:16:59,090 --> 01:17:01,100 We use rank and select a lot, but now are 1310 01:17:01,100 --> 01:17:02,360 using the binary trie. 1311 01:17:02,360 --> 01:17:05,670 We're going to store this using the balanced paren structure. 1312 01:17:05,670 --> 01:17:08,500 OK, so you have to double that-- this linear number of bits, 1313 01:17:08,500 --> 01:17:11,540 so if we're just looking for compact, that's fine. 1314 01:17:11,540 --> 01:17:13,630 Now the hard part is in a search, 1315 01:17:13,630 --> 01:17:18,630 where we go from one node, to the next node. 1316 01:17:18,630 --> 01:17:20,445 We need to know the length of this edge, 1317 01:17:20,445 --> 01:17:23,142 we've got to figure that out. 1318 01:17:23,142 --> 01:17:25,100 We need to know whether the pattern jumped off, 1319 01:17:25,100 --> 01:17:26,690 or something. 1320 01:17:26,690 --> 01:17:31,190 We need to know at position y, which letter of the pattern 1321 01:17:31,190 --> 01:17:33,620 should we branch on. 1322 01:17:33,620 --> 01:17:36,320 So we need to measure this length. 1323 01:17:36,320 --> 01:17:37,760 Not too hard. 1324 01:17:37,760 --> 01:17:40,280 What you do, you look at this subtree. 1325 01:17:40,280 --> 01:17:44,330 You look at the leftmost leaf and the rightmost leaf. 1326 01:17:44,330 --> 01:17:46,190 You look at their longest common prefix, 1327 01:17:46,190 --> 01:17:48,304 starting from the character you care about. 1328 01:17:48,304 --> 01:17:50,720 And you look at the longest common prefix with the pattern 1329 01:17:50,720 --> 01:17:53,120 P. All sounds easy-- 1330 01:17:53,120 --> 01:17:55,430 how do you actually do it? 1331 01:17:55,430 --> 01:17:58,700 So you need to be able to find the leftmost leaf in a subtree. 1332 01:17:58,700 --> 01:18:02,420 Leaves in the balanced paren expression-- 1333 01:18:02,420 --> 01:18:05,270 I think last class, I mistakenly thought they were that. 1334 01:18:05,270 --> 01:18:07,190 In fact, they are this. 1335 01:18:07,190 --> 01:18:09,080 Think about it long enough. 1336 01:18:09,080 --> 01:18:11,832 This was leaves in the rooted order tree, 1337 01:18:11,832 --> 01:18:14,040 but what we care about are leaves in the binary tree. 1338 01:18:14,040 --> 01:18:15,353 And they always look like open paren, 1339 01:18:15,353 --> 01:18:16,730 closed paren, and closed paren. 1340 01:18:16,730 --> 01:18:19,820 So this is a leaf, and so what we're 1341 01:18:19,820 --> 01:18:22,250 asking for is in a subtree, we'll find the first leaf. 1342 01:18:22,250 --> 01:18:26,120 That's actually just going to be right after this open paren. 1343 01:18:26,120 --> 01:18:32,870 Or, I guess, you do a select, select sub this, 1344 01:18:32,870 --> 01:18:34,619 to jump to the next leaf. 1345 01:18:34,619 --> 01:18:36,660 Then also, you can jump to the end of the subtree 1346 01:18:36,660 --> 01:18:39,890 and then go back to the previous leaf, using rank and select. 1347 01:18:39,890 --> 01:18:42,050 So I won't go into details, but that's easy to do. 1348 01:18:42,050 --> 01:18:44,000 So you can identify the two leaves 1349 01:18:44,000 --> 01:18:47,720 using rank sub, this thing. 1350 01:18:47,720 --> 01:18:51,230 I can identify the leaf number, so I can identify 1351 01:18:51,230 --> 01:18:53,090 where these leaves are. 1352 01:18:53,090 --> 01:18:54,950 Now, I have a suffix array. 1353 01:18:54,950 --> 01:18:58,520 If I look up the suffix array of these two leaf numbers-- 1354 01:18:58,520 --> 01:19:01,490 remember leaves are ordered by suffix in sorted order 1355 01:19:01,490 --> 01:19:03,222 by suffix array. 1356 01:19:03,222 --> 01:19:05,180 These are really indices into the suffix array. 1357 01:19:05,180 --> 01:19:07,580 They're giving me-- oh, this is the i-th suffix, 1358 01:19:07,580 --> 01:19:08,870 this is the j-th suffix. 1359 01:19:08,870 --> 01:19:11,078 So I look at those two positions of the suffix array, 1360 01:19:11,078 --> 01:19:15,560 I teleport over to the string T. Now I have the actual suffixes 1361 01:19:15,560 --> 01:19:17,120 corresponding to this and this. 1362 01:19:17,120 --> 01:19:19,160 And I just look at where they match. 1363 01:19:19,160 --> 01:19:22,940 I know that if I've already gone down to depth d, 1364 01:19:22,940 --> 01:19:23,870 letter depth d. 1365 01:19:23,870 --> 01:19:26,120 I already know that they match the first d characters. 1366 01:19:26,120 --> 01:19:27,110 I don't compare those. 1367 01:19:27,110 --> 01:19:28,460 They're guaranteed to match. 1368 01:19:28,460 --> 01:19:30,800 So I start at position d plus 1. 1369 01:19:30,800 --> 01:19:32,930 I know they should match, but one more letter. 1370 01:19:32,930 --> 01:19:34,430 How many more letters do they match? 1371 01:19:34,430 --> 01:19:36,900 That is the length of this thing. 1372 01:19:36,900 --> 01:19:37,415 OK. 1373 01:19:37,415 --> 01:19:38,790 How can I afford to pay for that? 1374 01:19:38,790 --> 01:19:41,030 I'm just going to pen linear cost, the total number 1375 01:19:41,030 --> 01:19:42,405 of characters I compare, is going 1376 01:19:42,405 --> 01:19:44,820 to be equal to the length of the pattern. 1377 01:19:44,820 --> 01:19:47,750 So we're going to end up getting length of the pattern, 1378 01:19:47,750 --> 01:19:51,762 times the cost to do a suffix array access. 1379 01:19:51,762 --> 01:19:53,720 Because I have to do this at every single step, 1380 01:19:53,720 --> 01:19:55,460 in the worst case. 1381 01:19:55,460 --> 01:19:58,010 So not perfect, but pretty good. 1382 01:19:58,010 --> 01:20:01,864 Roughly P, suffix array access is like log to the epsilon. 1383 01:20:01,864 --> 01:20:03,530 So we're getting a P log to the epsilon. 1384 01:20:03,530 --> 01:20:08,284 Not quite as good as this bound, but because here the P 1385 01:20:08,284 --> 01:20:09,950 is not multiplied by log to the epsilon. 1386 01:20:09,950 --> 01:20:12,000 But, it's just log to the epsilon. 1387 01:20:12,000 --> 01:20:13,760 If you want to see a better way to do it, 1388 01:20:13,760 --> 01:20:15,434 you can read the Grossi-Vitter paper. 1389 01:20:15,434 --> 01:20:16,850 But this is a decent way to do it. 1390 01:20:20,900 --> 01:20:25,997 Now briefly, this is the compact version, 1391 01:20:25,997 --> 01:20:27,830 and let me tell you how to make it succinct. 1392 01:20:33,504 --> 01:20:35,170 I'm not going to touch the suffix array. 1393 01:20:35,170 --> 01:20:37,880 Suffix array, to make that succinct is harder. 1394 01:20:37,880 --> 01:20:41,200 But if I just want to make the suffix tree parts succinct, 1395 01:20:41,200 --> 01:20:43,870 I can use this same idea, but I can't 1396 01:20:43,870 --> 01:20:46,300 afford to store the whole trie. 1397 01:20:46,300 --> 01:20:48,820 So just going to use a little bit of indirection. 1398 01:20:48,820 --> 01:20:50,320 You can use as little as you want, 1399 01:20:50,320 --> 01:20:54,650 this is the log log log log log log log n factor. 1400 01:21:00,150 --> 01:21:15,250 Use the suffix tree above every b-th suffix. 1401 01:21:15,250 --> 01:21:19,960 So throw away all but a 1/b fraction of the leaves. 1402 01:21:19,960 --> 01:21:22,600 And then, take the tree that remains. 1403 01:21:22,600 --> 01:21:25,210 So once you do a search, you won't find exactly the leaf 1404 01:21:25,210 --> 01:21:27,580 you want, but you'll be within an additive b 1405 01:21:27,580 --> 01:21:29,860 of the leaf you want. b here can be arbitrarily small. 1406 01:21:29,860 --> 01:21:32,530 This can be log log log log log n. 1407 01:21:32,530 --> 01:21:34,450 But something super constant. 1408 01:21:34,450 --> 01:21:37,180 Then if I use this structure, instead of being n-- 1409 01:21:37,180 --> 01:21:38,580 order n over b space-- 1410 01:21:38,580 --> 01:21:40,090 instead of being order n space, it's 1411 01:21:40,090 --> 01:21:43,000 going to be order n over b bits. 1412 01:21:43,000 --> 01:21:44,026 So, we win. 1413 01:21:44,026 --> 01:21:45,400 The only issue is now, how do you 1414 01:21:45,400 --> 01:21:49,890 find the correct leaf, as opposed to the incorrect leaf? 1415 01:21:53,161 --> 01:21:54,910 I don't really have time to talk about it. 1416 01:21:54,910 --> 01:21:57,100 You can look at the notes. 1417 01:21:57,100 --> 01:21:58,600 Rough idea is, well, you can have 1418 01:21:58,600 --> 01:22:01,810 a look-up table that lets you do whatever you want on b bits. 1419 01:22:01,810 --> 01:22:05,140 As long as b is less than, like, 1/2 log n. 1420 01:22:05,140 --> 01:22:09,730 Then you can encompass the whole trie, more or less. 1421 01:22:09,730 --> 01:22:11,620 And just hit it with a big look-up table 1422 01:22:11,620 --> 01:22:13,600 and do everything in constant time. 1423 01:22:13,600 --> 01:22:20,320 It's not quite so simple, because-- 1424 01:22:20,320 --> 01:22:21,610 easy summary, here. 1425 01:22:32,900 --> 01:22:38,840 Essentially, what you're doing is-- 1426 01:22:38,840 --> 01:22:40,175 these are the blocks. 1427 01:22:40,175 --> 01:22:42,410 So this is length b. 1428 01:22:42,410 --> 01:22:45,560 You're finding this suffix, and you want to know, 1429 01:22:45,560 --> 01:22:47,439 which of these is the correct one. 1430 01:22:47,439 --> 01:22:49,730 In some sense, you have to do the search simultaneously 1431 01:22:49,730 --> 01:22:51,620 for all b of these guys. 1432 01:22:51,620 --> 01:22:53,870 And so you run down the search again, 1433 01:22:53,870 --> 01:22:55,647 but instead of searching for one pattern, 1434 01:22:55,647 --> 01:22:57,980 you search for all b of these patterns at the same time. 1435 01:22:57,980 --> 01:23:00,830 Now they're mostly the same, and so you can 1436 01:23:00,830 --> 01:23:02,280 prove it doesn't hurt you much. 1437 01:23:02,280 --> 01:23:04,390 Maybe it hurts you an additive b. 1438 01:23:04,390 --> 01:23:07,730 I believe the correct answer is, in time, you end up 1439 01:23:07,730 --> 01:23:13,430 paying quarter p plus b time. 1440 01:23:13,430 --> 01:23:17,449 Sorry, times the cost of a suffix array access. 1441 01:23:17,449 --> 01:23:19,490 OK, so we're still paying the log to the epsilon, 1442 01:23:19,490 --> 01:23:21,120 because of the suffix array. 1443 01:23:21,120 --> 01:23:24,310 If that was constant, it would be free. 1444 01:23:24,310 --> 01:23:29,082 P plus b time is fine, if b is log log log log n. 1445 01:23:29,082 --> 01:23:30,290 Or you can make it log log n. 1446 01:23:30,290 --> 01:23:33,230 Then you save a log log n factor in the bits. 1447 01:23:33,230 --> 01:23:34,760 You pay an additive log log n. 1448 01:23:34,760 --> 01:23:37,301 That's going to be absorbed by the log to the epsilon anyway. 1449 01:23:37,301 --> 01:23:38,750 So it's pretty efficient. 1450 01:23:38,750 --> 01:23:40,625 I guess you can make this log to the epsilon, 1451 01:23:40,625 --> 01:23:43,280 if you felt like it, to balance out here. 1452 01:23:43,280 --> 01:23:45,740 Still would be P times log to the epsilon. 1453 01:23:45,740 --> 01:23:48,080 And so this stuff is really quite cheap, 1454 01:23:48,080 --> 01:23:50,360 see the notes for details. 1455 01:23:50,360 --> 01:23:54,150 That ends our succinct coverage. 1456 01:23:54,150 --> 01:23:57,700 Sorry, it was a little more succinct than intended. 1457 01:23:57,700 --> 01:23:59,320 Get the idea.