1 00:00:00,090 --> 00:00:02,490 The following content is provided under a Creative 2 00:00:02,490 --> 00:00:04,059 Commons license. 3 00:00:04,059 --> 00:00:06,330 Your support will help MIT OpenCourseWare 4 00:00:06,330 --> 00:00:10,720 continue to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,280 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,280 --> 00:00:19,790 at osw.mit.edu. 8 00:00:19,790 --> 00:00:20,790 ERIK DEMAINE: All right. 9 00:00:20,790 --> 00:00:24,270 Today's lecture's full of tries and trays, and trees. 10 00:00:24,270 --> 00:00:25,170 Oh, my. 11 00:00:25,170 --> 00:00:30,150 Lots of different synonyms all coming from trees. 12 00:00:30,150 --> 00:00:31,620 In particular, we're going to cover 13 00:00:31,620 --> 00:00:34,320 suffix trees today and various representations of them, 14 00:00:34,320 --> 00:00:36,387 and how to build them in linear time. 15 00:00:36,387 --> 00:00:37,470 Now, they are good things. 16 00:00:37,470 --> 00:00:39,840 Some of you may have seen suffix trees before, 17 00:00:39,840 --> 00:00:43,260 but hopefully, haven't actually seen most of the things 18 00:00:43,260 --> 00:00:46,360 we're going to cover, except for the very basics. 19 00:00:46,360 --> 00:00:50,580 So the general problem we're interested in solving today 20 00:00:50,580 --> 00:00:51,480 is string matching. 21 00:00:57,660 --> 00:01:02,610 And in string matching we are given two strings. 22 00:01:02,610 --> 00:01:07,290 One of them we call the text T and the other one 23 00:01:07,290 --> 00:01:18,630 we call a pattern P. These are both strings 24 00:01:18,630 --> 00:01:20,955 over some alphabet. 25 00:01:20,955 --> 00:01:25,950 And the alphabet we're going to always call capital Sigma. 26 00:01:25,950 --> 00:01:26,947 Think of that. 27 00:01:26,947 --> 00:01:27,780 It could be binary-- 28 00:01:27,780 --> 00:01:28,470 0 and 1. 29 00:01:28,470 --> 00:01:31,950 Could be ASCII, so there's 256 characters in there. 30 00:01:31,950 --> 00:01:35,280 Could be Unicode-- pick your favorite alphabet. 31 00:01:35,280 --> 00:01:40,410 Then it could be ACGT for DNA. 32 00:01:40,410 --> 00:01:43,486 And their goal is to find the occurrences 33 00:01:43,486 --> 00:01:44,610 of the pattern in the text. 34 00:01:52,960 --> 00:01:55,170 Could be we want to find some of those occurrences 35 00:01:55,170 --> 00:01:56,860 or all of them, or count them. 36 00:02:09,870 --> 00:02:14,280 And in this lecture, we're only interested in substring 37 00:02:14,280 --> 00:02:14,930 searches. 38 00:02:14,930 --> 00:02:17,580 So the pattern is just a string. 39 00:02:17,580 --> 00:02:25,000 You want to know all the places where P occurs. 40 00:02:25,000 --> 00:02:29,910 P might appear multiple times, even overlapping itself-- 41 00:02:29,910 --> 00:02:31,740 in those two positions, whatever. 42 00:02:31,740 --> 00:02:34,446 You want to find all the shifts of P where it's identical to T. 43 00:02:34,446 --> 00:02:35,820 Now, there are lots of variations 44 00:02:35,820 --> 00:02:37,278 on this problem which we won't look 45 00:02:37,278 --> 00:02:41,700 at in this lecture, such as when the pattern has wildcards 46 00:02:41,700 --> 00:02:44,152 in it, or you could imagine it being a regular expression, 47 00:02:44,152 --> 00:02:45,610 or you don't want to match exactly, 48 00:02:45,610 --> 00:02:48,930 you want to match approximately, you could have some mismatches, 49 00:02:48,930 --> 00:02:52,170 or it could require some edits to match 50 00:02:52,170 --> 00:02:54,045 T. We're not going to look at those problems. 51 00:02:56,920 --> 00:02:59,460 This is both an algorithmic problem and a data structures 52 00:02:59,460 --> 00:03:00,780 problem. 53 00:03:00,780 --> 00:03:02,730 If I give you this text in the pattern, 54 00:03:02,730 --> 00:03:04,200 I just want to know the answer. 55 00:03:04,200 --> 00:03:05,880 You can do that in linear time-- it's 56 00:03:05,880 --> 00:03:12,694 famous Knuth-Morris-Pratt, or Boyer-Moore, or Rabin-Karp. 57 00:03:12,694 --> 00:03:14,610 Lots of linear time algorithms for doing that. 58 00:03:14,610 --> 00:03:16,410 We're interested in the data structure 59 00:03:16,410 --> 00:03:21,070 version of the problem, static data structure. 60 00:03:21,070 --> 00:03:24,210 So we're given the text up front, 61 00:03:24,210 --> 00:03:27,660 given T. We want to preprocess T. 62 00:03:27,660 --> 00:03:34,110 And then the query consists of the pattern. 63 00:03:34,110 --> 00:03:37,980 Imagine T being very big, P being not so big. 64 00:03:37,980 --> 00:03:45,400 And we'd like to spend something like order 65 00:03:45,400 --> 00:03:47,820 P time to do a query. 66 00:03:51,222 --> 00:03:53,430 That would be ideal because you have to at least look 67 00:03:53,430 --> 00:03:55,170 at the query and you don't really 68 00:03:55,170 --> 00:03:57,240 want to spend time looking at the text. 69 00:03:57,240 --> 00:04:03,405 You'd also like something like order T space. 70 00:04:03,405 --> 00:04:05,280 We don't want the space of the data structure 71 00:04:05,280 --> 00:04:07,590 to be much bigger than the original text. 72 00:04:07,590 --> 00:04:10,727 So these are goals which we will more or less achieve, depending 73 00:04:10,727 --> 00:04:12,060 on exactly the problem you want. 74 00:04:12,060 --> 00:04:13,684 Sometimes we'll achieve this, sometimes 75 00:04:13,684 --> 00:04:16,019 we'll achieve almost this. 76 00:04:16,019 --> 00:04:21,492 But these are really nice running times and space. 77 00:04:21,492 --> 00:04:22,200 It's all optimal. 78 00:04:24,710 --> 00:04:26,660 Before we get to that problem, I want 79 00:04:26,660 --> 00:04:32,400 to solve a simpler problem which is necessary to solve this one. 80 00:04:32,400 --> 00:04:33,800 We'll call it a warm up. 81 00:04:36,460 --> 00:04:41,860 And that's a good friend-- the predecessor problem, 82 00:04:41,860 --> 00:04:43,220 but now among strings. 83 00:04:48,060 --> 00:04:50,970 Let's say we have k strings-- 84 00:04:50,970 --> 00:04:54,410 k texts-- T1 to T k. 85 00:04:54,410 --> 00:04:57,200 And now the query is you're given some pattern P 86 00:04:57,200 --> 00:04:59,090 and you want to know where P fits 87 00:04:59,090 --> 00:05:02,070 among these strings in lexical order. 88 00:05:02,070 --> 00:05:04,160 So a regular predecessor, but now comparison 89 00:05:04,160 --> 00:05:06,080 is string comparison. 90 00:05:06,080 --> 00:05:08,964 Of course, you could try to solve that using our existing 91 00:05:08,964 --> 00:05:10,130 predecessor data structures. 92 00:05:10,130 --> 00:05:11,457 But they won't do very well. 93 00:05:11,457 --> 00:05:13,040 Even a binary search tree is not going 94 00:05:13,040 --> 00:05:15,350 to do well here because comparing two strings 95 00:05:15,350 --> 00:05:18,350 could take a very long time if those strings are long. 96 00:05:18,350 --> 00:05:19,980 So we don't want to do that. 97 00:05:19,980 --> 00:05:24,470 Instead, we're going to build a trie. 98 00:05:24,470 --> 00:05:28,010 Now, tries we've seen in fast sorting 99 00:05:28,010 --> 00:05:31,540 lecture, when w is at least logs at two plus epsilon event. 100 00:05:34,110 --> 00:05:36,740 We used tries in a particular setting there. 101 00:05:36,740 --> 00:05:40,130 We're going to use them in their more native setting 102 00:05:40,130 --> 00:05:41,810 today a lot. 103 00:05:45,080 --> 00:05:51,710 In this setting-- again, a trie is a rooted tree. 104 00:05:51,710 --> 00:05:53,495 The children branches are labeled. 105 00:06:00,000 --> 00:06:01,710 And in this case, they're labeled 106 00:06:01,710 --> 00:06:04,440 with letters in the alphabet-- 107 00:06:04,440 --> 00:06:06,030 Sigma. 108 00:06:06,030 --> 00:06:07,830 So you have a node. 109 00:06:07,830 --> 00:06:13,020 And let's say, we have the English alphabet-- a, b, 110 00:06:13,020 --> 00:06:14,586 up to z. 111 00:06:14,586 --> 00:06:16,560 Those are your 26 possible children. 112 00:06:16,560 --> 00:06:19,200 Some of them may not exist, they are null pointers. 113 00:06:19,200 --> 00:06:23,340 Others may point to actual nodes. 114 00:06:23,340 --> 00:06:25,560 That is a trie in its native setting, 115 00:06:25,560 --> 00:06:28,750 which is when the alphabet is something you care about. 116 00:06:28,750 --> 00:06:31,260 Now, when we used tries before, our alphabet 117 00:06:31,260 --> 00:06:33,240 just represented some digit in some kind 118 00:06:33,240 --> 00:06:35,900 of arbitrary representation. 119 00:06:35,900 --> 00:06:38,160 The digit was made up of log to the epsilon bits. 120 00:06:38,160 --> 00:06:39,839 We were just using it as a tool. 121 00:06:39,839 --> 00:06:41,380 But this is where tries actually come 122 00:06:41,380 --> 00:06:45,660 from-- they come from trying to retrieve strings out 123 00:06:45,660 --> 00:06:48,840 of some database, in this case. 124 00:06:48,840 --> 00:06:52,320 We're doing predecessor-- this is a practical problem. 125 00:06:52,320 --> 00:06:54,850 Like a lot of library search engines, 126 00:06:54,850 --> 00:06:56,850 you type in the beginning of the title of a book 127 00:06:56,850 --> 00:07:01,080 and you want to know what is the preceding and succeeding book 128 00:07:01,080 --> 00:07:03,690 title of what you query for. 129 00:07:03,690 --> 00:07:06,202 So this is something people care about. 130 00:07:06,202 --> 00:07:07,910 Although really they want us-- typically, 131 00:07:07,910 --> 00:07:10,440 we want to solve this problem because it's harder. 132 00:07:13,020 --> 00:07:15,390 So that's a trie. 133 00:07:15,390 --> 00:07:23,070 Now, to make this actually work, what we'd like to do 134 00:07:23,070 --> 00:07:25,410 is represent our strings. 135 00:07:25,410 --> 00:07:29,510 So how do we use this structure to represent strings T1 to T k? 136 00:07:33,090 --> 00:07:35,960 We're going to represent those strings 137 00:07:35,960 --> 00:07:38,670 in the obvious way, which we've done many times 138 00:07:38,670 --> 00:07:41,970 in the past when we were doing integer data structures-- 139 00:07:41,970 --> 00:07:43,365 as root to leaf paths. 140 00:07:48,790 --> 00:07:51,370 Because any root to leaf path is just a sequence of letters, 141 00:07:51,370 --> 00:07:52,203 and that's a string. 142 00:07:52,203 --> 00:07:54,520 So we just throw them in there. 143 00:07:54,520 --> 00:07:59,360 Now, to do that, we need to change things a little bit. 144 00:07:59,360 --> 00:08:07,390 We're going to add a new letter, which we usually present as $ 145 00:08:07,390 --> 00:08:10,060 sign, to the end of every string. 146 00:08:21,550 --> 00:08:24,390 I have an example. 147 00:08:24,390 --> 00:08:33,039 We're going to do four strings-- 148 00:08:37,679 --> 00:08:41,590 various spellings of Anna and Ann. 149 00:08:41,590 --> 00:08:47,170 And say, we'd like to throw these into a trie. 150 00:08:47,170 --> 00:08:48,490 They all start with a. 151 00:08:48,490 --> 00:08:52,600 So at the root, there's going to be four branches corresponding 152 00:08:52,600 --> 00:08:55,940 to $ sign, a, e, and n. 153 00:08:55,940 --> 00:08:57,850 I'm supposing my alphabet is just a, 154 00:08:57,850 --> 00:09:01,150 e, n because that's all that appears here. 155 00:09:01,150 --> 00:09:04,570 But everything will be on the a branch. 156 00:09:04,570 --> 00:09:08,420 And then from there we're going to have-- 157 00:09:08,420 --> 00:09:12,710 let's see-- they all go to n next. 158 00:09:12,710 --> 00:09:16,780 So they all follow this branch. 159 00:09:16,780 --> 00:09:20,950 Then one of them goes to a. 160 00:09:20,950 --> 00:09:23,440 These all go to n afterwards. 161 00:09:23,440 --> 00:09:31,240 So we've got $ sign, a we use, e, n we use. 162 00:09:31,240 --> 00:09:33,880 And on the a branch, we are done. 163 00:09:33,880 --> 00:09:38,680 This corresponds to and a, n, e. 164 00:09:38,680 --> 00:09:39,310 We're finished. 165 00:09:39,310 --> 00:09:42,670 And we imagine there being $ sign at the end of this string. 166 00:09:42,670 --> 00:09:48,940 So we follow the $ sign child. 167 00:09:48,940 --> 00:09:50,920 The others are blank. 168 00:09:50,920 --> 00:09:55,870 And this leaf here corresponds to a, n, a. 169 00:09:55,870 --> 00:10:00,714 On the other hand, if we could do a, n, n, 170 00:10:00,714 --> 00:10:01,630 there's three options. 171 00:10:01,630 --> 00:10:03,000 We could be done. 172 00:10:03,000 --> 00:10:05,750 Or there could be an a or an e to follow. 173 00:10:05,750 --> 00:10:09,310 So if we're done, that would correspond to the $ sign 174 00:10:09,310 --> 00:10:10,660 pointer. 175 00:10:10,660 --> 00:10:14,530 That's going to be a leaf corresponding to this string 176 00:10:14,530 --> 00:10:16,840 here. 177 00:10:16,840 --> 00:10:22,465 Or it could be an a and then we're done. 178 00:10:26,380 --> 00:10:28,920 And then we have a leaf corresponding 179 00:10:28,920 --> 00:10:32,640 to Anna, a, n, n, a. 180 00:10:32,640 --> 00:10:39,230 Or could be we have an e next and then we're done. 181 00:10:44,210 --> 00:10:44,970 OK. 182 00:10:44,970 --> 00:10:46,770 Not very exciting but that is the tri 183 00:10:46,770 --> 00:10:52,260 representation of a, n, a; a, n, n; a, n, n, a; a, n, n, e. 184 00:10:52,260 --> 00:10:57,930 And you can see there is exactly one leaf per word down here. 185 00:10:57,930 --> 00:11:00,840 And furthermore, if you take in order traversal 186 00:11:00,840 --> 00:11:03,934 of those leaves, you get these strings in order. 187 00:11:03,934 --> 00:11:05,850 And typically, if you're going to store a data 188 00:11:05,850 --> 00:11:08,379 structure like this, you would store these actual pointers. 189 00:11:08,379 --> 00:11:10,670 So once you get to a leaf, you know which word matched. 190 00:11:13,760 --> 00:11:15,600 So that's a trie. 191 00:11:15,600 --> 00:11:18,885 Seems pretty trivial. 192 00:11:18,885 --> 00:11:19,385 Trievial? 193 00:11:23,860 --> 00:11:26,050 But it turns out there's something already 194 00:11:26,050 --> 00:11:30,580 pretty interesting about this data structure. 195 00:11:30,580 --> 00:11:32,820 How do you do a predecessor search? 196 00:11:32,820 --> 00:11:37,600 If I'm searching for something like, I don't know, a, n, e-- 197 00:11:37,600 --> 00:11:39,850 because I made a typo-- 198 00:11:39,850 --> 00:11:44,410 then I follow a, n, and then I follow this e branch here 199 00:11:44,410 --> 00:11:45,490 and discover-- whoops-- 200 00:11:45,490 --> 00:11:46,960 there's nothing here. 201 00:11:46,960 --> 00:11:49,590 But right at that node I see, OK, well, my predecessor is 202 00:11:49,590 --> 00:11:51,610 going to be the max in this subtree, which 203 00:11:51,610 --> 00:11:53,500 happens to be a, n, a. 204 00:11:53,500 --> 00:11:56,230 My successor is going to be the min in this subtree, which 205 00:11:56,230 --> 00:11:57,790 happens to be a, n, n. 206 00:11:57,790 --> 00:11:59,170 And so I find what I want. 207 00:11:59,170 --> 00:12:01,010 How long does it take me to do that? 208 00:12:01,010 --> 00:12:03,970 Well, if I store subtree mins and maxs, 209 00:12:03,970 --> 00:12:05,890 then I just have to walk down the tree. 210 00:12:05,890 --> 00:12:09,034 That will take order P time to walk down. 211 00:12:09,034 --> 00:12:10,450 And then, once I'm at a node, I've 212 00:12:10,450 --> 00:12:14,740 got to do a predecessor or successor in the node. 213 00:12:14,740 --> 00:12:15,740 So there are two issues. 214 00:12:15,740 --> 00:12:19,030 One is, given a node, how do you know which way to walk down? 215 00:12:19,030 --> 00:12:21,130 And then, when you're done, how do you 216 00:12:21,130 --> 00:12:22,270 do predecessor in a node? 217 00:12:22,270 --> 00:12:23,630 It's the fundamental question. 218 00:12:23,630 --> 00:12:25,505 Now, this is something we spent a lot of time 219 00:12:25,505 --> 00:12:26,950 doing in, say, fusion trees. 220 00:12:26,950 --> 00:12:28,160 That was the big challenge. 221 00:12:28,160 --> 00:12:30,310 So this is not really so trivial-- 222 00:12:30,310 --> 00:12:32,560 how do I represent a node? 223 00:12:32,560 --> 00:12:35,470 One way to make it trivial is to assume that the alphabet is 224 00:12:35,470 --> 00:12:37,145 constant size, like two. 225 00:12:37,145 --> 00:12:38,770 Then, of course, there's nothing to do. 226 00:12:38,770 --> 00:12:40,370 It's a binary trie. 227 00:12:40,370 --> 00:12:41,890 You look at 0, you look at 1. 228 00:12:41,890 --> 00:12:44,560 You can figure out anything you need to do 229 00:12:44,560 --> 00:12:45,780 if the alphabet is constant. 230 00:12:45,780 --> 00:12:47,530 But things get interesting if you imagine, 231 00:12:47,530 --> 00:12:49,150 well, the alphabet is some parameter, 232 00:12:49,150 --> 00:12:50,600 we don't know how big it is. 233 00:12:50,600 --> 00:12:51,950 It might be substantial. 234 00:12:51,950 --> 00:12:55,460 So let's think about how you might represent 235 00:12:55,460 --> 00:12:58,390 a trie or the node of a trie. 236 00:13:03,790 --> 00:13:08,320 Let's call this trie node representation. 237 00:13:11,620 --> 00:13:13,040 Any suggestions? 238 00:13:13,040 --> 00:13:15,550 What are the obvious ways to represent the node of a trie? 239 00:13:15,550 --> 00:13:18,156 Nothing fancy. 240 00:13:18,156 --> 00:13:19,780 I have three obvious answers, at least. 241 00:13:19,780 --> 00:13:20,547 AUDIENCE: Array. 242 00:13:20,547 --> 00:13:21,380 ERIK DEMAINE: Array. 243 00:13:21,380 --> 00:13:21,570 Good. 244 00:13:21,570 --> 00:13:22,570 That was my number one. 245 00:13:25,090 --> 00:13:26,740 Any more? 246 00:13:26,740 --> 00:13:28,365 That's I think the most obvious. 247 00:13:28,365 --> 00:13:28,990 AUDIENCE: Tree. 248 00:13:28,990 --> 00:13:29,915 ERIK DEMAINE: Tree. 249 00:13:29,915 --> 00:13:30,415 Good. 250 00:13:30,415 --> 00:13:32,350 Do a binary search tree. 251 00:13:32,350 --> 00:13:32,925 Or? 252 00:13:32,925 --> 00:13:33,800 AUDIENCE: Hash table. 253 00:13:33,800 --> 00:13:34,360 ERIK DEMAINE: Hash table. 254 00:13:34,360 --> 00:13:34,860 Good. 255 00:13:39,550 --> 00:13:44,950 So for each of them we have query time and space. 256 00:13:48,590 --> 00:13:51,420 If I use an array, meaning I have-- 257 00:13:51,420 --> 00:13:53,320 let's say, for a through z-- 258 00:13:53,320 --> 00:13:59,710 I have a pointer that either is null or points to the child. 259 00:13:59,710 --> 00:14:02,170 This is going to be really fast because they're at a node. 260 00:14:02,170 --> 00:14:05,350 If I want to know, I just look at that i-th letter 261 00:14:05,350 --> 00:14:08,560 in my pattern P. I say, oh, it's a j. 262 00:14:08,560 --> 00:14:11,577 So I look at the j position and I follow it. 263 00:14:11,577 --> 00:14:13,910 You might wonder, how do I do predecessor and successor? 264 00:14:13,910 --> 00:14:15,493 Well, this is a static data structure. 265 00:14:15,493 --> 00:14:17,470 So for every cell, if it's null, I 266 00:14:17,470 --> 00:14:21,460 can store the predecessor and successor in the node. 267 00:14:21,460 --> 00:14:23,910 With no more space. 268 00:14:23,910 --> 00:14:28,270 This is Sigma space per node. 269 00:14:28,270 --> 00:14:34,570 So the amount of space is T Sigma, which is not so great. 270 00:14:34,570 --> 00:14:37,388 But the query is fast, query is order P time. 271 00:14:40,230 --> 00:14:40,850 BST. 272 00:14:40,850 --> 00:14:43,110 The idea of the BST is instead of having 273 00:14:43,110 --> 00:14:47,270 a node that has some pointers, some of which may be absent, 274 00:14:47,270 --> 00:14:54,050 let's expand it out into something like this. 275 00:14:54,050 --> 00:14:55,610 Actually, I'll use colors. 276 00:14:55,610 --> 00:14:58,820 This will make life a little bit cleaner in a moment. 277 00:14:58,820 --> 00:15:01,830 Because we are going to modify this approach. 278 00:15:01,830 --> 00:15:06,470 So let's say that the pointers you care about are red. 279 00:15:06,470 --> 00:15:08,670 Those are the actual letter pointers you want to do. 280 00:15:08,670 --> 00:15:11,360 So the idea is to expand out this high degree node 281 00:15:11,360 --> 00:15:12,950 into binary nodes. 282 00:15:12,950 --> 00:15:16,150 You put appropriate keys in here so you can do a binary search. 283 00:15:16,150 --> 00:15:20,120 And then, eventually, you get down to where you need to go. 284 00:15:20,120 --> 00:15:23,960 This structure has high log Sigma. 285 00:15:23,960 --> 00:15:28,110 So the query time is going to be P log Sigma. 286 00:15:28,110 --> 00:15:31,310 So that goes up a little bit, not perfect. 287 00:15:31,310 --> 00:15:33,300 But the space now becomes linear, 288 00:15:33,300 --> 00:15:35,210 so that's an improvement. 289 00:15:35,210 --> 00:15:38,630 Ideally, we'd like the best of both of these-- 290 00:15:38,630 --> 00:15:42,740 optimal query time, optimal space, linear space. 291 00:15:42,740 --> 00:15:45,050 And hash tables achieve that. 292 00:15:45,050 --> 00:15:51,530 They give you order P query and order T space. 293 00:15:51,530 --> 00:15:53,840 Again, the issue is some of these cells are absent so 294 00:15:53,840 --> 00:15:54,740 don't use an array. 295 00:15:54,740 --> 00:15:56,840 That's like a direct mapped hash table. 296 00:15:56,840 --> 00:15:58,700 Use a hash table, use hashing. 297 00:15:58,700 --> 00:16:01,880 That way you can use linear space per node, however many 298 00:16:01,880 --> 00:16:04,380 occupied children there are. 299 00:16:04,380 --> 00:16:06,020 What is T here, by the way? 300 00:16:06,020 --> 00:16:11,120 T is the sum of the lengths of the T i's-- 301 00:16:11,120 --> 00:16:13,220 because here we're storing multiple T i's. 302 00:16:13,220 --> 00:16:19,700 Or it's the number of nodes in the tree, which, 303 00:16:19,700 --> 00:16:22,160 if your strings happen to have a lot of common prefixes, 304 00:16:22,160 --> 00:16:24,159 the number of nodes in the trie could be smaller 305 00:16:24,159 --> 00:16:28,980 than that, but not in general. 306 00:16:28,980 --> 00:16:30,841 What's the problem with the hash table? 307 00:16:30,841 --> 00:16:31,340 Question. 308 00:16:31,340 --> 00:16:38,330 AUDIENCE: [INAUDIBLE] 309 00:16:38,330 --> 00:16:39,080 ERIK DEMAINE: Yes. 310 00:16:39,080 --> 00:16:41,570 For the BST, we need to store some keys in this node. 311 00:16:41,570 --> 00:16:42,992 That lets you do a binary search. 312 00:16:42,992 --> 00:16:44,450 For example, every node could store 313 00:16:44,450 --> 00:16:46,220 the max in the left subtree-- 314 00:16:46,220 --> 00:16:48,353 just within this little tree, though. 315 00:16:48,353 --> 00:16:53,090 AUDIENCE: [INAUDIBLE] 316 00:16:53,090 --> 00:16:54,980 ERIK DEMAINE: It is order T. Sorry, 317 00:16:54,980 --> 00:16:57,040 I see-- why is it not O T Sigma space? 318 00:16:57,040 --> 00:16:58,580 You're only storing one letter here, 319 00:16:58,580 --> 00:17:02,130 so that fits in a single word, and two pointers. 320 00:17:02,130 --> 00:17:04,099 So every node only takes constant space. 321 00:17:04,099 --> 00:17:08,730 It's only T space, not T Sigma. 322 00:17:08,730 --> 00:17:10,940 Other questions? 323 00:17:10,940 --> 00:17:11,450 Or answers? 324 00:17:11,450 --> 00:17:13,790 There's a problem with hashing-- 325 00:17:13,790 --> 00:17:18,260 doesn't actually solve the problem we want to solve. 326 00:17:18,260 --> 00:17:20,520 It doesn't solve predecessor. 327 00:17:20,520 --> 00:17:22,940 Because hashing mixes up the order of the nodes. 328 00:17:22,940 --> 00:17:25,010 This is the problem we had with-- 329 00:17:25,010 --> 00:17:29,460 what's it called-- signature sort, which hashed, 330 00:17:29,460 --> 00:17:32,790 it messed up, it permuted all the things in the nodes 331 00:17:32,790 --> 00:17:34,950 and so you didn't know-- 332 00:17:34,950 --> 00:17:37,410 I mean, in a hash table, you can't solve predecessor. 333 00:17:37,410 --> 00:17:40,289 That's what the predecessor problem is for. 334 00:17:40,289 --> 00:17:42,330 I guess you could try to throw a predecessor data 335 00:17:42,330 --> 00:17:43,500 structure in here. 336 00:17:43,500 --> 00:17:46,960 Actually, I hadn't thought of that before. 337 00:17:46,960 --> 00:17:49,680 So we could use y-fast tries or something. 338 00:17:49,680 --> 00:17:55,650 And we would get order T space and-- 339 00:17:55,650 --> 00:17:58,510 I guess, with high probability, this 340 00:17:58,510 --> 00:18:00,990 is also with high probability-- 341 00:18:00,990 --> 00:18:06,180 we get order P log log Sigma, I guess. 342 00:18:06,180 --> 00:18:09,270 Because I use Van Emde Boas. 343 00:18:09,270 --> 00:18:15,840 I'm going to have to call it 3.5, Van Emde Boas. 344 00:18:15,840 --> 00:18:17,280 So that would be another approach. 345 00:18:17,280 --> 00:18:22,409 So hashing will not do a predecessor. 346 00:18:22,409 --> 00:18:24,950 We'll do exact search, which is still an interesting problem. 347 00:18:24,950 --> 00:18:26,783 Might give you some strings I want to know-- 348 00:18:26,783 --> 00:18:29,640 is this string in your set? 349 00:18:29,640 --> 00:18:32,734 But it won't solve the predecessor problem. 350 00:18:32,734 --> 00:18:34,650 So this is an interesting solution-- hashing-- 351 00:18:34,650 --> 00:18:36,070 but not quite what we want. 352 00:18:36,070 --> 00:18:38,361 And Van Emde Boas doesn't quite do what we want either. 353 00:18:38,361 --> 00:18:40,620 It improves over the BST approach 354 00:18:40,620 --> 00:18:42,900 but we get another log in there. 355 00:18:42,900 --> 00:18:46,080 But it's still not order P. I kind of like order P. 356 00:18:46,080 --> 00:18:49,360 Or at least, instead of order P times log log Sigma, 357 00:18:49,360 --> 00:18:54,030 I kind of like order P plus log Sigma. 358 00:18:54,030 --> 00:18:56,470 And order P plus log Sigma is known. 359 00:18:56,470 --> 00:19:01,010 So that's what I want to tell you about. 360 00:19:01,010 --> 00:19:03,710 And this is normally done with a structure 361 00:19:03,710 --> 00:19:09,740 called trays, which is a portamento, I guess, 362 00:19:09,740 --> 00:19:12,410 of tree and array. 363 00:19:12,410 --> 00:19:15,140 Somewhere in there there's a tree and an array, 364 00:19:15,140 --> 00:19:18,630 so it's a bit of an awkward word. 365 00:19:18,630 --> 00:19:23,690 But Those are developed by Koplowitz and Lewenstein, 366 00:19:23,690 --> 00:19:28,460 in 2006, a fairly recent innovation. 367 00:19:28,460 --> 00:19:31,240 I'll have this number 6-- 368 00:19:31,240 --> 00:19:41,900 trays, achieve order P plus log Sigma and order T space. 369 00:19:41,900 --> 00:19:42,870 So this is pretty good. 370 00:19:42,870 --> 00:19:45,680 And they will do predecessor and successor-- definitely 371 00:19:45,680 --> 00:19:47,037 an improvement over the BST. 372 00:19:50,690 --> 00:19:55,460 It's open whether you could do order P plus log log Sigma. 373 00:19:55,460 --> 00:19:58,530 This is as far as I can tell, no one has worked on this. 374 00:19:58,530 --> 00:20:03,920 Maybe we will work on it today. 375 00:20:03,920 --> 00:20:05,360 So something to think about-- 376 00:20:05,360 --> 00:20:06,901 whether you could get the best of all 377 00:20:06,901 --> 00:20:09,174 of these worlds for predecessor. 378 00:20:09,174 --> 00:20:11,590 There's a lower bound-- you need to spend at least log log 379 00:20:11,590 --> 00:20:12,170 Sigma time. 380 00:20:12,170 --> 00:20:14,319 Because even if you try as a single node, 381 00:20:14,319 --> 00:20:15,860 you have the predecessor lower bound. 382 00:20:15,860 --> 00:20:18,470 And we know log log universe size is the best you 383 00:20:18,470 --> 00:20:19,860 can do in this regime. 384 00:20:23,320 --> 00:20:26,560 So that's where we're going. 385 00:20:26,560 --> 00:20:28,270 Instead of describing trays, though, I'm 386 00:20:28,270 --> 00:20:29,645 going to describe a new way to do 387 00:20:29,645 --> 00:20:32,200 it, which has never been seen before 388 00:20:32,200 --> 00:20:33,452 in any class or anywhere. 389 00:20:33,452 --> 00:20:34,410 Because it's brand new. 390 00:20:34,410 --> 00:20:37,720 It's developed by Martin Farach-Colton, who 391 00:20:37,720 --> 00:20:41,290 did the LCA and the level ancestor structures 392 00:20:41,290 --> 00:20:43,480 that we saw in last class. 393 00:20:43,480 --> 00:20:46,160 And he just told it to me and it's really cool 394 00:20:46,160 --> 00:20:47,590 so we're going to cover it. 395 00:20:51,640 --> 00:20:54,755 A simpler way to get this same bound of trays. 396 00:20:58,990 --> 00:21:00,670 And the first thing we're going to do 397 00:21:00,670 --> 00:21:02,270 is use a weight balanced BST. 398 00:21:07,040 --> 00:21:16,670 This will achieve P plus log k query and linear space. 399 00:21:20,650 --> 00:21:23,210 k, remember, is the number of strings 400 00:21:23,210 --> 00:21:26,670 that we're storing, so it's the number of leaves in the trie. 401 00:21:26,670 --> 00:21:28,789 So it's not quite as good as P plus log Sigma 402 00:21:28,789 --> 00:21:30,080 but it's going to be a warm up. 403 00:21:30,080 --> 00:21:32,621 We're going to to do this and then we're going to improve it. 404 00:21:34,550 --> 00:21:39,470 Remember weight balanced trees, we talked about them way back 405 00:21:39,470 --> 00:21:41,550 in lecture 3, I believe. 406 00:21:41,550 --> 00:21:44,690 There is an issue of what is the weight. 407 00:21:44,690 --> 00:21:46,850 And typically, you say, the weight of a subtree 408 00:21:46,850 --> 00:21:48,670 is the number of nodes in the subtree. 409 00:21:48,670 --> 00:21:50,420 I'm going to change that slightly and say, 410 00:21:50,420 --> 00:21:53,360 the weight of a subtree is the number of descendant 411 00:21:53,360 --> 00:21:59,300 leaves in the subtree, not the number of nodes, 412 00:21:59,300 --> 00:22:01,890 because it's log k. 413 00:22:01,890 --> 00:22:04,460 We really care about the number of leaves down there. 414 00:22:04,460 --> 00:22:05,960 There could be long paths here which 415 00:22:05,960 --> 00:22:08,690 we are not so excited about. 416 00:22:08,690 --> 00:22:11,150 We really care about how many leaves are down there. 417 00:22:11,150 --> 00:22:14,881 Like the weight of this node here is three-- 418 00:22:14,881 --> 00:22:16,130 there's three leaves below it. 419 00:22:21,410 --> 00:22:23,800 You may recall weight balanced BSTs 420 00:22:23,800 --> 00:22:25,910 trying to make the weight of the left subtree 421 00:22:25,910 --> 00:22:28,860 within a constant factor of the weight of the right subtree. 422 00:22:28,860 --> 00:22:31,130 Because we're static, we can be even simpler 423 00:22:31,130 --> 00:22:33,335 and say, find the optimal partition. 424 00:22:37,190 --> 00:22:40,430 So we're thinking about this approach-- 425 00:22:40,430 --> 00:22:44,570 idea of expanding a large degree node into some binary tree. 426 00:22:44,570 --> 00:22:46,610 We have a choice of what binary tree to use. 427 00:22:46,610 --> 00:22:48,694 With three nodes it may be not many choices-- that 428 00:22:48,694 --> 00:22:50,693 could be this or it could be a straight this way 429 00:22:50,693 --> 00:22:51,957 or a straight line that way. 430 00:22:51,957 --> 00:22:52,790 Those are different. 431 00:22:52,790 --> 00:22:55,167 And if one of these guys is really heavy, 432 00:22:55,167 --> 00:22:56,750 one of these children is really heavy, 433 00:22:56,750 --> 00:22:58,675 you want to put it closer to the root. 434 00:22:58,675 --> 00:23:00,050 So that's what we're going to do. 435 00:23:04,060 --> 00:23:06,720 Let me draw it this way. 436 00:23:06,720 --> 00:23:09,270 That's kind of an array. 437 00:23:09,270 --> 00:23:13,490 But what this array represents is for a node-- 438 00:23:13,490 --> 00:23:16,040 so here's my node, it has lots of children. 439 00:23:16,040 --> 00:23:18,800 Some of these are heavy, some of them 440 00:23:18,800 --> 00:23:21,567 are light, lighter than others. 441 00:23:21,567 --> 00:23:23,150 We don't know how they're distributed. 442 00:23:23,150 --> 00:23:27,374 But they're ordered, we have to preserve the order. 443 00:23:27,374 --> 00:23:28,790 What this is supposed to represent 444 00:23:28,790 --> 00:23:31,890 is the total number of leaves in this subtree. 445 00:23:31,890 --> 00:23:36,050 So the total number of leaves here. 446 00:23:36,050 --> 00:23:40,490 And then I'm going to partition this rectangle into groups 447 00:23:40,490 --> 00:23:43,430 corresponding to these sizes. 448 00:23:43,430 --> 00:23:48,230 So these are small, medium, small, little less than medium, 449 00:23:48,230 --> 00:23:51,290 big, and then small. 450 00:23:51,290 --> 00:23:52,550 Something like that. 451 00:23:52,550 --> 00:23:54,950 So these horizontal lengths correspond 452 00:23:54,950 --> 00:23:57,260 to the number of leaves in these things, 453 00:23:57,260 --> 00:24:00,224 correspond to the weight of my children. 454 00:24:00,224 --> 00:24:01,640 So I look at that and I say, well, 455 00:24:01,640 --> 00:24:04,160 what I'd really like to do is split this 456 00:24:04,160 --> 00:24:06,690 in the middle, which is, maybe, here. 457 00:24:06,690 --> 00:24:08,730 I say, OK, well, then I'll split here. 458 00:24:08,730 --> 00:24:11,090 That's pretty close to the middle. 459 00:24:11,090 --> 00:24:13,547 So my left subtree will consist of these guys, 460 00:24:13,547 --> 00:24:15,380 my right subtree will consist of these guys. 461 00:24:15,380 --> 00:24:16,760 And then I recurse-- 462 00:24:16,760 --> 00:24:19,490 over here I've split at the middle, 463 00:24:19,490 --> 00:24:21,440 I find the thing that's closest to the middle. 464 00:24:21,440 --> 00:24:22,898 Over here I've split at the middle, 465 00:24:22,898 --> 00:24:26,724 I find the thing that's closest to the middle. 466 00:24:26,724 --> 00:24:27,890 It's pretty much determined. 467 00:24:27,890 --> 00:24:30,642 So my root node corresponds to this one. 468 00:24:30,642 --> 00:24:31,850 It's going to partition here. 469 00:24:31,850 --> 00:24:33,935 So over on the right, there's going to be-- 470 00:24:37,250 --> 00:24:39,470 here's going to be the big tree and then here 471 00:24:39,470 --> 00:24:40,379 is the small tree. 472 00:24:40,379 --> 00:24:42,170 So this small tree corresponds to this one. 473 00:24:42,170 --> 00:24:44,270 This big tree corresponds to this interval. 474 00:24:44,270 --> 00:24:47,390 Then on the left we've got four things we need to store. 475 00:24:47,390 --> 00:24:51,970 So these are the red pointers that we had before. 476 00:24:51,970 --> 00:24:54,960 Then over on the left, we're going to have a partition. 477 00:24:54,960 --> 00:24:57,250 And then there's going to be two guys here. 478 00:24:57,250 --> 00:24:59,810 It doesn't really matter how we store them. 479 00:24:59,810 --> 00:25:03,050 It's something like this. 480 00:25:03,050 --> 00:25:10,214 There is medium and small. 481 00:25:10,214 --> 00:25:12,255 And then over on the left, we also have two guys. 482 00:25:12,255 --> 00:25:15,740 So it's going to be, again, something like this. 483 00:25:20,440 --> 00:25:23,370 You got medium and small. 484 00:25:23,370 --> 00:25:25,880 So you see how that worked. 485 00:25:25,880 --> 00:25:29,270 Our main goal was to make this big guy as close to the root 486 00:25:29,270 --> 00:25:30,141 as possible. 487 00:25:30,141 --> 00:25:32,390 It was the biggest and that's basically what happened. 488 00:25:32,390 --> 00:25:34,206 This one is really big. 489 00:25:34,206 --> 00:25:36,330 And we couldn't quite put it as a child of the root 490 00:25:36,330 --> 00:25:37,790 because it appeared in the middle, 491 00:25:37,790 --> 00:25:41,480 but we could put it as a grandchild at the root. 492 00:25:41,480 --> 00:25:43,400 In general, if you have a super heavy child, 493 00:25:43,400 --> 00:25:47,150 it will always become a child or grandchild of the root. 494 00:25:47,150 --> 00:25:50,149 So in constant number of traversals you'll get there. 495 00:25:50,149 --> 00:25:52,190 Now again, you fill in these nodes with some keys 496 00:25:52,190 --> 00:25:53,780 so you can do a binary search. 497 00:25:53,780 --> 00:25:57,950 But now the binary search might go faster 498 00:25:57,950 --> 00:26:01,490 than log Sigma, which is what we had before. 499 00:26:01,490 --> 00:26:04,340 And indeed, you can prove that this really works well. 500 00:26:12,410 --> 00:26:13,610 So what's the claim? 501 00:26:17,030 --> 00:26:32,950 Claim is every two edges you follow either 502 00:26:32,950 --> 00:26:35,654 advance one letter in P-- 503 00:26:38,440 --> 00:26:41,177 these are the red edges that we want to follow. 504 00:26:41,177 --> 00:26:42,760 So if we follow a red edge, then we've 505 00:26:42,760 --> 00:26:45,020 made progress to the next node. 506 00:26:45,020 --> 00:26:48,730 So this would be following a red edge. 507 00:26:48,730 --> 00:27:04,270 Or we reduce the number of candidate to T i's by 2/3 508 00:27:04,270 --> 00:27:08,110 or, I guess, to 2/3 of its original value. 509 00:27:08,110 --> 00:27:10,210 So we lose a third of the strings. 510 00:27:10,210 --> 00:27:12,100 That's what I'd like to claim. 511 00:27:12,100 --> 00:27:14,450 And it's not too hard to see this. 512 00:27:14,450 --> 00:27:17,685 You have to imagine all of these possible partitions. 513 00:27:17,685 --> 00:27:19,720 It's a little bit awkward. 514 00:27:19,720 --> 00:27:20,890 The idea is the following. 515 00:27:20,890 --> 00:27:23,170 If you take one of these arrays-- 516 00:27:23,170 --> 00:27:26,630 this view of all the leaves just laid out on the line-- 517 00:27:26,630 --> 00:27:30,800 you say, well, I'd like to split in half and half. 518 00:27:30,800 --> 00:27:33,490 But that will never happen unless I'm really lucky. 519 00:27:33,490 --> 00:27:37,540 So let's think about this one third splitting. 520 00:27:37,540 --> 00:27:41,830 If I were able to cut anywhere in here, then in one step, 521 00:27:41,830 --> 00:27:45,430 actually, I would achieve this 2/3 reduction. 522 00:27:45,430 --> 00:27:46,690 I'd lose a third of the nodes. 523 00:27:50,530 --> 00:27:56,170 If I end up cutting here, for example, then either I 524 00:27:56,170 --> 00:27:58,420 go to the left and I lost almost 2/3 of the nodes, 525 00:27:58,420 --> 00:27:59,878 or I go to the right and I at least 526 00:27:59,878 --> 00:28:02,410 lost this one third of the notes or one third of the leaves, 527 00:28:02,410 --> 00:28:04,300 I should say. 528 00:28:04,300 --> 00:28:06,250 So that would be a good situation 529 00:28:06,250 --> 00:28:07,610 if I got some cut in here. 530 00:28:07,610 --> 00:28:10,570 But it might be there is no possible cut I can make in here 531 00:28:10,570 --> 00:28:16,879 because there's a giant child in here that has more than one 532 00:28:16,879 --> 00:28:17,670 third of the nodes. 533 00:28:17,670 --> 00:28:20,710 It would have to span all the way across here. 534 00:28:20,710 --> 00:28:22,630 So I can't make any cuts, I can only 535 00:28:22,630 --> 00:28:25,060 cut between child boundaries. 536 00:28:25,060 --> 00:28:29,880 In that situation, you make this-- 537 00:28:29,880 --> 00:28:34,180 well, this is when I need to follow two edges, not one. 538 00:28:34,180 --> 00:28:36,550 When there's a super big child like that, as we said, 539 00:28:36,550 --> 00:28:39,280 it will become a grandchild of the root. 540 00:28:39,280 --> 00:28:40,720 So it will be-- 541 00:28:40,720 --> 00:28:45,070 there's the root and then here is the giant tree. 542 00:28:45,070 --> 00:28:50,270 And then there's going to be the other stuff here and here. 543 00:28:50,270 --> 00:28:53,890 So after I go down to either one step or two steps, 544 00:28:53,890 --> 00:28:57,390 I will either get here-- 545 00:28:57,390 --> 00:29:03,020 sorry, more red chalk, this was a red point. 546 00:29:03,020 --> 00:29:04,840 Now, this is going to a child. 547 00:29:04,840 --> 00:29:07,090 So if I went there, I'm happy in two steps. 548 00:29:07,090 --> 00:29:12,260 I advance one letter in P. Or in two steps, I went here 549 00:29:12,260 --> 00:29:13,060 or I went here. 550 00:29:13,060 --> 00:29:14,780 And this was a huge amount of the nodes, 551 00:29:14,780 --> 00:29:16,363 this is at least a third of the nodes. 552 00:29:16,363 --> 00:29:18,220 Again, if I end up here or end up here, 553 00:29:18,220 --> 00:29:20,990 I lost 2/3 of the candidate leaves. 554 00:29:20,990 --> 00:29:23,740 I mean, I lost one third of the candidate leaves, 555 00:29:23,740 --> 00:29:25,570 leaving 2/3 of them. 556 00:29:28,840 --> 00:29:33,340 If this happens, I charged to this order P term. 557 00:29:33,340 --> 00:29:34,810 And if the other situation happens, 558 00:29:34,810 --> 00:29:37,360 I charge the log k term-- because I can only reduce k 559 00:29:37,360 --> 00:29:41,260 by a factor of 2/3-- 560 00:29:41,260 --> 00:29:44,110 order log k times. 561 00:29:44,110 --> 00:29:50,720 This implies order p plus log k search. 562 00:29:50,720 --> 00:29:52,440 So a very simple idea. 563 00:29:52,440 --> 00:29:54,900 Just change the way we do BSTs. 564 00:29:54,900 --> 00:29:57,210 And we get, in some cases, a better bound. 565 00:29:57,210 --> 00:29:59,940 But not in all cases because maybe 566 00:29:59,940 --> 00:30:03,240 P plus log k might be bigger than P times log Sigma. 567 00:30:03,240 --> 00:30:06,620 And k and Sigma are kind of incomparable, so we don't know. 568 00:30:06,620 --> 00:30:13,350 That's where method 5 comes in, which 569 00:30:13,350 --> 00:30:15,620 is our good friend from last class-- 570 00:30:15,620 --> 00:30:18,752 leaf trimming and indirection. 571 00:30:22,200 --> 00:30:26,640 So we're going to use this idea of finding-- 572 00:30:26,640 --> 00:30:33,750 we're going to cut below maximally deep nodes 573 00:30:33,750 --> 00:30:36,200 with the right number of descendants in them. 574 00:30:43,820 --> 00:30:48,910 So we need at least Sigma descendants. 575 00:30:53,820 --> 00:30:56,090 It could just be descendants or descendant leaves, 576 00:30:56,090 --> 00:30:57,090 doesn't actually matter. 577 00:31:02,890 --> 00:31:06,416 Let me draw a picture, maybe. 578 00:31:06,416 --> 00:31:08,040 This is pretty much what we did before, 579 00:31:08,040 --> 00:31:12,500 except before this magic number was log n that we needed or 1/2 580 00:31:12,500 --> 00:31:13,820 log n or something. 581 00:31:13,820 --> 00:31:16,260 Now it's going to be Sigma that we need. 582 00:31:16,260 --> 00:31:18,815 So it is we find these maximally deep nodes-- 583 00:31:18,815 --> 00:31:22,880 these dots-- that have at least-- 584 00:31:22,880 --> 00:31:25,310 I guess, there is really multiple things hanging off 585 00:31:25,310 --> 00:31:25,970 here. 586 00:31:25,970 --> 00:31:29,990 In general, it could be several things hanging off. 587 00:31:29,990 --> 00:31:31,490 But the total number of descendants 588 00:31:31,490 --> 00:31:35,970 of each of these nodes is at least Sigma. 589 00:31:35,970 --> 00:31:37,880 So what that implies is that the number 590 00:31:37,880 --> 00:31:43,420 of these dots, the number of the leaves in the top tree-- 591 00:31:43,420 --> 00:31:51,890 so up here-- number of leaves is at most T over Sigma. 592 00:31:51,890 --> 00:31:54,890 Because we can charge each of these nodes to Sigma 593 00:31:54,890 --> 00:31:58,490 descendants in each of them. 594 00:31:58,490 --> 00:32:03,160 So that's good because it says we can use method 1-- 595 00:32:03,160 --> 00:32:06,830 the simple array method-- which is fast but spacious. 596 00:32:06,830 --> 00:32:11,870 But if our new size of the trie gets divided by a Sigma factor, 597 00:32:11,870 --> 00:32:14,100 then this turns out to be linear. 598 00:32:14,100 --> 00:32:15,860 So up here we use method 1. 599 00:32:18,980 --> 00:32:21,170 Now, you got to be a little careful because we can't 600 00:32:21,170 --> 00:32:23,127 use method 1 on all the nodes. 601 00:32:23,127 --> 00:32:24,710 We can definitely use it on the leaves 602 00:32:24,710 --> 00:32:26,600 because there aren't too many leaves. 603 00:32:26,600 --> 00:32:32,910 That means we can also use it on the number of branching nodes. 604 00:32:32,910 --> 00:32:35,000 Number of branching nodes is also 605 00:32:35,000 --> 00:32:37,940 going to be, at most, T over Sigma 606 00:32:37,940 --> 00:32:40,550 because it's actually one fewer branching node 607 00:32:40,550 --> 00:32:43,310 than there are leaves. 608 00:32:43,310 --> 00:32:49,340 So great, I can use arrays on the leaves, 609 00:32:49,340 --> 00:32:52,340 I can use arrays on the branching nodes. 610 00:32:52,340 --> 00:32:54,950 I can't use it on the non-branching nodes. 611 00:32:54,950 --> 00:32:58,220 Non-branching nodes are nodes with a single descendant 612 00:32:58,220 --> 00:33:00,650 and everything else is null. 613 00:33:00,650 --> 00:33:03,490 What do I do for those nodes? 614 00:33:03,490 --> 00:33:06,470 Very difficult. I just store that one pointer in a storage 615 00:33:06,470 --> 00:33:07,340 label. 616 00:33:07,340 --> 00:33:09,670 I guess you could think of that as method 2 617 00:33:09,670 --> 00:33:11,360 in a very trivial case. 618 00:33:11,360 --> 00:33:13,670 You see-- is this the right label? 619 00:33:13,670 --> 00:33:15,310 Yes or no. 620 00:33:15,310 --> 00:33:17,975 So this is the non-branching nodes. 621 00:33:22,730 --> 00:33:25,910 Non-branching top nodes-- 622 00:33:25,910 --> 00:33:28,430 I will use method 2. 623 00:33:28,430 --> 00:33:30,260 So I guess this is really-- 624 00:33:30,260 --> 00:33:32,510 well, for these guys I use method 1, 625 00:33:32,510 --> 00:33:35,930 for these guys I use method 1. 626 00:33:35,930 --> 00:33:37,160 So I can afford all this. 627 00:33:37,160 --> 00:33:38,555 This will take order T space. 628 00:33:43,070 --> 00:33:46,280 And it will be fast because either I'm using arrays 629 00:33:46,280 --> 00:33:48,230 or I really don't have any work to do, 630 00:33:48,230 --> 00:33:50,440 and so it doesn't really matter what I do. 631 00:33:50,440 --> 00:33:52,190 But except I can't use arrays because they 632 00:33:52,190 --> 00:33:53,990 would be too spacious. 633 00:33:53,990 --> 00:33:55,170 So that handles the top. 634 00:33:55,170 --> 00:33:57,530 Now, the issue is, what about these bottom structures? 635 00:33:57,530 --> 00:33:59,540 The bottom structures-- what do we know? 636 00:33:59,540 --> 00:34:03,450 They have to have less than Sigma nodes, 637 00:34:03,450 --> 00:34:05,660 less than Sigma descendants. 638 00:34:05,660 --> 00:34:09,260 Also less than Sigma leaves. 639 00:34:09,260 --> 00:34:15,889 So in other words, in these trees 640 00:34:15,889 --> 00:34:19,100 we have k less than Sigma. 641 00:34:19,100 --> 00:34:21,889 Well, then we can afford to use method 4. 642 00:34:21,889 --> 00:34:25,530 Because our whole goal is to get k down to Sigma in this bound. 643 00:34:25,530 --> 00:34:29,105 So in the bottom trees, we use method 4. 644 00:34:31,730 --> 00:34:33,500 Method 4 was always linear space. 645 00:34:33,500 --> 00:34:36,260 And the issue was we paid P plus log k. 646 00:34:36,260 --> 00:34:44,239 But now in here, k is less than Sigma in these trees. 647 00:34:44,239 --> 00:34:50,690 So that means we get order P plus log Sigma query time. 648 00:34:53,659 --> 00:34:56,550 And that's the best we know how to do if you want predecessor 649 00:34:56,550 --> 00:34:57,780 at the nodes. 650 00:34:57,780 --> 00:35:01,830 So it matches this tray bound in pretty easy way. 651 00:35:01,830 --> 00:35:04,790 Just to apply weight balanced, clean things up a little bit. 652 00:35:04,790 --> 00:35:07,800 But only do that at the leaves and everywhere up 653 00:35:07,800 --> 00:35:09,120 here, basically. 654 00:35:09,120 --> 00:35:11,114 Except the non-branching nodes use arrays. 655 00:35:11,114 --> 00:35:13,280 So for the most part arrays and then, at the bottom, 656 00:35:13,280 --> 00:35:16,400 you use weight balance. 657 00:35:16,400 --> 00:35:19,340 This is how you ought to represent a trie. 658 00:35:19,340 --> 00:35:22,172 If you want to preserve the order of the children, 659 00:35:22,172 --> 00:35:23,630 this is the best we know how to do. 660 00:35:23,630 --> 00:35:26,330 If you don't want to preserve order, just use a hash table. 661 00:35:26,330 --> 00:35:28,145 So it depends on the application. 662 00:35:32,110 --> 00:35:36,370 One fun application of this is string sorting. 663 00:35:39,370 --> 00:35:40,930 It's not a data structures problem 664 00:35:40,930 --> 00:35:42,804 so I don't want to spend too much time on it. 665 00:35:42,804 --> 00:35:45,340 But you use this trie data structure to sort strings. 666 00:35:45,340 --> 00:35:47,590 You just throw in a string and then throw in a string. 667 00:35:47,590 --> 00:35:52,670 We didn't talk about dynamic tries but it can be done. 668 00:35:52,670 --> 00:35:54,670 And if you throw it, you just sort of find 669 00:35:54,670 --> 00:35:57,220 where you fall off and then add the thing. 670 00:35:57,220 --> 00:35:59,770 Now, you have to maintain all this funky stuff 671 00:35:59,770 --> 00:36:03,370 but weight balanced trees can be made dynamic and indirection 672 00:36:03,370 --> 00:36:05,030 can be made dynamic. 673 00:36:05,030 --> 00:36:09,880 So you end up with this sort of simple incremental scheme. 674 00:36:09,880 --> 00:36:14,410 You end up with T plus k log Sigma 675 00:36:14,410 --> 00:36:21,790 to sort k strings of total size T with alphabet size Sigma. 676 00:36:21,790 --> 00:36:22,720 This is good. 677 00:36:22,720 --> 00:36:26,077 If I used, for example, merge sort to sort strings, 678 00:36:26,077 --> 00:36:27,160 it's going to be very bad. 679 00:36:27,160 --> 00:36:32,650 It's going to be something like T times k times log something. 680 00:36:32,650 --> 00:36:34,150 We didn't really care about the log. 681 00:36:34,150 --> 00:36:35,420 T times k is bad. 682 00:36:35,420 --> 00:36:39,400 That's because comparing strings could potentially take T time. 683 00:36:39,400 --> 00:36:40,915 And then there's k of them. 684 00:36:40,915 --> 00:36:42,290 But this is linear. 685 00:36:42,290 --> 00:36:44,870 This is the sum of the lengths of the strings. 686 00:36:44,870 --> 00:36:46,510 There's this extra little term. 687 00:36:46,510 --> 00:36:48,670 But most of the time that's going to be dominated 688 00:36:48,670 --> 00:36:51,600 by the length of the strings. 689 00:36:51,600 --> 00:36:55,217 So that's a good way to sort strings using tries. 690 00:36:55,217 --> 00:36:57,800 Tries by themselves, I mean this is about all there is to say. 691 00:36:57,800 --> 00:37:02,870 So let's move on to suffix trees and compressed tries. 692 00:37:02,870 --> 00:37:06,721 Now, we actually did compressed tries in the signature sort 693 00:37:06,721 --> 00:37:07,220 lecture. 694 00:37:14,492 --> 00:37:15,950 Actually, why don't I go over here? 695 00:37:25,210 --> 00:37:28,230 So tries-- branches were labeled with letters. 696 00:37:28,230 --> 00:37:32,160 That's still going to be true for a compressed trie. 697 00:37:32,160 --> 00:37:35,190 But as we saw in that lecture, in compressed trie 698 00:37:35,190 --> 00:37:37,820 we're going to get rid of the non-branching nodes. 699 00:37:41,650 --> 00:37:44,010 So idea with the compressed trie is very simple-- 700 00:37:44,010 --> 00:37:49,500 just contract non-branching paths into a single edge. 701 00:38:03,580 --> 00:38:05,800 This is our example of a trie. 702 00:38:05,800 --> 00:38:08,440 We're just going to modify it to make a compressed trie. 703 00:38:14,890 --> 00:38:17,920 Here we have a non-branching path. 704 00:38:17,920 --> 00:38:20,770 We have to follow an a, and then we have to follow an n. 705 00:38:20,770 --> 00:38:22,330 There's no point in having this node. 706 00:38:22,330 --> 00:38:24,038 You might as well just have a single edge 707 00:38:24,038 --> 00:38:26,560 that says a-n on it. 708 00:38:26,560 --> 00:38:29,470 So we go from here, from the root. 709 00:38:29,470 --> 00:38:33,370 We're going to have an edge that says a-n. 710 00:38:37,230 --> 00:38:40,560 And in some sense, the key of this child is a. 711 00:38:40,560 --> 00:38:42,540 If you're starting up here and you want to know 712 00:38:42,540 --> 00:38:45,820 which way should I go, you should only go this way 713 00:38:45,820 --> 00:38:47,700 if your first letter is a. 714 00:38:47,700 --> 00:38:49,410 After that, your next letter better be n, 715 00:38:49,410 --> 00:38:51,420 otherwise you fell off the tree. 716 00:38:51,420 --> 00:38:53,280 So that's the compression we're doing. 717 00:38:53,280 --> 00:38:55,310 Now, here we have-- this is a branching node, 718 00:38:55,310 --> 00:38:56,700 so that node we keep intact. 719 00:39:00,490 --> 00:39:03,840 This is an n, this is an a here. 720 00:39:03,840 --> 00:39:06,370 But here it's non-branching. 721 00:39:06,370 --> 00:39:08,320 Let me draw this a little bit longer. 722 00:39:08,320 --> 00:39:10,350 In reality, it's just a single edge. 723 00:39:10,350 --> 00:39:13,000 And again, the key is a, and then you must have a $ sign 724 00:39:13,000 --> 00:39:14,070 on afterwards. 725 00:39:14,070 --> 00:39:16,730 Then you reach a leaf, the first leaf. 726 00:39:16,730 --> 00:39:18,420 If we follow the n branch-- 727 00:39:18,420 --> 00:39:22,278 this is branching, so that node is preserved. 728 00:39:25,560 --> 00:39:28,730 If I go this way, it's a $ sign and I reach a leaf. 729 00:39:28,730 --> 00:39:33,030 If I go this way it's an a that must be followed by a $ sign, 730 00:39:33,030 --> 00:39:34,020 so that's a leaf. 731 00:39:34,020 --> 00:39:37,635 And if I go this way, it must be an e, followed by a $ sign, 732 00:39:37,635 --> 00:39:39,780 which is a leaf. 733 00:39:39,780 --> 00:39:43,080 Again, these four leaves can point to these places. 734 00:39:43,080 --> 00:39:44,640 That's a compressed trie. 735 00:39:44,640 --> 00:39:45,967 Pretty obvious. 736 00:39:45,967 --> 00:39:48,300 The nice thing about the compressed trie is the number-- 737 00:39:48,300 --> 00:39:50,258 here we knew the number of non-branching nodes, 738 00:39:50,258 --> 00:39:51,780 it was at most the number of leaves. 739 00:39:51,780 --> 00:39:53,510 Over here, the number of internal nodes 740 00:39:53,510 --> 00:39:54,843 is at most the number of leaves. 741 00:39:54,843 --> 00:39:59,540 So this structure has order k nodes in total 742 00:39:59,540 --> 00:40:02,160 because we got rid of all the non-branching nodes. 743 00:40:02,160 --> 00:40:04,852 I guess except the root, the root might not be branching. 744 00:40:07,500 --> 00:40:09,330 We've got a big O there to cover us. 745 00:40:12,000 --> 00:40:14,790 And all the things we said about representing tries here, 746 00:40:14,790 --> 00:40:18,300 you can do the same thing with a compressed trie. 747 00:40:18,300 --> 00:40:22,990 I need to write down that 3.5 here. 748 00:40:33,980 --> 00:40:35,990 And in fact, these results get better because 749 00:40:35,990 --> 00:40:40,880 before order T meant the number of nodes in the trie. 750 00:40:40,880 --> 00:40:42,730 Now order T will be the number of nodes 751 00:40:42,730 --> 00:40:45,770 in the compressed trie, which is actually order k. 752 00:40:45,770 --> 00:40:50,902 So life gets really good in this world. 753 00:40:50,902 --> 00:40:52,610 I did it in the trie setting because it's 754 00:40:52,610 --> 00:40:53,760 just simpler to think about. 755 00:40:53,760 --> 00:40:55,968 But really, you would always store a compressed trie. 756 00:40:55,968 --> 00:40:57,942 There's no point in storing a trie. 757 00:40:57,942 --> 00:41:00,080 You can still do the same kinds of searches. 758 00:41:04,010 --> 00:41:09,150 But really, compressed tries are warm up for suffix trees. 759 00:41:09,150 --> 00:41:10,820 So let's talk about suffix trees. 760 00:41:14,720 --> 00:41:18,910 Suffix trees are a compressed trie. 761 00:41:18,910 --> 00:41:22,790 So really they should be called suffix tries. 762 00:41:22,790 --> 00:41:27,050 And occasionally, people will call them suffix tries. 763 00:41:27,050 --> 00:41:28,745 But most people call them suffix trees, 764 00:41:28,745 --> 00:41:31,550 so for consistency I'll call them trees as well. 765 00:41:31,550 --> 00:41:32,423 But they are tries. 766 00:41:42,542 --> 00:41:45,110 I'm going to introduce some notation here. 767 00:41:53,457 --> 00:41:55,040 With tries, we are thinking about lots 768 00:41:55,040 --> 00:41:56,240 of different strings. 769 00:41:56,240 --> 00:41:59,590 In this case, we're going back to our string matching problem. 770 00:41:59,590 --> 00:42:02,940 We have a single text and we want to preprocess that text. 771 00:42:02,940 --> 00:42:04,940 But we're going to turn it into multiple strings 772 00:42:04,940 --> 00:42:07,970 by looking at all suffixes of the string. 773 00:42:07,970 --> 00:42:09,860 This is Python notation for everything 774 00:42:09,860 --> 00:42:12,590 from letter i onwards. 775 00:42:12,590 --> 00:42:15,440 And we do that for all i, so that's a lot of strings. 776 00:42:15,440 --> 00:42:18,824 And we build the compressed trie over them. 777 00:42:18,824 --> 00:42:19,490 That's the idea. 778 00:42:19,490 --> 00:42:22,340 And to make it work out-- because you remember, 779 00:42:22,340 --> 00:42:25,790 with tries we had to append $ sign to every string. 780 00:42:25,790 --> 00:42:28,700 In this case, we'd just have to append $ sign to T, 781 00:42:28,700 --> 00:42:31,220 and then all suffixes will end with a $ sign. 782 00:42:31,220 --> 00:42:33,590 So that covers us. $ sign, again, 783 00:42:33,590 --> 00:42:36,510 is a character not appearing in the alphabet. 784 00:42:36,510 --> 00:42:37,372 And that's it. 785 00:42:37,372 --> 00:42:38,330 So that's a definition. 786 00:42:38,330 --> 00:42:39,163 Let's do an example. 787 00:42:49,240 --> 00:42:51,880 At this point, we going for this goal of order P query, 788 00:42:51,880 --> 00:42:54,010 order T space. 789 00:42:54,010 --> 00:42:57,130 Suffix trees will be a way to achieve that goal. 790 00:43:03,820 --> 00:43:10,375 Let's do my favorite example which is banana. 791 00:43:13,540 --> 00:43:17,650 I had a friend who said, I know how to spell banana, 792 00:43:17,650 --> 00:43:19,900 I just don't know when to stop. 793 00:43:19,900 --> 00:43:22,990 There's nice pattern to it and a lot of repeated letters 794 00:43:22,990 --> 00:43:24,700 and so on. 795 00:43:24,700 --> 00:43:26,230 I've got to number the characters. 796 00:43:26,230 --> 00:43:28,880 He said that when he was like six, not when he was older. 797 00:43:31,277 --> 00:43:33,610 It's a little harder when you're writing it on the board 798 00:43:33,610 --> 00:43:36,580 but we all know how to spell banana, I hope. 799 00:43:36,580 --> 00:43:37,790 I'd got it right, right? 800 00:43:37,790 --> 00:43:40,600 It should be 7 letters, including the $ sign. 801 00:43:43,395 --> 00:43:44,020 There they are. 802 00:43:44,020 --> 00:43:46,190 So there's a suffix which is the whole string. 803 00:43:46,190 --> 00:43:48,670 There's a suffix which is a, n, a, n, a, $ sign. 804 00:43:48,670 --> 00:43:50,710 There is a suffix which is n, a, n, a, $ sign. 805 00:43:50,710 --> 00:43:52,459 There's a suffix which is a, n, a, $ sign. 806 00:43:52,459 --> 00:43:53,890 Suffix n, a, $ sign. a, $ sign. 807 00:43:53,890 --> 00:43:55,140 And $ sign. 808 00:43:55,140 --> 00:43:58,340 And empty, I suppose, but we're not going to store that one. 809 00:43:58,340 --> 00:44:01,210 You don't need to. 810 00:44:01,210 --> 00:44:02,422 Cool. 811 00:44:02,422 --> 00:44:04,630 I'm going to cheat a little bit and look at my figure 812 00:44:04,630 --> 00:44:07,350 because it is a little bit of thinking. 813 00:44:07,350 --> 00:44:09,400 One The final challenge of this lecture 814 00:44:09,400 --> 00:44:14,560 will be construct this diagram in linear time. 815 00:44:14,560 --> 00:44:38,115 But I'm, just for now, going to cheat 816 00:44:38,115 --> 00:44:39,990 because it's a little tricky to do it and get 817 00:44:39,990 --> 00:44:41,340 all the nodes in sorted order. 818 00:44:57,980 --> 00:44:59,320 So that should give it to us. 819 00:44:59,320 --> 00:45:02,155 And then the suffixes. 820 00:45:02,155 --> 00:45:04,630 Here is another color. 821 00:45:04,630 --> 00:45:14,810 6, 5, 3, 1, 0, 4, 2. 822 00:45:14,810 --> 00:45:16,420 Cool. 823 00:45:16,420 --> 00:45:18,580 This I claim is a suffix tree of banana. 824 00:45:18,580 --> 00:45:20,530 You see the banana substring. 825 00:45:20,530 --> 00:45:24,670 Than the next one is a, n, a, n, a, $ sign. 826 00:45:24,670 --> 00:45:27,700 Then the next one is n, a, n, a, $ sign. 827 00:45:27,700 --> 00:45:31,570 Then the next one is a, n, a, $ sign. 828 00:45:31,570 --> 00:45:34,380 Next one is n, a, $ sign. 829 00:45:34,380 --> 00:45:35,840 Next one is a, $ sign. 830 00:45:35,840 --> 00:45:37,710 And then $ sign. 831 00:45:37,710 --> 00:45:39,917 So that's a nice, clean representation 832 00:45:39,917 --> 00:45:40,750 of all the suffixes. 833 00:45:40,750 --> 00:45:43,000 And you can see that if you wanted to search from 834 00:45:43,000 --> 00:45:45,250 the middle of this string-- suppose I want to search 835 00:45:45,250 --> 00:45:46,510 for a nan-- 836 00:45:46,510 --> 00:45:47,490 then it's right there. 837 00:45:47,490 --> 00:45:51,700 Just do n, a, n, then I'm done. 838 00:45:51,700 --> 00:45:54,130 This virtual node in the middle here 839 00:45:54,130 --> 00:45:56,980 along the one third of the way down the edge, 840 00:45:56,980 --> 00:46:00,100 that represents n-a-n. 841 00:46:00,100 --> 00:46:02,170 And indeed, if you look at the descendant leaf, 842 00:46:02,170 --> 00:46:05,470 that corresponds to an occurrence of n-a-n. 843 00:46:05,470 --> 00:46:08,830 If I was going to look for a-n, I 844 00:46:08,830 --> 00:46:12,880 would do a, n, so halfway down this edge. 845 00:46:12,880 --> 00:46:17,920 And then this subtree represents all the occurrences of a-n. 846 00:46:17,920 --> 00:46:19,210 Think about it. 847 00:46:19,210 --> 00:46:21,220 There's two of them-- 848 00:46:21,220 --> 00:46:25,450 One that starts at position 3, one that starts at position 1. 849 00:46:25,450 --> 00:46:27,067 Here's one occurrence of a-n, here's 850 00:46:27,067 --> 00:46:28,150 another occurrence of a-n. 851 00:46:28,150 --> 00:46:29,858 This works even when they're overlapping. 852 00:46:29,858 --> 00:46:32,717 If I search for a-n-a, I would get here. 853 00:46:32,717 --> 00:46:35,050 And then these are the two occurrences of a-n-a and they 854 00:46:35,050 --> 00:46:36,400 actually overlap each other-- 855 00:46:36,400 --> 00:46:38,764 this one and this one. 856 00:46:38,764 --> 00:46:40,180 So this is a great data structure, 857 00:46:40,180 --> 00:46:43,940 it solves what we need. 858 00:46:43,940 --> 00:46:46,512 It's all substrings searching. 859 00:47:01,460 --> 00:47:03,350 Applications of suffix trees. 860 00:47:18,570 --> 00:47:21,860 Just do a search in the trie for a particular pattern. 861 00:47:21,860 --> 00:47:42,800 We get subtree representing all of the occurrences of P and T. 862 00:47:42,800 --> 00:47:44,150 So this is great. 863 00:47:44,150 --> 00:47:47,690 In order P time, walking down this structure, 864 00:47:47,690 --> 00:47:49,820 I can figure out all the occurrences. 865 00:47:49,820 --> 00:47:52,190 And then, if I want to know how many there were, 866 00:47:52,190 --> 00:47:54,110 I could just store subtree sizes-- 867 00:47:54,110 --> 00:47:55,940 number of leaves below every node. 868 00:47:55,940 --> 00:47:59,270 If I wanted to list them, I could just 869 00:47:59,270 --> 00:48:00,890 do an in-order traversal. 870 00:48:00,890 --> 00:48:03,230 And I'll even get them in order. 871 00:48:03,230 --> 00:48:08,900 So in particular, if I wanted to list the first 10 occurrences, 872 00:48:08,900 --> 00:48:12,800 I could store the left-most leaf from every node, teleport down 873 00:48:12,800 --> 00:48:14,870 to the first occurrence in constant time. 874 00:48:14,870 --> 00:48:17,600 And then I could just have a linked list of all the leaves. 875 00:48:17,600 --> 00:48:19,760 So once I find the first one, I can just 876 00:48:19,760 --> 00:48:22,880 follow until I find, oh, that's not an occurrence of P. 877 00:48:22,880 --> 00:48:25,520 So I can list the first k of them in order k time 878 00:48:25,520 --> 00:48:28,160 once I've done the search of order P time. 879 00:48:28,160 --> 00:48:30,110 So this is really good searching. 880 00:48:30,110 --> 00:48:32,360 And It's the ideal situation. 881 00:48:32,360 --> 00:48:34,520 You can list any information you want about all 882 00:48:34,520 --> 00:48:38,150 of the answers in the optimal time and size of the output. 883 00:48:40,670 --> 00:48:43,640 How big is this data structure? 884 00:48:43,640 --> 00:48:51,008 Well, there are T suffixes, so k is the size of T. 885 00:48:51,008 --> 00:48:53,630 And when we look at our trie representations, 886 00:48:53,630 --> 00:48:55,730 our general goal was to get-- 887 00:48:55,730 --> 00:48:59,817 here, capital T was the sum of the lengths. 888 00:48:59,817 --> 00:49:01,400 Well, sum of the lengths is not good-- 889 00:49:01,400 --> 00:49:02,702 that would be quadratic-- 890 00:49:02,702 --> 00:49:04,160 sum of the lengths of the suffixes. 891 00:49:04,160 --> 00:49:08,420 But we also said, or the number of nodes in the trie. 892 00:49:08,420 --> 00:49:10,745 And we know the number of leaves in this trie 893 00:49:10,745 --> 00:49:15,054 is exactly the size of T. And so because it's a compressed trie, 894 00:49:15,054 --> 00:49:16,470 the number of internal [INAUDIBLE] 895 00:49:16,470 --> 00:49:19,640 is also less than the size of T. So the total number of nodes 896 00:49:19,640 --> 00:49:24,890 here is order T And so if we use any 897 00:49:24,890 --> 00:49:26,750 of the reasonable representations, 898 00:49:26,750 --> 00:49:27,900 we get order T space. 899 00:49:33,020 --> 00:49:36,020 Now, there's one issue which is, how long does a search for P 900 00:49:36,020 --> 00:49:36,950 cost? 901 00:49:36,950 --> 00:49:38,630 And it depends on our representation, 902 00:49:38,630 --> 00:49:41,180 it depends how quickly we can traverse a node. 903 00:49:41,180 --> 00:49:42,860 If we use hashing-- 904 00:49:42,860 --> 00:49:51,740 method 3-- use hashing, then we get order P time. 905 00:49:55,310 --> 00:49:58,100 But the trouble with hashing is it permutes 906 00:49:58,100 --> 00:50:00,650 the children of every node. 907 00:50:00,650 --> 00:50:02,360 So in that situation, the leaves will not 908 00:50:02,360 --> 00:50:05,799 be ordered in the same way that they're ordered in the string. 909 00:50:05,799 --> 00:50:08,090 So if you really want to be able to find the first five 910 00:50:08,090 --> 00:50:11,060 occurrences of the pattern P, you can't use hashing. 911 00:50:11,060 --> 00:50:12,680 You can find some five occurrences 912 00:50:12,680 --> 00:50:15,200 but you will find the first in the usual ordering 913 00:50:15,200 --> 00:50:16,770 of the string. 914 00:50:16,770 --> 00:50:19,280 So if you really want the first five 915 00:50:19,280 --> 00:50:23,750 and you want them in order, then you should use trays-- 916 00:50:23,750 --> 00:50:26,100 this method 6 that we used. 917 00:50:26,100 --> 00:50:26,780 6? 918 00:50:26,780 --> 00:50:28,220 5. 919 00:50:28,220 --> 00:50:35,230 If we use trays, then it will be order P times log Sigma-- 920 00:50:38,050 --> 00:50:40,640 sorry, order P plus log Sigma. 921 00:50:40,640 --> 00:50:43,720 That was our query time. 922 00:50:43,720 --> 00:50:47,030 Here, P plus log Sigma. 923 00:50:47,030 --> 00:50:50,240 Small penalty to pay but the nice thing is then your answers 924 00:50:50,240 --> 00:50:52,310 are represented in order. 925 00:50:52,310 --> 00:50:56,840 No permutation, no hashing, no randomization. 926 00:50:56,840 --> 00:50:58,940 This is the reason suffix trees were invented-- 927 00:50:58,940 --> 00:51:00,680 they let you do searches fast. 928 00:51:00,680 --> 00:51:03,500 But actually, they let you do a ton of things fast. 929 00:51:03,500 --> 00:51:05,930 And I want to quickly give you an overview 930 00:51:05,930 --> 00:51:08,999 of the zillions of things you can do with the suffix tree. 931 00:51:08,999 --> 00:51:10,790 And then I want to get to how to build them 932 00:51:10,790 --> 00:51:16,205 in linear time, which has some interesting algorithms/data 933 00:51:16,205 --> 00:51:19,184 structures. 934 00:51:19,184 --> 00:51:20,600 I already talked about if you want 935 00:51:20,600 --> 00:51:21,980 to find the first k occurrences, you 936 00:51:21,980 --> 00:51:23,150 can do that in order k time. 937 00:51:23,150 --> 00:51:25,280 If you want to find the number of occurrences, 938 00:51:25,280 --> 00:51:26,654 you can do that in constant time, 939 00:51:26,654 --> 00:51:29,174 just by augmenting the subtree sizes. 940 00:51:29,174 --> 00:51:30,590 Here's another thing you could do. 941 00:51:30,590 --> 00:51:32,990 Suppose you have a very long string. 942 00:51:32,990 --> 00:51:35,160 I mean think of T as an entire document. 943 00:51:35,160 --> 00:51:38,360 You know, it could be the Merriam-Webster dictionary 944 00:51:38,360 --> 00:51:41,320 or it could be the web. 945 00:51:41,320 --> 00:51:44,069 We're imagining T to be the huge data structure. 946 00:51:44,069 --> 00:51:46,610 And then we're able to search for substrings within that data 947 00:51:46,610 --> 00:51:50,130 structure very fast. 948 00:51:50,130 --> 00:51:52,430 So that's cool. 949 00:51:52,430 --> 00:51:53,680 Here's an interesting puzzle. 950 00:51:53,680 --> 00:51:57,790 What is the longest substring-- what is the longest string that 951 00:51:57,790 --> 00:52:00,280 appears twice on the web? 952 00:52:00,280 --> 00:52:02,260 This is called the longest repeated substring. 953 00:52:02,260 --> 00:52:04,610 Could be overlapping, maybe not. 954 00:52:04,610 --> 00:52:07,690 Well, you take the web, you throw it in the suffix tree-- 955 00:52:07,690 --> 00:52:09,500 not sure anyone could actually do that-- 956 00:52:09,500 --> 00:52:11,762 but small part of the web. 957 00:52:11,762 --> 00:52:13,345 Dictionary-- this would be no problem. 958 00:52:17,260 --> 00:52:18,560 Wikipedia would be feasible. 959 00:52:18,560 --> 00:52:21,280 You take Wikipedia, you throw it in the suffix tree. 960 00:52:21,280 --> 00:52:24,820 And what I'm interested in is, basically, 961 00:52:24,820 --> 00:52:29,230 a node that has two, at least two descendant leaves. 962 00:52:29,230 --> 00:52:31,749 And if I'm counting the number of leaves at every node, 963 00:52:31,749 --> 00:52:33,790 I could just do one pass over this data structure 964 00:52:33,790 --> 00:52:35,529 and find what are all the nodes that have 965 00:52:35,529 --> 00:52:36,820 at least two descendant leaves. 966 00:52:36,820 --> 00:52:39,520 That's all the internal nodes. 967 00:52:39,520 --> 00:52:42,280 And then among them I'd also like to know how deep is it. 968 00:52:42,280 --> 00:52:46,330 Because the depth corresponds to how long the string is. 969 00:52:46,330 --> 00:52:48,280 This one is a-n-a so this one has, 970 00:52:48,280 --> 00:52:51,036 I call it, a letter depth of 3. 971 00:52:51,036 --> 00:52:52,410 This one has a letter depth of 1. 972 00:52:52,410 --> 00:52:53,785 This one has a letter depth of 2. 973 00:52:53,785 --> 00:52:55,784 So I just want to find the deepest node that has 974 00:52:55,784 --> 00:52:57,130 at least two descendant leaves. 975 00:52:57,130 --> 00:53:00,151 In linear time, I could find the longest repeated substring. 976 00:53:00,151 --> 00:53:01,900 Or I could find the longest substring that 977 00:53:01,900 --> 00:53:03,520 appears five times or whatever. 978 00:53:03,520 --> 00:53:05,530 I just do one pass over this thing, 979 00:53:05,530 --> 00:53:08,087 find the deepest node that has my threshold of leaves. 980 00:53:08,087 --> 00:53:09,670 So that's kind of a neat thing you can 981 00:53:09,670 --> 00:53:11,440 do in linear time on a string. 982 00:53:14,780 --> 00:53:16,580 Here's another fun one. 983 00:53:16,580 --> 00:53:18,920 Suppose I have this giant string. 984 00:53:18,920 --> 00:53:21,930 And I just want to compare two substrings in it. 985 00:53:21,930 --> 00:53:25,730 So here's my giant string. 986 00:53:25,730 --> 00:53:29,360 And suppose I want to measure how long is the repeated 987 00:53:29,360 --> 00:53:30,290 substring. 988 00:53:30,290 --> 00:53:31,940 So I say, well, I've got position i, 989 00:53:31,940 --> 00:53:32,944 I've got position j. 990 00:53:32,944 --> 00:53:35,360 Let's say I already know that they match for a little bit. 991 00:53:35,360 --> 00:53:37,220 I want to know, how long do they match? 992 00:53:37,220 --> 00:53:40,790 How far can I go to the right and have them still match? 993 00:53:43,217 --> 00:53:44,050 How could I do that? 994 00:53:44,050 --> 00:53:46,580 Well, I could look at the suffix starting at i. 995 00:53:46,580 --> 00:53:48,510 That corresponds to a leaf over here. 996 00:53:48,510 --> 00:53:51,380 And I could look at the suffix starting at j. 997 00:53:51,380 --> 00:53:55,100 That corresponds to some other leaf. 998 00:53:55,100 --> 00:53:59,000 And what is the length of the longest common prefix 999 00:53:59,000 --> 00:54:02,560 of those two suffixes in the suffix tree? 1000 00:54:07,040 --> 00:54:12,150 Three letters-- LCA. 1001 00:54:12,150 --> 00:54:16,110 If I take the LCA of those two leaves-- for example, 1002 00:54:16,110 --> 00:54:19,270 I take these two leaves-- 1003 00:54:19,270 --> 00:54:21,970 the LCA gives me the longest common prefix. 1004 00:54:21,970 --> 00:54:23,500 Then they branch. 1005 00:54:23,500 --> 00:54:25,780 So longest common prefix of these two suffixes 1006 00:54:25,780 --> 00:54:28,360 is the letter a, so it's just length 1. 1007 00:54:28,360 --> 00:54:31,030 And again, if I label every node with the letter depth, 1008 00:54:31,030 --> 00:54:33,340 I can figure out exactly how long these guys match, 1009 00:54:33,340 --> 00:54:35,450 even if they overlap. 1010 00:54:35,450 --> 00:54:37,150 So in constant time-- because we already 1011 00:54:37,150 --> 00:54:39,670 have a constant time LCA query. 1012 00:54:39,670 --> 00:54:41,590 Linear space, constant time query. 1013 00:54:41,590 --> 00:54:43,031 Given any two positions i and j, I 1014 00:54:43,031 --> 00:54:45,280 can tell you how long they match for in constant time. 1015 00:54:45,280 --> 00:54:47,549 Boom-- instantaneously. 1016 00:54:47,549 --> 00:54:48,340 It's kind of crazy. 1017 00:54:48,340 --> 00:54:51,310 So you can do tons of these queries instantly. 1018 00:54:51,310 --> 00:54:53,770 That's one reason why people care about LCAs, 1019 00:54:53,770 --> 00:54:54,770 there are other reasons. 1020 00:54:54,770 --> 00:54:58,630 But mostly LCAs were developed for suffix trees 1021 00:54:58,630 --> 00:54:59,800 to answer queries like that. 1022 00:55:02,650 --> 00:55:03,310 Got some more. 1023 00:55:08,940 --> 00:55:11,010 Why don't I just write-- 1024 00:55:11,010 --> 00:55:19,620 LCP of one suffix and another suffix 1025 00:55:19,620 --> 00:55:22,250 is equivalent to an LCA query. 1026 00:55:22,250 --> 00:55:25,050 And so we can do that in constant time 1027 00:55:25,050 --> 00:55:26,462 after pre-processing. 1028 00:55:38,600 --> 00:55:39,920 Here's another one. 1029 00:55:39,920 --> 00:55:52,180 Suppose I want to find all occurrences of T i to j. 1030 00:55:55,810 --> 00:55:57,670 So I give you a substring and I want 1031 00:55:57,670 --> 00:56:00,070 to know where does that occur. 1032 00:56:00,070 --> 00:56:03,800 The substring is restricted to come from the text. 1033 00:56:03,800 --> 00:56:05,080 Now, this is a little subtle. 1034 00:56:05,080 --> 00:56:08,860 Of course, I could solve it in j minus i plus 1 time. 1035 00:56:08,860 --> 00:56:10,660 I just do the search. 1036 00:56:10,660 --> 00:56:14,470 But what if I want to do it in constant time? 1037 00:56:14,470 --> 00:56:16,435 Maybe this is a really big substring. 1038 00:56:16,435 --> 00:56:18,740 But I still know it appears multiple times. 1039 00:56:18,740 --> 00:56:21,120 I want to know how many times does it appear. 1040 00:56:21,120 --> 00:56:24,100 I claim I can do this in constant time. 1041 00:56:24,100 --> 00:56:26,320 How? 1042 00:56:26,320 --> 00:56:30,040 This is a level ancestor query. 1043 00:56:30,040 --> 00:56:32,050 Why is it a level ancestor query? 1044 00:56:32,050 --> 00:56:35,380 If I look at the suffix starting at i, 1045 00:56:35,380 --> 00:56:38,470 and then I just want to trim off, I want to stop. 1046 00:56:38,470 --> 00:56:40,510 Or I don't care about the entire suffix, 1047 00:56:40,510 --> 00:56:43,330 I just want to do that j. 1048 00:56:43,330 --> 00:56:45,580 It's like saying, well, suppose I'm looking 1049 00:56:45,580 --> 00:56:47,590 for occurrences of a-n-a. 1050 00:56:47,590 --> 00:56:51,520 So I go and I start at the first occurrence of a-n-a, 1051 00:56:51,520 --> 00:56:54,930 which is a-n-a-n-a-$, so this is the leaf corresponding 1052 00:56:54,930 --> 00:56:55,974 to a-n-a. 1053 00:56:55,974 --> 00:56:58,140 And then if I want to find all occurrences of a-n-a, 1054 00:56:58,140 --> 00:57:03,910 I just need to go up to the ancestor that represents a-n-a. 1055 00:57:03,910 --> 00:57:09,787 This is what I call a weighted level ancestor. 1056 00:57:09,787 --> 00:57:11,370 That's not quite the problem we solved 1057 00:57:11,370 --> 00:57:18,490 in the last lecture, lecture 15, because now it's weighted. 1058 00:57:18,490 --> 00:57:28,577 So it's level ancestor j minus i of the T i suffix leaf. 1059 00:57:28,577 --> 00:57:30,160 So I find this leaf, which I just have 1060 00:57:30,160 --> 00:57:31,450 an array of all the leaves. 1061 00:57:31,450 --> 00:57:34,737 Given a suffix, tell me what leaf it is in the suffix tree. 1062 00:57:34,737 --> 00:57:36,820 And then I want to find the j minus i-th ancestor, 1063 00:57:36,820 --> 00:57:39,820 except the edges don't just have unit length. 1064 00:57:39,820 --> 00:57:42,160 So here I want to find the third ancestor, 1065 00:57:42,160 --> 00:57:45,190 except it's really the ancestor in the compressed trie. 1066 00:57:45,190 --> 00:57:47,240 I want to do the j minus i-th ancestor 1067 00:57:47,240 --> 00:57:49,900 in the suffix in the trie, but what 1068 00:57:49,900 --> 00:57:51,850 I have is a compressed tree. 1069 00:57:51,850 --> 00:57:54,880 And so these edges are labeled with how many characters 1070 00:57:54,880 --> 00:57:58,000 are on them and I got to deal with that. 1071 00:57:58,000 --> 00:58:00,980 Fortunately, the data structure we gave for a level ancestor-- 1072 00:58:00,980 --> 00:58:03,040 which was constant time query, linear space-- 1073 00:58:03,040 --> 00:58:05,140 can be fairly easily adapted to weights. 1074 00:58:08,170 --> 00:58:10,710 Not quite in constant time though. 1075 00:58:10,710 --> 00:58:14,860 It can be solved in log log n time. 1076 00:58:14,860 --> 00:58:17,440 And I think that's optimal. 1077 00:58:17,440 --> 00:58:23,710 Because if your thing is a single path with maybe 1078 00:58:23,710 --> 00:58:28,720 the occasional branch, then finding your i-th ancestor here 1079 00:58:28,720 --> 00:58:31,430 is like solving a predecessor problem. 1080 00:58:31,430 --> 00:58:36,190 Because you say, well, from the i-th position up, 1081 00:58:36,190 --> 00:58:40,150 I want to know what is the previous-- 1082 00:58:40,150 --> 00:58:41,887 I want to round up or round down. 1083 00:58:41,887 --> 00:58:43,720 So I want to do a predecessor or a successor 1084 00:58:43,720 --> 00:58:45,600 on this straight line. 1085 00:58:45,600 --> 00:58:47,200 And so for a predecessor you need 1086 00:58:47,200 --> 00:58:51,610 log log time for the right parameters 1087 00:58:51,610 --> 00:58:53,206 and this can be achieved. 1088 00:58:53,206 --> 00:58:55,330 And the basic idea is you use ladder decomposition, 1089 00:58:55,330 --> 00:58:56,440 just like before. 1090 00:58:56,440 --> 00:58:58,840 But now a ladder can't be represented by an array 1091 00:58:58,840 --> 00:59:01,760 because there are lots of absent places in the array. 1092 00:59:01,760 --> 00:59:04,540 Now instead, use a predecessor, use a Van Emde Boas 1093 00:59:04,540 --> 00:59:06,260 to represent a ladder. 1094 00:59:06,260 --> 00:59:07,870 So that's basically all you do. 1095 00:59:07,870 --> 00:59:13,630 Van Emde Boas represents a ladder. 1096 00:59:13,630 --> 00:59:15,517 That's what you do in the top structure. 1097 00:59:15,517 --> 00:59:17,350 Remember, we had indirection, leaf trimming, 1098 00:59:17,350 --> 00:59:19,058 top was this thing, ladder decomposition. 1099 00:59:19,058 --> 00:59:21,317 You Bottom was look up tables. 1100 00:59:21,317 --> 00:59:23,650 The other problem is you can't use lookup tables anymore 1101 00:59:23,650 --> 00:59:26,530 because in one of these tiny trees 1102 00:59:26,530 --> 00:59:28,030 you could have a super long path. 1103 00:59:28,030 --> 00:59:30,040 It's non-branching, they got compressed. 1104 00:59:30,040 --> 00:59:31,420 And you can't afford to enumerate 1105 00:59:31,420 --> 00:59:33,410 all possible situations. 1106 00:59:33,410 --> 00:59:35,092 It's kind of annoying. 1107 00:59:35,092 --> 00:59:37,300 So instead of using lookup tables-- this was actually 1108 00:59:37,300 --> 00:59:40,960 an idea from some students in this class last time 1109 00:59:40,960 --> 00:59:43,750 I taught this material-- they said, oh, well, instead 1110 00:59:43,750 --> 00:59:47,180 of using a lookup table, you can use ladder decomposition. 1111 00:59:47,180 --> 00:59:49,960 So down here, in the compressed tree, 1112 00:59:49,960 --> 00:59:52,240 we have log n different nodes. 1113 00:59:52,240 --> 00:59:55,090 If you use ladder decomposition on that thing-- but not 1114 00:59:55,090 --> 00:59:56,030 the hybrid structure. 1115 00:59:56,030 --> 00:59:58,360 Remember, we used jump pointers plus ladders. 1116 00:59:58,360 --> 00:59:59,890 Jump pointers still work here, just 1117 00:59:59,890 --> 01:00:03,160 you have to round them to a different place. 1118 01:00:03,160 --> 01:00:04,750 Down here, I'm not going to try to do 1119 01:00:04,750 --> 01:00:06,010 jump pointers plus ladders. 1120 01:00:06,010 --> 01:00:07,210 I'll just do ladders. 1121 01:00:07,210 --> 01:00:10,120 And remember, just ladders gave us a log n query time. 1122 01:00:10,120 --> 01:00:18,300 But now n is log T. And so we get log log T query time. 1123 01:00:18,300 --> 01:00:20,050 And that's, basically, all you have to do. 1124 01:00:22,960 --> 01:00:24,960 So you're always jumping to the top of a ladder. 1125 01:00:24,960 --> 01:00:27,262 You'll only have to traverse log log T ladders. 1126 01:00:27,262 --> 01:00:28,720 The very last ladder you might have 1127 01:00:28,720 --> 01:00:32,500 to do a predecessor query that will cost you log log log T. 1128 01:00:32,500 --> 01:00:35,050 But overall, it will be log log T time just 1129 01:00:35,050 --> 01:00:39,730 by this kind of tweak to our level ancestor data structure. 1130 01:00:39,730 --> 01:00:43,120 So I thought that was kind of a fun connection. 1131 01:00:43,120 --> 01:00:46,450 This is the reason, essentially, level ancestors were developed. 1132 01:00:46,450 --> 01:00:48,460 And people use them because you can 1133 01:00:48,460 --> 01:00:51,800 do these kinds of things in nearly constant time, 1134 01:00:51,800 --> 01:00:54,530 even if the substring is huge. 1135 01:00:54,530 --> 01:00:57,760 So maybe I know ahead of time all the queries 1136 01:00:57,760 --> 01:00:59,860 I might want to do. 1137 01:00:59,860 --> 01:01:03,310 I just throw them into the text, just add them in there. 1138 01:01:03,310 --> 01:01:05,770 Then I've cut these substrings, they're now 1139 01:01:05,770 --> 01:01:07,200 represented in the suffix tree. 1140 01:01:07,200 --> 01:01:10,480 Now any substring I want to query in log log n time, 1141 01:01:10,480 --> 01:01:13,960 I can find all the occurrences of that string, 1142 01:01:13,960 --> 01:01:16,670 even if the substring is huge. 1143 01:01:16,670 --> 01:01:19,060 So if you know what queries you want, 1144 01:01:19,060 --> 01:01:20,980 you can preprocess them and solve them 1145 01:01:20,980 --> 01:01:24,430 even faster than order P time. 1146 01:01:24,430 --> 01:01:25,900 Cool. 1147 01:01:25,900 --> 01:01:32,480 Another thing you can do is represent multiple documents. 1148 01:01:32,480 --> 01:01:35,580 And that's what I was sort of getting at there. 1149 01:01:35,580 --> 01:01:37,270 If you have multiple documents-- say, 1150 01:01:37,270 --> 01:01:39,670 you're storing the entire web or Wikipedia. 1151 01:01:39,670 --> 01:01:41,560 Like there's multiple pages. 1152 01:01:41,560 --> 01:01:43,480 You want to separate them. 1153 01:01:43,480 --> 01:01:47,860 All you need to do is say, OK, I'll take my first string 1154 01:01:47,860 --> 01:01:49,990 and then put a special $ sign after it. 1155 01:01:49,990 --> 01:01:52,980 Then take my second string, put a special $ sign after it. 1156 01:01:52,980 --> 01:01:56,860 And take my k-th string and put a special $ sign after it. 1157 01:01:56,860 --> 01:01:59,710 Just concatenate them with different $ signs in between 1158 01:01:59,710 --> 01:02:00,460 them. 1159 01:02:00,460 --> 01:02:03,885 Then build the suffix tree on this thing which I'll call T 1160 01:02:03,885 --> 01:02:06,010 So you can use the same suffix tree data structure, 1161 01:02:06,010 --> 01:02:08,140 but now, in some sense, you're representing 1162 01:02:08,140 --> 01:02:11,964 all of these documents and all the ways they interweave. 1163 01:02:11,964 --> 01:02:13,630 Because there are some shared substrings 1164 01:02:13,630 --> 01:02:15,838 here that are shared by this, and this, and whatever. 1165 01:02:15,838 --> 01:02:18,070 And those will be represented in the same structure. 1166 01:02:18,070 --> 01:02:20,050 Or I can do a search and then I've effectively 1167 01:02:20,050 --> 01:02:23,500 found all the documents that contain it. 1168 01:02:23,500 --> 01:02:25,570 One issue, though. 1169 01:02:25,570 --> 01:02:27,820 Suppose, I want to find all the documents containing 1170 01:02:27,820 --> 01:02:31,390 the word MIT or something. 1171 01:02:31,390 --> 01:02:34,927 Maybe all k of them match, maybe one document matches, 1172 01:02:34,927 --> 01:02:36,010 maybe two documents match. 1173 01:02:36,010 --> 01:02:37,930 Suppose, two documents match. 1174 01:02:37,930 --> 01:02:40,990 The first document mentions MIT a billion times. 1175 01:02:40,990 --> 01:02:46,330 The second document has MIT in it once. 1176 01:02:46,330 --> 01:02:47,980 Then suffix trees are kind of annoying 1177 01:02:47,980 --> 01:02:50,980 because they will find that billion and one matches 1178 01:02:50,980 --> 01:02:51,907 as a subtree. 1179 01:02:51,907 --> 01:02:54,490 But if I just want to know the answer, oh, these two documents 1180 01:02:54,490 --> 01:02:57,070 match, I'd like to do that in order 2 time, 1181 01:02:57,070 --> 01:03:02,230 not order billion time, to use technical terms. 1182 01:03:02,230 --> 01:03:08,080 And that is called the document retrieval problem or a document 1183 01:03:08,080 --> 01:03:09,830 retrieval data structure. 1184 01:03:09,830 --> 01:03:14,320 This is a problem considered by M. Krishnan in 2002. 1185 01:03:14,320 --> 01:03:22,510 Document retrieval you can do an order P plus number 1186 01:03:22,510 --> 01:03:26,150 of documents matching. 1187 01:03:26,150 --> 01:03:30,449 So if I want to list all the documents that match, 1188 01:03:30,449 --> 01:03:32,740 I could do an order the number of documents that match, 1189 01:03:32,740 --> 01:03:37,270 not the order of a number of occurrences of the string. 1190 01:03:37,270 --> 01:03:39,760 So I still got to do the P search in the beginning, 1191 01:03:39,760 --> 01:03:41,920 and then this is better. 1192 01:03:41,920 --> 01:03:45,340 And the funny thing is the solution to this data structure 1193 01:03:45,340 --> 01:03:49,717 uses RMQ, range minimum queries, from last lecture. 1194 01:03:49,717 --> 01:03:51,050 So let me tell you how it works. 1195 01:03:51,050 --> 01:03:52,133 It's actually very simple. 1196 01:03:56,730 --> 01:04:01,460 And then I think we'll move on to how to build a suffix tree. 1197 01:04:01,460 --> 01:04:03,500 So document retrieval. 1198 01:04:08,220 --> 01:04:09,470 Here's what we're going to do. 1199 01:04:25,040 --> 01:04:27,530 Remember, these different $ signs i represent different 1200 01:04:27,530 --> 01:04:30,230 documents. 1201 01:04:30,230 --> 01:04:32,450 I want to remember which suffixes 1202 01:04:32,450 --> 01:04:35,060 came from the same document. 1203 01:04:35,060 --> 01:04:40,790 So at every $ sign i, I want to store the number 1204 01:04:40,790 --> 01:04:44,990 of the previous $ sign i. 1205 01:04:44,990 --> 01:04:48,260 Let's suppose, the suffixes, when they get to one of the $ 1206 01:04:48,260 --> 01:04:51,490 signs, I can just stop, I don't have to store the rest, 1207 01:04:51,490 --> 01:04:52,490 I'm going to throw away. 1208 01:04:52,490 --> 01:04:55,490 Whenever I hit a $ sign, I will stop the suffix tree. 1209 01:04:55,490 --> 01:04:57,410 That way, the $ signs really are leaves, 1210 01:04:57,410 --> 01:04:59,741 all of them now become leaves. 1211 01:04:59,741 --> 01:05:01,490 So I don't really care about a suffix that 1212 01:05:01,490 --> 01:05:02,480 goes all the way through here. 1213 01:05:02,480 --> 01:05:04,040 I just want the suffix to the $ sign, 1214 01:05:04,040 --> 01:05:06,960 as it represents the individual documents. 1215 01:05:06,960 --> 01:05:08,810 So $ sign i's are leaves. 1216 01:05:08,810 --> 01:05:11,600 And I want each of them just to store a pointer, basically, 1217 01:05:11,600 --> 01:05:14,930 to the previous one of the same type, the same $ sign i. 1218 01:05:14,930 --> 01:05:16,370 It came from the same document. 1219 01:05:22,860 --> 01:05:26,990 Now, here's the idea. 1220 01:05:26,990 --> 01:05:30,470 I did a search, I got down to a node, 1221 01:05:30,470 --> 01:05:32,570 and now there's this big subtree here. 1222 01:05:32,570 --> 01:05:36,400 And this subtree has a bunch of leaves in it, 1223 01:05:36,400 --> 01:05:40,460 those represent all the occurrences of the pattern P. 1224 01:05:40,460 --> 01:05:43,120 And let's suppose that those leaves are numbered. 1225 01:05:43,120 --> 01:05:48,620 I'm numbering the leaves from 1 to n, I guess. 1226 01:05:48,620 --> 01:05:51,440 Then in here, the leaves will be an interval-- 1227 01:05:51,440 --> 01:05:54,710 interval l, comma, n. 1228 01:05:54,710 --> 01:05:57,560 And the trouble is a lot of these have the same label $ 1229 01:05:57,560 --> 01:05:58,640 sign i. 1230 01:05:58,640 --> 01:06:01,370 And I just want to find the unique ones. 1231 01:06:01,370 --> 01:06:02,120 How do I do that? 1232 01:06:07,760 --> 01:06:15,560 What we do is find the first occurrence of $ sign i for each 1233 01:06:15,560 --> 01:06:17,090 i. 1234 01:06:17,090 --> 01:06:19,560 I could just find the first occurrence of $ sign i for each 1235 01:06:19,560 --> 01:06:20,060 i. 1236 01:06:20,060 --> 01:06:24,370 I'd then only have to pay order number of distinct documents, 1237 01:06:24,370 --> 01:06:27,170 then we'll have to pay for every match within the document. 1238 01:06:27,170 --> 01:06:30,860 Now, one way to define the first $ sign i is-- 1239 01:06:30,860 --> 01:06:35,690 that's a $ sign i whose stored value-- 1240 01:06:35,690 --> 01:06:38,620 we said we store the leaf number of the previous $ sign i-- 1241 01:06:38,620 --> 01:06:45,301 whose stored value is less than l. 1242 01:06:45,301 --> 01:06:46,550 So we find some position here. 1243 01:06:46,550 --> 01:06:48,300 If the previous guy is less than l, 1244 01:06:48,300 --> 01:06:51,950 that means it was the first of that type. 1245 01:06:51,950 --> 01:06:56,330 If we store this, that's definition of being first. 1246 01:06:56,330 --> 01:07:01,610 So in this interval, I want to find $ sign i's that have very 1247 01:07:01,610 --> 01:07:04,070 small stored values. 1248 01:07:04,070 --> 01:07:06,050 How would I find the very best one? 1249 01:07:06,050 --> 01:07:07,430 Range minimum query. 1250 01:07:07,430 --> 01:07:12,560 So we do a range minimum query on l, comma, n. 1251 01:07:12,560 --> 01:07:15,200 If there's any firsts in there, this will find it. 1252 01:07:18,580 --> 01:07:24,470 Find, let's say, a position m with the smallest 1253 01:07:24,470 --> 01:07:25,790 possible stored value. 1254 01:07:37,430 --> 01:07:43,080 If the stored number is less than l, 1255 01:07:43,080 --> 01:07:44,570 then output that answer. 1256 01:07:48,480 --> 01:07:53,890 And then recurse on the remaining intervals. 1257 01:07:53,890 --> 01:08:01,860 So there's going to be from l to m minus 1 and m plus 1 to n. 1258 01:08:01,860 --> 01:08:05,284 So we find the best candidate, the minimum. 1259 01:08:05,284 --> 01:08:06,450 That's minimum sorted value. 1260 01:08:06,450 --> 01:08:09,210 If anything is going to be less than l, that would be it. 1261 01:08:09,210 --> 01:08:12,459 If it is less than l, we output it, then we recurse over here 1262 01:08:12,459 --> 01:08:13,500 and we recurse over here. 1263 01:08:13,500 --> 01:08:15,750 At some point this will stop finding things. 1264 01:08:15,750 --> 01:08:17,910 We're going to do another RMQ over here. 1265 01:08:17,910 --> 01:08:21,189 Might not find anything, then we just stop that recursion. 1266 01:08:21,189 --> 01:08:23,151 But the number of recursions we have to do 1267 01:08:23,151 --> 01:08:25,109 is going to be equal to the number of documents 1268 01:08:25,109 --> 01:08:27,810 that match, maybe plus 1. 1269 01:08:27,810 --> 01:08:30,660 So we achieved this bound using RMQ because RMQ 1270 01:08:30,660 --> 01:08:33,689 we can do in constant time with appropriate pre-processing. 1271 01:08:33,689 --> 01:08:35,770 Now, the RMQ is over an array. 1272 01:08:35,770 --> 01:08:40,590 It's over this array of stored values indexed by leaves. 1273 01:08:40,590 --> 01:08:42,330 And this idea of taking the leaves 1274 01:08:42,330 --> 01:08:46,290 and writing them down in order is actually something we need. 1275 01:08:46,290 --> 01:08:48,180 It's called a suffix array. 1276 01:08:56,970 --> 01:08:59,340 We're going to use this alternate representation 1277 01:08:59,340 --> 01:09:02,640 of suffix trees in order to compute them. 1278 01:09:02,640 --> 01:09:04,899 Suffix arrays in some sense are easier to think about. 1279 01:09:16,410 --> 01:09:18,750 The idea with the suffix array is 1280 01:09:18,750 --> 01:09:21,540 to write down all the suffixes, sort them. 1281 01:09:25,979 --> 01:09:27,090 This is conceptual. 1282 01:09:27,090 --> 01:09:28,710 Imagine you take all these suffixes. 1283 01:09:28,710 --> 01:09:30,370 Their total size is quadratic in T 1284 01:09:30,370 --> 01:09:32,142 so you'd never actually want to do this. 1285 01:09:32,142 --> 01:09:33,600 But just imagine writing them down, 1286 01:09:33,600 --> 01:09:37,560 sorting them lexically using our string sorting algorithms. 1287 01:09:37,560 --> 01:09:40,560 And then we can't represent them explicitly 1288 01:09:40,560 --> 01:09:41,850 because it would be too big. 1289 01:09:41,850 --> 01:09:48,254 Just write down their index, just store the indices. 1290 01:09:51,729 --> 01:09:52,770 Let's do this for banana. 1291 01:09:55,350 --> 01:09:58,545 Banana's over here. 1292 01:09:58,545 --> 01:10:00,060 It'll make my life a little harder. 1293 01:10:17,090 --> 01:10:19,640 Actually, they're already here in sorted order. 1294 01:10:19,640 --> 01:10:23,720 If dollar sign, I'm supposing, is first, first suffix is $, 1295 01:10:23,720 --> 01:10:28,460 then a-$, then a-n-a-$, then a-n-a-n-a-$, then banana, 1296 01:10:28,460 --> 01:10:31,880 then n-a-$, then n-a-n-a-$. 1297 01:10:31,880 --> 01:10:33,980 I'll just write that down over here. 1298 01:10:33,980 --> 01:10:56,580 $, a-$, a-n-a-$, a-n-a-n-a-$, then banana, then n-a-$, 1299 01:10:56,580 --> 01:10:59,420 then n-a-n-a-$. 1300 01:10:59,420 --> 01:11:02,220 If you look at these, they're indeed in sorted order-- $, 1301 01:11:02,220 --> 01:11:04,330 a's, b's, n's. 1302 01:11:04,330 --> 01:11:06,740 Everything is sorted here lexically. 1303 01:11:06,740 --> 01:11:09,160 Now, I can't store this because it's quadratic size. 1304 01:11:09,160 --> 01:11:11,740 Instead, I just write down the numbers that are down there. 1305 01:11:11,740 --> 01:11:14,620 This was the sixth suffix, it was starting at position 6. 1306 01:11:14,620 --> 01:11:21,370 Then 5, then 3, then 1, then 0-- 1307 01:11:21,370 --> 01:11:27,010 that's everything-- then 4, then 2. 1308 01:11:27,010 --> 01:11:31,810 This thing is the suffix array. 1309 01:11:31,810 --> 01:11:33,550 It also has linear size. 1310 01:11:33,550 --> 01:11:37,570 It's just a permutation on the suffix labels, suffix indices. 1311 01:11:46,115 --> 01:11:48,699 I still want to tell you about it. 1312 01:11:48,699 --> 01:11:50,240 There's some other information that's 1313 01:11:50,240 --> 01:11:55,630 helpful to write down about the suffix array. 1314 01:11:55,630 --> 01:11:59,600 It's called longest common prefix information, LCP. 1315 01:11:59,600 --> 01:12:03,580 The idea is to look at adjacent elements in the suffix array. 1316 01:12:03,580 --> 01:12:06,080 In some sense, this represents the same information, right? 1317 01:12:06,080 --> 01:12:08,360 Our whole goal is to sort the suffixes. 1318 01:12:08,360 --> 01:12:12,200 If we could do this, then, as we'll see, 1319 01:12:12,200 --> 01:12:13,430 we can also build this. 1320 01:12:13,430 --> 01:12:15,472 And this is sort of what we really want. 1321 01:12:15,472 --> 01:12:17,180 The suffix array by itself is pretty good 1322 01:12:17,180 --> 01:12:19,130 if you add in LCP information. 1323 01:12:19,130 --> 01:12:21,500 LCP is-- what is the longest common prefix of these two 1324 01:12:21,500 --> 01:12:22,000 suffixes? 1325 01:12:22,000 --> 01:12:23,960 In this case, 0. 1326 01:12:23,960 --> 01:12:26,090 In this case, one letter. 1327 01:12:26,090 --> 01:12:30,770 In this case, three letters match. 1328 01:12:30,770 --> 01:12:33,440 So here the value is 3. 1329 01:12:33,440 --> 01:12:36,320 And the next one, zero letters match. 1330 01:12:36,320 --> 01:12:39,660 Next one, zero letters match. 1331 01:12:39,660 --> 01:12:45,380 Next one, two letters match. 1332 01:12:45,380 --> 01:12:47,720 So this is another array you could store here-- 1333 01:12:47,720 --> 01:12:49,805 0, 1, 3, 0, 0, 2. 1334 01:12:49,805 --> 01:12:51,980 AUDIENCE: Longest common prefix? 1335 01:12:51,980 --> 01:12:56,270 ERIK DEMAINE: Longest common prefix of the suffixes. 1336 01:12:56,270 --> 01:12:58,200 Because each of these is a suffix but here 1337 01:12:58,200 --> 01:13:01,400 we're interested in how long they match for. 1338 01:13:01,400 --> 01:13:05,360 I claim if you have this suffix array and this LCP information, 1339 01:13:05,360 --> 01:13:08,570 you can build this structure. 1340 01:13:08,570 --> 01:13:12,680 Anyone wants to tell me how to build this using this? 1341 01:13:12,680 --> 01:13:17,930 It's a one word or two word answer that we saw, I think, 1342 01:13:17,930 --> 01:13:19,480 last class. 1343 01:13:19,480 --> 01:13:21,860 But we saw a lot of things last class, so it's maybe not 1344 01:13:21,860 --> 01:13:22,360 obvious. 1345 01:13:30,530 --> 01:13:32,730 Magic words are Cartesian tree. 1346 01:13:35,570 --> 01:13:41,970 Cartesian tree was how we converted RMQ into LCA, 1347 01:13:41,970 --> 01:13:42,630 I think. 1348 01:13:42,630 --> 01:13:43,250 Yeah? 1349 01:13:43,250 --> 01:13:45,890 Which was you take the minimum value in the array, 1350 01:13:45,890 --> 01:13:50,480 make that the root, and then recurse on the two sides. 1351 01:13:50,480 --> 01:13:54,110 So a Cartesian tree of the LCP array, 1352 01:13:54,110 --> 01:13:57,800 basically, gives you this transformation. 1353 01:13:57,800 --> 01:13:59,930 The minimum values here are the 0's. 1354 01:13:59,930 --> 01:14:02,990 Now, before we just broke ties, we picked an arbitrary 0, 1355 01:14:02,990 --> 01:14:04,050 put it at the root. 1356 01:14:04,050 --> 01:14:07,500 Now I want to take all the 0's, put them at the root. 1357 01:14:07,500 --> 01:14:13,130 If I do that, I get three 0's at the root 1358 01:14:13,130 --> 01:14:16,170 and then I have everything in between. 1359 01:14:16,170 --> 01:14:19,430 So there's nothing left of the first 0. 1360 01:14:19,430 --> 01:14:21,740 Then next one, there's these guys 1361 01:14:21,740 --> 01:14:24,320 and the mins are going to be 1 and then 3. 1362 01:14:24,320 --> 01:14:27,800 So here I'm going to get a 1 when I recurse and then 3. 1363 01:14:34,910 --> 01:14:36,740 There's nothing in between these 0's. 1364 01:14:36,740 --> 01:14:39,380 And after the last 0, there's a 2. 1365 01:14:39,380 --> 01:14:41,840 So this would be the Cartesian tree, a slightly different 1366 01:14:41,840 --> 01:14:43,760 version where we don't break ties, 1367 01:14:43,760 --> 01:14:45,800 we take all the mins simultaneously, 1368 01:14:45,800 --> 01:14:47,670 put them at the root. 1369 01:14:47,670 --> 01:14:50,300 Now, does that look like this thing? 1370 01:14:50,300 --> 01:14:50,877 Yeah. 1371 01:14:50,877 --> 01:14:52,085 Everything except the leaves. 1372 01:14:52,085 --> 01:14:53,960 [INAUDIBLE] are missing at the leaves. 1373 01:14:53,960 --> 01:14:57,170 The leaves are represented by these values. 1374 01:14:57,170 --> 01:15:00,410 Just visit them in order here, do an inner traversal 1375 01:15:00,410 --> 01:15:02,270 of the missing pointers here. 1376 01:15:02,270 --> 01:15:12,260 We're going to get 6, and then 5, and then 3 and then 1, 1377 01:15:12,260 --> 01:15:19,609 and then 0, and then 4, and then 2. 1378 01:15:19,609 --> 01:15:21,900 Now, the meaning of these values is slightly different. 1379 01:15:21,900 --> 01:15:24,740 Maybe I should circle them in red. 1380 01:15:24,740 --> 01:15:27,570 These leaves are just like these leaves. 1381 01:15:27,570 --> 01:15:30,389 They're exactly the labels we wrote down in the same order. 1382 01:15:30,389 --> 01:15:31,930 These numbers are slightly different. 1383 01:15:31,930 --> 01:15:33,569 What they represent are letter depths. 1384 01:15:33,569 --> 01:15:36,110 The letter depth of this node is 0, letter depth of this node 1385 01:15:36,110 --> 01:15:37,670 is 1, letter depth of this node is 3. 1386 01:15:37,670 --> 01:15:38,753 That's what I wrote here-- 1387 01:15:38,753 --> 01:15:40,804 1, 3, 2. 1388 01:15:40,804 --> 01:15:42,240 This one says, 2. 1389 01:15:42,240 --> 01:15:44,000 These LCPs are exactly the letter depth. 1390 01:15:44,000 --> 01:15:45,967 That's how far down the tree you are. 1391 01:15:45,967 --> 01:15:48,050 Once you have this structure and the letter depth, 1392 01:15:48,050 --> 01:15:50,265 you can very easily put in these labels. 1393 01:15:50,265 --> 01:15:51,410 I won't say how to do that. 1394 01:15:51,410 --> 01:15:53,210 But in linear time, if I could build 1395 01:15:53,210 --> 01:15:59,030 the suffix array plus the LCPs, I could build suffix tree. 1396 01:15:59,030 --> 01:16:02,081 So our real goal is to build this information, these two 1397 01:16:02,081 --> 01:16:02,580 arrays. 1398 01:16:02,580 --> 01:16:04,300 If we could do it in linear time, 1399 01:16:04,300 --> 01:16:07,470 we'd get a suffix tree in linear time. 1400 01:16:07,470 --> 01:16:10,100 So that is what remains to be done. 1401 01:16:17,379 --> 01:16:18,170 We're going to do-- 1402 01:16:24,565 --> 01:16:27,200 not quite linear time. 1403 01:16:27,200 --> 01:16:30,620 If you want a nicely sorted suffix 1404 01:16:30,620 --> 01:16:33,830 tree where all the children are labeled here-- 1405 01:16:33,830 --> 01:16:36,020 so in particular, if I just had a single node, 1406 01:16:36,020 --> 01:16:39,650 I have to be able to sort the letters in the alphabet. 1407 01:16:39,650 --> 01:16:41,090 However long that takes. 1408 01:16:41,090 --> 01:16:42,560 Maybe it's a small alphabet and you 1409 01:16:42,560 --> 01:16:45,500 can do linear time sorting by radix sort or whatever. 1410 01:16:45,500 --> 01:16:47,260 However long that takes, we do it once. 1411 01:16:47,260 --> 01:16:50,377 Then the rest will be order T time. 1412 01:16:50,377 --> 01:16:51,210 Here's how we do it. 1413 01:16:51,210 --> 01:16:54,217 First step-- sort the alphabet. 1414 01:16:54,217 --> 01:16:56,300 This will turn out to be more interesting than you 1415 01:16:56,300 --> 01:16:57,170 might think. 1416 01:16:57,170 --> 01:16:58,430 I'll come back to it. 1417 01:16:58,430 --> 01:17:01,250 Second step-- replace each letter 1418 01:17:01,250 --> 01:17:03,920 by its index in the sorted order. 1419 01:17:03,920 --> 01:17:07,130 This sounds boring but it will be useful for later. 1420 01:17:15,710 --> 01:17:19,670 Third step-- the big idea. 1421 01:17:19,670 --> 01:17:23,180 This is an algorithm by Karkkainen and Sanders, 1422 01:17:23,180 --> 01:17:25,750 from 2003. 1423 01:17:25,750 --> 01:17:27,950 The problem was first solved in this running time 1424 01:17:27,950 --> 01:17:31,640 by Martin Farach-Colton, our good friend. 1425 01:17:31,640 --> 01:17:33,260 But then it got simplified. 1426 01:17:33,260 --> 01:17:36,170 So I'll tell you a little bit about that in a moment. 1427 01:17:38,750 --> 01:17:41,270 And there going to be a lot of writing here. 1428 01:17:52,880 --> 01:17:55,410 The idea here is we're going to take the 3i-th letter, 1429 01:17:55,410 --> 01:17:57,440 3i plus first, 3i plus second letter, 1430 01:17:57,440 --> 01:18:00,020 concatenate them into a single triple letter-- 1431 01:18:00,020 --> 01:18:01,800 think of it as a single letter. 1432 01:18:01,800 --> 01:18:03,402 And then just do that for all i. 1433 01:18:03,402 --> 01:18:05,110 So it's like I take these guys, make them 1434 01:18:05,110 --> 01:18:07,070 one letter, these guys, make them one letter. 1435 01:18:07,070 --> 01:18:10,100 Now, I could start at 0, or I could start at 1, 1436 01:18:10,100 --> 01:18:12,230 or I could start at 2. 1437 01:18:12,230 --> 01:18:14,260 Do them all. 1438 01:18:14,260 --> 01:18:22,910 So this is going to be 3i plus 1, 3i plus 2, 3i plus 3. 1439 01:18:22,910 --> 01:18:32,390 And this one is going to be 3i plus 2, 3i plus 3, 3i plus 4. 1440 01:18:32,390 --> 01:18:33,860 We're going to do this to recurse. 1441 01:18:33,860 --> 01:18:36,320 But the point is, if I want to represent 1442 01:18:36,320 --> 01:18:38,750 all the suffixes of T, suffix could 1443 01:18:38,750 --> 01:18:41,600 start at a position 0 mod 3, or position 1 mod 3, 1444 01:18:41,600 --> 01:18:43,800 or position 2 mod 3. 1445 01:18:43,800 --> 01:18:46,250 So if I could sort all the suffixes of these guys, 1446 01:18:46,250 --> 01:18:49,120 I would effectively sort all the suffixes of the original T. 1447 01:18:49,120 --> 01:18:51,800 This tripling up doesn't really change things, 1448 01:18:51,800 --> 01:18:53,880 up to like plus 1 or 2. 1449 01:19:03,500 --> 01:19:05,180 Next, I believe, is recursion. 1450 01:19:13,726 --> 01:19:18,320 I'm going to take T0 and T1 and concatenate them. 1451 01:19:18,320 --> 01:19:21,320 This thing has size 2/3 n. 1452 01:19:21,320 --> 01:19:24,500 It has number of characters 2/3 n because each of them 1453 01:19:24,500 --> 01:19:26,420 has a third of the number of characters. 1454 01:19:26,420 --> 01:19:28,020 Of course, all the information is still there, 1455 01:19:28,020 --> 01:19:28,978 which is kind of weird. 1456 01:19:28,978 --> 01:19:31,680 But if we treat this as a single character, which 1457 01:19:31,680 --> 01:19:35,435 then has a 1/3 n, we can't afford to recurse on all three. 1458 01:19:35,435 --> 01:19:37,850 We can only afford to recurse on two out of the three 1459 01:19:37,850 --> 01:19:40,850 because then we're going to get a recurrence of the form T of n 1460 01:19:40,850 --> 01:19:46,370 is T of 2/3 n plus order n. 1461 01:19:46,370 --> 01:19:48,682 And this is geometric, so it's order n. 1462 01:19:48,682 --> 01:19:50,390 That's how we're going to get linear time 1463 01:19:50,390 --> 01:19:53,630 after the first sort. 1464 01:19:53,630 --> 01:19:56,000 If this was 3/3 n, then this would be n log n. 1465 01:19:56,000 --> 01:19:58,310 We don't want to do that. 1466 01:19:58,310 --> 01:19:59,832 So that's what I can afford. 1467 01:19:59,832 --> 01:20:01,040 Now I've got to deal with it. 1468 01:20:01,040 --> 01:20:03,020 What this tells me is, the sorted order of all 1469 01:20:03,020 --> 01:20:05,420 the suffixes of T0 and T1, all the suffixes 1470 01:20:05,420 --> 01:20:08,630 starting at positions that are 0 or 1 mod 3. 1471 01:20:12,140 --> 01:20:17,840 Next thing we'd like to do is sort the suffixes of T2. 1472 01:20:17,840 --> 01:20:20,150 We can do that, I claim, by radix sort. 1473 01:20:26,160 --> 01:20:27,190 How do we do that? 1474 01:20:27,190 --> 01:20:30,240 Well, if you look at a suffix T 2i, 1475 01:20:30,240 --> 01:20:37,320 this is the same thing as T from 3i plus 2 onwards. 1476 01:20:37,320 --> 01:20:43,140 Which we can think of as that first character, comma, 1477 01:20:43,140 --> 01:20:44,265 the next character onwards. 1478 01:20:48,840 --> 01:20:51,150 Sorry, that's the angle bracket. 1479 01:20:51,150 --> 01:20:56,680 And this thing is, basically, T0 of i plus 1 onwards. 1480 01:20:56,680 --> 01:20:58,260 So if I strip off the first letter, 1481 01:20:58,260 --> 01:21:00,000 then I get a suffix that I know about. 1482 01:21:00,000 --> 01:21:02,830 I know the sorted order of all the T0 suffixes. 1483 01:21:02,830 --> 01:21:04,830 So this is really just a-- you can think of this 1484 01:21:04,830 --> 01:21:06,679 as a two character value. 1485 01:21:06,679 --> 01:21:08,220 There's a single character from Sigma 1486 01:21:08,220 --> 01:21:12,660 here, which we've already reduced down to-- 1487 01:21:12,660 --> 01:21:18,210 this is an integer between 0 and Sigma minus 1. 1488 01:21:18,210 --> 01:21:21,150 This thing you can do the same thing with these recursive 1489 01:21:21,150 --> 01:21:22,110 values. 1490 01:21:22,110 --> 01:21:24,210 So you've just got two values. 1491 01:21:24,210 --> 01:21:24,810 Small. 1492 01:21:24,810 --> 01:21:27,540 You can radix sort them in linear time. 1493 01:21:27,540 --> 01:21:30,990 And then we will have sorted T2 suffixes because we already 1494 01:21:30,990 --> 01:21:32,220 knew the order of these guys. 1495 01:21:34,800 --> 01:21:45,780 One more thing, which we have to merge suffixes of T0 and T1 1496 01:21:45,780 --> 01:21:52,470 with suffixes of T2. 1497 01:21:52,470 --> 01:21:55,755 And this is where we use the fact 1498 01:21:55,755 --> 01:21:58,130 that there are three of these things and not two of them. 1499 01:21:58,130 --> 01:22:01,040 This is a weird case where three way divide and conquer works. 1500 01:22:01,040 --> 01:22:02,952 Two way divide and conquer is what 1501 01:22:02,952 --> 01:22:04,160 Farach-Colton did originally. 1502 01:22:04,160 --> 01:22:07,070 It's much more complicated because of this merge step. 1503 01:22:07,070 --> 01:22:09,050 Merge gets painful. 1504 01:22:09,050 --> 01:22:13,490 I claim this merging is easy because merging 1505 01:22:13,490 --> 01:22:16,500 is linear time, provided your comparison is constant time. 1506 01:22:16,500 --> 01:22:21,610 So if I need to compare a T0 suffix with a T2 suffix, 1507 01:22:21,610 --> 01:22:23,840 if I want to do that comparison, I 1508 01:22:23,840 --> 01:22:26,030 strip off the first letter from this one. 1509 01:22:26,030 --> 01:22:29,330 It turns into a T1 suffix, the first character 1510 01:22:29,330 --> 01:22:30,155 plus a T1 suffix. 1511 01:22:30,155 --> 01:22:32,113 If I strip out the first character of this one, 1512 01:22:32,113 --> 01:22:35,810 it turns into the first character and then a T0 suffix. 1513 01:22:35,810 --> 01:22:38,150 And these things I know how to compare because I already 1514 01:22:38,150 --> 01:22:40,970 sorted T0, comma, T1. 1515 01:22:40,970 --> 01:22:47,900 If I need to compare T1 suffix with the T2 suffix, 1516 01:22:47,900 --> 01:22:48,830 how do I do it? 1517 01:22:48,830 --> 01:22:51,440 I strip off the first two letters of this one, 1518 01:22:51,440 --> 01:22:52,850 I get a T0 suffix. 1519 01:22:52,850 --> 01:22:55,340 I strip off the first two letters of this one, 1520 01:22:55,340 --> 01:22:56,829 I get a T1 suffix. 1521 01:22:56,829 --> 01:22:59,120 I can't strip off one letter because this would turn it 1522 01:22:59,120 --> 01:23:00,953 into a T2 and I don't know how to compare T2 1523 01:23:00,953 --> 01:23:02,960 to other things, that's the whole point. 1524 01:23:02,960 --> 01:23:05,420 I guess, it's a T2 versus a T0, if I 1525 01:23:05,420 --> 01:23:07,330 did that, which is this case. 1526 01:23:07,330 --> 01:23:09,200 But here, I strip off two letters, 1527 01:23:09,200 --> 01:23:10,850 I get something I know how to compare. 1528 01:23:10,850 --> 01:23:13,290 This technique does not work if you only have two things. 1529 01:23:13,290 --> 01:23:15,540 It only works if you have three things because they're 1530 01:23:15,540 --> 01:23:17,700 sort of these situations. 1531 01:23:17,700 --> 01:23:18,950 So constant time. 1532 01:23:18,950 --> 01:23:21,620 By comparing these little tuples, the first character 1533 01:23:21,620 --> 01:23:26,000 or two plus the remaining suffix, I can do the comparator 1534 01:23:26,000 --> 01:23:27,980 and merge. 1535 01:23:27,980 --> 01:23:31,010 And then if I can do that, everything is linear time. 1536 01:23:31,010 --> 01:23:33,710 The only interesting thing is how do I sort the alphabet 1537 01:23:33,710 --> 01:23:34,970 when I recurse? 1538 01:23:34,970 --> 01:23:38,690 And for that, you use radix sort. 1539 01:23:38,690 --> 01:23:44,500 So the first time, you pay sort of Sigma. 1540 01:23:44,500 --> 01:23:46,250 We don't know how long that takes, depends 1541 01:23:46,250 --> 01:23:47,330 on your alphabet. 1542 01:23:47,330 --> 01:23:49,359 But every following recursion it's a radix 1543 01:23:49,359 --> 01:23:51,650 sort because you have a triple of values, each of which 1544 01:23:51,650 --> 01:23:52,990 is small. 1545 01:23:52,990 --> 01:23:54,650 And so you can do it in linear time. 1546 01:23:54,650 --> 01:23:57,860 Because there's only three digits to the thing 1547 01:23:57,860 --> 01:23:59,000 you're sorting. 1548 01:23:59,000 --> 01:24:01,880 So overall, this is a recursive algorithm. 1549 01:24:01,880 --> 01:24:05,000 It gives you linear time because you're 1550 01:24:05,000 --> 01:24:07,910 making one recursive call of 2/3 the size. 1551 01:24:07,910 --> 01:24:11,840 Pretty clever and simple. 1552 01:24:11,840 --> 01:24:14,330 And that's suffix trees and how you build them. 1553 01:24:14,330 --> 01:24:17,270 Versus you get suffix arrays, you can do the same thing 1554 01:24:17,270 --> 01:24:20,330 and get LCP information at the same time, 1555 01:24:20,330 --> 01:24:22,220 it's written in the nodes. 1556 01:24:22,220 --> 01:24:23,330 Then you get suffix trees. 1557 01:24:23,330 --> 01:24:25,480 And then you're done.