1
00:00:00,090 --> 00:00:02,490
The following content is
provided under a Creative

2
00:00:02,490 --> 00:00:04,059
Commons license.

3
00:00:04,059 --> 00:00:06,330
Your support will help
MIT OpenCourseWare

4
00:00:06,330 --> 00:00:10,720
continue to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,320
To make a donation or
view additional materials

6
00:00:13,320 --> 00:00:17,280
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,280 --> 00:00:19,790
at osw.mit.edu.

8
00:00:19,790 --> 00:00:20,790
ERIK DEMAINE: All right.

9
00:00:20,790 --> 00:00:24,270
Today's lecture's full of
tries and trays, and trees.

10
00:00:24,270 --> 00:00:25,170
Oh, my.

11
00:00:25,170 --> 00:00:30,150
Lots of different synonyms
all coming from trees.

12
00:00:30,150 --> 00:00:31,620
In particular,
we're going to cover

13
00:00:31,620 --> 00:00:34,320
suffix trees today and various
representations of them,

14
00:00:34,320 --> 00:00:36,387
and how to build
them in linear time.

15
00:00:36,387 --> 00:00:37,470
Now, they are good things.

16
00:00:37,470 --> 00:00:39,840
Some of you may have
seen suffix trees before,

17
00:00:39,840 --> 00:00:43,260
but hopefully, haven't actually
seen most of the things

18
00:00:43,260 --> 00:00:46,360
we're going to cover,
except for the very basics.

19
00:00:46,360 --> 00:00:50,580
So the general problem we're
interested in solving today

20
00:00:50,580 --> 00:00:51,480
is string matching.

21
00:00:57,660 --> 00:01:02,610
And in string matching
we are given two strings.

22
00:01:02,610 --> 00:01:07,290
One of them we call the
text T and the other one

23
00:01:07,290 --> 00:01:18,630
we call a pattern P.
These are both strings

24
00:01:18,630 --> 00:01:20,955
over some alphabet.

25
00:01:20,955 --> 00:01:25,950
And the alphabet we're going
to always call capital Sigma.

26
00:01:25,950 --> 00:01:26,947
Think of that.

27
00:01:26,947 --> 00:01:27,780
It could be binary--

28
00:01:27,780 --> 00:01:28,470
0 and 1.

29
00:01:28,470 --> 00:01:31,950
Could be ASCII, so there's
256 characters in there.

30
00:01:31,950 --> 00:01:35,280
Could be Unicode-- pick
your favorite alphabet.

31
00:01:35,280 --> 00:01:40,410
Then it could be ACGT for DNA.

32
00:01:40,410 --> 00:01:43,486
And their goal is to
find the occurrences

33
00:01:43,486 --> 00:01:44,610
of the pattern in the text.

34
00:01:52,960 --> 00:01:55,170
Could be we want to find
some of those occurrences

35
00:01:55,170 --> 00:01:56,860
or all of them, or count them.

36
00:02:09,870 --> 00:02:14,280
And in this lecture, we're
only interested in substring

37
00:02:14,280 --> 00:02:14,930
searches.

38
00:02:14,930 --> 00:02:17,580
So the pattern is just a string.

39
00:02:17,580 --> 00:02:25,000
You want to know all the
places where P occurs.

40
00:02:25,000 --> 00:02:29,910
P might appear multiple times,
even overlapping itself--

41
00:02:29,910 --> 00:02:31,740
in those two
positions, whatever.

42
00:02:31,740 --> 00:02:34,446
You want to find all the shifts
of P where it's identical to T.

43
00:02:34,446 --> 00:02:35,820
Now, there are
lots of variations

44
00:02:35,820 --> 00:02:37,278
on this problem
which we won't look

45
00:02:37,278 --> 00:02:41,700
at in this lecture, such as
when the pattern has wildcards

46
00:02:41,700 --> 00:02:44,152
in it, or you could imagine
it being a regular expression,

47
00:02:44,152 --> 00:02:45,610
or you don't want
to match exactly,

48
00:02:45,610 --> 00:02:48,930
you want to match approximately,
you could have some mismatches,

49
00:02:48,930 --> 00:02:52,170
or it could require
some edits to match

50
00:02:52,170 --> 00:02:54,045
T. We're not going to
look at those problems.

51
00:02:56,920 --> 00:02:59,460
This is both an algorithmic
problem and a data structures

52
00:02:59,460 --> 00:03:00,780
problem.

53
00:03:00,780 --> 00:03:02,730
If I give you this
text in the pattern,

54
00:03:02,730 --> 00:03:04,200
I just want to know the answer.

55
00:03:04,200 --> 00:03:05,880
You can do that in
linear time-- it's

56
00:03:05,880 --> 00:03:12,694
famous Knuth-Morris-Pratt, or
Boyer-Moore, or Rabin-Karp.

57
00:03:12,694 --> 00:03:14,610
Lots of linear time
algorithms for doing that.

58
00:03:14,610 --> 00:03:16,410
We're interested in
the data structure

59
00:03:16,410 --> 00:03:21,070
version of the problem,
static data structure.

60
00:03:21,070 --> 00:03:24,210
So we're given
the text up front,

61
00:03:24,210 --> 00:03:27,660
given T. We want
to preprocess T.

62
00:03:27,660 --> 00:03:34,110
And then the query
consists of the pattern.

63
00:03:34,110 --> 00:03:37,980
Imagine T being very
big, P being not so big.

64
00:03:37,980 --> 00:03:45,400
And we'd like to spend
something like order

65
00:03:45,400 --> 00:03:47,820
P time to do a query.

66
00:03:51,222 --> 00:03:53,430
That would be ideal because
you have to at least look

67
00:03:53,430 --> 00:03:55,170
at the query and
you don't really

68
00:03:55,170 --> 00:03:57,240
want to spend time
looking at the text.

69
00:03:57,240 --> 00:04:03,405
You'd also like something
like order T space.

70
00:04:03,405 --> 00:04:05,280
We don't want the space
of the data structure

71
00:04:05,280 --> 00:04:07,590
to be much bigger than
the original text.

72
00:04:07,590 --> 00:04:10,727
So these are goals which we will
more or less achieve, depending

73
00:04:10,727 --> 00:04:12,060
on exactly the problem you want.

74
00:04:12,060 --> 00:04:13,684
Sometimes we'll
achieve this, sometimes

75
00:04:13,684 --> 00:04:16,019
we'll achieve almost this.

76
00:04:16,019 --> 00:04:21,492
But these are really nice
running times and space.

77
00:04:21,492 --> 00:04:22,200
It's all optimal.

78
00:04:24,710 --> 00:04:26,660
Before we get to
that problem, I want

79
00:04:26,660 --> 00:04:32,400
to solve a simpler problem which
is necessary to solve this one.

80
00:04:32,400 --> 00:04:33,800
We'll call it a warm up.

81
00:04:36,460 --> 00:04:41,860
And that's a good friend--
the predecessor problem,

82
00:04:41,860 --> 00:04:43,220
but now among strings.

83
00:04:48,060 --> 00:04:50,970
Let's say we have k strings--

84
00:04:50,970 --> 00:04:54,410
k texts-- T1 to T k.

85
00:04:54,410 --> 00:04:57,200
And now the query is
you're given some pattern P

86
00:04:57,200 --> 00:04:59,090
and you want to
know where P fits

87
00:04:59,090 --> 00:05:02,070
among these strings
in lexical order.

88
00:05:02,070 --> 00:05:04,160
So a regular predecessor,
but now comparison

89
00:05:04,160 --> 00:05:06,080
is string comparison.

90
00:05:06,080 --> 00:05:08,964
Of course, you could try to
solve that using our existing

91
00:05:08,964 --> 00:05:10,130
predecessor data structures.

92
00:05:10,130 --> 00:05:11,457
But they won't do very well.

93
00:05:11,457 --> 00:05:13,040
Even a binary search
tree is not going

94
00:05:13,040 --> 00:05:15,350
to do well here because
comparing two strings

95
00:05:15,350 --> 00:05:18,350
could take a very long time
if those strings are long.

96
00:05:18,350 --> 00:05:19,980
So we don't want to do that.

97
00:05:19,980 --> 00:05:24,470
Instead, we're going
to build a trie.

98
00:05:24,470 --> 00:05:28,010
Now, tries we've
seen in fast sorting

99
00:05:28,010 --> 00:05:31,540
lecture, when w is at least
logs at two plus epsilon event.

100
00:05:34,110 --> 00:05:36,740
We used tries in a
particular setting there.

101
00:05:36,740 --> 00:05:40,130
We're going to use them in
their more native setting

102
00:05:40,130 --> 00:05:41,810
today a lot.

103
00:05:45,080 --> 00:05:51,710
In this setting-- again,
a trie is a rooted tree.

104
00:05:51,710 --> 00:05:53,495
The children
branches are labeled.

105
00:06:00,000 --> 00:06:01,710
And in this case,
they're labeled

106
00:06:01,710 --> 00:06:04,440
with letters in the alphabet--

107
00:06:04,440 --> 00:06:06,030
Sigma.

108
00:06:06,030 --> 00:06:07,830
So you have a node.

109
00:06:07,830 --> 00:06:13,020
And let's say, we have the
English alphabet-- a, b,

110
00:06:13,020 --> 00:06:14,586
up to z.

111
00:06:14,586 --> 00:06:16,560
Those are your 26
possible children.

112
00:06:16,560 --> 00:06:19,200
Some of them may not exist,
they are null pointers.

113
00:06:19,200 --> 00:06:23,340
Others may point
to actual nodes.

114
00:06:23,340 --> 00:06:25,560
That is a trie in
its native setting,

115
00:06:25,560 --> 00:06:28,750
which is when the alphabet
is something you care about.

116
00:06:28,750 --> 00:06:31,260
Now, when we used tries
before, our alphabet

117
00:06:31,260 --> 00:06:33,240
just represented some
digit in some kind

118
00:06:33,240 --> 00:06:35,900
of arbitrary representation.

119
00:06:35,900 --> 00:06:38,160
The digit was made up of
log to the epsilon bits.

120
00:06:38,160 --> 00:06:39,839
We were just using it as a tool.

121
00:06:39,839 --> 00:06:41,380
But this is where
tries actually come

122
00:06:41,380 --> 00:06:45,660
from-- they come from trying
to retrieve strings out

123
00:06:45,660 --> 00:06:48,840
of some database, in this case.

124
00:06:48,840 --> 00:06:52,320
We're doing predecessor--
this is a practical problem.

125
00:06:52,320 --> 00:06:54,850
Like a lot of library
search engines,

126
00:06:54,850 --> 00:06:56,850
you type in the beginning
of the title of a book

127
00:06:56,850 --> 00:07:01,080
and you want to know what is the
preceding and succeeding book

128
00:07:01,080 --> 00:07:03,690
title of what you query for.

129
00:07:03,690 --> 00:07:06,202
So this is something
people care about.

130
00:07:06,202 --> 00:07:07,910
Although really they
want us-- typically,

131
00:07:07,910 --> 00:07:10,440
we want to solve this
problem because it's harder.

132
00:07:13,020 --> 00:07:15,390
So that's a trie.

133
00:07:15,390 --> 00:07:23,070
Now, to make this actually
work, what we'd like to do

134
00:07:23,070 --> 00:07:25,410
is represent our strings.

135
00:07:25,410 --> 00:07:29,510
So how do we use this structure
to represent strings T1 to T k?

136
00:07:33,090 --> 00:07:35,960
We're going to
represent those strings

137
00:07:35,960 --> 00:07:38,670
in the obvious way, which
we've done many times

138
00:07:38,670 --> 00:07:41,970
in the past when we were doing
integer data structures--

139
00:07:41,970 --> 00:07:43,365
as root to leaf paths.

140
00:07:48,790 --> 00:07:51,370
Because any root to leaf path
is just a sequence of letters,

141
00:07:51,370 --> 00:07:52,203
and that's a string.

142
00:07:52,203 --> 00:07:54,520
So we just throw them in there.

143
00:07:54,520 --> 00:07:59,360
Now, to do that, we need to
change things a little bit.

144
00:07:59,360 --> 00:08:07,390
We're going to add a new letter,
which we usually present as $

145
00:08:07,390 --> 00:08:10,060
sign, to the end
of every string.

146
00:08:21,550 --> 00:08:24,390
I have an example.

147
00:08:24,390 --> 00:08:33,039
We're going to do four strings--

148
00:08:37,679 --> 00:08:41,590
various spellings
of Anna and Ann.

149
00:08:41,590 --> 00:08:47,170
And say, we'd like to
throw these into a trie.

150
00:08:47,170 --> 00:08:48,490
They all start with a.

151
00:08:48,490 --> 00:08:52,600
So at the root, there's going to
be four branches corresponding

152
00:08:52,600 --> 00:08:55,940
to $ sign, a, e, and n.

153
00:08:55,940 --> 00:08:57,850
I'm supposing my
alphabet is just a,

154
00:08:57,850 --> 00:09:01,150
e, n because that's
all that appears here.

155
00:09:01,150 --> 00:09:04,570
But everything will
be on the a branch.

156
00:09:04,570 --> 00:09:08,420
And then from there
we're going to have--

157
00:09:08,420 --> 00:09:12,710
let's see-- they
all go to n next.

158
00:09:12,710 --> 00:09:16,780
So they all follow this branch.

159
00:09:16,780 --> 00:09:20,950
Then one of them goes to a.

160
00:09:20,950 --> 00:09:23,440
These all go to n afterwards.

161
00:09:23,440 --> 00:09:31,240
So we've got $ sign,
a we use, e, n we use.

162
00:09:31,240 --> 00:09:33,880
And on the a
branch, we are done.

163
00:09:33,880 --> 00:09:38,680
This corresponds to and a, n, e.

164
00:09:38,680 --> 00:09:39,310
We're finished.

165
00:09:39,310 --> 00:09:42,670
And we imagine there being $
sign at the end of this string.

166
00:09:42,670 --> 00:09:48,940
So we follow the $ sign child.

167
00:09:48,940 --> 00:09:50,920
The others are blank.

168
00:09:50,920 --> 00:09:55,870
And this leaf here
corresponds to a, n, a.

169
00:09:55,870 --> 00:10:00,714
On the other hand, if
we could do a, n, n,

170
00:10:00,714 --> 00:10:01,630
there's three options.

171
00:10:01,630 --> 00:10:03,000
We could be done.

172
00:10:03,000 --> 00:10:05,750
Or there could be an
a or an e to follow.

173
00:10:05,750 --> 00:10:09,310
So if we're done, that would
correspond to the $ sign

174
00:10:09,310 --> 00:10:10,660
pointer.

175
00:10:10,660 --> 00:10:14,530
That's going to be a leaf
corresponding to this string

176
00:10:14,530 --> 00:10:16,840
here.

177
00:10:16,840 --> 00:10:22,465
Or it could be an a
and then we're done.

178
00:10:26,380 --> 00:10:28,920
And then we have a
leaf corresponding

179
00:10:28,920 --> 00:10:32,640
to Anna, a, n, n, a.

180
00:10:32,640 --> 00:10:39,230
Or could be we have an e
next and then we're done.

181
00:10:44,210 --> 00:10:44,970
OK.

182
00:10:44,970 --> 00:10:46,770
Not very exciting
but that is the tri

183
00:10:46,770 --> 00:10:52,260
representation of a, n, a; a,
n, n; a, n, n, a; a, n, n, e.

184
00:10:52,260 --> 00:10:57,930
And you can see there is exactly
one leaf per word down here.

185
00:10:57,930 --> 00:11:00,840
And furthermore, if you
take in order traversal

186
00:11:00,840 --> 00:11:03,934
of those leaves, you get
these strings in order.

187
00:11:03,934 --> 00:11:05,850
And typically, if you're
going to store a data

188
00:11:05,850 --> 00:11:08,379
structure like this, you would
store these actual pointers.

189
00:11:08,379 --> 00:11:10,670
So once you get to a leaf,
you know which word matched.

190
00:11:13,760 --> 00:11:15,600
So that's a trie.

191
00:11:15,600 --> 00:11:18,885
Seems pretty trivial.

192
00:11:18,885 --> 00:11:19,385
Trievial?

193
00:11:23,860 --> 00:11:26,050
But it turns out there's
something already

194
00:11:26,050 --> 00:11:30,580
pretty interesting about
this data structure.

195
00:11:30,580 --> 00:11:32,820
How do you do a
predecessor search?

196
00:11:32,820 --> 00:11:37,600
If I'm searching for something
like, I don't know, a, n, e--

197
00:11:37,600 --> 00:11:39,850
because I made a typo--

198
00:11:39,850 --> 00:11:44,410
then I follow a, n, and then
I follow this e branch here

199
00:11:44,410 --> 00:11:45,490
and discover-- whoops--

200
00:11:45,490 --> 00:11:46,960
there's nothing here.

201
00:11:46,960 --> 00:11:49,590
But right at that node I see,
OK, well, my predecessor is

202
00:11:49,590 --> 00:11:51,610
going to be the max
in this subtree, which

203
00:11:51,610 --> 00:11:53,500
happens to be a, n, a.

204
00:11:53,500 --> 00:11:56,230
My successor is going to be
the min in this subtree, which

205
00:11:56,230 --> 00:11:57,790
happens to be a, n, n.

206
00:11:57,790 --> 00:11:59,170
And so I find what I want.

207
00:11:59,170 --> 00:12:01,010
How long does it
take me to do that?

208
00:12:01,010 --> 00:12:03,970
Well, if I store
subtree mins and maxs,

209
00:12:03,970 --> 00:12:05,890
then I just have to
walk down the tree.

210
00:12:05,890 --> 00:12:09,034
That will take order
P time to walk down.

211
00:12:09,034 --> 00:12:10,450
And then, once I'm
at a node, I've

212
00:12:10,450 --> 00:12:14,740
got to do a predecessor
or successor in the node.

213
00:12:14,740 --> 00:12:15,740
So there are two issues.

214
00:12:15,740 --> 00:12:19,030
One is, given a node, how do
you know which way to walk down?

215
00:12:19,030 --> 00:12:21,130
And then, when you're
done, how do you

216
00:12:21,130 --> 00:12:22,270
do predecessor in a node?

217
00:12:22,270 --> 00:12:23,630
It's the fundamental question.

218
00:12:23,630 --> 00:12:25,505
Now, this is something
we spent a lot of time

219
00:12:25,505 --> 00:12:26,950
doing in, say, fusion trees.

220
00:12:26,950 --> 00:12:28,160
That was the big challenge.

221
00:12:28,160 --> 00:12:30,310
So this is not
really so trivial--

222
00:12:30,310 --> 00:12:32,560
how do I represent a node?

223
00:12:32,560 --> 00:12:35,470
One way to make it trivial is
to assume that the alphabet is

224
00:12:35,470 --> 00:12:37,145
constant size, like two.

225
00:12:37,145 --> 00:12:38,770
Then, of course,
there's nothing to do.

226
00:12:38,770 --> 00:12:40,370
It's a binary trie.

227
00:12:40,370 --> 00:12:41,890
You look at 0, you look at 1.

228
00:12:41,890 --> 00:12:44,560
You can figure out
anything you need to do

229
00:12:44,560 --> 00:12:45,780
if the alphabet is constant.

230
00:12:45,780 --> 00:12:47,530
But things get interesting
if you imagine,

231
00:12:47,530 --> 00:12:49,150
well, the alphabet
is some parameter,

232
00:12:49,150 --> 00:12:50,600
we don't know how big it is.

233
00:12:50,600 --> 00:12:51,950
It might be substantial.

234
00:12:51,950 --> 00:12:55,460
So let's think about
how you might represent

235
00:12:55,460 --> 00:12:58,390
a trie or the node of a trie.

236
00:13:03,790 --> 00:13:08,320
Let's call this trie
node representation.

237
00:13:11,620 --> 00:13:13,040
Any suggestions?

238
00:13:13,040 --> 00:13:15,550
What are the obvious ways to
represent the node of a trie?

239
00:13:15,550 --> 00:13:18,156
Nothing fancy.

240
00:13:18,156 --> 00:13:19,780
I have three obvious
answers, at least.

241
00:13:19,780 --> 00:13:20,547
AUDIENCE: Array.

242
00:13:20,547 --> 00:13:21,380
ERIK DEMAINE: Array.

243
00:13:21,380 --> 00:13:21,570
Good.

244
00:13:21,570 --> 00:13:22,570
That was my number one.

245
00:13:25,090 --> 00:13:26,740
Any more?

246
00:13:26,740 --> 00:13:28,365
That's I think the most obvious.

247
00:13:28,365 --> 00:13:28,990
AUDIENCE: Tree.

248
00:13:28,990 --> 00:13:29,915
ERIK DEMAINE: Tree.

249
00:13:29,915 --> 00:13:30,415
Good.

250
00:13:30,415 --> 00:13:32,350
Do a binary search tree.

251
00:13:32,350 --> 00:13:32,925
Or?

252
00:13:32,925 --> 00:13:33,800
AUDIENCE: Hash table.

253
00:13:33,800 --> 00:13:34,360
ERIK DEMAINE: Hash table.

254
00:13:34,360 --> 00:13:34,860
Good.

255
00:13:39,550 --> 00:13:44,950
So for each of them we
have query time and space.

256
00:13:48,590 --> 00:13:51,420
If I use an array,
meaning I have--

257
00:13:51,420 --> 00:13:53,320
let's say, for a through z--

258
00:13:53,320 --> 00:13:59,710
I have a pointer that either
is null or points to the child.

259
00:13:59,710 --> 00:14:02,170
This is going to be really
fast because they're at a node.

260
00:14:02,170 --> 00:14:05,350
If I want to know, I just
look at that i-th letter

261
00:14:05,350 --> 00:14:08,560
in my pattern P. I
say, oh, it's a j.

262
00:14:08,560 --> 00:14:11,577
So I look at the j
position and I follow it.

263
00:14:11,577 --> 00:14:13,910
You might wonder, how do I
do predecessor and successor?

264
00:14:13,910 --> 00:14:15,493
Well, this is a
static data structure.

265
00:14:15,493 --> 00:14:17,470
So for every cell,
if it's null, I

266
00:14:17,470 --> 00:14:21,460
can store the predecessor
and successor in the node.

267
00:14:21,460 --> 00:14:23,910
With no more space.

268
00:14:23,910 --> 00:14:28,270
This is Sigma space per node.

269
00:14:28,270 --> 00:14:34,570
So the amount of space is T
Sigma, which is not so great.

270
00:14:34,570 --> 00:14:37,388
But the query is fast,
query is order P time.

271
00:14:40,230 --> 00:14:40,850
BST.

272
00:14:40,850 --> 00:14:43,110
The idea of the BST
is instead of having

273
00:14:43,110 --> 00:14:47,270
a node that has some pointers,
some of which may be absent,

274
00:14:47,270 --> 00:14:54,050
let's expand it out into
something like this.

275
00:14:54,050 --> 00:14:55,610
Actually, I'll use colors.

276
00:14:55,610 --> 00:14:58,820
This will make life a little
bit cleaner in a moment.

277
00:14:58,820 --> 00:15:01,830
Because we are going to
modify this approach.

278
00:15:01,830 --> 00:15:06,470
So let's say that the pointers
you care about are red.

279
00:15:06,470 --> 00:15:08,670
Those are the actual letter
pointers you want to do.

280
00:15:08,670 --> 00:15:11,360
So the idea is to expand
out this high degree node

281
00:15:11,360 --> 00:15:12,950
into binary nodes.

282
00:15:12,950 --> 00:15:16,150
You put appropriate keys in here
so you can do a binary search.

283
00:15:16,150 --> 00:15:20,120
And then, eventually, you get
down to where you need to go.

284
00:15:20,120 --> 00:15:23,960
This structure has
high log Sigma.

285
00:15:23,960 --> 00:15:28,110
So the query time is
going to be P log Sigma.

286
00:15:28,110 --> 00:15:31,310
So that goes up a
little bit, not perfect.

287
00:15:31,310 --> 00:15:33,300
But the space now
becomes linear,

288
00:15:33,300 --> 00:15:35,210
so that's an improvement.

289
00:15:35,210 --> 00:15:38,630
Ideally, we'd like the
best of both of these--

290
00:15:38,630 --> 00:15:42,740
optimal query time, optimal
space, linear space.

291
00:15:42,740 --> 00:15:45,050
And hash tables achieve that.

292
00:15:45,050 --> 00:15:51,530
They give you order P
query and order T space.

293
00:15:51,530 --> 00:15:53,840
Again, the issue is some of
these cells are absent so

294
00:15:53,840 --> 00:15:54,740
don't use an array.

295
00:15:54,740 --> 00:15:56,840
That's like a direct
mapped hash table.

296
00:15:56,840 --> 00:15:58,700
Use a hash table, use hashing.

297
00:15:58,700 --> 00:16:01,880
That way you can use linear
space per node, however many

298
00:16:01,880 --> 00:16:04,380
occupied children there are.

299
00:16:04,380 --> 00:16:06,020
What is T here, by the way?

300
00:16:06,020 --> 00:16:11,120
T is the sum of the
lengths of the T i's--

301
00:16:11,120 --> 00:16:13,220
because here we're
storing multiple T i's.

302
00:16:13,220 --> 00:16:19,700
Or it's the number of
nodes in the tree, which,

303
00:16:19,700 --> 00:16:22,160
if your strings happen to
have a lot of common prefixes,

304
00:16:22,160 --> 00:16:24,159
the number of nodes in
the trie could be smaller

305
00:16:24,159 --> 00:16:28,980
than that, but not in general.

306
00:16:28,980 --> 00:16:30,841
What's the problem
with the hash table?

307
00:16:30,841 --> 00:16:31,340
Question.

308
00:16:31,340 --> 00:16:38,330
AUDIENCE: [INAUDIBLE]

309
00:16:38,330 --> 00:16:39,080
ERIK DEMAINE: Yes.

310
00:16:39,080 --> 00:16:41,570
For the BST, we need to
store some keys in this node.

311
00:16:41,570 --> 00:16:42,992
That lets you do
a binary search.

312
00:16:42,992 --> 00:16:44,450
For example, every
node could store

313
00:16:44,450 --> 00:16:46,220
the max in the left subtree--

314
00:16:46,220 --> 00:16:48,353
just within this
little tree, though.

315
00:16:48,353 --> 00:16:53,090
AUDIENCE: [INAUDIBLE]

316
00:16:53,090 --> 00:16:54,980
ERIK DEMAINE: It
is order T. Sorry,

317
00:16:54,980 --> 00:16:57,040
I see-- why is it
not O T Sigma space?

318
00:16:57,040 --> 00:16:58,580
You're only storing
one letter here,

319
00:16:58,580 --> 00:17:02,130
so that fits in a single
word, and two pointers.

320
00:17:02,130 --> 00:17:04,099
So every node only
takes constant space.

321
00:17:04,099 --> 00:17:08,730
It's only T space, not T Sigma.

322
00:17:08,730 --> 00:17:10,940
Other questions?

323
00:17:10,940 --> 00:17:11,450
Or answers?

324
00:17:11,450 --> 00:17:13,790
There's a problem with hashing--

325
00:17:13,790 --> 00:17:18,260
doesn't actually solve the
problem we want to solve.

326
00:17:18,260 --> 00:17:20,520
It doesn't solve predecessor.

327
00:17:20,520 --> 00:17:22,940
Because hashing mixes up
the order of the nodes.

328
00:17:22,940 --> 00:17:25,010
This is the problem
we had with--

329
00:17:25,010 --> 00:17:29,460
what's it called-- signature
sort, which hashed,

330
00:17:29,460 --> 00:17:32,790
it messed up, it permuted
all the things in the nodes

331
00:17:32,790 --> 00:17:34,950
and so you didn't know--

332
00:17:34,950 --> 00:17:37,410
I mean, in a hash table,
you can't solve predecessor.

333
00:17:37,410 --> 00:17:40,289
That's what the
predecessor problem is for.

334
00:17:40,289 --> 00:17:42,330
I guess you could try to
throw a predecessor data

335
00:17:42,330 --> 00:17:43,500
structure in here.

336
00:17:43,500 --> 00:17:46,960
Actually, I hadn't
thought of that before.

337
00:17:46,960 --> 00:17:49,680
So we could use y-fast
tries or something.

338
00:17:49,680 --> 00:17:55,650
And we would get
order T space and--

339
00:17:55,650 --> 00:17:58,510
I guess, with high
probability, this

340
00:17:58,510 --> 00:18:00,990
is also with high probability--

341
00:18:00,990 --> 00:18:06,180
we get order P log
log Sigma, I guess.

342
00:18:06,180 --> 00:18:09,270
Because I use Van Emde Boas.

343
00:18:09,270 --> 00:18:15,840
I'm going to have to call
it 3.5, Van Emde Boas.

344
00:18:15,840 --> 00:18:17,280
So that would be
another approach.

345
00:18:17,280 --> 00:18:22,409
So hashing will not
do a predecessor.

346
00:18:22,409 --> 00:18:24,950
We'll do exact search, which is
still an interesting problem.

347
00:18:24,950 --> 00:18:26,783
Might give you some
strings I want to know--

348
00:18:26,783 --> 00:18:29,640
is this string in your set?

349
00:18:29,640 --> 00:18:32,734
But it won't solve the
predecessor problem.

350
00:18:32,734 --> 00:18:34,650
So this is an interesting
solution-- hashing--

351
00:18:34,650 --> 00:18:36,070
but not quite what we want.

352
00:18:36,070 --> 00:18:38,361
And Van Emde Boas doesn't
quite do what we want either.

353
00:18:38,361 --> 00:18:40,620
It improves over
the BST approach

354
00:18:40,620 --> 00:18:42,900
but we get another log in there.

355
00:18:42,900 --> 00:18:46,080
But it's still not order
P. I kind of like order P.

356
00:18:46,080 --> 00:18:49,360
Or at least, instead of
order P times log log Sigma,

357
00:18:49,360 --> 00:18:54,030
I kind of like order
P plus log Sigma.

358
00:18:54,030 --> 00:18:56,470
And order P plus
log Sigma is known.

359
00:18:56,470 --> 00:19:01,010
So that's what I want
to tell you about.

360
00:19:01,010 --> 00:19:03,710
And this is normally
done with a structure

361
00:19:03,710 --> 00:19:09,740
called trays, which is
a portamento, I guess,

362
00:19:09,740 --> 00:19:12,410
of tree and array.

363
00:19:12,410 --> 00:19:15,140
Somewhere in there there's
a tree and an array,

364
00:19:15,140 --> 00:19:18,630
so it's a bit of
an awkward word.

365
00:19:18,630 --> 00:19:23,690
But Those are developed by
Koplowitz and Lewenstein,

366
00:19:23,690 --> 00:19:28,460
in 2006, a fairly
recent innovation.

367
00:19:28,460 --> 00:19:31,240
I'll have this number 6--

368
00:19:31,240 --> 00:19:41,900
trays, achieve order P plus
log Sigma and order T space.

369
00:19:41,900 --> 00:19:42,870
So this is pretty good.

370
00:19:42,870 --> 00:19:45,680
And they will do predecessor
and successor-- definitely

371
00:19:45,680 --> 00:19:47,037
an improvement over the BST.

372
00:19:50,690 --> 00:19:55,460
It's open whether you could
do order P plus log log Sigma.

373
00:19:55,460 --> 00:19:58,530
This is as far as I can tell,
no one has worked on this.

374
00:19:58,530 --> 00:20:03,920
Maybe we will work on it today.

375
00:20:03,920 --> 00:20:05,360
So something to think about--

376
00:20:05,360 --> 00:20:06,901
whether you could
get the best of all

377
00:20:06,901 --> 00:20:09,174
of these worlds for predecessor.

378
00:20:09,174 --> 00:20:11,590
There's a lower bound-- you
need to spend at least log log

379
00:20:11,590 --> 00:20:12,170
Sigma time.

380
00:20:12,170 --> 00:20:14,319
Because even if you
try as a single node,

381
00:20:14,319 --> 00:20:15,860
you have the
predecessor lower bound.

382
00:20:15,860 --> 00:20:18,470
And we know log log universe
size is the best you

383
00:20:18,470 --> 00:20:19,860
can do in this regime.

384
00:20:23,320 --> 00:20:26,560
So that's where we're going.

385
00:20:26,560 --> 00:20:28,270
Instead of describing
trays, though, I'm

386
00:20:28,270 --> 00:20:29,645
going to describe
a new way to do

387
00:20:29,645 --> 00:20:32,200
it, which has never
been seen before

388
00:20:32,200 --> 00:20:33,452
in any class or anywhere.

389
00:20:33,452 --> 00:20:34,410
Because it's brand new.

390
00:20:34,410 --> 00:20:37,720
It's developed by Martin
Farach-Colton, who

391
00:20:37,720 --> 00:20:41,290
did the LCA and the
level ancestor structures

392
00:20:41,290 --> 00:20:43,480
that we saw in last class.

393
00:20:43,480 --> 00:20:46,160
And he just told it to
me and it's really cool

394
00:20:46,160 --> 00:20:47,590
so we're going to cover it.

395
00:20:51,640 --> 00:20:54,755
A simpler way to get
this same bound of trays.

396
00:20:58,990 --> 00:21:00,670
And the first thing
we're going to do

397
00:21:00,670 --> 00:21:02,270
is use a weight balanced BST.

398
00:21:07,040 --> 00:21:16,670
This will achieve P plus log
k query and linear space.

399
00:21:20,650 --> 00:21:23,210
k, remember, is the
number of strings

400
00:21:23,210 --> 00:21:26,670
that we're storing, so it's the
number of leaves in the trie.

401
00:21:26,670 --> 00:21:28,789
So it's not quite as
good as P plus log Sigma

402
00:21:28,789 --> 00:21:30,080
but it's going to be a warm up.

403
00:21:30,080 --> 00:21:32,621
We're going to to do this and
then we're going to improve it.

404
00:21:34,550 --> 00:21:39,470
Remember weight balanced trees,
we talked about them way back

405
00:21:39,470 --> 00:21:41,550
in lecture 3, I believe.

406
00:21:41,550 --> 00:21:44,690
There is an issue of
what is the weight.

407
00:21:44,690 --> 00:21:46,850
And typically, you say,
the weight of a subtree

408
00:21:46,850 --> 00:21:48,670
is the number of
nodes in the subtree.

409
00:21:48,670 --> 00:21:50,420
I'm going to change
that slightly and say,

410
00:21:50,420 --> 00:21:53,360
the weight of a subtree is
the number of descendant

411
00:21:53,360 --> 00:21:59,300
leaves in the subtree,
not the number of nodes,

412
00:21:59,300 --> 00:22:01,890
because it's log k.

413
00:22:01,890 --> 00:22:04,460
We really care about the
number of leaves down there.

414
00:22:04,460 --> 00:22:05,960
There could be long
paths here which

415
00:22:05,960 --> 00:22:08,690
we are not so excited about.

416
00:22:08,690 --> 00:22:11,150
We really care about how
many leaves are down there.

417
00:22:11,150 --> 00:22:14,881
Like the weight of this
node here is three--

418
00:22:14,881 --> 00:22:16,130
there's three leaves below it.

419
00:22:21,410 --> 00:22:23,800
You may recall
weight balanced BSTs

420
00:22:23,800 --> 00:22:25,910
trying to make the weight
of the left subtree

421
00:22:25,910 --> 00:22:28,860
within a constant factor of the
weight of the right subtree.

422
00:22:28,860 --> 00:22:31,130
Because we're static,
we can be even simpler

423
00:22:31,130 --> 00:22:33,335
and say, find the
optimal partition.

424
00:22:37,190 --> 00:22:40,430
So we're thinking
about this approach--

425
00:22:40,430 --> 00:22:44,570
idea of expanding a large degree
node into some binary tree.

426
00:22:44,570 --> 00:22:46,610
We have a choice of
what binary tree to use.

427
00:22:46,610 --> 00:22:48,694
With three nodes it may
be not many choices-- that

428
00:22:48,694 --> 00:22:50,693
could be this or it could
be a straight this way

429
00:22:50,693 --> 00:22:51,957
or a straight line that way.

430
00:22:51,957 --> 00:22:52,790
Those are different.

431
00:22:52,790 --> 00:22:55,167
And if one of these
guys is really heavy,

432
00:22:55,167 --> 00:22:56,750
one of these children
is really heavy,

433
00:22:56,750 --> 00:22:58,675
you want to put it
closer to the root.

434
00:22:58,675 --> 00:23:00,050
So that's what
we're going to do.

435
00:23:04,060 --> 00:23:06,720
Let me draw it this way.

436
00:23:06,720 --> 00:23:09,270
That's kind of an array.

437
00:23:09,270 --> 00:23:13,490
But what this array
represents is for a node--

438
00:23:13,490 --> 00:23:16,040
so here's my node, it
has lots of children.

439
00:23:16,040 --> 00:23:18,800
Some of these are
heavy, some of them

440
00:23:18,800 --> 00:23:21,567
are light, lighter than others.

441
00:23:21,567 --> 00:23:23,150
We don't know how
they're distributed.

442
00:23:23,150 --> 00:23:27,374
But they're ordered, we
have to preserve the order.

443
00:23:27,374 --> 00:23:28,790
What this is
supposed to represent

444
00:23:28,790 --> 00:23:31,890
is the total number of
leaves in this subtree.

445
00:23:31,890 --> 00:23:36,050
So the total number
of leaves here.

446
00:23:36,050 --> 00:23:40,490
And then I'm going to partition
this rectangle into groups

447
00:23:40,490 --> 00:23:43,430
corresponding to these sizes.

448
00:23:43,430 --> 00:23:48,230
So these are small, medium,
small, little less than medium,

449
00:23:48,230 --> 00:23:51,290
big, and then small.

450
00:23:51,290 --> 00:23:52,550
Something like that.

451
00:23:52,550 --> 00:23:54,950
So these horizontal
lengths correspond

452
00:23:54,950 --> 00:23:57,260
to the number of
leaves in these things,

453
00:23:57,260 --> 00:24:00,224
correspond to the
weight of my children.

454
00:24:00,224 --> 00:24:01,640
So I look at that
and I say, well,

455
00:24:01,640 --> 00:24:04,160
what I'd really like
to do is split this

456
00:24:04,160 --> 00:24:06,690
in the middle, which
is, maybe, here.

457
00:24:06,690 --> 00:24:08,730
I say, OK, well,
then I'll split here.

458
00:24:08,730 --> 00:24:11,090
That's pretty close
to the middle.

459
00:24:11,090 --> 00:24:13,547
So my left subtree will
consist of these guys,

460
00:24:13,547 --> 00:24:15,380
my right subtree will
consist of these guys.

461
00:24:15,380 --> 00:24:16,760
And then I recurse--

462
00:24:16,760 --> 00:24:19,490
over here I've
split at the middle,

463
00:24:19,490 --> 00:24:21,440
I find the thing that's
closest to the middle.

464
00:24:21,440 --> 00:24:22,898
Over here I've
split at the middle,

465
00:24:22,898 --> 00:24:26,724
I find the thing that's
closest to the middle.

466
00:24:26,724 --> 00:24:27,890
It's pretty much determined.

467
00:24:27,890 --> 00:24:30,642
So my root node
corresponds to this one.

468
00:24:30,642 --> 00:24:31,850
It's going to partition here.

469
00:24:31,850 --> 00:24:33,935
So over on the right,
there's going to be--

470
00:24:37,250 --> 00:24:39,470
here's going to be the
big tree and then here

471
00:24:39,470 --> 00:24:40,379
is the small tree.

472
00:24:40,379 --> 00:24:42,170
So this small tree
corresponds to this one.

473
00:24:42,170 --> 00:24:44,270
This big tree corresponds
to this interval.

474
00:24:44,270 --> 00:24:47,390
Then on the left we've got
four things we need to store.

475
00:24:47,390 --> 00:24:51,970
So these are the red
pointers that we had before.

476
00:24:51,970 --> 00:24:54,960
Then over on the left, we're
going to have a partition.

477
00:24:54,960 --> 00:24:57,250
And then there's going
to be two guys here.

478
00:24:57,250 --> 00:24:59,810
It doesn't really matter
how we store them.

479
00:24:59,810 --> 00:25:03,050
It's something like this.

480
00:25:03,050 --> 00:25:10,214
There is medium and small.

481
00:25:10,214 --> 00:25:12,255
And then over on the left,
we also have two guys.

482
00:25:12,255 --> 00:25:15,740
So it's going to be,
again, something like this.

483
00:25:20,440 --> 00:25:23,370
You got medium and small.

484
00:25:23,370 --> 00:25:25,880
So you see how that worked.

485
00:25:25,880 --> 00:25:29,270
Our main goal was to make this
big guy as close to the root

486
00:25:29,270 --> 00:25:30,141
as possible.

487
00:25:30,141 --> 00:25:32,390
It was the biggest and that's
basically what happened.

488
00:25:32,390 --> 00:25:34,206
This one is really big.

489
00:25:34,206 --> 00:25:36,330
And we couldn't quite put
it as a child of the root

490
00:25:36,330 --> 00:25:37,790
because it appeared
in the middle,

491
00:25:37,790 --> 00:25:41,480
but we could put it as a
grandchild at the root.

492
00:25:41,480 --> 00:25:43,400
In general, if you have
a super heavy child,

493
00:25:43,400 --> 00:25:47,150
it will always become a child
or grandchild of the root.

494
00:25:47,150 --> 00:25:50,149
So in constant number of
traversals you'll get there.

495
00:25:50,149 --> 00:25:52,190
Now again, you fill in
these nodes with some keys

496
00:25:52,190 --> 00:25:53,780
so you can do a binary search.

497
00:25:53,780 --> 00:25:57,950
But now the binary
search might go faster

498
00:25:57,950 --> 00:26:01,490
than log Sigma, which
is what we had before.

499
00:26:01,490 --> 00:26:04,340
And indeed, you can prove
that this really works well.

500
00:26:12,410 --> 00:26:13,610
So what's the claim?

501
00:26:17,030 --> 00:26:32,950
Claim is every two
edges you follow either

502
00:26:32,950 --> 00:26:35,654
advance one letter in P--

503
00:26:38,440 --> 00:26:41,177
these are the red edges
that we want to follow.

504
00:26:41,177 --> 00:26:42,760
So if we follow a
red edge, then we've

505
00:26:42,760 --> 00:26:45,020
made progress to the next node.

506
00:26:45,020 --> 00:26:48,730
So this would be
following a red edge.

507
00:26:48,730 --> 00:27:04,270
Or we reduce the number of
candidate to T i's by 2/3

508
00:27:04,270 --> 00:27:08,110
or, I guess, to 2/3
of its original value.

509
00:27:08,110 --> 00:27:10,210
So we lose a third
of the strings.

510
00:27:10,210 --> 00:27:12,100
That's what I'd like to claim.

511
00:27:12,100 --> 00:27:14,450
And it's not too
hard to see this.

512
00:27:14,450 --> 00:27:17,685
You have to imagine all of
these possible partitions.

513
00:27:17,685 --> 00:27:19,720
It's a little bit awkward.

514
00:27:19,720 --> 00:27:20,890
The idea is the following.

515
00:27:20,890 --> 00:27:23,170
If you take one
of these arrays--

516
00:27:23,170 --> 00:27:26,630
this view of all the leaves
just laid out on the line--

517
00:27:26,630 --> 00:27:30,800
you say, well, I'd like
to split in half and half.

518
00:27:30,800 --> 00:27:33,490
But that will never happen
unless I'm really lucky.

519
00:27:33,490 --> 00:27:37,540
So let's think about
this one third splitting.

520
00:27:37,540 --> 00:27:41,830
If I were able to cut anywhere
in here, then in one step,

521
00:27:41,830 --> 00:27:45,430
actually, I would achieve
this 2/3 reduction.

522
00:27:45,430 --> 00:27:46,690
I'd lose a third of the nodes.

523
00:27:50,530 --> 00:27:56,170
If I end up cutting here,
for example, then either I

524
00:27:56,170 --> 00:27:58,420
go to the left and I lost
almost 2/3 of the nodes,

525
00:27:58,420 --> 00:27:59,878
or I go to the
right and I at least

526
00:27:59,878 --> 00:28:02,410
lost this one third of the notes
or one third of the leaves,

527
00:28:02,410 --> 00:28:04,300
I should say.

528
00:28:04,300 --> 00:28:06,250
So that would be
a good situation

529
00:28:06,250 --> 00:28:07,610
if I got some cut in here.

530
00:28:07,610 --> 00:28:10,570
But it might be there is no
possible cut I can make in here

531
00:28:10,570 --> 00:28:16,879
because there's a giant child
in here that has more than one

532
00:28:16,879 --> 00:28:17,670
third of the nodes.

533
00:28:17,670 --> 00:28:20,710
It would have to span
all the way across here.

534
00:28:20,710 --> 00:28:22,630
So I can't make any
cuts, I can only

535
00:28:22,630 --> 00:28:25,060
cut between child boundaries.

536
00:28:25,060 --> 00:28:29,880
In that situation,
you make this--

537
00:28:29,880 --> 00:28:34,180
well, this is when I need to
follow two edges, not one.

538
00:28:34,180 --> 00:28:36,550
When there's a super big
child like that, as we said,

539
00:28:36,550 --> 00:28:39,280
it will become a
grandchild of the root.

540
00:28:39,280 --> 00:28:40,720
So it will be--

541
00:28:40,720 --> 00:28:45,070
there's the root and then
here is the giant tree.

542
00:28:45,070 --> 00:28:50,270
And then there's going to be
the other stuff here and here.

543
00:28:50,270 --> 00:28:53,890
So after I go down to either
one step or two steps,

544
00:28:53,890 --> 00:28:57,390
I will either get here--

545
00:28:57,390 --> 00:29:03,020
sorry, more red chalk,
this was a red point.

546
00:29:03,020 --> 00:29:04,840
Now, this is going to a child.

547
00:29:04,840 --> 00:29:07,090
So if I went there,
I'm happy in two steps.

548
00:29:07,090 --> 00:29:12,260
I advance one letter in P.
Or in two steps, I went here

549
00:29:12,260 --> 00:29:13,060
or I went here.

550
00:29:13,060 --> 00:29:14,780
And this was a huge
amount of the nodes,

551
00:29:14,780 --> 00:29:16,363
this is at least a
third of the nodes.

552
00:29:16,363 --> 00:29:18,220
Again, if I end up
here or end up here,

553
00:29:18,220 --> 00:29:20,990
I lost 2/3 of the
candidate leaves.

554
00:29:20,990 --> 00:29:23,740
I mean, I lost one third
of the candidate leaves,

555
00:29:23,740 --> 00:29:25,570
leaving 2/3 of them.

556
00:29:28,840 --> 00:29:33,340
If this happens, I charged
to this order P term.

557
00:29:33,340 --> 00:29:34,810
And if the other
situation happens,

558
00:29:34,810 --> 00:29:37,360
I charge the log k term--
because I can only reduce k

559
00:29:37,360 --> 00:29:41,260
by a factor of 2/3--

560
00:29:41,260 --> 00:29:44,110
order log k times.

561
00:29:44,110 --> 00:29:50,720
This implies order
p plus log k search.

562
00:29:50,720 --> 00:29:52,440
So a very simple idea.

563
00:29:52,440 --> 00:29:54,900
Just change the way we do BSTs.

564
00:29:54,900 --> 00:29:57,210
And we get, in some
cases, a better bound.

565
00:29:57,210 --> 00:29:59,940
But not in all
cases because maybe

566
00:29:59,940 --> 00:30:03,240
P plus log k might be bigger
than P times log Sigma.

567
00:30:03,240 --> 00:30:06,620
And k and Sigma are kind of
incomparable, so we don't know.

568
00:30:06,620 --> 00:30:13,350
That's where method
5 comes in, which

569
00:30:13,350 --> 00:30:15,620
is our good friend
from last class--

570
00:30:15,620 --> 00:30:18,752
leaf trimming and indirection.

571
00:30:22,200 --> 00:30:26,640
So we're going to use
this idea of finding--

572
00:30:26,640 --> 00:30:33,750
we're going to cut below
maximally deep nodes

573
00:30:33,750 --> 00:30:36,200
with the right number
of descendants in them.

574
00:30:43,820 --> 00:30:48,910
So we need at least
Sigma descendants.

575
00:30:53,820 --> 00:30:56,090
It could just be descendants
or descendant leaves,

576
00:30:56,090 --> 00:30:57,090
doesn't actually matter.

577
00:31:02,890 --> 00:31:06,416
Let me draw a picture, maybe.

578
00:31:06,416 --> 00:31:08,040
This is pretty much
what we did before,

579
00:31:08,040 --> 00:31:12,500
except before this magic number
was log n that we needed or 1/2

580
00:31:12,500 --> 00:31:13,820
log n or something.

581
00:31:13,820 --> 00:31:16,260
Now it's going to be
Sigma that we need.

582
00:31:16,260 --> 00:31:18,815
So it is we find these
maximally deep nodes--

583
00:31:18,815 --> 00:31:22,880
these dots-- that
have at least--

584
00:31:22,880 --> 00:31:25,310
I guess, there is really
multiple things hanging off

585
00:31:25,310 --> 00:31:25,970
here.

586
00:31:25,970 --> 00:31:29,990
In general, it could be
several things hanging off.

587
00:31:29,990 --> 00:31:31,490
But the total number
of descendants

588
00:31:31,490 --> 00:31:35,970
of each of these nodes
is at least Sigma.

589
00:31:35,970 --> 00:31:37,880
So what that implies
is that the number

590
00:31:37,880 --> 00:31:43,420
of these dots, the number of
the leaves in the top tree--

591
00:31:43,420 --> 00:31:51,890
so up here-- number of leaves
is at most T over Sigma.

592
00:31:51,890 --> 00:31:54,890
Because we can charge each
of these nodes to Sigma

593
00:31:54,890 --> 00:31:58,490
descendants in each of them.

594
00:31:58,490 --> 00:32:03,160
So that's good because it
says we can use method 1--

595
00:32:03,160 --> 00:32:06,830
the simple array method--
which is fast but spacious.

596
00:32:06,830 --> 00:32:11,870
But if our new size of the trie
gets divided by a Sigma factor,

597
00:32:11,870 --> 00:32:14,100
then this turns
out to be linear.

598
00:32:14,100 --> 00:32:15,860
So up here we use method 1.

599
00:32:18,980 --> 00:32:21,170
Now, you got to be a little
careful because we can't

600
00:32:21,170 --> 00:32:23,127
use method 1 on all the nodes.

601
00:32:23,127 --> 00:32:24,710
We can definitely
use it on the leaves

602
00:32:24,710 --> 00:32:26,600
because there aren't
too many leaves.

603
00:32:26,600 --> 00:32:32,910
That means we can also use it on
the number of branching nodes.

604
00:32:32,910 --> 00:32:35,000
Number of branching
nodes is also

605
00:32:35,000 --> 00:32:37,940
going to be, at
most, T over Sigma

606
00:32:37,940 --> 00:32:40,550
because it's actually
one fewer branching node

607
00:32:40,550 --> 00:32:43,310
than there are leaves.

608
00:32:43,310 --> 00:32:49,340
So great, I can use
arrays on the leaves,

609
00:32:49,340 --> 00:32:52,340
I can use arrays on
the branching nodes.

610
00:32:52,340 --> 00:32:54,950
I can't use it on the
non-branching nodes.

611
00:32:54,950 --> 00:32:58,220
Non-branching nodes are nodes
with a single descendant

612
00:32:58,220 --> 00:33:00,650
and everything else is null.

613
00:33:00,650 --> 00:33:03,490
What do I do for those nodes?

614
00:33:03,490 --> 00:33:06,470
Very difficult. I just store
that one pointer in a storage

615
00:33:06,470 --> 00:33:07,340
label.

616
00:33:07,340 --> 00:33:09,670
I guess you could think
of that as method 2

617
00:33:09,670 --> 00:33:11,360
in a very trivial case.

618
00:33:11,360 --> 00:33:13,670
You see-- is this
the right label?

619
00:33:13,670 --> 00:33:15,310
Yes or no.

620
00:33:15,310 --> 00:33:17,975
So this is the
non-branching nodes.

621
00:33:22,730 --> 00:33:25,910
Non-branching top nodes--

622
00:33:25,910 --> 00:33:28,430
I will use method 2.

623
00:33:28,430 --> 00:33:30,260
So I guess this is really--

624
00:33:30,260 --> 00:33:32,510
well, for these
guys I use method 1,

625
00:33:32,510 --> 00:33:35,930
for these guys I use method 1.

626
00:33:35,930 --> 00:33:37,160
So I can afford all this.

627
00:33:37,160 --> 00:33:38,555
This will take order T space.

628
00:33:43,070 --> 00:33:46,280
And it will be fast because
either I'm using arrays

629
00:33:46,280 --> 00:33:48,230
or I really don't
have any work to do,

630
00:33:48,230 --> 00:33:50,440
and so it doesn't
really matter what I do.

631
00:33:50,440 --> 00:33:52,190
But except I can't use
arrays because they

632
00:33:52,190 --> 00:33:53,990
would be too spacious.

633
00:33:53,990 --> 00:33:55,170
So that handles the top.

634
00:33:55,170 --> 00:33:57,530
Now, the issue is, what about
these bottom structures?

635
00:33:57,530 --> 00:33:59,540
The bottom structures--
what do we know?

636
00:33:59,540 --> 00:34:03,450
They have to have
less than Sigma nodes,

637
00:34:03,450 --> 00:34:05,660
less than Sigma descendants.

638
00:34:05,660 --> 00:34:09,260
Also less than Sigma leaves.

639
00:34:09,260 --> 00:34:15,889
So in other words,
in these trees

640
00:34:15,889 --> 00:34:19,100
we have k less than Sigma.

641
00:34:19,100 --> 00:34:21,889
Well, then we can
afford to use method 4.

642
00:34:21,889 --> 00:34:25,530
Because our whole goal is to get
k down to Sigma in this bound.

643
00:34:25,530 --> 00:34:29,105
So in the bottom
trees, we use method 4.

644
00:34:31,730 --> 00:34:33,500
Method 4 was always
linear space.

645
00:34:33,500 --> 00:34:36,260
And the issue was we
paid P plus log k.

646
00:34:36,260 --> 00:34:44,239
But now in here, k is less
than Sigma in these trees.

647
00:34:44,239 --> 00:34:50,690
So that means we get order
P plus log Sigma query time.

648
00:34:53,659 --> 00:34:56,550
And that's the best we know how
to do if you want predecessor

649
00:34:56,550 --> 00:34:57,780
at the nodes.

650
00:34:57,780 --> 00:35:01,830
So it matches this tray
bound in pretty easy way.

651
00:35:01,830 --> 00:35:04,790
Just to apply weight balanced,
clean things up a little bit.

652
00:35:04,790 --> 00:35:07,800
But only do that at the
leaves and everywhere up

653
00:35:07,800 --> 00:35:09,120
here, basically.

654
00:35:09,120 --> 00:35:11,114
Except the non-branching
nodes use arrays.

655
00:35:11,114 --> 00:35:13,280
So for the most part arrays
and then, at the bottom,

656
00:35:13,280 --> 00:35:16,400
you use weight balance.

657
00:35:16,400 --> 00:35:19,340
This is how you ought
to represent a trie.

658
00:35:19,340 --> 00:35:22,172
If you want to preserve
the order of the children,

659
00:35:22,172 --> 00:35:23,630
this is the best
we know how to do.

660
00:35:23,630 --> 00:35:26,330
If you don't want to preserve
order, just use a hash table.

661
00:35:26,330 --> 00:35:28,145
So it depends on
the application.

662
00:35:32,110 --> 00:35:36,370
One fun application of
this is string sorting.

663
00:35:39,370 --> 00:35:40,930
It's not a data
structures problem

664
00:35:40,930 --> 00:35:42,804
so I don't want to spend
too much time on it.

665
00:35:42,804 --> 00:35:45,340
But you use this trie data
structure to sort strings.

666
00:35:45,340 --> 00:35:47,590
You just throw in a string
and then throw in a string.

667
00:35:47,590 --> 00:35:52,670
We didn't talk about dynamic
tries but it can be done.

668
00:35:52,670 --> 00:35:54,670
And if you throw it,
you just sort of find

669
00:35:54,670 --> 00:35:57,220
where you fall off and
then add the thing.

670
00:35:57,220 --> 00:35:59,770
Now, you have to maintain
all this funky stuff

671
00:35:59,770 --> 00:36:03,370
but weight balanced trees can
be made dynamic and indirection

672
00:36:03,370 --> 00:36:05,030
can be made dynamic.

673
00:36:05,030 --> 00:36:09,880
So you end up with this sort
of simple incremental scheme.

674
00:36:09,880 --> 00:36:14,410
You end up with T
plus k log Sigma

675
00:36:14,410 --> 00:36:21,790
to sort k strings of total size
T with alphabet size Sigma.

676
00:36:21,790 --> 00:36:22,720
This is good.

677
00:36:22,720 --> 00:36:26,077
If I used, for example,
merge sort to sort strings,

678
00:36:26,077 --> 00:36:27,160
it's going to be very bad.

679
00:36:27,160 --> 00:36:32,650
It's going to be something like
T times k times log something.

680
00:36:32,650 --> 00:36:34,150
We didn't really
care about the log.

681
00:36:34,150 --> 00:36:35,420
T times k is bad.

682
00:36:35,420 --> 00:36:39,400
That's because comparing strings
could potentially take T time.

683
00:36:39,400 --> 00:36:40,915
And then there's k of them.

684
00:36:40,915 --> 00:36:42,290
But this is linear.

685
00:36:42,290 --> 00:36:44,870
This is the sum of the
lengths of the strings.

686
00:36:44,870 --> 00:36:46,510
There's this extra little term.

687
00:36:46,510 --> 00:36:48,670
But most of the time that's
going to be dominated

688
00:36:48,670 --> 00:36:51,600
by the length of the strings.

689
00:36:51,600 --> 00:36:55,217
So that's a good way to
sort strings using tries.

690
00:36:55,217 --> 00:36:57,800
Tries by themselves, I mean this
is about all there is to say.

691
00:36:57,800 --> 00:37:02,870
So let's move on to suffix
trees and compressed tries.

692
00:37:02,870 --> 00:37:06,721
Now, we actually did compressed
tries in the signature sort

693
00:37:06,721 --> 00:37:07,220
lecture.

694
00:37:14,492 --> 00:37:15,950
Actually, why don't
I go over here?

695
00:37:25,210 --> 00:37:28,230
So tries-- branches were
labeled with letters.

696
00:37:28,230 --> 00:37:32,160
That's still going to be
true for a compressed trie.

697
00:37:32,160 --> 00:37:35,190
But as we saw in that
lecture, in compressed trie

698
00:37:35,190 --> 00:37:37,820
we're going to get rid of
the non-branching nodes.

699
00:37:41,650 --> 00:37:44,010
So idea with the compressed
trie is very simple--

700
00:37:44,010 --> 00:37:49,500
just contract non-branching
paths into a single edge.

701
00:38:03,580 --> 00:38:05,800
This is our example of a trie.

702
00:38:05,800 --> 00:38:08,440
We're just going to modify
it to make a compressed trie.

703
00:38:14,890 --> 00:38:17,920
Here we have a
non-branching path.

704
00:38:17,920 --> 00:38:20,770
We have to follow an a, and
then we have to follow an n.

705
00:38:20,770 --> 00:38:22,330
There's no point in
having this node.

706
00:38:22,330 --> 00:38:24,038
You might as well just
have a single edge

707
00:38:24,038 --> 00:38:26,560
that says a-n on it.

708
00:38:26,560 --> 00:38:29,470
So we go from here,
from the root.

709
00:38:29,470 --> 00:38:33,370
We're going to have
an edge that says a-n.

710
00:38:37,230 --> 00:38:40,560
And in some sense, the
key of this child is a.

711
00:38:40,560 --> 00:38:42,540
If you're starting up
here and you want to know

712
00:38:42,540 --> 00:38:45,820
which way should I go, you
should only go this way

713
00:38:45,820 --> 00:38:47,700
if your first letter is a.

714
00:38:47,700 --> 00:38:49,410
After that, your next
letter better be n,

715
00:38:49,410 --> 00:38:51,420
otherwise you fell off the tree.

716
00:38:51,420 --> 00:38:53,280
So that's the
compression we're doing.

717
00:38:53,280 --> 00:38:55,310
Now, here we have-- this
is a branching node,

718
00:38:55,310 --> 00:38:56,700
so that node we keep intact.

719
00:39:00,490 --> 00:39:03,840
This is an n, this is an a here.

720
00:39:03,840 --> 00:39:06,370
But here it's non-branching.

721
00:39:06,370 --> 00:39:08,320
Let me draw this a
little bit longer.

722
00:39:08,320 --> 00:39:10,350
In reality, it's
just a single edge.

723
00:39:10,350 --> 00:39:13,000
And again, the key is a, and
then you must have a $ sign

724
00:39:13,000 --> 00:39:14,070
on afterwards.

725
00:39:14,070 --> 00:39:16,730
Then you reach a
leaf, the first leaf.

726
00:39:16,730 --> 00:39:18,420
If we follow the n branch--

727
00:39:18,420 --> 00:39:22,278
this is branching, so
that node is preserved.

728
00:39:25,560 --> 00:39:28,730
If I go this way, it's a
$ sign and I reach a leaf.

729
00:39:28,730 --> 00:39:33,030
If I go this way it's an a that
must be followed by a $ sign,

730
00:39:33,030 --> 00:39:34,020
so that's a leaf.

731
00:39:34,020 --> 00:39:37,635
And if I go this way, it must
be an e, followed by a $ sign,

732
00:39:37,635 --> 00:39:39,780
which is a leaf.

733
00:39:39,780 --> 00:39:43,080
Again, these four leaves
can point to these places.

734
00:39:43,080 --> 00:39:44,640
That's a compressed trie.

735
00:39:44,640 --> 00:39:45,967
Pretty obvious.

736
00:39:45,967 --> 00:39:48,300
The nice thing about the
compressed trie is the number--

737
00:39:48,300 --> 00:39:50,258
here we knew the number
of non-branching nodes,

738
00:39:50,258 --> 00:39:51,780
it was at most the
number of leaves.

739
00:39:51,780 --> 00:39:53,510
Over here, the number
of internal nodes

740
00:39:53,510 --> 00:39:54,843
is at most the number of leaves.

741
00:39:54,843 --> 00:39:59,540
So this structure has
order k nodes in total

742
00:39:59,540 --> 00:40:02,160
because we got rid of all
the non-branching nodes.

743
00:40:02,160 --> 00:40:04,852
I guess except the root, the
root might not be branching.

744
00:40:07,500 --> 00:40:09,330
We've got a big O
there to cover us.

745
00:40:12,000 --> 00:40:14,790
And all the things we said
about representing tries here,

746
00:40:14,790 --> 00:40:18,300
you can do the same thing
with a compressed trie.

747
00:40:18,300 --> 00:40:22,990
I need to write
down that 3.5 here.

748
00:40:33,980 --> 00:40:35,990
And in fact, these
results get better because

749
00:40:35,990 --> 00:40:40,880
before order T meant the
number of nodes in the trie.

750
00:40:40,880 --> 00:40:42,730
Now order T will be
the number of nodes

751
00:40:42,730 --> 00:40:45,770
in the compressed trie,
which is actually order k.

752
00:40:45,770 --> 00:40:50,902
So life gets really
good in this world.

753
00:40:50,902 --> 00:40:52,610
I did it in the trie
setting because it's

754
00:40:52,610 --> 00:40:53,760
just simpler to think about.

755
00:40:53,760 --> 00:40:55,968
But really, you would always
store a compressed trie.

756
00:40:55,968 --> 00:40:57,942
There's no point
in storing a trie.

757
00:40:57,942 --> 00:41:00,080
You can still do the
same kinds of searches.

758
00:41:04,010 --> 00:41:09,150
But really, compressed tries
are warm up for suffix trees.

759
00:41:09,150 --> 00:41:10,820
So let's talk
about suffix trees.

760
00:41:14,720 --> 00:41:18,910
Suffix trees are
a compressed trie.

761
00:41:18,910 --> 00:41:22,790
So really they should
be called suffix tries.

762
00:41:22,790 --> 00:41:27,050
And occasionally, people
will call them suffix tries.

763
00:41:27,050 --> 00:41:28,745
But most people call
them suffix trees,

764
00:41:28,745 --> 00:41:31,550
so for consistency I'll
call them trees as well.

765
00:41:31,550 --> 00:41:32,423
But they are tries.

766
00:41:42,542 --> 00:41:45,110
I'm going to introduce
some notation here.

767
00:41:53,457 --> 00:41:55,040
With tries, we are
thinking about lots

768
00:41:55,040 --> 00:41:56,240
of different strings.

769
00:41:56,240 --> 00:41:59,590
In this case, we're going back
to our string matching problem.

770
00:41:59,590 --> 00:42:02,940
We have a single text and we
want to preprocess that text.

771
00:42:02,940 --> 00:42:04,940
But we're going to turn
it into multiple strings

772
00:42:04,940 --> 00:42:07,970
by looking at all
suffixes of the string.

773
00:42:07,970 --> 00:42:09,860
This is Python
notation for everything

774
00:42:09,860 --> 00:42:12,590
from letter i onwards.

775
00:42:12,590 --> 00:42:15,440
And we do that for all i,
so that's a lot of strings.

776
00:42:15,440 --> 00:42:18,824
And we build the
compressed trie over them.

777
00:42:18,824 --> 00:42:19,490
That's the idea.

778
00:42:19,490 --> 00:42:22,340
And to make it work out--
because you remember,

779
00:42:22,340 --> 00:42:25,790
with tries we had to append
$ sign to every string.

780
00:42:25,790 --> 00:42:28,700
In this case, we'd just
have to append $ sign to T,

781
00:42:28,700 --> 00:42:31,220
and then all suffixes
will end with a $ sign.

782
00:42:31,220 --> 00:42:33,590
So that covers
us. $ sign, again,

783
00:42:33,590 --> 00:42:36,510
is a character not
appearing in the alphabet.

784
00:42:36,510 --> 00:42:37,372
And that's it.

785
00:42:37,372 --> 00:42:38,330
So that's a definition.

786
00:42:38,330 --> 00:42:39,163
Let's do an example.

787
00:42:49,240 --> 00:42:51,880
At this point, we going for
this goal of order P query,

788
00:42:51,880 --> 00:42:54,010
order T space.

789
00:42:54,010 --> 00:42:57,130
Suffix trees will be a
way to achieve that goal.

790
00:43:03,820 --> 00:43:10,375
Let's do my favorite
example which is banana.

791
00:43:13,540 --> 00:43:17,650
I had a friend who said, I
know how to spell banana,

792
00:43:17,650 --> 00:43:19,900
I just don't know when to stop.

793
00:43:19,900 --> 00:43:22,990
There's nice pattern to it
and a lot of repeated letters

794
00:43:22,990 --> 00:43:24,700
and so on.

795
00:43:24,700 --> 00:43:26,230
I've got to number
the characters.

796
00:43:26,230 --> 00:43:28,880
He said that when he was like
six, not when he was older.

797
00:43:31,277 --> 00:43:33,610
It's a little harder when
you're writing it on the board

798
00:43:33,610 --> 00:43:36,580
but we all know how to
spell banana, I hope.

799
00:43:36,580 --> 00:43:37,790
I'd got it right, right?

800
00:43:37,790 --> 00:43:40,600
It should be 7 letters,
including the $ sign.

801
00:43:43,395 --> 00:43:44,020
There they are.

802
00:43:44,020 --> 00:43:46,190
So there's a suffix which
is the whole string.

803
00:43:46,190 --> 00:43:48,670
There's a suffix which
is a, n, a, n, a, $ sign.

804
00:43:48,670 --> 00:43:50,710
There is a suffix which
is n, a, n, a, $ sign.

805
00:43:50,710 --> 00:43:52,459
There's a suffix which
is a, n, a, $ sign.

806
00:43:52,459 --> 00:43:53,890
Suffix n, a, $ sign. a, $ sign.

807
00:43:53,890 --> 00:43:55,140
And $ sign.

808
00:43:55,140 --> 00:43:58,340
And empty, I suppose, but we're
not going to store that one.

809
00:43:58,340 --> 00:44:01,210
You don't need to.

810
00:44:01,210 --> 00:44:02,422
Cool.

811
00:44:02,422 --> 00:44:04,630
I'm going to cheat a little
bit and look at my figure

812
00:44:04,630 --> 00:44:07,350
because it is a little
bit of thinking.

813
00:44:07,350 --> 00:44:09,400
One The final challenge
of this lecture

814
00:44:09,400 --> 00:44:14,560
will be construct this
diagram in linear time.

815
00:44:14,560 --> 00:44:38,115
But I'm, just for
now, going to cheat

816
00:44:38,115 --> 00:44:39,990
because it's a little
tricky to do it and get

817
00:44:39,990 --> 00:44:41,340
all the nodes in sorted order.

818
00:44:57,980 --> 00:44:59,320
So that should give it to us.

819
00:44:59,320 --> 00:45:02,155
And then the suffixes.

820
00:45:02,155 --> 00:45:04,630
Here is another color.

821
00:45:04,630 --> 00:45:14,810
6, 5, 3, 1, 0, 4, 2.

822
00:45:14,810 --> 00:45:16,420
Cool.

823
00:45:16,420 --> 00:45:18,580
This I claim is a
suffix tree of banana.

824
00:45:18,580 --> 00:45:20,530
You see the banana substring.

825
00:45:20,530 --> 00:45:24,670
Than the next one is
a, n, a, n, a, $ sign.

826
00:45:24,670 --> 00:45:27,700
Then the next one is
n, a, n, a, $ sign.

827
00:45:27,700 --> 00:45:31,570
Then the next one
is a, n, a, $ sign.

828
00:45:31,570 --> 00:45:34,380
Next one is n, a, $ sign.

829
00:45:34,380 --> 00:45:35,840
Next one is a, $ sign.

830
00:45:35,840 --> 00:45:37,710
And then $ sign.

831
00:45:37,710 --> 00:45:39,917
So that's a nice,
clean representation

832
00:45:39,917 --> 00:45:40,750
of all the suffixes.

833
00:45:40,750 --> 00:45:43,000
And you can see that if
you wanted to search from

834
00:45:43,000 --> 00:45:45,250
the middle of this string--
suppose I want to search

835
00:45:45,250 --> 00:45:46,510
for a nan--

836
00:45:46,510 --> 00:45:47,490
then it's right there.

837
00:45:47,490 --> 00:45:51,700
Just do n, a, n, then I'm done.

838
00:45:51,700 --> 00:45:54,130
This virtual node
in the middle here

839
00:45:54,130 --> 00:45:56,980
along the one third of
the way down the edge,

840
00:45:56,980 --> 00:46:00,100
that represents n-a-n.

841
00:46:00,100 --> 00:46:02,170
And indeed, if you look
at the descendant leaf,

842
00:46:02,170 --> 00:46:05,470
that corresponds to an
occurrence of n-a-n.

843
00:46:05,470 --> 00:46:08,830
If I was going to
look for a-n, I

844
00:46:08,830 --> 00:46:12,880
would do a, n, so
halfway down this edge.

845
00:46:12,880 --> 00:46:17,920
And then this subtree represents
all the occurrences of a-n.

846
00:46:17,920 --> 00:46:19,210
Think about it.

847
00:46:19,210 --> 00:46:21,220
There's two of them--

848
00:46:21,220 --> 00:46:25,450
One that starts at position 3,
one that starts at position 1.

849
00:46:25,450 --> 00:46:27,067
Here's one occurrence
of a-n, here's

850
00:46:27,067 --> 00:46:28,150
another occurrence of a-n.

851
00:46:28,150 --> 00:46:29,858
This works even when
they're overlapping.

852
00:46:29,858 --> 00:46:32,717
If I search for a-n-a,
I would get here.

853
00:46:32,717 --> 00:46:35,050
And then these are the two
occurrences of a-n-a and they

854
00:46:35,050 --> 00:46:36,400
actually overlap each other--

855
00:46:36,400 --> 00:46:38,764
this one and this one.

856
00:46:38,764 --> 00:46:40,180
So this is a great
data structure,

857
00:46:40,180 --> 00:46:43,940
it solves what we need.

858
00:46:43,940 --> 00:46:46,512
It's all substrings searching.

859
00:47:01,460 --> 00:47:03,350
Applications of suffix trees.

860
00:47:18,570 --> 00:47:21,860
Just do a search in the trie
for a particular pattern.

861
00:47:21,860 --> 00:47:42,800
We get subtree representing all
of the occurrences of P and T.

862
00:47:42,800 --> 00:47:44,150
So this is great.

863
00:47:44,150 --> 00:47:47,690
In order P time, walking
down this structure,

864
00:47:47,690 --> 00:47:49,820
I can figure out
all the occurrences.

865
00:47:49,820 --> 00:47:52,190
And then, if I want to
know how many there were,

866
00:47:52,190 --> 00:47:54,110
I could just store
subtree sizes--

867
00:47:54,110 --> 00:47:55,940
number of leaves
below every node.

868
00:47:55,940 --> 00:47:59,270
If I wanted to list
them, I could just

869
00:47:59,270 --> 00:48:00,890
do an in-order traversal.

870
00:48:00,890 --> 00:48:03,230
And I'll even get them in order.

871
00:48:03,230 --> 00:48:08,900
So in particular, if I wanted to
list the first 10 occurrences,

872
00:48:08,900 --> 00:48:12,800
I could store the left-most leaf
from every node, teleport down

873
00:48:12,800 --> 00:48:14,870
to the first occurrence
in constant time.

874
00:48:14,870 --> 00:48:17,600
And then I could just have a
linked list of all the leaves.

875
00:48:17,600 --> 00:48:19,760
So once I find the
first one, I can just

876
00:48:19,760 --> 00:48:22,880
follow until I find, oh,
that's not an occurrence of P.

877
00:48:22,880 --> 00:48:25,520
So I can list the first
k of them in order k time

878
00:48:25,520 --> 00:48:28,160
once I've done the
search of order P time.

879
00:48:28,160 --> 00:48:30,110
So this is really
good searching.

880
00:48:30,110 --> 00:48:32,360
And It's the ideal situation.

881
00:48:32,360 --> 00:48:34,520
You can list any information
you want about all

882
00:48:34,520 --> 00:48:38,150
of the answers in the optimal
time and size of the output.

883
00:48:40,670 --> 00:48:43,640
How big is this data structure?

884
00:48:43,640 --> 00:48:51,008
Well, there are T suffixes,
so k is the size of T.

885
00:48:51,008 --> 00:48:53,630
And when we look at our
trie representations,

886
00:48:53,630 --> 00:48:55,730
our general goal was to get--

887
00:48:55,730 --> 00:48:59,817
here, capital T was
the sum of the lengths.

888
00:48:59,817 --> 00:49:01,400
Well, sum of the
lengths is not good--

889
00:49:01,400 --> 00:49:02,702
that would be quadratic--

890
00:49:02,702 --> 00:49:04,160
sum of the lengths
of the suffixes.

891
00:49:04,160 --> 00:49:08,420
But we also said, or the
number of nodes in the trie.

892
00:49:08,420 --> 00:49:10,745
And we know the number
of leaves in this trie

893
00:49:10,745 --> 00:49:15,054
is exactly the size of T. And so
because it's a compressed trie,

894
00:49:15,054 --> 00:49:16,470
the number of
internal [INAUDIBLE]

895
00:49:16,470 --> 00:49:19,640
is also less than the size of
T. So the total number of nodes

896
00:49:19,640 --> 00:49:24,890
here is order T And
so if we use any

897
00:49:24,890 --> 00:49:26,750
of the reasonable
representations,

898
00:49:26,750 --> 00:49:27,900
we get order T space.

899
00:49:33,020 --> 00:49:36,020
Now, there's one issue which
is, how long does a search for P

900
00:49:36,020 --> 00:49:36,950
cost?

901
00:49:36,950 --> 00:49:38,630
And it depends on
our representation,

902
00:49:38,630 --> 00:49:41,180
it depends how quickly
we can traverse a node.

903
00:49:41,180 --> 00:49:42,860
If we use hashing--

904
00:49:42,860 --> 00:49:51,740
method 3-- use hashing,
then we get order P time.

905
00:49:55,310 --> 00:49:58,100
But the trouble with
hashing is it permutes

906
00:49:58,100 --> 00:50:00,650
the children of every node.

907
00:50:00,650 --> 00:50:02,360
So in that situation,
the leaves will not

908
00:50:02,360 --> 00:50:05,799
be ordered in the same way that
they're ordered in the string.

909
00:50:05,799 --> 00:50:08,090
So if you really want to be
able to find the first five

910
00:50:08,090 --> 00:50:11,060
occurrences of the pattern
P, you can't use hashing.

911
00:50:11,060 --> 00:50:12,680
You can find some
five occurrences

912
00:50:12,680 --> 00:50:15,200
but you will find the
first in the usual ordering

913
00:50:15,200 --> 00:50:16,770
of the string.

914
00:50:16,770 --> 00:50:19,280
So if you really
want the first five

915
00:50:19,280 --> 00:50:23,750
and you want them in order,
then you should use trays--

916
00:50:23,750 --> 00:50:26,100
this method 6 that we used.

917
00:50:26,100 --> 00:50:26,780
6?

918
00:50:26,780 --> 00:50:28,220
5.

919
00:50:28,220 --> 00:50:35,230
If we use trays, then it will
be order P times log Sigma--

920
00:50:38,050 --> 00:50:40,640
sorry, order P plus log Sigma.

921
00:50:40,640 --> 00:50:43,720
That was our query time.

922
00:50:43,720 --> 00:50:47,030
Here, P plus log Sigma.

923
00:50:47,030 --> 00:50:50,240
Small penalty to pay but the
nice thing is then your answers

924
00:50:50,240 --> 00:50:52,310
are represented in order.

925
00:50:52,310 --> 00:50:56,840
No permutation, no
hashing, no randomization.

926
00:50:56,840 --> 00:50:58,940
This is the reason suffix
trees were invented--

927
00:50:58,940 --> 00:51:00,680
they let you do searches fast.

928
00:51:00,680 --> 00:51:03,500
But actually, they let you
do a ton of things fast.

929
00:51:03,500 --> 00:51:05,930
And I want to quickly
give you an overview

930
00:51:05,930 --> 00:51:08,999
of the zillions of things you
can do with the suffix tree.

931
00:51:08,999 --> 00:51:10,790
And then I want to get
to how to build them

932
00:51:10,790 --> 00:51:16,205
in linear time, which has some
interesting algorithms/data

933
00:51:16,205 --> 00:51:19,184
structures.

934
00:51:19,184 --> 00:51:20,600
I already talked
about if you want

935
00:51:20,600 --> 00:51:21,980
to find the first
k occurrences, you

936
00:51:21,980 --> 00:51:23,150
can do that in order k time.

937
00:51:23,150 --> 00:51:25,280
If you want to find the
number of occurrences,

938
00:51:25,280 --> 00:51:26,654
you can do that
in constant time,

939
00:51:26,654 --> 00:51:29,174
just by augmenting
the subtree sizes.

940
00:51:29,174 --> 00:51:30,590
Here's another
thing you could do.

941
00:51:30,590 --> 00:51:32,990
Suppose you have a
very long string.

942
00:51:32,990 --> 00:51:35,160
I mean think of T as
an entire document.

943
00:51:35,160 --> 00:51:38,360
You know, it could be the
Merriam-Webster dictionary

944
00:51:38,360 --> 00:51:41,320
or it could be the web.

945
00:51:41,320 --> 00:51:44,069
We're imagining T to be
the huge data structure.

946
00:51:44,069 --> 00:51:46,610
And then we're able to search
for substrings within that data

947
00:51:46,610 --> 00:51:50,130
structure very fast.

948
00:51:50,130 --> 00:51:52,430
So that's cool.

949
00:51:52,430 --> 00:51:53,680
Here's an interesting puzzle.

950
00:51:53,680 --> 00:51:57,790
What is the longest substring--
what is the longest string that

951
00:51:57,790 --> 00:52:00,280
appears twice on the web?

952
00:52:00,280 --> 00:52:02,260
This is called the longest
repeated substring.

953
00:52:02,260 --> 00:52:04,610
Could be overlapping, maybe not.

954
00:52:04,610 --> 00:52:07,690
Well, you take the web, you
throw it in the suffix tree--

955
00:52:07,690 --> 00:52:09,500
not sure anyone could
actually do that--

956
00:52:09,500 --> 00:52:11,762
but small part of the web.

957
00:52:11,762 --> 00:52:13,345
Dictionary-- this
would be no problem.

958
00:52:17,260 --> 00:52:18,560
Wikipedia would be feasible.

959
00:52:18,560 --> 00:52:21,280
You take Wikipedia, you
throw it in the suffix tree.

960
00:52:21,280 --> 00:52:24,820
And what I'm interested
in is, basically,

961
00:52:24,820 --> 00:52:29,230
a node that has two, at
least two descendant leaves.

962
00:52:29,230 --> 00:52:31,749
And if I'm counting the number
of leaves at every node,

963
00:52:31,749 --> 00:52:33,790
I could just do one pass
over this data structure

964
00:52:33,790 --> 00:52:35,529
and find what are all
the nodes that have

965
00:52:35,529 --> 00:52:36,820
at least two descendant leaves.

966
00:52:36,820 --> 00:52:39,520
That's all the internal nodes.

967
00:52:39,520 --> 00:52:42,280
And then among them I'd also
like to know how deep is it.

968
00:52:42,280 --> 00:52:46,330
Because the depth corresponds
to how long the string is.

969
00:52:46,330 --> 00:52:48,280
This one is a-n-a
so this one has,

970
00:52:48,280 --> 00:52:51,036
I call it, a letter depth of 3.

971
00:52:51,036 --> 00:52:52,410
This one has a
letter depth of 1.

972
00:52:52,410 --> 00:52:53,785
This one has a
letter depth of 2.

973
00:52:53,785 --> 00:52:55,784
So I just want to find
the deepest node that has

974
00:52:55,784 --> 00:52:57,130
at least two descendant leaves.

975
00:52:57,130 --> 00:53:00,151
In linear time, I could find
the longest repeated substring.

976
00:53:00,151 --> 00:53:01,900
Or I could find the
longest substring that

977
00:53:01,900 --> 00:53:03,520
appears five times or whatever.

978
00:53:03,520 --> 00:53:05,530
I just do one pass
over this thing,

979
00:53:05,530 --> 00:53:08,087
find the deepest node that
has my threshold of leaves.

980
00:53:08,087 --> 00:53:09,670
So that's kind of a
neat thing you can

981
00:53:09,670 --> 00:53:11,440
do in linear time on a string.

982
00:53:14,780 --> 00:53:16,580
Here's another fun one.

983
00:53:16,580 --> 00:53:18,920
Suppose I have
this giant string.

984
00:53:18,920 --> 00:53:21,930
And I just want to compare
two substrings in it.

985
00:53:21,930 --> 00:53:25,730
So here's my giant string.

986
00:53:25,730 --> 00:53:29,360
And suppose I want to measure
how long is the repeated

987
00:53:29,360 --> 00:53:30,290
substring.

988
00:53:30,290 --> 00:53:31,940
So I say, well,
I've got position i,

989
00:53:31,940 --> 00:53:32,944
I've got position j.

990
00:53:32,944 --> 00:53:35,360
Let's say I already know that
they match for a little bit.

991
00:53:35,360 --> 00:53:37,220
I want to know, how
long do they match?

992
00:53:37,220 --> 00:53:40,790
How far can I go to the right
and have them still match?

993
00:53:43,217 --> 00:53:44,050
How could I do that?

994
00:53:44,050 --> 00:53:46,580
Well, I could look at
the suffix starting at i.

995
00:53:46,580 --> 00:53:48,510
That corresponds to
a leaf over here.

996
00:53:48,510 --> 00:53:51,380
And I could look at the
suffix starting at j.

997
00:53:51,380 --> 00:53:55,100
That corresponds
to some other leaf.

998
00:53:55,100 --> 00:53:59,000
And what is the length of
the longest common prefix

999
00:53:59,000 --> 00:54:02,560
of those two suffixes
in the suffix tree?

1000
00:54:07,040 --> 00:54:12,150
Three letters-- LCA.

1001
00:54:12,150 --> 00:54:16,110
If I take the LCA of those
two leaves-- for example,

1002
00:54:16,110 --> 00:54:19,270
I take these two leaves--

1003
00:54:19,270 --> 00:54:21,970
the LCA gives me the
longest common prefix.

1004
00:54:21,970 --> 00:54:23,500
Then they branch.

1005
00:54:23,500 --> 00:54:25,780
So longest common prefix
of these two suffixes

1006
00:54:25,780 --> 00:54:28,360
is the letter a, so
it's just length 1.

1007
00:54:28,360 --> 00:54:31,030
And again, if I label every
node with the letter depth,

1008
00:54:31,030 --> 00:54:33,340
I can figure out exactly
how long these guys match,

1009
00:54:33,340 --> 00:54:35,450
even if they overlap.

1010
00:54:35,450 --> 00:54:37,150
So in constant time--
because we already

1011
00:54:37,150 --> 00:54:39,670
have a constant time LCA query.

1012
00:54:39,670 --> 00:54:41,590
Linear space,
constant time query.

1013
00:54:41,590 --> 00:54:43,031
Given any two
positions i and j, I

1014
00:54:43,031 --> 00:54:45,280
can tell you how long they
match for in constant time.

1015
00:54:45,280 --> 00:54:47,549
Boom-- instantaneously.

1016
00:54:47,549 --> 00:54:48,340
It's kind of crazy.

1017
00:54:48,340 --> 00:54:51,310
So you can do tons of
these queries instantly.

1018
00:54:51,310 --> 00:54:53,770
That's one reason why
people care about LCAs,

1019
00:54:53,770 --> 00:54:54,770
there are other reasons.

1020
00:54:54,770 --> 00:54:58,630
But mostly LCAs were
developed for suffix trees

1021
00:54:58,630 --> 00:54:59,800
to answer queries like that.

1022
00:55:02,650 --> 00:55:03,310
Got some more.

1023
00:55:08,940 --> 00:55:11,010
Why don't I just write--

1024
00:55:11,010 --> 00:55:19,620
LCP of one suffix
and another suffix

1025
00:55:19,620 --> 00:55:22,250
is equivalent to an LCA query.

1026
00:55:22,250 --> 00:55:25,050
And so we can do
that in constant time

1027
00:55:25,050 --> 00:55:26,462
after pre-processing.

1028
00:55:38,600 --> 00:55:39,920
Here's another one.

1029
00:55:39,920 --> 00:55:52,180
Suppose I want to find all
occurrences of T i to j.

1030
00:55:55,810 --> 00:55:57,670
So I give you a
substring and I want

1031
00:55:57,670 --> 00:56:00,070
to know where does that occur.

1032
00:56:00,070 --> 00:56:03,800
The substring is restricted
to come from the text.

1033
00:56:03,800 --> 00:56:05,080
Now, this is a little subtle.

1034
00:56:05,080 --> 00:56:08,860
Of course, I could solve it
in j minus i plus 1 time.

1035
00:56:08,860 --> 00:56:10,660
I just do the search.

1036
00:56:10,660 --> 00:56:14,470
But what if I want to
do it in constant time?

1037
00:56:14,470 --> 00:56:16,435
Maybe this is a
really big substring.

1038
00:56:16,435 --> 00:56:18,740
But I still know it
appears multiple times.

1039
00:56:18,740 --> 00:56:21,120
I want to know how many
times does it appear.

1040
00:56:21,120 --> 00:56:24,100
I claim I can do this
in constant time.

1041
00:56:24,100 --> 00:56:26,320
How?

1042
00:56:26,320 --> 00:56:30,040
This is a level ancestor query.

1043
00:56:30,040 --> 00:56:32,050
Why is it a level
ancestor query?

1044
00:56:32,050 --> 00:56:35,380
If I look at the
suffix starting at i,

1045
00:56:35,380 --> 00:56:38,470
and then I just want to
trim off, I want to stop.

1046
00:56:38,470 --> 00:56:40,510
Or I don't care about
the entire suffix,

1047
00:56:40,510 --> 00:56:43,330
I just want to do that j.

1048
00:56:43,330 --> 00:56:45,580
It's like saying, well,
suppose I'm looking

1049
00:56:45,580 --> 00:56:47,590
for occurrences of a-n-a.

1050
00:56:47,590 --> 00:56:51,520
So I go and I start at the
first occurrence of a-n-a,

1051
00:56:51,520 --> 00:56:54,930
which is a-n-a-n-a-$, so this
is the leaf corresponding

1052
00:56:54,930 --> 00:56:55,974
to a-n-a.

1053
00:56:55,974 --> 00:56:58,140
And then if I want to find
all occurrences of a-n-a,

1054
00:56:58,140 --> 00:57:03,910
I just need to go up to the
ancestor that represents a-n-a.

1055
00:57:03,910 --> 00:57:09,787
This is what I call a
weighted level ancestor.

1056
00:57:09,787 --> 00:57:11,370
That's not quite the
problem we solved

1057
00:57:11,370 --> 00:57:18,490
in the last lecture, lecture
15, because now it's weighted.

1058
00:57:18,490 --> 00:57:28,577
So it's level ancestor j minus
i of the T i suffix leaf.

1059
00:57:28,577 --> 00:57:30,160
So I find this leaf,
which I just have

1060
00:57:30,160 --> 00:57:31,450
an array of all the leaves.

1061
00:57:31,450 --> 00:57:34,737
Given a suffix, tell me what
leaf it is in the suffix tree.

1062
00:57:34,737 --> 00:57:36,820
And then I want to find
the j minus i-th ancestor,

1063
00:57:36,820 --> 00:57:39,820
except the edges don't
just have unit length.

1064
00:57:39,820 --> 00:57:42,160
So here I want to find
the third ancestor,

1065
00:57:42,160 --> 00:57:45,190
except it's really the ancestor
in the compressed trie.

1066
00:57:45,190 --> 00:57:47,240
I want to do the j
minus i-th ancestor

1067
00:57:47,240 --> 00:57:49,900
in the suffix in
the trie, but what

1068
00:57:49,900 --> 00:57:51,850
I have is a compressed tree.

1069
00:57:51,850 --> 00:57:54,880
And so these edges are labeled
with how many characters

1070
00:57:54,880 --> 00:57:58,000
are on them and I got
to deal with that.

1071
00:57:58,000 --> 00:58:00,980
Fortunately, the data structure
we gave for a level ancestor--

1072
00:58:00,980 --> 00:58:03,040
which was constant time
query, linear space--

1073
00:58:03,040 --> 00:58:05,140
can be fairly easily
adapted to weights.

1074
00:58:08,170 --> 00:58:10,710
Not quite in
constant time though.

1075
00:58:10,710 --> 00:58:14,860
It can be solved
in log log n time.

1076
00:58:14,860 --> 00:58:17,440
And I think that's optimal.

1077
00:58:17,440 --> 00:58:23,710
Because if your thing is
a single path with maybe

1078
00:58:23,710 --> 00:58:28,720
the occasional branch, then
finding your i-th ancestor here

1079
00:58:28,720 --> 00:58:31,430
is like solving a
predecessor problem.

1080
00:58:31,430 --> 00:58:36,190
Because you say, well,
from the i-th position up,

1081
00:58:36,190 --> 00:58:40,150
I want to know what
is the previous--

1082
00:58:40,150 --> 00:58:41,887
I want to round
up or round down.

1083
00:58:41,887 --> 00:58:43,720
So I want to do a
predecessor or a successor

1084
00:58:43,720 --> 00:58:45,600
on this straight line.

1085
00:58:45,600 --> 00:58:47,200
And so for a
predecessor you need

1086
00:58:47,200 --> 00:58:51,610
log log time for
the right parameters

1087
00:58:51,610 --> 00:58:53,206
and this can be achieved.

1088
00:58:53,206 --> 00:58:55,330
And the basic idea is you
use ladder decomposition,

1089
00:58:55,330 --> 00:58:56,440
just like before.

1090
00:58:56,440 --> 00:58:58,840
But now a ladder can't be
represented by an array

1091
00:58:58,840 --> 00:59:01,760
because there are lots of
absent places in the array.

1092
00:59:01,760 --> 00:59:04,540
Now instead, use a predecessor,
use a Van Emde Boas

1093
00:59:04,540 --> 00:59:06,260
to represent a ladder.

1094
00:59:06,260 --> 00:59:07,870
So that's basically all you do.

1095
00:59:07,870 --> 00:59:13,630
Van Emde Boas
represents a ladder.

1096
00:59:13,630 --> 00:59:15,517
That's what you do
in the top structure.

1097
00:59:15,517 --> 00:59:17,350
Remember, we had
indirection, leaf trimming,

1098
00:59:17,350 --> 00:59:19,058
top was this thing,
ladder decomposition.

1099
00:59:19,058 --> 00:59:21,317
You Bottom was look up tables.

1100
00:59:21,317 --> 00:59:23,650
The other problem is you can't
use lookup tables anymore

1101
00:59:23,650 --> 00:59:26,530
because in one of
these tiny trees

1102
00:59:26,530 --> 00:59:28,030
you could have a
super long path.

1103
00:59:28,030 --> 00:59:30,040
It's non-branching,
they got compressed.

1104
00:59:30,040 --> 00:59:31,420
And you can't
afford to enumerate

1105
00:59:31,420 --> 00:59:33,410
all possible situations.

1106
00:59:33,410 --> 00:59:35,092
It's kind of annoying.

1107
00:59:35,092 --> 00:59:37,300
So instead of using lookup
tables-- this was actually

1108
00:59:37,300 --> 00:59:40,960
an idea from some students
in this class last time

1109
00:59:40,960 --> 00:59:43,750
I taught this material--
they said, oh, well, instead

1110
00:59:43,750 --> 00:59:47,180
of using a lookup table, you
can use ladder decomposition.

1111
00:59:47,180 --> 00:59:49,960
So down here, in
the compressed tree,

1112
00:59:49,960 --> 00:59:52,240
we have log n different nodes.

1113
00:59:52,240 --> 00:59:55,090
If you use ladder decomposition
on that thing-- but not

1114
00:59:55,090 --> 00:59:56,030
the hybrid structure.

1115
00:59:56,030 --> 00:59:58,360
Remember, we used jump
pointers plus ladders.

1116
00:59:58,360 --> 00:59:59,890
Jump pointers still
work here, just

1117
00:59:59,890 --> 01:00:03,160
you have to round them
to a different place.

1118
01:00:03,160 --> 01:00:04,750
Down here, I'm not
going to try to do

1119
01:00:04,750 --> 01:00:06,010
jump pointers plus ladders.

1120
01:00:06,010 --> 01:00:07,210
I'll just do ladders.

1121
01:00:07,210 --> 01:00:10,120
And remember, just ladders
gave us a log n query time.

1122
01:00:10,120 --> 01:00:18,300
But now n is log T. And so
we get log log T query time.

1123
01:00:18,300 --> 01:00:20,050
And that's, basically,
all you have to do.

1124
01:00:22,960 --> 01:00:24,960
So you're always jumping
to the top of a ladder.

1125
01:00:24,960 --> 01:00:27,262
You'll only have to
traverse log log T ladders.

1126
01:00:27,262 --> 01:00:28,720
The very last ladder
you might have

1127
01:00:28,720 --> 01:00:32,500
to do a predecessor query that
will cost you log log log T.

1128
01:00:32,500 --> 01:00:35,050
But overall, it will
be log log T time just

1129
01:00:35,050 --> 01:00:39,730
by this kind of tweak to our
level ancestor data structure.

1130
01:00:39,730 --> 01:00:43,120
So I thought that was
kind of a fun connection.

1131
01:00:43,120 --> 01:00:46,450
This is the reason, essentially,
level ancestors were developed.

1132
01:00:46,450 --> 01:00:48,460
And people use them
because you can

1133
01:00:48,460 --> 01:00:51,800
do these kinds of things
in nearly constant time,

1134
01:00:51,800 --> 01:00:54,530
even if the substring is huge.

1135
01:00:54,530 --> 01:00:57,760
So maybe I know ahead
of time all the queries

1136
01:00:57,760 --> 01:00:59,860
I might want to do.

1137
01:00:59,860 --> 01:01:03,310
I just throw them into the
text, just add them in there.

1138
01:01:03,310 --> 01:01:05,770
Then I've cut these
substrings, they're now

1139
01:01:05,770 --> 01:01:07,200
represented in the suffix tree.

1140
01:01:07,200 --> 01:01:10,480
Now any substring I want
to query in log log n time,

1141
01:01:10,480 --> 01:01:13,960
I can find all the
occurrences of that string,

1142
01:01:13,960 --> 01:01:16,670
even if the substring is huge.

1143
01:01:16,670 --> 01:01:19,060
So if you know what
queries you want,

1144
01:01:19,060 --> 01:01:20,980
you can preprocess
them and solve them

1145
01:01:20,980 --> 01:01:24,430
even faster than order P time.

1146
01:01:24,430 --> 01:01:25,900
Cool.

1147
01:01:25,900 --> 01:01:32,480
Another thing you can do is
represent multiple documents.

1148
01:01:32,480 --> 01:01:35,580
And that's what I was
sort of getting at there.

1149
01:01:35,580 --> 01:01:37,270
If you have multiple
documents-- say,

1150
01:01:37,270 --> 01:01:39,670
you're storing the
entire web or Wikipedia.

1151
01:01:39,670 --> 01:01:41,560
Like there's multiple pages.

1152
01:01:41,560 --> 01:01:43,480
You want to separate them.

1153
01:01:43,480 --> 01:01:47,860
All you need to do is say,
OK, I'll take my first string

1154
01:01:47,860 --> 01:01:49,990
and then put a special
$ sign after it.

1155
01:01:49,990 --> 01:01:52,980
Then take my second string,
put a special $ sign after it.

1156
01:01:52,980 --> 01:01:56,860
And take my k-th string and
put a special $ sign after it.

1157
01:01:56,860 --> 01:01:59,710
Just concatenate them with
different $ signs in between

1158
01:01:59,710 --> 01:02:00,460
them.

1159
01:02:00,460 --> 01:02:03,885
Then build the suffix tree on
this thing which I'll call T

1160
01:02:03,885 --> 01:02:06,010
So you can use the same
suffix tree data structure,

1161
01:02:06,010 --> 01:02:08,140
but now, in some sense,
you're representing

1162
01:02:08,140 --> 01:02:11,964
all of these documents and
all the ways they interweave.

1163
01:02:11,964 --> 01:02:13,630
Because there are
some shared substrings

1164
01:02:13,630 --> 01:02:15,838
here that are shared by
this, and this, and whatever.

1165
01:02:15,838 --> 01:02:18,070
And those will be represented
in the same structure.

1166
01:02:18,070 --> 01:02:20,050
Or I can do a search and
then I've effectively

1167
01:02:20,050 --> 01:02:23,500
found all the documents
that contain it.

1168
01:02:23,500 --> 01:02:25,570
One issue, though.

1169
01:02:25,570 --> 01:02:27,820
Suppose, I want to find all
the documents containing

1170
01:02:27,820 --> 01:02:31,390
the word MIT or something.

1171
01:02:31,390 --> 01:02:34,927
Maybe all k of them match,
maybe one document matches,

1172
01:02:34,927 --> 01:02:36,010
maybe two documents match.

1173
01:02:36,010 --> 01:02:37,930
Suppose, two documents match.

1174
01:02:37,930 --> 01:02:40,990
The first document mentions
MIT a billion times.

1175
01:02:40,990 --> 01:02:46,330
The second document
has MIT in it once.

1176
01:02:46,330 --> 01:02:47,980
Then suffix trees
are kind of annoying

1177
01:02:47,980 --> 01:02:50,980
because they will find that
billion and one matches

1178
01:02:50,980 --> 01:02:51,907
as a subtree.

1179
01:02:51,907 --> 01:02:54,490
But if I just want to know the
answer, oh, these two documents

1180
01:02:54,490 --> 01:02:57,070
match, I'd like to do
that in order 2 time,

1181
01:02:57,070 --> 01:03:02,230
not order billion time,
to use technical terms.

1182
01:03:02,230 --> 01:03:08,080
And that is called the document
retrieval problem or a document

1183
01:03:08,080 --> 01:03:09,830
retrieval data structure.

1184
01:03:09,830 --> 01:03:14,320
This is a problem considered
by M. Krishnan in 2002.

1185
01:03:14,320 --> 01:03:22,510
Document retrieval you can
do an order P plus number

1186
01:03:22,510 --> 01:03:26,150
of documents matching.

1187
01:03:26,150 --> 01:03:30,449
So if I want to list all
the documents that match,

1188
01:03:30,449 --> 01:03:32,740
I could do an order the number
of documents that match,

1189
01:03:32,740 --> 01:03:37,270
not the order of a number of
occurrences of the string.

1190
01:03:37,270 --> 01:03:39,760
So I still got to do the
P search in the beginning,

1191
01:03:39,760 --> 01:03:41,920
and then this is better.

1192
01:03:41,920 --> 01:03:45,340
And the funny thing is the
solution to this data structure

1193
01:03:45,340 --> 01:03:49,717
uses RMQ, range minimum
queries, from last lecture.

1194
01:03:49,717 --> 01:03:51,050
So let me tell you how it works.

1195
01:03:51,050 --> 01:03:52,133
It's actually very simple.

1196
01:03:56,730 --> 01:04:01,460
And then I think we'll move on
to how to build a suffix tree.

1197
01:04:01,460 --> 01:04:03,500
So document retrieval.

1198
01:04:08,220 --> 01:04:09,470
Here's what we're going to do.

1199
01:04:25,040 --> 01:04:27,530
Remember, these different $
signs i represent different

1200
01:04:27,530 --> 01:04:30,230
documents.

1201
01:04:30,230 --> 01:04:32,450
I want to remember
which suffixes

1202
01:04:32,450 --> 01:04:35,060
came from the same document.

1203
01:04:35,060 --> 01:04:40,790
So at every $ sign i, I
want to store the number

1204
01:04:40,790 --> 01:04:44,990
of the previous $ sign i.

1205
01:04:44,990 --> 01:04:48,260
Let's suppose, the suffixes,
when they get to one of the $

1206
01:04:48,260 --> 01:04:51,490
signs, I can just stop, I
don't have to store the rest,

1207
01:04:51,490 --> 01:04:52,490
I'm going to throw away.

1208
01:04:52,490 --> 01:04:55,490
Whenever I hit a $ sign, I
will stop the suffix tree.

1209
01:04:55,490 --> 01:04:57,410
That way, the $ signs
really are leaves,

1210
01:04:57,410 --> 01:04:59,741
all of them now become leaves.

1211
01:04:59,741 --> 01:05:01,490
So I don't really care
about a suffix that

1212
01:05:01,490 --> 01:05:02,480
goes all the way through here.

1213
01:05:02,480 --> 01:05:04,040
I just want the
suffix to the $ sign,

1214
01:05:04,040 --> 01:05:06,960
as it represents the
individual documents.

1215
01:05:06,960 --> 01:05:08,810
So $ sign i's are leaves.

1216
01:05:08,810 --> 01:05:11,600
And I want each of them just
to store a pointer, basically,

1217
01:05:11,600 --> 01:05:14,930
to the previous one of the
same type, the same $ sign i.

1218
01:05:14,930 --> 01:05:16,370
It came from the same document.

1219
01:05:22,860 --> 01:05:26,990
Now, here's the idea.

1220
01:05:26,990 --> 01:05:30,470
I did a search, I
got down to a node,

1221
01:05:30,470 --> 01:05:32,570
and now there's this
big subtree here.

1222
01:05:32,570 --> 01:05:36,400
And this subtree has a
bunch of leaves in it,

1223
01:05:36,400 --> 01:05:40,460
those represent all the
occurrences of the pattern P.

1224
01:05:40,460 --> 01:05:43,120
And let's suppose that
those leaves are numbered.

1225
01:05:43,120 --> 01:05:48,620
I'm numbering the leaves
from 1 to n, I guess.

1226
01:05:48,620 --> 01:05:51,440
Then in here, the leaves
will be an interval--

1227
01:05:51,440 --> 01:05:54,710
interval l, comma, n.

1228
01:05:54,710 --> 01:05:57,560
And the trouble is a lot of
these have the same label $

1229
01:05:57,560 --> 01:05:58,640
sign i.

1230
01:05:58,640 --> 01:06:01,370
And I just want to
find the unique ones.

1231
01:06:01,370 --> 01:06:02,120
How do I do that?

1232
01:06:07,760 --> 01:06:15,560
What we do is find the first
occurrence of $ sign i for each

1233
01:06:15,560 --> 01:06:17,090
i.

1234
01:06:17,090 --> 01:06:19,560
I could just find the first
occurrence of $ sign i for each

1235
01:06:19,560 --> 01:06:20,060
i.

1236
01:06:20,060 --> 01:06:24,370
I'd then only have to pay order
number of distinct documents,

1237
01:06:24,370 --> 01:06:27,170
then we'll have to pay for
every match within the document.

1238
01:06:27,170 --> 01:06:30,860
Now, one way to define
the first $ sign i is--

1239
01:06:30,860 --> 01:06:35,690
that's a $ sign i
whose stored value--

1240
01:06:35,690 --> 01:06:38,620
we said we store the leaf number
of the previous $ sign i--

1241
01:06:38,620 --> 01:06:45,301
whose stored value
is less than l.

1242
01:06:45,301 --> 01:06:46,550
So we find some position here.

1243
01:06:46,550 --> 01:06:48,300
If the previous
guy is less than l,

1244
01:06:48,300 --> 01:06:51,950
that means it was the
first of that type.

1245
01:06:51,950 --> 01:06:56,330
If we store this, that's
definition of being first.

1246
01:06:56,330 --> 01:07:01,610
So in this interval, I want to
find $ sign i's that have very

1247
01:07:01,610 --> 01:07:04,070
small stored values.

1248
01:07:04,070 --> 01:07:06,050
How would I find
the very best one?

1249
01:07:06,050 --> 01:07:07,430
Range minimum query.

1250
01:07:07,430 --> 01:07:12,560
So we do a range minimum
query on l, comma, n.

1251
01:07:12,560 --> 01:07:15,200
If there's any firsts in
there, this will find it.

1252
01:07:18,580 --> 01:07:24,470
Find, let's say, a position
m with the smallest

1253
01:07:24,470 --> 01:07:25,790
possible stored value.

1254
01:07:37,430 --> 01:07:43,080
If the stored number
is less than l,

1255
01:07:43,080 --> 01:07:44,570
then output that answer.

1256
01:07:48,480 --> 01:07:53,890
And then recurse on the
remaining intervals.

1257
01:07:53,890 --> 01:08:01,860
So there's going to be from l
to m minus 1 and m plus 1 to n.

1258
01:08:01,860 --> 01:08:05,284
So we find the best
candidate, the minimum.

1259
01:08:05,284 --> 01:08:06,450
That's minimum sorted value.

1260
01:08:06,450 --> 01:08:09,210
If anything is going to be
less than l, that would be it.

1261
01:08:09,210 --> 01:08:12,459
If it is less than l, we output
it, then we recurse over here

1262
01:08:12,459 --> 01:08:13,500
and we recurse over here.

1263
01:08:13,500 --> 01:08:15,750
At some point this will
stop finding things.

1264
01:08:15,750 --> 01:08:17,910
We're going to do
another RMQ over here.

1265
01:08:17,910 --> 01:08:21,189
Might not find anything, then
we just stop that recursion.

1266
01:08:21,189 --> 01:08:23,151
But the number of
recursions we have to do

1267
01:08:23,151 --> 01:08:25,109
is going to be equal to
the number of documents

1268
01:08:25,109 --> 01:08:27,810
that match, maybe plus 1.

1269
01:08:27,810 --> 01:08:30,660
So we achieved this bound
using RMQ because RMQ

1270
01:08:30,660 --> 01:08:33,689
we can do in constant time with
appropriate pre-processing.

1271
01:08:33,689 --> 01:08:35,770
Now, the RMQ is over an array.

1272
01:08:35,770 --> 01:08:40,590
It's over this array of stored
values indexed by leaves.

1273
01:08:40,590 --> 01:08:42,330
And this idea of
taking the leaves

1274
01:08:42,330 --> 01:08:46,290
and writing them down in order
is actually something we need.

1275
01:08:46,290 --> 01:08:48,180
It's called a suffix array.

1276
01:08:56,970 --> 01:08:59,340
We're going to use this
alternate representation

1277
01:08:59,340 --> 01:09:02,640
of suffix trees in
order to compute them.

1278
01:09:02,640 --> 01:09:04,899
Suffix arrays in some sense
are easier to think about.

1279
01:09:16,410 --> 01:09:18,750
The idea with the
suffix array is

1280
01:09:18,750 --> 01:09:21,540
to write down all the
suffixes, sort them.

1281
01:09:25,979 --> 01:09:27,090
This is conceptual.

1282
01:09:27,090 --> 01:09:28,710
Imagine you take
all these suffixes.

1283
01:09:28,710 --> 01:09:30,370
Their total size
is quadratic in T

1284
01:09:30,370 --> 01:09:32,142
so you'd never actually
want to do this.

1285
01:09:32,142 --> 01:09:33,600
But just imagine
writing them down,

1286
01:09:33,600 --> 01:09:37,560
sorting them lexically using
our string sorting algorithms.

1287
01:09:37,560 --> 01:09:40,560
And then we can't
represent them explicitly

1288
01:09:40,560 --> 01:09:41,850
because it would be too big.

1289
01:09:41,850 --> 01:09:48,254
Just write down their index,
just store the indices.

1290
01:09:51,729 --> 01:09:52,770
Let's do this for banana.

1291
01:09:55,350 --> 01:09:58,545
Banana's over here.

1292
01:09:58,545 --> 01:10:00,060
It'll make my life
a little harder.

1293
01:10:17,090 --> 01:10:19,640
Actually, they're already
here in sorted order.

1294
01:10:19,640 --> 01:10:23,720
If dollar sign, I'm supposing,
is first, first suffix is $,

1295
01:10:23,720 --> 01:10:28,460
then a-$, then a-n-a-$, then
a-n-a-n-a-$, then banana,

1296
01:10:28,460 --> 01:10:31,880
then n-a-$, then n-a-n-a-$.

1297
01:10:31,880 --> 01:10:33,980
I'll just write
that down over here.

1298
01:10:33,980 --> 01:10:56,580
$, a-$, a-n-a-$, a-n-a-n-a-$,
then banana, then n-a-$,

1299
01:10:56,580 --> 01:10:59,420
then n-a-n-a-$.

1300
01:10:59,420 --> 01:11:02,220
If you look at these, they're
indeed in sorted order-- $,

1301
01:11:02,220 --> 01:11:04,330
a's, b's, n's.

1302
01:11:04,330 --> 01:11:06,740
Everything is sorted
here lexically.

1303
01:11:06,740 --> 01:11:09,160
Now, I can't store this
because it's quadratic size.

1304
01:11:09,160 --> 01:11:11,740
Instead, I just write down the
numbers that are down there.

1305
01:11:11,740 --> 01:11:14,620
This was the sixth suffix, it
was starting at position 6.

1306
01:11:14,620 --> 01:11:21,370
Then 5, then 3, then 1, then 0--

1307
01:11:21,370 --> 01:11:27,010
that's everything--
then 4, then 2.

1308
01:11:27,010 --> 01:11:31,810
This thing is the suffix array.

1309
01:11:31,810 --> 01:11:33,550
It also has linear size.

1310
01:11:33,550 --> 01:11:37,570
It's just a permutation on the
suffix labels, suffix indices.

1311
01:11:46,115 --> 01:11:48,699
I still want to
tell you about it.

1312
01:11:48,699 --> 01:11:50,240
There's some other
information that's

1313
01:11:50,240 --> 01:11:55,630
helpful to write down
about the suffix array.

1314
01:11:55,630 --> 01:11:59,600
It's called longest common
prefix information, LCP.

1315
01:11:59,600 --> 01:12:03,580
The idea is to look at adjacent
elements in the suffix array.

1316
01:12:03,580 --> 01:12:06,080
In some sense, this represents
the same information, right?

1317
01:12:06,080 --> 01:12:08,360
Our whole goal is to
sort the suffixes.

1318
01:12:08,360 --> 01:12:12,200
If we could do this,
then, as we'll see,

1319
01:12:12,200 --> 01:12:13,430
we can also build this.

1320
01:12:13,430 --> 01:12:15,472
And this is sort of
what we really want.

1321
01:12:15,472 --> 01:12:17,180
The suffix array by
itself is pretty good

1322
01:12:17,180 --> 01:12:19,130
if you add in LCP information.

1323
01:12:19,130 --> 01:12:21,500
LCP is-- what is the longest
common prefix of these two

1324
01:12:21,500 --> 01:12:22,000
suffixes?

1325
01:12:22,000 --> 01:12:23,960
In this case, 0.

1326
01:12:23,960 --> 01:12:26,090
In this case, one letter.

1327
01:12:26,090 --> 01:12:30,770
In this case, three
letters match.

1328
01:12:30,770 --> 01:12:33,440
So here the value is 3.

1329
01:12:33,440 --> 01:12:36,320
And the next one,
zero letters match.

1330
01:12:36,320 --> 01:12:39,660
Next one, zero letters match.

1331
01:12:39,660 --> 01:12:45,380
Next one, two letters match.

1332
01:12:45,380 --> 01:12:47,720
So this is another array
you could store here--

1333
01:12:47,720 --> 01:12:49,805
0, 1, 3, 0, 0, 2.

1334
01:12:49,805 --> 01:12:51,980
AUDIENCE: Longest common prefix?

1335
01:12:51,980 --> 01:12:56,270
ERIK DEMAINE: Longest common
prefix of the suffixes.

1336
01:12:56,270 --> 01:12:58,200
Because each of these
is a suffix but here

1337
01:12:58,200 --> 01:13:01,400
we're interested in how
long they match for.

1338
01:13:01,400 --> 01:13:05,360
I claim if you have this suffix
array and this LCP information,

1339
01:13:05,360 --> 01:13:08,570
you can build this structure.

1340
01:13:08,570 --> 01:13:12,680
Anyone wants to tell me how
to build this using this?

1341
01:13:12,680 --> 01:13:17,930
It's a one word or two word
answer that we saw, I think,

1342
01:13:17,930 --> 01:13:19,480
last class.

1343
01:13:19,480 --> 01:13:21,860
But we saw a lot of things
last class, so it's maybe not

1344
01:13:21,860 --> 01:13:22,360
obvious.

1345
01:13:30,530 --> 01:13:32,730
Magic words are Cartesian tree.

1346
01:13:35,570 --> 01:13:41,970
Cartesian tree was how we
converted RMQ into LCA,

1347
01:13:41,970 --> 01:13:42,630
I think.

1348
01:13:42,630 --> 01:13:43,250
Yeah?

1349
01:13:43,250 --> 01:13:45,890
Which was you take the
minimum value in the array,

1350
01:13:45,890 --> 01:13:50,480
make that the root, and then
recurse on the two sides.

1351
01:13:50,480 --> 01:13:54,110
So a Cartesian tree
of the LCP array,

1352
01:13:54,110 --> 01:13:57,800
basically, gives you
this transformation.

1353
01:13:57,800 --> 01:13:59,930
The minimum values
here are the 0's.

1354
01:13:59,930 --> 01:14:02,990
Now, before we just broke
ties, we picked an arbitrary 0,

1355
01:14:02,990 --> 01:14:04,050
put it at the root.

1356
01:14:04,050 --> 01:14:07,500
Now I want to take all the
0's, put them at the root.

1357
01:14:07,500 --> 01:14:13,130
If I do that, I get
three 0's at the root

1358
01:14:13,130 --> 01:14:16,170
and then I have
everything in between.

1359
01:14:16,170 --> 01:14:19,430
So there's nothing
left of the first 0.

1360
01:14:19,430 --> 01:14:21,740
Then next one,
there's these guys

1361
01:14:21,740 --> 01:14:24,320
and the mins are going
to be 1 and then 3.

1362
01:14:24,320 --> 01:14:27,800
So here I'm going to get a
1 when I recurse and then 3.

1363
01:14:34,910 --> 01:14:36,740
There's nothing in
between these 0's.

1364
01:14:36,740 --> 01:14:39,380
And after the last
0, there's a 2.

1365
01:14:39,380 --> 01:14:41,840
So this would be the Cartesian
tree, a slightly different

1366
01:14:41,840 --> 01:14:43,760
version where we
don't break ties,

1367
01:14:43,760 --> 01:14:45,800
we take all the
mins simultaneously,

1368
01:14:45,800 --> 01:14:47,670
put them at the root.

1369
01:14:47,670 --> 01:14:50,300
Now, does that look
like this thing?

1370
01:14:50,300 --> 01:14:50,877
Yeah.

1371
01:14:50,877 --> 01:14:52,085
Everything except the leaves.

1372
01:14:52,085 --> 01:14:53,960
[INAUDIBLE] are
missing at the leaves.

1373
01:14:53,960 --> 01:14:57,170
The leaves are represented
by these values.

1374
01:14:57,170 --> 01:15:00,410
Just visit them in order
here, do an inner traversal

1375
01:15:00,410 --> 01:15:02,270
of the missing pointers here.

1376
01:15:02,270 --> 01:15:12,260
We're going to get 6, and
then 5, and then 3 and then 1,

1377
01:15:12,260 --> 01:15:19,609
and then 0, and
then 4, and then 2.

1378
01:15:19,609 --> 01:15:21,900
Now, the meaning of these
values is slightly different.

1379
01:15:21,900 --> 01:15:24,740
Maybe I should
circle them in red.

1380
01:15:24,740 --> 01:15:27,570
These leaves are just
like these leaves.

1381
01:15:27,570 --> 01:15:30,389
They're exactly the labels we
wrote down in the same order.

1382
01:15:30,389 --> 01:15:31,930
These numbers are
slightly different.

1383
01:15:31,930 --> 01:15:33,569
What they represent
are letter depths.

1384
01:15:33,569 --> 01:15:36,110
The letter depth of this node
is 0, letter depth of this node

1385
01:15:36,110 --> 01:15:37,670
is 1, letter depth
of this node is 3.

1386
01:15:37,670 --> 01:15:38,753
That's what I wrote here--

1387
01:15:38,753 --> 01:15:40,804
1, 3, 2.

1388
01:15:40,804 --> 01:15:42,240
This one says, 2.

1389
01:15:42,240 --> 01:15:44,000
These LCPs are exactly
the letter depth.

1390
01:15:44,000 --> 01:15:45,967
That's how far down
the tree you are.

1391
01:15:45,967 --> 01:15:48,050
Once you have this structure
and the letter depth,

1392
01:15:48,050 --> 01:15:50,265
you can very easily
put in these labels.

1393
01:15:50,265 --> 01:15:51,410
I won't say how to do that.

1394
01:15:51,410 --> 01:15:53,210
But in linear time,
if I could build

1395
01:15:53,210 --> 01:15:59,030
the suffix array plus the LCPs,
I could build suffix tree.

1396
01:15:59,030 --> 01:16:02,081
So our real goal is to build
this information, these two

1397
01:16:02,081 --> 01:16:02,580
arrays.

1398
01:16:02,580 --> 01:16:04,300
If we could do it
in linear time,

1399
01:16:04,300 --> 01:16:07,470
we'd get a suffix
tree in linear time.

1400
01:16:07,470 --> 01:16:10,100
So that is what
remains to be done.

1401
01:16:17,379 --> 01:16:18,170
We're going to do--

1402
01:16:24,565 --> 01:16:27,200
not quite linear time.

1403
01:16:27,200 --> 01:16:30,620
If you want a
nicely sorted suffix

1404
01:16:30,620 --> 01:16:33,830
tree where all the
children are labeled here--

1405
01:16:33,830 --> 01:16:36,020
so in particular, if I
just had a single node,

1406
01:16:36,020 --> 01:16:39,650
I have to be able to sort
the letters in the alphabet.

1407
01:16:39,650 --> 01:16:41,090
However long that takes.

1408
01:16:41,090 --> 01:16:42,560
Maybe it's a small
alphabet and you

1409
01:16:42,560 --> 01:16:45,500
can do linear time sorting
by radix sort or whatever.

1410
01:16:45,500 --> 01:16:47,260
However long that
takes, we do it once.

1411
01:16:47,260 --> 01:16:50,377
Then the rest will
be order T time.

1412
01:16:50,377 --> 01:16:51,210
Here's how we do it.

1413
01:16:51,210 --> 01:16:54,217
First step-- sort the alphabet.

1414
01:16:54,217 --> 01:16:56,300
This will turn out to be
more interesting than you

1415
01:16:56,300 --> 01:16:57,170
might think.

1416
01:16:57,170 --> 01:16:58,430
I'll come back to it.

1417
01:16:58,430 --> 01:17:01,250
Second step--
replace each letter

1418
01:17:01,250 --> 01:17:03,920
by its index in
the sorted order.

1419
01:17:03,920 --> 01:17:07,130
This sounds boring but it
will be useful for later.

1420
01:17:15,710 --> 01:17:19,670
Third step-- the big idea.

1421
01:17:19,670 --> 01:17:23,180
This is an algorithm by
Karkkainen and Sanders,

1422
01:17:23,180 --> 01:17:25,750
from 2003.

1423
01:17:25,750 --> 01:17:27,950
The problem was first
solved in this running time

1424
01:17:27,950 --> 01:17:31,640
by Martin Farach-Colton,
our good friend.

1425
01:17:31,640 --> 01:17:33,260
But then it got simplified.

1426
01:17:33,260 --> 01:17:36,170
So I'll tell you a little
bit about that in a moment.

1427
01:17:38,750 --> 01:17:41,270
And there going to be
a lot of writing here.

1428
01:17:52,880 --> 01:17:55,410
The idea here is we're going
to take the 3i-th letter,

1429
01:17:55,410 --> 01:17:57,440
3i plus first, 3i
plus second letter,

1430
01:17:57,440 --> 01:18:00,020
concatenate them into a
single triple letter--

1431
01:18:00,020 --> 01:18:01,800
think of it as a single letter.

1432
01:18:01,800 --> 01:18:03,402
And then just do that for all i.

1433
01:18:03,402 --> 01:18:05,110
So it's like I take
these guys, make them

1434
01:18:05,110 --> 01:18:07,070
one letter, these guys,
make them one letter.

1435
01:18:07,070 --> 01:18:10,100
Now, I could start at 0,
or I could start at 1,

1436
01:18:10,100 --> 01:18:12,230
or I could start at 2.

1437
01:18:12,230 --> 01:18:14,260
Do them all.

1438
01:18:14,260 --> 01:18:22,910
So this is going to be 3i
plus 1, 3i plus 2, 3i plus 3.

1439
01:18:22,910 --> 01:18:32,390
And this one is going to be 3i
plus 2, 3i plus 3, 3i plus 4.

1440
01:18:32,390 --> 01:18:33,860
We're going to do
this to recurse.

1441
01:18:33,860 --> 01:18:36,320
But the point is, if
I want to represent

1442
01:18:36,320 --> 01:18:38,750
all the suffixes
of T, suffix could

1443
01:18:38,750 --> 01:18:41,600
start at a position 0 mod
3, or position 1 mod 3,

1444
01:18:41,600 --> 01:18:43,800
or position 2 mod 3.

1445
01:18:43,800 --> 01:18:46,250
So if I could sort all the
suffixes of these guys,

1446
01:18:46,250 --> 01:18:49,120
I would effectively sort all
the suffixes of the original T.

1447
01:18:49,120 --> 01:18:51,800
This tripling up doesn't
really change things,

1448
01:18:51,800 --> 01:18:53,880
up to like plus 1 or 2.

1449
01:19:03,500 --> 01:19:05,180
Next, I believe, is recursion.

1450
01:19:13,726 --> 01:19:18,320
I'm going to take T0 and
T1 and concatenate them.

1451
01:19:18,320 --> 01:19:21,320
This thing has size 2/3 n.

1452
01:19:21,320 --> 01:19:24,500
It has number of characters
2/3 n because each of them

1453
01:19:24,500 --> 01:19:26,420
has a third of the
number of characters.

1454
01:19:26,420 --> 01:19:28,020
Of course, all the
information is still there,

1455
01:19:28,020 --> 01:19:28,978
which is kind of weird.

1456
01:19:28,978 --> 01:19:31,680
But if we treat this as
a single character, which

1457
01:19:31,680 --> 01:19:35,435
then has a 1/3 n, we can't
afford to recurse on all three.

1458
01:19:35,435 --> 01:19:37,850
We can only afford to recurse
on two out of the three

1459
01:19:37,850 --> 01:19:40,850
because then we're going to get
a recurrence of the form T of n

1460
01:19:40,850 --> 01:19:46,370
is T of 2/3 n plus order n.

1461
01:19:46,370 --> 01:19:48,682
And this is geometric,
so it's order n.

1462
01:19:48,682 --> 01:19:50,390
That's how we're going
to get linear time

1463
01:19:50,390 --> 01:19:53,630
after the first sort.

1464
01:19:53,630 --> 01:19:56,000
If this was 3/3 n, then
this would be n log n.

1465
01:19:56,000 --> 01:19:58,310
We don't want to do that.

1466
01:19:58,310 --> 01:19:59,832
So that's what I can afford.

1467
01:19:59,832 --> 01:20:01,040
Now I've got to deal with it.

1468
01:20:01,040 --> 01:20:03,020
What this tells me is,
the sorted order of all

1469
01:20:03,020 --> 01:20:05,420
the suffixes of T0 and
T1, all the suffixes

1470
01:20:05,420 --> 01:20:08,630
starting at positions
that are 0 or 1 mod 3.

1471
01:20:12,140 --> 01:20:17,840
Next thing we'd like to do
is sort the suffixes of T2.

1472
01:20:17,840 --> 01:20:20,150
We can do that, I
claim, by radix sort.

1473
01:20:26,160 --> 01:20:27,190
How do we do that?

1474
01:20:27,190 --> 01:20:30,240
Well, if you look
at a suffix T 2i,

1475
01:20:30,240 --> 01:20:37,320
this is the same thing as
T from 3i plus 2 onwards.

1476
01:20:37,320 --> 01:20:43,140
Which we can think of as
that first character, comma,

1477
01:20:43,140 --> 01:20:44,265
the next character onwards.

1478
01:20:48,840 --> 01:20:51,150
Sorry, that's the angle bracket.

1479
01:20:51,150 --> 01:20:56,680
And this thing is, basically,
T0 of i plus 1 onwards.

1480
01:20:56,680 --> 01:20:58,260
So if I strip off
the first letter,

1481
01:20:58,260 --> 01:21:00,000
then I get a suffix
that I know about.

1482
01:21:00,000 --> 01:21:02,830
I know the sorted order
of all the T0 suffixes.

1483
01:21:02,830 --> 01:21:04,830
So this is really just
a-- you can think of this

1484
01:21:04,830 --> 01:21:06,679
as a two character value.

1485
01:21:06,679 --> 01:21:08,220
There's a single
character from Sigma

1486
01:21:08,220 --> 01:21:12,660
here, which we've
already reduced down to--

1487
01:21:12,660 --> 01:21:18,210
this is an integer between
0 and Sigma minus 1.

1488
01:21:18,210 --> 01:21:21,150
This thing you can do the same
thing with these recursive

1489
01:21:21,150 --> 01:21:22,110
values.

1490
01:21:22,110 --> 01:21:24,210
So you've just got two values.

1491
01:21:24,210 --> 01:21:24,810
Small.

1492
01:21:24,810 --> 01:21:27,540
You can radix sort
them in linear time.

1493
01:21:27,540 --> 01:21:30,990
And then we will have sorted
T2 suffixes because we already

1494
01:21:30,990 --> 01:21:32,220
knew the order of these guys.

1495
01:21:34,800 --> 01:21:45,780
One more thing, which we have
to merge suffixes of T0 and T1

1496
01:21:45,780 --> 01:21:52,470
with suffixes of T2.

1497
01:21:52,470 --> 01:21:55,755
And this is where
we use the fact

1498
01:21:55,755 --> 01:21:58,130
that there are three of these
things and not two of them.

1499
01:21:58,130 --> 01:22:01,040
This is a weird case where three
way divide and conquer works.

1500
01:22:01,040 --> 01:22:02,952
Two way divide and
conquer is what

1501
01:22:02,952 --> 01:22:04,160
Farach-Colton did originally.

1502
01:22:04,160 --> 01:22:07,070
It's much more complicated
because of this merge step.

1503
01:22:07,070 --> 01:22:09,050
Merge gets painful.

1504
01:22:09,050 --> 01:22:13,490
I claim this merging
is easy because merging

1505
01:22:13,490 --> 01:22:16,500
is linear time, provided your
comparison is constant time.

1506
01:22:16,500 --> 01:22:21,610
So if I need to compare a
T0 suffix with a T2 suffix,

1507
01:22:21,610 --> 01:22:23,840
if I want to do
that comparison, I

1508
01:22:23,840 --> 01:22:26,030
strip off the first
letter from this one.

1509
01:22:26,030 --> 01:22:29,330
It turns into a T1 suffix,
the first character

1510
01:22:29,330 --> 01:22:30,155
plus a T1 suffix.

1511
01:22:30,155 --> 01:22:32,113
If I strip out the first
character of this one,

1512
01:22:32,113 --> 01:22:35,810
it turns into the first
character and then a T0 suffix.

1513
01:22:35,810 --> 01:22:38,150
And these things I know how
to compare because I already

1514
01:22:38,150 --> 01:22:40,970
sorted T0, comma, T1.

1515
01:22:40,970 --> 01:22:47,900
If I need to compare T1
suffix with the T2 suffix,

1516
01:22:47,900 --> 01:22:48,830
how do I do it?

1517
01:22:48,830 --> 01:22:51,440
I strip off the first
two letters of this one,

1518
01:22:51,440 --> 01:22:52,850
I get a T0 suffix.

1519
01:22:52,850 --> 01:22:55,340
I strip off the first
two letters of this one,

1520
01:22:55,340 --> 01:22:56,829
I get a T1 suffix.

1521
01:22:56,829 --> 01:22:59,120
I can't strip off one letter
because this would turn it

1522
01:22:59,120 --> 01:23:00,953
into a T2 and I don't
know how to compare T2

1523
01:23:00,953 --> 01:23:02,960
to other things,
that's the whole point.

1524
01:23:02,960 --> 01:23:05,420
I guess, it's a T2
versus a T0, if I

1525
01:23:05,420 --> 01:23:07,330
did that, which is this case.

1526
01:23:07,330 --> 01:23:09,200
But here, I strip
off two letters,

1527
01:23:09,200 --> 01:23:10,850
I get something I
know how to compare.

1528
01:23:10,850 --> 01:23:13,290
This technique does not work
if you only have two things.

1529
01:23:13,290 --> 01:23:15,540
It only works if you have
three things because they're

1530
01:23:15,540 --> 01:23:17,700
sort of these situations.

1531
01:23:17,700 --> 01:23:18,950
So constant time.

1532
01:23:18,950 --> 01:23:21,620
By comparing these little
tuples, the first character

1533
01:23:21,620 --> 01:23:26,000
or two plus the remaining
suffix, I can do the comparator

1534
01:23:26,000 --> 01:23:27,980
and merge.

1535
01:23:27,980 --> 01:23:31,010
And then if I can do that,
everything is linear time.

1536
01:23:31,010 --> 01:23:33,710
The only interesting thing
is how do I sort the alphabet

1537
01:23:33,710 --> 01:23:34,970
when I recurse?

1538
01:23:34,970 --> 01:23:38,690
And for that, you
use radix sort.

1539
01:23:38,690 --> 01:23:44,500
So the first time,
you pay sort of Sigma.

1540
01:23:44,500 --> 01:23:46,250
We don't know how long
that takes, depends

1541
01:23:46,250 --> 01:23:47,330
on your alphabet.

1542
01:23:47,330 --> 01:23:49,359
But every following
recursion it's a radix

1543
01:23:49,359 --> 01:23:51,650
sort because you have a triple
of values, each of which

1544
01:23:51,650 --> 01:23:52,990
is small.

1545
01:23:52,990 --> 01:23:54,650
And so you can do
it in linear time.

1546
01:23:54,650 --> 01:23:57,860
Because there's only
three digits to the thing

1547
01:23:57,860 --> 01:23:59,000
you're sorting.

1548
01:23:59,000 --> 01:24:01,880
So overall, this is a
recursive algorithm.

1549
01:24:01,880 --> 01:24:05,000
It gives you linear
time because you're

1550
01:24:05,000 --> 01:24:07,910
making one recursive
call of 2/3 the size.

1551
01:24:07,910 --> 01:24:11,840
Pretty clever and simple.

1552
01:24:11,840 --> 01:24:14,330
And that's suffix trees
and how you build them.

1553
01:24:14,330 --> 01:24:17,270
Versus you get suffix arrays,
you can do the same thing

1554
01:24:17,270 --> 01:24:20,330
and get LCP information
at the same time,

1555
01:24:20,330 --> 01:24:22,220
it's written in the nodes.

1556
01:24:22,220 --> 01:24:23,330
Then you get suffix trees.

1557
01:24:23,330 --> 01:24:25,480
And then you're done.