1
00:00:00,090 --> 00:00:02,490
The following content is
provided under a Creative

2
00:00:02,490 --> 00:00:04,030
Commons license.

3
00:00:04,030 --> 00:00:06,360
Your support will help
MIT OpenCourseWare

4
00:00:06,360 --> 00:00:10,720
continue to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,320
To make a donation, or
view additional materials

6
00:00:13,320 --> 00:00:17,280
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,280 --> 00:00:18,450
at ocw.mit.edu.

8
00:00:21,139 --> 00:00:22,680
ERIK DEMAINE: All
right, welcome back

9
00:00:22,680 --> 00:00:26,580
to Succinct Data
Structures, part two of two.

10
00:00:26,580 --> 00:00:29,610
Today we're going to take all
the stuff we know about tries

11
00:00:29,610 --> 00:00:32,910
and apply them to the main
motivating application, which

12
00:00:32,910 --> 00:00:35,190
is suffix trees.

13
00:00:35,190 --> 00:00:37,320
And as we know, suffix
trees and suffix arrays

14
00:00:37,320 --> 00:00:39,510
are more or less equivalent.

15
00:00:39,510 --> 00:00:43,524
But if you build one,
you can build the other.

16
00:00:43,524 --> 00:00:44,940
But what we're
going to show today

17
00:00:44,940 --> 00:00:47,500
is they're equivalent also
from a space perspective.

18
00:00:47,500 --> 00:00:49,130
That will be the last topic.

19
00:00:49,130 --> 00:00:52,980
If you can succinctly
represent a suffix array,

20
00:00:52,980 --> 00:00:56,700
then you can transform-- with
a little o of n extra space,

21
00:00:56,700 --> 00:00:59,430
you can make a
suffix tree as well

22
00:00:59,430 --> 00:01:02,920
and do searches in roughly
the time we're used to,

23
00:01:02,920 --> 00:01:05,844
which is p plus size of output.

24
00:01:05,844 --> 00:01:07,260
It's not going to
be exactly that.

25
00:01:07,260 --> 00:01:09,660
We're going to lose like log
to the epsilons and such.

26
00:01:09,660 --> 00:01:12,630
But that's mostly caused--
this transformation only

27
00:01:12,630 --> 00:01:17,240
occurs like an additive log,
log, log, log, log, log n.

28
00:01:17,240 --> 00:01:19,800
You could have as
many logs as you want.

29
00:01:19,800 --> 00:01:23,700
Take any arbitrarily,
slowly growing function,

30
00:01:23,700 --> 00:01:24,940
that it will--

31
00:01:24,940 --> 00:01:27,610
your space bound gets
closer and closer to linear.

32
00:01:27,610 --> 00:01:29,580
Anyway, that's what
we'll get to at the end.

33
00:01:29,580 --> 00:01:32,360
The bulk of the lecture will
be on building suffix arrays.

34
00:01:32,360 --> 00:01:34,290
Here we're going to lose
a log to the epsilon

35
00:01:34,290 --> 00:01:36,810
time in the query.

36
00:01:36,810 --> 00:01:40,901
And we're going to start out
improving down to T log log T

37
00:01:40,901 --> 00:01:41,400
space.

38
00:01:41,400 --> 00:01:43,500
Our bottom line
is T log T space.

39
00:01:43,500 --> 00:01:46,020
That's a normal-- if you
just stored a suffix array

40
00:01:46,020 --> 00:01:47,850
as a bunch of numbers.

41
00:01:47,850 --> 00:01:51,600
First we'll add another log,
then we'll get down to linear.

42
00:01:51,600 --> 00:01:56,230
That gives us a compact suffix
tree, or sorry, suffix array.

43
00:01:56,230 --> 00:01:59,490
That's also knowing how to
do succinct suffix arrays.

44
00:01:59,490 --> 00:02:03,270
But there are dozens of papers
on this topic, it's kind

45
00:02:03,270 --> 00:02:05,970
of a big field all to itself.

46
00:02:05,970 --> 00:02:09,210
And a lot of the techniques
are pretty complicated.

47
00:02:09,210 --> 00:02:13,140
So I'm going try to keep it
to the bare minimum we can do,

48
00:02:13,140 --> 00:02:15,222
that will give us linear--

49
00:02:15,222 --> 00:02:17,670
linear number, bits of
space give us a compact data

50
00:02:17,670 --> 00:02:18,720
structure.

51
00:02:18,720 --> 00:02:21,390
But before we go to
those data structures,

52
00:02:21,390 --> 00:02:23,700
I want to give you a little
survey of what's known.

53
00:02:26,680 --> 00:02:34,589
So compact suffix arrays,
and trees, start out with--

54
00:02:34,589 --> 00:02:36,630
I'm going to start out
with the original results.

55
00:02:36,630 --> 00:02:39,088
And then I'll jump to sort of
the latest results, which are

56
00:02:39,088 --> 00:02:40,515
getting things to be succinct.

57
00:02:43,810 --> 00:02:48,660
So the first result on this
topic that got a compact suffix

58
00:02:48,660 --> 00:02:52,620
array, is by Grossi and Vitter.

59
00:02:52,620 --> 00:02:55,710
This was in 2000,
spring of 2000.

60
00:02:55,710 --> 00:02:58,530
And let me tell you the
bounds that they achieve.

61
00:02:58,530 --> 00:03:01,560
This is actually the solution
that we're going to look at.

62
00:03:26,340 --> 00:03:28,830
So this is the
first space bound.

63
00:03:31,670 --> 00:03:35,000
I guess the big term
here is T log sigma.

64
00:03:35,000 --> 00:03:38,030
That's how many bits it takes
just to write down the text.

65
00:03:38,030 --> 00:03:41,480
So this is what you might
call optimal, in this world.

66
00:03:41,480 --> 00:03:42,980
I mean, if you have
random text, you

67
00:03:42,980 --> 00:03:44,720
need that many bits
to write it down.

68
00:03:44,720 --> 00:03:46,610
So there's 1 times that.

69
00:03:46,610 --> 00:03:49,040
We're also going to have this
1 over epsilon times that.

70
00:03:49,040 --> 00:03:50,750
And this is actually
the data structure.

71
00:03:50,750 --> 00:03:52,541
It's going to store
the text, and then it's

72
00:03:52,541 --> 00:03:55,730
going to add on a data structure
of 1 over epsilon times that.

73
00:03:55,730 --> 00:04:01,326
So it's order-- order [? ops. ?]
There's some lower-order terms.

74
00:04:01,326 --> 00:04:03,200
We won't actually have
this lower-order term,

75
00:04:03,200 --> 00:04:06,860
because I'm going to focus
on binary alphabet here.

76
00:04:06,860 --> 00:04:07,887
Keep it simple.

77
00:04:07,887 --> 00:04:09,470
But if you have a
non-binary alphabet,

78
00:04:09,470 --> 00:04:12,410
they have another order
T bits, and so on.

79
00:04:12,410 --> 00:04:14,210
But you get to
control this constant.

80
00:04:14,210 --> 00:04:17,060
This will work for any
epsilon between 0 and 1.

81
00:04:19,850 --> 00:04:22,910
And why are you interested
in a small epsilon?

82
00:04:22,910 --> 00:04:27,060
Because if epsilon is small,
this space bound goes up.

83
00:04:27,060 --> 00:04:29,295
Well, that happens
in the query bound.

84
00:04:48,380 --> 00:04:51,590
So in the query bound, there's
this multiplicative log

85
00:04:51,590 --> 00:04:55,130
to the epsilon of T. So if you
really want queries to go fast,

86
00:04:55,130 --> 00:04:57,290
you don't want to pay
a big polylog here,

87
00:04:57,290 --> 00:04:59,490
then you're going to have
to pay for it in space.

88
00:04:59,490 --> 00:05:00,781
So those are the same epsilons.

89
00:05:05,150 --> 00:05:08,690
In the Grossi-Vitter paper,
they only multiply this

90
00:05:08,690 --> 00:05:09,920
by the size of the output.

91
00:05:09,920 --> 00:05:11,597
So if you want to
just output one guy,

92
00:05:11,597 --> 00:05:13,430
you only pay an additive
log to the epsilon.

93
00:05:13,430 --> 00:05:14,900
If you want to output
all the matches

94
00:05:14,900 --> 00:05:16,775
you have to pay a number
of matches times log

95
00:05:16,775 --> 00:05:17,690
to the epsilon.

96
00:05:17,690 --> 00:05:20,330
They achieve the P bound.

97
00:05:20,330 --> 00:05:23,050
In fact, they do a little bit
better than order P query.

98
00:05:23,050 --> 00:05:26,150
On a RAM, you can hope to do--

99
00:05:26,150 --> 00:05:33,170
save a log factor by reading log
base sigma of T, of the letters

100
00:05:33,170 --> 00:05:34,950
in one word operation.

101
00:05:34,950 --> 00:05:36,900
So I'm not going to go
into how to do this--

102
00:05:36,900 --> 00:05:41,257
I'm going to cover this paper
today, or a simplification

103
00:05:41,257 --> 00:05:41,840
of this paper.

104
00:05:41,840 --> 00:05:43,560
You might say, throw away.

105
00:05:43,560 --> 00:05:46,250
I'm going to get a slightly
worse bounds than this.

106
00:05:46,250 --> 00:05:51,170
Space bound will be the
same, but I'm not going to--

107
00:05:51,170 --> 00:05:53,300
I'm not going to worry
about this log factor.

108
00:05:53,300 --> 00:05:55,490
And in fact, both
P and output are

109
00:05:55,490 --> 00:05:59,540
going to be multiplied
by log to the epsilon.

110
00:05:59,540 --> 00:06:01,700
So I won't achieve quite
the best query bound,

111
00:06:01,700 --> 00:06:03,590
but same space bound,
just to give you

112
00:06:03,590 --> 00:06:06,800
an idea of how it works.

113
00:06:06,800 --> 00:06:12,170
The next result-- yeah,
I'll go to another board.

114
00:06:12,170 --> 00:06:14,480
These bounds are a
bit big, as you see.

115
00:06:17,090 --> 00:06:19,860
The next result, which was
done later in the same year.

116
00:06:19,860 --> 00:06:21,830
So these are
probably discovered,

117
00:06:21,830 --> 00:06:25,690
basically at the same time.

118
00:06:25,690 --> 00:06:29,840
Because writing a paper
takes probably a year or so.

119
00:06:29,840 --> 00:06:31,550
So they were being
done in parallel,

120
00:06:31,550 --> 00:06:34,260
and then this was published
in the spring of 2000.

121
00:06:34,260 --> 00:06:36,899
This was published
in the fall of 2000.

122
00:06:36,899 --> 00:06:37,940
It's called the FM-index.

123
00:06:40,640 --> 00:06:43,754
And it achieves
this bound, which

124
00:06:43,754 --> 00:06:45,545
is going to take a
little while to explain.

125
00:07:09,270 --> 00:07:11,450
OK.

126
00:07:11,450 --> 00:07:16,160
Think of this right now,
as this is T log sigma.

127
00:07:16,160 --> 00:07:20,254
Ignore this H. This
is entropy stuff.

128
00:07:20,254 --> 00:07:21,920
But if you think of
this as T log sigma,

129
00:07:21,920 --> 00:07:24,470
we're getting 5
times T log sigma,

130
00:07:24,470 --> 00:07:27,800
plus some lower-order term.

131
00:07:27,800 --> 00:07:30,046
So it's a little less
flexible over here.

132
00:07:30,046 --> 00:07:31,670
We kind of got to
control the constant.

133
00:07:31,670 --> 00:07:35,600
Anything greater or equal
to 2 would be all right.

134
00:07:35,600 --> 00:07:37,580
Over here, it's
always at least 5.

135
00:07:37,580 --> 00:07:39,170
This has since been improved.

136
00:07:39,170 --> 00:07:41,750
I'm just telling
you the historical--

137
00:07:41,750 --> 00:07:45,410
these days people can get
down to at least 4 or so.

138
00:07:45,410 --> 00:07:46,470
Actually, get down to 1.

139
00:07:46,470 --> 00:07:49,730
We'll talk about it in a moment.

140
00:07:49,730 --> 00:07:51,380
Before I get to
the Hk part, I want

141
00:07:51,380 --> 00:07:53,180
to talk about the
lower-order term.

142
00:07:53,180 --> 00:07:55,340
There's some scary
parts like this.

143
00:07:55,340 --> 00:07:58,430
If sigma is that at all
large, this is big trouble.

144
00:07:58,430 --> 00:08:00,320
Or even sigma log n--

145
00:08:00,320 --> 00:08:03,110
this is a super polynomial.

146
00:08:03,110 --> 00:08:05,960
So this cannot handle
very large sigma,

147
00:08:05,960 --> 00:08:07,580
whereas this solution can.

148
00:08:07,580 --> 00:08:11,930
And other structures can,
but this is an early result.

149
00:08:11,930 --> 00:08:14,990
This also gets bad when
sigma's very large.

150
00:08:14,990 --> 00:08:17,540
Even bigger-- when sigma's
bigger than log log T,

151
00:08:17,540 --> 00:08:20,660
then this starts to dominate.

152
00:08:20,660 --> 00:08:23,779
OK, but for sigma small, think
binary alphabets, whatever.

153
00:08:23,779 --> 00:08:25,820
This is good, and in many
ways is actually better

154
00:08:25,820 --> 00:08:26,750
than T log sigma.

155
00:08:26,750 --> 00:08:31,750
So let me tell you about
this Hk of T thing.

156
00:08:31,750 --> 00:08:34,985
This is what's called k-th
order empirical entropy.

157
00:08:45,770 --> 00:08:49,645
Maybe I should start with an
aside of 0-th order entropy,

158
00:08:49,645 --> 00:08:51,020
because we haven't
talked about--

159
00:08:51,020 --> 00:08:53,450
I guess we talked about entropy
in the context of binary search

160
00:08:53,450 --> 00:08:53,949
trees.

161
00:08:53,949 --> 00:08:55,820
We said, oh, if you've got--

162
00:08:55,820 --> 00:08:59,300
if you access item i
with probability P i,

163
00:08:59,300 --> 00:09:07,450
then there's this entropy bound,
which is sum of P log 1/P.

164
00:09:07,450 --> 00:09:11,630
So I don't know, let's
call this character x.

165
00:09:11,630 --> 00:09:23,420
So if you-- let's see,
you have H0 substring s.

166
00:09:23,420 --> 00:09:30,030
You sum over all
characters in the alphabet,

167
00:09:30,030 --> 00:09:33,380
of the probability-- this
is not really a probability.

168
00:09:33,380 --> 00:09:40,490
This is going to be the
number of x's in s, divided

169
00:09:40,490 --> 00:09:41,900
by the length of s.

170
00:09:41,900 --> 00:09:43,810
This is what's called
empirical probability.

171
00:09:43,810 --> 00:09:46,167
It's what you observe
from this string.

172
00:09:46,167 --> 00:09:48,000
There's this many
occurrences in the string.

173
00:09:48,000 --> 00:09:49,220
You divide by the
length of the string.

174
00:09:49,220 --> 00:09:50,660
That's kind of
like a probability.

175
00:09:50,660 --> 00:09:52,390
It's scaled to be
like a probability.

176
00:09:52,390 --> 00:09:53,760
It's between 0 and 1.

177
00:09:53,760 --> 00:09:57,410
And if you take sum of P log
1/P, that gives you a bound.

178
00:09:57,410 --> 00:10:01,950
And this is the bound achieved
by say, Huffman coding,

179
00:10:01,950 --> 00:10:04,542
or the optimal code.

180
00:10:04,542 --> 00:10:06,500
If all you're allowed to
do is give a code word

181
00:10:06,500 --> 00:10:08,660
for each letter of the
alphabet, and then you

182
00:10:08,660 --> 00:10:10,820
write down a binary code
word for each letter

183
00:10:10,820 --> 00:10:11,750
of the alphabet.

184
00:10:11,750 --> 00:10:14,870
And you write that down for each
letter in s, then you achieve--

185
00:10:14,870 --> 00:10:17,480
I guess Huffman codes
achieve ceiling of this.

186
00:10:17,480 --> 00:10:19,610
If you want to achieve
exactly that bound,

187
00:10:19,610 --> 00:10:23,660
you can use arithmetic
coding, but we're not

188
00:10:23,660 --> 00:10:25,850
going to get into
those kinds of details.

189
00:10:25,850 --> 00:10:30,740
So if you used what's called a
0-th order code, where you just

190
00:10:30,740 --> 00:10:34,220
have a code for each
character of the alphabet,

191
00:10:34,220 --> 00:10:37,400
then the space bound you
would achieve is H0 of s,

192
00:10:37,400 --> 00:10:41,460
times the number
of characters in s.

193
00:10:41,460 --> 00:10:45,800
So that would be if you
substituted k equals 0 here.

194
00:10:45,800 --> 00:10:46,800
So that's kind of neat.

195
00:10:46,800 --> 00:10:50,030
This is a compressed
representation of the string.

196
00:10:50,030 --> 00:10:53,150
Over here, we just
wrote down the string.

197
00:10:53,150 --> 00:10:54,950
And if the string
is incompressible,

198
00:10:54,950 --> 00:10:56,954
yeah, T log sigma is optimal.

199
00:10:56,954 --> 00:10:59,120
But if the string is
compressible, like many strings

200
00:10:59,120 --> 00:11:01,328
we want to store-- you're
storing English, whatever--

201
00:11:01,328 --> 00:11:04,610
you should save somewhere
between a factor of 2 and 10.

202
00:11:04,610 --> 00:11:06,110
This will try to save it.

203
00:11:06,110 --> 00:11:09,530
Of course, factor
between 2 and 10 is not--

204
00:11:09,530 --> 00:11:12,020
is a little scary, when
there's this factor 5 out here.

205
00:11:12,020 --> 00:11:14,420
That might dominate
whatever savings you get.

206
00:11:14,420 --> 00:11:17,420
But in theory, this
could be a lot better.

207
00:11:17,420 --> 00:11:20,090
And this is just the first
result in this series.

208
00:11:20,090 --> 00:11:23,240
Now we can get 1 times
Hk of T, and then it's

209
00:11:23,240 --> 00:11:27,960
a lot more interesting.

210
00:11:27,960 --> 00:11:30,020
OK, so that was
0-th order entropy.

211
00:11:30,020 --> 00:11:32,210
What's this k-th order
entropy business?

212
00:11:32,210 --> 00:11:34,790
Essentially, it's about taking--
instead of writing a code

213
00:11:34,790 --> 00:11:37,520
word for a single letter,
you can write a code word

214
00:11:37,520 --> 00:11:43,670
for a letter that depends on
the previous k characters.

215
00:11:43,670 --> 00:11:47,170
So I'm going to write
down a definition.

216
00:11:47,170 --> 00:11:54,970
Hk of T is going to be the sum
over all words of length k.

217
00:11:54,970 --> 00:11:58,970
This is going to be our
context of the probability,

218
00:11:58,970 --> 00:12:09,740
or empirical probability of w
occurring times the 0-th order

219
00:12:09,740 --> 00:12:24,005
entropy of the string of
successor characters of w.

220
00:12:26,510 --> 00:12:29,660
So again, the empirical
probability of w occurring

221
00:12:29,660 --> 00:12:37,730
is the number of occurrences
of w, divided by T, basically.

222
00:12:37,730 --> 00:12:44,570
So the idea is, now you get to
encode a character depending

223
00:12:44,570 --> 00:12:47,180
on the context of the
last k characters.

224
00:12:47,180 --> 00:12:50,150
So we're summing over
all possible contexts

225
00:12:50,150 --> 00:12:54,260
of k characters, and we're
taking the expectation

226
00:12:54,260 --> 00:12:58,040
over all possible context w.

227
00:12:58,040 --> 00:13:01,070
That's the sum of the
probabilities times something.

228
00:13:01,070 --> 00:13:05,540
And then condition on w
being the context, the last k

229
00:13:05,540 --> 00:13:06,440
characters.

230
00:13:06,440 --> 00:13:10,790
We want to measure what
characters follow that.

231
00:13:10,790 --> 00:13:13,130
And there, we can use
a 0-th order encoding.

232
00:13:13,130 --> 00:13:15,140
I mean, we've
already conditioned

233
00:13:15,140 --> 00:13:17,940
on w being right there.

234
00:13:17,940 --> 00:13:21,170
So for all occurrences of w,
you look at the next character

235
00:13:21,170 --> 00:13:23,660
right after it, and you
take 0-th order entropy

236
00:13:23,660 --> 00:13:26,600
of that, that's called
k-th order entropy.

237
00:13:26,600 --> 00:13:31,230
OK, you have to think
about it for a while, too.

238
00:13:31,230 --> 00:13:34,520
But this essentially
means the best,

239
00:13:34,520 --> 00:13:37,400
you can prove this is the
best encoding you can do,

240
00:13:37,400 --> 00:13:40,520
if the codeword of a letter
can depend on the previous k

241
00:13:40,520 --> 00:13:41,540
characters.

242
00:13:41,540 --> 00:13:44,000
Of course, if you have such a
code it's easy to decompress,

243
00:13:44,000 --> 00:13:46,620
because as you're decompressing,
you know what the previous k

244
00:13:46,620 --> 00:13:48,780
characters were.

245
00:13:48,780 --> 00:13:51,980
OK, interesting thing about this
index or this data structure,

246
00:13:51,980 --> 00:13:54,620
is it's independent of k.

247
00:13:54,620 --> 00:13:57,080
The data structure
doesn't know what k is.

248
00:13:57,080 --> 00:14:00,200
This works for all k.

249
00:14:00,200 --> 00:14:05,032
For any fixed k-- k has
to be constant here.

250
00:14:05,032 --> 00:14:07,490
There are other data structures
like [? KB, ?] logarithmic,

251
00:14:07,490 --> 00:14:08,870
or so.

252
00:14:08,870 --> 00:14:11,020
But here, we'll think
of k as a constant.

253
00:14:11,020 --> 00:14:14,150
And so this is really a neat
thing about compression.

254
00:14:14,150 --> 00:14:17,750
There's a technique called
the Burrows-Wheeler transform.

255
00:14:17,750 --> 00:14:20,431
And Lempel-Ziv does
similar things.

256
00:14:20,431 --> 00:14:22,430
You may have heard of
those compression schemes.

257
00:14:22,430 --> 00:14:24,430
They're used in bzip,
and things like-- bzip

258
00:14:24,430 --> 00:14:28,190
is named after
Burrows-Wheeler, I believe.

259
00:14:28,190 --> 00:14:32,150
And those compression
schemes achieve

260
00:14:32,150 --> 00:14:35,330
Hk of T bits per character--

261
00:14:35,330 --> 00:14:37,220
so Hk of T times T--

262
00:14:37,220 --> 00:14:39,720
for all k.

263
00:14:39,720 --> 00:14:42,410
So if your text is
really good, given

264
00:14:42,410 --> 00:14:45,320
the context of the last five
letters, or three letters.

265
00:14:45,320 --> 00:14:49,110
In some sense, the compression
scheme adapts to that.

266
00:14:49,110 --> 00:14:53,360
So this is what we call a
self index, in that this also

267
00:14:53,360 --> 00:14:54,260
stores the string.

268
00:14:54,260 --> 00:14:56,690
You can read the
data of the string.

269
00:14:56,690 --> 00:14:59,360
And so whereas
over here, we just

270
00:14:59,360 --> 00:15:00,920
stored the string uncompressed.

271
00:15:00,920 --> 00:15:02,930
Here we're effectively
storing the string

272
00:15:02,930 --> 00:15:05,450
in a compressed form,
and the data structure

273
00:15:05,450 --> 00:15:06,920
is similarly compressed.

274
00:15:06,920 --> 00:15:08,990
So if your string is
compressible by more

275
00:15:08,990 --> 00:15:13,600
than a factor of 5, this
will be really good.

276
00:15:13,600 --> 00:15:17,390
And that's the FM-index bound.

277
00:15:17,390 --> 00:15:21,530
Now that you have that Hk
stuff, it's a lot easier

278
00:15:21,530 --> 00:15:25,010
to state all other results.

279
00:15:25,010 --> 00:15:28,370
So we have-- oh, I didn't
give a query bound.

280
00:15:28,370 --> 00:15:30,410
That was the space.

281
00:15:30,410 --> 00:15:46,070
Query is P plus size of output
times log to the epsilon T.

282
00:15:46,070 --> 00:15:49,520
So, similar to this
one, but we don't

283
00:15:49,520 --> 00:15:54,470
have this trick over here.

284
00:15:54,470 --> 00:15:56,880
Another early result
is by Sadakane.

285
00:16:00,205 --> 00:16:10,510
I think also, maybe 2001, I have
the journal referenced as 2003.

286
00:16:10,510 --> 00:16:13,737
This is in some ways
better, some ways worse,

287
00:16:13,737 --> 00:16:15,695
it's kind of incomparable
to the other results.

288
00:16:32,380 --> 00:16:40,435
This is bits, and then the
query has an extra large factor.

289
00:16:48,840 --> 00:16:50,439
This is again,
another early result

290
00:16:50,439 --> 00:16:51,480
that I want to highlight.

291
00:16:51,480 --> 00:16:53,520
Now I'm going to start
skipping results.

292
00:16:53,520 --> 00:16:55,320
The main innovation
here, is that it

293
00:16:55,320 --> 00:16:57,690
works good for large alphabets.

294
00:16:57,690 --> 00:17:00,030
This is a very small
dependence on sigma,

295
00:17:00,030 --> 00:17:02,841
whereas-- as I mentioned,
this structure really

296
00:17:02,841 --> 00:17:04,424
doesn't work well
for large alphabets.

297
00:17:04,424 --> 00:17:06,636
Here we're getting-- not
getting k-th order entropy,

298
00:17:06,636 --> 00:17:08,010
we're getting 0-th
order entropy.

299
00:17:08,010 --> 00:17:12,270
It's a somewhat weaker result.
The dependence on epsilon

300
00:17:12,270 --> 00:17:13,349
is more like this one.

301
00:17:16,680 --> 00:17:19,079
But if you just want
a log factor here,

302
00:17:19,079 --> 00:17:21,690
then this is a 1 plus
epsilon times H0.

303
00:17:21,690 --> 00:17:23,490
So in that sense,
we're doing better--

304
00:17:23,490 --> 00:17:26,880
only a 1 plus epsilon,
which is better.

305
00:17:26,880 --> 00:17:30,300
This thing was
always at least 2.

306
00:17:30,300 --> 00:17:33,000
This thing was always at least
5, the complete constant.

307
00:17:33,000 --> 00:17:34,980
Here the lead constant
can be 1 plus epsilon.

308
00:17:34,980 --> 00:17:38,010
This is almost
succinct, but not quite.

309
00:17:38,010 --> 00:17:40,050
It doesn't quite
compress as well--

310
00:17:40,050 --> 00:17:41,520
it only uses 0-th
order entropy--

311
00:17:41,520 --> 00:17:43,140
but that's still not bad.

312
00:17:43,140 --> 00:17:44,610
And then the other
big innovation

313
00:17:44,610 --> 00:17:47,670
is the dependence
on sigma small.

314
00:17:47,670 --> 00:17:49,800
The query is a little bit worse.

315
00:17:53,050 --> 00:17:55,710
OK, now fast forward
a little bit.

316
00:17:58,620 --> 00:18:00,750
I want to talk about
succinct data structures

317
00:18:00,750 --> 00:18:04,785
for suffix-tree-like queries.

318
00:18:07,350 --> 00:18:10,050
So there's two succinct
data structures out there,

319
00:18:10,050 --> 00:18:13,620
with more or less the same
authors as the first two

320
00:18:13,620 --> 00:18:15,480
results I talked about.

321
00:18:15,480 --> 00:18:19,050
So Grossi and Vitter,
together with Gupta,

322
00:18:19,050 --> 00:18:27,510
can get Hk of T times
T, which is optimal even

323
00:18:27,510 --> 00:18:33,390
with compression, with
k-th order compression.

324
00:18:33,390 --> 00:18:34,830
And a good dependence on sigma.

325
00:18:38,440 --> 00:18:39,170
Yeah, I guess--

326
00:18:39,170 --> 00:18:42,330
T log sigma is the
uncompressed bound.

327
00:18:42,330 --> 00:18:43,607
So you have to worry about--

328
00:18:43,607 --> 00:18:45,190
when you're talking
about compression,

329
00:18:45,190 --> 00:18:47,070
so here we have
the optimal bound

330
00:18:47,070 --> 00:18:50,670
using k-th order entropy
with a lead constant of 1,

331
00:18:50,670 --> 00:18:51,690
so that's great.

332
00:18:51,690 --> 00:18:53,220
That's what makes it succinct.

333
00:18:53,220 --> 00:18:54,840
As long as this is
little o of that.

334
00:18:54,840 --> 00:18:58,170
This is going to be a little
o of that, as long as Hk of T

335
00:18:58,170 --> 00:19:00,940
is not too small.

336
00:19:00,940 --> 00:19:06,420
If it's like 1 over log T, then
actually this term dominates.

337
00:19:06,420 --> 00:19:11,080
But as long as it's bigger
than log log T over log T,

338
00:19:11,080 --> 00:19:13,810
this thing, then you're fine.

339
00:19:13,810 --> 00:19:16,420
Just as long as you're not
compressing a huge amount,

340
00:19:16,420 --> 00:19:18,541
then this will be lower-order.

341
00:19:21,940 --> 00:19:23,010
Sorry, query time.

342
00:19:23,010 --> 00:19:25,860
Query's a little
bit worse, though.

343
00:19:25,860 --> 00:19:30,240
We have a log term with
a P, only a log sigma,

344
00:19:30,240 --> 00:19:35,050
but then we also have this
log squared over log log.

345
00:19:35,050 --> 00:19:41,280
Times log sigma,
and here I haven't--

346
00:19:41,280 --> 00:19:44,180
there isn't a clear dependence
on the size of the output.

347
00:19:44,180 --> 00:19:46,140
So this is-- let's say
size of output is 1.

348
00:19:46,140 --> 00:19:49,265
You just want to find one match.

349
00:19:49,265 --> 00:19:51,690
I won't write this dependence
on the size of the output.

350
00:19:51,690 --> 00:19:54,065
My guess is this is multiplied
by the size of the output,

351
00:19:54,065 --> 00:19:56,440
but it's not stated
explicitly in the paper,

352
00:19:56,440 --> 00:19:58,180
so I want to be careful.

353
00:19:58,180 --> 00:20:02,460
So we have a polylog
additive slowdown here.

354
00:20:02,460 --> 00:20:04,670
So it's a little
bit worse in time,

355
00:20:04,670 --> 00:20:06,420
but this space is
obviously, a lot better.

356
00:20:06,420 --> 00:20:12,840
We've improved our constant
factor from 5, over here, to 1.

357
00:20:12,840 --> 00:20:16,890
OK, and then there's
one more paper

358
00:20:16,890 --> 00:20:29,075
I want to mention, by Ferragina,
Manzini, Makinen, and Navarro.

359
00:20:32,160 --> 00:20:38,880
This is from just five
years ago now, 2007.

360
00:20:38,880 --> 00:20:47,890
They also achieved 1 times Hk
of T times T as the lead term.

361
00:20:47,890 --> 00:20:53,160
And they get T divided by log
to the epsilon n, so this is--

362
00:20:56,040 --> 00:20:59,370
yes, it's slight, there's
probably a log sigma here, too.

363
00:20:59,370 --> 00:21:03,190
I'm not sure, it might just be
T. Probably just T, actually.

364
00:21:03,190 --> 00:21:07,210
So we get rid of the log sigma,
but this log log over log

365
00:21:07,210 --> 00:21:08,140
gets slightly smaller.

366
00:21:08,140 --> 00:21:12,040
It's only a log to
the epsilon now.

367
00:21:12,040 --> 00:21:15,550
But the query bound is
a little bit better.

368
00:21:15,550 --> 00:21:18,120
So the P plus--

369
00:21:18,120 --> 00:21:27,380
as the output times log to
the 1 plus epsilon T query.

370
00:21:30,490 --> 00:21:33,010
So instead of basically log
squared, we have log to 1

371
00:21:33,010 --> 00:21:35,290
plus epsilon, slightly better.

372
00:21:35,290 --> 00:21:39,350
They also have an
order P counting query.

373
00:21:39,350 --> 00:21:41,710
So if you just want to know
how many matches are there,

374
00:21:41,710 --> 00:21:47,530
they can do that really fast in
kind of regular time order P.

375
00:21:47,530 --> 00:21:49,039
And this is
obviously very small.

376
00:21:49,039 --> 00:21:50,830
So this is probably
the best result so far,

377
00:21:50,830 --> 00:21:57,040
still obviously, lots of
open problems in this world.

378
00:21:57,040 --> 00:21:58,990
Still an active
area of research.

379
00:21:58,990 --> 00:22:01,769
There are papers
since these, but they

380
00:22:01,769 --> 00:22:03,310
don't achieve-- the
space bounds they

381
00:22:03,310 --> 00:22:04,390
achieve are not quite as good.

382
00:22:04,390 --> 00:22:06,250
There may be like 2
times Hk, and then they

383
00:22:06,250 --> 00:22:07,770
can get better query bounds.

384
00:22:07,770 --> 00:22:09,620
A lot of papers that
I'm not talking about,

385
00:22:09,620 --> 00:22:12,190
there's just a few too many.

386
00:22:12,190 --> 00:22:15,550
But if you just care about
space, this is the best so far.

387
00:22:15,550 --> 00:22:21,612
Or I use these two, depending
on exactly how big sigma is.

388
00:22:21,612 --> 00:22:24,070
Just to mention, there's some
other cool things you can do.

389
00:22:24,070 --> 00:22:28,240
So these are small space
static data structures.

390
00:22:28,240 --> 00:22:30,250
Some of them can
be made dynamic.

391
00:22:30,250 --> 00:22:32,440
But in particular,
there's work on,

392
00:22:32,440 --> 00:22:36,100
how do you actually build these
data structures with low space?

393
00:22:36,100 --> 00:22:39,070
Because you don't really want
to build a huge suffix tree

394
00:22:39,070 --> 00:22:40,180
and then compress it.

395
00:22:40,180 --> 00:22:42,138
Because the whole point
is you have a hard time

396
00:22:42,138 --> 00:22:43,550
storing this data structure.

397
00:22:43,550 --> 00:22:47,080
So in fact, there's
some papers--

398
00:22:47,080 --> 00:22:49,480
I think more along the lines
of these original results--

399
00:22:49,480 --> 00:22:52,630
the Grossi-Vitter, Ferragina,
Manzini, and Sadakane--

400
00:22:52,630 --> 00:22:54,820
building those data structures.

401
00:22:54,820 --> 00:22:57,250
And while you're building
the amount of working space

402
00:22:57,250 --> 00:23:01,540
is at least proportional to
the size of the final data

403
00:23:01,540 --> 00:23:02,170
structure.

404
00:23:02,170 --> 00:23:04,510
So that can be done.

405
00:23:04,510 --> 00:23:06,380
We're not going to
go into it here.

406
00:23:06,380 --> 00:23:08,537
There are other papers about--

407
00:23:08,537 --> 00:23:10,120
all of these papers
are focused on how

408
00:23:10,120 --> 00:23:12,130
do I do a search, how do
I search for a pattern,

409
00:23:12,130 --> 00:23:13,449
find all the matches.

410
00:23:13,449 --> 00:23:15,490
There's other things you
can do with suffix trees

411
00:23:15,490 --> 00:23:19,180
like, given two suffixes,
you can find the longest

412
00:23:19,180 --> 00:23:21,100
common prefix of them.

413
00:23:21,100 --> 00:23:23,200
So there's papers on how
to do that kind of stuff

414
00:23:23,200 --> 00:23:26,200
in the compressed regime.

415
00:23:26,200 --> 00:23:28,990
There's papers on-- or there is
a paper on how to do document

416
00:23:28,990 --> 00:23:32,290
retrieval, which is a problem
we looked at two lectures ago,

417
00:23:32,290 --> 00:23:33,670
in the string lecture.

418
00:23:33,670 --> 00:23:35,350
You want to find--
not all the matches,

419
00:23:35,350 --> 00:23:36,974
you want to find all
the documents that

420
00:23:36,974 --> 00:23:40,360
have this substring in them.

421
00:23:40,360 --> 00:23:42,400
So that can be--
that reduces the size

422
00:23:42,400 --> 00:23:46,150
of the output in these bounds.

423
00:23:46,150 --> 00:23:49,240
That can also be done, Sadakane
wrote a paper about that.

424
00:23:49,240 --> 00:23:50,316
Some work on dynamic--

425
00:23:50,316 --> 00:23:52,690
there's actually a lot of work
in implementing these data

426
00:23:52,690 --> 00:23:57,360
structures, definitely
FM-index, and I believe,

427
00:23:57,360 --> 00:23:58,850
maybe the Sadakane one.

428
00:23:58,850 --> 00:24:00,910
And maybe this--
versions of this one.

429
00:24:00,910 --> 00:24:03,160
I don't think the succinct
ones have been implemented,

430
00:24:03,160 --> 00:24:04,760
although I don't know for sure.

431
00:24:04,760 --> 00:24:06,718
But there's a lot of work
in implementing this,

432
00:24:06,718 --> 00:24:08,710
because people care,
and indeed they're

433
00:24:08,710 --> 00:24:11,760
small and reasonably fast.

434
00:24:11,760 --> 00:24:14,170
So if you need a
text index, there's

435
00:24:14,170 --> 00:24:18,490
freely available implementations
of at least some of these.

436
00:24:18,490 --> 00:24:21,670
So this is one of--

437
00:24:21,670 --> 00:24:25,350
I mean this is
practical stuff, too.

438
00:24:25,350 --> 00:24:26,020
Cool.

439
00:24:26,020 --> 00:24:29,590
But as I said, I'm going to
focus on the simplest I know,

440
00:24:29,590 --> 00:24:32,934
which is Grossi and Vitter.

441
00:24:32,934 --> 00:24:34,600
If you look at the
paper, there are sort

442
00:24:34,600 --> 00:24:36,340
of successive improvements.

443
00:24:36,340 --> 00:24:39,280
And we're going to
cover up to the point

444
00:24:39,280 --> 00:24:41,290
where we get a good space
bound, and the query

445
00:24:41,290 --> 00:24:44,310
won't be quite as good.

446
00:24:44,310 --> 00:24:47,710
So that's going to be
the bulk of the lecture.

447
00:24:47,710 --> 00:24:49,200
It's how to get
that space bound.

448
00:24:52,230 --> 00:24:54,120
And as I mentioned,
we're going to start out

449
00:24:54,120 --> 00:24:56,910
with a weaker bound, which
is getting T log log T bits,

450
00:24:56,910 --> 00:25:01,050
and then we'll see how
to improve that to T.

451
00:25:01,050 --> 00:25:04,440
And then we'll see how to
improve it to 1 over epsilon

452
00:25:04,440 --> 00:25:05,580
times T.

453
00:25:05,580 --> 00:25:07,759
So it will be a series
of improvements.

454
00:25:11,799 --> 00:25:13,590
And we're going to
start just with thinking

455
00:25:13,590 --> 00:25:16,390
about suffix arrays.

456
00:25:16,390 --> 00:25:19,110
So what is the compressed
suffix array problem?

457
00:25:19,110 --> 00:25:22,500
Well, it's just that I have--

458
00:25:22,500 --> 00:25:27,830
I want to be able to do
queries of the form SA of k.

459
00:25:27,830 --> 00:25:29,580
If I imagine the
suffixes in sorted order,

460
00:25:29,580 --> 00:25:30,930
what is the k-th suffix?

461
00:25:30,930 --> 00:25:32,520
Where does it begin?

462
00:25:32,520 --> 00:25:34,940
So I want to be able to
represent that array.

463
00:25:34,940 --> 00:25:36,750
And using that, you
could do searches,

464
00:25:36,750 --> 00:25:40,590
and later we'll see how to use
that to make a suffix tree.

465
00:25:40,590 --> 00:25:44,730
But for now, that's just our
goal, is to compute SA of k.

466
00:25:44,730 --> 00:25:47,400
OK, well, the idea is actually
going to be very familiar.

467
00:25:47,400 --> 00:25:49,920
We saw it two lectures ago,
when we did this divide

468
00:25:49,920 --> 00:25:52,200
and conquer for
building a suffix array.

469
00:25:52,200 --> 00:25:55,770
We did this-- we divided
the letters in our string

470
00:25:55,770 --> 00:25:58,170
by 0, 1, and 2, mod 3.

471
00:25:58,170 --> 00:26:00,050
We won't need mod 3.

472
00:26:00,050 --> 00:26:01,555
We'll just do mod 2 here.

473
00:26:01,555 --> 00:26:05,760
It won't actually matter
what constant we use.

474
00:26:05,760 --> 00:26:07,830
But we're going to
follow that recursion

475
00:26:07,830 --> 00:26:09,930
and use it to represent
the suffix array,

476
00:26:09,930 --> 00:26:12,720
instead of using it to build it.

477
00:26:12,720 --> 00:26:16,020
So the base case, and
set up some notation.

478
00:26:16,020 --> 00:26:20,730
T0 is going to represent T.
The length of that string I'm

479
00:26:20,730 --> 00:26:22,290
going to call n0 or n.

480
00:26:27,090 --> 00:26:31,590
And we have a
suffix array, which

481
00:26:31,590 --> 00:26:37,530
I'm going to call SA 0, which is
the suffix array of that text.

482
00:26:37,530 --> 00:26:38,820
So that's just notation.

483
00:26:38,820 --> 00:26:40,778
We're not actually storing
all of those things.

484
00:26:43,230 --> 00:26:49,410
Now, the recursion
is T k plus 1.

485
00:26:49,410 --> 00:26:52,770
That's going to be the next
level, which is, we write--

486
00:26:52,770 --> 00:26:55,880
we combine two letters, Tk--

487
00:26:55,880 --> 00:26:57,580
sorry, square bracket--

488
00:26:57,580 --> 00:27:02,970
2i comma Tk square
bracket 2i plus 1.

489
00:27:02,970 --> 00:27:06,480
Combine two adjacent
letters into one letter,

490
00:27:06,480 --> 00:27:12,480
and we do that for i
equals 0, 1, up to n/2.

491
00:27:15,510 --> 00:27:17,820
That's our new string.

492
00:27:17,820 --> 00:27:19,440
I'm not going to
sort these letters

493
00:27:19,440 --> 00:27:21,537
and remap the letters to
compress the alphabet.

494
00:27:21,537 --> 00:27:23,370
I'm just going to leave
those letters alone,

495
00:27:23,370 --> 00:27:24,750
as an ordered pair.

496
00:27:24,750 --> 00:27:29,370
In general, at level
Tk, a single letter

497
00:27:29,370 --> 00:27:32,027
is actually 2 to the k letters.

498
00:27:32,027 --> 00:27:34,110
But still, this is a useful
way to think about it,

499
00:27:34,110 --> 00:27:36,190
because it lets me think
about fewer suffixes.

500
00:27:36,190 --> 00:27:38,880
Here, we only have
the even suffixes,

501
00:27:38,880 --> 00:27:46,650
suffixes that begin at even
positions relative to Tk.

502
00:27:46,650 --> 00:27:49,500
The size of this string, in
terms of number of letters,

503
00:27:49,500 --> 00:27:51,360
is 1/2 of the original.

504
00:27:51,360 --> 00:27:56,430
So in general, this is going
to be n over 2 to the k.

505
00:27:56,430 --> 00:28:01,680
And then we're interested in
the suffix array SA k plus 1.

506
00:28:01,680 --> 00:28:07,630
This is going to be just
looking at the even values.

507
00:28:07,630 --> 00:28:17,340
So if we extract even
entries from sorry, SA k.

508
00:28:17,340 --> 00:28:19,770
So if we already
have SA k, we just

509
00:28:19,770 --> 00:28:22,380
take the even values
that are in there.

510
00:28:22,380 --> 00:28:25,620
Those are the ones that
are existing suffixes.

511
00:28:25,620 --> 00:28:28,050
Extract those, divide by 2.

512
00:28:28,050 --> 00:28:32,220
That will be the suffix
array of this text.

513
00:28:32,220 --> 00:28:33,810
This is kind of
backwards from how

514
00:28:33,810 --> 00:28:35,820
you would construct the thing.

515
00:28:35,820 --> 00:28:37,290
You would construct
it bottom up.

516
00:28:37,290 --> 00:28:38,370
Here, we're
imagining-- we already

517
00:28:38,370 --> 00:28:40,590
know the suffix arrays are
just about representation.

518
00:28:40,590 --> 00:28:43,440
So this is a top-down
kind of definition

519
00:28:43,440 --> 00:28:45,650
of what we're trying to store.

520
00:28:45,650 --> 00:28:48,399
OK, so this is
what we want to do.

521
00:28:48,399 --> 00:28:50,190
Now we are going to
build things bottom up.

522
00:28:50,190 --> 00:28:51,689
We're going to
imagine we've already

523
00:28:51,689 --> 00:28:54,210
represented SA k plus 1.

524
00:28:54,210 --> 00:28:57,060
And now we need
to represent SA k.

525
00:28:57,060 --> 00:29:04,050
If we can represent SA k
in terms of SA k plus 1

526
00:29:04,050 --> 00:29:06,420
with not too many
bits, then you add up

527
00:29:06,420 --> 00:29:08,114
all of the levels of recursion.

528
00:29:08,114 --> 00:29:10,530
We'll have to talk about how
many levels of this recursion

529
00:29:10,530 --> 00:29:11,113
we need to do.

530
00:29:11,113 --> 00:29:13,470
We're not going to go
down to constant size.

531
00:29:13,470 --> 00:29:17,310
We'll just go log log n levels.

532
00:29:17,310 --> 00:29:20,940
But we just add up
all those costs,

533
00:29:20,940 --> 00:29:23,826
and we'll get the overall
size of our data structure.

534
00:29:27,560 --> 00:29:30,080
So how do we do
this representation?

535
00:29:30,080 --> 00:29:34,190
I need to define two
kind of weird things,

536
00:29:34,190 --> 00:29:37,280
and then we'll see why
they're interesting.

537
00:29:37,280 --> 00:29:44,920
OK, the first thing is called
even successor sub k of i.

538
00:29:44,920 --> 00:29:47,330
So let me define it.

539
00:29:47,330 --> 00:29:55,700
It's going to be i if
the i-th suffix starts

540
00:29:55,700 --> 00:29:58,010
in an even position.

541
00:29:58,010 --> 00:30:00,754
So it doesn't do anything
for the even guys.

542
00:30:00,754 --> 00:30:02,420
The interesting thing
is when the suffix

543
00:30:02,420 --> 00:30:04,370
starts in an odd position.

544
00:30:04,370 --> 00:30:06,730
Then we're going to write
down a different number j.

545
00:30:09,860 --> 00:30:14,120
This is going to look kind
of weird, but it's actually--

546
00:30:14,120 --> 00:30:19,330
it's simple after you think
about it for 10 minutes.

547
00:30:19,330 --> 00:30:23,880
This one is odd.

548
00:30:23,880 --> 00:30:26,330
OK, so the other
situation is that SA k--

549
00:30:26,330 --> 00:30:29,550
the i-th suffix starts
at an even position.

550
00:30:29,550 --> 00:30:31,910
So let me draw a little picture.

551
00:30:31,910 --> 00:30:36,814
So here is SA of i.

552
00:30:39,600 --> 00:30:42,920
OK, if this happens to
be odd, this position

553
00:30:42,920 --> 00:30:44,645
in the text-- this is Tk.

554
00:30:47,300 --> 00:30:50,540
Then I want to go here.

555
00:30:50,540 --> 00:30:51,050
OK?

556
00:30:51,050 --> 00:30:53,008
Because that's an even
position, it's a suffix,

557
00:30:53,008 --> 00:30:55,080
it's right next to the
suffix I care about.

558
00:30:55,080 --> 00:30:57,710
It is what we call the
even successor suffix.

559
00:30:57,710 --> 00:30:59,810
But I don't want to
know the index of that.

560
00:30:59,810 --> 00:31:03,590
The index of that would
just be SA k of i plus 1.

561
00:31:03,590 --> 00:31:07,580
I want to map backwards
through SA inverse.

562
00:31:07,580 --> 00:31:12,320
I want to know, what is
the rank of that suffix?

563
00:31:12,320 --> 00:31:17,120
Which suffix j
starts right there?

564
00:31:17,120 --> 00:31:19,790
I want to know that the
j-th suffix starts right

565
00:31:19,790 --> 00:31:23,480
after the i-th suffix, and
I want to write down j.

566
00:31:23,480 --> 00:31:26,590
We'll see why this is the
right thing in a moment.

567
00:31:26,590 --> 00:31:29,270
We're just mapping
through SA, adding 1,

568
00:31:29,270 --> 00:31:32,350
and then mapping
backwards through SA.

569
00:31:32,350 --> 00:31:33,584
So that's a function.

570
00:31:33,584 --> 00:31:35,750
We're going to store that
function in a particular--

571
00:31:35,750 --> 00:31:40,670
in a very weird way, which
we'll get to in a moment.

572
00:31:40,670 --> 00:31:45,420
OK, next thing we need
is called even rank.

573
00:31:45,420 --> 00:31:47,910
This is going to be
like our rank function.

574
00:31:47,910 --> 00:31:49,850
We've had it before.

575
00:31:49,850 --> 00:31:55,220
This is going to be the
number of even suffixes--

576
00:31:55,220 --> 00:31:59,090
even suffixes are suffixes
starting at even positions--

577
00:31:59,090 --> 00:32:05,600
preceding the i-th suffix.

578
00:32:05,600 --> 00:32:09,570
i-th suffix meaning the
i-th one in sorted order.

579
00:32:09,570 --> 00:32:11,180
So the suffix SA of i.

580
00:32:14,600 --> 00:32:15,980
Yes, so this is--

581
00:32:15,980 --> 00:32:18,170
let me be more precise.

582
00:32:18,170 --> 00:32:28,460
This is the number of even
values in SA k up to i.

583
00:32:28,460 --> 00:32:30,890
So we're looking--
so this was the text.

584
00:32:30,890 --> 00:32:33,680
Now we're looking at the
suffix array, which has

585
00:32:33,680 --> 00:32:35,660
the suffixes in sorted order.

586
00:32:35,660 --> 00:32:38,440
We're looking at position i
here, and we want to know,

587
00:32:38,440 --> 00:32:41,270
of all of these values,
which ones are even?

588
00:32:41,270 --> 00:32:42,540
Or how many are even--

589
00:32:42,540 --> 00:32:44,390
that's the even rank.

590
00:32:44,390 --> 00:32:46,250
Again, a weird thing,
we'll see why it's

591
00:32:46,250 --> 00:32:47,530
the right thing in a moment.

592
00:32:55,760 --> 00:32:56,690
Right now, in fact.

593
00:33:02,440 --> 00:33:07,900
So here is observation 3,
putting these together.

594
00:33:07,900 --> 00:33:09,295
This is a rather long equation.

595
00:33:12,040 --> 00:33:13,510
Ultimately, I want to know--

596
00:33:13,510 --> 00:33:16,260
I want to represent Sk of i.

597
00:33:16,260 --> 00:33:17,830
I'm trying to represent that.

598
00:33:17,830 --> 00:33:22,090
And I want the right-hand side
to only refer to SA k plus 1.

599
00:33:22,090 --> 00:33:24,280
So here's the claim.

600
00:33:24,280 --> 00:33:28,540
Take 2 times SA k plus 1 of--

601
00:33:35,187 --> 00:33:36,520
I'm going to need another board.

602
00:33:50,240 --> 00:33:51,830
Not of i.

603
00:33:51,830 --> 00:34:03,110
Even rank of even successor
of i, minus 1 minus

604
00:34:03,110 --> 00:34:14,060
is even suffix of i.

605
00:34:14,060 --> 00:34:16,320
OK, so that's the equation.

606
00:34:16,320 --> 00:34:19,580
Let me unpack this a little bit.

607
00:34:19,580 --> 00:34:22,240
The idea is, we want to
know about a suffix i.

608
00:34:22,240 --> 00:34:24,080
If i happens to be even--

609
00:34:24,080 --> 00:34:26,929
sorry, not if i happens
to be even-- if SA of i

610
00:34:26,929 --> 00:34:29,510
happens to be
even, we're golden.

611
00:34:29,510 --> 00:34:33,260
Because that suffix is
represented by SA k plus 1,

612
00:34:33,260 --> 00:34:34,429
but it might not be even.

613
00:34:34,429 --> 00:34:38,120
So we want to round
it to an even suffix.

614
00:34:38,120 --> 00:34:41,150
Knowing about this odd
suffix is just about as good

615
00:34:41,150 --> 00:34:44,340
as knowing about the suffix
that starts right after it.

616
00:34:44,340 --> 00:34:46,310
So that's what even
successor does.

617
00:34:46,310 --> 00:34:51,620
This is rounding
to an even suffix,

618
00:34:51,620 --> 00:34:54,949
meaning a suffix starting
at an even position.

619
00:34:59,870 --> 00:35:04,820
Now there's this issue
that over here, we

620
00:35:04,820 --> 00:35:08,630
have this relation between
SA k and SA k plus 1,

621
00:35:08,630 --> 00:35:11,970
but it extracts
the even entries.

622
00:35:11,970 --> 00:35:14,300
So if you think about the
suffix array, which now I'm

623
00:35:14,300 --> 00:35:16,508
going to draw a vertical,
because that's more normal.

624
00:35:19,580 --> 00:35:21,964
Some of these values
are going to be even,

625
00:35:21,964 --> 00:35:24,380
but you don't really know which
ones are going to be even.

626
00:35:24,380 --> 00:35:26,530
It's arbitrary subset of--

627
00:35:26,530 --> 00:35:29,630
in SA k, our even values.

628
00:35:29,630 --> 00:35:35,950
And those are the ones that you
extract and form SA k plus 1.

629
00:35:35,950 --> 00:35:38,600
But it's an arbitrary
subset, that's kind of a--

630
00:35:38,600 --> 00:35:40,550
you can't just divide
by 2 or something.

631
00:35:40,550 --> 00:35:42,140
It's not the right thing.

632
00:35:42,140 --> 00:35:45,680
If I'm given an index into
here, even if it's an even one,

633
00:35:45,680 --> 00:35:48,470
I need to know what the
corresponding index is

634
00:35:48,470 --> 00:35:50,270
over here.

635
00:35:50,270 --> 00:35:53,390
And that, I claim,
is exactly even rank.

636
00:35:53,390 --> 00:35:59,090
Because what position does
this cell become over here?

637
00:35:59,090 --> 00:36:03,140
Well, however many even
numbers there are above it.

638
00:36:03,140 --> 00:36:05,840
So you take-- that's what
this definition was, a number

639
00:36:05,840 --> 00:36:08,060
of even values in that prefix.

640
00:36:08,060 --> 00:36:12,360
That is the position you
will be in, in SA k plus 1.

641
00:36:12,360 --> 00:36:15,510
So this is what I
would call the name--

642
00:36:15,510 --> 00:36:17,630
we've now rounded to
an even suffix but now

643
00:36:17,630 --> 00:36:21,530
we need to find the name
of that even suffix--

644
00:36:21,530 --> 00:36:25,070
in SA k plus 1.

645
00:36:25,070 --> 00:36:29,120
So that's exactly
what even rank does.

646
00:36:29,120 --> 00:36:33,050
So now we can dereference
SA k plus 1 of that thing.

647
00:36:33,050 --> 00:36:38,120
That will give us a
position into the text T k

648
00:36:38,120 --> 00:36:42,830
plus 1, where that
suffix begins.

649
00:36:42,830 --> 00:36:47,960
Now that's an index into
this divided by 2 string,

650
00:36:47,960 --> 00:36:51,560
we need to uncompress that to
an index into the actual string.

651
00:36:51,560 --> 00:36:52,560
And there are two parts.

652
00:36:52,560 --> 00:36:55,220
One is we need to multiply by
2, because every letter in T

653
00:36:55,220 --> 00:36:57,770
k plus 1 is two letters in Tk.

654
00:36:57,770 --> 00:36:58,670
So multiply by 2.

655
00:36:58,670 --> 00:37:01,070
And sometimes we
need to subtract 1.

656
00:37:01,070 --> 00:37:04,670
We basically need to subtract
1 if if even successor did

657
00:37:04,670 --> 00:37:05,505
anything.

658
00:37:05,505 --> 00:37:08,540
If even successor essentially
moved us to the right by 1,

659
00:37:08,540 --> 00:37:10,040
now we need to move
back to the left

660
00:37:10,040 --> 00:37:13,080
by 1, if this moved us at all.

661
00:37:13,080 --> 00:37:15,620
So I have one more function
here, which is is even suffix.

662
00:37:15,620 --> 00:37:19,670
Was SA of i an even--

663
00:37:19,670 --> 00:37:22,490
SA sub k of i, an
even number already.

664
00:37:22,490 --> 00:37:25,580
Which means that even
successor did nothing.

665
00:37:25,580 --> 00:37:30,180
If it did nothing,
then 1 minus 1 is 0.,

666
00:37:30,180 --> 00:37:31,550
and so nothing happens.

667
00:37:31,550 --> 00:37:33,800
If it did something,
then its even suffix

668
00:37:33,800 --> 00:37:35,780
will be 0, because it was odd.

669
00:37:35,780 --> 00:37:37,250
And then we're subtracting 1.

670
00:37:37,250 --> 00:37:40,550
So this just means
subtract 1, if it was odd.

671
00:37:40,550 --> 00:37:43,130
You might say minus is
odd suffix, instead of

672
00:37:43,130 --> 00:37:44,510
1 minus is even suffix.

673
00:37:44,510 --> 00:37:47,330
But it turns out, this is
the thing I want to store,

674
00:37:47,330 --> 00:37:50,424
so I wrote it in a weird way.

675
00:37:50,424 --> 00:37:51,590
Why did I write it that way?

676
00:37:51,590 --> 00:37:57,380
Because is even suffix
is related to even rank.

677
00:37:57,380 --> 00:38:04,040
Even rank is just rank
sub 1 of is even suffix.

678
00:38:04,040 --> 00:38:05,900
And we already saw
how to do rank sub 1,

679
00:38:05,900 --> 00:38:09,650
and so that's why I
wanted to reuse it.

680
00:38:09,650 --> 00:38:12,800
I think you see now why
this equation holds.

681
00:38:12,800 --> 00:38:16,820
What remains is how to
store is even suffix,

682
00:38:16,820 --> 00:38:18,957
even rank, even successor.

683
00:38:23,340 --> 00:38:26,590
One other thing that
remains, is to say

684
00:38:26,590 --> 00:38:29,240
when to stop this recursion.

685
00:38:29,240 --> 00:38:32,885
So I claim it's enough to just
do this recursion for log log n

686
00:38:32,885 --> 00:38:33,385
levels.

687
00:38:43,900 --> 00:38:46,900
And then I'll call log
log n l, the number

688
00:38:46,900 --> 00:38:48,620
of levels in this recursion.

689
00:38:48,620 --> 00:38:53,440
Because at that point,
n sub l equals n over--

690
00:38:53,440 --> 00:38:55,210
it's n over 2 to
the l, so that's

691
00:38:55,210 --> 00:38:57,760
going to be n over log n.

692
00:38:57,760 --> 00:39:00,880
Once I have a string
of length n over log n,

693
00:39:00,880 --> 00:39:04,600
I can afford the regular
boring representation

694
00:39:04,600 --> 00:39:11,310
of a suffix tree.

695
00:39:11,310 --> 00:39:16,560
I can afford T log T bits,
when T is only n over log n.

696
00:39:16,560 --> 00:39:18,790
If you want to be a
little extra clever,

697
00:39:18,790 --> 00:39:22,500
you can put a factor 2 here,
and then there's a square here.

698
00:39:22,500 --> 00:39:25,140
And so then you're really
paying little o of T

699
00:39:25,140 --> 00:39:27,485
in order to store that thing.

700
00:39:27,485 --> 00:39:28,860
So once you get
down to here, you

701
00:39:28,860 --> 00:39:31,510
can afford a simple
representation.

702
00:39:31,510 --> 00:39:36,210
Now let's think about
how to compute SA,

703
00:39:36,210 --> 00:39:40,360
like the original SA,
sub 0, of an index.

704
00:39:40,360 --> 00:39:46,740
Well I apply this
formula at all times,

705
00:39:46,740 --> 00:39:49,690
I do all these computations.

706
00:39:49,690 --> 00:39:52,806
And now I've reduced
the problem to SA 1,

707
00:39:52,806 --> 00:39:54,180
and then I do
these computations.

708
00:39:54,180 --> 00:39:56,170
I reduce it to SA 2, and so on.

709
00:39:56,170 --> 00:40:00,350
After l steps, I'll have
reduced it to an SA query

710
00:40:00,350 --> 00:40:02,070
in a boring old
suffix array, which

711
00:40:02,070 --> 00:40:04,090
I've just stored as an array.

712
00:40:04,090 --> 00:40:07,160
So then I can answer it, and
then I pop up the recursion,

713
00:40:07,160 --> 00:40:11,460
log log n times, doing these
adjustments as appropriate.

714
00:40:11,460 --> 00:40:15,780
In the end, I get the correct
index into the original text T.

715
00:40:15,780 --> 00:40:17,850
How much time did it take?

716
00:40:17,850 --> 00:40:19,110
Order log log n time.

717
00:40:23,640 --> 00:40:30,050
So I can do a log log
n time query to SA.

718
00:40:34,740 --> 00:40:37,940
This is, of course, assuming
that even rank, even successor,

719
00:40:37,940 --> 00:40:42,030
and is even suffix are all
constant time operations.

720
00:40:42,030 --> 00:40:44,340
So what remains is
to do each of these

721
00:40:44,340 --> 00:40:46,740
in small space
and constant time.

722
00:40:46,740 --> 00:40:50,790
Then my overall query time will
only go up by log log factor.

723
00:40:50,790 --> 00:40:52,830
This is actually going
to be pretty good,

724
00:40:52,830 --> 00:40:54,470
we're not going to--

725
00:40:54,470 --> 00:40:56,550
we're going to
achieve log log query

726
00:40:56,550 --> 00:40:59,920
when we have T log log T bits.

727
00:40:59,920 --> 00:41:02,150
That'll be our first
encoding of these things.

728
00:41:02,150 --> 00:41:03,608
Later on, we're
going have to go up

729
00:41:03,608 --> 00:41:05,970
to log to the epsilon, which
is worse than log log n.

730
00:41:08,620 --> 00:41:09,600
Clear, so far?

731
00:41:09,600 --> 00:41:12,447
Everything is pretty
easy at this point now.

732
00:41:12,447 --> 00:41:14,280
It's going to remain
easy, it's just there's

733
00:41:14,280 --> 00:41:15,780
a lot of pieces to the puzzle.

734
00:41:15,780 --> 00:41:18,300
This is the first--
this is the big idea.

735
00:41:18,300 --> 00:41:21,240
Next thing is some
fancy encoding schemes

736
00:41:21,240 --> 00:41:22,615
to make these
things quite small.

737
00:41:22,615 --> 00:41:23,114
Question?

738
00:41:23,114 --> 00:41:25,690
AUDIENCE: [INAUDIBLE] Did you
say what the space [INAUDIBLE]

739
00:41:25,690 --> 00:41:26,640
was?

740
00:41:26,640 --> 00:41:27,510
ERIK DEMAINE: We haven't
analyzed space yet,

741
00:41:27,510 --> 00:41:28,710
because I haven't said
how we're actually

742
00:41:28,710 --> 00:41:29,812
storing these functions.

743
00:41:29,812 --> 00:41:31,770
If you stored these
functions explicitly, you'd

744
00:41:31,770 --> 00:41:34,590
have bad space, probably still
T log T. But it turns out,

745
00:41:34,590 --> 00:41:38,400
these functions can be encoded
in a clever way, that small--

746
00:41:38,400 --> 00:41:41,610
smaller, it's going
to be T log log T.

747
00:41:41,610 --> 00:41:44,710
And still has
constant time query.

748
00:41:44,710 --> 00:41:47,266
AUDIENCE: Without the functions,
how much space are we using?

749
00:41:47,266 --> 00:41:48,765
ERIK DEMAINE: Without
the functions,

750
00:41:48,765 --> 00:41:51,090
we're using,
essentially, no space.

751
00:41:51,090 --> 00:41:54,096
I guess, at the end
where we're using--

752
00:41:54,096 --> 00:41:55,470
the only thing
we've said so far,

753
00:41:55,470 --> 00:41:58,120
is at the end we use an
explicit suffix array.

754
00:41:58,120 --> 00:42:00,480
And if you set this
to 2 log log T,

755
00:42:00,480 --> 00:42:04,010
then this would be like n
over log n bits of space.

756
00:42:04,010 --> 00:42:07,860
Because it's going
to be this times--

757
00:42:07,860 --> 00:42:12,980
I mean, the space at the bottom
is going to be nl log nl.

758
00:42:12,980 --> 00:42:15,150
That's to store an
explicit suffix array,

759
00:42:15,150 --> 00:42:17,400
so it's going to be
this times log of this,

760
00:42:17,400 --> 00:42:23,190
which is going to be n over
log n, if we put the 2 in.

761
00:42:23,190 --> 00:42:26,280
So that part's really cheap,
and that's little o of n.

762
00:42:26,280 --> 00:42:28,670
Of course, we probably also
have to store the text.

763
00:42:28,670 --> 00:42:30,720
So that's n bits.

764
00:42:30,720 --> 00:42:32,572
I didn't mention--
I'm going to assume,

765
00:42:32,572 --> 00:42:33,780
I don't think we need it yet.

766
00:42:33,780 --> 00:42:36,600
At some point I will assume
that the alphabets binary.

767
00:42:36,600 --> 00:42:39,700
So I'm going to leave off--
when I say n bits, really it's

768
00:42:39,700 --> 00:42:42,270
n log sigma bits, or n
characters, or whatever.

769
00:42:42,270 --> 00:42:45,850
But I'm not going to
worry about that here.

770
00:42:45,850 --> 00:42:48,810
Are there questions?

771
00:42:48,810 --> 00:42:51,064
So now, it's an
encoding problem.

772
00:42:51,064 --> 00:42:52,230
How do we encode these guys?

773
00:42:57,120 --> 00:43:00,419
Actually, even successor is the
only thing that's non-trivial.

774
00:43:00,419 --> 00:43:02,460
We're going to do the
obvious thing for the rest.

775
00:43:05,110 --> 00:43:07,910
So let me tell you about
the obvious ones, easy ones.

776
00:43:16,527 --> 00:43:18,110
At least, the first
revision we're not

777
00:43:18,110 --> 00:43:20,540
going to do anything fancy
with them, later on we will.

778
00:43:26,170 --> 00:43:27,865
Sorry, is even suffix.

779
00:43:37,010 --> 00:43:39,080
We're just going to store
this as a bit vector.

780
00:43:39,080 --> 00:43:47,990
This is 1 if SA k is
even, 0 if it's odd.

781
00:43:47,990 --> 00:43:49,840
So if we just store
that is a bit vector,

782
00:43:49,840 --> 00:43:55,730
this is n sub k bits
that we can afford.

783
00:43:55,730 --> 00:43:57,439
Because this is a
geometric series,

784
00:43:57,439 --> 00:43:58,480
it's going to be order n.

785
00:44:02,030 --> 00:44:03,290
Next is even rank.

786
00:44:06,830 --> 00:44:10,550
This is just the
rank one structure

787
00:44:10,550 --> 00:44:14,900
that we covered last
class, on this thing.

788
00:44:14,900 --> 00:44:19,420
So this is going to be nk--

789
00:44:19,420 --> 00:44:25,610
I think we did log
log nk over log nk.

790
00:44:25,610 --> 00:44:28,670
And this can be improved
to nk over log to the k--

791
00:44:28,670 --> 00:44:31,760
or log to the something of nk.

792
00:44:31,760 --> 00:44:33,830
But that's an OK bound.

793
00:44:33,830 --> 00:44:36,740
It's little o of N.
Again, this is geometric,

794
00:44:36,740 --> 00:44:39,890
so this overall will
be little o of n.

795
00:44:39,890 --> 00:44:48,032
So those are easy, the remaining
part is doing even successor.

796
00:45:00,120 --> 00:45:03,370
A little optimization.

797
00:45:03,370 --> 00:45:09,640
If the i's where
Sk of i is even,

798
00:45:09,640 --> 00:45:11,320
we don't really need
to store anything.

799
00:45:11,320 --> 00:45:14,870
Because then, even successor
is the identity function.

800
00:45:14,870 --> 00:45:16,810
So let's forget
about those guys.

801
00:45:16,810 --> 00:45:21,640
I'll say, it's trivial
for even successors--

802
00:45:21,640 --> 00:45:24,490
for even suffixes.

803
00:45:29,350 --> 00:45:33,130
So what I'd like to do, is store
the answers for odd suffixes.

804
00:45:33,130 --> 00:45:36,550
That's what we're going to do.

805
00:45:36,550 --> 00:45:39,880
We're going to store them in
a weird way, as we will see.

806
00:45:50,388 --> 00:45:52,100
So that's the odd suffixes.

807
00:45:52,100 --> 00:45:59,054
There are nk over 2 evens,
and there are nk over 2 odds.

808
00:45:59,054 --> 00:46:00,470
So we've just saved
a factor of 2.

809
00:46:00,470 --> 00:46:02,350
This wasn't a very
deep observation.

810
00:46:02,350 --> 00:46:06,320
But it turns out, if you
focus in on the odd ones,

811
00:46:06,320 --> 00:46:08,500
has a nice little
structure to them.

812
00:46:12,330 --> 00:46:14,770
This step isn't
really necessary,

813
00:46:14,770 --> 00:46:16,060
but it saves a factor of 2.

814
00:46:24,910 --> 00:46:29,560
Now the kind of
interesting observation.

815
00:46:29,560 --> 00:46:34,269
What I'd like to do is store
these answers in order by i.

816
00:46:34,269 --> 00:46:35,560
That's the obvious thing to do.

817
00:46:35,560 --> 00:46:37,060
I want to store
basically an array.

818
00:46:40,780 --> 00:46:43,600
Just store it in
order by i, so I'm

819
00:46:43,600 --> 00:46:46,360
skipping the even suffixes,
just storing the answers

820
00:46:46,360 --> 00:46:49,240
for the odd suffixes.

821
00:46:49,240 --> 00:46:52,750
So if I was given a number
i, how would I look it up?

822
00:46:52,750 --> 00:46:59,180
Well, given an index i
into the suffix array,

823
00:46:59,180 --> 00:47:00,925
what I need to know is--

824
00:47:00,925 --> 00:47:05,290
this is basically the inverse
of what we did with SA k plus 1.

825
00:47:05,290 --> 00:47:07,560
SA k plus 1 is extracting
the even entries,

826
00:47:07,560 --> 00:47:09,640
here we're extracting
the odd entries.

827
00:47:09,640 --> 00:47:13,660
So all I need to know
is the odd rank of i,

828
00:47:13,660 --> 00:47:15,940
and then I look
up into this array

829
00:47:15,940 --> 00:47:18,110
at position odd rank of i.

830
00:47:18,110 --> 00:47:20,260
That will give me
the answer I want.

831
00:47:20,260 --> 00:47:23,350
Well, first I check is
is it an even suffix,

832
00:47:23,350 --> 00:47:25,000
which I have stored
as a bit vector.

833
00:47:25,000 --> 00:47:28,990
If it's an even suffix, I
do nothing, I just return i.

834
00:47:28,990 --> 00:47:32,710
But if it's an odd suffix,
then I compute the odd rank.

835
00:47:32,710 --> 00:47:34,290
How do I compute the odd rank?

836
00:47:34,290 --> 00:47:37,720
I take the even rank
and take i minus that.

837
00:47:37,720 --> 00:47:42,362
Odd rank, we don't need
to store anything for it.

838
00:47:42,362 --> 00:47:43,820
I mean, you could
if you wanted to,

839
00:47:43,820 --> 00:47:47,530
but odd rank is just
i minus even rank.

840
00:47:50,050 --> 00:47:55,590
Because every index
is either odd or even.

841
00:47:55,590 --> 00:47:57,040
OK, great.

842
00:47:57,040 --> 00:48:00,900
So I can look up odd rank
and then look at this array.

843
00:48:00,900 --> 00:48:02,580
That'll give me
the answer I need.

844
00:48:02,580 --> 00:48:04,788
But I'm not going to actually
store this as an array.

845
00:48:04,788 --> 00:48:07,350
I lied.

846
00:48:07,350 --> 00:48:09,370
But in any case, let's
worry about how I'm

847
00:48:09,370 --> 00:48:10,880
going to store it in a moment.

848
00:48:10,880 --> 00:48:15,670
Let's think about i-- if I
I'm storing these answers--

849
00:48:15,670 --> 00:48:21,091
the even successor answers,
these j values, in order by i.

850
00:48:21,091 --> 00:48:25,210
I claim that order is a very
special order, because what

851
00:48:25,210 --> 00:48:27,100
does it mean to order by i?

852
00:48:27,100 --> 00:48:31,240
Ordering by i, that means the
suffixes are sorted, right?

853
00:48:31,240 --> 00:48:43,240
So this is the same thing as
ordering by an odd suffix in Tk

854
00:48:43,240 --> 00:48:47,590
from SA of i onwards.

855
00:48:47,590 --> 00:48:49,360
That's the suffix that we're--

856
00:48:49,360 --> 00:48:52,462
sorting by that suffix,
is sorting by i.

857
00:48:55,060 --> 00:48:56,980
Now we can unpack
an odd suffix--

858
00:48:56,980 --> 00:49:00,190
it has the first character--
and then an even suffix.

859
00:49:00,190 --> 00:49:03,850
So this is the same
thing as ordering by--

860
00:49:03,850 --> 00:49:05,350
this should look
familiar because we

861
00:49:05,350 --> 00:49:06,460
did the same kinds
of tricks when

862
00:49:06,460 --> 00:49:07,710
we were building suffix trees.

863
00:49:21,010 --> 00:49:23,110
This is even.

864
00:49:26,390 --> 00:49:28,328
In fact, it's the
even successor.

865
00:49:31,674 --> 00:49:41,560
There's a typo here,
[? see ?] If we follow SA k,

866
00:49:41,560 --> 00:49:42,700
and then we add 1.

867
00:49:42,700 --> 00:49:45,820
If we follow SA
k backwards, that

868
00:49:45,820 --> 00:49:48,430
was the definition
of even successor.

869
00:49:48,430 --> 00:49:51,200
So I can rewrite this thing.

870
00:49:51,200 --> 00:50:06,500
This part is the same thing as
Tk SA k even successor k of i,

871
00:50:06,500 --> 00:50:09,706
closed bracket,
colon, closed bracket.

872
00:50:09,706 --> 00:50:11,980
Get that right?

873
00:50:11,980 --> 00:50:14,650
Yes.

874
00:50:14,650 --> 00:50:16,630
That was the definition
of even successors.

875
00:50:16,630 --> 00:50:20,950
Even successor is the value j,
for which if I do SA k of j,

876
00:50:20,950 --> 00:50:22,896
I get SA k of i plus 1.

877
00:50:22,896 --> 00:50:25,600
That's the definition.

878
00:50:25,600 --> 00:50:30,185
OK, now Tk of SA of k.

879
00:50:32,980 --> 00:50:35,400
Sorry, the suffix--
that's not Tk of.

880
00:50:35,400 --> 00:50:36,580
There's a colon here.

881
00:50:36,580 --> 00:50:40,690
The suffix of Tk
starting at SA k.

882
00:50:40,690 --> 00:50:42,903
If I sort by those suffixes--

883
00:50:46,770 --> 00:50:47,780
they're sorted, right?

884
00:50:47,780 --> 00:50:50,330
I mean, that was the
point of the suffix array,

885
00:50:50,330 --> 00:50:51,860
is to sort the suffixes.

886
00:50:51,860 --> 00:50:57,350
So if I say I'm ordering by the
suffixes given in order by SA

887
00:50:57,350 --> 00:50:59,660
k, they're already sorted.

888
00:50:59,660 --> 00:51:02,720
There's no reason to do
this Tk of SA k part.

889
00:51:02,720 --> 00:51:07,250
This is going to be the
same thing as the order

890
00:51:07,250 --> 00:51:16,513
by this first letter, Tk SA
k of i comma, even successor.

891
00:51:20,257 --> 00:51:22,340
The suffix array is defined
to have this property,

892
00:51:22,340 --> 00:51:23,881
that these orders
are the same thing.

893
00:51:26,470 --> 00:51:28,510
And sorting by the
suffixes is the same thing

894
00:51:28,510 --> 00:51:33,590
as sorting by the indices
into the suffix array.

895
00:51:33,590 --> 00:51:36,702
Interesting, because this is
what I want to store, right?

896
00:51:36,702 --> 00:51:38,660
Those are the answers
that I'm trying to store.

897
00:51:38,660 --> 00:51:41,990
I'm trying to store even
successor for every i

898
00:51:41,990 --> 00:51:43,910
that has an odd--

899
00:51:43,910 --> 00:51:45,710
that starts in an odd suffix.

900
00:51:48,500 --> 00:51:51,740
So really, all I need to
do is order by this thing.

901
00:51:51,740 --> 00:51:54,560
And then once I've
ordered by this thing,

902
00:51:54,560 --> 00:52:00,450
I'll store these guys
in order by their value.

903
00:52:00,450 --> 00:52:00,980
Cool.

904
00:52:00,980 --> 00:52:04,160
So these are the pairs
I'm going to store.

905
00:52:04,160 --> 00:52:05,780
I'm not going to--

906
00:52:05,780 --> 00:52:11,110
I'm going to store this comma
this, for all i, in order

907
00:52:11,110 --> 00:52:12,242
by this value.

908
00:52:12,242 --> 00:52:13,430
That is my goal.

909
00:52:13,430 --> 00:52:16,010
If I can store these
in order by this value,

910
00:52:16,010 --> 00:52:18,560
then by computing
odd rank, I know

911
00:52:18,560 --> 00:52:20,762
where in this list
of pairs to go.

912
00:52:20,762 --> 00:52:22,220
And I just look at
the second value

913
00:52:22,220 --> 00:52:26,720
of the pair, that is my answer.

914
00:52:26,720 --> 00:52:28,290
Why am I storing this?

915
00:52:28,290 --> 00:52:28,790
We'll see.

916
00:52:31,500 --> 00:52:33,500
I don't know if you really
need to, but you can.

917
00:52:36,080 --> 00:52:38,810
OK.

918
00:52:38,810 --> 00:52:41,276
So what we're going to--

919
00:52:41,276 --> 00:52:43,040
I feel like it's cheating.

920
00:52:43,040 --> 00:52:44,780
I say, actually
store these pairs.

921
00:52:44,780 --> 00:52:46,700
We're not really going
to actually store them.

922
00:52:46,700 --> 00:52:49,070
We still have another
trick up our sleeve.

923
00:52:49,070 --> 00:52:51,920
But more or less, we're
going to store these pairs--

924
00:52:51,920 --> 00:52:54,860
I'll cross out, actually.

925
00:52:54,860 --> 00:53:02,906
Store these pairs
in order by value.

926
00:53:02,906 --> 00:53:04,280
Storing them in
order by value is

927
00:53:04,280 --> 00:53:06,820
the same thing as order by i.

928
00:53:06,820 --> 00:53:09,150
That's what we just proved.

929
00:53:09,150 --> 00:53:10,850
And at this point,
is when I'm going

930
00:53:10,850 --> 00:53:12,350
to assume a binary alphabet.

931
00:53:16,971 --> 00:53:17,470
OK.

932
00:53:22,980 --> 00:53:28,330
Maybe, I'll go through here.

933
00:53:31,000 --> 00:53:31,960
Need lots of stuff.

934
00:53:35,180 --> 00:53:37,440
Think we don't need this
giant recursion up here.

935
00:53:41,494 --> 00:53:42,910
Just remember,
it's enough to know

936
00:53:42,910 --> 00:53:45,550
how to compute even
successor, the rest is easy.

937
00:54:16,620 --> 00:54:17,300
So here we go.

938
00:54:19,434 --> 00:54:20,850
We're trying to
store these pairs,

939
00:54:20,850 --> 00:54:30,740
so we're trying to store
a sorted array of nk

940
00:54:30,740 --> 00:54:31,480
over 2 values.

941
00:54:34,550 --> 00:54:37,520
That's how many odd
suffixes there are.

942
00:54:37,520 --> 00:54:46,460
And they're each 2 to the k
plus log nk bits, I claim.

943
00:54:46,460 --> 00:54:47,450
Why?

944
00:54:47,450 --> 00:54:51,510
Because this was a
single character in Tk.

945
00:54:51,510 --> 00:54:54,329
But a single character in Tk
was actually 2 to the k bits,

946
00:54:54,329 --> 00:54:56,120
in the original string
for binary alphabet,

947
00:54:56,120 --> 00:54:59,120
and general sigma to the k.

948
00:54:59,120 --> 00:55:01,470
So that's that part of
this 2 to the k bits.

949
00:55:01,470 --> 00:55:03,710
The even successor, well,
that's just an index

950
00:55:03,710 --> 00:55:05,570
into something of size nk.

951
00:55:05,570 --> 00:55:08,060
So it's log nk bits.

952
00:55:08,060 --> 00:55:08,880
OK, fine.

953
00:55:08,880 --> 00:55:11,390
If I store that explicitly,
I would be in trouble,

954
00:55:11,390 --> 00:55:16,470
because 2 to the
k times nk is n.

955
00:55:16,470 --> 00:55:19,460
And so I would be storing
n bits at every level--

956
00:55:19,460 --> 00:55:23,300
well, so I guess they
get n log log n space.

957
00:55:23,300 --> 00:55:25,110
That part's actually OK.

958
00:55:25,110 --> 00:55:27,200
I can afford that
much if I'm just

959
00:55:27,200 --> 00:55:29,900
going for an n log log n bound.

960
00:55:29,900 --> 00:55:32,840
This part, not so much.

961
00:55:32,840 --> 00:55:35,270
Because in particular,
when k equals 0,

962
00:55:35,270 --> 00:55:38,030
that's going to
be n times log n.

963
00:55:38,030 --> 00:55:40,431
I don't want to
spend n log n space.

964
00:55:40,431 --> 00:55:41,930
And the whole point,
is we're trying

965
00:55:41,930 --> 00:55:43,346
to avoid storing
these explicitly.

966
00:55:43,346 --> 00:55:45,920
Because if I did, I'd
get n log n space.

967
00:55:45,920 --> 00:55:48,010
So we're not going to
store them explicitly.

968
00:55:52,010 --> 00:56:01,397
As follows, we are
going to store so there

969
00:56:01,397 --> 00:56:02,480
are these big bit vectors.

970
00:56:02,480 --> 00:56:07,550
We're going to look at
the leading log nk bits.

971
00:56:07,550 --> 00:56:10,580
This is kind of weird,
because the log nk bits

972
00:56:10,580 --> 00:56:12,050
we care about are at the end.

973
00:56:12,050 --> 00:56:14,150
But we're going to look
at the leading log nk bits

974
00:56:14,150 --> 00:56:20,190
especially, because this is
a sorted list of bit vectors.

975
00:56:20,190 --> 00:56:23,452
So if you look at the leading
bits, most of the time,

976
00:56:23,452 --> 00:56:24,660
they're going to be the same.

977
00:56:24,660 --> 00:56:26,300
They don't change very much.

978
00:56:26,300 --> 00:56:28,850
Leading bits are going to
be all 0's for a while,

979
00:56:28,850 --> 00:56:30,530
and then occasionally
they'll increment.

980
00:56:30,530 --> 00:56:31,904
How many times
will it increment?

981
00:56:31,904 --> 00:56:36,440
nk times, at most, if we look
at the leading log nk bits.

982
00:56:48,274 --> 00:56:49,690
Here's the crazy
idea, we're going

983
00:56:49,690 --> 00:56:53,080
to use unary encoding,
unary differential encoding.

984
00:56:59,440 --> 00:57:00,970
Differential encoding
means, instead

985
00:57:00,970 --> 00:57:03,910
of storing a list of values,
you store the first value.

986
00:57:03,910 --> 00:57:07,540
Then the next value,
minus the first value,

987
00:57:07,540 --> 00:57:10,284
and then the next value
minus that value, and so on.

988
00:57:10,284 --> 00:57:11,950
And unary means we're
going to represent

989
00:57:11,950 --> 00:57:14,650
those differences in unary.

990
00:57:14,650 --> 00:57:18,370
Seems like a bad idea, but it
turns out it's a good idea.

991
00:57:18,370 --> 00:57:20,230
So here's what it looks
like, you look at--

992
00:57:20,230 --> 00:57:22,020
I'm going to write down 0.

993
00:57:22,020 --> 00:57:27,270
I'm going to write down a bunch
of 0's, however big v1 is.

994
00:57:27,270 --> 00:57:28,630
Then I'm going to write a 1.

995
00:57:28,630 --> 00:57:30,670
Then I'm going to write
a bunch of 0's, however

996
00:57:30,670 --> 00:57:35,510
big v2 minus v1 is.

997
00:57:35,510 --> 00:57:39,130
Then I'll write a 1, and so on.

998
00:57:39,130 --> 00:57:43,200
0 to the lead, the
leading bits of v--

999
00:57:43,200 --> 00:57:44,340
sorry.

1000
00:57:44,340 --> 00:57:49,190
It's the leading bits of v2
minus the leading bits of v1.

1001
00:57:49,190 --> 00:57:51,960
That's what I meant.

1002
00:57:51,960 --> 00:57:56,240
And then leading bits of v3
minus the leading bits of v2.

1003
00:57:56,240 --> 00:57:59,860
And then 1, and so on.

1004
00:57:59,860 --> 00:58:02,290
OK, that is unary
differential encoding.

1005
00:58:02,290 --> 00:58:05,380
I claim this is small,
looks kind of crazy.

1006
00:58:05,380 --> 00:58:09,210
But it's small, because how
many 0's are there total?

1007
00:58:09,210 --> 00:58:11,050
Well, at most, nk 0's.

1008
00:58:11,050 --> 00:58:15,340
Because I start at the value 0.

1009
00:58:15,340 --> 00:58:18,760
With log nk bits, at most
I get up to n k minus 1.

1010
00:58:18,760 --> 00:58:22,505
So the number of times I
increment is, at most, nk.

1011
00:58:25,660 --> 00:58:26,770
How many 1's are there?

1012
00:58:30,040 --> 00:58:34,030
Well, there's one 1, per value.

1013
00:58:34,030 --> 00:58:35,620
So there's nk over 2 1's.

1014
00:58:40,840 --> 00:58:46,030
So total size of this
bit factor is 3/2 nk.

1015
00:58:48,580 --> 00:58:54,630
So storing those leading bits
in this weird way is cheap.

1016
00:58:54,630 --> 00:58:57,470
Linear-- again, this
geometric series

1017
00:58:57,470 --> 00:58:59,240
is going to add up to 3/2.

1018
00:58:59,240 --> 00:59:04,640
All right, it's going
to add up to 3 times n.

1019
00:59:04,640 --> 00:59:06,924
Cool.

1020
00:59:06,924 --> 00:59:08,340
But that's just
the leading bits--

1021
00:59:08,340 --> 00:59:09,646
I need to store this thing.

1022
00:59:09,646 --> 00:59:11,020
I need to store
the leading bits,

1023
00:59:11,020 --> 00:59:12,644
and I need to store
the remaining bits.

1024
00:59:12,644 --> 00:59:15,660
Now the remaining bits, there's
only 2 to the k remaining bits.

1025
00:59:15,660 --> 00:59:17,040
We switched the order.

1026
00:59:17,040 --> 00:59:18,624
We looked at the
high log nk bits,

1027
00:59:18,624 --> 00:59:20,040
but then the low
end bits, there's

1028
00:59:20,040 --> 00:59:21,390
going to be 2 to the k of them.

1029
00:59:21,390 --> 00:59:23,700
That I already said was OK.

1030
00:59:23,700 --> 00:59:25,890
We could afford that--

1031
00:59:25,890 --> 00:59:30,020
kind of, we'd lose
a log log factor.

1032
00:59:30,020 --> 00:59:34,430
So we store the trailing
2 of the k bits.

1033
00:59:34,430 --> 00:59:36,940
This we actually
store explicitly.

1034
00:59:41,280 --> 00:59:44,350
So this is going to
be 2 to the k times

1035
00:59:44,350 --> 00:59:50,520
nk over 2, which is n/2 bits.

1036
00:59:50,520 --> 00:59:52,840
nk is n over 2 to the k.

1037
00:59:52,840 --> 00:59:56,650
Cancel, n over 2.

1038
00:59:56,650 --> 01:00:00,880
OK, so total number of
bits-- we add these up--

1039
01:00:00,880 --> 01:00:11,130
is going to be 1/2
n plus 3/2 nk plus--

1040
01:00:11,130 --> 01:00:13,410
we'll get to this later.

1041
01:00:13,410 --> 01:00:19,710
And then the total, this we
have to do for log log n levels.

1042
01:00:19,710 --> 01:00:25,170
We're summing k
equals 0 to log log n.

1043
01:00:25,170 --> 01:00:26,940
This thing.

1044
01:00:26,940 --> 01:00:33,604
And this comes out
to 1/2 n log log n.

1045
01:00:33,604 --> 01:00:35,270
This is bad, we want
to get rid of that.

1046
01:00:35,270 --> 01:00:42,660
But that was our first
aim, then we have 5n--

1047
01:00:42,660 --> 01:00:43,800
did I miss a term?

1048
01:00:47,580 --> 01:00:48,080
OK.

1049
01:00:53,555 --> 01:00:57,060
Where did I miss the nk?

1050
01:00:57,060 --> 01:01:01,450
This was the cost
for even successor.

1051
01:01:01,450 --> 01:01:05,010
OK, but there was also, is
even suffix, which was nk bits,

1052
01:01:05,010 --> 01:01:08,010
and there was even rank,
which was little o of that.

1053
01:01:08,010 --> 01:01:13,035
So there's an extra nk
here for is even suffix.

1054
01:01:17,760 --> 01:01:20,490
OK, so we have nk plus 3/2 nk.

1055
01:01:20,490 --> 01:01:22,080
That's 5/2 nk.

1056
01:01:22,080 --> 01:01:23,850
And then the 1/2
disappears because it's

1057
01:01:23,850 --> 01:01:24,840
a geometric series.

1058
01:01:24,840 --> 01:01:28,710
So we end up with 5n,
for what it's worth.

1059
01:01:28,710 --> 01:01:30,770
Plus big O of something.

1060
01:01:30,770 --> 01:01:32,730
OK, I left out something,
because there's

1061
01:01:32,730 --> 01:01:35,640
one data structure we
haven't yet described.

1062
01:01:35,640 --> 01:01:37,290
There's one more thing we need.

1063
01:01:37,290 --> 01:01:40,230
And that comes up if you want
to do a query in the structure.

1064
01:01:40,230 --> 01:01:41,280
How do I do a query?

1065
01:01:44,490 --> 01:01:46,620
I already did odd
rank, so I'm just

1066
01:01:46,620 --> 01:01:50,430
trying to look up into the
sorted array, at a given index.

1067
01:01:50,430 --> 01:01:55,740
Well, first thing is to
compute the leading bits.

1068
01:01:55,740 --> 01:01:58,650
Actually, computing
leading bits is really easy

1069
01:01:58,650 --> 01:02:00,720
if I have rank and select.

1070
01:02:00,720 --> 01:02:05,590
What I want, if I'm trying
to index into index i,

1071
01:02:05,590 --> 01:02:07,860
I want the i-th one bit.

1072
01:02:07,860 --> 01:02:09,390
To look at the
i-th one bit, which

1073
01:02:09,390 --> 01:02:18,690
is select sub 1 of i, which
we already know how to do,

1074
01:02:18,690 --> 01:02:24,040
then that corresponds
to the i-th value.

1075
01:02:24,040 --> 01:02:25,980
And in particular,
if I look at how many

1076
01:02:25,980 --> 01:02:30,570
0's are there up
to that point, it's

1077
01:02:30,570 --> 01:02:31,950
going to be the sum of this.

1078
01:02:31,950 --> 01:02:35,220
Plus this, plus this,
it's a telescoping sum.

1079
01:02:35,220 --> 01:02:39,380
It's just going to give
me the leading bits.

1080
01:02:39,380 --> 01:02:42,350
Because this plus this
is just lead of v2.

1081
01:02:42,350 --> 01:02:44,930
This plus that is lead of v3.

1082
01:02:44,930 --> 01:02:45,944
So they all cancel.

1083
01:02:45,944 --> 01:02:47,360
I just count the
number of 0 bits.

1084
01:02:47,360 --> 01:02:50,540
That's exactly the
value I want to know.

1085
01:02:50,540 --> 01:02:56,960
So I want to do rank
sub 0 of that position.

1086
01:02:56,960 --> 01:03:00,780
That will tell me
the leading bits.

1087
01:03:00,780 --> 01:03:08,730
In a query, it's not
really lead of i, I guess.

1088
01:03:08,730 --> 01:03:13,040
Lead of vi is what
we're trying to compute.

1089
01:03:13,040 --> 01:03:14,540
Now, we also need
the trailing bits.

1090
01:03:14,540 --> 01:03:16,090
The trailing bits,
they're just in an array,

1091
01:03:16,090 --> 01:03:17,220
so you just look that up.

1092
01:03:17,220 --> 01:03:18,303
You get the trailing bits.

1093
01:03:18,303 --> 01:03:20,180
You concatenate those
two words, the leading

1094
01:03:20,180 --> 01:03:22,730
bits of the trailing bits--
boom, you have your answer.

1095
01:03:22,730 --> 01:03:25,820
That gives you the
even successor.

1096
01:03:25,820 --> 01:03:28,130
So the only thing
is we need to store

1097
01:03:28,130 --> 01:03:30,020
rank and a select structure.

1098
01:03:30,020 --> 01:03:36,320
And for rank, we used nk
over log log nk space.

1099
01:03:36,320 --> 01:03:39,130
Again, that can be improved
to nk over polylog nk.

1100
01:03:39,130 --> 01:03:40,880
But let's not worry about that.

1101
01:03:52,290 --> 01:03:54,700
Item 1 completes.

1102
01:03:54,700 --> 01:03:59,410
We now have a T log
log T bit suffix array.

1103
01:03:59,410 --> 01:04:01,950
Next, we need to
make it order T,

1104
01:04:01,950 --> 01:04:05,261
then we need to make
it into suffix tree.

1105
01:04:05,261 --> 01:04:06,760
We're going to move
a little faster.

1106
01:04:11,570 --> 01:04:12,465
Where to go now?

1107
01:04:22,080 --> 01:04:24,270
Now I want a compact
suffix array.

1108
01:04:31,670 --> 01:04:33,770
I'm going to use
the same definition.

1109
01:04:33,770 --> 01:04:36,849
Everything's going to be
more or less the same.

1110
01:04:36,849 --> 01:04:38,765
I just can't afford to
store all these levels.

1111
01:04:44,034 --> 01:04:45,200
There were log log n levels.

1112
01:04:45,200 --> 01:04:47,100
Log log n levels
is too expensive.

1113
01:04:47,100 --> 01:04:49,240
Each one costs linear space.

1114
01:04:49,240 --> 01:04:52,096
So I'm only going to store
a constant number of levels.

1115
01:04:54,850 --> 01:05:00,350
Only store 1 over
epsilon plus 1 levels.

1116
01:05:03,350 --> 01:05:06,410
And not just any levels,
but the first level,

1117
01:05:06,410 --> 01:05:10,390
the epsilon l-th level,
the 2 epsilon l-th level,

1118
01:05:10,390 --> 01:05:11,660
up to the l-th level.

1119
01:05:11,660 --> 01:05:14,486
So it's still log log n levels.

1120
01:05:14,486 --> 01:05:17,120
I'm just going to
skip a lot of them.

1121
01:05:17,120 --> 01:05:19,130
Now, it's going to be different.

1122
01:05:19,130 --> 01:05:21,570
I can't use even
successor anymore.

1123
01:05:21,570 --> 01:05:27,080
Instead, even is
going to be replaced

1124
01:05:27,080 --> 01:05:32,740
with the notion of divisible
by 2 to the epsilon l,

1125
01:05:32,740 --> 01:05:33,950
instead of divisible by 2.

1126
01:05:37,460 --> 01:05:40,310
So I do all this, but
replace the notion of even

1127
01:05:40,310 --> 01:05:44,990
with divisible by epsilon l.

1128
01:05:44,990 --> 01:05:57,860
Because this is when you are
in SA sub k plus 1 epsilon l.

1129
01:05:57,860 --> 01:06:00,390
The whole name of
the game is, you're

1130
01:06:00,390 --> 01:06:03,620
trying to do a query
in SA k epsilon l,

1131
01:06:03,620 --> 01:06:07,250
and now you want to reduce
it to SA k plus 1 epsilon l.

1132
01:06:07,250 --> 01:06:10,790
And these are the suffixes that
are explicitly represented.

1133
01:06:10,790 --> 01:06:13,610
Everything else needs to be
rounded to that value, then

1134
01:06:13,610 --> 01:06:17,215
rounded back, like we had
with our giant formula before.

1135
01:06:17,215 --> 01:06:19,340
It's not so easy to write
a single formula anymore,

1136
01:06:19,340 --> 01:06:21,156
it's now really an algorithm.

1137
01:06:24,440 --> 01:06:30,010
So to compute SA
k epsilon l of i,

1138
01:06:30,010 --> 01:06:34,510
what you do is follow
a new thing, which

1139
01:06:34,510 --> 01:06:38,600
I'm going to call
just successor of i,

1140
01:06:38,600 --> 01:06:45,050
repeatedly to get a new index j.

1141
01:06:47,620 --> 01:06:51,810
Or I guess call it i prime,
make it a little clearer--

1142
01:06:51,810 --> 01:06:52,940
until it's even.

1143
01:07:00,080 --> 01:07:02,090
So before, we just
had to make one step,

1144
01:07:02,090 --> 01:07:03,110
and then we were even.

1145
01:07:03,110 --> 01:07:06,680
Now, we're going to have to make
potentially epsilon l steps.

1146
01:07:06,680 --> 01:07:08,620
So this could cost log log n.

1147
01:07:08,620 --> 01:07:10,860
Log log n, that's not much.

1148
01:07:10,860 --> 01:07:14,700
Actually-- sorry, not log log n.

1149
01:07:14,700 --> 01:07:16,850
This is going to
cost 2 to the epsilon

1150
01:07:16,850 --> 01:07:20,020
l, because it's divisible
by 2 to the epsilon l.

1151
01:07:20,020 --> 01:07:23,660
2 to the epsilon l is
log to the epsilon.

1152
01:07:23,660 --> 01:07:33,650
So this now may take log
to the epsilon T steps.

1153
01:07:33,650 --> 01:07:37,460
This is where we're going to get
the log to the epsilon penalty,

1154
01:07:37,460 --> 01:07:38,800
in time.

1155
01:07:38,800 --> 01:07:42,410
OK, but it's simple linear
search, nothing clever here.

1156
01:07:42,410 --> 01:07:44,040
Now, what is successor?

1157
01:07:44,040 --> 01:07:45,950
Well, successor is
just the same thing.

1158
01:07:48,710 --> 01:07:51,740
If you're even in this strong
sense, then nothing happens.

1159
01:07:51,740 --> 01:07:53,640
Otherwise, you just--
same definition.

1160
01:07:53,640 --> 01:07:56,360
This part is exactly the same.

1161
01:07:56,360 --> 01:07:59,180
Just go to the next
position, the next suffix.

1162
01:07:59,180 --> 01:08:01,340
But now we have to
follow it several times,

1163
01:08:01,340 --> 01:08:04,220
until we get to an even one.

1164
01:08:04,220 --> 01:08:05,630
OK.

1165
01:08:05,630 --> 01:08:13,220
Then we recurse, just
like before on SA k plus

1166
01:08:13,220 --> 01:08:15,260
1, epsilon l.

1167
01:08:15,260 --> 01:08:20,456
The next level down of the--

1168
01:08:20,456 --> 01:08:22,430
I think we can still
call it even rank.

1169
01:08:35,520 --> 01:08:46,370
And then we multiply
by 2 to the epsilon l.

1170
01:08:49,319 --> 01:08:57,020
And then subtract the number
of steps we did, in 1.

1171
01:09:03,319 --> 01:09:05,160
We made several
steps here, we need

1172
01:09:05,160 --> 01:09:07,180
to undo those steps at the end.

1173
01:09:07,180 --> 01:09:07,680
That's it.

1174
01:09:07,680 --> 01:09:09,846
So it's just the same as
before, except before there

1175
01:09:09,846 --> 01:09:12,010
was one step here, and
at most, one step here.

1176
01:09:12,010 --> 01:09:14,439
Now you just count them,
subtract at the end.

1177
01:09:14,439 --> 01:09:16,620
So exactly the
same template, just

1178
01:09:16,620 --> 01:09:18,689
skipping a lot of the levels.

1179
01:09:18,689 --> 01:09:25,529
And now the space is going to be
1 over epsilon, plus 1 times n.

1180
01:09:25,529 --> 01:09:27,880
That's it.

1181
01:09:27,880 --> 01:09:29,560
OK, so let me
analyze a little bit.

1182
01:09:35,340 --> 01:09:37,920
So you have to check
that all of this works.

1183
01:09:37,920 --> 01:09:39,689
Is is even suffix, that's easy.

1184
01:09:39,689 --> 01:09:40,680
It's still nk bits.

1185
01:09:40,680 --> 01:09:42,479
Even rank, still nk bits.

1186
01:09:42,479 --> 01:09:45,240
Even successor, we did
all this fancy encoding.

1187
01:09:45,240 --> 01:09:47,472
The one thing you
can't do, is this part.

1188
01:09:47,472 --> 01:09:49,680
I mean, there aren't very
many even suffixes anymore.

1189
01:09:49,680 --> 01:09:54,970
So it really doesn't help you,
it buys you a very tiny factor.

1190
01:09:54,970 --> 01:10:00,840
But 1 over 2 to the epsilon
l are going to be even.

1191
01:10:00,840 --> 01:10:01,891
So that's very few.

1192
01:10:01,891 --> 01:10:04,140
So you still have to store
all the answers, basically.

1193
01:10:04,140 --> 01:10:06,730
But you can do all this
ordering trick, it still works.

1194
01:10:06,730 --> 01:10:10,650
We weren't really exploiting
the fact that it was odd.

1195
01:10:10,650 --> 01:10:13,290
And now you have to-- this
is not a single character,

1196
01:10:13,290 --> 01:10:16,800
it's a bunch of characters.

1197
01:10:16,800 --> 01:10:19,860
But still-- and so now
instead of 2 to the k,

1198
01:10:19,860 --> 01:10:24,480
it's probably 2 to
the k epsilon l.

1199
01:10:24,480 --> 01:10:25,950
But it all works out.

1200
01:10:25,950 --> 01:10:29,160
It's just a renaming
of everything.

1201
01:10:29,160 --> 01:10:32,665
It's still going to be linear
number of bits, I claim.

1202
01:10:32,665 --> 01:10:34,790
I don't want to go through
a formal proof for that,

1203
01:10:34,790 --> 01:10:35,581
we don't have time.

1204
01:10:38,290 --> 01:10:39,550
But all the same tricks work.

1205
01:10:45,730 --> 01:10:53,500
So the claim is
space going to be sum

1206
01:10:53,500 --> 01:10:58,850
k equals 0 to 1 over epsilon.

1207
01:10:58,850 --> 01:11:05,480
nk epsilon l, plus n,
plus 2 nk epsilon l,

1208
01:11:05,480 --> 01:11:12,702
plus the select bound,
n over log log n.

1209
01:11:17,270 --> 01:11:17,900
Why?

1210
01:11:17,900 --> 01:11:20,930
Because this is storing
the is even structure.

1211
01:11:20,930 --> 01:11:23,180
That was just nk bits.

1212
01:11:23,180 --> 01:11:27,784
And then, this is the successor.

1213
01:11:27,784 --> 01:11:29,049
This is, is even.

1214
01:11:33,270 --> 01:11:36,080
Same as we had over here,
except there's no 1/2 anymore.

1215
01:11:36,080 --> 01:11:38,840
It's just n plus--

1216
01:11:38,840 --> 01:11:43,430
claim is 2 nk epsilon l.

1217
01:11:43,430 --> 01:11:45,590
That's the right answer.

1218
01:11:45,590 --> 01:11:47,694
Yeah, that 3 was because
of this, plus this.

1219
01:11:47,694 --> 01:11:50,110
So we still have the 3, just
don't divide it by 2 anymore.

1220
01:11:55,950 --> 01:12:04,060
So this equals some
constant times n, 6n

1221
01:12:04,060 --> 01:12:05,840
plus 1 over epsilon n.

1222
01:12:09,410 --> 01:12:14,416
Plus order n over
log log n bits.

1223
01:12:18,520 --> 01:12:19,840
OK, not bad.

1224
01:12:19,840 --> 01:12:22,400
Not quite as good as this
bound for binary alphabet,

1225
01:12:22,400 --> 01:12:25,210
so ignore the log sigma.

1226
01:12:25,210 --> 01:12:26,980
Before we had 1 plus
1 over epsilon, now

1227
01:12:26,980 --> 01:12:28,540
we have 6 plus 1 over epsilon.

1228
01:12:32,494 --> 01:12:33,660
Kind of running out of time.

1229
01:12:33,660 --> 01:12:40,260
I'll just tell you, you can
tune this to 1 over epsilon n,

1230
01:12:40,260 --> 01:12:44,050
plus the little o, with
two very simple tricks.

1231
01:12:44,050 --> 01:12:45,310
Two simple observations.

1232
01:12:45,310 --> 01:12:51,540
The first one is, the
successor structure.

1233
01:12:51,540 --> 01:12:55,760
At level 0, there's
nothing to do.

1234
01:12:55,760 --> 01:12:56,260
Why?

1235
01:12:56,260 --> 01:13:02,710
Because level 0--
a single step just

1236
01:13:02,710 --> 01:13:04,900
corresponds to
walking in the string.

1237
01:13:04,900 --> 01:13:08,437
I've got to think about
this a little bit.

1238
01:13:08,437 --> 01:13:16,420
Successor-- Actually not quite
clear to me why that's true,

1239
01:13:16,420 --> 01:13:17,820
but it turns out to be true.

1240
01:13:17,820 --> 01:13:20,660
It's an exercise, I guess.

1241
01:13:20,660 --> 01:13:23,580
At level 0, you don't need to
[? the ?] successor structure.

1242
01:13:23,580 --> 01:13:27,210
So that actually saves you
a big factor, because if you

1243
01:13:27,210 --> 01:13:28,680
can skip the very--

1244
01:13:28,680 --> 01:13:32,340
k equals 0, then you get to
skip-- you get to divide by 2

1245
01:13:32,340 --> 01:13:33,870
to the epsilon l, the space.

1246
01:13:33,870 --> 01:13:38,850
So that gets rid of this term.

1247
01:13:38,850 --> 01:13:43,630
Then there's this other
term, which you can skip,

1248
01:13:43,630 --> 01:13:45,660
or you can store is
even more efficiently.

1249
01:13:45,660 --> 01:13:48,219
So before is even,
should be a big factor.

1250
01:13:48,219 --> 01:13:50,010
Because half of them
are even, half of them

1251
01:13:50,010 --> 01:13:52,200
are odd, that's the
optimal thing to do.

1252
01:13:52,200 --> 01:13:55,920
But in this structure,
most of them are not even.

1253
01:13:55,920 --> 01:14:00,240
So you can save a little bit
using succinct dictionaries.

1254
01:14:00,240 --> 01:14:01,800
Because there are
very few ones--

1255
01:14:01,800 --> 01:14:05,160
you can achieve log, the
total number of things,

1256
01:14:05,160 --> 01:14:08,240
choose the number of ones.

1257
01:14:08,240 --> 01:14:10,980
[? Bog ?] of that binomial
coefficient is the number

1258
01:14:10,980 --> 01:14:12,710
of 0's plus 1's.

1259
01:14:12,710 --> 01:14:15,170
Not going to work it out,
it's worked out in the notes.

1260
01:14:15,170 --> 01:14:17,550
But if you store that more
efficient dictionary, which

1261
01:14:17,550 --> 01:14:20,010
we claimed could
be done last time,

1262
01:14:20,010 --> 01:14:23,760
then this turns out to get a
nice sort of cascading thing.

1263
01:14:23,760 --> 01:14:27,210
And it's little of
of n, in the end.

1264
01:14:27,210 --> 01:14:28,920
So that gets rid of this term.

1265
01:14:28,920 --> 01:14:32,580
And so you're left with
just n times 1 over epsilon.

1266
01:14:32,580 --> 01:14:34,680
Plus 1, because you have
to store the text also.

1267
01:14:37,200 --> 01:14:43,410
Or maybe because of
this plus 1, anyway.

1268
01:14:43,410 --> 01:14:45,960
Boom.

1269
01:14:45,960 --> 01:14:48,720
That's all I want to say
about this structure.

1270
01:14:48,720 --> 01:14:51,310
So I wanted to focus on
the ideas, which got us

1271
01:14:51,310 --> 01:14:54,940
the T log log T. Just
apply the same ideas,

1272
01:14:54,940 --> 01:14:56,119
but much more sparsely.

1273
01:14:56,119 --> 01:14:57,910
You lose in running
time, instead of paying

1274
01:14:57,910 --> 01:15:00,240
log log T. Now we pay--

1275
01:15:00,240 --> 01:15:02,520
we pay log to the
epsilon times log log T,

1276
01:15:02,520 --> 01:15:04,350
but that's just log
to some other epsilon.

1277
01:15:06,870 --> 01:15:10,320
So that gives us better space.

1278
01:15:10,320 --> 01:15:13,700
Now linear space, instead
of n log log space.

1279
01:15:13,700 --> 01:15:16,290
Any questions about that?

1280
01:15:16,290 --> 01:15:16,790
All right.

1281
01:15:19,310 --> 01:15:24,740
Now, I get to hurry through
transforming suffix arrays,

1282
01:15:24,740 --> 01:15:25,730
into suffix trees.

1283
01:15:35,611 --> 01:15:37,110
This is actually a
much older paper.

1284
01:15:37,110 --> 01:15:45,710
It's by [? Monroe, ?]
[? Roman, ?] and [? Row. ?]

1285
01:15:45,710 --> 01:15:49,370
There's two versions of
it in the same paper.

1286
01:15:49,370 --> 01:15:51,680
First version is going to
be compact, second version

1287
01:15:51,680 --> 01:15:52,450
is succinct.

1288
01:15:52,450 --> 01:15:54,950
Probably won't have much time
to cover the succinct version,

1289
01:15:54,950 --> 01:15:57,920
but here's what we do.

1290
01:15:57,920 --> 01:16:00,950
Start with compact.

1291
01:16:00,950 --> 01:16:04,460
Store compressed--
we're going to assume

1292
01:16:04,460 --> 01:16:12,230
binary alphabet again, as
this paper does, I believe.

1293
01:16:12,230 --> 01:16:17,090
Store the suffix tree, but
only store the trie part of it.

1294
01:16:17,090 --> 01:16:19,510
Suffix tree really
consists of trie--

1295
01:16:19,510 --> 01:16:22,260
binary trie, if it's
a binary alphabet.

1296
01:16:22,260 --> 01:16:25,220
Plus, lengths on the edges.

1297
01:16:25,220 --> 01:16:26,960
Don't store the links.

1298
01:16:26,960 --> 01:16:30,980
Or, as Ian likes to
call it, skip the skips.

1299
01:16:30,980 --> 01:16:33,050
The lengths of an
edge is how many bits

1300
01:16:33,050 --> 01:16:36,530
you're supposed to
skip, so skip those.

1301
01:16:36,530 --> 01:16:39,270
Just store the trie structure.

1302
01:16:39,270 --> 01:16:43,250
So the trie structure
is on 2n plus 1 nodes,

1303
01:16:43,250 --> 01:16:45,705
because there is n
leaves, and minus 1.

1304
01:16:45,705 --> 01:16:47,991
Telling me it's plus
1, I don't know.

1305
01:16:47,991 --> 01:16:50,820
2n plus a constant nodes.

1306
01:16:50,820 --> 01:16:55,160
So this is 4n bits.

1307
01:16:55,160 --> 01:16:57,440
We know how to do
binary tries, finally

1308
01:16:57,440 --> 01:16:59,090
we're using last lecture.

1309
01:16:59,090 --> 01:17:01,100
We use rank and select
a lot, but now are

1310
01:17:01,100 --> 01:17:02,360
using the binary trie.

1311
01:17:02,360 --> 01:17:05,670
We're going to store this using
the balanced paren structure.

1312
01:17:05,670 --> 01:17:08,500
OK, so you have to double that--
this linear number of bits,

1313
01:17:08,500 --> 01:17:11,540
so if we're just looking
for compact, that's fine.

1314
01:17:11,540 --> 01:17:13,630
Now the hard part
is in a search,

1315
01:17:13,630 --> 01:17:18,630
where we go from one
node, to the next node.

1316
01:17:18,630 --> 01:17:20,445
We need to know the
length of this edge,

1317
01:17:20,445 --> 01:17:23,142
we've got to figure that out.

1318
01:17:23,142 --> 01:17:25,100
We need to know whether
the pattern jumped off,

1319
01:17:25,100 --> 01:17:26,690
or something.

1320
01:17:26,690 --> 01:17:31,190
We need to know at position
y, which letter of the pattern

1321
01:17:31,190 --> 01:17:33,620
should we branch on.

1322
01:17:33,620 --> 01:17:36,320
So we need to
measure this length.

1323
01:17:36,320 --> 01:17:37,760
Not too hard.

1324
01:17:37,760 --> 01:17:40,280
What you do, you
look at this subtree.

1325
01:17:40,280 --> 01:17:44,330
You look at the leftmost
leaf and the rightmost leaf.

1326
01:17:44,330 --> 01:17:46,190
You look at their
longest common prefix,

1327
01:17:46,190 --> 01:17:48,304
starting from the
character you care about.

1328
01:17:48,304 --> 01:17:50,720
And you look at the longest
common prefix with the pattern

1329
01:17:50,720 --> 01:17:53,120
P. All sounds easy--

1330
01:17:53,120 --> 01:17:55,430
how do you actually do it?

1331
01:17:55,430 --> 01:17:58,700
So you need to be able to find
the leftmost leaf in a subtree.

1332
01:17:58,700 --> 01:18:02,420
Leaves in the balanced
paren expression--

1333
01:18:02,420 --> 01:18:05,270
I think last class, I mistakenly
thought they were that.

1334
01:18:05,270 --> 01:18:07,190
In fact, they are this.

1335
01:18:07,190 --> 01:18:09,080
Think about it long enough.

1336
01:18:09,080 --> 01:18:11,832
This was leaves in
the rooted order tree,

1337
01:18:11,832 --> 01:18:14,040
but what we care about are
leaves in the binary tree.

1338
01:18:14,040 --> 01:18:15,353
And they always look
like open paren,

1339
01:18:15,353 --> 01:18:16,730
closed paren, and closed paren.

1340
01:18:16,730 --> 01:18:19,820
So this is a leaf,
and so what we're

1341
01:18:19,820 --> 01:18:22,250
asking for is in a subtree,
we'll find the first leaf.

1342
01:18:22,250 --> 01:18:26,120
That's actually just going to
be right after this open paren.

1343
01:18:26,120 --> 01:18:32,870
Or, I guess, you do a
select, select sub this,

1344
01:18:32,870 --> 01:18:34,619
to jump to the next leaf.

1345
01:18:34,619 --> 01:18:36,660
Then also, you can jump
to the end of the subtree

1346
01:18:36,660 --> 01:18:39,890
and then go back to the previous
leaf, using rank and select.

1347
01:18:39,890 --> 01:18:42,050
So I won't go into details,
but that's easy to do.

1348
01:18:42,050 --> 01:18:44,000
So you can identify
the two leaves

1349
01:18:44,000 --> 01:18:47,720
using rank sub, this thing.

1350
01:18:47,720 --> 01:18:51,230
I can identify the leaf
number, so I can identify

1351
01:18:51,230 --> 01:18:53,090
where these leaves are.

1352
01:18:53,090 --> 01:18:54,950
Now, I have a suffix array.

1353
01:18:54,950 --> 01:18:58,520
If I look up the suffix array
of these two leaf numbers--

1354
01:18:58,520 --> 01:19:01,490
remember leaves are ordered
by suffix in sorted order

1355
01:19:01,490 --> 01:19:03,222
by suffix array.

1356
01:19:03,222 --> 01:19:05,180
These are really indices
into the suffix array.

1357
01:19:05,180 --> 01:19:07,580
They're giving me-- oh,
this is the i-th suffix,

1358
01:19:07,580 --> 01:19:08,870
this is the j-th suffix.

1359
01:19:08,870 --> 01:19:11,078
So I look at those two
positions of the suffix array,

1360
01:19:11,078 --> 01:19:15,560
I teleport over to the string T.
Now I have the actual suffixes

1361
01:19:15,560 --> 01:19:17,120
corresponding to this and this.

1362
01:19:17,120 --> 01:19:19,160
And I just look at
where they match.

1363
01:19:19,160 --> 01:19:22,940
I know that if I've already
gone down to depth d,

1364
01:19:22,940 --> 01:19:23,870
letter depth d.

1365
01:19:23,870 --> 01:19:26,120
I already know that they
match the first d characters.

1366
01:19:26,120 --> 01:19:27,110
I don't compare those.

1367
01:19:27,110 --> 01:19:28,460
They're guaranteed to match.

1368
01:19:28,460 --> 01:19:30,800
So I start at position d plus 1.

1369
01:19:30,800 --> 01:19:32,930
I know they should match,
but one more letter.

1370
01:19:32,930 --> 01:19:34,430
How many more letters
do they match?

1371
01:19:34,430 --> 01:19:36,900
That is the length
of this thing.

1372
01:19:36,900 --> 01:19:37,415
OK.

1373
01:19:37,415 --> 01:19:38,790
How can I afford
to pay for that?

1374
01:19:38,790 --> 01:19:41,030
I'm just going to pen linear
cost, the total number

1375
01:19:41,030 --> 01:19:42,405
of characters I
compare, is going

1376
01:19:42,405 --> 01:19:44,820
to be equal to the
length of the pattern.

1377
01:19:44,820 --> 01:19:47,750
So we're going to end up
getting length of the pattern,

1378
01:19:47,750 --> 01:19:51,762
times the cost to do
a suffix array access.

1379
01:19:51,762 --> 01:19:53,720
Because I have to do this
at every single step,

1380
01:19:53,720 --> 01:19:55,460
in the worst case.

1381
01:19:55,460 --> 01:19:58,010
So not perfect, but pretty good.

1382
01:19:58,010 --> 01:20:01,864
Roughly P, suffix array access
is like log to the epsilon.

1383
01:20:01,864 --> 01:20:03,530
So we're getting a P
log to the epsilon.

1384
01:20:03,530 --> 01:20:08,284
Not quite as good as this
bound, but because here the P

1385
01:20:08,284 --> 01:20:09,950
is not multiplied by
log to the epsilon.

1386
01:20:09,950 --> 01:20:12,000
But, it's just log
to the epsilon.

1387
01:20:12,000 --> 01:20:13,760
If you want to see a
better way to do it,

1388
01:20:13,760 --> 01:20:15,434
you can read the
Grossi-Vitter paper.

1389
01:20:15,434 --> 01:20:16,850
But this is a
decent way to do it.

1390
01:20:20,900 --> 01:20:25,997
Now briefly, this is
the compact version,

1391
01:20:25,997 --> 01:20:27,830
and let me tell you how
to make it succinct.

1392
01:20:33,504 --> 01:20:35,170
I'm not going to touch
the suffix array.

1393
01:20:35,170 --> 01:20:37,880
Suffix array, to make
that succinct is harder.

1394
01:20:37,880 --> 01:20:41,200
But if I just want to make the
suffix tree parts succinct,

1395
01:20:41,200 --> 01:20:43,870
I can use this same
idea, but I can't

1396
01:20:43,870 --> 01:20:46,300
afford to store the whole trie.

1397
01:20:46,300 --> 01:20:48,820
So just going to use a
little bit of indirection.

1398
01:20:48,820 --> 01:20:50,320
You can use as
little as you want,

1399
01:20:50,320 --> 01:20:54,650
this is the log log log
log log log log n factor.

1400
01:21:00,150 --> 01:21:15,250
Use the suffix tree
above every b-th suffix.

1401
01:21:15,250 --> 01:21:19,960
So throw away all but a
1/b fraction of the leaves.

1402
01:21:19,960 --> 01:21:22,600
And then, take the
tree that remains.

1403
01:21:22,600 --> 01:21:25,210
So once you do a search, you
won't find exactly the leaf

1404
01:21:25,210 --> 01:21:27,580
you want, but you'll
be within an additive b

1405
01:21:27,580 --> 01:21:29,860
of the leaf you want. b here
can be arbitrarily small.

1406
01:21:29,860 --> 01:21:32,530
This can be log
log log log log n.

1407
01:21:32,530 --> 01:21:34,450
But something super constant.

1408
01:21:34,450 --> 01:21:37,180
Then if I use this structure,
instead of being n--

1409
01:21:37,180 --> 01:21:38,580
order n over b space--

1410
01:21:38,580 --> 01:21:40,090
instead of being
order n space, it's

1411
01:21:40,090 --> 01:21:43,000
going to be order n over b bits.

1412
01:21:43,000 --> 01:21:44,026
So, we win.

1413
01:21:44,026 --> 01:21:45,400
The only issue is
now, how do you

1414
01:21:45,400 --> 01:21:49,890
find the correct leaf, as
opposed to the incorrect leaf?

1415
01:21:53,161 --> 01:21:54,910
I don't really have
time to talk about it.

1416
01:21:54,910 --> 01:21:57,100
You can look at the notes.

1417
01:21:57,100 --> 01:21:58,600
Rough idea is,
well, you can have

1418
01:21:58,600 --> 01:22:01,810
a look-up table that lets you
do whatever you want on b bits.

1419
01:22:01,810 --> 01:22:05,140
As long as b is less
than, like, 1/2 log n.

1420
01:22:05,140 --> 01:22:09,730
Then you can encompass the
whole trie, more or less.

1421
01:22:09,730 --> 01:22:11,620
And just hit it with
a big look-up table

1422
01:22:11,620 --> 01:22:13,600
and do everything
in constant time.

1423
01:22:13,600 --> 01:22:20,320
It's not quite so
simple, because--

1424
01:22:20,320 --> 01:22:21,610
easy summary, here.

1425
01:22:32,900 --> 01:22:38,840
Essentially, what
you're doing is--

1426
01:22:38,840 --> 01:22:40,175
these are the blocks.

1427
01:22:40,175 --> 01:22:42,410
So this is length b.

1428
01:22:42,410 --> 01:22:45,560
You're finding this suffix,
and you want to know,

1429
01:22:45,560 --> 01:22:47,439
which of these is
the correct one.

1430
01:22:47,439 --> 01:22:49,730
In some sense, you have to
do the search simultaneously

1431
01:22:49,730 --> 01:22:51,620
for all b of these guys.

1432
01:22:51,620 --> 01:22:53,870
And so you run down
the search again,

1433
01:22:53,870 --> 01:22:55,647
but instead of searching
for one pattern,

1434
01:22:55,647 --> 01:22:57,980
you search for all b of these
patterns at the same time.

1435
01:22:57,980 --> 01:23:00,830
Now they're mostly the
same, and so you can

1436
01:23:00,830 --> 01:23:02,280
prove it doesn't hurt you much.

1437
01:23:02,280 --> 01:23:04,390
Maybe it hurts
you an additive b.

1438
01:23:04,390 --> 01:23:07,730
I believe the correct answer
is, in time, you end up

1439
01:23:07,730 --> 01:23:13,430
paying quarter p plus b time.

1440
01:23:13,430 --> 01:23:17,449
Sorry, times the cost of
a suffix array access.

1441
01:23:17,449 --> 01:23:19,490
OK, so we're still paying
the log to the epsilon,

1442
01:23:19,490 --> 01:23:21,120
because of the suffix array.

1443
01:23:21,120 --> 01:23:24,310
If that was constant,
it would be free.

1444
01:23:24,310 --> 01:23:29,082
P plus b time is fine, if
b is log log log log n.

1445
01:23:29,082 --> 01:23:30,290
Or you can make it log log n.

1446
01:23:30,290 --> 01:23:33,230
Then you save a log log
n factor in the bits.

1447
01:23:33,230 --> 01:23:34,760
You pay an additive log log n.

1448
01:23:34,760 --> 01:23:37,301
That's going to be absorbed by
the log to the epsilon anyway.

1449
01:23:37,301 --> 01:23:38,750
So it's pretty efficient.

1450
01:23:38,750 --> 01:23:40,625
I guess you can make
this log to the epsilon,

1451
01:23:40,625 --> 01:23:43,280
if you felt like it,
to balance out here.

1452
01:23:43,280 --> 01:23:45,740
Still would be P times
log to the epsilon.

1453
01:23:45,740 --> 01:23:48,080
And so this stuff is
really quite cheap,

1454
01:23:48,080 --> 01:23:50,360
see the notes for details.

1455
01:23:50,360 --> 01:23:54,150
That ends our succinct coverage.

1456
01:23:54,150 --> 01:23:57,700
Sorry, it was a little more
succinct than intended.

1457
01:23:57,700 --> 01:23:59,320
Get the idea.