1
00:00:00,090 --> 00:00:02,490
The following content is
provided under a Creative

2
00:00:02,490 --> 00:00:04,030
Commons license.

3
00:00:04,030 --> 00:00:06,360
Your support will help
MIT OpenCourseWare

4
00:00:06,360 --> 00:00:10,720
continue to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,320
To make a donation or
view additional materials

6
00:00:13,320 --> 00:00:17,280
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,280 --> 00:00:18,450
at ocw.mit.edu.

8
00:00:20,970 --> 00:00:21,970
ERIK DEMAINE: All right.

9
00:00:21,970 --> 00:00:24,904
Today is all about the
predecessor problem, which

10
00:00:24,904 --> 00:00:27,070
is a problem we've certainly
talked about implicitly

11
00:00:27,070 --> 00:00:28,540
with, say, binary search trees.

12
00:00:28,540 --> 00:00:30,940
You want to be able to
insert and delete into a set,

13
00:00:30,940 --> 00:00:34,570
and compute the predecessor
and successor of any given key.

14
00:00:34,570 --> 00:00:39,115
So maybe define that formally.

15
00:00:48,160 --> 00:00:51,070
And this is not
really our first,

16
00:00:51,070 --> 00:00:54,370
but it is an example of
an integer data structure.

17
00:00:54,370 --> 00:00:56,980
And for whatever
reason, I don't brand

18
00:00:56,980 --> 00:00:58,900
hashing as an integer
data structure,

19
00:00:58,900 --> 00:01:01,384
just because it's its own beast.

20
00:01:01,384 --> 00:01:03,550
But in particular, today,
I need to be a little more

21
00:01:03,550 --> 00:01:06,030
formal about the models of
computation we're allowing--

22
00:01:06,030 --> 00:01:07,600
or I want to be.

23
00:01:07,600 --> 00:01:12,880
In particular, because, in the
predecessor problem, which is,

24
00:01:12,880 --> 00:01:19,410
insert, delete,
predecessor, successor,

25
00:01:19,410 --> 00:01:22,540
there are actually lower bounds
that say you cannot do better

26
00:01:22,540 --> 00:01:23,950
than such and such.

27
00:01:23,950 --> 00:01:26,170
With hashing, there aren't
really any lower bounds,

28
00:01:26,170 --> 00:01:28,086
because you can do
everything in constant time

29
00:01:28,086 --> 00:01:29,017
with high probability.

30
00:01:29,017 --> 00:01:30,850
So I mean, there are
maybe some lower bounds

31
00:01:30,850 --> 00:01:31,990
on deterministic hashing.

32
00:01:31,990 --> 00:01:33,080
That's harder.

33
00:01:33,080 --> 00:01:35,850
But if you allow randomization,
there's no real lower bounds,

34
00:01:35,850 --> 00:01:37,710
whereas predecessor, there is.

35
00:01:40,430 --> 00:01:46,566
And in general,
predecessor problem--

36
00:01:46,566 --> 00:01:48,670
the key thing I
want to highlight

37
00:01:48,670 --> 00:01:53,500
is that we're maintaining
here a set of--

38
00:01:53,500 --> 00:02:04,270
the set is called s
of n elements, which

39
00:02:04,270 --> 00:02:06,890
live in some universe, U--

40
00:02:06,890 --> 00:02:08,014
just like last time.

41
00:02:08,014 --> 00:02:10,180
When you insert, you can
insert an arbitrary element

42
00:02:10,180 --> 00:02:10,900
of the universe.

43
00:02:10,900 --> 00:02:14,895
That probably shouldn't be an
s, or it will get thrown away.

44
00:02:14,895 --> 00:02:17,020
But the key thing is that
predecessor and successor

45
00:02:17,020 --> 00:02:19,600
operate not just on
the [INAUDIBLE] in s,

46
00:02:19,600 --> 00:02:21,000
but you can give it any key.

47
00:02:21,000 --> 00:02:22,390
It doesn't have to be in there.

48
00:02:22,390 --> 00:02:24,475
And it will find the
previous key that is in s,

49
00:02:24,475 --> 00:02:26,350
or the next key that is in s.

50
00:02:26,350 --> 00:02:29,280
So predecessor is
the largest key

51
00:02:29,280 --> 00:02:34,030
that is less than or equal
to x that's in your set.

52
00:02:37,550 --> 00:02:43,406
And successor is the
smallest that is larger--

53
00:02:43,406 --> 00:02:44,530
of course, if there is one.

54
00:02:48,675 --> 00:02:50,800
So those are the kinds of
operations we want to do.

55
00:02:50,800 --> 00:02:54,040
Now, we know how to do all of
this n log n time, no problem,

56
00:02:54,040 --> 00:02:57,850
with binary search trees,
in the comparison model.

57
00:02:57,850 --> 00:03:01,330
But I want to introduce two
more, say, realistic models

58
00:03:01,330 --> 00:03:05,620
of computers, that ignore
the memory hierarchy,

59
00:03:05,620 --> 00:03:08,620
but think about
regular RAM machines--

60
00:03:08,620 --> 00:03:13,212
random access machines--
and what they can really do.

61
00:03:13,212 --> 00:03:15,670
And it's a model we're going
to be working on for the next,

62
00:03:15,670 --> 00:03:17,780
I think, five lectures.

63
00:03:17,780 --> 00:03:19,630
So, important to
set the stage right.

64
00:03:22,150 --> 00:03:26,267
So these are models for
integer data structures.

65
00:03:29,470 --> 00:03:34,960
In general, we have
a unifying concept,

66
00:03:34,960 --> 00:03:40,045
which is a word of
information, a word of data,

67
00:03:40,045 --> 00:03:41,690
a word of memory.

68
00:03:41,690 --> 00:03:44,730
It's used all over the
place-- a word of input.

69
00:03:44,730 --> 00:03:48,640
A word is the machine
theoretic sense,

70
00:03:48,640 --> 00:03:50,790
not like the linguistic sense.

71
00:03:50,790 --> 00:03:54,300
It's going to be
a w-bit integer.

72
00:03:54,300 --> 00:03:57,790
And so this defines the
universe, which is--

73
00:03:57,790 --> 00:04:00,910
I'm going to assume they're
all unsigned integers.

74
00:04:00,910 --> 00:04:05,120
So this is 2 to the w minus one.

75
00:04:05,120 --> 00:04:07,000
Those are all the
unsigned integers

76
00:04:07,000 --> 00:04:08,760
you can represent with w-bits.

77
00:04:08,760 --> 00:04:11,860
We'll also call this number,
2 to the w, little u.

78
00:04:11,860 --> 00:04:16,060
That is the size of the
universe, which is capital U.

79
00:04:16,060 --> 00:04:18,399
So this matches
notation from last time.

80
00:04:18,399 --> 00:04:23,350
But I'm really highlighting how
many bits we have, which is w.

81
00:04:23,350 --> 00:04:26,980
Now, here's where
things get interesting.

82
00:04:26,980 --> 00:04:28,930
I'm going to get to
a model called a word

83
00:04:28,930 --> 00:04:32,469
RAM, which is what you
might expect, more or less.

84
00:04:32,469 --> 00:04:34,510
But before I get there I
want to define something

85
00:04:34,510 --> 00:04:38,770
called transdichotomous RAM--

86
00:04:38,770 --> 00:04:40,280
tough word to spell.

87
00:04:40,280 --> 00:04:43,360
It just means bridging
a dichotomy-- bridging

88
00:04:43,360 --> 00:04:46,510
two worlds, if you will.

89
00:04:46,510 --> 00:04:50,170
RAM is a random access machine.

90
00:04:50,170 --> 00:04:54,294
I've certainly mentioned
the word RAM before.

91
00:04:54,294 --> 00:04:56,710
But now we're going to get a
little more precise about it.

92
00:04:56,710 --> 00:05:00,360
So in general, in the
RAM, memory is an array.

93
00:05:00,360 --> 00:05:02,530
And you can do random
access into the array.

94
00:05:02,530 --> 00:05:06,580
But now, we're going to say,
the cells of the memory--

95
00:05:06,580 --> 00:05:07,990
each slot in that array--

96
00:05:07,990 --> 00:05:09,040
is a word.

97
00:05:09,040 --> 00:05:10,540
Everything is
going to be a word.

98
00:05:10,540 --> 00:05:14,887
Every input-- all these
x's are going to be words.

99
00:05:14,887 --> 00:05:15,970
Everything will be a word.

100
00:05:15,970 --> 00:05:18,520
And in particular, the things
in your memory are words.

101
00:05:21,070 --> 00:05:22,300
Let's say you have s of them.

102
00:05:22,300 --> 00:05:25,750
That's your space bound.

103
00:05:25,750 --> 00:05:27,430
In general, in
transdichotomous RAM,

104
00:05:27,430 --> 00:05:34,180
you can do any operation
that reads and writes

105
00:05:34,180 --> 00:05:40,600
a constant number
of words in memory.

106
00:05:40,600 --> 00:05:45,860
And in particular, you can do
random access to that memory.

107
00:05:45,860 --> 00:05:52,225
But in particular, we use
words to serve as pointers.

108
00:05:57,460 --> 00:05:59,665
Here's my memory of words.

109
00:06:02,590 --> 00:06:05,460
Each of them is w bits--

110
00:06:05,460 --> 00:06:12,040
so s of them, from, I
guess, 0 to s minus one.

111
00:06:12,040 --> 00:06:17,170
And if you have, like,
the number 3 here,

112
00:06:17,170 --> 00:06:21,610
that can be used as a pointer
to the third slot of memory.

113
00:06:21,610 --> 00:06:24,910
One, two, three.

114
00:06:24,910 --> 00:06:27,187
You can use numbers as
indexes into memory.

115
00:06:27,187 --> 00:06:29,270
So that's what I mean by,
words serve as pointers.

116
00:06:29,270 --> 00:06:32,110
So particularly, you can
implement a pointer machine,

117
00:06:32,110 --> 00:06:33,850
which-- no surprise--

118
00:06:33,850 --> 00:06:39,760
but for this to work, we
need a lower bound on w.

119
00:06:39,760 --> 00:06:44,440
This implies w has to be at
least log of the space bound.

120
00:06:44,440 --> 00:06:47,560
Otherwise, you just can't
index your whole memory.

121
00:06:47,560 --> 00:06:50,940
And if you've got s minus 1
things, this 2 to the w minus 1

122
00:06:50,940 --> 00:06:53,880
better be at least s minus 1.

123
00:06:53,880 --> 00:06:55,980
So we get this lower bound.

124
00:06:55,980 --> 00:07:03,621
So in particular, presumably, s
is at least your problem size,

125
00:07:03,621 --> 00:07:04,120
n.

126
00:07:04,120 --> 00:07:05,880
If you're trying to
maintain n items,

127
00:07:05,880 --> 00:07:08,670
you've got to store them.

128
00:07:08,670 --> 00:07:12,420
So w is at least log n.

129
00:07:12,420 --> 00:07:17,760
Now, this relation is
essentially a statement

130
00:07:17,760 --> 00:07:18,930
bridging two worlds.

131
00:07:18,930 --> 00:07:20,560
Namely, you have,
on the one hand,

132
00:07:20,560 --> 00:07:24,045
your model of computation, which
has a particular word size.

133
00:07:24,045 --> 00:07:28,590
And in reality, we think of that
as being 32 or 64 or maybe 128.

134
00:07:28,590 --> 00:07:31,230
Some fancy operations
on Intel machines,

135
00:07:31,230 --> 00:07:34,650
you can do 128-bit or so.

136
00:07:34,650 --> 00:07:36,210
And then there's
your problem size,

137
00:07:36,210 --> 00:07:37,830
which we think of as an input.

138
00:07:37,830 --> 00:07:39,810
Now, this is relating the two.

139
00:07:39,810 --> 00:07:40,929
It's a little weird.

140
00:07:40,929 --> 00:07:43,470
I guess you could say it's just
a limitation for a given CPU.

141
00:07:43,470 --> 00:07:46,100
There's only certain
problems you can solve.

142
00:07:46,100 --> 00:07:49,687
But theoretically, it
makes a lot of sense

143
00:07:49,687 --> 00:07:50,520
to relate these two.

144
00:07:50,520 --> 00:07:52,770
Because if you're
in a RAM, and you've

145
00:07:52,770 --> 00:07:54,889
got to be able to
index your data,

146
00:07:54,889 --> 00:07:56,430
you need at least
that many bits just

147
00:07:56,430 --> 00:07:58,860
to be able to talk
about all those things.

148
00:07:58,860 --> 00:08:01,680
And so the claim is,
basically, machines

149
00:08:01,680 --> 00:08:03,980
will grow to
accommodate memory size.

150
00:08:03,980 --> 00:08:06,700
As memory size grows,
you'll need more bits.

151
00:08:06,700 --> 00:08:11,894
Now, in reality, there's
only about 2 to 256--

152
00:08:11,894 --> 00:08:13,320
what do you call
them-- particles

153
00:08:13,320 --> 00:08:15,250
in the known universe.

154
00:08:15,250 --> 00:08:18,520
So word size probably
won't get that much bigger.

155
00:08:18,520 --> 00:08:20,440
Beyond 256 should be OK.

156
00:08:20,440 --> 00:08:23,610
But theoretically,
this is a nice way

157
00:08:23,610 --> 00:08:26,220
to formalize this claim
that word sizes don't

158
00:08:26,220 --> 00:08:29,370
need to get too big unless
memories get gigantic.

159
00:08:29,370 --> 00:08:32,400
So it may seem weird at
first, but it's very natural.

160
00:08:32,400 --> 00:08:35,549
And all real world machines
have big enough words

161
00:08:35,549 --> 00:08:36,750
to accommodate that.

162
00:08:36,750 --> 00:08:38,833
Word size could be bigger,
and that will give you,

163
00:08:38,833 --> 00:08:40,320
essentially, more parallelism.

164
00:08:40,320 --> 00:08:42,789
But it should be
at least that big.

165
00:08:42,789 --> 00:08:43,289
All right.

166
00:08:43,289 --> 00:08:46,030
Enough proselytizing.

167
00:08:50,410 --> 00:08:52,350
That's the transdichotomous RAM.

168
00:08:52,350 --> 00:08:54,060
The end.

169
00:08:54,060 --> 00:08:56,220
And the word RAM is
a specific version

170
00:08:56,220 --> 00:09:04,230
of the transdichotomous
RAM, where

171
00:09:04,230 --> 00:09:07,440
you restrict the operations
to c-like operations.

172
00:09:11,610 --> 00:09:13,362
These are sort of the standard--

173
00:09:13,362 --> 00:09:14,820
they're instructions
on, basically,

174
00:09:14,820 --> 00:09:18,850
all computers, except a few
risk architectures don't have

175
00:09:18,850 --> 00:09:21,650
multiplication and division.

176
00:09:21,650 --> 00:09:26,191
But everything else
is on everything.

177
00:09:29,350 --> 00:09:35,200
So these are the operators,
unless I missed one, in c.

178
00:09:35,200 --> 00:09:37,530
They're all in Python,
and pick your--

179
00:09:37,530 --> 00:09:38,565
most languages.

180
00:09:38,565 --> 00:09:42,696
You've got integer
arithmetic, including mod.

181
00:09:42,696 --> 00:09:46,230
You've got bitwise and,
bitwise or, bitwise x

182
00:09:46,230 --> 00:09:50,730
or, bitwise negation, and
shift left, and shift right.

183
00:09:50,730 --> 00:09:53,400
These we all view as
taking constant time.

184
00:09:53,400 --> 00:09:59,700
They take one or two integer
inputs-- words as inputs.

185
00:09:59,700 --> 00:10:01,120
They can compute an answer.

186
00:10:01,120 --> 00:10:03,150
They write out another word.

187
00:10:03,150 --> 00:10:05,771
Of course, there's also random
access-- array dereference,

188
00:10:05,771 --> 00:10:06,270
I guess.

189
00:10:10,980 --> 00:10:12,110
So that's the word RAM.

190
00:10:12,110 --> 00:10:13,260
You restrict to
these operations.

191
00:10:13,260 --> 00:10:14,460
Whereas transdichotomous
RAM, you

192
00:10:14,460 --> 00:10:16,501
can do weird things, as
long as they only involve

193
00:10:16,501 --> 00:10:18,030
a constant number of words.

194
00:10:18,030 --> 00:10:19,890
Word RAM-- it's
the regular thing.

195
00:10:19,890 --> 00:10:22,800
So this is basically
the standard model

196
00:10:22,800 --> 00:10:25,350
that all integer data
structures use, pretty much.

197
00:10:25,350 --> 00:10:28,200
If they don't use this
model, they have to say so.

198
00:10:28,200 --> 00:10:31,590
Otherwise this model has become
accepted as the normal one.

199
00:10:31,590 --> 00:10:33,930
It took several years
before people realized

200
00:10:33,930 --> 00:10:36,360
that's a good model--
good enough to capture

201
00:10:36,360 --> 00:10:39,300
pretty much everything we want.

202
00:10:39,300 --> 00:10:41,280
The cool thing
about word RAM is,

203
00:10:41,280 --> 00:10:43,980
it lets you do things
on w-bits in parallel.

204
00:10:43,980 --> 00:10:47,920
You can take the and of
w-bits, pairwise, all at once.

205
00:10:47,920 --> 00:10:49,684
So you get some speed up.

206
00:10:49,684 --> 00:10:51,600
But it's a natural
generalization of something

207
00:10:51,600 --> 00:10:52,683
like the comparison model.

208
00:10:52,683 --> 00:10:55,215
Comparison model-- I guess
I didn't write those.

209
00:10:55,215 --> 00:10:59,040
It's more operations-- less
than, greater than, and so on.

210
00:10:59,040 --> 00:11:01,290
You can compare two
numbers in constant time,

211
00:11:01,290 --> 00:11:03,900
get a Boolean output
via, say, subtraction,

212
00:11:03,900 --> 00:11:05,050
and computing the sine.

213
00:11:09,000 --> 00:11:11,230
And you think of comparisons
as taking constant time,

214
00:11:11,230 --> 00:11:14,850
so why not all of these things?

215
00:11:14,850 --> 00:11:15,930
Cool.

216
00:11:15,930 --> 00:11:21,710
One more model-- this
is kind of a weird one.

217
00:11:21,710 --> 00:11:25,920
It's called cell
probe model, which

218
00:11:25,920 --> 00:11:31,440
is, we just count
the number of memory

219
00:11:31,440 --> 00:11:42,206
reads and writes that we
need to do to solve a data

220
00:11:42,206 --> 00:11:43,080
structure or a query.

221
00:11:43,080 --> 00:11:44,905
Like, you you're
looking at predecessor,

222
00:11:44,905 --> 00:11:47,280
and you just want to know,
how much of the data structure

223
00:11:47,280 --> 00:11:50,271
do I have to read in order to be
able to answer the predecessor

224
00:11:50,271 --> 00:11:50,770
problem?

225
00:11:50,770 --> 00:11:52,311
How much do I have
to write out to do

226
00:11:52,311 --> 00:11:55,140
an insertion, or whatever?

227
00:11:55,140 --> 00:12:00,480
And so in this model,
computation is free.

228
00:12:00,480 --> 00:12:04,889
And this is kind of like
the external memory model,

229
00:12:04,889 --> 00:12:06,180
and the cache oblivious models.

230
00:12:06,180 --> 00:12:07,830
There, we were
measuring how many block

231
00:12:07,830 --> 00:12:08,955
reads and writes there are.

232
00:12:08,955 --> 00:12:11,160
Here, our blocks are
actually our words.

233
00:12:11,160 --> 00:12:14,397
So there is a bit of a relation,
except there's no real--

234
00:12:14,397 --> 00:12:16,230
you can either think
of there being no cache

235
00:12:16,230 --> 00:12:17,130
here, because
you're just reading

236
00:12:17,130 --> 00:12:19,463
in a constant number of words,
doing something, spitting

237
00:12:19,463 --> 00:12:20,029
stuff out.

238
00:12:20,029 --> 00:12:21,570
Or in the cell probe
model, you could

239
00:12:21,570 --> 00:12:24,630
imagine there being an infinite
cache for this operation,

240
00:12:24,630 --> 00:12:26,365
but no cache from
operation to operation.

241
00:12:26,365 --> 00:12:28,740
It's just, how much do I have
to [INAUDIBLE] information,

242
00:12:28,740 --> 00:12:31,190
theoretically, to solve a
particular predecessor problem?

243
00:12:31,190 --> 00:12:33,060
We'll deal with this
a lot in a couple

244
00:12:33,060 --> 00:12:35,230
of lectures-- not quite yet.

245
00:12:35,230 --> 00:12:38,200
This model is just
used for lower bounds.

246
00:12:38,200 --> 00:12:40,140
It's not a realistic
model, because you

247
00:12:40,140 --> 00:12:43,794
have to pay for computation
in the real world.

248
00:12:43,794 --> 00:12:45,210
But if you can
prove that you need

249
00:12:45,210 --> 00:12:47,520
to read at least a certain
number of words, then,

250
00:12:47,520 --> 00:12:50,920
of course, you have to do at
least that many operations.

251
00:12:50,920 --> 00:12:52,590
So it's nice for lower bounds.

252
00:12:52,590 --> 00:12:58,020
In general, we have this
sort of hierarchy of models,

253
00:12:58,020 --> 00:13:03,390
where this is the most
powerful, strongest,

254
00:13:03,390 --> 00:13:13,920
and below cell probe, we
have transdichotomous RAM,

255
00:13:13,920 --> 00:13:20,100
then word RAM, then--
just to fit it in context,

256
00:13:20,100 --> 00:13:21,330
what we've been doing--

257
00:13:21,330 --> 00:13:25,620
below that is pointer
machine, and below that

258
00:13:25,620 --> 00:13:26,820
would be binary search tree.

259
00:13:26,820 --> 00:13:28,445
I've mentioned before,
pointer machines

260
00:13:28,445 --> 00:13:30,240
are more powerful than
binary search tree.

261
00:13:30,240 --> 00:13:32,281
And of course, we can
implement a pointer machine

262
00:13:32,281 --> 00:13:32,910
on a word RAM.

263
00:13:32,910 --> 00:13:35,112
So we have these relations.

264
00:13:35,112 --> 00:13:36,570
There are, of
course, other models.

265
00:13:36,570 --> 00:13:39,450
But this is a quick picture
of models we've seen so far.

266
00:13:47,140 --> 00:13:50,330
So now, we have this
notion of a word.

267
00:13:50,330 --> 00:13:54,475
In the predecessor problem,
these elements are words.

268
00:13:57,090 --> 00:14:00,311
They're w-bit integers,
universe-defined.

269
00:14:00,311 --> 00:14:02,560
And we want to be able to
insert, delete, predecessor,

270
00:14:02,560 --> 00:14:04,756
and successor over words.

271
00:14:04,756 --> 00:14:07,558
So that's our challenge.

272
00:14:12,910 --> 00:14:14,590
In the binary search
tree model, we

273
00:14:14,590 --> 00:14:16,690
know the answer to this
problem is theta log n.

274
00:14:16,690 --> 00:14:19,690
In general, any
comparison-based data structure,

275
00:14:19,690 --> 00:14:22,310
you need theta log
n, in the worst case.

276
00:14:22,310 --> 00:14:24,710
It's an easy lower bound.

277
00:14:24,710 --> 00:14:28,240
But we're going to do better on
these other models in the word

278
00:14:28,240 --> 00:14:30,040
RAM.

279
00:14:30,040 --> 00:14:31,435
So here are some results.

280
00:14:38,440 --> 00:14:40,840
First data structure is
called Van Emde Boas.

281
00:14:40,840 --> 00:14:42,950
You might guess it
is by van Emde Boas--

282
00:14:42,950 --> 00:14:45,160
Peter.

283
00:14:45,160 --> 00:14:46,900
It actually has a
couple other authors

284
00:14:46,900 --> 00:14:48,500
in some versions of
the papers, which

285
00:14:48,500 --> 00:14:49,570
makes a little bit confusing.

286
00:14:49,570 --> 00:14:51,361
But for whatever reason,
the data structure

287
00:14:51,361 --> 00:14:54,340
is just named Van Emde Boas.

288
00:14:54,340 --> 00:15:03,770
And it achieves log
w per operation.

289
00:15:03,770 --> 00:15:05,350
I think I'll rewrite this.

290
00:15:05,350 --> 00:15:09,490
This is log log u per operation.

291
00:15:09,490 --> 00:15:12,250
But it requires u space.

292
00:15:12,250 --> 00:15:16,330
So think of u space as
being, like, for every item

293
00:15:16,330 --> 00:15:19,360
in the universe I store,
yes or no, is it in the set?

294
00:15:19,360 --> 00:15:24,700
So that's a lot of space, unless
n and u are not too different.

295
00:15:24,700 --> 00:15:27,310
But we can do better.

296
00:15:27,310 --> 00:15:29,020
But the cool thing
is the running time.

297
00:15:29,020 --> 00:15:30,970
This is really fast--

298
00:15:30,970 --> 00:15:33,040
log log u.

299
00:15:33,040 --> 00:15:35,110
If you think about,
for example--

300
00:15:35,110 --> 00:15:40,660
I don't know-- the universe
being polynomial in n,

301
00:15:40,660 --> 00:15:43,300
or even if the universe
is something like--

302
00:15:43,300 --> 00:15:45,460
polynomial in n is
the same as this--

303
00:15:45,460 --> 00:15:46,960
2 to the c log n.

304
00:15:46,960 --> 00:15:49,650
You can go crazy and
say log to the c power--

305
00:15:49,650 --> 00:15:52,100
so, like, 2 to the log
to the fifth power.

306
00:15:52,100 --> 00:15:55,540
All those things,
you take log twice.

307
00:15:55,540 --> 00:16:02,590
Then log log u becomes
theta log log n.

308
00:16:02,590 --> 00:16:11,545
So as long as your word
size is not insanely large,

309
00:16:11,545 --> 00:16:13,640
you're getting log
log n performance.

310
00:16:13,640 --> 00:16:23,700
So in general, when, let's
say, w is is polylog n,

311
00:16:23,700 --> 00:16:26,150
then we're getting this
kind of performance.

312
00:16:26,150 --> 00:16:28,540
And I think on most computers,
w is polylogarithmic.

313
00:16:28,540 --> 00:16:30,490
We said it has to
be at least log.

314
00:16:30,490 --> 00:16:32,670
It's also, generally, not
so much bigger than log.

315
00:16:32,670 --> 00:16:35,590
So log squared is probably
fine most of the time,

316
00:16:35,590 --> 00:16:39,680
unless you have a
really small problem.

317
00:16:39,680 --> 00:16:40,680
OK, so cool.

318
00:16:40,680 --> 00:16:41,770
But the space is giant.

319
00:16:41,770 --> 00:16:43,480
So how do we do
better than that?

320
00:16:43,480 --> 00:16:46,700
Well, there's a
couple of answers.

321
00:16:46,700 --> 00:16:54,640
One is that you can achieve
log w with high probability,

322
00:16:54,640 --> 00:16:58,660
and order n space.

323
00:16:58,660 --> 00:17:01,930
With a slight tweak, basically,
you combine Van Emde Boas

324
00:17:01,930 --> 00:17:05,876
plus hashing, and you get that.

325
00:17:05,876 --> 00:17:08,800
I don't actually know what the
reference is for this result.

326
00:17:08,800 --> 00:17:14,349
It's been an exercise in
various courses, and so on.

327
00:17:14,349 --> 00:17:16,403
I can talk more
about that later.

328
00:17:16,403 --> 00:17:18,819
Then alternatively, there's
another data structure, which,

329
00:17:18,819 --> 00:17:20,740
in many ways, is simpler.

330
00:17:20,740 --> 00:17:22,240
It really embraces hashing.

331
00:17:22,240 --> 00:17:23,980
It's called y-fast trees.

332
00:17:23,980 --> 00:17:26,904
It achieves the same
bounds-- so log w

333
00:17:26,904 --> 00:17:29,710
with high probability
and linear space.

334
00:17:29,710 --> 00:17:33,430
It's basically just a hash
table with some cleverness.

335
00:17:33,430 --> 00:17:34,240
So we'll get there.

336
00:17:34,240 --> 00:17:35,781
Even though it's
simpler, we're going

337
00:17:35,781 --> 00:17:38,020
to start with this structure.

338
00:17:38,020 --> 00:17:40,210
Historically, this is
the way it happened--

339
00:17:40,210 --> 00:17:43,735
Van Emde Boas, then y-fast
trees, which are by Willard.

340
00:17:43,735 --> 00:17:45,670
And it'll be kind
of a nice finale.

341
00:17:49,750 --> 00:17:53,480
There's another data structure
I want to talk about,

342
00:17:53,480 --> 00:17:58,780
which is designed for the
case when w is very large--

343
00:17:58,780 --> 00:18:01,060
much bigger than polylog n.

344
00:18:01,060 --> 00:18:04,360
In that case, there's
something called fusion trees.

345
00:18:04,360 --> 00:18:09,301
And you can achieve
log base w of n--

346
00:18:09,301 --> 00:18:13,840
and, I guess, with high
probability and linear space.

347
00:18:16,480 --> 00:18:18,190
The original fusion
trees are static.

348
00:18:18,190 --> 00:18:21,880
And you could just do log base
w of n deterministic queries.

349
00:18:21,880 --> 00:18:24,430
But there's a later
version that dynamic,

350
00:18:24,430 --> 00:18:30,160
achieves this using hashing
for updates, insertions,

351
00:18:30,160 --> 00:18:31,900
and deletions.

352
00:18:31,900 --> 00:18:32,560
Cool.

353
00:18:32,560 --> 00:18:34,990
So this is an
almost upside-down--

354
00:18:34,990 --> 00:18:36,490
it's obviously
always an improvement

355
00:18:36,490 --> 00:18:41,890
over just log base 2 of n.

356
00:18:41,890 --> 00:18:44,430
But it's sometimes better and
sometimes worse than log w.

357
00:18:44,430 --> 00:18:48,490
In fact, it kind of makes
sense to take the min of them.

358
00:18:48,490 --> 00:18:50,780
When w is small, you
want to use log w.

359
00:18:50,780 --> 00:18:53,950
When w is big, you want
to use log base w of n.

360
00:18:53,950 --> 00:19:04,890
They're going to balance out
when w is 2 to the root log n--

361
00:19:04,890 --> 00:19:07,100
something like that.

362
00:19:07,100 --> 00:19:09,560
The easy thing is,
when these balance out

363
00:19:09,560 --> 00:19:11,150
is when they're equal.

364
00:19:11,150 --> 00:19:16,850
And that will be when this
is log n divided by log w.

365
00:19:16,850 --> 00:19:20,380
So when log w equals
log n divided by log w--

366
00:19:20,380 --> 00:19:22,250
let do that over here.

367
00:19:22,250 --> 00:19:27,870
log w is log n over log w.

368
00:19:27,870 --> 00:19:32,240
Then this is like saying
log squared w equals log n,

369
00:19:32,240 --> 00:19:36,641
or log w is root log n.

370
00:19:36,641 --> 00:19:39,260
So I was right. w is
2 to the root log in,

371
00:19:39,260 --> 00:19:40,352
which is a weird quantity.

372
00:19:40,352 --> 00:19:42,310
But the easy thing to
think about is this one--

373
00:19:42,310 --> 00:19:44,000
log w is root log n.

374
00:19:44,000 --> 00:19:47,610
And in that case, the running
time you get is root log n.

375
00:19:51,050 --> 00:19:52,880
So it's always, at most, this.

376
00:19:52,880 --> 00:19:57,340
And the worst case is when
these things are balanced,

377
00:19:57,340 --> 00:20:00,290
or these two are the same, and
they both achieve root log n.

378
00:20:00,290 --> 00:20:02,780
But if w is smaller or
larger than this threshold,

379
00:20:02,780 --> 00:20:05,030
these structures will be
even better than root log n.

380
00:20:05,030 --> 00:20:06,446
But in particular,
it's a nice way

381
00:20:06,446 --> 00:20:09,380
to think about, oh, we're
doing sort of a square factor

382
00:20:09,380 --> 00:20:12,930
better than binary search trees.

383
00:20:12,930 --> 00:20:16,880
And we can do this high
probability in linear space.

384
00:20:20,120 --> 00:20:20,770
So that's cool.

385
00:20:27,530 --> 00:20:30,870
Turns out it's also
pretty much optimal.

386
00:20:30,870 --> 00:20:38,540
And that's not at all
obvious, and wasn't

387
00:20:38,540 --> 00:20:39,680
known for many years.

388
00:20:46,650 --> 00:20:50,880
So there's a cell
probe lower bound.

389
00:20:50,880 --> 00:20:57,580
So these are all in
the word RAM model--

390
00:20:57,580 --> 00:20:58,670
all these results.

391
00:20:58,670 --> 00:21:01,236
The first one actually kind of
works in the pointer machine.

392
00:21:01,236 --> 00:21:02,360
I'll talk about that later.

393
00:21:10,570 --> 00:21:15,140
This lower bound's a
little bit messy to state.

394
00:21:15,140 --> 00:21:18,100
The bound is slightly
more complicated

395
00:21:18,100 --> 00:21:19,700
than what we've seen.

396
00:21:19,700 --> 00:21:22,160
But I'm going to restrict to
a special situation, which is,

397
00:21:22,160 --> 00:21:24,660
if you have n polylog n space.

398
00:21:24,660 --> 00:21:27,470
So this is a lower bound
on static predecessor.

399
00:21:27,470 --> 00:21:30,200
All you need to do is solve
predecessor and successor,

400
00:21:30,200 --> 00:21:31,790
or even just predecessor.

401
00:21:31,790 --> 00:21:33,770
There's no inserts and deletes.

402
00:21:33,770 --> 00:21:37,250
In that case, if you use lots of
space, like u space, of course,

403
00:21:37,250 --> 00:21:39,110
you can do constant
time for everything.

404
00:21:39,110 --> 00:21:41,450
You just store all the answers.

405
00:21:41,450 --> 00:21:45,632
But if you want space that's
not much bigger than n--

406
00:21:45,632 --> 00:21:47,840
in particular, if you wanted
to be able to do updates

407
00:21:47,840 --> 00:21:49,820
in polylog, this
is the most space

408
00:21:49,820 --> 00:21:51,830
you could ever hope to achieve.

409
00:21:51,830 --> 00:21:56,450
So assuming that, which
is pretty reasonable,

410
00:21:56,450 --> 00:22:00,050
there's a bound of the
min of two things--

411
00:22:00,050 --> 00:22:02,750
log base w of n,
which is fusion trees,

412
00:22:02,750 --> 00:22:05,930
and, roughly, log w,
which is Van Emde Boas.

413
00:22:05,930 --> 00:22:07,820
But it's slightly
smaller than that.

414
00:22:16,170 --> 00:22:17,600
Yeah, pretty weird.

415
00:22:17,600 --> 00:22:19,620
Let me tell you
the consequences--

416
00:22:19,620 --> 00:22:21,050
a little easier to think about.

417
00:22:21,050 --> 00:22:26,270
Van Emde Boas is going to be
optimal for the kind of cases

418
00:22:26,270 --> 00:22:30,360
we care about, which
is when w is polylog n.

419
00:22:35,270 --> 00:22:38,780
And fusion trees are
optimal when w is big.

420
00:22:49,910 --> 00:22:55,450
Square root log n log log n.

421
00:22:55,450 --> 00:22:59,870
OK-- a little messy.

422
00:22:59,870 --> 00:23:03,890
So there's this divided by
log of log w over log n.

423
00:23:03,890 --> 00:23:08,507
If w is polylog n, then this
is just order log log n.

424
00:23:08,507 --> 00:23:09,340
And so this cancels.

425
00:23:09,340 --> 00:23:10,612
This becomes constant.

426
00:23:10,612 --> 00:23:13,070
So in these situations, which
are the ones I mentioned over

427
00:23:13,070 --> 00:23:15,440
here, w is polylog
n, which is when

428
00:23:15,440 --> 00:23:16,760
we get log log n performance.

429
00:23:16,760 --> 00:23:18,770
And that's kind of the
case we care about.

430
00:23:18,770 --> 00:23:21,320
Van Emde Boas is the
best thing to do.

431
00:23:21,320 --> 00:23:24,140
Turns out, this is
actually the right answer.

432
00:23:24,140 --> 00:23:26,670
You can do slightly better.

433
00:23:26,670 --> 00:23:27,830
It's almost an exercise.

434
00:23:27,830 --> 00:23:30,710
You can tweak Van Emde Boas and
get this slight improvement.

435
00:23:30,710 --> 00:23:33,600
But most word sizes, it
really doesn't matter.

436
00:23:33,600 --> 00:23:36,420
You're not saving much.

437
00:23:36,420 --> 00:23:37,400
Cool.

438
00:23:37,400 --> 00:23:39,050
So other than that
little factor,

439
00:23:39,050 --> 00:23:40,216
these are the right answers.

440
00:23:40,216 --> 00:23:41,960
You have to know
about Van Emde Boas.

441
00:23:41,960 --> 00:23:43,460
You have to know
about fusion trees.

442
00:23:43,460 --> 00:23:45,380
And so this lecture is
about Van Emde Boas.

443
00:23:45,380 --> 00:23:48,890
Next lecture is
about fusion trees.

444
00:23:48,890 --> 00:24:05,040
This result is from 2006
and 2007, so pretty recent.

445
00:24:05,040 --> 00:24:07,100
So let's start a Van Emde Boas.

446
00:24:39,540 --> 00:24:40,049
Yeah.

447
00:24:40,049 --> 00:24:40,840
Let's dive into it.

448
00:24:40,840 --> 00:24:44,160
I'll talk about
history a little later.

449
00:24:44,160 --> 00:24:45,800
The central idea,
I guess, if you

450
00:24:45,800 --> 00:24:49,530
wanted to sum up Van Emde
Boas in an equation, which

451
00:24:49,530 --> 00:24:52,980
is something we very rarely
get to do in algorithms,

452
00:24:52,980 --> 00:24:55,115
is to think about
this recurrence--

453
00:24:55,115 --> 00:25:00,430
T of u is T of square
root of u plus order 1.

454
00:25:00,430 --> 00:25:02,310
What does this solve to?

455
00:25:02,310 --> 00:25:05,470
log log u.

456
00:25:05,470 --> 00:25:10,080
All right, just
think of taking logs.

457
00:25:10,080 --> 00:25:13,470
This is the same as
T of w equals T of w

458
00:25:13,470 --> 00:25:15,930
over 2 plus order 1.

459
00:25:15,930 --> 00:25:17,600
w is the word size.

460
00:25:17,600 --> 00:25:19,470
And so this is log w.

461
00:25:19,470 --> 00:25:22,188
It's the same thing.

462
00:25:22,188 --> 00:25:25,410
If we could achieve
this recurrence, then--

463
00:25:25,410 --> 00:25:28,090
boom-- we get our
bound of log w.

464
00:25:30,810 --> 00:25:32,800
So how do we do it.

465
00:25:32,800 --> 00:25:45,480
We split the universe
into root u clusters,

466
00:25:45,480 --> 00:25:48,290
each of size root u.

467
00:25:51,980 --> 00:25:59,940
OK, so, if here is our
universe, then I just

468
00:25:59,940 --> 00:26:03,090
split every square
root of u items.

469
00:26:03,090 --> 00:26:06,870
So each of these is root u long.

470
00:26:06,870 --> 00:26:09,132
The number of them
is square root of u.

471
00:26:09,132 --> 00:26:10,590
And then somehow,
I want to recurse

472
00:26:10,590 --> 00:26:14,460
on each of these clusters.

473
00:26:14,460 --> 00:26:16,400
And I only get to
recurse on one of them--

474
00:26:16,400 --> 00:26:17,400
so a pretty simple idea.

475
00:26:34,550 --> 00:26:35,080
Yeah.

476
00:26:35,080 --> 00:26:36,621
So I'll talk about
how to actually do

477
00:26:36,621 --> 00:26:37,810
that recursion in a moment.

478
00:26:37,810 --> 00:26:39,309
Before I get there,
I want to define

479
00:26:39,309 --> 00:26:43,790
a sort of hierarchical
coordinate system.

480
00:26:43,790 --> 00:26:46,730
This is a new way of
phrasing it for me.

481
00:26:46,730 --> 00:26:48,790
So I hope you like it.

482
00:26:48,790 --> 00:26:52,510
If we have a word x, I
want to write it as two

483
00:26:52,510 --> 00:26:55,440
coordinates-- c and i.

484
00:26:55,440 --> 00:26:57,470
I'm going to use
angle brackets, so it

485
00:26:57,470 --> 00:27:00,700
doesn't get too confusing. c
is which cluster you're in.

486
00:27:00,700 --> 00:27:03,970
So this is cluster 0, cluster
1, cluster 2, cluster three.

487
00:27:03,970 --> 00:27:05,820
i is your index
within the cluster.

488
00:27:05,820 --> 00:27:09,437
So this is 0, 1, 2, 3, 4,
5-- up to root u minus 1

489
00:27:09,437 --> 00:27:10,270
within this cluster.

490
00:27:10,270 --> 00:27:12,670
Then 0, 1, 2, 3, 4, 5
up to root u minus 1

491
00:27:12,670 --> 00:27:15,220
with in this
cluster-- so the i is

492
00:27:15,220 --> 00:27:19,720
your index within the
cluster, like this,

493
00:27:19,720 --> 00:27:23,750
and c is which
cluster you are in.

494
00:27:23,750 --> 00:27:24,250
OK.

495
00:27:24,250 --> 00:27:25,540
Pretty simple.

496
00:27:25,540 --> 00:27:29,260
And there's easy
arithmetic to do this.

497
00:27:29,260 --> 00:27:33,430
c is x integer divide root u.

498
00:27:33,430 --> 00:27:38,200
And i is x integer mod root u.

499
00:27:38,200 --> 00:27:41,560
I used Python notation here.

500
00:27:41,560 --> 00:27:44,517
So fine, I think
you all know this--

501
00:27:44,517 --> 00:27:45,100
pretty simple.

502
00:27:45,100 --> 00:27:47,230
And if I gave you
c and i, you could

503
00:27:47,230 --> 00:27:49,070
reconstruct x by just
saying, oh, well,

504
00:27:49,070 --> 00:27:52,600
that's c times root u plus i.

505
00:27:52,600 --> 00:27:55,690
So in constant time, you
can decompose a number

506
00:27:55,690 --> 00:27:56,950
into its two coordinates.

507
00:27:56,950 --> 00:27:59,581
That's the point.

508
00:27:59,581 --> 00:28:01,080
In fact, it's much
easier than this.

509
00:28:01,080 --> 00:28:02,950
You don't even
have to do division

510
00:28:02,950 --> 00:28:04,910
if you think of
everything in binary,

511
00:28:04,910 --> 00:28:06,580
which computers tend to do.

512
00:28:06,580 --> 00:28:16,560
So the binary perspective
is that x is a word.

513
00:28:16,560 --> 00:28:18,410
So it's a bunch of bits.

514
00:28:18,410 --> 00:28:25,000
0, 1, 1, 0, 1, 0,
0, 1-- whatever.

515
00:28:25,000 --> 00:28:29,440
Divide that bit sequence
in half, and then this part

516
00:28:29,440 --> 00:28:32,920
is c, this part is i.

517
00:28:32,920 --> 00:28:35,740
And if you assume that
w is a power of 2,

518
00:28:35,740 --> 00:28:37,012
these two are identical.

519
00:28:37,012 --> 00:28:38,470
If they're not a
power of 2, you've

520
00:28:38,470 --> 00:28:40,600
got to round a little bit here.

521
00:28:40,600 --> 00:28:42,470
It doesn't matter.

522
00:28:42,470 --> 00:28:46,190
But you can use this definition
instead of this one either way.

523
00:28:46,190 --> 00:28:48,430
So in this case, c is--

524
00:28:48,430 --> 00:28:54,100
ooh, boy-- x shifted
right, w over 2, basically.

525
00:28:54,100 --> 00:28:58,100
So this w over 2--

526
00:28:58,100 --> 00:29:00,220
w over 2.

527
00:29:00,220 --> 00:29:04,220
The whole thing is w bits.

528
00:29:04,220 --> 00:29:07,150
So if I shift right, I get
rid of the low order bits,

529
00:29:07,150 --> 00:29:08,240
if I want.

530
00:29:08,240 --> 00:29:10,120
i is slightly more annoying.

531
00:29:10,120 --> 00:29:18,070
But I can't do it
as an and with one

532
00:29:18,070 --> 00:29:24,690
shifted left w over 2 minus 1.

533
00:29:24,690 --> 00:29:26,290
That's probably
how you do it in c.

534
00:29:26,290 --> 00:29:27,190
I don't know if
you're used to this.

535
00:29:27,190 --> 00:29:29,470
But if I take it a 1 bit,
I shift it over to here,

536
00:29:29,470 --> 00:29:30,310
and I subtract 1.

537
00:29:30,310 --> 00:29:31,900
Then I get a whole
bunch of 1 bits.

538
00:29:31,900 --> 00:29:34,660
And then you mask
with that bit pattern.

539
00:29:34,660 --> 00:29:36,850
So I'm masking with 1, 1, 1, 1.

540
00:29:36,850 --> 00:29:38,890
Then I'll just get
the low order bits.

541
00:29:38,890 --> 00:29:40,900
Computers do the
super fast-- way

542
00:29:40,900 --> 00:29:42,610
faster than integer division.

543
00:29:42,610 --> 00:29:44,830
Because this is just
like routing bits around.

544
00:29:44,830 --> 00:29:47,770
So this is easy to
do on a typical CPU.

545
00:29:47,770 --> 00:29:49,720
And this will be much
faster than this code,

546
00:29:49,720 --> 00:29:53,440
even though looks like
more operations, typically.

547
00:29:53,440 --> 00:29:54,010
All right.

548
00:29:54,010 --> 00:29:54,940
So fine.

549
00:29:54,940 --> 00:29:58,090
The point is, I can
decompos x into c and i.

550
00:29:58,090 --> 00:30:01,410
Of course, I can
also do the reverse.

551
00:30:01,410 --> 00:30:06,355
This would be c shifted
left w over 2, ord with i.

552
00:30:10,160 --> 00:30:11,920
It's a slight diversion.

553
00:30:11,920 --> 00:30:15,400
Now, I can tell you
the actual recursion,

554
00:30:15,400 --> 00:30:19,240
and then talk about
how to maintain it.

555
00:30:19,240 --> 00:30:24,580
So we're going to define
a recursive Van Emde Boas

556
00:30:24,580 --> 00:30:32,280
structure of size
u and word size w.

557
00:30:37,660 --> 00:30:39,670
And what it's going
to look like is,

558
00:30:39,670 --> 00:30:48,830
we have a bunch of clusters,
each of size square root of u.

559
00:30:54,820 --> 00:30:56,747
So this represents the
first root u items.

560
00:30:56,747 --> 00:30:58,330
This represents the
next root u items.

561
00:30:58,330 --> 00:31:01,100
This represents the last
root u items, and so on.

562
00:31:01,100 --> 00:31:03,710
So that's the obvious
recursion from this.

563
00:31:03,710 --> 00:31:05,540
So this is going to
be a Van Emde Boas

564
00:31:05,540 --> 00:31:07,850
structure of size root u.

565
00:31:07,850 --> 00:31:09,980
And then we also
have a structure

566
00:31:09,980 --> 00:31:14,930
up top, which is called
the summary structure.

567
00:31:14,930 --> 00:31:19,250
And the idea is, it represents,
for each of these clusters,

568
00:31:19,250 --> 00:31:21,620
is the cluster empty or not?

569
00:31:21,620 --> 00:31:24,620
Does this cluster
have any items in it?

570
00:31:24,620 --> 00:31:25,770
Yes or no.

571
00:31:25,770 --> 00:31:28,940
If yes, then the
name of this cluster

572
00:31:28,940 --> 00:31:31,340
is in the summary structure.

573
00:31:31,340 --> 00:31:33,950
So notice, by this
hierarchical decomposition,

574
00:31:33,950 --> 00:31:37,340
the cluster number
and the index are

575
00:31:37,340 --> 00:31:40,020
valid names of items
within these substructures.

576
00:31:40,020 --> 00:31:43,977
And basically we're going to use
the i part to talk about things

577
00:31:43,977 --> 00:31:44,810
within the clusters.

578
00:31:44,810 --> 00:31:46,640
And we're going to use the
c part to talk about things

579
00:31:46,640 --> 00:31:47,848
within the summary structure.

580
00:31:47,848 --> 00:31:50,280
They're both numbers between
0 and root u minus 1.

581
00:31:50,280 --> 00:31:54,170
And so we get this perspective.

582
00:31:54,170 --> 00:31:54,830
All right.

583
00:31:54,830 --> 00:32:01,730
So formally, or some
notation, cluster i--

584
00:32:01,730 --> 00:32:05,300
so we're going to have
an array of clusters.

585
00:32:05,300 --> 00:32:10,470
It is Van Emde Boas thing
of size square root u,

586
00:32:10,470 --> 00:32:15,620
and word size w over 2.

587
00:32:15,620 --> 00:32:19,100
This is slightly weird,
because the machine, of course,

588
00:32:19,100 --> 00:32:20,630
its word size remains w.

589
00:32:20,630 --> 00:32:22,890
It doesn't get smaller
as you recurse.

590
00:32:22,890 --> 00:32:24,890
We're not going to try
to spread the parallelism

591
00:32:24,890 --> 00:32:26,950
around or whatever.

592
00:32:26,950 --> 00:32:28,700
But this is just a
notational convenience.

593
00:32:28,700 --> 00:32:31,340
I want to say the
word size conceptually

594
00:32:31,340 --> 00:32:34,040
goes down to w over 2, so
that this definition still

595
00:32:34,040 --> 00:32:35,270
makes sense.

596
00:32:35,270 --> 00:32:38,480
Because as I look at a
smaller part of the word,

597
00:32:38,480 --> 00:32:41,690
in order to divide it in
half, I have to shift right

598
00:32:41,690 --> 00:32:42,950
by a smaller amount.

599
00:32:42,950 --> 00:32:47,330
So that's the w that I'm
passing into the structure.

600
00:32:47,330 --> 00:32:52,710
OK, and then v dot
summary is same thing.

601
00:32:52,710 --> 00:32:58,110
It's also Van Emde Boa's
thing of size root u.

602
00:32:58,110 --> 00:33:01,550
Then the one other clever idea,
which makes all of this work,

603
00:33:01,550 --> 00:33:05,040
is that we store the minimum
element in v dot min.

604
00:33:10,490 --> 00:33:13,144
And we do not store
it recursively.

605
00:33:20,070 --> 00:33:27,080
So there's also one item here,
size 1, which is the min.

606
00:33:27,080 --> 00:33:28,530
It's just stored
off to the side.

607
00:33:28,530 --> 00:33:30,590
It doesn't live in
these structures.

608
00:33:30,590 --> 00:33:33,204
Every other item
lives down here.

609
00:33:33,204 --> 00:33:35,120
And furthermore, if one
of these is not empty,

610
00:33:35,120 --> 00:33:38,750
there's also a
corresponding item up here.

611
00:33:38,750 --> 00:33:43,880
This turns out to be crucial
to make a Van Emde Boas work.

612
00:33:43,880 --> 00:33:46,850
And then v dot
max, we also need--

613
00:33:46,850 --> 00:33:48,382
but it can be
stored recursively.

614
00:33:48,382 --> 00:33:50,090
So just think of it
as a copy of whatever

615
00:33:50,090 --> 00:33:52,632
the maximum element is.

616
00:33:52,632 --> 00:33:54,590
OK, so in constant time,
we can compute the min

617
00:33:54,590 --> 00:33:55,520
and compute the max.

618
00:33:55,520 --> 00:33:56,150
That's good.

619
00:33:56,150 --> 00:33:59,840
But then I claim also in log
w time-- log log u time--

620
00:33:59,840 --> 00:34:02,492
we can do insert, delete,
predecessor, successor.

621
00:34:08,889 --> 00:34:09,770
So let's do that.

622
00:34:22,380 --> 00:34:24,040
This data structure--
the solution

623
00:34:24,040 --> 00:34:26,364
is both simple and
a little bit subtle.

624
00:34:26,364 --> 00:34:28,030
And so this will be
one of the few times

625
00:34:28,030 --> 00:34:30,250
I'm going to write
explicit pseudocode-- say

626
00:34:30,250 --> 00:34:33,400
exactly how to maintain
this data structure.

627
00:34:33,400 --> 00:34:35,320
It's short code, which is good.

628
00:34:35,320 --> 00:34:38,739
Each algorithm is
only a few lines.

629
00:34:38,739 --> 00:34:40,301
But every line matters.

630
00:34:40,301 --> 00:34:42,550
So I want to write them down
so I can talk about them.

631
00:34:46,040 --> 00:34:49,030
And with this new
hierarchical notation,

632
00:34:49,030 --> 00:34:52,460
I think it's even easier
to write these down.

633
00:34:52,460 --> 00:34:54,690
Let's see how I do.

634
00:36:08,510 --> 00:36:10,934
OK, so we'll start with
the successor code.

635
00:36:10,934 --> 00:36:12,350
Predecessor is,
of course, metric.

636
00:36:28,000 --> 00:36:31,624
And it basically has two cases.

637
00:36:31,624 --> 00:36:33,540
There's a special case
in the beginning, which

638
00:36:33,540 --> 00:36:36,270
is, if the thing you're
querying happens to be less

639
00:36:36,270 --> 00:36:38,670
than the minimum of the
whole thing, then of course,

640
00:36:38,670 --> 00:36:40,572
the minimum is the successor.

641
00:36:40,572 --> 00:36:42,780
This has to be done specially,
because the min is not

642
00:36:42,780 --> 00:36:43,987
stored recursively.

643
00:36:43,987 --> 00:36:45,570
And so you've got
to check for the min

644
00:36:45,570 --> 00:36:48,251
every single level
of the recursion.

645
00:36:48,251 --> 00:36:49,500
But that's just constant time.

646
00:36:49,500 --> 00:36:50,334
No big deal.

647
00:36:50,334 --> 00:36:51,750
Then the interesting
things is, we

648
00:36:51,750 --> 00:36:54,420
have recursions in both sides--

649
00:36:54,420 --> 00:36:58,150
in both cases-- but only one.

650
00:36:58,150 --> 00:37:00,380
The key is, we want
this recurrence--

651
00:37:00,380 --> 00:37:05,820
T of u is 1 times T of
root u plus order 1.

652
00:37:05,820 --> 00:37:07,460
That gives us log log u.

653
00:37:07,460 --> 00:37:12,760
If there was a 2 here, we would
get log u, which is no good.

654
00:37:12,760 --> 00:37:13,560
We want the one.

655
00:37:13,560 --> 00:37:16,810
So in one case, we call
successor on a cluster.

656
00:37:16,810 --> 00:37:18,630
In the other case,
we call successor

657
00:37:18,630 --> 00:37:22,230
on the summary structure.

658
00:37:22,230 --> 00:37:24,840
But we don't want to do both.

659
00:37:24,840 --> 00:37:27,900
So let's just think about,
intuitively, what's going on.

660
00:37:27,900 --> 00:37:29,420
We've got this--

661
00:37:29,420 --> 00:37:31,200
I guess I can do it
in the same picture.

662
00:37:31,200 --> 00:37:34,710
We've got this summary
and a bunch of clusters.

663
00:37:34,710 --> 00:37:36,870
And let's say you want
to compute, what's

664
00:37:36,870 --> 00:37:39,040
the successor of this item?

665
00:37:39,040 --> 00:37:40,830
So via this
transformation, we compute

666
00:37:40,830 --> 00:37:44,100
which cluster it lives in and
where it is within the cluster.

667
00:37:44,100 --> 00:37:45,040
That's i.

668
00:37:45,040 --> 00:37:46,560
So it's some item here.

669
00:37:46,560 --> 00:37:49,650
Now, it could be the successor
is inside the same cluster.

670
00:37:49,650 --> 00:37:51,870
Maybe there's an
item right there.

671
00:37:51,870 --> 00:37:54,330
Then want to recurse in here.

672
00:37:54,330 --> 00:37:57,090
Or it could be, it's
in some future cluster.

673
00:38:00,570 --> 00:38:02,910
Let's do the first case.

674
00:38:02,910 --> 00:38:08,190
If, basically, we are less than
the max of our own cluster,

675
00:38:08,190 --> 00:38:12,064
that means that the
answer is in there.

676
00:38:12,064 --> 00:38:13,980
Figure out what the max
is in this structure--

677
00:38:13,980 --> 00:38:18,780
the rightmost item in s
that's inside this cluster c.

678
00:38:18,780 --> 00:38:21,300
This is c.

679
00:38:21,300 --> 00:38:25,845
If our index is less than the
max's index, then if we recurse

680
00:38:25,845 --> 00:38:28,219
in here, we will find an answer.

681
00:38:28,219 --> 00:38:29,760
If we're bigger than
the max, then we

682
00:38:29,760 --> 00:38:31,051
won't find an answer down here.

683
00:38:31,051 --> 00:38:32,770
We have to recurse
somewhere else.

684
00:38:32,770 --> 00:38:34,890
So that's what we do.

685
00:38:34,890 --> 00:38:37,500
If we're less than
the max, then we just

686
00:38:37,500 --> 00:38:42,090
recursively find the successor
of our index within cluster c.

687
00:38:42,090 --> 00:38:45,630
And we have to add
on the c in front.

688
00:38:45,630 --> 00:38:47,460
Because successor
within this cluster

689
00:38:47,460 --> 00:38:50,370
will only give an index
within the cluster.

690
00:38:50,370 --> 00:38:54,620
And we have to prepend this
c part to give a global name.

691
00:38:54,620 --> 00:38:56,070
OK, so that's case 1.

692
00:38:56,070 --> 00:38:57,520
Very easy.

693
00:38:57,520 --> 00:39:01,590
The other case is where we're
slightly clever, in some sense.

694
00:39:01,590 --> 00:39:06,630
We say, OK, well, if there's no
successor within the cluster,

695
00:39:06,630 --> 00:39:08,040
maybe it's in the next cluster.

696
00:39:08,040 --> 00:39:09,660
Of course, that one might
be empty, in which case,

697
00:39:09,660 --> 00:39:10,480
it's in the next cluster.

698
00:39:10,480 --> 00:39:13,050
But that one might be empty,
so look at the next cluster.

699
00:39:13,050 --> 00:39:15,630
We need to find, what is
the next non-empty cluster?

700
00:39:15,630 --> 00:39:19,020
For that, we use the
summary structure.

701
00:39:19,020 --> 00:39:22,230
So we go up to position c here.

702
00:39:22,230 --> 00:39:25,400
We say, OK, what is the next
non-empty structure after c?

703
00:39:25,400 --> 00:39:27,950
Because we know that's
going to be where

704
00:39:27,950 --> 00:39:30,187
our answer lives for successor.

705
00:39:30,187 --> 00:39:31,770
So that's going to
give us, basically,

706
00:39:31,770 --> 00:39:36,750
a pointer to one of these
structures-- c prime, which--

707
00:39:36,750 --> 00:39:38,249
all these guys are empty.

708
00:39:38,249 --> 00:39:39,790
And so there's no
successor in there.

709
00:39:39,790 --> 00:39:43,150
The successor is then the
min in this structure.

710
00:39:43,150 --> 00:39:44,160
So that's all we do.

711
00:39:44,160 --> 00:39:48,130
Compute the successor of c
in the summary structure.

712
00:39:48,130 --> 00:39:51,900
And then, in that
cluster, c prime,

713
00:39:51,900 --> 00:39:54,240
find the min, which
takes constant time,

714
00:39:54,240 --> 00:39:59,060
and then prepend c prime to
that to get a global name.

715
00:39:59,060 --> 00:40:01,320
And that's our successor.

716
00:40:01,320 --> 00:40:01,970
Yeah, question.

717
00:40:01,970 --> 00:40:05,864
AUDIENCE: Could you repeat
why min is not recursive?

718
00:40:05,864 --> 00:40:07,238
Because looking
at this, it looks

719
00:40:07,238 --> 00:40:10,368
like all these smaller
[INAUDIBLE] trees have

720
00:40:10,368 --> 00:40:12,715
[INAUDIBLE]

721
00:40:12,715 --> 00:40:13,590
ERIK DEMAINE: Ah, OK.

722
00:40:13,590 --> 00:40:14,295
Sorry.

723
00:40:14,295 --> 00:40:16,505
The question is, why is
the minimum not recursive?

724
00:40:16,505 --> 00:40:18,380
The answer to that
question is not yet clear.

725
00:40:18,380 --> 00:40:19,890
It will have to
do with insertion.

726
00:40:19,890 --> 00:40:22,060
But I think what
exactly this means,

727
00:40:22,060 --> 00:40:25,440
I maybe didn't state
carefully enough.

728
00:40:25,440 --> 00:40:28,020
Every Van Emde Boas
structure has a min--

729
00:40:28,020 --> 00:40:29,460
stores a min.

730
00:40:29,460 --> 00:40:32,080
In that sense, this is done--

731
00:40:32,080 --> 00:40:34,320
that's funny-- not
so recursively.

732
00:40:34,320 --> 00:40:36,180
But every one stores it.

733
00:40:36,180 --> 00:40:38,850
The point is that
this item doesn't

734
00:40:38,850 --> 00:40:40,740
get put into one
of these clusters

735
00:40:40,740 --> 00:40:42,670
recursively-- just the item.

736
00:40:42,670 --> 00:40:44,310
But each of these
has its own min,

737
00:40:44,310 --> 00:40:46,620
which is then not stored
at the next level down.

738
00:40:46,620 --> 00:40:48,720
And each of those has
its own min, which is not

739
00:40:48,720 --> 00:40:50,190
stored at the next level down.

740
00:40:50,190 --> 00:40:52,444
Think of this as kind
of like a little buffer.

741
00:40:52,444 --> 00:40:54,360
The first time I insert
it into the structure,

742
00:40:54,360 --> 00:40:55,568
I just stick it into the min.

743
00:40:55,568 --> 00:40:57,787
I don't touch anything else.

744
00:40:57,787 --> 00:40:59,870
You'll see when we get to
the insertion algorithm.

745
00:40:59,870 --> 00:41:02,430
But it sort of slows
things down from trickling.

746
00:41:02,430 --> 00:41:07,126
AUDIENCE: So putting that min,
is that what prevents from--

747
00:41:07,126 --> 00:41:09,000
ERIK DEMAINE: That will
prevent the insertion

748
00:41:09,000 --> 00:41:11,051
from doing two recursions
instead of one.

749
00:41:11,051 --> 00:41:12,300
So we'll see that in a moment.

750
00:41:12,300 --> 00:41:15,379
At this point, just
successor is very clear.

751
00:41:15,379 --> 00:41:17,920
This would work whether the min
is stored recursively or not.

752
00:41:17,920 --> 00:41:20,440
But we need to know what the
min is of every structure,

753
00:41:20,440 --> 00:41:23,382
and we need to know the
max of every structure.

754
00:41:23,382 --> 00:41:25,840
At this point, you could just
say that min and max could be

755
00:41:25,840 --> 00:41:27,610
copies-- no big deal--

756
00:41:27,610 --> 00:41:28,550
and we'd be happy.

757
00:41:28,550 --> 00:41:31,294
And of course, predecessor
does the same thing.

758
00:41:31,294 --> 00:41:33,710
So the slight cleverness here
is that we use the min here.

759
00:41:33,710 --> 00:41:36,640
This could have been a successor
operation with minus infinity

760
00:41:36,640 --> 00:41:37,840
as the query.

761
00:41:37,840 --> 00:41:40,120
But that would be
two recursions.

762
00:41:40,120 --> 00:41:41,137
We can only afford one.

763
00:41:41,137 --> 00:41:42,970
Fortunately, it's the
min item that we need.

764
00:41:42,970 --> 00:41:45,740
So we're done with successor.

765
00:41:45,740 --> 00:41:46,870
That was the easy case--

766
00:41:46,870 --> 00:41:47,800
or the easy one.

767
00:41:47,800 --> 00:41:50,710
Insert is slightly harder.

768
00:41:50,710 --> 00:41:53,065
Delete is just slightly messier.

769
00:41:53,065 --> 00:41:54,570
It's basically the
same as insert.

770
00:41:59,610 --> 00:42:03,790
So insert-- let me
write the code again.

771
00:43:17,170 --> 00:43:20,340
Insertion also has
two main cases.

772
00:43:20,340 --> 00:43:22,620
There's this case,
and the other case.

773
00:43:22,620 --> 00:43:23,850
But there's no else here.

774
00:43:23,850 --> 00:43:25,650
This happens in both cases.

775
00:43:25,650 --> 00:43:27,900
And then there's some just
annoying little details

776
00:43:27,900 --> 00:43:28,800
at the beginning.

777
00:43:28,800 --> 00:43:31,410
Just like over here, we had to
check for the min specially,

778
00:43:31,410 --> 00:43:34,170
here, we've got to
update the min and max.

779
00:43:34,170 --> 00:43:37,836
And there's a special case,
which I haven't mentioned yet.

780
00:43:37,836 --> 00:43:44,700
v dot min-- special case is,
it will be this value, none,

781
00:43:44,700 --> 00:43:48,480
if the whole structure is empty.

782
00:43:48,480 --> 00:43:52,740
So this is the obvious way to
tell whether a structure is

783
00:43:52,740 --> 00:43:54,247
empty and has no min.

784
00:43:54,247 --> 00:43:55,830
Because if there's
any items in there,

785
00:43:55,830 --> 00:43:57,810
there's going to be
one in the min slot.

786
00:43:57,810 --> 00:44:00,410
So first thing we do is
check, is our structure empty?

787
00:44:00,410 --> 00:44:04,710
If it's empty, the min and the
max become the inserted item.

788
00:44:04,710 --> 00:44:06,050
We're done.

789
00:44:06,050 --> 00:44:07,410
So that's the easy case.

790
00:44:07,410 --> 00:44:11,820
We do not store it
recursively in here.

791
00:44:11,820 --> 00:44:14,580
That's what this means.

792
00:44:14,580 --> 00:44:17,894
This element does not get
stored in any of the clusters.

793
00:44:17,894 --> 00:44:20,310
If it's not the very first
item, or it's not the min item,

794
00:44:20,310 --> 00:44:24,520
then we're going to recursively
insert it into a cluster.

795
00:44:24,520 --> 00:44:29,130
So if we have x in
cluster c, we always

796
00:44:29,130 --> 00:44:36,840
insert index i into cluster
c, except if it's the min.

797
00:44:36,840 --> 00:44:39,480
Now, it could be where a
structure is non-empty.

798
00:44:39,480 --> 00:44:40,612
There is a min item there.

799
00:44:40,612 --> 00:44:41,820
But we are less than the min.

800
00:44:41,820 --> 00:44:43,650
In that case, we're the new
min, and we just swap those.

801
00:44:43,650 --> 00:44:45,733
And now, we have to
recursively insert the old min

802
00:44:45,733 --> 00:44:47,680
into the rest of the structure.

803
00:44:47,680 --> 00:44:49,290
So that's a simple case.

804
00:44:49,290 --> 00:44:50,930
Then we also have
to update v dot max,

805
00:44:50,930 --> 00:44:51,930
just in the obvious way.

806
00:44:51,930 --> 00:44:55,869
This is the easy way to
maintain v dot max in variant,

807
00:44:55,869 --> 00:44:56,910
that is the maximum item.

808
00:44:56,910 --> 00:45:00,240
OK, now we have the two cases.

809
00:45:00,240 --> 00:45:02,100
I mean, this is really
the obvious thing

810
00:45:02,100 --> 00:45:03,870
to get to do insertion.

811
00:45:03,870 --> 00:45:06,900
We have to update the
summary structure, meaning,

812
00:45:06,900 --> 00:45:10,020
if the cluster that we are
inserting into-- cluster c--

813
00:45:10,020 --> 00:45:13,330
is empty, that means it was not
yet in the summary structure.

814
00:45:13,330 --> 00:45:14,500
We need to put it in there.

815
00:45:14,500 --> 00:45:17,190
So we just insert c
into v dot summary--

816
00:45:17,190 --> 00:45:18,370
pretty obvious.

817
00:45:18,370 --> 00:45:24,044
And in all cases, we insert
our item into cluster c.

818
00:45:24,044 --> 00:45:25,710
This looks bad,
however, because there's

819
00:45:25,710 --> 00:45:27,820
two recursions in some cases.

820
00:45:27,820 --> 00:45:29,880
If this if doesn't hold,
it's one recursion.

821
00:45:29,880 --> 00:45:30,930
Everything's fine.

822
00:45:30,930 --> 00:45:34,320
So if the cluster was
already in use, great.

823
00:45:34,320 --> 00:45:35,770
This is one recursion.

824
00:45:35,770 --> 00:45:37,370
This is constant work.

825
00:45:37,370 --> 00:45:38,550
We're done.

826
00:45:38,550 --> 00:45:40,800
The worry is, if the
cluster was empty

827
00:45:40,800 --> 00:45:44,670
before, then this insertion
is a whole recursion.

828
00:45:44,670 --> 00:45:48,010
That's scary, because we can't
afford a second recursion.

829
00:45:48,010 --> 00:45:50,310
But it's all OK.

830
00:45:50,310 --> 00:45:53,160
Because if we do
this recursion, that

831
00:45:53,160 --> 00:45:56,250
means that this cluster
was empty, which means,

832
00:45:56,250 --> 00:45:59,910
in this recursion, we fall
into this very first case.

833
00:45:59,910 --> 00:46:01,950
That structure,
it's min is none.

834
00:46:01,950 --> 00:46:03,750
That's what we just checked for.

835
00:46:03,750 --> 00:46:06,572
If it's none, we do
constant work and stop.

836
00:46:06,572 --> 00:46:10,250
So everything's OK.

837
00:46:10,250 --> 00:46:13,170
If we recursed in the
summary structure,

838
00:46:13,170 --> 00:46:15,060
this recursion will be
a shallow recursion.

839
00:46:15,060 --> 00:46:16,290
It just does one thing.

840
00:46:16,290 --> 00:46:23,340
You could actually put this
code into this if case,

841
00:46:23,340 --> 00:46:25,050
and make this an else case.

842
00:46:25,050 --> 00:46:26,814
That's another way
to write the code.

843
00:46:26,814 --> 00:46:28,480
But this will be a
very short recursion.

844
00:46:28,480 --> 00:46:30,580
So either you just
do this recursion,

845
00:46:30,580 --> 00:46:32,160
which could be
expensive, or you just

846
00:46:32,160 --> 00:46:34,470
do this one, in which case,
we know this one was cheap.

847
00:46:34,470 --> 00:46:36,790
If this happens, we know
this will take constant time.

848
00:46:36,790 --> 00:46:39,660
So in both cases, we
get this recursion--

849
00:46:39,660 --> 00:46:43,200
square root of u plus constant.

850
00:46:43,200 --> 00:46:45,086
And so we get log
log u insertion.

851
00:46:48,507 --> 00:46:49,590
Do you want to see delete?

852
00:46:49,590 --> 00:46:51,410
I mean, it's basically
the same thing.

853
00:46:51,410 --> 00:46:53,995
It's in the notes.

854
00:46:53,995 --> 00:46:55,620
I mean, you do the
obvious thing, which

855
00:46:55,620 --> 00:46:57,919
is, you delete in the cluster.

856
00:46:57,919 --> 00:46:59,460
And then if it became
empty, you also

857
00:46:59,460 --> 00:47:02,550
have to delete in the
summary structure.

858
00:47:02,550 --> 00:47:05,510
So there's, again, a chance
that you do two recursions.

859
00:47:05,510 --> 00:47:08,130
But-- OK, I'm talking about it.

860
00:47:08,130 --> 00:47:10,920
Maybe I'll write a
little bit of the code.

861
00:47:17,672 --> 00:47:19,130
I think I won't
write all the code,

862
00:47:19,130 --> 00:47:20,450
though-- just the main stuff.

863
00:47:24,600 --> 00:47:31,130
So if we want to
delete, then basically,

864
00:47:31,130 --> 00:47:36,800
we delete in cluster c, index i.

865
00:47:39,920 --> 00:47:44,510
And then if the cluster
has become empty

866
00:47:44,510 --> 00:47:49,970
as a result of
that, then we have

867
00:47:49,970 --> 00:47:53,870
to delete cluster c from
the summary structure,

868
00:47:53,870 --> 00:47:56,240
so that our predecessor and
successor queries actually

869
00:47:56,240 --> 00:47:56,930
still work.

870
00:48:04,132 --> 00:48:05,590
OK, so that's the
bulk of the code.

871
00:48:05,590 --> 00:48:07,256
I mean, that's where
the action happens.

872
00:48:07,256 --> 00:48:09,190
And the worry would be,
in this if case, we're

873
00:48:09,190 --> 00:48:12,400
doing two recursive deletes.

874
00:48:12,400 --> 00:48:16,300
The claim is, if we
do this second delete,

875
00:48:16,300 --> 00:48:19,930
which is potentially expensive--
this one was really cheap--

876
00:48:19,930 --> 00:48:23,429
the claim is that emptying
a Van Emde Boas structure

877
00:48:23,429 --> 00:48:24,970
takes constant time--
like, if you're

878
00:48:24,970 --> 00:48:26,841
deleting the last element.

879
00:48:26,841 --> 00:48:27,340
Why?

880
00:48:27,340 --> 00:48:29,760
Because when you're
deleting the last element,

881
00:48:29,760 --> 00:48:32,260
it's in the min right here.

882
00:48:32,260 --> 00:48:35,020
Everything below it-- all
the recursive structures--

883
00:48:35,020 --> 00:48:36,670
will be empty if
there's only one item,

884
00:48:36,670 --> 00:48:37,919
because it will be right here.

885
00:48:37,919 --> 00:48:39,970
And you can check that
from the insertion.

886
00:48:39,970 --> 00:48:43,390
If it was empty, all we did was
change v dot min and v dot max.

887
00:48:43,390 --> 00:48:45,690
So the inverse, which
I want right here,

888
00:48:45,690 --> 00:48:48,350
is just to clear out v
dot min and v dot max.

889
00:48:48,350 --> 00:48:52,630
So if this ends up happening,
this only took constant time.

890
00:48:52,630 --> 00:48:55,570
You don't have to recurse when
you're deleting the last item.

891
00:48:55,570 --> 00:48:59,000
So in either case, you're really
only doing one deep recursion.

892
00:48:59,000 --> 00:49:01,870
So you get the same recurrence,
and you get log log u.

893
00:49:01,870 --> 00:49:04,390
So for the details,
check out the notes.

894
00:49:04,390 --> 00:49:09,250
I want to go to other
perspectives of Van Emde Boas.

895
00:49:09,250 --> 00:49:11,110
This is one way
to think about it.

896
00:49:11,110 --> 00:49:14,260
And amusingly, and this is
probably the most taut way

897
00:49:14,260 --> 00:49:16,540
to do Van Emde Boas.

898
00:49:16,540 --> 00:49:19,120
It's, in CLRS,
described this way,

899
00:49:19,120 --> 00:49:21,967
because in 2001, when
I first came here,

900
00:49:21,967 --> 00:49:24,550
I presented Van Emde Boas like
this in an undergrad algorithms

901
00:49:24,550 --> 00:49:26,930
class with more details.

902
00:49:26,930 --> 00:49:29,500
You guys are grads, so I did
it like three times faster

903
00:49:29,500 --> 00:49:34,497
than I would in 6046.

904
00:49:34,497 --> 00:49:36,080
So now, it's in
textbooks and whatnot.

905
00:49:36,080 --> 00:49:37,640
But this is not
how Van Emde Boas

906
00:49:37,640 --> 00:49:39,832
presented this data
structure-- just out

907
00:49:39,832 --> 00:49:40,790
of historical interest.

908
00:49:40,790 --> 00:49:44,401
This is a way that I believe
was invented by Michael Bender

909
00:49:44,401 --> 00:49:46,400
and Martin Farach-Colton,
who are the co-authors

910
00:49:46,400 --> 00:49:48,080
on "Cache-oblivious B-trees."

911
00:49:48,080 --> 00:49:49,730
And around 2001,
they were looking

912
00:49:49,730 --> 00:49:52,680
at lots of old data structures
and simplifying them.

913
00:49:52,680 --> 00:49:54,800
And I think this is a
very clean, simple way

914
00:49:54,800 --> 00:49:56,429
to think about Van Emde Boas.

915
00:49:56,429 --> 00:49:58,220
But I want to tell you
the other way, which

916
00:49:58,220 --> 00:50:02,600
is the way it originally
appeared in their papers.

917
00:50:02,600 --> 00:50:05,840
There's actually three
papers by van Emde

918
00:50:05,840 --> 00:50:09,260
Boas about this structure.

919
00:50:09,260 --> 00:50:10,640
Many papers appear twice--

920
00:50:10,640 --> 00:50:12,770
once in a conference,
once in a journal--

921
00:50:12,770 --> 00:50:15,350
this one, there's
three relevant papers.

922
00:50:15,350 --> 00:50:18,105
There's conference
version, journal version.

923
00:50:18,105 --> 00:50:20,480
The only weird thing there is
that the conference version

924
00:50:20,480 --> 00:50:22,150
has one author-- van Emde Boas.

925
00:50:22,150 --> 00:50:24,130
The journal version
has three authors--

926
00:50:24,130 --> 00:50:28,135
van Emde Boas,
Kaas, and Zijlstra.

927
00:50:28,135 --> 00:50:30,260
And they're acknowledged
in the conference version,

928
00:50:30,260 --> 00:50:32,990
so I guess they
helped even more.

929
00:50:32,990 --> 00:50:35,540
In particular, they, I think,
implemented this data structure

930
00:50:35,540 --> 00:50:36,180
for the first time.

931
00:50:36,180 --> 00:50:37,554
It's a really easy
data structure

932
00:50:37,554 --> 00:50:38,860
to implement, and very fast.

933
00:50:41,370 --> 00:50:43,400
Then there's a third
paper by van Emde Boas

934
00:50:43,400 --> 00:50:47,010
only in a journal which
improves the space a little bit.

935
00:50:47,010 --> 00:50:50,860
So we'll see a little
bit what that's about.

936
00:50:50,860 --> 00:50:52,610
But what I like about
both of these papers

937
00:50:52,610 --> 00:51:00,140
is they offer a simpler way
to get log log u, successor,

938
00:51:00,140 --> 00:51:01,490
predecessor.

939
00:51:01,490 --> 00:51:04,490
Let's not worry about insertions
and deletions for a little bit,

940
00:51:04,490 --> 00:51:08,990
and take what I'll call
the simple tree view.

941
00:51:14,660 --> 00:51:18,150
So I'm going to draw a picture--

942
00:51:18,150 --> 00:51:21,760
0, 1, 0, 0, 0, 0, 0--

943
00:51:27,780 --> 00:51:29,720
OK.

944
00:51:29,720 --> 00:51:36,200
This is what we call a bit
vector, meaning, here's

945
00:51:36,200 --> 00:51:38,510
item zero, item one, item two.

946
00:51:38,510 --> 00:51:42,195
And here is u minus 1.

947
00:51:42,195 --> 00:51:45,470
And I'll put a 1 if that
element is in my set, and a 0

948
00:51:45,470 --> 00:51:47,280
otherwise.

949
00:51:47,280 --> 00:51:51,680
OK, so one is in the set, nine--

950
00:51:51,680 --> 00:51:55,190
I think-- is in the set,
10, and 15 are in the set.

951
00:51:58,597 --> 00:51:59,930
I kind of want to maintain this.

952
00:51:59,930 --> 00:52:00,980
This is, of course,
easy to maintain

953
00:52:00,980 --> 00:52:02,146
by insertions and deletions.

954
00:52:02,146 --> 00:52:03,740
I just flip a bit on or off.

955
00:52:03,740 --> 00:52:05,760
But I want to be able
to do successor queries.

956
00:52:05,760 --> 00:52:07,850
And if I want the
successor of, say, this 0,

957
00:52:07,850 --> 00:52:08,744
finding the next 1--

958
00:52:08,744 --> 00:52:10,160
I don't want to
have to walk down.

959
00:52:10,160 --> 00:52:12,890
That would take order
u time-- very bad.

960
00:52:12,890 --> 00:52:15,290
So obvious thing to do is
build a tree on this thing.

961
00:52:20,990 --> 00:52:25,265
And I'm going to put in here
the or of the two children.

962
00:52:25,265 --> 00:52:27,140
Every node will store
the or of its children.

963
00:52:31,160 --> 00:52:32,990
And then keep building the tree.

964
00:52:44,630 --> 00:52:49,400
Now we have a binary tree,
with bits on the vertices.

965
00:52:49,400 --> 00:52:51,500
And I claim, if
I want to compute

966
00:52:51,500 --> 00:52:54,020
the successor of this
item, I can do it

967
00:52:54,020 --> 00:52:58,290
in a pretty natural way
in the log log u time.

968
00:52:58,290 --> 00:53:03,850
So keep in mind, this
height here is w--

969
00:53:03,850 --> 00:53:04,350
log u.

970
00:53:07,610 --> 00:53:09,270
So I need to achieve log w.

971
00:53:09,270 --> 00:53:12,740
So of course, you could try
just walking down this tree,

972
00:53:12,740 --> 00:53:14,660
or walking up and
then back down.

973
00:53:14,660 --> 00:53:17,510
That would take order w time.

974
00:53:17,510 --> 00:53:19,340
That's the obvious BST approach.

975
00:53:19,340 --> 00:53:21,600
I want to do log w.

976
00:53:21,600 --> 00:53:22,360
So how do I do it?

977
00:53:22,360 --> 00:53:27,626
I'm going to binary
search on the height.

978
00:53:27,626 --> 00:53:29,920
How could I binary
search on the height?

979
00:53:29,920 --> 00:53:33,340
Well, what I'd really like
to do, in some sense--

980
00:53:33,340 --> 00:53:37,570
if I look at the path of
this node to the route--

981
00:53:37,570 --> 00:53:40,940
where is my red chalk?

982
00:53:40,940 --> 00:53:43,710
So here's the path to the root.

983
00:53:46,840 --> 00:53:50,540
These bits are saying, is
there anybody down here?

984
00:53:50,540 --> 00:53:52,870
That's what the or gives you.

985
00:53:52,870 --> 00:53:55,540
So it's like the
summary structure.

986
00:53:55,540 --> 00:53:59,590
If I want to search for this
guy-- well, if I walked up,

987
00:53:59,590 --> 00:54:01,660
eventually, I find a 1.

988
00:54:01,660 --> 00:54:04,180
And that's when I find
the first nearby element.

989
00:54:04,180 --> 00:54:06,220
Now, in this case it's
not the successor I find.

990
00:54:06,220 --> 00:54:08,110
It's really the
predecessor I found.

991
00:54:08,110 --> 00:54:11,320
When you get to the first one--
the transition from 0 to 1--

992
00:54:11,320 --> 00:54:12,730
you look at your sibling--

993
00:54:12,730 --> 00:54:15,250
the other child of that one.

994
00:54:15,250 --> 00:54:19,210
And down in this subtree, there
will be either the predecessor

995
00:54:19,210 --> 00:54:20,274
or the successor.

996
00:54:20,274 --> 00:54:21,940
In this case, we've
got the predecessor,

997
00:54:21,940 --> 00:54:23,460
because it was to the left.

998
00:54:23,460 --> 00:54:25,140
We take the max
element in there,

999
00:54:25,140 --> 00:54:27,081
and that's the
predecessor of this item.

1000
00:54:27,081 --> 00:54:29,080
If instead, we had found
this was our first one,

1001
00:54:29,080 --> 00:54:30,856
then we look over
here, take the min--

1002
00:54:30,856 --> 00:54:32,230
there's, of course,
nothing here.

1003
00:54:32,230 --> 00:54:35,110
But in that situation,
the min over there

1004
00:54:35,110 --> 00:54:36,670
would be our successor.

1005
00:54:36,670 --> 00:54:39,220
So we can't guarantee
which one we find.

1006
00:54:39,220 --> 00:54:42,130
But we will find either the
predecessor or the successor

1007
00:54:42,130 --> 00:54:45,410
if we could find the first
transition from 0 to 1.

1008
00:54:45,410 --> 00:54:47,470
And we can do that
via binary search,

1009
00:54:47,470 --> 00:54:49,596
because this string is monotone.

1010
00:54:49,596 --> 00:54:51,220
It's a whole bunch
of zeros for awhile,

1011
00:54:51,220 --> 00:54:52,761
and then once you
get a 1, it's going

1012
00:54:52,761 --> 00:54:54,730
to continue to be 1,
because those are or.

1013
00:54:54,730 --> 00:54:55,880
That one will propagate up.

1014
00:55:18,090 --> 00:55:21,210
So this is the new idea to
get log log u, predecessor,

1015
00:55:21,210 --> 00:55:25,376
successor is to--

1016
00:55:25,376 --> 00:55:34,555
let's say-- any root-to-leaf
path is monotone.

1017
00:55:34,555 --> 00:55:37,090
It's 0 for awhile, and
then it becomes 1 forever.

1018
00:55:40,550 --> 00:55:44,890
So we should be able to
binary search for the 0

1019
00:55:44,890 --> 00:55:46,236
to 1 transition.

1020
00:55:51,200 --> 00:55:57,470
And it either looks like
this, or it looks like this.

1021
00:55:57,470 --> 00:56:04,745
So our query was somewhere
down here in the 0 part.

1022
00:56:04,745 --> 00:56:06,370
I'm assuming that
our query is not a 1.

1023
00:56:06,370 --> 00:56:08,990
Otherwise, it's an
immediate 0 to 1 transition.

1024
00:56:08,990 --> 00:56:10,410
And that's a special case.

1025
00:56:10,410 --> 00:56:11,770
It's easy to deal with.

1026
00:56:11,770 --> 00:56:17,190
And then there's
the other tree--

1027
00:56:17,190 --> 00:56:19,450
the sibling of x--

1028
00:56:19,450 --> 00:56:22,810
the other child of the 1.

1029
00:56:22,810 --> 00:56:25,870
And in this case, we
want to take the min.

1030
00:56:25,870 --> 00:56:28,240
And that will give us
our successor of x.

1031
00:56:31,540 --> 00:56:34,219
And in this case, we want
to take the max over here,

1032
00:56:34,219 --> 00:56:36,010
and that will give us
the predecessor of x.

1033
00:56:41,110 --> 00:56:42,860
So as long as we have
minimax of subtrees,

1034
00:56:42,860 --> 00:56:44,690
this is constant time.

1035
00:56:44,690 --> 00:56:47,480
We find either the
predecessor or the successor.

1036
00:56:47,480 --> 00:56:49,400
Now, how do we
get the other one?

1037
00:56:49,400 --> 00:56:50,330
Pretty easy.

1038
00:56:50,330 --> 00:56:54,140
Just store a linked list
of all the items, in order.

1039
00:56:54,140 --> 00:56:57,980
So I'm going to store a pointer
from this one to this one,

1040
00:56:57,980 --> 00:56:59,390
and vice versa--

1041
00:56:59,390 --> 00:57:01,020
and this one or this one.

1042
00:57:01,020 --> 00:57:04,210
This is actually really
easy to maintain.

1043
00:57:04,210 --> 00:57:07,394
Because when you insert,
if you can compute

1044
00:57:07,394 --> 00:57:08,810
the predecessor
and the successor,

1045
00:57:08,810 --> 00:57:10,280
you can just stick it
in the linked list.

1046
00:57:10,280 --> 00:57:11,100
That's really easy.

1047
00:57:11,100 --> 00:57:13,260
We know how to do
that in constant time.

1048
00:57:13,260 --> 00:57:15,770
So once you do this, it's
enough to find one of them,

1049
00:57:15,770 --> 00:57:17,270
as long as you know
which one it is.

1050
00:57:17,270 --> 00:57:18,830
Because then you just
follow a pointer--

1051
00:57:18,830 --> 00:57:19,990
either a forward or
a backward pointer--

1052
00:57:19,990 --> 00:57:21,060
and you get the other one.

1053
00:57:21,060 --> 00:57:22,268
So whichever one you wanted--

1054
00:57:22,268 --> 00:57:24,350
you find both the
predecessor and successor

1055
00:57:24,350 --> 00:57:26,690
at the cost of
finding either one.

1056
00:57:26,690 --> 00:57:30,170
So that's a cute little trick.

1057
00:57:30,170 --> 00:57:34,610
This is hard to maintain,
dynamically, at the moment.

1058
00:57:34,610 --> 00:57:37,670
But this is, I think,
where the Van Emde Boas

1059
00:57:37,670 --> 00:57:39,080
structure came from.

1060
00:57:39,080 --> 00:57:42,830
It's nice to think about
it in the tree view.

1061
00:57:42,830 --> 00:57:51,320
So we get log log you u,
predecessor, and successor.

1062
00:57:54,260 --> 00:57:58,040
I should say what this relies on
is the ability to binary search

1063
00:57:58,040 --> 00:57:59,750
on any route-to-node path.

1064
00:57:59,750 --> 00:58:03,740
Now, there aren't enough
pointers to do that.

1065
00:58:03,740 --> 00:58:04,680
So you have a choice.

1066
00:58:04,680 --> 00:58:07,520
Either you realize,
oh, this is a bunch

1067
00:58:07,520 --> 00:58:09,990
of bits in a
complete binary tree,

1068
00:58:09,990 --> 00:58:14,090
so I can store them
sequentially in array.

1069
00:58:14,090 --> 00:58:18,050
And given a particular node
position in that array,

1070
00:58:18,050 --> 00:58:20,660
I can compute, what is
the second ancestor,

1071
00:58:20,660 --> 00:58:23,970
or the fourth ancestor or
whatever, in constant time.

1072
00:58:23,970 --> 00:58:26,330
I just do some arithmetic
and I can compute from here

1073
00:58:26,330 --> 00:58:27,205
where to go to there.

1074
00:58:27,205 --> 00:58:29,630
It's like the regular old
heaps, but a little bit

1075
00:58:29,630 --> 00:58:31,255
embellished, because
you have to divide

1076
00:58:31,255 --> 00:58:33,540
by a larger power of two,
not just one of them.

1077
00:58:33,540 --> 00:58:36,000
So that's one way to do it.

1078
00:58:36,000 --> 00:58:39,310
So in a RAM, that
all works fine.

1079
00:58:39,310 --> 00:58:42,145
When van Emde Boas wrote this
paper, though, the RAM didn't--

1080
00:58:42,145 --> 00:58:43,490
it kind of existed.

1081
00:58:43,490 --> 00:58:45,590
It just wasn't as
well-developed then.

1082
00:58:45,590 --> 00:58:49,520
And the hot thing at the
time was the pointer machine,

1083
00:58:49,520 --> 00:58:52,310
or I guess at that point, they
called it the Pascal machine,

1084
00:58:52,310 --> 00:58:53,690
more or less.

1085
00:58:53,690 --> 00:58:55,280
Pascal does have arrays.

1086
00:58:55,280 --> 00:58:59,660
And the funny thing is, Van
Emde Boas does use arrays,

1087
00:58:59,660 --> 00:59:01,220
but mostly it's pointers.

1088
00:59:01,220 --> 00:59:03,840
And you can get rid of the
arrays from their structure.

1089
00:59:03,840 --> 00:59:07,040
And essentially, in
the end, Van Emde Boas,

1090
00:59:07,040 --> 00:59:10,744
as presented like this,
is in a pointer machine.

1091
00:59:10,744 --> 00:59:12,410
Let me tell you a
little bit about that.

1092
00:59:16,440 --> 00:59:28,040
So original Van Emde Boas, which
I'll call stratified trees--

1093
00:59:28,040 --> 00:59:30,980
that's what he called it--

1094
00:59:30,980 --> 00:59:35,520
is basically this tree structure
with a lot more pointers.

1095
00:59:35,520 --> 00:59:39,500
So in particular, each leaf--

1096
00:59:39,500 --> 00:59:42,000
or every node,
actually, let's say--

1097
00:59:42,000 --> 00:59:54,080
stores a pointer to 2
to the ith ancestor,

1098
00:59:54,080 --> 01:00:02,123
where i is 0, 1, up to log w.

1099
01:00:02,123 --> 01:00:04,590
Because it was the
2 to the-- here.

1100
01:00:04,590 --> 01:00:08,047
So once you get the
ancestor immediately

1101
01:00:08,047 --> 01:00:10,130
above me, two steps above
me, four steps above me,

1102
01:00:10,130 --> 01:00:11,880
eight steps above me,
that's what I really

1103
01:00:11,880 --> 01:00:13,700
need to do this binary search.

1104
01:00:13,700 --> 01:00:16,094
The first thing I
need is halfway up.

1105
01:00:16,094 --> 01:00:17,510
And then if I have
to go down, I'm

1106
01:00:17,510 --> 01:00:19,310
going to need a
quarter of the way up.

1107
01:00:19,310 --> 01:00:21,800
And if I have to go down, I
want an eighth of the way up.

1108
01:00:21,800 --> 01:00:25,110
Whenever I go up, from-- if
I decide, oh, this is a 0.

1109
01:00:25,110 --> 01:00:26,432
I've got to go above here.

1110
01:00:26,432 --> 01:00:27,890
Then I do the same
thing from here.

1111
01:00:27,890 --> 01:00:29,970
I want to go halfway
up from here--

1112
01:00:29,970 --> 01:00:30,830
from this node.

1113
01:00:30,830 --> 01:00:36,140
So as long as every node knows
how to go up by any power of 2,

1114
01:00:36,140 --> 01:00:36,800
we're golden.

1115
01:00:36,800 --> 01:00:39,396
We can do a binary search.

1116
01:00:39,396 --> 01:00:41,270
The trouble with this
is, it increases space.

1117
01:00:41,270 --> 01:00:47,560
This is u log w space, which
is a little bit bigger than u.

1118
01:00:47,560 --> 01:00:50,060
And the original van Emde Boas
paper, conference and journal

1119
01:00:50,060 --> 01:00:51,830
version, achieves this bound--

1120
01:00:51,830 --> 01:00:53,210
not u.

1121
01:00:53,210 --> 01:00:55,790
Little historical fun fact--

1122
01:00:55,790 --> 01:00:58,250
not terribly well known.

1123
01:00:58,250 --> 01:00:58,790
Cool.

1124
01:00:58,790 --> 01:01:01,235
So that's stratified trees.

1125
01:01:05,620 --> 01:01:08,100
Anything else?

1126
01:01:08,100 --> 01:01:08,600
All right.

1127
01:01:08,600 --> 01:01:09,680
Stratified tree.

1128
01:01:09,680 --> 01:01:10,730
Right.

1129
01:01:10,730 --> 01:01:13,547
At this point, we have fast
search, but slow update let.

1130
01:01:13,547 --> 01:01:15,130
Me tell you about
updates in a second.

1131
01:01:15,130 --> 01:01:16,080
Yeah, question.

1132
01:01:16,080 --> 01:01:19,002
AUDIENCE: So once you do binary
search to find the first 1,

1133
01:01:19,002 --> 01:01:22,910
how do you walk
back down the tree--

1134
01:01:22,910 --> 01:01:24,400
ERIK DEMAINE: Oh,
I didn't mention,

1135
01:01:24,400 --> 01:01:26,360
but also, every node
stores min and max.

1136
01:01:33,500 --> 01:01:36,950
So that lets me do the
teleportation back down.

1137
01:01:36,950 --> 01:01:39,570
Every node knows the min
and the max of its subtree.

1138
01:01:39,570 --> 01:01:40,070
Right.

1139
01:01:40,070 --> 01:01:42,917
One more thing I was
forgetting here--

1140
01:01:42,917 --> 01:01:44,750
when I say, this a lot
of pointers to store.

1141
01:01:44,750 --> 01:01:47,330
You can't store them
all in one node.

1142
01:01:47,330 --> 01:01:50,220
And in the van Emde Boas
paper, it's stored in an array.

1143
01:01:50,220 --> 01:01:51,470
But it doesn't really
need that its an array.

1144
01:01:51,470 --> 01:01:53,120
It could just as well
be a linked list.

1145
01:01:53,120 --> 01:01:56,600
And that's how you
get pointer machine.

1146
01:01:56,600 --> 01:01:59,486
So this could be linked list.

1147
01:01:59,486 --> 01:02:01,610
And then this whole thing
works in pointer machine,

1148
01:02:01,610 --> 01:02:03,230
which is kind of neat.

1149
01:02:03,230 --> 01:02:07,479
And it's a little weird,
because if you used a comparison

1150
01:02:07,479 --> 01:02:09,770
pointer machine, where all
you can do is compare items,

1151
01:02:09,770 --> 01:02:12,110
there's a lower bound of
log n, because you only

1152
01:02:12,110 --> 01:02:14,780
have branching factor constant.

1153
01:02:14,780 --> 01:02:18,620
But here, the formulation of
the problem is, when I say,

1154
01:02:18,620 --> 01:02:20,480
give me the successor
of this, I actually

1155
01:02:20,480 --> 01:02:23,967
give you a pointer to this item.

1156
01:02:23,967 --> 01:02:26,300
And then from there, you can
do all this jumping around,

1157
01:02:26,300 --> 01:02:28,520
and find your
predecessor or successor.

1158
01:02:28,520 --> 01:02:30,830
So in this world, you
need at least u space,

1159
01:02:30,830 --> 01:02:32,555
even to be able to
specify the input.

1160
01:02:35,290 --> 01:02:37,750
So that's kind of a limitation
of the pointer machine.

1161
01:02:37,750 --> 01:02:39,958
And you can actually show
in the pointer machine log,

1162
01:02:39,958 --> 01:02:46,070
log u is optimal for any
predecessor data structure

1163
01:02:46,070 --> 01:02:47,510
in the pointer machine.

1164
01:02:47,510 --> 01:02:53,567
So there's a matching lower
bound log log u in this model.

1165
01:02:53,567 --> 01:02:54,650
And you need to use space.

1166
01:02:54,650 --> 01:02:56,020
So it's not very exciting.

1167
01:02:56,020 --> 01:02:58,120
What we like is the word RAM.

1168
01:02:58,120 --> 01:03:00,070
There, we can reduce space to n.

1169
01:03:00,070 --> 01:03:03,400
And that's what I want
to do next, I believe--

1170
01:03:03,400 --> 01:03:04,780
almost next.

1171
01:03:04,780 --> 01:03:08,380
One more mention--
actual stratified trees--

1172
01:03:08,380 --> 01:03:11,110
here, we got query
fast, update slow.

1173
01:03:11,110 --> 01:03:13,990
Stratified trees actually
do update fast, as well.

1174
01:03:13,990 --> 01:03:17,710
Essentially, it's this idea,
plus you don't recursively

1175
01:03:17,710 --> 01:03:20,290
store the min, which,
of course, makes

1176
01:03:20,290 --> 01:03:21,900
all these bits no
longer accurate,

1177
01:03:21,900 --> 01:03:23,740
as it gets much messier.

1178
01:03:23,740 --> 01:03:26,800
But in the end, it's doing
exactly the same thing

1179
01:03:26,800 --> 01:03:28,175
as this recursion.

1180
01:03:28,175 --> 01:03:30,100
In fact, you can
draw the picture.

1181
01:03:30,100 --> 01:03:35,870
It is this part up here--

1182
01:03:35,870 --> 01:03:37,330
the top half of the tree--

1183
01:03:37,330 --> 01:03:38,130
this is summary.

1184
01:03:41,110 --> 01:03:46,408
And each of these bottom
halves is a cluster.

1185
01:03:46,408 --> 01:03:51,820
And there's root u
clusters down here.

1186
01:03:51,820 --> 01:03:53,500
So those are smaller structures.

1187
01:03:53,500 --> 01:03:57,610
And there's one root u sized
Van Emde Boas structure, which

1188
01:03:57,610 --> 01:03:58,900
is a summary structure.

1189
01:03:58,900 --> 01:04:02,617
These bits here is the
bit vector representation

1190
01:04:02,617 --> 01:04:03,700
of the summary structures.

1191
01:04:03,700 --> 01:04:05,283
It's, is there anyone
in this cluster?

1192
01:04:05,283 --> 01:04:08,597
Is there anyone in this
cluster, and so on?

1193
01:04:08,597 --> 01:04:10,930
This, of course, also looks
a lot like the Van Emde Boas

1194
01:04:10,930 --> 01:04:11,650
layout.

1195
01:04:11,650 --> 01:04:14,137
Take a binary tree, cut
it in half, do the top,

1196
01:04:14,137 --> 01:04:15,220
recursively do the bottom.

1197
01:04:15,220 --> 01:04:17,428
So that's why it was called
the Van Emde Boas layout,

1198
01:04:17,428 --> 01:04:18,791
is this picture.

1199
01:04:18,791 --> 01:04:20,290
But if you take
this tree structure,

1200
01:04:20,290 --> 01:04:22,039
and then you don't
recursively store mins,

1201
01:04:22,039 --> 01:04:24,880
and then the bits are not
quite accurate, it's messy.

1202
01:04:24,880 --> 01:04:26,769
And so stratified
trees-- you should

1203
01:04:26,769 --> 01:04:28,060
try to read the original paper.

1204
01:04:28,060 --> 01:04:28,870
It's a mess.

1205
01:04:28,870 --> 01:04:31,640
Whereas this code--
pretty clean.

1206
01:04:31,640 --> 01:04:33,280
And so once you
say, oh, I'm just

1207
01:04:33,280 --> 01:04:35,440
going to store all these
clusters as an array

1208
01:04:35,440 --> 01:04:37,540
and not worry about
keeping track of the tree,

1209
01:04:37,540 --> 01:04:39,620
it actually gets a lot easier.

1210
01:04:39,620 --> 01:04:43,090
And that was the
Bender/Farach-Colton cleaning

1211
01:04:43,090 --> 01:04:44,920
up, which never
appeared in print.

1212
01:04:44,920 --> 01:04:48,380
But it's appeared in the lecture
notes all over the place--

1213
01:04:48,380 --> 01:04:50,700
and now CLRS.

1214
01:04:50,700 --> 01:04:51,940
Cool.

1215
01:04:51,940 --> 01:04:53,656
I want to tell you
about two more things.

1216
01:04:53,656 --> 01:04:55,030
It's actually
going to get easier

1217
01:04:55,030 --> 01:04:57,989
the more time we spend
with this data structure.

1218
01:05:21,970 --> 01:05:24,790
All right.

1219
01:05:24,790 --> 01:05:27,970
Let me draw a box.

1220
01:05:27,970 --> 01:05:31,930
At this point, we've seen a
clean way to get Van Emde Boas.

1221
01:05:31,930 --> 01:05:34,570
And we've seen a
cute way in a tree

1222
01:05:34,570 --> 01:05:37,240
to get search fast,
but update slow.

1223
01:05:37,240 --> 01:05:39,280
I want to talk a
little more about that.

1224
01:05:39,280 --> 01:05:41,200
Let's suppose I have
this data structure.

1225
01:05:41,200 --> 01:05:46,120
It's achieves log w
query, which is fast,

1226
01:05:46,120 --> 01:05:50,590
but it only achieves w
update, which is slow.

1227
01:05:50,590 --> 01:05:52,270
How do you update the structure?

1228
01:05:52,270 --> 01:05:54,010
You update one
bit at the bottom,

1229
01:05:54,010 --> 01:05:56,980
and then you've got to update
all the bits up the path.

1230
01:05:56,980 --> 01:05:59,590
So you spend w time to
do an update over here.

1231
01:06:02,410 --> 01:06:05,680
If updates are slow, I just
want to do less updates.

1232
01:06:05,680 --> 01:06:07,570
We have a trick for
doing this, which

1233
01:06:07,570 --> 01:06:10,630
is, you put little things
down here of size theta w.

1234
01:06:16,810 --> 01:06:19,810
And then only one item
from here gets promoted

1235
01:06:19,810 --> 01:06:21,490
into the top structure.

1236
01:06:21,490 --> 01:06:26,860
We only end up having n over
w items up here, and about 1

1237
01:06:26,860 --> 01:06:29,050
over w as many updates.

1238
01:06:29,050 --> 01:06:31,930
If I want to do an
insertion, I do a search here

1239
01:06:31,930 --> 01:06:33,880
to figure out which
of these little--

1240
01:06:33,880 --> 01:06:39,370
I'll call these "chunks--" which
little chunk it belongs in.

1241
01:06:39,370 --> 01:06:41,800
I do an insert there.

1242
01:06:41,800 --> 01:06:43,360
If that structure
gets too big-- it's

1243
01:06:43,360 --> 01:06:45,730
bigger than, say, 2
times w, or 4 times w,

1244
01:06:45,730 --> 01:06:48,516
whatever-- then I'll split it.

1245
01:06:48,516 --> 01:06:50,765
And if I delete from something,
and it gets too small,

1246
01:06:50,765 --> 01:06:53,330
I'll merge with the
neighbor, or maybe re-split--

1247
01:06:53,330 --> 01:06:55,580
just like B-trees.

1248
01:06:55,580 --> 01:06:58,820
We've done this
many times, by now.

1249
01:06:58,820 --> 01:07:01,490
But only when it
splits, or I do a merge,

1250
01:07:01,490 --> 01:07:03,050
do I have to do
an update up here.

1251
01:07:03,050 --> 01:07:05,540
Only when the set
of chunks changes do

1252
01:07:05,540 --> 01:07:07,790
I need to do a single
insertion or deletion

1253
01:07:07,790 --> 01:07:10,100
up here-- or a constant number.

1254
01:07:10,100 --> 01:07:15,860
So this update time goes
down by a factor of w.

1255
01:07:15,860 --> 01:07:18,567
But I have to pay whatever
the update cost is here.

1256
01:07:18,567 --> 01:07:20,150
So what I do with
this data structure?

1257
01:07:20,150 --> 01:07:21,530
I don't want use Van
Emde Boas, because this

1258
01:07:21,530 --> 01:07:22,580
could be a very big universe.

1259
01:07:22,580 --> 01:07:23,205
Who knows what?

1260
01:07:23,205 --> 01:07:26,360
I use the binary search tree.

1261
01:07:26,360 --> 01:07:28,250
Here, I can afford a
binary search tree,

1262
01:07:28,250 --> 01:07:30,610
because then it's only log w.

1263
01:07:30,610 --> 01:07:32,914
log w is the bound
we're trying to get.

1264
01:07:32,914 --> 01:07:34,580
So you can do these
binary search trees.

1265
01:07:34,580 --> 01:07:35,490
It's trivial.

1266
01:07:35,490 --> 01:07:37,970
Just do insert, delete, search.

1267
01:07:37,970 --> 01:07:39,994
Everything will be log w.

1268
01:07:39,994 --> 01:07:42,410
So if I want to do a search,
I search through here, which,

1269
01:07:42,410 --> 01:07:44,476
conveniently, is
already fast-- log w--

1270
01:07:44,476 --> 01:07:46,850
and then I do a search through
here, which is also log w.

1271
01:07:46,850 --> 01:07:47,445
So it's nice and balanced.

1272
01:07:47,445 --> 01:07:48,530
Everything's log w.

1273
01:07:51,632 --> 01:07:53,840
If I want to do an insertion,
I do an insertion here.

1274
01:07:53,840 --> 01:07:56,090
If it splits, I do
an insertion here.

1275
01:07:56,090 --> 01:08:00,470
But that order w update
cost, I charge to the order

1276
01:08:00,470 --> 01:08:02,660
w updates I would have
had to do in this chunk

1277
01:08:02,660 --> 01:08:04,700
before it got split.

1278
01:08:04,700 --> 01:08:08,060
So this our good friend
indirection, a technique we

1279
01:08:08,060 --> 01:08:09,800
will use over and
over in this class.

1280
01:08:09,800 --> 01:08:14,450
It's very helpful when you're
almost at the right bound.

1281
01:08:14,450 --> 01:08:17,479
And that's actually in the
follow-up van Emde Boas paper.

1282
01:08:17,479 --> 01:08:20,520
A similar indirection
trick is in there.

1283
01:08:20,520 --> 01:08:31,370
So we can charge the
order w update in top to--

1284
01:08:31,370 --> 01:08:33,020
that's the cost of the update--

1285
01:08:33,020 --> 01:08:38,180
to the order w updates
that have actually

1286
01:08:38,180 --> 01:08:42,740
been performed in the bottom.

1287
01:08:42,740 --> 01:08:44,600
Because when
somebody gets split,

1288
01:08:44,600 --> 01:08:47,330
it's nice in its average
state-- or when it gets merged,

1289
01:08:47,330 --> 01:08:48,640
it's going to be close
to its average state.

1290
01:08:48,640 --> 01:08:50,598
You have to do a lot of
insertions or deletions

1291
01:08:50,598 --> 01:08:54,630
to get it out of whack, and
cause a split or a merge.

1292
01:08:54,630 --> 01:08:56,000
So-- boom.

1293
01:08:56,000 --> 01:09:01,700
This means the
updates become log w.

1294
01:09:01,700 --> 01:09:04,130
Searches are also log w.

1295
01:09:04,130 --> 01:09:07,550
So we've got Van Emde
Boas again, in a new way.

1296
01:09:07,550 --> 01:09:11,127
Bonus points-- if you
take this structure--

1297
01:09:14,330 --> 01:09:16,950
even this structure, if we
did it in the array form--

1298
01:09:16,950 --> 01:09:17,450
great.

1299
01:09:17,450 --> 01:09:18,890
It was order u space.

1300
01:09:18,890 --> 01:09:20,750
If we did it with
all these pointers,

1301
01:09:20,750 --> 01:09:22,708
and we wanted a pointer
machine data structure,

1302
01:09:22,708 --> 01:09:25,589
we needed u log w space.

1303
01:09:25,589 --> 01:09:28,130
But this indirection trick, you
can also get rid of the log w

1304
01:09:28,130 --> 01:09:30,160
in space factor.

1305
01:09:30,160 --> 01:09:31,340
It's a little less obvious.

1306
01:09:31,340 --> 01:09:33,020
But you take this--

1307
01:09:33,020 --> 01:09:34,859
here, we reduced n
by a factor of w.

1308
01:09:34,859 --> 01:09:37,420
You can also reduce
u by a factor of w.

1309
01:09:37,420 --> 01:09:38,420
I'll just wave my hands.

1310
01:09:38,420 --> 01:09:39,500
That's possible.

1311
01:09:39,500 --> 01:09:41,510
So u gets a little bit smaller.

1312
01:09:41,510 --> 01:09:44,120
And so when we
pay u log w space,

1313
01:09:44,120 --> 01:09:46,040
if you got smaller
by a factor of w,

1314
01:09:46,040 --> 01:09:48,689
this basically disappears.

1315
01:09:48,689 --> 01:09:50,580
So you get, at
most, order u space.

1316
01:09:53,210 --> 01:09:54,390
But order u is not order n.

1317
01:09:54,390 --> 01:09:56,550
I want order n space, darn it.

1318
01:09:56,550 --> 01:10:01,400
So let's reduce space.

1319
01:10:01,400 --> 01:10:04,015
As I said, this is going
to get easier and easier.

1320
01:10:04,015 --> 01:10:06,390
By the end, we will have very
little of a data structure.

1321
01:10:06,390 --> 01:10:10,150
But still, we'll have log log u.

1322
01:10:10,150 --> 01:10:13,860
And you thought this was
easy, but wait, there's more.

1323
01:10:16,680 --> 01:10:19,410
Right now, we have two
ways to get log log u--

1324
01:10:19,410 --> 01:10:22,950
query and order u space.

1325
01:10:22,950 --> 01:10:25,170
There's the one I'm
erasing, and there's

1326
01:10:25,170 --> 01:10:28,410
this-- take this tree structure
with the very simple pointers.

1327
01:10:28,410 --> 01:10:29,867
Add indirection.

1328
01:10:29,867 --> 01:10:31,950
So admittedly, it's more
complicated to implement.

1329
01:10:31,950 --> 01:10:33,526
But conceptually,
it's super simple.

1330
01:10:33,526 --> 01:10:35,400
It's like, do this
obvious tree binary search

1331
01:10:35,400 --> 01:10:36,930
on the level thing.

1332
01:10:36,930 --> 01:10:40,080
And then add indirection,
and it fixes all your bounds,

1333
01:10:40,080 --> 01:10:41,670
magically.

1334
01:10:41,670 --> 01:10:43,600
So conceptually, very simple--

1335
01:10:43,600 --> 01:10:47,850
practically, you definitely
want to do this-- much simpler.

1336
01:10:47,850 --> 01:10:51,225
Now, what about saving space?

1337
01:10:54,420 --> 01:10:56,820
Very simple idea--
which, I think,

1338
01:10:56,820 --> 01:11:01,560
again, comes from Michael
Bender and Martin Farach-Colton.

1339
01:11:01,560 --> 01:11:05,490
Don't store empty structures.

1340
01:11:05,490 --> 01:11:09,620
So in this picture, we had
an array of all the clusters.

1341
01:11:09,620 --> 01:11:13,200
But a cluster could be
entirely empty, like this one--

1342
01:11:13,200 --> 01:11:15,470
this entirely empty cluster.

1343
01:11:15,470 --> 01:11:16,440
Don't store it.

1344
01:11:16,440 --> 01:11:18,110
It's a waste.

1345
01:11:18,110 --> 01:11:20,710
If you store them all, you're
going to spend order u space.

1346
01:11:20,710 --> 01:11:22,130
If you don't store them all--

1347
01:11:22,130 --> 01:11:23,780
just don't store
the empty ones--

1348
01:11:23,780 --> 01:11:25,210
I claim your order n space.

1349
01:11:25,210 --> 01:11:27,910
Done.

1350
01:11:27,910 --> 01:11:30,350
So I'm going back to
the structure I erased.

1351
01:11:30,350 --> 01:11:33,110
Ignore the tree
perspective for awhile.

1352
01:11:33,110 --> 01:11:39,440
Don't store empty clusters.

1353
01:11:39,440 --> 01:11:41,960
OK, now, this sounds easy.

1354
01:11:41,960 --> 01:11:44,090
But in reality, it's a
little bit more annoying.

1355
01:11:44,090 --> 01:11:47,640
Because we wanted to have
an array of clusters.

1356
01:11:47,640 --> 01:11:51,260
So we could quickly
find the cluster.

1357
01:11:51,260 --> 01:11:52,810
If you store an
array, you're going

1358
01:11:52,810 --> 01:11:54,650
to spend at least
square rot of u space.

1359
01:11:54,650 --> 01:11:56,720
Because at the very
beginning, you say,

1360
01:11:56,720 --> 01:11:58,040
here are my root u clusters.

1361
01:11:58,040 --> 01:11:59,748
Now, some of them
might be null pointers.

1362
01:11:59,748 --> 01:12:03,980
But I can't afford to store
that entire array of clusters.

1363
01:12:03,980 --> 01:12:05,410
So don't use an array.

1364
01:12:05,410 --> 01:12:06,980
Use a perfect hash table.

1365
01:12:10,730 --> 01:12:13,990
So v dot cluster, instead
of being an array,

1366
01:12:13,990 --> 01:12:18,650
is now, let's say, a
dynamic perfect hashing.

1367
01:12:18,650 --> 01:12:21,250
And I'm going to use the
version which I did not present.

1368
01:12:21,250 --> 01:12:23,750
The version I presented,
which used universal hashing,

1369
01:12:23,750 --> 01:12:26,270
was order 1 expected.

1370
01:12:26,270 --> 01:12:30,410
But I said that it can be
constant with high probability

1371
01:12:30,410 --> 01:12:31,040
per operation.

1372
01:12:31,040 --> 01:12:34,070
It's a little bit stronger.

1373
01:12:34,070 --> 01:12:35,540
So now, everything's fine.

1374
01:12:35,540 --> 01:12:37,880
If I do an index
v dot cluster c,

1375
01:12:37,880 --> 01:12:41,120
that's still constant time,
with high probability now.

1376
01:12:41,120 --> 01:12:45,940
And I claim this structure
is now order n's space.

1377
01:12:45,940 --> 01:12:47,540
Why is it order n's space?

1378
01:12:47,540 --> 01:12:56,510
By simple amortization--
charge each table entry in that

1379
01:12:56,510 --> 01:13:01,250
hash table to the
min of the cluster.

1380
01:13:06,924 --> 01:13:08,340
We're only storing
non-empty ones.

1381
01:13:08,340 --> 01:13:11,840
So if one of these guys
exists in the hash table,

1382
01:13:11,840 --> 01:13:13,670
we had to store a
pointer to it, then

1383
01:13:13,670 --> 01:13:16,100
that means the summary
structure is non-zero.

1384
01:13:16,100 --> 01:13:17,990
It means this guy is not empty.

1385
01:13:17,990 --> 01:13:19,810
So it has an item in its min.

1386
01:13:19,810 --> 01:13:22,820
Charge the space up here to
store the pointer to that min

1387
01:13:22,820 --> 01:13:23,810
guy.

1388
01:13:23,810 --> 01:13:27,200
Then each item-- each min item--

1389
01:13:27,200 --> 01:13:28,880
only gets charged once.

1390
01:13:28,880 --> 01:13:32,380
Because it only has one parent
that has a pointer to it.

1391
01:13:32,380 --> 01:13:33,980
So you only charge once.

1392
01:13:33,980 --> 01:13:38,290
And therefore-- charge
and table entry--

1393
01:13:43,010 --> 01:13:45,050
only charge each element once.

1394
01:13:49,800 --> 01:13:50,910
And that's all your space.

1395
01:13:50,910 --> 01:13:53,310
So it's order n space.

1396
01:13:53,310 --> 01:13:56,250
Done.

1397
01:13:56,250 --> 01:13:57,050
Kind of crazy.

1398
01:13:57,050 --> 01:14:00,180
I guess, if you want, there's
also the pointer to the summary

1399
01:14:00,180 --> 01:14:00,680
structure.

1400
01:14:00,680 --> 01:14:02,290
You could charge
that to your own min.

1401
01:14:02,290 --> 01:14:03,581
And then you're charging twice.

1402
01:14:03,581 --> 01:14:07,290
But it's constant per item.

1403
01:14:07,290 --> 01:14:08,517
So this is kind of funny.

1404
01:14:08,517 --> 01:14:10,850
Again, it doesn't appear in
print anywhere, except maybe

1405
01:14:10,850 --> 01:14:12,920
as an exercise in CLRS now.

1406
01:14:12,920 --> 01:14:16,760
But you get linear
order n space,

1407
01:14:16,760 --> 01:14:18,890
just by adding hashing
in the obvious way.

1408
01:14:18,890 --> 01:14:23,180
Now, for whatever reason,
Willard didn't see this,

1409
01:14:23,180 --> 01:14:25,160
or wanted to do his
own thing, and so he

1410
01:14:25,160 --> 01:14:30,409
found another way to do
order n space log log u query

1411
01:14:30,409 --> 01:14:30,950
with hashing.

1412
01:14:34,000 --> 01:14:35,570
Well, I guess, also,
you had to think

1413
01:14:35,570 --> 01:14:36,780
of it in this simple form.

1414
01:14:36,780 --> 01:14:38,510
It's harder to do
this in the tree.

1415
01:14:38,510 --> 01:14:39,800
It can be done, I think.

1416
01:14:39,800 --> 01:14:42,530
But this is a simpler view
than the tree, I think.

1417
01:14:42,530 --> 01:14:45,320
And then boom-- order n space.

1418
01:14:45,320 --> 01:14:48,140
But it turns out there's
another way to do it.

1419
01:14:48,140 --> 01:14:49,820
This is a completely
different way

1420
01:14:49,820 --> 01:14:52,940
to do Van Emde Boas-- actually,
not that completely different.

1421
01:14:52,940 --> 01:14:58,613
It's another way to
do this with hashing.

1422
01:15:02,690 --> 01:15:06,140
And we're going to start with
what's called x-fast trees,

1423
01:15:06,140 --> 01:15:08,510
and then we will modify
it to get y-fast trees.

1424
01:15:08,510 --> 01:15:11,882
That's Willard's terminology.

1425
01:15:11,882 --> 01:15:15,560
OK, so x-fast trees
is, store this tree,

1426
01:15:15,560 --> 01:15:18,060
but don't store the zeros.

1427
01:15:18,060 --> 01:15:21,590
So don't store zeros.

1428
01:15:21,590 --> 01:15:27,530
Only store the ones in the-- we
call this the simple tree view.

1429
01:15:27,530 --> 01:15:29,030
This is why I, in
particular, wanted

1430
01:15:29,030 --> 01:15:30,655
to tell you about
the simple tree view,

1431
01:15:30,655 --> 01:15:33,320
because it is really
what x fast trees do.

1432
01:15:33,320 --> 01:15:35,330
So what do I mean by
only store the ones?

1433
01:15:35,330 --> 01:15:41,280
Well, each of these
ones has sort of a name.

1434
01:15:41,280 --> 01:15:42,590
What is the name of this item?

1435
01:15:42,590 --> 01:15:43,580
Its name is one--

1436
01:15:43,580 --> 01:15:46,227
or in other words, 0, 0, 0, 1.

1437
01:15:46,227 --> 01:15:47,810
Each of these nodes,
you can think of,

1438
01:15:47,810 --> 01:15:49,580
what is the path to get here?

1439
01:15:49,580 --> 01:15:52,730
Like, the path to get
to this one is 1, 0, 0.

1440
01:15:52,730 --> 01:15:53,450
1 means right.

1441
01:15:53,450 --> 01:15:54,800
0 means left.

1442
01:15:54,800 --> 01:15:56,960
Those names give you
the binary indicator

1443
01:15:56,960 --> 01:16:00,860
of where that node is in
the tree, in some sense.

1444
01:16:00,860 --> 01:16:13,316
So store the ones as binary
strings in a hash table--

1445
01:16:17,260 --> 01:16:19,270
again, a dynamic
perfect hash table.

1446
01:16:19,270 --> 01:16:22,150
Let's say I can get constant
with high probability.

1447
01:16:22,150 --> 01:16:23,860
OK.

1448
01:16:23,860 --> 01:16:26,280
And if you're a
little concerned--

1449
01:16:26,280 --> 01:16:29,290
so what this means--
the ones are exactly

1450
01:16:29,290 --> 01:16:32,050
the prefixes of the paths
to each of the items.

1451
01:16:32,050 --> 01:16:33,280
This was item one.

1452
01:16:33,280 --> 01:16:37,420
And so I want to store this
one, which is empty string,

1453
01:16:37,420 --> 01:16:40,120
this one, which is 0, this
one, which is 00, this one,

1454
01:16:40,120 --> 01:16:44,060
which is 000, this
one, which is 0001.

1455
01:16:44,060 --> 01:16:49,600
So I take 0001, which is
the item I want to store.

1456
01:16:49,600 --> 01:16:51,970
And there's all
these prefixes, which

1457
01:16:51,970 --> 01:16:54,310
are the items I want to store.

1458
01:16:54,310 --> 01:16:56,290
And for this really
to make sense,

1459
01:16:56,290 --> 01:16:58,180
you also need the
length of the string.

1460
01:16:58,180 --> 01:17:01,810
Strings of different lengths
should be in different worlds.

1461
01:17:01,810 --> 01:17:03,790
So the way, actually,
x-fast trees originally

1462
01:17:03,790 --> 01:17:05,500
did it in the paper is,
have a different hash

1463
01:17:05,500 --> 01:17:07,124
table for strings of
different lengths.

1464
01:17:07,124 --> 01:17:09,310
So that's probably an easier
way to think about it.

1465
01:17:09,310 --> 01:17:11,770
You store all the items
themselves in a hash table.

1466
01:17:11,770 --> 01:17:13,330
You store all the
prefixes of all

1467
01:17:13,330 --> 01:17:16,240
but the last bit in a
separate hash table,

1468
01:17:16,240 --> 01:17:20,120
all but the last two bits in a
separate hash table, and so on.

1469
01:17:20,120 --> 01:17:22,560
Now, what does this let you do?

1470
01:17:22,560 --> 01:17:24,220
It lets you do this--

1471
01:17:24,220 --> 01:17:28,162
binary search for the
0 to 1 transition.

1472
01:17:30,760 --> 01:17:32,440
What we did here was--

1473
01:17:32,440 --> 01:17:35,470
I look at the bit, is it 0 or 1?

1474
01:17:35,470 --> 01:17:38,200
Instead of doing that, you do
a query into the hash table,

1475
01:17:38,200 --> 01:17:39,760
and say, is it in
the hash table?

1476
01:17:39,760 --> 01:17:42,900
It's in the hash table
if and only if it is one.

1477
01:17:42,900 --> 01:17:44,980
So looking at a bit in
this conceptual tree

1478
01:17:44,980 --> 01:17:47,350
is the same thing as
checking for containment

1479
01:17:47,350 --> 01:17:48,760
in this hash table.

1480
01:17:48,760 --> 01:17:52,630
But now, we don't have to
store the zeros, which is cool.

1481
01:17:55,930 --> 01:18:04,600
We can now do search,
predecessor or successor, fast,

1482
01:18:04,600 --> 01:18:11,650
in log w time, via
this old thing.

1483
01:18:11,650 --> 01:18:15,080
Again, you have to have min
and max pointers, as well.

1484
01:18:15,080 --> 01:18:17,039
So in this hash table,
you store the min

1485
01:18:17,039 --> 01:18:18,205
and the max of your subtree.

1486
01:18:21,030 --> 01:18:22,960
Or actually, from
a 1, you actually

1487
01:18:22,960 --> 01:18:25,270
need the max of
the left subtree,

1488
01:18:25,270 --> 01:18:27,170
and you need the min
of the right subtree.

1489
01:18:27,170 --> 01:18:30,190
But it's a constant amount
of information per thing.

1490
01:18:30,190 --> 01:18:36,190
This is not perfect, however,
in that it uses nw space.

1491
01:18:38,920 --> 01:18:40,840
And also, updates are slow.

1492
01:18:40,840 --> 01:18:42,775
It's order w updates.

1493
01:19:01,540 --> 01:19:02,650
But we're almost there.

1494
01:19:02,650 --> 01:19:06,520
Because we have fast
queries, slow updates, not

1495
01:19:06,520 --> 01:19:07,990
optimal space.

1496
01:19:07,990 --> 01:19:09,340
Take this.

1497
01:19:09,340 --> 01:19:11,955
Add indirection-- done.

1498
01:19:11,955 --> 01:19:12,955
And that's y-fast trees.

1499
01:19:17,950 --> 01:19:20,660
y-fast trees-- you
take x-fast trees,

1500
01:19:20,660 --> 01:19:25,470
you add this
indirection right here,

1501
01:19:25,470 --> 01:19:33,266
and you get log w per
operation order and space.

1502
01:19:33,266 --> 01:19:34,950
Of course, this is
a high probability

1503
01:19:34,950 --> 01:19:37,740
because we're using hashing.

1504
01:19:37,740 --> 01:19:40,290
Because we have a
factor w bad here,

1505
01:19:40,290 --> 01:19:41,400
we have factor w bad here.

1506
01:19:41,400 --> 01:19:42,390
You divide by w.

1507
01:19:42,390 --> 01:19:43,770
You're done.

1508
01:19:43,770 --> 01:19:48,360
Up here, you have n over w
space. n over w times w is n.

1509
01:19:48,360 --> 01:19:51,360
Queries, just like
before, remain log w.

1510
01:19:51,360 --> 01:19:53,040
But now-- boom--

1511
01:19:53,040 --> 01:19:56,160
updates, we pay log w
because of the binary search

1512
01:19:56,160 --> 01:19:59,035
trees at the bottom,
but pretty cool.

1513
01:19:59,035 --> 01:20:00,840
Isn't that neat?

1514
01:20:00,840 --> 01:20:02,190
I've never seen this before.

1515
01:20:02,190 --> 01:20:04,690
OK, I've seen x-fast
trees and y-fast trees.

1516
01:20:04,690 --> 01:20:06,690
But it's really just the same--

1517
01:20:06,690 --> 01:20:09,990
we're taking Van Emde Boas,
looking at it in the tree view.

1518
01:20:09,990 --> 01:20:11,699
You can see where
Willard got this stuff.

1519
01:20:11,699 --> 01:20:14,073
It's like, oh, man I really
want to store all these bits,

1520
01:20:14,073 --> 01:20:15,300
but hey, it's way too big.

1521
01:20:15,300 --> 01:20:17,280
Just don't store the zeros.

1522
01:20:17,280 --> 01:20:19,260
That means we should
use a hash table.

1523
01:20:19,260 --> 01:20:22,770
Ah , hash table just gives you
whether the bit is in or out.

1524
01:20:22,770 --> 01:20:24,480
Great.

1525
01:20:24,480 --> 01:20:25,749
Now use indirection.

1526
01:20:25,749 --> 01:20:27,540
And indirection was
already floating around

1527
01:20:27,540 --> 01:20:29,220
as a concept at the time--

1528
01:20:29,220 --> 01:20:30,560
slightly different parameters.

1529
01:20:30,560 --> 01:20:32,970
Van Emde Boas had
his own indirection

1530
01:20:32,970 --> 01:20:38,187
to reduce the space
from u times log w to u.

1531
01:20:38,187 --> 01:20:40,020
But Willard did it,
and-- boom-- it got down

1532
01:20:40,020 --> 01:20:43,140
to n space in this way.

1533
01:20:43,140 --> 01:20:45,700
But as you saw, you can also do
it directly to Van Emde Boas.

1534
01:20:45,700 --> 01:20:47,304
All these ideas can
be interchanged.

1535
01:20:47,304 --> 01:20:48,720
You can combine
any data structure

1536
01:20:48,720 --> 01:20:50,220
you want with any
space saving trick

1537
01:20:50,220 --> 01:20:51,810
you want, with
indirection, if you

1538
01:20:51,810 --> 01:20:55,290
need to, to speed things up
and reduce space a little bit.

1539
01:20:55,290 --> 01:20:57,190
So there's many,
many ways to do this.

1540
01:20:57,190 --> 01:20:59,230
But in the end, you get
log w per operation,

1541
01:20:59,230 --> 01:21:00,750
and order n space.

1542
01:21:00,750 --> 01:21:02,070
And that sort of result one.

1543
01:21:02,070 --> 01:21:04,530
And it's probably the most
useful predecessor data

1544
01:21:04,530 --> 01:21:05,487
structure, in general.

1545
01:21:05,487 --> 01:21:07,320
But next time, we'll
see fusion trees, which

1546
01:21:07,320 --> 01:21:10,730
are good for when w is huge.