1
00:00:07,000 --> 00:00:10,000
Good morning.
Today we're going to talk about

2
00:00:10,000 --> 00:00:14,000
it a balanced search structure,
so a data structure that

3
00:00:14,000 --> 00:00:18,000
maintains a dynamic set subject
to insertion,

4
00:00:18,000 --> 00:00:21,000
deletion, and search called
skip lists.

5
00:00:21,000 --> 00:00:25,000
So, I'll call this a dynamic
search structure because it's a

6
00:00:25,000 --> 00:00:28,000
data structure.
It supports search,

7
00:00:28,000 --> 00:00:33,000
and it's dynamic,
meaning insert and delete.

8
00:00:33,000 --> 00:00:39,000
So, what other dynamic search
structures do we know,

9
00:00:39,000 --> 00:00:45,000
just for sake of comparison,
and to wake everyone up?

10
00:00:45,000 --> 00:00:50,000
Shut them out,
efficient, I should say,

11
00:00:50,000 --> 00:00:55,000
also good, logarithmic time per
operation.

12
00:00:55,000 --> 00:01:01,000
So, this is a really easy
question to get us off the

13
00:01:01,000 --> 00:01:05,000
ground.
You've seen them all in the

14
00:01:05,000 --> 00:01:08,000
last week, so it shouldn't be so
hard.

15
00:01:08,000 --> 00:01:11,000
Treap, good.
On the problems that we saw

16
00:01:11,000 --> 00:01:13,000
treaps.
That's, in some sense,

17
00:01:13,000 --> 00:01:17,000
the simplest dynamic search
structure you can get from first

18
00:01:17,000 --> 00:01:21,000
principles because all we needed
was a bound on a randomly

19
00:01:21,000 --> 00:01:26,000
constructed binary search tree.
And then treaps did well.

20
00:01:26,000 --> 00:01:30,000
So, that was sort of the first
one you saw depending on when

21
00:01:30,000 --> 00:01:34,000
you did your problem set.
What else?

22
00:01:34,000 --> 00:01:36,000
Charles?
Red black trees,

23
00:01:36,000 --> 00:01:40,000
good answer.
So, that was exactly one week

24
00:01:40,000 --> 00:01:44,000
ago.
I hope you still remember it.

25
00:01:44,000 --> 00:01:48,000
They have guaranteed log n
performance.

26
00:01:48,000 --> 00:01:55,000
So, this was an expected bound.
This was a worst-case order log

27
00:01:55,000 --> 00:01:58,000
n per operation,
insert, delete,

28
00:01:58,000 --> 00:02:02,000
and search.
And, there was one more for

29
00:02:02,000 --> 00:02:07,000
those who want to recitation on
Friday: B trees,

30
00:02:07,000 --> 00:02:10,000
good.
And, by B trees,

31
00:02:10,000 --> 00:02:14,000
I also include two-three trees,
two-three-four trees,

32
00:02:14,000 --> 00:02:16,000
and all those guys.
So, if B is a constant,

33
00:02:16,000 --> 00:02:19,000
or if you want your B trees
knows a little bit cleverly,

34
00:02:19,000 --> 00:02:22,000
that these have guaranteed
order log n performance,

35
00:02:22,000 --> 00:02:24,000
so, worst case,
order log n.

36
00:02:24,000 --> 00:02:27,000
So, you should know this.
These are all balanced search

37
00:02:27,000 --> 00:02:29,000
structures.
They are dynamic.

38
00:02:29,000 --> 00:02:31,000
They support insertions and
deletions.

39
00:02:31,000 --> 00:02:34,000
They support searches,
finding a given key.

40
00:02:34,000 --> 00:02:37,000
And if you don't find the key,
you find its predecessor and

41
00:02:37,000 --> 00:02:42,000
successor pretty easily in all
of these structures.

42
00:02:42,000 --> 00:02:44,000
If you want to augment some
data structure,

43
00:02:44,000 --> 00:02:48,000
you should think about which
one of these is easiest to

44
00:02:48,000 --> 00:02:53,000
augment, as in Monday's lecture.
So, the question I want to pose

45
00:02:53,000 --> 00:02:56,000
to you is supposed I gave you
all a laptop right now,

46
00:02:56,000 --> 00:02:59,000
which would be great.
Then I asked you,

47
00:02:59,000 --> 00:03:03,000
in order to keep this laptop
you have to implement one of

48
00:03:03,000 --> 00:03:06,000
these data structures,
let's say, within this class

49
00:03:06,000 --> 00:03:09,000
hour.
Do you think you could do it?

50
00:03:09,000 --> 00:03:12,000
How many people think you could
do it?

51
00:03:12,000 --> 00:03:13,000
A couple people,
a few people,

52
00:03:13,000 --> 00:03:15,000
OK, all front row people,
good.

53
00:03:15,000 --> 00:03:19,000
I could probably do it.
My preference would be B trees.

54
00:03:19,000 --> 00:03:21,000
They're sort of the simplest in
my mind.

55
00:03:21,000 --> 00:03:23,000
This is without using the
textbook.

56
00:03:23,000 --> 00:03:25,000
This would be a closed book
exam.

57
00:03:25,000 --> 00:03:30,000
I don't have enough laptops to
do it, unfortunately.

58
00:03:30,000 --> 00:03:32,000
So, B trees are pretty
reasonable.

59
00:03:32,000 --> 00:03:35,000
Deletion, you have to remember
stealing from a sibling and

60
00:03:35,000 --> 00:03:37,000
whatnot.
So, deletions are a bit tricky.

61
00:03:37,000 --> 00:03:40,000
Red black trees,
I can never remember it.

62
00:03:40,000 --> 00:03:43,000
I'd have to look it up,
or re-derive the three cases.

63
00:03:43,000 --> 00:03:46,000
treaps are a bit fancy.
So, that would take a little

64
00:03:46,000 --> 00:03:49,000
while to remember exactly how
those work.

65
00:03:49,000 --> 00:03:51,000
You'd have to solve your
problem set again,

66
00:03:51,000 --> 00:03:55,000
if you don't have it memorized.
Skip lists, on the other hand,

67
00:03:55,000 --> 00:03:57,000
are a data structure you will
never forget,

68
00:03:57,000 --> 00:04:00,000
and something you can implement
within an hour,

69
00:04:00,000 --> 00:04:03,000
no problem.
I've made this claim a couple

70
00:04:03,000 --> 00:04:05,000
times before,
and I always felt bad because I

71
00:04:05,000 --> 00:04:10,000
had never actually done it.
So, this morning,

72
00:04:10,000 --> 00:04:13,000
I implemented skip lists,
and it took me ten minutes to

73
00:04:13,000 --> 00:04:17,000
implement a linked list,
and 30 minutes to implement

74
00:04:17,000 --> 00:04:19,000
skip lists.
And another 30 minutes

75
00:04:19,000 --> 00:04:21,000
debugging them.
There you go.

76
00:04:21,000 --> 00:04:24,000
It can be done.
Skip lists are really simple.

77
00:04:24,000 --> 00:04:27,000
And, at no point writing the
code did I have to think,

78
00:04:27,000 --> 00:04:32,000
whereas every other structure I
would have to think.

79
00:04:32,000 --> 00:04:36,000
There was one moment when I
thought, ah, how do I flip a

80
00:04:36,000 --> 00:04:38,000
coin?
That was the entire amount of

81
00:04:38,000 --> 00:04:41,000
thinking.
So, skip lists are a randomized

82
00:04:41,000 --> 00:04:44,000
structure.
Let's add in another adjective

83
00:04:44,000 --> 00:04:46,000
here, and let's also add in
simple.

84
00:04:46,000 --> 00:04:49,000
So, we have a simple,
efficient, dynamic,

85
00:04:49,000 --> 00:04:53,000
randomized search structure:
all those things together.

86
00:04:53,000 --> 00:04:57,000
So, it's sort of like treaps
and that the bound is only a

87
00:04:57,000 --> 00:05:01,000
randomized bound.
But today, we're going to see a

88
00:05:01,000 --> 00:05:06,000
much stronger bound than an
expectation bound.

89
00:05:06,000 --> 00:05:11,000
So, in particular,
skip lists will run in order

90
00:05:11,000 --> 00:05:17,000
log n expected time.
So, the running time for each

91
00:05:17,000 --> 00:05:22,000
operation will be order log n in
expectation.

92
00:05:22,000 --> 00:05:28,000
But, we're going to prove a
much stronger result that their

93
00:05:28,000 --> 00:05:34,000
order log n, with high
probability.

94
00:05:34,000 --> 00:05:37,000
So, this is a very strong
claim.

95
00:05:37,000 --> 00:05:42,000
And it means that the running
time of each operation,

96
00:05:42,000 --> 00:05:48,000
the running time of every
operation is order log n almost

97
00:05:48,000 --> 00:05:54,000
always in a certain sense.
Why don't I foreshadow that?

98
00:05:54,000 --> 00:05:59,000
So, it's something like,
the probability that it's order

99
00:05:59,000 --> 00:06:05,000
log n is at least one minus one
over some polynomial,

100
00:06:05,000 --> 00:06:08,000
and n.
And, you get to set the

101
00:06:08,000 --> 00:06:10,000
polynomial however large you
like.

102
00:06:10,000 --> 00:06:13,000
So, what this basically means
is that almost all the time,

103
00:06:13,000 --> 00:06:16,000
you take your skip lists,
you do a polynomial number of

104
00:06:16,000 --> 00:06:18,000
operations on it,
because presumably you are

105
00:06:18,000 --> 00:06:21,000
running a polynomial time
algorithm that using this data

106
00:06:21,000 --> 00:06:23,000
structure.
Do polynomial numbers of

107
00:06:23,000 --> 00:06:26,000
inserts, delete searches,
every single one of them will

108
00:06:26,000 --> 00:06:30,000
take order log n time,
almost guaranteed.

109
00:06:30,000 --> 00:06:33,000
So this is a really strong
bound on the tail of the

110
00:06:33,000 --> 00:06:36,000
distribution.
The mean is order log n.

111
00:06:36,000 --> 00:06:39,000
That's not so exciting.
But, in fact,

112
00:06:39,000 --> 00:06:43,000
almost all of the weight of
this probability distribution is

113
00:06:43,000 --> 00:06:47,000
right around the log n,
just tiny little epsilons,

114
00:06:47,000 --> 00:06:51,000
very tiny probabilities you
could be bigger than log n.

115
00:06:51,000 --> 00:06:55,000
So that's where we are going.
This is a data structure by

116
00:06:55,000 --> 00:07:00,000
Pugh] in 1989.
This is the most recent.

117
00:07:00,000 --> 00:07:03,000
Actually, no,
sorry, treaps are more recent.

118
00:07:03,000 --> 00:07:06,000
They were like '93 or so,
but a fairly recent data

119
00:07:06,000 --> 00:07:09,000
structure for just insert,
delete, search.

120
00:07:09,000 --> 00:07:13,000
And, it's very simple.
You can derive it if you don't

121
00:07:13,000 --> 00:07:16,000
know anything about data
structures, well,

122
00:07:16,000 --> 00:07:19,000
almost nothing.
Now, analyzing that the

123
00:07:19,000 --> 00:07:21,000
performance is log n,
that, of course,

124
00:07:21,000 --> 00:07:25,000
takes our sophistication.
But the data structure itself

125
00:07:25,000 --> 00:07:30,000
is very simple.
We're going to start from

126
00:07:30,000 --> 00:07:34,000
scratch.
Suppose you don't know what a

127
00:07:34,000 --> 00:07:38,000
red black tree is.
You don't know what a B tree

128
00:07:38,000 --> 00:07:41,000
is.
Suppose you don't even know

129
00:07:41,000 --> 00:07:45,000
what a tree is.
What is the simplest data

130
00:07:45,000 --> 00:07:51,000
structure for storing a bunch of
items for storing a dynamic set?

131
00:07:51,000 --> 00:07:54,000
A list, good,
a linked list.

132
00:07:54,000 --> 00:07:58,000
Now, suppose that it's a sorted
linked list.

133
00:07:58,000 --> 00:08:05,000
So, I'm going to be a little
bit fancier there.

134
00:08:05,000 --> 00:08:10,000
So, if you have a linked list
of items, here it is,

135
00:08:10,000 --> 00:08:16,000
maybe we'll make it doubly
linked just for kicks,

136
00:08:16,000 --> 00:08:22,000
how long does it take to search
in a sorted linked list?

137
00:08:22,000 --> 00:08:26,000
Log n is one answer.
n is the other answer.

138
00:08:26,000 --> 00:08:31,000
Which one is right?
n is the right answer.

139
00:08:31,000 --> 00:08:35,000
So, even though it's sorted,
we can't do binary search

140
00:08:35,000 --> 00:08:38,000
because we don't have
random-access into a linked

141
00:08:38,000 --> 00:08:40,000
list.
So, suppose I'm only given a

142
00:08:40,000 --> 00:08:44,000
pointer to the head.
Otherwise, I'm assuming it's an

143
00:08:44,000 --> 00:08:46,000
array.
So, in a sorted array you can

144
00:08:46,000 --> 00:08:48,000
search in log n.
Sorted linked list:

145
00:08:48,000 --> 00:08:51,000
you've still got to scan
through the darn thing.

146
00:08:51,000 --> 00:08:53,000
So, theta n,
worst case search.

147
00:08:53,000 --> 00:08:56,000
Not so good,
but if we just try to improve

148
00:08:56,000 --> 00:08:59,000
it a little bit,
we will discover skip lists

149
00:08:59,000 --> 00:09:03,000
automatically.
So, this is our starting point:

150
00:09:03,000 --> 00:09:06,000
sorted linked lists,
data n time.

151
00:09:06,000 --> 00:09:09,000
And, I'm not going to think too
much about insertions and

152
00:09:09,000 --> 00:09:12,000
deletions for the moment.
Let's just get search better,

153
00:09:12,000 --> 00:09:15,000
and then we'll worry about
dates.

154
00:09:15,000 --> 00:09:17,000
Updates are where randomization
will come in.

155
00:09:17,000 --> 00:09:21,000
Search: pretty easy idea.
So, how can we make a linked

156
00:09:21,000 --> 00:09:23,000
list better?
Suppose all we know about our

157
00:09:23,000 --> 00:09:26,000
linked lists.
What can I do to make it

158
00:09:26,000 --> 00:09:28,000
faster?
This is where you need a little

159
00:09:28,000 --> 00:09:32,000
bit of innovation,
some creativity.

160
00:09:32,000 --> 00:09:37,000
More links: that's a good idea.
So, I do try to maybe add

161
00:09:37,000 --> 00:09:40,000
pointers to go a couple steps
ahead.

162
00:09:40,000 --> 00:09:45,000
If I had log n pointers,
I could do all powers of two

163
00:09:45,000 --> 00:09:48,000
ahead.
That's a pretty good search

164
00:09:48,000 --> 00:09:51,000
structure.
Some people use that;

165
00:09:51,000 --> 00:09:56,000
like, some peer-to-peer
networks use that idea.

166
00:09:56,000 --> 00:10:01,000
But that's a little too fancy
for me.

167
00:10:01,000 --> 00:10:03,000
Ah, good.
You could try to build a tree

168
00:10:03,000 --> 00:10:07,000
on this linear structure.
That's essentially where we're

169
00:10:07,000 --> 00:10:09,000
going.
So, you could try to put

170
00:10:09,000 --> 00:10:12,000
pointers to, like,
the middle of the list from the

171
00:10:12,000 --> 00:10:14,000
roots.
So, you search between either

172
00:10:14,000 --> 00:10:16,000
here.
You point to the median,

173
00:10:16,000 --> 00:10:20,000
so you can compare against the
median, and know whether you

174
00:10:20,000 --> 00:10:23,000
should go in the first half or
the second half that's

175
00:10:23,000 --> 00:10:27,000
definitely on the right track,
also a bit too sophisticated.

176
00:10:27,000 --> 00:10:29,000
Another list:
yes.

177
00:10:29,000 --> 00:10:32,000
Yes, good.
So, we are going to use two

178
00:10:32,000 --> 00:10:34,000
lists.
That's sort of the next

179
00:10:34,000 --> 00:10:38,000
simplest thing you could do.
OK, and as you suggested,

180
00:10:38,000 --> 00:10:41,000
we could maybe have pointers
between them.

181
00:10:41,000 --> 00:10:46,000
So, maybe we have some elements
down here, some of the elements

182
00:10:46,000 --> 00:10:48,000
up here.
We want to have pointers

183
00:10:48,000 --> 00:10:51,000
between the lists.
OK, it gets a little bit crazy

184
00:10:51,000 --> 00:10:54,000
in how exactly you might do
that.

185
00:10:54,000 --> 00:10:56,000
But somehow,
this feels good.

186
00:10:56,000 --> 00:10:58,000
So this is one linked list:
L_1.

187
00:10:58,000 --> 00:11:02,000
This is another linked list:
L_2.

188
00:11:02,000 --> 00:11:12,000
And, to give you some
inspiration, I want to give you,

189
00:11:12,000 --> 00:11:19,000
so let's play a game.
The game is,

190
00:11:19,000 --> 00:11:29,000
what is this sequence?
So, the sequence is 14.

191
00:11:29,000 --> 00:11:38,000
If you know the answer,
shout it out.

192
00:11:38,000 --> 00:11:42,000
Anyone yet? OK, it's tricky.

193
00:11:54,000 --> 00:11:58,000
It's a bit of a small class,
so I hope someone knows the

194
00:11:58,000 --> 00:11:59,000
answer.

195
00:12:10,000 --> 00:12:14,000
How many TA's know the answer?
Just a couple,

196
00:12:14,000 --> 00:12:19,000
OK, if you're looking at the
slides, probably you know the

197
00:12:19,000 --> 00:12:21,000
answer.
That's cheating.

198
00:12:21,000 --> 00:12:26,000
OK, I'll give you a hint.
It is not a mathematical

199
00:12:26,000 --> 00:12:29,000
sequence.
This is a real-life sequence.

200
00:12:29,000 --> 00:12:32,000
Yeah?
Yeah, and what city?

201
00:12:32,000 --> 00:12:36,000
New York, yeah,
this is the 7th Ave line.

202
00:12:36,000 --> 00:12:40,000
This is my favorite subway line
in New York.

203
00:12:40,000 --> 00:12:46,000
But, what's a cool feature of
the New York City subway?

204
00:12:46,000 --> 00:12:49,000
OK, it's a skip list.
Good answer.

205
00:12:49,000 --> 00:12:54,000
[LAUGHTER] Indeed it is.
Skip lists are so practical.

206
00:12:54,000 --> 00:13:00,000
They've been implemented in the
subway system.

207
00:13:00,000 --> 00:13:03,000
How cool is that?
OK, Boston subway is pretty

208
00:13:03,000 --> 00:13:08,000
cool because it's the oldest
subway definitely in the United

209
00:13:08,000 --> 00:13:11,000
States, maybe in the world.
New York is close,

210
00:13:11,000 --> 00:13:16,000
and it has other nice features
like it's open 24 hours.

211
00:13:16,000 --> 00:13:20,000
That's a definite plus,
but it also has this feature of

212
00:13:20,000 --> 00:13:23,000
express lines.
So, it's a bit of an

213
00:13:23,000 --> 00:13:26,000
abstraction,
but the 7th Ave line has

214
00:13:26,000 --> 00:13:29,000
essentially two kinds of cars.
These are street numbers by the

215
00:13:29,000 --> 00:13:31,000
way.
This is, Penn Station,

216
00:13:31,000 --> 00:13:33,000
Times Square,
and so on.

217
00:13:33,000 --> 00:13:36,000
So, there are essentially two
lines.

218
00:13:36,000 --> 00:13:39,000
There's the express line which
goes 14, to 34,

219
00:13:39,000 --> 00:13:41,000
to 42, to 72,
to 96.

220
00:13:41,000 --> 00:13:45,000
And then, there's the local
line which stops at every stop.

221
00:13:45,000 --> 00:13:49,000
And, they accomplish this with
four sets of tracks.

222
00:13:49,000 --> 00:13:54,000
So, I mean, the express lines
have their own dedicated track.

223
00:13:54,000 --> 00:13:57,000
If you want to go to stop 59
from, let's say,

224
00:13:57,000 --> 00:14:00,000
Penn Station,
well, let's say from lower west

225
00:14:00,000 --> 00:14:05,000
side, you get on the express
line.

226
00:14:05,000 --> 00:14:10,000
You jump to 42 pretty quickly,
and then you switch over to the

227
00:14:10,000 --> 00:14:16,000
local line, and go on to 59 or
wherever I said I was going.

228
00:14:16,000 --> 00:14:21,000
OK, so this is express and
local lines, and we can

229
00:14:21,000 --> 00:14:25,000
represent that with a couple of
lists.

230
00:14:25,000 --> 00:14:29,000
We have one list,
sure, we have one list on the

231
00:14:29,000 --> 00:14:34,000
bottom, so leave some space up
here.

232
00:14:34,000 --> 00:14:48,000
This is the local line,
L_2, 34, 42,

233
00:14:48,000 --> 00:15:02,000
50, 59, 66, 72,
79, and so on.

234
00:15:02,000 --> 00:15:08,000
And then we had the express
line on top, which only stops at

235
00:15:08,000 --> 00:15:11,000
14, 34, 42, 72,
and so on.

236
00:15:11,000 --> 00:15:16,000
I'm not going to redraw the
whole list.

237
00:15:16,000 --> 00:15:21,000
You get the idea.
And so, what we're going to do

238
00:15:21,000 --> 00:15:27,000
is put links between in the
local and express lines,

239
00:15:27,000 --> 00:15:34,000
wherever they happen to meet.
And, that's our two linked list

240
00:15:34,000 --> 00:15:38,000
structure.
So, that's what I actually

241
00:15:38,000 --> 00:15:42,000
meant what I was trying to draw
some picture.

242
00:15:42,000 --> 00:15:47,000
Now, this has a property that
in one list, the bottom list,

243
00:15:47,000 --> 00:15:52,000
every element occurs.
And the top list just copies

244
00:15:52,000 --> 00:15:56,000
some of those elements.
And we're going to preserve

245
00:15:56,000 --> 00:16:00,000
that property.
So, L_2 stores all the

246
00:16:00,000 --> 00:16:05,000
elements, and L_1 stores some
subset.

247
00:16:05,000 --> 00:16:10,000
And, it's still open which ones
we should store.

248
00:16:10,000 --> 00:16:16,000
That's the one thing we need to
think about.

249
00:16:16,000 --> 00:16:23,000
But, our inspiration is from
the New York subway system.

250
00:16:23,000 --> 00:16:30,000
OK, there, that the idea.
Of course, we're also going to

251
00:16:30,000 --> 00:16:36,000
use more than two lists.
OK, we also have links.

252
00:16:36,000 --> 00:16:44,000
Let's say it links between
equal keys in L_1 and L_2.

253
00:16:44,000 --> 00:16:46,000
Good.
So, just for the sake of

254
00:16:46,000 --> 00:16:50,000
completeness,
and because we will need this

255
00:16:50,000 --> 00:16:55,000
later, let's talk about searches
before we worry about how these

256
00:16:55,000 --> 00:17:00,000
lists are actually constructed.
Of course, if I wanted that

257
00:17:00,000 --> 00:17:04,000
board.
So, if you want to search for

258
00:17:04,000 --> 00:17:06,000
an element, x,
what do you do?

259
00:17:06,000 --> 00:17:09,000
Well, this is the taking the
subway algorithm.

260
00:17:09,000 --> 00:17:14,000
And, suppose you always start
in the upper left corner of the

261
00:17:14,000 --> 00:17:17,000
subway system,
if you're always in the lower

262
00:17:17,000 --> 00:17:21,000
west side, 14th St,
and I don't know exactly where

263
00:17:21,000 --> 00:17:25,000
that is, but more or less,
somewhere down at the bottom of

264
00:17:25,000 --> 00:17:27,000
Manhattan.
And, you want to go to a

265
00:17:27,000 --> 00:17:33,000
particular station like 59.
Well, you'd stay on the express

266
00:17:33,000 --> 00:17:37,000
line as long as you can because
it happens that we started on

267
00:17:37,000 --> 00:17:39,000
the express line.
And then, you go down.

268
00:17:39,000 --> 00:17:43,000
And then you take the local
line the rest of the way.

269
00:17:43,000 --> 00:17:47,000
That's clearly the right thing
to do if you always start in the

270
00:17:47,000 --> 00:17:50,000
top left corner.
So, I'm going to write that

271
00:17:50,000 --> 00:17:54,000
down in some kind of an
algorithm because we will be

272
00:17:54,000 --> 00:17:56,000
generalizing it.
It's pretty obvious at this

273
00:17:56,000 --> 00:18:00,000
point.
It will remain obvious.

274
00:18:00,000 --> 00:18:06,000
So, I want to walk right in the
top list until that would go too

275
00:18:06,000 --> 00:18:09,000
far.
So, you imagine giving someone

276
00:18:09,000 --> 00:18:14,000
directions on the subway system
they've never been on.

277
00:18:14,000 --> 00:18:17,000
So, you say,
OK, you start at 14th.

278
00:18:17,000 --> 00:18:22,000
Take the express line,
and when you get to 72nd,

279
00:18:22,000 --> 00:18:25,000
you've gone too far.
Go back one,

280
00:18:25,000 --> 00:18:30,000
and then go down to the local
line.

281
00:18:30,000 --> 00:18:32,000
It's really annoying
directions.

282
00:18:32,000 --> 00:18:37,000
But this is what an algorithm
has to do because it's never

283
00:18:37,000 --> 00:18:41,000
taken the subway before.
So, it's going to check,

284
00:18:41,000 --> 00:18:45,000
so let's do it here.
So, suppose I'm aiming for 59.

285
00:18:45,000 --> 00:18:49,000
So, I started 14,
say the first thing I do is go

286
00:18:49,000 --> 00:18:51,000
to 34.
Then from there,

287
00:18:51,000 --> 00:18:54,000
I go to 42.
Still good because 59 is bigger

288
00:18:54,000 --> 00:18:56,000
than 42.
I go right again.

289
00:18:56,000 --> 00:18:59,000
I say, oops,
72 is too big.

290
00:18:59,000 --> 00:19:04,000
That was too far.
So, I go back to where it just

291
00:19:04,000 --> 00:19:07,000
was.
Then I go down and then I keep

292
00:19:07,000 --> 00:19:12,000
going right until I find the
element that I want,

293
00:19:12,000 --> 00:19:17,000
or discover that it's not in
the bottom list because bottom

294
00:19:17,000 --> 00:19:21,000
list has everyone.
So, that's the algorithm.

295
00:19:21,000 --> 00:19:27,000
Stop when going right would go
too far, and you discover that

296
00:19:27,000 --> 00:19:31,000
with a comparison.
Then you walk down to L_2.

297
00:19:31,000 --> 00:19:35,000
And then you walk right in L_2
until you find x,

298
00:19:35,000 --> 00:19:40,000
or you find something greater
than x, in which case x is

299
00:19:40,000 --> 00:19:46,000
definitely not on your list.
And you found the predecessor

300
00:19:46,000 --> 00:19:49,000
and successor,
which may be your goal.

301
00:19:49,000 --> 00:19:52,000
If you didn't find where x was,
you should find where it would

302
00:19:52,000 --> 00:19:55,000
go if it were there,
because then maybe you could

303
00:19:55,000 --> 00:19:58,000
insert there.
We're going to use this

304
00:19:58,000 --> 00:20:00,000
algorithm in insertion.
OK, but that search:

305
00:20:00,000 --> 00:20:05,000
pretty easy at this point.
Now, what we haven't discussed

306
00:20:05,000 --> 00:20:08,000
is how fast the search algorithm
is, and it depends,

307
00:20:08,000 --> 00:20:12,000
of course, which elements we're
going to store in L_1,

308
00:20:12,000 --> 00:20:14,000
which subset of elements should
go in L_1.

309
00:20:14,000 --> 00:20:18,000
Now, in the subway system,
you probably put all the

310
00:20:18,000 --> 00:20:21,000
popular stations in L_1.
But here, we want worst-case

311
00:20:21,000 --> 00:20:24,000
performance.
So, we don't have some

312
00:20:24,000 --> 00:20:26,000
probability distribution on the
nodes.

313
00:20:26,000 --> 00:20:30,000
We just like every node to be
accessed sort of as quickly as

314
00:20:30,000 --> 00:20:35,000
possible, uniformly.
So, we want to minimize the

315
00:20:35,000 --> 00:20:39,000
maximum time over all queries.
So, any ideas what we should do

316
00:20:39,000 --> 00:20:42,000
with L_1?
Should I put all the nodes of

317
00:20:42,000 --> 00:20:46,000
L_1 in the beginning?
OK, it's a strict subset.

318
00:20:46,000 --> 00:20:49,000
Suppose I told you what the
size of L_1 was.

319
00:20:49,000 --> 00:20:53,000
I can tell you,
I could afford to build this

320
00:20:53,000 --> 00:20:56,000
many express stops.
How should you distribute them

321
00:20:56,000 --> 00:21:02,000
among the elements of L_2?
Uniformly, good.

322
00:21:02,000 --> 00:21:08,000
So, what nodes,
sorry, what keys,

323
00:21:08,000 --> 00:21:17,000
let's say, go in L_1?
Well, definitely the best thing

324
00:21:17,000 --> 00:21:24,000
to do is to spread them out
uniformly, OK,

325
00:21:24,000 --> 00:21:35,000
which is definitely not what
the 7th Ave line looks like.

326
00:21:35,000 --> 00:21:39,000
But, let's imagine that we
could reengineer everything.

327
00:21:39,000 --> 00:21:45,000
So, we're going to try to space
these things out a little bit

328
00:21:45,000 --> 00:21:47,000
more.
So, 34 and 42nd are way too

329
00:21:47,000 --> 00:21:50,000
close.
We'll take a few more stops.

330
00:21:50,000 --> 00:21:54,000
And, now we can start to
analyze things.

331
00:21:54,000 --> 00:21:57,000
OK, as a function of the length
of L_1.

332
00:21:57,000 --> 00:22:03,000
So, the cost of a search is now
roughly, so, I want a function

333
00:22:03,000 --> 00:22:07,000
of the length of L_1,
and the length of L_2,

334
00:22:07,000 --> 00:22:11,000
which is all the elements,
n.

335
00:22:11,000 --> 00:22:18,000
What is the cost of the search
if I spread out all the elements

336
00:22:18,000 --> 00:22:20,000
in L_1 uniformly?
Yeah?

337
00:22:20,000 --> 00:22:26,000
Right, the total number of
elements in the top lists,

338
00:22:26,000 --> 00:22:33,000
plus the division between the
bottom and the top.

339
00:22:33,000 --> 00:22:36,000
So, I'll write the length of
L_1 plus the length of L_2

340
00:22:36,000 --> 00:22:39,000
divided by the length of L_1.
OK, this is roughly,

341
00:22:39,000 --> 00:22:42,000
I mean, there's maybe a plus
one or so here because in the

342
00:22:42,000 --> 00:22:46,000
worst case, I have to search
through all of L_1 because the

343
00:22:46,000 --> 00:22:49,000
station I could be looking for
could be the max.

344
00:22:49,000 --> 00:22:52,000
OK, and maybe I'm not lucky,
and the max is not on the

345
00:22:52,000 --> 00:22:54,000
express line.
So then, I have to go down to

346
00:22:54,000 --> 00:22:57,000
the local line.
And how many stops will I have

347
00:22:57,000 --> 00:23:01,000
to go on the local line?
Well, L_1 just evenly

348
00:23:01,000 --> 00:23:04,000
partitions L_2.
So this is the number of

349
00:23:04,000 --> 00:23:08,000
consecutive stations between two
express stops.

350
00:23:08,000 --> 00:23:12,000
So, I take the express,
possibly this long,

351
00:23:12,000 --> 00:23:15,000
but I take the local possibly
this long.

352
00:23:15,000 --> 00:23:18,000
And, this is an L_2.
And there is,

353
00:23:18,000 --> 00:23:20,000
plus, a constant,
for example,

354
00:23:20,000 --> 00:23:24,000
go walking down.
But that's basically the number

355
00:23:24,000 --> 00:23:28,000
of nodes that I visit.
So, I'd like to minimize this

356
00:23:28,000 --> 00:23:36,000
function.
Now, L_2, I'm going to call

357
00:23:36,000 --> 00:23:47,000
that n because that's the total
number of elements.

358
00:23:47,000 --> 00:23:55,000
L_1, I can choose to be
whatever I want.

359
00:23:55,000 --> 00:24:03,000
So, let's go over here.
So, I want to minimize L_1 plus

360
00:24:03,000 --> 00:24:07,000
n over L_1.
And I get to choose L_1.

361
00:24:07,000 --> 00:24:11,000
Now, I could differentiate
this, set it to zero,

362
00:24:11,000 --> 00:24:15,000
and go crazy.
Or, I could realize that,

363
00:24:15,000 --> 00:24:19,000
I mean, that's not hard.
But, that's a little bit too

364
00:24:19,000 --> 00:24:22,000
fancy for me.
So, I could say,

365
00:24:22,000 --> 00:24:26,000
well, this is clearly best when
L_1 is small.

366
00:24:26,000 --> 00:24:32,000
And this is clearly best when
L_1 is large.

367
00:24:32,000 --> 00:24:37,000
So, there's a trade-off there.
And, the trade-off will be

368
00:24:37,000 --> 00:24:44,000
roughly minimized up to constant
factors when these two terms are

369
00:24:44,000 --> 00:24:48,000
equal.
That's when I have pretty good

370
00:24:48,000 --> 00:24:53,000
balance between the two ends of
the trade-off.

371
00:24:53,000 --> 00:24:56,000
So, this is up to constant
factors.

372
00:24:56,000 --> 00:25:03,000
I can let L_1 equal n over L_1,
OK, because at most I'm losing

373
00:25:03,000 --> 00:25:10,000
a factor of two there when they
happen to be equal.

374
00:25:10,000 --> 00:25:14,000
So now, I just solve this.
This is really easy.

375
00:25:14,000 --> 00:25:18,000
This is (L_1)^2 equals n.
So, L_1 is the square root of

376
00:25:18,000 --> 00:25:20,000
n.
OK, so the cost that I'm

377
00:25:20,000 --> 00:25:24,000
getting over here,
L_1 plus L_2 over L_1 is the

378
00:25:24,000 --> 00:25:28,000
square root of n plus n over
root n, which is,

379
00:25:28,000 --> 00:25:32,000
again, root n.
So, I get two root n.

380
00:25:32,000 --> 00:25:36,000
So, search cost,
and I'm caring about the

381
00:25:36,000 --> 00:25:39,000
constant here,
because it will matter in a

382
00:25:39,000 --> 00:25:41,000
moment.
Two square root of n:

383
00:25:41,000 --> 00:25:45,000
I'm not caring about the
additive constant,

384
00:25:45,000 --> 00:25:48,000
but the multiplicative constant
I care about.

385
00:25:48,000 --> 00:25:52,000
OK, that seems good.
We started with a linked list

386
00:25:52,000 --> 00:25:56,000
that searched in n time,
theta n time per operation.

387
00:25:56,000 --> 00:26:03,000
Now we have two linked lists,
search and theta root n time.

388
00:26:03,000 --> 00:26:07,000
It seems pretty good.
This is what the structure

389
00:26:07,000 --> 00:26:10,000
looks like.
We have root n guys here.

390
00:26:10,000 --> 00:26:15,000
This is in the local line.
And, we have one express stop

391
00:26:15,000 --> 00:26:19,000
which represents that.
But we have another root n

392
00:26:19,000 --> 00:26:24,000
values in the local line.
And we have one express stop

393
00:26:24,000 --> 00:26:28,000
that represents that.
And these two are linked,

394
00:26:28,000 --> 00:26:31,000
and so on.

395
00:26:42,000 --> 00:26:44,000
Well, I should put some dot,
dot, dots in there.

396
00:26:44,000 --> 00:26:47,000
OK, so each of these chunks has
length root n,

397
00:26:47,000 --> 00:26:49,000
and the number of
representatives up here is

398
00:26:49,000 --> 00:26:52,000
square root of n.
The number of express stops is

399
00:26:52,000 --> 00:26:54,000
square root of n.
So clearly, things are balanced

400
00:26:54,000 --> 00:26:55,000
now.
I search for,

401
00:26:55,000 --> 00:26:57,000
at most, square root of n up
here.

402
00:26:57,000 --> 00:27:00,000
Then I search in one of these
lists for, at most,

403
00:27:00,000 --> 00:27:04,000
square root of n.
So, every search takes,

404
00:27:04,000 --> 00:27:10,000
at most, two root n.
Cool, what should we do next?

405
00:27:10,000 --> 00:27:15,000
So, again, ignore insertions
and deletions.

406
00:27:15,000 --> 00:27:22,000
I want to make searches faster
because square root of n is not

407
00:27:22,000 --> 00:27:25,000
so hot as we know.
Sorry?

408
00:27:25,000 --> 00:27:30,000
More lines.
Let's add a super express line,

409
00:27:30,000 --> 00:27:35,000
or another linked list.
OK, this was two.

410
00:27:35,000 --> 00:27:41,000
Why not do three?
So, we started with a sorted

411
00:27:41,000 --> 00:27:45,000
linked list.
Then we went to two.

412
00:27:45,000 --> 00:27:48,000
This gave us two square root of
n.

413
00:27:48,000 --> 00:27:52,000
Now, I want three sorted linked
lists.

414
00:27:52,000 --> 00:27:57,000
I didn't pluralize here.
Any guesses what the running

415
00:27:57,000 --> 00:28:02,000
time might be?
This is just guesswork.

416
00:28:02,000 --> 00:28:05,000
Don't think.
From two square root of n,

417
00:28:05,000 --> 00:28:08,000
you would go to,
sorry?

418
00:28:08,000 --> 00:28:12,000
Two square root of two,
fourth root of n?

419
00:28:12,000 --> 00:28:17,000
That's on the right track.
Both the constant and the root

420
00:28:17,000 --> 00:28:20,000
change, but not quite so
fancily.

421
00:28:20,000 --> 00:28:24,000
Three times the cubed root:
good.

422
00:28:24,000 --> 00:28:29,000
Intuition is very helpful here.
It doesn't matter what the

423
00:28:29,000 --> 00:28:35,000
right answer is.
Use your intuition.

424
00:28:35,000 --> 00:28:37,000
You can prove that.
It's not so hard.

425
00:28:37,000 --> 00:28:40,000
You now have three lists,
and what you want to balance

426
00:28:40,000 --> 00:28:44,000
are at the length of the top
list, the ratio between the top

427
00:28:44,000 --> 00:28:47,000
two lists, and the ratio between
the bottom two lists.

428
00:28:47,000 --> 00:28:50,000
So, you want these three to
multiply out to n,

429
00:28:50,000 --> 00:28:53,000
because the top times the ratio
times the ratio:

430
00:28:53,000 --> 00:28:56,000
that has to equal n.
And, so that's where you get

431
00:28:56,000 --> 00:28:59,000
the cubed root of n.
Each of these should be equal.

432
00:28:59,000 --> 00:29:03,000
So, you set them because the
cost is the sum of those three

433
00:29:03,000 --> 00:29:07,000
things.
So, you set each of them to

434
00:29:07,000 --> 00:29:11,000
cubed root of n,
and there are three of them.

435
00:29:11,000 --> 00:29:15,000
OK, check it at home if you
want to be more sure.

436
00:29:15,000 --> 00:29:21,000
Obviously, we want a few more.
So, let's think about k sorted

437
00:29:21,000 --> 00:29:24,000
lists.
k sorted lists will be k times

438
00:29:24,000 --> 00:29:28,000
the k'th root of n.
You probably guessed that by

439
00:29:28,000 --> 00:29:33,000
now.
So, what should we set k to?

440
00:29:33,000 --> 00:29:38,000
I don't want the exact minimum.
What's a good value for k?

441
00:29:38,000 --> 00:29:41,000
Should I set it to n?
n's kind of nice,

442
00:29:41,000 --> 00:29:44,000
because the n'th root of n is
just one.

443
00:29:44,000 --> 00:29:48,000
Now that's n.
So, this is why I cared about

444
00:29:48,000 --> 00:29:53,000
the lead constant because it's
going to grow as I add more

445
00:29:53,000 --> 00:29:56,000
lists.
What's the biggest reasonable

446
00:29:56,000 --> 00:30:03,000
value of k that I could use?
Log n, because I have a k out

447
00:30:03,000 --> 00:30:07,000
there.
I certainly don't want to use

448
00:30:07,000 --> 00:30:13,000
more than log n.
So, log n times the log n'th

449
00:30:13,000 --> 00:30:18,000
root, and this is a little hard
to draw of n.

450
00:30:18,000 --> 00:30:23,000
Now, what is the log n'th root
of n?

451
00:30:23,000 --> 00:30:27,000
That's what you're all thinking
about.

452
00:30:27,000 --> 00:30:34,000
What is the log n'th root of n
minus two?

453
00:30:34,000 --> 00:30:39,000
It's one of these good
questions whose answer is?

454
00:30:39,000 --> 00:30:43,000
Oh man.
Remember the definition of

455
00:30:43,000 --> 00:30:47,000
root?
OK, the root is n to the one

456
00:30:47,000 --> 00:30:51,000
over log n.
OK, good, remember the

457
00:30:51,000 --> 00:30:55,000
definition of having a power,
A to the B?

458
00:30:55,000 --> 00:30:59,000
It was like two to the power,
B log A?

459
00:30:59,000 --> 00:31:06,000
Does that sound familiar?
So, this is two to the log n

460
00:31:06,000 --> 00:31:11,000
over log n, which is,
I hope you can get it at this

461
00:31:11,000 --> 00:31:17,000
point, two.
Wow, so the log n'th root of n

462
00:31:17,000 --> 00:31:20,000
minus two is zero:
my favorite answer.

463
00:31:20,000 --> 00:31:23,000
OK, this is to.
So this whole thing is two log

464
00:31:23,000 --> 00:31:26,000
n: pretty nifty.
So, you could be a little

465
00:31:26,000 --> 00:31:31,000
fancier and tweak this a little
bit, but two log n is plenty

466
00:31:31,000 --> 00:31:36,000
good for me.
We clearly don't want to use

467
00:31:36,000 --> 00:31:41,000
any more lists,
but log n lists sounds pretty

468
00:31:41,000 --> 00:31:45,000
good.
I get, now, logarithmic search

469
00:31:45,000 --> 00:31:47,000
time.
Let's check.

470
00:31:47,000 --> 00:31:52,000
I mean, we sort of did this all
intuitively.

471
00:31:52,000 --> 00:31:56,000
Let's draw what the list looks
like.

472
00:31:56,000 --> 00:32:01,000
But, it will work.
So, I'm going to redraw this

473
00:32:01,000 --> 00:32:07,000
example because you have to,
also.

474
00:32:07,000 --> 00:32:14,000
So, let's redesign that New
York City subway system.

475
00:32:14,000 --> 00:32:22,000
And, I want you to leave three
blank lines up here.

476
00:32:22,000 --> 00:32:29,000
So, you should have this
memorized by now.

477
00:32:29,000 --> 00:32:34,000
But I don't.
So, we are not allowed to

478
00:32:34,000 --> 00:32:38,000
change the local line,
though it would be nice,

479
00:32:38,000 --> 00:32:43,000
add a few more stops there.
OK, we can stop at 79th Street.

480
00:32:43,000 --> 00:32:47,000
That's enough.
So now, we have log n lists.

481
00:32:47,000 --> 00:32:53,000
And here, log n is about four.
So, I want to make a bunch of

482
00:32:53,000 --> 00:32:55,000
lists here.
In particular,

483
00:32:55,000 --> 00:33:02,000
14 will appear on all of them.
So, why don't I draw those in?

484
00:33:02,000 --> 00:33:05,000
And, the question is,
which elements go in here?

485
00:33:05,000 --> 00:33:08,000
So, I have log n lists.
And, my goal is to balance the

486
00:33:08,000 --> 00:33:12,000
number of items up here,
and the ratio between these two

487
00:33:12,000 --> 00:33:15,000
lists, and the ratio between
these two lists,

488
00:33:15,000 --> 00:33:18,000
and the ratio between these two
lists.

489
00:33:18,000 --> 00:33:20,000
I want all these things to be
balanced.

490
00:33:20,000 --> 00:33:24,000
There are log n of them.
So, the product of all those

491
00:33:24,000 --> 00:33:27,000
ratios better be n,
the number of elements down

492
00:33:27,000 --> 00:33:29,000
here.
So, the product of all these

493
00:33:29,000 --> 00:33:36,000
ratios is n.
And there's log n of them;

494
00:33:36,000 --> 00:33:44,000
how big is each ratio?
So, I'll call the ratio r.

495
00:33:44,000 --> 00:33:52,000
The ratio's r.
I should have r to the power of

496
00:33:52,000 --> 00:33:56,000
log n equals n.
What's r?

497
00:33:56,000 --> 00:34:02,000
What's r minus two?
Zero.

498
00:34:02,000 --> 00:34:05,000
OK, this should be two to the
power of log n.

499
00:34:05,000 --> 00:34:09,000
So, if the ratio between the
number of elements here and here

500
00:34:09,000 --> 00:34:12,000
is to all the way down,
then I will have an elements at

501
00:34:12,000 --> 00:34:15,000
the bottom, which is what I
want.

502
00:34:15,000 --> 00:34:18,000
So, in other words,
I want half the elements here,

503
00:34:18,000 --> 00:34:22,000
a quarter of the elements here,
an eighth of the elements here,

504
00:34:22,000 --> 00:34:25,000
and so on.
So, I'm going to take half of

505
00:34:25,000 --> 00:34:28,000
the elements evenly spaced out:
34th, 50th, 66th,

506
00:34:28,000 --> 00:34:32,000
79th, and so on.
So, this is our new

507
00:34:32,000 --> 00:34:35,000
semi-express line:
not terribly fast,

508
00:34:35,000 --> 00:34:39,000
but you save a factor of two
for going up there.

509
00:34:39,000 --> 00:34:42,000
And, when you're done,
you go down,

510
00:34:42,000 --> 00:34:44,000
and you walk,
at most, one step.

511
00:34:44,000 --> 00:34:47,000
And you find what you're
looking for.

512
00:34:47,000 --> 00:34:52,000
OK, and then we do the same
thing over and over and over

513
00:34:52,000 --> 00:34:56,000
until we run out of elements.
I can't read my own writing.

514
00:34:56,000 --> 00:34:59,000
It's 79th.

515
00:35:11,000 --> 00:35:14,000
OK, if I had a bigger example,
I would be more levels,

516
00:35:14,000 --> 00:35:19,000
but this is just barely enough.
Let's say two elements is where

517
00:35:19,000 --> 00:35:21,000
I stop.
So, this looks good.

518
00:35:21,000 --> 00:35:24,000
Does this look like a structure
you've seen before,

519
00:35:24,000 --> 00:35:25,000
at all, vaguely?
Yes?

520
00:35:25,000 --> 00:35:28,000
A tree: yes.
It looks a lot like a binary

521
00:35:28,000 --> 00:35:31,000
tree.
I'll just leave it at that.

522
00:35:31,000 --> 00:35:34,000
In your problem set,
you'll understand why skip

523
00:35:34,000 --> 00:35:38,000
lists are really like trees.
But it's more or less a tree.

524
00:35:38,000 --> 00:35:41,000
Let's say at this level,
it looks sort of like binary

525
00:35:41,000 --> 00:35:42,000
search.
You look at 14;

526
00:35:42,000 --> 00:35:44,000
you look at 15,
and therefore,

527
00:35:44,000 --> 00:35:48,000
you decide whether you are in
the left half for the right

528
00:35:48,000 --> 00:35:50,000
half.
And that's sort of like a tree.

529
00:35:50,000 --> 00:35:54,000
It's not quite a tree because
we have this element repeated

530
00:35:54,000 --> 00:35:55,000
all over.
But more or less,

531
00:35:55,000 --> 00:35:59,000
this is a binary tree.
At depth I, we have two to the

532
00:35:59,000 --> 00:36:04,000
I nodes, just like a tree,
just like a balanced tree.

533
00:36:04,000 --> 00:36:08,000
I'm going to call this
structure an ideal skip list.

534
00:36:08,000 --> 00:36:13,000
And, if all we are doing our
searches, ideal skip lists are

535
00:36:13,000 --> 00:36:15,000
pretty good.
Maybe at practice:

536
00:36:15,000 --> 00:36:20,000
not quite as good as a binary
search tree, but up to constant

537
00:36:20,000 --> 00:36:24,000
factors: just as good.
So, for example,

538
00:36:24,000 --> 00:36:28,000
I mean, we can generalize
search, just check that it's log

539
00:36:28,000 --> 00:36:32,000
n.
So, the search procedure is you

540
00:36:32,000 --> 00:36:36,000
start at the top left.
So, let's say we are looking

541
00:36:36,000 --> 00:36:38,000
for 72.
You start at the top left.

542
00:36:38,000 --> 00:36:41,000
14 is smaller than 72,
so I try to go right.

543
00:36:41,000 --> 00:36:44,000
79 is too big.
So, I follow this arrow,

544
00:36:44,000 --> 00:36:47,000
but I say, oops,
that's too much.

545
00:36:47,000 --> 00:36:49,000
So, instead,
I go down 14 still.

546
00:36:49,000 --> 00:36:53,000
I go to the right:
oh, 50, that's still smaller

547
00:36:53,000 --> 00:36:55,000
than 72: OK.
I tried to go right again.

548
00:36:55,000 --> 00:36:58,000
Oh: 79, that's too big.
That's no good.

549
00:36:58,000 --> 00:37:00,000
So, I go down.
So, I get 50.

550
00:37:00,000 --> 00:37:05,000
I do the same thing over and
over.

551
00:37:05,000 --> 00:37:07,000
I try to go to the right:
oh, 66, that's OK.

552
00:37:07,000 --> 00:37:09,000
Try to go to the right:
oh, 79, that's too big.

553
00:37:09,000 --> 00:37:11,000
So I go down.
Now I go to the right and,

554
00:37:11,000 --> 00:37:14,000
oh, 72: done.
Otherwise, I'd go too far and

555
00:37:14,000 --> 00:37:16,000
try to go down and say,
oops, element must not be

556
00:37:16,000 --> 00:37:18,000
there.
It's a very simple search

557
00:37:18,000 --> 00:37:21,000
algorithm: same as here except
just remove the L_1 and L_2.

558
00:37:21,000 --> 00:37:23,000
Go right until that would go
too far.

559
00:37:23,000 --> 00:37:25,000
Then go down.
Then go right until we'd go too

560
00:37:25,000 --> 00:37:28,000
far, and then go down.
You might have to do this log n

561
00:37:28,000 --> 00:37:30,000
times.
In each level,

562
00:37:30,000 --> 00:37:34,000
you're clearly only walking a
couple of steps because the

563
00:37:34,000 --> 00:37:37,000
ratio between these two sizes is
only two.

564
00:37:37,000 --> 00:37:40,000
So, this will cost two log n
for search.

565
00:37:40,000 --> 00:37:42,000
Good, I mean,
so that was to check because we

566
00:37:42,000 --> 00:37:46,000
were using intuition over here;
a little bit shaky.

567
00:37:46,000 --> 00:37:50,000
So, this is an ideal skip list,
we have to support insertions

568
00:37:50,000 --> 00:37:53,000
and deletions.
As soon as we do an insert and

569
00:37:53,000 --> 00:37:57,000
delete, there's no way we're
going to maintain the structure.

570
00:37:57,000 --> 00:38:03,000
It's a bit too special.
There is only one of these

571
00:38:03,000 --> 00:38:09,000
where everything is perfectly
spaced out, and everything is

572
00:38:09,000 --> 00:38:13,000
beautiful.
So, we can't do that.

573
00:38:13,000 --> 00:38:20,000
We're going to maintain roughly
this structure as best we can.

574
00:38:20,000 --> 00:38:27,000
And, if anyone of you knows
someone in New York City subway

575
00:38:27,000 --> 00:38:31,000
planning, you can tell them
this.

576
00:38:31,000 --> 00:38:37,000
OK, so: skip lists.
So, I mean, this is basically

577
00:38:37,000 --> 00:38:42,000
our data structure.
You could use this as a

578
00:38:42,000 --> 00:38:46,000
starting point,
but then you start using skip

579
00:38:46,000 --> 00:38:49,000
lists.
And, we need to somehow

580
00:38:49,000 --> 00:38:54,000
implement insertions and
deletions, and maintain roughly

581
00:38:54,000 --> 00:39:01,000
this structure well enough that
the search still costs order log

582
00:39:01,000 --> 00:39:05,000
n time.
So, let's focus on insertions.

583
00:39:05,000 --> 00:39:09,000
If we do insertions right,
it turns out deletions are

584
00:39:09,000 --> 00:39:11,000
really trivial.

585
00:39:28,000 --> 00:39:31,000
And again, this is all from
first principles.

586
00:39:31,000 --> 00:39:34,000
We're not allowed to use
anything fancy.

587
00:39:34,000 --> 00:39:38,000
But, it would be nice if we
used some good chalk.

588
00:39:38,000 --> 00:39:42,000
This one looks better.
So, suppose you want to insert

589
00:39:42,000 --> 00:39:46,000
an element, x.
We said how to search for an

590
00:39:46,000 --> 00:39:48,000
element.
So, how do we insert it?

591
00:39:48,000 --> 00:39:53,000
Well, the first thing we should
do is figure out where it goes.

592
00:39:53,000 --> 00:39:57,000
So, we search for x.
We call search of x to find

593
00:39:57,000 --> 00:40:03,000
where x fits in the bottom list,
not just any list.

594
00:40:03,000 --> 00:40:06,000
Pretty easy to find out where
it fits in the top list.

595
00:40:06,000 --> 00:40:08,000
That takes, like,
constant time.

596
00:40:08,000 --> 00:40:11,000
What we want to know:
because the top list has

597
00:40:11,000 --> 00:40:14,000
constant length,
we want to know where x goes in

598
00:40:14,000 --> 00:40:17,000
the bottom list.
So, let's say we want to insert

599
00:40:17,000 --> 00:40:19,000
a search for 80.
Well, it is a bit too big.

600
00:40:19,000 --> 00:40:22,000
Let search for 75.
So, we'll find the 75 fits

601
00:40:22,000 --> 00:40:25,000
right here between 72 and 79
using the same path.

602
00:40:25,000 --> 00:40:29,000
OK, if it's there already,
we complain because I'm going

603
00:40:29,000 --> 00:40:32,000
to assume all keys are distinct
for now just so the picture

604
00:40:32,000 --> 00:40:38,000
stays simple.
But this works fine even if you

605
00:40:38,000 --> 00:40:42,000
are inserting the same key over
and over.

606
00:40:42,000 --> 00:40:47,000
So, that seems good.
One thing we should clearly do

607
00:40:47,000 --> 00:40:50,000
is insert x into the bottom
list.

608
00:40:50,000 --> 00:40:55,000
We now know where it fits.
It should go there.

609
00:40:55,000 --> 00:40:59,000
Because we want to maintain
this invariant,

610
00:40:59,000 --> 00:41:06,000
that the bottom list contains
all the elements.

611
00:41:06,000 --> 00:41:10,000
So, there we go.
We've maintained the invariant.

612
00:41:10,000 --> 00:41:14,000
The bottom list contains all
the elements.

613
00:41:14,000 --> 00:41:18,000
So, we search for 75.
We say, oh, 75 goes here,

614
00:41:18,000 --> 00:41:24,000
and we just sort of link in 75.
You know how to do a linked

615
00:41:24,000 --> 00:41:29,000
list, I hope.
Let me just erase that pointer.

616
00:41:29,000 --> 00:41:32,000
All the work in implementing
skip lists is the linked list

617
00:41:32,000 --> 00:41:34,000
manipulation.
Is that enough?

618
00:41:34,000 --> 00:41:38,000
No, it would be fine for now
because now there's only a chain

619
00:41:38,000 --> 00:41:41,000
of length three here that you'd
have to walk over if you're

620
00:41:41,000 --> 00:41:44,000
looking for something in this
range.

621
00:41:44,000 --> 00:41:47,000
But if I just keep inserting
75, and 76, than 76 plus

622
00:41:47,000 --> 00:41:51,000
epsilon, 76 plus two epsilon,
and so on, just pack a whole

623
00:41:51,000 --> 00:41:54,000
bunch of elements in here,
this chain will get really

624
00:41:54,000 --> 00:41:55,000
long.
Now, suddenly,

625
00:41:55,000 --> 00:41:58,000
things are not so balanced.
If I do a search,

626
00:41:58,000 --> 00:42:02,000
I'll pay an arbitrarily long
amount time here to search for

627
00:42:02,000 --> 00:42:05,000
someone.
If I insert k things,

628
00:42:05,000 --> 00:42:08,000
it'll take k time.
I want it to stay log n.

629
00:42:08,000 --> 00:42:11,000
If I only insert log n items,
it's OK for now.

630
00:42:11,000 --> 00:42:15,000
What I want to do is decide
which of these lists contain 75.

631
00:42:15,000 --> 00:42:17,000
So, clearly it goes on the
bottom.

632
00:42:17,000 --> 00:42:19,000
Every element goes in the
bottom.

633
00:42:19,000 --> 00:42:21,000
Should it go up a level?
Maybe.

634
00:42:21,000 --> 00:42:23,000
It depends.
It's not clear yet.

635
00:42:23,000 --> 00:42:27,000
If I insert a few items here,
definitely some of them should

636
00:42:27,000 --> 00:42:39,000
go on the next level.
Should I go to levels up?

637
00:42:39,000 --> 00:42:57,000
Maybe, but even less likely.
So, what should I do?

638
00:42:57,000 --> 00:43:01,000
Yeah?
Right, so you maintain the

639
00:43:01,000 --> 00:43:05,000
ideal partition size,
which may be like the length of

640
00:43:05,000 --> 00:43:07,000
this chain.
And you see,

641
00:43:07,000 --> 00:43:10,000
well, if that gets too long,
then I should split it in the

642
00:43:10,000 --> 00:43:14,000
middle, promote that guy up to
the next level,

643
00:43:14,000 --> 00:43:18,000
and do the same thing up here.
If this chain gets too long

644
00:43:18,000 --> 00:43:21,000
between two consecutive next
level express stops,

645
00:43:21,000 --> 00:43:23,000
then I'll promote the middle
guy.

646
00:43:23,000 --> 00:43:26,000
And that's what you'll do in
your problem set.

647
00:43:26,000 --> 00:43:30,000
That's too fancy for me.
I don't need no stinking

648
00:43:30,000 --> 00:43:34,000
counters.
What else could I do?

649
00:43:46,000 --> 00:43:48,000
I could try to maintain the
ideal skip list structure.

650
00:43:48,000 --> 00:43:51,000
That will be too expensive.
Like I say, 75 is the guy that

651
00:43:51,000 --> 00:43:54,000
gets promoted,
and this guy gets demoted all

652
00:43:54,000 --> 00:43:55,000
the way down.
But that will propagate

653
00:43:55,000 --> 00:43:58,000
everything to the right.
And that could cost linear time

654
00:43:58,000 --> 00:44:01,000
for update.
Other idea?

655
00:44:01,000 --> 00:44:07,000
If I only want half of them to
go up, I could flip a coin.

656
00:44:07,000 --> 00:44:11,000
Good idea.
All right, for that,

657
00:44:11,000 --> 00:44:16,000
I will give you a quarter.
It's a good one.

658
00:44:16,000 --> 00:44:19,000
It's the old line state,
Maryland.

659
00:44:19,000 --> 00:44:24,000
There you go.
However, you have to perform

660
00:44:24,000 --> 00:44:32,000
some services for that quarter,
namely, flip the coin.

661
00:44:32,000 --> 00:44:34,000
Can you flip a coin?
Good.

662
00:44:34,000 --> 00:44:38,000
What did you get?
Tails, OK, that's the first

663
00:44:38,000 --> 00:44:42,000
random bit.
But we are going to do is build

664
00:44:42,000 --> 00:44:45,000
a skip list.
Maybe I should tell you how

665
00:44:45,000 --> 00:44:48,000
first.
OK, but the idea is flip a

666
00:44:48,000 --> 00:44:50,000
coin.
If it's heads,

667
00:44:50,000 --> 00:44:55,000
so, sorry, if it's heads,
we will promote it to the next

668
00:44:55,000 --> 00:45:03,000
level, and flip again.
So, this is an answer to the

669
00:45:03,000 --> 00:45:10,000
question, which other lists
should store x?

670
00:45:10,000 --> 00:45:16,000
How many other lists should we
add x to?

671
00:45:16,000 --> 00:45:22,000
Well, the algorithm is,
flip a coin,

672
00:45:22,000 --> 00:45:28,000
and if it comes out heads,
then promote x.

673
00:45:28,000 --> 00:45:36,000
to the next level up,
and flip again.

674
00:45:36,000 --> 00:45:39,000
OK, that's key because we might
want this element to go

675
00:45:39,000 --> 00:45:41,000
arbitrarily high.
But for starters,

676
00:45:41,000 --> 00:45:43,000
we flip a coin.
It doesn't go to the next

677
00:45:43,000 --> 00:45:45,000
level.
Well, we'd like it to go to the

678
00:45:45,000 --> 00:45:49,000
next level with probability one
half because we want the ratio

679
00:45:49,000 --> 00:45:51,000
between these two sizes to be a
half, or sorry,

680
00:45:51,000 --> 00:45:54,000
two, depending which way you
take the ratio.

681
00:45:54,000 --> 00:45:56,000
So, I want roughly half the
elements up here.

682
00:45:56,000 --> 00:45:58,000
So, I flip a coin.
If it comes up heads,

683
00:45:58,000 --> 00:46:02,000
I go up here.
This is a fair coin.

684
00:46:02,000 --> 00:46:05,000
So I want it 50-50.
OK, then how many should that

685
00:46:05,000 --> 00:46:07,000
element go up to the next level
up?

686
00:46:07,000 --> 00:46:09,000
Well, with 50% probability
again.

687
00:46:09,000 --> 00:46:12,000
So, I flip another point.
If it comes up heads,

688
00:46:12,000 --> 00:46:15,000
I'll go up another level.
And that will maintain the

689
00:46:15,000 --> 00:46:19,000
approximate ratio between these
two guys as being two.

690
00:46:19,000 --> 00:46:21,000
The expected ratio will
definitely be two,

691
00:46:21,000 --> 00:46:25,000
and so on, all the way up.
If I go up to the top and flip

692
00:46:25,000 --> 00:46:28,000
a coin, it comes up heads,
I'll make another level.

693
00:46:28,000 --> 00:46:33,000
This is the insertion
algorithm: dead simple.

694
00:46:33,000 --> 00:46:38,000
The fancier one you will see on
your problem set.

695
00:46:38,000 --> 00:46:40,000
So, let's do it.

696
00:46:49,000 --> 00:46:53,000
OK, I also need someone to
generate random numbers.

697
00:46:53,000 --> 00:46:56,000
Who can generate random
numbers?

698
00:46:56,000 --> 00:47:00,000
Pseudo-random?
I'll give you a quarter.

699
00:47:00,000 --> 00:47:02,000
I have one here.
Here you go.

700
00:47:02,000 --> 00:47:05,000
That's a boring quarter.
Who would like to generate

701
00:47:05,000 --> 00:47:08,000
random numbers?
Someone volunteering someone

702
00:47:08,000 --> 00:47:10,000
else: that's a good way to do
it.

703
00:47:10,000 --> 00:47:13,000
Here you go.
You get a quarter,

704
00:47:13,000 --> 00:47:15,000
but you're not allowed to flip
it.

705
00:47:15,000 --> 00:47:18,000
No randomness for you;
well, OK, you can generate

706
00:47:18,000 --> 00:47:22,000
bits, and then compute a number.
So, give me a number.

707
00:47:22,000 --> 00:47:25,000
44, can answer.
OK, we already flipped a coin

708
00:47:25,000 --> 00:47:27,000
and I got tails.
Done.

709
00:47:27,000 --> 00:47:33,000
That's the insertion algorithm.
I'm going to make some more

710
00:47:33,000 --> 00:47:36,000
space actually,
put it way down here.

711
00:47:36,000 --> 00:47:41,000
OK, so 44 does not get promoted
because we got a tails.

712
00:47:41,000 --> 00:47:46,000
So, give me another number.
Nine, OK, I search for nine in

713
00:47:46,000 --> 00:47:49,000
this list.
I should mention one other

714
00:47:49,000 --> 00:47:53,000
thing, sorry.
I need a small change.

715
00:47:53,000 --> 00:47:57,000
This is just to make sure
searches still work.

716
00:47:57,000 --> 00:48:02,000
So, the worry is suppose I
insert something bigger and then

717
00:48:02,000 --> 00:48:07,000
I promote it.
This would look very bad for a

718
00:48:07,000 --> 00:48:11,000
skip list data structure because
I always want to start at the

719
00:48:11,000 --> 00:48:13,000
top left, and now there's no top
left.

720
00:48:13,000 --> 00:48:17,000
So, just minor change:
just let me remember that.

721
00:48:17,000 --> 00:48:21,000
The minor change is that I'm
going to store a special value

722
00:48:21,000 --> 00:48:25,000
minus infinity in every list.
So, minus infinity always gets

723
00:48:25,000 --> 00:48:29,000
promoted all the way to the top,
whatever the top happens to be

724
00:48:29,000 --> 00:48:32,000
now.
So, initially,

725
00:48:32,000 --> 00:48:35,000
that way I'll always have a top
left.

726
00:48:35,000 --> 00:48:38,000
Sorry, I forgot to mention
that.

727
00:48:38,000 --> 00:48:41,000
So, initially I'll just have
minus infinity.

728
00:48:41,000 --> 00:48:45,000
Then I insert 44.
I say, OK, 44 goes there,

729
00:48:45,000 --> 00:48:47,000
no promotion,
done.

730
00:48:47,000 --> 00:48:49,000
Now, we're going to insert
nine.

731
00:48:49,000 --> 00:48:53,000
Nine goes here.
So, minus infinity to nine,

732
00:48:53,000 --> 00:48:55,000
flip your coin,
heads.

733
00:48:55,000 --> 00:49:00,000
Did he actually flip it?
OK, good.

734
00:49:00,000 --> 00:49:02,000
He flipped it before,
yeah, sure.

735
00:49:02,000 --> 00:49:04,000
I'm just giving you a hard
time.

736
00:49:04,000 --> 00:49:09,000
So, we have nine up here.
We need to maintain this minus

737
00:49:09,000 --> 00:49:13,000
infinity just to make sure it
gets promoted along with

738
00:49:13,000 --> 00:49:16,000
everything else.
So, that looks like a nice skip

739
00:49:16,000 --> 00:49:18,000
list.
Flip it again.

740
00:49:18,000 --> 00:49:21,000
Tails, good.
OK, so this looks like an ideal

741
00:49:21,000 --> 00:49:23,000
skip list.
Isn't that great?

742
00:49:23,000 --> 00:49:27,000
It works every time.
OK, give me another number.

743
00:49:27,000 --> 00:49:32,000
26, OK, so I search for 26.
26 goes here.

744
00:49:32,000 --> 00:49:36,000
It clearly goes on the bottom
list.

745
00:49:36,000 --> 00:49:41,000
Here we go, 26,
and then I you raised 44.

746
00:49:41,000 --> 00:49:46,000
Flip.
Tails, OK, another number.

747
00:49:46,000 --> 00:49:52,000
50, oh, a big one.
It costs me a little while to

748
00:49:52,000 --> 00:00:50,000
search, and I get over here.

749
00:49:56,000 --> 00:49:58,000
Flip.
Heads, good.

750
00:49:58,000 --> 00:50:05,000
So 50 gets promoted.
Flip it again.

751
00:50:05,000 --> 00:50:08,000
Tails, OK, still a reasonable
number.

752
00:50:08,000 --> 00:50:11,000
Another number?
12, it takes a little while to

753
00:50:11,000 --> 00:50:15,000
get exciting here.
OK, 12 goes here between nine

754
00:50:15,000 --> 00:50:18,000
and 26.
You're giving me a hard time

755
00:50:18,000 --> 00:50:20,000
here.
OK, flip.

756
00:50:20,000 --> 00:50:24,000
Heads, OK, 12 gets promoted.
I know you have to work a

757
00:50:24,000 --> 00:50:30,000
little bit, but we just came
here to search for 12.

758
00:50:30,000 --> 00:50:35,000
So, we know that nine was the
last point we went down.

759
00:50:35,000 --> 00:50:39,000
So, we promote 12.
It gets inserted up here.

760
00:50:39,000 --> 00:50:45,000
We are just inserting into this
particular linked list:

761
00:50:45,000 --> 00:50:48,000
nothing fancy.
We link the two twelves

762
00:50:48,000 --> 00:50:52,000
together.
It still looks kind of like a

763
00:50:52,000 --> 00:50:55,000
linked list.
Flip again.

764
00:50:55,000 --> 00:00:37,000
OK, tails, another number.

765
00:50:58,000 --> 00:51:02,000
Jeez.
It's a good test of memory.

766
00:51:02,000 --> 00:51:05,000
37, what was it,
44 and 50?

767
00:51:05,000 --> 00:51:08,000
And 50 was at the next level
up.

768
00:51:08,000 --> 00:51:14,000
I think I should just keep
appending elements and have you

769
00:51:14,000 --> 00:51:18,000
flip coins.
OK, we just inserted 37.

770
00:51:18,000 --> 00:51:22,000
Tails.
OK, that's getting to be a long

771
00:51:22,000 --> 00:51:25,000
chain.
That looks a bit worse.

772
00:51:25,000 --> 00:51:29,000
OK, give me another number
larger than 50.

773
00:51:29,000 --> 00:51:34,000
51, good answer.
Thank you.

774
00:51:34,000 --> 00:51:37,000
OK, flip again.
And again.

775
00:51:37,000 --> 00:51:40,000
Tails.
Another number.

776
00:51:40,000 --> 00:51:45,000
Wait, someone else should pick
a number.

777
00:51:45,000 --> 00:51:49,000
It's not working.
What did you say?

778
00:51:49,000 --> 00:51:52,000
52, good answer.
Flip.

779
00:51:52,000 --> 00:51:58,000
Tails, not surprising.
We've gotten a lot of heads

780
00:51:58,000 --> 00:52:03,000
there.
OK, another number.

781
00:52:03,000 --> 00:52:06,000
53, thank you.
Flip.

782
00:52:06,000 --> 00:52:08,000
Heads, heads,
OK.

783
00:52:08,000 --> 00:52:13,000
Heads, heads,
you didn't flip.

784
00:52:13,000 --> 00:52:17,000
All right, 53,
you get the idea.

785
00:52:17,000 --> 00:52:26,000
If you get two consecutive
heads, then the guy goes up two

786
00:52:26,000 --> 00:52:32,000
levels.
OK, now flip for real.

787
00:52:32,000 --> 00:52:33,000
Heads.
Finally.

788
00:52:33,000 --> 00:52:39,000
Heads we've been waiting for.
If you flipped three heads in a

789
00:52:39,000 --> 00:52:44,000
row, you go three levels.
And each time,

790
00:52:44,000 --> 00:52:47,000
we keep promoting minus
infinity.

791
00:52:47,000 --> 00:52:50,000
Look again.
Heads, oh my God.

792
00:52:50,000 --> 00:52:54,000
Where were they before?
Flip again.

793
00:52:54,000 --> 00:53:00,000
It better be tails this time.
Tails, good.

794
00:53:00,000 --> 00:53:04,000
OK, you get the idea.
Eventually you run out of board

795
00:53:04,000 --> 00:53:06,000
space.
Now, it's pretty rare that you

796
00:53:06,000 --> 00:53:10,000
go too high.
What's the probability that you

797
00:53:10,000 --> 00:53:13,000
go higher than log n?
Another easy log computation.

798
00:53:13,000 --> 00:53:17,000
Each time, I have a 50%
probability of going up.

799
00:53:17,000 --> 00:53:22,000
One in n probability of going
up log n levels because half to

800
00:53:22,000 --> 00:53:24,000
the power of log n is one out of
n.

801
00:53:24,000 --> 00:53:28,000
So, it depends on n,
but I'm not going to go too

802
00:53:28,000 --> 00:53:32,000
high.
And, intuitively,

803
00:53:32,000 --> 00:53:37,000
this is not so bad.
So, these are skip lists.

804
00:53:37,000 --> 00:53:44,000
You have the ratios right in
expectation, which is a pretty

805
00:53:44,000 --> 00:53:49,000
weak statement.
This doesn't say anything about

806
00:53:49,000 --> 00:53:54,000
the lengths of these change.
But intuitively,

807
00:53:54,000 --> 00:53:59,000
it's pretty good.
Let's say pretty good on

808
00:53:59,000 --> 00:54:03,000
average.
So, I had two semi-random

809
00:54:03,000 --> 00:54:05,000
processes going on here.
One is picking the numbers,

810
00:54:05,000 --> 00:54:08,000
and that, I don't want to
assume anything about.

811
00:54:08,000 --> 00:54:09,000
The numbers could be
adversarial.

812
00:54:09,000 --> 00:54:12,000
It could be sequential.
It could be reverse sorted.

813
00:54:12,000 --> 00:54:14,000
It could be random.
I don't know.

814
00:54:14,000 --> 00:54:15,000
So, it didn't matter what he
said.

815
00:54:15,000 --> 00:54:18,000
At least, it shouldn't matter.
I mean, it matters here.

816
00:54:18,000 --> 00:54:20,000
Don't worry.
You're still loved.

817
00:54:20,000 --> 00:54:22,000
You still get your $0.25.
But what the algorithm cares

818
00:54:22,000 --> 00:54:24,000
about is the outcomes of these
coins.

819
00:54:24,000 --> 00:54:27,000
And the probability,
the statement that this data

820
00:54:27,000 --> 00:54:30,000
structure is fast with high
probability is only about the

821
00:54:30,000 --> 00:54:34,000
random coins.
Right, it doesn't matter what

822
00:54:34,000 --> 00:54:38,000
the adversary chooses for
numbers as long as those coins

823
00:54:38,000 --> 00:54:43,000
are random, and the adversary
doesn't know the coins.

824
00:54:43,000 --> 00:54:46,000
It doesn't know the outcomes of
the coins.

825
00:54:46,000 --> 00:54:50,000
So, in that case,
on average, overall of the coin

826
00:54:50,000 --> 00:54:55,000
flips, you should be OK.
But the claim is not just that

827
00:54:55,000 --> 00:54:58,000
it's pretty good on average.
But, it's really,

828
00:54:58,000 --> 00:55:03,000
really good almost always.
OK, with really high

829
00:55:03,000 --> 00:55:07,000
probability it's log n.
So, for example,

830
00:55:07,000 --> 00:55:10,000
with probability,
one minus one over n,

831
00:55:10,000 --> 00:55:15,000
it's order of log n,
with probability one minus one

832
00:55:15,000 --> 00:55:19,000
over n^2 it's log n,
probability one minus one over

833
00:55:19,000 --> 00:55:24,000
n^100, it's order log n.
All those statements are true

834
00:55:24,000 --> 00:55:30,000
for any value of 100.
So, that's where we're going.

835
00:55:30,000 --> 00:55:33,000
OK, I should mention,
how do you delete in a skip

836
00:55:33,000 --> 00:55:34,000
list?
Find the element.

837
00:55:34,000 --> 00:55:37,000
You delete it all the way.
There's nothing fancy with

838
00:55:37,000 --> 00:55:40,000
delete.
Because we have all these

839
00:55:40,000 --> 00:55:43,000
independent, random choices,
all of these elements are sort

840
00:55:43,000 --> 00:55:47,000
of independent from each other.
We don't really care.

841
00:55:47,000 --> 00:55:49,000
So, delete an element,
just throw it away.

842
00:55:49,000 --> 00:55:53,000
The tricky part is insertion.
When I insert an element,

843
00:55:53,000 --> 00:55:56,000
I'm just going to randomly see
how high it should go.

844
00:55:56,000 --> 00:56:00,000
With probability one over two
to the i, it will go to height

845
00:56:00,000 --> 00:56:04,000
i.
Good, that's my time.

846
00:56:04,000 --> 00:56:08,000
I've been having too much fun
here.

847
00:56:08,000 --> 00:56:14,000
I've got to go a little bit
faster, OK.

848
00:56:25,000 --> 00:56:32,000
So here's the theorem.
Let's see exactly what we are

849
00:56:32,000 --> 00:56:38,000
proving first.
With high probability,

850
00:56:38,000 --> 00:56:46,000
this is a formal notion which I
will define a second.

851
00:56:46,000 --> 00:56:55,000
Every search in n elements skip
lists costs order of log n.

852
00:56:55,000 --> 00:57:03,000
So, that's the theorem.
Now I need to define with high

853
00:57:03,000 --> 00:57:06,000
probability.
So, with high probability.

854
00:57:06,000 --> 00:57:10,000
And, it's a bit of a long
phrase.

855
00:57:10,000 --> 00:57:15,000
So, often we will,
and you can abbreviate it WHP.

856
00:57:15,000 --> 00:57:20,000
So, if I have a random event,
and the random event here is

857
00:57:20,000 --> 00:57:26,000
that every search in an n
element skip list costs order

858
00:57:26,000 --> 00:57:32,000
log n, I want to know what it
means for that event E to occur

859
00:57:32,000 --> 00:57:36,000
with high probability.

860
00:57:47,000 --> 00:57:53,000
So this is the definition.
So, the statement is that for

861
00:57:53,000 --> 00:58:00,000
any alpha greater than or equal
to one, there is a suitable

862
00:58:00,000 --> 00:58:04,000
choice of constants --

863
00:58:16,000 --> 00:58:27,000
-- for which the event,
E, occurs with this probability

864
00:58:27,000 --> 00:58:37,000
I keep mentioning.
So, the probability at least

865
00:58:37,000 --> 00:58:46,000
one minus one over n to the
alpha.

866
00:58:46,000 --> 00:58:49,000
So, this is a bit imprecise,
but it will suffice for our

867
00:58:49,000 --> 00:58:52,000
purposes.
If you want to really formal

868
00:58:52,000 --> 00:58:55,000
definition, you can read the
lecture notes.

869
00:58:55,000 --> 00:58:59,000
There are special lecture notes
for this lecture on the stellar

870
00:58:59,000 --> 00:59:01,000
site.
And, there's the PowerPoint

871
00:59:01,000 --> 00:59:06,000
notes on the SMA site.
But, right, there's a bit of a

872
00:59:06,000 --> 00:59:08,000
subtlety in the choice of
constants here.

873
00:59:08,000 --> 00:59:11,000
There is a choice of this
constant.

874
00:59:11,000 --> 00:59:14,000
And there's a choice of this
constant.

875
00:59:14,000 --> 00:59:16,000
And, these are related.
And, there's alpha,

876
00:59:16,000 --> 00:59:19,000
which we get to whatever we
want.

877
00:59:19,000 --> 00:59:22,000
But the bottom line is,
we get to choose what

878
00:59:22,000 --> 00:59:24,000
probability we want this to be
true.

879
00:59:24,000 --> 00:59:28,000
If I want it to be true,
with probability one minus one

880
00:59:28,000 --> 00:59:32,000
over n^100, I can do that.
I just sat alpha to a hundred,

881
00:59:32,000 --> 00:59:37,000
and up to this little constant
that's going to grow much slower

882
00:59:37,000 --> 00:59:41,000
than n to the alpha.
I get the error probability.

883
00:59:41,000 --> 00:59:45,000
So this thing is called the
error probability.

884
00:59:45,000 --> 00:59:48,000
The probability that I fail is
polynomially small,

885
00:59:48,000 --> 00:59:51,000
for any polynomial I want.
Now, with the same data

886
00:59:51,000 --> 00:59:54,000
structure, right,
I fixed the data structure.

887
00:59:54,000 --> 00:59:57,000
It doesn't depend on alpha.
Anything you want,

888
00:59:57,000 --> 01:00:01,717
any alpha value you want,
this data structure will take

889
01:00:01,717 --> 01:00:06,692
order of log n time.
Now, this constant will depend

890
01:00:06,692 --> 01:00:08,666
on alpha.
So, you know,

891
01:00:08,666 --> 01:00:14,141
you want error probability one
over n^100 is probably going to

892
01:00:14,141 --> 01:00:17,461
be, like, 100 log n.
It's still log n.

893
01:00:17,461 --> 01:00:22,128
OK, this is a very strong claim
about the tale of the

894
01:00:22,128 --> 01:00:27,064
distribution of the running time
of search, very strong.

895
01:00:27,064 --> 01:00:32,000
Let me give you an idea of how
strong it is.

896
01:00:32,000 --> 01:00:36,731
How many people know what
Boole's inequality is?

897
01:00:36,731 --> 01:00:42,671
How many people know what the
union bound is in probability?

898
01:00:42,671 --> 01:00:45,691
You should.
It's in appendix c.

899
01:00:45,691 --> 01:00:49,214
Maybe you'll know it by the
theorem.

900
01:00:49,214 --> 01:00:55,154
It's good to know it by name.
It's sort of like linearity of

901
01:00:55,154 --> 01:00:58,476
expectations.
It's a lot easier to

902
01:00:58,476 --> 01:01:03,978
communicate to someone.
Linearity of expectations:

903
01:01:03,978 --> 01:01:07,554
instead of saying,
you know that thing where you

904
01:01:07,554 --> 01:01:11,510
sum up all the expectations of
things, and that's the

905
01:01:11,510 --> 01:01:15,086
expectation of the sum?
It's a lot easier to say

906
01:01:15,086 --> 01:01:18,815
linearity of expectation.
So, let me quiz you in a

907
01:01:18,815 --> 01:01:21,706
different way.
So, if I take a bunch of

908
01:01:21,706 --> 01:01:26,119
events, and I take their union,
either this happens or this

909
01:01:26,119 --> 01:01:29,847
happens, or so on.
So, this is the inclusive OR of

910
01:01:29,847 --> 01:01:31,521
k events.
And, instead,

911
01:01:31,521 --> 01:01:37,000
I look at the sum of the
probabilities of those events.

912
01:01:37,000 --> 01:01:40,111
OK, easy question:
are these equal?

913
01:01:40,111 --> 01:01:42,947
No, unless they are
independent.

914
01:01:42,947 --> 01:01:47,248
But can I say anything about
them, any relation?

915
01:01:47,248 --> 01:01:51,183
Smaller, yeah.
This is less than or equal to

916
01:01:51,183 --> 01:01:54,477
that.
OK, this should be intuitive to

917
01:01:54,477 --> 01:01:57,771
you from a probability point of
view.

918
01:01:57,771 --> 01:02:01,705
Look at the textbook.
OK: very basic result,

919
01:02:01,705 --> 01:02:07,041
trivial result almost.
What does this tell us?

920
01:02:07,041 --> 01:02:11,479
Well, suppose that E_i is some
kind of error event.

921
01:02:11,479 --> 01:02:15,295
We don't want it to happen.
OK, and suppose,

922
01:02:15,295 --> 01:02:19,467
mix some letters here.
Suppose I have a bunch of

923
01:02:19,467 --> 01:02:23,017
events which occur with high
probability.

924
01:02:23,017 --> 01:02:26,745
OK, call those E_i complement.
So, suppose,

925
01:02:26,745 --> 01:02:31,893
so this is the end of that
statement, E_i complement occurs

926
01:02:31,893 --> 01:02:37,063
with high probability.
OK, so then the probability of

927
01:02:37,063 --> 01:02:39,609
E_i is very small,
polynomially small.

928
01:02:39,609 --> 01:02:42,636
One over n to the alpha for any
alpha I want.

929
01:02:42,636 --> 01:02:46,007
Now, suppose I take a whole
bunch of these events,

930
01:02:46,007 --> 01:02:48,690
and let's say that k is
polynomial in n.

931
01:02:48,690 --> 01:02:52,405
So, I take a bunch of events,
which I'd like to happen.

932
01:02:52,405 --> 01:02:54,882
They all occur with high
probability.

933
01:02:54,882 --> 01:02:57,565
There is only polynomially many
of them.

934
01:02:57,565 --> 01:03:00,316
So let's say,
let me give this constant a

935
01:03:00,316 --> 01:03:03,000
name.
Let's call it c.

936
01:03:03,000 --> 01:03:05,873
Let's say I take n to the c
such events.

937
01:03:05,873 --> 01:03:09,926
Well, what's the probability
that all those events occur

938
01:03:09,926 --> 01:03:12,873
together?
Because they should rest of the

939
01:03:12,873 --> 01:03:17,073
time occurred together because
each one occurs most of the

940
01:03:17,073 --> 01:03:19,578
time, occurs with high
probability.

941
01:03:19,578 --> 01:03:23,115
So, I want to look at E_1 bar
intersect, E_2 bar,

942
01:03:23,115 --> 01:03:25,842
and so on.
So, each of these occurs as

943
01:03:25,842 --> 01:03:29,378
high probability.
What's the chance that they all

944
01:03:29,378 --> 01:03:32,166
occur?
It's also with high

945
01:03:32,166 --> 01:03:34,316
probability.
I'm changing the alpha.

946
01:03:34,316 --> 01:03:37,817
So, the union bound tells me
the probability of any one of

947
01:03:37,817 --> 01:03:40,090
these failing,
the probability of this

948
01:03:40,090 --> 01:03:42,608
failing, or this failing,
or this failing,

949
01:03:42,608 --> 01:03:44,573
which is this thing,
is, at most,

950
01:03:44,573 --> 01:03:47,276
the sum of the probabilities of
each failure.

951
01:03:47,276 --> 01:03:49,303
These are the error
probabilities.

952
01:03:49,303 --> 01:03:52,619
I know that each of them is,
at most, one over n to the

953
01:03:52,619 --> 01:03:55,875
alpha, with a constant in front.
If I add them all up,

954
01:03:55,875 --> 01:03:57,779
there's only n to the c of
them.

955
01:03:57,779 --> 01:04:01,034
So, I take this error
probability, and I multiply by n

956
01:04:01,034 --> 01:04:05,400
to the c.
So, I get like n to the c over

957
01:04:05,400 --> 01:04:08,679
n to the alpha,
which is one over n to the

958
01:04:08,679 --> 01:04:11,960
alpha minus c.
I can set alpha as big as I

959
01:04:11,960 --> 01:04:13,880
want.
So, I said it much,

960
01:04:13,880 --> 01:04:17,880
much bigger than c,
and this event occurs with high

961
01:04:17,880 --> 01:04:21,000
probability.
I sort of made a mess here,

962
01:04:21,000 --> 01:04:25,719
but this event occurs with high
probability because of this.

963
01:04:25,719 --> 01:04:30,599
Whatever the constant is here,
however many events I'm taking,

964
01:04:30,599 --> 01:04:35,000
I just set alpha to be bigger
than that.

965
01:04:35,000 --> 01:04:37,951
And, this event will occur with
high probability,

966
01:04:37,951 --> 01:04:40,041
too.
So, when I say here that every

967
01:04:40,041 --> 01:04:42,992
search of cost order log n with
high probability,

968
01:04:42,992 --> 01:04:46,005
not only do I mean that if you
look at one search,

969
01:04:46,005 --> 01:04:48,587
it costs order log n with high
probability.

970
01:04:48,587 --> 01:04:51,969
You look at another search,
and it costs log n with high

971
01:04:51,969 --> 01:04:54,244
probability.
I mean, if you take every

972
01:04:54,244 --> 01:04:57,318
search, all of them take order
log n time with high

973
01:04:57,318 --> 01:04:59,593
probability.
So, this event that every

974
01:04:59,593 --> 01:05:03,036
single search you do takes order
log n, is true with high

975
01:05:03,036 --> 01:05:06,663
probability estimate the number
of searches you are doing is

976
01:05:06,663 --> 01:05:10,887
polynomial in n.
So, I'm assuming that I'm not

977
01:05:10,887 --> 01:05:14,467
using this data structure
forever, just for a polynomial

978
01:05:14,467 --> 01:05:17,136
amount of time.
But, who's got more than a

979
01:05:17,136 --> 01:05:19,218
polynomial amount of time
anyway?

980
01:05:19,218 --> 01:05:21,757
This is MIT.
So, hopefully that's clear.

981
01:05:21,757 --> 01:05:24,035
We'll see it a few more times.
Yeah?

982
01:05:24,035 --> 01:05:26,443
The algorithm doesn't depend on
Alpha.

983
01:05:26,443 --> 01:05:31,000
The question is how do you
choose alpha in the algorithm.

984
01:05:31,000 --> 01:05:33,925
So, we don't need to.
This is just sort of for an

985
01:05:33,925 --> 01:05:36,668
analysis tool.
This is saying that the farther

986
01:05:36,668 --> 01:05:39,838
out you get, so you say,
well, what's the probability

987
01:05:39,838 --> 01:05:43,190
that more than ten log n.
Well, it's like one over n^10.

988
01:05:43,190 --> 01:05:46,238
Let's say it's linear.
Well, what's the chance that

989
01:05:46,238 --> 01:05:49,407
you're more than 20 log n?
Well that's one over n^20.

990
01:05:49,407 --> 01:05:52,942
So, the point is the tail of
this distribution is getting a

991
01:05:52,942 --> 01:05:54,466
really small,
really fast.

992
01:05:54,466 --> 01:05:57,758
And, such using alpha is more
like sort of for your own

993
01:05:57,758 --> 01:06:00,135
feeling good.
OK, you can set it to 100,

994
01:06:00,135 --> 01:06:05,209
and then n is at least two.
So, that's like one over 2^100

995
01:06:05,209 --> 01:06:08,082
chance that you fail.
That's damn small.

996
01:06:08,082 --> 01:06:11,322
If you've got a real random
number generator,

997
01:06:11,322 --> 01:06:15,668
the chance that you're going to
hit one over 2^200 is pretty

998
01:06:15,668 --> 01:06:18,762
tiny, right?
So, let's say you set alpha to

999
01:06:18,762 --> 01:06:21,266
256, which is always a good
number.

1000
01:06:21,266 --> 01:06:25,759
2^256 is much bigger than the
number of particles in the known

1001
01:06:25,759 --> 01:06:29,000
universe, so,
the light matter.

1002
01:06:29,000 --> 01:06:32,898
So, actually I think this even
accounts for some notion of dark

1003
01:06:32,898 --> 01:06:34,533
matter.
So, this is really,

1004
01:06:34,533 --> 01:06:37,615
really, really big.
So, the chance that you pick a

1005
01:06:37,615 --> 01:06:41,576
random particle in the universe
that happens to be your favorite

1006
01:06:41,576 --> 01:06:45,161
particle, this one right here,
that's over one over 2^256,

1007
01:06:45,161 --> 01:06:47,487
or even smaller.
So, set alpha to 256,

1008
01:06:47,487 --> 01:06:51,260
the chance to your algorithm
takes more than order log n time

1009
01:06:51,260 --> 01:06:54,907
is a lot smaller than the chance
that a meteor strikes your

1010
01:06:54,907 --> 01:06:58,680
computer at the same time that
it has a flooding point error,

1011
01:06:58,680 --> 01:07:02,642
at the same time that the earth
explodes because they're putting

1012
01:07:02,642 --> 01:07:06,415
a transport through this part of
the solar system at the same

1013
01:07:06,415 --> 01:07:08,113
time, I mean,
I could go on,

1014
01:07:08,113 --> 01:07:10,752
right?
It's really,

1015
01:07:10,752 --> 01:07:13,510
really unlikely that you are
more than log n.

1016
01:07:13,510 --> 01:07:15,705
And how unlikely:
you get to choose.

1017
01:07:15,705 --> 01:07:19,467
But it's just in the analysis
the algorithm doesn't depend on

1018
01:07:19,467 --> 01:07:21,159
it.
It's the same algorithm,

1019
01:07:21,159 --> 01:07:23,040
very cool.
Sometimes, with high

1020
01:07:23,040 --> 01:07:25,297
probability, bounds depends on
alpha.

1021
01:07:25,297 --> 01:07:27,680
I mean, the algorithm depends
on alpha.

1022
01:07:27,680 --> 01:07:32,307
But here, it will not.
OK, away we go.

1023
01:07:32,307 --> 01:07:37,692
So now you all understand the
claim.

1024
01:07:37,692 --> 01:07:45,384
So let's do a warm up.
We will also need this fact.

1025
01:07:45,384 --> 01:07:52,769
But it's pretty easy.
The lemma is that with high

1026
01:07:52,769 --> 01:08:01,692
probability, the number of
levels in the skip list is order

1027
01:08:01,692 --> 01:08:06,266
log n.
I think it's order log n,

1028
01:08:06,266 --> 01:08:09,349
certainly.
So, how do we prove that

1029
01:08:09,349 --> 01:08:12,613
something happens with high
probably?

1030
01:08:12,613 --> 01:08:18,144
Compute the probability that it
happened; show that it's high.

1031
01:08:18,144 --> 01:08:22,676
Even if you don't know what
high probability means,

1032
01:08:22,676 --> 01:08:26,122
in fact, I used to ask that
earlier on.

1033
01:08:26,122 --> 01:08:30,746
So, let's compute the chance
that it doesn't happen,

1034
01:08:30,746 --> 01:08:35,551
the error probability,
because that's just a one minus

1035
01:08:35,551 --> 01:08:39,448
the cleaner.
So, I'd like to say,

1036
01:08:39,448 --> 01:08:42,710
let's say, that it's,
at most, c log n levels.

1037
01:08:42,710 --> 01:08:46,115
So, what's the error
probability for that event?

1038
01:08:46,115 --> 01:08:50,028
This is sort of an event.
I'll put it in squiggles just

1039
01:08:50,028 --> 01:08:53,000
for, all set.
This is the probability that

1040
01:08:53,000 --> 01:08:56,260
they are strictly greater than c
log n levels.

1041
01:08:56,260 --> 01:09:00,173
So, I want to say that that
probability is particularly

1042
01:09:00,173 --> 01:09:04,683
small, polynomially small.
Well, how do I make levels?

1043
01:09:04,683 --> 01:09:07,551
When I insert an element,
the probability half,

1044
01:09:07,551 --> 01:09:09,984
it goes up.
And, the number of levels in

1045
01:09:09,984 --> 01:09:13,725
the skip list is the max over
all the elements of how high it

1046
01:09:13,725 --> 01:09:15,035
goes up.
But, max, oh,

1047
01:09:15,035 --> 01:09:17,779
that's a mess.
All right, you can compute the

1048
01:09:17,779 --> 01:09:21,022
expectation of the max if you
have a bunch of unknown

1049
01:09:21,022 --> 01:09:24,202
variables; there is expectation
there is a constant,

1050
01:09:24,202 --> 01:09:26,759
and you take the max.
It's like log in and

1051
01:09:26,759 --> 01:09:31,000
expectation, but we want a much
stronger statement.

1052
01:09:31,000 --> 01:09:35,815
And, we have this Boole's
inequality that says I have a

1053
01:09:35,815 --> 01:09:39,471
bunch of things,
polynomially many things.

1054
01:09:39,471 --> 01:09:43,841
Let's say we have n items.
Each one independently,

1055
01:09:43,841 --> 01:09:47,142
I don't even care if it's a
dependent.

1056
01:09:47,142 --> 01:09:52,582
If it goes up more than c log
n, yeah, the number of levels is

1057
01:09:52,582 --> 01:09:55,258
more than c log n.
So, this is,

1058
01:09:55,258 --> 01:10:00,163
at most, and then I want to
know, do any of those events

1059
01:10:00,163 --> 01:10:03,017
happen for any of the n
elements?

1060
01:10:03,017 --> 01:10:06,762
So, I just multiplied by n.
It's certainly,

1061
01:10:06,762 --> 01:10:10,597
at most, n times the
probability that x gets

1062
01:10:10,597 --> 01:10:15,502
promoted, this much here,
greater than or equal to log n

1063
01:10:15,502 --> 01:10:18,734
times.
OK, if I pick,

1064
01:10:18,734 --> 01:10:21,041
for any element,
x, because it's the same for

1065
01:10:21,041 --> 01:10:23,191
each element.
They are done independently.

1066
01:10:23,191 --> 01:10:26,179
So, I'm just summing over x
here, and that's just a factor

1067
01:10:26,179 --> 01:10:26,756
of n.
Clear?

1068
01:10:26,756 --> 01:10:29,588
This is Boole's inequality.
Now, what's the probability

1069
01:10:29,588 --> 01:10:32,000
that x gets promoted c log n
times?

1070
01:10:32,000 --> 01:10:36,646
We did this before for log n.
It was one over n.

1071
01:10:36,646 --> 01:10:40,305
For c log n,
it's one over n to the c.

1072
01:10:40,305 --> 01:10:44,161
OK, this is n times two.
Let's be nicer:

1073
01:10:44,161 --> 01:10:47,324
one half to the power of c log
n.

1074
01:10:47,324 --> 01:10:53,257
One half to the power of c log
n is one over two to the c log

1075
01:10:53,257 --> 01:10:55,926
n.
The log n comes out here,

1076
01:10:55,926 --> 01:10:58,991
becomes an n.
We get n to the c.

1077
01:10:58,991 --> 01:11:05,022
So, this is n divided by n to
the c, which is n to the c minus

1078
01:11:05,022 --> 01:11:09,904
one.
And, I get to choose c to be

1079
01:11:09,904 --> 01:11:14,676
whatever I want.
So, I choose c minus one to be

1080
01:11:14,676 --> 01:11:17,477
alpha.
I think exactly that.

1081
01:11:17,477 --> 01:11:21,626
Oh, sorry, one over n to the c
minus one.

1082
01:11:21,626 --> 01:11:24,634
Thank you.
It better be small.

1083
01:11:24,634 --> 01:11:30,236
This is an upper bound.
So, probability is polynomially

1084
01:11:30,236 --> 01:11:32,956
small.
I get to choose,

1085
01:11:32,956 --> 01:11:36,484
and this is a bit of the trik.
I'm choosing this constant to

1086
01:11:36,484 --> 01:11:38,397
be large, large enough for
alpha.

1087
01:11:38,397 --> 01:11:40,610
The point is,
as c grows, alpha grows.

1088
01:11:40,610 --> 01:11:43,480
Therefore, I can set alpha to
be whatever I want,

1089
01:11:43,480 --> 01:11:46,290
set c accordingly.
So, there's a little bit more

1090
01:11:46,290 --> 01:11:49,459
words that have to go here.
But, they're in the notes.

1091
01:11:49,459 --> 01:11:51,851
I can set alpha to be as large
as I want.

1092
01:11:51,851 --> 01:11:55,199
So, I can make this probability
as small as I want in the

1093
01:11:55,199 --> 01:11:56,993
polynomial sets.
So, that's it.

1094
01:11:56,993 --> 01:11:58,727
Number of levels,
order log n:

1095
01:11:58,727 --> 01:12:02,224
wasn't that easy?
Rules and equality,

1096
01:12:02,224 --> 01:12:06,026
the point is that when you're
dealing with high probability,

1097
01:12:06,026 --> 01:12:09,377
use Boole's inequality.
And, anything that's true for

1098
01:12:09,377 --> 01:12:12,664
one element is true for all of
them, just like that.

1099
01:12:12,664 --> 01:12:15,886
Just lose a factor of n,
but that's just one in the

1100
01:12:15,886 --> 01:12:18,271
alpha, and alpha is big:
big constant,

1101
01:12:18,271 --> 01:12:21,106
but it's big.
OK, so let's prove the theorem.

1102
01:12:21,106 --> 01:12:23,813
High probability searches cost
order log n.

1103
01:12:23,813 --> 01:12:27,422
We now know the height is order
log n, but it depends how

1104
01:12:27,422 --> 01:12:32,756
balanced this thing is.
It depends how long the chains

1105
01:12:32,756 --> 01:12:36,800
are to really know that a search
costs log n.

1106
01:12:36,800 --> 01:12:41,210
Just knowing a bound on the
height is not enough,

1107
01:12:41,210 --> 01:12:45,805
unlike a binary tree.
So, we have one cool idea for

1108
01:12:45,805 --> 01:12:49,389
this analysis.
And it's called backwards

1109
01:12:49,389 --> 01:12:52,697
analysis.
So, normally you think of a

1110
01:12:52,697 --> 01:12:58,210
search as starting in the top
left corner going left and down

1111
01:12:58,210 --> 01:13:04,000
until you get to the item that
you're looking for.

1112
01:13:04,000 --> 01:13:07,423
I'm going to look at the
reverse process.

1113
01:13:07,423 --> 01:13:12,558
You start at the item you're
looking for, and you go left and

1114
01:13:12,558 --> 01:13:15,896
up until you get to the top left
corner.

1115
01:13:15,896 --> 01:13:20,175
The number of steps in those
two walks is the same.

1116
01:13:20,175 --> 01:13:23,855
And, I'm not implementing an
algorithm here,

1117
01:13:23,855 --> 01:13:27,792
I'm just doing analysis.
So, those are the same

1118
01:13:27,792 --> 01:13:32,671
processes, just in reverse.
So, here's what it looks like.

1119
01:13:32,671 --> 01:13:35,409
You have a search,
and it starts,

1120
01:13:35,409 --> 01:13:42,000
which really means that it ends
at a node in the bottom list.

1121
01:13:42,000 --> 01:13:46,845
Then, each time you visit a
node in this search,

1122
01:13:46,845 --> 01:13:52,618
you either go left or up.
And, when do you go left or up?

1123
01:13:52,618 --> 01:13:56,639
Well, it depends with the coin
flip was.

1124
01:13:56,639 --> 01:14:02,000
So, if the node wasn't promoted
at this level.

1125
01:14:02,000 --> 01:14:08,317
So, if it wasn't promoted
higher, and that happened

1126
01:14:08,317 --> 01:14:14,003
exactly when we got a tails.
Then, we go left,

1127
01:14:14,003 --> 01:14:19,057
which really means we came from
the left.

1128
01:14:19,057 --> 01:14:25,754
Or, if we got a heads,
so if this node was promoted to

1129
01:14:25,754 --> 01:14:31,440
the next level,
which happened whenever we got

1130
01:14:31,440 --> 01:14:37,000
a heads at that particular
moment.

1131
01:14:37,000 --> 01:14:42,860
This is in the past some time
when we did the insertion.

1132
01:14:42,860 --> 01:14:45,844
Then we go, or came from,
up.

1133
01:14:45,844 --> 01:14:51,704
And, we stop at the root.
This is really where we start;

1134
01:14:51,704 --> 01:14:55,967
same thing.
So, either at the root or I'm

1135
01:14:55,967 --> 01:15:03,000
also going to think of this as
stopping at minus infinity.

1136
01:15:03,000 --> 01:15:05,562
OK, that was a bit messy,
but let me review.

1137
01:15:05,562 --> 01:15:08,602
So, normally we start up here.
Well, just looking at

1138
01:15:08,602 --> 01:15:11,344
everything backwards,
and in brackets is what's

1139
01:15:11,344 --> 01:15:13,966
really happening.
So, this search ends at the

1140
01:15:13,966 --> 01:15:17,364
node you were looking for.
It's always in the bottom list.

1141
01:15:17,364 --> 01:15:19,807
Then it says,
well, was this node promoted

1142
01:15:19,807 --> 01:15:21,952
higher?
If it was, I came from above.

1143
01:15:21,952 --> 01:15:25,410
If not, I came to the left.
It must have been in the bottom

1144
01:15:25,410 --> 01:15:28,033
chain somewhere.
OK, and that's true at every

1145
01:15:28,033 --> 01:15:31,870
node you visit.
It depends whether that quite

1146
01:15:31,870 --> 01:15:35,806
slipped heads or tails at the
time that you inserted that node

1147
01:15:35,806 --> 01:15:38,774
into that level.
But, these are just a bunch of

1148
01:15:38,774 --> 01:15:40,774
events.
I'm just going to check,

1149
01:15:40,774 --> 01:15:44,258
what is the probability that
its heads, and what is the

1150
01:15:44,258 --> 01:15:47,096
probability that a tails?
It's always a half.

1151
01:15:47,096 --> 01:15:50,516
Every time I look at a coin
flip, when it was flipped,

1152
01:15:50,516 --> 01:15:54,000
there was a probability of half
going out of their way.

1153
01:15:54,000 --> 01:15:56,967
That's the magic.
And, I'm not using that these

1154
01:15:56,967 --> 01:16:02,248
events are independent anyway.
For every element that I search

1155
01:16:02,248 --> 01:16:05,584
for, for every value,
x, that's another search.

1156
01:16:05,584 --> 01:16:08,123
Those events may not be
independent.

1157
01:16:08,123 --> 01:16:12,112
I can still use Boole's
inequality and conclude that all

1158
01:16:12,112 --> 01:16:15,375
of them are order log n with
high probability.

1159
01:16:15,375 --> 01:16:19,582
As long as I can prove that any
one event happens with high

1160
01:16:19,582 --> 01:16:22,556
probability.
So, I don't need independence

1161
01:16:22,556 --> 01:16:26,835
between, I knew that these coin
flips in a single search are

1162
01:16:26,835 --> 01:16:30,969
independent, but everything
else, for different searches I

1163
01:16:30,969 --> 01:16:35,803
don't care.
So, how long can this process

1164
01:16:35,803 --> 01:16:39,283
go on?
We want to know how many times

1165
01:16:39,283 --> 01:16:44,309
can I make this walk?
Well, when I hit the root node,

1166
01:16:44,309 --> 01:16:47,983
I'm done.
Well, how quickly would I hit

1167
01:16:47,983 --> 01:16:51,559
the root node?
Well, with probability,

1168
01:16:51,559 --> 01:16:57,068
a half, I go up each step.
The number of times I go up is,

1169
01:16:57,068 --> 01:17:02,000
at most, the number of levels
minus one.

1170
01:17:02,000 --> 01:17:05,410
And that's order log n with
high probability.

1171
01:17:05,410 --> 01:17:07,813
So, this is the only other
idea.

1172
01:17:07,813 --> 01:17:10,682
So, we are now improving this
theorem.

1173
01:17:10,682 --> 01:17:15,333
So, the number of up moves in a
search, which are really down

1174
01:17:15,333 --> 01:17:19,054
moves, but same thing,
is less than the number of

1175
01:17:19,054 --> 01:17:22,000
levels.
Certainly, you can't go up more

1176
01:17:22,000 --> 01:17:24,713
than there are levels in the
search.

1177
01:17:24,713 --> 01:17:27,968
And in insert,
you can go arbitrarily high.

1178
01:17:27,968 --> 01:17:32,000
But a search:
as high as you can go.

1179
01:17:32,000 --> 01:17:34,821
And this is,
at most, c log n with high

1180
01:17:34,821 --> 01:17:37,866
probability.
This is what we proved in the

1181
01:17:37,866 --> 01:17:40,242
lemma.
So, we have a bound on the

1182
01:17:40,242 --> 01:17:42,990
number of up moves.
Half of the moves,

1183
01:17:42,990 --> 01:17:45,440
roughly, are going to be up
moves.

1184
01:17:45,440 --> 01:17:49,004
So, this pretty much down to
the number of moves.

1185
01:17:49,004 --> 01:17:51,752
Not quite.
So, what this means is that

1186
01:17:51,752 --> 01:17:54,797
with high probability,
so this is the same

1187
01:17:54,797 --> 01:17:58,955
probability, but I could choose
that as high as I want by

1188
01:17:58,955 --> 01:18:03,553
setting c large enough.
The number of moves,

1189
01:18:03,553 --> 01:18:06,893
in other words,
the cost of the search is at

1190
01:18:06,893 --> 01:18:11,320
most the number of coin flips
until we get c long n heads,

1191
01:18:11,320 --> 01:18:15,747
right, because in every step of
the search, I make a move,

1192
01:18:15,747 --> 01:18:19,009
and then I flip another coin,
conceptually.

1193
01:18:19,009 --> 01:18:22,504
There is another independent
coin lying there.

1194
01:18:22,504 --> 01:18:27,165
And it's either heads or tails.
Each of those is independent.

1195
01:18:27,165 --> 01:18:31,902
So, how many independent coin
flips does it take until I get c

1196
01:18:31,902 --> 01:18:37,206
log n heads?
The claim is that that's order

1197
01:18:37,206 --> 01:18:42,979
log n with high probability.
But we need to prove that.

1198
01:18:42,979 --> 01:18:48,324
So, this is a claim.
So, if you just sit there with

1199
01:18:48,324 --> 01:18:55,058
a coin, and you want to know how
many times does it take until I

1200
01:18:55,058 --> 01:19:00,082
get c log n heads,
the claim is that that number

1201
01:19:00,082 --> 01:19:05,000
is order log n with high
probability.

1202
01:19:05,000 --> 01:19:08,595
As long as I prove that,
I know that the total number of

1203
01:19:08,595 --> 01:19:11,276
steps I make,
which is the number of heads

1204
01:19:11,276 --> 01:19:15,394
and tails is order log n because
I definitely know the number of

1205
01:19:15,394 --> 01:19:17,094
heads is, at most,
c log n.

1206
01:19:17,094 --> 01:19:21,147
The claim is that the number of
tails can't be too much bigger.

1207
01:19:21,147 --> 01:19:23,174
Notice, I can't just say c
here.

1208
01:19:23,174 --> 01:19:25,985
OK, it's really important that
I have log n.

1209
01:19:25,985 --> 01:19:28,208
Why?
Because with high probability,

1210
01:19:28,208 --> 01:19:32,000
it depends on n.
This notion depends on n.

1211
01:19:32,000 --> 01:19:35,434
Log n: it's true.
Anything bigger that log n:

1212
01:19:35,434 --> 01:19:38,087
it's true, like n.
If I put n here,

1213
01:19:38,087 --> 01:19:41,756
this is also true.
But, if I put a constant or a

1214
01:19:41,756 --> 01:19:46,126
log log n, this is not true.
It's really important that I

1215
01:19:46,126 --> 01:19:50,184
have log n here because my
notion of high probability

1216
01:19:50,184 --> 01:19:54,321
depends on what's written here.
OK, it's clear so far.

1217
01:19:54,321 --> 01:19:57,912
We're almost done,
which is good because I just

1218
01:19:57,912 --> 01:20:01,190
ran out of time.
Sorry, we're going to go a

1219
01:20:01,190 --> 01:20:07,528
couple minutes over.
So, I want to compute the error

1220
01:20:07,528 --> 01:20:12,308
probability here.
So, I want to compute the

1221
01:20:12,308 --> 01:20:17,886
probability that there is less
than c log n heads.

1222
01:20:17,886 --> 01:20:23,691
Let me skip this step.
So, I will be approximate and

1223
01:20:23,691 --> 01:20:29,382
say, what's the probability that
there is, at most,

1224
01:20:29,382 --> 01:20:33,923
c log n heads?
So, I need to say how many

1225
01:20:33,923 --> 01:20:37,549
coins we are flipping here for
what this event is.

1226
01:20:37,549 --> 01:20:40,139
So, I need to specify this
constant.

1227
01:20:40,139 --> 01:20:42,729
Let's say we flip ten c log n
coins.

1228
01:20:42,729 --> 01:20:47,169
Now I want to look at the error
probability under that event.

1229
01:20:47,169 --> 01:20:51,312
The probability that there is
at most c log n heads among

1230
01:20:51,312 --> 01:20:55,382
those ten c log n flips.
So, the claim is this should be

1231
01:20:55,382 --> 01:20:58,416
pretty small.
It's going to depend on ten.

1232
01:20:58,416 --> 01:21:01,672
Then I'll choose ten to be
arbitrarily large,

1233
01:21:01,672 --> 01:21:05,076
and I'll be done,
OK, make my life a little bit

1234
01:21:05,076 --> 01:21:10,054
easier.
Well, I would ask you normally,

1235
01:21:10,054 --> 01:21:15,770
but this is 6.042 material.
So, what's the probability that

1236
01:21:15,770 --> 01:21:19,021
we have, at most,
this many heads?

1237
01:21:19,021 --> 01:21:23,653
Well, that means that nine c
log n of the coins,

1238
01:21:23,653 --> 01:21:29,368
because there are ten c log n
flips, c log n heads at most,

1239
01:21:29,368 --> 01:21:34,000
nine c log n at least better be
tails.

1240
01:21:34,000 --> 01:21:37,148
So this is the probability that
all those other guys become

1241
01:21:37,148 --> 01:21:39,104
tails, which is already pretty
small.

1242
01:21:39,104 --> 01:21:41,330
And then, there is this
permutation thing.

1243
01:21:41,330 --> 01:21:44,532
So, if I had exactly c log n
heads, this would be the number

1244
01:21:44,532 --> 01:21:47,574
of ways to rearrange c log n
heads among ten c log n coin

1245
01:21:47,574 --> 01:21:49,475
flips.
OK, that's just the number of

1246
01:21:49,475 --> 01:21:51,375
permutations.
So, this is a bit big,

1247
01:21:51,375 --> 01:21:53,601
which is kind of annoying.
This is really,

1248
01:21:53,601 --> 01:21:55,665
really small.
The claim is this is much

1249
01:21:55,665 --> 01:21:58,000
smaller than that is big.

1250
01:22:14,000 --> 01:22:18,548
So, this is just some math.
I'm going to whiz through it.

1251
01:22:18,548 --> 01:22:21,390
So, you don't have to stay too
long.

1252
01:22:21,390 --> 01:22:26,020
But you should go over it.
You should know that y choose x

1253
01:22:26,020 --> 01:22:30,000
is, at most, ey over x to the x,
good fact.

1254
01:22:30,000 --> 01:22:35,032
Therefore, this is,
at most, ten c log n over c log

1255
01:22:35,032 --> 01:22:38,456
n, also known as ten.
These cancel.

1256
01:22:38,456 --> 01:22:43,691
There's an e out here.
And then I raise that to the c

1257
01:22:43,691 --> 01:22:48,020
log n power.
OK, then I divide by two to the

1258
01:22:48,020 --> 01:22:51,946
power, nine c log n.
OK, so what's this?

1259
01:22:51,946 --> 01:22:57,986
This is e times ten to the c
log n divided by two to the nine

1260
01:22:57,986 --> 01:23:02,355
c log n.
OK, claim this is very big.

1261
01:23:02,355 --> 01:23:06,367
This is not so big,
because I have a nine here.

1262
01:23:06,367 --> 01:23:09,769
So, let's work it out.
This e times ten,

1263
01:23:09,769 --> 01:23:13,345
that's a good number,
we can put upstairs.

1264
01:23:13,345 --> 01:23:17,096
So, we get log of e times ten,
ten times, e,

1265
01:23:17,096 --> 01:23:21,109
and then c log n.
And then, we have over two to

1266
01:23:21,109 --> 01:23:25,121
the nine c log n.
So, we have this two to the c

1267
01:23:25,121 --> 01:23:31,946
log n in both cases.
So, this is two to the log,

1268
01:23:31,946 --> 01:23:38,669
ten e minus nine,
c, log n: some basic algebra.

1269
01:23:38,669 --> 01:23:43,199
So, I'm going to set,
not quite.

1270
01:23:43,199 --> 01:23:49,338
This is one over two to the
nine minus log:

1271
01:23:49,338 --> 01:23:58,253
so, just inverting everything
here, negating the sign in here.

1272
01:23:58,253 --> 01:24:06,000
And, this is my alpha because
the rest is n.

1273
01:24:06,000 --> 01:24:09,903
So, this is one over n to the
alpha when alpha is this

1274
01:24:09,903 --> 01:24:13,291
particular value:
nine minus log of ten times e

1275
01:24:13,291 --> 01:24:16,090
times c.
It's a bit of a strange thing.

1276
01:24:16,090 --> 01:24:19,184
But, the point is,
as ten goes to infinity,

1277
01:24:19,184 --> 01:24:22,424
nine here is the number one
smaller than ten,

1278
01:24:22,424 --> 01:24:24,855
right?
We subtracted one somewhere

1279
01:24:24,855 --> 01:24:27,949
along the way.
So, as ten goes to infinity,

1280
01:24:27,949 --> 01:24:32,000
this is basically,
this is ten minus one.

1281
01:24:32,000 --> 01:24:35,100
This is log of ten times e.
e doesn't really matter.

1282
01:24:35,100 --> 01:24:37,531
The point is,
this is logarithmic in ten.

1283
01:24:37,531 --> 01:24:40,692
This is linear in ten.
The thing that's linear in ten

1284
01:24:40,692 --> 01:24:44,035
is much bigger than the thing
that's logarithmic in ten.

1285
01:24:44,035 --> 01:24:45,919
This is called abusive
notation.

1286
01:24:45,919 --> 01:24:48,958
OK, as ten goes to infinity,
this goes to infinity,

1287
01:24:48,958 --> 01:24:51,329
gets bigger.
And, there is a c out here.

1288
01:24:51,329 --> 01:24:54,794
But, for any value of c that
you want, whatever value of c

1289
01:24:54,794 --> 01:24:58,015
you wanted in that claim,
I can make alpha arbitrarily

1290
01:24:58,015 --> 01:25:00,629
large by changing the constant
in the big O,

1291
01:25:00,629 --> 01:25:04,812
which here was ten.
OK, so that claim is true with

1292
01:25:04,812 --> 01:25:07,652
high probability.
Whatever probability you want,

1293
01:25:07,652 --> 01:25:10,673
which tells you alpha,
you set a constant effort of

1294
01:25:10,673 --> 01:25:13,089
the log N to be this number,
which grows,

1295
01:25:13,089 --> 01:25:15,929
and you're done.
You get the claim that is order

1296
01:25:15,929 --> 01:25:19,312
log N heads, order log N flips
with the high probability,

1297
01:25:19,312 --> 01:25:21,548
therefore.
[None of the steps?] in the

1298
01:25:21,548 --> 01:25:24,146
search is order log N with high
probability.

1299
01:25:24,146 --> 01:25:26,140
Really cool stuff;
read the notes.

1300
01:25:26,140 --> 01:25:29,000
Sorry I went so fast at the
end.