1
00:00:08,000 --> 00:00:13,000
The last lecture of 6.046.
We are here today to talk more

2
00:00:13,000 --> 00:00:17,000
about cache oblivious
algorithms.

3
00:00:30,000 --> 00:00:34,000
Last class, we saw several
cache oblivious algorithms,

4
00:00:34,000 --> 00:00:37,000
although none of them quite too
difficult.

5
00:00:37,000 --> 00:00:42,000
Today we will see two difficult
cache oblivious algorithms,

6
00:00:42,000 --> 00:00:46,000
a little bit more advanced.
I figure we should do something

7
00:00:46,000 --> 00:00:51,000
advanced for the last class just
to get to some exciting climax.

8
00:00:51,000 --> 00:00:55,000
So without further ado,
let's get started.

9
00:00:55,000 --> 00:01:00,000
Last time, we looked at the
binary search problem.

10
00:01:00,000 --> 00:01:02,000
Or, we looked at binary search,
rather.

11
00:01:02,000 --> 00:01:06,000
And so, the binary search did
not do so well in the cache

12
00:01:06,000 --> 00:01:10,000
oblivious context.
And, some people asked me after

13
00:01:10,000 --> 00:01:14,000
class, is it possible to do
binary search while cache

14
00:01:14,000 --> 00:01:16,000
obliviously?
And, indeed it is with

15
00:01:16,000 --> 00:01:19,000
something called static search
trees.

16
00:01:19,000 --> 00:01:21,000
So, this is really binary
search.

17
00:01:21,000 --> 00:01:25,000
So, I mean, the abstract
problem is I give you N items,

18
00:01:25,000 --> 00:01:28,000
say presorted,
build some static data

19
00:01:28,000 --> 00:01:34,000
structure so that you can search
among those N items quickly.

20
00:01:34,000 --> 00:01:37,000
And quickly,
I claim, means log base B of N.

21
00:01:37,000 --> 00:01:41,000
We know that with B trees,
our goal is to get log base B

22
00:01:41,000 --> 00:01:44,000
of N.
We know that we can achieve

23
00:01:44,000 --> 00:01:46,000
that with B trees when we know
B.

24
00:01:46,000 --> 00:01:49,000
We'd like to do that when we
don't know B.

25
00:01:49,000 --> 00:01:54,000
And that's what cache oblivious
static search trees achieve.

26
00:01:54,000 --> 00:01:58,000
So here's what we're going to
do.

27
00:01:58,000 --> 00:02:02,000
As you might suspect,
we're going to use a tree.

28
00:02:02,000 --> 00:02:07,000
So, we're going to store our N
elements in a complete binary

29
00:02:07,000 --> 00:02:10,000
tree.
We can't use B trees because we

30
00:02:10,000 --> 00:02:15,000
don't know what B is.
So, we'll use a binary tree.

31
00:02:15,000 --> 00:02:19,000
And the key is how we lay out a
binary tree.

32
00:02:19,000 --> 00:02:22,000
The binary tree will have N
nodes.

33
00:02:22,000 --> 00:02:25,000
Or, you can put the data in the
leaves.

34
00:02:25,000 --> 00:02:30,000
It doesn't really matter.
So, here's our tree.

35
00:02:30,000 --> 00:02:33,000
There are the N nodes.
And we're storing them,

36
00:02:33,000 --> 00:02:35,000
I didn't say,
in order, you know,

37
00:02:35,000 --> 00:02:38,000
in the usual way,
in order in a binary tree,

38
00:02:38,000 --> 00:02:41,000
which makes it a binary search
tree.

39
00:02:41,000 --> 00:02:43,000
So we now had a search in this
thing.

40
00:02:43,000 --> 00:02:47,000
So, the search will just start
at the root and a walk down some

41
00:02:47,000 --> 00:02:51,000
root-to-leaf path.
OK, and each point you know

42
00:02:51,000 --> 00:02:54,000
whether to go left or to go
right because things are in

43
00:02:54,000 --> 00:02:57,000
order.
So we're assuming here that we

44
00:02:57,000 --> 00:03:01,000
have an ordered universe of
keys.

45
00:03:01,000 --> 00:03:04,000
So that's easy.
We know that that will take log

46
00:03:04,000 --> 00:03:07,000
N time.
The question is how many memory

47
00:03:07,000 --> 00:03:10,000
transfers?
We'd like a lot of the nodes

48
00:03:10,000 --> 00:03:13,000
near the root to be somehow
closer and one block.

49
00:03:13,000 --> 00:03:16,000
But we don't know what the
block size is.

50
00:03:16,000 --> 00:03:21,000
So are going to do is carve the
tree in the middle level.

51
00:03:21,000 --> 00:03:25,000
We're going to use divide and
conquer for our layout of the

52
00:03:25,000 --> 00:03:28,000
tree, how we order the nodes in
memory.

53
00:03:28,000 --> 00:03:33,000
And the divide and conquer is
based on cutting in the middle,

54
00:03:33,000 --> 00:03:38,000
which is a bit weird.
It's not our usual divide and

55
00:03:38,000 --> 00:03:42,000
conquer.
And we'll see this more than

56
00:03:42,000 --> 00:03:45,000
once today.
So, when you cut on the middle

57
00:03:45,000 --> 00:03:50,000
level, if the height of your
original tree is log N,

58
00:03:50,000 --> 00:03:55,000
maybe log N plus one or
something, it's roughly log N,

59
00:03:55,000 --> 00:04:00,000
then the top half will be log N
over two.

60
00:04:00,000 --> 00:04:05,000
And at the height of the bottom
pieces will be log N over two.

61
00:04:05,000 --> 00:04:10,000
How many nodes will there be in
the top tree?

62
00:04:10,000 --> 00:04:12,000
N over two?
Not quite.

63
00:04:12,000 --> 00:04:16,000
Two to the log N over two,
square root of N.

64
00:04:16,000 --> 00:04:21,000
OK, so it would be about root N
nodes over here.

65
00:04:21,000 --> 00:04:24,000
And therefore,
there will be about root N

66
00:04:24,000 --> 00:04:28,000
subtrees down here,
one for each,

67
00:04:28,000 --> 00:04:34,000
or a couple for each leaf.
OK, so we have these subtrees

68
00:04:34,000 --> 00:04:38,000
of root N, and there are about
root N of them.

69
00:04:38,000 --> 00:04:42,000
OK, this is how we are carving
our tree.

70
00:04:42,000 --> 00:04:46,000
Now, we're going to recurse on
each of the pieces.

71
00:04:46,000 --> 00:04:50,000
I'd like to redraw this
slightly, sorry,

72
00:04:50,000 --> 00:04:53,000
just to make it a little bit
clearer.

73
00:04:53,000 --> 00:04:58,000
These triangles are really
trees, and they are connected by

74
00:04:58,000 --> 00:05:04,000
edges to this tree up here.
So what we are really doing is

75
00:05:04,000 --> 00:05:08,000
carving in the middle level of
edges in the tree.

76
00:05:08,000 --> 00:05:12,000
And if N is not exactly a power
of two, you have to round your

77
00:05:12,000 --> 00:05:15,000
level by taking floors or
ceilings.

78
00:05:15,000 --> 00:05:19,000
But you cut roughly in the
middle level of edges.

79
00:05:19,000 --> 00:05:23,000
There is a lot of edges here.
You conceptually slice there.

80
00:05:23,000 --> 00:05:27,000
That gives you a top tree and
the bottom tree,

81
00:05:27,000 --> 00:05:32,000
several bottom trees,
each of size roughly root N.

82
00:05:32,000 --> 00:05:39,000
OK, and then we are going to
recursively layout these root N

83
00:05:39,000 --> 00:05:45,000
plus one subtrees,
and then concatenate.

84
00:05:45,000 --> 00:05:50,000
So, this is the idea of the
recursive layout.

85
00:05:50,000 --> 00:05:57,000
We sought recursive layouts
with matrices last time.

86
00:05:57,000 --> 00:06:04,000
This is doing the same thing
for a tree.

87
00:06:04,000 --> 00:06:07,000
So, I want to recursively
layout the top tree.

88
00:06:07,000 --> 00:06:11,000
So here's the top tree.
And I imagine it being somehow

89
00:06:11,000 --> 00:06:14,000
squashed down into a linear
array recursively.

90
00:06:14,000 --> 00:06:18,000
And then I do the same thing
for each of the bottom trees.

91
00:06:18,000 --> 00:06:21,000
So here are all the bottom
trees.

92
00:06:21,000 --> 00:06:25,000
And I squashed each of them
down into some linear order.

93
00:06:25,000 --> 00:06:28,000
And then I concatenate those
linear orders.

94
00:06:28,000 --> 00:06:32,000
That's the linear order of the
street.

95
00:06:32,000 --> 00:06:35,000
And you need a base case.
And the base case,

96
00:06:35,000 --> 00:06:39,000
just a single node,
is stored in the only order of

97
00:06:39,000 --> 00:06:43,000
a single node there is.
OK, so that's a recursive

98
00:06:43,000 --> 00:06:48,000
layout of a binary search tree.
It turns out this works really

99
00:06:48,000 --> 00:06:51,000
well.
And let's quickly do a little

100
00:06:51,000 --> 00:06:56,000
example just so it's completely
clear what this layout is

101
00:06:56,000 --> 00:07:02,000
because it's a bit bizarre maybe
the first time you see it.

102
00:07:02,000 --> 00:07:05,000
So let me draw my favorite
picture.

103
00:07:15,000 --> 00:07:19,000
So here's a tree of height four
or three depending on how you

104
00:07:19,000 --> 00:07:22,000
count.
We divide in the middle level,

105
00:07:22,000 --> 00:07:25,000
and we say, OK,
that's the top tree.

106
00:07:25,000 --> 00:07:27,000
And then these are the bottom
trees.

107
00:07:27,000 --> 00:07:32,000
So there's four bottom trees.
So there are four children

108
00:07:32,000 --> 00:07:36,000
hanging off the root tree.
They each have the same size in

109
00:07:36,000 --> 00:07:39,000
this case.
They should all roughly be the

110
00:07:39,000 --> 00:07:41,000
same size.
And the first we layout the top

111
00:07:41,000 --> 00:07:43,000
thing where we divide on the
middle level.

112
00:07:43,000 --> 00:07:47,000
We say, OK, this comes first.
And then, the bottom subtrees

113
00:07:47,000 --> 00:07:50,000
come next, two and three.
So, I'm writing down the order

114
00:07:50,000 --> 00:07:52,000
in which these nodes are stored
in an array.

115
00:07:52,000 --> 00:07:55,000
And then, we visit this tree so
we get four, five,

116
00:07:55,000 --> 00:07:57,000
six.
And then we visit this one so

117
00:07:57,000 --> 00:08:00,000
we get seven,
eight, nine.

118
00:08:00,000 --> 00:08:03,000
And then the subtree,
10, 11, 12, and then the last

119
00:08:03,000 --> 00:08:06,000
subtree.
So that's the order in which

120
00:08:06,000 --> 00:08:09,000
you store these 15 nodes.
And you can build that up

121
00:08:09,000 --> 00:08:13,000
recursively.
OK, so the structure is fairly

122
00:08:13,000 --> 00:08:17,000
simple, just a binary structure
which we know and love,

123
00:08:17,000 --> 00:08:19,000
but store it in this funny
order.

124
00:08:19,000 --> 00:08:22,000
This is not depth research
order or level order,

125
00:08:22,000 --> 00:08:27,000
lots of natural things you
might try, none of which work in

126
00:08:27,000 --> 00:08:31,000
cache oblivious context.
This is pretty much the only

127
00:08:31,000 --> 00:08:33,000
thing that works.
And the intuition as,

128
00:08:33,000 --> 00:08:36,000
well, we are trying to mimic
all kinds of B trees.

129
00:08:36,000 --> 00:08:39,000
So, if you want a binary tree,
well, that's the original tree.

130
00:08:39,000 --> 00:08:41,000
It doesn't matter how you store
things.

131
00:08:41,000 --> 00:08:44,000
If you want a tree where the
branching factor is four,

132
00:08:44,000 --> 00:08:47,000
well, then here it is.
These blocks give you a

133
00:08:47,000 --> 00:08:50,000
branching factor of four.
If we had more leaves down

134
00:08:50,000 --> 00:08:53,000
here, there would be four
children hanging off of that

135
00:08:53,000 --> 00:08:54,000
node.
And these are all clustered

136
00:08:54,000 --> 00:08:56,000
together consecutively in
memory.

137
00:08:56,000 --> 00:08:59,000
So, if your block size happens
to be three, then this is a

138
00:08:59,000 --> 00:09:04,000
perfect way to store things for
a block size of three.

139
00:09:04,000 --> 00:09:07,000
If you're block size happens to
be probably 15,

140
00:09:07,000 --> 00:09:12,000
right, if we count the number
of, right, the number of nodes

141
00:09:12,000 --> 00:09:16,000
in here is 15,
if you're block size happens to

142
00:09:16,000 --> 00:09:21,000
be 15, then this recursion will
give you a perfect blocking in

143
00:09:21,000 --> 00:09:23,000
terms of 15.
And in general,

144
00:09:23,000 --> 00:09:27,000
it's actually mimicking block
sizes of 2^K-1.

145
00:09:27,000 --> 00:09:32,000
Think powers of two.
OK, that's the intuition.

146
00:09:32,000 --> 00:09:37,000
Let me give you the formal
analysis to make it clearer.

147
00:09:37,000 --> 00:09:42,000
So, we claim that there are
order, log base B of N memory

148
00:09:42,000 --> 00:09:45,000
transfers.
That's what we want to prove no

149
00:09:45,000 --> 00:09:49,000
matter what B is.
So here's what we're going to

150
00:09:49,000 --> 00:09:52,000
do.
You may recall last time when

151
00:09:52,000 --> 00:09:57,000
we analyzed divide and conquer
algorithms, we wrote our

152
00:09:57,000 --> 00:10:03,000
recurrence, and that the base
case was the key.

153
00:10:03,000 --> 00:10:05,000
Here, in fact,
we are only going to think

154
00:10:05,000 --> 00:10:07,000
about the base case in a certain
sense.

155
00:10:07,000 --> 00:10:08,000
We don't have,
really, recursion in the

156
00:10:08,000 --> 00:10:10,000
algorithm.
The algorithm is just walking

157
00:10:10,000 --> 00:10:13,000
down some root-to-leaf path.
We only have a recursion in a

158
00:10:13,000 --> 00:10:16,000
definition of the layout.
So, we can be a little bit more

159
00:10:16,000 --> 00:10:18,000
flexible.
We don't have to look at our

160
00:10:18,000 --> 00:10:20,000
recurrence.
We are just going to think

161
00:10:20,000 --> 00:10:22,000
about the base case.
I want to imagine,

162
00:10:22,000 --> 00:10:24,000
you start with the big
triangle.

163
00:10:24,000 --> 00:10:26,000
That you cut it in the middle;
you get smaller triangles.

164
00:10:26,000 --> 00:10:31,000
Imagine the point at which you
keep recursively cutting.

165
00:10:31,000 --> 00:10:34,000
So imagine this process.
Big triangles halve in height

166
00:10:34,000 --> 00:10:37,000
each time.
They're getting smaller and

167
00:10:37,000 --> 00:10:41,000
smaller, stop cutting at the
point where a triangle fits in a

168
00:10:41,000 --> 00:10:44,000
block.
OK, and look at that time.

169
00:10:44,000 --> 00:10:48,000
OK, the recursion actually goes
all the way, but in the analysis

170
00:10:48,000 --> 00:10:53,000
let's think about the point
where the chunk fits in a block

171
00:10:53,000 --> 00:10:57,000
in one of these triangles,
one of these boxes fits in a

172
00:10:57,000 --> 00:10:59,000
block.
So, I'm going to call this a

173
00:10:59,000 --> 00:11:05,000
recursive level.
So, I'm imagining expanding all

174
00:11:05,000 --> 00:11:10,000
of the recursions in parallel.
This is some level of detail,

175
00:11:10,000 --> 00:11:16,000
some level of refinement of the
trees at which the tree you're

176
00:11:16,000 --> 00:11:19,000
looking at, the triangle,
has size.

177
00:11:19,000 --> 00:11:24,000
In other words,
there is a number of nodes in

178
00:11:24,000 --> 00:11:29,000
that triangle is less than or
equal to B.

179
00:11:29,000 --> 00:11:34,000
OK, so let me draw a picture.
So, I want to draw sort of this

180
00:11:34,000 --> 00:11:39,000
picture but where instead of
nodes, I have little triangles

181
00:11:39,000 --> 00:11:41,000
of size, at most,
B.

182
00:11:41,000 --> 00:11:44,000
So, the picture looks something
like this.

183
00:11:44,000 --> 00:11:48,000
We have a little triangle of
size, at most,

184
00:11:48,000 --> 00:11:50,000
B.
It has a bunch of children

185
00:11:50,000 --> 00:11:55,000
which are subtrees of size,
at most, B, the same size.

186
00:11:55,000 --> 00:12:00,000
And then, these are in a chunk,
and then we have other chunks

187
00:12:00,000 --> 00:12:06,000
that look like that in recursion
potentially.

188
00:12:29,000 --> 00:12:31,000
OK, so I haven't drawn
everything.

189
00:12:31,000 --> 00:12:34,000
There would be a whole bunch
of, between B and B^2,

190
00:12:34,000 --> 00:12:37,000
in fact, subtrees,
other squares of this size.

191
00:12:37,000 --> 00:12:40,000
So here, I had to refine the
entire tree here.

192
00:12:40,000 --> 00:12:44,000
And then I refined each of the
subtrees here and here at these

193
00:12:44,000 --> 00:12:47,000
levels.
And then it turned out after

194
00:12:47,000 --> 00:12:50,000
these two recursive levels,
everything fits in a block.

195
00:12:50,000 --> 00:12:54,000
Everything has the same size,
so at some point they will all

196
00:12:54,000 --> 00:12:57,000
fit within a block.
And they might actually be

197
00:12:57,000 --> 00:12:59,000
quite a bit smaller than the
block.

198
00:12:59,000 --> 00:13:05,000
How small?
So, what I'm doing is cutting

199
00:13:05,000 --> 00:13:09,000
the number of levels and half at
each point.

200
00:13:09,000 --> 00:13:15,000
And I stop when the height of
one of these trees is

201
00:13:15,000 --> 00:13:21,000
essentially at most log B
because that's when the number

202
00:13:21,000 --> 00:13:25,000
of nodes at there will be B
roughly.

203
00:13:25,000 --> 00:13:30,000
So, how small can the height
be?

204
00:13:30,000 --> 00:13:32,000
I keep dividing at half and
stopping when it's,

205
00:13:32,000 --> 00:13:34,000
at most, log B.
Log B over two.

206
00:13:34,000 --> 00:13:37,000
So it's, at most,
log B, it's at least half log

207
00:13:37,000 --> 00:13:39,000
B.
Therefore, the number of nodes

208
00:13:39,000 --> 00:13:42,000
it here could be between the
square root of B and B.

209
00:13:42,000 --> 00:13:46,000
So, this could be a lot smaller
and less than a constant factor

210
00:13:46,000 --> 00:13:49,000
of a block, a claim that doesn't
matter.

211
00:13:49,000 --> 00:13:51,000
It's OK.
This could be a small square

212
00:13:51,000 --> 00:13:53,000
root of B.
I'm not even going to write

213
00:13:53,000 --> 00:13:57,000
that it could be a small square
root of B because that doesn't

214
00:13:57,000 --> 00:14:00,000
play a role in the analysis.
It's a worry,

215
00:14:00,000 --> 00:14:04,000
but it's OK essentially because
our bound only involves log B.

216
00:14:04,000 --> 00:14:09,000
It doesn't involve B.
So, here's what we do.

217
00:14:09,000 --> 00:14:16,000
We know that each of the height
of one of these triangles of

218
00:14:16,000 --> 00:14:20,000
size, at most,
B is at least a half log B.

219
00:14:20,000 --> 00:14:25,000
And therefore,
if you look at a search path,

220
00:14:25,000 --> 00:14:32,000
so, when we do a search in this
tree, we're going to start up

221
00:14:32,000 --> 00:14:36,000
here.
And I'm going to mess up the

222
00:14:36,000 --> 00:14:39,000
diagram now.
We're going to follow some

223
00:14:39,000 --> 00:14:42,000
path, maybe I should have drawn
it going down here.

224
00:14:42,000 --> 00:14:46,000
We visit through some of these
triangles, but it's a

225
00:14:46,000 --> 00:14:51,000
root-to-node path in the tree.
So, how many of the triangles

226
00:14:51,000 --> 00:14:54,000
could it visit?
Well, the height of the tree

227
00:14:54,000 --> 00:14:58,000
divided by the height of one of
the triangles.

228
00:14:58,000 --> 00:15:01,000
So, this visits,
at most, log N over half log B

229
00:15:01,000 --> 00:15:07,000
triangles, which looks good.
This is log base B of N,

230
00:15:07,000 --> 00:15:12,000
mind you off factor of two.
Now, what we worry about is how

231
00:15:12,000 --> 00:15:15,000
many blocks does a triangle
occupy?

232
00:15:15,000 --> 00:15:19,000
One of these triangles should
fit in a block.

233
00:15:19,000 --> 00:15:23,000
We know by the recursive
layout, it is stored in a

234
00:15:23,000 --> 00:15:28,000
consecutive region in memory.
So, how many blocks could

235
00:15:28,000 --> 00:15:32,000
occupy?
Two, because of alignment,

236
00:15:32,000 --> 00:15:35,000
it might fall across the
boundary of a block,

237
00:15:35,000 --> 00:15:37,000
but at most,
one boundary.

238
00:15:37,000 --> 00:15:42,000
So, it fits in two blocks.
So, each triangle fits in one

239
00:15:42,000 --> 00:15:45,000
block, but is in,
at most, two blocks,

240
00:15:45,000 --> 00:15:49,000
memory blocks,
size B depending on alignment.

241
00:15:49,000 --> 00:15:53,000
So, the number of memory
transfers, in other words,

242
00:15:53,000 --> 00:15:57,000
a number of blocks we read,
because all we are doing here

243
00:15:57,000 --> 00:16:01,000
is reading in a search,
is at most two blocks per

244
00:16:01,000 --> 00:16:05,000
triangle.
There are this many triangles,

245
00:16:05,000 --> 00:16:07,000
so it's at most,
4 log base B of N,

246
00:16:07,000 --> 00:16:09,000
OK, which is order log base B
of N.

247
00:16:09,000 --> 00:16:13,000
And there are papers about
decreasing this constant 4 with

248
00:16:13,000 --> 00:16:15,000
more sophisticated data
structures.

249
00:16:15,000 --> 00:16:18,000
You can get it down to a little
bit less than two I think.

250
00:16:18,000 --> 00:16:21,000
So, there you go.
So not quite as good as B trees

251
00:16:21,000 --> 00:16:24,000
in terms of the constant,
but pretty good.

252
00:16:24,000 --> 00:16:27,000
And what's good is that this
data structure works for all B

253
00:16:27,000 --> 00:16:32,000
at the same time.
This analysis works for all B.

254
00:16:32,000 --> 00:16:37,000
So, we have a multilevel memory
hierarchy, no problem.

255
00:16:37,000 --> 00:16:41,000
Any questions about this data
structure?

256
00:16:41,000 --> 00:16:44,000
This is already pretty
sophisticated,

257
00:16:44,000 --> 00:16:48,000
but we are going to get even
more sophisticated.

258
00:16:48,000 --> 00:16:51,000
Next, OK, good,
no questions.

259
00:16:51,000 --> 00:16:56,000
This is either perfectly clear,
or a little bit difficult,

260
00:16:56,000 --> 00:16:59,000
or both.
So, now, I debated with myself

261
00:16:59,000 --> 00:17:05,000
what exactly I would cover next.
There are two natural things I

262
00:17:05,000 --> 00:17:08,000
could cover, both of which are
complicated.

263
00:17:08,000 --> 00:17:11,000
My first result in the cache
oblivious world is making this

264
00:17:11,000 --> 00:17:14,000
data structure dynamic.
So, there is a dynamic B tree

265
00:17:14,000 --> 00:17:18,000
that's cache oblivious that
works for all values of B.

266
00:17:18,000 --> 00:17:20,000
And it gets log base B of N,
insert, delete,

267
00:17:20,000 --> 00:17:23,000
and search.
So, this just gets search in

268
00:17:23,000 --> 00:17:25,000
log base B of N.
That data structure,

269
00:17:25,000 --> 00:17:28,000
our first paper was damn
complicated, and then it got

270
00:17:28,000 --> 00:17:31,000
simplified.
It's now not too hard,

271
00:17:31,000 --> 00:17:35,000
but it takes a couple of
lectures in an advanced

272
00:17:35,000 --> 00:17:40,000
algorithms class to teach it.
So, I'm not going to do that.

273
00:17:40,000 --> 00:17:42,000
But there you go.
It exists.

274
00:17:42,000 --> 00:17:47,000
Instead, we're going to cover
our favorite problem sorting in

275
00:17:47,000 --> 00:17:52,000
the cache oblivious context.
And this is quite complicated,

276
00:17:52,000 --> 00:17:56,000
more than you'd expect,
OK, much more complicated than

277
00:17:56,000 --> 00:18:01,000
it is in a multithreaded setting
to get the right answer,

278
00:18:01,000 --> 00:18:05,000
anyway.
Maybe to get the best answer in

279
00:18:05,000 --> 00:18:08,000
a multithreaded setting is also
complicated.

280
00:18:08,000 --> 00:18:11,000
The version we got last week
was pretty easy.

281
00:18:11,000 --> 00:18:13,000
But before we go to cache
oblivious sorting,

282
00:18:13,000 --> 00:18:18,000
let me talk about cache aware
sorting because we need to know

283
00:18:18,000 --> 00:18:21,000
what bound we are aiming for.
And just to warn you,

284
00:18:21,000 --> 00:18:24,000
I may not get to the full
analysis of the full cache

285
00:18:24,000 --> 00:18:28,000
oblivious sorting.
But I want to give you an idea

286
00:18:28,000 --> 00:18:31,000
of what goes into it because
it's pretty cool,

287
00:18:31,000 --> 00:18:35,000
I think, a lot of ideas.
So, how might you sort?

288
00:18:35,000 --> 00:18:39,000
So, cache aware,
we assume we can do everything.

289
00:18:39,000 --> 00:18:41,000
Basically, this means we have B
trees.

290
00:18:41,000 --> 00:18:44,000
That's the only other structure
we know.

291
00:18:44,000 --> 00:18:49,000
How would you sort N numbers,
given that that's the only data

292
00:18:49,000 --> 00:18:52,000
structure you have?
Right, just add them into the B

293
00:18:52,000 --> 00:18:55,000
tree, and then do an in-order
traversal.

294
00:18:55,000 --> 00:19:00,000
That's one way to sort,
perfectly reasonable.

295
00:19:00,000 --> 00:19:04,000
We'll call it repeated
insertion into a B tree.

296
00:19:04,000 --> 00:19:08,000
OK, we know in the usual
setting, and the BST sort,

297
00:19:08,000 --> 00:19:13,000
where you use a balanced binary
search tree, like red-black

298
00:19:13,000 --> 00:19:17,000
trees, that takes N log N time,
log N per operation,

299
00:19:17,000 --> 00:19:22,000
and that's an optimal sorting
algorithm in the comparison

300
00:19:22,000 --> 00:19:28,000
model, only thinking about
comparison model here.

301
00:19:28,000 --> 00:19:39,000
So, how many memory transfers
does this data structure takes?

302
00:19:39,000 --> 00:19:45,000
Sorry, this algorithm for
sorting?

303
00:19:45,000 --> 00:19:54,000
The number of memory transfers
is a function of N,

304
00:19:54,000 --> 00:20:01,000
and B_M of N is?
This is easy.

305
00:20:01,000 --> 00:20:07,000
N insertions,
OK, you have to think about N

306
00:20:07,000 --> 00:20:13,000
order traversal.
You have to remember back your

307
00:20:13,000 --> 00:20:20,000
analysis of B trees,
but this is not too hard.

308
00:20:20,000 --> 00:20:27,000
How long does the insertion
take, the N insertions?

309
00:20:27,000 --> 00:20:32,000
N log base B of N.
How long does the traversal

310
00:20:32,000 --> 00:20:33,000
take?
Less time.

311
00:20:33,000 --> 00:20:37,000
If we think about it,
you can get away with N over B

312
00:20:37,000 --> 00:20:40,000
memory transfers,
so quite a bit less than this.

313
00:20:40,000 --> 00:20:44,000
This is bigger than N,
which is actually pretty bad.

314
00:20:44,000 --> 00:20:47,000
N memory transfers means
essentially you're doing random

315
00:20:47,000 --> 00:20:51,000
access, visiting every element
in some random order.

316
00:20:51,000 --> 00:20:54,000
It's even worse.
There's even a log factor.

317
00:20:54,000 --> 00:20:57,000
Now, the log factor goes down
by this log B factor.

318
00:20:57,000 --> 00:21:02,000
But, this is actually a really
bad sorting bound.

319
00:21:02,000 --> 00:21:06,000
So, unlike normal algorithms,
where using a search tree is a

320
00:21:06,000 --> 00:21:10,000
good way to sort,
in cache oblivious or cache

321
00:21:10,000 --> 00:21:13,000
aware sorting it's really,
really bad.

322
00:21:13,000 --> 00:21:17,000
So, what's another natural
algorithm you might try,

323
00:21:17,000 --> 00:21:22,000
given what we know for sorting?
And, even cache oblivious,

324
00:21:22,000 --> 00:21:26,000
all the algorithms we've seen
are cache oblivious.

325
00:21:26,000 --> 00:21:30,000
So, what's a good one to try?
Merge sort.

326
00:21:30,000 --> 00:21:34,000
OK, we did merge sort in
multithreaded algorithms.

327
00:21:34,000 --> 00:21:37,000
Let's try a merge sort,
a good divide and conquer

328
00:21:37,000 --> 00:21:40,000
thing.
So, I'm going to call it binary

329
00:21:40,000 --> 00:21:44,000
merge sort because it splits the
array into two pieces,

330
00:21:44,000 --> 00:21:46,000
and it recurses on the two
pieces.

331
00:21:46,000 --> 00:21:49,000
So, you get a binary recursion
tree.

332
00:21:49,000 --> 00:21:52,000
So, let's analyze it.
So the number of memory

333
00:21:52,000 --> 00:21:56,000
transfers on N elements,
so I mean it has a pretty good

334
00:21:56,000 --> 00:21:57,000
recursive layout,
right?

335
00:21:57,000 --> 00:22:02,000
The two subarrays that we get
what we partition our array are

336
00:22:02,000 --> 00:22:05,000
consecutive.
So, we're recursing on this,

337
00:22:05,000 --> 00:22:10,000
recursing on this.
So, it's a nice cache oblivious

338
00:22:10,000 --> 00:22:13,000
layout.
And this is even for cache

339
00:22:13,000 --> 00:22:15,000
aware.
This is a pretty good

340
00:22:15,000 --> 00:22:19,000
algorithm, a lot better than
this one, as we'll see.

341
00:22:19,000 --> 00:22:22,000
But, what is the recurrence we
get?

342
00:22:22,000 --> 00:22:27,000
So, here we have to go back to
last lecture when we were

343
00:22:27,000 --> 00:22:31,000
thinking about recurrences for
recursive cache oblivious

344
00:22:31,000 --> 00:22:34,000
algorithms.

345
00:22:46,000 --> 00:22:50,000
I mean, the first part should
be pretty easy.

346
00:22:50,000 --> 00:22:55,000
There's an O.
Well, OK, let's put the O at

347
00:22:55,000 --> 00:23:00,000
the end, the divide and the
conquer part at the end.

348
00:23:00,000 --> 00:23:06,000
The recursion is 2MT of N over
two, good.

349
00:23:06,000 --> 00:23:09,000
All right, that's just like the
merge sort recurrence,

350
00:23:09,000 --> 00:23:12,000
and that's the additive term
that you're thinking about.

351
00:23:12,000 --> 00:23:15,000
OK, so normally,
we would pay a linear additive

352
00:23:15,000 --> 00:23:19,000
term here, order N because
merging takes order N time.

353
00:23:19,000 --> 00:23:22,000
Now, we are merging,
which is three parallel scans,

354
00:23:22,000 --> 00:23:26,000
the two inputs and the output.
OK, they're not quite parallel

355
00:23:26,000 --> 00:23:28,000
interleaved.
They're a bit funnily

356
00:23:28,000 --> 00:23:31,000
interleaved, but as long as your
cache stores at least three

357
00:23:31,000 --> 00:23:35,000
blocks, this is also linear time
in this setting,

358
00:23:35,000 --> 00:23:38,000
which means you visit each
block a constant number of

359
00:23:38,000 --> 00:23:41,000
times.
OK, that's the recurrence.

360
00:23:41,000 --> 00:23:44,000
Now, we also need a base case,
of course.

361
00:23:44,000 --> 00:23:47,000
We've seen two base cases,
one MT of B,

362
00:23:47,000 --> 00:23:50,000
and the other,
MT of whatever fits in cache.

363
00:23:50,000 --> 00:23:53,000
So, let's look at that one
because it's better.

364
00:23:53,000 --> 00:23:56,000
So, for some constant,
C, if I have an array of size

365
00:23:56,000 --> 00:24:00,000
M, this fits in cache,
actually, probably C is one

366
00:24:00,000 --> 00:24:03,000
here, but I'll just be careful.
For some constant,

367
00:24:03,000 --> 00:24:10,000
this fits in cache.
A problem of this size fits in

368
00:24:10,000 --> 00:24:18,000
cache, and in that case,
the number of memory transfers

369
00:24:18,000 --> 00:24:25,000
is, anyone remember?
We've used this base case more

370
00:24:25,000 --> 00:24:31,000
than once before.
Do you remember?

371
00:24:31,000 --> 00:24:32,000
Sorry?
CM over B.

372
00:24:32,000 --> 00:24:33,000
I've got a big O,
so M over B.

373
00:24:33,000 --> 00:24:37,000
Order M over B because this is
the size of the data.

374
00:24:37,000 --> 00:24:40,000
So, I mean, just to read it all
in takes M over B.

375
00:24:40,000 --> 00:24:43,000
Once it's in cache,
it doesn't really matter what I

376
00:24:43,000 --> 00:24:47,000
do as long as I use linear space
for the right constant here.

377
00:24:47,000 --> 00:24:50,000
As long as I use linear space
in that algorithm,

378
00:24:50,000 --> 00:24:53,000
I'll stay in cache,
and therefore,

379
00:24:53,000 --> 00:24:57,000
not have to write anything out
until the very end and I spend M

380
00:24:57,000 --> 00:25:02,000
over B to write it out.
OK, so I can't really spend

381
00:25:02,000 --> 00:25:07,000
more than M over B almost no
matter what algorithm I have,

382
00:25:07,000 --> 00:25:09,000
so long as it uses linear
space.

383
00:25:09,000 --> 00:25:14,000
So, this is a base case that's
useful pretty much in any

384
00:25:14,000 --> 00:25:17,000
algorithm.
OK, that's a recurrence.

385
00:25:17,000 --> 00:25:22,000
Now we just have to solve it.
OK, let's see how good binary

386
00:25:22,000 --> 00:25:24,000
merge sort is.
OK, and again,

387
00:25:24,000 --> 00:25:29,000
I'm going to just give the
intuition behind the solution to

388
00:25:29,000 --> 00:25:33,000
this recurrence.
And I won't use the

389
00:25:33,000 --> 00:25:36,000
substitution method to prove it
formally.

390
00:25:36,000 --> 00:25:38,000
But this one's actually pretty
simple.

391
00:25:38,000 --> 00:25:41,000
So, we have,
at the top, actually I'm going

392
00:25:41,000 --> 00:25:44,000
to write it over here.
Otherwise I won't be able to

393
00:25:44,000 --> 00:25:46,000
see.
So, at the top of the

394
00:25:46,000 --> 00:25:48,000
recursion, we have N over B
costs.

395
00:25:48,000 --> 00:25:52,000
I'll ignore the constants.
There is probably also on

396
00:25:52,000 --> 00:25:55,000
additive one,
which I'm ignoring here.

397
00:25:55,000 --> 00:25:58,000
Then we split into two problems
of half the size.

398
00:25:58,000 --> 00:26:03,000
So, we get a half N over B,
and a half N over B.

399
00:26:03,000 --> 00:26:05,000
OK, usually this was N,
half N, half N.

400
00:26:05,000 --> 00:26:08,000
You should regard it as from
lecture one.

401
00:26:08,000 --> 00:26:10,000
So, the total on this level is
N over B.

402
00:26:10,000 --> 00:26:12,000
The total on this level is N
over B.

403
00:26:12,000 --> 00:26:16,000
And, you can prove by
induction, that every level is N

404
00:26:16,000 --> 00:26:18,000
over B.
The question is how many levels

405
00:26:18,000 --> 00:26:20,000
are there?
Well, at the bottom,

406
00:26:20,000 --> 00:26:23,000
so, dot, dot,
dot, at the bottom of this

407
00:26:23,000 --> 00:26:26,000
recursion tree we should get
something of size M,

408
00:26:26,000 --> 00:26:30,000
and then we're paying M over B.
Actually here we're paying M

409
00:26:30,000 --> 00:26:34,000
over B.
So, it's a good thing those

410
00:26:34,000 --> 00:26:35,000
match.
They should.

411
00:26:35,000 --> 00:26:40,000
So here, we have a bunch of
leaves, all the size M over B.

412
00:26:40,000 --> 00:26:44,000
You can also compute the number
of leaves here is N over M.

413
00:26:44,000 --> 00:26:49,000
If you want to be extra sure,
you should always check the

414
00:26:49,000 --> 00:26:51,000
leaf level.
It's a good idea.

415
00:26:51,000 --> 00:26:55,000
So we have N over M leaves,
each costing M over B.

416
00:26:55,000 --> 00:27:00,000
This is an M.
So, this is N over B also.

417
00:27:00,000 --> 00:27:04,000
So, every level here is N over
B memory transfers.

418
00:27:04,000 --> 00:27:08,000
And the number of levels is one
N over B?

419
00:27:08,000 --> 00:27:11,000
Log N over B.
Yep, that's right.

420
00:27:11,000 --> 00:27:16,000
I just didn't hear it right.
OK, we are starting at N.

421
00:27:16,000 --> 00:27:21,000
We're getting down to M.
So, you can think of it as log

422
00:27:21,000 --> 00:27:26,000
N, the whole binary tree minus
the subtrees log M,

423
00:27:26,000 --> 00:27:31,000
and that's the same as log N
over M, OK, or however you want

424
00:27:31,000 --> 00:27:37,000
to think about it.
The point is that this is a log

425
00:27:37,000 --> 00:27:40,000
base two.
That's where we are not doing

426
00:27:40,000 --> 00:27:42,000
so great.
So this is actually a pretty

427
00:27:42,000 --> 00:27:46,000
good algorithm.
So let me write the solution

428
00:27:46,000 --> 00:27:48,000
over here.
So, the number of memory

429
00:27:48,000 --> 00:27:53,000
transfers on N items is going to
be the number of levels times

430
00:27:53,000 --> 00:27:56,000
the cost of each level.
So, this is N over B times log

431
00:27:56,000 --> 00:28:00,000
base two of N over M,
which is a lot better than

432
00:28:00,000 --> 00:28:04,000
repeated insertion into a B
tree.

433
00:28:04,000 --> 00:28:07,000
Here, we were getting N times
log N over log B,

434
00:28:07,000 --> 00:28:12,000
OK, so N log N over log B.
We're getting a log B savings

435
00:28:12,000 --> 00:28:16,000
over not proving anything,
and here we are getting a

436
00:28:16,000 --> 00:28:19,000
factor of B savings,
N log N over B.

437
00:28:19,000 --> 00:28:24,000
In fact, we even made it a
little bit smaller by dividing

438
00:28:24,000 --> 00:28:28,000
this N by M.
That doesn't matter too much.

439
00:28:28,000 --> 00:28:32,000
This dividing by B is a big
one.

440
00:28:32,000 --> 00:28:35,000
OK, so we're almost there.
This is almost an optimal

441
00:28:35,000 --> 00:28:37,000
algorithm.
It's even cache oblivious,

442
00:28:37,000 --> 00:28:40,000
which is pretty cool.
And that extra little step,

443
00:28:40,000 --> 00:28:43,000
which is that you should be
able to get on other log B

444
00:28:43,000 --> 00:28:46,000
factor improvement,
I want to combine these two

445
00:28:46,000 --> 00:28:48,000
ideas.
I want to keep this factor B

446
00:28:48,000 --> 00:28:51,000
improvement over N log N,
and I want to keep this factor

447
00:28:51,000 --> 00:28:54,000
log B improvement over N log N,
and get them together.

448
00:28:54,000 --> 00:28:57,000
So, first, before we do that
cache obliviously,

449
00:28:57,000 --> 00:29:03,000
let's do it cache aware.
So, this is the third cache

450
00:29:03,000 --> 00:29:07,000
aware algorithm.
This one was also cache

451
00:29:07,000 --> 00:29:11,000
oblivious.
So, how should I modify a merge

452
00:29:11,000 --> 00:29:18,000
sort in order to do better?
I mean, I have this log base

453
00:29:18,000 --> 00:29:22,000
two, and I want a log base B,
more or less.

454
00:29:22,000 --> 00:29:27,000
So, how would I do that with
merge sort?

455
00:29:27,000 --> 00:29:30,000
Yeah?
Split into B subarrays,

456
00:29:30,000 --> 00:29:32,000
yeah.
Instead of doing binary merge

457
00:29:32,000 --> 00:29:35,000
sort, this is what I was hinting
at here, instead of splitting it

458
00:29:35,000 --> 00:29:37,000
into two pieces,
and recursing on the two

459
00:29:37,000 --> 00:29:40,000
pieces, and then merging them,
I could split potentially into

460
00:29:40,000 --> 00:29:42,000
more pieces.
OK, and to do that,

461
00:29:42,000 --> 00:29:45,000
I'm going to use my cache.
So the idea is B pieces.

462
00:29:45,000 --> 00:29:48,000
This is actually not the best
thing to do, although B pieces

463
00:29:48,000 --> 00:29:50,000
does work.
And, it's what I was hinting at

464
00:29:50,000 --> 00:29:52,000
because I was saying I want a
log B.

465
00:29:52,000 --> 00:29:55,000
It's actually not quite log B.
It's log M over B.

466
00:29:55,000 --> 00:29:57,000
OK, but let's see.
So, what is the most pieces I

467
00:29:57,000 --> 00:30:01,000
could split into?
Right, well,

468
00:30:01,000 --> 00:30:06,000
I could split into N pieces.
That would be good,

469
00:30:06,000 --> 00:30:11,000
wouldn't it,
at only one recursive level?

470
00:30:11,000 --> 00:30:14,000
I can't split into N pieces.
Why?

471
00:30:14,000 --> 00:30:19,000
What happens wrong when I split
into N pieces?

472
00:30:19,000 --> 00:30:24,000
That would be the ultimate.
You can't merge,

473
00:30:24,000 --> 00:30:27,000
exactly.
So, if I have N pieces,

474
00:30:27,000 --> 00:30:33,000
you can't merge in cache.
I mean, so in order to merge in

475
00:30:33,000 --> 00:30:37,000
cache, what I need is to be able
to store an entire block from

476
00:30:37,000 --> 00:30:40,000
each of the lists that I'm
merging.

477
00:30:40,000 --> 00:30:43,000
If I can store an entire block
in cache for each of the lists,

478
00:30:43,000 --> 00:30:46,000
then it's a bunch of parallel
scans.

479
00:30:46,000 --> 00:30:49,000
So this is like testing the
limit of parallel scanning

480
00:30:49,000 --> 00:30:52,000
technology.
If you have K parallel scans,

481
00:30:52,000 --> 00:30:55,000
and you can fit K blocks in
cache, then all is well because

482
00:30:55,000 --> 00:30:58,000
you can scan through each of
those K arrays,

483
00:30:58,000 --> 00:31:02,000
and have one block from each of
the K arrays in cache at the

484
00:31:02,000 --> 00:31:05,000
same time.
So, that's the idea.

485
00:31:05,000 --> 00:31:09,000
Now, how many blocks can I fit
in cache?

486
00:31:09,000 --> 00:31:13,000
M over B.
That's the biggest I could do.

487
00:31:13,000 --> 00:31:18,000
So this will give the best
running time among these kinds

488
00:31:18,000 --> 00:31:24,000
of merge sort algorithms.
This is an M over B way merge

489
00:31:24,000 --> 00:31:27,000
sort.
OK, so now we get somewhat

490
00:31:27,000 --> 00:31:31,000
better recurrence.
We split into M over B

491
00:31:31,000 --> 00:31:34,000
subproblems now,
each of size,

492
00:31:34,000 --> 00:31:38,000
well, it's N divided by M over
B without thinking.

493
00:31:38,000 --> 00:31:43,000
And, the claim is that the
merge time is still linear

494
00:31:43,000 --> 00:31:48,000
because we have barely enough,
OK, maybe I should describe

495
00:31:48,000 --> 00:31:50,000
this algorithm.
So, we divide,

496
00:31:50,000 --> 00:31:55,000
because we've never really done
non-binary merge sort.

497
00:31:55,000 --> 00:32:00,000
We divide into M over B equal
size subarrays instead of two.

498
00:32:00,000 --> 00:32:06,000
Here, we are clearly doing a
cache aware algorithm.

499
00:32:06,000 --> 00:32:11,000
We are assuming we know what M
over B is.

500
00:32:11,000 --> 00:32:17,000
So, then we recursively sort
each subarray,

501
00:32:17,000 --> 00:32:21,000
and then we conquer.
We merge.

502
00:32:21,000 --> 00:32:29,000
And, the reason merge works is
because we can afford one block

503
00:32:29,000 --> 00:32:34,000
in cache.
So, let's call it one cache

504
00:32:34,000 --> 00:32:36,000
block per subarray.
OK, actually,

505
00:32:36,000 --> 00:32:40,000
if you're careful,
you also need one block for the

506
00:32:40,000 --> 00:32:44,000
output of the merged array
before you write it out.

507
00:32:44,000 --> 00:32:47,000
So, it should be M over B minus
one.

508
00:32:47,000 --> 00:32:50,000
But, let's ignore some additive
constants.

509
00:32:50,000 --> 00:32:53,000
OK, so this is the recurrence
we get.

510
00:32:53,000 --> 00:32:59,000
The base case is the same.
And, what improves here?

511
00:32:59,000 --> 00:33:02,000
I mean, the per level cost
doesn't change,

512
00:33:02,000 --> 00:33:06,000
I claim, because at the top we
get N over B.

513
00:33:06,000 --> 00:33:09,000
This does before.
Then we split into M over B

514
00:33:09,000 --> 00:33:15,000
subproblems, each of which costs
a one over M over B factor times

515
00:33:15,000 --> 00:33:18,000
N over B.
OK, so you add all those up,

516
00:33:18,000 --> 00:33:23,000
you still get N over B because
we are not decreasing the number

517
00:33:23,000 --> 00:33:26,000
of elements.
We're just splitting them.

518
00:33:26,000 --> 00:33:31,000
There's now M over B
subproblems, each of one over M

519
00:33:31,000 --> 00:33:36,000
over B the size.
So, just like before,

520
00:33:36,000 --> 00:33:39,000
each level will sum to N over
B.

521
00:33:39,000 --> 00:33:44,000
What changes is the number of
levels because now we have

522
00:33:44,000 --> 00:33:49,000
bigger branching factor.
Instead of log base two,

523
00:33:49,000 --> 00:33:53,000
it's now log base the branching
factor.

524
00:33:53,000 --> 00:33:59,000
So, the height of this tree is
log base M over B of N over M,

525
00:33:59,000 --> 00:34:03,000
I believe.
Let me make sure that agrees

526
00:34:03,000 --> 00:34:06,000
with me.
Yeah.

527
00:34:06,000 --> 00:34:12,000
OK, and if you're careful,
this counts not quite the

528
00:34:12,000 --> 00:34:18,000
number of levels,
but the number of levels minus

529
00:34:18,000 --> 00:34:22,000
one.
So, I'm going to one plus one

530
00:34:22,000 --> 00:34:26,000
here.
And the reason why is this is

531
00:34:26,000 --> 00:34:37,000
not quite the bound that I want.
So, we have log base M over B.

532
00:34:37,000 --> 00:34:45,000
What I really want,
actually, is N over B.

533
00:34:45,000 --> 00:34:55,000
I claim that these are the same
because we have minus,

534
00:34:55,000 --> 00:35:01,000
yeah, that's good.
OK, this should come as rather

535
00:35:01,000 --> 00:35:05,000
mysterious, but it's because I
know what the sorting bound

536
00:35:05,000 --> 00:35:07,000
should be as I'm doing this
arithmetic.

537
00:35:07,000 --> 00:35:10,000
So, I'm taking log base M over
B of N over M.

538
00:35:10,000 --> 00:35:12,000
I'm not changing the base of
the log.

539
00:35:12,000 --> 00:35:14,000
I'm just saying,
well, N over M,

540
00:35:14,000 --> 00:35:17,000
that is N over B divided by M
over B because then the B's

541
00:35:17,000 --> 00:35:20,000
cancel, and the M goes on the
bottom.

542
00:35:20,000 --> 00:35:23,000
So, if I do that in the logs,
I get log of N over B minus log

543
00:35:23,000 --> 00:35:26,000
of M over B minus,
because I'm dividing.

544
00:35:26,000 --> 00:35:30,000
OK, now, log base M over B of M
over B is one.

545
00:35:30,000 --> 00:35:33,000
So, these cancel,
and I get log base M over B,

546
00:35:33,000 --> 00:35:36,000
N over B, which is what I was
aiming for.

547
00:35:36,000 --> 00:35:39,000
Why?
Because that's the right bound

548
00:35:39,000 --> 00:35:43,000
as it's normally written.
OK, that's what we will be

549
00:35:43,000 --> 00:35:48,000
trying to get cache obliviously.
So, that's the height of the

550
00:35:48,000 --> 00:35:53,000
search tree, and at each level
we are paying N over B memory

551
00:35:53,000 --> 00:35:56,000
transfers.
So, the overall number of

552
00:35:56,000 --> 00:36:01,000
memory transfers for this M over
B way merge sort is the sorting

553
00:36:01,000 --> 00:36:03,000
bound.

554
00:36:13,000 --> 00:36:19,000
This is, I'll put it in a box.
This is the sorting bound,

555
00:36:19,000 --> 00:36:25,000
and it's very special because
it is the optimal number of

556
00:36:25,000 --> 00:36:31,000
memory transfers for sorting N
items cache aware.

557
00:36:31,000 --> 00:36:33,000
This has been known since,
like, 1983.

558
00:36:33,000 --> 00:36:35,000
OK, this is the best thing to
do.

559
00:36:35,000 --> 00:36:38,000
It's a really weird bound,
but if you ignore all the

560
00:36:38,000 --> 00:36:41,000
divided by B's,
it's sort of like N times log

561
00:36:41,000 --> 00:36:44,000
base M of N.
So, that's little bit more

562
00:36:44,000 --> 00:36:46,000
reasonable.
But, there's lots of divided by

563
00:36:46,000 --> 00:36:49,000
B's.
So, the number of the blocks in

564
00:36:49,000 --> 00:36:53,000
the input times log base the
number of blocks in the cache of

565
00:36:53,000 --> 00:36:55,000
the number of blocks in the
input.

566
00:36:55,000 --> 00:36:57,000
That's a little bit more
intuitive.

567
00:36:57,000 --> 00:37:02,000
That is the bound.
And that's what we are aiming

568
00:37:02,000 --> 00:37:04,000
for.
So, this algorithm,

569
00:37:04,000 --> 00:37:08,000
crucially, assume that we knew
what M over B was.

570
00:37:08,000 --> 00:37:12,000
Now, we are going to try and do
it without knowing M over B,

571
00:37:12,000 --> 00:37:17,000
do it cache obliviously.
And that is the result of only

572
00:37:17,000 --> 00:37:19,000
a few years ago.
Are you ready?

573
00:37:19,000 --> 00:37:23,000
Everything clear so far?
It's a pretty natural

574
00:37:23,000 --> 00:37:26,000
algorithm.
We were going to try to mimic

575
00:37:26,000 --> 00:37:31,000
it essentially and do a merge
sort, but not M over B way merge

576
00:37:31,000 --> 00:37:36,000
sort because we don't know how.
We're going to try and do it

577
00:37:36,000 --> 00:37:39,000
essentially a square root of N
way merge sort.

578
00:37:39,000 --> 00:37:43,000
If you play around,
that's the natural thing to do.

579
00:37:43,000 --> 00:37:46,000
The tricky part is that it's
hard to merge square root of N

580
00:37:46,000 --> 00:37:50,000
lists at the same time,
in a cache efficient way.

581
00:37:50,000 --> 00:37:54,000
We know that if the square root
of N is bigger than M over B,

582
00:37:54,000 --> 00:37:57,000
you're hosed if you just do a
straightforward merge.

583
00:37:57,000 --> 00:38:02,000
So, we need a fancy merge.
We are going to do a divide and

584
00:38:02,000 --> 00:38:05,000
conquer merge.
It's a lot like the

585
00:38:05,000 --> 00:38:10,000
multithreaded algorithms of last
week, try and do a divide and

586
00:38:10,000 --> 00:38:14,000
conquer merge so that no matter
how many lists are merging,

587
00:38:14,000 --> 00:38:18,000
as long as it's less than the
square root of N,

588
00:38:18,000 --> 00:38:23,000
or actually cubed root of N,
we can do it cache efficiently,

589
00:38:23,000 --> 00:38:24,000
OK?
So, we'll do this,

590
00:38:24,000 --> 00:38:28,000
we need a bit of setup.
But that's where we're going,

591
00:38:28,000 --> 00:38:33,000
cache oblivious sorting.
So, we want to get the sorting

592
00:38:33,000 --> 00:38:36,000
bound, and, yeah.
It turns out,

593
00:38:36,000 --> 00:38:40,000
to do cache oblivious sorting,
you need an assumption about

594
00:38:40,000 --> 00:38:42,000
the cache size.
This is kind of annoying,

595
00:38:42,000 --> 00:38:45,000
because we said,
well, cache oblivious

596
00:38:45,000 --> 00:38:49,000
algorithms should work for all
values of B and all values of M.

597
00:38:49,000 --> 00:38:53,000
But, you can actually prove you
need an additional assumption in

598
00:38:53,000 --> 00:38:55,000
order to get this bound cache
obliviously.

599
00:38:55,000 --> 00:38:58,000
That's the result of,
like, last year by Garrett

600
00:38:58,000 --> 00:39:01,000
Brodel.
So, and the assumption is,

601
00:39:01,000 --> 00:39:04,000
well, the assumption is fairly
weak.

602
00:39:04,000 --> 00:39:07,000
That's the good news.
OK, we've actually made an

603
00:39:07,000 --> 00:39:10,000
assumption several times.
We said, well,

604
00:39:10,000 --> 00:39:13,000
assuming the cache can store at
least three blocks,

605
00:39:13,000 --> 00:39:17,000
or assuming the cache can store
at least four blocks,

606
00:39:17,000 --> 00:39:21,000
yeah, it's reasonable to say
the cache can store at least

607
00:39:21,000 --> 00:39:25,000
four blocks, or at least any
constant number of blocks.

608
00:39:25,000 --> 00:39:29,000
This is that the number of
blocks that your cache can store

609
00:39:29,000 --> 00:39:33,000
is at least B to the epsilon
blocks.

610
00:39:33,000 --> 00:39:36,000
This is saying your cache
isn't, like, really narrow.

611
00:39:36,000 --> 00:39:37,000
It's about as tall as it is
wide.

612
00:39:37,000 --> 00:39:40,000
This actually gives you a lot
of sloth.

613
00:39:40,000 --> 00:39:42,000
And, we're going to use a
simple version of this

614
00:39:42,000 --> 00:39:44,000
assumption that M is at least
B^2.

615
00:39:44,000 --> 00:39:48,000
OK, this is pretty natural.
It's saying that your cache is

616
00:39:48,000 --> 00:39:51,000
at least as tall as it is wide
where these are the blocks.

617
00:39:51,000 --> 00:39:54,000
OK, the number of blocks is it
least the size of a block.

618
00:39:54,000 --> 00:39:57,000
That's a pretty reasonable
assumption, and if you look at

619
00:39:57,000 --> 00:40:00,000
caches these days,
they all satisfy this,

620
00:40:00,000 --> 00:40:04,000
at least for some epsilon.
Pretty much universally,

621
00:40:04,000 --> 00:40:08,000
M is at least B^2 or so.
OK, and in fact,

622
00:40:08,000 --> 00:40:12,000
if you think from our speed of
light arguments from last time,

623
00:40:12,000 --> 00:40:16,000
B^2 or B^3 is actually the
right thing to do.

624
00:40:16,000 --> 00:40:18,000
As you go out,
I guess in 3-D,

625
00:40:18,000 --> 00:40:23,000
B^2 would be the surface area
of the sphere out there.

626
00:40:23,000 --> 00:40:27,000
OK, so this is actually the
natural thing of how much space

627
00:40:27,000 --> 00:40:32,000
you should have at a particular
distance.

628
00:40:32,000 --> 00:40:35,000
Assuming we live in a constant
dimensional space,

629
00:40:35,000 --> 00:40:40,000
that assumption would be true.
This even allows going up to 42

630
00:40:40,000 --> 00:40:43,000
dimensions or whatever,
OK, so a pretty reasonable

631
00:40:43,000 --> 00:40:44,000
assumption.
Good.

632
00:40:44,000 --> 00:40:47,000
Now, we are going to achieve
this bound.

633
00:40:47,000 --> 00:40:52,000
And what we are going to try to
do is use an N to the epsilon

634
00:40:52,000 --> 00:40:56,000
way merge sort for some epsilon.
And, if we assume that M is at

635
00:40:56,000 --> 00:41:02,000
least B^2, the epsilon will be
one third, it turns out.

636
00:41:02,000 --> 00:41:08,000
So, we are going to do the
cubed root of N way merge sort.

637
00:41:08,000 --> 00:41:14,000
I'll start by giving you and
analyzing the sorting

638
00:41:14,000 --> 00:41:20,000
algorithms, assuming that we
know how to do merge in a

639
00:41:20,000 --> 00:41:25,000
particular bound.
OK, then we'll do the merge.

640
00:41:25,000 --> 00:41:31,000
The merge is the hard part.
OK, so the merge,

641
00:41:31,000 --> 00:41:34,000
I'm going to give you the black
box first of all.

642
00:41:34,000 --> 00:41:36,000
First of all,
what does merge do?

643
00:41:36,000 --> 00:41:40,000
The K way merger is called the
K funnel just because it looks

644
00:41:40,000 --> 00:41:42,000
like a funnel,
which you'll see.

645
00:41:42,000 --> 00:41:45,000
So, a K funnel is a data
structure, or is an algorithm,

646
00:41:45,000 --> 00:41:48,000
let's say, that looks like a
data structure.

647
00:41:48,000 --> 00:41:52,000
And it merges K sorted lists.
So, supposing you already have

648
00:41:52,000 --> 00:41:56,000
K lists, and they're sorted,
and assuming that the lists are

649
00:41:56,000 --> 00:41:59,000
relatively long,
so we need some additional

650
00:41:59,000 --> 00:42:03,000
assumptions for this black box
to work, and we'll be able to

651
00:42:03,000 --> 00:42:09,000
get them as we sort.
We want the total size of those

652
00:42:09,000 --> 00:42:12,000
lists.
You add up all the elements,

653
00:42:12,000 --> 00:42:17,000
and all the lists should have
size at least K^3 is the

654
00:42:17,000 --> 00:42:21,000
assumption.
Then, it merges these lists

655
00:42:21,000 --> 00:42:25,000
using essentially the sorting
bound.

656
00:42:25,000 --> 00:42:30,000
Actually, I should really say
theta K^3.

657
00:42:30,000 --> 00:42:36,000
I also don't want to be too
much bigger than K^3.

658
00:42:36,000 --> 00:42:42,000
Sorry about that.
So, the number of memory

659
00:42:42,000 --> 00:42:50,000
transfers that this funnel
merger uses is the sorting bound

660
00:42:50,000 --> 00:42:57,000
on K^3, so K^3 over B,
log base M over B of K^3 over

661
00:42:57,000 --> 00:43:03,000
B, plus another K memory
transfers.

662
00:43:03,000 --> 00:43:06,000
Now, K memory transfers is
pretty reasonable.

663
00:43:06,000 --> 00:43:09,000
You've got to at least start
reading each list,

664
00:43:09,000 --> 00:43:12,000
so you got to pay one memory
transfer per list.

665
00:43:12,000 --> 00:43:16,000
OK, but our challenge in some
sense will be getting rid of

666
00:43:16,000 --> 00:43:19,000
this plus K.
This is how fast we can merge.

667
00:43:19,000 --> 00:43:22,000
We'll do that after.
Now, assuming we have this,

668
00:43:22,000 --> 00:43:26,000
let me tell you how to sort.
This is, eventually enough,

669
00:43:26,000 --> 00:43:31,000
called funnel sort.
But in a certain sense,

670
00:43:31,000 --> 00:43:36,000
it's really cubed root of N way
merge sort.

671
00:43:36,000 --> 00:43:41,000
OK, but we'll analyze it using
this.

672
00:43:41,000 --> 00:43:47,000
OK, so funnel sort,
we are going to define K to be

673
00:43:47,000 --> 00:43:52,000
N to the one third,
and apply this merger.

674
00:43:52,000 --> 00:43:56,000
So, what do we do?
It's just like here.

675
00:43:56,000 --> 00:44:05,000
We're going to divide our array
into N to the one third.

676
00:44:05,000 --> 00:44:09,000
I mean, it they should be
consecutive subarrays.

677
00:44:09,000 --> 00:44:13,000
I'll call them segments of the
array.

678
00:44:13,000 --> 00:44:18,000
OK, for cache oblivious,
it's really crucial how these

679
00:44:18,000 --> 00:44:22,000
things are laid out.
We're going to cut and get

680
00:44:22,000 --> 00:44:28,000
consecutive chunks of the array,
N to the one third of them.

681
00:44:28,000 --> 00:44:34,000
Then I'm going to recursively
sort them, and then I'm going to

682
00:44:34,000 --> 00:44:37,000
merge.
OK, and I'm going to merge

683
00:44:37,000 --> 00:44:41,000
using the K funnel,
the N to the one third funnel

684
00:44:41,000 --> 00:44:43,000
because, now,
why do I use one third?

685
00:44:43,000 --> 00:44:48,000
Well, because of this three.
OK, in order to use the N to

686
00:44:48,000 --> 00:44:51,000
the one third funnel,
I need to guarantee that the

687
00:44:51,000 --> 00:44:55,000
total number of elements that
I'm merging is at least the cube

688
00:44:55,000 --> 00:44:57,000
of this number,
K^3.

689
00:44:57,000 --> 00:45:01,000
The cube of this number is N.
That's exactly how many

690
00:45:01,000 --> 00:45:05,000
elements I have in total.
OK, so this is exactly what I

691
00:45:05,000 --> 00:45:08,000
can apply the funnel.
It's going to require that I

692
00:45:08,000 --> 00:45:11,000
have it least K^3 elements,
so that I can only use an N to

693
00:45:11,000 --> 00:45:14,000
the one third funnel.
I mean, if it didn't have this

694
00:45:14,000 --> 00:45:17,000
requirement, I could just say,
well, I have N lists each of

695
00:45:17,000 --> 00:45:20,000
size one.
OK, that's clearly not going to

696
00:45:20,000 --> 00:45:23,000
work very well for our merger,
I mean, intuitively because

697
00:45:23,000 --> 00:45:26,000
this plus K will kill you.
That will be a plus M which is

698
00:45:26,000 --> 00:45:30,000
way too big.
But we can use an N to the one

699
00:45:30,000 --> 00:45:35,000
third funnel,
and this is how we would sort.

700
00:45:35,000 --> 00:45:38,000
So, let's analyze this
algorithm.

701
00:45:38,000 --> 00:45:42,000
Hopefully, it will give the
sorting bound if I did

702
00:45:42,000 --> 00:45:47,000
everything correctly.
OK, this is pretty easy.

703
00:45:47,000 --> 00:45:52,000
The only thing that makes this
messy as I have to write the

704
00:45:52,000 --> 00:45:58,000
sorting bound over and over.
OK, this is the cost of the

705
00:45:58,000 --> 00:46:02,000
merge.
So that's at the root.

706
00:46:02,000 --> 00:46:07,000
But K^3 in this case is N.
So at the root of the

707
00:46:07,000 --> 00:46:11,000
recursion, let me write the
recurrence first.

708
00:46:11,000 --> 00:46:15,000
Sorry.
So, we have memory transfers on

709
00:46:15,000 --> 00:46:19,000
N elements is N to the one
third.

710
00:46:19,000 --> 00:46:24,000
Let me get this right.
Yeah, N to the one third

711
00:46:24,000 --> 00:46:28,000
recursions, each of size N to
the two thirds,

712
00:46:28,000 --> 00:46:34,000
OK, plus this time,
except K^3 is N.

713
00:46:34,000 --> 00:46:40,000
So, this is plus N over B,
log base M over B of N over B

714
00:46:40,000 --> 00:46:46,000
plus cubed root of M.
This is additive plus K term.

715
00:46:46,000 --> 00:46:52,000
OK, so that's my recurrence.
The base case will be the

716
00:46:52,000 --> 00:46:57,000
usual.
MT is some constant times M is

717
00:46:57,000 --> 00:47:02,000
order M over B.
So, we sort of know what we

718
00:47:02,000 --> 00:47:06,000
should get here.
Well, not really.

719
00:47:06,000 --> 00:47:09,000
So, in all the previous
recurrence is,

720
00:47:09,000 --> 00:47:15,000
we have the same costs at every
level, and that's where we got

721
00:47:15,000 --> 00:47:20,000
our log factor.
Now, we already have a log

722
00:47:20,000 --> 00:47:24,000
factor, so we better not get
another one.

723
00:47:24,000 --> 00:47:28,000
Right, this is the bound we
want to prove.

724
00:47:28,000 --> 00:47:33,000
So, let me cheat here for a
second.

725
00:47:33,000 --> 00:47:36,000
All right, indeed.
You may already be wondering,

726
00:47:36,000 --> 00:47:39,000
this N to the one third seems
rather large.

727
00:47:39,000 --> 00:47:43,000
If it's bigger than this,
we are already in trouble at

728
00:47:43,000 --> 00:47:45,000
the very top level of the
recursion.

729
00:47:45,000 --> 00:47:49,000
So, I claim that that's OK.
Let's look at N to the one

730
00:47:49,000 --> 00:47:51,000
third.
OK, there is a base case here

731
00:47:51,000 --> 00:47:54,000
which covers all values of N
that are, at most,

732
00:47:54,000 --> 00:47:58,000
some constant times N.
So, if I'm in this case,

733
00:47:58,000 --> 00:48:02,000
I know that N is at least as
big as the cache up to some

734
00:48:02,000 --> 00:48:06,000
constant.
OK, now the cache is it least

735
00:48:06,000 --> 00:48:10,000
B^2, we've assumed.
And you can do this with B to

736
00:48:10,000 --> 00:48:13,000
the one plus epsilon if you're
more careful.

737
00:48:13,000 --> 00:48:15,000
So, N is at least B^2,
OK?

738
00:48:15,000 --> 00:48:19,000
And then, I always have trouble
with these.

739
00:48:19,000 --> 00:48:23,000
So this means that N divided by
B is omega root N.

740
00:48:23,000 --> 00:48:26,000
OK, there's many things you
could say here,

741
00:48:26,000 --> 00:48:30,000
and only one of them is right.
So, why?

742
00:48:30,000 --> 00:48:34,000
So this says that the square
root of N is at least B,

743
00:48:34,000 --> 00:48:38,000
and so N divided by B is at
most N divided by square root of

744
00:48:38,000 --> 00:48:41,000
N.
So that's at least the square

745
00:48:41,000 --> 00:48:43,000
root of N if you check that all
out.

746
00:48:43,000 --> 00:48:48,000
I'm going to go through this
arithmetic relatively quickly

747
00:48:48,000 --> 00:48:50,000
because it's tedious but
necessary.

748
00:48:50,000 --> 00:48:54,000
OK, the square root of N is
strictly bigger than cubed root

749
00:48:54,000 --> 00:48:57,000
of N.
OK, so that means that N over B

750
00:48:57,000 --> 00:49:02,000
is strictly bigger than N to the
one third.

751
00:49:02,000 --> 00:49:05,000
Here we have N over B times
something that's bigger than

752
00:49:05,000 --> 00:49:07,000
one.
So this term definitely

753
00:49:07,000 --> 00:49:10,000
dominates this term in this
case.

754
00:49:10,000 --> 00:49:14,000
As long as I'm not in the base
case, I know N is at least order

755
00:49:14,000 --> 00:49:16,000
M.
This term disappears from my

756
00:49:16,000 --> 00:49:18,000
recurrence.
OK, so, good.

757
00:49:18,000 --> 00:49:21,000
That was a bit close.
Now, what we want to get is

758
00:49:21,000 --> 00:49:25,000
this running time overall.
So, the recursive cost better

759
00:49:25,000 --> 00:49:29,000
be small, better be less than
the constant factor increase

760
00:49:29,000 --> 00:49:35,000
over this.
So, let's write the recurrence.

761
00:49:35,000 --> 00:49:39,000
So, we get N over B,
log base M over B,

762
00:49:39,000 --> 00:49:44,000
N over B at the root.
Then, we split into a lot of

763
00:49:44,000 --> 00:49:49,000
subproblems, N to the one third
subproblems here,

764
00:49:49,000 --> 00:49:55,000
and each one costs essentially
this but with N replaced by N to

765
00:49:55,000 --> 00:50:00,000
the two thirds.
OK, so N to the two thirds log

766
00:50:00,000 --> 00:50:04,000
base M over B,
oops I forgot to divide it by B

767
00:50:04,000 --> 00:50:11,000
out here, of N to the two thirds
divided by B.

768
00:50:11,000 --> 00:50:14,000
That's the cost of one of these
nodes, N to the one third of

769
00:50:14,000 --> 00:50:17,000
them.
What should they add up to?

770
00:50:17,000 --> 00:50:20,000
Well, there is N to the one
third, and there's an N to the

771
00:50:20,000 --> 00:50:23,000
two thirds here that multiplies
out to N.

772
00:50:23,000 --> 00:50:25,000
So, we get N over B.
This looks bad.

773
00:50:25,000 --> 00:50:28,000
This looks the same.
And we don't want to lose

774
00:50:28,000 --> 00:50:31,000
another log factor.
But the good news is we have

775
00:50:31,000 --> 00:50:35,000
two thirds in here.
OK, this is what we get in

776
00:50:35,000 --> 00:50:38,000
total at this level.
It looks like the sorting

777
00:50:38,000 --> 00:50:41,000
bound, but in the log there's
still a two thirds.

778
00:50:41,000 --> 00:50:45,000
Now, a power of two thirds and
a log comes out as a multiple of

779
00:50:45,000 --> 00:50:48,000
two thirds.
So, this is in fact two thirds

780
00:50:48,000 --> 00:50:51,000
times N over B,
log base M over B of N over B,

781
00:50:51,000 --> 00:50:54,000
the sorting bound.
So, this is two thirds of the

782
00:50:54,000 --> 00:50:57,000
sorting bound.
And this is the sorting bound,

783
00:50:57,000 --> 00:51:01,000
one times the sorting bound.
So, it's going down

784
00:51:01,000 --> 00:51:02,000
geometrically,
yea!

785
00:51:02,000 --> 00:51:05,000
OK, I'm not going to prove it,
but it's true.

786
00:51:05,000 --> 00:51:08,000
This went down by a factor of
two thirds.

787
00:51:08,000 --> 00:51:12,000
The next one will also go down
by a factor of two thirds by

788
00:51:12,000 --> 00:51:14,000
induction.
OK, if you prove it at one

789
00:51:14,000 --> 00:51:17,000
level, it should be true at all
of them.

790
00:51:17,000 --> 00:51:19,000
And I'm going to skip the
details there.

791
00:51:19,000 --> 00:51:23,000
So, we could check the leaf
level just to make sure.

792
00:51:23,000 --> 00:51:25,000
That's always a good sanity
check.

793
00:51:25,000 --> 00:51:30,000
At the leaves,
we know our cost is M over B.

794
00:51:30,000 --> 00:51:32,000
OK, and how many leaves are
there?

795
00:51:32,000 --> 00:51:34,000
Just like before,
in some sense,

796
00:51:34,000 --> 00:51:38,000
we have N/M leaves.
OK, so in fact the total cost

797
00:51:38,000 --> 00:51:41,000
at the bottom is N over B.
And it turns out that that's

798
00:51:41,000 --> 00:51:44,000
what you get.
So, you essentially,

799
00:51:44,000 --> 00:51:47,000
it looks funny,
because you'd think that this

800
00:51:47,000 --> 00:51:51,000
would actually be smaller than
this at some intuitive level.

801
00:51:51,000 --> 00:51:54,000
It's not.
In fact, what's happening is

802
00:51:54,000 --> 00:51:57,000
you have this N over B times
this log thing,

803
00:51:57,000 --> 00:52:00,000
whatever the log thing is.
We don't care too much.

804
00:52:00,000 --> 00:52:05,000
Let's just call it log.
What you are taking at the next

805
00:52:05,000 --> 00:52:08,000
level is two thirds times that
log.

806
00:52:08,000 --> 00:52:11,000
And at the next level,
it's four ninths times that log

807
00:52:11,000 --> 00:52:13,000
and so on.
So, it's geometrically

808
00:52:13,000 --> 00:52:16,000
decreasing until the log gets
down to one.

809
00:52:16,000 --> 00:52:17,000
And then you stop the
recursion.

810
00:52:17,000 --> 00:52:21,000
And that's what you get N over
B here with no log.

811
00:52:21,000 --> 00:52:23,000
So, what you're doing is
decreasing the log,

812
00:52:23,000 --> 00:52:27,000
not the N over B stuff.
The two thirds should really be

813
00:52:27,000 --> 00:52:29,000
over here.
In fact, the number of levels

814
00:52:29,000 --> 00:52:34,000
here is log, log N.
It's the number of times you

815
00:52:34,000 --> 00:52:39,000
have to divide a log by three
halves before you get down to

816
00:52:39,000 --> 00:52:42,000
one, OK?
So, we don't actually need

817
00:52:42,000 --> 00:52:45,000
that.
We don't care how many levels

818
00:52:45,000 --> 00:52:49,000
are because it's geometrically
decreasing.

819
00:52:49,000 --> 00:52:52,000
It could be infinitely many
levels.

820
00:52:52,000 --> 00:52:58,000
It's geometrically decreasing,
and we get this as our running

821
00:52:58,000 --> 00:53:01,000
time.
MT of N is the sorting bound

822
00:53:01,000 --> 00:53:05,000
for funnel sort.
So, this is great.

823
00:53:05,000 --> 00:53:09,000
As long as we can get a funnel
that merges this quickly,

824
00:53:09,000 --> 00:53:14,000
we get a sorting algorithm that
sorts as fast as it possibly

825
00:53:14,000 --> 00:53:17,000
can.
I didn't write that on the

826
00:53:17,000 --> 00:53:20,000
board that this is
asymptotically optimal.

827
00:53:20,000 --> 00:53:25,000
Even if you knew what B and M
were, this is the best that you

828
00:53:25,000 --> 00:53:28,000
could hope to do.
And here, we are doing it no

829
00:53:28,000 --> 00:53:32,000
matter what, B and M are.
Good.

830
00:53:32,000 --> 00:53:35,000
Get ready for the funnel.
The funnel will be another

831
00:53:35,000 --> 00:53:37,000
recursion.
So, this is a recursive

832
00:53:37,000 --> 00:53:39,000
algorithm in a recursive
algorithm.

833
00:53:39,000 --> 00:53:43,000
It's another divide and
conquer, kind of like the static

834
00:53:43,000 --> 00:53:46,000
search trees we saw at the
beginning of this lecture.

835
00:53:46,000 --> 00:53:49,000
So, these all tie together.

836
00:54:03,000 --> 00:54:06,000
All right, the K funnel,
so, I'm calling it K funnel

837
00:54:06,000 --> 00:54:10,000
because I want to think of it at
some recursive level,

838
00:54:10,000 --> 00:54:14,000
not just N to the one third.
OK, we're going to recursively

839
00:54:14,000 --> 00:54:17,000
use, in fact,
the square root of K funnel.

840
00:54:17,000 --> 00:54:21,000
So, here's, and I need to
achieve that bound.

841
00:54:21,000 --> 00:54:24,000
So, the recursion is like the
static search tree,

842
00:54:24,000 --> 00:54:27,000
and a little bit hard to draw
on one board,

843
00:54:27,000 --> 00:54:34,000
but here we go.
So, we have a square root of K

844
00:54:34,000 --> 00:54:37,000
funnel.
Recursively,

845
00:54:37,000 --> 00:54:44,000
we have a buffer up here.
This is called the output

846
00:54:44,000 --> 00:54:50,000
buffer, and it has size K^3,
and just for kicks,

847
00:54:50,000 --> 00:54:57,000
let's suppose it that filled up
a little bit.

848
00:54:57,000 --> 00:55:06,000
And, we have some more buffers.
And, let's suppose they've been

849
00:55:06,000 --> 00:55:13,000
filled up by different amounts.
And each of these has size K to

850
00:55:13,000 --> 00:55:16,000
the three halves,
of course.

851
00:55:16,000 --> 00:55:21,000
Halves, these are called
buffers, let's say,

852
00:55:21,000 --> 00:55:28,000
with the intermediate buffers.
And, then hanging off of them,

853
00:55:28,000 --> 00:55:34,000
we have more funnels,
the square root of K funnel

854
00:55:34,000 --> 00:55:40,000
here, and a square root of K
funnel here, one for each

855
00:55:40,000 --> 00:55:47,000
buffer, one for each child of
this funnel.

856
00:55:47,000 --> 00:55:53,000
OK, and then hanging off of
these funnels are the input

857
00:55:53,000 --> 00:55:54,000
arrays.

858
00:56:07,000 --> 00:56:12,000
OK, I'm not going to draw all K
of them, but there are K input

859
00:56:12,000 --> 00:56:16,000
arrays, input lists let's call
them down at the bottom.

860
00:56:16,000 --> 00:56:21,000
OK, so the idea is we are going
to merge bottom-up in this

861
00:56:21,000 --> 00:56:23,000
picture.
We start with our K input

862
00:56:23,000 --> 00:56:26,000
arrays of total size at least
K^3.

863
00:56:26,000 --> 00:56:31,000
That's what we're assuming we
have up here.

864
00:56:31,000 --> 00:56:34,000
We are clustering them into
groups of size square root of K,

865
00:56:34,000 --> 00:56:37,000
so, the square root of K
groups, throw each of them into

866
00:56:37,000 --> 00:56:40,000
a square root of K funnel that
recursively merges those square

867
00:56:40,000 --> 00:56:43,000
root of K lists.
The output of those funnels we

868
00:56:43,000 --> 00:56:46,000
are putting into a buffer to
sort of accumulate what the

869
00:56:46,000 --> 00:56:49,000
answer should be.
These buffers have besides

870
00:56:49,000 --> 00:56:52,000
exactly K to the three halves,
which might not be perfect

871
00:56:52,000 --> 00:56:55,000
because we know that on average,
there should be K to the three

872
00:56:55,000 --> 00:56:59,000
halves elements in each of these
because there's K^3 total,

873
00:56:59,000 --> 00:57:02,000
and the square root of K
groups.

874
00:57:02,000 --> 00:57:05,000
So, it should be K^3 divided by
the square root of K,

875
00:57:05,000 --> 00:57:07,000
which is K to the three halves
on average.

876
00:57:07,000 --> 00:57:09,000
But some of these will be
bigger.

877
00:57:09,000 --> 00:57:12,000
Some of them will be smaller.
I've drawn it here.

878
00:57:12,000 --> 00:57:15,000
Some of them had emptied a bit
more depending on how you merge

879
00:57:15,000 --> 00:57:16,000
things.
But on average,

880
00:57:16,000 --> 00:57:18,000
these will all fill at the same
time.

881
00:57:18,000 --> 00:57:22,000
And then, we plug them into a
square root of K funnel,

882
00:57:22,000 --> 00:57:24,000
and that we get the output of
size K^3.

883
00:57:24,000 --> 00:57:28,000
So, that is roughly what we
should have happen.

884
00:57:28,000 --> 00:57:31,000
OK, but in fact,
some of these might fill first,

885
00:57:31,000 --> 00:57:36,000
and we have to do some merging
in order to empty a buffer,

886
00:57:36,000 --> 00:57:39,000
make room for more stuff coming
up.

887
00:57:39,000 --> 00:57:43,000
That's the picture.
Now, before I actually tell you

888
00:57:43,000 --> 00:57:47,000
what the algorithm is,
or analyze the algorithm,

889
00:57:47,000 --> 00:57:51,000
let's first just think about
space, a very simple warm-up

890
00:57:51,000 --> 00:57:54,000
analysis.
So, let's look at the space

891
00:57:54,000 --> 00:58:00,000
excluding the inputs and
outputs, those buffers.

892
00:58:00,000 --> 00:58:02,000
OK, why do I want to exclude
input and output buffers?

893
00:58:02,000 --> 00:58:05,000
Well, because I want to only
count each buffer once,

894
00:58:05,000 --> 00:58:09,000
and this buffer is actually the
input to this one and the output

895
00:58:09,000 --> 00:58:11,000
to this one.
So, in order to recursively

896
00:58:11,000 --> 00:58:14,000
count all the buffers exactly
once, I'm only going to count

897
00:58:14,000 --> 00:58:16,000
these middle buffers.
And then separately,

898
00:58:16,000 --> 00:58:20,000
I'm going to have to think of
the overall output and input

899
00:58:20,000 --> 00:58:22,000
buffers.
But those are sort of given.

900
00:58:22,000 --> 00:58:23,000
I mean, I need K^3 for the
output.

901
00:58:23,000 --> 00:58:26,000
I need K^3 for the input.
So ignore those overall.

902
00:58:26,000 --> 00:58:29,000
And that if I count the middle
buffers recursively,

903
00:58:29,000 --> 00:58:34,000
I'll get all the buffers.
So, then we get a very simple

904
00:58:34,000 --> 00:58:39,000
recurrence for space.
S of K is roughly square root

905
00:58:39,000 --> 00:58:45,000
of K plus one times S of square
root of K plus order K^2,

906
00:58:45,000 --> 00:58:51,000
K^2 because we have the square
root of K of these buffers,

907
00:58:51,000 --> 00:58:54,000
each of size K to the three
halves.

908
00:58:54,000 --> 00:58:58,000
Work that out,
does that sound right?

909
00:58:58,000 --> 00:59:02,000
That sounds an awful lot like
K^3, but maybe,

910
00:59:02,000 --> 00:59:06,000
all right.
Oh, no, that's right.

911
00:59:06,000 --> 00:59:09,000
It's K to the three halves
times the square root of K,

912
00:59:09,000 --> 00:59:13,000
which is K to the three halves
plus a half, which is K to the

913
00:59:13,000 --> 00:59:16,000
four halves, which is K^2.
Phew, OK, good.

914
00:59:16,000 --> 00:59:18,000
I'm just bad with my arithmetic
here.

915
00:59:18,000 --> 00:59:20,000
OK, so K^2 total buffering
here.

916
00:59:20,000 --> 00:59:23,000
You add them up for each level,
each recursion,

917
00:59:23,000 --> 00:59:27,000
and the plus one here is to
take into account the top guy,

918
00:59:27,000 --> 00:59:31,000
the square root of K bottom
guys, so the square root of K

919
00:59:31,000 --> 00:59:33,000
plus one.
If this were,

920
00:59:33,000 --> 00:59:36,000
well, let me just draw the
recurrence tree.

921
00:59:36,000 --> 00:59:39,000
There's many ways you could
solve this recurrence.

922
00:59:39,000 --> 00:59:41,000
A natural one is instead of
looking at K,

923
00:59:41,000 --> 00:59:44,000
you look at log K,
because here at log K is

924
00:59:44,000 --> 00:59:47,000
getting divided by two.
I just going to draw the

925
00:59:47,000 --> 00:59:50,000
recursion trees,
so you can see the intuition.

926
00:59:50,000 --> 00:59:53,000
But if you are going to solve
it, you should probably take the

927
00:59:53,000 --> 00:59:57,000
logs, substitute by log.
So, we have the square root of

928
00:59:57,000 --> 01:00:00,000
K.
plus one branching factor.

929
01:00:00,000 --> 01:00:03,729
And then, the problem is size
square root of K,

930
01:00:03,729 --> 01:00:08,108
so this is going to be K,
I believe, for each of these.

931
01:00:08,108 --> 01:00:12,324
This is square root of K
squared is the cost of these

932
01:00:12,324 --> 01:00:14,513
levels.
And, you keep going.

933
01:00:14,513 --> 01:00:19,540
I don't particularly care what
the bottom looks like because at

934
01:00:19,540 --> 01:00:23,351
the top we have K^2.
That we have K times root K

935
01:00:23,351 --> 01:00:28,297
plus one cost at the next level.
This is K to the three halves

936
01:00:28,297 --> 01:00:32,664
plus K.
OK, so we go from K^2 to K to

937
01:00:32,664 --> 01:00:37,257
the three halves plus K.
This is a super-geometric.

938
01:00:37,257 --> 01:00:41,207
It's like an exponential
geometric decrease.

939
01:00:41,207 --> 01:00:45,800
This is decreasing really fast.
So, it's order K^2.

940
01:00:45,800 --> 01:00:51,220
That's my hand-waving argument.
OK, so the cost is basically

941
01:00:51,220 --> 01:00:56,456
the size of the buffers at the
top level, the total space.

942
01:00:56,456 --> 01:01:01,601
We're going to need this.
It's actually theta K^2 because

943
01:01:01,601 --> 01:01:06,398
I have a theta K^2 here.
We are going to be this in

944
01:01:06,398 --> 01:01:09,249
order to analyze the time.
That's why it mentioned it.

945
01:01:09,249 --> 01:01:12,368
It's not just a good feeling
that the space is not too big.

946
01:01:12,368 --> 01:01:15,595
In fact, the funnel is a lot
smaller than a total input size.

947
01:01:15,595 --> 01:01:18,177
The input size is K^3.
But that's not so crucial.

948
01:01:18,177 --> 01:01:21,243
What's crucial is that it's
K^2, and we'll use that in the

949
01:01:21,243 --> 01:01:22,480
analysis.
OK, naturally,

950
01:01:22,480 --> 01:01:24,308
this thing is laid out
recursively.

951
01:01:24,308 --> 01:01:26,675
You recursively store the
funnel, top funnel.

952
01:01:26,675 --> 01:01:29,256
Then, for example,
you write out each buffer as a

953
01:01:29,256 --> 01:01:32,000
consecutive array,
in this case.

954
01:01:32,000 --> 01:01:34,748
There's no recursion there.
So just write them all out one

955
01:01:34,748 --> 01:01:36,243
by one.
Don't interleave them or

956
01:01:36,243 --> 01:01:37,642
anything.
Store them in order.

957
01:01:37,642 --> 01:01:40,005
And that, you write out
recursively these funnels,

958
01:01:40,005 --> 01:01:41,934
the bottom funnels.
OK, any way you do it

959
01:01:41,934 --> 01:01:44,634
recursively, as long as each
funnel remains a consecutive

960
01:01:44,634 --> 01:01:46,418
chunk of memory,
each buffer remains a

961
01:01:46,418 --> 01:01:49,167
consecutive chuck of memory,
the time analysis that we are

962
01:01:49,167 --> 01:01:51,000
about to do will work.

963
01:02:14,000 --> 01:02:18,062
OK, let me actually give you
the algorithm that we're

964
01:02:18,062 --> 01:02:21,265
analyzing.
In order to make the funnel go,

965
01:02:21,265 --> 01:02:25,015
what we do is say,
initially, all the buffers are

966
01:02:25,015 --> 01:02:27,671
empty.
Everything is at the bottom.

967
01:02:27,671 --> 01:02:32,125
And what we are going to do is,
say, fill the root buffer.

968
01:02:32,125 --> 01:02:36,040
Fill this one.
And, that's a recursive

969
01:02:36,040 --> 01:02:41,542
algorithm, which I'll define in
a second, how to fill a buffer.

970
01:02:41,542 --> 01:02:45,713
Once it's filled,
that means everything has been

971
01:02:45,713 --> 01:02:50,682
pulled up, and then it's merged.
OK, so that's how we get

972
01:02:50,682 --> 01:02:53,522
started.
So, merge means to merge

973
01:02:53,522 --> 01:02:58,402
algorithm is fill the topmost
buffer, the topmost output

974
01:02:58,402 --> 01:03:01,002
buffer.
OK, and now,

975
01:03:01,002 --> 01:03:04,678
here's how you fill a buffer.
So, in general,

976
01:03:04,678 --> 01:03:08,355
if you expand out this
recursion all the way,

977
01:03:08,355 --> 01:03:12,114
in the base case,
I didn't mention you sort of

978
01:03:12,114 --> 01:03:16,710
get a little node there.
So, if you look at an arbitrary

979
01:03:16,710 --> 01:03:20,386
buffer in this picture that you
want to fill,

980
01:03:20,386 --> 01:03:23,979
so this one's empty and you
want to fill it,

981
01:03:23,979 --> 01:03:28,407
then immediately below it will
be a vertex who has two

982
01:03:28,407 --> 01:03:34,434
children, two other buffers.
OK, maybe they look like this.

983
01:03:34,434 --> 01:03:39,141
You have no idea how big they
are, except they are the same

984
01:03:39,141 --> 01:03:41,981
size.
It could be a lot smaller than

985
01:03:41,981 --> 01:03:44,984
this one, a lot bigger,
we don't know.

986
01:03:44,984 --> 01:03:48,554
But in the end,
you do get a binary structure

987
01:03:48,554 --> 01:03:53,261
out of this just like we did
with the binary search tree at

988
01:03:53,261 --> 01:03:56,913
the beginning.
So, how do we fill this buffer?

989
01:03:56,913 --> 01:04:03,000
Well, we just merge these two
child buffers as long as we can.

990
01:04:03,000 --> 01:04:08,854
So, we merge the two children
buffers as long as they are both

991
01:04:08,854 --> 01:04:11,253
non-empty.
So, in general,

992
01:04:11,253 --> 01:04:16,820
the invariant will be that this
buffer, let me write down a

993
01:04:16,820 --> 01:04:19,795
sentence.
As long as a buffer is

994
01:04:19,795 --> 01:04:25,170
non-empty, or whatever is in
that buffer, and hasn't been

995
01:04:25,170 --> 01:04:29,009
used already,
it's a prefix of the merged

996
01:04:29,009 --> 01:04:34,000
output of the entire subtree
beneath it.

997
01:04:34,000 --> 01:04:37,567
OK, so this is a partially
merged subsequence of everything

998
01:04:37,567 --> 01:04:39,781
down here.
This is a partially merged

999
01:04:39,781 --> 01:04:41,933
subsequence of everything down
here.

1000
01:04:41,933 --> 01:04:44,824
I can just merge element by
element off the top,

1001
01:04:44,824 --> 01:04:48,453
and that will give me outputs
to put there until one of them

1002
01:04:48,453 --> 01:04:51,096
gets emptied.
And, we have no idea which one

1003
01:04:51,096 --> 01:04:54,357
will empty first just because it
depends on the order.

1004
01:04:54,357 --> 01:04:57,801
OK, whenever one of them
empties, we recursively fill it,

1005
01:04:57,801 --> 01:05:01,000
and that's it.
That's the algorithm.

1006
01:05:01,000 --> 01:05:05,000
Whenever one empties --

1007
01:05:16,000 --> 01:05:20,391
-- we recursively fill it.
And at the base case at the

1008
01:05:20,391 --> 01:05:23,456
leaves, there's sort of nothing
to do.

1009
01:05:23,456 --> 01:05:27,846
I believe you just sort of
directly read from an input

1010
01:05:27,846 --> 01:05:30,167
list.
So, at the very bottom,

1011
01:05:30,167 --> 01:05:34,807
if you have some note here
that's trying to merge between

1012
01:05:34,807 --> 01:05:39,198
these two, that's just a
straightforward merge between

1013
01:05:39,198 --> 01:05:42,595
two lists.
We know how to do that with two

1014
01:05:42,595 --> 01:05:44,832
parallel scans.
So, in fact,

1015
01:05:44,832 --> 01:05:49,886
we can merge the entire thing
here and just spit it out to the

1016
01:05:49,886 --> 01:05:52,786
buffer.
Well, it depends how big the

1017
01:05:52,786 --> 01:05:56,100
buffer is.
We can only merge it until the

1018
01:05:56,100 --> 01:06:01,445
buffer fills.
Whenever a buffer is full,

1019
01:06:01,445 --> 01:06:05,394
we stop and we pop up the
recursive layers.

1020
01:06:05,394 --> 01:06:11,131
OK, so we keep doing this merge
until the buffer we are trying

1021
01:06:11,131 --> 01:06:14,047
to fill fills,
and that we stop,

1022
01:06:14,047 --> 01:06:17,338
pop up.
OK, that's the algorithm for

1023
01:06:17,338 --> 01:06:20,724
merging.
Now, we just have to analyze

1024
01:06:20,724 --> 01:06:24,579
the algorithm.
It's actually not too hard,

1025
01:06:24,579 --> 01:06:29,000
but it's a pretty clever
analysis.

1026
01:06:29,000 --> 01:06:31,898
And, to top it off,
it's an amortization,

1027
01:06:31,898 --> 01:06:35,159
your favorite.
OK, so we get one last practice

1028
01:06:35,159 --> 01:06:39,072
at amortized analysis in the
context of cache oblivious

1029
01:06:39,072 --> 01:06:41,971
algorithms.
So, this is going to be a bit

1030
01:06:41,971 --> 01:06:45,231
sophisticated.
We are going to combine all the

1031
01:06:45,231 --> 01:06:48,492
ideas we've seen.
The main analysis idea we've

1032
01:06:48,492 --> 01:06:52,840
seen is that we are doing this
recursion in the construction,

1033
01:06:52,840 --> 01:06:55,666
and if we imagine,
we take our K funnel,

1034
01:06:55,666 --> 01:06:59,507
we split it in the middle
level, make a whole bunch of

1035
01:06:59,507 --> 01:07:03,202
square root of K funnels,
and so on, and then we cut

1036
01:07:03,202 --> 01:07:07,188
those in the middle level,
get fourth root of K funnels,

1037
01:07:07,188 --> 01:07:10,666
and so on, and so on,
at some point the funnel we

1038
01:07:10,666 --> 01:07:15,816
look at fits in cache.
OK, before we said if it's in a

1039
01:07:15,816 --> 01:07:17,984
block.
Now, we're going to say that at

1040
01:07:17,984 --> 01:07:20,913
some point, one of these funnels
will fit in cache.

1041
01:07:20,913 --> 01:07:24,253
Each of the funnels at that
recursive level of detail will

1042
01:07:24,253 --> 01:07:26,656
fit in cache.
We are going to analyze that

1043
01:07:26,656 --> 01:07:29,000
level.
We'll call that level J.

1044
01:07:29,000 --> 01:07:37,266
So, consider the first
recursive level of detail,

1045
01:07:37,266 --> 01:07:45,877
and I'll call it J,
at which every J funnel we have

1046
01:07:45,877 --> 01:07:53,800
fits, let's say,
not only does it fit in cache,

1047
01:07:53,800 --> 01:08:02,337
but four of them fit in cache.
It fits in one quarter of the

1048
01:08:02,337 --> 01:08:05,158
cache.
OK, but we need to leave some

1049
01:08:05,158 --> 01:08:07,899
cache extra for doing other
things.

1050
01:08:07,899 --> 01:08:11,607
But I want to make sure that
the J funnel fits.

1051
01:08:11,607 --> 01:08:16,040
OK, now what does that mean?
Well, we've analyzed space.

1052
01:08:16,040 --> 01:08:19,988
We know that the space of a J
funnel is about J^2,

1053
01:08:19,988 --> 01:08:24,020
some constant times J^2.
We'll call it C times J^2.

1054
01:08:24,020 --> 01:08:27,969
OK, so this is saying that C
times J^2 is at most,

1055
01:08:27,969 --> 01:08:32,000
M over 4, one quarter of the
cache.

1056
01:08:32,000 --> 01:08:35,915
OK, that means a J funnel that
happens at the size sits in the

1057
01:08:35,915 --> 01:08:38,803
quarter of the cache.
OK, at some point in the

1058
01:08:38,803 --> 01:08:41,884
recursion, we'll have this big
tree of J funnels,

1059
01:08:41,884 --> 01:08:44,515
with all sorts of buffers in
between them,

1060
01:08:44,515 --> 01:08:46,697
and each of the J funnels will
fit.

1061
01:08:46,697 --> 01:08:49,520
So, let's think about one of
those J funnels.

1062
01:08:49,520 --> 01:08:51,960
Suppose J is like the square
root of K.

1063
01:08:51,960 --> 01:08:55,618
So, this is the picture because
otherwise I have to draw a

1064
01:08:55,618 --> 01:08:58,314
bigger one.
So, suppose this is a J funnel.

1065
01:08:58,314 --> 01:09:03,000
It has a bunch of input
buffers, has one output buffer.

1066
01:09:03,000 --> 01:09:06,366
So, we just want to think about
how the J funnel executes.

1067
01:09:06,366 --> 01:09:09,259
And, for a long time,
as long as these buffers are

1068
01:09:09,259 --> 01:09:12,330
all full, this is just a merger.
It's doing something

1069
01:09:12,330 --> 01:09:14,515
recursively, but we don't really
care.

1070
01:09:14,515 --> 01:09:17,468
As soon as this whole thing
swaps in, and actually,

1071
01:09:17,468 --> 01:09:20,243
I should be drawing this,
as soon as the funnel,

1072
01:09:20,243 --> 01:09:23,019
the output buffer,
and the input buffer swap in,

1073
01:09:23,019 --> 01:09:25,676
in other words,
you bring all those blocks in,

1074
01:09:25,676 --> 01:09:28,452
you can just merge,
and you can go on your merry

1075
01:09:28,452 --> 01:09:33,000
way merging until something
empties or you fill the output.

1076
01:09:33,000 --> 01:09:36,323
So, let's analyze that.
Suppose everything is in

1077
01:09:36,323 --> 01:09:40,707
memory, because we know it fits.
OK, well I have to be a little

1078
01:09:40,707 --> 01:09:43,676
bit careful.
The input buffers are actually

1079
01:09:43,676 --> 01:09:48,202
pretty big in total size because
the total size is K to the three

1080
01:09:48,202 --> 01:09:50,747
halves here versus K to the one
half.

1081
01:09:50,747 --> 01:09:54,848
Actually, this is of size K.
Let me draw a general picture.

1082
01:09:54,848 --> 01:09:57,676
We have a J funnel,
because otherwise the

1083
01:09:57,676 --> 01:10:01,000
arithmetic is going to get
messy.

1084
01:10:01,000 --> 01:10:04,854
We have a J funnel.
Its size is C times J^2,

1085
01:10:04,854 --> 01:10:08,619
we're supposing.
The number of inputs is J,

1086
01:10:08,619 --> 01:10:11,666
and the size of them is pretty
big.

1087
01:10:11,666 --> 01:10:15,610
Where did we define that?
We have a K funnel.

1088
01:10:15,610 --> 01:10:20,719
The total input size is K^3.
So, the total input size here

1089
01:10:20,719 --> 01:10:24,663
would be J^3.
We can't afford to put all that

1090
01:10:24,663 --> 01:10:27,980
in cache.
That's an extra factor of J.

1091
01:10:27,980 --> 01:10:33,000
But, we can afford to one block
per input.

1092
01:10:33,000 --> 01:10:35,035
And for merging,
that's all we need.

1093
01:10:35,035 --> 01:10:38,176
I claim that I can fit the
first block of each of these

1094
01:10:38,176 --> 01:10:41,724
input arrays in cash at the same
time along with the J funnel.

1095
01:10:41,724 --> 01:10:44,864
And so, for that duration,
as long as all of that is in

1096
01:10:44,864 --> 01:10:48,238
cache, this thing can merge at
full speed just like we were

1097
01:10:48,238 --> 01:10:51,204
doing parallel scans.
You use up all the blocks down

1098
01:10:51,204 --> 01:10:54,752
here, and one of them empties.
You go to the next block in the

1099
01:10:54,752 --> 01:10:57,602
input buffer and so on,
just like the normal merge

1100
01:10:57,602 --> 01:11:00,859
analysis of parallel arrays,
at this point we assume that

1101
01:11:00,859 --> 01:11:04,000
everything here is fitting in
cache.

1102
01:11:04,000 --> 01:11:08,485
So, it's just like before.
Of course, in fact,

1103
01:11:08,485 --> 01:11:13,668
it's recursive but we are
analyzing it at this level.

1104
01:11:13,668 --> 01:11:19,250
OK, I need to prove that you
can fit one block per input.

1105
01:11:19,250 --> 01:11:22,839
It's not hard.
It's just computation.

1106
01:11:22,839 --> 01:11:28,720
And, it's basically the way
that these funnels were designed

1107
01:11:28,720 --> 01:11:35,000
was so that you could fit one
block per input buffer.

1108
01:11:35,000 --> 01:11:41,607
And, here's the argument.
So, the claim is you can also

1109
01:11:41,607 --> 01:11:47,725
fit one memory block in the
cache per input buffer.

1110
01:11:47,725 --> 01:11:52,497
So, this is in addition to one
J funnel.

1111
01:11:52,497 --> 01:11:59,594
You could also fit one block
for each of its input buffers.

1112
01:11:59,594 --> 01:12:06,230
OK, this is of the J funnel.
It's not any funnel because

1113
01:12:06,230 --> 01:12:10,938
bigger funnels are way too big.
OK, so here's how we prove

1114
01:12:10,938 --> 01:12:13,581
that.
J^2 is at most a quarter M.

1115
01:12:13,581 --> 01:12:16,967
That's what we assumed here,
actually CJ2.

1116
01:12:16,967 --> 01:12:21,675
I'm not going to bother with
the C because that's going to

1117
01:12:21,675 --> 01:12:25,887
make my life even harder.
OK, I think this is even a

1118
01:12:25,887 --> 01:12:29,522
weaker constraint.
So, the size of our funnel

1119
01:12:29,522 --> 01:12:35,110
proves about J^2.
That's at most a quarter of the

1120
01:12:35,110 --> 01:12:37,719
cache.
That implies that J,

1121
01:12:37,719 --> 01:12:43,941
if we take square roots of both
sides, is at most a half square

1122
01:12:43,941 --> 01:12:47,955
root of M.
OK, also, we know that B is at

1123
01:12:47,955 --> 01:12:53,273
most square root of M because M
is at least B squared.

1124
01:12:53,273 --> 01:12:58,993
So, we put these together,
and we get J times B is at most

1125
01:12:58,993 --> 01:13:02,611
a half M.
OK, now I claim that what we

1126
01:13:02,611 --> 01:13:05,718
are asking for here is J times B
because in a J funnel,

1127
01:13:05,718 --> 01:13:08,825
there are J input arrays.
And so, if you want one block

1128
01:13:08,825 --> 01:13:10,781
each, that costs a space of B
each.

1129
01:13:10,781 --> 01:13:13,831
So, for each input buffer we
have one block of size B,

1130
01:13:13,831 --> 01:13:16,938
and the claim is that that
whole thing fits in half the

1131
01:13:16,938 --> 01:13:19,009
cache.
And, we've only used a quarter

1132
01:13:19,009 --> 01:13:20,448
of the cache.
So in total,

1133
01:13:20,448 --> 01:13:23,843
we use three quarters of the
cache and that's all we'll use.

1134
01:13:23,843 --> 01:13:26,950
OK, so that's good news.
We can also fit one more block

1135
01:13:26,950 --> 01:13:30,000
to the output.
Not too big a deal.

1136
01:13:30,000 --> 01:13:33,401
So now, as long as this J
funnel is running,

1137
01:13:33,401 --> 01:13:36,012
if it's all in cache,
all is well.

1138
01:13:36,012 --> 01:13:39,889
What does that mean?
Let me first analyze how long

1139
01:13:39,889 --> 01:13:42,895
it takes for us to swap in this
funnel.

1140
01:13:42,895 --> 01:13:47,563
OK, so how long does it take
for us to read all the stuff in

1141
01:13:47,563 --> 01:13:50,806
a J funnel and one block per
input buffer?

1142
01:13:50,806 --> 01:13:55,000
That's what it would take to
get started.

1143
01:13:55,000 --> 01:14:02,344
So, this is swapping in a J
funnel, which means reading the

1144
01:14:02,344 --> 01:14:09,434
J funnel in its entirety,
and reading one block per input

1145
01:14:09,434 --> 01:14:14,120
buffer.
OK, the cost of the swap in is

1146
01:14:14,120 --> 01:14:19,818
pretty natural.
The size of the buffer divided

1147
01:14:19,818 --> 01:14:27,542
by B, because that's just sort
of a linear scan to read it in,

1148
01:14:27,542 --> 01:14:34,000
and we need to read one block
per buffer.

1149
01:14:34,000 --> 01:14:38,463
These buffers could be all over
the place because they're pretty

1150
01:14:38,463 --> 01:14:40,942
big.
So, let's say we pay one memory

1151
01:14:40,942 --> 01:14:45,264
transfer for each input buffer
just to get started to read the

1152
01:14:45,264 --> 01:14:47,318
first block.
OK, the claim is,

1153
01:14:47,318 --> 01:14:50,365
and here we need to do some
more arithmetic.

1154
01:14:50,365 --> 01:14:52,348
This is, at most,
J^3 over B.

1155
01:14:52,348 --> 01:14:54,757
OK, why is it,
at most, J^3 over B?

1156
01:14:54,757 --> 01:15:00,000
Well, this was the first level
at which things fit in cache.

1157
01:15:00,000 --> 01:15:04,119
That means the next level
bigger, which is J^2,

1158
01:15:04,119 --> 01:15:08,327
which has size J^4,
should be bigger than cache.

1159
01:15:08,327 --> 01:15:11,552
Otherwise we would have stopped
then.

1160
01:15:11,552 --> 01:15:14,686
OK, so this is just more
arithmetic.

1161
01:15:14,686 --> 01:15:19,164
You can either believe me or
follow the arithmetic.

1162
01:15:19,164 --> 01:15:23,731
We know that J^4 is at least M.
So, this means that,

1163
01:15:23,731 --> 01:15:26,776
and we know that M is at least
B^2.

1164
01:15:26,776 --> 01:15:29,462
Therefore, J^2,
instead of J^4,

1165
01:15:29,462 --> 01:15:36,000
we take the square root of both
sides, J^2 is at least B.

1166
01:15:36,000 --> 01:15:39,379
OK, so certainly J^2 over B is
at most J^3 over B.

1167
01:15:39,379 --> 01:15:43,379
But also J is at most J^3 over
B because J^2 is at least B.

1168
01:15:43,379 --> 01:15:46,896
Hopefully that should be clear.
That's just algebra.

1169
01:15:46,896 --> 01:15:50,965
OK, so we're not going to use
this bound because that's kind

1170
01:15:50,965 --> 01:15:53,655
of complicated.
We're just going to say,

1171
01:15:53,655 --> 01:15:56,689
well, it causes J^3 over B to
get swapped in.

1172
01:15:56,689 --> 01:16:00,000
Now, why is J^3 over B a good
thing?

1173
01:16:00,000 --> 01:16:03,972
Because we know the total size
of inputs to the J funnel is

1174
01:16:03,972 --> 01:16:06,232
J^3.
So, to read all of the inputs

1175
01:16:06,232 --> 01:16:08,424
to the J funnel takes J^3 over
B.

1176
01:16:08,424 --> 01:16:12,054
So, this is really just a
linear extra cost to get the

1177
01:16:12,054 --> 01:16:14,657
whole thing swapped in.
It sounds good.

1178
01:16:14,657 --> 01:16:17,671
To do the merging would also
cost J^3 over B.

1179
01:16:17,671 --> 01:16:21,438
So, the swap-in causes J^3 over
B to merge all these J^3

1180
01:16:21,438 --> 01:16:24,041
elements.
If they were all there in the

1181
01:16:24,041 --> 01:16:28,013
inputs, it would take J^3 over B
because once everything is

1182
01:16:28,013 --> 01:16:31,780
there, you're merging at full
speed, one per B items per

1183
01:16:31,780 --> 01:16:36,859
memory transfer on average.
OK, the problem is you're going

1184
01:16:36,859 --> 01:16:39,260
to swap out, which you may have
imagined.

1185
01:16:39,260 --> 01:16:41,899
As soon as one of your input
buffers empties,

1186
01:16:41,899 --> 01:16:45,199
let's say this one's almost
gone, as soon as it empties,

1187
01:16:45,199 --> 01:16:48,439
you're going to totally
obliterate that funnel and swap

1188
01:16:48,439 --> 01:16:51,380
in this one in order to merge
all the stuff there,

1189
01:16:51,380 --> 01:16:54,920
and fill this buffer back up.
This is where the amortization

1190
01:16:54,920 --> 01:16:56,960
comes in.
And this is where the log

1191
01:16:56,960 --> 01:17:00,680
factor comes in because so far
it we've basically paid a linear

1192
01:17:00,680 --> 01:17:07,034
cost.
We are almost done.

1193
01:17:07,034 --> 01:17:17,897
So, we charge,
sorry, I'm jumping ahead of

1194
01:17:17,897 --> 01:17:26,111
myself.
So, when an input buffer

1195
01:17:26,111 --> 01:17:35,169
empties, we swap out.
And we recursively fill that

1196
01:17:35,169 --> 01:17:37,881
buffer.
OK, I'm going to assume that

1197
01:17:37,881 --> 01:17:42,065
there is absolutely no reuse,
that is recursive filling

1198
01:17:42,065 --> 01:17:46,481
completely swapped everything
out and I have to start from

1199
01:17:46,481 --> 01:17:50,046
scratch for this funnel.
So, when that happens,

1200
01:17:50,046 --> 01:17:53,920
I feel this buffer,
and then I come back and I say,

1201
01:17:53,920 --> 01:17:58,026
well, I go swap it back in.
So when the recursive call

1202
01:17:58,026 --> 01:18:01,978
finishes, I swap back in.
OK, so I recursively fill,

1203
01:18:01,978 --> 01:18:08,031
and then I swap back in.
And, at the swapping back in

1204
01:18:08,031 --> 01:18:13,012
costs J^3 over B.
I'm going to charge that cost

1205
01:18:13,012 --> 01:18:16,910
to the elements that just got
filled.

1206
01:18:16,910 --> 01:18:22,000
So this is an amortized
charging argument.

1207
01:18:48,000 --> 01:18:51,322
How many are there?
It's the only question.

1208
01:18:51,322 --> 01:18:54,169
It turns out,
things are really good,

1209
01:18:54,169 --> 01:18:59,073
like here, for the square root
of K funnel, we have each buffer

1210
01:18:59,073 --> 01:19:04,063
has size K to the three halves.
OK, so this is a bit

1211
01:19:04,063 --> 01:19:08,395
complicated.
But I claim that the number of

1212
01:19:08,395 --> 01:19:12,624
elements here that fill the
buffer is J^3.

1213
01:19:12,624 --> 01:19:18,401
So, if you have a J funnel,
each of the input buffers has

1214
01:19:18,401 --> 01:19:22,114
size J^3.
It should be correct if you

1215
01:19:22,114 --> 01:19:26,137
work it out.
So, we're charging this J^3

1216
01:19:26,137 --> 01:19:31,501
over B cost to J^3 elements,
which sounds like you're

1217
01:19:31,501 --> 01:19:38,000
charging, essentially,
one over B to each element.

1218
01:19:38,000 --> 01:19:39,951
Sounds great.
That means that,

1219
01:19:39,951 --> 01:19:43,718
so you're thinking overall,
I mean, there are N elements,

1220
01:19:43,718 --> 01:19:46,678
and to each one you charge a
one over B cost.

1221
01:19:46,678 --> 01:19:50,110
That sounds like the total
running time is N over B.

1222
01:19:50,110 --> 01:19:52,195
It's a bit too fast for
sorting.

1223
01:19:52,195 --> 01:19:55,559
We lost the log factor.
So, what's going on is that

1224
01:19:55,559 --> 01:20:00,000
we're actually charging to one
element more than once.

1225
01:20:00,000 --> 01:20:02,729
And, this is something that we
don't normally do,

1226
01:20:02,729 --> 01:20:05,913
never done it in this class,
but you can do it as long as

1227
01:20:05,913 --> 01:20:08,471
you bound that the number of
times you charge.

1228
01:20:08,471 --> 01:20:10,916
OK, and wherever you do a
charging argument,

1229
01:20:10,916 --> 01:20:13,304
you say, well,
this doesn't happen too many

1230
01:20:13,304 --> 01:20:16,090
times because whenever this
happens, this happens.

1231
01:20:16,090 --> 01:20:18,705
You should say,
you should prove that the thing

1232
01:20:18,705 --> 01:20:21,775
that you're charging to,
Ito charged to that think very

1233
01:20:21,775 --> 01:20:24,107
many times.
So here, I have a quantifiable

1234
01:20:24,107 --> 01:20:26,153
thing that I'm charging to:
elements.

1235
01:20:26,153 --> 01:20:29,394
So, I'm saying that for each
element that happened to come

1236
01:20:29,394 --> 01:20:31,952
into this buffer,
I'm going to charge it a one

1237
01:20:31,952 --> 01:20:35,992
over B cost.
How many times does one element

1238
01:20:35,992 --> 01:20:38,755
get charged?
Well, each time it gets charged

1239
01:20:38,755 --> 01:20:40,812
to, it's moved into a new
buffer.

1240
01:20:40,812 --> 01:20:43,254
How many buffers could it move
through?

1241
01:20:43,254 --> 01:20:45,632
Well, it's just going up all
the time.

1242
01:20:45,632 --> 01:20:49,102
Merging always goes up.
So, we start here and you go to

1243
01:20:49,102 --> 01:20:52,059
the next buffer,
and you go to the next buffer.

1244
01:20:52,059 --> 01:20:55,143
The number of buffers you visit
is the right log,

1245
01:20:55,143 --> 01:20:59,000
it turns out.
I don't know which log that is.

1246
01:20:59,000 --> 01:21:05,199
So, the number of charges of a
one over B cost to each element

1247
01:21:05,199 --> 01:21:11,196
is the number of buffers it
visits, and that's a log factor.

1248
01:21:11,196 --> 01:21:17,193
That's where we get an extra
log factor on the running time.

1249
01:21:17,193 --> 01:21:23,291
It is, this is the number of
levels of J funnels that you can

1250
01:21:23,291 --> 01:21:26,849
visit.
So, it's log K divided by log

1251
01:21:26,849 --> 01:21:33,228
J, if I got it right.
OK, and we're almost done.

1252
01:21:33,228 --> 01:21:38,442
Let's wrap up a bit.
Just a little bit more

1253
01:21:38,442 --> 01:21:44,278
arithmetic, unfortunately.
So, log K over log J.

1254
01:21:44,278 --> 01:21:47,630
Now, J^2 is like M,
roughly.

1255
01:21:47,630 --> 01:21:54,956
It might be square root of M.
But, log J is basically log M.

1256
01:21:54,956 --> 01:22:02,281
There's some constants there.
So, the number of charges here

1257
01:22:02,281 --> 01:22:08,299
is theta, log K over log M.
So, now this is a bit,

1258
01:22:08,299 --> 01:22:11,135
we haven't seen this in
amortization necessarily,

1259
01:22:11,135 --> 01:22:14,265
but we just need to count up
total amount of charging.

1260
01:22:14,265 --> 01:22:17,219
All work gets charged to
somebody, except we didn't

1261
01:22:17,219 --> 01:22:20,054
charge the very initial swapping
in to everybody.

1262
01:22:20,054 --> 01:22:23,244
But, every time we do some
swapping in, we charge it to

1263
01:22:23,244 --> 01:22:25,075
someone.
So, how many times does

1264
01:22:25,075 --> 01:22:27,970
everything it charged?
Well, there are N elements.

1265
01:22:27,970 --> 01:22:31,632
Each gets charged to a one over
B cost, and the number of times

1266
01:22:31,632 --> 01:22:35,000
it gets charged is its log K
over log M.

1267
01:22:35,000 --> 01:22:39,246
So therefore,
the total cost is number of

1268
01:22:39,246 --> 01:22:44,342
elements times a one over B
times this log thing.

1269
01:22:44,342 --> 01:22:49,650
OK, it's actually plus K.
We forgot about a plus K,

1270
01:22:49,650 --> 01:22:55,171
but that's just to get started
in the very beginning,

1271
01:22:55,171 --> 01:22:58,886
and start on all of the input
lists.

1272
01:22:58,886 --> 01:23:06,000
OK, this is an amortization
analysis to prove this bound.

1273
01:23:06,000 --> 01:23:10,914
Sorry, what was N here?
I assumed that I started out

1274
01:23:10,914 --> 01:23:14,286
with K cubed elements at the
bottom.

1275
01:23:14,286 --> 01:23:19,682
The total number of elements in
the bottom was K^3 theta.

1276
01:23:19,682 --> 01:23:23,343
OK, so I should have written
K^3 not M.

1277
01:23:23,343 --> 01:23:28,835
This should be almost the same
as this, OK, but not quite.

1278
01:23:28,835 --> 01:23:34,039
This is log based M of K,
and if you do a little bit of

1279
01:23:34,039 --> 01:23:39,820
arithmetic, this should be K^3
over B times log base M over B

1280
01:23:39,820 --> 01:23:45,747
of K over B plus K.
That's what I want to prove.

1281
01:23:45,747 --> 01:23:49,867
Actually there's a K^3 here
instead of a K,

1282
01:23:49,867 --> 01:23:53,105
but that's just a factor of
three.

1283
01:23:53,105 --> 01:23:58,600
And this follows because we
assume we are not in the base

1284
01:23:58,600 --> 01:24:01,052
case.
So, K is at least M,

1285
01:24:01,052 --> 01:24:06,252
which is at least B^2,
and therefore K over B is omega

1286
01:24:06,252 --> 01:24:10,716
square root of K.
OK, so K over B is basically

1287
01:24:10,716 --> 01:24:13,045
the same as K when you put it in
a log.

1288
01:24:13,045 --> 01:24:16,354
So here we have log base M.
I turned it into log base M

1289
01:24:16,354 --> 01:24:17,887
over B.
That's even worse.

1290
01:24:17,887 --> 01:24:20,277
It doesn't matter.
And, I have log of K.

1291
01:24:20,277 --> 01:24:23,525
I replaced it with K over B,
but K over B is basically

1292
01:24:23,525 --> 01:24:25,303
square root of K.
So in a log,

1293
01:24:25,303 --> 01:24:30,261
that's just a factor of a half.
So that concludes the analysis

1294
01:24:30,261 --> 01:24:33,654
of the funnel.
We get this crazy running time,

1295
01:24:33,654 --> 01:24:37,424
which is basically sorting
bound plus a little bit.

1296
01:24:37,424 --> 01:24:40,817
We plug that into our funnel
sort, and we get,

1297
01:24:40,817 --> 01:24:44,964
magically, optimal cache
oblivious sorting just in time.

1298
01:24:44,964 --> 01:24:48,809
Tuesday is the final.
The final is more in the style

1299
01:24:48,809 --> 01:24:53,107
of quiz one, so not too much
creativity, mostly mastery of

1300
01:24:53,107 --> 01:24:55,369
material.
It covers everything.

1301
01:24:55,369 --> 01:24:59,591
You don't have to worry about
the details of funnel sort,

1302
01:24:59,591 --> 01:25:03,285
but everything else.
So it's like quiz one but for

1303
01:25:03,285 --> 01:25:07,664
the entire class.
It's three hours long,

1304
01:25:07,664 --> 01:25:10,766
and good luck.
It's been a pleasure having

1305
01:25:10,766 --> 01:25:14,247
you, all the students.
I'm sure Charles agrees,

1306
01:25:14,247 --> 01:25:17,000
so thanks everyone.
It was a lot of fun.