1
00:00:00,090 --> 00:00:02,590
NARRATOR: The following content
is provided under a Creative

2
00:00:02,590 --> 00:00:04,059
Commons license.

3
00:00:04,059 --> 00:00:06,360
Your support will help
MIT OpenCourseWare

4
00:00:06,360 --> 00:00:10,720
continue to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,350
To make a donation or
view additional materials

6
00:00:13,350 --> 00:00:17,310
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,310 --> 00:00:18,450
at ocw.mit.edu.

8
00:00:25,719 --> 00:00:26,510
ERIK DEMAINE: Yeah.

9
00:00:26,510 --> 00:00:28,340
I'm going to talk
about I/O models.

10
00:00:28,340 --> 00:00:29,840
Just to get a sense,
how many people

11
00:00:29,840 --> 00:00:33,510
know a model called I/O model?

12
00:00:33,510 --> 00:00:35,851
And how many people don't?

13
00:00:35,851 --> 00:00:36,600
It doesn't matter.

14
00:00:36,600 --> 00:00:38,990
I'm just curious.

15
00:00:38,990 --> 00:00:41,210
As some of you may
know, I/O models

16
00:00:41,210 --> 00:00:42,440
have a really rich history.

17
00:00:42,440 --> 00:00:44,030
And they're pretty fascinating.

18
00:00:44,030 --> 00:00:48,770
They are all central to this
problem of modeling the memory

19
00:00:48,770 --> 00:00:50,210
hierarchy in a computer.

20
00:00:50,210 --> 00:00:52,970
We have things like RAM
model of computation

21
00:00:52,970 --> 00:00:56,390
where you can access anything at
the same price in your memory.

22
00:00:56,390 --> 00:00:58,040
But the reality of
computers is you

23
00:00:58,040 --> 00:01:00,200
have things that are
very close to you that

24
00:01:00,200 --> 00:01:02,030
are very cheap to
access, and you

25
00:01:02,030 --> 00:01:05,630
have things that are very
far from you that are big.

26
00:01:05,630 --> 00:01:08,090
You can get 3 terabyte
disks these days,

27
00:01:08,090 --> 00:01:10,880
but are very slow to access.

28
00:01:10,880 --> 00:01:13,010
And one of the big
costs there is latency.

29
00:01:13,010 --> 00:01:17,000
Because here, the head has to
move to the right position,

30
00:01:17,000 --> 00:01:19,820
and then you can read
lots of data really fast.

31
00:01:19,820 --> 00:01:22,460
The disk actually can
give you data very fast,

32
00:01:22,460 --> 00:01:26,167
but the hard part is getting
started in reading stuff.

33
00:01:26,167 --> 00:01:28,250
And so this is the sort
of thing we want to model.

34
00:01:28,250 --> 00:01:31,190
These kinds of computers
have been around for decades,

35
00:01:31,190 --> 00:01:31,820
as we'll see.

36
00:01:31,820 --> 00:01:33,528
And people have been
trying to model them

37
00:01:33,528 --> 00:01:36,050
in as clean a way as possible
that works well theoretically

38
00:01:36,050 --> 00:01:39,920
and matches practice
in some ways.

39
00:01:39,920 --> 00:01:41,795
I have just some fun
additions to this slide.

40
00:01:41,795 --> 00:01:44,120
You can keep getting
bigger, go to the internet,

41
00:01:44,120 --> 00:01:45,910
get to an exa- or a zettabyte.

42
00:01:45,910 --> 00:01:48,620
You have to look up all
the words for these.

43
00:01:48,620 --> 00:01:50,180
In the universe,
you've got about 10

44
00:01:50,180 --> 00:01:54,250
to the 83 atoms, so maybe
roughly that many bits.

45
00:01:54,250 --> 00:01:57,570
But I don't know if
there's a letter for them.

46
00:01:57,570 --> 00:02:00,140
So how do we model this?

47
00:02:00,140 --> 00:02:01,410
Well, there's a lot of models.

48
00:02:01,410 --> 00:02:02,620
This is a partial list.

49
00:02:02,620 --> 00:02:05,950
These are sort of the core
models that were around,

50
00:02:05,950 --> 00:02:10,280
let's say, since
this millennium.

51
00:02:10,280 --> 00:02:13,820
So we start in 1972 and
work our way forward.

52
00:02:13,820 --> 00:02:15,500
And I'm going to go
through all of these

53
00:02:15,500 --> 00:02:18,030
in different levels of detail.

54
00:02:18,030 --> 00:02:20,270
There's a couple of
key features in a cache

55
00:02:20,270 --> 00:02:23,467
that we want to model or
maybe a few key features.

56
00:02:23,467 --> 00:02:25,550
And then there's some
measure of simplicity, which

57
00:02:25,550 --> 00:02:26,674
is a little hard to define.

58
00:02:29,390 --> 00:02:32,690
The goal is to get all four
of these things at once.

59
00:02:32,690 --> 00:02:36,750
And we get that more
or less by the end.

60
00:02:36,750 --> 00:02:40,545
So first section is on this
idealized two-level storage,

61
00:02:40,545 --> 00:02:44,930
which was introduced
by Bob Floyd in 1972.

62
00:02:44,930 --> 00:02:48,220
This is what the first page
of the paper looks like.

63
00:02:48,220 --> 00:02:51,200
It's probably typeset
on a typewriter

64
00:02:51,200 --> 00:02:54,560
it looks like and underline,
good old days of computer

65
00:02:54,560 --> 00:02:57,740
science, very early days
of computer science.

66
00:02:57,740 --> 00:03:00,470
And this was published
in a conference

67
00:03:00,470 --> 00:03:02,720
called The Complexity of
Computer Computations.

68
00:03:02,720 --> 00:03:06,001
How many people have
heard of that conference?

69
00:03:06,001 --> 00:03:06,500
No one.

70
00:03:06,500 --> 00:03:06,999
Wow.

71
00:03:06,999 --> 00:03:07,880
There it is.

72
00:03:07,880 --> 00:03:09,380
It's a kind of a
classic, because it

73
00:03:09,380 --> 00:03:11,870
had Karp's original
paper on NP-completeness.

74
00:03:11,870 --> 00:03:14,220
So you've definitely
read this paper.

75
00:03:14,220 --> 00:03:16,884
But there are a lot of
neat papers in there

76
00:03:16,884 --> 00:03:19,550
and a panel discussion including
what should we call algorithms,

77
00:03:19,550 --> 00:03:21,830
which is kind of a fun read.

78
00:03:21,830 --> 00:03:25,190
So this is in the day when
one of the state of the art

79
00:03:25,190 --> 00:03:27,650
computers was the PDP-11.

80
00:03:27,650 --> 00:03:30,770
This is what PDP-11,
or one of them,

81
00:03:30,770 --> 00:03:34,850
looks like by probably
owned by Bell Labs.

82
00:03:34,850 --> 00:03:36,710
But Dennis Ritchie
and Ken Thompson's

83
00:03:36,710 --> 00:03:39,590
the inventors of C and
Unix, working away there.

84
00:03:39,590 --> 00:03:45,860
It has disks, each of which is
about 2 megabytes in capacity.

85
00:03:45,860 --> 00:03:50,040
And it has internal memory which
was core memory at the time.

86
00:03:50,040 --> 00:03:53,360
So each of these is a little
circular magnetic core.

87
00:03:53,360 --> 00:03:55,010
And it stores 1 bit.

88
00:03:55,010 --> 00:03:56,930
And in total, there
are 8 kilobytes.

89
00:03:56,930 --> 00:04:01,220
So you get a sense of
already this being an issue.

90
00:04:01,220 --> 00:04:03,060
And this is why they
wrote their paper.

91
00:04:03,060 --> 00:04:04,560
So here's the model
they introduced,

92
00:04:04,560 --> 00:04:08,000
a very simple model, maybe
the simplest we'll see.

93
00:04:08,000 --> 00:04:11,210
You have your CPU, which
can do local computation.

94
00:04:11,210 --> 00:04:14,720
And then you have your
memory, which is very big.

95
00:04:14,720 --> 00:04:17,810
But in particular, it's divided
into these blocks of size B.

96
00:04:17,810 --> 00:04:20,390
So each block can
have up to B items.

97
00:04:20,390 --> 00:04:23,720
And what you're allowed to
do in one block operation

98
00:04:23,720 --> 00:04:26,754
is read two of the blocks.

99
00:04:26,754 --> 00:04:28,420
You can read all the
items in the block.

100
00:04:28,420 --> 00:04:30,260
So let's say you
read these two items.

101
00:04:30,260 --> 00:04:33,764
You pick some subset of
those items to pick up.

102
00:04:33,764 --> 00:04:35,180
And then what
you're allowed to do

103
00:04:35,180 --> 00:04:37,830
is store them somewhere else.

104
00:04:37,830 --> 00:04:40,280
So you can pick some other
target block like this one

105
00:04:40,280 --> 00:04:44,885
and copy those elements
to overwrite that block.

106
00:04:44,885 --> 00:04:46,760
I mean, there's no
computation in this model,

107
00:04:46,760 --> 00:04:50,330
because he was just interested
in how you can permute items

108
00:04:50,330 --> 00:04:51,950
in that world.

109
00:04:51,950 --> 00:04:55,710
So simple model, but
you get the idea.

110
00:04:55,710 --> 00:04:58,310
You can read two blocks, take
up to B items out of them,

111
00:04:58,310 --> 00:04:59,310
stick them in here.

112
00:04:59,310 --> 00:05:01,820
Here, we just ignore what
the order is within a block,

113
00:05:01,820 --> 00:05:04,070
because we're assuming you
can just rearrange once you

114
00:05:04,070 --> 00:05:06,050
read them in and spit them out.

115
00:05:06,050 --> 00:05:08,270
So don't worry about the
order within the block.

116
00:05:08,270 --> 00:05:11,300
It's more for every item,
which block is it in?

117
00:05:11,300 --> 00:05:13,730
And we're assuming here
items are indivisible.

118
00:05:13,730 --> 00:05:16,670
So here's the main
theorem of that paper.

119
00:05:16,670 --> 00:05:22,000
If you're given N items and
you want to permute them into N

120
00:05:22,000 --> 00:05:24,620
over B blocks, which means each
of those blocks is going to be

121
00:05:24,620 --> 00:05:27,210
full-- let's say that's sort
of the most interesting case--

122
00:05:27,210 --> 00:05:32,730
then you need to use N over
B log B block operations

123
00:05:32,730 --> 00:05:37,540
even for a random permutation on
average with high probability.

124
00:05:37,540 --> 00:05:41,680
So this is kind of nice
or kind of interesting,

125
00:05:41,680 --> 00:05:43,950
because just to touch
those blocks requires

126
00:05:43,950 --> 00:05:47,590
N over B block operations.

127
00:05:47,590 --> 00:05:51,120
But there's an extra log
factor that starts to creep up,

128
00:05:51,120 --> 00:05:54,711
which is maybe a little bit
surprising, less surprising

129
00:05:54,711 --> 00:05:57,210
to people who are familiar with
I/O models, but at the time,

130
00:05:57,210 --> 00:05:58,377
very new.

131
00:05:58,377 --> 00:06:00,210
And I'm making a
particular assumption here,

132
00:06:00,210 --> 00:06:01,110
but just a small thing.

133
00:06:01,110 --> 00:06:02,490
I thought I'd go through
the proof of this theorem,

134
00:06:02,490 --> 00:06:04,346
because it's fairly simple.

135
00:06:04,346 --> 00:06:05,970
It's going to use a
slightly simplified

136
00:06:05,970 --> 00:06:09,010
model where, instead of copying
items, you actually move items.

137
00:06:09,010 --> 00:06:11,160
So these guys would
disappear after you put them

138
00:06:11,160 --> 00:06:12,017
in this new block.

139
00:06:12,017 --> 00:06:14,100
Because we're thinking
about permutation problems,

140
00:06:14,100 --> 00:06:16,536
again, that doesn't
really change anything.

141
00:06:16,536 --> 00:06:17,910
You can just, for
every item, see

142
00:06:17,910 --> 00:06:20,220
what path it follows to
ultimately get to its target

143
00:06:20,220 --> 00:06:22,770
location, throw away
all the extra copies

144
00:06:22,770 --> 00:06:25,020
and just keep that
one set of copies.

145
00:06:25,020 --> 00:06:27,820
And that will still be a
valid solution in this model.

146
00:06:27,820 --> 00:06:29,990
So how does the lower bound go?

147
00:06:29,990 --> 00:06:32,010
It's a simple
potential argument.

148
00:06:32,010 --> 00:06:36,540
You look at for
every pair of blocks,

149
00:06:36,540 --> 00:06:39,270
how many items
are there in block

150
00:06:39,270 --> 00:06:41,270
i that are destined for block j?

151
00:06:41,270 --> 00:06:43,135
You want to move from
block i to block j.

152
00:06:43,135 --> 00:06:44,760
This is going to be
changing over time.

153
00:06:44,760 --> 00:06:47,560
This is where they
currently are.

154
00:06:47,560 --> 00:06:48,510
So that's nij.

155
00:06:48,510 --> 00:06:51,630
You take and nij, log nij,
and sum that up over all

156
00:06:51,630 --> 00:06:52,950
i's and j's.

157
00:06:52,950 --> 00:06:54,610
That's the potential function.

158
00:06:54,610 --> 00:06:58,920
And our goal is to
maximize that potential.

159
00:06:58,920 --> 00:07:00,840
Because it's going to be--

160
00:07:00,840 --> 00:07:02,310
for those familiar
with entropy--

161
00:07:02,310 --> 00:07:03,400
negative entropy.

162
00:07:03,400 --> 00:07:08,220
So it's going to be maximized
when all the items are

163
00:07:08,220 --> 00:07:09,960
where they need to be.

164
00:07:09,960 --> 00:07:12,510
This is when everything is
as clustered as possible.

165
00:07:12,510 --> 00:07:14,280
You can only have a
cluster of size B,

166
00:07:14,280 --> 00:07:18,430
because items can only be
up to B in the same place.

167
00:07:18,430 --> 00:07:21,000
One way to see this, in
the target configuration,

168
00:07:21,000 --> 00:07:23,144
nii is B for all i.

169
00:07:23,144 --> 00:07:24,810
Everyone's where
they're supposed to be.

170
00:07:24,810 --> 00:07:27,990
And so that potential gives
you the number of items

171
00:07:27,990 --> 00:07:30,390
times log B. And this
is always, at most,

172
00:07:30,390 --> 00:07:34,290
log B. And so that's the biggest
this could ever hope to get.

173
00:07:34,290 --> 00:07:36,810
So our goal is to increase
entropy as much as possible.

174
00:07:36,810 --> 00:07:38,310
And we're starting
with low entropy.

175
00:07:38,310 --> 00:07:41,130
If you take a
random permutation,

176
00:07:41,130 --> 00:07:43,620
you're trying to get the
expected number of guys

177
00:07:43,620 --> 00:07:44,610
that are where they're
supposed to be.

178
00:07:44,610 --> 00:07:46,151
It's very small,
because most of them

179
00:07:46,151 --> 00:07:49,350
are going to be destined
for some other block.

180
00:07:49,350 --> 00:07:52,350
So we're starting with
the potential of linear.

181
00:07:52,350 --> 00:07:56,730
We need to get to N log
B. And then the claim is

182
00:07:56,730 --> 00:08:00,960
that each block operation we
do can only increase potential

183
00:08:00,960 --> 00:08:04,790
by, at most, B. And
so that gives us

184
00:08:04,790 --> 00:08:06,930
this bound of the
potential we need

185
00:08:06,930 --> 00:08:10,710
to get to minus the potential we
had divided by how much we can

186
00:08:10,710 --> 00:08:15,510
decrease potential in each step,
which is basically N over B log

187
00:08:15,510 --> 00:08:18,720
B minus a little O.

188
00:08:18,720 --> 00:08:20,190
Why is this claim true?

189
00:08:20,190 --> 00:08:21,300
I'll just sketch.

190
00:08:21,300 --> 00:08:24,930
The idea is this fun fact,
the x plus y log x plus y

191
00:08:24,930 --> 00:08:28,500
is, at most, x log x plus
y log y plus x plus y.

192
00:08:28,500 --> 00:08:31,470
What this means is if
you have two clusters,

193
00:08:31,470 --> 00:08:33,090
our goal is to sort
of cluster things

194
00:08:33,090 --> 00:08:36,615
together and make bigger groups
that are in the same place

195
00:08:36,615 --> 00:08:39,309
or in the correct place.

196
00:08:39,309 --> 00:08:41,400
So if you have two clusters
x log x and y log y

197
00:08:41,400 --> 00:08:43,830
contributing to this
thing and you merge them,

198
00:08:43,830 --> 00:08:46,020
then you now have
this potential.

199
00:08:46,020 --> 00:08:49,710
And the claim is that could
have only gone up by x plus y.

200
00:08:49,710 --> 00:08:52,740
And when you're moving B
items, the total number

201
00:08:52,740 --> 00:08:55,200
of things you're moving
is B. So you can only

202
00:08:55,200 --> 00:08:57,850
increase things by B.
So it was a quick sketch

203
00:08:57,850 --> 00:08:58,620
of this old paper.

204
00:08:58,620 --> 00:09:03,810
It's a fun read, quite
clear, easy argument.

205
00:09:03,810 --> 00:09:07,110
So we proved this theorem that
you need at least N over B

206
00:09:07,110 --> 00:09:08,850
log B. But what is
the right answer?

207
00:09:08,850 --> 00:09:11,640
There's actually not a
matching upper bound.

208
00:09:11,640 --> 00:09:14,670
Of course, for B at constant,
this is the right answer.

209
00:09:14,670 --> 00:09:20,100
It's N, but that's
not so exciting.

210
00:09:20,100 --> 00:09:21,900
On the upper bound
side, this paper

211
00:09:21,900 --> 00:09:23,896
has almost matching lower bound.

212
00:09:23,896 --> 00:09:25,770
It's another log, but
not quite the same log,

213
00:09:25,770 --> 00:09:28,590
N over B log N over
B instead of log B.

214
00:09:28,590 --> 00:09:30,480
And the rough idea
of how to do that--

215
00:09:30,480 --> 00:09:30,960
AUDIENCE: [INAUDIBLE]

216
00:09:30,960 --> 00:09:32,168
ERIK DEMAINE: Yeah, question.

217
00:09:32,168 --> 00:09:34,377
AUDIENCE: [INAUDIBLE]

218
00:09:34,377 --> 00:09:36,210
ERIK DEMAINE: I said a
tall disk assumption.

219
00:09:36,210 --> 00:09:39,330
I'm assuming N over
B is greater than B.

220
00:09:39,330 --> 00:09:41,640
The number of
blocks in your disk

221
00:09:41,640 --> 00:09:44,036
is at least the size of a block.

222
00:09:44,036 --> 00:09:45,660
AUDIENCE: You needed
that in the proof?

223
00:09:45,660 --> 00:09:47,701
ERIK DEMAINE: I needed
that in the proof I think.

224
00:09:50,100 --> 00:09:55,214
Good question, Where
N over B log B.

225
00:09:55,214 --> 00:09:57,910
AUDIENCE: [INAUDIBLE]

226
00:09:57,910 --> 00:09:58,930
ERIK DEMAINE: Yeah.

227
00:09:58,930 --> 00:09:59,560
Exactly.

228
00:09:59,560 --> 00:10:00,893
Yeah, that's where I'm using it.

229
00:10:00,893 --> 00:10:02,800
Thanks.

230
00:10:02,800 --> 00:10:06,210
Otherwise this expectation
doesn't work out.

231
00:10:06,210 --> 00:10:08,860
I mean, if you have one block,
for example, this will fail,

232
00:10:08,860 --> 00:10:12,090
because you need
zero operations.

233
00:10:12,090 --> 00:10:15,230
So there has to be some trade
off at the very small regime.

234
00:10:15,230 --> 00:10:15,730
OK.

235
00:10:15,730 --> 00:10:18,400
So the way to get N
over B log N over B

236
00:10:18,400 --> 00:10:20,950
is basically a radix sort.

237
00:10:20,950 --> 00:10:23,140
In one pass through
the data, you

238
00:10:23,140 --> 00:10:28,060
can rewrite everything to
have the lower order bits of 0

239
00:10:28,060 --> 00:10:30,100
before all the lower
order bits of 1.

240
00:10:30,100 --> 00:10:33,940
So in N over B, you can sort
by each bit in the target

241
00:10:33,940 --> 00:10:36,730
block ID of every item.

242
00:10:36,730 --> 00:10:38,712
And so you do log
of N over B things,

243
00:10:38,712 --> 00:10:40,420
because that's how
many blocks there are.

244
00:10:40,420 --> 00:10:41,836
And so this is how
many passes you

245
00:10:41,836 --> 00:10:44,500
need by a binary radix sort.

246
00:10:44,500 --> 00:10:47,080
You can achieve that bound.

247
00:10:47,080 --> 00:10:52,507
And the paper actually claims
that there's a lower bound.

248
00:10:52,507 --> 00:10:54,090
It's a little strange,
because there's

249
00:10:54,090 --> 00:10:55,660
a careful proof given for this.

250
00:10:55,660 --> 00:10:58,060
And then this claim just
says, "by information

251
00:10:58,060 --> 00:11:01,774
theoretic consideration--"
this is also true.

252
00:11:01,774 --> 00:11:03,190
This is in the
days when we didn't

253
00:11:03,190 --> 00:11:05,650
distinguish between
big O and big omega

254
00:11:05,650 --> 00:11:08,290
before [INAUDIBLE] paper.

255
00:11:08,290 --> 00:11:09,280
But this is not true.

256
00:11:09,280 --> 00:11:11,210
And we'll see that
it's not true.

257
00:11:11,210 --> 00:11:15,220
It was settled about
14 years later.

258
00:11:15,220 --> 00:11:17,720
So we'll see the right answer.

259
00:11:17,720 --> 00:11:20,320
This is almost the right
answer, but it doesn't quite

260
00:11:20,320 --> 00:11:22,180
work when B is very small.

261
00:11:22,180 --> 00:11:24,850
And one way to see
that is when B is 1.

262
00:11:24,850 --> 00:11:28,650
When B is 1, the right
answer is N, not N log N.

263
00:11:28,650 --> 00:11:32,814
So when B is less
than log N over B,

264
00:11:32,814 --> 00:11:34,480
then there's a slightly
different answer

265
00:11:34,480 --> 00:11:35,910
which we'll get to later.

266
00:11:35,910 --> 00:11:37,740
But that was the early days.

267
00:11:37,740 --> 00:11:42,100
There's some other fun quotes
from this paper foreshadowing

268
00:11:42,100 --> 00:11:43,100
different things.

269
00:11:43,100 --> 00:11:46,270
One is the word RAM model,
which is very common today,

270
00:11:46,270 --> 00:11:47,579
but not at the time.

271
00:11:47,579 --> 00:11:49,120
And it says, obviously,
these results

272
00:11:49,120 --> 00:11:51,370
apply for distant drums,
which was probably

273
00:11:51,370 --> 00:11:53,078
what they were thinking
about originally,

274
00:11:53,078 --> 00:11:54,970
but also when the
pages, the blocks,

275
00:11:54,970 --> 00:11:57,404
are words of internal
memory and the records

276
00:11:57,404 --> 00:11:58,570
are the bits in those words.

277
00:11:58,570 --> 00:11:59,820
So this is a word RAM model.

278
00:12:02,800 --> 00:12:05,537
Here, I said just ignore the
permutation within each block.

279
00:12:05,537 --> 00:12:07,120
But you can actually
do all the things

280
00:12:07,120 --> 00:12:09,040
you need to do for
these algorithms using

281
00:12:09,040 --> 00:12:12,800
shifts and logical or,
xor, and operations.

282
00:12:12,800 --> 00:12:15,250
So all these algorithms work
in the word RAM model, too,

283
00:12:15,250 --> 00:12:17,440
which is kind of nifty.

284
00:12:17,440 --> 00:12:20,230
Another thing is
foreshadowing, what

285
00:12:20,230 --> 00:12:22,660
we call the I/O model, which
we'll get to in a little bit.

286
00:12:22,660 --> 00:12:24,990
It says, "work is in progress."

287
00:12:24,990 --> 00:12:26,410
He got scooped, unfortunately.

288
00:12:26,410 --> 00:12:29,095
"Work is in progress--" unless
he meant by someone else--

289
00:12:29,095 --> 00:12:32,860
"attempting to study
the case where you

290
00:12:32,860 --> 00:12:34,300
can store more than two pages."

291
00:12:34,300 --> 00:12:37,060
Basically, this CPU can
hold two of these blocks,

292
00:12:37,060 --> 00:12:41,920
and then write one back out, but
has no bigger memory than that.

293
00:12:41,920 --> 00:12:43,720
or bigger cache.

294
00:12:43,720 --> 00:12:46,150
So that's where we
were at the time.

295
00:12:46,150 --> 00:12:50,010
Next, chapter in
this story is 1981.

296
00:12:50,010 --> 00:12:50,810
It's a good year.

297
00:12:50,810 --> 00:12:52,750
It was when I was born.

298
00:12:52,750 --> 00:12:55,690
And this is Hong
and Kung's paper.

299
00:12:55,690 --> 00:12:58,480
You've probably heard about
the red-blue pebble game.

300
00:12:58,480 --> 00:13:01,690
And it's also a two-level
model, but now there's

301
00:13:01,690 --> 00:13:03,340
a cache in the middle.

302
00:13:03,340 --> 00:13:05,860
And you can remember
stuff for a while.

303
00:13:05,860 --> 00:13:07,840
I mean, you can
remember up to M things

304
00:13:07,840 --> 00:13:09,520
before you have
to kick them out.

305
00:13:09,520 --> 00:13:11,720
The difference here is
there's no blocks anymore.

306
00:13:11,720 --> 00:13:13,490
It's just items.

307
00:13:13,490 --> 00:13:15,490
So let me tell you a
little bit about the paper.

308
00:13:15,490 --> 00:13:17,500
This was the state of the
art in computing at the time.

309
00:13:17,500 --> 00:13:19,458
The personal computer
revolution was happening.

310
00:13:19,458 --> 00:13:22,450
They had the Apple
II, TRS-80, VIC-20.

311
00:13:22,450 --> 00:13:25,660
All of these originally had
about 4 kilobytes of RAM.

312
00:13:25,660 --> 00:13:28,910
And the disks could store
maybe, I don't know,

313
00:13:28,910 --> 00:13:31,060
360 kilobytes or so.

314
00:13:31,060 --> 00:13:34,302
But you could also connect a
tape and other crazy things.

315
00:13:34,302 --> 00:13:35,510
So, again, this was relevant.

316
00:13:35,510 --> 00:13:37,879
And that's the setting
they were writing this.

317
00:13:37,879 --> 00:13:38,920
They have this fun quote.

318
00:13:38,920 --> 00:13:41,530
"When a large computation is
performed on a small device--"

319
00:13:41,530 --> 00:13:44,027
at that point, small devices
were becoming common--

320
00:13:44,027 --> 00:13:45,610
"you must decompose
those computations

321
00:13:45,610 --> 00:13:47,100
to subcomputations."

322
00:13:47,100 --> 00:13:49,940
This is going to require a lot
of I/O. It's going to be slow.

323
00:13:49,940 --> 00:13:52,990
So how do we minimize I/O?

324
00:13:52,990 --> 00:13:57,700
So their model-- before I get
to this red-blue pebble game

325
00:13:57,700 --> 00:14:03,520
model, it's based on a vanilla
single color pebble game model

326
00:14:03,520 --> 00:14:05,200
by a Hopcroft,
Paul, and Valiant.

327
00:14:05,200 --> 00:14:07,780
This is the famous interrelation
between the time hierarchy

328
00:14:07,780 --> 00:14:10,720
and space hierarchy paper.

329
00:14:10,720 --> 00:14:14,380
And what they said is, OK,
let's think of the algorithm

330
00:14:14,380 --> 00:14:15,880
we're executing as a DAG.

331
00:14:15,880 --> 00:14:18,370
We start with some
things that are inputs.

332
00:14:18,370 --> 00:14:22,960
And we want to compute stuff
that this computation depends

333
00:14:22,960 --> 00:14:24,657
on having these two
values and so on.

334
00:14:24,657 --> 00:14:26,490
In the end, we want to
compute some outputs.

335
00:14:26,490 --> 00:14:29,920
So you can rewrite computation
in this kind of DAG form.

336
00:14:29,920 --> 00:14:33,160
And we're going to model
the execution of that

337
00:14:33,160 --> 00:14:35,110
by playing this pebble game.

338
00:14:35,110 --> 00:14:37,790
And so a node can
have pebbles on it.

339
00:14:37,790 --> 00:14:40,290
And for example, we could
put a pebble on this node.

340
00:14:40,290 --> 00:14:43,180
In general, we are allowed
to put a pebble on a node

341
00:14:43,180 --> 00:14:46,300
if all of its predecessors
have a pebble.

342
00:14:46,300 --> 00:14:49,390
And pebble is going to
correspond to being in memory.

343
00:14:49,390 --> 00:14:51,640
And we can also throw away
a node, because we can just

344
00:14:51,640 --> 00:14:53,140
forget stuff.

345
00:14:53,140 --> 00:14:55,240
Unlike real life, you can
just forget whatever you

346
00:14:55,240 --> 00:14:56,830
don't want to know any more.

347
00:14:56,830 --> 00:14:58,420
So you add a pebble.

348
00:14:58,420 --> 00:15:00,460
Let's say, now we
can add this pebble,

349
00:15:00,460 --> 00:15:03,570
because its predecessor
has a pebble on it.

350
00:15:03,570 --> 00:15:06,777
We can add this pebble over
here, add this pebble here.

351
00:15:06,777 --> 00:15:08,610
Now, we don't need this
information anymore,

352
00:15:08,610 --> 00:15:10,610
because we've computed
all the things out of it.

353
00:15:10,610 --> 00:15:12,450
So we can choose to
remove that pebble.

354
00:15:12,450 --> 00:15:15,810
And now, we can add this one,
remove that one, add this one.

355
00:15:15,810 --> 00:15:20,340
You can check that I got all
these right, add this one,

356
00:15:20,340 --> 00:15:23,669
remove that one, remove,
add, remove, remove.

357
00:15:23,669 --> 00:15:25,460
In the end, we want
pebbles on the outputs.

358
00:15:25,460 --> 00:15:27,450
We start with pebbles
on the inputs.

359
00:15:27,450 --> 00:15:31,500
And in this case, their goal was
to minimize the maximum number

360
00:15:31,500 --> 00:15:32,790
of pebbles over time.

361
00:15:32,790 --> 00:15:36,030
Here, there's up to four
pebbles at any one moment.

362
00:15:36,030 --> 00:15:40,020
That means you need
memory of size four.

363
00:15:40,020 --> 00:15:42,990
And they ended up proving
that any DAG can be executed

364
00:15:42,990 --> 00:15:45,360
using N over log N
maximum pebbles, which

365
00:15:45,360 --> 00:15:47,260
gave this theorem time.

366
00:15:47,260 --> 00:15:49,270
If you use t units of
time, you can fit in t

367
00:15:49,270 --> 00:15:52,980
over log t units of space,
which was a neat advance.

368
00:15:52,980 --> 00:15:55,906
But that's beside the point.

369
00:15:55,906 --> 00:15:57,780
This is where Hong and
Kung were coming from.

370
00:15:57,780 --> 00:15:58,904
They had this pebble model.

371
00:15:58,904 --> 00:16:01,410
And they wanted to use
two colors of pebbles, one

372
00:16:01,410 --> 00:16:06,030
to represent the shallower
level of the memory hierarchy

373
00:16:06,030 --> 00:16:10,150
in cache, and the other to say
that you're on disk somewhere.

374
00:16:10,150 --> 00:16:12,240
So red pebble is
going to be in cache.

375
00:16:12,240 --> 00:16:13,320
That's the hot stuff.

376
00:16:13,320 --> 00:16:15,150
And the blue pebbles
are our disk.

377
00:16:15,150 --> 00:16:17,370
That's the cold stuff.

378
00:16:17,370 --> 00:16:19,960
And, basically, the same rules--

379
00:16:19,960 --> 00:16:21,930
when you're initially
placing a pebble,

380
00:16:21,930 --> 00:16:23,512
everything here has to be red.

381
00:16:23,512 --> 00:16:25,470
You can place a red pebble
if your predecessors

382
00:16:25,470 --> 00:16:26,384
have red pebbles.

383
00:16:26,384 --> 00:16:28,050
We start out with the
inputs being blue,

384
00:16:28,050 --> 00:16:29,560
so there are no red pebbles.

385
00:16:29,560 --> 00:16:31,450
But for free-- or not for free.

386
00:16:31,450 --> 00:16:33,444
For unit cost, we can
convert any red pebble

387
00:16:33,444 --> 00:16:35,610
to a blue pebble or any
blue pebble to a red pebble.

388
00:16:35,610 --> 00:16:36,960
So let's go through this.

389
00:16:36,960 --> 00:16:38,370
I can make that one red.

390
00:16:38,370 --> 00:16:40,080
And now, I can
make this one red.

391
00:16:40,080 --> 00:16:40,889
Great.

392
00:16:40,889 --> 00:16:42,180
Now, I don't need it right now.

393
00:16:42,180 --> 00:16:45,925
So I'm going to make it blue,
meaning write it out to disk.

394
00:16:45,925 --> 00:16:47,699
I make this one red,
make this one red.

395
00:16:47,699 --> 00:16:48,990
Now, I can throw that one away.

396
00:16:48,990 --> 00:16:51,360
I don't need it
on cache or disk.

397
00:16:51,360 --> 00:16:54,570
I can put that one
on disk, because I

398
00:16:54,570 --> 00:16:56,430
don't need it right now.

399
00:16:56,430 --> 00:17:00,300
I can bring that one
back in from cache,

400
00:17:00,300 --> 00:17:03,205
write this one out,
put that one onto disk,

401
00:17:03,205 --> 00:17:04,079
put that onto a disk.

402
00:17:04,079 --> 00:17:06,990
Now, we'll go over here,
read this back in from disk,

403
00:17:06,990 --> 00:17:09,460
finish off this
section over here.

404
00:17:09,460 --> 00:17:13,805
And now, I can throw that away,
add this guy, throw that away.

405
00:17:13,805 --> 00:17:14,430
What do I need?

406
00:17:14,430 --> 00:17:15,846
Now, I can write
this out to disk.

407
00:17:15,846 --> 00:17:17,200
I'm done with the output.

408
00:17:17,200 --> 00:17:18,866
Now, I've got to read
all these guys in,

409
00:17:18,866 --> 00:17:20,550
and then I can do this one.

410
00:17:20,550 --> 00:17:23,430
And so I needed a cache
size here of four.

411
00:17:23,430 --> 00:17:26,980
The maximum number of red
things at any moment was four.

412
00:17:26,980 --> 00:17:29,820
And I can get rid of those guys
and write that one to disk.

413
00:17:29,820 --> 00:17:32,460
And my goal is to get
the outputs all blue.

414
00:17:32,460 --> 00:17:33,960
But the objective
here is different.

415
00:17:33,960 --> 00:17:37,650
Before, we were minimizing,
essentially, cache size.

416
00:17:37,650 --> 00:17:39,150
Cache size now is given to us.

417
00:17:39,150 --> 00:17:41,114
We say we have a
cache of size M.

418
00:17:41,114 --> 00:17:43,530
But now, what we count are the
number of reads and writes,

419
00:17:43,530 --> 00:17:45,690
the number of switching
colors of pebbles.

420
00:17:45,690 --> 00:17:48,895
That is the number I/Os.

421
00:17:48,895 --> 00:17:51,930
And so you can think of
this model as this picture

422
00:17:51,930 --> 00:17:53,010
I drew before.

423
00:17:53,010 --> 00:17:53,640
You have cache.

424
00:17:53,640 --> 00:17:55,590
You can store up to M items.

425
00:17:55,590 --> 00:17:59,940
You can take any blue item.

426
00:17:59,940 --> 00:18:01,860
You could throw them
away, for example.

427
00:18:01,860 --> 00:18:04,380
I could move a red item
over here, turn it blue.

428
00:18:04,380 --> 00:18:06,480
That corresponds to
writing out to disk.

429
00:18:06,480 --> 00:18:09,510
I can bring a blue item
back in to fill that spot.

430
00:18:09,510 --> 00:18:12,180
That corresponds to reading from
disk as long as, at all times,

431
00:18:12,180 --> 00:18:14,760
I have at most M red items.

432
00:18:14,760 --> 00:18:18,100
And these are the same model.

433
00:18:18,100 --> 00:18:20,640
So what Hong and
Kung did is look

434
00:18:20,640 --> 00:18:24,120
at a bunch of different
algorithms, not problems,

435
00:18:24,120 --> 00:18:26,160
but specific algorithms,
things that you

436
00:18:26,160 --> 00:18:28,464
could compute in the DAG form.

437
00:18:28,464 --> 00:18:29,880
The DAG form is,
I guess you could

438
00:18:29,880 --> 00:18:31,387
say, a class of algorithms.

439
00:18:31,387 --> 00:18:32,970
There's many ways
to execute this DAG.

440
00:18:32,970 --> 00:18:37,470
You could follow any
topological sort of this DAG.

441
00:18:37,470 --> 00:18:39,700
That's an algorithm
in some sense.

442
00:18:39,700 --> 00:18:44,940
And so what he's finding is the
best execution of these meta

443
00:18:44,940 --> 00:18:46,500
algorithms, if you will.

444
00:18:46,500 --> 00:18:51,450
So that doesn't mean it's the
best way to do matrix vector

445
00:18:51,450 --> 00:18:52,080
multiplication.

446
00:18:52,080 --> 00:18:54,450
But it says if you're following
the standard algorithm,

447
00:18:54,450 --> 00:18:57,380
the standard DAG that you get
from it or the standard FFT

448
00:18:57,380 --> 00:18:58,045
DAG--

449
00:18:58,045 --> 00:19:00,090
I guess FFT is
actually an algorithm--

450
00:19:00,090 --> 00:19:03,960
then the minimum number
of memory transfers

451
00:19:03,960 --> 00:19:08,859
is this number of red
or blue recolorings.

452
00:19:08,859 --> 00:19:09,900
And so you get a variety.

453
00:19:09,900 --> 00:19:14,130
Of course, the speed-ups,
relative to the regular RAM

454
00:19:14,130 --> 00:19:17,550
analysis versus this analysis is
going to be somewhere between 1

455
00:19:17,550 --> 00:19:23,306
and M, I guess for
most problems at least.

456
00:19:23,306 --> 00:19:25,680
And for some problems, like
matrix vector multiplication,

457
00:19:25,680 --> 00:19:30,000
you get very good M odd even
transpositions [INAUDIBLE]

458
00:19:30,000 --> 00:19:33,120
you get M. Matrix
multiplication, not quite

459
00:19:33,120 --> 00:19:36,690
as good red M and FFT.

460
00:19:36,690 --> 00:19:38,792
Sorting was not analyzed
here, because sorting

461
00:19:38,792 --> 00:19:40,000
is many different algorithms.

462
00:19:40,000 --> 00:19:45,360
Just one specific algorithm
analyzed here, only log M.

463
00:19:45,360 --> 00:19:47,760
So I don't want to go
through these analyzes,

464
00:19:47,760 --> 00:19:49,980
because a lot of them will
follow from other results

465
00:19:49,980 --> 00:19:52,489
that we'll get to.

466
00:19:52,489 --> 00:19:54,030
So at this point,
we have two models.

467
00:19:54,030 --> 00:19:57,210
We have the idealized
two-level storage of Floyd.

468
00:19:57,210 --> 00:19:59,550
We have the red-blue pebble
game of Hong and Kung.

469
00:19:59,550 --> 00:20:02,130
This one models
caching, that you

470
00:20:02,130 --> 00:20:03,300
can store a bunch of things.

471
00:20:03,300 --> 00:20:04,466
But it does not have blocks.

472
00:20:04,466 --> 00:20:07,740
This one models blocking,
but it does not have a cache,

473
00:20:07,740 --> 00:20:10,020
or it has a cache
of constant size.

474
00:20:10,020 --> 00:20:12,690
So the idea is to
merge these two models.

475
00:20:12,690 --> 00:20:15,540
And this is the Aggarwal
and Vitter paper many of you

476
00:20:15,540 --> 00:20:17,160
have heard of, I'm sure.

477
00:20:17,160 --> 00:20:21,135
It was in 1987, so six
years after Hong and Kung.

478
00:20:21,135 --> 00:20:22,960
It has many names.

479
00:20:22,960 --> 00:20:25,609
I/O model is the
original, I guess.

480
00:20:25,609 --> 00:20:27,400
External Memory Model
is what I usually use

481
00:20:27,400 --> 00:20:28,890
and a bunch of people here use.

482
00:20:28,890 --> 00:20:32,610
Disk Access Model has the nice
advantage of you can call it

483
00:20:32,610 --> 00:20:34,080
the DAM model.

484
00:20:34,080 --> 00:20:38,400
And, again, our goal is to
minimize number of I/Os.

485
00:20:38,400 --> 00:20:40,650
It's just a fusion
of the two models.

486
00:20:40,650 --> 00:20:44,100
Now, our cache has
blocks of size B.

487
00:20:44,100 --> 00:20:46,680
And you have M over B blocks.

488
00:20:46,680 --> 00:20:49,840
And your disk is also
divided into blocks of size

489
00:20:49,840 --> 00:20:52,350
B. We imagine it being as
large as you need it to be,

490
00:20:52,350 --> 00:20:54,690
probably about order N.

491
00:20:54,690 --> 00:20:56,230
And what can you do?

492
00:20:56,230 --> 00:20:58,980
Well, you can pick up
one of these blocks

493
00:20:58,980 --> 00:21:01,350
and read it in
from disk to cache,

494
00:21:01,350 --> 00:21:04,710
so kicking out whatever
used to be there.

495
00:21:04,710 --> 00:21:06,510
You can do computation
internally,

496
00:21:06,510 --> 00:21:10,350
change whatever these items
are for free, let's say.

497
00:21:10,350 --> 00:21:12,750
You could measure
time, but usually

498
00:21:12,750 --> 00:21:15,090
you just measure a number
of memory transfers.

499
00:21:15,090 --> 00:21:16,860
And then you can take
one of these blocks

500
00:21:16,860 --> 00:21:18,780
and write it back out
to disk, kicking out

501
00:21:18,780 --> 00:21:20,310
whatever used to be there.

502
00:21:20,310 --> 00:21:23,020
So it's the obvious
hybrid of these models.

503
00:21:23,020 --> 00:21:26,354
But this turns out to
be a really good model.

504
00:21:26,354 --> 00:21:28,270
Those other two models,
they were interesting.

505
00:21:28,270 --> 00:21:28,894
They were toys.

506
00:21:28,894 --> 00:21:29,790
They were simple.

507
00:21:29,790 --> 00:21:33,605
This is basically as simple,
but it spawned this whole field.

508
00:21:33,605 --> 00:21:35,190
And it's why we're here today.

509
00:21:35,190 --> 00:21:39,090
So this is a really
cool model, let's say,

510
00:21:39,090 --> 00:21:40,389
tons of results in this model.

511
00:21:40,389 --> 00:21:41,430
It's interesting to see--

512
00:21:41,430 --> 00:21:43,967
I'm going to talk about
a lot of models today.

513
00:21:43,967 --> 00:21:46,050
We're sort of in the middle
of them at the moment.

514
00:21:46,050 --> 00:21:48,960
But only two have really
caught on in a big way

515
00:21:48,960 --> 00:21:51,190
and have led to lots
and lots of papers.

516
00:21:51,190 --> 00:21:53,950
This is one of them.

517
00:21:53,950 --> 00:21:56,930
So let me tell you some basic
results and how to do them.

518
00:21:56,930 --> 00:22:02,880
A simple approach algorithmic
technique in external memory

519
00:22:02,880 --> 00:22:04,290
is to scan.

520
00:22:04,290 --> 00:22:06,710
So here's my data.

521
00:22:06,710 --> 00:22:10,680
If I just want to read items in
order and stop at some point N,

522
00:22:10,680 --> 00:22:13,710
then that cost me order N
over B memory transfers.

523
00:22:13,710 --> 00:22:14,340
That's optimal.

524
00:22:14,340 --> 00:22:15,548
I've got to read the data in.

525
00:22:15,548 --> 00:22:18,240
I can accumulate, add them
up, multiply them together,

526
00:22:18,240 --> 00:22:19,650
whatever.

527
00:22:19,650 --> 00:22:22,230
One thing to be careful
with those is plus 1,

528
00:22:22,230 --> 00:22:24,180
or you could put
a ceiling on that.

529
00:22:24,180 --> 00:22:27,000
If N is a lot less and B, then
this is not a good strategy.

530
00:22:27,000 --> 00:22:29,930
But as long as N is
at least order B,

531
00:22:29,930 --> 00:22:33,030
that's really efficient.

532
00:22:33,030 --> 00:22:36,060
More generally, instead
of just one scan,

533
00:22:36,060 --> 00:22:39,270
you can run up to M
over B parallel scans.

534
00:22:39,270 --> 00:22:41,220
Because for a scan,
you really just need

535
00:22:41,220 --> 00:22:44,250
to know what is my
block currently.

536
00:22:44,250 --> 00:22:47,100
And we can fit M over
B blocks in our cache.

537
00:22:47,100 --> 00:22:50,940
And so we can advance
this scan a little bit,

538
00:22:50,940 --> 00:22:53,560
advance this scan a little
bit, advanced this one,

539
00:22:53,560 --> 00:22:54,700
and go back and forth.

540
00:22:54,700 --> 00:22:57,845
In any kind of interleaving we
want of those M over B scans,

541
00:22:57,845 --> 00:22:59,220
some of them could
be read scans.

542
00:22:59,220 --> 00:23:00,270
Some of them could
be write scans.

543
00:23:00,270 --> 00:23:01,520
Some of them can go backwards.

544
00:23:01,520 --> 00:23:03,776
Some of them could go forwards,
a lot of options here.

545
00:23:03,776 --> 00:23:05,400
And in particular,
you can do something

546
00:23:05,400 --> 00:23:07,860
like given a little
bit less than M

547
00:23:07,860 --> 00:23:09,840
over be lists of
total size N, you

548
00:23:09,840 --> 00:23:11,070
can merge them all together.

549
00:23:11,070 --> 00:23:12,870
If they're sorted lists,
you can merge them

550
00:23:12,870 --> 00:23:17,661
into one sorted list in
optimal N over B time.

551
00:23:17,661 --> 00:23:20,160
So that's good.

552
00:23:20,160 --> 00:23:22,480
We'll use that in a moment.

553
00:23:22,480 --> 00:23:23,010
Here

554
00:23:23,010 --> 00:23:26,610
I have a little bit of
a thought experiment,

555
00:23:26,610 --> 00:23:30,420
originally by Lars Arge
who will be speaking later.

556
00:23:30,420 --> 00:23:31,920
You know, is this
really a big deal?

557
00:23:31,920 --> 00:23:34,720
Factor B doesn't sound so big.

558
00:23:34,720 --> 00:23:35,250
Do I care?

559
00:23:35,250 --> 00:23:37,460
For example, suppose
I'm going to traverse

560
00:23:37,460 --> 00:23:41,340
a linked list in memory, but
it's actually stored on disk.

561
00:23:41,340 --> 00:23:43,230
Is it really important
that I sort that list

562
00:23:43,230 --> 00:23:46,950
and do a scan versus jumping
around random access?

563
00:23:46,950 --> 00:23:49,650
And this is back
of the envelope,

564
00:23:49,650 --> 00:23:51,750
just computing what
things ought to be.

565
00:23:51,750 --> 00:23:54,975
If you have about a gigabyte
of data, a block size of 32

566
00:23:54,975 --> 00:23:58,770
kilobytes, which is probably on
the small side, a 1 millisecond

567
00:23:58,770 --> 00:24:02,250
disk access time,
which is really fast,

568
00:24:02,250 --> 00:24:05,940
usually at least 2
milliseconds, then

569
00:24:05,940 --> 00:24:10,140
if you do things
in random order,

570
00:24:10,140 --> 00:24:13,020
on average every access is going
to require a memory transfer.

571
00:24:13,020 --> 00:24:17,100
That'll take about
70 hours, three days.

572
00:24:17,100 --> 00:24:20,550
But if you do a scan, if
you presorted everything

573
00:24:20,550 --> 00:24:23,400
and you do a scan, then it
will only take you 32 seconds.

574
00:24:23,400 --> 00:24:28,080
So it's just 8,000 in time
space is a lot bigger than we

575
00:24:28,080 --> 00:24:29,640
conceptualize.

576
00:24:29,640 --> 00:24:33,060
And it makes things that were
impractical to do, say, daily,

577
00:24:33,060 --> 00:24:34,750
very practical.

578
00:24:34,750 --> 00:24:36,649
So that's why we're here.

579
00:24:36,649 --> 00:24:37,690
Let's do another problem.

580
00:24:37,690 --> 00:24:38,398
How about search?

581
00:24:38,398 --> 00:24:40,380
Suppose I have the
items in sorted order,

582
00:24:40,380 --> 00:24:42,309
and I want to do binary search.

583
00:24:42,309 --> 00:24:44,100
Well, the right thing
is not binary search,

584
00:24:44,100 --> 00:24:48,690
but B-way search, so log
base B of N. The plus 1

585
00:24:48,690 --> 00:24:50,310
is to handle the
case when B equals 1.

586
00:24:50,310 --> 00:24:52,830
Then you want log base 2.

587
00:24:52,830 --> 00:24:55,200
So we have our items.

588
00:24:55,200 --> 00:24:58,230
We want to search, first,
why is this the right bound?

589
00:24:58,230 --> 00:24:59,607
Why is this optimal?

590
00:24:59,607 --> 00:25:01,440
You can do an information
theoretic argument

591
00:25:01,440 --> 00:25:02,830
in the comparison
model, assuming

592
00:25:02,830 --> 00:25:04,800
you're just comparing items.

593
00:25:04,800 --> 00:25:06,730
Then whenever you
read in a block--

594
00:25:06,730 --> 00:25:08,610
if the blocks have
already been sorted,

595
00:25:08,610 --> 00:25:10,320
you read in some block--

596
00:25:10,320 --> 00:25:13,440
what you learn from
looking at those B items

597
00:25:13,440 --> 00:25:16,564
is where your query guy, x,
fits among those B items.

598
00:25:16,564 --> 00:25:18,480
You already know everything
about the B items,

599
00:25:18,480 --> 00:25:19,950
how they relate to each other.

600
00:25:19,950 --> 00:25:21,180
But you learn where x is.

601
00:25:21,180 --> 00:25:24,840
So that gives you log of B
plus 1 bits of information,

602
00:25:24,840 --> 00:25:28,260
because there are B plus
1 places where x could be.

603
00:25:28,260 --> 00:25:30,300
And you need to figure
out log of N plus 1 bits.

604
00:25:30,300 --> 00:25:32,740
You want to know where x
fits among all the items.

605
00:25:32,740 --> 00:25:35,310
And so you divide log of N
plus 1 by log of B plus 1.

606
00:25:35,310 --> 00:25:39,510
That's log base b
plus 1 of N plus 1.

607
00:25:39,510 --> 00:25:40,716
So that's the lower bound.

608
00:25:40,716 --> 00:25:43,090
And the upper bound is, you
probably have guessed by now,

609
00:25:43,090 --> 00:25:44,130
is a B-tree.

610
00:25:44,130 --> 00:25:47,130
You just have B items
and the node sort

611
00:25:47,130 --> 00:25:49,620
of uniformly distributed
through the sorted list.

612
00:25:49,620 --> 00:25:52,440
And then once you
get those items,

613
00:25:52,440 --> 00:25:54,930
you go to the appropriate
subtree and recurse.

614
00:25:54,930 --> 00:25:57,930
And the height of such a tree
is log base b plus 1 of N,

615
00:25:57,930 --> 00:25:59,162
and so it works.

616
00:25:59,162 --> 00:26:00,870
B-trees have the nice
thing, you can also

617
00:26:00,870 --> 00:26:03,870
do insertions and deletions
in the same amount of time.

618
00:26:03,870 --> 00:26:05,730
Though, that's no
longer so optimal.

619
00:26:05,730 --> 00:26:07,860
For searches, this
is the right answer.

620
00:26:10,369 --> 00:26:11,910
So, next thing you
might want to do--

621
00:26:11,910 --> 00:26:13,860
I keep saying,
assume it's sorted--

622
00:26:13,860 --> 00:26:16,030
I'd really like some
sorted data, please.

623
00:26:16,030 --> 00:26:19,009
So how do I sort my data?

624
00:26:19,009 --> 00:26:20,550
I think the Aggarwal
and Vitter paper

625
00:26:20,550 --> 00:26:26,040
has this fun quote about, today,
one fourth of all computation

626
00:26:26,040 --> 00:26:27,750
is sorting.

627
00:26:27,750 --> 00:26:29,820
Some machines are devoted
entirely to sorting.

628
00:26:29,820 --> 00:26:31,410
It's like the
problem of the day.

629
00:26:31,410 --> 00:26:32,580
Everyone was sorting.

630
00:26:32,580 --> 00:26:34,800
I assume people
still sort, but I'm

631
00:26:34,800 --> 00:26:38,710
guessing it's not the
dominant feature anymore.

632
00:26:38,710 --> 00:26:39,960
And it's a big deal, you know.

633
00:26:39,960 --> 00:26:41,760
Can I sort within one
day, so that all the

634
00:26:41,760 --> 00:26:45,390
stuff that I learned today
or all the transactions that

635
00:26:45,390 --> 00:26:48,450
happened today I
could sort them.

636
00:26:48,450 --> 00:26:51,060
So it turns out the
right answer for sorting

637
00:26:51,060 --> 00:26:53,520
bound is N over B log
base M over B of N over B.

638
00:26:53,520 --> 00:26:56,790
If you haven't seen that, it
looks kind of like a big thing.

639
00:26:56,790 --> 00:27:01,080
But those of us in the know
can recite that in our sleep.

640
00:27:01,080 --> 00:27:02,490
It comes up all over the place.

641
00:27:02,490 --> 00:27:04,470
Lots of problems are
as hard as sorting,

642
00:27:04,470 --> 00:27:07,170
and can be solved in
the sorting bound time.

643
00:27:07,170 --> 00:27:09,930
To go back to the
problem I was talking

644
00:27:09,930 --> 00:27:14,820
about with Floyd's model,
the permutation problem,

645
00:27:14,820 --> 00:27:16,050
I know the permutation.

646
00:27:16,050 --> 00:27:17,950
I know where things
are supposed to go.

647
00:27:17,950 --> 00:27:20,580
I just need to move
them there physically.

648
00:27:20,580 --> 00:27:22,870
Then it's slightly better.

649
00:27:22,870 --> 00:27:25,680
You have the sorting
bound, which is essentially

650
00:27:25,680 --> 00:27:26,860
what we had before.

651
00:27:26,860 --> 00:27:30,120
But in some cases, just doing
the naive thing is better.

652
00:27:30,120 --> 00:27:33,210
Sometimes it's better to just
take every item and stick it

653
00:27:33,210 --> 00:27:36,591
where it belongs in
completely random access.

654
00:27:36,591 --> 00:27:39,090
So you could always do it, of
course, in N memory transfers.

655
00:27:39,090 --> 00:27:42,030
And sometimes that is slightly
better than the sorting bound,

656
00:27:42,030 --> 00:27:45,020
because you don't
have the log term.

657
00:27:45,020 --> 00:27:48,260
And so that is the right
answer to Floyd's problem.

658
00:27:48,260 --> 00:27:50,900
He got the upper bound right.

659
00:27:50,900 --> 00:27:52,860
In his case, M over B is 3.

660
00:27:52,860 --> 00:27:57,350
So this is just log base 2.

661
00:27:57,350 --> 00:28:01,380
But he missed this one term.

662
00:28:01,380 --> 00:28:02,360
OK.

663
00:28:02,360 --> 00:28:03,982
So why is the sorting
bound correct?

664
00:28:03,982 --> 00:28:05,690
I won't go through
the permutation bound.

665
00:28:05,690 --> 00:28:07,151
The upper bound's clear.

666
00:28:07,151 --> 00:28:08,900
Information, theoretically,
it's very easy

667
00:28:08,900 --> 00:28:12,365
to see why you can't do
better than the sorting bound.

668
00:28:12,365 --> 00:28:14,720
Let's set up a little
bit of ground rules.

669
00:28:14,720 --> 00:28:19,190
Let's suppose that whatever
you have in cache, you sort it.

670
00:28:19,190 --> 00:28:19,936
Because why not?

671
00:28:19,936 --> 00:28:21,560
I mean, this is only
going to help you.

672
00:28:21,560 --> 00:28:23,510
And everything you
do in cache is free.

673
00:28:23,510 --> 00:28:25,460
So always keep the cache sorted.

674
00:28:25,460 --> 00:28:29,227
And to clean up the
information that's around,

675
00:28:29,227 --> 00:28:31,310
I'm going to first do a
pass where I read a block,

676
00:28:31,310 --> 00:28:34,110
sort the block, stick
it back out, and repeat.

677
00:28:34,110 --> 00:28:36,290
So each block is presorted.

678
00:28:36,290 --> 00:28:39,830
So there's no sorting
information inside a block.

679
00:28:39,830 --> 00:28:42,660
It's all about how blocks
compare to each other here.

680
00:28:42,660 --> 00:28:45,010
So when I read a block--

681
00:28:45,010 --> 00:28:47,780
let's say this is my cache,
and a new block comes in here--

682
00:28:47,780 --> 00:28:51,890
what I learn is where those B
items live among the M items

683
00:28:51,890 --> 00:28:53,097
that I already had.

684
00:28:53,097 --> 00:28:54,680
So it's just like
the analysis before,

685
00:28:54,680 --> 00:29:00,620
except now I'm reading B items
among M instead of one among B.

686
00:29:00,620 --> 00:29:04,160
And so the number of
possible outcomes for that

687
00:29:04,160 --> 00:29:06,350
is M plus b choose B.

688
00:29:06,350 --> 00:29:07,550
So you have M plus B things.

689
00:29:07,550 --> 00:29:10,490
And there's B of them
that we're saying

690
00:29:10,490 --> 00:29:14,210
which of the B in the order
came from the new block.

691
00:29:14,210 --> 00:29:16,340
You take log of that,
and you get basically

692
00:29:16,340 --> 00:29:20,150
B log M over B bits that
you learn from each step.

693
00:29:20,150 --> 00:29:22,730
And the total number of
bits we need to learn

694
00:29:22,730 --> 00:29:24,900
is N log N, as you know.

695
00:29:24,900 --> 00:29:28,490
But we knew a little bit of
bits from this presorting step.

696
00:29:28,490 --> 00:29:30,500
This is to clean this
up at the beginning.

697
00:29:30,500 --> 00:29:35,377
We already knew N log B bits,
because each of those B things

698
00:29:35,377 --> 00:29:35,960
was presorted.

699
00:29:35,960 --> 00:29:38,540
So we have B log B per
block each of them.

700
00:29:38,540 --> 00:29:39,740
There's N over B of them.

701
00:29:39,740 --> 00:29:42,800
So it's N log B.
So we need to learn

702
00:29:42,800 --> 00:29:45,650
N log N minus N log B bits.

703
00:29:45,650 --> 00:29:53,030
And in each step, which is a
log of N over B N log N over B--

704
00:29:53,030 --> 00:29:55,490
and in each step, we
learn B log M over B.

705
00:29:55,490 --> 00:29:57,230
So you divide those
two things, and you

706
00:29:57,230 --> 00:29:58,880
get N over B log
base M over B and N

707
00:29:58,880 --> 00:30:03,170
over B. It's a good exercise
in log rules and information

708
00:30:03,170 --> 00:30:03,670
theory.

709
00:30:03,670 --> 00:30:07,130
But now, you see it's
sort of the obvious bound

710
00:30:07,130 --> 00:30:11,670
once you check how many bits
you're learning in each step.

711
00:30:11,670 --> 00:30:12,230
OK.

712
00:30:12,230 --> 00:30:15,124
How do we achieve this bound?

713
00:30:15,124 --> 00:30:16,040
What's an upper bound?

714
00:30:16,040 --> 00:30:18,410
I'm going to show you
two ways to do it.

715
00:30:18,410 --> 00:30:21,450
The easy one is mergesort.

716
00:30:21,450 --> 00:30:23,330
To me, the conceptually
easiest is mergesort.

717
00:30:23,330 --> 00:30:25,790
They're actually
kind of symmetric.

718
00:30:25,790 --> 00:30:27,604
So you probably know
binary mergesort.

719
00:30:27,604 --> 00:30:30,020
You take your items, split
them in half, recursively sort,

720
00:30:30,020 --> 00:30:31,220
merge.

721
00:30:31,220 --> 00:30:35,180
But we know that we can
merge M over B sorted lists

722
00:30:35,180 --> 00:30:38,000
in linear time as
well in N over M time.

723
00:30:38,000 --> 00:30:39,602
So instead of doing
binary mergesort

724
00:30:39,602 --> 00:30:41,060
where we split in
half, we're going

725
00:30:41,060 --> 00:30:43,970
to split into M over
B equal sized pieces,

726
00:30:43,970 --> 00:30:46,640
recursively sort them
all, and then merge.

727
00:30:46,640 --> 00:30:49,826
And the recurrence we
get from that, there is--

728
00:30:49,826 --> 00:30:52,080
did I get this right?

729
00:30:52,080 --> 00:30:52,580
Yeah.

730
00:30:52,580 --> 00:30:56,600
There's M over B sub-problems,
each of size a factor of M

731
00:30:56,600 --> 00:30:59,150
over B smaller than N.

732
00:30:59,150 --> 00:31:02,490
And then to do the merge,
we pay N over B plus 1.

733
00:31:02,490 --> 00:31:04,700
That won't end up mattering.

734
00:31:04,700 --> 00:31:07,250
To make this not matter,
we need to use a base

735
00:31:07,250 --> 00:31:11,540
case for this recurrence that's
not 1, but B. B will work.

736
00:31:11,540 --> 00:31:14,634
You could also do M, but
it doesn't really help you.

737
00:31:14,634 --> 00:31:16,550
Once we get down to a
single block, of course,

738
00:31:16,550 --> 00:31:18,580
we can sort in constant time.

739
00:31:18,580 --> 00:31:22,089
We read it and sort
it, write it back out.

740
00:31:22,089 --> 00:31:23,630
So you want to solve
this recurrence.

741
00:31:23,630 --> 00:31:25,880
Easy way is to draw
a recursion tree.

742
00:31:25,880 --> 00:31:28,400
At the root, you have
a problem of size N.

743
00:31:28,400 --> 00:31:30,830
We're paying N
over B to solve it.

744
00:31:30,830 --> 00:31:33,830
We have branching factor M
over B. And at the leaves,

745
00:31:33,830 --> 00:31:37,950
we have problems with size B.
Each of them has constant cost.

746
00:31:37,950 --> 00:31:41,570
I'm removing the big Os to
make this diagram both more

747
00:31:41,570 --> 00:31:42,980
legible and more correct.

748
00:31:42,980 --> 00:31:45,740
Because you can't use big Os
when you're using dot dot dot.

749
00:31:45,740 --> 00:31:47,810
So no big Os for you.

750
00:31:47,810 --> 00:31:49,410
So then use sum
these level by level,

751
00:31:49,410 --> 00:31:51,680
and you see we have
conservation of mass.

752
00:31:51,680 --> 00:31:53,630
We have N things here.

753
00:31:53,630 --> 00:31:54,740
We still have N things.

754
00:31:54,740 --> 00:31:55,948
They just got distributed up.

755
00:31:55,948 --> 00:31:58,250
They're all being
divided by B linearity.

756
00:31:58,250 --> 00:32:00,662
You get N over B at every
level, including the leaves.

757
00:32:00,662 --> 00:32:02,120
Leaves you have to
check specially.

758
00:32:02,120 --> 00:32:03,770
But there are indeed
N over B leaves,

759
00:32:03,770 --> 00:32:06,380
because we stop
when we get to B.

760
00:32:06,380 --> 00:32:07,190
So you add this up.

761
00:32:07,190 --> 00:32:09,530
We just need to know how
many levels are there.

762
00:32:09,530 --> 00:32:12,890
One is log base M
over B of N over B.

763
00:32:12,890 --> 00:32:16,160
Because there's N over B leaves
branching factor M over B.

764
00:32:16,160 --> 00:32:19,490
So you multiply, done, easy.

765
00:32:19,490 --> 00:32:21,890
So mergesort is pretty cool.

766
00:32:21,890 --> 00:32:23,525
And this works really
well in practice.

767
00:32:23,525 --> 00:32:28,100
It revolutionized the
world of sorting in 1988.

768
00:32:28,100 --> 00:32:32,180
Here's a different approach, the
inverse, more like quicksort,

769
00:32:32,180 --> 00:32:34,430
the one that you know is
guaranteed to run [INAUDIBLE]

770
00:32:34,430 --> 00:32:36,410
log N usually.

771
00:32:36,410 --> 00:32:37,940
Here, you can't do
binary quicksort.

772
00:32:37,940 --> 00:32:41,420
You do M over B root M
over B-way quicksort.

773
00:32:41,420 --> 00:32:46,480
The square root is necessary
just to do step one.

774
00:32:46,480 --> 00:32:50,150
So step one is I need to split.

775
00:32:50,150 --> 00:32:53,870
Now, I'm not splitting
my list into chunks.

776
00:32:53,870 --> 00:32:55,940
In the answer, in
the sorted answer,

777
00:32:55,940 --> 00:33:00,050
I need to find things that are
evenly spaced in the answer.

778
00:33:00,050 --> 00:33:01,070
That's the hard part.

779
00:33:01,070 --> 00:33:03,349
Then, usually, you find
the median to do this.

780
00:33:03,349 --> 00:33:05,390
But now, we have to find
sort of square root of M

781
00:33:05,390 --> 00:33:08,635
over B median-like elements
spread out through the answer.

782
00:33:08,635 --> 00:33:11,210
But we don't know the answer,
so it's a little tricky.

783
00:33:11,210 --> 00:33:13,550
Then once we have those
partition elements,

784
00:33:13,550 --> 00:33:14,840
we can just do it.

785
00:33:14,840 --> 00:33:19,067
This is the square root of
M over B-way scan again.

786
00:33:19,067 --> 00:33:20,150
You scan through the data.

787
00:33:20,150 --> 00:33:22,910
For each of them, you see how
it compares to the partition

788
00:33:22,910 --> 00:33:24,110
elements.

789
00:33:24,110 --> 00:33:25,470
There aren't very many of them.

790
00:33:25,470 --> 00:33:27,636
And then you write it out
to the corresponding list,

791
00:33:27,636 --> 00:33:31,240
and you get square root
of M over B plus 1 lists.

792
00:33:31,240 --> 00:33:33,730
And so that's efficient,
because it's just

793
00:33:33,730 --> 00:33:35,960
a scan or parallel scans.

794
00:33:35,960 --> 00:33:38,180
And then you recurse, and
there's no combination.

795
00:33:38,180 --> 00:33:40,040
There's no merging to do.

796
00:33:40,040 --> 00:33:41,540
Once you've got
them set up there,

797
00:33:41,540 --> 00:33:43,350
you recursively sort,
and you're done.

798
00:33:43,350 --> 00:33:46,790
So the recurrence is exactly
the same as mergesort.

799
00:33:46,790 --> 00:33:49,070
And the hard part is how do
you do this partitioning?

800
00:33:49,070 --> 00:33:51,162
And I'll just
quickly sketch that.

801
00:33:51,162 --> 00:33:53,120
This is probably the most
complicated algorithm

802
00:33:53,120 --> 00:33:56,330
in these slides.

803
00:33:56,330 --> 00:33:57,710
I'll tell you the algorithm.

804
00:33:57,710 --> 00:34:02,990
Exactly why it works is familiar
to if you know the Bloom

805
00:34:02,990 --> 00:34:07,090
at all, linear time
merging algorithm

806
00:34:07,090 --> 00:34:09,540
for regular internal memory.

807
00:34:09,540 --> 00:34:12,600
Here's what we're going to do.

808
00:34:12,600 --> 00:34:17,513
We're going to read in M items
into our cache, sort them.

809
00:34:17,513 --> 00:34:19,429
So that's a piece of the
answer in some sense.

810
00:34:19,429 --> 00:34:21,000
But how it relates
to the answer,

811
00:34:21,000 --> 00:34:23,360
which subset of the answer
it is, we don't know.

812
00:34:23,360 --> 00:34:26,840
Sample that piece of
the answer like this.

813
00:34:26,840 --> 00:34:30,290
Every root M over B
items, take one guy.

814
00:34:30,290 --> 00:34:32,540
Spit that in an
output of samples.

815
00:34:32,540 --> 00:34:34,929
Do this over and over
for all the items--

816
00:34:34,929 --> 00:34:37,699
read in M, sort,
sample, spit out--

817
00:34:37,699 --> 00:34:39,770
you end up with this many items.

818
00:34:39,770 --> 00:34:43,167
This is basically a trick
to shrink your input.

819
00:34:43,167 --> 00:34:45,500
So now, we can do inefficient
things on this many items,

820
00:34:45,500 --> 00:34:47,690
because there aren't
that many of them.

821
00:34:47,690 --> 00:34:49,969
So what do we do?

822
00:34:49,969 --> 00:34:54,500
We just run the regular linear
time selection algorithm

823
00:34:54,500 --> 00:34:57,590
that you know and love
from algorithms class

824
00:34:57,590 --> 00:35:02,330
to find the right item.

825
00:35:02,330 --> 00:35:07,370
So if you were splitting
into four pieces,

826
00:35:07,370 --> 00:35:10,370
then you'd want the
25%, 50%, and 75%.

827
00:35:10,370 --> 00:35:12,684
You know how to do each
of those in linear time.

828
00:35:12,684 --> 00:35:14,100
And it turns out
if you re-analyze

829
00:35:14,100 --> 00:35:16,040
the regular linear
time selection, indeed,

830
00:35:16,040 --> 00:35:19,010
it runs in N over B
time in external memory.

831
00:35:19,010 --> 00:35:21,177
So that's great.

832
00:35:21,177 --> 00:35:23,510
But now, we're doing this
just repeatedly over and over.

833
00:35:23,510 --> 00:35:24,860
You find the 25%.

834
00:35:24,860 --> 00:35:25,670
You find the 50%.

835
00:35:25,670 --> 00:35:27,450
Each of them, you
spend linear time.

836
00:35:27,450 --> 00:35:29,300
But you multiply it out.

837
00:35:29,300 --> 00:35:31,400
You're only finding root
of M over B of them.

838
00:35:31,400 --> 00:35:35,799
Linear time, it's not N over
B, it's N divided by this mess.

839
00:35:35,799 --> 00:35:37,340
You multiply them
out, it disappears.

840
00:35:37,340 --> 00:35:39,860
You end up in
regular linear time,

841
00:35:39,860 --> 00:35:42,500
N over B. You find a
good set of partitions.

842
00:35:42,500 --> 00:35:45,041
Why this is a good set
is not totally clear.

843
00:35:45,041 --> 00:35:46,040
I won't justify it here.

844
00:35:46,040 --> 00:35:50,170
But it is good, so don't worry.

845
00:35:50,170 --> 00:35:51,710
OK.

846
00:35:51,710 --> 00:35:54,230
One embellishment to the
external memory model

847
00:35:54,230 --> 00:35:59,700
before I go on is to
distinguish not just saying,

848
00:35:59,700 --> 00:36:01,970
oh, well, every block
is equally good.

849
00:36:01,970 --> 00:36:04,370
You want to count how
many blocks you read.

850
00:36:04,370 --> 00:36:06,620
When you read one item,
you get the whole block.

851
00:36:06,620 --> 00:36:07,934
And you better use that block.

852
00:36:07,934 --> 00:36:09,350
But you can
furthermore say, well,

853
00:36:09,350 --> 00:36:11,960
it would be really good if I
read a whole bunch of blocks

854
00:36:11,960 --> 00:36:12,930
in sequence.

855
00:36:12,930 --> 00:36:15,650
There are lots of reasons
for this in particular.

856
00:36:15,650 --> 00:36:17,990
Disks are really good
at sequential access,

857
00:36:17,990 --> 00:36:19,220
because they're spinning.

858
00:36:19,220 --> 00:36:21,769
It's very easy to seek to
the thing right after you.

859
00:36:21,769 --> 00:36:23,810
First of all, it's easy
to read the entire track,

860
00:36:23,810 --> 00:36:25,018
the whole circle of the desk.

861
00:36:25,018 --> 00:36:29,110
And it's easy to
move that thing.

862
00:36:29,110 --> 00:36:31,370
So here's a model
that captures the idea

863
00:36:31,370 --> 00:36:35,756
that sequential block reads or
writes are better than random.

864
00:36:35,756 --> 00:36:37,130
So here's the idea
of sequential.

865
00:36:37,130 --> 00:36:44,600
If you read M items, so you read
M over B blocks in sequence,

866
00:36:44,600 --> 00:36:47,030
then each of those is considered
to be a sequential memory

867
00:36:47,030 --> 00:36:47,960
transfer.

868
00:36:47,960 --> 00:36:51,260
If you break that sequence, then
you're starting a new sequence.

869
00:36:51,260 --> 00:36:53,540
Or it's just random
access if you don't fall

870
00:36:53,540 --> 00:36:56,250
into a big block like this.

871
00:36:56,250 --> 00:36:58,650
So there's a couple of
results in this model.

872
00:36:58,650 --> 00:37:02,750
One is this harder version
of external memory.

873
00:37:02,750 --> 00:37:05,180
So one thing is
what about sorting?

874
00:37:05,180 --> 00:37:06,335
We just covered sorting.

875
00:37:06,335 --> 00:37:09,320
It turns out those are pretty
random access in the algorithms

876
00:37:09,320 --> 00:37:10,100
we saw.

877
00:37:10,100 --> 00:37:15,410
But if you use binary
mergesort, it is sequential.

878
00:37:15,410 --> 00:37:18,505
As you binary merge,
things are good.

879
00:37:18,505 --> 00:37:19,880
And that's,
essentially, the best

880
00:37:19,880 --> 00:37:22,070
you can do, surprisingly,
in this model.

881
00:37:22,070 --> 00:37:27,650
If you want the number of
random memory transfers

882
00:37:27,650 --> 00:37:29,900
to be little o of
the sorting bound--

883
00:37:29,900 --> 00:37:32,300
so you want more than
a constant fraction

884
00:37:32,300 --> 00:37:35,660
to be sequential--
then you need to use

885
00:37:35,660 --> 00:37:38,930
at least this much
total memory transfers.

886
00:37:38,930 --> 00:37:44,390
And so binary mergesort
is optimal in this model,

887
00:37:44,390 --> 00:37:47,780
assuming you want a reasonable
number of sequential axises.

888
00:37:47,780 --> 00:37:50,060
And the main point
of this paper was

889
00:37:50,060 --> 00:37:51,579
to solve suffix-tree
construction

890
00:37:51,579 --> 00:37:52,370
in external memory.

891
00:37:52,370 --> 00:37:54,740
And what they prove is
it reduces to sorting,

892
00:37:54,740 --> 00:37:56,000
essentially, and scans.

893
00:37:56,000 --> 00:37:57,740
And scans are good.

894
00:37:57,740 --> 00:38:00,170
So you get this
exact same trade-off

895
00:38:00,170 --> 00:38:04,210
for suffix-tree construction,
fair representation.

896
00:38:04,210 --> 00:38:08,391
I have to be careful, because so
many authors are in those room.

897
00:38:08,391 --> 00:38:08,890
Cool.

898
00:38:08,890 --> 00:38:10,727
So let's move on to
a different model.

899
00:38:10,727 --> 00:38:12,310
This is a model that
did not catch on.

900
00:38:12,310 --> 00:38:13,810
But it's fun for
historical reasons

901
00:38:13,810 --> 00:38:18,070
to see what it was about.

902
00:38:18,070 --> 00:38:20,960
You can see in here two issues.

903
00:38:20,960 --> 00:38:23,800
One is, what about a
deeper memory hierarchy?

904
00:38:23,800 --> 00:38:25,630
Two levels is nice.

905
00:38:25,630 --> 00:38:28,480
Yeah, in practice, two
levels are all that matter.

906
00:38:28,480 --> 00:38:31,420
But we should really
understand multiple levels.

907
00:38:31,420 --> 00:38:33,290
Surely, there's a
clean way to do that.

908
00:38:33,290 --> 00:38:36,190
And so there are a bunch of
models that try to do this.

909
00:38:36,190 --> 00:38:38,950
And by the end, we get
something that's reasonable.

910
00:38:38,950 --> 00:38:43,060
And HMM is probably one of
my favorite weird models.

911
00:38:43,060 --> 00:38:44,890
It's "particularly simple."

912
00:38:44,890 --> 00:38:47,670
This is a quote from
their own paper,

913
00:38:47,670 --> 00:38:48,910
not that they're boastful.

914
00:38:48,910 --> 00:38:49,870
It is a simple model.

915
00:38:49,870 --> 00:38:51,850
This is true.

916
00:38:51,850 --> 00:38:55,390
And it does model, in some
sense, a larger hierarchy.

917
00:38:55,390 --> 00:38:57,430
But the way it's
phrased initially

918
00:38:57,430 --> 00:39:01,290
doesn't look like this picture,
but they're equivalent.

919
00:39:01,290 --> 00:39:02,440
So it's a RAM model.

920
00:39:02,440 --> 00:39:05,150
So your memory is an array.

921
00:39:05,150 --> 00:39:09,490
If you want to access position
x in the array, you pay f of x.

922
00:39:09,490 --> 00:39:12,970
And in the original
definition, that's just log x.

923
00:39:12,970 --> 00:39:15,670
So what that corresponds to
is the first item is free.

924
00:39:15,670 --> 00:39:17,110
Second item costs 1.

925
00:39:17,110 --> 00:39:19,240
The next two items cost 2.

926
00:39:19,240 --> 00:39:20,800
The next four items cost 3.

927
00:39:20,800 --> 00:39:23,230
The next eight items
cost 4, and so on.

928
00:39:23,230 --> 00:39:26,470
So it's exactly this
kind of memory hierarchy.

929
00:39:26,470 --> 00:39:27,820
And you can move items.

930
00:39:27,820 --> 00:39:28,490
You can copy.

931
00:39:28,490 --> 00:39:31,180
And you can do all the
things you can do in a RAM.

932
00:39:31,180 --> 00:39:34,560
So this is a pretty good
model of hierarchical memory.

933
00:39:34,560 --> 00:39:36,310
It's just a little hard.

934
00:39:36,310 --> 00:39:39,310
So, originally, they
defined it with log x

935
00:39:39,310 --> 00:39:42,730
based on this book, which is
the classic reference of VLSI

936
00:39:42,730 --> 00:39:44,050
at the time by Mead and Conway.

937
00:39:44,050 --> 00:39:47,380
It sort of revolutionized
teaching VLSI.

938
00:39:47,380 --> 00:39:49,990
And it has this
particular construction

939
00:39:49,990 --> 00:39:52,060
of a hierarchical RAM.

940
00:39:52,060 --> 00:39:54,200
I Don't know if RAMs are
actually built this way.

941
00:39:54,200 --> 00:39:57,100
But they have a
sketch of how to do it

942
00:39:57,100 --> 00:40:00,500
that achieves a
logarithmic performance.

943
00:40:00,500 --> 00:40:05,980
The deeper you are, you pay log.

944
00:40:05,980 --> 00:40:08,560
The bigger your
space is, you need

945
00:40:08,560 --> 00:40:11,940
to pay logarithmic to access it.

946
00:40:11,940 --> 00:40:12,460
OK.

947
00:40:12,460 --> 00:40:14,650
So here are the results
that they get in this model.

948
00:40:14,650 --> 00:40:15,816
I'm not going to prove them.

949
00:40:15,816 --> 00:40:18,970
Because, again, they follow
from the results in some sense.

950
00:40:18,970 --> 00:40:23,002
But you've got matrix
multiplication, FFT sorting,

951
00:40:23,002 --> 00:40:25,210
scanning, binary search, a
lot of the usual problems.

952
00:40:25,210 --> 00:40:31,150
You get kind of weird running
times, log, log, and so on.

953
00:40:31,150 --> 00:40:33,280
Here, it's a matter of
slow down versus speed

954
00:40:33,280 --> 00:40:36,864
up, because everything is going
to cost more than constant now.

955
00:40:36,864 --> 00:40:38,280
So you want to
minimize slowdowns.

956
00:40:38,280 --> 00:40:39,350
Sometimes you get constant.

957
00:40:39,350 --> 00:40:41,110
The worst slow down
you can get is log N,

958
00:40:41,110 --> 00:40:43,180
because everything you
can access in, at most,

959
00:40:43,180 --> 00:40:44,870
log N time in this model.

960
00:40:44,870 --> 00:40:49,870
But I would say setting f of
N to be log N doesn't really

961
00:40:49,870 --> 00:40:51,670
reveal what we care about.

962
00:40:51,670 --> 00:40:54,580
But in the same paper, they
give a better perspective

963
00:40:54,580 --> 00:40:56,150
of their own work.

964
00:40:56,150 --> 00:40:59,290
So they say, well, let's
look at the general case.

965
00:40:59,290 --> 00:41:00,890
Maybe log x isn't
the right thing.

966
00:41:00,890 --> 00:41:02,890
Let's look at an
arbitrary f of x.

967
00:41:02,890 --> 00:41:04,390
Well, you could
write an arbitrary f

968
00:41:04,390 --> 00:41:08,320
of x as a weighted sum
of threshold functions.

969
00:41:08,320 --> 00:41:10,300
I want to know is
x bigger than xi.

970
00:41:10,300 --> 00:41:12,910
If so, I pay wi.

971
00:41:12,910 --> 00:41:15,989
Well, that is just
like this picture.

972
00:41:15,989 --> 00:41:17,530
Any function can be
written like that

973
00:41:17,530 --> 00:41:19,390
if it's a discrete function.

974
00:41:19,390 --> 00:41:21,530
But you can also think
of it in this form

975
00:41:21,530 --> 00:41:23,530
if the xi's are sorted.

976
00:41:23,530 --> 00:41:26,410
After you get beyond
x0 items, you pay w0.

977
00:41:26,410 --> 00:41:31,010
After you get beyond x1 items
total, you pay w1, and so on.

978
00:41:31,010 --> 00:41:33,400
So this gives you an
arbitrary memory hierarchy

979
00:41:33,400 --> 00:41:35,134
even with growing
and shrinking sizes,

980
00:41:35,134 --> 00:41:36,550
which you'd never
see in practice.

981
00:41:36,550 --> 00:41:38,810
But this is the general case.

982
00:41:38,810 --> 00:41:40,630
And we are going to
assume here that f

983
00:41:40,630 --> 00:41:44,720
is polynomially bounded to make
these functions reasonable.

984
00:41:44,720 --> 00:41:47,230
So when you double the input,
you only change the output

985
00:41:47,230 --> 00:41:48,190
by a constant factor.

986
00:41:51,350 --> 00:41:52,610
OK.

987
00:41:52,610 --> 00:41:53,110
Fine.

988
00:41:53,110 --> 00:41:55,060
So we have to solve
this weighted sum.

989
00:41:55,060 --> 00:41:57,010
But let's just look
at one of these.

990
00:41:57,010 --> 00:41:59,284
This is kind of the
canonical function.

991
00:41:59,284 --> 00:42:00,950
The rest is just a
weighted sum of them.

992
00:42:00,950 --> 00:42:03,074
And if you assume this
polynomial bounded property,

993
00:42:03,074 --> 00:42:05,860
really it suffices
to look at this.

994
00:42:05,860 --> 00:42:12,430
So this is called
f sub M. We pay 1

995
00:42:12,430 --> 00:42:16,970
to access anything beyond
M. And we pay 0 otherwise.

996
00:42:16,970 --> 00:42:20,230
So they've taken general f
with this deep hierarchy,

997
00:42:20,230 --> 00:42:26,170
and they've reduced to this
model, the red-blue pebble

998
00:42:26,170 --> 00:42:28,232
game, which we've already seen.

999
00:42:28,232 --> 00:42:30,190
I don't know if they
mentioned this explicitly,

1000
00:42:30,190 --> 00:42:32,440
but it's the same model again.

1001
00:42:32,440 --> 00:42:35,350
And that's good, because
a lot of problems-- well,

1002
00:42:35,350 --> 00:42:36,800
they haven't been
solved exactly.

1003
00:42:36,800 --> 00:42:38,674
I would say, now, this
paper is the first one

1004
00:42:38,674 --> 00:42:40,780
to really say, OK,
sorting, what's the best

1005
00:42:40,780 --> 00:42:43,660
way I can sort in this model?

1006
00:42:43,660 --> 00:42:45,030
And they get something.

1007
00:42:45,030 --> 00:42:46,430
Do I have it here?

1008
00:42:46,430 --> 00:42:47,060
Yeah.

1009
00:42:47,060 --> 00:42:49,150
They aim for a
uniform optimality.

1010
00:42:49,150 --> 00:42:51,700
This means there's
one algorithm that

1011
00:42:51,700 --> 00:42:56,170
works optimally for this
threshold function no matter

1012
00:42:56,170 --> 00:42:57,040
what M is.

1013
00:42:57,040 --> 00:42:58,450
The algorithm
doesn't get to know

1014
00:42:58,450 --> 00:43:00,340
M. You might say
the algorithm is

1015
00:43:00,340 --> 00:43:04,800
oblivious to M. Sound familiar?

1016
00:43:04,800 --> 00:43:05,960
So this is a cool idea.

1017
00:43:05,960 --> 00:43:07,880
Of course, it does
not have blocking yet.

1018
00:43:07,880 --> 00:43:10,160
But none of this
model has blocking.

1019
00:43:10,160 --> 00:43:12,160
But they prove that if
you're uniformly optimal,

1020
00:43:12,160 --> 00:43:16,010
if you work in the red-blue
pebble game model for all M

1021
00:43:16,010 --> 00:43:17,810
with one algorithm,
then, in fact, you

1022
00:43:17,810 --> 00:43:20,540
are optimal for all
f of x, which means,

1023
00:43:20,540 --> 00:43:23,840
in particular for the deep
hierarchy, you also work.

1024
00:43:23,840 --> 00:43:27,200
And they achieve tight bounds
for a bunch of problems here.

1025
00:43:27,200 --> 00:43:29,270
You should recognize
all of these bounds

1026
00:43:29,270 --> 00:43:32,030
are now, in some
sense, particular cases

1027
00:43:32,030 --> 00:43:33,950
of the external memory bounds.

1028
00:43:33,950 --> 00:43:35,482
So like sorting, you have this.

1029
00:43:35,482 --> 00:43:37,940
Except there's no B. The B has
disappeared, because there's

1030
00:43:37,940 --> 00:43:38,960
no B in this model.

1031
00:43:38,960 --> 00:43:41,120
But, otherwise, it is N
over B log base M over B

1032
00:43:41,120 --> 00:43:44,030
of N over B and so
on down the line.

1033
00:43:44,030 --> 00:43:47,930
They said, oh, search here is
really bad, because caching

1034
00:43:47,930 --> 00:43:49,310
doesn't really help for search.

1035
00:43:49,310 --> 00:43:50,750
But blocks help for search.

1036
00:43:50,750 --> 00:43:53,270
So when there's no B, these
are exactly the bounds

1037
00:43:53,270 --> 00:43:55,160
you get for external memory.

1038
00:43:55,160 --> 00:43:56,900
So I mean, some of
these were known.

1039
00:43:56,900 --> 00:43:59,240
These were already
known by Hong and Kung,

1040
00:43:59,240 --> 00:44:01,760
because it's the
same special case.

1041
00:44:01,760 --> 00:44:04,379
And then the others followed
from external memory.

1042
00:44:04,379 --> 00:44:05,420
But this is kind of neat.

1043
00:44:05,420 --> 00:44:10,040
They're doing it in a somewhat
stronger sense, because it's

1044
00:44:10,040 --> 00:44:14,000
uniform without knowing
M. So the uniformity

1045
00:44:14,000 --> 00:44:16,110
doesn't follow from this.

1046
00:44:16,110 --> 00:44:17,240
But they get uniformity.

1047
00:44:17,240 --> 00:44:20,930
And therefore, it
works for all f.

1048
00:44:20,930 --> 00:44:22,860
OK.

1049
00:44:22,860 --> 00:44:24,402
They had another
fun fact, which will

1050
00:44:24,402 --> 00:44:25,776
look familiar to
those of you who

1051
00:44:25,776 --> 00:44:28,050
know the cache-oblivious
model, which we'll get to.

1052
00:44:28,050 --> 00:44:29,967
They have this
observation that while we

1053
00:44:29,967 --> 00:44:32,550
have these algorithms that are
explicitly moving things around

1054
00:44:32,550 --> 00:44:34,230
in our RAM, it
would be nice if we

1055
00:44:34,230 --> 00:44:37,080
didn't have to write that down
explicitly in the algorithm.

1056
00:44:37,080 --> 00:44:40,650
Could we just use least
recently used replacement,

1057
00:44:40,650 --> 00:44:43,550
so move things forward?

1058
00:44:43,550 --> 00:44:45,900
That works great if
you know what M is.

1059
00:44:45,900 --> 00:44:49,490
Then you say, OK, if I need to
get something from out if here,

1060
00:44:49,490 --> 00:44:50,670
I'll move it over here.

1061
00:44:50,670 --> 00:44:53,190
And whatever was least
recently used, I'll kick out.

1062
00:44:53,190 --> 00:44:55,020
And at this point,
this is just a couple

1063
00:44:55,020 --> 00:44:56,280
of years prior to this paper.

1064
00:44:56,280 --> 00:44:59,910
Sleator and Tarjan did the first
paper on competitive analysis.

1065
00:44:59,910 --> 00:45:02,470
And they proved that
LRU or even first in,

1066
00:45:02,470 --> 00:45:05,310
first out is good in the
sense that if you just

1067
00:45:05,310 --> 00:45:08,330
double the size of your cache--

1068
00:45:08,330 --> 00:45:09,870
oh, I got this backwards.

1069
00:45:09,870 --> 00:45:13,290
TLRU of twice the
cache is, at most,

1070
00:45:13,290 --> 00:45:15,540
TOPT of 1 times the cache.

1071
00:45:15,540 --> 00:45:16,800
So the 2 should be over here.

1072
00:45:20,100 --> 00:45:20,850
Great.

1073
00:45:20,850 --> 00:45:24,180
And assuming you have a
polynomially bounded growth

1074
00:45:24,180 --> 00:45:27,130
function, then this is only
losing a constant factor.

1075
00:45:27,130 --> 00:45:27,630
OK.

1076
00:45:27,630 --> 00:45:28,830
But we don't know what M is.

1077
00:45:28,830 --> 00:45:31,271
This works for the
threshold function f sub m.

1078
00:45:31,271 --> 00:45:33,270
But it doesn't work for
an arbitrary function f,

1079
00:45:33,270 --> 00:45:35,330
or it doesn't work uniformly.

1080
00:45:35,330 --> 00:45:36,700
And we want a uniform solution.

1081
00:45:36,700 --> 00:45:37,830
And they gave one.

1082
00:45:37,830 --> 00:45:39,619
I'll just sketch it here.

1083
00:45:39,619 --> 00:45:41,535
The idea is you have
this arbitrary hierarchy.

1084
00:45:41,535 --> 00:45:42,960
You don't really know.

1085
00:45:42,960 --> 00:45:45,720
I'm going to assume
I do know what f is.

1086
00:45:45,720 --> 00:45:47,600
So this is not uniform.

1087
00:45:47,600 --> 00:45:49,530
It's achieved in
a different way.

1088
00:45:49,530 --> 00:45:52,500
But I'm going to basically
rearrange the structure

1089
00:45:52,500 --> 00:45:55,200
to be roughly
exponential to say, well,

1090
00:45:55,200 --> 00:45:57,390
I'm going to measure
f of x as x increases.

1091
00:45:57,390 --> 00:45:59,714
And whenever f of x
doubles, I'll draw a line.

1092
00:45:59,714 --> 00:46:01,380
These are not where
the real levels are.

1093
00:46:01,380 --> 00:46:02,790
It's just a conceptual thing.

1094
00:46:02,790 --> 00:46:04,980
And then I do LRU
on this structure.

1095
00:46:04,980 --> 00:46:08,100
So if I want to access
something here, I pull it out.

1096
00:46:08,100 --> 00:46:08,940
I stick it in here.

1097
00:46:08,940 --> 00:46:10,920
Whatever is least recently
used gets kicked out here.

1098
00:46:10,920 --> 00:46:12,000
And whatever is
least recently used

1099
00:46:12,000 --> 00:46:13,550
gets kicked out
here, here, here.

1100
00:46:13,550 --> 00:46:15,485
And you do a chain of LRUs.

1101
00:46:15,485 --> 00:46:17,610
Then you can prove that is
within a constant factor

1102
00:46:17,610 --> 00:46:21,830
of optimal, but you do
have to pay a startup cost.

1103
00:46:21,830 --> 00:46:24,080
It's similar to move
to front analysis

1104
00:46:24,080 --> 00:46:26,330
from Sleator and Tarjan.

1105
00:46:26,330 --> 00:46:27,200
OK.

1106
00:46:27,200 --> 00:46:30,590
Enough about HMM sort of.

1107
00:46:30,590 --> 00:46:32,740
The next model is called BT.

1108
00:46:32,740 --> 00:46:35,810
It's the same as HMM,
but they add blocks.

1109
00:46:35,810 --> 00:46:39,020
But not the blocks that we know
from computer architecture,

1110
00:46:39,020 --> 00:46:41,030
but a different
kind of block thing.

1111
00:46:41,030 --> 00:46:43,040
It's kind of similar.

1112
00:46:43,040 --> 00:46:45,800
Probably, [INAUDIBLE] constant
factors and not so different.

1113
00:46:45,800 --> 00:46:50,312
So you have the old thing
accessing x costs f of x.

1114
00:46:50,312 --> 00:46:51,770
But, now, you have
a new operation,

1115
00:46:51,770 --> 00:46:53,950
which is I can copy
any interval, which

1116
00:46:53,950 --> 00:46:57,110
would look something like
this, from x minus delta to x.

1117
00:46:57,110 --> 00:47:00,260
And I can copy it to
y minus delta to y.

1118
00:47:00,260 --> 00:47:05,210
And I pay the time to seek
there, f of max of x and y.

1119
00:47:05,210 --> 00:47:06,920
Or you could do f
of x plus f of y.

1120
00:47:06,920 --> 00:47:08,220
It doesn't matter.

1121
00:47:08,220 --> 00:47:09,720
And then you pay plus delta.

1122
00:47:09,720 --> 00:47:12,915
So you can move a big
chunk relatively quickly.

1123
00:47:12,915 --> 00:47:15,290
You just pay once to get there,
and then you can move it.

1124
00:47:15,290 --> 00:47:18,560
This is a lot more
reasonable than HMM.

1125
00:47:18,560 --> 00:47:21,890
But it makes things a lot
messier is the short answer.

1126
00:47:21,890 --> 00:47:24,969
Because-- here's a block move--

1127
00:47:24,969 --> 00:47:26,510
these are the sort
of bounds you get.

1128
00:47:26,510 --> 00:47:28,100
They depend now on f.

1129
00:47:28,100 --> 00:47:31,280
And you don't get the
same kind of uniformity

1130
00:47:31,280 --> 00:47:32,780
as far as I can tell.

1131
00:47:32,780 --> 00:47:35,150
You can't just say,
oh, it works for all f.

1132
00:47:35,150 --> 00:47:39,080
For each of these problems, this
is basically scanning or matrix

1133
00:47:39,080 --> 00:47:40,430
multiplication.

1134
00:47:40,430 --> 00:47:43,460
It doesn't matter much until f
of x gets really big, and then

1135
00:47:43,460 --> 00:47:45,170
something changes.

1136
00:47:45,170 --> 00:47:47,810
You Dot product, you
get log*, log, log, log,

1137
00:47:47,810 --> 00:47:51,950
depending on whether your f
of x is log or subpolynomial

1138
00:47:51,950 --> 00:47:53,450
or linear.

1139
00:47:53,450 --> 00:47:55,130
So I find this kind
of unsatisfying.

1140
00:47:55,130 --> 00:47:59,180
So I'm just going to move
on to MH, which is probably

1141
00:47:59,180 --> 00:48:01,100
the messiest of the models.

1142
00:48:01,100 --> 00:48:04,050
But in some sense, it's the
most realistic of the models.

1143
00:48:04,050 --> 00:48:05,660
Here's the picture
which I would draw

1144
00:48:05,660 --> 00:48:07,960
if someone asked me to draw
a general memory hierarchy.

1145
00:48:07,960 --> 00:48:09,990
I have CPU connects to
this cache for free.

1146
00:48:09,990 --> 00:48:12,140
It has blocks of size B0.

1147
00:48:12,140 --> 00:48:16,250
And to go to the next memory,
it costs me some time, t0.

1148
00:48:16,250 --> 00:48:20,180
And the blocks that I read here
of size B0, I write of size B0.

1149
00:48:20,180 --> 00:48:22,070
So the transfers
here are size B0.

1150
00:48:22,070 --> 00:48:24,030
And one has potentially
a different block size.

1151
00:48:24,030 --> 00:48:26,030
It has a different
cache size, M1.

1152
00:48:26,030 --> 00:48:27,620
And you pay.

1153
00:48:27,620 --> 00:48:30,410
So these blocks are subdivided
into B0 sized blocks,

1154
00:48:30,410 --> 00:48:31,700
which happen here.

1155
00:48:31,700 --> 00:48:34,089
This is a generic multi-level
memory hierarchy picture.

1156
00:48:34,089 --> 00:48:36,380
It's the obvious extension
of the external memory model

1157
00:48:36,380 --> 00:48:38,960
to arbitrarily many levels.

1158
00:48:38,960 --> 00:48:41,660
And to make it so
easy to program,

1159
00:48:41,660 --> 00:48:43,660
all levels can be
transferring at once.

1160
00:48:43,660 --> 00:48:48,500
This is realistic, but
hard to manipulate.

1161
00:48:48,500 --> 00:48:52,610
And they thought, oh,
well, l parameters

1162
00:48:52,610 --> 00:48:54,360
for an l-level
hierarchy is too many.

1163
00:48:54,360 --> 00:48:58,340
So let's reduce it to two
parameters and one function.

1164
00:48:58,340 --> 00:49:00,800
So assume that B [? does
?] grow exponentially,

1165
00:49:00,800 --> 00:49:03,300
that these things grow
roughly the same way.

1166
00:49:03,300 --> 00:49:04,940
with some aspect ratio alpha.

1167
00:49:04,940 --> 00:49:06,260
And then the ti--

1168
00:49:06,260 --> 00:49:08,092
this is the part
that's hard to guess--

1169
00:49:08,092 --> 00:49:09,050
it grows exponentially.

1170
00:49:09,050 --> 00:49:11,450
And then there's some f
of i, which we don't know,

1171
00:49:11,450 --> 00:49:13,070
maybe it's log i.

1172
00:49:13,070 --> 00:49:16,010
And because of that,
this doesn't really

1173
00:49:16,010 --> 00:49:17,420
clean the model enough.

1174
00:49:17,420 --> 00:49:20,600
You get bounds, which,
it's interesting.

1175
00:49:20,600 --> 00:49:23,720
You can say as long as f of
i is, at most, something,

1176
00:49:23,720 --> 00:49:26,120
then we get optimal bounds.

1177
00:49:26,120 --> 00:49:29,190
But sometimes when f of
i grows, things change.

1178
00:49:29,190 --> 00:49:31,280
And it's interesting.

1179
00:49:31,280 --> 00:49:32,870
These algorithms
follow approaches

1180
00:49:32,870 --> 00:49:35,570
that we will see in a
moment, divide and conquer.

1181
00:49:35,570 --> 00:49:39,420
But it's hard to state
what the answers are.

1182
00:49:39,420 --> 00:49:41,636
What's B4?

1183
00:49:41,636 --> 00:49:42,760
I think that's just a typo.

1184
00:49:42,760 --> 00:49:45,530
That should be blank.

1185
00:49:45,530 --> 00:49:48,520
I mean, it's hard to
beat an upper bound of 1.

1186
00:49:48,520 --> 00:49:50,810
It also seems wrong.

1187
00:49:50,810 --> 00:49:53,280
Ignore that row.

1188
00:49:53,280 --> 00:49:53,840
All right.

1189
00:49:53,840 --> 00:49:58,430
Finally, we go to the
cache-oblivious model by Frigo,

1190
00:49:58,430 --> 00:50:01,140
et al. in 1999.

1191
00:50:01,140 --> 00:50:02,430
This is another clean model.

1192
00:50:02,430 --> 00:50:07,000
And this is another of the two
models that really caught on.

1193
00:50:07,000 --> 00:50:10,400
It's motivated by all the
models you've just seen.

1194
00:50:10,400 --> 00:50:13,310
And in particular, it picks up
on the other successful model,

1195
00:50:13,310 --> 00:50:15,350
the External Memory
Model and says, OK,

1196
00:50:15,350 --> 00:50:17,750
let's take External Memory
Model, exactly the same cost

1197
00:50:17,750 --> 00:50:18,410
model.

1198
00:50:18,410 --> 00:50:21,325
But suppose your algorithm
doesn't know B or M.

1199
00:50:21,325 --> 00:50:23,450
And we're going to analyze
it in this model knowing

1200
00:50:23,450 --> 00:50:24,230
what B and M is.

1201
00:50:24,230 --> 00:50:26,690
But, really, one algorithm
has to work for all B and M.

1202
00:50:26,690 --> 00:50:30,560
This is uniformity from the--

1203
00:50:30,560 --> 00:50:33,170
I can't even remember
the model names--

1204
00:50:33,170 --> 00:50:36,920
not UMH, but the HMM model.

1205
00:50:36,920 --> 00:50:39,110
So it's taking that
idea, but applying it

1206
00:50:39,110 --> 00:50:41,450
to a model that has blocking.

1207
00:50:44,510 --> 00:50:47,750
So for this to be
meaningful, block transfers

1208
00:50:47,750 --> 00:50:48,650
have to be automatic.

1209
00:50:48,650 --> 00:50:51,174
Because you can't manually
move between here and here.

1210
00:50:51,174 --> 00:50:53,090
In HMM, you could manually
move things around,

1211
00:50:53,090 --> 00:50:55,040
because your memory is
just a sequential thing.

1212
00:50:55,040 --> 00:50:56,570
But now, you don't
know where the cutoff

1213
00:50:56,570 --> 00:50:57,710
is between cache and disks.

1214
00:50:57,710 --> 00:50:59,750
So you can't manually
manage your memory.

1215
00:50:59,750 --> 00:51:02,360
So you have to assume
automatic block replacement.

1216
00:51:02,360 --> 00:51:04,950
But we already know
LRU or FIFOs only going

1217
00:51:04,950 --> 00:51:06,350
to lose a constant factor.

1218
00:51:06,350 --> 00:51:09,560
So that's cool.

1219
00:51:09,560 --> 00:51:12,096
I like this model,
because it's clean.

1220
00:51:12,096 --> 00:51:14,720
Also, in a certain sense, it's
a little hard to formalize this.

1221
00:51:14,720 --> 00:51:17,840
But it works for changing B,
because it works for all B.

1222
00:51:17,840 --> 00:51:20,450
And so you can imagine even
if B is not a uniform thing--

1223
00:51:20,450 --> 00:51:23,830
like the size of tracks
on a disk are varying,

1224
00:51:23,830 --> 00:51:26,500
because circles have
different sizes--

1225
00:51:26,500 --> 00:51:29,230
it probably works
well in that setting.

1226
00:51:29,230 --> 00:51:32,110
It also works if your cache
gets smaller, because you've

1227
00:51:32,110 --> 00:51:34,600
got a competing process.

1228
00:51:34,600 --> 00:51:38,470
It'll just adjust, because
the analysis will work.

1229
00:51:38,470 --> 00:51:40,750
And the other fun thing
is even though you're

1230
00:51:40,750 --> 00:51:43,240
analyzing on a two-level
memory hierarchy,

1231
00:51:43,240 --> 00:51:46,850
it works on an arbitrary memory
hierarchy, this MH thing.

1232
00:51:46,850 --> 00:51:48,850
This is a clean
way to tackle MH.

1233
00:51:48,850 --> 00:51:54,040
You just need a
cache-oblivious solution.

1234
00:51:54,040 --> 00:51:55,450
Cool.

1235
00:51:55,450 --> 00:51:59,070
Because you can
imagine the levels

1236
00:51:59,070 --> 00:52:01,570
to the left of something and
the levels to the right of some

1237
00:52:01,570 --> 00:52:02,130
point.

1238
00:52:02,130 --> 00:52:03,880
And the cache-oblivious
analysis tells you

1239
00:52:03,880 --> 00:52:06,338
that the number of transfers
over this boundary is optimal.

1240
00:52:06,338 --> 00:52:08,540
And if that's true
for every boundary,

1241
00:52:08,540 --> 00:52:10,990
then the overall
thing will be optimal,

1242
00:52:10,990 --> 00:52:15,040
just like for HMM uniformity.

1243
00:52:15,040 --> 00:52:15,670
OK.

1244
00:52:15,670 --> 00:52:17,740
Quickly some techniques
from cache-oblivious.

1245
00:52:17,740 --> 00:52:20,050
I don't have much
time, so I will just

1246
00:52:20,050 --> 00:52:21,220
give you a couple sketches.

1247
00:52:21,220 --> 00:52:23,980
Scanning is one that generalizes
great from external memory.

1248
00:52:23,980 --> 00:52:25,730
Of course, every
cache-oblivious algorithm

1249
00:52:25,730 --> 00:52:27,070
is external memory also.

1250
00:52:27,070 --> 00:52:29,980
So we should first try all the
external memory techniques.

1251
00:52:29,980 --> 00:52:31,210
You can scan.

1252
00:52:31,210 --> 00:52:33,204
You can't really do M
over B parallel scans,

1253
00:52:33,204 --> 00:52:34,870
because you don't
know what M over B is.

1254
00:52:34,870 --> 00:52:36,995
But you can do a constant
number of parallel scans.

1255
00:52:36,995 --> 00:52:40,770
So you could at least
merge two lists.

1256
00:52:40,770 --> 00:52:41,740
OK.

1257
00:52:41,740 --> 00:52:44,980
Searching, so this is the
analog of binary search.

1258
00:52:44,980 --> 00:52:48,987
You'd like to achieve log
base B of N query time.

1259
00:52:48,987 --> 00:52:49,820
And you can do that.

1260
00:52:49,820 --> 00:52:52,940
And this is in Harald
Prokop's master's thesis.

1261
00:52:52,940 --> 00:52:55,900
So the idea is pretty cool.

1262
00:52:55,900 --> 00:53:00,910
You imagine a binary search
tree built on the items.

1263
00:53:00,910 --> 00:53:03,430
We can't do a B-way, because
we don't know what B is.

1264
00:53:03,430 --> 00:53:05,740
But then we cut it
at the middle level,

1265
00:53:05,740 --> 00:53:08,046
recursively store the
top part, and then

1266
00:53:08,046 --> 00:53:09,670
recursively store
all the bottom parts,

1267
00:53:09,670 --> 00:53:11,950
and get root N
chunks of size root

1268
00:53:11,950 --> 00:53:15,610
N. Do that recursively, you get
some kind of lay out like this.

1269
00:53:15,610 --> 00:53:17,290
And it turns out
this works very well.

1270
00:53:17,290 --> 00:53:18,940
Because at some level
of the recursion,

1271
00:53:18,940 --> 00:53:20,705
whatever B is-- it
doesn't know when

1272
00:53:20,705 --> 00:53:21,830
you're doing the recursion.

1273
00:53:21,830 --> 00:53:23,412
But B is something.

1274
00:53:23,412 --> 00:53:25,120
And if you look at
the level of recursion

1275
00:53:25,120 --> 00:53:27,536
where you straddle B here,
these things are size, at most,

1276
00:53:27,536 --> 00:53:30,670
B. And the next level up
is size bigger than B.

1277
00:53:30,670 --> 00:53:34,625
Then you look at a
root to leaf path here.

1278
00:53:34,625 --> 00:53:36,250
It's a matter of how
many of these blue

1279
00:53:36,250 --> 00:53:38,032
triangles do you visit.

1280
00:53:38,032 --> 00:53:39,490
Well, the height
of a blue triangle

1281
00:53:39,490 --> 00:53:41,590
is going to be
around half log B,

1282
00:53:41,590 --> 00:53:44,500
because we're dividing in
half until we hit log B.

1283
00:53:44,500 --> 00:53:47,800
So we might overshoot by a
factor of 2, but that's all.

1284
00:53:47,800 --> 00:53:50,644
And we only have to pay 2
memory transfers to visit these.

1285
00:53:50,644 --> 00:53:52,810
Because we don't know how
it's aligned with a block.

1286
00:53:52,810 --> 00:53:55,520
but at most, it fits
in 2 blocks, certainly.

1287
00:53:55,520 --> 00:53:57,940
It's stored consecutively
by the recursion.

1288
00:53:57,940 --> 00:53:59,230
And so you divide.

1289
00:53:59,230 --> 00:54:00,900
I mean, the height
of this thing,

1290
00:54:00,900 --> 00:54:03,460
it's going to be log
base B of N times 2.

1291
00:54:03,460 --> 00:54:04,250
We pay 2 each.

1292
00:54:04,250 --> 00:54:06,702
So we get an upper bound of 4.

1293
00:54:06,702 --> 00:54:07,660
Not as good as B-trees.

1294
00:54:07,660 --> 00:54:10,270
B-trees get 1 times
log base B of N. Here,

1295
00:54:10,270 --> 00:54:12,484
we get 4 times log base
B of N. This problem has

1296
00:54:12,484 --> 00:54:13,150
been considered.

1297
00:54:13,150 --> 00:54:17,900
The right answer is
log of e plus little o.

1298
00:54:17,900 --> 00:54:18,880
And that is tight.

1299
00:54:18,880 --> 00:54:22,370
You can't do better
than that bound.

1300
00:54:22,370 --> 00:54:24,520
So cache-oblivious
loses a constant factor

1301
00:54:24,520 --> 00:54:27,220
relative to external
memory for that problem.

1302
00:54:27,220 --> 00:54:28,702
You can also make this dynamic.

1303
00:54:28,702 --> 00:54:30,160
This is where a
bunch of us started

1304
00:54:30,160 --> 00:54:33,880
getting involved in this world,
in cache-oblivious world.

1305
00:54:33,880 --> 00:54:39,850
And this is a sketch of one of
the methods, I think this one.

1306
00:54:39,850 --> 00:54:41,314
That's the one I usually teach.

1307
00:54:41,314 --> 00:54:43,480
You might have guessed these
are from lecture notes,

1308
00:54:43,480 --> 00:54:45,700
these handwritten things.

1309
00:54:45,700 --> 00:54:47,300
I'll plug that in in a second.

1310
00:54:47,300 --> 00:54:50,260
So sorting is trickier.

1311
00:54:50,260 --> 00:54:51,760
There is an analog to mergesort.

1312
00:54:51,760 --> 00:54:54,290
There is an analog
to distribution sort.

1313
00:54:54,290 --> 00:54:55,600
They achieve the sorting bound.

1314
00:54:55,600 --> 00:54:58,100
But they do need an assumption,
this tall-cache assumption.

1315
00:54:58,100 --> 00:54:59,850
It's a little different
from the last one.

1316
00:54:59,850 --> 00:55:01,720
This is a stronger
assumption than before.

1317
00:55:01,720 --> 00:55:05,230
It says the cache is taller
than it is wide, roughly,

1318
00:55:05,230 --> 00:55:07,600
up to some epsilon exponent.

1319
00:55:07,600 --> 00:55:11,290
So this is saying M over B
is at least B to the epsilon.

1320
00:55:11,290 --> 00:55:14,000
Most caches have that property,
so it's not that big a deal.

1321
00:55:14,000 --> 00:55:15,375
But you can prove
it's necessary.

1322
00:55:15,375 --> 00:55:17,980
If you don't have it, you can't
achieve the sorting bound.

1323
00:55:17,980 --> 00:55:20,320
You could also prove you cannot
achieve the permutation bound,

1324
00:55:20,320 --> 00:55:21,569
because you can't do that min.

1325
00:55:21,569 --> 00:55:26,170
You don't know which
is better, same paper.

1326
00:55:26,170 --> 00:55:28,770
Finally, I wanted
to plug this class.

1327
00:55:28,770 --> 00:55:31,420
It just got released
if you're interested.

1328
00:55:31,420 --> 00:55:32,810
It's advanced data structures.

1329
00:55:32,810 --> 00:55:35,602
There's video lectures
for free streaming online.

1330
00:55:35,602 --> 00:55:37,810
There are three lectures
about cache-oblivious stuff,

1331
00:55:37,810 --> 00:55:39,510
mostly on the data
structure side, because it's

1332
00:55:39,510 --> 00:55:40,490
a data structures class.

1333
00:55:40,490 --> 00:55:42,323
But if you're interested
in data structures,

1334
00:55:42,323 --> 00:55:43,900
you should check it out.

1335
00:55:43,900 --> 00:55:47,025
That is the end of my
summary of a zillion models.

1336
00:55:47,025 --> 00:55:48,525
The ones to keep
in mind, of course,

1337
00:55:48,525 --> 00:55:49,930
are external memory
and cache-oblivious.

1338
00:55:49,930 --> 00:55:51,221
But the others are kind of fun.

1339
00:55:51,221 --> 00:55:54,550
And you really see the
genesis of how this was

1340
00:55:54,550 --> 00:55:56,170
the union of these two models.

1341
00:55:56,170 --> 00:55:58,660
And this was sort of the
culmination of this effort

1342
00:55:58,660 --> 00:56:02,170
to do multilevel in a clean way.

1343
00:56:02,170 --> 00:56:05,380
So I learned a lot looking
at all these papers.

1344
00:56:05,380 --> 00:56:06,410
Hope you enjoyed it.

1345
00:56:06,410 --> 00:56:07,045
Thanks.

1346
00:56:07,045 --> 00:56:11,005
[APPLAUSE]

1347
00:56:11,005 --> 00:56:13,480
PROFESSOR: Are
there any questions?

1348
00:56:13,480 --> 00:56:17,440
AUDIENCE: So all these are
order of magnitude bounds

1349
00:56:17,440 --> 00:56:20,424
I'm wondering about
the constant factors.

1350
00:56:20,424 --> 00:56:22,090
ERIK DEMAINE: Are you
guys going to talk

1351
00:56:22,090 --> 00:56:23,920
about that in your final talk?

1352
00:56:23,920 --> 00:56:28,370
Or who knows?

1353
00:56:28,370 --> 00:56:31,780
Or Lars maybe also?

1354
00:56:31,780 --> 00:56:33,490
Some of these papers
even evaluated

1355
00:56:33,490 --> 00:56:36,790
that, especially these guys
that had the messy models.

1356
00:56:36,790 --> 00:56:39,370
They were getting the
parameters of, at that time,

1357
00:56:39,370 --> 00:56:41,890
[INAUDIBLE] 6,000 processor,
which is something I've

1358
00:56:41,890 --> 00:56:44,830
actually used, so not so old.

1359
00:56:44,830 --> 00:56:49,830
And they got very good
matching even at that point.

1360
00:56:49,830 --> 00:56:54,100
I'd say external memory does
very good for modeling disk.

1361
00:56:54,100 --> 00:56:56,830
I don't know if people
use it a lot for cache.

1362
00:56:56,830 --> 00:56:58,420
No, I'm told.

1363
00:56:58,420 --> 00:57:04,010
Cache-oblivious, it's a
little harder to measure.

1364
00:57:04,010 --> 00:57:06,370
Because you're not trying
to tune to specific things.

1365
00:57:06,370 --> 00:57:10,190
But in practice, it seems to
do very well for many problems.

1366
00:57:10,190 --> 00:57:11,507
That's the short answer.

1367
00:57:11,507 --> 00:57:13,006
AUDIENCE: [INAUDIBLE]
it runs faster

1368
00:57:13,006 --> 00:57:15,940
than [INAUDIBLE] cache aware.

1369
00:57:15,940 --> 00:57:17,260
ERIK DEMAINE: Yeah.

1370
00:57:17,260 --> 00:57:19,060
It does better than
our analysis said

1371
00:57:19,060 --> 00:57:21,880
it should do in some sense,
because it's so flexible.

1372
00:57:21,880 --> 00:57:23,710
And the reality is very messy.

1373
00:57:23,710 --> 00:57:26,020
In reality, M is
changing, because there's

1374
00:57:26,020 --> 00:57:29,710
all sorts of processes
doing useless work.

1375
00:57:29,710 --> 00:57:31,780
And cache-oblivious
will adjust to that.

1376
00:57:31,780 --> 00:57:35,069
And it's especially the
case in internal memory,

1377
00:57:35,069 --> 00:57:35,860
in the cache world.

1378
00:57:35,860 --> 00:57:38,770
Things are very messy and fussy.

1379
00:57:38,770 --> 00:57:40,810
And the nice thing
about cache-oblivious

1380
00:57:40,810 --> 00:57:42,560
is because you're not
specifically tuning,

1381
00:57:42,560 --> 00:57:44,939
you have the potential to
not die when you mess up.

1382
00:57:44,939 --> 00:57:46,397
AUDIENCE: I'd say
that's especially

1383
00:57:46,397 --> 00:57:47,830
the case in the disk world.

1384
00:57:47,830 --> 00:57:49,080
ERIK DEMAINE: Oh, interesting.

1385
00:57:49,080 --> 00:57:50,320
AUDIENCE: [INAUDIBLE] But--

1386
00:57:50,320 --> 00:57:51,880
ERIK DEMAINE: These
are the guys who know.

1387
00:57:51,880 --> 00:57:54,171
AUDIENCE: [INAUDIBLE] people
have different [INAUDIBLE]

1388
00:57:54,171 --> 00:57:55,150
ERIK DEMAINE: Yeah.

1389
00:57:55,150 --> 00:57:56,184
They're both relevant.

1390
00:57:56,184 --> 00:57:58,040
AUDIENCE: What's the future?

1391
00:57:58,040 --> 00:57:59,462
This is history.

1392
00:57:59,462 --> 00:58:00,170
ERIK DEMAINE: OK.

1393
00:58:00,170 --> 00:58:03,020
Well, for the future, you should
go to the other talks, I guess.

1394
00:58:03,020 --> 00:58:05,600
There's still lots of open
problems in both models.

1395
00:58:05,600 --> 00:58:08,330
External memory, I guess,
graph algorithms and geometry

1396
00:58:08,330 --> 00:58:11,020
are still the main topics
of ongoing research.

1397
00:58:11,020 --> 00:58:13,130
Cache-oblivious is similar.

1398
00:58:13,130 --> 00:58:14,930
At this point, I think--

1399
00:58:14,930 --> 00:58:17,350
well, also geometry
is a big one.

1400
00:58:17,350 --> 00:58:18,950
There's some external
memory results

1401
00:58:18,950 --> 00:58:24,880
that have not yet been
cache-oblivified in geometry.

1402
00:58:24,880 --> 00:58:25,931
AUDIENCE: Multicore.

1403
00:58:25,931 --> 00:58:26,930
ERIK DEMAINE: Multicore.

1404
00:58:26,930 --> 00:58:28,304
Oh, yeah, I forgot
to say I'm not

1405
00:58:28,304 --> 00:58:31,327
going to talk about
parallel models here.

1406
00:58:31,327 --> 00:58:32,660
Partly, because of lack of time.

1407
00:58:32,660 --> 00:58:35,930
Also, that's probably
the most active--

1408
00:58:35,930 --> 00:58:39,260
it's an interesting active
area of research, something

1409
00:58:39,260 --> 00:58:42,830
I'm interested in particular.

1410
00:58:42,830 --> 00:58:45,350
There are some results about
parallel cache-oblivious.

1411
00:58:45,350 --> 00:58:49,460
And all of these papers
actually had parallelism.

1412
00:58:49,460 --> 00:58:53,084
These had parallelism
in a single disk.

1413
00:58:53,084 --> 00:58:55,000
There's another model
that has multiple disks.

1414
00:58:55,000 --> 00:58:56,458
Those behave more
or less the same.

1415
00:58:56,458 --> 00:58:58,759
You basically divide
everything by p.

1416
00:58:58,759 --> 00:59:00,800
These models also tried
to introduce parallelism.

1417
00:59:00,800 --> 00:59:04,130
Or there's a follow up
to UMH by these guys.

1418
00:59:04,130 --> 00:59:06,680
So there is work
on parallel, but I

1419
00:59:06,680 --> 00:59:08,750
think multicore or
cache-oblivious is probably

1420
00:59:08,750 --> 00:59:11,044
the most exciting unknown
or still in progress stuff.

1421
00:59:11,044 --> 00:59:12,460
AUDIENCE: Thank
the speaker again.

1422
00:59:12,460 --> 00:59:13,360
ERIK DEMAINE: Thanks.

1423
00:59:13,360 --> 00:59:17,910
[APPLAUSE]