1
00:00:00,060 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:04,010
Commons license.

3
00:00:04,010 --> 00:00:06,360
Your support will help
MIT OpenCourseWare

4
00:00:06,360 --> 00:00:10,730
continue to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,330
To make a donation or
view additional materials

6
00:00:13,330 --> 00:00:17,236
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,236 --> 00:00:21,690
at ocw.mit.edu.

8
00:00:21,690 --> 00:00:24,524
ERIK DEMAINE: Welcome to
the final week of 6.046.

9
00:00:24,524 --> 00:00:25,190
Are you excited?

10
00:00:25,190 --> 00:00:28,581
[CHEERING] Yeah,
today-- AUDIENCE: Oh.

11
00:00:28,581 --> 00:00:30,080
ERIK DEMAINE: Well,
and sad, I know.

12
00:00:30,080 --> 00:00:30,640
It's tough.

13
00:00:30,640 --> 00:00:32,630
But we've got two more lectures.

14
00:00:32,630 --> 00:00:36,720
They're on that one topic, which
is cache oblivious algorithms.

15
00:00:36,720 --> 00:00:39,470
And this is a
really cool concept.

16
00:00:39,470 --> 00:00:40,970
It was actually
originally developed

17
00:00:40,970 --> 00:00:45,980
in the context of 6.046, as
sort of an interesting way

18
00:00:45,980 --> 00:00:48,560
to teach cache
efficient algorithms.

19
00:00:48,560 --> 00:00:50,710
But it turned into a
whole research program

20
00:00:50,710 --> 00:00:55,240
in the late '90s, and
now it's its own thing.

21
00:00:55,240 --> 00:00:59,870
It's kind of funny to
bring it back to 6.046.

22
00:00:59,870 --> 00:01:04,670
The whole idea is in all of
the algorithms we have seen,

23
00:01:04,670 --> 00:01:06,520
except maybe
distributed algorithms,

24
00:01:06,520 --> 00:01:11,289
we've had this view that all
of the data that we can access

25
00:01:11,289 --> 00:01:13,280
is the same cost.

26
00:01:13,280 --> 00:01:15,360
If we have an array,
like a hash table,

27
00:01:15,360 --> 00:01:19,020
accessing anything in a hash
table is equally costly.

28
00:01:19,020 --> 00:01:20,840
If we have a binary
search tree, every node

29
00:01:20,840 --> 00:01:23,110
costs the same to access.

30
00:01:23,110 --> 00:01:26,120
But this is not real.

31
00:01:26,120 --> 00:01:27,680
Let me give you
some idea of what

32
00:01:27,680 --> 00:01:29,350
a real computer looks like.

33
00:01:29,350 --> 00:01:32,880
You probably know this, but
we've not yet thought about it

34
00:01:32,880 --> 00:01:42,300
in an algorithmic context.

35
00:01:42,300 --> 00:01:45,390
These are caches, what are
typically called caches,

36
00:01:45,390 --> 00:01:48,539
in your computer.

37
00:01:48,539 --> 00:01:52,640
Then you have what we've
mostly been thinking about,

38
00:01:52,640 --> 00:02:00,660
which is main memory, your RAM.

39
00:02:00,660 --> 00:02:02,260
And then there's
probably more stuff.

40
00:02:02,260 --> 00:02:05,512
These days you probably
have some big flash.

41
00:02:05,512 --> 00:02:06,970
If you have a
fancier computer, you

42
00:02:06,970 --> 00:02:09,030
have flash, which
is maybe caching

43
00:02:09,030 --> 00:02:13,000
your disk, which is huge.

44
00:02:13,000 --> 00:02:15,110
And then maybe there's
the internet at the end,

45
00:02:15,110 --> 00:02:16,860
if you like.

46
00:02:16,860 --> 00:02:22,370
So the point is all the data in
the world is not on your CPU.

47
00:02:22,370 --> 00:02:26,040
And there's this big thing which
is called the memory hierarchy,

48
00:02:26,040 --> 00:02:29,410
which dictates which
things are fast

49
00:02:29,410 --> 00:02:32,290
and which things are
slow, not exactly which

50
00:02:32,290 --> 00:02:35,530
data items; that's up to you.

51
00:02:35,530 --> 00:02:37,820
But the idea is that
on board your CPU

52
00:02:37,820 --> 00:02:43,570
you have probably, these days,
up to four levels of cache.

53
00:02:43,570 --> 00:02:47,500
As I've tried to draw them,
they get increasingly big.

54
00:02:47,500 --> 00:02:49,440
Typical values--
a level one cache

55
00:02:49,440 --> 00:02:52,860
is something on the order
of 10, 32 K, whatever.

56
00:02:52,860 --> 00:02:54,840
Level four cache these
days, as introduced

57
00:02:54,840 --> 00:02:58,724
by like Haswell Architectures,
has about 100 megabytes.

58
00:02:58,724 --> 00:03:01,140
Main memory you know; this is
the thing you usually think.

59
00:03:01,140 --> 00:03:01,470
About.

60
00:03:01,470 --> 00:03:02,400
It's in the gigabytes.

61
00:03:02,400 --> 00:03:05,600
These days you can buy computers
with a terabyte of RAM.

62
00:03:05,600 --> 00:03:06,830
It's not crazy.

63
00:03:06,830 --> 00:03:08,520
Flash gets bigger.

64
00:03:08,520 --> 00:03:12,317
Disk-- these days you can
buy 4 terabyte single disk,

65
00:03:12,317 --> 00:03:13,900
but if you have a
whole RAID of disks,

66
00:03:13,900 --> 00:03:17,079
you can have petabytes
of data on one computer.

67
00:03:17,079 --> 00:03:20,780
So things are getting bigger
as we go farther to the right.

68
00:03:20,780 --> 00:03:23,329
But they're also
getting slower. .

69
00:03:23,329 --> 00:03:25,310
And the point of cache
efficient algorithms

70
00:03:25,310 --> 00:03:27,710
is to deal with the fact
that things get slow

71
00:03:27,710 --> 00:03:29,462
when they get far away.

72
00:03:29,462 --> 00:03:31,420
And this makes sense from
a physics standpoint.

73
00:03:31,420 --> 00:03:33,860
If you think about
how much data can

74
00:03:33,860 --> 00:03:36,240
you store in a cubic
inch or something

75
00:03:36,240 --> 00:03:41,017
and how much could possibly be
near your CPU, at some point,

76
00:03:41,017 --> 00:03:42,600
you're just going
to run out of space,

77
00:03:42,600 --> 00:03:44,090
and you've got to
go farther away.

78
00:03:44,090 --> 00:03:47,184
And to go farther away is
going to take more time.

79
00:03:47,184 --> 00:03:48,850
So you can think of
it-- I mean, there's

80
00:03:48,850 --> 00:03:50,850
the speed of light
argument, that things

81
00:03:50,850 --> 00:03:52,525
that are farther
away in your computer

82
00:03:52,525 --> 00:03:54,710
are going to take longer.

83
00:03:54,710 --> 00:03:56,630
Typical computers
are not anywhere near

84
00:03:56,630 --> 00:03:59,350
the speed of light, so there's
a more real issue, which

85
00:03:59,350 --> 00:04:01,160
is how long are your traces.

86
00:04:01,160 --> 00:04:04,640
And then when you have physical
moving parts, like a disk,

87
00:04:04,640 --> 00:04:06,840
I don't know if you know,
but disks actually spin,

88
00:04:06,840 --> 00:04:09,814
and there's a head, and
it has to move around.

89
00:04:09,814 --> 00:04:10,980
And that's called seek time.

90
00:04:10,980 --> 00:04:13,550
Moving a head around on
the disk is really slow,

91
00:04:13,550 --> 00:04:15,360
on the order of milliseconds.

92
00:04:15,360 --> 00:04:18,152
Whereas reading
from on chip cache,

93
00:04:18,152 --> 00:04:19,610
that's on the order
of nanoseconds,

94
00:04:19,610 --> 00:04:23,180
whatever your clock rate is, so
a few billion times a second.

95
00:04:23,180 --> 00:04:26,340
So there's a big
spread of like a factor

96
00:04:26,340 --> 00:04:31,930
of a million or 10 million from
level one cache to disk speed.

97
00:04:31,930 --> 00:04:33,100
That sucks.

98
00:04:33,100 --> 00:04:35,576
And so you might think,
well, if your data's big,

99
00:04:35,576 --> 00:04:36,409
you're just screwed.

100
00:04:36,409 --> 00:04:39,909
You've got to deal with
disk, and disk is slow.

101
00:04:39,909 --> 00:04:42,430
But that's not true.

102
00:04:42,430 --> 00:04:44,960
Life is not so bad.

103
00:04:44,960 --> 00:05:07,720
So, in general, there's
two notions of speed,

104
00:05:07,720 --> 00:05:09,560
and I've been kind
of vague on them.

105
00:05:09,560 --> 00:05:13,300
One notion is latency,
which is if right now I

106
00:05:13,300 --> 00:05:17,250
have the idea that I really
need to fetch memory location 2

107
00:05:17,250 --> 00:05:21,100
billion and 73, how long does
it take for that data-- say,

108
00:05:21,100 --> 00:05:23,190
one word of data-- to come back?

109
00:05:23,190 --> 00:05:25,260
That's latency.

110
00:05:25,260 --> 00:05:28,710
But there's another
issue, which is bandwidth;

111
00:05:28,710 --> 00:05:32,720
how fat are these pipes?

112
00:05:32,720 --> 00:05:35,500
What's my rate of information
that I could pump?

113
00:05:35,500 --> 00:05:38,570
If I said, please give me
all of main memory in order,

114
00:05:38,570 --> 00:05:40,507
how fast could it pump it back?

115
00:05:40,507 --> 00:05:56,370
And that's actually really good.

116
00:05:56,370 --> 00:05:57,974
So latency is like
your start up cost.

117
00:05:57,974 --> 00:05:59,390
When I ask for
something, how long

118
00:05:59,390 --> 00:06:01,270
does it take for that
one thing to come?

119
00:06:01,270 --> 00:06:04,660
But then there's a data rate.

120
00:06:04,660 --> 00:06:08,620
And bandwidth you can
generally make really large.

121
00:06:08,620 --> 00:06:12,110
For example, in disk, bandwidth
of a disk is pretty big.

122
00:06:12,110 --> 00:06:16,780
But even if it weren't big, you
could just add 100 more disks.

123
00:06:16,780 --> 00:06:18,410
And then when you
ask for some data,

124
00:06:18,410 --> 00:06:23,032
all 100 disks could give
you data at the same speed,

125
00:06:23,032 --> 00:06:24,740
and provided you don't
overload your bus,

126
00:06:24,740 --> 00:06:27,480
so you've got to also
make more buses and so on.

127
00:06:27,480 --> 00:06:30,567
You can actually really huge
amount of data per second,

128
00:06:30,567 --> 00:06:33,150
but still the time to get there
and the time for all the disks

129
00:06:33,150 --> 00:06:34,770
to seek their
heads, that's slow.

130
00:06:34,770 --> 00:06:36,770
It doesn't add up, actually,
because they're all

131
00:06:36,770 --> 00:06:38,420
doing it in parallel.

132
00:06:38,420 --> 00:06:42,620
So you can't reduced latency,
but you can increase bandwidth.

133
00:06:42,620 --> 00:06:45,630
And let's say-- it
doesn't match physics,

134
00:06:45,630 --> 00:06:48,480
but we can get pretty close
to arbitrarily high bandwidth.

135
00:06:48,480 --> 00:06:50,590
And so in a well
designed computer,

136
00:06:50,590 --> 00:06:52,840
the fatnesses of
these pipes are going

137
00:06:52,840 --> 00:06:56,520
to increase, or could
increase, if you want.

138
00:06:56,520 --> 00:07:01,510
So you can move
lots of data around.

139
00:07:01,510 --> 00:07:03,840
But latency we can't get
rid of, and this is annoying

140
00:07:03,840 --> 00:07:05,673
because from an algorithmic
standpoint, when

141
00:07:05,673 --> 00:07:07,952
we ask for something,
we'd like it immediately.

142
00:07:07,952 --> 00:07:09,910
In a sequential logarithm,
we can't do anything

143
00:07:09,910 --> 00:07:11,530
until that date arrives.

144
00:07:11,530 --> 00:07:21,260
So cache efficiency is going
to fix this by blocking.

145
00:07:21,260 --> 00:07:30,692
This is an old idea, since
caches were introduced.

146
00:07:30,692 --> 00:07:31,900
There's the idea of blocking.

147
00:07:31,900 --> 00:07:36,300
So when you ask for a
single word in main memory,

148
00:07:36,300 --> 00:07:38,090
you don't get one word.

149
00:07:38,090 --> 00:07:41,960
You get maybe 32
kilobytes of information,

150
00:07:41,960 --> 00:08:09,620
not just 4 bytes or 8 bytes.

151
00:08:09,620 --> 00:08:12,490
And we're kind of free to
choose these block sizes however

152
00:08:12,490 --> 00:08:15,490
we want, when we
designed the system.

153
00:08:15,490 --> 00:08:21,940
So we can set them, in a
certain sense, to hide latency.

154
00:08:21,940 --> 00:08:34,799
So if you think of amortizing
the cost over the block,

155
00:08:34,799 --> 00:08:41,990
then you have something like
amortized cost over block.

156
00:08:41,990 --> 00:08:45,600
This is per word.

157
00:08:45,600 --> 00:08:54,380
Essentially, we divide the
latency by the block size.

158
00:08:54,380 --> 00:08:57,330
And we have to pay
one over bandwidth.

159
00:08:57,330 --> 00:08:59,900
Bandwidth is how
many words a second

160
00:08:59,900 --> 00:09:02,390
you can read, say,
from your memory.

161
00:09:02,390 --> 00:09:06,240
So one over bandwidth is
going to be your cost.

162
00:09:06,240 --> 00:09:07,800
So this we can't change but.

163
00:09:07,800 --> 00:09:10,280
By adding enough disks
or adding enough things

164
00:09:10,280 --> 00:09:12,300
at making these
pipes fat enough,

165
00:09:12,300 --> 00:09:15,500
you can basically make this big.

166
00:09:15,500 --> 00:09:18,600
Latency is the thing
we can't control.

167
00:09:18,600 --> 00:09:24,260
But if this block
is sort of useful,

168
00:09:24,260 --> 00:09:27,240
then we're paying the initial
start up time, say, hey,

169
00:09:27,240 --> 00:09:29,600
give me this block, and then
waiting for the response.

170
00:09:29,600 --> 00:09:32,520
That latency we only pay
once for the entire block.

171
00:09:32,520 --> 00:09:38,080
So if there are block size
words in that block, per item,

172
00:09:38,080 --> 00:09:40,380
we're effectively dividing
latency by block size.

173
00:09:40,380 --> 00:09:44,110
This is kind of rough,
but this is the idea

174
00:09:44,110 --> 00:09:45,790
of how to reduce latency.

175
00:09:45,790 --> 00:10:00,230
Now, for this actually work,
we need better algorithms.

176
00:10:00,230 --> 00:10:03,550
Pretty much every algorithm you
see in the class so far works

177
00:10:03,550 --> 00:10:05,050
horribly in this model.

178
00:10:05,050 --> 00:10:13,410
So that's the point of today
and next class is to fix that.

179
00:10:13,410 --> 00:10:20,070
For this kind of
amortization to work,

180
00:10:20,070 --> 00:10:23,370
I'm using "use" in a
vague sense so far.

181
00:10:23,370 --> 00:10:24,930
We'll make it
formal in a moment.

182
00:10:24,930 --> 00:10:28,500
When I fetch an entire
block, all of the elements

183
00:10:28,500 --> 00:10:30,184
in that block should be useful.

184
00:10:30,184 --> 00:10:32,100
We should be able to
compute something on them

185
00:10:32,100 --> 00:10:33,510
that we needed to compute.

186
00:10:33,510 --> 00:10:36,915
Otherwise, if I if I only needed
the one item that I read out

187
00:10:36,915 --> 00:10:41,250
of the block, that's not
going to help me so much.

188
00:10:41,250 --> 00:10:45,382
So I really want to structure
my data in such a way

189
00:10:45,382 --> 00:10:46,840
that when I access
one element, I'm

190
00:10:46,840 --> 00:10:49,890
also going to access
the elements nearby it.

191
00:10:49,890 --> 00:10:51,980
Then this blocking will
actually be useful.

192
00:10:51,980 --> 00:11:04,190
This is a property normally
called spatial locality.

193
00:11:04,190 --> 00:11:08,180
And the other thing
we'd like-- these caches

194
00:11:08,180 --> 00:11:11,360
have some size, so I can store
more than just one block.

195
00:11:11,360 --> 00:11:13,110
It's not like I read
one block, and I just

196
00:11:13,110 --> 00:11:16,630
finish processing it, and then
I read the next block and go on.

197
00:11:16,630 --> 00:11:18,505
Some of these caches
are actually pretty big.

198
00:11:18,505 --> 00:11:21,470
If you think of main memory
as a cache to your disk,

199
00:11:21,470 --> 00:11:22,740
that can be really big.

200
00:11:22,740 --> 00:11:27,355
So ideally, the blocks
that I'm using here

201
00:11:27,355 --> 00:11:28,730
relate to each
other in some way,

202
00:11:28,730 --> 00:11:30,990
or when I access
the block, I'm going

203
00:11:30,990 --> 00:11:34,030
to access it for awhile,
along with other blocks.

204
00:11:34,030 --> 00:11:36,020
So the way this
is usually said is

205
00:11:36,020 --> 00:11:41,902
that we'd like to reuse the
existing blocks in the cache

206
00:11:41,902 --> 00:11:45,290
as much as possible.

207
00:11:45,290 --> 00:11:51,000
And this you can think
of as temporal locality.

208
00:11:51,000 --> 00:11:52,530
When I access a
particular block,

209
00:11:52,530 --> 00:11:54,710
I'm going to access
it again fairly soon.

210
00:11:54,710 --> 00:11:57,690
That way it's actually useful
to bring it into my cache,

211
00:11:57,690 --> 00:11:59,197
and then I use it many times.

212
00:11:59,197 --> 00:12:00,280
That would be even better.

213
00:12:00,280 --> 00:12:01,738
I don't have to
have both of these,

214
00:12:01,738 --> 00:12:03,500
and exactly to
what extent I have

215
00:12:03,500 --> 00:12:05,740
them is going to dictate
what the overall time

216
00:12:05,740 --> 00:12:07,570
it's going to take
to run my algorithm.

217
00:12:07,570 --> 00:12:09,280
But these are so
the ideal properties

218
00:12:09,280 --> 00:12:11,640
you want in a very
informal sense.

219
00:12:11,640 --> 00:12:16,000
Now, in the rest today, we're
going to make this formal,

220
00:12:16,000 --> 00:12:18,840
and then we're going to develop
some algorithms for this model.

221
00:12:18,840 --> 00:12:20,740
But this is the motivation.

222
00:12:20,740 --> 00:12:24,585
In reality, we're free to
choose block size in the system.

223
00:12:24,585 --> 00:12:26,460
Though, in a moment,
I'm going to assume that

224
00:12:26,460 --> 00:12:27,870
it's given to us.

225
00:12:27,870 --> 00:12:29,290
You'd normally
set the block size

226
00:12:29,290 --> 00:12:32,570
so that these two terms
come out roughly equal.

227
00:12:32,570 --> 00:12:35,400
Because if you're spending
the latency time to go and get

228
00:12:35,400 --> 00:12:41,530
something, you might as well
get a whole chunk of something,

229
00:12:41,530 --> 00:12:43,200
according to whatever
your bandwidth is.

230
00:12:43,200 --> 00:12:44,710
If it only cost
you, say, twice as

231
00:12:44,710 --> 00:12:48,350
much to fetch an entire
block than to fetch one word,

232
00:12:48,350 --> 00:12:50,420
that seems like a
pretty good block size.

233
00:12:50,420 --> 00:12:54,160
So for something like
disk, that block size

234
00:12:54,160 --> 00:12:57,140
is on the order of
megabytes, maybe even

235
00:12:57,140 --> 00:12:58,970
bigger-- hundreds of megabytes.

236
00:12:58,970 --> 00:13:01,170
So think of the block
sizes as really big.

237
00:13:01,170 --> 00:13:04,950
We really want all that data
to be useful in some way.

238
00:13:04,950 --> 00:13:08,980
Now it's really hard
to think about a memory

239
00:13:08,980 --> 00:13:12,150
hierarchy with so many levels.

240
00:13:12,150 --> 00:13:14,810
So we're going to focus
on two levels at a time--

241
00:13:14,810 --> 00:13:18,530
the sort of the
cheap and small cache

242
00:13:18,530 --> 00:13:21,010
versus the huge thing,
which I'll call disk,

243
00:13:21,010 --> 00:13:26,420
just for emphasis.

244
00:13:26,420 --> 00:13:30,386
So I'm going to call this two
level model the external memory

245
00:13:30,386 --> 00:13:30,890
model.

246
00:13:30,890 --> 00:13:33,920
It was originally
introduced as a model

247
00:13:33,920 --> 00:13:35,626
for main memory versus disk.

248
00:13:35,626 --> 00:13:37,500
But you could apply it
to any pair of levels.

249
00:13:37,500 --> 00:13:40,720
In general, you have
your problem size N,

250
00:13:40,720 --> 00:13:45,040
choose the smallest level
that fits N. Typically that's

251
00:13:45,040 --> 00:13:45,680
main memory.

252
00:13:45,680 --> 00:13:46,900
Maybe it's disk.

253
00:13:46,900 --> 00:13:52,160
And just think of the level
between that and the previous,

254
00:13:52,160 --> 00:13:57,120
so the last level and
the next to last level.

255
00:13:57,120 --> 00:13:58,567
Often that's what matters.

256
00:13:58,567 --> 00:14:00,650
Like if you run a program,
and you run out of RAM,

257
00:14:00,650 --> 00:14:02,983
and you start swapping the
disks, that's when everything

258
00:14:02,983 --> 00:14:05,080
just slows to a crawl.

259
00:14:05,080 --> 00:14:07,305
You can see that difference
at each of these levels,

260
00:14:07,305 --> 00:14:08,930
but it's probably
most dramatic at disk

261
00:14:08,930 --> 00:14:13,910
just because it's so slow-- a
million times slower than RAM,

262
00:14:13,910 --> 00:14:17,190
or at least 1,000 times
slower than RAM, I should say.

263
00:14:17,190 --> 00:14:23,730
Anyway, so we have
just two levels.

264
00:14:23,730 --> 00:14:25,760
So let me draw a
more precise picture.

265
00:14:25,760 --> 00:14:27,020
We have the CPU.

266
00:14:27,020 --> 00:14:29,160
This is where all
of our operations

267
00:14:29,160 --> 00:14:31,210
are doing this, where we
add numbers and so on.

268
00:14:31,210 --> 00:14:33,668
We'll think of it as having a
constant number of registers.

269
00:14:33,668 --> 00:14:36,660
Each register is one word.

270
00:14:36,660 --> 00:14:42,440
And then we have a really
fat pipe, low latency pipe,

271
00:14:42,440 --> 00:14:48,490
to the cache.

272
00:14:48,490 --> 00:14:53,810
Cache is going to be
divided into blocks.

273
00:14:53,810 --> 00:14:58,800
So let's say there's
B words per blocks.

274
00:14:58,800 --> 00:15:00,930
Instead of writing
block size, I'll

275
00:15:00,930 --> 00:15:04,910
just write capital B.
And the number of blocks.

276
00:15:04,910 --> 00:15:16,300
I'm going to call M over B. So
the total size of your cache

277
00:15:16,300 --> 00:15:22,650
is capital M. And then there
is a relatively thin and slow

278
00:15:22,650 --> 00:15:28,060
connection-- this one's fast.

279
00:15:28,060 --> 00:15:35,460
This one's slow-- to your disk.

280
00:15:35,460 --> 00:15:38,800
Disk we'll think of as huge,
essentially infinite size.

281
00:15:38,800 --> 00:15:51,110
It's also divided into blocks
of size B, so same block size.

282
00:15:51,110 --> 00:15:52,840
So this is the picture.

283
00:15:52,840 --> 00:15:56,190
And so, initially,
all of the input

284
00:15:56,190 --> 00:15:59,050
is over here, all of your
end data items, whatever.

285
00:15:59,050 --> 00:16:01,010
So you want to sort those items.

286
00:16:01,010 --> 00:16:04,130
And in order to
access those items,

287
00:16:04,130 --> 00:16:06,670
you first have to
bring them into cache.

288
00:16:06,670 --> 00:16:12,260
That's going to be slow, but
it's done in a blocked manner.

289
00:16:12,260 --> 00:16:16,230
So when I can't access
an individual item here,

290
00:16:16,230 --> 00:16:19,190
I have to request
the entire block.

291
00:16:19,190 --> 00:16:21,300
When I request that block,
it gets sent over here.

292
00:16:21,300 --> 00:16:22,409
It takes a while.

293
00:16:22,409 --> 00:16:24,200
And then I get to choose
where to store it.

294
00:16:24,200 --> 00:16:25,960
Maybe I'll put it here.

295
00:16:25,960 --> 00:16:29,000
And then maybe I'll
grab this block

296
00:16:29,000 --> 00:16:32,100
and then store it
here and so on.

297
00:16:32,100 --> 00:16:34,870
Each of those is a block read,
so these are new instructions

298
00:16:34,870 --> 00:16:36,990
the CPU can do.

299
00:16:36,990 --> 00:16:39,680
And eventually, this
cache will get full.

300
00:16:39,680 --> 00:16:41,430
And then before I
bring in a new block,

301
00:16:41,430 --> 00:16:42,990
I have to kick out an old lock.

302
00:16:42,990 --> 00:16:44,910
Meaning I need to
take one these blocks

303
00:16:44,910 --> 00:16:49,175
and write it to some position,
maybe to the same place.

304
00:16:49,175 --> 00:16:50,800
I think, in fact, we
will always assume

305
00:16:50,800 --> 00:16:53,100
that you write to the
same place, overwrite

306
00:16:53,100 --> 00:16:54,340
what was on the disk.

307
00:16:54,340 --> 00:16:56,656
You made some changes
here, send it back.

308
00:16:56,656 --> 00:16:58,280
And, in general, what
we're going to do

309
00:16:58,280 --> 00:17:01,100
is count how many times
we read and write blocks.

310
00:17:01,100 --> 00:17:02,474
Question?

311
00:17:02,474 --> 00:17:05,400
AUDIENCE: When you talked about
how fast the connection is,

312
00:17:05,400 --> 00:17:07,108
you're just talking
about latency, right?

313
00:17:07,108 --> 00:17:08,858
ERIK DEMAINE: Yes,
sorry, this is latency.

314
00:17:08,858 --> 00:17:11,710
AUDIENCE: Yeah, so like
the [INAUDIBLE] connections

315
00:17:11,710 --> 00:17:13,710
[? just don't have ?]
[INAUDIBLE]?

316
00:17:13,710 --> 00:17:16,020
ERIK DEMAINE: Right, this
could have huge bandwidth.

317
00:17:16,020 --> 00:17:19,089
So in this model, we're assuming
the block size is fixed,

318
00:17:19,089 --> 00:17:21,250
and then the latency
versus bandwidth

319
00:17:21,250 --> 00:17:23,781
is not-- we're not going
to think about bandwidth.

320
00:17:23,781 --> 00:17:25,280
We'll assume the
block size has been

321
00:17:25,280 --> 00:17:27,036
chosen in some reasonable way.

322
00:17:27,036 --> 00:17:29,410
And then all we need to do is
count the number of blocks.

323
00:17:29,410 --> 00:17:33,640
But underneath, yeah, you
have some kind of bandwidth.

324
00:17:33,640 --> 00:17:35,990
Presumably you
set the block size

325
00:17:35,990 --> 00:17:37,841
to make these two
things roughly equal,

326
00:17:37,841 --> 00:17:39,216
and so then latency
and bandwidth

327
00:17:39,216 --> 00:17:41,010
are kind of the same thing.

328
00:17:41,010 --> 00:17:41,939
That's the idea.

329
00:17:41,939 --> 00:17:44,480
But really, we're just going to
think about counting latency,

330
00:17:44,480 --> 00:17:45,870
which is how many
times do I have

331
00:17:45,870 --> 00:17:48,400
to request to block and
wait for it to come over,

332
00:17:48,400 --> 00:17:51,010
and how much does it
cost to write a block?

333
00:17:51,010 --> 00:17:52,650
How many times do
I write a block?

334
00:17:52,650 --> 00:17:55,357
I'm not going to worry about
how much physical time it

335
00:17:55,357 --> 00:17:56,940
takes me to do either
of those things;

336
00:17:56,940 --> 00:17:59,000
I'm just going to count
them and assume that that

337
00:17:59,000 --> 00:18:02,410
is what I need to minimize.

338
00:18:02,410 --> 00:18:09,750
So I'm going to count-- we
call these memory transfers--

339
00:18:09,750 --> 00:18:13,334
transfers of blocks
between levels,

340
00:18:13,334 --> 00:18:17,150
between these two levels.

341
00:18:17,150 --> 00:18:32,380
This is the number of blocks
read from or written to disk.

342
00:18:32,380 --> 00:18:40,110
We're going to view accesses
to the cache as free.

343
00:18:40,110 --> 00:18:45,284
I'm not going to count those.

344
00:18:45,284 --> 00:18:46,700
You don't need to
worry about that

345
00:18:46,700 --> 00:18:52,550
so much because we can still
count the number of operations

346
00:18:52,550 --> 00:18:59,520
that we do on the
computer, on the CPU.

347
00:18:59,520 --> 00:19:02,120
We still can think
about how much time,

348
00:19:02,120 --> 00:19:04,730
regular time, it takes
to do the computation--

349
00:19:04,730 --> 00:19:07,570
how many comparisons, how many
additions, things like that.

350
00:19:07,570 --> 00:19:10,030
And that would include things
like reading and writing

351
00:19:10,030 --> 00:19:12,890
elements from cache--
individual things.

352
00:19:12,890 --> 00:19:15,360
But we're going to view
this connection-- let's say,

353
00:19:15,360 --> 00:19:17,120
these are on the same ship.

354
00:19:17,120 --> 00:19:19,897
So reading cache is just as
fast as reading from registers.

355
00:19:19,897 --> 00:19:21,730
So we're not going to
worry about that time.

356
00:19:21,730 --> 00:19:24,230
What we're focusing on, for
the purpose of this model,

357
00:19:24,230 --> 00:19:26,700
is between these two levels.

358
00:19:26,700 --> 00:19:30,355
So these are essentially
one level combined.

359
00:19:30,355 --> 00:19:31,730
I'll change that
in a little bit.

360
00:19:31,730 --> 00:19:34,240
But for now, just think
about the two levels.

361
00:19:34,240 --> 00:19:36,210
And we're counting how
many memory transfers

362
00:19:36,210 --> 00:19:41,660
do we have between these
two levels, cache and disk.

363
00:19:41,660 --> 00:19:43,480
So we want to minimize that.

364
00:19:43,480 --> 00:19:45,360
Now, just like before,
we want to minimize

365
00:19:45,360 --> 00:19:49,090
the running time in the
usual traditional measure.

366
00:19:49,090 --> 00:19:51,570
And we want to minimize space
and all the usual things

367
00:19:51,570 --> 00:19:52,240
we minimize.

368
00:19:52,240 --> 00:19:53,740
But now we have a
new measure, which

369
00:19:53,740 --> 00:19:56,073
is number of memory transfers,
and we want our algorithm

370
00:19:56,073 --> 00:19:59,460
to minimize that too,
for a given block size

371
00:19:59,460 --> 00:20:05,930
and for a given cache size.

372
00:20:05,930 --> 00:20:10,180
And at this point-- I'm going
to change this in a moment--

373
00:20:10,180 --> 00:20:15,480
the algorithm that we would
write in this external memory

374
00:20:15,480 --> 00:20:19,590
model explicitly
manages the blocks.

375
00:20:19,590 --> 00:20:31,444
It has to explicitly
read and write blocks.

376
00:20:31,444 --> 00:20:32,860
And there's a
software system that

377
00:20:32,860 --> 00:20:35,350
implements this model,
particularly for disk,

378
00:20:35,350 --> 00:20:37,510
and lets you do this in
a nice controlled way,

379
00:20:37,510 --> 00:20:40,460
maintain your memory, maintain
reading and writing disk.

380
00:20:40,460 --> 00:20:42,160
The operating system
tries to do this,

381
00:20:42,160 --> 00:20:45,070
but it usually does a really
bad job with swapping.

382
00:20:45,070 --> 00:20:47,280
But there are
software systems that

383
00:20:47,280 --> 00:20:53,900
let you take control
and do much better.

384
00:20:53,900 --> 00:20:54,980
So that's a good model.

385
00:20:54,980 --> 00:21:02,280
External memory model is
especially good for disk.

386
00:21:02,280 --> 00:21:04,630
It's not going to capture
the finesse of all

387
00:21:04,630 --> 00:21:07,900
these other levels, and
it's a little bit annoying

388
00:21:07,900 --> 00:21:10,245
to write algorithms in
this way-- explicitly

389
00:21:10,245 --> 00:21:11,370
reading and writing blocks.

390
00:21:11,370 --> 00:21:14,230
Today I will not write
any such algorithms.

391
00:21:14,230 --> 00:21:16,890
Although, you could
think about them.

392
00:21:16,890 --> 00:21:20,765
I personally love
this other model,

393
00:21:20,765 --> 00:21:29,920
which is cache obviousness.

394
00:21:29,920 --> 00:21:32,840
It's going to lead to, in
some sense, cleaner algorithm.

395
00:21:32,840 --> 00:21:36,380
Although, it's more of a magic
trick to get them to work.

396
00:21:36,380 --> 00:21:38,140
But writing the
algorithms is very simple.

397
00:21:38,140 --> 00:21:40,500
Analyzing them is more work.

398
00:21:40,500 --> 00:21:44,740
And it will capture, in some
sense, all of these levels.

399
00:21:44,740 --> 00:21:48,320
But, in fact, it is basically
exactly this model, almost

400
00:21:48,320 --> 00:21:49,400
the same.

401
00:21:49,400 --> 00:21:52,270
We're going to change
one thing, which is

402
00:21:52,270 --> 00:21:54,510
where the oblivious comes from.

403
00:21:54,510 --> 00:21:59,200
We're going to say that
the algorithm doesn't

404
00:21:59,200 --> 00:22:02,430
know the cache parameters.

405
00:22:02,430 --> 00:22:08,215
It doesn't know B or M.
So this is a little weird.

406
00:22:08,215 --> 00:22:13,490
We're going to have to make some
other changes to make it work.

407
00:22:13,490 --> 00:22:16,800
From an analysis perspective, I
want to count memory transfers

408
00:22:16,800 --> 00:22:19,350
and analyze my algorithm
with respect to this memory

409
00:22:19,350 --> 00:22:20,980
hierarchy.

410
00:22:20,980 --> 00:22:22,930
But the algorithm
itself isn't allowed

411
00:22:22,930 --> 00:22:25,190
to know what that member
hierarchy looks like.

412
00:22:25,190 --> 00:22:27,250
Another way to say this
is that the algorithm

413
00:22:27,250 --> 00:22:29,765
has to work simultaneously
for all values of B

414
00:22:29,765 --> 00:22:34,167
and all values of M.
As you might imagine,

415
00:22:34,167 --> 00:22:35,000
this is not so easy.

416
00:22:35,000 --> 00:22:37,166
But there are some simple
things where this is easy,

417
00:22:37,166 --> 00:22:40,150
and more complicated things
where this is possible.

418
00:22:40,150 --> 00:22:42,800
And it gives you all
sorts of cool things.

419
00:22:42,800 --> 00:22:46,290
Let me first formalize
the model a little bit.

420
00:22:46,290 --> 00:22:48,780
The other nice thing about
cache oblivious algorithms

421
00:22:48,780 --> 00:22:52,620
is it corresponds
much more closely

422
00:22:52,620 --> 00:22:55,740
to how these caches work.

423
00:22:55,740 --> 00:22:57,310
When you write code
on your CPU, you

424
00:22:57,310 --> 00:22:58,560
may have noticed
you don't usually

425
00:22:58,560 --> 00:23:00,476
do block reads and block
writes, unless you're

426
00:23:00,476 --> 00:23:02,760
dealing with flash or disk.

427
00:23:02,760 --> 00:23:04,590
All of this is
taking care for you.

428
00:23:04,590 --> 00:23:06,500
It's all done internal
to the processor.

429
00:23:06,500 --> 00:23:09,360
When you access a word,
behind the scenes,

430
00:23:09,360 --> 00:23:13,260
magically, the
system, the computer,

431
00:23:13,260 --> 00:23:16,100
finds which word to read
or which block to read.

432
00:23:16,100 --> 00:23:19,200
It moves the entire block
into a higher level cache,

433
00:23:19,200 --> 00:23:21,930
and then it's just serving
you words out of that block.

434
00:23:21,930 --> 00:23:25,200
And you don't have
explicit control over that.

435
00:23:25,200 --> 00:23:32,930
So the way that works is when
you access a word in memory--

436
00:23:32,930 --> 00:23:36,420
and I'm going to think
of memory as everything;

437
00:23:36,420 --> 00:23:42,246
this is what's stored
in the disk, say.

438
00:23:42,246 --> 00:23:44,370
This is the entire memory
system, the entire memory

439
00:23:44,370 --> 00:23:45,280
hierarchy.

440
00:23:45,280 --> 00:23:46,840
And, as usual in
this class, we're

441
00:23:46,840 --> 00:23:48,360
going to think of
the entire memory

442
00:23:48,360 --> 00:23:56,940
as a giant array of words.

443
00:23:56,940 --> 00:24:01,660
Each of these
squares is one word.

444
00:24:01,660 --> 00:24:06,040
But then also, the memory
is now divided into blocks.

445
00:24:06,040 --> 00:24:07,440
So let's say every four.

446
00:24:07,440 --> 00:24:09,160
Let's say B equals 4.

447
00:24:09,160 --> 00:24:13,820
Every four words is
a block boundary,

448
00:24:13,820 --> 00:24:17,960
just for the sake
of drawing a figure.

449
00:24:17,960 --> 00:24:20,450
So this is B equals 4.

450
00:24:20,450 --> 00:24:23,650
When you access a single
word, like this one,

451
00:24:23,650 --> 00:24:30,100
you get the entire block
containing the word.

452
00:24:30,100 --> 00:24:32,720
Let's say, to emphasize,
it's not you personally;

453
00:24:32,720 --> 00:24:47,307
the system somehow fetches the
block containing that word.

454
00:24:47,307 --> 00:24:48,640
It has to do this automatically.

455
00:24:48,640 --> 00:24:51,120
We can't explicitly read and
write blocks in this model

456
00:24:51,120 --> 00:24:53,050
because we don't know
how big the blocks are.

457
00:24:53,050 --> 00:24:55,230
So it couldn't even name them.

458
00:24:55,230 --> 00:24:59,510
But internally, on the real
system and in your analysis,

459
00:24:59,510 --> 00:25:01,760
you're going to think of
whenever you touch something,

460
00:25:01,760 --> 00:25:03,630
you actually get all
this into the cache.

461
00:25:03,630 --> 00:25:06,100
So you hope that you will use
things nearby because you've

462
00:25:06,100 --> 00:25:07,300
already read them in.

463
00:25:07,300 --> 00:25:08,300
Ideally, they're useful.

464
00:25:08,300 --> 00:25:10,091
But you don't know how
many you've read in.

465
00:25:10,091 --> 00:25:13,457
You've read in B, and
you don't what B is.

466
00:25:13,457 --> 00:25:17,200
The algorithm doesn't now.

467
00:25:17,200 --> 00:25:19,670
One more detail--
the cache is going

468
00:25:19,670 --> 00:25:21,462
to get full pretty quickly.

469
00:25:21,462 --> 00:25:23,170
And so then, whenever
you read something,

470
00:25:23,170 --> 00:25:24,461
you have to kick something out.

471
00:25:24,461 --> 00:25:26,690
In steady state,
cache might as well

472
00:25:26,690 --> 00:25:29,250
always stay full-- no reason
to leave anything empty.

473
00:25:29,250 --> 00:25:34,884
So which block do you kick out?

474
00:25:34,884 --> 00:25:35,550
Any suggestions?

475
00:25:35,550 --> 00:25:37,870
Which block should I kick out?

476
00:25:37,870 --> 00:25:40,710
If I've been reading
and writing some blocks,

477
00:25:40,710 --> 00:25:46,228
reading and writing to
words within these blocks.

478
00:25:46,228 --> 00:25:46,728
Yeah?

479
00:25:46,728 --> 00:25:48,477
AUDIENCE: [INAUDIBLE].

480
00:25:48,477 --> 00:25:51,060
ERIK DEMAINE: The block that was
fetched farthest in the past?

481
00:25:51,060 --> 00:25:53,480
Yeah that is usually
called First In, First Out.

482
00:25:53,480 --> 00:25:54,610
That's FIFO.

483
00:25:54,610 --> 00:25:57,890
And that is a good strategy.

484
00:25:57,890 --> 00:25:59,352
Any other suggestions?

485
00:25:59,352 --> 00:25:59,852
Yeah.

486
00:25:59,852 --> 00:26:02,692
AUDIENCE: [INAUDIBLE].

487
00:26:02,692 --> 00:26:04,900
ERIK DEMAINE: The block has
been least recently used.

488
00:26:04,900 --> 00:26:07,670
So maybe you fetched
it a long time ago,

489
00:26:07,670 --> 00:26:10,999
but you use it
every clock cycle.

490
00:26:10,999 --> 00:26:12,790
That one you should
probably not throw away

491
00:26:12,790 --> 00:26:13,831
because you use it a lot.

492
00:26:13,831 --> 00:26:18,730
That's called LRU, and that
is also a good strategy.

493
00:26:18,730 --> 00:26:19,720
Other suggestions?

494
00:26:19,720 --> 00:26:21,150
Those are two good ones.

495
00:26:21,150 --> 00:26:23,180
If you go beyond that,
I'm worried I won't know.

496
00:26:23,180 --> 00:26:24,596
But there are some
bad strategies.

497
00:26:24,596 --> 00:26:25,890
Yeah?

498
00:26:25,890 --> 00:26:27,050
AUDIENCE: Just random.

499
00:26:27,050 --> 00:26:32,435
ERIK DEMAINE: Random-- yeah,
random is probably pretty good.

500
00:26:32,435 --> 00:26:33,310
I don't know offhand.

501
00:26:33,310 --> 00:26:34,810
There are some
randomized strategies

502
00:26:34,810 --> 00:26:36,160
that beat both of those.

503
00:26:36,160 --> 00:26:38,250
But from this perspective,
both are good.

504
00:26:38,250 --> 00:26:42,760
We've got lots of Frisbees
to go through, so.

505
00:26:42,760 --> 00:26:43,690
That's a good answer.

506
00:26:43,690 --> 00:26:44,950
Random is definitely
a good idea.

507
00:26:44,950 --> 00:26:46,930
I know there's a randomized
strategy called [? bit, ?]

508
00:26:46,930 --> 00:26:48,760
that in certain senses
is a little bit better.

509
00:26:48,760 --> 00:26:51,051
But from my perspective, I
think all of those are good.

510
00:26:51,051 --> 00:26:53,600
Random, I have to double check
whether you lose a log factor.

511
00:26:53,600 --> 00:26:57,520
And expectation should be fine.

512
00:26:57,520 --> 00:27:00,600
So all of those
strategies will work.

513
00:27:00,600 --> 00:27:02,945
You could define this
model with any of them.

514
00:27:02,945 --> 00:27:04,820
I think it would work
fine, except randomize,

515
00:27:04,820 --> 00:27:08,640
you'd get an expectation bound.

516
00:27:08,640 --> 00:27:24,480
So the system evicts, let's say,
the least recently used page.

517
00:27:24,480 --> 00:27:26,770
The least recently loaded
page would also work fine.

518
00:27:26,770 --> 00:27:28,136
That's FIFO.

519
00:27:28,136 --> 00:27:31,740
Sorry I'm switching to page, but
I've been calling them blocks.

520
00:27:31,740 --> 00:27:36,200
Blocks and pages are the
same thing for this lecture.

521
00:27:36,200 --> 00:27:40,690
And either at the end of this
lecture or beginning of next,

522
00:27:40,690 --> 00:27:42,900
I'll tell you why
that's an OK thing.

523
00:27:42,900 --> 00:27:51,440
But let's not worry
about it at this point.

524
00:27:51,440 --> 00:27:55,000
So now we have a model--
cache flow oblivious.

525
00:27:55,000 --> 00:27:58,186
We have two models, actually.

526
00:27:58,186 --> 00:28:00,060
But I think now that
the cache flow oblivious

527
00:28:00,060 --> 00:28:03,040
model is complete,
we're going to analyze.

528
00:28:03,040 --> 00:28:06,460
Again, we're still counting
the number of memory transfers

529
00:28:06,460 --> 00:28:07,260
in this thing.

530
00:28:07,260 --> 00:28:09,275
The algorithm's just not
allowed know B and M,

531
00:28:09,275 --> 00:28:10,900
and so we had to
change the model

532
00:28:10,900 --> 00:28:13,890
to make the reading
and writing of blocks

533
00:28:13,890 --> 00:28:15,812
automatic because
the algorithm's not

534
00:28:15,812 --> 00:28:16,520
allowed to do it.

535
00:28:16,520 --> 00:28:18,950
So someone's got to.

536
00:28:18,950 --> 00:28:20,980
The cool thing about
cache oblivious model

537
00:28:20,980 --> 00:28:23,870
is every algorithm
you see in this class,

538
00:28:23,870 --> 00:28:26,260
or most of the algorithms
you see in this class,

539
00:28:26,260 --> 00:28:28,510
are in a certain sense
cache oblivious algorithms.

540
00:28:28,510 --> 00:28:32,390
They weren't aware of B
and M before, still not.

541
00:28:32,390 --> 00:28:35,930
What changes is now you can
analyze them in this new way,

542
00:28:35,930 --> 00:28:37,300
in this new model.

543
00:28:37,300 --> 00:28:39,815
Now, as I said, all the
algorithms we've seen

544
00:28:39,815 --> 00:28:44,309
are not going to perform well
in this model-- almost all.

545
00:28:44,309 --> 00:28:45,725
But that makes
things interesting,

546
00:28:45,725 --> 00:28:50,870
and that's why we
have some work to do.

547
00:28:50,870 --> 00:28:53,180
I have some reasons why
cache obliviousness--

548
00:28:53,180 --> 00:28:55,470
why would you tie your
hands behind your back

549
00:28:55,470 --> 00:28:57,060
and not know B or M?

550
00:28:57,060 --> 00:29:00,150
Reason one, it's cool.

551
00:29:00,150 --> 00:29:02,390
I think it's pretty amazing
you can actually do this.

552
00:29:02,390 --> 00:29:03,880
I guess that's reason
two is you can actually

553
00:29:03,880 --> 00:29:05,930
do it for a lot of
problems we care about.

554
00:29:05,930 --> 00:29:08,900
Cache oblivious algorithms
exist that are just as good.

555
00:29:08,900 --> 00:29:10,630
So, I mean, of
course they exist.

556
00:29:10,630 --> 00:29:12,700
But there are ones
that are optimal.

557
00:29:12,700 --> 00:29:15,230
They're within a constant
factor of the best algorithm

558
00:29:15,230 --> 00:29:18,665
when you know B or M.
So that's surprising.

559
00:29:18,665 --> 00:29:22,019
That's the cool part.

560
00:29:22,019 --> 00:29:23,560
In general, the
algorithms are easier

561
00:29:23,560 --> 00:29:27,540
to write down because we can use
pseudo code just like before.

562
00:29:27,540 --> 00:29:30,930
We don't need to worry about
blocking in the algorithm.

563
00:29:30,930 --> 00:29:34,530
The analysis is going to be
harder, but that's unavoidable.

564
00:29:34,530 --> 00:29:37,510
In some sense, it makes
it easier to write code.

565
00:29:37,510 --> 00:29:40,820
And it's also a little easier
to distribute your code

566
00:29:40,820 --> 00:29:43,160
because every computer
has different block

567
00:29:43,160 --> 00:29:44,200
sizes that matter.

568
00:29:44,200 --> 00:29:46,500
Also, as you change
your value of N,

569
00:29:46,500 --> 00:29:49,390
a different level in the memory
hierarchy's going to matter.

570
00:29:49,390 --> 00:29:52,520
And so it's annoying-- each of
these levels, I didn't mention,

571
00:29:52,520 --> 00:29:54,640
has a different block
size and, of course,

572
00:29:54,640 --> 00:29:56,360
has a different cache size.

573
00:29:56,360 --> 00:29:59,840
So tuning your code every
time to a different B or M

574
00:29:59,840 --> 00:30:01,740
is annoying.

575
00:30:01,740 --> 00:30:03,980
The big gain here,
though, I think,

576
00:30:03,980 --> 00:30:08,030
is that you capture the
entire hierarchy, in a sense.

577
00:30:08,030 --> 00:30:11,930
So in the real world,
each of these pipes

578
00:30:11,930 --> 00:30:12,945
has its own latency.

579
00:30:12,945 --> 00:30:15,110
And let's just
think about latency.

580
00:30:15,110 --> 00:30:17,490
And you'd like to minimize
the number of block transfers

581
00:30:17,490 --> 00:30:18,630
between here and here.

582
00:30:18,630 --> 00:30:20,810
You'd like to minimize the
number block answers here here.

583
00:30:20,810 --> 00:30:22,434
Well, OK, I can't
minimize all of them.

584
00:30:22,434 --> 00:30:24,580
That's a multi
dimensional problem.

585
00:30:24,580 --> 00:30:27,190
What I'd like to minimize
is some weighted average

586
00:30:27,190 --> 00:30:30,589
of those things-- latency
times number of blocks here,

587
00:30:30,589 --> 00:30:32,380
plus the latency times
the number of blocks

588
00:30:32,380 --> 00:30:34,254
here, plus latency times
the number of blocks

589
00:30:34,254 --> 00:30:36,910
here, and so on.

590
00:30:36,910 --> 00:30:41,410
If you can find an optimal cache
oblivious algorithm and analyze

591
00:30:41,410 --> 00:30:44,680
it just with respect
to two levels,

592
00:30:44,680 --> 00:30:47,455
because the algorithm's not
allowed to know B and M,

593
00:30:47,455 --> 00:30:50,130
it has to work for all levels.

594
00:30:50,130 --> 00:30:54,140
It has to minimize the number
of block transfers between all

595
00:30:54,140 --> 00:30:55,680
these levels, and
so, in particular,

596
00:30:55,680 --> 00:30:59,175
will minimize the
weighted sum of them.

597
00:30:59,175 --> 00:31:00,050
It's a bit hand wavy.

598
00:31:00,050 --> 00:31:01,520
You have to prove
something there.

599
00:31:01,520 --> 00:31:06,680
But you can prove it.

600
00:31:06,680 --> 00:31:09,930
So there's a paper
about this from 1999

601
00:31:09,930 --> 00:31:15,390
by Frigo, Leiserson,
Prokop, and Ramachandran.

602
00:31:15,390 --> 00:31:17,710
It's old enough that I
remember all the names.

603
00:31:17,710 --> 00:31:20,387
After about 2001, when
I became a professor,

604
00:31:20,387 --> 00:31:21,470
I can't remember anything.

605
00:31:21,470 --> 00:31:24,400
But before that, I can
remember everything.

606
00:31:24,400 --> 00:31:28,450
So Frigo, we've talked about
him in the context of FFTW.

607
00:31:28,450 --> 00:31:30,780
That was the fastest Fourier
Transform in the West.

608
00:31:30,780 --> 00:31:31,870
So he was a student here.

609
00:31:31,870 --> 00:31:35,810
And FFTW uses a cache oblivious
Fast Fourier Transform

610
00:31:35,810 --> 00:31:37,770
algorithm.

611
00:31:37,770 --> 00:31:40,490
Leiserson, you've probably seen
on the cover of your textbook

612
00:31:40,490 --> 00:31:42,340
or walking around Stata.

613
00:31:42,340 --> 00:31:45,370
Professor Leiserson here at MIT.

614
00:31:45,370 --> 00:31:48,270
And Prokop, this is actually
his [? M Enge ?] thesis.

615
00:31:48,270 --> 00:31:52,220
So pretty awesome
[? M Enge ?] thesis.

616
00:31:52,220 --> 00:31:56,180
All right, so cool, I
think I said all the things

617
00:31:56,180 --> 00:31:58,450
I wanted to say.

618
00:31:58,450 --> 00:32:00,195
So if you want to see
the proof that you

619
00:32:00,195 --> 00:32:02,140
can solve the entire
memory hierarchy,

620
00:32:02,140 --> 00:32:04,020
you can read their paper.

621
00:32:04,020 --> 00:32:05,740
You have to make a
couple of assumptions,

622
00:32:05,740 --> 00:32:07,200
but it's intuitive.

623
00:32:07,200 --> 00:32:09,090
Cache oblivious has to
work for all B and M,

624
00:32:09,090 --> 00:32:12,220
so it's going to optimize all
the levels simultaneously.

625
00:32:12,220 --> 00:32:15,220
Doing that explicitly, with all
the different B's and M's, that

626
00:32:15,220 --> 00:32:19,157
would be really messy
code, probably also slower.

627
00:32:19,157 --> 00:32:20,740
Cache oblivious is
just going to do it

628
00:32:20,740 --> 00:32:23,046
for free with the same code.

629
00:32:23,046 --> 00:32:29,480
All right, let's
do some algorithms.

630
00:32:29,480 --> 00:32:31,660
There's one easy
algorithm which works

631
00:32:31,660 --> 00:32:37,770
great from a cache oblivious
perspective, which is scanning.

632
00:32:37,770 --> 00:32:48,540
Let we give you
some Python code.

633
00:32:48,540 --> 00:32:50,850
For historical
reasons, in this field,

634
00:32:50,850 --> 00:32:52,730
N is written with
a capital letter.

635
00:32:52,730 --> 00:32:55,707
Don't ask, or don't
worry about it.

636
00:32:55,707 --> 00:32:57,040
So here's some very simple code.

637
00:32:57,040 --> 00:32:59,230
Suppose you want to
accumulate an array.

638
00:32:59,230 --> 00:33:01,700
You want to add up all
of the items in the array

639
00:33:01,700 --> 00:33:04,180
or multiply them or take
them in or whatever.

640
00:33:04,180 --> 00:33:06,900
This is a typical kind of thing.

641
00:33:06,900 --> 00:33:11,200
Again, an array, we're
going to think of-- so here

642
00:33:11,200 --> 00:33:12,895
was my memory.

643
00:33:12,895 --> 00:33:14,270
We're going to
think of the array

644
00:33:14,270 --> 00:33:19,040
as being stored
as some contiguous

645
00:33:19,040 --> 00:33:23,610
segment of that array,
let's say, this segment.

646
00:33:23,610 --> 00:33:25,092
So this is important.

647
00:33:25,092 --> 00:33:37,090
Assume array is stored
contiguously, no holes,

648
00:33:37,090 --> 00:33:41,760
relative to how it's
mapped on to memory.

649
00:33:41,760 --> 00:33:43,350
And this is a
realistic assumption.

650
00:33:43,350 --> 00:33:45,860
When you allocate
a block of memory,

651
00:33:45,860 --> 00:33:48,120
the promise by the system
is that it's essentially

652
00:33:48,120 --> 00:33:53,390
a contiguous chunk of
memory or disk, or whatever.

653
00:33:53,390 --> 00:33:58,170
And when Python makes
an array, it does this.

654
00:33:58,170 --> 00:34:01,160
It guarantees that these things
will be stored contiguously.

655
00:34:01,160 --> 00:34:03,160
If you use a dictionary,
this would not be true.

656
00:34:03,160 --> 00:34:05,780
But for regular [? array's ?]
list, this is true.

657
00:34:05,780 --> 00:34:10,530
So I'm accessing the items
in the array in order,

658
00:34:10,530 --> 00:34:12,500
and so I start
here at item zero.

659
00:34:12,500 --> 00:34:15,780
I end up with item N minus 1.

660
00:34:15,780 --> 00:34:17,949
That seems good because
I read this one.

661
00:34:17,949 --> 00:34:19,114
I get the whole block.

662
00:34:19,114 --> 00:34:19,989
Then I read this one.

663
00:34:19,989 --> 00:34:21,031
I already had that block.

664
00:34:21,031 --> 00:34:21,710
It's free.

665
00:34:21,710 --> 00:34:22,560
This one's free.

666
00:34:22,560 --> 00:34:23,389
This one's free.

667
00:34:23,389 --> 00:34:25,260
Here I have to read a new block.

668
00:34:25,260 --> 00:34:26,650
But then this one's free.

669
00:34:26,650 --> 00:34:29,130
So the first item I
access in each block

670
00:34:29,130 --> 00:34:33,610
costs one, but as long as my
cache store's at least one

671
00:34:33,610 --> 00:34:35,820
block, that's enough.

672
00:34:35,820 --> 00:34:38,010
And let's say the
sum is a register;

673
00:34:38,010 --> 00:34:39,684
that's enough to
remember that block so

674
00:34:39,684 --> 00:34:43,250
that the next operation
I do will be free.

675
00:34:43,250 --> 00:34:52,840
So the cost is going
to be-- actually,

676
00:34:52,840 --> 00:35:02,800
be a little more precise--
ceiling of N over B almost.

677
00:35:02,800 --> 00:35:09,170
Without the big O here, this
is right in the external memory

678
00:35:09,170 --> 00:35:16,330
model, but not quite right
in the cache oblivious model.

679
00:35:16,330 --> 00:35:18,154
Can someone tell me why?

680
00:35:18,154 --> 00:35:19,540
Yeah?

681
00:35:19,540 --> 00:35:21,388
AUDIENCE: If N is
two, you could have it

682
00:35:21,388 --> 00:35:23,240
beyond a border [INAUDIBLE].

683
00:35:23,240 --> 00:35:25,320
ERIK DEMAINE: Good,
N could be two.

684
00:35:25,320 --> 00:35:26,970
But it could span
a block boundary.

685
00:35:26,970 --> 00:35:28,500
Remember, the
algorithm has no idea

686
00:35:28,500 --> 00:35:29,791
where the block boundaries are.

687
00:35:29,791 --> 00:35:32,750
And again, in reality,
there are block boundaries

688
00:35:32,750 --> 00:35:35,390
all over the place, and
there's no way to know.

689
00:35:35,390 --> 00:35:38,400
You can't request that
when you allocate an array

690
00:35:38,400 --> 00:35:40,230
it always begins in
a block boundary.

691
00:35:40,230 --> 00:35:48,066
So great, you can span block
boundaries in-- oh, way off.

692
00:35:48,066 --> 00:35:52,200
I just spanned a
block boundary, sorry.

693
00:35:52,200 --> 00:35:56,290
So it's going to be,
at most, ceiling over N

694
00:35:56,290 --> 00:36:00,470
over B plus 1 cache obviously.

695
00:36:00,470 --> 00:36:02,060
So it's just going
to hurt you by one.

696
00:36:02,060 --> 00:36:04,226
But I want to point out,
there's a slight difference

697
00:36:04,226 --> 00:36:07,010
between the two models,
even with this very simple

698
00:36:07,010 --> 00:36:08,560
algorithm.

699
00:36:08,560 --> 00:36:10,860
In general, I'm just
going to think of this

700
00:36:10,860 --> 00:36:15,680
as big O N over B plus 1.

701
00:36:15,680 --> 00:36:17,790
There's some additive constant.

702
00:36:17,790 --> 00:36:20,470
I guess you could even say
it's N over B plus big O 1,

703
00:36:20,470 --> 00:36:23,960
but we won't worry about
constant factors today.

704
00:36:23,960 --> 00:36:26,820
So that's scanning, cache
oblivious external memory,

705
00:36:26,820 --> 00:36:28,500
both great.

706
00:36:28,500 --> 00:36:52,740
Slightly more interesting--
AUDIENCE: [INAUDIBLE]?

707
00:36:52,740 --> 00:36:55,949
ERIK DEMAINE: Yeah, in the
external memory algorithm,

708
00:36:55,949 --> 00:36:57,990
because you're explicitly
controlling the blocks,

709
00:36:57,990 --> 00:36:59,970
you're explicitly
reading and writing them.

710
00:36:59,970 --> 00:37:01,920
And you know where the
block boundaries are.

711
00:37:01,920 --> 00:37:04,680
You could, if you wanted
to, you don't have to,

712
00:37:04,680 --> 00:37:07,070
but you could choose
the array to be aligned,

713
00:37:07,070 --> 00:37:09,370
to be starting at
a block boundary.

714
00:37:09,370 --> 00:37:10,490
So that's the distinction.

715
00:37:10,490 --> 00:37:12,570
In the cache oblivious,
you can't control that,

716
00:37:12,570 --> 00:37:15,319
so you have to worry
about the worst case.

717
00:37:15,319 --> 00:37:16,860
External memory you
could control it,

718
00:37:16,860 --> 00:37:19,240
and you could do better,
and maybe you'd want to.

719
00:37:19,240 --> 00:37:23,240
It will hurt you buy
a constant factor.

720
00:37:23,240 --> 00:37:25,130
And in disks, for
example, you want

721
00:37:25,130 --> 00:37:28,182
things to be track
aligned because if you

722
00:37:28,182 --> 00:37:30,640
have to go to an adjacent track,
it's a lot more expensive.

723
00:37:30,640 --> 00:37:32,320
You've got to move the head.

724
00:37:32,320 --> 00:37:35,220
Track is a circle, what
you can read without moving

725
00:37:35,220 --> 00:37:42,170
the head, so great.

726
00:37:42,170 --> 00:37:44,110
So slightly more
interesting is you

727
00:37:44,110 --> 00:37:47,360
can do a constant number
of parallel scans.

728
00:37:47,360 --> 00:37:50,380
So that was one scan.

729
00:37:50,380 --> 00:38:02,810
Here's an example of two scans.

730
00:38:02,810 --> 00:38:10,360
Again, we have one array
of size N. Python notation,

731
00:38:10,360 --> 00:38:13,370
that would be the whole thing.

732
00:38:13,370 --> 00:38:21,170
And what I want to do is swap
Ai with-- this is not Python,

733
00:38:21,170 --> 00:38:25,780
but it's, I think,
textbook notation.

734
00:38:25,780 --> 00:38:28,840
But you know what swap means.

735
00:38:28,840 --> 00:38:35,980
What does this do, assuming
I got my minus ones right?

736
00:38:35,980 --> 00:38:36,480
Yeah?

737
00:38:36,480 --> 00:38:37,813
AUDIENCE: It reverses the array.

738
00:38:37,813 --> 00:38:39,860
ERIK DEMAINE: It
reverses the array, good.

739
00:38:39,860 --> 00:38:42,424
We'll just run through
these Frisbees.

740
00:38:42,424 --> 00:38:43,840
So this is a very
simple algorithm

741
00:38:43,840 --> 00:38:44,575
for reversing the array.

742
00:38:44,575 --> 00:38:46,060
It was originally
by John Bentley,

743
00:38:46,060 --> 00:38:48,120
who was Charles
Leiserson's adviser-- PhD

744
00:38:48,120 --> 00:38:50,670
adviser-- back in the day.

745
00:38:50,670 --> 00:38:53,085
So very simple, but
what's cool about it,

746
00:38:53,085 --> 00:38:56,630
if you think about the array
and the order in which you're

747
00:38:56,630 --> 00:39:03,210
accessing things, it's
like I have two fingers--

748
00:39:03,210 --> 00:39:06,190
and I should have
made this smaller.

749
00:39:06,190 --> 00:39:08,340
So here, we'll go down here.

750
00:39:08,340 --> 00:39:10,250
I start at the very
beginning of the array

751
00:39:10,250 --> 00:39:11,660
and the very end of the array.

752
00:39:11,660 --> 00:39:14,830
Then I go to the second
element, next to last element,

753
00:39:14,830 --> 00:39:17,740
and I advance like this.

754
00:39:17,740 --> 00:39:22,256
So as long as your cache M, the
number of blocks in the cache

755
00:39:22,256 --> 00:39:24,130
is at least two, which
is totally reasonable.

756
00:39:24,130 --> 00:39:27,930
You can assume this is
at least 100, typically.

757
00:39:27,930 --> 00:39:30,100
You've got at least
100 blocks, say.

758
00:39:30,100 --> 00:39:32,837
So for any fixed constant,
we're going to assume N over B

759
00:39:32,837 --> 00:39:33,920
is bigger than a constant.

760
00:39:33,920 --> 00:39:35,628
We'll only need like
two or three or four

761
00:39:35,628 --> 00:39:38,320
for the algorithms we cover.

762
00:39:38,320 --> 00:39:40,660
Then great, when I
access this item,

763
00:39:40,660 --> 00:39:43,410
I will load in the
block that contains it.

764
00:39:43,410 --> 00:39:47,640
I don't know how it's aligned,
but don't care so much.

765
00:39:47,640 --> 00:39:50,120
And then I load in the block
that contains this item.

766
00:39:50,120 --> 00:39:52,250
And then the next
accesses are free until I

767
00:39:52,250 --> 00:39:53,470
advance to the next block.

768
00:39:53,470 --> 00:39:56,340
But once I advance to the next
block on the left or the right,

769
00:39:56,340 --> 00:39:58,020
I'll never have to
access the old ones.

770
00:39:58,020 --> 00:40:01,020
And so again, the
cost here is just

771
00:40:01,020 --> 00:40:03,290
going to be equal to the
number of blocks, which

772
00:40:03,290 --> 00:40:07,540
is big O of N over B plus 1.

773
00:40:07,540 --> 00:40:09,840
So a constant number
of parallel scans

774
00:40:09,840 --> 00:40:14,690
is going to be basically the
number of blocks in the array.

775
00:40:14,690 --> 00:40:18,340
So N is smaller than B, this
is a bad idea or not so hot.

776
00:40:18,340 --> 00:40:19,760
But when N is
bigger than B, this

777
00:40:19,760 --> 00:40:21,590
is just N over B.
That's how much it takes

778
00:40:21,590 --> 00:40:26,670
to read in the data-- big deal.

779
00:40:26,670 --> 00:40:29,890
So these are boring cache
oblivious algorithms.

780
00:40:29,890 --> 00:40:31,830
Let's do interesting ones.

781
00:40:31,830 --> 00:40:34,830
And I would say the
central idea in cache

782
00:40:34,830 --> 00:40:38,360
oblivious algorithms is
to use divide and conquer.

783
00:40:38,360 --> 00:40:42,010
This goes back to the first
few lectures in this class.

784
00:40:42,010 --> 00:40:46,390
And so we will go back
to examples from there.

785
00:40:46,390 --> 00:40:48,910
Today we're going to
do the median finding,

786
00:40:48,910 --> 00:40:52,790
in particular, which
we did in lecture two,

787
00:40:52,790 --> 00:40:54,620
so really a blast from the past.

788
00:40:54,620 --> 00:40:57,040
But it's good review because
the final covers everything,

789
00:40:57,040 --> 00:40:59,570
so you've got to remember that.

790
00:40:59,570 --> 00:41:02,430
Matrix multiplication,
we've talked about, but not

791
00:41:02,430 --> 00:41:06,920
usually-- well, I guess we did
actually use divide and conquer

792
00:41:06,920 --> 00:41:08,440
for Strassen's algorithm.

793
00:41:08,440 --> 00:41:11,310
We're going to use -and conquer
even for the boring algorithm

794
00:41:11,310 --> 00:41:12,139
today.

795
00:41:12,139 --> 00:41:14,680
And then next class, we're going
to go back to van Emde Boas,

796
00:41:14,680 --> 00:41:16,150
but in a completely
different way.

797
00:41:16,150 --> 00:41:18,280
So if you don't
like van Emde Boas,

798
00:41:18,280 --> 00:41:21,800
don't worry; it's much simpler.

799
00:41:21,800 --> 00:41:24,930
So let's do median finding.

800
00:41:24,930 --> 00:41:29,808
Or actually, sorry, let
me first talk about divide

801
00:41:29,808 --> 00:41:33,860
and conquer in general.

802
00:41:33,860 --> 00:41:35,360
You know what divide
and conquer is.

803
00:41:35,360 --> 00:41:36,600
You take your problem.

804
00:41:36,600 --> 00:41:39,390
You split it into non
overlapping subproblems,

805
00:41:39,390 --> 00:41:42,665
recursively solve
them, combine them.

806
00:41:42,665 --> 00:41:44,040
But what I want
to stress here is

807
00:41:44,040 --> 00:41:47,110
what it's going to look like
in a cache oblivious context.

808
00:41:47,110 --> 00:41:51,598
So the algorithm is going to
look like a regular divide

809
00:41:51,598 --> 00:41:53,380
and conquer algorithm.

810
00:41:53,380 --> 00:42:01,800
So, in particular, the algorithm
will recurse all the way to,

811
00:42:01,800 --> 00:42:05,090
let's say, constant
size problems,

812
00:42:05,090 --> 00:42:11,900
whatever the base case is.

813
00:42:11,900 --> 00:42:18,410
So same as usual, but what's
different is the analysis.

814
00:42:18,410 --> 00:42:23,610
When we analyze a cache
oblivious algorithm,

815
00:42:23,610 --> 00:42:25,274
then we get to know
what B and M are.

816
00:42:25,274 --> 00:42:27,190
In some sense, we're
analyzing for all B an M.

817
00:42:27,190 --> 00:42:29,669
But let's suppose B and
M is given to us, then

818
00:42:29,669 --> 00:42:32,451
will tell you how many
memory transfers you need.

819
00:42:32,451 --> 00:42:33,950
This kind of bound,
you need to know

820
00:42:33,950 --> 00:42:37,110
what B is to know what the
value of this bound is.

821
00:42:37,110 --> 00:42:39,740
But you learn it as a
function of B and, in general,

822
00:42:39,740 --> 00:42:41,500
a function of B
and M, and that's

823
00:42:41,500 --> 00:42:46,390
the best you could hope for as
a complete characterization.

824
00:42:46,390 --> 00:42:49,470
So in the analysis, let's
just look at one value of B

825
00:42:49,470 --> 00:42:57,540
and one value of M. So
analysis knows B and M,

826
00:42:57,540 --> 00:43:11,620
and it's going to look at,
let's say, the recursive level,

827
00:43:11,620 --> 00:43:15,230
where one of two things happens.

828
00:43:15,230 --> 00:43:28,660
Either the problem size
fits in order one blocks.

829
00:43:28,660 --> 00:43:32,790
So meaning it's order B size.

830
00:43:32,790 --> 00:43:34,690
That's an interesting level.

831
00:43:34,690 --> 00:43:39,650
Another interesting level,
the more obvious one probably,

832
00:43:39,650 --> 00:43:44,045
is that it fits in cache.

833
00:43:44,045 --> 00:43:46,545
So that means that the size is
less than or equal to capital

834
00:43:46,545 --> 00:43:52,870
M. Everything here is
counted in terms of words.

835
00:43:52,870 --> 00:43:54,200
This is the more obvious one.

836
00:43:54,200 --> 00:43:57,247
For a lot of problems, the
cache size isn't so relevant.

837
00:43:57,247 --> 00:43:58,830
What really matters
is the block size.

838
00:43:58,830 --> 00:44:01,430
For example, scanning, you're
only looking through the data

839
00:44:01,430 --> 00:44:01,930
once.

840
00:44:01,930 --> 00:44:03,721
So it doesn't matter
how big your cache is,

841
00:44:03,721 --> 00:44:05,970
as long as it's not super tiny.

842
00:44:05,970 --> 00:44:08,420
As long as it has
a few blocks, then

843
00:44:08,420 --> 00:44:13,360
it's just a function of
B and N, no M involved.

844
00:44:13,360 --> 00:44:15,500
So for that kind of
problem this would

845
00:44:15,500 --> 00:44:19,800
be more useful-- constant
number of blocks.

846
00:44:19,800 --> 00:44:22,740
Because I think of the
cache M as being larger

847
00:44:22,740 --> 00:44:27,140
than any constant times B,
this is strictly smaller,

848
00:44:27,140 --> 00:44:30,870
or this is smaller or equal
to problem fitting in cache.

849
00:44:30,870 --> 00:44:32,780
So when M is
relevant, we'll look

850
00:44:32,780 --> 00:44:35,600
at this level and maybe
the adjacent levels

851
00:44:35,600 --> 00:44:37,580
in the recursion.

852
00:44:37,580 --> 00:44:40,150
So the algorithm doesn't know
what B and M are, so it's

853
00:44:40,150 --> 00:44:42,610
got to recurse all
the way down-- turtles

854
00:44:42,610 --> 00:44:44,040
all the way down.

855
00:44:44,040 --> 00:44:45,890
But the analysis,
because we're only

856
00:44:45,890 --> 00:44:47,780
thinking about one
value B and M at a time,

857
00:44:47,780 --> 00:44:49,830
we can afford to just
consider that one level,

858
00:44:49,830 --> 00:44:51,496
and that will be like
the critical place

859
00:44:51,496 --> 00:44:52,599
where all the cost is.

860
00:44:52,599 --> 00:44:55,140
Because once things fit in cache
and you've loaded things in,

861
00:44:55,140 --> 00:44:56,570
the cost will be zero.

862
00:44:56,570 --> 00:44:58,899
So below that, the base
case is kind of trivial.

863
00:44:58,899 --> 00:45:00,440
So basically what
this is going to do

864
00:45:00,440 --> 00:45:02,410
is make our base cases larger.

865
00:45:02,410 --> 00:45:04,510
Instead of our base
case being constant,

866
00:45:04,510 --> 00:45:11,650
it's going to be order B or M.

867
00:45:11,650 --> 00:45:21,839
What don't I need?

868
00:45:21,839 --> 00:45:45,420
So now let's going
to median finding.

869
00:45:45,420 --> 00:45:47,990
Median finding, you're
given an unsorted array.

870
00:45:47,990 --> 00:45:50,560
You want to find the median.

871
00:45:50,560 --> 00:45:55,440
And in lecture two,
we had a linear time

872
00:45:55,440 --> 00:46:00,900
worst case algorithm for this.

873
00:46:00,900 --> 00:46:04,230
And so my goal today is to
make it this running time.

874
00:46:04,230 --> 00:46:05,890
This is what you
might call linear time

875
00:46:05,890 --> 00:46:08,360
in the cache oblivious model
because that's how long it

876
00:46:08,360 --> 00:46:12,850
takes just to read the data.

877
00:46:12,850 --> 00:46:15,680
It turns out basically
the same algorithm works.

878
00:46:15,680 --> 00:46:17,810
First, you've got to
remember the algorithm.

879
00:46:17,810 --> 00:46:20,510
So let me write it down quickly.

880
00:46:20,510 --> 00:46:25,250
This is the sort of
five by in N array.

881
00:46:25,250 --> 00:46:29,816
So think of the array as
being partitioned into, I'll

882
00:46:29,816 --> 00:46:35,540
call them, five columns.

883
00:46:35,540 --> 00:46:42,990
So this picture of five dots
by N over 5 dots-- this is

884
00:46:42,990 --> 00:46:44,361
dot, dot, dot.

885
00:46:44,361 --> 00:46:46,240
So this is five.

886
00:46:46,240 --> 00:46:48,025
Now, we didn't
talk about it then,

887
00:46:48,025 --> 00:46:50,150
and there's a few different
ways you could actually

888
00:46:50,150 --> 00:46:52,710
implement it, but let's say
these-- the actual array is

889
00:46:52,710 --> 00:46:53,810
one-dimensional.

890
00:46:53,810 --> 00:46:55,610
Let's say these are
the first five items.

891
00:46:55,610 --> 00:46:57,020
These are the next five items.

892
00:46:57,020 --> 00:47:01,610
So, in other words, this matrix
is stored column by column.

893
00:47:01,610 --> 00:47:03,150
This is just a conceptual view.

894
00:47:03,150 --> 00:47:05,880
So we can define it either
way, however we want.

895
00:47:05,880 --> 00:47:08,070
So I'm going to
view it that way.

896
00:47:08,070 --> 00:47:12,090
And then what the rest of the
algorithm did was for sort

897
00:47:12,090 --> 00:47:16,150
each column, it's
only five items,

898
00:47:16,150 --> 00:47:19,137
so you can sort it in
constant time, each one.

899
00:47:19,137 --> 00:47:20,720
But, in particular,
what we care about

900
00:47:20,720 --> 00:47:24,010
is the median of
those five items.

901
00:47:24,010 --> 00:47:32,370
Then we recursively found
the median of the medians.

902
00:47:32,370 --> 00:47:41,150
This is the step we're going
to have to change a little bit.

903
00:47:41,150 --> 00:47:46,350
Then we-- leave a
little bit of space.

904
00:47:46,350 --> 00:47:52,580
Then we partition
the array by x.

905
00:47:52,580 --> 00:47:55,190
Meaning we split the
array into items less than

906
00:47:55,190 --> 00:48:00,350
or equal to x and
things greater than x.

907
00:48:00,350 --> 00:48:03,040
We probably assumed there was
only one value equal to x,

908
00:48:03,040 --> 00:48:05,160
but it doesn't matter.

909
00:48:05,160 --> 00:48:13,760
And finally, we recurse on
one of those two halves.

910
00:48:13,760 --> 00:48:16,369
So this is a pretty crazy divide
and conquer algorithm, one

911
00:48:16,369 --> 00:48:17,867
of the more sophisticated ones.

912
00:48:17,867 --> 00:48:19,700
You don't need to know
all the details here,

913
00:48:19,700 --> 00:48:22,980
just that it worked and
it ran in linear time.

914
00:48:22,980 --> 00:48:26,030
What's crazy about it is
there are two recursive calls.

915
00:48:26,030 --> 00:48:27,460
Usually, like in
merge sort, where

916
00:48:27,460 --> 00:48:30,090
you do two recursive calls
and spend linear time

917
00:48:30,090 --> 00:48:32,300
to do the stuff,
like this partition,

918
00:48:32,300 --> 00:48:34,470
you get n log n time,
like merge sort.

919
00:48:34,470 --> 00:48:37,570
Here, because this
array is a lot smaller,

920
00:48:37,570 --> 00:48:39,690
this is a size N over 5.

921
00:48:39,690 --> 00:48:41,610
And this one was
reasonably small;

922
00:48:41,610 --> 00:48:48,800
it was like [? M of ?] 7/10
N. Because 7/10 plus 1/5

923
00:48:48,800 --> 00:48:52,730
is strictly less than
1, this ends up being

924
00:48:52,730 --> 00:48:54,480
linear time instead of n log n.

925
00:48:54,480 --> 00:48:56,820
That's just review.

926
00:48:56,820 --> 00:49:05,270
Now, what I'd like to do is
the same thing, same analysis,

927
00:49:05,270 --> 00:49:08,490
or same algorithm, but
now I want to analyze it

928
00:49:08,490 --> 00:49:10,410
in this two-level model.

929
00:49:10,410 --> 00:49:28,740
So actually, I will
erase this board.

930
00:49:28,740 --> 00:49:33,820
So now my array has been
partitioned into blocks

931
00:49:33,820 --> 00:49:36,414
of size B, like this picture.

932
00:49:36,414 --> 00:49:37,580
In fact, it's quite similar.

933
00:49:37,580 --> 00:49:39,730
Here, we're partitioning
things into blocks,

934
00:49:39,730 --> 00:49:40,670
but they're size five.

935
00:49:40,670 --> 00:49:41,620
That's different.

936
00:49:41,620 --> 00:49:44,990
Now someone has partitioned my
array into blocks of size B.

937
00:49:44,990 --> 00:49:46,840
I need to count how
many things I access.

938
00:49:46,840 --> 00:49:49,720
Well, let's just look
line by line at this code

939
00:49:49,720 --> 00:49:50,530
and see what we do.

940
00:49:50,530 --> 00:49:52,490
Step one, we do
absolutely nothing.

941
00:49:52,490 --> 00:49:56,440
This is a conceptual
picture, so zero cost, great.

942
00:49:56,440 --> 00:50:00,600
Step one is zero,
my favorite answer.

943
00:50:00,600 --> 00:50:03,630
Step two, we sort each column.

944
00:50:03,630 --> 00:50:04,630
How long does this take?

945
00:50:04,630 --> 00:50:12,230
What am I doing?

946
00:50:12,230 --> 00:50:17,420
It's right above me.

947
00:50:17,420 --> 00:50:18,350
AUDIENCE: N over B.

948
00:50:18,350 --> 00:50:21,024
ERIK DEMAINE: N over B
because this is a scan.

949
00:50:21,024 --> 00:50:22,440
It's a little bit
weird of a scan.

950
00:50:22,440 --> 00:50:25,520
We look at five
items, and then we

951
00:50:25,520 --> 00:50:28,370
look at the next five items,
and then the next five items.

952
00:50:28,370 --> 00:50:29,550
But it's basically a scan.

953
00:50:29,550 --> 00:50:32,559
You could think of it as almost
five parallel scans, I suppose,

954
00:50:32,559 --> 00:50:34,100
or you could just
break into the case

955
00:50:34,100 --> 00:50:37,540
where maybe if B
is a constant, then

956
00:50:37,540 --> 00:50:38,790
it doesn't matter what you do.

957
00:50:38,790 --> 00:50:42,681
But if B bigger than a constant,
then reading five items,

958
00:50:42,681 --> 00:50:44,680
those are all probably
going to be in one block,

959
00:50:44,680 --> 00:50:47,290
except the ones that straddle
the block boundaries.

960
00:50:47,290 --> 00:50:51,086
So in all cases,
for step two-- maybe

961
00:50:51,086 --> 00:50:53,600
I should rewrite
step one-- zero cost.

962
00:50:53,600 --> 00:51:01,240
Step two, is order N over
B plus 1, to be careful.

963
00:51:01,240 --> 00:51:03,636
That's a scan.

964
00:51:03,636 --> 00:51:05,010
Actually, it's
two parallel scans

965
00:51:05,010 --> 00:51:09,490
because we have to write
out these medians somewhere,

966
00:51:09,490 --> 00:51:10,620
so we'll have to.

967
00:51:10,620 --> 00:51:14,320
Step three is recursively
find the medians.

968
00:51:14,320 --> 00:51:22,140
Now, before, we had in
T of N is T of N over 5

969
00:51:22,140 --> 00:51:30,110
plus T of 7/10 N plus linear.

970
00:51:30,110 --> 00:51:34,002
In this new world-- this
is regular running time.

971
00:51:34,002 --> 00:51:35,460
In this new world,
I'm going to use

972
00:51:35,460 --> 00:51:38,640
a different notation for
the recurrence, MT of N

973
00:51:38,640 --> 00:51:40,630
for memory transfers.

974
00:51:40,630 --> 00:51:42,490
This is a good old
fashioned time,

975
00:51:42,490 --> 00:51:44,900
and this is our new
modern notion of time--

976
00:51:44,900 --> 00:51:47,760
how many block transfers do I
need to do for problem size N.

977
00:51:47,760 --> 00:51:54,020
So this is a recursion, and
should be MT of N over 5.

978
00:51:54,020 --> 00:52:00,500
But, and this is
important, for this

979
00:52:00,500 --> 00:52:02,850
to be a same problem
of the same type,

980
00:52:02,850 --> 00:52:05,750
I need to know that the
array that recursing on

981
00:52:05,750 --> 00:52:08,660
is stored contiguously.

982
00:52:08,660 --> 00:52:10,800
Before, I didn't
need to do that.

983
00:52:10,800 --> 00:52:14,200
I could say, well, let's put
the medians in the middle.

984
00:52:14,200 --> 00:52:18,341
So now every fifth item in
this array is my new subarray.

985
00:52:18,341 --> 00:52:20,090
And so I could recursively
call this thing

986
00:52:20,090 --> 00:52:22,690
and say, OK, here's my
array, but really only think

987
00:52:22,690 --> 00:52:23,830
about every fifth item.

988
00:52:23,830 --> 00:52:25,650
That's like a
stride in the array.

989
00:52:25,650 --> 00:52:27,460
And then the next
recursive level, oh, only

990
00:52:27,460 --> 00:52:28,950
worry about every 25th item.

991
00:52:28,950 --> 00:52:32,260
And every 5 cubed item, I'm
going to stop computing,

992
00:52:32,260 --> 00:52:33,870
and so on.

993
00:52:33,870 --> 00:52:37,360
And that would be fine
for regular running time.

994
00:52:37,360 --> 00:52:39,554
But when I get my stride
gets bigger and bigger,

995
00:52:39,554 --> 00:52:40,970
at some point,
every item is going

996
00:52:40,970 --> 00:52:42,130
to be in a different block.

997
00:52:42,130 --> 00:52:43,130
That's bad.

998
00:52:43,130 --> 00:52:44,260
I don't want to do that.

999
00:52:44,260 --> 00:52:48,270
So when I find these
medians, or when I recurse,

1000
00:52:48,270 --> 00:52:50,290
I need that the medians
that I'm recursing on

1001
00:52:50,290 --> 00:52:51,670
are stored in a
contiguous array.

1002
00:52:51,670 --> 00:52:52,690
Now, this is easy to do.

1003
00:52:52,690 --> 00:52:54,023
But we didn't have to do before.

1004
00:52:54,023 --> 00:52:57,790
That's the key difference.

1005
00:52:57,790 --> 00:53:07,060
Make sure they are
stored contiguously.

1006
00:53:07,060 --> 00:53:11,760
I can do that because when I
sort each column in one scan,

1007
00:53:11,760 --> 00:53:14,250
I can have a second scan
which is the output, which

1008
00:53:14,250 --> 00:53:16,360
is the array of medians.

1009
00:53:16,360 --> 00:53:17,960
So as I'm scanning
through the input,

1010
00:53:17,960 --> 00:53:19,290
I'm going to output the median.

1011
00:53:19,290 --> 00:53:21,059
It's going to be 1/5 the size.

1012
00:53:21,059 --> 00:53:22,850
Then I've got all the
medians nicely stored

1013
00:53:22,850 --> 00:53:25,070
in a contiguous array.

1014
00:53:25,070 --> 00:53:27,430
So with order one
parallel scans,

1015
00:53:27,430 --> 00:53:33,810
same time here, this is actually
a legitimate recursive call.

1016
00:53:33,810 --> 00:53:35,120
Then we partition.

1017
00:53:35,120 --> 00:53:42,920
Partition, again, is a bunch of
parallel scans, I think, three.

1018
00:53:42,920 --> 00:53:44,640
You've got one
reading scan, which

1019
00:53:44,640 --> 00:53:46,190
is you're reading
through the array,

1020
00:53:46,190 --> 00:53:47,380
and you've got to writing scans.

1021
00:53:47,380 --> 00:53:49,420
You're writing out the elements
less than or equal to x,

1022
00:53:49,420 --> 00:53:51,630
and you're writing out the
elements greater than x.

1023
00:53:51,630 --> 00:53:53,050
But again, all of
those are scans.

1024
00:53:53,050 --> 00:53:55,120
You're always writing
the next element right

1025
00:53:55,120 --> 00:53:56,460
after the previous one.

1026
00:53:56,460 --> 00:53:58,350
So if you already have
that block in memory

1027
00:53:58,350 --> 00:54:01,750
and if you assume that the
number of blocks in cache

1028
00:54:01,750 --> 00:54:06,910
is at least three, then
three parallel scans is fine.

1029
00:54:06,910 --> 00:54:09,190
It's different from the
CLRS partition algorithm.

1030
00:54:09,190 --> 00:54:11,260
That one was fancy
to be in place.

1031
00:54:11,260 --> 00:54:13,720
We're not trying to be
in place or fancy at all.

1032
00:54:13,720 --> 00:54:16,310
Let's just do it with
a bunch of scans.

1033
00:54:16,310 --> 00:54:18,524
So now we have two arrays--
the element less than x,

1034
00:54:18,524 --> 00:54:19,690
the elements greater than x.

1035
00:54:19,690 --> 00:54:22,070
Then we recurse on one of
them, and those elements

1036
00:54:22,070 --> 00:54:24,260
are consecutive
already, so good.

1037
00:54:24,260 --> 00:54:26,630
This is a regular
recursive call.

1038
00:54:26,630 --> 00:54:28,320
Again, we're
maintaining the variant

1039
00:54:28,320 --> 00:54:32,350
that the array is
stored contiguously.

1040
00:54:32,350 --> 00:54:37,430
And by the old analysis, that
array is sized at most 7/10 N.

1041
00:54:37,430 --> 00:54:45,918
So I get a new recurrence, which
is MT of N is MT of N over 5

1042
00:54:45,918 --> 00:54:51,090
plus MT-- this analysis
feels very "empty--"

1043
00:54:51,090 --> 00:54:57,150
plus N over B-- sorry, bad
joke-- N over B plus 1.

1044
00:54:57,150 --> 00:55:01,010
So basically the same
recurrence, but now N over B

1045
00:55:01,010 --> 00:55:03,955
plus 1 for what
we're doing here.

1046
00:55:03,955 --> 00:55:05,330
But I had to change
the algorithm

1047
00:55:05,330 --> 00:55:07,970
a little bit for this
recurrence to be correct,

1048
00:55:07,970 --> 00:55:10,430
for it to correctly reflect
the number of memory transfers.

1049
00:55:10,430 --> 00:55:13,920
Now all we need to do
is solve the recurrence.

1050
00:55:13,920 --> 00:55:18,100
And actually, in some
sense, more importantly,

1051
00:55:18,100 --> 00:55:20,520
we need to figure out
what the base case is.

1052
00:55:20,520 --> 00:55:25,630
Because we could say, all right,
here's the usual base case.

1053
00:55:25,630 --> 00:55:27,170
If I have a constant
sized problem,

1054
00:55:27,170 --> 00:55:29,140
well, that's going
to be constant.

1055
00:55:29,140 --> 00:55:32,260
This is our base case for every
recurrence we've ever done.

1056
00:55:32,260 --> 00:55:34,220
And that's enough usually.

1057
00:55:34,220 --> 00:55:36,900
It's going to give us a
really bad answer here.

1058
00:55:36,900 --> 00:56:01,060
So let's go off to the side
here and solve that recurrence.

1059
00:56:01,060 --> 00:56:05,534
So if that's my base case,
well, in particular-- so

1060
00:56:05,534 --> 00:56:06,700
this is some recursion tree.

1061
00:56:06,700 --> 00:56:09,650
It's very uneven, so it's
kind of annoying to draw.

1062
00:56:09,650 --> 00:56:13,500
But what I know
with this base case,

1063
00:56:13,500 --> 00:56:17,160
this overall MT event is
going to at least the number

1064
00:56:17,160 --> 00:56:19,880
of leaves in the recursion tree.

1065
00:56:19,880 --> 00:56:22,955
So let's say MT
of N is at least L

1066
00:56:22,955 --> 00:56:29,300
of N, number of leaves
in the recursion.

1067
00:56:29,300 --> 00:56:31,520
So this is really if
I run the algorithm,

1068
00:56:31,520 --> 00:56:35,630
how many base cases of
constant size do I get?

1069
00:56:35,630 --> 00:56:46,050
And that satisfies-- so it's
not obvious what that is.

1070
00:56:46,050 --> 00:56:47,199
There's no plus here.

1071
00:56:47,199 --> 00:56:49,490
Number of leaves is just how
many leaves are over here,

1072
00:56:49,490 --> 00:56:52,560
how many leaves are over here,
and L of 1 equals 1, say,

1073
00:56:52,560 --> 00:56:55,120
or some constant
equals constant.

1074
00:56:55,120 --> 00:56:59,440
I happen to know, because
I saw lots of recurrences,

1075
00:56:59,440 --> 00:57:03,260
this solves to some
N to the alpha.

1076
00:57:03,260 --> 00:57:08,730
I claim that L of N is N to the
alpha for some constant alpha.

1077
00:57:08,730 --> 00:57:09,440
Why?

1078
00:57:09,440 --> 00:57:11,650
I'll just prove that it works.

1079
00:57:11,650 --> 00:57:16,250
So this is now N
over 5 to the alpha,

1080
00:57:16,250 --> 00:57:19,950
and this is 7/10 N to the alpha.

1081
00:57:19,950 --> 00:57:24,640
If it's going to work, this
recurrence should be satisfied.

1082
00:57:24,640 --> 00:57:26,830
And now, if you look
at this equation,

1083
00:57:26,830 --> 00:57:30,080
there's a lot of N to the
alphas, and they all cancel.

1084
00:57:30,080 --> 00:57:37,305
So I get 1 equals 1/5 to the
alpha plus 7/10 to the alpha.

1085
00:57:37,305 --> 00:57:38,680
It's confusing
because I was just

1086
00:57:38,680 --> 00:57:42,930
watching the TV show
Alphas, but no relation.

1087
00:57:42,930 --> 00:57:45,370
So this is now something
purely in terms of alpha.

1088
00:57:45,370 --> 00:57:47,580
You just need to check that
there is a real solution.

1089
00:57:47,580 --> 00:57:48,440
There is one.

1090
00:57:48,440 --> 00:57:51,630
You have to plug it into
Wolfram Alpha or something,

1091
00:57:51,630 --> 00:57:53,000
no pun intended.

1092
00:57:53,000 --> 00:57:55,967
Wow, they're just
coming out today.

1093
00:57:55,967 --> 00:57:58,947
And then alpha
is... next page...

1094
00:57:58,947 --> 00:58:01,947
I can't do this by hand.

1095
00:58:01,947 --> 00:58:08,207
Something like .83978.

1096
00:58:08,207 --> 00:58:15,487
So we get L of N is say at
least N to the 0.8th bigger.

1097
00:58:15,487 --> 00:58:21,247
It's sublinear and that was
enough when we cared about time

1098
00:58:21,247 --> 00:58:23,687
but now it's bad news
because N over B...

1099
00:58:23,687 --> 00:58:29,087
our goal was to get N over B+1.

1100
00:58:29,087 --> 00:58:33,507
If B is huge, if B is
bigger than N to the 0.2,

1101
00:58:33,507 --> 00:58:35,843
then we are not
achieving this bound.

1102
00:58:35,843 --> 00:58:36,343
Right.

1103
00:58:36,343 --> 00:58:38,947
We are always are paying
at least N to the 0.8.

1104
00:58:38,947 --> 00:58:43,247
For example B is roughly
N. We are way off!

1105
00:58:43,247 --> 00:58:45,247
But that's because we
used the wrong base case.

1106
00:58:45,247 --> 00:58:49,360
Turns out if you use a better
base case, things just work.

1107
00:58:49,360 --> 00:58:51,024
So let's do that.

1108
00:58:51,024 --> 00:58:53,940
I think its going to be smaller.

1109
00:58:53,940 --> 00:58:55,912
So... the next base...

1110
00:58:55,912 --> 00:58:56,616
I mean...

1111
00:58:56,616 --> 00:58:58,532
When you are doing cache
full release analysis

1112
00:58:58,532 --> 00:58:59,760
you never use this base case.

1113
00:58:59,760 --> 00:59:03,100
The first one you should
think about is this one.

1114
00:59:03,100 --> 00:59:04,812
If you have a
problem of size that

1115
00:59:04,812 --> 00:59:06,320
fits in a constant
number of blocks.

1116
00:59:06,320 --> 00:59:08,518
Well of course that's
going to take...

1117
00:59:08,518 --> 00:59:10,684
once they are read
into the cache,

1118
00:59:10,684 --> 00:59:12,100
you are not going
to pay anything.

1119
00:59:12,100 --> 00:59:14,524
How long does it take to read
a constant number of blocks

1120
00:59:14,524 --> 00:59:15,024
into cache?

1121
00:59:15,024 --> 00:59:16,940
Constant number of
memory transfers.

1122
00:59:16,940 --> 00:59:19,415
Okay, this is obviously a
strictly better base case

1123
00:59:19,415 --> 00:59:20,240
than this one.

1124
00:59:20,240 --> 00:59:23,090
Because we have the same
thing on the right hand

1125
00:59:23,090 --> 00:59:26,240
side as a constant but we've
solved a larger problem.

1126
00:59:26,240 --> 00:59:29,600
So clearly you should cut
here, instead of there.

1127
00:59:29,600 --> 00:59:34,740
Then the number of leaves
in this recursion...

1128
00:59:34,740 --> 00:59:38,998
So same recurrence-
different base case.

1129
00:59:38,998 --> 00:59:41,164
So we'd stop recursing
conceptually in the analysis,

1130
00:59:41,164 --> 00:59:42,980
the algorithm goes
all the way down,

1131
00:59:42,980 --> 00:59:45,092
but in the analysis
we stop recursing when

1132
00:59:45,092 --> 00:59:46,940
we reach a problem of size B.

1133
00:59:46,940 --> 00:59:54,168
The number of leaves in that new
recursion tree will be N over B

1134
00:59:54,168 --> 00:59:55,900
to the alpha.

1135
00:59:55,900 --> 00:59:56,870
That's good!

1136
00:59:56,870 --> 00:59:59,780
That's smaller than N over B.

1137
00:59:59,780 --> 01:00:02,380
OK, now I'm going to wave
my hands a little bit

1138
01:00:02,380 --> 01:00:10,780
and say, MT of N is
order N over B plus 1.

1139
01:00:10,780 --> 01:00:13,090
I guess to do that,
you want to prove it

1140
01:00:13,090 --> 01:00:16,390
the same way we did before
when we solved this recurrence,

1141
01:00:16,390 --> 01:00:17,770
which is by substitution.

1142
01:00:17,770 --> 01:00:19,900
You assume this is
true, you plug it in,

1143
01:00:19,900 --> 01:00:22,750
verify it can actually be
done with some constants.

1144
01:00:22,750 --> 01:00:25,750
The intuition of what's going on
is, in general, this recurrence

1145
01:00:25,750 --> 01:00:27,550
is dominated by the root.

1146
01:00:27,550 --> 01:00:31,759
The root cost for this
recursion is N over B plus 1.

1147
01:00:31,759 --> 01:00:32,800
So this is the root cost.

1148
01:00:32,800 --> 01:00:34,630
I claim that, up to
constant factors,

1149
01:00:34,630 --> 01:00:35,920
that is the overall cost.

1150
01:00:35,920 --> 01:00:38,440
Roughly because, as you go
down the recursion tree,

1151
01:00:38,440 --> 01:00:41,431
the cost is decreasing
geometrically.

1152
01:00:41,431 --> 01:00:43,180
But that's not obvious
for this recurrence

1153
01:00:43,180 --> 01:00:44,694
because it's so uneven.

1154
01:00:44,694 --> 01:00:47,110
But it's kind of like the
master method, a little fancier.

1155
01:00:47,110 --> 01:00:52,230
Intuitively, this
should be obvious.

1156
01:00:52,230 --> 01:00:54,490
There's the root cost and
then there's the other ones.

1157
01:00:54,490 --> 01:00:56,990
But to actually prove it, you
should do substitution method.

1158
01:00:56,990 --> 01:01:02,320
I want to go to more
interesting algorithms instead,

1159
01:01:02,320 --> 01:01:07,610
but any questions
before we continue?

1160
01:01:07,610 --> 01:01:08,150
All right.

1161
01:01:08,150 --> 01:01:11,390
So next algorithm,
that was median, now

1162
01:01:11,390 --> 01:01:18,670
we're going to do matrix
multiplication via divide

1163
01:01:18,670 --> 01:01:32,620
and conquer.

1164
01:01:32,620 --> 01:01:34,060
So what we just
saw was an example

1165
01:01:34,060 --> 01:01:37,464
where, in divide and
conquer, in the analysis

1166
01:01:37,464 --> 01:01:39,130
we think about the
case where things fit

1167
01:01:39,130 --> 01:01:40,810
in a constant number of blocks.

1168
01:01:40,810 --> 01:01:42,280
That was sort of case one.

1169
01:01:42,280 --> 01:01:44,372
The next example,
matrix multiplication,

1170
01:01:44,372 --> 01:01:45,330
will be the other case.

1171
01:01:45,330 --> 01:01:53,650
So you get to see both types.

1172
01:01:53,650 --> 01:01:55,890
So multiplying
matrices, something

1173
01:01:55,890 --> 01:01:57,270
we've done many times.

1174
01:01:57,270 --> 01:02:01,020
For example, in the FFT
lecture and in the Strassen's

1175
01:02:01,020 --> 01:02:03,764
algorithm, just to remind you.

1176
01:02:03,764 --> 01:02:05,430
I'm just thinking
about the square case,

1177
01:02:05,430 --> 01:02:07,860
although this generalizes.

1178
01:02:07,860 --> 01:02:16,140
We have two square
matrices, N by N.

1179
01:02:16,140 --> 01:02:18,420
Normally, I would say
C equals A times B,

1180
01:02:18,420 --> 01:02:20,970
but I realized we
used B for block side.

1181
01:02:20,970 --> 01:02:26,125
So this is going to
be s equals x times y.

1182
01:02:26,125 --> 01:02:31,020
Hopefully that doesn't conflict
with anything else, but no B's.

1183
01:02:31,020 --> 01:02:33,510
All right, so standard matrix.

1184
01:02:33,510 --> 01:02:42,090
Let's start with the
standard algorithm.

1185
01:02:42,090 --> 01:02:43,920
Let's start by analyzing that.

1186
01:02:43,920 --> 01:02:46,770
Because if you're
reasonably clever,

1187
01:02:46,770 --> 01:02:50,740
this the standard
algorithm is not so bad.

1188
01:02:50,740 --> 01:02:54,010
So in general, this
won't matter too much.

1189
01:02:54,010 --> 01:02:57,150
Let's suppose we're
doing z row by row,

1190
01:02:57,150 --> 01:03:03,150
and let's say we're currently
computing this product cell.

1191
01:03:03,150 --> 01:03:06,780
So that product cell
is the dot product

1192
01:03:06,780 --> 01:03:15,410
this ZIJ here is the dot product
of this row with this column.

1193
01:03:15,410 --> 01:03:17,370
How do I compute dot products?

1194
01:03:17,370 --> 01:03:18,190
Two parallel scans.

1195
01:03:18,190 --> 01:03:18,690
Right?

1196
01:03:18,690 --> 01:03:20,520
I scan through this
row and I parallel

1197
01:03:20,520 --> 01:03:22,020
scan through this column.

1198
01:03:22,020 --> 01:03:26,160
Now, it depends the order
in which you store x and y,

1199
01:03:26,160 --> 01:03:30,000
but let's suppose we can
store x in row major order,

1200
01:03:30,000 --> 01:03:33,240
meaning row by row, and we
store y in column major order,

1201
01:03:33,240 --> 01:03:34,530
meaning column-by-column.

1202
01:03:34,530 --> 01:03:36,178
Then this will be an
honest to goodness

1203
01:03:36,178 --> 01:03:37,560
scan of a contiguous array.

1204
01:03:37,560 --> 01:03:41,550
Again, the order we store
things in memory really matters.

1205
01:03:41,550 --> 01:03:42,990
So let's make our life ideal.

1206
01:03:42,990 --> 01:03:48,120
Let's say that
this is row by row

1207
01:03:48,120 --> 01:03:52,740
and this one is column
by column, then hey,

1208
01:03:52,740 --> 01:03:55,500
this is two parallel
scans so order N over B

1209
01:03:55,500 --> 01:03:58,300
to compute this cell.

1210
01:03:58,300 --> 01:04:07,500
OK, I claim that
computing ZIJ costs

1211
01:04:07,500 --> 01:04:12,360
N over B, so maybe plus 1.

1212
01:04:12,360 --> 01:04:14,187
Again, these are
end by end matrices,

1213
01:04:14,187 --> 01:04:24,480
so total size N squared, which
means the total cost is what?

1214
01:04:24,480 --> 01:04:30,480
N cubed over B plus
N squared, I guess.

1215
01:04:30,480 --> 01:04:31,350
Seems pretty good.

1216
01:04:31,350 --> 01:04:34,200
I mean, we had a running
time of N cubed before

1217
01:04:34,200 --> 01:04:37,470
and we divided by B. How
could you possibly do better?

1218
01:04:37,470 --> 01:04:40,140
Well, by being smarter.

1219
01:04:40,140 --> 01:04:46,500
This is not optimal,
you can do better.

1220
01:04:46,500 --> 01:04:50,702
It's not obvious,
but let me just spend

1221
01:04:50,702 --> 01:04:53,160
a little more time convincing
you this is the right answer.

1222
01:04:53,160 --> 01:04:56,607
Not only is this big O, but
for appropriate settings--

1223
01:04:56,607 --> 01:05:01,110
in the worst case this
is going to be theta.

1224
01:05:01,110 --> 01:05:03,800
Because if you think of the
order in which we're-- see,

1225
01:05:03,800 --> 01:05:06,680
we look at these
rows several times.

1226
01:05:06,680 --> 01:05:09,330
And if you look at, when I
compute this cell and this cell

1227
01:05:09,330 --> 01:05:12,420
and this cell of the z
matrix, or the product matrix,

1228
01:05:12,420 --> 01:05:15,750
each of them uses
the same row of x.

1229
01:05:15,750 --> 01:05:18,300
So maybe you could reuse that.

1230
01:05:18,300 --> 01:05:21,300
You could reuse that row of x.

1231
01:05:21,300 --> 01:05:23,550
That might actually
be free, depending

1232
01:05:23,550 --> 01:05:24,910
on how B and N relate.

1233
01:05:24,910 --> 01:05:30,930
But the columns of y, those
are different every time.

1234
01:05:30,930 --> 01:05:32,920
When I compute this one,
I use the first column

1235
01:05:32,920 --> 01:05:35,760
of y, when I compute this one
I use the second column of y.

1236
01:05:35,760 --> 01:05:38,010
Unless the cache
is so big that it

1237
01:05:38,010 --> 01:05:40,380
can store all of
y, which is like,

1238
01:05:40,380 --> 01:05:42,779
you could store the
entire problem in cache

1239
01:05:42,779 --> 01:05:44,460
that's unrealistic.

1240
01:05:44,460 --> 01:05:48,900
So unless M is bigger
than N squared,

1241
01:05:48,900 --> 01:05:52,290
in this algorithm at least, you
have to read a new column of y

1242
01:05:52,290 --> 01:05:53,970
every single time.

1243
01:05:53,970 --> 01:05:55,960
So that's why it's
theta N over B plus 1.

1244
01:05:55,960 --> 01:06:00,430
You need to spend
N over B, assuming

1245
01:06:00,430 --> 01:06:06,160
M is less than N squared.

1246
01:06:06,160 --> 01:06:06,660
OK.

1247
01:06:06,660 --> 01:06:09,120
And I claim this is not the
best you can do because we're

1248
01:06:09,120 --> 01:06:10,380
going to do better.

1249
01:06:10,380 --> 01:06:28,490
And we're going to do better
by divide and conquer.

1250
01:06:28,490 --> 01:06:30,540
Now, you've already seen
divide and conquer used

1251
01:06:30,540 --> 01:06:36,930
for matrix multiplication
to get Strassen's algorithm,

1252
01:06:36,930 --> 01:06:44,100
and the idea there
is to use blocks.

1253
01:06:44,100 --> 01:06:46,740
So this is sort of an
algorithm you've already seen.

1254
01:06:46,740 --> 01:06:55,080
I'm going to divide the
matrix z into N over 2

1255
01:06:55,080 --> 01:06:57,900
by N over 2 sub-matrices.

1256
01:06:57,900 --> 01:07:02,190
Each of these ZIJs is an N
over 2 by N over 2 matrix.

1257
01:07:02,190 --> 01:07:15,400
And I do the same
thing for x and y.

1258
01:07:15,400 --> 01:07:16,770
Numbers are right.

1259
01:07:16,770 --> 01:07:19,190
1, 2, y, 2, 1, and so on.

1260
01:07:19,190 --> 01:07:22,190
And you can write
this out explicitly.

1261
01:07:22,190 --> 01:07:25,345
I prefer not to do all of
it, but let's do one of them.

1262
01:07:25,345 --> 01:07:27,470
You can just think of these
as two by two matrices,

1263
01:07:27,470 --> 01:07:29,570
because matrix
multiplication is associative

1264
01:07:29,570 --> 01:07:30,740
and good things happen.

1265
01:07:30,740 --> 01:07:32,450
I can just take
these two elements--

1266
01:07:32,450 --> 01:07:34,640
but they're actually
matrices, sorry.

1267
01:07:34,640 --> 01:07:38,192
I might take these two and
dot product with these two.

1268
01:07:38,192 --> 01:07:46,790
And I get x1,1 y1,1
plus x1,2 y2,1,

1269
01:07:46,790 --> 01:07:50,220
and that's what I
should set z1,1 to.

1270
01:07:50,220 --> 01:07:54,020
So this is a formula, but it's
also a recursive algorithm.

1271
01:07:54,020 --> 01:07:57,470
It says, if I want to
compute z I'm going to say,

1272
01:07:57,470 --> 01:07:59,510
well, there are
four sub problems.

1273
01:07:59,510 --> 01:08:01,399
The first one is
to compute z1,1,

1274
01:08:01,399 --> 01:08:03,440
and I'm going to do that
by recursively computing

1275
01:08:03,440 --> 01:08:06,500
the product of x1,1 and
y1,1, recursively computing

1276
01:08:06,500 --> 01:08:10,034
the product of x1,2 y2,1 and
then adding them together.

1277
01:08:10,034 --> 01:08:10,950
This is not recursive.

1278
01:08:10,950 --> 01:08:13,080
Addition is easy.

1279
01:08:13,080 --> 01:08:13,580
OK.

1280
01:08:13,580 --> 01:08:15,800
And there's two products
here, two products here,

1281
01:08:15,800 --> 01:08:17,341
two products here,
two products here,

1282
01:08:17,341 --> 01:08:18,830
a total of eight
products, so we're

1283
01:08:18,830 --> 01:08:25,055
going to have eight recursive
calls in size N over 2.

1284
01:08:25,055 --> 01:08:26,930
If we look at the number
of memory transfers,

1285
01:08:26,930 --> 01:08:31,689
this is 8 times recursive
call on N over 2 by N

1286
01:08:31,689 --> 01:08:37,550
over 2 sub matrices plus
the cost of addition.

1287
01:08:37,550 --> 01:08:41,300
And I claim the cost of addition
is at most N squared over B

1288
01:08:41,300 --> 01:08:46,461
plus 1, because addition is
basically parallel scans.

1289
01:08:46,461 --> 01:08:50,390
I can scan through
x, scan through y.

1290
01:08:50,390 --> 01:08:52,609
As long as they're
stored in the same order,

1291
01:08:52,609 --> 01:08:55,350
I just am adding them
element by element,

1292
01:08:55,350 --> 01:08:59,960
and there's a third scan, which
is writing out the z vector

1293
01:08:59,960 --> 01:09:02,500
once things are linearized.

1294
01:09:02,500 --> 01:09:06,100
Now, for this to work, for
this to be a true recursion,

1295
01:09:06,100 --> 01:09:12,279
I need that, say, x1,1 and y1,1
are stored as contiguous things

1296
01:09:12,279 --> 01:09:13,690
in memory.

1297
01:09:13,690 --> 01:09:19,809
So this means that the
layout of a matrix,

1298
01:09:19,809 --> 01:09:24,430
let's consider the matrix z, is
going to be like the following.

1299
01:09:24,430 --> 01:09:27,370
I'm going to recursively lay
out 1,1-- so when I say lay out,

1300
01:09:27,370 --> 01:09:29,920
I mean what order do I store
the elements in memory?

1301
01:09:29,920 --> 01:09:32,050
What order do I store
the cells in memory?

1302
01:09:32,050 --> 01:09:36,370
And what I'm going to say
is, recursively lay out

1303
01:09:36,370 --> 01:09:40,859
the pieces-- there's four
pieces-- recursively call

1304
01:09:40,859 --> 01:09:46,282
layout of those and then
concatenate them together.

1305
01:09:46,282 --> 01:09:46,990
That's my layout.

1306
01:09:46,990 --> 01:09:48,430
So I'm going to store
all of these items,

1307
01:09:48,430 --> 01:09:50,221
then I'm going to store
all of these items,

1308
01:09:50,221 --> 01:09:52,510
and then all of these
items, then all these items.

1309
01:09:52,510 --> 01:09:55,480
How do I store these
items, in what order?

1310
01:09:55,480 --> 01:09:56,107
Recursively.

1311
01:09:56,107 --> 01:09:57,690
So I'm going to
divide them like this,

1312
01:09:57,690 --> 01:09:59,650
store these before these
before these before these,

1313
01:09:59,650 --> 01:10:00,640
how do I store these?

1314
01:10:00,640 --> 01:10:01,440
Recursively.

1315
01:10:01,440 --> 01:10:02,455
OK, same recursion.

1316
01:10:02,455 --> 01:10:04,510
So it's a really
weird order, it's

1317
01:10:04,510 --> 01:10:06,820
a divide and conquer order.

1318
01:10:06,820 --> 01:10:08,380
There's only four things here.

1319
01:10:08,380 --> 01:10:10,510
In what order should I
combine the four things?

1320
01:10:10,510 --> 01:10:11,716
Doesn't matter.

1321
01:10:11,716 --> 01:10:13,590
All that matters is that
this is consecutive,

1322
01:10:13,590 --> 01:10:15,730
this is consecutive,
and this is consecutive,

1323
01:10:15,730 --> 01:10:18,530
so that when I recurse, I'm
recursing on consecutive chunks

1324
01:10:18,530 --> 01:10:19,030
of memory.

1325
01:10:19,030 --> 01:10:21,070
Otherwise the analysis
just won't work.

1326
01:10:21,070 --> 01:10:24,690
So for this to be right,
got to have this layout.

1327
01:10:24,690 --> 01:10:26,500
OK.

1328
01:10:26,500 --> 01:10:32,110
Now we just need to solve the
recurrence, and we're done.

1329
01:10:32,110 --> 01:10:34,960
I already told you, the
base case we're going to use

1330
01:10:34,960 --> 01:10:35,860
is this one.

1331
01:10:35,860 --> 01:10:37,526
We're going to use
this one because it's

1332
01:10:37,526 --> 01:10:40,360
stronger and better, and
we'll need it, in this case,

1333
01:10:40,360 --> 01:10:43,442
to get a better analysis.

1334
01:10:43,442 --> 01:10:45,400
You could solve it using
the weaker base cases,

1335
01:10:45,400 --> 01:10:47,330
you'll get larger numbers.

1336
01:10:47,330 --> 01:10:51,309
But if you use the strongest
base case, MT of-- it's

1337
01:10:51,309 --> 01:10:54,180
not M. Got to be
a little careful.

1338
01:10:54,180 --> 01:10:56,980
Because N here is actually
just one side length.

1339
01:10:56,980 --> 01:11:00,445
This is an end by end
matrix, so the total size

1340
01:11:00,445 --> 01:11:04,720
is N squared-- actually the
total size is 3N squared,

1341
01:11:04,720 --> 01:11:09,460
so this is going to be the
square root of M over 3,

1342
01:11:09,460 --> 01:11:12,880
at some constant, times the
square root of N. It actually

1343
01:11:12,880 --> 01:11:14,510
doesn't matter what
the constant is.

1344
01:11:14,510 --> 01:11:16,240
But this is the
size of-- this is

1345
01:11:16,240 --> 01:11:18,400
the value of N for which
all three matrices will

1346
01:11:18,400 --> 01:11:20,890
fit in cache.

1347
01:11:20,890 --> 01:11:27,309
So I claim we know this costs at
most M over B memory transfers,

1348
01:11:27,309 --> 01:11:33,550
because we were kind of
stroke here because we know

1349
01:11:33,550 --> 01:11:35,110
that all of these
guys fit in cache

1350
01:11:35,110 --> 01:11:37,540
and because we know that they
can store it consecutively

1351
01:11:37,540 --> 01:11:40,840
in memory, well three
consecutive chunks.

1352
01:11:40,840 --> 01:11:45,940
Once, no matter what I do, there
are only M over B blocks there,

1353
01:11:45,940 --> 01:11:47,660
and so at worst I
read them all in.

1354
01:11:47,660 --> 01:11:50,470
But once the cache
is filled with them,

1355
01:11:50,470 --> 01:11:52,540
for the duration
of this recursion,

1356
01:11:52,540 --> 01:11:54,100
I won't be reading
any other blocks,

1357
01:11:54,100 --> 01:11:56,990
and so the cache will just
stay full with the problem.

1358
01:11:56,990 --> 01:11:59,440
And so I never pay
more than this.

1359
01:11:59,440 --> 01:12:00,830
So that's the base case.

1360
01:12:00,830 --> 01:12:05,140
Easy, but you have to think
about it for a second.

1361
01:12:05,140 --> 01:12:05,650
Cool.

1362
01:12:05,650 --> 01:12:08,640
Now we have a recurrence
and a base case,

1363
01:12:08,640 --> 01:12:11,210
and now we have a good old
fashioned recursion tree.

1364
01:12:11,210 --> 01:12:13,809
This one I can actually
draw, because it's-- well,

1365
01:12:13,809 --> 01:12:17,120
partly because it's
nice and uniform.

1366
01:12:17,120 --> 01:12:19,870
It just explodes rather fast.

1367
01:12:19,870 --> 01:12:25,960
So at the top we have a cost
of N squared over B plus 1,

1368
01:12:25,960 --> 01:12:28,750
and we have eight
recursive calls.

1369
01:12:28,750 --> 01:12:31,960
And the recursive calls
are to something in size

1370
01:12:31,960 --> 01:12:39,970
N over 2 squared over B, also
known as N squared over 4B.

1371
01:12:39,970 --> 01:12:42,170
OK, so if I add up
everything on this level,

1372
01:12:42,170 --> 01:12:46,180
I get N squared over B, and if I
add up everything on this level

1373
01:12:46,180 --> 01:12:53,200
I'm going to get 8 times
N over 4-- is that right?

1374
01:12:53,200 --> 01:12:53,740
Yeah.

1375
01:12:53,740 --> 01:12:58,120
So 2 times N squared over B.

1376
01:12:58,120 --> 01:12:58,870
OK.

1377
01:12:58,870 --> 01:13:02,170
I did that in order to verify
that the cost per level

1378
01:13:02,170 --> 01:13:06,430
is increasing geometrically,
so all that will matter

1379
01:13:06,430 --> 01:13:08,770
is the leaf level.

1380
01:13:08,770 --> 01:13:11,520
This is the proof of
the master theorem.

1381
01:13:11,520 --> 01:13:13,300
When things are
doubling at every step--

1382
01:13:13,300 --> 01:13:15,059
and this was just
a special case,

1383
01:13:15,059 --> 01:13:18,529
but every level would look
the same-- every level

1384
01:13:18,529 --> 01:13:20,070
of recursion, if
you add them all up,

1385
01:13:20,070 --> 01:13:22,653
you're getting twice as much as
you had at the previous level.

1386
01:13:22,653 --> 01:13:26,150
So all that will matter
is the leaf level.

1387
01:13:26,150 --> 01:13:29,430
OK, the leaf level.

1388
01:13:29,430 --> 01:13:32,610
Actually, maybe I'll
do it over here.

1389
01:13:32,610 --> 01:13:34,800
First question is how
many leaves are there?

1390
01:13:34,800 --> 01:13:37,900
The leaves are this thing.

1391
01:13:37,900 --> 01:13:41,300
So the way I would think about
this is, because everything

1392
01:13:41,300 --> 01:13:44,499
is nice and uniform, is 8 to the
power of the number of levels.

1393
01:13:44,499 --> 01:13:50,790
What's the number of levels?

1394
01:13:50,790 --> 01:13:53,320
Well, we're dividing
by 2 each time,

1395
01:13:53,320 --> 01:13:56,559
so it's going to be
log of something,

1396
01:13:56,559 --> 01:14:00,130
but it's no longer log N
because we're stopping early.

1397
01:14:00,130 --> 01:14:03,450
We're stopping when
N reaches this value.

1398
01:14:03,450 --> 01:14:10,856
So it turns out that is
N divided by that value.

1399
01:14:10,856 --> 01:14:12,230
This is, how many
times do I have

1400
01:14:12,230 --> 01:14:14,775
to multiply by 2 before
I get to this, which

1401
01:14:14,775 --> 01:14:17,150
is the same thing as how many
times do I have to divide N

1402
01:14:17,150 --> 01:14:19,636
by 2 before I get that?

1403
01:14:19,636 --> 01:14:20,780
Think about it.

1404
01:14:20,780 --> 01:14:22,320
OK, but 8 to the log.

1405
01:14:22,320 --> 01:14:24,980
This is 2 to the 3 times log.

1406
01:14:24,980 --> 01:14:27,690
2 to the log is just the thing.

1407
01:14:27,690 --> 01:14:36,830
So this is N over root M
over B-- so many overs--

1408
01:14:36,830 --> 01:14:39,017
to the third power.

1409
01:14:39,017 --> 01:14:40,600
OK, this is starting
to look familiar.

1410
01:14:40,600 --> 01:14:46,080
This is N cubed, that
should appear somewhere,

1411
01:14:46,080 --> 01:14:48,480
divided by square
root of M over B.

1412
01:14:48,480 --> 01:14:50,090
This is the number of leaves.

1413
01:14:50,090 --> 01:14:55,880
Now, for each leaf
we're paying this cost,

1414
01:14:55,880 --> 01:15:03,020
so the overall cost of MT of N
is going to be this times this.

1415
01:15:03,020 --> 01:15:11,930
So let's do that and simplify.

1416
01:15:11,930 --> 01:15:18,062
So MT of N is going to be big
O, because we're taking the leaf

1417
01:15:18,062 --> 01:15:20,020
level but there's some
other things that's just

1418
01:15:20,020 --> 01:15:23,530
going to lose us a factor of 2.

1419
01:15:23,530 --> 01:15:26,175
We have this thing
multiplied by this thing.

1420
01:15:26,175 --> 01:15:34,220
So we've got N cubed over
square root of M over B

1421
01:15:34,220 --> 01:15:40,798
times M over B.

1422
01:15:40,798 --> 01:15:43,360
AUDIENCE: You-- PROFESSOR:
I made a mistake.

1423
01:15:43,360 --> 01:15:44,720
Yea, thank you.

1424
01:15:44,720 --> 01:15:45,970
This was supposed to be cubed.

1425
01:15:45,970 --> 01:15:49,600
So this was M over B to
the 1/2, so now we have,

1426
01:15:49,600 --> 01:15:52,720
down here, M over B to the 3/2.

1427
01:15:52,720 --> 01:15:59,850
Thank you, thought
that looked weird.

1428
01:15:59,850 --> 01:16:01,890
All right.

1429
01:16:01,890 --> 01:16:06,531
M over B to the 3/2.

1430
01:16:06,531 --> 01:16:07,031
OK.

1431
01:16:07,031 --> 01:16:18,430
AUDIENCE: [INAUDIBLE]
PROFESSOR: Yeah.

1432
01:16:18,430 --> 01:16:20,180
What was I doing here?

1433
01:16:20,180 --> 01:16:21,630
This is supposed to be M over 3.

1434
01:16:21,630 --> 01:16:23,605
I was not missing a
stroke, thank you.

1435
01:16:23,605 --> 01:16:27,824
M over 3, this is
supposed to be M over 3.

1436
01:16:27,824 --> 01:16:29,240
Wow.

1437
01:16:29,240 --> 01:16:32,260
OK, so this is M over 3.

1438
01:16:32,260 --> 01:16:34,780
I'm just going to drop the--
well, I'll put it here.

1439
01:16:34,780 --> 01:16:37,340
But then I'm just
going to write theta

1440
01:16:37,340 --> 01:16:39,670
so I can forget about the
3, because that's just

1441
01:16:39,670 --> 01:16:41,360
a square root of 3 factor.

1442
01:16:41,360 --> 01:16:47,405
So now this is going
to be M to the 3/2.

1443
01:16:47,405 --> 01:16:50,546
That makes me much happier.

1444
01:16:50,546 --> 01:16:53,410
Did I get it right this time?

1445
01:16:53,410 --> 01:16:54,430
Let's double-check.

1446
01:16:54,430 --> 01:16:57,910
So this is square root
of M to the 3 power,

1447
01:16:57,910 --> 01:17:00,840
so that's M to the 1/2
cubed M to the 3/2.

1448
01:17:00,840 --> 01:17:04,850
I think that's good, this base
case was square root of M.

1449
01:17:04,850 --> 01:17:07,929
OK, get it right.

1450
01:17:07,929 --> 01:17:10,282
So now this is M to the 3/2.

1451
01:17:10,282 --> 01:17:11,740
There is a square
root that's going

1452
01:17:11,740 --> 01:17:16,270
to come back, there's M to the
3/2 and there's an M upstairs,

1453
01:17:16,270 --> 01:17:18,270
so the one cancels.

1454
01:17:18,270 --> 01:17:23,439
We're going to be left with
N cubed over square root of M

1455
01:17:23,439 --> 01:17:26,112
times B. OK.

1456
01:17:26,112 --> 01:17:28,570
There was a lower order term
because I dropped this plus 1,

1457
01:17:28,570 --> 01:17:31,180
but let's not worry
about that right now.

1458
01:17:31,180 --> 01:17:33,730
Here we had N cubed
divided by B, that

1459
01:17:33,730 --> 01:17:35,200
was the standard algorithm.

1460
01:17:35,200 --> 01:17:39,160
Now we've got M cubed divided by
B divided by square root of M.

1461
01:17:39,160 --> 01:17:40,635
That's big.

1462
01:17:40,635 --> 01:17:42,010
I mean, this is
basically, you're

1463
01:17:42,010 --> 01:17:45,450
dividing by-- well, square
root of your cache size.

1464
01:17:45,450 --> 01:17:46,030
Wow.

1465
01:17:46,030 --> 01:17:49,230
So who knows how big
that is, but say,

1466
01:17:49,230 --> 01:17:52,430
between memory and disk,
we're talking gigabytes.

1467
01:17:52,430 --> 01:17:54,330
So this is like billions.

1468
01:17:54,330 --> 01:17:57,450
Square root of a billion
is still pretty big,

1469
01:17:57,450 --> 01:18:01,324
like 10 to 100,000, so this
is a huge amount faster

1470
01:18:01,324 --> 01:18:02,490
than the standard algorithm.

1471
01:18:02,490 --> 01:18:04,920
You can do way
better than scans.

1472
01:18:04,920 --> 01:18:07,650
Basically because we're reusing
the same rows and columns

1473
01:18:07,650 --> 01:18:08,726
over and over.

1474
01:18:08,726 --> 01:18:10,559
Now, this is standard
matrix multiplication.

1475
01:18:10,559 --> 01:18:12,610
You might ask, what about
Strassen's algorithm?

1476
01:18:12,610 --> 01:18:13,770
Well, same thing works.

1477
01:18:13,770 --> 01:18:15,960
You can do the same analysis
Strassen, of course.

1478
01:18:15,960 --> 01:18:19,020
You get a similar
improvement over Strassen.

1479
01:18:19,020 --> 01:18:21,960
You can do this for non
square matrices and all

1480
01:18:21,960 --> 01:18:23,580
those good things.

1481
01:18:23,580 --> 01:18:25,170
And one minute left.

1482
01:18:25,170 --> 01:18:27,100
And it's going to
be enough, I think,

1483
01:18:27,100 --> 01:18:31,330
to cover LRU block replacement.

1484
01:18:31,330 --> 01:18:39,604
So here's what I want to say
about LRU block replacement.

1485
01:18:39,604 --> 01:18:41,520
So in the beginning, we
said the model is LRU,

1486
01:18:41,520 --> 01:18:44,670
or it could have been FIFO.

1487
01:18:44,670 --> 01:18:45,870
Remember that?

1488
01:18:45,870 --> 01:18:48,210
And this algorithm will
work just fine from an LRU

1489
01:18:48,210 --> 01:18:49,620
perspective or a
FIFO perspective

1490
01:18:49,620 --> 01:18:51,570
if you think about
it, but how do

1491
01:18:51,570 --> 01:18:53,700
we know that LRU is
as good as anything?

1492
01:18:53,700 --> 01:18:58,790
I claim, if you look at some
sequence of block axises--

1493
01:18:58,790 --> 01:19:01,890
so suppose you know what
B is-- and you count,

1494
01:19:01,890 --> 01:19:07,020
for a cache of size M, how many
memory transfers does LRU do,

1495
01:19:07,020 --> 01:19:09,930
it's going to be within a
factor of 2 of the optimal.

1496
01:19:09,930 --> 01:19:12,780
But not the optimal
for a cache of size M,

1497
01:19:12,780 --> 01:19:15,732
the optimal for a
cache of size M over 2.

1498
01:19:15,732 --> 01:19:17,190
This is a bit of
a weird statement.

1499
01:19:17,190 --> 01:19:19,950
I have a factor of 2 here
and a factor of 2 here.

1500
01:19:19,950 --> 01:19:27,450
This is a cool idea called
resource augmentation,

1501
01:19:27,450 --> 01:19:30,380
fancy word for a simple idea.

1502
01:19:30,380 --> 01:19:31,200
This we're used to.

1503
01:19:31,200 --> 01:19:33,370
This is approximation
algorithms.

1504
01:19:33,370 --> 01:19:36,090
OK, but this is an
approximation in cost.

1505
01:19:36,090 --> 01:19:38,040
Here we're approximating
the resources

1506
01:19:38,040 --> 01:19:39,240
available to the algorithm.

1507
01:19:39,240 --> 01:19:42,900
We're changing the machine
model, dividing M by 2,

1508
01:19:42,900 --> 01:19:46,540
and we get a nice result.

1509
01:19:46,540 --> 01:19:47,670
Why is this OK?

1510
01:19:47,670 --> 01:19:49,559
Because, if you look
at a bound like this,

1511
01:19:49,559 --> 01:19:51,300
if you change M
by a factor of 2,

1512
01:19:51,300 --> 01:19:53,130
it will not change the
bound by more than a

1513
01:19:53,130 --> 01:19:54,880
factor of square root of 2.

1514
01:19:54,880 --> 01:19:56,910
So as long as you
have at most, say,

1515
01:19:56,910 --> 01:19:59,670
a linear or polynomial
dependence on M,

1516
01:19:59,670 --> 01:20:01,530
changing M by a
constant factor will not

1517
01:20:01,530 --> 01:20:03,110
change the overall
running time of the cache

1518
01:20:03,110 --> 01:20:03,690
for the previous algorithm.

1519
01:20:03,690 --> 01:20:05,430
This is why we can
assume it's LRU.

1520
01:20:05,430 --> 01:20:07,680
The same is true for
FIFO, it's probably

1521
01:20:07,680 --> 01:20:11,660
true in expectation
for random sequences.

1522
01:20:11,660 --> 01:20:16,170
And I will leave it at that.

1523
01:20:16,170 --> 01:20:17,670
If you want to see
the-- do you want

1524
01:20:17,670 --> 01:20:22,080
to see the proof
of this theorem?

1525
01:20:22,080 --> 01:20:22,680
Tomorrow?

1526
01:20:22,680 --> 01:20:24,120
Or, Thursday?

1527
01:20:24,120 --> 01:20:24,620
Yes.

1528
01:20:24,620 --> 01:20:27,370
OK, we'll cover it on Thursday.