1
00:00:07,000 --> 00:00:08,000
-- week of 6.046.
Woohoo!

2
00:00:08,000 --> 00:00:13,000
The topic of this final week,
among our advanced topics,

3
00:00:13,000 --> 00:00:18,000
is cache oblivious algorithms.
This is a particularly fun

4
00:00:18,000 --> 00:00:22,000
area, one dear to my heart
because I've done a lot of

5
00:00:22,000 --> 00:00:26,000
research in this area.
This is an area co-founded by

6
00:00:26,000 --> 00:00:29,000
Professor Leiserson.
So, in fact,

7
00:00:29,000 --> 00:00:34,000
the first context in which I
met Professor Leiserson was him

8
00:00:34,000 --> 00:00:38,000
giving a talk about cache
oblivious algorithms at WADS '99

9
00:00:38,000 --> 00:00:41,000
in Vancouver I think.
Yeah, that has to be an odd

10
00:00:41,000 --> 00:00:44,000
year.
So, I learned about cache

11
00:00:44,000 --> 00:00:48,000
oblivious algorithms then,
started working in the area,

12
00:00:48,000 --> 00:00:50,000
and it's been a fun place to
play.

13
00:00:50,000 --> 00:00:54,000
But this topic in some sense
was also developed in the

14
00:00:54,000 --> 00:00:58,000
context of this class.
I think there was one semester,

15
00:00:58,000 --> 00:01:02,000
probably also '98-'99 where all
of the problem sets were about

16
00:01:02,000 --> 00:01:07,000
cache oblivious algorithms.
And they were,

17
00:01:07,000 --> 00:01:10,000
in particular,
working out the research ideas

18
00:01:10,000 --> 00:01:13,000
at the same time.
So, it must have been fun

19
00:01:13,000 --> 00:01:15,000
semester.
We consider doing that this

20
00:01:15,000 --> 00:01:18,000
semester, but we kept it to the
simple.

21
00:01:18,000 --> 00:01:23,000
We know a lot more about cache
oblivious algorithms by now as

22
00:01:23,000 --> 00:01:26,000
you might expect.
Right, I think that's all the

23
00:01:26,000 --> 00:01:29,000
setting.
I mean, it was kind of

24
00:01:29,000 --> 00:01:33,000
developed also with a bunch of
MIT students in particular,

25
00:01:33,000 --> 00:01:35,000
M.Eng.
student, Harold Prokop.

26
00:01:35,000 --> 00:01:36,000
It was his M.Eng.
thesis.

27
00:01:36,000 --> 00:01:39,000
There is all the citations I
will give for now.

28
00:01:39,000 --> 00:01:43,000
I haven't posted yet,
but there are some lecture

29
00:01:43,000 --> 00:01:45,000
notes that are already on my
webpage.

30
00:01:45,000 --> 00:01:49,000
But I will link to them from
the course website that gives

31
00:01:49,000 --> 00:01:53,000
all the references for all the
results I'll be talking about.

32
00:01:53,000 --> 00:01:56,000
They've all been done in the
last five years or so,

33
00:01:56,000 --> 00:01:59,000
in particular,
starting in '99 when the first

34
00:01:59,000 --> 00:02:03,000
paper was published.
But I won't give the specific

35
00:02:03,000 --> 00:02:08,000
citations in lecture.
And, this topic is related to

36
00:02:08,000 --> 00:02:11,000
the topic of last week,
multithreaded algorithms,

37
00:02:11,000 --> 00:02:14,000
although at a somewhat high
level.

38
00:02:14,000 --> 00:02:18,000
And then it's also dealing with
parallelism in modern machines.

39
00:02:18,000 --> 00:02:22,000
And we've had throughout all of
these last two lectures,

40
00:02:22,000 --> 00:02:26,000
we've had this very simple
model of a computer where we

41
00:02:26,000 --> 00:02:30,000
have random access.
You can access memory at a cost

42
00:02:30,000 --> 00:02:33,000
of one.
You can read and write a word

43
00:02:33,000 --> 00:02:36,000
of memory.
There is some details on how

44
00:02:36,000 --> 00:02:39,000
big a word can be and whatnot.
It's pretty basic,

45
00:02:39,000 --> 00:02:41,000
simple, flat model.
And, at the multithreaded

46
00:02:41,000 --> 00:02:45,000
algorithm is the idea that,
well, maybe you have multiple

47
00:02:45,000 --> 00:02:48,000
threads of computation running
at once, but you still have this

48
00:02:48,000 --> 00:02:51,000
very flat memory.
Everyone can access anything in

49
00:02:51,000 --> 00:02:54,000
memory at a constant cost.
We're going to change that

50
00:02:54,000 --> 00:02:58,000
model now.
And we are going to realize

51
00:02:58,000 --> 00:03:03,000
that a real machine,
the memory of a real machine is

52
00:03:03,000 --> 00:03:06,000
some hierarchy.
You have some CPU,

53
00:03:06,000 --> 00:03:10,000
you have some cache,
probably on the same chip,

54
00:03:10,000 --> 00:03:14,000
level 1 cache,
you have some level 2 cache,

55
00:03:14,000 --> 00:03:18,000
if you're lucky,
maybe you have some level 3

56
00:03:18,000 --> 00:03:21,000
cache, before you get to main
memory.

57
00:03:21,000 --> 00:03:26,000
And then, you probably have a
really big disk and probably

58
00:03:26,000 --> 00:03:31,000
there's even some cache out
here, but I won't even think

59
00:03:31,000 --> 00:03:35,000
about that.
So, the point is,

60
00:03:35,000 --> 00:03:38,000
you have lots of different
levels of memory and what's

61
00:03:38,000 --> 00:03:42,000
changing here is that things
that are very close to the CPU

62
00:03:42,000 --> 00:03:46,000
are very fast to access.
Usually level 1 cache you can

63
00:03:46,000 --> 00:03:48,000
access in one clock cycle or a
few.

64
00:03:48,000 --> 00:03:50,000
And then, things get slower and
slower.

65
00:03:50,000 --> 00:03:54,000
Memory still costs like 70 ns
or so to access a chunk out of.

66
00:03:54,000 --> 00:03:57,000
And that's a long time.
70 ns is, of course,

67
00:03:57,000 --> 00:04:01,000
a very long time.
So, as we go out here,

68
00:04:01,000 --> 00:04:04,000
we get slower.
But we also get bigger.

69
00:04:04,000 --> 00:04:07,000
I mean, if we could put
everything at level 1 cache,

70
00:04:07,000 --> 00:04:11,000
the problem would be solved.
But what would be a flat

71
00:04:11,000 --> 00:04:13,000
memory.
Accessing everything in here,

72
00:04:13,000 --> 00:04:16,000
we assumed takes the same
amount of time.

73
00:04:16,000 --> 00:04:18,000
But usually,
we can't afford,

74
00:04:18,000 --> 00:04:22,000
it's not even possible to put
everything in level 1 cache.

75
00:04:22,000 --> 00:04:26,000
I mean, there's a reason why
there is a memory hierarchy.

76
00:04:26,000 --> 00:04:32,000
Does anyone have a suggestion
on what that reason might be?

77
00:04:32,000 --> 00:04:35,000
It's like one of these limits
in life.

78
00:04:35,000 --> 00:04:37,000
Yeah?
Fast memory is expensive.

79
00:04:37,000 --> 00:04:40,000
That's the practical
limitations indeed,

80
00:04:40,000 --> 00:04:45,000
that you could try to build
more and more at level 1 cache

81
00:04:45,000 --> 00:04:48,000
and maybe you could try to,
well, yeah.

82
00:04:48,000 --> 00:04:51,000
Expenses is a good reason,
and practically,

83
00:04:51,000 --> 00:04:55,000
that's why they may be the
sizes are what they are.

84
00:04:55,000 --> 00:05:01,000
But suppose really fast memory
were really cheap.

85
00:05:01,000 --> 00:05:04,000
There is a physical limitation
of what's going on,

86
00:05:04,000 --> 00:05:05,000
yeah?
The speed of light.

87
00:05:05,000 --> 00:05:08,000
Yeah, that's a bit of a
problem, right?

88
00:05:08,000 --> 00:05:11,000
No matter how much,
let's suppose you can only fit

89
00:05:11,000 --> 00:05:15,000
so many bits in an atom.
You can only fit so many bits

90
00:05:15,000 --> 00:05:18,000
in a particular amount of space.
If you want more bits,

91
00:05:18,000 --> 00:05:22,000
and you need more space,
and the more space you have,

92
00:05:22,000 --> 00:05:25,000
the longer it's going to take
for a round-trip.

93
00:05:25,000 --> 00:05:28,000
So, if you assume your CPU is
like this point in space,

94
00:05:28,000 --> 00:05:32,000
so it's relatively small and it
has to get the data in,

95
00:05:32,000 --> 00:05:37,000
the bigger the data,
the farther it has to be away.

96
00:05:37,000 --> 00:05:40,000
But, you can have these cores
around the CPU that are,

97
00:05:40,000 --> 00:05:44,000
we don't usually live in 3-D,
and chips were usually in 2-D,

98
00:05:44,000 --> 00:05:46,000
but never mind.
You can have the sphere that's

99
00:05:46,000 --> 00:05:49,000
closer to the CPU that's a lot
faster to access.

100
00:05:49,000 --> 00:05:52,000
And as you get further away it
costs more.

101
00:05:52,000 --> 00:05:55,000
And that's essentially what
this model is representing,

102
00:05:55,000 --> 00:05:59,000
although it's a bit
approximated from the intrinsic

103
00:05:59,000 --> 00:06:02,000
physics and geometry and
whatnot.

104
00:06:02,000 --> 00:06:05,000
But that's the idea.
The latency,

105
00:06:05,000 --> 00:06:11,000
the round-trip time to get some
of this memory has to be big.

106
00:06:11,000 --> 00:06:17,000
In general, the costs to access
memory is made up of two things.

107
00:06:17,000 --> 00:06:21,000
There's the latency,
the round-trip time,

108
00:06:21,000 --> 00:06:26,000
which in particular is limited
by the speed of light.

109
00:06:26,000 --> 00:06:32,000
And, plus the round-trip time,
you also have to get the data

110
00:06:32,000 --> 00:06:36,000
out.
And depending on how much data

111
00:06:36,000 --> 00:06:40,000
you want, it could take longer.
OK, so there's something.

112
00:06:40,000 --> 00:06:42,000
There could be,
get this right,

113
00:06:42,000 --> 00:06:46,000
let's say, the amount of data
divided by the bandwidth.

114
00:06:46,000 --> 00:06:51,000
OK, the bandwidth is at what
rate can you get the data out?

115
00:06:51,000 --> 00:06:54,000
And if you look at the
bandwidth of these various

116
00:06:54,000 --> 00:06:58,000
levels of memory,
it's all pretty much the same.

117
00:06:58,000 --> 00:07:02,000
If you have a well-designed
computer the bandwidths should

118
00:07:02,000 --> 00:07:07,000
all be the same.
OK, as you can still get data

119
00:07:07,000 --> 00:07:08,000
off disc really,
really fast,

120
00:07:08,000 --> 00:07:13,000
usually at about the speed of
your bus, and that the bus gets

121
00:07:13,000 --> 00:07:16,000
the CPU hopefully as fast as
everything else.

122
00:07:16,000 --> 00:07:20,000
So, even though they're slower,
they're really only slower in

123
00:07:20,000 --> 00:07:23,000
terms of latency.
And so, this part is maybe

124
00:07:23,000 --> 00:07:26,000
reasonable.
The bandwidth looks pretty much

125
00:07:26,000 --> 00:07:29,000
the same universally.
It's the latency that's going

126
00:07:29,000 --> 00:07:32,000
up.
So, if the latency is going up

127
00:07:32,000 --> 00:07:36,000
but we still get to divide by
the same amount of bandwidth,

128
00:07:36,000 --> 00:07:40,000
what should we do to make the
access cost at all these levels

129
00:07:40,000 --> 00:07:45,000
about the same?
This is fixed.

130
00:07:45,000 --> 00:07:53,000
Let's say this is increasing,
but this is still staying big.

131
00:07:53,000 --> 00:07:59,000
What could we do to balance
this formula?

132
00:07:59,000 --> 00:08:05,000
Change the amounts.
As the latency goes up,

133
00:08:05,000 --> 00:08:10,000
if we increase the amount,
then the amortized cost to

134
00:08:10,000 --> 00:08:16,000
access one element will go down.
So, this is amortization in a

135
00:08:16,000 --> 00:08:21,000
very simple sense.
So, this was to access a whole

136
00:08:21,000 --> 00:08:26,000
block, let's say,
and this amount was the size of

137
00:08:26,000 --> 00:08:30,000
the block.
So, the amortized cost,

138
00:08:30,000 --> 00:08:36,000
then, to access one element is
going to be the latency divided

139
00:08:36,000 --> 00:08:41,000
by the size of the block,
the amount plus one over the

140
00:08:41,000 --> 00:08:45,000
bandwidth.
OK, so this is what you should

141
00:08:45,000 --> 00:08:49,000
implicitly be thinking in your
head.

142
00:08:49,000 --> 00:08:55,000
So, I'm just dividing here by
the amounts because the amount

143
00:08:55,000 --> 00:09:02,000
is how many elements you get in
one access, let's suppose.

144
00:09:02,000 --> 00:09:04,000
OK, so we get this formula for
the amortized cost.

145
00:09:04,000 --> 00:09:08,000
The one over bandwidth is going
to be good no matter what level

146
00:09:08,000 --> 00:09:11,000
we are on, I claim.
There's no real fundamental

147
00:09:11,000 --> 00:09:14,000
limitation there except it might
be expensive.

148
00:09:14,000 --> 00:09:17,000
And the latency week at the
amortized out by the amounts,

149
00:09:17,000 --> 00:09:21,000
so whatever the latency is,
at the latency gets bigger out

150
00:09:21,000 --> 00:09:24,000
here, we just get more and more
stuff and then we make these two

151
00:09:24,000 --> 00:09:27,000
terms equal, let's say.
That would be a good way to

152
00:09:27,000 --> 00:09:30,000
balance things.
So what particular,

153
00:09:30,000 --> 00:09:34,000
disc has a really high latency.
Not only is there speed of

154
00:09:34,000 --> 00:09:37,000
light issues here,
but there's actually the speed

155
00:09:37,000 --> 00:09:39,000
of the head moving on the tracks
of the disk.

156
00:09:39,000 --> 00:09:42,000
That takes a long time.
There's a physical motion.

157
00:09:42,000 --> 00:09:45,000
Everything else here doesn't
usually have physical motion.

158
00:09:45,000 --> 00:09:47,000
It's just electric.
So, this is really,

159
00:09:47,000 --> 00:09:51,000
really slow and latency,
so when you read something out

160
00:09:51,000 --> 00:09:54,000
of disk, you might as well read
a lot of data from disc,

161
00:09:54,000 --> 00:09:57,000
like a megabyte or so.
It's probably even old these

162
00:09:57,000 --> 00:09:58,000
days.
Maybe you read multiple

163
00:09:58,000 --> 00:10:02,000
megabytes when you read anything
from disk if you want these to

164
00:10:02,000 --> 00:10:06,000
be matched.
OK, there's a bit of a problem

165
00:10:06,000 --> 00:10:10,000
with doing that.
Any suggestions what the

166
00:10:10,000 --> 00:10:14,000
problem would be?
So, you have this algorithm.

167
00:10:14,000 --> 00:10:17,000
And, whenever it reads
something off of desk,

168
00:10:17,000 --> 00:10:22,000
it reads an entire megabyte of
stuff around the element it

169
00:10:22,000 --> 00:10:26,000
asked for.
So the amortized cost to access

170
00:10:26,000 --> 00:10:31,000
is going to be reasonable,
but that's actually sort of

171
00:10:31,000 --> 00:10:34,000
assuming something.
Yeah?

172
00:10:34,000 --> 00:10:38,000
Right.
I'm assuming I'm ever going to

173
00:10:38,000 --> 00:10:43,000
use the rest of that data.
If I'm going to read 10 MB

174
00:10:43,000 --> 00:10:49,000
around the one element that
asked for, I access A bracket I,

175
00:10:49,000 --> 00:10:55,000
and I get 10 million items from
A around I, it would be kind of

176
00:10:55,000 --> 00:11:00,000
good if the algorithm actually
used that data for something.

177
00:11:00,000 --> 00:11:06,000
It seems reasonable.
So, this would be spatial

178
00:11:06,000 --> 00:11:08,000
locality.
So, we want,

179
00:11:08,000 --> 00:11:15,000
I mean the goal of this world
in cache oblivious algorithms

180
00:11:15,000 --> 00:11:20,000
and cache efficient algorithms
in general is you want

181
00:11:20,000 --> 00:11:26,000
algorithms that perform well
when this is happening.

182
00:11:26,000 --> 00:11:31,000
So, this is the idea of
blocking.

183
00:11:31,000 --> 00:11:36,000
And we want the algorithm to
use all or at least most of the

184
00:11:36,000 --> 00:11:41,000
elements in a block,
a consecutive chunk of memory.

185
00:11:41,000 --> 00:11:45,000
So, this is spatial locality.

186
00:11:55,000 --> 00:11:57,000
Ideally, we'd use all of them
right then.

187
00:11:57,000 --> 00:11:59,000
But I mean, depending on your
algorithm, that's a little bit

188
00:11:59,000 --> 00:12:01,000
tricky.
There is another issue,

189
00:12:01,000 --> 00:12:03,000
though.
So, you read in your thing

190
00:12:03,000 --> 00:12:05,000
into, read your 10 MB into main
memory, let's say,

191
00:12:05,000 --> 00:12:07,000
and your memory,
let's say, is at least,

192
00:12:07,000 --> 00:12:10,000
these days you should have a 4
GB memory or something.

193
00:12:10,000 --> 00:12:13,000
So, you could read and actually
a lot of different blocks into

194
00:12:13,000 --> 00:12:15,000
main memory.
What you'd like is that you can

195
00:12:15,000 --> 00:12:17,000
use those blocks for as long as
possible.

196
00:12:17,000 --> 00:12:20,000
Maybe you don't even use them.
If you have a linear time

197
00:12:20,000 --> 00:12:23,000
algorithm, you're probably only
going to visit each element a

198
00:12:23,000 --> 00:12:25,000
constant number of times.
So, this is enough.

199
00:12:25,000 --> 00:12:27,000
But if your algorithm is more
than linear time,

200
00:12:27,000 --> 00:12:32,000
you're going to be accessing
elements more than once.

201
00:12:32,000 --> 00:12:37,000
So, it would be a good idea not
only to use all the elements of

202
00:12:37,000 --> 00:12:43,000
the blocks, but use them as many
times as you can before you have

203
00:12:43,000 --> 00:12:47,000
to throw the block out.
That's temporal locality.

204
00:12:47,000 --> 00:12:52,000
So ideally, you even reuse
blocks as much as possible.

205
00:12:52,000 --> 00:12:55,000
So, I mean, we have all these
caches.

206
00:12:55,000 --> 00:13:01,000
So, I didn't write this word.
Just in case I don't know how

207
00:13:01,000 --> 00:13:07,000
to spell it, it's not the money.
We should use those caches for

208
00:13:07,000 --> 00:13:09,000
something.
I mean, the fact that they

209
00:13:09,000 --> 00:13:13,000
store more than one block,
each cache can store several

210
00:13:13,000 --> 00:13:14,000
blocks.
How many?

211
00:13:14,000 --> 00:13:17,000
Well, we'll get to that in a
second.

212
00:13:17,000 --> 00:13:20,000
OK, so this is the general
motivation, but at this point

213
00:13:20,000 --> 00:13:23,000
the model is still pretty damn
ugly.

214
00:13:23,000 --> 00:13:27,000
If you wanted to design an
algorithm that runs well on this

215
00:13:27,000 --> 00:13:30,000
kind of machine directly,
it's possible but pretty

216
00:13:30,000 --> 00:13:34,000
difficult, and essentially never
done, let's say,

217
00:13:34,000 --> 00:13:39,000
even though this is what real
machines look like.

218
00:13:39,000 --> 00:13:42,000
At least in theory,
and pretty much in practice,

219
00:13:42,000 --> 00:13:47,000
the main thing to think about
is two levels at a time.

220
00:13:47,000 --> 00:13:51,000
So, this is a simplification
where we can say a lot more

221
00:13:51,000 --> 00:13:55,000
about algorithms,
a simplification over this

222
00:13:55,000 --> 00:13:57,000
model.
So, in this model,

223
00:13:57,000 --> 00:14:01,000
each of these levels has
different block sizes,

224
00:14:01,000 --> 00:14:06,000
and a different total sizes,
it's a mess to deal with and

225
00:14:06,000 --> 00:14:10,000
design algorithms for.
If you just think about two

226
00:14:10,000 --> 00:14:17,000
levels, it's relatively easy.
So, we have our CPU which we

227
00:14:17,000 --> 00:14:22,000
assume has a constant number of
registers only.

228
00:14:22,000 --> 00:14:27,000
So, you know,
once it has a couple of data

229
00:14:27,000 --> 00:14:31,000
items, you can add them and
whatnot.

230
00:14:31,000 --> 00:14:35,000
Then we have this really fast
pipe.

231
00:14:35,000 --> 00:14:41,000
So, I draw it thick to some
cache.

232
00:14:41,000 --> 00:14:49,000
So this is cache.
And, we have a relatively

233
00:14:49,000 --> 00:14:58,000
narrow pipe to some really big
other storage,

234
00:14:58,000 --> 00:15:06,000
which I will call main memory.
So, I mean, that's the general

235
00:15:06,000 --> 00:15:09,000
picture.
Now, this could represent any

236
00:15:09,000 --> 00:15:12,000
two of these levels.
It could be between L3 cache

237
00:15:12,000 --> 00:15:14,000
and make memory.
That's maybe,

238
00:15:14,000 --> 00:15:16,000
what?
The naming corresponds to best.

239
00:15:16,000 --> 00:15:20,000
Or cache could in fact be main
memory, what we consider the RAM

240
00:15:20,000 --> 00:15:23,000
of the machine,
and what's called a memory over

241
00:15:23,000 --> 00:15:26,000
there to be the disk.
It's whatever you care about.

242
00:15:26,000 --> 00:15:28,000
And usually,
if you have a program,

243
00:15:28,000 --> 00:15:34,000
that's what usually we assume
everything fits in main memory.

244
00:15:34,000 --> 00:15:36,000
Then you care about the caching
behavior.

245
00:15:36,000 --> 00:15:39,000
So you probably look between
these two levels.

246
00:15:39,000 --> 00:15:42,000
That's probably what matters
the most inner program because

247
00:15:42,000 --> 00:15:46,000
the cost differential here is
really big relative to the cost

248
00:15:46,000 --> 00:15:49,000
differential here.
If your data doesn't even fit

249
00:15:49,000 --> 00:15:51,000
it main memory,
and you have to go to disk,

250
00:15:51,000 --> 00:15:54,000
then you really care about this
level because the cost

251
00:15:54,000 --> 00:15:57,000
differential here is huge.
It's like six orders of

252
00:15:57,000 --> 00:16:00,000
magnitude, let's say.
So, in practice you may think

253
00:16:00,000 --> 00:16:05,000
of just two memory levels that
are the most relevant.

254
00:16:05,000 --> 00:16:09,000
OK, now I'm going to define
some parameters.

255
00:16:09,000 --> 00:16:14,000
I'm going to call them cache
and make memory just for clarity

256
00:16:14,000 --> 00:16:20,000
because I like to think of main
memory just the way it used to

257
00:16:20,000 --> 00:16:23,000
be.
And now all we have to worry

258
00:16:23,000 --> 00:16:26,000
about is this extra thing called
cache.

259
00:16:26,000 --> 00:16:31,000
It has some bounded size,
and there's a block size.

260
00:16:31,000 --> 00:16:36,000
The block size is B.
and a number of blocks is M

261
00:16:36,000 --> 00:16:41,000
over B.
So, the total size of the cache

262
00:16:41,000 --> 00:16:44,000
is M.
OK, main memory is also blocked

263
00:16:44,000 --> 00:16:49,000
into blocks of size B.
And we assume that it has

264
00:16:49,000 --> 00:16:55,000
essentially infinite size.
We don't care about its size in

265
00:16:55,000 --> 00:16:59,000
this picture.
It's whatever is big enough to

266
00:16:59,000 --> 00:17:04,000
hold the size of your algorithm,
or data structure,

267
00:17:04,000 --> 00:17:09,000
or whatever.
OK, so that's the general

268
00:17:09,000 --> 00:17:11,000
model.
And for strange,

269
00:17:11,000 --> 00:17:15,000
historical reasons,
which I don't want to get into,

270
00:17:15,000 --> 00:17:20,000
these things are called capital
M and capital B.

271
00:17:20,000 --> 00:17:25,000
Even though M sounds a lot like
memory, it's really for cache,

272
00:17:25,000 --> 00:17:29,000
and don't ask.
OK, this is to preserve

273
00:17:29,000 --> 00:17:32,000
history.
OK, now what do we do with this

274
00:17:32,000 --> 00:17:34,000
model?
It seems nice,

275
00:17:34,000 --> 00:17:36,000
but now what do we measure
about it?

276
00:17:36,000 --> 00:17:39,000
What I'm going to assume is
that the cache is really fast.

277
00:17:39,000 --> 00:17:43,000
So the CPU can access cache
essentially instantaneously.

278
00:17:43,000 --> 00:17:46,000
I still have to pay for the
computation that the CPU is

279
00:17:46,000 --> 00:17:50,000
doing, but I'm assuming cache is
close enough that I don't care.

280
00:17:50,000 --> 00:17:54,000
And that may memory is so big
that it has to be far away,

281
00:17:54,000 --> 00:17:56,000
and therefore,
this pipe is a problem.

282
00:17:56,000 --> 00:17:59,000
I mean, what I should really
draw is that pipe is still

283
00:17:59,000 --> 00:18:04,000
thick, but is really long.
So, the latency is high.

284
00:18:04,000 --> 00:18:07,000
The bandwidth is still high.
OK, and all transfers here

285
00:18:07,000 --> 00:18:10,000
happened as blocks.
So, when you don't have

286
00:18:10,000 --> 00:18:12,000
something, so the idea is CPU
asks for A of I,

287
00:18:12,000 --> 00:18:15,000
as for something in memory,
if it's in the cache,

288
00:18:15,000 --> 00:18:17,000
it gets it.
That's free.

289
00:18:17,000 --> 00:18:21,000
Otherwise, it has to grab the
entire block containing that

290
00:18:21,000 --> 00:18:23,000
element from main memory,
brings it into cache,

291
00:18:23,000 --> 00:18:26,000
maybe kicks somebody out if the
cache was full,

292
00:18:26,000 --> 00:18:29,000
and then the CPU can use that
data and keep going.

293
00:18:29,000 --> 00:18:33,000
Until it accesses something
else that's not in cache,

294
00:18:33,000 --> 00:18:37,000
then it has to grab it from
main memory.

295
00:18:37,000 --> 00:18:43,000
When you kick something out,
you're actually writing back to

296
00:18:43,000 --> 00:18:46,000
memory.
That's the model.

297
00:18:46,000 --> 00:18:51,000
So, we suppose the accesses to
cache are free.

298
00:18:51,000 --> 00:18:56,000
But we can still think about
the running time of the

299
00:18:56,000 --> 00:19:01,000
algorithm.
I'm not going to change the

300
00:19:01,000 --> 00:19:05,000
definition of running time.
This would be the computation

301
00:19:05,000 --> 00:19:10,000
time, or the work if you want to
use multithreaded lingo,

302
00:19:10,000 --> 00:19:13,000
computation time.
OK, so we still have time,

303
00:19:13,000 --> 00:19:17,000
and T of N will still mean what
it did before.

304
00:19:17,000 --> 00:19:22,000
This is just an extra level of
refinement of understanding of

305
00:19:22,000 --> 00:19:24,000
what's going on.
Essentially,

306
00:19:24,000 --> 00:19:29,000
measuring the parallelism that
we can exploit out of the memory

307
00:19:29,000 --> 00:19:34,000
system, that when you access
something you actually get B

308
00:19:34,000 --> 00:19:39,000
items.
So, this is the old stuff.

309
00:19:39,000 --> 00:19:47,000
Now, what I want to do is count
memory transfers.

310
00:19:47,000 --> 00:19:56,000
These are transfers of blocks,
so I should say block memory

311
00:19:56,000 --> 00:20:04,000
transfers between the two
levels, so, between the cache

312
00:20:04,000 --> 00:20:12,000
and main memory.
So, memory transfers are either

313
00:20:12,000 --> 00:20:19,000
reading reads or writes.
Maybe I should say that.

314
00:20:19,000 --> 00:20:29,000
These are number of block reads
and writes from and to the main

315
00:20:29,000 --> 00:20:33,000
memory.
OK, so I'm going to introduce

316
00:20:33,000 --> 00:20:35,000
some notation.
This is new notation,

317
00:20:35,000 --> 00:20:40,000
so we'll see how it works out.
MT of N I want to represent the

318
00:20:40,000 --> 00:20:44,000
number of memory transfers
instead of just normal time of

319
00:20:44,000 --> 00:20:49,000
the problem of size N.
Really, this is a function that

320
00:20:49,000 --> 00:20:52,000
depends not only on N but also
on these parameters,

321
00:20:52,000 --> 00:20:56,000
B and M, in our model.
So, this is what it should be,

322
00:20:56,000 --> 00:21:00,000
MT_B,M(N), but that's obviously
pretty messy,

323
00:21:00,000 --> 00:21:04,000
so I'm going to stick to MT of
N.

324
00:21:04,000 --> 00:21:07,000
But this will always,
because mainly I care about the

325
00:21:07,000 --> 00:21:09,000
growth in terms of N.
well, I care about the growth

326
00:21:09,000 --> 00:21:12,000
in terms of all things,
but the only thing I could

327
00:21:12,000 --> 00:21:14,000
change is N.
So, most of the time I only

328
00:21:14,000 --> 00:21:17,000
think about, like when we are
writing recurrences,

329
00:21:17,000 --> 00:21:20,000
only N is changing.
I can't recurse on the block

330
00:21:20,000 --> 00:21:22,000
size.
I can't recurse on the size of

331
00:21:22,000 --> 00:21:24,000
cache.
Those are given to me.

332
00:21:24,000 --> 00:21:26,000
They're fixed.
OK, so we'll be changing N

333
00:21:26,000 --> 00:21:28,000
mainly.
But B and M always matter here.

334
00:21:28,000 --> 00:21:31,000
They're not constants.
They're parameters of the

335
00:21:31,000 --> 00:21:34,000
model.
OK, easy enough.

336
00:21:34,000 --> 00:21:39,000
This is something called the
disk access model,

337
00:21:39,000 --> 00:21:44,000
if you like DAM models,
or the external memory model,

338
00:21:44,000 --> 00:21:50,000
or the cache aware model.
Maybe I should mention that;

339
00:21:50,000 --> 00:21:55,000
this is the cache aware.
In general, you have some

340
00:21:55,000 --> 00:22:01,000
algorithm that runs on this kind
of model, machine model.

341
00:22:01,000 --> 00:22:07,000
That's a cache aware algorithm.
OK, we're not too interested in

342
00:22:07,000 --> 00:22:10,000
cache aware algorithms.
We've seen one,

343
00:22:10,000 --> 00:22:12,000
B trees.
B trees are cache aware data

344
00:22:12,000 --> 00:22:14,000
structure.
You assume that there is some

345
00:22:14,000 --> 00:22:15,000
block size, B,
underlying.

346
00:22:15,000 --> 00:22:18,000
Maybe you didn't see exactly
this model.

347
00:22:18,000 --> 00:22:20,000
In particular,
it didn't really matter how big

348
00:22:20,000 --> 00:22:23,000
the cache was because you just
wanted to know.

349
00:22:23,000 --> 00:22:26,000
When I read B items,
I can use all of them as much

350
00:22:26,000 --> 00:22:29,000
as possible and figure out where
I fit among those B items,

351
00:22:29,000 --> 00:22:32,000
and that gives me log base B of
N memory transfers instead of

352
00:22:32,000 --> 00:22:36,000
log N, which would be,
if you just threw your favorite

353
00:22:36,000 --> 00:22:41,000
balanced binary search tree.
So, log base B of N is

354
00:22:41,000 --> 00:22:46,000
definitely better than log base
2 of N.

355
00:22:46,000 --> 00:22:51,000
B trees are a cache aware
algorithm.

356
00:22:51,000 --> 00:22:58,000
OK, what we would like to do
today and next lecture is get

357
00:22:58,000 --> 00:23:06,000
cache oblivious algorithms.
So, there's essentially only

358
00:23:06,000 --> 00:23:12,000
one difference between cache
aware algorithms and cache

359
00:23:12,000 --> 00:23:18,000
oblivious algorithms.
In cache oblivious algorithms,

360
00:23:18,000 --> 00:23:22,000
the algorithm doesn't know what
B and M are.

361
00:23:22,000 --> 00:23:30,000
So this is a bit of a subtle
point, but very cool idea.

362
00:23:30,000 --> 00:23:32,000
You assume that this is the
model of the machine,

363
00:23:32,000 --> 00:23:36,000
and you care about the number
of memory transfers between this

364
00:23:36,000 --> 00:23:39,000
cache of size M with blocking B,
and main memory with blocking

365
00:23:39,000 --> 00:23:41,000
B.
But you don't actually know

366
00:23:41,000 --> 00:23:43,000
what the model is.
You don't know the other

367
00:23:43,000 --> 00:23:45,000
parameters of the model.
It looks like this,

368
00:23:45,000 --> 00:23:48,000
but you don't know the width.
You don't know the height.

369
00:23:48,000 --> 00:23:50,000
Why not?
So, the analysis knows what B

370
00:23:50,000 --> 00:23:52,000
and M are.
We are going to write some

371
00:23:52,000 --> 00:23:56,000
algorithms which look just like
boring old algorithms that we've

372
00:23:56,000 --> 00:24:00,000
seen throughout the lecture.
That's one of the nice things

373
00:24:00,000 --> 00:24:03,000
about this model.
Every algorithm we have seen is

374
00:24:03,000 --> 00:24:06,000
a cache oblivious algorithm,
all right, because we didn't

375
00:24:06,000 --> 00:24:08,000
even know the word cache in this
class until today.

376
00:24:08,000 --> 00:24:11,000
So, we already have lots of
algorithms to choose from.

377
00:24:11,000 --> 00:24:13,000
The thing is,
some of them will perform well

378
00:24:13,000 --> 00:24:15,000
in this model,
and some of them won't.

379
00:24:15,000 --> 00:24:18,000
So, we would like to design
algorithms that just like our

380
00:24:18,000 --> 00:24:21,000
old algorithms that happened to
perform well in this context,

381
00:24:21,000 --> 00:24:24,000
no matter what B and M are.
So, another way this is the

382
00:24:24,000 --> 00:24:27,000
same algorithm should work well
for all values of B and M if you

383
00:24:27,000 --> 00:24:31,000
have a good cache oblivious
algorithm.

384
00:24:31,000 --> 00:24:33,000
OK, there are a few
consequences to this assumption.

385
00:24:33,000 --> 00:24:36,000
In a cache aware algorithm,
you can explicitly say,

386
00:24:36,000 --> 00:24:39,000
OK, I'm blocking my memory into
chunks of size B.

387
00:24:39,000 --> 00:24:42,000
Here they are.
I was going to store these B

388
00:24:42,000 --> 00:24:44,000
elements here,
these B elements here,

389
00:24:44,000 --> 00:24:46,000
because you know B,
you can do that.

390
00:24:46,000 --> 00:24:48,000
You can say,
well, OK, now I want to read

391
00:24:48,000 --> 00:24:51,000
these B items into my cache,
and then write out these ones

392
00:24:51,000 --> 00:24:53,000
over here.
You can explicitly maintain

393
00:24:53,000 --> 00:24:55,000
your cache.
With cache oblivious

394
00:24:55,000 --> 00:25:00,000
algorithms, you can't because
you don't know what it is.

395
00:25:00,000 --> 00:25:04,000
So, it's got to be all
implicit.

396
00:25:04,000 --> 00:25:11,000
And this is pretty much how
caches work anyway except for

397
00:25:11,000 --> 00:25:15,000
disk.
So, it's a pretty reasonable

398
00:25:15,000 --> 00:25:18,000
model.
In particular,

399
00:25:18,000 --> 00:25:24,000
when you access an element
that's not in cache,

400
00:25:24,000 --> 00:25:33,000
you automatically fetch the
block containing that element.

401
00:25:33,000 --> 00:25:38,000
And you pay one memory transfer
for that if it wasn't already

402
00:25:38,000 --> 00:25:41,000
there.
Another bit of a catch here is,

403
00:25:41,000 --> 00:25:45,000
what if your cache is full?
Then you've got to kick some

404
00:25:45,000 --> 00:25:50,000
block out of your cache.
And then, so we need some model

405
00:25:50,000 --> 00:25:55,000
of which block gets kicked out
because we can't control that.

406
00:25:55,000 --> 00:26:00,000
We have no knowledge of what
the blocks are in our algorithm.

407
00:26:00,000 --> 00:26:05,000
So what we're going to assume
in this model is the ideal

408
00:26:05,000 --> 00:26:10,000
thing, that when you fetch a new
block, if your cache is full,

409
00:26:10,000 --> 00:26:17,000
you evict a block that will be
used farthest in the future.

410
00:26:17,000 --> 00:26:21,000
Sorry, the furthest.
Farthest is distance.

411
00:26:21,000 --> 00:26:25,000
Furthest is time.
Furthest in the future.

412
00:26:25,000 --> 00:26:31,000
OK, this would be the best
possible thing to do.

413
00:26:31,000 --> 00:26:35,000
It's a little bit hard to do in
practice because you don't know

414
00:26:35,000 --> 00:26:38,000
the future generally,
unless you're omniscient.

415
00:26:38,000 --> 00:26:41,000
So, this is a bit of an
idealized model.

416
00:26:41,000 --> 00:26:45,000
But it's pretty reasonable in
the sense that if you've read

417
00:26:45,000 --> 00:26:49,000
the reading handout number 20,
this paper by Sleator and

418
00:26:49,000 --> 00:26:52,000
Tarjan, they introduce the idea
of competitive algorithms.

419
00:26:52,000 --> 00:26:56,000
So, we only talked about a
small portion of that paper that

420
00:26:56,000 --> 00:27:01,000
moved to front heuristic for
storing a list.

421
00:27:01,000 --> 00:27:03,000
But, it also proves that there
are strategies,

422
00:27:03,000 --> 00:27:06,000
and maybe you heard this in
recitation.

423
00:27:06,000 --> 00:27:08,000
Some people covered it;
some didn't,

424
00:27:08,000 --> 00:27:10,000
that these are called paging
strategies.

425
00:27:10,000 --> 00:27:13,000
So, you want to maintain some
cache of pages or blocks,

426
00:27:13,000 --> 00:27:17,000
and you pay whenever you have
to access a block that's not in

427
00:27:17,000 --> 00:27:19,000
your cache.
The best thing to do is to

428
00:27:19,000 --> 00:27:23,000
always kick out the block that
will be used farthest in the

429
00:27:23,000 --> 00:27:27,000
future because that way you'll
use all the blocks that are in

430
00:27:27,000 --> 00:27:28,000
there.
This turns out to be the

431
00:27:28,000 --> 00:27:33,000
offline optimal strategy if you
knew the future.

432
00:27:33,000 --> 00:27:35,000
But, there are algorithms that
are essentially constant

433
00:27:35,000 --> 00:27:37,000
competitive against this
strategy.

434
00:27:37,000 --> 00:27:40,000
I don't want to get into
details because they're not

435
00:27:40,000 --> 00:27:43,000
exactly constant competitive.
But they are sufficiently

436
00:27:43,000 --> 00:27:46,000
constant competitive for the
purposes of this lecture that we

437
00:27:46,000 --> 00:27:49,000
can assume this,
not have to worry about it.

438
00:27:49,000 --> 00:27:51,000
Most of the time,
we don't even really use this

439
00:27:51,000 --> 00:27:53,000
assumption.
But there it is.

440
00:27:53,000 --> 00:27:55,000
That's the cache oblivious
model.

441
00:27:55,000 --> 00:27:58,000
It makes things cleaner to
think about just anything that

442
00:27:58,000 --> 00:28:01,000
should be done,
will be done.

443
00:28:01,000 --> 00:28:05,000
And you can simulate that with
least recently used or whatever

444
00:28:05,000 --> 00:28:10,000
good heuristic that you want to
that's competitive against the

445
00:28:10,000 --> 00:28:12,000
optimal.
OK, that's pretty much the

446
00:28:12,000 --> 00:28:16,000
cache oblivious algorithm.
Once you have the two level

447
00:28:16,000 --> 00:28:20,000
model, you just assume you don't
know B and M.

448
00:28:20,000 --> 00:28:24,000
You have this automatic request
in writing, and whatnot.

449
00:28:24,000 --> 00:28:28,000
A little bit more to say,
I guess, it may be obvious at

450
00:28:28,000 --> 00:28:34,000
this point, but I've been
drawing everything as tables.

451
00:28:34,000 --> 00:28:37,000
So, it's not really clear what
the linear order is.

452
00:28:37,000 --> 00:28:40,000
Linear order is just the
reading order.

453
00:28:40,000 --> 00:28:44,000
So, although we don't
explicitly say it most of the

454
00:28:44,000 --> 00:28:48,000
time, a typical model is that
memory is a linear array.

455
00:28:48,000 --> 00:28:53,000
Everything that you ever store
in your program is written in

456
00:28:53,000 --> 00:28:57,000
this linear array.
If you've ever programmed in

457
00:28:57,000 --> 00:29:01,000
Assembly or whatever,
that's the model.

458
00:29:01,000 --> 00:29:04,000
You have the address space,
and any number between here and

459
00:29:04,000 --> 00:29:08,000
here, that's where you can
actually, this is physical

460
00:29:08,000 --> 00:29:11,000
memory.
This is all you can write to.

461
00:29:11,000 --> 00:29:15,000
So, it starts at zero and goes
out to, let's call it infinity

462
00:29:15,000 --> 00:29:17,000
over here.
And, if you allocate some

463
00:29:17,000 --> 00:29:20,000
array, maybe it occupies some
space in the middle.

464
00:29:20,000 --> 00:29:23,000
Who knows?
OK, we usually don't think

465
00:29:23,000 --> 00:29:26,000
about that much.
What I care about now is that

466
00:29:26,000 --> 00:29:29,000
memory itself is blocked in this
view.

467
00:29:29,000 --> 00:29:31,000
So, however your stuff is
stored in memory,

468
00:29:31,000 --> 00:29:36,000
it's blocked into clusters of
length B.

469
00:29:36,000 --> 00:29:39,000
So, if this is,
let me call it one and be a

470
00:29:39,000 --> 00:29:41,000
little bit nicer.
This is B.

471
00:29:41,000 --> 00:29:46,000
This is position B plus one.
This is 2B, and 2B plus one,

472
00:29:46,000 --> 00:29:49,000
and so on.
These are the indexes into

473
00:29:49,000 --> 00:29:51,000
memory.
This is how the blocking

474
00:29:51,000 --> 00:29:54,000
happens.
If you access something here,

475
00:29:54,000 --> 00:29:59,000
you get that chunk from U,
round it down to the previous

476
00:29:59,000 --> 00:30:02,000
multiple of B,
round it up to the next

477
00:30:02,000 --> 00:30:06,000
multiple of B.
That's what you always get.

478
00:30:06,000 --> 00:30:11,000
OK, so if you think about some
array that's maybe allocated

479
00:30:11,000 --> 00:30:15,000
here, OK, you have to keep in
mind that that array may not be

480
00:30:15,000 --> 00:30:18,000
perfectly aligned with the
blocks.

481
00:30:18,000 --> 00:30:21,000
But more or less it will be so
we don't care too much.

482
00:30:21,000 --> 00:30:24,000
But that's a bit of a subtlety
there.

483
00:30:24,000 --> 00:30:28,000
OK, so that's pretty much the
model.

484
00:30:28,000 --> 00:30:32,000
So every algorithm we've seen,
except B trees,

485
00:30:32,000 --> 00:30:36,000
is a cache oblivious algorithm.
And our question is,

486
00:30:36,000 --> 00:30:41,000
now, we know how everything
runs in terms of running time.

487
00:30:41,000 --> 00:30:46,000
Now we want to measure the
number of memory transfers,

488
00:30:46,000 --> 00:30:49,000
MT of N.
I want to mention one other

489
00:30:49,000 --> 00:30:53,000
fact or theorem.
I'll put it in brackets because

490
00:30:53,000 --> 00:30:58,000
I don't want to state it
precisely.

491
00:30:58,000 --> 00:31:04,000
But if you have an algorithm
that is efficient on two levels,

492
00:31:04,000 --> 00:31:08,000
so in other words,
what we're looking at,

493
00:31:08,000 --> 00:31:14,000
if we just think about the two
level world and your algorithm

494
00:31:14,000 --> 00:31:18,000
is cache oblivious,
then it is efficient on any

495
00:31:18,000 --> 00:31:23,000
number of levels in your memory
hierarchy, say,

496
00:31:23,000 --> 00:31:27,000
L levels.
So, I don't want to define what

497
00:31:27,000 --> 00:31:31,000
efficient means.
But the intuition is,

498
00:31:31,000 --> 00:31:34,000
if your machine really looks
like this and you have a cache

499
00:31:34,000 --> 00:31:36,000
oblivious algorithm,
you can apply the cache

500
00:31:36,000 --> 00:31:38,000
oblivious analysis for all B and
M.

501
00:31:38,000 --> 00:31:41,000
So you can analyze the number
of memory transfers here,

502
00:31:41,000 --> 00:31:43,000
here, here, here,
and here.

503
00:31:43,000 --> 00:31:45,000
And if you have a good cache
oblivious algorithm,

504
00:31:45,000 --> 00:31:48,000
the performances at all those
levels has to be good.

505
00:31:48,000 --> 00:31:51,000
And therefore,
the whole performance is good.

506
00:31:51,000 --> 00:31:54,000
Good here means asymptotically
optimal up to constant factors,

507
00:31:54,000 --> 00:31:57,000
something like that.
OK, so I don't want to prove

508
00:31:57,000 --> 00:32:01,000
that, and you can read the cache
oblivious papers.

509
00:32:01,000 --> 00:32:04,000
That's a nice fact about cache
oblivious algorithms.

510
00:32:04,000 --> 00:32:08,000
If you have a cache aware
algorithm that tunes to a

511
00:32:08,000 --> 00:32:12,000
particular value of B,
and a particular value of M,

512
00:32:12,000 --> 00:32:15,000
you're not going to have that
problem.

513
00:32:15,000 --> 00:32:19,000
So, this is one nice feature of
cache obliviousness.

514
00:32:19,000 --> 00:32:23,000
Another nice feature is when
you are coding the algorithm,

515
00:32:23,000 --> 00:32:26,000
you don't have to put in B and
M.

516
00:32:26,000 --> 00:32:28,000
So, that simplifies things a
bit.

517
00:32:28,000 --> 00:32:34,000
So, let's do some algorithms.
Enough about models.

518
00:32:34,000 --> 00:32:40,000
OK, we're going to start out
with some really simple things

519
00:32:40,000 --> 00:32:45,000
just to get warmed up on the
analysis side.

520
00:32:45,000 --> 00:32:52,000
The most basic thing you can do
that's good in a cache oblivious

521
00:32:52,000 --> 00:32:57,000
world is scanning.
So, scanning is just visiting

522
00:32:57,000 --> 00:33:03,000
the items in an array in order.
So, visit A_1 up to A_N in

523
00:33:03,000 --> 00:33:06,000
order.
For some notion of visit,

524
00:33:06,000 --> 00:33:09,000
this is presumably some
constant time operation.

525
00:33:09,000 --> 00:33:12,000
For example,
suppose you want to compute the

526
00:33:12,000 --> 00:33:16,000
aggregate of the array.
You want to sum all the

527
00:33:16,000 --> 00:33:19,000
elements in the array.
So, you have one extra variable

528
00:33:19,000 --> 00:33:23,000
using, but you can store that in
a register or whatever,

529
00:33:23,000 --> 00:33:27,000
so that's one simple example.
Sum the array.

530
00:33:27,000 --> 00:33:31,000
OK, so here's the picture.
We have our memory.

531
00:33:31,000 --> 00:33:36,000
Each of these cells represents
one item, one element,

532
00:33:36,000 --> 00:33:38,000
log N bits, one word,
whatever.

533
00:33:38,000 --> 00:33:43,000
Our array is somewhere in here.
Maybe it's there.

534
00:33:43,000 --> 00:33:47,000
And we go from here to here to
here to here.

535
00:33:47,000 --> 00:33:50,000
OK, and so on.
So, what does this cost?

536
00:33:50,000 --> 00:33:53,000
What is the number of memory
transfers?

537
00:33:53,000 --> 00:33:57,000
We know that this is a linear
time algorithm.

538
00:33:57,000 --> 00:34:03,000
It takes order N time.
What does it cost in terms of

539
00:34:03,000 --> 00:34:07,000
memory transfers?
N over B, pretty much.

540
00:34:07,000 --> 00:34:12,000
We like to say it's order N
over B plus two or one in the

541
00:34:12,000 --> 00:34:15,000
big O.
This is a bit of worry.

542
00:34:15,000 --> 00:34:18,000
I mean, N could be smaller than
B.

543
00:34:18,000 --> 00:34:21,000
We really want to think about
all the cases,

544
00:34:21,000 --> 00:34:26,000
especially because usually
you're not doing this on

545
00:34:26,000 --> 00:34:31,000
something of size N.
You're doing it on something of

546
00:34:31,000 --> 00:34:37,000
size k, where we don't really
know much about k.

547
00:34:37,000 --> 00:34:40,000
But in general,
it's N over B plus one because

548
00:34:40,000 --> 00:34:43,000
we always need at least one
memory transfer to look at

549
00:34:43,000 --> 00:34:46,000
something, unless N is zero.
And in particular,

550
00:34:46,000 --> 00:34:49,000
it's plus two if you care about
the constants.

551
00:34:49,000 --> 00:34:53,000
If I don't write the big O,
then it would be plus two at

552
00:34:53,000 --> 00:34:57,000
most because you could
essentially waste the first

553
00:34:57,000 --> 00:35:01,000
block and that everything is
fine for awhile.

554
00:35:01,000 --> 00:35:05,000
And then, if you're unlucky,
you essentially waste the last

555
00:35:05,000 --> 00:35:08,000
blocked.
There is just one element in

556
00:35:08,000 --> 00:35:12,000
that block, and you're not
getting much out of it.

557
00:35:12,000 --> 00:35:16,000
Everything in the middle,
though, every block between the

558
00:35:16,000 --> 00:35:19,000
first and last block has to be
full.

559
00:35:19,000 --> 00:35:22,000
So, you're using all of those
elements.

560
00:35:22,000 --> 00:35:26,000
So out of the N elements,
you only have N over B blocks

561
00:35:26,000 --> 00:35:28,000
because the block has B
elements.

562
00:35:28,000 --> 00:35:33,000
OK, that's pretty trivial.
Let me do something slightly

563
00:35:33,000 --> 00:35:38,000
more interesting,
which is two scans at once.

564
00:35:38,000 --> 00:35:41,000
OK, here we are not assuming
anything about M.

565
00:35:41,000 --> 00:35:45,000
we're not assuming anything
about the size of the cache,

566
00:35:45,000 --> 00:35:48,000
just that I can hold a single
block.

567
00:35:48,000 --> 00:35:51,000
The last block that we visited
has to be there.

568
00:35:51,000 --> 00:35:55,000
OK, you can also do a constant
number of parallel scans.

569
00:35:55,000 --> 00:36:00,000
This is not really parallel in
the sense of multithreaded,

570
00:36:00,000 --> 00:36:06,000
bit simulated parallelism.
I mean, if you have a constant

571
00:36:06,000 --> 00:36:09,000
number, do one,
do the other,

572
00:36:09,000 --> 00:36:12,000
do the other,
come back, come back,

573
00:36:12,000 --> 00:36:18,000
come back, all right,
visit them in turn round robin,

574
00:36:18,000 --> 00:36:20,000
whatever.
For example,

575
00:36:20,000 --> 00:36:26,000
here's a cute piece of code.
If you want to reverse an

576
00:36:26,000 --> 00:36:33,000
array, OK, then you can do it.
This is a good puzzle.

577
00:36:33,000 --> 00:36:38,000
You can do it by essentially
two scans where you repeatedly

578
00:36:38,000 --> 00:36:42,000
swapped the first and last
element.

579
00:36:42,000 --> 00:36:46,000
So I was swapping A_i with N
minus i plus one,

580
00:36:46,000 --> 00:36:51,000
and just restart at one.
So, here's your array.

581
00:36:51,000 --> 00:36:54,000
Suppose this is actually my
array.

582
00:36:54,000 --> 00:36:59,000
I swap these two guys,
and I saw these two guys,

583
00:36:59,000 --> 00:37:04,000
and so on.
That will reverse my array,

584
00:37:04,000 --> 00:37:08,000
and this should work hopefully
the middle as well if it's odd.

585
00:37:08,000 --> 00:37:13,000
It should not do anything.
And you can view this as two

586
00:37:13,000 --> 00:37:16,000
scans.
There is one scan that's coming

587
00:37:16,000 --> 00:37:19,000
in this way.
There's also a reverse scan,

588
00:37:19,000 --> 00:37:23,000
ooh, some more sophisticated,
coming back this way.

589
00:37:23,000 --> 00:37:26,000
Of course, reverse scan has the
same analysis.

590
00:37:26,000 --> 00:37:31,000
And as long as your cache is
big enough to store at least two

591
00:37:31,000 --> 00:37:35,000
blocks, which is a pretty
reasonable assumption,

592
00:37:35,000 --> 00:37:40,000
so let's write it.
Assuming the number of blocks

593
00:37:40,000 --> 00:37:43,000
in the cache,
which is M over B,

594
00:37:43,000 --> 00:37:49,000
is at least two in this
algorithm, the number of memory

595
00:37:49,000 --> 00:37:53,000
transfers is still order N over
B plus one.

596
00:37:53,000 --> 00:37:58,000
OK, the constant goes up maybe,
but in this case it probably

597
00:37:58,000 --> 00:38:02,000
doesn't.
But who cares.

598
00:38:02,000 --> 00:38:06,000
OK, as long as you're doing a
constant number of scans,

599
00:38:06,000 --> 00:38:11,000
and some constant number of
arrays, it happens to be one of

600
00:38:11,000 --> 00:38:15,000
them's reversed,
whatever, it will take,

601
00:38:15,000 --> 00:38:20,000
we call this linear time.
It's linear in the number of

602
00:38:20,000 --> 00:38:22,000
blocks in your input.
OK, great.

603
00:38:22,000 --> 00:38:26,000
So now you can reverse an
array: exciting.

604
00:38:26,000 --> 00:38:32,000
Let's try another simple
algorithm on another board.

605
00:38:47,000 --> 00:38:50,000
Let's try binary search.
So just like last week,

606
00:38:50,000 --> 00:38:53,000
we're going back to our basics
here.

607
00:38:53,000 --> 00:38:57,000
Scanning we didn't even talk
about in this class.

608
00:38:57,000 --> 00:39:02,000
Binary search is something we
talked about a little bit.

609
00:39:02,000 --> 00:39:04,000
It was a simple divide and
conquer algorithm.

610
00:39:04,000 --> 00:39:08,000
I hope you all remember it.
And if we look at an array,

611
00:39:08,000 --> 00:39:11,000
and I'm not going to draw the
cells here because I want to

612
00:39:11,000 --> 00:39:14,000
imagine a really big array,
binary search,

613
00:39:14,000 --> 00:39:16,000
but suppose it always goes to
left.

614
00:39:16,000 --> 00:39:19,000
It starts by visiting this
element in the middle.

615
00:39:19,000 --> 00:39:23,000
Then ago so the quarter marked.
Then it goes to the one eighth

616
00:39:23,000 --> 00:39:25,000
mark.
OK, this is one hypothetical

617
00:39:25,000 --> 00:39:29,000
execution of a binary search.
OK, and eventually it finds the

618
00:39:29,000 --> 00:39:32,000
element it's looking for.
It finds where it fits at

619
00:39:32,000 --> 00:39:35,000
least.
So x is over here.

620
00:39:35,000 --> 00:39:38,000
So, we know that it takes log N
time.

621
00:39:38,000 --> 00:39:41,000
How many memory transfers of
the take?

622
00:39:41,000 --> 00:39:45,000
Now, I blocked this array into
chunks of size B,

623
00:39:45,000 --> 00:39:49,000
blocks of size B.
How many blocks do I touch?

624
00:39:49,000 --> 00:39:53,000
This one's a little bit more
subtle.

625
00:40:18,000 --> 00:40:21,000
It depends on the relative
sizes of N and B,

626
00:40:21,000 --> 00:40:23,000
yeah.
Log base B of N would be a good

627
00:40:23,000 --> 00:40:25,000
guess.
We would like it to be,

628
00:40:25,000 --> 00:40:29,000
let's say, hope,
is that it's log base B of N

629
00:40:29,000 --> 00:40:33,000
because we know that B trees can
search in what's essentially a

630
00:40:33,000 --> 00:40:38,000
sorted list of N items in log
base B of N time.

631
00:40:38,000 --> 00:40:42,000
That turns out to be optimal in
the cache oblivious model or in

632
00:40:42,000 --> 00:40:46,000
the two level model you've got
to pay log base B of N.

633
00:40:46,000 --> 00:40:51,000
I won't prove that here.
The same reason you need log N

634
00:40:51,000 --> 00:40:55,000
comparisons to do a binary
search in the normal model.

635
00:40:55,000 --> 00:41:00,000
Alas, it is possible to get log
base B of N even without knowing

636
00:41:00,000 --> 00:41:06,000
B.
But, binary search does not do

637
00:41:06,000 --> 00:41:09,000
it.
Log of N over B,

638
00:41:09,000 --> 00:41:13,000
yes.
So the number of memory

639
00:41:13,000 --> 00:41:22,000
transfers on N items is log of N
over B also known as,

640
00:41:22,000 --> 00:41:31,000
let's say, plus one,
also known as log N minus log

641
00:41:31,000 --> 00:41:35,000
B.
OK, whereas log base B of N is

642
00:41:35,000 --> 00:41:39,000
log N divided by log B,
OK, clearly this is much better

643
00:41:39,000 --> 00:41:42,000
than subtracting.
So, this would be good,

644
00:41:42,000 --> 00:41:45,000
but this is bad.
Most of the time,

645
00:41:45,000 --> 00:41:47,000
this is log N,
which is no better,

646
00:41:47,000 --> 00:41:51,000
I mean, you're not using blocks
at all essentially.

647
00:41:51,000 --> 00:41:53,000
The idea is,
out here, I mean,

648
00:41:53,000 --> 00:41:57,000
there's some little,
tiny block that contains this

649
00:41:57,000 --> 00:42:00,000
thing.
I mean, tiny depends on how big

650
00:42:00,000 --> 00:42:03,000
B is.
But, each of these items will

651
00:42:03,000 --> 00:42:06,000
be in a different block until
you get essentially within one

652
00:42:06,000 --> 00:42:09,000
block worth of x.
When you get within one block

653
00:42:09,000 --> 00:42:12,000
worth of x, there's only like a
constant number of blocks that

654
00:42:12,000 --> 00:42:15,000
matter, and so all of these
accesses are indeed within the

655
00:42:15,000 --> 00:42:17,000
same block.
But, how many are there?

656
00:42:17,000 --> 00:42:21,000
Well, just log B because you're
only spending log B within a,

657
00:42:21,000 --> 00:42:24,000
if you're within an interval of
size k, you're only going to

658
00:42:24,000 --> 00:42:27,000
spend log k steps in it.
So, you're saving log B in

659
00:42:27,000 --> 00:42:30,000
here, but overall you're paying
log N, so you only get log N

660
00:42:30,000 --> 00:42:34,000
minus log B plus some constant.
OK, so this is bad news for

661
00:42:34,000 --> 00:42:37,000
binary search.
So, not all of the algorithms

662
00:42:37,000 --> 00:42:40,000
we've seen are going to work
well in this model.

663
00:42:40,000 --> 00:42:43,000
We need a lot more thinking
before we can solve what is

664
00:42:43,000 --> 00:42:47,000
essentially the binary search
problem, finding an element in a

665
00:42:47,000 --> 00:42:50,000
sorted list, in log base B of N
without knowing B.

666
00:42:50,000 --> 00:42:52,000
OK, we know we could use B
trees.

667
00:42:52,000 --> 00:42:53,000
If you knew B,
great, that works,

668
00:42:53,000 --> 00:42:56,000
and that's optimal.
But without knowing B,

669
00:42:56,000 --> 00:43:02,000
it's a little bit harder.
And this gets us into the world

670
00:43:02,000 --> 00:43:06,000
of divide and conquer.
Also like last week,

671
00:43:06,000 --> 00:43:13,000
and like the first few weeks of
this class, divide and conquer

672
00:43:13,000 --> 00:43:17,000
is your friend.
And, it turns out divide and

673
00:43:17,000 --> 00:43:23,000
conquer is not the only tool,
but it's a really useful tool

674
00:43:23,000 --> 00:43:27,000
in designing cache oblivious
algorithms.

675
00:43:27,000 --> 00:43:31,000
And, let me say why.

676
00:43:43,000 --> 00:43:47,000
So, we'll see a bunch of divide
and conquer based algorithms,

677
00:43:47,000 --> 00:43:50,000
cache oblivious.
And, the intuition is that we

678
00:43:50,000 --> 00:43:54,000
can take all the favorite
algorithms we have,

679
00:43:54,000 --> 00:43:56,000
obviously it doesn't always
work.

680
00:43:56,000 --> 00:43:59,000
Binary search was a divide and
conquer algorithm.

681
00:43:59,000 --> 00:44:03,000
It's not so great.
But, in general,

682
00:44:03,000 --> 00:44:07,000
the idea is that your algorithm
can just do the normal divide

683
00:44:07,000 --> 00:44:08,000
and conquer thing,
right?

684
00:44:08,000 --> 00:44:12,000
You divide your problem into
subproblems of smaller size

685
00:44:12,000 --> 00:44:15,000
repeatedly, all the way down to
problems of constant size,

686
00:44:15,000 --> 00:44:19,000
OK, just like before.
But, if you're recursively

687
00:44:19,000 --> 00:44:21,000
dividing your problem into
smaller things,

688
00:44:21,000 --> 00:44:24,000
at some point you can think
about it and say,

689
00:44:24,000 --> 00:44:27,000
well, wait, I mean,
the algorithm divides all the

690
00:44:27,000 --> 00:44:31,000
way, but we can think about the
point at which the problem fits

691
00:44:31,000 --> 00:44:36,000
in a block or fits in cache.
OK, and that's the analysis.

692
00:44:36,000 --> 00:44:40,000
OK, we'll think about the time
when your problem is small

693
00:44:40,000 --> 00:44:43,000
enough that we can analyze it in
some other way.

694
00:44:43,000 --> 00:44:46,000
So, usually,
we analyze it recursively.

695
00:44:46,000 --> 00:44:48,000
We get a recurrence.
What we're changing,

696
00:44:48,000 --> 00:44:50,000
essentially,
is the base case.

697
00:44:50,000 --> 00:44:54,000
So, in the base case,
we don't want to go down to a

698
00:44:54,000 --> 00:44:56,000
constant size.
That's too far.

699
00:44:56,000 --> 00:45:02,000
I'll show you some examples.
We want to consider the point

700
00:45:02,000 --> 00:45:09,000
in recursion at which either the
problem fits in cache,

701
00:45:09,000 --> 00:45:17,000
so it has size less than or
equal to M, or it fits in order

702
00:45:17,000 --> 00:45:22,000
one blocks.
That's another natural time to

703
00:45:22,000 --> 00:45:27,000
do it.
Order one blocks would be even

704
00:45:27,000 --> 00:45:35,000
better than fitting in cache.
So, this means a size order B.

705
00:45:35,000 --> 00:45:41,000
OK, this will change the base
case of the recurrence,

706
00:45:41,000 --> 00:45:48,000
and it will turn out to give us
good answers instead of bad

707
00:45:48,000 --> 00:45:52,000
ones.
So, let's do a simple example.

708
00:45:52,000 --> 00:45:57,000
Our good friend order
statistics, in particular,

709
00:45:57,000 --> 00:46:04,000
for finding medians.
So, I hope you all know this by

710
00:46:04,000 --> 00:46:08,000
heart.
Remember the worst case linear

711
00:46:08,000 --> 00:46:12,000
time, median finding algorithm
by Bloom et al.

712
00:46:12,000 --> 00:46:17,000
I'll write this fast.
We partition our array.

713
00:46:17,000 --> 00:46:21,000
It turns out,
this is a good algorithm as it

714
00:46:21,000 --> 00:46:24,000
is.
We partition our array

715
00:46:24,000 --> 00:46:30,000
conceptually into N over five,
five tuples into little groups

716
00:46:30,000 --> 00:46:36,000
of five.
This may not have been exactly

717
00:46:36,000 --> 00:46:40,000
how I wrote it last time.
I didn't check.

718
00:46:40,000 --> 00:46:46,000
But, it's the same algorithm.
You compute the median of each

719
00:46:46,000 --> 00:46:49,000
five tuple.
Then you recursively compute

720
00:46:49,000 --> 00:46:55,000
the median of the medians of
these medians.

721
00:47:11,000 --> 00:47:15,000
Then, you partition around x.
So, that gave us some element

722
00:47:15,000 --> 00:47:20,000
that was roughly in the middle.
It was within the middle half,

723
00:47:20,000 --> 00:47:22,000
I think.
Partition around x,

724
00:47:22,000 --> 00:47:27,000
and then we show that you could
always recurse on just one of

725
00:47:27,000 --> 00:47:29,000
the sides.

726
00:47:38,000 --> 00:47:41,000
OK, this was our good old
friend for computing,

727
00:47:41,000 --> 00:47:43,000
order statistics,
or medians, or whatnot.

728
00:47:43,000 --> 00:47:47,000
OK, so how much time does this,
well, we know how much time

729
00:47:47,000 --> 00:47:50,000
this takes.
It should be linear time.

730
00:47:50,000 --> 00:47:52,000
But how many memory transfers
does this take?

731
00:47:52,000 --> 00:47:56,000
Well, conceptually partitioning
that, I can do,

732
00:47:56,000 --> 00:47:58,000
in zero.
Maybe I have to compute N over

733
00:47:58,000 --> 00:48:02,000
five, no big deal here.
We're not thinking about

734
00:48:02,000 --> 00:48:05,000
computation.
I have to find the median of

735
00:48:05,000 --> 00:48:07,000
each tuple.
So, here it matters how my

736
00:48:07,000 --> 00:48:10,000
array is laid out.
But, what I'm going to do is

737
00:48:10,000 --> 00:48:13,000
take my array,
take the first five elements,

738
00:48:13,000 --> 00:48:16,000
and then the next five elements
and so on.

739
00:48:16,000 --> 00:48:20,000
Those will be my five tuples.
So, I can implement this just

740
00:48:20,000 --> 00:48:23,000
by scanning, and then computing
the median on those five

741
00:48:23,000 --> 00:48:27,000
elements, which I stored in the
five registers on my CPU.

742
00:48:27,000 --> 00:48:32,000
I'll assume that there are
enough registers for that.

743
00:48:32,000 --> 00:48:35,000
And, I compute the median,
write it out to some array out

744
00:48:35,000 --> 00:48:38,000
here.
So, it's going to be one

745
00:48:38,000 --> 00:48:40,000
element.
So, the median of here goes

746
00:48:40,000 --> 00:48:43,000
into there.
The median of these guys goes

747
00:48:43,000 --> 00:48:46,000
into there, and so on.
So, I'm scanning in here,

748
00:48:46,000 --> 00:48:50,000
and in parallel,
I'm scanning an output in here.

749
00:48:50,000 --> 00:48:54,000
So, it's two parallel scans.
So, that takes linear time.

750
00:48:54,000 --> 00:48:59,000
So, this takes order N over B
plus one memory transfers.

751
00:48:59,000 --> 00:49:03,000
OK, then we have recursively
compute the median of the

752
00:49:03,000 --> 00:49:06,000
medians.
This step used to be T of N

753
00:49:06,000 --> 00:49:09,000
over five.
Now it's MT of N over five,

754
00:49:09,000 --> 00:49:12,000
OK, with the same values of B
and M.

755
00:49:12,000 --> 00:49:17,000
Then we partition around x.
Partitioning is also like three

756
00:49:17,000 --> 00:49:19,000
parallel scans if you work it
out.

757
00:49:19,000 --> 00:49:24,000
So, this is also going to take
linear memory transfers,

758
00:49:24,000 --> 00:49:28,000
N over B plus one.
And then, we recurse on one of

759
00:49:28,000 --> 00:49:33,000
the sides, and this is the fun
part of the analysis which I

760
00:49:33,000 --> 00:49:37,000
won't repeat here.
But, we get MT of,

761
00:49:37,000 --> 00:49:42,000
like, three quarters N.
I think originally it was seven

762
00:49:42,000 --> 00:49:45,000
tenths, so we simplified to
three quarters,

763
00:49:45,000 --> 00:49:49,000
which is hopefully bigger than
seven tenths.

764
00:49:49,000 --> 00:49:52,000
Yeah, it is.
OK, so this is the new

765
00:49:52,000 --> 00:49:55,000
analysis.
Now we get a recurrence.

766
00:49:55,000 --> 00:49:58,000
So, let's do that.

767
00:50:16,000 --> 00:50:22,000
So, the analysis is we get this
MT of N is MT of N over five

768
00:50:22,000 --> 00:50:29,000
plus MT of three quarters N
plus, this is just as before.

769
00:50:29,000 --> 00:50:35,000
Before we had linear work here.
And now, we have what we call

770
00:50:35,000 --> 00:50:39,000
linear number of memory
transfers, linear number of

771
00:50:39,000 --> 00:50:41,000
blocks.
OK, I'll sort of ignore this

772
00:50:41,000 --> 00:50:44,000
plus one.
It's not too critical.

773
00:50:44,000 --> 00:50:48,000
So, this is our recurrence.
Now, it depends what our base

774
00:50:48,000 --> 00:50:51,000
case is.
And, usually we would use a

775
00:50:51,000 --> 00:50:55,000
base case of constant size.
So, let's see what happens if

776
00:50:55,000 --> 00:51:00,000
we use a base case of constant
size just so that it's clear why

777
00:51:00,000 --> 00:51:05,000
this base case is so important.
OK, this describes a recurrence

778
00:51:05,000 --> 00:51:07,000
as one of these hairy
recurrences.

779
00:51:07,000 --> 00:51:09,000
And, I don't want to use
substitution.

780
00:51:09,000 --> 00:51:12,000
I just want the intuition of
why this is going to solve to

781
00:51:12,000 --> 00:51:14,000
something rather big.
OK, and for me,

782
00:51:14,000 --> 00:51:17,000
the best intuition always comes
from recursion trees.

783
00:51:17,000 --> 00:51:20,000
If you don't know the solution
to recurrence and you need a

784
00:51:20,000 --> 00:51:24,000
good guess, use recursion trees.
And today, I will only give you

785
00:51:24,000 --> 00:51:26,000
good guesses.
I don't want to prove anything

786
00:51:26,000 --> 00:51:31,000
with substitution because I want
to get to the bigger ideas.

787
00:51:31,000 --> 00:51:34,000
So, this is even messy from a
recursion tree point of view

788
00:51:34,000 --> 00:51:38,000
because you have these
unbalanced sizes where you start

789
00:51:38,000 --> 00:51:40,000
at the root with some of size N
over B.

790
00:51:40,000 --> 00:51:44,000
Then you split it into
something size one fifth N over

791
00:51:44,000 --> 00:51:47,000
B, and something of size three
quarters N over B,

792
00:51:47,000 --> 00:51:51,000
which is annoying because now
this subtree will be a lot

793
00:51:51,000 --> 00:51:54,000
bigger than this one,
or this one will terminate

794
00:51:54,000 --> 00:51:56,000
faster.
So, it's pretty unbalanced.

795
00:51:56,000 --> 00:52:00,000
But, summing per level doesn't
really tell you a lot at this

796
00:52:00,000 --> 00:52:02,000
point.
But let's just look at the

797
00:52:02,000 --> 00:52:07,000
bottom level.
Look at all the leaves in this

798
00:52:07,000 --> 00:52:10,000
recursion tree.
So, that's the base cases.

799
00:52:10,000 --> 00:52:13,000
How many base cases are there?
This is an interesting

800
00:52:13,000 --> 00:52:16,000
question.
We've never thought about it in

801
00:52:16,000 --> 00:52:21,000
the context of this recurrence.
It gives a somewhat surprising

802
00:52:21,000 --> 00:52:23,000
answer.
It was surprising to me the

803
00:52:23,000 --> 00:52:27,000
first time I worked it out.
So, how many leaves does this

804
00:52:27,000 --> 00:52:32,000
recursion tree have?
Well, we can write a

805
00:52:32,000 --> 00:52:35,000
recurrence.
The number of leaves in a

806
00:52:35,000 --> 00:52:41,000
problem of size N,
it's going to be the number of

807
00:52:41,000 --> 00:52:47,000
leaves in this problem plus the
number of leaves in this problem

808
00:52:47,000 --> 00:52:52,000
plus zero.
So, that's another recurrence.

809
00:52:52,000 --> 00:52:57,000
We'll call this L of N.
OK, now the base case is really

810
00:52:57,000 --> 00:53:02,000
relevant.
It determines the solution to

811
00:53:02,000 --> 00:53:04,000
this recurrence.
And let's, again,

812
00:53:04,000 --> 00:53:08,000
assume that in a problem of
size one, we have one leaf.

813
00:53:08,000 --> 00:53:12,000
That's our only base case.
Well, it turns out,

814
00:53:12,000 --> 00:53:14,000
and here you need to guess,
I think.

815
00:53:14,000 --> 00:53:17,000
This is not particularly
obvious.

816
00:53:17,000 --> 00:53:21,000
Any of the TA's have guesses of
the form of this solution?

817
00:53:21,000 --> 00:53:25,000
Or anybody, not just TA's.
But this is open to everyone.

818
00:53:25,000 --> 00:53:28,000
If Charles were here,
I would ask him.

819
00:53:28,000 --> 00:53:31,000
I had to think for a while,
and it's not linear,

820
00:53:31,000 --> 00:53:37,000
right, because you're somehow
decreasing quite a bit.

821
00:53:37,000 --> 00:53:42,000
So, it's smaller than linear,
but it's more than a constant.

822
00:53:42,000 --> 00:53:47,000
OK, it's actually more than
polylog, so what's your favorite

823
00:53:47,000 --> 00:53:50,000
function in the middle?
N over log N,

824
00:53:50,000 --> 00:53:53,000
that's still too big.
Keep going.

825
00:53:53,000 --> 00:53:57,000
You have an oracle here,
so you can, N to the k,

826
00:53:57,000 --> 00:54:00,000
yeah, close.
I mean, k is usually an

827
00:54:00,000 --> 00:54:04,000
integer.
N to the alpha for some real

828
00:54:04,000 --> 00:54:09,000
number between zero and one.
Yeah, that's what you meant.

829
00:54:09,000 --> 00:54:11,000
Sorry.
It's like the shortest

830
00:54:11,000 --> 00:54:15,000
mathematical joke.
Let epsilon be less than zero

831
00:54:15,000 --> 00:54:18,000
or for a sufficiently large
epsilon.

832
00:54:18,000 --> 00:54:21,000
I don't know.
So, you've got to use the right

833
00:54:21,000 --> 00:54:25,000
letters.
So, let's suppose that it's N

834
00:54:25,000 --> 00:54:28,000
to the alpha.
Then we would get this N over

835
00:54:28,000 --> 00:54:32,000
five to the alpha,
and we'd get three quarters N

836
00:54:32,000 --> 00:54:36,000
to the alpha.
When you have a nice recurrence

837
00:54:36,000 --> 00:54:40,000
like this, you can just try
plugging in a guess and see

838
00:54:40,000 --> 00:54:42,000
whether it works,
OK, and of course this will

839
00:54:42,000 --> 00:54:46,000
work only depending on alpha.
So, we should get an equation

840
00:54:46,000 --> 00:54:49,000
on alpha here.
So, everything has an N to the

841
00:54:49,000 --> 00:54:51,000
alpha, in fact,
all of these terms.

842
00:54:51,000 --> 00:54:53,000
So, I can divide through my N
to the alpha.

843
00:54:53,000 --> 00:54:56,000
That's assuming that it's not
zero or something.

844
00:54:56,000 --> 00:54:59,000
That seems reasonable.
So, we have one equals one

845
00:54:59,000 --> 00:55:04,000
fifth to the alpha plus three
quarters to the alpha.

846
00:55:04,000 --> 00:55:10,000
This is something you won't get
on a final because I don't know

847
00:55:10,000 --> 00:55:15,000
any good way to solve this
except with like Maple or

848
00:55:15,000 --> 00:55:19,000
Mathematica.
If you're smart I'm sure you

849
00:55:19,000 --> 00:55:24,000
could compute it in a nicer way,
but alpha is about 0.8,

850
00:55:24,000 --> 00:55:28,000
it turns out.
So, the number of leaves is

851
00:55:28,000 --> 00:55:34,000
this sort of in between constant
and linear.

852
00:55:34,000 --> 00:55:37,000
Usually polynomial means you
have an integer power.

853
00:55:37,000 --> 00:55:40,000
Let's call it a polynomial.
Why not?

854
00:55:40,000 --> 00:55:43,000
There's a lot of leaves,
is the point,

855
00:55:43,000 --> 00:55:47,000
and if we say that each leaf
costs a constant number of

856
00:55:47,000 --> 00:55:50,000
memory transfers,
we're in trouble because then

857
00:55:50,000 --> 00:55:54,000
the number of memory transfers
has to be at least this.

858
00:55:54,000 --> 00:55:58,000
If it's at least that,
that's potentially bigger than

859
00:55:58,000 --> 00:56:02,000
N over B, I mean,
bigger than in an asymptotic

860
00:56:02,000 --> 00:56:06,000
sense.
This is little omega of N over

861
00:56:06,000 --> 00:56:10,000
B if B is big.
If B is at least N to the 0.2

862
00:56:10,000 --> 00:56:14,000
something, OK,
or one seventh something.

863
00:56:14,000 --> 00:56:18,000
But if, in particular,
B is at least N to the 0.2,

864
00:56:18,000 --> 00:56:22,000
then this should be bigger than
that.

865
00:56:22,000 --> 00:56:27,000
So, this is a bad analysis
because we're not going to get

866
00:56:27,000 --> 00:56:32,000
the answer we want,
which is N over B.

867
00:56:32,000 --> 00:56:35,000
The best you can do for median
is N over B because you have to

868
00:56:35,000 --> 00:56:38,000
read all the element,
and you should spend linear

869
00:56:38,000 --> 00:56:40,000
time.
So, we want to get N over B.

870
00:56:40,000 --> 00:56:42,000
This algorithm is N over B plus
one.

871
00:56:42,000 --> 00:56:45,000
So, this is why you need a good
base case, all right?

872
00:56:45,000 --> 00:56:48,000
So that makes the point.
So, the question is,

873
00:56:48,000 --> 00:56:51,000
what base case should I use?

874
00:57:04,000 --> 00:57:06,000
So, we have this recurrence

875
00:57:21,000 --> 00:57:25,000
What base case should I use?
Constant was too small.

876
00:57:25,000 --> 00:57:30,000
We have a couple of choices
listed up here.

877
00:57:46,000 --> 00:57:55,000
Any suggestions?
B, OK, MT of B is?

878
00:57:55,000 --> 00:58:01,000
The hard part.
So, if my problem,

879
00:58:01,000 --> 00:58:07,000
if the size of my array fits in
a block and I do all this stuff

880
00:58:07,000 --> 00:58:11,000
on it, how many memory transfers
could that take?

881
00:58:11,000 --> 00:58:15,000
One, or a constant,
depending on alignment.

882
00:58:15,000 --> 00:58:20,000
OK, maybe it takes two memory
transfers, but constant.

883
00:58:20,000 --> 00:58:23,000
Good.
That's clearly a lot better

884
00:58:23,000 --> 00:58:27,000
than this base case,
MT of one equals order one,

885
00:58:27,000 --> 00:58:30,000
clearly stronger.
So, hopefully,

886
00:58:30,000 --> 00:58:36,000
it gives the right answer,
and now indeed it does.

887
00:58:36,000 --> 00:58:39,000
I love this analysis.
So, I'm going to wave my hands.

888
00:58:39,000 --> 00:58:43,000
OK, but in particular,
what this gives us,

889
00:58:43,000 --> 00:58:47,000
if we do the previous analysis,
what is the number of leaves?

890
00:58:47,000 --> 00:58:51,000
So, in the leaves,
now L of B equals one instead

891
00:58:51,000 --> 00:58:54,000
of L of one equals one.
So, this stops earlier.

892
00:58:54,000 --> 00:58:59,000
When does it stop?
Well, instead of getting N to

893
00:58:59,000 --> 00:59:02,000
the order of 0.8,
whatever, we get N over B to

894
00:59:02,000 --> 00:59:06,000
the power of 0.8 whatever.
OK, so it turns out the number

895
00:59:06,000 --> 00:59:10,000
of leaves is N over B to the
alpha, which is little o of N

896
00:59:10,000 --> 00:59:12,000
over B.
So, we don't care.

897
00:59:12,000 --> 00:59:15,000
It's tiny.
And, if you look at the root

898
00:59:15,000 --> 00:59:17,000
cost is N over B in the
recursion tree,

899
00:59:17,000 --> 00:59:22,000
the leaf cost is little o of N
over B, and if you wave your

900
00:59:22,000 --> 00:59:26,000
hands, and close your eyes,
and squint, the cost should be

901
00:59:26,000 --> 00:59:29,000
geometrically decreasing as we
go down, I hope,

902
00:59:29,000 --> 00:59:34,000
more or less.
It's a bit messy because of all

903
00:59:34,000 --> 00:59:39,000
the things terminating,
but let's say cost is roughly

904
00:59:39,000 --> 00:59:42,000
geometric.
Don't do this in the final,

905
00:59:42,000 --> 00:59:47,000
but you won't have any messy
recurrences like this.

906
00:59:47,000 --> 00:59:50,000
So, don't worry.
Down the tree,

907
00:59:50,000 --> 00:59:55,000
so you'd have to prove this
formally, but I claim that the

908
00:59:55,000 --> 01:00:01,000
root cost dominates.
And, the root cost is N over B.

909
01:00:13,000 --> 01:00:16,591
So, we get N over B.
OK, so this is a nice,

910
01:00:16,591 --> 01:00:21,892
linear time algorithm for order
statistics for cache oblivious.

911
01:00:21,892 --> 01:00:24,970
Great.
This may turn you off a little

912
01:00:24,970 --> 01:00:29,758
bit, but even though this is
like the simplest algorithm,

913
01:00:29,758 --> 01:00:34,460
it's also probably the most
complicated analysis that we

914
01:00:34,460 --> 01:00:36,846
will do.
In the future,

915
01:00:36,846 --> 01:00:40,234
our algorithms will be more
complicated, and the analyses

916
01:00:40,234 --> 01:00:42,533
will be relatively simple.
And usually,

917
01:00:42,533 --> 01:00:45,255
it's that way with cache
oblivious algorithms.

918
01:00:45,255 --> 01:00:48,824
So, I'm giving you this sort of
as the intuition of why this

919
01:00:48,824 --> 01:00:51,425
should be enough.
Then you have to prove it.

920
01:00:51,425 --> 01:00:54,933
OK, let's go to another problem
where divide and conquer is

921
01:00:54,933 --> 01:00:57,716
useful, our good friend,
matrix multiplication.

922
01:00:57,716 --> 01:01:01,164
I don't know how many times
we've seen this in this class,

923
01:01:01,164 --> 01:01:04,370
but in particular we saw it
last week with a recursive

924
01:01:04,370 --> 01:01:08,000
matrix multiply,
multithreaded algorithm.

925
01:01:08,000 --> 01:01:11,708
So, I won't give you the
algorithm yet again,

926
01:01:11,708 --> 01:01:16,176
but we're going to analyze it
in a very different way.

927
01:01:16,176 --> 01:01:20,475
So, we have C and we have A,
and actually up to you.

928
01:01:20,475 --> 01:01:24,521
So, I could cover standard
matrix multiplication,

929
01:01:24,521 --> 01:01:30,000
which is when you do it row by
row, and column by column.

930
01:01:30,000 --> 01:01:32,331
And, we could see why that's
bad.

931
01:01:32,331 --> 01:01:36,485
And then, we could do the
recursive one and see why that's

932
01:01:36,485 --> 01:01:39,036
good.
Or, we could skip the standard

933
01:01:39,036 --> 01:01:41,951
algorithm.
So, how many people would like

934
01:01:41,951 --> 01:01:44,866
to see why the standard
algorithm is bad?

935
01:01:44,866 --> 01:01:47,198
Because it's not totally
obvious.

936
01:01:47,198 --> 01:01:49,603
One, two, three,
four, five, half?

937
01:01:49,603 --> 01:01:53,611
Wow, that's a lot of votes.
Now, how many people want to

938
01:01:53,611 --> 01:01:55,433
skip to the chase?
No one.

939
01:01:55,433 --> 01:01:58,129
One, OK.
And, everyone else is asleep.

940
01:01:58,129 --> 01:02:01,190
So, that's pretty good,
50% awake, not bad.

941
01:02:01,190 --> 01:02:06,000
OK, then, so standard matrix
multiplication.

942
01:02:06,000 --> 01:02:10,036
I'll do this fast because it
is, I mean, you all know the

943
01:02:10,036 --> 01:02:13,207
algorithm, right?
To compute this value of C;

944
01:02:13,207 --> 01:02:17,099
in A, you take this row,
and in B you take this column.

945
01:02:17,099 --> 01:02:19,477
Sorry I did a little bit
sloppily.

946
01:02:19,477 --> 01:02:21,927
But this is supposed to be
aligned.

947
01:02:21,927 --> 01:02:24,378
Right?
So I take all of this stuff,

948
01:02:24,378 --> 01:02:27,837
I multiply it with all of the
stuff, add them up,

949
01:02:27,837 --> 01:02:31,949
the dot product.
That gives me this element.

950
01:02:31,949 --> 01:02:35,487
And, let's say I do them in
this order row by row.

951
01:02:35,487 --> 01:02:39,241
So for every item in C,
I loop over this row and this

952
01:02:39,241 --> 01:02:41,624
column, B, multiply them
together.

953
01:02:41,624 --> 01:02:44,151
That is an access pattern in
memory.

954
01:02:44,151 --> 01:02:48,555
So, exactly how much that costs
depends how these matrices are

955
01:02:48,555 --> 01:02:51,732
laid out in memory.
OK, this is a subtlety we

956
01:02:51,732 --> 01:02:55,703
haven't had to worry about
before because everything was

957
01:02:55,703 --> 01:02:58,519
uniform.
I'm going to assume to give the

958
01:02:58,519 --> 01:03:02,057
standard algorithm the best
chances of being good,

959
01:03:02,057 --> 01:03:05,956
I'm going to store C in row
major order, A in row major

960
01:03:05,956 --> 01:03:10,000
order, and B in column major
order.

961
01:03:10,000 --> 01:03:14,983
So, everything is nice and
you're scanning.

962
01:03:14,983 --> 01:03:19,254
So then this inner product is a
scan.

963
01:03:19,254 --> 01:03:21,389
Cool.
Sounds great,

964
01:03:21,389 --> 01:03:24,711
doesn't it?
It's bad, though.

965
01:03:24,711 --> 01:03:31,000
Assume A is row major,
and B is column major.

966
01:03:31,000 --> 01:03:33,911
And C, you could assume is
really either way,

967
01:03:33,911 --> 01:03:37,750
but if I'm doing it row by row,
I'll assume it's row major.

968
01:03:37,750 --> 01:03:41,257
So, this is what I call the
layout, the memory layout,

969
01:03:41,257 --> 01:03:43,904
of these matrices.
OK, it's good for this

970
01:03:43,904 --> 01:03:46,551
algorithm, but the algorithm is
not good.

971
01:03:46,551 --> 01:03:49,000
So, it won't be that great.

972
01:04:12,000 --> 01:04:16,227
So, how long does this take?
How many memory transfers?

973
01:04:16,227 --> 01:04:20,533
We know it takes M^3 time.
Not going to try and beat M^3

974
01:04:20,533 --> 01:04:22,882
here.
Just going to try and get

975
01:04:22,882 --> 01:04:26,249
standard matrix multiplication
going faster.

976
01:04:26,249 --> 01:04:30,711
So, well, for each item over
here I pay N over B to do the

977
01:04:30,711 --> 01:04:36,801
scans and get the inner product.
So, N over B per item.

978
01:04:36,801 --> 01:04:42,659
So, it's N over B,
or we could go with the plus

979
01:04:42,659 --> 01:04:49,408
one here, to compute each c_ij.
So that would suggest,

980
01:04:49,408 --> 01:04:54,883
as an upper bound at least,
it's N^3 over B.

981
01:04:54,883 --> 01:05:00,996
OK, and indeed that is the
right bound, so theta.

982
01:05:00,996 --> 01:05:08,000
This is memory transfers,
not time, obviously.

983
01:05:08,000 --> 01:05:12,349
That is indeed the case because
if you look at consecutive,

984
01:05:12,349 --> 01:05:14,525
I do this c_ij,
then this one,

985
01:05:14,525 --> 01:05:18,125
this one, this one,
this one, keep incrementing j

986
01:05:18,125 --> 01:05:20,074
and keeping I fixed,
right?

987
01:05:20,074 --> 01:05:23,824
So, the row that I use stays
fixed for a long time.

988
01:05:23,824 --> 01:05:27,875
I get to reuse that if it
happens, say that that fits a

989
01:05:27,875 --> 01:05:32,150
block maybe, I get to reuse that
row several times if that

990
01:05:32,150 --> 01:05:36,631
happens to fit in cache.
But the column is changing

991
01:05:36,631 --> 01:05:39,642
every single time.
OK, so every time I moved here

992
01:05:39,642 --> 01:05:43,093
and compute the next c_ij,
even if a column could fit in

993
01:05:43,093 --> 01:05:45,790
cache, I can't fit all the
columns in cache.

994
01:05:45,790 --> 01:05:48,174
And the columns that I'm
visiting move,

995
01:05:48,174 --> 01:05:50,119
you know, they just scan
across.

996
01:05:50,119 --> 01:05:52,942
So, I'm scanning this whole
matrix every time.

997
01:05:52,942 --> 01:05:55,766
And unless you're entire matrix
fits in cache,

998
01:05:55,766 --> 01:05:58,840
in which case you could do
anything, I don't care,

999
01:05:58,840 --> 01:06:02,353
it will take constant time,
or you'll take M over B time,

1000
01:06:02,353 --> 01:06:05,302
enough to read it into the
cache, do your stuff,

1001
01:06:05,302 --> 01:06:09,989
and write it back out.
Except in that boring case,

1002
01:06:09,989 --> 01:06:14,115
you're going to have to pay N^2
over B for every row here

1003
01:06:14,115 --> 01:06:18,242
because you have to scan the
whole collection of columns.

1004
01:06:18,242 --> 01:06:22,589
You have to read this entire
matrix for every row over here.

1005
01:06:22,589 --> 01:06:26,494
So, you really do need N^3 over
B for the whole thing.

1006
01:06:26,494 --> 01:06:30,043
So, it's usually a theta.
So, you might say,

1007
01:06:30,043 --> 01:06:32,766
well, that's great.
It's the size of my problem,

1008
01:06:32,766 --> 01:06:34,852
the usual running time,
divided by B.

1009
01:06:34,852 --> 01:06:38,329
And that was the case when we
are thinking about linear time,

1010
01:06:38,329 --> 01:06:41,168
N versus N over B.
It's hard to beat N over B when

1011
01:06:41,168 --> 01:06:44,066
your problem is of size N.
But now we have a cubed.

1012
01:06:44,066 --> 01:06:47,137
And, this gets back to,
we have good spatial locality.

1013
01:06:47,137 --> 01:06:49,687
When we read a block,
we use the whole thing.

1014
01:06:49,687 --> 01:06:51,019
Great.
It seems optimal.

1015
01:06:51,019 --> 01:06:53,337
But we don't have good temporal
locality.

1016
01:06:53,337 --> 01:06:56,350
It could be that maybe if we
stored the right things,

1017
01:06:56,350 --> 01:06:59,074
we kept them around,
we could them several times

1018
01:06:59,074 --> 01:07:04,000
because we're using each element
like a cubed number of times.

1019
01:07:04,000 --> 01:07:08,990
That's not the right way of
saying it, but we're reusing the

1020
01:07:08,990 --> 01:07:11,951
matrices a lot,
reusing those items.

1021
01:07:11,951 --> 01:07:16,942
If we are doing N^3 work on N^2
things, we're reusing a lot.

1022
01:07:16,942 --> 01:07:21,933
So, we want to do better than
this, and that's the recursive

1023
01:07:21,933 --> 01:07:26,416
algorithm, which we've seen.
So, we know the algorithm

1024
01:07:26,416 --> 01:07:29,800
pretty much.
I just have to tell you what

1025
01:07:29,800 --> 01:07:36,588
the layout is.
So, we're going to take C,

1026
01:07:36,588 --> 01:07:42,941
partition of C_1-1,
C_1-2, and so on.

1027
01:07:42,941 --> 01:07:52,647
So, I have an N by N matrix,
and I'm partitioning into N

1028
01:07:52,647 --> 01:08:02,176
over 2 by N over 2 submatrices,
all three of them times

1029
01:08:02,176 --> 01:08:07,377
whatever.
And, I could write this out yet

1030
01:08:07,377 --> 01:08:11,058
again but I won't.
OK, we can recursively compute

1031
01:08:11,058 --> 01:08:15,200
this thing with eight matrix
multiplies, and a bunch of

1032
01:08:15,200 --> 01:08:18,191
matrix additions.
I don't care how many,

1033
01:08:18,191 --> 01:08:22,256
but a constant number.
We see that at least twice now,

1034
01:08:22,256 --> 01:08:26,091
so I won't show it again.
Now, how do I lay out the

1035
01:08:26,091 --> 01:08:29,005
matrices?
Any suggestions how I lay out

1036
01:08:29,005 --> 01:08:32,979
the matrices?
I could lay them out in row

1037
01:08:32,979 --> 01:08:35,693
major order.
I'll call it major order.

1038
01:08:35,693 --> 01:08:38,185
But that might be less natural
now.

1039
01:08:38,185 --> 01:08:42,000
We're not doing anything by
rows or by columns.

1040
01:08:59,000 --> 01:09:03,014
So, what layout should I use?
Yeah?

1041
01:09:03,014 --> 01:09:08,446
Quartet major order,
maybe quadrant major order

1042
01:09:08,446 --> 01:09:12,933
unless you're musically
inclined, yeah.

1043
01:09:12,933 --> 01:09:17,420
Good idea.
You've never seen this order

1044
01:09:17,420 --> 01:09:21,671
before, so it's maybe not so
natural.

1045
01:09:21,671 --> 01:09:26,158
Somehow I want to cluster it by
blocks.

1046
01:09:26,158 --> 01:09:33,402
OK, I think that's about all.
So, I mean, it's a recursive

1047
01:09:33,402 --> 01:09:36,576
layout.
This was not an easy question.

1048
01:09:36,576 --> 01:09:39,751
It's OK.
Store matrices or lay out the

1049
01:09:39,751 --> 01:09:44,899
matrices recursively by block.
OK, I'm cheating a little bit.

1050
01:09:44,899 --> 01:09:49,961
I'm redefining the problem to
say, assume that your matrices

1051
01:09:49,961 --> 01:09:54,680
are laid out in this way.
But, it doesn't really matter.

1052
01:09:54,680 --> 01:09:56,568
We can cheat,
can't we?

1053
01:09:56,568 --> 01:10:02,276
In fact, it doesn't matter.
You can turn a matrix into this

1054
01:10:02,276 --> 01:10:06,315
layout without too much linear
work, almost linear work.

1055
01:10:06,315 --> 01:10:07,637
Log factors,
maybe.

1056
01:10:07,637 --> 01:10:11,676
OK, so if I want to store my
matrix A as a linear thing,

1057
01:10:11,676 --> 01:10:15,274
I'm going to recursively
defined that layout to be

1058
01:10:15,274 --> 01:10:19,019
recursively store the upper left
corner, then store,

1059
01:10:19,019 --> 01:10:21,442
let's say, the upper right
corner.

1060
01:10:21,442 --> 01:10:24,380
It doesn't matter which order I
do these.

1061
01:10:24,380 --> 01:10:28,492
I should have drawn this wider,
then store the lower left

1062
01:10:28,492 --> 01:10:34,000
corner, and then store the lower
right corner recursively.

1063
01:10:34,000 --> 01:10:38,025
So, how do you store this?
Well, you divide it in four,

1064
01:10:38,025 --> 01:10:40,634
and lay out the top left,
and so on.

1065
01:10:40,634 --> 01:10:44,511
OK, this is a recursive
definition of how the element

1066
01:10:44,511 --> 01:10:47,046
should be stored in a linear
array.

1067
01:10:47,046 --> 01:10:50,326
It's a weird one,
but this is a very powerful

1068
01:10:50,326 --> 01:10:52,861
idea in cache oblivious
algorithms.

1069
01:10:52,861 --> 01:10:57,408
We'll use this multiple times.
OK, so now all we have to do is

1070
01:10:57,408 --> 01:11:00,241
analyze the number of memory
transfers.

1071
01:11:00,241 --> 01:11:05,066
How hard could it be?
So, we're going to store all

1072
01:11:05,066 --> 01:11:08,978
the matrices in this order,
and we want to compute the

1073
01:11:08,978 --> 01:11:12,373
number of memory transfers on an
N by N matrix.

1074
01:11:12,373 --> 01:11:15,547
See, I lapsed and I switched to
lowercase n.

1075
01:11:15,547 --> 01:11:19,902
I should, throughout this week,
be using uppercase N because

1076
01:11:19,902 --> 01:11:23,666
for historical reasons,
any external memory kinds of

1077
01:11:23,666 --> 01:11:28,095
algorithms, to level algorithms,
always talk about capital N.

1078
01:11:28,095 --> 01:11:31,785
And, don't ask why.
You should see what they define

1079
01:11:31,785 --> 01:11:37,995
little n to be.
OK, so, any suggestions on what

1080
01:11:37,995 --> 01:11:45,342
the recurrence should be now?
All his fancy setup with the

1081
01:11:45,342 --> 01:11:49,724
recurrence is actually pretty
easy.

1082
01:11:49,724 --> 01:11:57,071
So, definitely it involves
multiplying matrices that are N

1083
01:11:57,071 --> 01:12:03,000
over 2 by N over 2.
So, what goes here?

1084
01:12:03,000 --> 01:12:05,752
Eight, thank you.
That you should know.

1085
01:12:05,752 --> 01:12:08,793
And that the tricky part is
what goes here.

1086
01:12:08,793 --> 01:12:12,487
OK, what goes here is,
now, the fact that I can even

1087
01:12:12,487 --> 01:12:15,384
write this, this is the matrix
additions.

1088
01:12:15,384 --> 01:12:18,788
Ignore those for now.
Suppose there weren't any.

1089
01:12:18,788 --> 01:12:21,323
I just have to recursively
multiply.

1090
01:12:21,323 --> 01:12:25,740
The fact that this actually is
eight times memory transfers of

1091
01:12:25,740 --> 01:12:30,670
N over 2 relies on this layout.
Right, I'm assuming that the

1092
01:12:30,670 --> 01:12:34,129
arrays that I'm given are given
as contiguous intervals and

1093
01:12:34,129 --> 01:12:35,442
memory.
If they aren't,

1094
01:12:35,442 --> 01:12:38,066
I mean, if they're scattered
all over memory,

1095
01:12:38,066 --> 01:12:40,273
I'm screwed.
There's nothing I can do.

1096
01:12:40,273 --> 01:12:43,434
So, but by assuming that I have
this recursive layout,

1097
01:12:43,434 --> 01:12:46,835
I know that the recursive
multiplies will always deal with

1098
01:12:46,835 --> 01:12:49,519
three consecutive chunks of
memory, one for A,

1099
01:12:49,519 --> 01:12:52,202
one for B, one for C,
OK, no matter what I do.

1100
01:12:52,202 --> 01:12:54,470
Because these are stored
consecutively,

1101
01:12:54,470 --> 01:12:56,438
recursively I have that
invariant.

1102
01:12:56,438 --> 01:12:59,540
And I can keep recursing.
And I'm always dealing with

1103
01:12:59,540 --> 01:13:03,000
three consecutive chunks of
memory.

1104
01:13:03,000 --> 01:13:08,327
That's why I need this layout
is to be able to say this.

1105
01:13:08,327 --> 01:13:11,332
OK, Now what does addition
cost?

1106
01:13:11,332 --> 01:13:14,335
I'll just give you two
matrices.

1107
01:13:14,335 --> 01:13:19,858
They're stored in some linear
order, the same linear order

1108
01:13:19,858 --> 01:13:25,186
among the three of them.
Do I care what the linear order

1109
01:13:25,186 --> 01:13:28,384
is?
How should I add two matrices,

1110
01:13:28,384 --> 01:13:31,000
get the output?

1111
01:13:42,000 --> 01:13:43,000
Yeah?

1112
01:13:51,000 --> 01:13:54,850
Right, if each of the three
arrays I'm dealing with are

1113
01:13:54,850 --> 01:13:58,559
stored in the same order,
I can just scan in parallel

1114
01:13:58,559 --> 01:14:02,909
through all three of them and
just add corresponding elements,

1115
01:14:02,909 --> 01:14:07,045
and output it to the third.
So, I don't care what the order

1116
01:14:07,045 --> 01:14:10,682
is, as long as it's consistent
and I get N^2 over B.

1117
01:14:10,682 --> 01:14:14,390
I'll ignore plus one here.
That's just looking at the

1118
01:14:14,390 --> 01:14:16,529
entire matrix.
So, there we go:

1119
01:14:16,529 --> 01:14:19,667
another recurrence.
We've seen this with N^2,

1120
01:14:19,667 --> 01:14:23,090
and we just got N^3.
But, it turns out now we get

1121
01:14:23,090 --> 01:14:26,371
something cooler if we use the
right base case.

1122
01:14:26,371 --> 01:14:30,008
So now we get to the base case,
ah, the tricky part.

1123
01:14:30,008 --> 01:14:35,000
So, any suggestions what base
case I should use?

1124
01:14:35,000 --> 01:14:36,672
The block size,
good suggestion.

1125
01:14:36,672 --> 01:14:38,829
So, if we have something of
size order B,

1126
01:14:38,829 --> 01:14:41,850
we know that takes a constant
number of memory transfers.

1127
01:14:41,850 --> 01:14:44,871
It turns out that's not enough.
That won't solve it here.

1128
01:14:44,871 --> 01:14:46,381
But good guess.
In this case,

1129
01:14:46,381 --> 01:14:49,294
it's not the right answer.
I'll give you some intuition

1130
01:14:49,294 --> 01:14:51,182
why.
We are trying to improve on N^3

1131
01:14:51,182 --> 01:14:53,178
over B.
If you were just trying to get

1132
01:14:53,178 --> 01:14:55,443
it divided by B,
this is a great base case.

1133
01:14:55,443 --> 01:14:58,572
But here, we know that just the
improvement afforded by the

1134
01:14:58,572 --> 01:15:03,244
block size is not enough.
We have to somehow use the fact

1135
01:15:03,244 --> 01:15:06,864
that the cache is big.
It's M, so however big M is,

1136
01:15:06,864 --> 01:15:09,977
it's that big.
OK, so if we want to get some

1137
01:15:09,977 --> 01:15:13,307
improvement on this,
we've got to have M in the

1138
01:15:13,307 --> 01:15:16,276
formula somewhere,
and there's no M's yet.

1139
01:15:16,276 --> 01:15:19,027
So, it's got to involve M.
What's that?

1140
01:15:19,027 --> 01:15:21,271
MT of M over B?
That would work,

1141
01:15:21,271 --> 01:15:25,108
but MT of M is also OK,
I mean, some constant times M,

1142
01:15:25,108 --> 01:15:27,859
let's say.
I want to make this constant

1143
01:15:27,859 --> 01:15:33,000
small enough so that the entire
problem fits in cache.

1144
01:15:33,000 --> 01:15:37,006
So, it's like one third.
I think it's actually,

1145
01:15:37,006 --> 01:15:40,837
oh wait, is it the square root
of M actually?

1146
01:15:40,837 --> 01:15:43,537
Right, this is an N by N
matrix.

1147
01:15:43,537 --> 01:15:47,456
So, it should be C times the
square root of M.

1148
01:15:47,456 --> 01:15:50,330
Sorry.
So, the square root of M by

1149
01:15:50,330 --> 01:15:53,552
square root of M matrix has M
entries.

1150
01:15:53,552 --> 01:15:58,603
If I make C like one third or
something, then I can fit all

1151
01:15:58,603 --> 01:16:04,372
three matrices in memory.
Actually, one over square root

1152
01:16:04,372 --> 01:16:06,903
of three would do,
but who cares?

1153
01:16:06,903 --> 01:16:10,621
So, for some constant,
C, now everything fits in

1154
01:16:10,621 --> 01:16:13,548
memory.
How many memory transfers does

1155
01:16:13,548 --> 01:16:14,497
it take?
One?

1156
01:16:14,497 --> 01:16:18,451
It's a bit too small,
because I do have to read the

1157
01:16:18,451 --> 01:16:20,587
problem in.
And now, I mean,

1158
01:16:20,587 --> 01:16:24,621
here was one because there's
only one block to read.

1159
01:16:24,621 --> 01:16:27,548
Now how many blocks are there
to read?

1160
01:16:27,548 --> 01:16:30,000
Constants?
No.

1161
01:16:30,000 --> 01:16:30,369
B?
No.

1162
01:16:30,369 --> 01:16:33,255
M over B, good.
Get it right eventually.

1163
01:16:33,255 --> 01:16:37,102
That's the great thing about
thinking with an oracle.

1164
01:16:37,102 --> 01:16:41,318
You can just keep guessing.
M over B because we have cache

1165
01:16:41,318 --> 01:16:43,908
size M.
There are M over B blocks in

1166
01:16:43,908 --> 01:16:46,201
that cache to read each one,
OK?

1167
01:16:46,201 --> 01:16:49,382
This is maybe,
you forgot what M was because

1168
01:16:49,382 --> 01:16:51,897
we haven't used it for a long
time.

1169
01:16:51,897 --> 01:16:54,857
But M is the number of elements
in cache.

1170
01:16:54,857 --> 01:16:59,000
This is the number of blocks in
cache.

1171
01:16:59,000 --> 01:17:02,537
OK, some of was saying B,
and it's reasonable to assume

1172
01:17:02,537 --> 01:17:05,943
that M over B is about B.
That's like a square cache,

1173
01:17:05,943 --> 01:17:08,892
but in general,
we don't make that assumption.

1174
01:17:08,892 --> 01:17:11,381
OK, where are we?
We're hopefully done,

1175
01:17:11,381 --> 01:17:14,460
just about, good,
because we have three minutes.

1176
01:17:14,460 --> 01:17:17,800
So, that's our base case.
I have a square root here;

1177
01:17:17,800 --> 01:17:20,815
I just forgot it.
Now we just have to solve it.

1178
01:17:20,815 --> 01:17:23,434
Now, this is an easier
recurrence, right?

1179
01:17:23,434 --> 01:17:27,497
I don't want to use the master
method, because master method is

1180
01:17:27,497 --> 01:17:31,296
not going to handle these B's
and M's, and these crazy base

1181
01:17:31,296 --> 01:17:35,271
cases.
OK, master method would prove

1182
01:17:35,271 --> 01:17:36,054
N^3.
Great.

1183
01:17:36,054 --> 01:17:40,282
Master method doesn't really
think about these kinds of

1184
01:17:40,282 --> 01:17:42,789
cases.
But with regression trees,

1185
01:17:42,789 --> 01:17:47,331
if you remember way back to the
proof of the master method,

1186
01:17:47,331 --> 01:17:52,030
just look at the recursion tree
as geometric up or down where

1187
01:17:52,030 --> 01:17:55,945
everything is equal,
and then you just add them up,

1188
01:17:55,945 --> 01:17:59,000
every level.
The point is that this is a

1189
01:17:59,000 --> 01:18:02,680
nice recurrence.
All of the sub problems are the

1190
01:18:02,680 --> 01:18:05,891
same size, and that analysis
always works,

1191
01:18:05,891 --> 01:18:12,000
I say, when everything has the
same size, all the children.

1192
01:18:12,000 --> 01:18:18,857
So, here's the recursion tree.
We have N^2 over B at the top.

1193
01:18:18,857 --> 01:18:24,114
We split into eight subproblems
where each one,

1194
01:18:24,114 --> 01:18:27,657
the cost is one half N^2 over
B.

1195
01:18:27,657 --> 01:18:32,000
I'm not going to write them
all.

1196
01:18:32,000 --> 01:18:34,716
There they are.
You add them up.

1197
01:18:34,716 --> 01:18:38,921
How much do you get?
Well, there's eight of them.

1198
01:18:38,921 --> 01:18:41,637
Eight times a half is two.
Four.

1199
01:18:41,637 --> 01:18:44,265
[LAUGHTER] Thanks.
Four, right?

1200
01:18:44,265 --> 01:18:48,909
OK, I'm bad at arithmetic.
I probably already said it,

1201
01:18:48,909 --> 01:18:52,675
but there are three kinds of
mathematicians,

1202
01:18:52,675 --> 01:18:56,006
those who can add,
and those who can't.

1203
01:18:56,006 --> 01:19:01,000
OK, why am I looking at this?
It's obvious.

1204
01:19:01,000 --> 01:19:03,800
OK, so we keep going.
This looks geometrically

1205
01:19:03,800 --> 01:19:04,858
increasing.
Right?

1206
01:19:04,858 --> 01:19:08,405
You just know in your heart
that if you work out the first

1207
01:19:08,405 --> 01:19:12,263
two levels, you can tell whether
it's geometrically increasing,

1208
01:19:12,263 --> 01:19:15,437
decreasing, or they're all
equal, or something else.

1209
01:19:15,437 --> 01:19:18,984
And then you better think.
But I see this as geometrically

1210
01:19:18,984 --> 01:19:21,412
increasing.
It will indeed be like 16 at

1211
01:19:21,412 --> 01:19:22,843
the next level,
I guess.

1212
01:19:22,843 --> 01:19:25,145
OK, it should be.
So, it's increasing.

1213
01:19:25,145 --> 01:19:30,000
That means the leaves matter.
So, let's work out the leaves.

1214
01:19:30,000 --> 01:19:33,960
And, this is where we use our
base case.

1215
01:19:33,960 --> 01:19:38,630
So, we have a problem of size
square root of M.

1216
01:19:38,630 --> 01:19:41,981
And so, yeah,
you have a question?

1217
01:19:41,981 --> 01:19:45,840
Oh, indeed.
I knew there was something.

1218
01:19:45,840 --> 01:19:50,003
I knew it was supposed to be
two out here.

1219
01:19:50,003 --> 01:19:53,150
Thanks.
This is why you're here.

1220
01:19:53,150 --> 01:19:57,110
It's actually N over two
squared over B.

1221
01:19:57,110 --> 01:20:00,867
Thanks.
I'm substituting N over 2 into

1222
01:20:00,867 --> 01:20:04,900
this.
OK, so this is actually N^2

1223
01:20:04,900 --> 01:20:06,519
over 4 B.
So, I get two,

1224
01:20:06,519 --> 01:20:09,546
because there are eight times
one over four.

1225
01:20:09,546 --> 01:20:13,416
OK, I wasn't that far off then.
It's still geometrically

1226
01:20:13,416 --> 01:20:15,529
increasing, still the case,
OK?

1227
01:20:15,529 --> 01:20:17,992
But now, it actually doesn't
matter.

1228
01:20:17,992 --> 01:20:21,371
Whatever the cost is,
as long as it's bigger than

1229
01:20:21,371 --> 01:20:23,975
one, great.
Now we look at the leaves.

1230
01:20:23,975 --> 01:20:26,157
The leaves are root M by root
M.

1231
01:20:26,157 --> 01:20:29,958
I substitute root M into this:
I get M over B with some

1232
01:20:29,958 --> 01:20:32,903
constants.
Who cares?

1233
01:20:32,903 --> 01:20:36,787
So, each leaf is M over B,
OK, lots of them.

1234
01:20:36,787 --> 01:20:40,038
How many are there?
This is the only,

1235
01:20:40,038 --> 01:20:45,006
deal with recursion trees,
counting the number of leaves

1236
01:20:45,006 --> 01:20:48,709
is always the annoying part.
Oh boy, well,

1237
01:20:48,709 --> 01:20:53,948
we start with an N by N matrix.
We stop when we get down to

1238
01:20:53,948 --> 01:21:00,000
root N by root N matrix.
So, that sounds like something.

1239
01:21:00,000 --> 01:21:04,141
Oh boy, I'm cheating here.
Really?

1240
01:21:04,141 --> 01:21:07,905
That many?
It sounds plausible.

1241
01:21:07,905 --> 01:21:11,921
OK, the claim is,
and I'll cheat.

1242
01:21:11,921 --> 01:21:19,450
So I'm going to use the oracle
here, and we'll figure out why

1243
01:21:19,450 --> 01:21:24,470
this is the case.
N over root N^3 leaves,

1244
01:21:24,470 --> 01:21:27,231
hey what?
I think here,

1245
01:21:27,231 --> 01:21:33,979
it's hard to see the tree.
But it's easy to see in the

1246
01:21:33,979 --> 01:21:36,178
matrix.
Let's enter the matrix.

1247
01:21:36,178 --> 01:21:39,256
We have our big matrix.
We divided in half.

1248
01:21:39,256 --> 01:21:43,654
We recursively divide in half.
We recursively divide in half.

1249
01:21:43,654 --> 01:21:45,120
You get the idea,
OK?

1250
01:21:45,120 --> 01:21:49,151
Now, at some point these
sectors, let's say one of these

1251
01:21:49,151 --> 01:21:52,743
sectors, and each of these
sectors, fits in cache.

1252
01:21:52,743 --> 01:21:56,994
And three of them fit in cache.
So, that's when we stop the

1253
01:21:56,994 --> 01:22:02,320
recursion in the analysis.
The algorithm goes all the way.

1254
01:22:02,320 --> 01:22:05,538
But in the analysis,
let's say we stop at M.

1255
01:22:05,538 --> 01:22:08,981
OK, now, how many leaves or
problems are there?

1256
01:22:08,981 --> 01:22:11,451
Oh man, this is still not
obvious.

1257
01:22:11,451 --> 01:22:14,669
OK, the number of leaf chunks
here is, like,

1258
01:22:14,669 --> 01:22:19,010
I mean, the number of these
things is something like N over

1259
01:22:19,010 --> 01:22:21,629
root M, right,
the number of chunks.

1260
01:22:21,629 --> 01:22:26,195
But, it's a little less clear
because I have so many of these.

1261
01:22:26,195 --> 01:22:28,964
But, all right,
so let's just suppose,

1262
01:22:28,964 --> 01:22:32,856
now, I think of normal,
boring, matrix multiplication

1263
01:22:32,856 --> 01:22:38,119
on chunks of this size.
That's essentially what the

1264
01:22:38,119 --> 01:22:42,200
leaves should tell me.
I start with this big problem,

1265
01:22:42,200 --> 01:22:45,261
I recurse out to all these
little, tiny,

1266
01:22:45,261 --> 01:22:48,950
multiply this by that,
OK, this root M by root M

1267
01:22:48,950 --> 01:22:51,305
chunk.
OK, how many operations,

1268
01:22:51,305 --> 01:22:54,680
how many multiplies do I do on
those things?

1269
01:22:54,680 --> 01:22:57,034
N^3.
But now, N, the size of my

1270
01:22:57,034 --> 01:23:00,488
matrix in terms of these little
sub matrices,

1271
01:23:00,488 --> 01:23:05,859
is N over root M.
So, it should be N over root

1272
01:23:05,859 --> 01:23:10,760
M^3 subproblems of this size.
If you work it out,

1273
01:23:10,760 --> 01:23:16,478
normally we go down to things
of constant size and we get

1274
01:23:16,478 --> 01:23:21,278
exactly N^3 of them.
Now we are stopping at this

1275
01:23:21,278 --> 01:23:26,485
short point in saying,
well, it's however many there

1276
01:23:26,485 --> 01:23:30,161
are, cubed.
OK, this is a bit of hand

1277
01:23:30,161 --> 01:23:35,352
waving.
You could work it out with the

1278
01:23:35,352 --> 01:23:39,151
recurrence on the number of
leaves.

1279
01:23:39,151 --> 01:23:44,180
But there it is.
So, the total here is N over,

1280
01:23:44,180 --> 01:23:49,656
let's work it out.
N^3 over M to the three halves,

1281
01:23:49,656 --> 01:23:56,025
that's this number of leaves,
times the cost at each leaf,

1282
01:23:56,025 --> 01:24:01,054
which is M over B.
So, some of the N's cancel,

1283
01:24:01,054 --> 01:24:07,759
and we get N^3 over B root M,
which is a root M factor better

1284
01:24:07,759 --> 01:24:13,433
than N^3 over B.
It's actually quite a lot,

1285
01:24:13,433 --> 01:24:16,522
the square root of the cache
size.

1286
01:24:16,522 --> 01:24:20,359
That is optimal.
The best two level matrix

1287
01:24:20,359 --> 01:24:26,162
multiplication algorithm is N^3
over B root M memory transfers.

1288
01:24:26,162 --> 01:24:30,000
Pretty amazing,
and I'm over time.

1289
01:24:30,000 --> 01:24:34,979
You can generalize this into
all sorts of great things,

1290
01:24:34,979 --> 01:24:39,959
but the bottom line is this is
a great way to do matrix

1291
01:24:39,959 --> 01:24:45,308
multiplication as a recursion.
We'll see more recursion for

1292
01:24:45,308 --> 01:24:48,000
cache oblivious algorithms on
Wednesday.