1
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:03,910
Commons license.

3
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,600
offer high quality educational
resources for free.

5
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

6
00:00:13,500 --> 00:00:17,430
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,430 --> 00:00:18,680
ocw.mit.edu.

8
00:00:21,764 --> 00:00:24,590
PROFESSOR: Good, we're going
to take a detour today into

9
00:00:24,590 --> 00:00:26,657
the realm of algorithms.

10
00:00:30,860 --> 00:00:34,570
So when you're trying to make
code go fast, of course,

11
00:00:34,570 --> 00:00:36,470
there's no holds barred.

12
00:00:36,470 --> 00:00:38,820
You can use whatever you
need to in order

13
00:00:38,820 --> 00:00:40,290
to make it go fast.

14
00:00:40,290 --> 00:00:42,900
Today we're going to talk
a little bit in a more

15
00:00:42,900 --> 00:00:47,870
principled way about the
memory hierarchy.

16
00:00:47,870 --> 00:00:50,850
And to do that we're going to
introduce what we call the

17
00:00:50,850 --> 00:00:54,290
ideal-cache model.

18
00:00:54,290 --> 00:01:01,880
So as you know most caches are
hacked together to try to

19
00:01:01,880 --> 00:01:04,660
provide something that will
cache well while still making

20
00:01:04,660 --> 00:01:08,530
it easy to build and
fast to build.

21
00:01:08,530 --> 00:01:13,850
The ideal-cache model
is a pretty nice

22
00:01:13,850 --> 00:01:16,720
beast if we had them.

23
00:01:16,720 --> 00:01:19,760
It's got a two-level
hierarchy.

24
00:01:19,760 --> 00:01:23,810
It's got a cache that has m
bytes that are organized in to

25
00:01:23,810 --> 00:01:26,430
b byte cache-lines.

26
00:01:26,430 --> 00:01:30,020
So each block is b bytes.

27
00:01:30,020 --> 00:01:31,980
And it's fully associative.

28
00:01:31,980 --> 00:01:34,840
So you recall, that means
that any line can

29
00:01:34,840 --> 00:01:37,720
go anywhere in cache.

30
00:01:37,720 --> 00:01:40,900
And probably the most impressive
aspect of an

31
00:01:40,900 --> 00:01:44,950
ideal-cache is that
it has an optimal

32
00:01:44,950 --> 00:01:48,590
omniscient replacement algorithm.

33
00:01:48,590 --> 00:01:52,920
So what it does, is it figures
out when it needs to kick

34
00:01:52,920 --> 00:01:56,110
something out of cache, it says,
what is the absolutely

35
00:01:56,110 --> 00:02:01,080
best thing you could possibly
kick out of cache.

36
00:02:01,080 --> 00:02:03,310
And it does that one.

37
00:02:03,310 --> 00:02:05,400
Looking into the future
if need be.

38
00:02:05,400 --> 00:02:07,230
It says, oh is this going
to be accessed

39
00:02:07,230 --> 00:02:08,210
a lot in the future?

40
00:02:08,210 --> 00:02:09,800
I think I'll keep this one.

41
00:02:09,800 --> 00:02:10,810
Let's throw out this one.

42
00:02:10,810 --> 00:02:12,920
I know it's never going
to be used again.

43
00:02:12,920 --> 00:02:18,090
So it has that omniscient
character to it.

44
00:02:20,790 --> 00:02:24,120
The performance measures we're
going to look at in this

45
00:02:24,120 --> 00:02:28,180
model, the first one is
what we call the work.

46
00:02:28,180 --> 00:02:31,490
And that's just the ordinary
serial running time if you

47
00:02:31,490 --> 00:02:37,420
just ran the code on one
processor and counted up

48
00:02:37,420 --> 00:02:43,980
essentially how many processor
instructions you would do.

49
00:02:43,980 --> 00:02:46,530
That's essentially the work.

50
00:02:46,530 --> 00:02:50,660
The second measure, which is
the one that's much more

51
00:02:50,660 --> 00:02:53,150
interesting, is cache misses.

52
00:02:53,150 --> 00:02:56,080
So the work has to do
with the processor.

53
00:02:56,080 --> 00:02:59,650
The cache misses has to do with
what moves between these

54
00:02:59,650 --> 00:03:04,470
two levels of memory.

55
00:03:04,470 --> 00:03:08,230
So in this case, what we're
interested in doing is, how

56
00:03:08,230 --> 00:03:10,620
often do I try to access
something.

57
00:03:10,620 --> 00:03:12,050
It's not in the cache.

58
00:03:12,050 --> 00:03:16,480
I have to go to main memory and
bring it back into cache.

59
00:03:16,480 --> 00:03:19,090
And so that's what we'll be
counting in this model.

60
00:03:22,460 --> 00:03:30,030
So it's reasonable to ask how
reasonable ideal caches are.

61
00:03:30,030 --> 00:03:36,830
In particular the assumption
of omniscient replacement,

62
00:03:36,830 --> 00:03:39,280
that's pretty powerful stuff.

63
00:03:39,280 --> 00:03:42,070
Well it turns out there's a
great lemma due to Slater and

64
00:03:42,070 --> 00:03:46,760
Tarjan that says essentially
the following.

65
00:03:46,760 --> 00:03:50,800
Suppose that you have an
algorithm that incurs q cache

66
00:03:50,800 --> 00:03:56,220
misses on an ideal
cache of size n.

67
00:03:56,220 --> 00:03:58,190
So you ran the algorithm on
your machine, you had a

68
00:03:58,190 --> 00:04:00,870
cache of size n.

69
00:04:00,870 --> 00:04:05,050
Then, if instead of having an
ideal cache, you have a fully

70
00:04:05,050 --> 00:04:09,970
associative cache of size two
m and use the least recently

71
00:04:09,970 --> 00:04:12,180
used replacement policy.

72
00:04:12,180 --> 00:04:14,350
So you always, whenever you're
kicking something out of

73
00:04:14,350 --> 00:04:17,190
cache, you kick out the thing
that has been touched the

74
00:04:17,190 --> 00:04:18,729
longest ago in the past.

75
00:04:22,110 --> 00:04:28,290
Then it incurs at most
2Q cache misses.

76
00:04:28,290 --> 00:04:33,940
So what that says is that LRU is
to with it constant factors

77
00:04:33,940 --> 00:04:37,550
essentially the same
as optimal.

78
00:04:37,550 --> 00:04:39,060
Really quite a remarkable
result.

79
00:04:42,170 --> 00:04:44,800
Who's taking 6046?

80
00:04:44,800 --> 00:04:46,630
You've just seen this, right?

81
00:04:46,630 --> 00:04:47,970
Yeah, OK.

82
00:04:47,970 --> 00:04:51,810
Just seen this result in 6046.

83
00:04:51,810 --> 00:04:55,830
See, I do talk to my colleagues
occasionally.

84
00:04:55,830 --> 00:04:59,720
So then something about
how this is proved.

85
00:04:59,720 --> 00:05:04,240
And what's important here is
that really it just says, OK,

86
00:05:04,240 --> 00:05:07,550
Yeah you could dither on the
constants, but basically

87
00:05:07,550 --> 00:05:12,440
whether you choose LRU or choose
ideal cache with the

88
00:05:12,440 --> 00:05:15,960
omniscient replacement,
asymptotically you're not

89
00:05:15,960 --> 00:05:19,110
going to be off at all.

90
00:05:19,110 --> 00:05:21,900
So for most asymptotic analyses,
you can assume

91
00:05:21,900 --> 00:05:25,550
optimal or LRU replacement
as convenient.

92
00:05:25,550 --> 00:05:29,740
And the typical way that you
do convenience is if you're

93
00:05:29,740 --> 00:05:32,810
looking at upper bounds.

94
00:05:32,810 --> 00:05:35,290
So you're trying to show that
a particular algorithm is

95
00:05:35,290 --> 00:05:40,970
good, then what you do is you
assume optimal replacement.

96
00:05:40,970 --> 00:05:45,070
If you're trying to show that
some algorithm is bad, then

97
00:05:45,070 --> 00:05:49,510
what you do is assume that it's
LRU to get a lower bound.

98
00:05:49,510 --> 00:05:51,930
Because then you can reason
more easily about what's

99
00:05:51,930 --> 00:05:53,450
actually in memory.

100
00:05:53,450 --> 00:05:55,210
Because you just say, oh we'll
just keep the least

101
00:05:55,210 --> 00:05:57,960
recently used one.

102
00:05:57,960 --> 00:06:03,230
So you tend to use the two for
upper bounds and lower bounds.

103
00:06:03,230 --> 00:06:06,610
Now, the way this relates
to software

104
00:06:06,610 --> 00:06:10,280
engineering is as follows.

105
00:06:10,280 --> 00:06:14,530
If you're developing a really
fast algorithm, it's going to

106
00:06:14,530 --> 00:06:18,800
start from a theoretically
sound algorithm.

107
00:06:18,800 --> 00:06:22,790
And from that then you have
to engineer for detailed

108
00:06:22,790 --> 00:06:24,460
performance.

109
00:06:24,460 --> 00:06:28,790
So you have to take into account
things like real

110
00:06:28,790 --> 00:06:31,630
caches are not fully
associative.

111
00:06:31,630 --> 00:06:34,240
That loads and stores, for
example, have different cost

112
00:06:34,240 --> 00:06:36,670
with respect to bandwidth
and latency.

113
00:06:36,670 --> 00:06:39,460
So whether you miss some a
load or miss on a store,

114
00:06:39,460 --> 00:06:42,200
there's a different impact.

115
00:06:42,200 --> 00:06:44,800
But these are all the tuning.

116
00:06:44,800 --> 00:06:47,570
And as you know, those constant
factors can sometimes

117
00:06:47,570 --> 00:06:54,550
add up to dramatic numbers,
orders of magnitude.

118
00:06:54,550 --> 00:06:56,560
And so it's important to do
that software engineering.

119
00:06:56,560 --> 00:06:59,800
But starting from a good
theoretical basis means that

120
00:06:59,800 --> 00:07:03,600
you actually have an algorithm
that is going to work well

121
00:07:03,600 --> 00:07:08,380
across a large variety
of real situations.

122
00:07:11,370 --> 00:07:14,500
Now, there's one other
assumption we tend to make

123
00:07:14,500 --> 00:07:18,540
when we're dealing with ideal
caches, and that's called the

124
00:07:18,540 --> 00:07:22,080
tall-cache assumption.

125
00:07:22,080 --> 00:07:27,100
So what the tall-cache
assumption says, is that I you

126
00:07:27,100 --> 00:07:33,230
have at least as many lines of
cache, essentially, in your

127
00:07:33,230 --> 00:07:36,470
cache, as you have bytes
in the line.

128
00:07:39,130 --> 00:07:40,620
So it says the cache is tall.

129
00:07:40,620 --> 00:07:45,310
In other words, this dimension
here is bigger than this

130
00:07:45,310 --> 00:07:48,040
dimension here.

131
00:07:48,040 --> 00:07:51,070
And in particular, you want that
to be true for where we

132
00:07:51,070 --> 00:07:55,000
have some constant here of slop
that we can throw in.

133
00:07:55,000 --> 00:07:56,168
Yes, question.

134
00:07:56,168 --> 00:07:58,976
AUDIENCE: Does that
[INAUDIBLE]

135
00:07:58,976 --> 00:08:02,090
associatively make the
cache shorter here.

136
00:08:02,090 --> 00:08:05,090
PROFESSOR: Yes, so this
is basically assuming

137
00:08:05,090 --> 00:08:07,310
everything is ideal.

138
00:08:07,310 --> 00:08:08,420
We're going to go back.

139
00:08:08,420 --> 00:08:10,880
When you engineer things, you
have to deal with the fact

140
00:08:10,880 --> 00:08:12,330
that things aren't ideal.

141
00:08:12,330 --> 00:08:16,090
But usually that's just a little
bit of a tweak on the

142
00:08:16,090 --> 00:08:17,570
actual ideal algorithm.

143
00:08:17,570 --> 00:08:21,640
And for many programs, the
ideal algorithm you don't

144
00:08:21,640 --> 00:08:25,170
actually have to tweak
at all to get a

145
00:08:25,170 --> 00:08:28,470
good practical algorithm.

146
00:08:28,470 --> 00:08:30,555
So here is just saying the
cache should be tall.

147
00:08:33,080 --> 00:08:38,710
Now, just as an example, if we
look at the machines that

148
00:08:38,710 --> 00:08:42,165
we're using, the cache-line
length is 64 bytes.

149
00:08:42,165 --> 00:08:46,420
The L1 cache size
is 32 kilobytes.

150
00:08:46,420 --> 00:08:52,260
And of course, for L1 it's is
32 kilobytes and for L2 and

151
00:08:52,260 --> 00:08:54,530
L3, it's even bigger.

152
00:08:54,530 --> 00:08:55,710
It's even taller.

153
00:08:55,710 --> 00:09:00,200
Because they also
have 64k line.

154
00:09:00,200 --> 00:09:03,730
So this is a fairly reasonable
assumption to make, that you

155
00:09:03,730 --> 00:09:10,040
have more lines in your cache
then essentially the length,

156
00:09:10,040 --> 00:09:14,780
the number of items you can
put on a cache line.

157
00:09:14,780 --> 00:09:18,310
Now why is this an important
assumption?

158
00:09:18,310 --> 00:09:21,030
So what's wrong with
short caches?

159
00:09:21,030 --> 00:09:25,330
So we're going to look at,
surprise, surprise, matrix

160
00:09:25,330 --> 00:09:26,580
multiplication.

161
00:09:28,950 --> 00:09:33,070
Which, by the end of this class
you will learn more

162
00:09:33,070 --> 00:09:34,890
algorithms than matrix
multiplication.

163
00:09:34,890 --> 00:09:37,940
But it is a good one to
illustrate things.

164
00:09:37,940 --> 00:09:40,190
So the idea here is, suppose
that you have an

165
00:09:40,190 --> 00:09:43,910
n by n matrix here.

166
00:09:43,910 --> 00:09:49,220
And you don't have this
tall-cache assumption.

167
00:09:49,220 --> 00:09:51,890
So where your cache is short.

168
00:09:51,890 --> 00:09:55,560
You have a lot of bytes in a
line, but very few lines.

169
00:09:55,560 --> 00:10:02,090
Then even if the size of
your matrix fits, in

170
00:10:02,090 --> 00:10:04,460
principle, in the cache.

171
00:10:04,460 --> 00:10:08,010
In other words, n squared is
less than m by more than a

172
00:10:08,010 --> 00:10:09,270
constant amount.

173
00:10:09,270 --> 00:10:11,620
So you'd say, Oh gee, that
ought to fit in.

174
00:10:11,620 --> 00:10:15,750
If you have a short cache it
doesn't necessarily fit in

175
00:10:15,750 --> 00:10:20,870
because your length n here is
going to be shorter than the

176
00:10:20,870 --> 00:10:23,220
number of bytes on a line.

177
00:10:23,220 --> 00:10:26,200
However, if you have a tall
cache, it's always the case

178
00:10:26,200 --> 00:10:37,190
that if the matrix size is
smaller than the cache size by

179
00:10:37,190 --> 00:10:41,830
a certain amount, then the
matrix will fit in the cache.

180
00:10:41,830 --> 00:10:42,665
OK, question?

181
00:10:42,665 --> 00:10:45,296
AUDIENCE: Why wouldn't
you fit more than one

182
00:10:45,296 --> 00:10:47,036
row per cache line?

183
00:10:47,036 --> 00:10:49,750
PROFESSOR: Well the issue is you
may not have control over

184
00:10:49,750 --> 00:10:51,770
the way this is laid out.

185
00:10:51,770 --> 00:10:55,490
So, for example, if this is
row-major order, and this is a

186
00:10:55,490 --> 00:11:00,280
submatrix of a much bigger
matrix, then you may not have

187
00:11:00,280 --> 00:11:03,020
the freedom to be using these.

188
00:11:03,020 --> 00:11:06,540
But if you have the tall-cache
assumption, then any section

189
00:11:06,540 --> 00:11:08,130
you pull out is going to fit.

190
00:11:08,130 --> 00:11:13,470
As long as the data fits
mathematically in the cache,

191
00:11:13,470 --> 00:11:15,620
it will fit practically in
the cache if you have the

192
00:11:15,620 --> 00:11:18,390
tall-cache assumption.

193
00:11:18,390 --> 00:11:23,440
Whereas if it's short, you
basically end up with the

194
00:11:23,440 --> 00:11:27,060
cache lines being long and you
not having any flexibility as

195
00:11:27,060 --> 00:11:28,530
to where the data goes.

196
00:11:28,530 --> 00:11:30,600
So this is sort of a--

197
00:11:30,600 --> 00:11:33,270
So any questions about
that before we get

198
00:11:33,270 --> 00:11:36,310
into the use of this?

199
00:11:36,310 --> 00:11:38,480
We're going to see the use of
this and where it comes up.

200
00:11:38,480 --> 00:11:42,430
So one of the things is that,
if it does fit in, then it

201
00:11:42,430 --> 00:11:49,470
takes, at most, size of the
matrix divided by the cache

202
00:11:49,470 --> 00:11:53,060
line size misses
to load it in.

203
00:11:53,060 --> 00:11:57,090
So this is linear time
in the cache world.

204
00:11:57,090 --> 00:12:00,740
Linear time says, you should
only take one cache fault for

205
00:12:00,740 --> 00:12:03,240
every line of cache.

206
00:12:03,240 --> 00:12:06,040
And so that's what you'll have
here if you have a tall cache.

207
00:12:06,040 --> 00:12:09,870
You'll have n squared over b
cache misses to load in n

208
00:12:09,870 --> 00:12:11,390
square data.

209
00:12:11,390 --> 00:12:12,870
And that's good.

210
00:12:16,096 --> 00:12:17,550
OK, good.

211
00:12:17,550 --> 00:12:21,220
So let's take on the problem
of multiplying matrices.

212
00:12:21,220 --> 00:12:23,610
We're going to look at square
matrices because they're

213
00:12:23,610 --> 00:12:25,900
easier to think about than
rectangular ones.

214
00:12:25,900 --> 00:12:28,970
But almost everything I say
today will relate to

215
00:12:28,970 --> 00:12:30,860
rectangular matrices as well.

216
00:12:33,380 --> 00:12:35,660
And it we'll generalize
beyond matrices as

217
00:12:35,660 --> 00:12:38,450
we'll see next time.

218
00:12:38,450 --> 00:12:40,900
So here's a typical code for
multiplying matrices.

219
00:12:40,900 --> 00:12:43,670
It's not the most efficient code
in the world, but it's

220
00:12:43,670 --> 00:12:48,180
good enough to illustrate
what I want to show you.

221
00:12:48,180 --> 00:12:54,270
So the first thing is, what is
the work of this algorithm?

222
00:12:57,460 --> 00:13:01,240
This is, by the way, the
softball question.

223
00:13:01,240 --> 00:13:02,680
What's the work?

224
00:13:02,680 --> 00:13:06,080
So the work, remember, is just
if you're analyzing it just

225
00:13:06,080 --> 00:13:10,284
like processor forget about
caches and so forth.

226
00:13:10,284 --> 00:13:11,526
AUDIENCE: n cubed.

227
00:13:11,526 --> 00:13:14,110
PROFESSOR: n cubed, right.

228
00:13:14,110 --> 00:13:17,910
Because there's a triply nested
loop going up to n and

229
00:13:17,910 --> 00:13:19,470
you're doing constant
work in the middle.

230
00:13:19,470 --> 00:13:20,720
So it's n times n times 1.

231
00:13:23,080 --> 00:13:24,920
n cubed work.

232
00:13:24,920 --> 00:13:26,620
That was easy.

233
00:13:26,620 --> 00:13:29,460
Now let's analyze caches.

234
00:13:29,460 --> 00:13:30,800
So we're going to look
at row major.

235
00:13:30,800 --> 00:13:34,080
I'm only going to illustrate the
cache lines on this side

236
00:13:34,080 --> 00:13:37,310
because B is where all
the action is.

237
00:13:37,310 --> 00:13:41,270
So we're going to analyze two
cases when the matrix doesn't

238
00:13:41,270 --> 00:13:41,990
fit in the cache.

239
00:13:41,990 --> 00:13:45,140
If the matrix fits in the cache,
then there's nothing to

240
00:13:45,140 --> 00:13:48,160
analyze, at some level.

241
00:13:48,160 --> 00:13:50,280
So we're going to look at the
cases where the matrix doesn't

242
00:13:50,280 --> 00:13:51,440
fit in the cache.

243
00:13:51,440 --> 00:13:54,320
And the first one is going to be
when the side of the matrix

244
00:13:54,320 --> 00:13:56,585
is bigger than m over b.

245
00:13:56,585 --> 00:13:59,940
So remember, m over b is the
height of our cache, the

246
00:13:59,940 --> 00:14:01,430
number of lines in our cache.

247
00:14:08,320 --> 00:14:13,480
So let's assume for this, now
I have a choice of assuming

248
00:14:13,480 --> 00:14:16,760
optimal omniscient replacement
or assuming LRU.

249
00:14:16,760 --> 00:14:21,800
Since I want to show this is
bad, I'm going to assume LRU.

250
00:14:21,800 --> 00:14:25,010
Could somebody please close
the back door there?

251
00:14:25,010 --> 00:14:29,300
Because it's we're getting some
noise in from out there.

252
00:14:29,300 --> 00:14:31,350
Thank you.

253
00:14:31,350 --> 00:14:32,530
So let's assume LRU.

254
00:14:32,530 --> 00:14:37,280
So what happens in the code is
basically I go across a row of

255
00:14:37,280 --> 00:14:43,650
A, while I go down a column of
B. And now if I'm using LRU,

256
00:14:43,650 --> 00:14:44,830
what's happening?

257
00:14:44,830 --> 00:14:47,600
I read in this cache
block and this one,

258
00:14:47,600 --> 00:14:49,080
then this one et cetera.

259
00:14:49,080 --> 00:14:53,800
And if n is bigger than M/B and
I'm using least recently

260
00:14:53,800 --> 00:14:56,470
used, by the time I get down
to the bottom here, what's

261
00:14:56,470 --> 00:15:00,030
happened the first cache line?

262
00:15:00,030 --> 00:15:02,300
First cache block?

263
00:15:02,300 --> 00:15:04,960
It's out of there.

264
00:15:04,960 --> 00:15:06,330
It's out of there
if I used LRU.

265
00:15:09,490 --> 00:15:12,270
So therefore, what happens
is I took a miss on

266
00:15:12,270 --> 00:15:13,860
every one of those.

267
00:15:13,860 --> 00:15:16,790
And then when I go to the second
one, I take a miss on

268
00:15:16,790 --> 00:15:18,040
every one again.

269
00:15:21,070 --> 00:15:26,500
And so as I keep going through,
every access to B

270
00:15:26,500 --> 00:15:30,250
causes a miss throughout the
whole accessing of B. Now go

271
00:15:30,250 --> 00:15:33,530
on to the second row A and I
had the same thing repeats.

272
00:15:36,280 --> 00:15:44,730
So therefore, the number of
cache misses is order n cubed

273
00:15:44,730 --> 00:15:50,690
since we miss on matrix
B on every access.

274
00:15:50,690 --> 00:15:51,868
OK, question.

275
00:15:51,868 --> 00:15:54,856
AUDIENCE: I know that
you said it's e.

276
00:15:54,856 --> 00:15:56,350
Does B push out due to conflict

277
00:15:56,350 --> 00:15:58,342
misses or capacity misses?

278
00:15:58,342 --> 00:16:03,930
PROFESSOR: So in this case
they're capacity misses that

279
00:16:03,930 --> 00:16:04,820
we're talking about here.

280
00:16:04,820 --> 00:16:07,940
So there's no conflict misses in
a fully associative cache.

281
00:16:10,490 --> 00:16:13,070
Conflict misses occurs because
of direct mapping.

282
00:16:13,070 --> 00:16:15,890
So there's no conflict misses
in what I'm going to be

283
00:16:15,890 --> 00:16:17,030
talking about today.

284
00:16:17,030 --> 00:16:20,790
That is an extra concern that
you have for real caches, not

285
00:16:20,790 --> 00:16:23,060
a concern when you have a
fully associative cache.

286
00:16:23,060 --> 00:16:25,425
AUDIENCE: [INAUDIBLE]

287
00:16:25,425 --> 00:16:28,736
n needs to be bigger than B?

288
00:16:28,736 --> 00:16:34,310
PROFESSOR: So the number of
lines to fit in my cache is m

289
00:16:34,310 --> 00:16:36,720
over b, right?

290
00:16:36,720 --> 00:16:39,490
AUDIENCE: So can't you put
multiple units of data--

291
00:16:39,490 --> 00:16:41,690
PROFESSOR: Well there are
multiple units of data.

292
00:16:41,690 --> 00:16:44,830
But notice this is row major,
what's stored here

293
00:16:44,830 --> 00:16:49,060
is B11, B12, B13.

294
00:16:49,060 --> 00:16:50,110
That's stored here.

295
00:16:50,110 --> 00:16:52,600
The way I'm going through the
access, I'm going down the

296
00:16:52,600 --> 00:16:58,570
columns of B. So by the time
to get up to the top again,

297
00:16:58,570 --> 00:17:01,320
that cache block is
no longer there.

298
00:17:01,320 --> 00:17:06,800
And so when I access B12,
assuming indexing from one or

299
00:17:06,800 --> 00:17:12,220
whatever, this block is
no longer in cache.

300
00:17:12,220 --> 00:17:15,480
Because LRU would say, somewhere
along here I hit the

301
00:17:15,480 --> 00:17:19,109
limit of my size of cache,
let's say around here.

302
00:17:19,109 --> 00:17:21,609
Then when this one goes
in, that one goes out.

303
00:17:21,609 --> 00:17:24,349
When the next one goes in, the
next one goes out et cetera

304
00:17:24,349 --> 00:17:27,660
using the least recently
used replacement.

305
00:17:27,660 --> 00:17:28,329
So my--

306
00:17:28,329 --> 00:17:29,520
AUDIENCE: Spatial locality.

307
00:17:29,520 --> 00:17:30,930
PROFESSOR: I'm sorry?

308
00:17:30,930 --> 00:17:33,300
You don't have any spatial
locality here.

309
00:17:33,300 --> 00:17:37,710
AUDIENCE: I'm just wondering
why they can't hold units.

310
00:17:37,710 --> 00:17:41,140
I guess this is the question,
why can't they hold multiple

311
00:17:41,140 --> 00:17:42,610
addresses per cache line.

312
00:17:42,610 --> 00:17:45,550
So why is it even pushed out?

313
00:17:45,550 --> 00:17:47,020
It's being pushed out
[UNINTELLIGIBLE] one per cache

314
00:17:47,020 --> 00:17:48,010
line, right?

315
00:17:48,010 --> 00:17:51,370
PROFESSOR: No, so it's getting
pushed out because the cache

316
00:17:51,370 --> 00:17:56,820
can hold M/B blocks, right?

317
00:17:56,820 --> 00:18:00,730
So once it's accessed m over b
blocks, if I want to access

318
00:18:00,730 --> 00:18:03,820
anything else, something
has to go out.

319
00:18:03,820 --> 00:18:05,480
It's a capacity issue.

320
00:18:05,480 --> 00:18:08,550
I access m over b blocks,
something has to go out.

321
00:18:08,550 --> 00:18:13,790
LRU says, the latest thing that
I accessed, well that was

322
00:18:13,790 --> 00:18:16,610
the first one, gets
knocked out.

323
00:18:16,610 --> 00:18:20,030
So what happens is every
one causes a miss.

324
00:18:20,030 --> 00:18:24,900
Even though I may access that
very nearby in the future, it

325
00:18:24,900 --> 00:18:27,040
doesn't take advantage
of that.

326
00:18:27,040 --> 00:18:29,342
Because LRU says knock it out.

327
00:18:29,342 --> 00:18:31,552
AUDIENCE: [INAUDIBLE]

328
00:18:31,552 --> 00:18:34,880
PROFESSOR: Is it row major
that's the confusion?

329
00:18:34,880 --> 00:18:37,300
This is the way we've been
dealing with are matrices.

330
00:18:37,300 --> 00:18:41,490
So in row major, there's
a good--

331
00:18:41,490 --> 00:18:43,180
that's nice, there's
no chalk here.

332
00:18:47,320 --> 00:18:48,730
Oh, there's a big one
there, great.

333
00:18:53,520 --> 00:19:00,600
Yeah, so here's B. So the order
that B is stored is like

334
00:19:00,600 --> 00:19:04,290
this in memory.

335
00:19:04,290 --> 00:19:08,790
So basically we're storing these
elements in this order.

336
00:19:08,790 --> 00:19:11,900
So it's a linear block
of memory, right?

337
00:19:11,900 --> 00:19:16,250
And it's being stored row
by row as we go through.

338
00:19:16,250 --> 00:19:20,030
So actually if I do it like
this, let me do this

339
00:19:20,030 --> 00:19:22,950
a little bit more.

340
00:19:22,950 --> 00:19:28,320
So the idea is that the first
element is going to be here,

341
00:19:28,320 --> 00:19:31,680
and then we get up to n minus 1,
and then we get to n minus

342
00:19:31,680 --> 00:19:33,270
2 is stored here.

343
00:19:35,956 --> 00:19:42,150
n plus 1, n plus
2, 2n minus 1.

344
00:19:42,150 --> 00:19:44,270
So that's the order that
they're stored

345
00:19:44,270 --> 00:19:45,830
in linear and memory.

346
00:19:45,830 --> 00:19:49,750
Now these guys will all be on
the same cache line if it's in

347
00:19:49,750 --> 00:19:52,020
row-major storage.

348
00:19:52,020 --> 00:19:55,680
So when I'm accessing B, I'm
going and I'm accessing zero,

349
00:19:55,680 --> 00:20:00,390
then I'm accessing the
thing at location n.

350
00:20:00,390 --> 00:20:01,820
And I'm going down like this.

351
00:20:01,820 --> 00:20:07,180
At some point here I reach
the limit of my cache.

352
00:20:07,180 --> 00:20:12,050
This is M/B. Notice it's
a different B--

353
00:20:12,050 --> 00:20:13,840
script B verses--

354
00:20:13,840 --> 00:20:15,990
so when I get to m over b.

355
00:20:15,990 --> 00:20:17,880
Now all these things
are sitting in

356
00:20:17,880 --> 00:20:20,590
cache, that's great.

357
00:20:20,590 --> 00:20:25,770
However, now I go to one more,
and it says OK, all those

358
00:20:25,770 --> 00:20:29,440
things are sitting in cache,
which one do I kick out?

359
00:20:29,440 --> 00:20:31,970
And the answer is the least
recently used one.

360
00:20:31,970 --> 00:20:33,380
That's this guy goes out.

361
00:20:33,380 --> 00:20:35,595
AUDIENCE: Do you only use one
element per cache link?

362
00:20:35,595 --> 00:20:38,260
PROFESSOR: And I've only used
one element from each cache

363
00:20:38,260 --> 00:20:39,680
line at this point.

364
00:20:39,680 --> 00:20:43,330
Then I go to the next one and it
knocks out the second one.

365
00:20:43,330 --> 00:20:46,270
By the time I get to the bottom
and then I go up to the

366
00:20:46,270 --> 00:20:49,930
top to access 1 here,
it's not in cache.

367
00:20:49,930 --> 00:20:52,250
And so it repeats the same
process, missing

368
00:20:52,250 --> 00:20:53,400
every single time.

369
00:20:53,400 --> 00:20:54,100
We have a question.

370
00:20:54,100 --> 00:21:00,320
AUDIENCE: Yeah,so my question
is why does the cache know

371
00:21:00,320 --> 00:21:01,370
where each row is?

372
00:21:01,370 --> 00:21:05,798
To us, we draw the matrix,
but the computer

373
00:21:05,798 --> 00:21:07,274
doesn't know it's a matrix.

374
00:21:07,274 --> 00:21:09,250
To the computer, its a linear
array of numbers.

375
00:21:09,250 --> 00:21:10,370
PROFESSOR: That's correct.

376
00:21:10,370 --> 00:21:16,016
AUDIENCE: So why would it load
the first couple elements in

377
00:21:16,016 --> 00:21:17,776
the first row, and the second
column is an extended row the

378
00:21:17,776 --> 00:21:19,026
second time.

379
00:21:21,455 --> 00:21:25,730
PROFESSOR: So the cache blocks
are determined by the locality

380
00:21:25,730 --> 00:21:27,415
and memory.

381
00:21:27,415 --> 00:21:30,472
AUDIENCE: So my assumption
would be the first

382
00:21:30,472 --> 00:21:32,255
cache line for say--

383
00:21:32,255 --> 00:21:34,330
PROFESSOR: Let's say
0 through 3.

384
00:21:34,330 --> 00:21:35,680
AUDIENCE: So yeah,
0 through 3.

385
00:21:35,680 --> 00:21:37,980
PROFESSOR: Let's say we have
four items on that cache line.

386
00:21:37,980 --> 00:21:39,040
AUDIENCE: 4 to 6.

387
00:21:39,040 --> 00:21:42,664
PROFESSOR: The next one the
hold 4 to 7, I think.

388
00:21:42,664 --> 00:21:44,086
AUDIENCE: 4 to 7, yeah.

389
00:21:44,086 --> 00:21:45,034
So that is a--

390
00:21:45,034 --> 00:21:48,010
PROFESSOR: So that would
be the next one, right?

391
00:21:48,010 --> 00:21:48,715
4 to 7.

392
00:21:48,715 --> 00:21:50,476
AUDIENCE: When you
get cache line.

393
00:21:50,476 --> 00:21:52,250
You are not using the
fully cache line.

394
00:21:52,250 --> 00:21:53,620
There is no spatial locale.

395
00:21:53,620 --> 00:21:55,062
You are using one from
the cache line.

396
00:21:55,062 --> 00:22:00,020
PROFESSOR: So this code
is using this one

397
00:22:00,020 --> 00:22:00,820
and then this one.

398
00:22:00,820 --> 00:22:01,870
It's not using the rest.

399
00:22:01,870 --> 00:22:03,690
So it's not very
efficient code.

400
00:22:03,690 --> 00:22:07,946
AUDIENCE: So the cache line
is holding the 0 to 3

401
00:22:07,946 --> 00:22:09,874
and the 4 to 7.

402
00:22:09,874 --> 00:22:11,320
[INAUDIBLE]

403
00:22:11,320 --> 00:22:13,320
n plus 2 just reading--

404
00:22:13,320 --> 00:22:14,955
[INTERPOSING VOICES]

405
00:22:14,955 --> 00:22:15,990
PROFESSOR: Right.

406
00:22:15,990 --> 00:22:17,340
And those are fixed.

407
00:22:17,340 --> 00:22:21,315
So if you just a dice up memory
in our machine into 64

408
00:22:21,315 --> 00:22:26,050
byte sizes, those are the things
that come in together

409
00:22:26,050 --> 00:22:28,220
whenever you access anything
on that line.

410
00:22:28,220 --> 00:22:31,671
AUDIENCE: And on this particular
axis, you never

411
00:22:31,671 --> 00:22:34,136
actually get the 4
to the 7 in the--

412
00:22:34,136 --> 00:22:36,390
PROFESSOR: Well we
eventually do.

413
00:22:36,390 --> 00:22:38,410
Until we get there,
yes, that's right.

414
00:22:38,410 --> 00:22:40,130
Until we get there.

415
00:22:40,130 --> 00:22:43,600
Now, of course, we're also
accessing A and C, but turns

416
00:22:43,600 --> 00:22:47,010
out to this analysis it
sufficient to show that we're

417
00:22:47,010 --> 00:22:54,780
getting n cubed misses just on
the matrix B. In order to say

418
00:22:54,780 --> 00:22:57,270
hey, we've got a lot
of misses here.

419
00:23:00,170 --> 00:23:03,270
So this was the case where
n was bigger than

420
00:23:03,270 --> 00:23:05,290
the size of a cache.

421
00:23:05,290 --> 00:23:10,630
So the situation is a little
bit different if n is large

422
00:23:10,630 --> 00:23:17,560
but still actually less
than m over b.

423
00:23:17,560 --> 00:23:24,410
So in this case, we suppose it
n squared is bigger than m.

424
00:23:24,410 --> 00:23:27,520
So the matrix doesn't
fit in memory.

425
00:23:27,520 --> 00:23:29,804
So that's what this part of the
equation is, m to the 1/2

426
00:23:29,804 --> 00:23:34,210
less than n is the same as n
squared is bigger than memory.

427
00:23:34,210 --> 00:23:38,290
So we still don't fit in memory,
but in fact it's less

428
00:23:38,290 --> 00:23:41,320
than some constant
times m over b.

429
00:23:41,320 --> 00:23:44,600
And now let's look at the
difference with what happens

430
00:23:44,600 --> 00:23:48,290
with the caches as we go
through the algorithm.

431
00:23:48,290 --> 00:23:49,960
So we essentially do
the same thing.

432
00:23:49,960 --> 00:23:52,270
Once again, we're going
to assume LRU.

433
00:23:52,270 --> 00:23:54,570
And so what happens is we're
going to go down

434
00:23:54,570 --> 00:23:56,320
a single row there.

435
00:23:56,320 --> 00:24:08,870
But now, notice that by the time
I get down to the bottom,

436
00:24:08,870 --> 00:24:11,513
basically I've accessed fewer
than some constant times m

437
00:24:11,513 --> 00:24:14,410
over b locations.

438
00:24:14,410 --> 00:24:17,460
And so nothing has gotten
kicked out yet.

439
00:24:17,460 --> 00:24:24,880
So when I go back to the top for
the next access to B, all

440
00:24:24,880 --> 00:24:26,380
these things are still
in memory.

441
00:24:29,150 --> 00:24:35,030
So I don't take a cache fall,
a cache miss in those cases.

442
00:24:35,030 --> 00:24:36,470
And so we keep going through.

443
00:24:36,470 --> 00:24:41,040
And basically this is much
better because we're actually

444
00:24:41,040 --> 00:24:44,310
getting to take advantage
of the spatial locality.

445
00:24:44,310 --> 00:24:45,910
So this algorithm
takes advantage

446
00:24:45,910 --> 00:24:48,580
of the spatial locality.

447
00:24:48,580 --> 00:24:57,040
If n is really big it doesn't,
but if n is just kind of big,

448
00:24:57,040 --> 00:24:59,300
then it does.

449
00:24:59,300 --> 00:25:03,510
And then if n is small enough,
of course, it all fits in

450
00:25:03,510 --> 00:25:06,170
cache and there's no misses
other than those needed to

451
00:25:06,170 --> 00:25:07,420
bring it in once.

452
00:25:09,820 --> 00:25:12,610
And then the same thing happens
once you go through

453
00:25:12,610 --> 00:25:13,820
the next one.

454
00:25:13,820 --> 00:25:18,610
So in this case, what's
happening is we have n squared

455
00:25:18,610 --> 00:25:27,130
over b misses per run through
the matrix B, and then we have

456
00:25:27,130 --> 00:25:28,510
n times that we go through.

457
00:25:28,510 --> 00:25:32,506
Once for every row of A. So the
total then is n cubed over

458
00:25:32,506 --> 00:25:37,700
b the cache block size.

459
00:25:37,700 --> 00:25:43,050
So depending upon the size, we
can analyze with this, that

460
00:25:43,050 --> 00:25:47,730
this is better because we get
a factor of B improvement.

461
00:25:47,730 --> 00:25:49,550
But it's still not particularly
good.

462
00:25:49,550 --> 00:25:58,280
And it only works, of course,
if my side of my matrix fits

463
00:25:58,280 --> 00:26:02,780
in the number of lines
of cache that I have.

464
00:26:02,780 --> 00:26:03,345
Yeah, question.

465
00:26:03,345 --> 00:26:05,129
AUDIENCE: Can you explain in--

466
00:26:05,129 --> 00:26:07,494
I don't understand why you
have n cubed over b?

467
00:26:07,494 --> 00:26:12,590
PROFESSOR: OK, so we're going
through this matrix n times.

468
00:26:12,590 --> 00:26:14,100
And for each one of those,
we're running

469
00:26:14,100 --> 00:26:16,440
through this thing.

470
00:26:16,440 --> 00:26:22,170
So this thing basically, I get
to go b times through, because

471
00:26:22,170 --> 00:26:24,890
all these things are going to be
in memory when I come back

472
00:26:24,890 --> 00:26:26,580
to do them again.

473
00:26:26,580 --> 00:26:32,740
And so it's only once every B
columns that I take a miss.

474
00:26:32,740 --> 00:26:36,910
I take a miss and then I get to
the other b minus 1 access

475
00:26:36,910 --> 00:26:40,120
is that I get cache hit.

476
00:26:40,120 --> 00:26:44,790
And so the total here is
then n squared over b.

477
00:26:44,790 --> 00:26:46,040
So therefore a total
of n cubed over b.

478
00:26:49,120 --> 00:26:52,630
So even this is not very good
compared to what we can

479
00:26:52,630 --> 00:26:54,435
actually do if we exploit
the cache well.

480
00:26:57,780 --> 00:26:59,150
So let's go on and
take a look.

481
00:26:59,150 --> 00:27:00,660
We saw this before.

482
00:27:00,660 --> 00:27:03,710
Let's use tiling.

483
00:27:03,710 --> 00:27:09,950
So the idea of tiling is to say,
let's break our matrix

484
00:27:09,950 --> 00:27:14,510
into blocks of s times s size.

485
00:27:14,510 --> 00:27:23,540
And essentially what we do is we
treat our big matrix as if

486
00:27:23,540 --> 00:27:25,960
we're doing block matrix
multiplications of things of

487
00:27:25,960 --> 00:27:28,300
size s by s.

488
00:27:28,300 --> 00:27:32,360
So the inner loop here is doing
essentially the matrix

489
00:27:32,360 --> 00:27:33,710
multiplication.

490
00:27:33,710 --> 00:27:39,020
It's actually matrix
multiply and add.

491
00:27:39,020 --> 00:27:41,500
The inner three loops are just
doing ordinary matrix

492
00:27:41,500 --> 00:27:45,210
multiplication, but on
s-sized matrices.

493
00:27:45,210 --> 00:27:54,490
And the outer loop is jumping
over matrix by matrix for each

494
00:27:54,490 --> 00:27:56,350
of those doing a matrix
multiply as

495
00:27:56,350 --> 00:27:58,450
its elemental piece.

496
00:27:58,450 --> 00:28:01,430
So this is the tiling solution
that you've seen before.

497
00:28:01,430 --> 00:28:03,920
We can analyze it in
this model to see,

498
00:28:03,920 --> 00:28:07,250
is this a good solution.

499
00:28:07,250 --> 00:28:10,740
So everybody clear on
what the code does?

500
00:28:10,740 --> 00:28:14,737
So it's a lot of four
loops, right?

501
00:28:14,737 --> 00:28:15,226
Yeah.

502
00:28:15,226 --> 00:28:19,138
AUDIENCE: There should be
less than n somewhere?

503
00:28:19,138 --> 00:28:20,610
There's like an and something.

504
00:28:20,610 --> 00:28:22,354
PROFESSOR: Oh yeah.

505
00:28:22,354 --> 00:28:25,100
That must have happened
when I coped it.

506
00:28:25,100 --> 00:28:27,490
That should be j less
than n here.

507
00:28:27,490 --> 00:28:29,732
It should just follow
this pattern.

508
00:28:29,732 --> 00:28:31,770
i less than n, k less than
n, that should be

509
00:28:31,770 --> 00:28:33,020
j less than n there.

510
00:28:35,830 --> 00:28:37,880
Good catch.

511
00:28:37,880 --> 00:28:39,270
I did execute this.

512
00:28:39,270 --> 00:28:40,520
That must have happened
when I was editing.

513
00:28:43,520 --> 00:28:45,870
So here's the analysis
of work.

514
00:28:45,870 --> 00:28:48,990
So what's going on
in the work?

515
00:28:48,990 --> 00:28:55,120
So here we have, basically the
outer loop is going n over s

516
00:28:55,120 --> 00:28:57,880
times, each loop.

517
00:28:57,880 --> 00:29:01,110
So there's cube there times the
inner loops here which are

518
00:29:01,110 --> 00:29:03,430
going each s times.

519
00:29:03,430 --> 00:29:04,790
So times s cubed.

520
00:29:04,790 --> 00:29:07,250
Multiply that through,
n cubed operations.

521
00:29:07,250 --> 00:29:08,500
That's kind of what
you'd expect.

522
00:29:10,810 --> 00:29:12,210
What about cache misses?

523
00:29:15,320 --> 00:29:19,520
So the whole idea here is that
s becomes a tuning parameter.

524
00:29:19,520 --> 00:29:24,060
And whether we choose s well or
poorly influences how well

525
00:29:24,060 --> 00:29:27,430
this algorithm works.

526
00:29:27,430 --> 00:29:31,620
So the idea here is we want tune
s so that the submatrices

527
00:29:31,620 --> 00:29:33,570
just fit into cache.

528
00:29:33,570 --> 00:29:37,230
So in this case, if I want a
matrix to fit into cache, I

529
00:29:37,230 --> 00:29:39,830
want to be about the size
of the square root

530
00:29:39,830 --> 00:29:43,350
of the cache size.

531
00:29:43,350 --> 00:29:45,250
And this is where we're going
to use the tall-cache

532
00:29:45,250 --> 00:29:47,090
assumption now.

533
00:29:47,090 --> 00:29:52,620
Because I want to say, it fits
in cache, therefore I can just

534
00:29:52,620 --> 00:29:54,220
assume it all fits in cache.

535
00:29:54,220 --> 00:29:58,220
It's not like the size fits but
the actual data doesn't,

536
00:29:58,220 --> 00:30:01,900
which is what happens with
the short cache.

537
00:30:01,900 --> 00:30:05,320
So the tall-cache assumption
implies that when I'm

538
00:30:05,320 --> 00:30:09,660
executing one of these inner
loops, what's happening?

539
00:30:09,660 --> 00:30:12,110
When I'm executing one of these
linear loops, all of the

540
00:30:12,110 --> 00:30:14,600
matrices are going
to fit in cache.

541
00:30:14,600 --> 00:30:18,610
So all I have is my
cold misses, if

542
00:30:18,610 --> 00:30:22,520
any, on that submatrix.

543
00:30:22,520 --> 00:30:24,820
And how many cold misses
can I have?

544
00:30:24,820 --> 00:30:29,300
Well the size of the matrix is
s squared and I get to bring

545
00:30:29,300 --> 00:30:33,500
in b bytes of the matrix
each time.

546
00:30:33,500 --> 00:30:37,820
So I get s squared over b
misses per submatrix.

547
00:30:37,820 --> 00:30:40,150
So that was a little bit fast,
but I just want to make sure--

548
00:30:43,520 --> 00:30:45,790
it's at one level
straightforward, and the other

549
00:30:45,790 --> 00:30:48,300
level it's a little bit fast.

550
00:30:48,300 --> 00:30:51,680
So the point is that the inner
three loops I can analyze if I

551
00:30:51,680 --> 00:30:53,990
know that s is fitting
in cache.

552
00:30:53,990 --> 00:30:57,340
The inner three loops I can
analyze by saying, look it's s

553
00:30:57,340 --> 00:30:58,610
squared data.

554
00:30:58,610 --> 00:31:01,400
Once I get the data in cache,
if I'm using an optimal

555
00:31:01,400 --> 00:31:06,480
replacement, then it's going
to stay in there.

556
00:31:06,480 --> 00:31:10,460
And so it will cost me s squared
over b misses to bring

557
00:31:10,460 --> 00:31:14,410
that matrix in for each
of the three matrices.

558
00:31:14,410 --> 00:31:17,960
But once it's in there, I can
keep going over and over it as

559
00:31:17,960 --> 00:31:19,480
the algorithm does.

560
00:31:19,480 --> 00:31:21,190
I don't get any cache misses.

561
00:31:21,190 --> 00:31:25,720
Because those all fitting
in the cache.

562
00:31:25,720 --> 00:31:28,050
Question?

563
00:31:28,050 --> 00:31:29,770
Everybody with me?

564
00:31:29,770 --> 00:31:31,890
OK.

565
00:31:31,890 --> 00:31:36,020
So then I basically have
the outer three loops.

566
00:31:36,020 --> 00:31:38,360
And here I don't make any
assumptions whatsoever.

567
00:31:38,360 --> 00:31:43,120
There's n over s iterations
for each loop.

568
00:31:43,120 --> 00:31:43,940
And there's three loops.

569
00:31:43,940 --> 00:31:45,690
So that's n over s cubed.

570
00:31:45,690 --> 00:31:48,840
And then the cost of the misses
in the inner loop is s

571
00:31:48,840 --> 00:31:50,440
squared over b.

572
00:31:50,440 --> 00:31:56,210
And that gives me n cubed over
bm to the 1/2 if you plug in s

573
00:31:56,210 --> 00:31:57,460
being m to the 1/2.

574
00:32:01,110 --> 00:32:08,530
So this is radically better
because m is usually big.

575
00:32:08,530 --> 00:32:14,800
Especially for a higher level
cache, for an L2 or an L3. m

576
00:32:14,800 --> 00:32:15,980
is really big.

577
00:32:15,980 --> 00:32:19,180
What was the value we had before
for the best case for

578
00:32:19,180 --> 00:32:22,260
the other algorithm when it
didn't fit in matrix?

579
00:32:22,260 --> 00:32:25,280
It was n cubed over b.

580
00:32:25,280 --> 00:32:28,570
b is like 64 bytes.

581
00:32:28,570 --> 00:32:36,160
m is like the small L1 cache
is 32 kilobytes.

582
00:32:36,160 --> 00:32:39,460
So you get to square root
the 32 kilobytes.

583
00:32:39,460 --> 00:32:40,710
What's that?

584
00:32:47,470 --> 00:32:52,670
So that's 32 kilobytes
is 2 to the 15th.

585
00:32:52,670 --> 00:32:55,590
So it's 2 to the 7.5.

586
00:32:55,590 --> 00:32:59,120
2 to the 7 is 128.

587
00:32:59,120 --> 00:33:08,020
So it's somewhere between
128 and 256.

588
00:33:08,020 --> 00:33:13,130
So if we said 128, I've got a 64
and a 128 multiplier there.

589
00:33:13,130 --> 00:33:15,400
Much, much better in terms
of calculating.

590
00:33:15,400 --> 00:33:20,920
In fact, this is such that if we
tune this properly and then

591
00:33:20,920 --> 00:33:25,470
we say, well what was the cost
of the cache misses here,

592
00:33:25,470 --> 00:33:28,880
you're not going to see the cost
of the cache misses when

593
00:33:28,880 --> 00:33:31,300
you do your performance
analysis.

594
00:33:31,300 --> 00:33:33,590
It's all going to be the work.

595
00:33:33,590 --> 00:33:37,080
Because the work is
still n cubed.

596
00:33:37,080 --> 00:33:40,050
The work is still n cubed,
but now the misses are so

597
00:33:40,050 --> 00:33:45,180
infrequent, because we're
only getting one every--

598
00:33:45,180 --> 00:33:50,375
on the order of 64 times 128,
which is 2 to the 6th times 2

599
00:33:50,375 --> 00:33:54,200
to the 7th is 2 to
the 13th is 8K.

600
00:33:54,200 --> 00:33:58,490
Every 8,000 or so accesses
there's a constant factor in

601
00:33:58,490 --> 00:34:02,150
there or whatever, but every
8,000 or so accesses we're

602
00:34:02,150 --> 00:34:03,370
getting a cache miss.

603
00:34:03,370 --> 00:34:05,700
Uh, too bad.

604
00:34:05,700 --> 00:34:08,380
If it's L1, that cost is four
cycles rather than one.

605
00:34:11,350 --> 00:34:15,170
Or that cost us 10 cycles if I
had to go to L2 rather than

606
00:34:15,170 --> 00:34:17,030
one, or whatever.

607
00:34:17,030 --> 00:34:20,190
So is to the point is, that's
a great multiplier to have.

608
00:34:22,940 --> 00:34:26,190
So this is a really
good algorithm.

609
00:34:26,190 --> 00:34:28,750
And in fact, this is the optimal
behavior you can get

610
00:34:28,750 --> 00:34:30,000
for matrix multiplication.

611
00:34:33,420 --> 00:34:37,989
Hong and Kung proved back in
1981 that this particular

612
00:34:37,989 --> 00:34:40,750
strategy and this bound was
the best you could do for

613
00:34:40,750 --> 00:34:44,010
matrix multiplication.

614
00:34:44,010 --> 00:34:45,310
So that's great.

615
00:34:45,310 --> 00:34:47,489
I want you to remember this
number because we're going to

616
00:34:47,489 --> 00:34:50,130
come back to it.

617
00:34:50,130 --> 00:34:52,802
So remember it's b times n to
the 1/2, b times square root

618
00:34:52,802 --> 00:34:54,634
of m in the denominator.

619
00:34:57,990 --> 00:35:06,390
Now there's one hitch
in this story.

620
00:35:06,390 --> 00:35:08,400
And that is, what do I
have to do for this

621
00:35:08,400 --> 00:35:09,650
algorithm to work well?

622
00:35:12,720 --> 00:35:14,170
It says right up there
on the slide.

623
00:35:17,630 --> 00:35:19,560
Tune s.

624
00:35:19,560 --> 00:35:21,660
I've got to tune s.

625
00:35:21,660 --> 00:35:22,910
How do I do that?

626
00:35:25,130 --> 00:35:26,380
How do I tune s?

627
00:35:28,950 --> 00:35:30,810
How would you suggest
we tune s?

628
00:35:30,810 --> 00:35:33,995
AUDIENCE: Just run a
binary [INAUDIBLE].

629
00:35:33,995 --> 00:35:39,160
PROFESSOR: Yeah, do binary
search on s to find out what's

630
00:35:39,160 --> 00:35:41,990
the best value for s.

631
00:35:41,990 --> 00:35:44,490
Good strategy.

632
00:35:44,490 --> 00:35:45,740
What if we guess wrong?

633
00:35:48,500 --> 00:35:53,000
What happens if, say, we tune
s, we get some value for it.

634
00:35:53,000 --> 00:35:56,690
Let's say the value is 100.

635
00:35:56,690 --> 00:35:57,670
So we've turned it.

636
00:35:57,670 --> 00:35:59,970
We find 100 is our best value.

637
00:35:59,970 --> 00:36:03,570
We run it on our workstation,
and somebody else has another

638
00:36:03,570 --> 00:36:04,820
job running.

639
00:36:06,770 --> 00:36:09,730
What happens then?

640
00:36:09,730 --> 00:36:13,520
That other job starts sharing
part of that cache.

641
00:36:13,520 --> 00:36:17,450
So the effective cache size is
going to be smaller than what

642
00:36:17,450 --> 00:36:18,410
we turned it for.

643
00:36:18,410 --> 00:36:19,660
And what's going to happen?

644
00:36:23,740 --> 00:36:24,990
What's going to happen
in that case?

645
00:36:27,700 --> 00:36:30,540
If I've tuned in for a given
size and then I actually have

646
00:36:30,540 --> 00:36:36,340
to run with something that's
effectively a smaller cache,

647
00:36:36,340 --> 00:36:37,710
does it matter or
doesn't matter?

648
00:36:37,710 --> 00:36:39,030
AUDIENCE: Is it still tall?

649
00:36:39,030 --> 00:36:41,601
PROFESSOR: Still tall.

650
00:36:41,601 --> 00:36:42,851
AUDIENCE: [INAUDIBLE]

651
00:36:46,958 --> 00:36:50,600
PROFESSOR: So if you imagine
this fit exactly into cache,

652
00:36:50,600 --> 00:36:52,960
and now I only have
half that amount.

653
00:36:52,960 --> 00:36:58,090
Then the assumption that these
three inner loops is running

654
00:36:58,090 --> 00:37:02,280
with only s squared over b
misses is going to be totally

655
00:37:02,280 --> 00:37:04,410
out the window.

656
00:37:04,410 --> 00:37:10,870
In fact, it's going to be just
like the case of the first

657
00:37:10,870 --> 00:37:14,460
algorithm, the naive algorithm
that I gave.

658
00:37:14,460 --> 00:37:18,180
Because the size of matrix that
I'm feeding it, s by s,

659
00:37:18,180 --> 00:37:21,850
isn't fitting in the cache.

660
00:37:21,850 --> 00:37:25,850
And so rather than it being s
squared over b accesses, it's

661
00:37:25,850 --> 00:37:27,100
going to be much bigger.

662
00:37:29,290 --> 00:37:36,720
I'm going to end up with
essentially s cubed accesses

663
00:37:36,720 --> 00:37:38,575
if the cache, in fact,
gets enough smaller.

664
00:37:44,590 --> 00:37:48,130
It's also one thing you have to
put in there is what I like

665
00:37:48,130 --> 00:37:51,770
to call voodoo.

666
00:37:51,770 --> 00:37:56,190
Whenever you have a program and
you've got some parameters

667
00:37:56,190 --> 00:37:59,940
that, oh good we've got these
parameters we get to tweak to

668
00:37:59,940 --> 00:38:02,170
make it go better.

669
00:38:02,170 --> 00:38:04,970
I call those voodoo
parameters.

670
00:38:04,970 --> 00:38:10,320
Because typically setting them
is not straightforward.

671
00:38:10,320 --> 00:38:11,590
Now there are different
strategies.

672
00:38:11,590 --> 00:38:14,270
One, as you say, is to do binary
search by doing it.

673
00:38:14,270 --> 00:38:18,360
There are some programs, in
fact, which when you start

674
00:38:18,360 --> 00:38:21,350
them up you call an
initialization routine.

675
00:38:21,350 --> 00:38:25,920
And what they will do is
automatically check to see

676
00:38:25,920 --> 00:38:29,700
what size is my cache and what's
the best size should I

677
00:38:29,700 --> 00:38:33,050
do something on and then use
that when you actually run it

678
00:38:33,050 --> 00:38:34,430
later in the program.

679
00:38:34,430 --> 00:38:37,670
So it does an automatic
adaptation

680
00:38:37,670 --> 00:38:39,470
automatically when you start.

681
00:38:39,470 --> 00:38:42,280
But the more parameters
you get, the more

682
00:38:42,280 --> 00:38:44,640
troublesome it becomes.

683
00:38:44,640 --> 00:38:45,580
So let's take a look.

684
00:38:45,580 --> 00:38:48,290
For example, suppose we have a
two-level cache rather than a

685
00:38:48,290 --> 00:38:51,500
one-level cache.

686
00:38:51,500 --> 00:38:55,430
Now I need to have something
that I tune in for L1 and

687
00:38:55,430 --> 00:38:56,990
something that I tune for L2.

688
00:39:01,210 --> 00:39:06,660
So it turns out that if I want
to optimize s and t, I can't

689
00:39:06,660 --> 00:39:08,910
do it in more with binary search
because I have two

690
00:39:08,910 --> 00:39:11,090
parameters.

691
00:39:11,090 --> 00:39:13,550
And binary search won't suffice
for figuring out

692
00:39:13,550 --> 00:39:16,430
what's the best combination
of s and t.

693
00:39:16,430 --> 00:39:20,170
And generally multidimensional
searches are much harder than

694
00:39:20,170 --> 00:39:24,720
one-dimensional searches
for optimizing.

695
00:39:24,720 --> 00:39:26,590
Moreover, here's what
the code looks like.

696
00:39:31,290 --> 00:39:34,530
So now I've got, how
many four loops?

697
00:39:34,530 --> 00:39:37,860
1,2,3,4,5,6,7,8,9 nested
for loops.

698
00:39:41,160 --> 00:39:45,130
So you can see the voodoo
is starting to

699
00:39:45,130 --> 00:39:46,160
make this stuff run.

700
00:39:46,160 --> 00:39:49,720
You really have to be a magician
to tune these things

701
00:39:49,720 --> 00:39:51,900
appropriately.

702
00:39:51,900 --> 00:39:53,410
I mean, if you can can
do it that's great.

703
00:39:53,410 --> 00:39:56,110
But if you don't do it, OK.

704
00:39:56,110 --> 00:39:59,940
So now what about three
levels of cache?

705
00:39:59,940 --> 00:40:04,650
So now we need three
tuning parameters.

706
00:40:04,650 --> 00:40:08,190
Here s, t and u, we have
12 nested four loops.

707
00:40:08,190 --> 00:40:12,040
I didn't have the heart to
actually write out the code

708
00:40:12,040 --> 00:40:13,400
for the 12 nested for loops.

709
00:40:13,400 --> 00:40:15,940
That just seemed overhead.

710
00:40:15,940 --> 00:40:17,160
But our new halo machines,
they have

711
00:40:17,160 --> 00:40:18,800
three levels of caches.

712
00:40:18,800 --> 00:40:22,750
So let's tune for all the
levels of caches.

713
00:40:22,750 --> 00:40:24,690
And as we mentioned, in a
multi-program environment, you

714
00:40:24,690 --> 00:40:26,720
don't actually know what the
cache size is, what other

715
00:40:26,720 --> 00:40:27,960
programs are running.

716
00:40:27,960 --> 00:40:29,985
So it's really easy to mistune
these parameters.

717
00:40:33,464 --> 00:40:36,733
AUDIENCE: [INAUDIBLE] don't
you have a problem because

718
00:40:36,733 --> 00:40:41,078
you're running the program for
a particular n, and you don't

719
00:40:41,078 --> 00:40:42,737
necessarily know whether your
program is going to run faster

720
00:40:42,737 --> 00:40:43,922
or slower--

721
00:40:43,922 --> 00:40:46,090
PROFESSOR: Well what you're
usually doing, is you're

722
00:40:46,090 --> 00:40:48,030
tuning for s not n, right?

723
00:40:48,030 --> 00:40:49,040
So you're assuming--

724
00:40:49,040 --> 00:40:51,515
AUDIENCE: No, no
a particular n.

725
00:40:51,515 --> 00:40:54,990
PROFESSOR: But the tuning of
this is only dependent on s.

726
00:40:54,990 --> 00:40:56,480
It doesn't depend on n.

727
00:40:56,480 --> 00:40:58,800
So if you run it for a
sufficiently large n, I think

728
00:40:58,800 --> 00:41:02,110
it's reasonable to assume that
the s you get would be a good

729
00:41:02,110 --> 00:41:05,300
s for any large n.

730
00:41:05,300 --> 00:41:08,830
Because the real question is,
what's fitting in cache?

731
00:41:08,830 --> 00:41:09,190
Yeah--

732
00:41:09,190 --> 00:41:13,519
AUDIENCE: How long does it
take to fill up the cache

733
00:41:13,519 --> 00:41:17,848
relative to the context
each time?

734
00:41:17,848 --> 00:41:21,510
PROFESSOR: Generally you can
do it pretty quickly.

735
00:41:21,510 --> 00:41:22,810
AUDIENCE: Right.

736
00:41:22,810 --> 00:41:25,450
So why does it matter if your
have multiple users, if you

737
00:41:25,450 --> 00:41:27,610
can fill it [INAUDIBLE].

738
00:41:27,610 --> 00:41:30,130
PROFESSOR: No because, he
may not be using all

739
00:41:30,130 --> 00:41:31,850
of the cache, right?

740
00:41:31,850 --> 00:41:36,320
So when you come back, you're
going to have it polluted with

741
00:41:36,320 --> 00:41:40,080
a certain amount of stuff.

742
00:41:42,830 --> 00:41:44,280
I think it's a good question.

743
00:41:44,280 --> 00:41:45,530
AUDIENCE: [INAUDIBLE]

744
00:41:49,836 --> 00:41:53,000
PROFESSOR: OK, so anyway, so
this is the-- yeah question.

745
00:41:53,000 --> 00:41:58,240
AUDIENCE: So if n is really
large, is it possible that the

746
00:41:58,240 --> 00:42:01,984
second row of the matrix
never loaded?

747
00:42:01,984 --> 00:42:04,115
PROFESSOR: If n is
really large--

748
00:42:04,115 --> 00:42:07,228
AUDIENCE: Because n is
really large, right?

749
00:42:07,228 --> 00:42:08,906
Just the first row of
the matrix will

750
00:42:08,906 --> 00:42:10,156
fill up all the caches.

751
00:42:13,620 --> 00:42:16,420
PROFESSOR: It's LRU and in B
you're going down this way.

752
00:42:19,210 --> 00:42:23,010
You're accessing things
going down.

753
00:42:23,010 --> 00:42:25,480
OK, good.

754
00:42:25,480 --> 00:42:28,690
So let's look at a solution
to these alternatives.

755
00:42:28,690 --> 00:42:33,950
What I want to in particular
take a look at is recursive

756
00:42:33,950 --> 00:42:35,200
matrix multiplication.

757
00:42:38,510 --> 00:42:43,460
So the idea is you can do
divide and conquer on

758
00:42:43,460 --> 00:42:47,530
multiplying matrices because if
I divide each of these into

759
00:42:47,530 --> 00:42:54,573
four pieces, then essentially
I have 8 multiply adds of n

760
00:42:54,573 --> 00:42:57,450
over 2 by n over 2 matrices.

761
00:42:57,450 --> 00:43:00,250
Because I basically do these
eight multiplies each going

762
00:43:00,250 --> 00:43:01,880
into the correct result.

763
00:43:01,880 --> 00:43:06,090
So multiply A11 B11 and
add it into C11.

764
00:43:06,090 --> 00:43:09,540
Multiply A12 B21 add it
into C11 and so forth.

765
00:43:12,310 --> 00:43:14,690
So I can basically do
divide and conquer.

766
00:43:14,690 --> 00:43:15,770
And then each of those I

767
00:43:15,770 --> 00:43:17,250
recursively divide and conquer.

768
00:43:20,480 --> 00:43:22,720
So what's the intuition
by why this might a

769
00:43:22,720 --> 00:43:23,970
good scheme to use?

770
00:43:27,624 --> 00:43:30,492
AUDIENCE: [INAUDIBLE]

771
00:43:30,492 --> 00:43:33,730
PROFESSOR: Well, we're not
going to do parallel yet.

772
00:43:33,730 --> 00:43:37,028
Just why is this going to
use the cache well?

773
00:43:37,028 --> 00:43:39,408
AUDIENCE: [INAUDIBLE]

774
00:43:39,408 --> 00:43:45,580
PROFESSOR: Yeah eventually I get
down to a size where the

775
00:43:45,580 --> 00:43:49,380
matrix that I'm working on fits
into cache, and then all

776
00:43:49,380 --> 00:43:51,710
the rest of the operations
I do are all

777
00:43:51,710 --> 00:43:54,240
going to be cache hits.

778
00:43:54,240 --> 00:43:59,260
It is taking something and it
it's doing what the tiling is

779
00:43:59,260 --> 00:44:01,440
doing but doing it blindly.

780
00:44:04,210 --> 00:44:04,980
So let's take a look.

781
00:44:04,980 --> 00:44:08,000
Here's the recursive code.

782
00:44:08,000 --> 00:44:13,890
So here I have the base case if
n is 1, I basically have a

783
00:44:13,890 --> 00:44:16,800
one by one matrix and I
just simply update c,

784
00:44:16,800 --> 00:44:19,460
with a times b.

785
00:44:19,460 --> 00:44:21,770
And otherwise what I do, is
I'm going to do this by

786
00:44:21,770 --> 00:44:22,890
computing offsets.

787
00:44:22,890 --> 00:44:25,590
So generally when you're
dealing with matrices,

788
00:44:25,590 --> 00:44:28,810
especially if you want fast
code, I usually don't rely on

789
00:44:28,810 --> 00:44:32,680
two-dimensional addressing, but
rather do the addressing

790
00:44:32,680 --> 00:44:36,620
myself and rely on the
compiler to do common

791
00:44:36,620 --> 00:44:38,470
subexpression elimination.

792
00:44:38,470 --> 00:44:40,280
So, for example, here
what I'm going to do

793
00:44:40,280 --> 00:44:42,890
is compute the offsets.

794
00:44:42,890 --> 00:44:44,240
So here's how I do it.

795
00:44:44,240 --> 00:44:46,770
So first of all, in practice
what you do, is you don't go

796
00:44:46,770 --> 00:44:48,360
down to n equals 1.

797
00:44:48,360 --> 00:44:50,300
You have some cutoff.

798
00:44:50,300 --> 00:44:52,220
Maybe n is 8 or something.

799
00:44:52,220 --> 00:44:55,760
And at that point you go into
a specialized routine that

800
00:44:55,760 --> 00:44:58,490
does a really good
8 by 8 multiply.

801
00:44:58,490 --> 00:45:00,820
And the reason for that is you
don't want to have the

802
00:45:00,820 --> 00:45:02,200
function call overheads.

803
00:45:02,200 --> 00:45:05,900
This function call is expensive
to do two floating

804
00:45:05,900 --> 00:45:08,040
point operations here.

805
00:45:08,040 --> 00:45:11,200
So you'd like to have a function
call and then do 100

806
00:45:11,200 --> 00:45:13,380
floating point operations
or something.

807
00:45:13,380 --> 00:45:15,040
So that you get a
better balance.

808
00:45:15,040 --> 00:45:16,370
Do people understand that?

809
00:45:16,370 --> 00:45:18,920
So normally to write recursive
codes you want a

810
00:45:18,920 --> 00:45:22,210
course in the recursion.

811
00:45:22,210 --> 00:45:23,660
Make it so you're not going
all go the into way

812
00:45:23,660 --> 00:45:24,760
down to n equals 1.

813
00:45:24,760 --> 00:45:28,590
But rather are stopping short
and then doing something that

814
00:45:28,590 --> 00:45:33,200
doesn't involve a lot of
overhead in the base case of

815
00:45:33,200 --> 00:45:34,230
your recursion.

816
00:45:34,230 --> 00:45:37,260
But here I'll explain it as
if we went all the way

817
00:45:37,260 --> 00:45:38,510
down to n equals 1.

818
00:45:40,590 --> 00:45:45,490
So then what we do is, if this
is a submatrix, which is

819
00:45:45,490 --> 00:45:46,650
basically what I'm
showing here.

820
00:45:46,650 --> 00:45:48,310
We have an n by n submatrix.

821
00:45:48,310 --> 00:45:52,070
And it's being pulled out on a
matrix of size row size, of

822
00:45:52,070 --> 00:45:54,690
width row size.

823
00:45:54,690 --> 00:45:59,330
So what I can do is, if I want
to know where the elements of

824
00:45:59,330 --> 00:46:03,000
the beginning of matrices are,
well the first one is exactly

825
00:46:03,000 --> 00:46:06,770
the same place that the
input matrix is.

826
00:46:06,770 --> 00:46:11,950
The second one is basically I
have to add n over 2 to the

827
00:46:11,950 --> 00:46:14,990
location in the array.

828
00:46:14,990 --> 00:46:20,950
The third one here, 21, I have
to basically add n over 2 rows

829
00:46:20,950 --> 00:46:23,440
to get the starting point
of that matrix.

830
00:46:23,440 --> 00:46:27,870
And for the last one I have to
add n over 2 and n over 2 rows

831
00:46:27,870 --> 00:46:31,070
and n over 2 plus 1 rows
to get to that point.

832
00:46:31,070 --> 00:46:35,690
So I compute those and now I can
recursively multiply with

833
00:46:35,690 --> 00:46:42,620
sizes of n over 2 and perform
the program recursively.

834
00:46:42,620 --> 00:46:43,870
Yeah--

835
00:46:48,883 --> 00:46:50,380
AUDIENCE: So you said
it rightly.

836
00:46:50,380 --> 00:46:53,706
You're blindly dividing the
matrix up til you get

837
00:46:53,706 --> 00:46:54,380
something to fit the cache.

838
00:46:54,380 --> 00:46:55,230
So essentially--

839
00:46:55,230 --> 00:46:56,520
PROFESSOR: Well and
you're continuing.

840
00:46:56,520 --> 00:46:59,460
The algorithm is completely
blind all the way

841
00:46:59,460 --> 00:47:00,710
down to n equals 1.

842
00:47:03,418 --> 00:47:07,690
AUDIENCE: This could never be
better if the other one-- your

843
00:47:07,690 --> 00:47:09,334
computer's version
is well-tuned.

844
00:47:09,334 --> 00:47:11,429
Because the applications are
the same, but this one you

845
00:47:11,429 --> 00:47:13,278
have all the overhead from
the [INAUDIBLE].

846
00:47:13,278 --> 00:47:14,757
PROFESSOR: Could be.

847
00:47:14,757 --> 00:47:17,222
AUDIENCE: At the end, you
still need to make a

848
00:47:17,222 --> 00:47:19,610
multiplication and then go back
and look at all of the--

849
00:47:19,610 --> 00:47:20,430
PROFESSOR: Could be.

850
00:47:20,430 --> 00:47:28,930
So let's discuss that later at
the end when we talk about the

851
00:47:28,930 --> 00:47:29,420
differences between
the algorithms.

852
00:47:29,420 --> 00:47:32,240
Let's at this point, just try to
understand what's going on

853
00:47:32,240 --> 00:47:33,230
in the algorithm.

854
00:47:33,230 --> 00:47:34,572
Question--

855
00:47:34,572 --> 00:47:35,822
AUDIENCE: [INAUDIBLE]

856
00:47:46,476 --> 00:47:50,900
PROFESSOR: n over 2 times
row size plus--

857
00:47:50,900 --> 00:47:51,070
plus n over 2.

858
00:47:51,070 --> 00:47:54,040
It should be row size plus 1.

859
00:47:54,040 --> 00:47:54,920
You're right.

860
00:47:54,920 --> 00:47:57,470
Good, bug.

861
00:47:57,470 --> 00:47:59,000
Should be n over 2 times
row size plus 1.

862
00:48:03,940 --> 00:48:06,040
So let's analyze the work
assuming the code

863
00:48:06,040 --> 00:48:07,290
actually did work.

864
00:48:09,780 --> 00:48:14,430
So the work we can write
a recurrence for.

865
00:48:14,430 --> 00:48:18,500
So here we have the
work to solve an

866
00:48:18,500 --> 00:48:20,980
n by n matrix problem.

867
00:48:20,980 --> 00:48:25,510
Well if n is 1, then it's
just order one work--

868
00:48:25,510 --> 00:48:28,270
constant amount of work.

869
00:48:28,270 --> 00:48:31,810
But if n is bigger than 1,
then I'm solving eight

870
00:48:31,810 --> 00:48:36,890
problems of size n over 2, plus
doing a constant amount

871
00:48:36,890 --> 00:48:41,090
of work to divide
all those up.

872
00:48:41,090 --> 00:48:43,580
So everybody understand where
I get this recurrence?

873
00:48:43,580 --> 00:48:50,360
Now normally, as you know, when
you do algorithmic work,

874
00:48:50,360 --> 00:48:55,210
we usually omit this first line
because we assume a base

875
00:48:55,210 --> 00:48:58,260
case of constant if it's one.

876
00:48:58,260 --> 00:48:59,650
I'm actually going to keep it.

877
00:48:59,650 --> 00:49:02,540
And the reason is because when
we do caching the basic cases

878
00:49:02,540 --> 00:49:03,790
are important.

879
00:49:05,890 --> 00:49:09,530
So everybody understand where
this recurrence came from?

880
00:49:09,530 --> 00:49:13,600
So I can use the master theorem
or something like that

881
00:49:13,600 --> 00:49:14,380
to solve this.

882
00:49:14,380 --> 00:49:17,060
In which case the answer
for this is what?

883
00:49:17,060 --> 00:49:18,760
Those of you who
have the master

884
00:49:18,760 --> 00:49:20,010
theorem in your hip pocket.

885
00:49:23,170 --> 00:49:24,420
What's the solution of
this recurrence?

886
00:49:27,890 --> 00:49:28,500
People remember?

887
00:49:28,500 --> 00:49:31,920
Who's has heard of the
master theorem?

888
00:49:31,920 --> 00:49:34,345
I thought that was kind of a
prerequisite or something of

889
00:49:34,345 --> 00:49:35,595
this class, right?

890
00:49:38,770 --> 00:49:41,290
So you might want to brush up on
the master theorem for the

891
00:49:41,290 --> 00:49:42,540
quiz next week.

892
00:49:45,230 --> 00:49:48,916
So basically it's a n over
b, so it's n to the

893
00:49:48,916 --> 00:49:50,166
log base 2 of 8.

894
00:49:53,080 --> 00:49:54,752
So that's n cubed, n to
the log base 2 of 8

895
00:49:54,752 --> 00:49:56,002
is n to the n cubed.

896
00:49:58,180 --> 00:50:00,780
And that's bigger than the order
one here, so the answer

897
00:50:00,780 --> 00:50:02,080
is order n cubed.

898
00:50:02,080 --> 00:50:03,700
Which is a relief, right?

899
00:50:03,700 --> 00:50:09,140
Because if weren't order n cubed
we would be doing a lot

900
00:50:09,140 --> 00:50:15,260
more work than one of the
looping algorithms.

901
00:50:15,260 --> 00:50:18,270
However, let's actually go
through and understand where

902
00:50:18,270 --> 00:50:21,120
that n cubed comes from.

903
00:50:21,120 --> 00:50:23,680
And to do that I'm going to
use the technique of a

904
00:50:23,680 --> 00:50:27,380
recursive tree, which I think
all of you have seen.

905
00:50:27,380 --> 00:50:30,060
But let me go through it slowly
here to make sure,

906
00:50:30,060 --> 00:50:32,780
because we're going to do it
again when we do cache misses

907
00:50:32,780 --> 00:50:35,940
and it's going to be
more complicated.

908
00:50:35,940 --> 00:50:37,080
So here's the idea.

909
00:50:37,080 --> 00:50:42,160
I write down the left hand side
the recurrence, w of n.

910
00:50:42,160 --> 00:50:44,610
And now what I do is
I substitute, and I

911
00:50:44,610 --> 00:50:46,040
draw it out as a tree.

912
00:50:46,040 --> 00:50:49,580
I have eight problems
of size n over 2.

913
00:50:49,580 --> 00:50:54,350
So what I do is I replace that
with the thing that's on the

914
00:50:54,350 --> 00:50:58,170
right hand side, I've dropped
the theta here, but basically

915
00:50:58,170 --> 00:50:59,640
put just a constant one here.

916
00:50:59,640 --> 00:51:02,440
Because I'll take into account
the thetas at the end.

917
00:51:02,440 --> 00:51:08,510
So I have a one here, and
then I have, loops

918
00:51:08,510 --> 00:51:10,870
that should be a w.

919
00:51:10,870 --> 00:51:13,110
Should be w n over 2.

920
00:51:13,110 --> 00:51:16,060
That's a bug there.

921
00:51:16,060 --> 00:51:17,625
And then I replace
each of those.

922
00:51:21,420 --> 00:51:27,795
OK, wn over 2, sorry that
should be wn over 4.

923
00:51:27,795 --> 00:51:30,240
Ah, more bugs.

924
00:51:30,240 --> 00:51:33,240
I'll fix them up on
after lecture.

925
00:51:33,240 --> 00:51:35,760
So this should be
w of n over 4.

926
00:51:35,760 --> 00:51:39,260
And we go all the way down to
the bottom to where I hit the

927
00:51:39,260 --> 00:51:44,410
base case of theta 1.

928
00:51:44,410 --> 00:51:48,020
So I built out this big tree
that represents, if you think

929
00:51:48,020 --> 00:51:50,350
about it, that's exactly what
the algorithm is going to do.

930
00:51:50,350 --> 00:51:53,140
It's going to walk this
tree doing the work.

931
00:51:53,140 --> 00:51:55,450
And what I've just simply put up
here is to work it does at

932
00:51:55,450 --> 00:51:56,700
every level.

933
00:52:00,450 --> 00:52:03,320
So the first thing we want to
do is figure out what's the

934
00:52:03,320 --> 00:52:04,480
height of this tree.

935
00:52:04,480 --> 00:52:06,551
Can somebody tell me what the
height of the tree is?

936
00:52:10,479 --> 00:52:11,830
It is a log n.

937
00:52:11,830 --> 00:52:13,830
What's the base?

938
00:52:13,830 --> 00:52:16,950
Log base 2 of n, base because
at every level if I hadn't

939
00:52:16,950 --> 00:52:21,280
made a mistake here, I'm
actually having the argument.

940
00:52:21,280 --> 00:52:24,250
So I'm having the argument
at each level.

941
00:52:24,250 --> 00:52:26,610
So the height is log
base 2 of n.

942
00:52:26,610 --> 00:52:29,070
So LG is notation
for log base 2.

943
00:52:31,720 --> 00:52:34,930
So if I have log base 2 of n,
I can count how many leaves

944
00:52:34,930 --> 00:52:36,940
there are to this tree.

945
00:52:36,940 --> 00:52:40,350
So how many leaves are there?

946
00:52:40,350 --> 00:52:45,200
Well I'm branching a factor
of eight at every level.

947
00:52:45,200 --> 00:52:49,030
And if I'm going log base 2
levels, the number of leaves

948
00:52:49,030 --> 00:52:52,030
is 8 to the log base 2.

949
00:52:52,030 --> 00:52:54,550
So 8 to the log base 2 of n.

950
00:52:54,550 --> 00:52:57,100
And then with a little bit of
algebraic magic that turns out

951
00:52:57,100 --> 00:52:58,360
that's the same as n to
the log base 2 of 8.

952
00:53:02,140 --> 00:53:05,710
And that is equal to n cubed.

953
00:53:05,710 --> 00:53:08,820
So I end up with
n cubed leaves.

954
00:53:08,820 --> 00:53:12,750
Now let's add up all the
work that's in here.

955
00:53:12,750 --> 00:53:14,940
So what I do is I add
across the rows.

956
00:53:14,940 --> 00:53:18,020
So the top level I've
got work of one.

957
00:53:18,020 --> 00:53:20,220
The next level I
work of eight.

958
00:53:20,220 --> 00:53:22,690
The next I have work of 64.

959
00:53:22,690 --> 00:53:25,340
Do people see the pattern?

960
00:53:25,340 --> 00:53:28,830
The work is growing how?

961
00:53:28,830 --> 00:53:31,190
Geometrically.

962
00:53:31,190 --> 00:53:34,160
And at this level I know that
if I add up all the leaves

963
00:53:34,160 --> 00:53:39,000
I've got work of n cubed.

964
00:53:39,000 --> 00:53:40,700
Because I've got n cubed
leaves, each of

965
00:53:40,700 --> 00:53:42,270
them taking a constant.

966
00:53:42,270 --> 00:53:44,840
And so this is geometrically
increasing, which means that

967
00:53:44,840 --> 00:53:46,570
it's all born in the leaves.

968
00:53:46,570 --> 00:53:48,480
So the total work is
order n cubed.

969
00:53:51,600 --> 00:53:52,430
And that's nice.

970
00:53:52,430 --> 00:53:55,220
It's the same work is the
looping versions.

971
00:53:55,220 --> 00:53:56,650
Because we don't want
to increase that.

972
00:54:01,140 --> 00:54:01,730
Questions?

973
00:54:01,730 --> 00:54:03,600
Because now we're going to do
cache misses and it's going to

974
00:54:03,600 --> 00:54:07,420
get hairy, not too hairy,
but hairier.

975
00:54:13,810 --> 00:54:15,540
So here we're going
to cache misses.

976
00:54:15,540 --> 00:54:17,690
So the first thing is coming
up with a recurrence.

977
00:54:17,690 --> 00:54:21,960
And this is probably the hardest
part, except for the

978
00:54:21,960 --> 00:54:23,675
other hard part which is
solving the recurrence.

979
00:54:26,270 --> 00:54:29,910
So here what we're doing is, we
have the same thing is that

980
00:54:29,910 --> 00:54:35,570
I'm solving eight problems of
size n over 2 and to do the

981
00:54:35,570 --> 00:54:36,360
work in here.

982
00:54:36,360 --> 00:54:39,580
I'm taking basically order
one cache misses.

983
00:54:39,580 --> 00:54:44,440
However I do, those
things work out.

984
00:54:44,440 --> 00:54:46,920
Plus the cache misses
I have in there.

985
00:54:46,920 --> 00:54:51,060
But then at some point, when I'm
claiming is that I'm going

986
00:54:51,060 --> 00:54:53,680
to bottom out the
recursion early.

987
00:54:53,680 --> 00:54:59,260
Not when I get to n equals 1,
but in fact when n squared is

988
00:54:59,260 --> 00:55:03,960
less than some constant
times the cache size.

989
00:55:03,960 --> 00:55:07,010
For some sufficiently
small concept.

990
00:55:07,010 --> 00:55:10,200
And what I claim, at that point,
is that the number of

991
00:55:10,200 --> 00:55:13,540
cache misses I'm going to take
at that point, I can just,

992
00:55:13,540 --> 00:55:17,450
without doing any more recursive
stuff, I can just

993
00:55:17,450 --> 00:55:18,700
say it's n squared over b.

994
00:55:21,140 --> 00:55:23,080
So where does that come from?

995
00:55:23,080 --> 00:55:27,450
So this basically comes from
the tall-cache assumption.

996
00:55:27,450 --> 00:55:30,460
So the idea is that when n
squared is less than a

997
00:55:30,460 --> 00:55:35,660
constant times the size of your
cache, constant times the

998
00:55:35,660 --> 00:55:39,410
size of m, then that means
that this fits into--

999
00:55:39,410 --> 00:55:42,120
the n by n matrices
fit within m.

1000
00:55:42,120 --> 00:55:42,960
I've got three of them.

1001
00:55:42,960 --> 00:55:48,660
I've got C, A and B. So that's
where I need a constant here.

1002
00:55:48,660 --> 00:55:53,220
So they're all going
to fit in memory.

1003
00:55:53,220 --> 00:55:56,640
And so if I look at it, all I
have to do is count up the

1004
00:55:56,640 --> 00:56:05,840
cold misses for bringing in
those submatrices at the time

1005
00:56:05,840 --> 00:56:10,970
that n hits this threshold here
of some constant times m.

1006
00:56:10,970 --> 00:56:13,685
And to bring in those matrices
is only going to cost me n

1007
00:56:13,685 --> 00:56:16,580
squared over b cache misses.

1008
00:56:16,580 --> 00:56:19,430
And once I've done that, all of
the rest of the recursion

1009
00:56:19,430 --> 00:56:24,350
that's going on down below is
all operating out of cache.

1010
00:56:24,350 --> 00:56:29,510
It's not taking any misses
if I have an

1011
00:56:29,510 --> 00:56:32,450
optimal replacement algorithm.

1012
00:56:32,450 --> 00:56:36,540
it's not taking any more misses
as I get further down.

1013
00:56:36,540 --> 00:56:45,470
Questions about this part
of the recurrence here?

1014
00:56:51,510 --> 00:56:52,760
So people with me?

1015
00:56:55,690 --> 00:56:58,550
So when I get down to something
of size n squared,

1016
00:56:58,550 --> 00:57:02,990
where the submatrix is size n
squared, the point is that

1017
00:57:02,990 --> 00:57:05,300
I'll bring in the entire
submatrix.

1018
00:57:05,300 --> 00:57:08,640
But all the stuff that I have to
do in there is never going

1019
00:57:08,640 --> 00:57:10,610
to get kicked out, because
it's small

1020
00:57:10,610 --> 00:57:12,340
enough that it all fits.

1021
00:57:12,340 --> 00:57:14,960
And an optimal algorithm for
replacement is going to make

1022
00:57:14,960 --> 00:57:17,200
sure that stuff stays in there,
because there's plenty

1023
00:57:17,200 --> 00:57:19,160
of room in the cache
at that point.

1024
00:57:19,160 --> 00:57:22,350
There's room for three matrices
in the cache and a

1025
00:57:22,350 --> 00:57:24,970
couple of other variables that
I might need and that's

1026
00:57:24,970 --> 00:57:26,220
basically it.

1027
00:57:30,400 --> 00:57:33,610
Any questions about that?

1028
00:57:33,610 --> 00:57:37,540
So let's then solve
this recurrence.

1029
00:57:37,540 --> 00:57:39,480
So we're going to go about it
very much the same way.

1030
00:57:39,480 --> 00:57:41,230
We make draw a recursion tree.

1031
00:57:41,230 --> 00:57:44,470
So those of you are rusty in
drawing recursion trees, I can

1032
00:57:44,470 --> 00:57:47,430
promise you there will be a
recursion tree on the quiz

1033
00:57:47,430 --> 00:57:49,350
next Thursday.

1034
00:57:49,350 --> 00:57:50,570
I think I can promise that.

1035
00:57:50,570 --> 00:57:51,250
Can I promise that?

1036
00:57:51,250 --> 00:57:52,560
Yeah, OK I can promise that.

1037
00:57:55,970 --> 00:57:58,450
The way I like to do it, by the
way, is not to try to just

1038
00:57:58,450 --> 00:58:02,020
brought out all at once.

1039
00:58:02,020 --> 00:58:05,500
In my own notes when I do this I
always draw it step by step.

1040
00:58:05,500 --> 00:58:07,980
I copy over and just
do a step by step.

1041
00:58:07,980 --> 00:58:09,290
You might think that
that's extensive.

1042
00:58:09,290 --> 00:58:12,680
Gee, why do I have to draw
every one along the way?

1043
00:58:12,680 --> 00:58:16,060
Well the answer is, it's
a geometric process.

1044
00:58:16,060 --> 00:58:20,130
All the ones going up to the
last one are a small amount of

1045
00:58:20,130 --> 00:58:23,710
the work to draw out
the last one.

1046
00:58:23,710 --> 00:58:27,520
And they help you get it
correct the first time.

1047
00:58:27,520 --> 00:58:32,550
So let me encourage you
to draw out the tree

1048
00:58:32,550 --> 00:58:33,620
iteration by iteration.

1049
00:58:33,620 --> 00:58:36,200
Here I'm going to just
do replacement.

1050
00:58:36,200 --> 00:58:39,940
So what we do is we replace with
the right hand side to do

1051
00:58:39,940 --> 00:58:41,890
the recursion.

1052
00:58:41,890 --> 00:58:42,940
And replace that.

1053
00:58:42,940 --> 00:58:46,890
And once again I made the bug,
that should be in over 8.

1054
00:58:46,890 --> 00:58:48,600
Sorry, n over 4 here.

1055
00:58:48,600 --> 00:58:50,140
n over 4.

1056
00:58:50,140 --> 00:58:55,570
And then we keep going down
until I get to the base case,

1057
00:58:55,570 --> 00:58:58,780
which is this case here.

1058
00:58:58,780 --> 00:59:02,120
Now comes the first hard part.

1059
00:59:02,120 --> 00:59:03,800
How tall is this tree?

1060
00:59:03,800 --> 00:59:04,130
Yeah--

1061
00:59:04,130 --> 00:59:06,300
AUDIENCE: [INAUDIBLE]

1062
00:59:06,300 --> 00:59:08,038
square root of n over b.

1063
00:59:08,038 --> 00:59:11,635
You want n squared to be
cm, not [INAUDIBLE].

1064
00:59:11,635 --> 00:59:14,830
PROFESSOR: So here's the thing,
let's discuss, first of

1065
00:59:14,830 --> 00:59:17,090
all, why this is what it is.

1066
00:59:17,090 --> 00:59:22,850
So at the point where n squared
is less than cm, that

1067
00:59:22,850 --> 00:59:28,066
says that it's going to cost
us n squared over b.

1068
00:59:28,066 --> 00:59:33,720
But n squared is just less than
cm, so therefore, this is

1069
00:59:33,720 --> 00:59:34,970
effectively m over b.

1070
00:59:37,420 --> 00:59:39,230
Good question.

1071
00:59:39,230 --> 00:59:40,580
So everybody see that?

1072
00:59:40,580 --> 00:59:43,020
So when I get down to the
bottom, it's basically costing

1073
00:59:43,020 --> 00:59:45,630
me something that's about the
number of lines I have in my

1074
00:59:45,630 --> 00:59:51,640
cache, number of misses
to fill things up.

1075
00:59:51,640 --> 00:59:53,820
The tricky thing is,
what's the height?

1076
00:59:53,820 --> 00:59:56,280
Because this is crucial to
getting this kind of

1077
00:59:56,280 --> 00:59:58,110
calculation right.

1078
00:59:58,110 --> 01:00:00,610
So what is the height
of this tree?

1079
01:00:04,690 --> 01:00:07,730
So I'm having every time.

1080
01:00:07,730 --> 01:00:11,506
So one way to think about it is,
it's going to be log bas 2

1081
01:00:11,506 --> 01:00:17,120
of n, just as before, minus the
height of the tree that is

1082
01:00:17,120 --> 01:00:19,510
hidden here that I didn't have
to actually go into because

1083
01:00:19,510 --> 01:00:20,760
there are no cache
misses in it.

1084
01:00:23,210 --> 01:00:31,980
So that's going to occur when
n is approximately m, cm,

1085
01:00:31,980 --> 01:00:36,560
sorry when n is approximately
square root of cm.

1086
01:00:36,560 --> 01:00:39,120
So I end up with log of
n minus 1/2 log of cm.

1087
01:00:44,600 --> 01:00:45,340
That's the height here.

1088
01:00:45,340 --> 01:00:47,880
Because the height at this
point of the tree that's

1089
01:00:47,880 --> 01:00:51,000
missing because they're no
cache, I don't have to account

1090
01:00:51,000 --> 01:00:56,880
for any cache misses in there,
is log of cm to the one half,

1091
01:00:56,880 --> 01:00:58,130
based on this.

1092
01:01:01,610 --> 01:01:03,400
Does that follow
for everybody?

1093
01:01:03,400 --> 01:01:04,650
People comfortable?

1094
01:01:07,080 --> 01:01:08,060
Yeah?

1095
01:01:08,060 --> 01:01:10,090
OK, good.

1096
01:01:10,090 --> 01:01:11,950
So now what do we do?

1097
01:01:11,950 --> 01:01:15,430
We count up how many
leaves there are.

1098
01:01:15,430 --> 01:01:17,930
So the number of leaves is 8,
because I have a branching

1099
01:01:17,930 --> 01:01:21,150
factor of 8, 2 whatever
the height is.

1100
01:01:21,150 --> 01:01:22,400
Log n minus 1/2 log of cm.

1101
01:01:24,760 --> 01:01:27,760
And then if I do my matrix
magic, well that part is n

1102
01:01:27,760 --> 01:01:31,126
cubed, the minus becomes a
divide, and now 8 to the 1/2

1103
01:01:31,126 --> 01:01:40,180
log of cm is the square
root of n cubed,

1104
01:01:40,180 --> 01:01:41,430
which is m to the 3/2.

1105
01:01:46,561 --> 01:01:49,090
Is that good?

1106
01:01:49,090 --> 01:01:52,050
The rest of it is very similar
to what we did before.

1107
01:01:52,050 --> 01:01:57,570
At every level I have a certain
number of things that

1108
01:01:57,570 --> 01:01:58,710
I'm adding up.

1109
01:01:58,710 --> 01:02:02,850
And on the bottom level, I take
the cost here, m over b,

1110
01:02:02,850 --> 01:02:06,620
and I multiply it by
a number of leaves.

1111
01:02:06,620 --> 01:02:09,440
When I do that I get, what?

1112
01:02:09,440 --> 01:02:13,470
I get n cubed over b
times m to the 1/2.

1113
01:02:16,660 --> 01:02:18,200
This is geometric.

1114
01:02:18,200 --> 01:02:20,300
So the answer is going, in this
case, just going to be

1115
01:02:20,300 --> 01:02:27,860
the sum of a constant factor
times the large thing.

1116
01:02:27,860 --> 01:02:29,110
And why does this
look familiar?

1117
01:02:31,630 --> 01:02:36,180
That was the optimal result
we got from tiling.

1118
01:02:36,180 --> 01:02:37,570
But where's the tuning
parameters?

1119
01:02:40,670 --> 01:02:41,920
No tuning parameters.

1120
01:02:44,380 --> 01:02:46,440
No tuning parameters.

1121
01:02:46,440 --> 01:02:48,970
So that means that this analysis
that I did for one

1122
01:02:48,970 --> 01:02:54,550
level of caching, it applies
even if you have three levels

1123
01:02:54,550 --> 01:02:55,950
of caching.

1124
01:02:55,950 --> 01:03:03,885
At every level you're getting
near optimal cache behavior.

1125
01:03:03,885 --> 01:03:06,070
So it's got the same cache
misses as with tiling.

1126
01:03:16,450 --> 01:03:19,510
These are called cache-oblivious
algorithms.

1127
01:03:19,510 --> 01:03:22,680
Because the algorithm itself
has no tuning parameters

1128
01:03:22,680 --> 01:03:23,700
related to cache.

1129
01:03:23,700 --> 01:03:26,670
Unlike the tiling algorithm.

1130
01:03:26,670 --> 01:03:28,895
That's a cache-aware
algorithm.

1131
01:03:28,895 --> 01:03:33,830
The cache-oblivious algorithm
has no tuning parameters.

1132
01:03:33,830 --> 01:03:36,950
And if it's an efficient one.

1133
01:03:36,950 --> 01:03:38,740
So, by the way, our first
algorithm was

1134
01:03:38,740 --> 01:03:40,550
cache-oblivious as well.

1135
01:03:40,550 --> 01:03:41,260
The naive one.

1136
01:03:41,260 --> 01:03:42,510
It's just not efficient.

1137
01:03:45,260 --> 01:03:48,132
So in this case we have
an efficient one.

1138
01:03:48,132 --> 01:03:51,970
It's got no voodoo turning of
parameters, no explicit

1139
01:03:51,970 --> 01:03:57,980
knowledge of caches, and it
passively autotunes itself.

1140
01:03:57,980 --> 01:04:01,180
As it goes down when it fits
things into cache it fits them

1141
01:04:01,180 --> 01:04:02,890
and uses things locally.

1142
01:04:02,890 --> 01:04:04,990
And then it goes down and it
fits into the next level of

1143
01:04:04,990 --> 01:04:09,080
cache and uses things locally
and so forth.

1144
01:04:09,080 --> 01:04:13,260
It handles multi-level
caches automatically.

1145
01:04:13,260 --> 01:04:16,660
And it's good in
multi-programmed environments.

1146
01:04:16,660 --> 01:04:20,410
Because if you end up taking
away some of the cache it

1147
01:04:20,410 --> 01:04:22,650
doesn't matter.

1148
01:04:22,650 --> 01:04:26,510
It still will end up using
whatever cache is available

1149
01:04:26,510 --> 01:04:36,160
nearly as well as any other
program could use that cache.

1150
01:04:36,160 --> 01:04:38,585
So these are very good in
multi-programmed environments.

1151
01:04:43,930 --> 01:04:46,600
The best cache-oblivious matrix
multiplication, in fact

1152
01:04:46,600 --> 01:04:50,430
doesn't do an eight way split
as I described here.

1153
01:04:50,430 --> 01:04:54,060
That was easier to analyze
and so forth.

1154
01:04:54,060 --> 01:04:56,030
The best one that
I know work on

1155
01:04:56,030 --> 01:04:57,610
arbitrary rectangular matrix.

1156
01:04:57,610 --> 01:05:00,500
And what they do, is they
do binary splitting.

1157
01:05:00,500 --> 01:05:08,440
So you would take your matrix,
i times j, So if you take a

1158
01:05:08,440 --> 01:05:11,470
matrix, let's say it's
something like this.

1159
01:05:14,640 --> 01:05:22,590
So here we have i, k, k, j.

1160
01:05:22,590 --> 01:05:24,210
And you're going to get
something of shape.

1161
01:05:29,160 --> 01:05:33,190
i times j, right?

1162
01:05:36,250 --> 01:05:37,940
What it does, is it
takes whatever

1163
01:05:37,940 --> 01:05:39,750
is the largest dimension.

1164
01:05:39,750 --> 01:05:42,560
In this case k is the
largest dimension.

1165
01:05:42,560 --> 01:05:45,890
And it partitions either
one or both of the

1166
01:05:45,890 --> 01:05:48,560
matrices along k.

1167
01:05:48,560 --> 01:05:50,280
In this case, it doesn't
do that.

1168
01:05:50,280 --> 01:05:53,020
And then it recursively
solves the two

1169
01:05:53,020 --> 01:05:55,210
sub-rectangular problems.

1170
01:05:55,210 --> 01:05:58,150
And that ends up being a very,
very efficient fast code if

1171
01:05:58,150 --> 01:06:00,650
you code that up tightly.

1172
01:06:00,650 --> 01:06:02,590
So it does binary splitting
rather than--

1173
01:06:02,590 --> 01:06:03,660
and it's general.

1174
01:06:03,660 --> 01:06:08,290
And if you analyze this, it's
got the same behavior as the

1175
01:06:08,290 --> 01:06:09,130
eight way division.

1176
01:06:09,130 --> 01:06:11,630
It's just more efficient.

1177
01:06:16,060 --> 01:06:16,790
So questions?

1178
01:06:16,790 --> 01:06:20,590
We had a question about
now comparing

1179
01:06:20,590 --> 01:06:23,750
with the tiled algorithm.

1180
01:06:23,750 --> 01:06:25,145
Do you want to reprise
your question?

1181
01:06:25,145 --> 01:06:28,458
AUDIENCE: What I was
saying was, I guess

1182
01:06:28,458 --> 01:06:29,725
this answers my question.

1183
01:06:29,725 --> 01:06:34,742
If you were to tune the previous
algorithm properly,

1184
01:06:34,742 --> 01:06:37,634
and you're assuming it's
not in a multi-program

1185
01:06:37,634 --> 01:06:41,900
environment, the recursive one,
it will never be the one

1186
01:06:41,900 --> 01:06:43,550
that is locked.

1187
01:06:43,550 --> 01:06:44,050
[INAUDIBLE]

1188
01:06:44,050 --> 01:06:50,010
PROFESSOR: So at some level
that's true, and at some level

1189
01:06:50,010 --> 01:06:51,260
it's not true.

1190
01:06:54,400 --> 01:07:02,590
So it is true in that if it's
cache-oblivious you can't take

1191
01:07:02,590 --> 01:07:05,550
advantage of all the corner
cases that you would might be

1192
01:07:05,550 --> 01:07:08,140
able to take advantage of
in a tiling algorithm.

1193
01:07:08,140 --> 01:07:09,900
So from that point of
view, that's true.

1194
01:07:09,900 --> 01:07:14,050
On the other hand, these
algorithms work even as you go

1195
01:07:14,050 --> 01:07:16,705
into paging and disks
and so forth.

1196
01:07:16,705 --> 01:07:19,090
And the interesting thing about
a disk, if you start

1197
01:07:19,090 --> 01:07:22,290
having a big problem that
doesn't fit in memory and, in

1198
01:07:22,290 --> 01:07:26,060
fact, is out of core as they
call it, and is paging to

1199
01:07:26,060 --> 01:07:33,680
disk, is that the disk sizes of
sectors that can be brought

1200
01:07:33,680 --> 01:07:36,050
efficiently off of
a disk, vary.

1201
01:07:40,430 --> 01:07:47,770
And the reason is because in
a disk, if you read a track

1202
01:07:47,770 --> 01:07:52,300
around the outside you can get
two or three times as much

1203
01:07:52,300 --> 01:07:57,510
data off the disk as a track
that you read near the inside.

1204
01:07:57,510 --> 01:08:02,960
So the head moves in and out
of the disk like this.

1205
01:08:02,960 --> 01:08:05,920
It's typically on a pivot
and pivots in and out.

1206
01:08:05,920 --> 01:08:08,940
If it's reading towards the
inside, you get blocks that

1207
01:08:08,940 --> 01:08:11,810
are small versus blocks
that are large.

1208
01:08:11,810 --> 01:08:14,950
This is effectively a
cache line size that

1209
01:08:14,950 --> 01:08:16,160
gets brought in.

1210
01:08:16,160 --> 01:08:22,569
And so the thing is that there
are actually programs in

1211
01:08:22,569 --> 01:08:27,189
which, when you run them on
disk, there is no fixed size

1212
01:08:27,189 --> 01:08:33,529
tuning parameter that beats
the cache-oblivious one.

1213
01:08:33,529 --> 01:08:36,240
So the cache-oblivious one will
beat every fixed-size

1214
01:08:36,240 --> 01:08:38,090
tuning parameters you put in.

1215
01:08:38,090 --> 01:08:41,859
Because you don't have any
control over where your file

1216
01:08:41,859 --> 01:08:47,130
got laid out on disk and how
much it's bringing in and how

1217
01:08:47,130 --> 01:08:49,720
much it isn't varies.

1218
01:08:49,720 --> 01:08:53,439
On the other hand, for in-core
thing, you're exactly right.

1219
01:08:53,439 --> 01:08:57,240
That, in principle, you could
tune it up more if you make it

1220
01:08:57,240 --> 01:08:58,710
more cache aware.

1221
01:08:58,710 --> 01:09:02,229
But then, of course, you suffer
from portability loss

1222
01:09:02,229 --> 01:09:03,964
and from, if you're in a
multi-programmed environment

1223
01:09:03,964 --> 01:09:04,840
and so forth.

1224
01:09:04,840 --> 01:09:07,790
So the answer is, that there
are situations where you're

1225
01:09:07,790 --> 01:09:14,340
doing some kind of embedded or
dedicated type of application,

1226
01:09:14,340 --> 01:09:17,990
you can take advantage of a lot
of things that you want.

1227
01:09:17,990 --> 01:09:19,180
There are other times
where you're doing a

1228
01:09:19,180 --> 01:09:24,109
multi-programmed environment,
or where you want to be able

1229
01:09:24,109 --> 01:09:27,279
to move something from one
platform to another without

1230
01:09:27,279 --> 01:09:30,590
having to re-engineer all of
the tuning and testing.

1231
01:09:30,590 --> 01:09:34,189
In which case it's better to
use the cache oblivious.

1232
01:09:34,189 --> 01:09:38,410
So as I mentioned, my view
of these things is that

1233
01:09:38,410 --> 01:09:41,069
performance is like
a currency.

1234
01:09:41,069 --> 01:09:43,540
It's a universal medium
of exchange.

1235
01:09:43,540 --> 01:09:45,490
So one place you might want
to pay a little bit of

1236
01:09:45,490 --> 01:09:48,300
performance is to make it so
it's very portable as for the

1237
01:09:48,300 --> 01:09:50,240
cache-oblivious stuff.

1238
01:09:50,240 --> 01:09:54,210
So you get nearly good
performance, but now I don't

1239
01:09:54,210 --> 01:09:57,760
have that headache
to worry about.

1240
01:09:57,760 --> 01:10:01,740
And then sometimes, in fact,
it actually does as well or

1241
01:10:01,740 --> 01:10:02,880
better than the one.

1242
01:10:02,880 --> 01:10:06,090
For matrix multiplication, the
best algorithms are the cache

1243
01:10:06,090 --> 01:10:07,450
oblivious ones that
I'm aware of.

1244
01:10:07,450 --> 01:10:10,348
AUDIENCE: [INAUDIBLE]

1245
01:10:10,348 --> 01:10:12,763
currency and all the different
currencies.

1246
01:10:12,763 --> 01:10:13,729
Single currency.

1247
01:10:13,729 --> 01:10:15,670
PROFESSOR: You want
a currency for--

1248
01:10:15,670 --> 01:10:20,210
so in fact the performance for
this is people who have

1249
01:10:20,210 --> 01:10:22,580
engineered it to take advantage
of exactly the cache

1250
01:10:22,580 --> 01:10:26,470
size, we can do just as well
with the cache oblivious one.

1251
01:10:26,470 --> 01:10:28,450
And particularly, if you think
about it, when you've got

1252
01:10:28,450 --> 01:10:32,565
three levels hierarchy,
you've got 12 loops.

1253
01:10:36,380 --> 01:10:39,550
And now you're going
to tune that.

1254
01:10:39,550 --> 01:10:40,800
It's hard to get it all right.

1255
01:10:44,310 --> 01:10:46,940
So next time we're going to see
a bunch of other examples

1256
01:10:46,940 --> 01:10:52,080
of cache-oblivious algorithms
that are optimal in terms of

1257
01:10:52,080 --> 01:10:53,110
their use of cache.

1258
01:10:53,110 --> 01:10:55,790
Of course, by the way, those
people who are familiar with

1259
01:10:55,790 --> 01:10:59,420
Strassen's algorithm, that's a
cache-oblivious algorithm.

1260
01:10:59,420 --> 01:11:01,500
Takes advantage the same
kind of thing.

1261
01:11:01,500 --> 01:11:05,530
And in fact you can analyze it
and come up with good bounds

1262
01:11:05,530 --> 01:11:10,670
on performance for Strassen's
algorithm just the same.

1263
01:11:10,670 --> 01:11:11,920
Just as we've done here.